<<

arXiv:1505.00273v4 [cs.IT] 20 Nov 2015 cecs dnug,E1 A,U emi:s.mclaughlin@h (e-mail: UK 4AS, EH14 Edinburgh, Sciences, RTTS,Fac eal jean-yves.tourneret@enseeiht (email: France IRIT-TeSA, Cha Paris-Est, Universit´e (email: France Monge, Gaspard d’Informatique n optrSine S eal [email protected]) (email: USA Science, Computer and n optrEgneig 05Ni v. oubs H4321 OH Columbus, Ave., [email protected]). Neil mail: 2015 Engineering, Computer and nvriyWl,B81W K(-al marcelo.pereyra@bri (e-mail: UK 1TW, BS8 Walk, University ytepoetAR1- B-00CM spr fteprogram the pro of image part on BS-03-0006-01 trimester as ANR-13- thematic ABX-0040-CIMI no the within ANR-11-L Project 11-IDEX-0002-02 project ANR the BNPSI ANR- Grant by the under proje by Project Imag’in ANR CNRS BS03-003, HYPANEMA the a W911NF-15 the by supported CCF-1218754 by grant also 2015OPTIMISME, ARO grants was work of via this support of NSF Part the acknowledges EP/J015180 the De AH of Career grant 1018368. for support via Fellowship the Intra-European EPSRC acknowledges Universit Curie of Marie , support a of holds the Department acknowledges the SMcL at - EP/D063485/1 r problems), processing image and signal relevant all cover rxmlotmzto loihs aitoa algorithm variational inference. algorithms; approximate optimization proximal CCadMM-rvnotmzto r discussed. are optimization- dete optimization bet MCMC-driven particular for overlap and in methods MCMC of optimization, areas and solv as Subsequently, to adopted well optimization. been as ministic have problems, that discus methods stochastic propagat also optimization It expectation of algorithms. variati and range passing as message belief such approximate approach, and methods, Bethe surrogate the deterministic Bayes, methods as variet (MCMC) well a Carlo as addresses Monte paper chain offer The Markov me processing. high-dimensional optimization paper image and and survey signal Bayes simulation in stochastic This (often to techniques. sophisticated introduction and inference bee have models statistical in They statistical problems difficult methods. many complex inference to b applied deterministic statis successfully and recently of intractable performing analytically scope are for an the that tools models simulation in intensive stochastic inference alg on computationally developme optimization based the and are simulation driven methods Stochastic computa has SP optimization. sophisticated statistical This more more techniques. of ever challenging ever with inference solve deal requiring tional to to expected models, now complex are methods and SP probability problems. on heavily tv caglni ihHro atUiest,Engineeri University, Watt Heriot with is McLaughlin Steve enYe oree swt h nvriyo olue INP Toulouse, of University the with is Tourneret Jean-Yves lrdHr swt nvriyo ihgn eto Electric of Dept Michigan, of University with is Hero Alfred the with are Pesquet Jean-Christophe and Chouzenoux Emilie hlpShie swt h hoSaeUiest,Departme University, State Ohio The with is Schniter Philip acl eer swt h nvriyo rso,Sho of School Bristol, of University the with is Pereyra Marcelo gra EPSRC - program SuSTaIN the by part in funded was work This oensga rcsig(P ehd,(eueS eeto here SP use (we methods, (SP) processing signal Modern ne Terms Index Abstract Mdr inlpoesn S)mtosrl very rely methods (SP) processing signal —Modern piiainMtosi inlProcessing Signal in Methods Optimization { emilie.chouzenoux,jean-christophe.pesquet Bysa neec;Mro hi ot Carlo; Monte chain Markov inference; —Bayesian uvyo tcatcSmlto and Simulation Stochastic of Survey A acl eer,Pii cntr mleCoznu,Jean Chouzenoux, Emilie Schniter, Philip Pereyra, Marcelo .I I. NTRODUCTION enYe oree,Afe eo n tv McLaughlin Steve and Hero, Alfred Tourneret, Jean-Yves .fr) } @univ-paris-est) p u Marne, sur mps gadPhysical and ng cessing. to Electrical of nt eomn.PS velopment. lEngineering al stol.ac.uk). w.ac.uk). tudrgrant under ct Mathematics, -ENSEEIHT- fBristol. of y Laboratoire ,UA(e- USA 0, orithms dCCF- nd volving within- -1-0479. 1 MP /1. eyond thods for s ANR- ween an s e a ses and , onal tical ian) of y ely ion 12- SP nt r- nt n d e - nov mgso 08b 04pxl cosu ohundreds to up wavelengths. across of pixels 1024 thousands by or c 2048 processed being of data images the involve imaging i hyperspectral example of For and case involved. the simulation dimensionality i high stochastic applications and in these makes complexity for pr art that appropriate the large-scale methods thread optimization of very key state The for passing the lems. inference forward message statistical pushing and approximate are rela Bayes that closely variational algorithms also randomized, are the insightfu algorithms and to stochastic accurate These delivered they results. pri involving which imprecise problems for and inverse knowledge, processes as observation unknown probl formulated partially These be (o analyses. generally inference sophisticated can statistical and problems Bayesian) models SP been ten difficult statistical have the many complex They beyond to involving methods. applied and successfully inference intractable recently deterministic analytically of are inf scope that statistical performing models for in tools com intensive are optimization algorithms tationally and optimization and simulation simulation stochastic Stochastic on intensi based computationally of methods development importantly, SP techniqu that the more driven inference problems has and computational This challenging sophisticated models, methods with more complex SP ever deal more rising. ever to constantly require expected are now methods are such of mance vra ewe iuainadotmzto,i particula in optimization of optimization, MCMC-driven range and and a optimization-within-MCMC areas simulation discusses Subsequently, between also overlap approaches. It optimization approximat algorithms. stochastic and ap- passing propagation Bethe expectation the message and Bayes, variational belief as deterministic proach, as such well methods, as methods surrogate (MCMC) Mark Carlo high-dimensional Monte process of chain variety image a and addresses paper signal The in ing. methods optimization and lation recognit a speech networks; to analysis social music in in user separation detection source from from change unmixing; to image prediction hyperspectral rating tas to varied images enhancemen and resolution medical many from ranging to modalities, applied signal and routinely particular, met in SP obtain Statistical are, strategies). they Bayesian usin or and (e.g., likelihood mum inference available, observatio statistical knowledge performing data by prior the solutions the represent and to models process ex stochastic for tools; use statistical they and probabilistic on heavily very hssre ae fesa nrdcint tcatcsimu stochastic to introduction an offers paper survey This perfor- the on demands and expectations the However, Crsoh Pesquet, -Christophe maxi- g ample, erence the s hods ems ion. of t ob- pu- ted es. nd ov an ve ks of or f- n n e 1 - - r l . 2 are discussed. Some methods such as sequential Monte Carlo overlap between simulation and optimization, in particular methods or methods based on importance are not the use of optimization techniques within MCMC algorithms considered in this survey mainly due to space limitations. and MCMC-driven optimization, and suggest some interesting This paper seeks to provide a survey of a variety of the areas worthy of exploration. Finally, in Section VI we draw algorithmic approaches in a tutorial fashion, as well as to high- together thoughts, observations and conclusions. light the state of the art, relationships between the methods, and potential future directions of research. In order to set the II. STOCHASTIC SIMULATION METHODS scene and inform our notation, consider an unknown random T Stochastic simulation methods are sophisticated random vector of interest x = [x1,...,xN ] and an observed data T number generators that allow samples to be drawn from vector y = [y1,...,yM ] , related to x by a statistical model with likelihood function p(y|x; θ) potentially parametrized by a user-specified target density π(x), such as the posterior a deterministic vector of parameters θ. Following a Bayesian p(x|y). These samples can then be used, for example, to inference approach, we model our prior knowledge about x approximate probabilities and expectations by Monte Carlo with a prior distribution p(x; θ), and our knowledge about x integration [4, Ch. 3]. In this section we will focus on Markov after observing y with the posterior distribution chain Monte Carlo (MCMC) methods, an important class of stochastic simulation techniques that operate by constructing p(y|x; θ)p(x; θ) p(x|y; θ)= (1) a Markov chain with stationary distribution π. In particular, Z(θ) we concentrate on Metropolis-Hastings algorithms for high- where the normalising constant dimensional models (see [5] for a more general recent review of MCMC methods). It is worth emphasizing, however, that we Z(θ)= p(y; θ)= p(y|x; θ)p(x; θ) dx (2) do not discuss many other important approaches to simulation Z that also arise often in signal processing applications, such as is known as the “evidence”, model likelihood, or the partition “particle filters” or sequential Monte Carlo methods [6, 7], and function. Although the integral in (2) suggests that all xj are approximate Bayesian computation [8]. continuous random variables, we allow any A cornerstone of MCMC methodology is the Metropolis- xj to be either continuous or discrete, and replace the integral Hastings (MH) algorithm [4, Ch. 7][9, 10], a universal machine with a sum as required. for constructing Markov chains with stationary density π. In many applications, we would like to evaluate the posterior Given some generic proposal distribution x∗ ∼ q(·|x), the p(x|y; θ) or some summary of it, for instance point estimates generic MH algorithm proceeds as follows. of x such as the conditional mean (i.e., MMSE estimate) E{x|y; θ}, reports such as the conditional vari- Algorithm 1 Metropolis–Hastings algorithm (generic version) ance var{x|y; θ}, or expected log statistics as used in the Set an initial state x(0) expectation maximization (EM) algorithm [1–3] for t =1 to T do θ(i+1) = argmax E{ln p(x, y; θ)|y; θ(i)} (3) Generate a candidate state x∗ from a proposal q(·|x(t−1)) θ where the expectation is taken with respect to p(x|y; θ(i)). Compute the acceptance probability However, when the signal dimensionality N is large, the inte- π(x∗) q(x(t−1)|x∗) gral in (2), as well as those used in the posterior summaries, ρ(t) = min 1, π(x(t−1)) q(x∗|x(t−1)) are often computationally intractable. Hence, the interest in   computationally efficient alternatives. An alternative that has Generate received a lot of attention in the statistical SP community is ut ∼U(0, 1) if (t) then maximum-a-posteriori (MAP) estimation. Unlike other poste- ut ≤ ρ Accept the candidate and set x(t) x∗ rior summaries, MAP estimates can be computed by finding = else the value of x maximizing p(x|y; θ), which for many models Reject the candidate and set x(t) x(t−1) is significantly more computationally tractable than numerical = end if integration. In the sequel, we will suppress the dependence on end for θ in the notation, since it is not of primary concern. The paper is organized as follows. After this brief introduc- tion where we have introduced the basic notation adopted, Under mild conditions on q, the chains generated by Algo. Section II discusses stochastic simulation methods, and in 1 are ergodic and converge to the stationary distribution particular a variety of MCMC methods. In Section III we π [11, 12]. An important feature of this algorithm is that discuss deterministic surrogate methods, such as variational computing the acceptance probabilities ρ(t) does not require Bayes, the Bethe approach, belief and expectation propaga- knowledge of the normalization constant of π (which is often tion, and provide a brief summary of approximate message not available in Bayesian inference problems). The intuition passing algorithms. Section IV discusses a range of opti- for the MH algorithm is that the algorithm proposes a stochas- mization methods for solving stochastic problems, as well tic perturbation to the state of the chain and applies a carefully as stochastic methods for solving deterministic optimization defined decision rule to decide if this perturbation should problems. Subsequently, in Section V we discuss areas of be accepted or not. This decision rule, given by the random 3 accept-reject step in Algo. 1, ensures that at equilibrium the algorithm is geometrically ergodic under mild conditions on π samples of the Markov chain have π as marginal distribution. [17]. Geometric ergodicity is important because it guarantees The specific choice of q will of course determine the a central limit theorem for the chains, and therefore that the efficiency and the convergence properties of the algorithm. samples can be used for Monte Carlo integration. However, Ideally one should choose q = π to obtain a perfect sampler the myopic nature of the random walk proposal means that (i.e., with candidates accepted with probability 1); this is the algorithm often requires a large number of iterations to of course not practically feasible since the objective is to explore the parameter space, and will tend to generate samples avoid the complexity of directly simulating from π. In the that are highly correlated, particularly if the dimension N is remainder of this section we review strategies for specifying q large (the performance of this algorithm generally deteriorates for high-dimensional models, and discuss relative advantages at rate O(N), which is worse than other more advanced and disadvantages. In order to compare and optimize the stochastic simulation MH algorithms [18]). This drawback can choice of q, a performance criterion needs to be chosen. A be partially mitigated by adjusting the proposal matrix Σ to natural criterion is the stationary integrated autocorrelation approximate the covariance structure of π, and some adaptive time for some relevant scalar summary statistic g : RN → R, versions of RWHM perform this adaptation automatically. For i.e., sufficiently smooth target densities, performance is further ∞ optimized by scaling Σ to achieve an acceptance probability τ =1+2 Cor{g(x(0)),g(x(t))} (4) g of approximately 20% − 25% [18]. t=1 X with x(0) ∼ π, and where Cor(·, ·) denotes the correlation B. Metropolis adjusted Langevin algorithms operator. This criterion is directly related to the effective number of independent Monte Carlo samples produced by The Metropolis adjusted Langevin algorithm (MALA) is an the MH algorithm, and therefore to the mean square error advanced MH algorithm inspired by the Langevin diffusion of the resulting Monte Carlo estimates. Unfortunately drawing process on RN , defined as the solution to the stochastic conclusions directly from (4) is generally not possible because [19] τg is highly dependent on the choice of g, with different 1 dX(t)= ∇ log π (X(t)) dt + dW (t), X(0) = x (5) choices often leading to contradictory results. Instead, MH 2 0 algorithms are generally studied asymptotically in the infinite- where W is the Brownian motion process on RN and x ∈ dimensional model limit. More precisely, in the limit N → ∞, 0 RN denotes some initial condition. Under appropriate stability the algorithms can be studied using diffusion process theory, conditions, X(t) converges in distribution to π as t → ∞, where the dependence on g vanishes and all measures of and is therefore potentially interesting for drawing samples efficiency become proportional to the diffusion speed. The from . Since direct simulation from is only possible in “complexity” of the algorithms can then be defined as the π X(t) very specific cases, we consider a discrete-time forward Euler rate at which efficiency deteriorates as N → ∞, e.g., O(N) approximation to (5) given by (see [13] for an introduction to this topic and details about the relationship between the efficiency of MH algorithms and δ X(t+1) ∼ N X(t) + ∇ log π X(t) ,δI (6) their average acceptance probabilities or acceptance rates)1. 2 N Finally, it is worth mentioning that despite the general-     where the parameter δ controls the discrete-time increment. ity of this approach, there are some specific models for Under certain conditions on π and δ, (6) produces a good which conventional MH sampling is not possible because the approximation of X(t) and converges to a stationary density computation of ρ(t) is intractable (e.g., when π involves an which is close to π. In MALA this approximation error is intractable function of x, such as the partition function of corrected by introducing an MH accept-reject step that guaran- the Potts-Markov random field). This issue has received a tees convergence to the correct target density π. The resulting lot of attention in the recent MCMC literature, and there are algorithm is equivalent to Algo. 1 above, with proposal now several variations of the MH construction for intractable models [8, 14–16]. q(x∗|x(t−1))= ∗ (t−1) δ (t−1) 2 1 kx − x − 2 ∇ log π x k A. Random walk Metropolis-Hastings algorithms N/2 exp − . (2πδ) 2δ  ! The so-called random walk Metropolis-Hastings (RWMH) (7) algorithm is based on proposals of the form x∗ = x(t−1) +w, where typically w ∼ N (0, Σ) for some positive-definite By analyzing the proposal (7) we notice that, in addition to covariance matrix Σ [4, Ch. 7.5]. This algorithm is one of the Langevin interpretation, MALA can also be interpreted as the most widely used MCMC methods, perhaps because it has an MH algorithm that, at each iteration t, draws a candidate very robust convergence properties and a relatively low com- from a local quadratic approximation to log π around x(t−1), −1 putational cost per iteration. It can be shown that the RWMH with δ IN as an approximation to the Hessian matrix. In addition, the MALA proposal can also be defined using 1Notice that this measure of complexity of MCMC algorithms does not take a matrix-valued time step Σ. This modification is related to into account the computational costs related to generating candidate states and evaluating their Metropolis-Hastings acceptance probabilities, which typically redefining (6) in an Euclidean space with the inner product also scale at least linearly with the problem dimension N. hw, Σ−1xi [20]. Again, the matrix Σ should capture the 4 correlation structure of π to improve efficiency. For example, a leap-frog approximation detailed in [28] Σ can be the spectrally-positive version of the inverse Hessian (t+δ/2) (t) δ (t) matrix of log π [21], or the inverse Fisher information matrix w = w + ∇x log π x 2 of the statistical observation model [20]. Note that, in a similar x(t+δ) = x(t) + δΣ−1w(t+δ/2)  (9) fashion to preconditioning in optimization, using the exact full (t+δ) (t+δ/2) δ (t+δ) Hessian or Fisher information matrix is often too computation- w = w + ∇x log π x ally expensive in high-dimensional settings and more efficient 2   representations must be used instead. Alternatively, adaptive where again the parameter δ controls the discrete-time in- versions of MALA can learn a representation of the covariance crement. The approximation error introduced by (9) is then structure online [22]. For sufficiently smooth target densities, corrected with an MH step targeting π(x, w). This algorithm MALA’s performance can be further optimized by scaling Σ is summarized in Algo. 2 below (see [28] for details about the (or δ) to achieve an acceptance probability of approximately derivation of the acceptance probability). 50% − 60% [23]. Finally, there has been significant empirical evidence that Algorithm 2 Hamiltonian Monte Carlo (with leap-frog) MALA can be very efficient for some models, particularly Set an initial state x(0), δ > 0, and L ∈ N. in high-dimensional settings and when the cost of computing for t =1 to T do the gradient ∇ log π(x) is low. Theoretically, for sufficiently Refresh the auxiliary variable w ∼ N (0, Σ). smooth π, the complexity of MALA scales at rate O(N 1/3) Generate a candidate (x∗, w∗) by propagating the current [23], comparing very favorably to the O(N) rate of RWMH state (x(t−1), w) with L leap-frog steps of length δ algorithms. However, the convergence properties of the con- defined in (9). ventional MALA are not as robust as those of the RWMH Compute the acceptance probability algorithm. In particular, MALA may fail to converge if the ∗ ∗ (t) π(x , w ) tails of π are super-Gaussian or heavy-tailed, or if δ is chosen ρ = min 1, . π(x(t−1), w) too large [19]. Similarly, MALA might also perform poorly   if is not sufficiently smooth, or multi-modal. Improving π Generate u ∼U(0, 1). MALA’s convergence properties is an active research topic. t if u ≤ ρ(t) then Many limitations of the original MALA algorithm can now t Accept the candidate and set x(t) = x∗. be avoided by using more advanced versions [20, 24–27]. else Reject the candidate and set x(t) = x(t−1). end if C. Hamiltonian Monte Carlo end for

The Hamiltonian Monte Carlo (HMC) method is a very Note that to obtain samples from the marginal π(x) it elegant and successful instance of an MH algorithm based is sufficient to project the augmented samples (x(t), w(t)) on auxiliary variables [28]. Let w ∈ RN , Σ ∈ RN×N onto the original space of x (i.e., by discarding the auxiliary positive definite, and consider the augmented target density variables w(t)). It is also worth mentioning that under some 1 T Σ−1 π(x, w) ∝ π(x) exp(− 2 w w), which admits the de- regularity condition on π(x), the leap-frog approximation sired target density π(x) as marginal. The HMC method is (9) is time-reversible and volume-preserving, and that these based on the observation that the trajectories defined by the properties are key to the validity of the HMC algorithm [28]. so-called Hamiltonian dynamics preserve the level sets of Finally, there has been a large body of empirical evidence R2N π(x, w). A point (x0, w0) ∈ that evolves according supporting HMC, particularly for high-dimensional models. to the differential equations (8) during some simulation time Unfortunately, its theoretical convergence properties are much period (0,t] less well understood [29]. It has been recently established that for certain target densities the complexity of HMC scales dx at rate O(N 1/4), comparing favorably with MALA’s rate = −∇ log π(x, w)= Σ−1w w 1/3 [30]. However, it has also been observed that, as dt (8) O(N ) dw with MALA, HMC may fail to converge if the tails of π are = ∇ log π(x, w)= ∇ log π(x) dt x x super-Gaussian or heavy-tailed, or if δ is chosen too large. HMC may also perform poorly if π is multi-modal, or not yields a point (xt, wt) that verifies π(xt, wt) = π(x0, w0). sufficiently smooth. In MCMC terms, the deterministic proposal (8) has π(x, w) Of course, the practical performance of HMC also depends as invariant distribution. Exploiting this property for stochastic strongly on the algorithm parameters Σ, L and δ [29]. The simulation, the HMC algorithm combines (8) with a stochastic covariance matrix Σ should be designed to model the correla- sampling step, w ∼ N (0, Σ), that also has invariant distribu- tion structure of π(x), which can be determined by performing tion π(x, w), and that will produce an ergodic chain. Finally, pilot runs, or alternatively by using the strategies described as with the Langevin diffusion (5), it is generally not possible in [20, 21, 31]. The parameters δ and L should be adjusted to solve the Hamiltonian equations (8) exactly. Instead, we use to obtain an average acceptance probability of approximately 5

60% − 70% [30]. Again, this can be achieved by performing E. Partially collapsed Gibbs sampling pilot runs, or by using an adaptive HMC algorithm that adjusts The partially collapsed Gibbs sampler (PCGS) is a recent these parameters automatically [32, 33]. development in MCMC theory that seeks to address some of the limitations of the conventional GS [36]. As mentioned previously, the GS performs poorly if strongly correlated D. Gibbs sampling elements of x are updated separately, as this leads to chains The Gibbs sampler (GS) is another very widely used MH that are highly correlated and to an inefficient exploration algorithm which operates by updating the elements of x of the parameter space. However, updating several elements individually, or by groups, using the appropriate conditional together can also be computationally expensive, particularly if distributions [4, Ch. 10]. This divide-and-conquer strategy it requires simulating from difficult conditional distributions. often leads to important efficiency gains, particularly if the In collapsed samplers, this drawback is addressed by carefully conditional densities involved are “simple”, in the sense that replacing one or several conditional densities by partially it is computationally straightforward to draw samples from collapsed, or marginalized conditional distributions. them. For illustration, suppose that we split the elements of For illustration, suppose that in our previous example the subvectors x1 and x2 exhibit strong dependencies, and that as x in three groups x = (x1, x2, x3), and that by doing so we a result the GS of Algo. 3 performs poorly. Assume that we obtain three conditional densities π(x1|x2, x3), π(x2|x1, x3), are able to draw samples from the marginalized conditional and π(x3|x1, x2) that are “simple” to sample. Using this decomposition, the GS targeting π proceeds as in Algo. 3. density π(x1|x3) = π(x1, x2|x3)dx2, which does not depend on x . This leads to the PCGS described in Algo. Somewhat surprisingly, the Markov kernel resulting from 2 R concatenating the component-wise kernels admits π(x) as 4 to sample from π, which “partially collapses” Algo. 3 by joint invariant distribution, and thus the chain produced by replacing π(x1|x2, x3) with π(x1|x3). Algo. 3 has the desired target density (see [4, Ch. 10] for a review of the theory behind this algorithm). This fundamental Algorithm 4 Partially collapsed Gibbs sampler (0) (0) (0) (0) property holds even if the simulation from the conditionals is Set an initial state x = (x1 , x2 , x3 ) done by using other MCMC algorithms (e.g., RWMH, MALA for t =1 to T do (t) (t−1) or HMC steps targeting the conditional densities), though this Generate x1 ∼ π x1|x3 may result in a deterioration of the algorithm convergence rate. Generate x(t) x x(t) x(t−1) Similarly, the property also holds if the frequency and order of 2 ∼ π 2| 1 , 3 (t)  (t) (t)  the updates is scheduled randomly and adaptively to improve Generate x3 ∼ π x3|x1 , x2 the overall performance. end for  

Algorithm 3 Gibbs sampler algorithm Van Dyk and Park [36] established that the PCGS is always (0) (0) (0) (0) at least as efficient as the conventional GS, and it has been Set an initial state x = (x1 , x2 , x3 ) for t =1 to T do observed that that the PCGS is remarkably efficient for some (t) (t−1) (t−1) statistical models [37, 38]. Unfortunately, PCGSs are not as Generate x1 ∼ π x1|x2 , x3 widely applicable as GSs because they require simulating (t)  (t) (t−1)  Generate x2 ∼ π x2|x1 , x3 exactly from the partially collapsed conditional distributions. (t)  (t) (t)  In general, using MCMC simulation (e.g., MH steps) within Generate x3 ∼ π x3|x1 , x2 a PCGS will lead to an incorrect MCMC algorithm [39]. end for   Similarly, altering the order of the updates (e.g., by permuting the of x1 and x2 in Algo. 4) may also alter the As with other MH algorithms, the performance of the target density [36]. GS depends on the correlation structure of π. Efficient sam- plers seek to update simultaneously the elements of x that III. SURROGATES FOR STOCHASTIC SIMULATION are highly correlated with each other, and to update “slow- A. Variational Bayes moving” elements more frequently. The structure of π can be determined by pilot runs, or alternatively by using an adaptive In the variational Bayes (VB) approach described in [40, 41], the true posterior x y is approximated by a density GS that learns it online and that adapts the updating schedule p( | ) accordingly as described in [34]. However, updating elements q⋆(x) ∈ Q, where Q is a subset of valid densities on x. In in parallel often involves simulating from more complicated particular, conditional distributions, and thus introduces a computational q⋆(x) = arg min D q(x) p(x|y) (10) overhead. Finally, it is worth noting that the GS is very useful q∈Q  for SP models, which typically have sparse conditional inde- where D(qkp) denotes the Kullback-Leibler (KL) divergence pendence structures (e.g., Markovian properties) and conjugate between p and q. As a result of the optimization in (10) priors and hyper-priors from the exponential family. This often over a function, this is termed “variational Bayes” because leads to simple one-dimensional conditional distributions that of the relation to the calculus of variations [42]. Recalling can be updated in parallel by groups [16, 35]. that D(qkp) reaches its minimum value of zero if and only if 6

p = q [43], we see that q⋆(x)= p(x|y) when Q includes all suggests an iterative coordinate-ascent algorithm: update each valid densities on x. However, the premise is that p(x|y) is component qj (xj ) of q(x) while holding the others fixed. But too difficult to work with, and so Q is chosen as a balance this requires solving the integral in (18). A solution arises between fidelity and tractability. when ∀j the conditionals p(xj , y|x\j ) belong to the same Note that the use of D(qkp), rather than D(pkq), implies a exponential family of distributions [46], i.e., search for a that agrees with the true posterior x y over q⋆ p( | ) T the set of x where p(x|y) is large. We will revisit this choice p(xj , y | x\j ) ∝ h(xj )exp η(x\j , y) t(xj ) (20) when discussing expectation propagation in Section III-E. where the sufficient statistic t(xj ) parameterizes the family. Rather than working with the KL divergence directly, it is The exponential family encompasses a broad range of dis- common to decompose it as follows tributions, notably jointly Gaussian and multinomial p(x, y). q(x) Plugging p(x, y)= p(x , y | x )p(x , y) and (20) into (18) D q(x) p(x|y) = q(x) ln dx = ln Z + F (q) j \j \j p(x|y) immediately gives Z (11)  T where gj(xj , y) ∝ h(xj )exp E{η(x\j , y)} t(xj ) (21) q(x) F (q) , q(x) ln dx (12)  p(x, y) where the expectation is taken over x\j ∼ i6=j qi,⋆(xi). Z Thus, if each qj is chosen from the same family, i.e., qj (xj ) ∝ is known as the Gibbs free energy or variational free energy. T Q h(xj )exp γj t(xj ) ∀j, then (19) reduces to the moment- Rearranging (11), we see that matching condition  − ln Z = F (q) − D q(x) p(x|y) ≤ F (q) (13) γj,⋆ =E η(x\j , y) (22) as a consequence of D q(x) p(x| y) ≥ 0. Thus, F (q) can  be interpreted as an upper bound on the negative log partition. where γj,⋆ is the optimal value of γj .  Also, because ln Z is invariant to q, the optimization (10) can be rewritten as C. The Bethe approach

q⋆(x) = arg min F (q), (14) In many cases, the fully factored model (15) yields too gross q∈Q of an approximation. As an alternative, one might try to fit a which avoids the difficult integral in (2). In the sequel, we will model q(x) that has a similar dependency structure as p(x|y). discuss several strategies to solve the variational optimization In the sequel, we assume that the true posterior factors as problem (14). −1 p(x|y)= Z fα(xα) (23) α B. The mean-field approximation Y A common choice of Q is the set of fully factorizable where xα are subvectors of x (sometimes called cliques or densities, resulting in the mean-field approximation [44, 45] outer clusters) and fα are non-negative potential functions. Note that the factorization (23) defines a Gibbs random field N when x y . When a collection of variables always q(x)= q (x ). (15) p( | ) > 0 {xn} j j appears together in a factor, we can collect them into x , j=1 β Y an inner cluster, although it is not necessary to do so. For Substituting (15) into (12) yields the mean-field free energy simplicity we will assume that these xβ are non-overlapping ′ 1 N (i.e., xβ ∩ xβ′ = 0 ∀β 6= β ), so that {xβ} represents a FMF(q) , qj (xj ) ln dx − h(qj ) (16) partition of x. The factorization (23) can then be drawn as a p(x, y) j j=1 Z h Y i X factor graph to help visualize the structure of the posterior, as in Figure 1. where h(qj ) , − qj (xj ) ln qj (xj ) dxj denotes the differen- tial entropy. Furthermore, for j =1,...,N, equation (16) can R be written as fa fb fc fα(xα) FMF(q)= D qj (xj ) gj (xj , y) − h(qi) (17) i6=j o  X

, xβ gj (xj , y) exp qi(xi) ln p(x, y) dx\j (18) i6=j x1 x2 x3 x4 o Z h Y i , T where x\j [x1,...,xj−1, xj ,...,xN ] for j = 1,...,N. Fig. 1. An example of a factor graph, which is a bipartite graph consisting Equation (17) implies the optimality condition of variable nodes, (circles/ovals), and factor nodes, (boxes). In this example, xa = {x1,x2}, xb = {x1,x3,x4}, xc = {x2,x3,x4}. There are several gj,⋆(xj , y) choices for the inner clusters xβ. One is the full factorization x1 = x1, qj,⋆(xj )= ′ ′ ∀j =1,...,N (19) x2 = x2, x3 = x3, and x4 = x4. Another is the partial factorization gj,⋆(xj , y) dxj x1 = x1, x2 = x2, and x3 = {x3,x4}, which results in the “super node” in N the dashed oval. Another is no factorization: x1 = {x1,x2,x3,x4}, resulting where q⋆(x)= j=1R qj,⋆(xj ) and where gj,⋆(xj , y) is defined in the “super node” in the dotted oval. In the latter case, we redefine each as in (18) but with qi,⋆(xi) in place of qi(xi). Equation (19) factor fα to have the full domain x (with trivial dependencies where needed). Q 7

We now seek a tractable way to build an approximation q(x) a factor graph. The standard form of BP is given by the sum- with the same dependency structure as (23). But rather than product algorithm (SPA) [51], which computes the following designing q(x) as a whole, we design the cluster marginals, messages from each factor node fα to each variable (super) {qα(xα)} and {qβ(xβ)}, which must be non-negative, nor- node xβ and vice versa malized, and consistent mα→β(xβ) ← fα(xα) mβ′→α(xβ′ ) dxα\β (29) 0 ≤ qα(xα), 0 ≤ qβ(xβ) ∀α, β, xα, xβ (24) ′ N Z β ∈Yα\β 1= qα(xα) dxα 1= qβ(xβ) dxβ ∀α, β (25) mβ→α(xβ) ← mα′→β(xβ). (30) ′ N Z Z α ∈Yβ \α N qβ(xβ)= qα(xα) dxα\β ∀α, β ∈ α, xβ (26) These messages are then used to compute the beliefs Z where xα\β gathers the components of x that are contained qβ(xβ) ∝ mα→β(xβ) (31) in the cluster α and not in the cluster β, and N denotes the α∈N α Yβ neighborhood of the factor α (i.e., the set of inner clusters β q (x ) ∝ f (x ) m (x ) (32) connected to α). α α α α β→α β β∈N In general, it is difficult to specify q(x) from its cluster Yα marginals. However, in the special case that the factor graph which must be normalized in accordance with (25). The has a tree structure (i.e., there is at most one path from one messages (29)-(30) do not need to be normalized, although node in the graph to another), we have [47] it is often done in practice to prevent numerical overflow. When the factor graph {fα} has a tree structure, the BP- q (x ) q(x)= α α α (27) computed marginals coincide with the true marginals after one q (x )Nβ −1 βQ β β round of forward and backward message passes. Thus, BP on a tree-graph is sometimes referred to as the forward-backward where Nβ = |Nβ| is the neighborhoodQ size of the cluster β. In this case, the free energy (12) simplifies to algorithm, particularly in the context of hidden Markov models [52]. In the tree case, BP is akin to a dynamic programming F (q)= D qα fα + (Nβ − 1)h(qβ) +const, (28) algorithm that organizes the computations needed for marginal α β evaluation in a tractable manner. X  X When the factor graph {fα} has cycles or “loops,” BP can , FB {qα}, {qβ} still be applied by iterating the message computations (29)- | {z } (30) until convergence (not guaranteed), which is known as where FB is known as the Bethe free energy (BFE) [47]. loopy BP (LBP). However, the corresponding beliefs (31)-(32) Clearly, if the true posterior {fα} has a tree structure, and no constraints beyond (24)-(26) are placed on the cluster are in general only approximations of the true marginals. This suboptimality is expected because exact marginal computation marginals {qα}, {qβ}, then minimization of FB {qα}, {qβ} will recover the cluster marginals of the true posterior. But on a loopy graph is an NP-hard problem [53]. Still, the answers  computed by LBP are in many cases very accurate [54]. even when {fα} is not a tree, FB {qα}, {qβ} can be used as an approximation of the Gibbs free energy F (q), and For example, LBP methods have been successfully applied  to communication and SP problems such as: turbo decoding minimizing FB can be interpreted as designing a q that locally matches the true posterior. [55], LDPC decoding [49, 56], inference on Markov random The remaining question is how to efficiently minimize fields [57], multiuser detection [58], and compressive sensing [59, 60]. FB {qα}, {qβ} subject to the (linear) constraints (24)-(26). Although the excellent performance of LBP was at first Complicating matters is the fact that FB {qα}, {qβ} is the sum of convex KL divergences and concave entropies. One op- a mystery, it was later established that LBP minimizes the tion is direct minimization using a “double loop” approach like constrained BFE. More precisely, the fixed points of LBP the concave-convex procedure (CCCP) [48], where the outer coincide with the stationary points of FB {qα}, {qβ} from (28) under the constraints (24)-(26) [47]. The link between loop linearizes the concave term about the current estimate  and the inner loop solves the resulting convex optimization LBP and BFE can be established through the Lagrangian problem (typically with an iterative technique). Another option formalism, which converts constrained BFE minimization to is belief propagation, which is described below. an unconstrained minimization through the incorporation of additional variables known as Lagrange multipliers [61]. By setting the derivatives of the Lagrangian to zero, one obtains D. Belief propagation a set of equations that are equivalent to the message updates Belief propagation (BP) [49, 50] is an algorithm for comput- (29)-(30) [47]. In particular, the stationary-point versions of ing (or approximating) marginal probability density functions the Lagrange multipliers equal the fixed-point versions of the 2 (pdfs) like qβ(xβ) and qα(xα) by propagating messages on loopy SPA log-messages. Note that, unlike the mean-field approach (15), the cluster- 2Note that another form of BP exists to compute the maximum a posteriori based nature of LBP does not facilitate an explicit description (MAP) estimate arg maxx p(x|y) known as the “max-product” or “min- sum” algorithm [50]. However, this approach does not address the problem of of the joint-posterior approximation q ∈ Q from (10). The computing surrogates for stochastic methods, and so is not discussed further. reason is that, when the factor graph is loopy, there is no 8 straightforward relationship between the joint posterior q(x) The site approximation is then updated in (39), and the old and the cluster marginals {qα(xα)}, {qβ(xβ)}, as explained quantities are overwritten in (40)-(41). Note that the right side new \α before (27). Instead, it is better to interpret LBP as an efficient of (39) depends only on xα because q (x) and q (x) differ implementation of the Bethe approach from Section III-C, only over xα. Note also that the KL divergence in (38) is which aims for a local approximation of the true posterior. reversed relative to (10). In summary, by constructing a factor graph with low- The EP updates (37)-(41) can be simplified by leveraging dimensional {xβ} and applying BP or LBP, we trade the high- the factorized exponential family structure in (34)-(35). First, dimensional integral qβ(xβ)= p(x|y) dx\β for a sequence for (33) to be consistent with (34)-(35), each site approxima- of low-dimensional message computations (29)-(30). But (29)- tion must factor into exponential-family terms, i.e., R (30) are themselves tractable only for a few families of {fα}. mα(xα)= mα,β(xα) (42) Typically, {fα} are limited to members of the exponential N βY∈ α family closed under marginalization (see [62]), so that the T updates of the message pdfs (29)-(30) reduce to updates of mα,β(xα) = exp µα,βtβ(xβ) . (43) the natural parameters (i.e., η in (20)). The two most common It can then be shown [62] that (36)-(38) reduce to instances are multivariate Gaussian pdfs and multinomial new T γ ← arg max γ Eb\α {tβ(xβ)}− cβ(γ) (44) probability mass functions (pmfs). For both of these cases, β γ q when LBP converges, it tends to be much faster than double- for all β ∈ Nα, which can be interpreted as the moment match- loop algorithms like CCCP (see, e.g., [63]). However, LBP ing condition Eqnew {tβ(xβ)} = Eqb\α {tβ(xβ)}. Furthermore, does not always converge [54]. (39) reduces to new new µα,β ← γβ − µα′,β (45) E. Expectation propagation ′ N α ∈Xβ \α Expectation propagation (EP) [64] (see also the overviews N new for all β ∈ α. Finally, (40) and (41) reduce to γβ ← γβ [62, 65]) is an of approximate inference that is new and µ ← µ , respectively, for all β ∈ Nα. reminiscent of LBP but has much more flexibility with regards α,β α,β Interestingly, in the case that the true factors {fα} are mem- to the modeling distributions. In EP, the true posterior p(x|y), bers of an exponential family closed under marginalization, the which is assumed to factorize as in (23), is approximated by version of EP described above is equivalent to the SPA up to q(x) such that a change in message schedule. In particular, for each given factor node , the SPA updates the outgoing message towards q(x) ∝ m (x ) (33) α α α one variable node per iteration, whereas EP simultaneously α β Y updates the outgoing messages in all directions, resulting in where xα are the same as in (23) and mα are referred to as new mα (see, e.g., [41]). By restricting the optimization in (39) “site approximations.” Although no constraints are imposed new to a single factor qβ , EP can be made equivalent to the SPA. on the true-posterior factors {fα}, the approximation q(x) is On the other hand, for generic factors {fα}, EP can be viewed restricted to a factorized exponential family. In particular, as a tractable approximation of the (intractable) SPA. Although the above form of EP iterates serially through q(x)= qβ(xβ) (34) β the factor nodes α, it is also possible to perform the updates Y in parallel, resulting in what is known as the expectation q (x ) = exp γT t (x ) − c (γ ) , ∀β, (35) β β β β β β β consistent (EC) approximation algorithm [66]. with some given base measure. We note that our description EP and EC have an interesting BFE interpretation. Whereas of EP applies to arbitrary partitions {xβ}, from the trivial the fixed points of LBP coincide with the stationary points of partition xβ = x to the full partition xβ = xβ. the BFE (28) subject to (24)-(25) and strong consistency (26), The EP algorithm iterates the following updates over all the fixed points of EP and EC coincide with the stationary factors α until convergence (not guaranteed) points of the BFE (28) subject to (24)-(25) and the weak consistency (i.e., moment-matching) constraint [67] q\α(x) ← q(x)/m (x ) (36) α α N \α \α Eqb\α {t(xβ)} =Eqβ {t(xβ)}, ∀α, β ∈ α. (46) q (x) ← q (x)fα(xα) (37) EP, like LBP, is not guaranteed to converge. Hence, provably qnew(x) ← arg min D q\α(x) q(x) (38) q∈Q convergence double-loop algorithms have been proposed that b mnew(x ) ← qnew(x)/q\α(x)  (39) directly minimize the weakly constrained BFE, e.g., [67]. α α b new q(x) ← q (x) (40) F. Approximate message passing new mα(xα) ← mα (xα) (41) So-called approximate message passing (AMP) algorithms where Q in (38) refers to the set of q(x) obeying (34)-(35). [59, 60] have recently been developed for the separable generalized linear model Essentially, (36) removes the αth site approximation mα from the posterior model q, and (37) replaces it with the true factor N M \α T fα. Here, q is know as the “cavity” distribution. The quantity p(x)= px(xj ), p(y|x)= py|z(ym|amx) (47) \α j=1 m=1 q is then projected onto the exponential family in (38). Y Y b 9 where the prior p(x) is fully factorizable, as is the conditional by using deterministic surrogates as described in Section pdf p(y|z) relating the observation vector y to the (hidden) III. Unfortunately, these faster inference methods are not as T transform output vector z , Ax, where A , [a1, ..., aM ] ∈ generally applicable, and because they rely on approximations, RM×N is a known linear transform. Like EP, AMP allows the resulting inferences can suffer from estimation bias. As 3 tractable inference under generic px and py|z. already mentioned, if one focuses on the MAP estimator, AMP can be derived as an approximation of LBP on the efficient optimization techniques can be employed, which are factor graph constructed with inner clusters xβ = xβ for β = often more computationally tractable than MCMC methods 1,...,N, with outer clusters xα = x for α = 1,...,M and and, for which strong guarantees of convergence exist. In many xα = xα−M for α = M +1,...,M + N, and with factors SP applications, the computation of the MAP estimator of x can be formulated as an optimization problem having the T py|z(yα|aα x) α =1,...,M following form fα(xα)= (48) (px(xα−M ) α = M +1,...,M + N. minimize ϕ(Hx, y)+ g(Dx) (51) In the large- limit (LSL), i.e., M,N → ∞ for fixed x∈RN ratio M/N, the LBP beliefs qβ(xβ) simplify to where ϕ: RM × RM → ]−∞, +∞], g : RP → ]−∞, +∞], H ∈ RM×N , and D ∈ RP ×N with P ∈ N∗. For example, qβ(xβ) ∝ px(xβ )N (xβ ; rβ, τ) (49) H may be a linear operator modeling a degradation of the N where {rβ} and τ are iteratively updated parameters. signal of interest, y a vector of observed data, ϕ a least- β=1 b Similarly, for m = 1,...,M, the belief on zm, denoted by squares criterion corresponding to the negative log-likelihood qz,m(·), simplifiesb to associated with an additive zero-mean white Gaussian noise, g a sparsity promoting measure, e.g., an ℓ norm, and D a q (z ) ∝ p (y |z )N (z ; p ,ν) (50) 1 z,m m y|z m m m m frame analysis transform or a gradient operator. M where {pm} and ν are iteratively updated parameters. Often, ϕ is an additively separable function, i.e., m=1 b Each AMP iteration requires only one evaluation of the mean M T 1 M 2 and varianceb of (49)-(50), one multiplication by A and A , ϕ(z, y)= ϕi(zi,yi) ∀(z, y) ∈ (R ) (52) M and relatively few iterations, making it very computationally i=1 efficient, especially when these multiplications have fast im- X where z = [z , ..., z ]T . Under this condition, the previous plementations (e.g., using fast Fourier transforms and discrete 1 M optimization problem becomes an instance of the more general wavelet transforms ). stochastic one In the LSL under i.i.d sub-Gaussian A, AMP is fully characterized by a scalar state evolution (SE). When this SE minimize Φ(x)+ g(Dx) (53) has a unique fixed point, the marginal posterior approximations x∈RN (49)-(50) are known to be exact [68, 69]. involving the expectation For generic A, AMP’s fixed points coincide with the E T stationary points of an LSL version of the BFE [70, 71]. When Φ(x)= {ϕj(hj x,yj )} (54) AMP converges, its posterior approximations are often very where j, y, and H are now assumed to be random variables accurate (e.g., [72]), but AMP does not always converge. In and the expectation is computed with respect to the joint the special case of Gaussian likelihood p and prior px, AMP T y|z distribution of (j, y, H), with hj the j-th line of H. More convergence is fully understood: convergence depends on the precisely, when (52) holds, (51) is then a special case of (53) ratio of peak-to-average squared singular values of A, and with j uniformly distributed over {1,...,M} and (y, H) de- convergence can be guaranteed for any A with appropriate terministic. Conversely, it is also possible to consider that j is damping [73]. For generic px and py|z, damping greatly helps deterministic and that for every i ∈{2,...,M}, ϕi = ϕ1, and convergence [74] but theoretical guarantees are lacking. A (yi, hi)1≤i≤M are identically distributed random variables. In double-loop algorithm to directly minimize the LSL-BFE was this second scenario, because of the separability condition recently proposed and shown to have global convergence for (52), the optimization problem (51) can be regarded as a proxy strictly log-concave px and py|z under generic A [75]. for (53), where the expectation Φ(x) is approximated by a sample estimate (or under suitable IV. OPTIMIZATION METHODS mixing assumptions). All these remarks illustrate the existing A. Optimization problem connections between problems (51) and (53). Note that the stochastic optimization problem defined in The Monte Carlo methods described in Section II provide (53) has been extensively investigated in two communities: a general approach for estimating reliably posterior proba- , and adaptive filtering, often under quite bilities and expectations. However, their high computational different practical assumptions on the forms of the functions cost often makes them unattractive for applications involving (ϕ ) and g. In machine learning [76–78], x indeed rep- very high dimensionality or tight computing time constraints. j j≥1 resents the vector of parameters of a classifier which has to One alternative strategy is to perform inference approximately be learnt, whereas in adaptive filtering [79, 80], it is generally 3 the impulse response of an unknown filter which needs to More precisely, the AMP algorithm [59] handles Gaussian py|z while the generalized AMP (GAMP) algorithm [60] handles arbitrary py|z. be identified and possibly tracked. In order to simplify our 10 presentation, in the rest of this section, we will assume that (ii) For every j ≥ 1, the functions (ϕ ) are convex and Lipschitz-differentiable j j≥1 2 X with respect to their first argument (for example, they may be E {kuj k } < +∞, E{uj | j−1} = ∇Φ(xj ), 2 2 2 logistic functions). E{kuj − ∇Φ(xj )k | Xj−1}≤ σ (1 + αj k∇Φ(xj )k )

where Xj = (yi, hi)1≤i≤j , and αj and σ are positive 2 −1 B. Optimization algorithms for solving stochastic problems values such that γj ≤ (2−ǫ)β(1+2σ αj ) with ǫ> 0. (iii) We have The main difficulty arising in the resolution of the stochastic 2 optimization problem (53) is that the integral involved in the λj γj =+∞ and χj < +∞ expectation term often cannot be computed in practice since it j≥1 j≥1 is generally high-dimensional and the underlying probability X X 2 2 2 measure is usually unknown. Two main computational ap- where, for every j ≥ 1, χj = λj γj (1+2αj k∇Φ(x)k ) proaches have been proposed in the literature to overcome and x is the solution of Problem (53). this issue. The first idea is to approximate the expected loss When g ≡ 0, the SFB algorithm in (55) becomes equivalent function by using a finite set of observations and to minimize to SGD [86–89]. According to the above result, the conver- the associated empirical loss (51). The resulting deterministic gence of SGD is ensured as soon as j≥1 λj γj = +∞ and 2 optimization problem can then be solved by using either j≥1 λj γj < +∞. In the unrelaxed case defined by λj ≡ 1, deterministic or stochastic algorithms, the latter being the topic we then retrieve a particular case ofP the decaying condition P −1/2−δ of Section IV-C. Here, we focus on the second family of γj ∝ j with δ ∈]0, 1/2] usually imposed on the stepsize methods grounded in stochastic approximation techniques to in the convergence studies of SGD under slightly different handle the expectation in (54). More precisely, a sequence assumptions on the gradient estimates (uj )j≥1 (see [90, 91] of identically distributed samples (yj , hj )j≥1 is drawn, which for more details). Note also that better convergence properties are processed sequentially according to some update rule. The can be obtained, if a Polyak-Ruppert averaging approach is iterative algorithm aims to produce a sequence of random performed, i.e., the averaged sequence (xj )j≥1, defined as j iterates (xj )j≥1 converging to a solution to (53). 1 xj = j i=1 xi for every j ≥ 1, is considered instead of We begin with a group of online learning algorithms based (xj )j≥1 in the convergence analysis [90, 92]. on extensions of the well-known stochastic gradient descent We nowP comment on approaches related to SFB that have (SGD) approach. Then we will turn our attention to stochastic been proposed in the literature to solve (53). It should first be optimization techniques developed in the context of adaptive noted that a simple alternative strategy to deal with a possibly filtering. nonsmooth term g is to incorporate a subgradient step into 1) Online learning methods based on SGD: Let us assume the previously mentioned SGD algorithm [93]. However, this N that an estimate uj ∈ R of the gradient of Φ at xj is approach, like its deterministic version, may suffer from a slow available at each iteration j ≥ 1. A popular strategy for solving convergence rate [94]. Another family of methods, close to (53) in this context leverages the gradient estimates to derive SFB, adopt the regularized dual averaging (RDA) strategy, a so-called stochastic forward-backward (SFB) scheme, (also first introduced in [94]. The principal difference between sometimes called stochastic proximal gradient algorithm) SFB and RDA methods is that the latter rely on iterative averaging of the stochastic gradient estimates, which consists (∀j ≥ 1) zj = proxγ g◦D (xj − γj uj ) j of replacing in the update rule (55), (uj )j≥1 by (uj )j≥1 1 j x +1 = (1 − λ )x + λ z (55) j j j j j where, for every j ≥ 1, uj = j i=1 ui. The advantage is that it provides convergence guarantees for nondecaying where (γj )j≥1 is a sequence of positive stepsize values and stepsize sequences. Finally, the so-calledP composite mirror de- (λj )j≥1 is a sequence of relaxation parameters in ]0, 1]. Here- scent methods, introduced in [95], can be viewed as extended RN above, proxψ(v) denotes the proximity operator at v ∈ versions of the SFB algorithm where the proximity operator is N of a lower-semicontinuous convex function ψ : R →] − computed with respect to a non Euclidean distance (typically, ∞, +∞] with nonempty domain, i.e., the unique minimizer of a Bregman divergence). 1 2 ψ + 2 k·−vk (see [81] and the references therein), and In the last few years, a great deal of effort has been g ◦ D(x) = g(Dx). A convergence analysis of the SFB made to modify SFB when the proximity operator of g ◦ D scheme has been conducted in [82–85], under various assump- does not have a simple expression, but when g can be split tions on the functions Φ, g, on the stepsize sequence, and into several terms whose proximity operators are explicit. We on the statistical properties of (uj )j≥1. For example, if x1 can mention the stochastic proximal averaging strategy from is set to a given (deterministic) value, the sequence (xj )j≥1 [96], the stochastic alternating direction method of mutipliers generated by (55) is guaranteed to converge almost surely (ADMM) from [97–99] and the alternating block strategy from to a solution of Problem (53) under the following technical [100] suited to the case when g ◦ D is a separable function. assumptions [84] Another active research area addresses the search for strate- (i) Φ has a β−1-Lipschitzian gradient with β ∈]0, +∞[, g is gies to improve the convergence rate of SFB. Two main a lower-semicontinuous convex function, and Φ+ g ◦ D approaches can be distinguished in the literature. The first, is strongly convex. adopted for example in [83, 101–103], relies on subspace 11

2 acceleration. In such methods, usually reminiscent of Nes- denotes the (possibly weighted) ℓ1 norm and (η,ρ) ∈]0, +∞[ . terov’s acceleration techniques in the deterministic case, the Over-relaxed projection algorithms allow such kind of prob- convergence rate is improved by using information from lems to be solved efficiently [124, 125]. previous iterates for the construction of the new estimate. Another efficient way to accelerate the convergence of SFB C. Stochastic algorithms for solving deterministic optimiza- is to incorporate in the update rule second-order information tion problems one may have on the cost functions. For instance, the method We now consider the deterministic optimization problem described in [104] incorporates quasi-Newton metrics into the defined by (51) and (52). Of particular interest is the case SFB and RDA algorithms, and the natural gradient method when the dimensions N and/or M are very large (for instance, from [105] can be viewed as a preconditioned SGD algorithm. in [126], M = 2500000 and in [127], N = 100250). The two strategies can be combined, as for example, in [106]. 1) Incremental gradient algorithms: Let us start with in- 2) Adaptive filtering methods: In adaptive filtering, stochas- cremental methods, which are dedicated to the solution of tic gradient-like methods have been quite popular for a long (51) when M is large, so that one prefers to exploit at time [107, 108]. In this field, the functions (ϕ ) often j j≥1 each iteration a single term ϕ , usually through its gradient, reduce to a least squares criterion j rather than the global function ϕ. There are many variants of T T 2 incremental algorithms, which differ in the assumptions made (∀j ≥ 1) ϕj (hj x,yj ) = (hj x − yj) (56) on the functions involved, on the stepsize sequence, and on where x is the unknown impulse response. However, a specific the way of activating the functions (ϕi)1≤i≤M . This order difficulty to be addressed is that the designed algorithms must could follow either a deterministic [128] or a randomized rule. be able to deal with dynamical problems the optimal solution However, it should be noted that the use of randomization of which may be time-varying due to some changes in the in the selection of the components presents some benefits statistics of the available data. In this context, it may be in terms of convergence rates [129] which are of particular useful to adopt a multivariate formulation by imposing, at each interest in the context of machine learning [130, 131], where iteration j ≥ Q the user can only afford few full passes over the data. Among yj ≃ Hj x (57) randomized incremental methods, the SAGA algorithm [132], presented below, allows the problem defined in (51) to be where y = [y ,...,y ]T , H = [h ,..., h ]T , j j j−Q+1 j j j−Q+1 solved when the function g is not necessarily smooth, by and Q ≥ 1. This technique, reminiscent of mini-batch proce- making use of the proximity operator introduced previously. dures in machine learning, constitutes the principle of affine The n-th iteration of SAGA reads as projection algorithms, the purpose of which is to accelerate the T T convergence speed [109]. Our focus now switches to recent un = hjn ∇ϕjn (hjn xn,yjn ) − hjn ∇ϕjn (hjn zjn,n,yjn ) 1 M T work which aims to impose some sparse structure on the + M i=1 hi∇ϕi(hi zi,n,yi) desired solution. zj ,n+1 = xn n P A simple method for imposing sparsity is to introduce a zi,n+1 = zi,n ∀ i ∈{1,...,M}\{jn} suitable adaptive preconditioning strategy in the stochastic xn+1 = proxγg◦D(xn − γun) gradient iteration, leading to the so-called proportionate least (58) mean square method [110, 111], which can be combined where γ ∈]0, +∞[, for all i∈{1,...,M}, zi,1 = x1 ∈ RN with affine projection techniques [112, 113]. Similarly to , and jn is drawn from an i.i.d. uniform distribution on the work already mentioned that has been developed in the {1,...,M}. Note that, although the storage of the variables machine learning community, a second approach proceeds (zi,n)1≤i≤M can be avoided in this method, it is necessary to store the gradient vectors h hT z . The by minimizing penalized criteria such as (53) where g is M i∇ϕi( i i,n,yi) 1≤i≤M a sparsity measure and D = IN . In [114, 115], zero- convergence of Algorithm (58) has been analyzed in [132]. If −1  attracting algorithms are developed which are based on the the functions (ϕi)1≤ı≤M are β -Lipschitz differentiable and stochastic subgradient method. These algorithms have been µ-strongly convex with (β,µ) ∈]0, +∞[2 and the stepsize γ 2 further extended to affine projection techniques in [116– equals β/(2(µβM + 1)), then (E {kxn − xk })n∈N goes to 118]. Proximal methods have also been proposed in the zero geometrically with rate 1 − γ, where x is the solution context of adaptive filtering, grounded on the use of the to Problem (51). When only convexity is assumed, a weaker forward-backward algorithm [119], an accelerated version of convergence result is available. it [120], or primal-dual approaches [121]. It is interesting to The relationship between Algorithm (58) and other stochas- note that proportionate affine projection algorithms can be tic incremental methods existing in the literature is worthy of viewed as special cases of these methods [119]. Other types comment. The main distinction arises in the way of building of algorithms have been proposed which provide extensions the gradient estimates (un)n≥1. The standard incremental of the recursive least squares method, which is known for gradient algorithm, analyzed for instance in [129], relies on T its fast convergence properties [106, 122, 123]. Instead of simply defining, at iteration n, un = hjn ∇ϕjn (hjn xn,yjn ). minimizing a sparsity promoting criterion, it is also possible However, this approach, while leading to a smaller com- to formulate the problem as a feasibility problem where, at putational complexity per iteration and to a lower mem- iteration j ≥ Q, one searches for a vector x satisfying both ory requirement, gives rise to suboptimal convergence rates T supj≤i≤j−Q+1 |yi − hi x| ≤ η and kxk1 ≤ ρ, where k·k1 [91, 129], mainly due to the fact that its convergence requires 12

Pk×Nk a stepsize sequence (γn)n≥1 decaying to zero. Motivated by where, for every k ∈{1,...,K}, Dk is a matrix in R , Nk Pk this observation, much recent work [126, 130–134] has been and g1,k : R →] − ∞, +∞] and g2,k : R →] − ∞, +∞] dedicated to the development of fast incremental gradient are proper lower-semicontinuous convex functions. Then, the methods, which would benefit from the same convergence stochastic primal-dual proximal algorithm allowing us to solve rates as batch optimization methods, while using a randomized Problem (51) is given by incremental approach. A first class of methods relies on a variance reduction approach [130, 132–134] which aims at Algorithm 5 Stochastic primal-dual proximal algorithm diminishing the variance in successive estimates (un)n≥1. for n =1, 2,... do All of the aforementioned algorithms are based on iterations for k =1 to K do which are similar to (58). In the stochastic variance reduction with probability εk ∈ (0, 1] do gradient method and the semi-stochastic gradient descent v −1 v D x k,n+1 = (Id − proxτ g2,k )( k,n + k k,n) method proposed in [133, 134], a full gradient step is made T xk,n+1 = proxγg1,k xk,n −γ τDk (2vk+1,n −vk,n) at every K iterations, K ≥ 1, so that a single vector zn is  used instead of (zi,n)1≤i≤M in the update rule. This so-called 1 M K T + M i=1 hi,k∇ϕi( k′=1 hi,k′ xk′ ,n,yi) mini-batch strategy leads to a reduced memory requiremente at the expense of more gradient evaluations. As pointed out in  otherwise P P  [132], the choice between one strategy or another may depend v = v , x = x . on the problem size and on the computer architecture. In the k,n+1 k,n k,n+1 k,n end for stochastic average gradient algorithm (SAGA) from [130], a end for multiplicative factor 1/M is placed in front of the gradient differences, leading to a lower variance counterbalanced by a bias in the gradient estimates. It should be emphasized that In the algorithm above, for every i ∈{1,...,M}, the scalar T the work in [130, 133] is limited to the case when g ≡ 0. A product hi x has been rewritten in a blockwise manner as K T second class of methods, closely related to SAGA, consists k′ =1 hi,k′ xk′ . Under some stability conditions on the choice T T T of applying the proximal step to z − γu , where z is of the positive step sizes τ and γ, xn = [x ,..., x ] n n n P 1,n K,n the average of the variables (zi,n)1≤i≤M (which thus need converges almost surely to a solution of the minimization to be stored). This approach is retained for instance in the problem, as n → +∞ (see [137] for more technical details). Finito algorithm [131] as well as in some instances of the It is important to note that the convergence result was estab- T minimization by incremental surrogate optimization (MISO) lished for arbitrary probabilities ε = [ε1,...,εK ] , provided algorithm, proposed in [126]. These methods are of particular that the block activation probabilities εk are positive and interest when the extra storage cost is negligible with respect independent of n. Note that the various blocks can also be to the high computational cost of the gradients. Note that activated in a dependent manner at a given iteration n. Like the MISO algorithm relying on the majoration-minimization its deterministic counterparts (see [138] and the references framework employs a more generic update rule than Finito and therein), this algorithm enjoys the property of not requiring has proven convergence guarantees even when g is nonzero. any matrix inversion, which is of paramount importance when 2) Block coordinate approaches: In the spirit of the Gauss- the matrices (Dk)1≤k≤K are of large size and do not have Seidel method, an efficient approach for dealing with Prob- some simple forms. lem (51) when N is large consists of resorting to block When g2,k ≡ 0, the random block coordinate forward- coordinate alternating strategies. Sometimes, such a block backward algorithm is recovered as an instance of Algorithm 5 alternation can be performed in a deterministic manner [135, since the dual variables (vk,n)1≤k≤K,n∈N can be set to 0 136]. However, many optimization methods are based on fixed and the constant τ becomes useless. An extensive literature point algorithms, and it can be shown that with deterministic exists on the latter algorithm and its variants. In particular, its block coordinate strategies, the contraction properties which almost sure convergence was established in [85] under general are required to guarantee the convergence of such algorithms conditions, whereas some worst case convergence rates were are generally no longer satisfied. In turn, by resorting to derived in [139–143]. In addition, if g1,k ≡ 0, the random stochastic techniques, these properties can be retrieved in some block coordinate descent algorithm is obtained [144]. probabilistic sense [85]. In addition, using stochastic rules for When the objective function minimized in Problem (51) activating the different blocks of variables often turns out to is strongly convex, the random block coordinate forward- be more flexible. backward algorithm can be applied to the dual problem, in To illustrate why there is interest in block coordinate a similar fashion to the dual forward-backward method used T T T approaches, let us split the target variable x as [x1 ,..., xK ] , in the deterministic case [145]. This leads to so-called dual Nk where, for every k ∈{1,...,K}, xk ∈ R is the k-th block ascent strategies which have become quite popular in machine of variables with reduced dimension Nk (with N1+···+NK = learning [146–149]. N). Let us further assume that the regularization function can Random block coordinate versions of other proximal algo- be blockwise decomposed as rithms such as the Douglas-Rachford algorithm and ADMM

K have also been proposed [85, 150]. Finally, it is worth em- phasizing that asynchronous distributed algorithms can be g(Dx)= g1,k(xk)+ g2,k(Dkxk) (59) deduced from various randomly activated block coordinate Xk=1 13 methods [137, 151]. As well as dual ascent methods, the latter This question was studied recently in [24] in the context of algorithms can also be viewed as incremental methods. Langevin algorithms. As explained in Section II.B, Langevin MCMC algorithms are derived from discrete-time approxi- V. AREAS OF INTERSECTION: mations of the time-continuous Langevin diffusion process OPTIMIZATION-WITHIN-MCMC AND MCMC-DRIVEN (5). Of course, the stability and accuracy of the discrete OPTIMIZATION approximations determine the theoretical and practical conver- gence properties of the MCMC algorithms they underpin. The There are many important examples of the synergy between approximations commonly used in the literature are generally stochastic simulation and optimization, including global op- well-behaved and lead to powerful MCMC methods. However, timization by , stochastic EM algorithms, they can perform poorly if π is not sufficiently regular, for and adaptive MCMC samplers [4]. In this section we highlight example if π is not continuously differentiable, if it is heavy- some of the interesting new connections between modern tailed, or if it has lighter tails than a Gaussian distribution. simulation and optimization that we believe are particularly This drawback limits the application of MCMC approaches relevant for the SP community, and that we hope will stimulate to many SP problems, which rely increasingly on models that further research in this community. are not continuously differentiable or that involve constraints. Using proximity operators, the following proximal approx- A. Riemannian manifold MALA and HMC imation for the Langevin diffusion process (5) was recently Riemannian manifold MALA and HMC exploit differential proposed in [24] geometry for the problem of specifying an appropriate pro- (t+1) (t) X ∼ N prox −δ X ,δIN (60) posal covariance matrix Σ that takes into account the geometry 2 log π of the target density π [20]. These new methods stem from the as an alternative to the standard forward Euler approximati on observation that specifying Σ is equivalent to formulating the (t+1) (t) δ (t) I 4 X ∼ N X + 2 ∇ log π X ,δ n used in MALA . Langevin or Hamiltonian dynamics in an Euclidean parameter Similarly to MALA, the time step δ can be adjusted online Σ−1 space with inner product hw, xi. Riemannian methods to achieve an acceptance probability  of approximately 50%. advance this observation by considering a smoothly-varying It was established in [24] that when π is log-concave, (60) Σ position dependent matrix (x), which arises naturally by for- defines a remarkably stable discretization of (5) with optimal mulating the dynamics in a Riemannian manifold. The choice theoretical convergence properties. Moreover, the “proximal” Σ of (x) then becomes the more familiar problem of specifying MALA resulting from combining (60) with an MH step has a metric or distance for the parameter space [20]. Notice very good geometric ergodicity properties. In [24], the algo- that the Riemannian and the canonical Euclidean gradients rithm efficiency was demonstrated empirically on challenging ˜ Σ are related by ∇g(x)= (x)∇g(x). Therefore this problem models that are not well addressed by other MALA or HMC is also closely related to gradient preconditioning in gradient methodologies, including an image resolution enhancement descent optimization discussed in Sec IV.B. Standard choices model with a total-variation prior. Further practical assess- Σ for include for example the inverse Hessian matrix [21, 31], ments of proximal MALA algorithms would therefore be a which is closely related to Newton’s optimization method, welcome area of research. and the inverse Fisher information matrix [20], which is the Proximity operators have also been used recently in [155] “natural” metric from an information geometry viewpoint and for HMC sampling from log-concave densities that are not is also related to optimization by natural gradient descent continuously differentiable. The experiments reported in [155] [105]. These strategies have originated in the computational show that this approach can be very efficient, in particular statistics community, and perform well in inference problems for SP models involving high-dimensionality and non-smooth that are not too high-dimensional. Therefore, the challenge is priors. Unfortunately, theoretically analyzing HMC methods to design new metrics that are appropriate for SP statistical is difficult, and the precise theoretical convergence properties models (see [152, 153] for recent work in this direction). of this algorithm are not yet fully understood. We hope future work will focus on this topic. B. Proximal MCMC algorithms Most high-dimensional MCMC algorithms rely particularly C. optimization-driven Gaussian simulation strongly on differential analysis to navigate vast parameter The standard approach for simulating from a multivariate spaces efficieoptimizationntly. Conversely, the potential of Gaussian distribution with precision matrix Q ∈ Rn×n is convex calculus for MCMC simulation remains largely unex- to perform a Cholesky factorization Q = LT L, generate an plored. This is in sharp contrast with modern high-dimensional auxiliary Gaussian vector w ∼ N (0, IN ), and then obtain the optimization described in Section IV, where convex calculus desired sample x by solving the linear system Lx = w [156]. in general, and proximity operators [81, 154] in particular, The computational complexity of this approach generally are used extensively. This raises the question as to whether scales at a prohibitive rate O(N 3) with the model dimension convex calculus and proximity operators can also be useful for N, making it impractical for large problems, (note however stochastic simulation, especially for high-dimensional target 4 densities that are log-concave, and possibly not continuously Recall that proxϕ(v) denotes the proximity operator of ϕ evaluated at differentiable. v ∈ RN [81, 154]. 14 that there are specific cases with lower complexity, for instance accuracy and computation times. Consequently, it seems not when Q is Toeplitz [157], circulant [158] or sparse [156]). unreasonable to speculate that the different computational optimization-driven Gaussian simulators arise from the ob- methodologies discussed in this paper will evolve to become servation that the samples can also be obtained by minimizing more adaptable, with their boundaries becoming less well a carefully designed stochastic cost function [159, 160]. For il- defined, and with the development of algorithms that make lustration, consider a Bayesian model with Gaussian likelihood use of simulation, variational approximations and optimization y|x ∼ N (Hx, Σy) and Gaussian prior x ∼ N (x0, Σx), simultaneously. Such an approach is more likely to be able to for some linear observation operator H ∈ RN×M , prior handle an even wider range of models, datasets, inferences, N mean x0 ∈ R , and positive definite covariance matrices accuracies and computing times in a computationally efficient N×N M×M Σx ∈ R and Σy ∈ R . The posterior distribution way. p(x|y) is Gaussian with mean µ ∈ RN and precision matrix Q ∈ RN×N given by REFERENCES T Σ−1 Σ−1 Q = H y H + x [1] A. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc., µ = Q−1 HT Σ−1y + Σ−1x . vol. 39, pp. 1–17, 1977. y x 0 [2] R. Neal and G. Hinton, “A view of the EM algorithm that justifies incremental, sparse, and other variants,” in Learning in Graphical  −1  Simulating samples x|y ∼ N (µ, Q ) by Cholesky factoriza- Models, M. I. Jordan, Ed. MIT Press, 1998, pp. 355–368. tion of Q can be computationally expensive when N is large. [3] H. Attias, “A variational Bayesian framework for graphical models,” in Proc. Neural Inform. Process. Syst. Conf., Denver, CO, USA, 2000, Instead, optimization-driven simulators generate samples by pp. 209–216. solving the following “random” minimization problem [4] C. Robert and G. Casella, Monte Carlo Statistical Methods, 2nd ed. Springer-Verlag, 2004. T Σ−1 x = argmin (w1 − Hu) y (w1 − Hu) [5] P. Green, K. Łatuszy´nski, M. Pereyra, and C. P. Robert, “Bayesian u∈RN (61) computation: a summary of the current state, and samples backwards T Σ−1 and forwards,” Statistics and Computing, 2015, in press. + (w2 − u) x (w2 − u) [6] A. Doucet and A. M. Johansen, “A tutorial on particle filtering and Σ Σ smoothing: Fiteen years later,” in In The Oxford Handbook of Nonlinear with random vectors w1 ∼ N (y, y) and w2 ∼ N (x0, x). Filtering, D. Crisan and B. Rozovsky, Eds. Oxford University Press, It is easy to check that if (61) is solved exactly, then x 2011. is a sample from the desired posterior distribution p(x|y). [7] A. Beskos, A. Jasra, E. A. Muzaffer, and A. M. Stuart, “Sequential Monte Carlo methods for Bayesian elliptic inverse problems,” Statistics From a computational viewpoint, however, it is significantly and Computing, vol. 25, 2015, in press. more efficient to solve (61) approximately, for example by [8] J. Marin, P. Pudlo, C. Robert, and R. Ryder, “Approximate Bayesian using a few linear conjugate gradient iterations [160]. The computational methods,” Statistics and Computing, pp. 1–14, 2011. [9] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, approximation error can then be corrected by using an MH step “Equations of state calculations by fast computational machine,” Jour- [161], at the expense of introducing some correlation between nal of Chemical Physics, vol. 21, no. 6, pp. 1087–1091, 1953. the samples and therefore reducing the total effective sample [10] W. Hastings, “Monte Carlo sampling using Markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, 1970. size. Fortunately, there is an elegant strategy to determine auto- [11] S. Chib and E. Greenberg, “Understanding the Metropolis–Hastings matically the optimal number of conjugate gradient iterations algorithm,” Ann. Mathemat. Statist., vol. 49, pp. 327–335, 1995. that maximizes the overall efficiency of the algorithm [161]. [12] M. B´edard, “Weak convergence of Metropolis algorithms for non-i.i.d. target distributions,” Ann. Appl. Probab., vol. 17, no. 4, pp. 1222–1244, 2007. VI. CONCLUSIONS AND OBSERVATIONS [13] G. O. Roberts and J. S. Rosenthal, “Optimal scaling for various metropolis-hastings algorithms,” Statist. Sci., vol. 16, no. 4, pp. 351– In writing this paper we have sought to provide an in- 367, 11 2001. troduction to stochastic simulation and optimization methods [14] C. Andrieu and G. Roberts, “The pseudo-marginal approach for ef- ficient Monte Carlo computations,” Ann. Statist., vol. 37, no. 2, pp. in a tutorial format, but which also raised some interesting 697–725, 2009. topics for future research. We have addressed a variety of [15] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain MCMC methods and discussed surrogate methods, such as Monte Carlo (with discussion),” J. Royal Statist. Society Series B, vol. 72 (2), pp. 269–342, 2011. variational Bayes, the Bethe approach, belief and expectation [16] M. Pereyra, N. Dobigeon, H. Batatia, and J.-Y. Tourneret, “Estimating propagation, and approximate message passing. We also dis- the granularity coefficient of a Potts-Markov random field within an cussed a range of recent advances in optimization methods MCMC algorithm,” IEEE Trans. Image Processing, vol. 22, no. 6, pp. 2385–2397, June 2013. that have been proposed to solve stochastic problems, as well [17] S. Jarner and E. Hansen, “Geometric ergodicity of Metropolis algo- as stochastic methods for deterministic optimization. Subse- rithms,” Stochastic Processes and Their Applications, vol. 85, no. 2, quently, we highlighted new methods that combine simulation pp. 341–361, 2000. [18] A. Beskos, G. Roberts, and A. Stuart, “Optimal scalings for local and optimization, such as proximal MCMC algorithms and Metropolis-Hastings chains on nonproduct targets in high dimensions,” optimization-driven Gaussian simulators. Our expectation is Ann. Appl. Probab., vol. 19, no. 3, pp. 863–898, 2009. that future methodologies will become more flexible. Our com- [19] G. Roberts and R. Tweedie, “Exponential convergence of Langevin distributions and their discrete approximations,” Bernoulli, vol. 2, no. 4, munity has successfully applied computational inference meth- pp. 341–363, 1996. ods, as we have described, to a plethora of challenges across [20] M. Girolami and B. Calderhead, “Riemann manifold Langevin and an enormous range of application domains. Each problem Hamiltonian Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, pp. 123–214, 2011. offers different challenges, ranging from model dimensionality [21] M. Betancourt, “A general metric for Riemannian manifold Hamilto- and complexity, data (too much or too little), inferences, nian Monte Carlo,” in National Conference on the Geometric Science of 15

Information, ser. Lecture Notes in Computer Science 8085, F. Nielsen Trans. Inform. Theory, vol. 51, no. 7, pp. 2282–2312, Jul. 2005. and F. Barbaresco, Eds. Springer, 2013, pp. 327–334. [48] A. L. Yuille and A. Rangarajan, “The concave-convex procedure [22] Y. Atchad´e, “An adaptive version for the Metropolis adjusted Langevin (CCCP),” Proc. Neural Inform. Process. Syst. Conf., vol. 2, pp. 1033– algorithm with a truncated drift,” Methodology and Computing in 1040, Vancouver, BC, Canada, 2002. Applied Probability, vol. 8, no. 2, pp. 235–254, 2006. [49] R. G. Gallager, “Low density parity check codes,” IRE Trans. Inform. [23] N. S. Pillai, A. M. Stuart, and A. H. Thi´ery, “Optimal scaling and Theory, vol. 8, pp. 21–28, Jan. 1962. diffusion limits for the Langevin algorithm in high dimensions,” The [50] J. Pearl, Probabilistic Reasoning in Intelligent . San Mateo, Annals of Applied Probability, vol. 22, no. 6, pp. 2320–2356, 2012. CA: Morgan Kaufman, 1988. [24] M. Pereyra, “Proximal Markov chain Monte Carlo algorithms,” Statis- [51] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and tics and Computing, 2015, in press. the sum-product algorithm,” IEEE Trans. Inform. Theory, vol. 47, pp. [25] T. Xifara, C. Sherlock, S. Livingstone, S. Byrne, and M. Girolami, 498–519, Feb. 2001. “Langevin diffusions and the Metropolis-adjusted Langevin algorithm,” [52] L. R. Rabiner, “A tutorial on hidden Markov models and selected Statistics & Probability Letters, vol. 91, pp. 14–19, 2014. applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. [26] B. Casella, G. Roberts, and O. Stramer, “Stability of partially implicit 257–286, Feb. 1989. Langevin schemes and their MCMC variants,” Methodology and Com- [53] G. F. Cooper, “The computational complexity of probabilistic inference puting in Applied Probability, vol. 13, no. 4, pp. 835–854, 2011. using Bayesian belief networks,” Artificial Intelligence, vol. 42, pp. [27] A. Schreck, G. Fort, S. L. Corff, and E. Moulines, “A shrinkage- 393–405, 1990. thresholding Metropolis adjusted Langevin algorithm for Bayesian [54] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation variable selection,” arXiv preprint arXiv:1312.5658, 2013. for approximate inference: An empirical study,” in Proc. Uncertainty [28] R. Neal, “MCMC using Hamiltonian dynamics,” in Handbook of Artif. Intell., 1999, pp. 467–475. Markov Chain Monte Carlo, S. Brooks, A. Gelman, G. Jones, and [55] R. J. McEliece, D. J. C. MacKay, and J.-F. Cheng, “Turbo decoding X.-L. Meng, Eds. Chapman & Hall/CRC Press, 2013, pp. 113–162. as an instance of Pearl’s ‘belief propagation’ algorithm,” IEEE J. Sel. [29] M. J. Betancourt, S. Byrne, S. Livingstone, and M. Girolami, “The Areas Commun., vol. 16, no. 2, pp. 140–152, Feb. 1998. geometric foundations of Hamiltonian Monte Carlo,” ArXiv e-prints, [56] D. J. C. MacKay, Information Theory, Inference, and Learning Algo- 2014. rithms. New York: Cambridge University Press, 2003. [30] A. Beskos, N. Pillai, G. Roberts, J.-M. Sanz-Serna, and A. Stuart, [57] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low- “Optimal tuning of the hybrid Monte Carlo algorithm,” Bernoulli, level vision,” Intl. J. Computer Vision, vol. 40, no. 1, pp. 25–47, Oct. vol. 19, no. 5A, pp. 1501–1534, 2013. 2000. [31] Y. Zhang and C. A. Sutton, “Quasi-Newton methods for Markov chain [58] J. Boutros and G. Caire, “Iterative multiuser joint decoding: Unified Monte Carlo,” in Advances in Neural Information Processing Systems framework and asymptotic analysis,” IEEE Trans. Inform. Theory, 24 (NIPS 2011), J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and vol. 48, no. 7, pp. 1772–1793, Jul. 2002. K. Weinberger, Eds., 2011, pp. 2393–2401. [59] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing [32] M. D. Hoffman and A. Gelman, “The no-u-turn sampler: Adaptively algorithms for compressed sensing: I. Motivation and construction,” setting path lengths in hamiltonian monte carlo,” Journal of Machine in Proc. Inform. Theory Workshop, Cairo, Egypt, Jan. 2010, pp. 1–5. Learning Research, vol. 15, pp. 1593–1623, 2014. [Online]. Available: [60] S. Rangan, “Generalized approximate message passing for estimation http://jmlr.org/papers/v15/hoffman14a.html with random linear mixing,” in Proc. IEEE Int. Symp. Inform. Theory, [33] M. Betancourt, S. Byrne, and M. Girolami, “Optimizing the in- St. Petersburg, Russia, Aug. 2011, pp. 2168–2172, (full version at tegrator step size for Hamiltonian Monte Carlo,” arXiv preprint arXiv:1010.5141). arXiv:1411.6669, 2014. [61] D. Bertsekas, , 2nd ed. Athena Scientific, [34] K. Łatuszy´nski, G. Roberts, and J. Rosenthal, “Adaptive Gibbs sam- 1999. plers and related MCMC methods,” Ann. Appl. Probab., vol. 23(1), pp. [62] T. Heskes, M. Opper, W. Wiegerinck, O. Winther, and O. Zoeter, 66–98, 2013. “Approximate inference techniques with expectation constraints,” J. [35] C. Bazot, N. Dobigeon, J.-Y. Tourneret, A. K. Zaas, G. S. Ginsburg, Stat. Mech., vol. P11015, 2005. and A. O. Hero, “Unsupervised Bayesian linear unmixing of gene [63] T. Heskes, O. Zoeter, and W. Wiegerinck, “Approximate expectation expression microarrays,” BMC Bioinformatics, vol. 14, no. 99, March maximization,” in Proc. Neural Inform. Process. Syst. Conf., Vancou- 2013. ver, BC, Canada, 2004, pp. 353–360. [36] D. A. van Dyk and T. Park, “Partially collapsed Gibbs samplers: Theory [64] T. Minka, “A family of approximate algorithms for Bayesian inference,” and methods,” Journal of the American Statistical Association, vol. 103, Ph.D. dissertation, MIT, Cambridge, MA, Jan. 2001. pp. 790–796, 2008. [65] M. Seeger, “Expectation propagation for exponential families,” EPFL, [37] T. Park and D. A. van Dyk, “Partially collapsed Gibbs samplers: Il- Tech. Rep. 161464, 2005. lustrations and applications,” Journal of Computational and Graphical [66] M. Opper and O. Winther, “Expectation consistent free energies for Statistics, vol. 103, pp. 283–305, 2009. approximate inference,” in Proc. Neural Inform. Process. Syst. Conf., [38] G. Kail, J.-Y. Tourneret, F. Hlawatsch, and N. Dobigeon, “Blind Vancouver, BC, Canada, 2005, pp. 1001–1008. deconvolution of sparse pulse sequences under a minimum distance [67] T. Heskes and O. Zoeter, “Expectation propagation for approximate constraint: a partially collapsed Gibbs sampler method,” IEEE Trans. inference in dynamic Bayesian networks,” in Proc. Uncertainty Artif. Signal Processing, vol. 60, no. 6, pp. 2727–2743, June 2012. Intell., Alberta, Canada, 2002, pp. 313–320. [39] D. A. van Dyk and X. Jiao, “Metropolis-Hastings within Partially [68] M. Bayati and A. Montanari, “The dynamics of message passing on Collapsed Gibbs Samplers,” Journal of Computational and Graphical dense graphs, with applications to compressed sensing,” IEEE Trans. Statistics, in press. Inform. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011. [40] M. Jordan, Z. Gharamani, T. Jaakkola, and L. Saul, “An introduction to [69] A. Javanmard and A. Montanari, “State evolution for general ap- variational methods for graphical models,” Machine Learning, vol. 37, proximate message passing algorithms, with applications to spatial pp. 183–233, 1999. coupling,” Inform. Inference, vol. 2, no. 2, pp. 115–144, 2013. [41] C. M. Bishop, Pattern Recognition and Machine Learning. New York: [70] S. Rangan, P. Schniter, E. Riegler, A. Fletcher, and V. Cevher, “Fixed Springer, 2007. points of generalized approximate message passing with arbitrary [42] R. Weinstock, Calculus of Variations. New York: Dover, 1974. matrices,” in Proc. IEEE Int. Symp. Inform. Theory, Istanbul, Turkey, [43] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Jul. 2013, pp. 664–668, (full version at arXiv:1301.6295). New York: Wiley, 2006. [71] F. Krzakala, A. Manoel, E. W. Tramel, and L. Zdeborov´a, “Variational [44] C. Peterson and J. R. Anderson, “A mean field theory learning free energies for compressed sensing,” in Proc. IEEE Int. Symp. algorithm for neural networks,” Complex Sys., vol. 1, pp. 995–1019, Inform. Theory, Honolulu, Hawai, Jul. 2014, pp. 1499–1503, (see also 1987. arXiv:1402.1384). [45] G. Parisi, Statistical Field Theory. Reading, MA: Addison-Wesley, [72] J. P. Vila and P. Schniter, “An empirical-Bayes approach to recov- 1988. ering linearly constrained non-negative sparse signals,” IEEE Trans. [46] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential Signal Process., vol. 62, no. 18, pp. 4689–4703, Sep. 2014, (see also families, and variational inference,” Found. Trends Mach. Learn., vol. 1, arXiv:1310.2806). May 2008. [73] S. Rangan, P. Schniter, and A. Fletcher, “On the convergence of [47] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy generalized approximate message passing with arbitrary matrices,” in approximations and generalized belief propagation algorithms,” IEEE Proc. IEEE Int. Symp. Inform. Theory, Honolulu, Hawai, Jul. 2014, pp. 16

236–240, (full version at arXiv:1402.3210). [101] C. Hu, J. T. Kwok, and W. Pan, “Accelerated gradient methods for [74] J. Vila, P. Schniter, S. Rangan, F. Krzakala, and L. Zdeborov´a, stochastic optimization and online learning,” in Proc. Ann. Conf. Neur. “Adaptive damping and mean removal for the generalized approximate Inform. Proc. Syst., Vancouver, Canada, Dec. 11-12 2009, pp. 781–789. message passing algorithm,” in Proc. IEEE Int. Conf. Acoust. Speech [102] G. Lan, “An optimal method for stochastic composite optimization,” & Signal Process., Brisbane, Australia, 2015. Math. Program., vol. 133, no. 1-2, pp. 365–397, 2012. [75] S. Rangan, A. K. Fletcher, P. Schniter, and U. Kamilov, “Inference [103] Q. Lin, X. Chen, and J. Pe˜na, “A sparsity preserving stochastic gradient for generalized linear models via alternating directions and Bethe free method for composite optimization,” Comput. Optim. Appl., vol. 58, energy minimization,” arXiv:1501:01797, Jan. 2015. no. 2, pp. 455–482, June 2014. [76] L. Bottou and Y. LeCun, “Large scale online learning,” in Proc. Ann. [104] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods Conf. Neur. Inform. Proc. Syst., Vancouver, Canada, Dec. 13-18 2004, for online learning and stochastic optimization,” J. Mach. Learn. Res., pp. x–x+8. vol. 12, pp. 2121–2159, Jul. 2011. [77] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for Machine [105] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural Learning. Cambridge, MA: MIT Press, 2011. Comput., vol. 10, no. 2, pp. 251–276, Feb. 1998. [78] S. Theodoridis, Machine Learning: A Bayesian and Optimization [106] E. Chouzenoux, J.-C. Pesquet, and A. Florescu, “A stochastic 3MG Perspective. San Diego, CA: Academic Press, 2015. algorithm with application to 2D filter identification,” in Proc. 22nd [79] S. O. Haykin, Adaptive Filter Theory, 4th ed. Englewood Cliffs, NJ: European Signal Processing Conference, Lisbon, Portugal, 1-5 Sept. Prentice Hall, 2002. 2014, pp. 1587–1591. [80] A. H. Sayed, Adaptive Filters. Hoboken, NJ: John Wiley & Sons, [107] B. Widrow and S. D. Stearns, Adaptive signal processing. Englewood 2011. Cliffs, NJ: Prentice-Hall, 1985. [81] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in [108] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive signal processing,” in Fixed-Point Algorithms for Inverse Problems Algorithms and Applications, 2nd ed. New York: Springer-Verlag, in Science and Engineering, H. H. Bauschke, R. Burachik, P. L. 2003. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, Eds. New York: [109] S. L. Gay and S. Tavathia, “The fast affine projection algorithm,” in Springer-Verlag, 2010, pp. 185–212. Proc. Int. Conf. Acoust., Speech Signal Process., vol. 5, Detroit, MI, [82] J. Duchi and Y. Singer, “Efficient online and batch learning using May 9-12 1995, pp. 3023–3026. forward backward splitting,” J. Mach. Learn. Rech., vol. 10, pp. 2899– [110] A. W. H. Khong and P. A. Naylor, “Efficient use of sparse adaptive 2934, 2009. filters,” in Proc. Asilomar Conf. Signal, Systems and Computers, Pacific [83] Y. F. Atchad´e, G. Fort, and E. Moulines, “On stochastic proximal Grove, CA, Oct. 29-Nov. 1 2006, pp. 1375–1379. gradient algorithms,” 2014, http://arxiv.org/abs/1402.2365. [111] C. Paleologu, J. Benesty, and S. Ciochina, “An improved proportionate [84] L. Rosasco, S. Villa, and B. C. V˜u, “A stochastic forward-backward NLMS algorithm based on the l0 norm,” in Proc. Int. Conf. Acoust., splitting method for solving monotone inclusions in Hilbert spaces,” Speech Signal Process., Dallas, TX, March 14-19 2010. 2014, http://arxiv.org/abs/1403.7999. [112] O. Hoshuyama, R. A. Goubran, and A. Sugiyama, “A generalized [85] P. L. Combettes and J.-C. Pesquet, “Stochastic quasi-Fej´er block- proportionate variable step-size algorithm for fast changing acoustic coordinate fixed point iterations with random sweeping,” SIAM J. environments,” in Proc. Int. Conf. Acoust., Speech Signal Process., Optim., vol. 25, no. 2, pp. 1221–1248, 2015. vol. 4, Montreal, Canada, May 17-21 2004, pp. iv–161– iv–164. [86] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. [113] C. Paleologu, J. Benesty, and F. Albu, “Regularization of the improved Math. Statistics, vol. 22, pp. 400–407, 1951. proportionate affine projection algorithm,” in Proc. Int. Conf. Acoust., [87] J. M. Ermoliev and Z. V. Nekrylova, “The method of stochastic Speech Signal Process., Kyoto, Japan, Mar. 25-30 2012, pp. 169–172. gradients and its application,” in Seminar: Theory of Optimal Solutions. [114] Y. Chen, G. Y., and A. O. Hero, “Sparse LMS for system identification,” No. 1 (Russian). Akad. Nauk Ukrain. SSR, Kiev, 1967, pp. 24–47. in Proc. Int. Conf. Acoust., Speech Signal Process., Taipei, Taiwan, Apr. [88] O. V. Guseva, “The rate of convergence of the method of generalized 19-24 2009, pp. 3125–3128. stochastic gradients,” Kibernetika (Kiev), no. 4, pp. 143–145, 1971. [115] ——, “Regularized least-mean-square algorithms,” 2010, [89] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient http://arxiv.org/abs/1012.5066. methods with errors,” SIAM J. Optim., vol. 10, no. 3, pp. 627–642, [116] R. Meng, R. C. De Lamare, and V. H. Nascimento, “Sparsity-aware 2000. affine projection adaptive algorithms for system identification,” in Proc. [90] F. Bach and E. Moulines, “Non-asymptotic analysis of stochastic Sensor Signal Process. Defence, London, U.K., Sept. 27-29 2011, pp. approximation algorithms for machine learning,” in Proc. Ann. Conf. 1–5. Neur. Inform. Proc. Syst., Granada, Spain, Dec. 12-15 2011, pp. 451– [117] L. V. S. Markus, W. A. Martins, and P. S. R. Diniz, “Affine projection 459. algorithms for sparse system identification,” in Proc. Int. Conf. Acoust., [91] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic Speech Signal Process., Vancouver, Canada, May 26-31 2013, pp. approximation approach to stochastic programming,” SIAM J. Optim., 5666–5670. vol. 19, no. 4, pp. 1574–1609, 2008. [118] L. V. S. Markus, T. N. Tadeu N. Ferreira, W. A. Martins, and P. S. R. [92] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approx- Diniz, “Sparsity-aware data-selective adaptive filters,” IEEE Trans. imation by averaging,” SIAM J. Control Optim., vol. 30, no. 4, pp. Signal Process., vol. 62, no. 17, pp. 4557–4572, Sept. 2014. 838–855, Jul. 1992. [119] Y. Murakami, M. Yamagishi, M. Yukawa, and I. Yamada, “A sparse [93] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal esti- adaptive filtering using time-varying soft-thresholding techniques,” in mated sub-gradient solver for SVM,” in Proc. 24th Int. Conf. Mach. Proc. Int. Conf. Acoust., Speech Signal Process., Dallas, TX, Mar. 14- Learn., Corvalis, Oregon, Jun. 20-24 2007, pp. 807–814. 19 2010, pp. 3734–3737. [94] L. Xiao, “Dual averaging methods for regularized stochastic learning [120] M. Yamagishi, M. Yukawa, and I. Yamada, “Acceleration of adaptive and online optimization,” J. Mach. Learn. Res., vol. 11, pp. 2543–2596, proximal forward-backward splitting method and its application to Dec. 2010. sparse system identification,” in Proc. Int. Conf. Acoust., Speech Signal [95] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Composite Process., Prague, Czech Republic, May 22-27 2011, pp. 4296–4299. objective mirror descent,” in Proc. Conf. Learn. Theory, Haifa, Israel, [121] S. Ono, M. Yamagishi, and I. Yamada, “A sparse system identification Jun. 27-29 2010, pp. 14–26. by using adaptively-weighted total variation via a primal-dual splitting [96] L. W. Zhong and J. T. Kwok, “Accelerated stochastic gradient method approach,” in Proc. Int. Conf. Acoust., Speech Signal Process., Van- for composite regularization,” J. Mach. Learn. Rech., vol. 33, pp. 1086– couver, Canada, 26-31 May 2013, pp. 6029–6033. 1094, 2014. [122] D. Angelosante, J. A. Bazerque, and G. B. Giannakis, “Online adaptive [97] H. Ouyang, N. He, L. Tran, and A. Gray, “Stochastic alternating estimation of sparse signals: where RLS meets the ℓ1 -norm,” IEEE direction method of multipliers,” in Proc. 30th Int. Conf. Mach. Learn., Trans. Signal Process., vol. 58, no. 7, pp. 3436–3447, Jul. 2010. Atlanta, USA, Jun. 16-21 2013, pp. 80–88. [123] B. Babadi, N. Kalouptsidis, and V. Tarokh, “SPARLS: The sparse RLS [98] T. Suzuki, “Dual averaging and proximal gradient descent for online algorithm,” IEEE Trans. Signal Process., vol. 58, no. 8, pp. 4013–4025, alternating direction multiplier method,” in Proc. 30th Int. Conf. Mach. Aug. 2010. Learn., Atlanta, USA, Jun. 16-21 2013, pp. 392–400. [124] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system [99] W. Zhong and J. Kwok, “Fast stochastic alternating direction method identification and signal reconstruction using projections onto weighted of multipliers,” Beijing, China, Tech. Rep., Jun. 21-26 2014. ℓ1 balls,” IEEE Trans. Signal Process., vol. 59, no. 3, pp. 936–952, [100] Y. Xu and W. Yin, “Block stochastic gradient iteration for convex and Mar. 2011. nonconvex optimization,” 2014, http://arxiv.org/pdf/1408.2597v2.pdf. [125] K. Slavakis, Y. Kopsinis, S. Theodoridis, and S. McLaughlin, “Gen- 17

eralized thresholding and online sparsity-aware learning in a union of Dec. 10-13 2013, pp. 3671–3676. subspaces,” IEEE Trans. Signal Process., vol. 61, no. 15, pp. 3760– [151] P. Bianchi, W. Hachem, and F. Iutzeler, “A stochastic coordinate 3773, Aug. 2013. descent primal-dual algorithm and applications to large-scale composite [126] J. Mairal, “Incremental majorization-minimization optimization with optimization,” 2014, http://arxiv.org/abs/1407.0898. application to large-scale machine learning,” SIAM J. Optim., vol. 25, [152] S. Allassonniere and E. Kuhn, “Convergent Stochastic Expectation no. 2, pp. 829–855, 2015. Maximization algorithm with efficient sampling in high dimension. [127] A. Repetti, E. Chouzenoux, and J.-C. Pesquet, “A parallel block- Application to deformable template model estimation,” ArXiv e-prints, coordinate approach for primal-dual splitting with arbitrary random Jul. 2012. block selection,” in Proc. European Signal Process. Conf. (EUSIPCO [153] Y. Marnissi, A. Benazza-Benyahia, E. Chouzenoux, and J.-C. Pesquet, 2015), Nice, France, Aug. 31 - Sep. 4 2015. “Majorize-minimize adapted Metropolis Hastings algorithm. appli- [128] D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental cation to multichannel image recovery,” in Proc. European Signal gradient method with a constant step size,” SIAM J. Optim., vol. 18, Process. Conf. (EUSIPCO 2014), Lisbon, Portugal, Sep. 1-5 2014, pp. no. 1, pp. 29–51, 1998. 1332–1336. [129] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal meth- [154] J. J. Moreau, “Fonctions convexes duales et points proximaux dans un ods for convex optimization: A survey,” 2010, http://www.mit.edu/ dim- espace hilbertien,” C. R. Acad. Sci. Paris S´er. A, vol. 255, pp. 2897– itrib/Incremental Survey LIDS.pdf. 2899, 1962. [130] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with [155] L. Chaari, J.-Y. Tourneret, C. Chaux, and H. Batatia, “A Hamiltonian the stochastic average gradient,” 2013, http://arxiv.org/abs/1309.2388. for Non-Smooth Energy Sampling,” ArXiv e- [131] A. Defazio, J. Domke, and T. Cartano, “Finito: A faster, permutable prints, Jan. 2014. incremental gradient method for big data problems,” in Proc. 31st Int. [156] H. Rue, “Fast sampling of Gaussian Markov random fields,” J. Royal Conf. Mach. Learn., Beijing, China, Jun. 21-26 2014, pp. 1125–1133. Statist. Society Series B, vol. 63, no. 2, pp. 325–338, 2001. [132] A. Defazio, F. Bach, and S. Lacoste, “SAGA: A fast incremental [157] W. F. Trench, “An algorithm for the inversion of finite Toeplitz gradient method with support for non-strongly convex composite matrices,” J. Soc. Ind. Appl. Math., vol. 12, no. 3, pp. 515–522, 1964. objectives,” in Proc. Ann. Conf. Neur. Inform. Proc. Syst., Montreal, [158] D. Geman and C. Yang, “Nonlinear image recovery with half-quadratic Canada, Dec. 8-11 2014, pp. 1646–1654. regularization,” IEEE Trans. Image Process., vol. 4, no. 7, pp. 932–946, [133] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent Jul. 1995. using predictive variance reduction,” in Proc. Ann. Conf. Neur. Inform. [159] G. Papandreou and A. L. Yuille, “Gaussian sampling by local pertur- Proc. Syst., Lake Tahoe, Nevada, Dec. 5-10 2013, pp. 315–323. bations,” in Advances in Neural Information Processing Systems 23 [134] J. Koneˇcn´y, J. Liu, P. Richt´arik, and M. Tak´aˇc, “mS2GD: Mini- (NIPS 2010), J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and batch semi-stochastic gradient descent in the proximal setting,” 2014, A. Culotta, Eds., 2010, pp. 1858–1866. http://arxiv.org/pdf/1410.4744v1.pdf. [160] F. Orieux, O. Feron, and J.-F. Giovannelli, “Sampling high-dimensional [135] P. Tseng, “Convergence of a block coordinate descent method for Gaussian distributions for general linear inverse problems,” IEEE nondifferentiable minimization,” J. Optim. Theory Appl., vol. 109, Signal Process. Lett., vol. 19, no. 5, pp. 251–254, May 2012. no. 3, pp. 475–494, 2001. [161] C. Gilavert, S. Moussaoui, and J. Idier, “Efficient Gaussian sampling [136] E. Chouzenoux, J.-C. Pesquet, and A. Repetti, “A block for solving large-scale inverse problems using MCMC,” IEEE Trans. coordinate variable metric forward-backward algorithm,” 2014, Signal Process., vol. 63, no. 1, pp. 70–80, Jan. 2015. http://www.optimization-online.org/DB HTML/2013/12/4178.html. [137] J.-C. Pesquet and A. Repetti, “A class of randomized primal-dual algorithms for distributed optimization,” J. Nonlinear Convex Anal., 2015, http://arxiv.org/abs/1406.6404. [138] N. Komodakis and J.-C. Pesquet, “Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimiza- tion problems,” IEEE Signal Process. Mag., 2014, to appear, http://www.optimization-online.org/DB HTML/2014/06/4398.html. [139] P. Richt´arik and M. Tak´aˇc, “Iteration complexity of randomized block- coordinate descent methods for minimizing a composite function,” Math. Program., vol. 144, pp. 1–38, 2014. [140] I. Necoara and A. Patrascu, “A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints,” Comput. Optim. Appl., vol. 57, pp. 307–337, 2014. [141] Z. Lu and L. Xiao, “On the complexity analysis of randomized block- coordinate descent methods,” Math. Program., 2015, to appear. [142] O. Fercoq and P. Richt´arik, “Accelerated, parallel and proximal coor- dinate descent,” 2014, http://arxiv.org/pdf/1312.5799v2.pdf. [143] Z. Qu and P. Richt´arik, “Coordinate descent with arbitrary sampling I: algorithms and complexity,” 2014, http://arxiv.org/abs/1412.8060. [144] Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM J. Optim., pp. 341–362, 2012. [145] P. L. Combettes, D. D˜ung, and B. C. V˜u, “Dualization of signal recovery problems,” Set-Valued Var. Anal, vol. 18, pp. 373–404, Dec. 2010. [146] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization,” J. Mach. Learn. Rech., vol. 14, pp. 567–599, 2013. [147] ——, “Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization,” to appear in Math. Program. Ser. A, Nov. 2014, http://arxiv.org/abs/1309.2375. [148] M. Jaggi, V. Smith, M. Tak´aˇc, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, “Communication-efficient distributed dual coordinate ascent,” in Proc. Ann. Conf. Neur. Inform. Proc. Syst., Montreal, Canada, Dec. 8-11 2014, pp. 3068–3076. [149] Z. Qu, P. Richt´arik, and T. Zhang, “Randomized dual coordinate ascent with arbitrary sampling,” 2014, http://arxiv.org/abs/1411.5873. [150] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, “Asynchronous dis- tributed optimization using a randomized alternating direction method of multipliers,” in Proc. 52nd Conf. Decision Control, Florence, Italy,