Success and Failure of Adaptation-Diffusion Algorithms for Consensus in Multi-Agent Networks Gemma Morral, Pascal Bianchi, Gersende Fort

To cite this version:

Gemma Morral, Pascal Bianchi, Gersende Fort. Success and Failure of Adaptation-Diffusion Algo- rithms for Consensus in Multi-Agent Networks. 2014. ￿hal-01078466￿

HAL Id: hal-01078466 https://hal.archives-ouvertes.fr/hal-01078466 Preprint submitted on 29 Oct 2014

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Success and Failure of Adaptation-Diffusion Algorithms for Consensus in Multi-Agent Networks

Gemma Morral∗, Pascal Bianchi and Gersende Fort

Abstract—This paper investigates the problem of distributed of some other j’s and computes the weighted average: stochastic approximation in multi-agent systems. The algorithm N under study consists of two steps: a local stochastic approxi- ˜ mation step and a diffusion step which drives the network to θn,i = wn(i,j) θn,j , (2) a consensus. The diffusion step uses row-stochastic matrices to Xj=1 weight the network exchanges. As opposed to previous works, where the wn(i,j)’s are scalar non-negative random coeffi- exchange matrices are not supposed to be doubly stochastic, and N may also depend on the past estimate. cients such that j=1 wn(i,j) = 1 for any i. The sequence N We prove that non-doubly stochastic matrices generally in- of random matricesP Wn := [wn(i,j)]i,j=1 represents the fluence the limit points of the algorithm. Nevertheless, the limit time-varying communication network between the nodes. One points are not affected by the choice of the matrices provided that simply set wn(i,j) = 0 whenever nodes i and j are unable to the latter are doubly-stochastic in expectation. This conclusion communicate at time n. The aim of this paper is to investigate legitimates the use of broadcast-like diffusion protocols, which are easier to implement. Next, by means of a central limit theorem, the almost sure (a.s.) convergence of this algorithm as n tends we prove that doubly stochastic protocols perform asymptotically to infinity as well as the convergence rate. Our goal is in as well as centralized algorithms and we quantify the degradation particular to quantify the effect of the sequence of matrices caused by the use of non doubly stochastic matrices. Throughout (Wn)n 1 on the convergence. The algorithm is initialized at the paper, a special emphasis is put on the special case of some arbitrary≥ Rd-valued vectors θ , ,θ . distributed non-convex optimization as an illustration of our 0,1 ··· 0,N results. Application to distributed optimization. The algo- rithm (1)-(2) under study is not new. The idea beyond the algorithm traces back to [5], [6] where a network of processors seeks to optimize some objective function known by all agents I.INTRODUCTION (possibly up to some additive noise). More recently, numerous Distributed stochastic approximation has been recently pro- works extended this kind of algorithm to more involved multi- posed using different cooperative approaches. In the so-called agent scenarios, see [7]–[19] as a non-exhaustive list. In this incremental approach (see for instance [1]–[4]) a message context, one seeks to minimize a sum of local private cost containing an estimate of the quantity of interest iteratively functions fi of the agents: travels all over the network. This paper focuses on another N cooperative approach based on average consensus techniques min fi(θ) , (3) θ Rd where the estimates computed locally by each agent are ∈ Xi=1 combined through the network. where for all i, the function fi is supposed to be unknown Consider a network composed by N agents, or nodes. by any other agent j,j = i. To address this question, it is Agents seek to find a consensus on some global parameter by assumed that 6 means of local observations and peer-to-peer communications. Yn,i = fi(θn 1,i)+ ξn,i (4) The aim of this paper is to analyze the behavior of the follow- −∇ − ing distributed algorithm. Node i (i = 1,...,N) generates a where is the gradient operator and ξn,i represents some d ∇ R -valued stochastic process (θn,i)n 0. At time n, the update random perturbation which possibly occurs when observing ≥ is in two steps: the gradient. In this paper, we handle the case where functions ˜ [Local step] Node i generates a temporary iterate θn,i fi are not necessarily convex. Of course, in that case, there is given by generally no hope to ensure the convergence to a minimizer to (3). Instead, a more realistic objective is to achieve crit- θ˜n,i = θn 1,i + γn Yn,i , (1) − ical points of the objective function i.e., points θ such that where γn is a deterministic positive step size and where the i fi(θ) = 0. d ∇ R -valued random process (Yn,i)n 1 represents the observa- PConvergence to a global minimizer is shown in [20] as- tions made by agent i. ≥ suming convex utility functions and bounded (sub)gradients. [Gossip step] Node i is able to observe the values θ˜n,j The results of [20] are extended in [21] to the stochastic descent case i.e., when the observation of utility functions *This work is supported by DGA (French Armement Procurement Agency), is perturbed by a random noise. More recently, [14] investi- the Institut Mines-Tel´ ecom´ and by the ANR grant ODISSEE of program gated distributed stochastic approximation at large, providing ASTRID (ANR-13-ASTR-0030). stability conditions of the algorithm (1)-(2) while relaxing the G. Morral, P. Bianchi and G. Fort are with LTCI, Tel´ ecom´ Paris- Tech & CNRS, 46 rue Barrault, 75634 Paris Cedex 13, France bounded gradient assumption and including the case of random [firstname].[lastname]@telecom-paristech.fr communication links. In [14], it is also proved under some hypotheses that the estimation error is asymptotically normal: necessarily doubly stochastic. We show that non-doubly the convergence rate and the asymptotic covariance are have an influence on the asymptotic characterized. An enhanced averaging algorithm a` la Polyak error covariance (even if they are doubly stochastic in is also proposed to recover the optimal convergence rate. average). On the other hand, we prove that when the Doubly and non-doubly stochastic matrices. In most matrix Wn is doubly stochastic for all n, the asymptotic works (see for instance [20]–[22]), the matrices (Wn)n 1 are covariance is identical to the one obtained in a central- T 1 1≥ 1 assumed doubly stochastic, meaning that Wn = Wn = ized setting. where 1 is the N 1 vector whose components are all The paper is organized as follows. Section II is a gentle × equal to one and where T denotes transposition. Although presentation of our results in the special case of distributed 1 1 row-stochasticity (Wn = ) is rather easy to ensure in optimization (see (3)) assuming in addition that sequence T 1 1 practice, column-stochasticity (Wn = ) implies more strin- (Wn) is i.i.d. In Section III we provide the general setting gent restrictions on the communication protocol. For instance, to study almost sure convergence. Almost sure convergence in [23], each one-way transmission from an agent i to another is studied in Section IV. Section V investigates convergence agent j requires at the same time a feedback link from j rates. Conclusions and numerical results complete the paper. to i. As a matter of fact, double stochasticity prevents from Notations: Throughout the paper, the vectors are column using natural broadcast schemes, in which a given node may vectors. The random variables W RN N and Y := transmit its local estimate to all neighbors without expecting n × n (Y T ,...,Y T )T RdN , n 1, are∈ defined on the same any immediate feedback. n,1 n,N measurable space equipped∈ with≥ a probability P; E denotes Remarkably, although generally assumed, double stochastic- the associated expectation. For any n 1, define the σ- ity of the matrices W is in fact not mandatory. A couple of n field F := σ(θ ,W ,...,W ,Y ,...,Y≥) where θ is the works (see e.g. [14], [24]) get rid of the column-stochasticity n 0 1 n 1 n 0 (possibly random) initial point of the algorithm. condition, but at the price of assumptions that may not always It is assumed that for any i 1,...,N, (θn,i)n 0 satisfies the be satisfied in practice. Other works ([17], [25], [26]) manage ≥ update equations (1)-(2); and∈ we set to circumvent the use of feedback links by coupling the gradient descent with the so-called push-sum protocol [27]. T T T θn := (θn,1,...,θn,N ) . The latter however introduces an additional communication of weights in the network in order to keep track of some For any vector x Rℓ, x represents the Euclidean norm of ∈ | | T summary of the past transmissions. In this paper, we address x. IN is the N N . J := 11 /N denotes the the following questions: What conditions on the sequence orthogonal projector× onto the linear span of the all-one N 1 × (Wn)n 1 are needed to ensure that Algorithm (1)-(2) drives vector 1, and J := IN J. We denote by the Kronecker ≥ ⊥ − ⊗ all agents to a common critical point of i fi? What happens product between matrices. For a matrix A, the spectral norm if these conditions are not satisfied? HowP is the convergence is denoted by A and the spectral radius is denoted by r(A) rate influenced by the communication protocol? whenever A isk ak . Contributions.

1) Assuming that (Wn)n 1 forms an independent and II.DISTRIBUTED OPTIMIZATION identically distributed (i.i.d.)≥ sequence of stochastic ma- A. Framework trices, we prove under some technical hypotheses that Algorithm (1)-(2) leads the agents to a consensus, which We first sketch our result in the special case of distributed is characterized. It is shown that the latter consensus optimization i.e., when the “innovation” Yn,i of the algorithm does not necessarily coincide with a critical point of in (1) has the form (4). f . i i Assumption 1. 1) f : Rd R is differentiable and f 2) We provide sufficient conditions either on the commu- i i P is locally Lipschitz-continuous.→ ∇ nication protocol (Wn)n 1 or on the functions fi which RdN P ≥ 2) For any Borel set A of , [ξn+1 A Fn]= νθn (A) ensure that limit points are the critical points of i fi. ∈ | almost surely (a.s.) where (νθ)θ RdN is a family of 3) When such conditions are not satisfied, we also proposeP ∈ probability measures such that zdνθ(z) = 0 and a simple modification of the algorithm which allows to 2 dN sup z dνθ(z) < for any compactR set K R . recover the sought behavior. θ K | | ∞ ⊂ 4) We extend our results to a broader setting, assuming ∈ R For simplicity, the matrix-valued process W will be as- that the matrices (W ) are no longer i.i.d., but n n n 1 sumed i.i.d. and independent of both processes Y and θ . are likely to depend on both≥ the current observations n n This assumption will be relaxed in section III. and the past estimates. We also investigate a general stochastic approximation framework which goes beyond Assumption 2. 1) For any n 0, conditionally to F , ≥ n the model (4) and beyond the only problem of distributed (Yn+1,Wn+1) are independent. optimization. 2) (Wn)n 1 is an i.i.d. sequence of row-stochastic matrices ≥ 5) We characterize the convergence rate of the algorithm (i.e., Wn1 = 1 for any n) with non-negative entries. E T under the form of a central limit theorem. Unlike [14], 3) The spectral radius of the matrix [W1 J W1] is strictly ⊥ we address the case where the sequence (Wn)n 1 is not lower than 1. ≥ The row-stochasticity assumption is a rather mild condition. C. Success and Failure of Convergence In many works, it is also assumed that Wn is column- The algorithm converges to L which in general is not the stochastic i.e., w (i,j) = 1 for any j, though this i n set of the critical points of θ fi(θ). We discuss some assumption is not required in this work. Assumption 2-3) is a 7→ i P special where both sets actually coincide.P contraction condition which is required to drive the network Scenario 1. All functions fi are strictly convex and admit to a consensus. a (unique) common minimizer θ⋆. This case is for instance investigated by [13] in the frame- Assumption 3. The deterministic step-size sequence (γn)n 1 ≥ work of statistical estimation in wireless sensor network. The satisfies γn > 0 and: set L is formed by the minimizers of fi. Relaxing strict 1) lim γ /γ = 1, i n n+1 n convexity, note that when the functions fi are just convex with 1+λ P 2) γn =+ , γn < for some λ (0, 1), L n ∞ n ∞ ∈ a common minimizer and vi > 0 for any i, then is formed 3) P γn γn 1 P< . n − by the minimizers of i fi, then the same conclusion holds. | − | ∞ T T P a Scenario 2. W is column-stochastic i.e., 1 W = 1 . Polynomially decreasing sequences γn γ⋆/n when n P ∼ → In this case, v given by Lemma 1 is the vector 1 1. , for some a (1/2, 1] and γ⋆ > 0 satisfy Assumption 3. N ∞ ∈ Consequently, 1 . Here again, L is the set of Finally, we introduce a stability-like condition. V = N i fi minimizers of i fi. AnP example of random communication Assumption 4. Almost surely, there exists a compact set K of protocol (see [28])P satisfying 1T W = 1T is the following: dN R such that θn K for any n 0. at time n, a single node i wakes up at random with prob- ∈ ≥ ability p and broadcasts its temporary update θ˜ to all its Assumption 4 claims that the sequence (θ ) remains in i n,i n n 0 neighbors N . Any neighbor j computes the weighted average a compact set and this compact set may depend on≥ the path. It i θ = βθ˜ + (1 β)θ˜ . On the other hand, any node k is implied by the stronger assumption “there exists a compact n,j n,i n,j which does not belong− to the neighborhood of i (including i set K of RdN such that with probability one, θ K for any n itself) sets θ = θ˜ . Then, given i wakes up, the (k,ℓ)th n 0”. Checking Assumption 4 is not always an∈ easy task. n,k n,k entry of W is given by: As≥ the main scope of this paper is the analysis of convergence n rather than stability, it is taken for granted: we refer to [14] 1 if k / Ni and k = ℓ , ∈ for sufficient conditions implying stability.  β if k Ni and ℓ = i , wn(k,ℓ)=  ∈  1 β if k Ni and k = ℓ , 0 − otherwise.∈  B. Results Here, Wn is not doubly stochastic. However, when nodes wake 1 up according to the uniform distribution (pi = N for all i) it The following lemma follows from standard algebra. T T is easily seen that 1 E[Wn]= 1 . Lemma 1. Under Assumptions 2-2) and 2-3), the N 1 vector T 1 1T 1 × v defined by v := W (IN J W )− is the unique non- D. Enhanced Algorithm with Weighted Step Sizes N − ⊥ negative vector satisfying vT = vT W and vT 1 = 1. We end up this section with a simple modification of the

If A is a set, we say that (xn)n converges to A if inf xn initial algorithm in the case where vi > 0 for all i. Let us y : y A tends to zero as n . {| − replace the local step (1) of the algorithm by | ∈ } → ∞ ˜ 1 Theorem 1. Let Assumptions 1, 2, 3 and 4 hold true. Define θn,i := θn 1,i + γn vi− Yn,i (6) − the function V : Rd R → where Yn,i is still given by (4). As an immediate Corollary N of Theorem 1, the algorithm (6)-(2) drives the agent to a consensus which coincides with the critical points of fi. V (θ) := vi fi(θ) (5) i Of course, this modification requires for each node i to Xi=1 P have some prior knowledge of the communication protocol where v = (v1,...,vN ) is the vector defined in Lemma 1. through the coefficients vi (in that case, questions related to d Assume that the set L = θ R V = 0 of critical points a distributed computation of the vi’s would be of interest, but of V is non-empty and included{ ∈ in|∇ some level} set θ : V (θ) are beyond the scope of this paper). C , and that V (L) has an empty interior. Assume{ also that≤ } the level sets θ : V (θ) C are either empty or compact. III.DISTRIBUTED ROBBINS-MONRO ALGORITHM: { ≤ } The following holds with probability one: GENERAL SETTING 1) The algorithm converges to a consensus i.e., In this section, we consider the general setting described by limn maxi,j θn,i θn,j = 0. Algorithm (1)-(2) with weaker conditions on the distribution →∞ | − | 2) The sequence (θn,1)n 0 converges to L as n . of the observations Yn. We also weaken the assumptions on ≥ → ∞ (Yn+1,Wn+1): our general framework includes the case when Theorem 1 is proved in Appendix A. Its proof consists the communication protocol is adapted at each time n. in showing that it is a special case of the more general We denote by M the set of N N non-negative row- 1 × convergence result given by Theorem 2. stochastic matrices and we endow M1 with its Borel σ-field. Assumption 5. 1) There exists a collection of distributions A. Disagreement Vector RdN M (µθ)θ RdN on 1 such that a.s. for any Borel Lemma 3. Let Assumptions 3-1), 5 and 6 hold. Let (φ ) ∈ × n n 0 set A: be the sequence given by (8). For any compact set K RdN≥ , 2 ⊂ sup E φ I K < . P [(Y ,W ) A F ]= µ (A) . n n Tj≤n−1 θj n+1 n+1 ∈ | n θn | | { ∈ } ∞ The result is proved in Appendix B. This lemma implies that In addition, the application θ µ (A) defined on RdN θ for any compact set, there exists such that for any , is measurable for any A in the7→ Borel σ-field of RdN C n 0 E J 2I 2 ≥ M × [ θn T θk Km ] Cγn+1. 1. | ⊥ | k{ ∈ } ≤ dN 2 2) For any compact set K R , sup y dµθ(y,w) < Proposition 1 (Agreement). Under Assumptions 3-1), 3-2), ⊂ θ K | | . ∈ R 4, 5 and 6, limn J θn = 0 a.s. ∞ →∞ ⊥ Assumption 5-1) means that the joint distribution of the r.v.’s Proof: Let (Km)m 0 be an increasing sequence of com- RdN ≥ K RdN Yn+1 and Wn+1 depends on the past Fn only through the last pact subsets of such that m m = . Under value θn of the vector of estimates. It also implies that Wn Assumption 4, we have to proveS equivalently that for any J 1 is almost-surely (a.s.) non-negative and row-stochastic. Since m 0, limn θn T θk Km = 0 a.s. Let m 0. ≥ ⊥ k{ ∈ } ≥ the variables (Yn+1,Wn+1) are not necessarily independent Lemma 3 implies that there exists a constant C such that for any n, E[ J θ 2I ] Cγ2 . By Assump- conditionally to the past Fn and (Wn)n 1 are no longer n Tk θk Km n+1 ≥ | ⊥ | { ∈ E} J≤ 2I i.i.d., the contraction condition on J W1 is replaced with the tion 3-2), this implies that n [ θn T θk Km ] is fi- ⊥ | ⊥ | k{ ∈ } nite; hence J θ 2I is finite a.s. which yields following condition: n n Tk Pθk Km J 2 I | ⊥ | { ∈ } limn θn TP θk Km = 0 a.s. Assumption 6. For any compact set K RdN , there exists ⊥ k{ ∈ } ⊂ dN ρK (0, 1) such that for all θ K, φ in R and A RdN∈ RdN , ∈ ∈ B. Average vector × We now study the long-time behavior of the average esti- 2 2 dN ((J w) Id)(φ+Ay) dµθ(y,w) ρK φ+Ay dµθ(y,w) .mate θn . Define for any θ R : Z | ⊥ ⊗ | ≤ Z | | h i ∈

Assumption 6 is satisfied as soon as the spectral radius Wθ := (w Id) dµθ(y,w) (11) E T Z ⊗ r W1 J W1 θ0,Y1 is upper bounded by a constant ⊥ | independent  of (θ0,Y1)when θ0 K and strictly lower than zθ := (w Id)ydµθ(y,w) . (12) ∈ Z ⊗ one. When (Wn)n 1 is an i.i.d. sequence, independent of the ≥ and let us assume regularity-in- properties of these quantities sequence (Yn)n 1 and of θ0, the above condition reduces to θ E T ≥ r( [W1 J W1]) < 1. ⊥ Assumption 7. There exists λµ (1/2, 1] and for any compact set K RdN , there exists∈ a constant C > 0 such ⊂ IV. CONVERGENCE ANALYSIS that for any θ,θ′ K, ∈ RdN T T T λµ For any vector x of the form x = (x1 ,...,xN ) Wθ Wθ′ C θ θ′ , (13) d ∈ d − ≤ | − | where xi R , we define the vector of R x := (x1 + + λµ ∈ h i ··· Jzθ JzJθ C J θ , (14) x )/N =(1T I )x/N. We extend the notation to matrices | − |≤ | ⊥ | N d λµ RdN k ⊗ 1 1T J J zθ J zθ′ C θ θ′ , (15) X × as X = N ( Id)X. We note := J Id | ⊥ − ⊥ |≤ | − | ∈ h i ⊗ 1 ⊗ and J := J Id. Note that Jx = x . Algorithm (1-2) We define the mean field function h : Rd Rd (10) by ⊥ ⊥ ⊗ ⊗h i → can be written in matrix form as: (1) h(ϑ)= z1 ϑ + W1 ϑ m1 ϑ (16) h ⊗ ⊗ ⊗ i θn = Wn (θn 1 + γnYn) where Wn = Wn Id . (7) − ⊗ (1) where m1 ϑ is the expectation of the invariant distribution ⊗ We decompose the estimate vector θn into two components π1,1 ϑ, given by (see Proposition 4 in Appendix C) ⊗ θn = 1 θn +J θn. In Section IV-A, we analyze the asymp- ⊗h i ⊥ (1) J W 1J totic behavior of the disagreement vector J . The study of mθ := (IdN θ)− zθ . θn − ⊥ ⊥ the average vector θ will be addressed⊥ in Section IV-B. h ni Note that under Assumption 6, this quantity is well defined These two sections are prefaced by a result which established RdN since for any compact K , supθ K J Wθ √ρK. the dynamics of these sequences. Set αn := γn/γn+1 and ⊂ ∈ k ⊥ k≤ Assumption 8. 1) h : Rd Rd is continuous. 1 → φn := γn−+1 J θn . (8) 2) There exists a continuously differentiable function V : ⊥ Rd R+ such that The following lemma is left to the reader. → a) there exists M > 0 such that L = ϑ Rd : T { ∈ Lemma 2. For each n, let θn be given by (7) and let Wn be V (ϑ)h(ϑ) = 0 V M . In addition, row stochastic. Then, ∇V (L) has an empty} interior; ⊂ { ≤ } b) there exists M ′ > M such that V M ′ is a θn = θn 1 + γn Wn(Yn + φn 1) , (9) Rd { ≤ } h i h − i h − i compact subset of ; d T φn = αn J Wn(φn 1 + Yn) . (10) c) for any ϑ R L, V (ϑ)h(ϑ) < 0. ⊥ − ∈ \ ∇ (1) Assumptions 5, 6 and 7 imply that ϑ m1 ϑ is continuous We finally have to strengthen the conditions on the step-size on Rd (see Proposition 5 in Appendix7→ C). Therefore,⊗ a suf- sequence. ficient condition for the Assumption 8-1) is to strengthen the Assumption 13. Let τ (resp. λµ) be given by Assumption 10 conditions (14-15) of Assumption 7 as follows: z z ′ θ θ (resp. Assumption 12). As n , γ γ /na for some λµ | − | ≤ n ⋆ C θ θ′ . 1 1→ ∞ ∼ | − | a ((1+ λµ)− (1+ τ/2)− ; 1] and γ⋆ > 0. In addition, if ∈ ∨ Proposition 2. Let Assumptions 3, 4, 5, 6,7 and 8 hold true. a = 1 then γ⋆ > 1/(2L) where L is given by Assumption 9. Assume in addition that λ λµ where λ,λµ are resp. given by (1) 1 (2) ≤ Define m⋆ := (IdN J W1 θ⋆ )− J z1 θ⋆ and m⋆ := Assumption 3 and 7. The average sequence ( θn )n converges 1 − ⊥ ⊗ ⊥ ⊗ h i (I 2 2 Φ⋆)− ζ⋆ where zθ is defined in (12), where almost-surely to a connected component of L. d N − 1 The proof of Proposition 2 is given in Appendix D. It Φ⋆ := T (w) dµ θ⋆ (y,w) Z ⊗ consists in verifying the assumptions of [29, Theorem 2]. T (1) T ζ⋆ := T (w)vec yy + 2m⋆ y dµ1 θ⋆ (y,w) C. Main Convergence Result Z   ⊗ and where we used the notation T (w) := ((J w) Id) As a trivial consequence of Propositions 1 and 2, we have ⊥(1) ⊗ (2)⊗ ((J w) Id). As will be seen in the proofs, m⋆ and m⋆ Theorem 2. Let Assumptions 3, 4, 5, 6, 7 and 8 hold true. represent⊥ ⊗ the asymptotic first order moment and (vectorized) Assume in addition that where are resp. given λ λµ λ,λµ second order moment of the r.v. φn defined by (8). Define also by Assumption 3 and 7. The≤ following holds with probability R⋆(w) := (w Id) W1 θ⋆ and υ⋆(y,w) := (w Id)y one: ⊗ − ⊗ ⊗ − z1 θ⋆ . Finally, define ⊗ 1) The algorithm converges to a consensus i.e., 1T J A 1 limn θn = 0; ⋆ := Id (IdN + W1 θ⋆ (IdN J W1 θ⋆ )− J ) →∞ ⊥  N ⊗  ⊗ − ⊥ ⊗ ⊥ 2) θn,1 converges to a connected component of L.

R⋆ := (R⋆(w) R⋆(w)) dµ1 θ⋆ (y,w) V. CONVERGENCE RATE Z ⊗ ⊗ A. Main Result T⋆ := (υ⋆(y,w) R⋆(w))dµ1 θ⋆ (y,w) We derive the rate of convergence of the sequence θ ,n Z ⊗ ⊗ { n ≥ 0 to 1 θ⋆ for some θ⋆ satisfying T S⋆ := vec (υ⋆(y,w)υ⋆(y,w) )dµ1 θ⋆ (y,w) . } ⊗ Z ⊗ Assumption 9. is a root of i.e., . Moreover, θ⋆ h h(θ⋆) = 0 h We establish in Section E the following result. is twice continuously differentiable in a neighborhood of θ⋆. The Jacobian h(θ ) is a Hurwitz matrix. Denote by L, Theorem 3. Let Assumption 5-1), Assumption 7, Assumption 6 ∇ ⋆ − L> 0, the largest real part of its eigenvalues. and Assumption 9 to Assumption 13 hold true. Let U⋆ be the positive-definite matrix given by The moment conditions on the conditional distributions of (2) (1) the observations Y and the contraction assumption on the vec U⋆ =(A⋆ A⋆)(R⋆ m⋆ + 2T⋆m⋆ + S⋆) n ⊗ network have to be strengthened as follows: Then conditionally to the event limn θn = 1 θ⋆ , the 1/2 { ⊗ } sequence γn− ( θ θ ),n 0 converges in distribution Assumption 10. There exists τ (0, 2) such that for any { h ni− ⋆ ≥ } RdN ∈ 2+τ to a zero mean Gaussian distribution with compact set K , one has supθ K y dµθ(y,w) < V V ⊂ ∈ | | where is the unique positive-definite matrix satisfying . R ∞ T V h(θ⋆) + h(θ⋆)V = U⋆ if a< 1, Assumption 11. Let τ be given by Assumption 10. For any ∇ ∇ − V T V RdN (Id +2γ⋆ h(θ⋆)) +(Id +2γ⋆ h(θ⋆)) = 2γ⋆U⋆ if a =1. compact set K , there exists ρ˜K (0, 1) such that for ∇ ∇ − any φ RdN ⊂ ∈ ∈ B. A Special Case: Doubly-Stochastic Matrices 2+τ 2+τ sup ((J w) Id) dµθ(y,w) ρ˜K φ . θ K Z | ⊥ ⊗ | ≤ | | In this paragraph, let us investigate the special case when ∈ (W ) are N N doubly-stochastic matrices. Note that in this We also go further in the regularity-in-θ of the integrals n n × w.r.t. . More precisely case, (9) gets into θn = θn 1 +γn Yn and the mean field µθ h i h − i h i function h is equal to h(ϑ) = y dµ1 ϑ(y,w). Since Wn h i ⊗ Assumption 12. There exists λµ (1/2, 1] and for any is column-stochastic, 1 is column-stochastic, ∈ wdµ θR⋆ (y,w) compact set K RdN there exists a constant such that 1T ⊗ C and we have A = R I . Then, it is not difficult to check ⊂ λµ ⋆ N d K ′ 1) for any θ,θ′ , zθ zθ C θ θ′ . that A R (w) = 0, which⊗ implies that R = T = 0. This ∈ |h i−hT i| ≤ T | − | ⋆ ⋆ ⋆ ⋆ 2) Set QA(x,y,w):=(x+y) (w Id) J AJ (w Id)(x+ ⊗ ⊥ ⊥ ⊗ yields the following corollary y) for some dN dN matrix A. For any θ,θ′ K, x RdN and any× matrix A such that A 1, ∈ Corollary 1. In addition to the assumptions of Theorem 3, ∈ k k≤ assume that (W ) are N N doubly-stochastic matrices n n × and set y¯⋆ = ydµ1 θ⋆ (y,w). Then QA(x,y,w)dµθ(y,w) QA(x,y,w)dµθ′ (y,w) ⊗ Z − R T λµ 2 U⋆ = y y¯⋆ y y¯⋆ dµ1 θ⋆ (y,w) . C θ θ′ (1 + x ) . h − ih − i ⊗ ≤ | − | | | Z VI.CONCLUDINGREMARKS f : R R. We address the following minimization problem: i → 5 In this paragraph, we informally draw some general conclu- 1 2 min (θ αi) (17) θ R 2 − sions of our study. We assimilate the communication protocol ⊂ Xi=1 to the selection of the sequence W , which we assume i.i.d. in n where αT =( 3, 5, 5, 1, 3). The minimizer of (17) is θ = this paragraph for simplicity. We say that a protocol is doubly f α = 1. The− network is− represented by an undirected graph stochastic if W is doubly stochastic for each n. We say that n hG i= (V,E) with vertices 1,...,N and 6 fixed edges E. a protocol is doubly stochastic in average if E [W ] is doubly n The corresponding adjacency{ matrix is} given by stochastic for each n. 0 1 0 1 0 1) Consensus is fast. Theorem 3 states that the average 1 0 1 0 0 estimation error converges to zero at rate √γn. This A = 0 1 0 1 1 . result was actually expected, as √γn is the well-known   1 0 1 0 1 convergence rate of standard stochastic approximation   0 0 1 1 0 algorithms.   On the other hand, Lemma 3 suggests that the disagree- We choose θ0,i = 0 for each agent i and the step-size sequence 0.7 ment vector J θn goes to zero at rate γn that is, one or- of the form γn = 0.1/n . Observations Yn,i are defined as ⊥ der of magnitude faster. Asymptotically, the fluctuations in (4): (ξn,i)n,i is an i.i.d. sequence with Gaussian distribution N 2 2 of the normalized estimation error (θn 1 θ⋆)/√γn (0,σ ) where σ = 1. are fully supported by the consensus space.− ⊗ Figure 1 illustrates the two results of Theorem 1 accord- This remark also suggests to analyze non-stationary ing to different gossip matrices (Wn)n. First, Figure 1 (a) communication protocols, for which the number of addresses the convergence of sequence (θn,1)n 0 as a function ≥ transmissions per unit of time decreases with n. This of n to show the influence of matrices Wn to the limit problem is addressed in [14]. points. In particular, the dashed line curve corresponds to 2) Non-doubly stochastic protocols generally influence the algorithm (1)-(2) when Wn is assumed to be fixed and the limit points. This issue is discussed in Section II-C. deterministic (Wn = W1 for all n); we select W1 in such The choice of the matrices Wn is likely to have an a way that each agent computes the average of the tem- impact on the set of limit points of the algorithms. This porary estimates in its neighborhood. This is equivalent to 1 may be inconvenient especially in distributed optimiza- set W1 = (IN + D)− (IN + A), where D is the diagonal N tion tasks. matrix containing the degrees, i.e. D(i,i) = j=1 A(i,j) 3) Protocols that are doubly stochastic ”in average” all for each agent i. Note that W1 is not doubly stochasticP since 1T 1T lead to the same limit points. In the framework of W1 = . Computing the left Perron eigenvector defined 6 distributed optimization, the latter set of limit points by Lemma 1 yields the minimizer of V = i vifi being θV = T precisely coincides with the sought critical points of v α = 1.24. In that case, the sequence (Pθn,1)n converges to the minimization problem. It means that non-doubly θ⋆ = θV instead of the desired θ⋆ = θf . Figure 1 (a) includes stochastic protocols can be used provided that they are the trajectory of θn,1 generated by Algorithm (6)-(2) with 1 doubly stochastic in average. W1 =(IN +D)− (IN +A). As proposed in Section II-D when 1 4) Asymptotically, doubly stochastic protocols perform introducing the weighted step size such γnvi− the sequence as well as a centralized algorithm. By Corollary 1, now converge to the sought value θf . if Wn is chosen to be doubly stochastic for all n, Figure 1 (a) also illustrates the convergence behavior of the asympotic error covariance characterized in Theo- Scenario 2 where the limit point θ⋆ of Algorithm (1)-(2) rem 3 does not depend on the specific choice of Wn. corresponds with θf . In that case, we consider two standard In distributed optimization, the asymptotic performance models for Wn, namely the pairwise gossip of [23] and the 1 is identical to the performance that would have been broadcast gossip of [28] (we set β = 2 ). Finally, the plain obtained by replacing Wn by the orthogonal projector line in Figure 1 (a) shows the performance of the algorithm J, which would lead to the centralized update θn = proposed by [17] for distributed optimization which is based γn N h i on a synchronous version of the push-sum model of [27]. θn 1 + Yn,i . On the opposite, protocols h − i N i=1 that are not doublyP stochastic generally influence the We conclude the illustration of Theorem 1 by the results asymptotic error covariance, even if they are doubly on the consensus convergence for the same examples of Wn stochastic in average. considered in Figure 1 (a). Thus, Figure 1 (b) represents the norm of the scaled disagreement vector as a function of n. As expected from Theorem 1-2), consensus is asymptotically VII.NUMERICAL RESULTS achieved independently of the limit point, i.e. θf or θV . Note that the synchronous models of W1 and [17] require N We illustrate the convergence results obtained in Sec- transmissions at each iteration n whereas the gossip protocols tion II-B and discussed in sections II-C and VI. We depict of [23] and [28] only require two and one transmissions re- a particular case of the distributed optimization problem de- spectively due to their asynchronous nature. This may explain scribed in Section II. Consider a network of N = 5 agents the gap between the curves in Figure 1 (b) when regarding the and for any i = 1,..., 5, we define a private cost function convergence rate towards the consensus.

1.2 8

6 1 4

0.8 2

0 n,1

θ 0.6 −2 W pairwise [23] W = W ∀ n Centralized n n 1 Algorithm (1)−(2), W = (I+D)−1(I+A) n −4 0.4 Algorithm (6)−(2), W = (I+D)−1(I+A) n −6 W broascast [28] Algorithm (1)−(2), W pairwise [23] n n 0.2 Algorithm (1)−(2), W broadcast [28] −8 n

Algorithm [17] 1 2 3 4 0 0 2000 4000 6000 8000 10000 12000 number of iterations (n) (a) Boxplots of the normalized average error.

Centralized Distributed with W pairwise [23] Distributed with W broadcast [28] n n (a) Trajectories of θn,1 as a function of n. 1 1 1 0.9 0.9 0.9 −1 −2 Algorithm (1)−(2), W = (I+D) (I+A) 10 n Algorithm (6)−(2), W = (I+D)−1(I+A) 0.8 0.8 0.8 n Algorithm (1)−(2), W pairwise [23] n 0.7 0.7 0.7 Algorithm (1)−(2), W broadcast [28] n 0.6 0.6 0.6 −4 10 Algorithm [17] 0.5 0.5 0.5

0.4 0.4 0.4

−6 0.3 0.3 0.3 10 0.2 0.2 0.2

0.1 0.1 0.1

−8 0 0 0 10 −2 −1 0 1 2 −2 −1 0 1 2 −10 0 10

(b) Empirical distribution (dark bars) versus theoretical distribution 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 number of iterations (n) given by Theorem 3 (solid line).

1 N 2 Figure 2: Asymptotic analysis of the normalized average error (b) N i=1(θn,i θn ) as a function of n. 1 ⋆ −h i ( θn θ ) of Algorithm (1)-(2) according to different q √γn h i− Figure 1: ConvergenceP result of Theorem 1 according to communication schemes for (Wn)n after n = 30000 iterations different communication schemes for (Wn)n. and over 100 independent Monte-Carlo runs.

The result of Theorem 3 is illustrated in Figure 2 which (4) P[Y A F ]= I (g(θ )+ z) dν (z) . The above n+1 ∈ | n A n θn leads to the concluding remark 4) of Section VI. Figures 2 (a) discussion provides theR expression of µθ in Assumption 5-1). In addition, under Assumption 1-2), for any compact set K of and 2 (b) display the asymptotic analysis of the normalized RdN 1/2 ⋆ , average error γn− ( θn θ ). Indeed, once the convergence h i− 2 2 2 is achieved, the asymptotic distribution can be characterized sup y dµθ(y,w) = sup g(θ) + z dνθ(z) < ⋆ ∈K | | ∈K | | | | ∞ by the closed form of the variance U R. In this example, θ Z θ  Z  1/2 ⋆∈ Theorem 3 states that γn− ( θ θ ) converges in distri- which proves Assumption 5-2). Assumption 6 easily follows h ni− bution to a r.v. N(0, V) where h(θ⋆) = 1 and thus the from Assumption 2-3). The regularity conditions of Assump- V ∼U⋆ ∇ − tion 7 are satisfied with λ = δ, where δ is given by variance is = 2 . The first boxplot and the first histogram µ in Figure 2 are related to the algorithm implemented in a Assumption 1. Observe indeed that the left hand side of (13) centralized manner. We consider the distributed algorithm (1)- is zero and (14) and (15) are true as long as ( f ) are ∇ i i (2) with different choices of Wn: the pairwise gossip of [23], locally Holder-continuous.¨ Again, the expression of µθ implies E E 1 the broadcast gossip of [28] and the fixed W1 defined by that Wθ = [W1]. Therefore, h(ϑ) = [W1] g ( ϑ) = N h ⊗ i (I + D) 1(I + A). The normal distribution obtained in v f (ϑ) which completes the proof. N − N − i=1 i∇ i Theorem 3 is coherent with the empirical results. P APPENDIX B APPENDIX A PROOFOF LEMMA 3 2 2 PROOFOF THEOREM 1 From (10), we compute φn = αn(φn 1 + T T | | − We prove that the Assumptions 5 to 8 hold. Then Theorem 1 Yn) Wn J Wn(φn 1 + Yn). Using Assumption 5-1), E 2 F⊥ − will follow from Theorem 2. For any θ = (θ1,...,θN ) φn n 1 is equal to dN d dN ∈ | | | − R where θi R , define the R -valued function g by   ∈ T T T 2 T J . Under Assump- αn (φn−1 + y) (w Id) ⊥(w Id)(φn−1 + y) dµθn−1 (y,w) . g(θ) := ( f1(θ1) ,..., fN (θN ) ) ⊗ ⊗ tion 2-1)−∇ and Assumption−∇ 2-2), for any Borel set A B Z dN × of R M1 P[(Yn+1,Wn+1) A B Fn] = P[Yn+1 By Fubini Theorem and Assumption 6, there exists ρK × ∈ × | ∈ 2 ∈ A Fn]P[Wn+1 B]. In addition, by Assumption 1 and Eq. (0, 1) such that for any n 1, E φn Fn 1 | ∈ ≥ | | | − ≤   2 2 αnρK φn 1 + y dµθn−1 (y,w). By Assumption 5-2), there exists, and is unique up to an additive constant. | − | existsR a constant C such that for any n 1 almost-surely ≥ Proposition 3. Let Assumptions 5 and 6 hold. Let K RdN ⊂ E 2 F I 2 2 √ be a compact set and let ρK (0, 1) be given by Assumption 6. φn n 1 θn−1 K αnρK φn 1 + 2 φn 1 C + C . | | | − ∈ ≤ | − | | − | ∈    The following holds for any a (0, 1/√ρK). Set U := φ 2I . Upon noting that K ∈ n n Tj≤n−1 θj K 1) For any θ and α [0,a], Pα,θ ad- I | I| { ∈,} the previous inequality ∈ ∈ Tj≤n−1 θj K Tj≤n−2 θj K mits an unique invariant distribution πα,θ such that { ∈ } ≤ { ∈ } 2 E 2 E E supα [0,a],θ K x dπα,θ(x) < . implies [Un] αnρK [Un 1] + 2 [Un 1] √C + C . ≤ − − ∈ ∈ | | ∞  p  2) For any p R [0, 1], there exists a constant K Let δ (ρK, 1). For any n large enough (say n n0), ∈ RdN 2 ∈ ≥ such that for any x and any f α ρK 1 δ since lim α = 1 under Assumption 3-1). n n n L (RdN ), sup ∈P n f(x) π (f) ∈ There exist≤ positive− constants such that for any , p α [0,a],θ K α,θ α,θ M,b n n0 ∈n ∈ | p+1 − | ≤ ≥ KNp(f) a√ρK (1 + x ) . | | RdN E[Un] (1 δ) E [Un 1] + 2 E[Un 1] √C + C 3) For any α (0,a], θ K, p [0, 1] and f Lp( ), ≤ − − − ∈ ∈ ∈ ∈  p  n δ the function fα,θ : x n 0 Pα,θf(x) πα,θf ex- E 1 7→ ≥ − 1 [Un 1]+ b E[Un−1] M .  RdN ≤  − 2 − ≤ ists, solves the Poisson equationP (20) and is in Lp( ). In addition, A trivial induction implies that E[Un] (1 n n0 E ≤ − KNp(f) p+1 δ/2) − [Un0 ] + 2b/δ, which concludes the proof. sup fα,θ(x) (1 + x ) . α [0,a],θ K | |≤ 1 a√ρK | | ∈ ∈ − APPENDIX C Proof: Let K be a compact subset of RdN . Throughout PRELIMINARY RESULTS ON THE SEQUENCE (φn)n this proof, for ease of notations, we will write ρ instead of Due to the coupling of the sequences ( θ ) and (φ ) ρK. Let a (0, 1/√ρ) be fixed. We check the assumptions n n n n ∈ (see Eq. (9)), the asymptotic analysis of (h θ i ) requires a of [30, Proposition 2 p. 253] from which all the items follow. h ni n more detailed understanding of the behavior of φn. Note from We first prove [30, (2.1.10) p.253]. By Assumption 6, for any Assumption 5-1) and (10) that φ ,n 0 is a Markov α [0,a] and θ K { n ≥ } ∈ ∈ chain w.r.t. the filtration Fn,n 0 with a transition kernel controlled by α ,θ ,n { 0 (see≥ also} (19) below). P (x,dy) y 2 n n α,θ | | Let us introduce{ some notations≥ } and definitions. If (x, A) Z RdN 7→ 2 2 2 P (x, A) is a probability transition kernel on , then for a ρ x + y dµθ(y,w) + 2 x y dµθ(y,w) ; any bounded continuous function f : RdN R, Pf is the ≤ | | Z | | | | Z | |  → measurable function x f(y)P (x,dy) . If ν is a probability by Assumption 5-2), for any ρ¯ (a2ρ, 1), there exists a dN 7→ dN ∈ on R , νP is the probabilityR on R given by νP (A) = positive constant c such that for any x RdN ν(dx) P (x, A). For n 0, notation P n stands for the n- ∈ ≥ n n 1 2 2 Rorder iterated kernel i.e., P f(x) = P − f(y)P (x,dy); sup Pa,θ(x,dy) y ρ¯ x + c. 0 α [0,a],θ K Z | | ≤ | | by convention P (x, A) = 1A(x) = Rδx(A). A measure π ∈ ∈ is said to be an invariant distribution w.r.t. P if πP = π. This concludes the proof of [30, (2.1.10) p.253]. Note that dN For p 0, denote by Lp(R ) the set of lipschitz functions iterating this inequality and applying the Jensen’s inequality RdN≥ RdN satisfying yield for any n 1, p [0, 1], x RdN , f : ≥ ∈ ∈ → p+1 2 f(x) f(y) n p+1 n 2 c [f]p := sup | − p | p < . sup Pa,θ(x,dy) y ρ¯ x + . x,y RdN x y (1 + x + y ) ∞ α [0,a],θ K Z | | ≤  | | 1 ρ¯ ∈ | − | | | | | ∈ ∈ − f(x) (21) dN dN We define Np(f) := (supx R 1+| x p+1| ) [f]p for f We now prove [30, (2.1.9) p.253] Let x,z R , α [0,a] RdN RdN∈ | | ∨ ∈ ∈ ∈ Lp( ). For any θ and any α 0, define the and θ K. We consider a coupling of the distributions ∈ ≥ probability transition kernel P on RdN as n ∈ n α,θ Pα,θ(x, ) and Pα,θ(z, ) defined as follows: (W n, Y n)n N are · · ∈ i.i.d. random variables with distribution µθ and set Wn = Pα,θf(x)= f (αJ (w Id)(x + y)) dµθ(y,w) . (18) x Z ⊥ ⊗ W n Id. The stochastic process (ϕn)n N defined recursively ⊗x x ∈x by ϕn = αJ Wn(ϕn 1 + Y n) and ϕ0 = x is a Markov This collection of kernels is related to the sequence (φ ) ⊥ − n n chain with transition kernel P starting from x. We denote since by Assumption 5-1) and (10), for any measurable posi- α,θ by E the expectation on the associated canonical space. Let tive function f it holds almost-surely α,θ p [0, 1]. For any g L (RdN ), it holds ∈ ∈ p E F (19) n n x z [f(φn+1) n]= Pαn+1,θn f(φn) . P g(x) P g(z) = E (g(φ ) g(φ )) | α,θ − α,θ | α,θ n − n | We start with a result that claims that any transition kernel E ( g(φx) g(φ z ) ) ≤ α,θ | n − n | Pα,θ possesses an unique invariant distribution πα,θ and is [g] E [ φx φz (1 + φx p + φz p)] ≤ p α,θ | n − n| | n| | n| ergodic at a geometric rate. This also implies that for a large 1/2 E x z 2 E x p z p 2 family of functions f, a solution fα,θ to the Poisson equation [g]p α,θ φn φn α,θ (1 + φn + φn ) . ≤ n | − | h | | | | io f π (f)= f P f (20) (22) − α,θ α,θ − α,θ α,θ By Assumption 6 combined with a trivial induction, 3) For any function f of the form xTAx, the Pois- son solution exists and there exists a con- E x z 2 1/2 E x z 2 1/2 fα,θ α,θ( ϕn ϕn ) = α α,θ( J Wn(ϕn 1 ϕn 1) ) | − | | ⊥ − − − | stant K such that for any α,α′ [0,a], E x z T A x z 1/2 ∈ = α α,θ((ϕn 1 ϕn 1) θ(ϕn 1 ϕn 1)) θ,θ′ K, one has Pα,θfα,θ(x) Pα′,θ′ fα′,θ′ (x) − − − − − − ∈ | − | ≤ E x z 2 1/2 λµ 2 a√ρ α,θ( ϕn 1 ϕn 1 ) K α α′ + θ θ′ 1+ x . ≤ n | − − − | | − | | − |  | | (a√ρ) x z , (23)  ≤ | − | APPENDIX D A T where θ := (w Id) J (w Id)dµθ(y,w). Combining PROOFOF PROPOSITION 2 ⊗ ⊥ ⊗ (21) and (23) showsR that there exists C > 0 such that for any dN dN Lemma 4. Under Assumptions 3-1) and 5, C > 0 s.t. x,z R , g Lp(R ) and n 1, ∃ ∈ ∈ ≥ θn+1 θn Cγn ( Yn+1 + φn ) a.s. | − |≤ | | | | n n sup Pα,θg(x) Pα,θg(z) Proof: Since limn γn/γn+1 = 1, there exists a constant C α [0,a],θ K − ∈ ∈ such that θn+1 θn 1 θn+1 1 θn + J θn+1 + n p p ⊥ C [g]p x z (a√ρ) (1 + x + z ) . (24) J | − |≤| ⊗h i− ⊗h i| | | ≤ | − | | | | | θn C θn+1 θn + γnφn+1 + γnφn. The result |follows⊥ | ≤from|h Eqs (9),i−h (10) andi| sup α < . This concludes the proof of [30, (2.1.9) p.253]. Finally, we n n ∞ show that the transition kernels are weak Feller. From (18) and the dominated convergence theorem, it is easily checked A. Decomposition of θn+1 θn h i−h i that for any bounded continuous function f on RdN , x By (9), it holds θ = θ + γ h( θ ) + 7→ h n+1i h ni n+1 h ni Pα,θf(x) is continuous. Therefore, all the assumptions of [30, γn+1(ηn+1,1+ηn+1,2) where ηn+1,1 = Wn+1(Yn+1+φn) Proposition 2 p.253] are verified. z +W φ , η = z +W φh h( θ ). We writei− h θn θn ni n+1,2 h θn θn ni− h ni RdN ηn+1,2 = un + vn + wn+1 + zn where un = zθn zJθn , Proposition 4. Let Assumptions 5 and 6 hold. Let θ h (1) − i ∈ vn = Wθ WJθ φn, wn+1 = WJθ (φn m (αn+1)), and α such that πα,θ exists. n n n θn h − (1) i (1)h i − (1) W zn = Jθn (mθ (αn+1)) mJθ (1)). We finally introduce 1) The first order moment mθ (α) := x dπα,θ(x) of πα,θ h i n − n (1) 1 1 a decomposition of wn. For any compact K, let ρK (0, 1) is given by mθ (α)=(α− IdN JR Wθ)− J zθ where ∈ − ⊥ ⊥ be given by Assumption 6. Let a (1, 1/√ρK). Under Wθ and zθ are given by (11) and (12). ∈ Assumption 3, the sequence (αn)n given by (8) converges to 2) Set T (w) := ((J w) Id) ((J w) Id). (2) ⊥ ⊗ ⊗ ⊥ ⊗ one; hence, there exists a (deterministic) integer n0 (depending The vector m (α) := vec ( xxT dπ (x)) is θ a,θ on K) such that α (0,a) for all n n . The identity (2) 2 1 n 0 given by m (α) = α− IdR2N 2 Φθ − ζθ(α) RdN∈ ≥ θ − function is in L0( ) and by Proposition 5, there exists a where Φθ := T (w)dµθ(y,w) and ζθ(α) := solution gfα,θ to the Poisson equation (20) with the f equal T R (1) T T (w)vec yy + 2y mθ (α) dµθ(y,w). to the identity function, for any α (0,a) and θ K; by   1 ∈ (1) ∈ R (25) fα,θ(x)=(IdN αJ Wθ)− (x m (α)). To make (1) ⊥ θ Proof: Since πα,θ = πα,θPα,θ, we obtain: m (α) = − − θ the notation easier, we will set below fn := fαn+1,θn and αJ (w I )(y + x)dµ (y,w)dπ (x) = α ((J w) d θ α,θ Pn := Pα ,θ . By Proposition 3-3), there exists a constant ⊥ ⊗(1) ⊥ ⊗ n+1 n IRRd)(y + mθ (α))dµθ(y,w). This yields the expressionR of C > 0 such that a.s. (1) mθ (α). The proof of item 2) follows the same lines as above I sup fn(x) EK C(1 + x ) . (26) and is omitted. n n0 | | ≤ | | The proof of the following Proposition is left to the reader. ≥ Letting x = φn in the Poisson equation (20), we Proposition 5. Let Assumptions 5, 6 and 7 to hold. Let K (1) obtain φn mθ (αn+1) = fn(φn) Pnfn(φn). dN ⊂ − n − R be a compact set and let ρK (0, 1) and λµ (0, 1] be We set w = e + c + s + t ∈ ∈ n+1 n+1 n+1 n+1 n given resp. by Assumption 6 and Assumption 7. The following where e = WJ (f (φ ) P f (φ )), n+1 h θn i n n+1 − n n n holds for any a (0, 1/√ρK). cn+1 = WJθn fn 1(φn) WJθn+1 fn(φn+1), ∈ dN h i − − h i 1) For any f L (R ), there exists a con- sn+1 = WJθ +1 WJθ fn(φn+1) and finally ∈ 1 h n − n i stant Cf such that for any α,α′ [0,a] and tn = WJθn (fn(φn) fn 1(φn)). As a conclusion, ∈ h i − − θ,θ′ K, f(x) (dπα,θ(x) dπα′,θ′ (x)) we have ηn+1,2 = un + vn + zn + en+1 + cn+1 + sn+1 + tn. ∈ λµ − ≤ C α α′ + θR θ′ . f | − | | − | 2) When f is the identity function f(x) = x then for any B. Proof of Proposition 2 dN α (0,a], θ K, x R , one has Define EK = j N, θj K and En,K = j n θj ∈ ∈ ∈ K for some compact{∀ ∈ set K. ∈ } ∩ ≤ { ∈ J W 1 (1) } fα,θ(x)=(IdN α θ)− (x mθ (α)) . (25) We show that γ η < a.s. for both i = 1, 2. The − ⊥ − n n n,i ∞ In addition, there exists a constant K such proposition willP then follow from [29]. By Assumption 4, it is enough to show that for any fixed compact set K, that for any α,α′ [0,a], θ,θ′ K, ∈ ∈ γ η I is finite a.s. Hereafter, K is fixed and n one has Pα,θfα,θ(x) Pα′,θ′ fα′,θ′ (x) + k 1 k k,i EK 0 | − | is defined≥ as in Section D-A. fα,θ(x) fα′,θ′ (x) P | − |λ ≤ We first study η . Note that for any ω, the sequence K α α + θ θ µ (1 + x ) . n,1 ′ ′ I is identically equal to I for all large . | − | | − |  | | En,K (ω) EK (ω) n I I As a consequence, n γnηn,1( EK En−1,K ) is finite a.s. Finally consider the term tn. By Proposition 5-2), − I and it is therefore sufficientP to prove that n γnηn,1 En−1,K there exists a constant C such that for any n I I λµ ≥ is finite a.s. Since ηn,1 En−1,K is aP martingale differ- n0, tn EK C αn αn 1 + θn θn 1 (1 + φn ). | | ≤ | − − | | − − | | | ence noise, the sought result will be obtained provided By Lemma 4, (27) and Assumption 3, it can be shown γ1+λE[ η 1+λI ] < where λ > 0 (see e.g. that γ E( t I ) < which proves that γ t I n n | n,1| En−1,K ∞ n n | n| EK ∞ n n n EK [31,P Theorem 2.18]); we choose λ (0, 1) given by As- convergesP a.s. P ∈ E 2I sumption 3. After some algebra, supn [ ηn,1 En−1,K ] 2sup E[ W (Y + φ ) 2I ] C| sup| E[( Y 2 +≤ n n n n 1 En−1,K n n APPENDIX E 2 I|h − i| ≤ | | φn 1 ) E 1 K ] for some constant C - where we used the | − | n− , PROOFOF THEOREM 3 fact that Wn is row-stocahstic and thus has bounded entries. E 2I Assumption 5-2) directly leads to supn [ Yn En−1,K ] < The core of the proof consists in checking the conditions E 2I | | ∞ whereas by Lemma 3, supn [ φn 1 En−1,K ] < . Hence, of [32, Theorem 2.1]. To make the notations easier, we write 1+λE 1+λI | − | 1+λ ∞ the proofs in the case and under the assumption that γn [ ηn,1 En−1,K ] C′ γn for some C′ > d = 1 n | | ≤ n 1 0P. And the upper bound is finiteP by Assumption 3. This limn θn = θ⋆ almost-surely. Throughout the proof, we will concludes the first step. write that a sequence of r.v. (Z ) is O (1) iff sup Z < n n w.p.1 n | n| We now study η for any n n . By (14), there almost-surely; and (Zn)n is O 1 (1) iff sup E [ Zn ] < . n,2 0 ∞ L n | | ∞ I ≥ λµ I exists C such that un EK C J θn 1 EK Fix δ > 0. Set for any positive integers m k λµ | | ≤ | ⊥ − | ≤ ≤ λµ I E I A := θ θ 1 δ. From Section D-A, it holds Cγn φn 1 En−2,K . Therefore, ( EK n γn un ) m j m j ⋆ | 1+−λ| | | ≤ ≥ {| − | ≤ µ E I θn+1 T= θn + γn+1h( θn ) + γn+1En+1 + γn+1Rn+1 C γn supn ( φn 1 En−2,K ) whichP is finite by As- n − h i h i h i | | I where En+1 := Wn+1 (Yn+1 + φn) zθ + Wθ φn + sumptionP 3 and Lemma 3. Thus n γn un EK is a.s. finite. h i− h n i h n i | | WJ (f (φ ) P f (φ )) and where R := u + The term vn can be analyzedP similarly: by (13) ap- θn n n+1 n n n n+1 n h i − E K K J K vn + zn + cn+1 + sn+1 + tn. Note that [En+1 Fn] = 0 plied with θ,θ , there exists a con- | ← ∪I { ∈ } λµ I i.e., (En)n is a Fn-adapted martingale increment. From the stant C such that vn EK C J θn φn En−1,K λ | | ≤ | ⊥ | | | ≤ µ 1+λµ I I expression of fn = fαn+1,θn (see Proposition (25)), we have Cγn+1 φn En−1,K and the fact that n γn vn EK is finite a.s.| follows| from the same arguments as above.| | P fα,θ(y) Pα,θfα,θ(x)= Bα,θ y αJ Wθx αJ zθ (28) We now study z C m(1)(α ) m(1) (1) . By − − ⊥ − ⊥ n v θn n+1 Jθn | | ≤ | − | 1  Proposition 5-1), since α < a < 1/ ρK, there ex- n+1 √ with Bα,θ := IdN αJ Wθ − . Hence, E I ⊥ ists a constant C′ such that γ ( z K ) is no larger − n n | n| E  1+λµ λµ W W than C γ γ + γ sup E( φ I ). The En+1 = n+1 (Yn+1 + φn) zθn θn φn ′ n n n+1 n P k k Ek−1,K h i−h i−h i | − | | | B latter isP finite by Lemma 3 and Assumption 3. Hence, + WJθn αn+1,θn φn+1 αn+1J⊥ Wθn φn + zθn . h i − γ z I is finite a.s. n n| n| EK  P(en)n is a martingale-increment sequence: as above I A. Checking condition C2 of [32, Theorem 2.1] for the term ηn,1, n γnen EK is finite a.s. if E 1+λI sup ( en+1 En,K ) P< . This holds true by (26) We start with a preliminary Lemma which extends n | | ∞ and Lemma 3. Lemma 3. The proof follows the same line and is thus omitted. Let us now investigate cn+1. We write n n W Lemma 5. Let Assumptions 3-1), 5, 10 and 11 hold. Let k=1 γk+1ck+1 = k=2(γk+1 γk) Jθk fk 1(φk) W − h Wi − − (φn)n 0 be the sequence given by (8) and τ be given by Pγn+1 Jθn+1 fn(φn+1P) + γ2 Jθ1 f0(φ1) . ≥ h i h i Assumption 10. For any compact set K RdN , Using again (26) and Lemma 3, there exists ⊂ C > 0 such that n γ E ( c I ) k=1 k+1 k+1 EK sup E φ 2+τ I < . | | ≤ n Tj≤n−1 θj K n | | { ∈ } ∞ C k 1 γk+1 γk + γn +P 1 . The right hand side    ≥ | − |  I is finiteP by Assumption 3, thus implying that n γncn EK is Let ρ˜K be given by Assumption 11. For any a (0, 1/√ρ˜K), finite a.s. 2+τ ∈ P supα [0,a],θ K x dπα,θ(x) < . ∈ ∈ | | ∞ Consider the term sn+1. Following similar arguments and R Let m 1. From Assumption 10 and Lemma 5, it using (26) again, we obtain ≥ is easily seen from the above expression of En+1 that γ s I C γ WJ WJ (1 + φ )I 2+τ 1 k k EK k θk θk−1 k EK sup E En+1 T θ θ 1 δ < where τ is | | ≤ kh − ik | | n m≤j≤n j ⋆ kXn kXn h| | {| − |≤ }i ∞ ≤ ≤ given by Assumption 10. for some constant C which depends only on K. By condi- In order to derive the asymptotic covariance, we go I tion (13) and Lemma 4, one has WJθk WJθk−1 EK further in the expression of the conditional covariance λ kh − k ≤ 2 2 µ λµ λµ I E E F . We write E E F = Ξ(α ,θ ,φ ) CKγ Yk + φk 1 EK . By Cauchy-Schwarz in- n+1 n n+1 n n+1 n n k | | | − | | | 2 equality, Assumption 5 and Lemma 3, it can be proved that where Ξ(α,θ,x ) := (ξα,θ,x(y,w)) dµθ(y,w) R E I A sup [( Yk + φk 1 )(1 + φk ) EK ] < . (27) ξα,θ,x(y,w) := α,θ w Wθ x +(wy zθ) (29) | | | − | | | ∞ − − k   1T 1 E I A W J W − J By Assumption 3, ( γk sk EK ) is finite thus implying and α,θ := IdN + α Jθ IdN α θ . Set k | | N − ⊥ ⊥ I 1   that k 1 γksk EK existsP a.s. π⋆ := π1,θ⋆ and πn := παn+1,θn where πα,θ is defined by P ≥ 1 Proposition 3. We write such that almost-surely, for any n n0, tn Am λµ λ λ≥ | | ≤ C α α + γn Y µ + φ µ 1A . Ξ(αn+1,θn,φn)=Ξ(αn+1,θn,φn) Ξ(1,θn,φn) 3 | n+1 − n| | n| | n| m − Lemma 3, Assumption 13 and  imply that 1 λµ > 1/2 + Ξ(1,θn,x)dπn(x) Ξ(1,θ⋆ ,x)dπ⋆(x) 1 − tn+1 A = √γno(1)OL1 (1). By Assumption 7, Z Z | | m there exists a constant C4 > 0 such that almost-surely, + Ξ(1,θn,φn) Ξ(1,θn,x)dπn(x) λµ λ − u 1A C γn φ µ 1A . Lemma 3 and the property Z | n| m ≤ 4 | n| m λµ > 1/2 imply un = o(√γn)OL1 (1). Finally, by + Ξ(1,θ⋆1,x)dπ⋆(x) . Assumption 7, there exists a constant C such that almost- Z λ 1 µ 1+λµ 1 For any m 1, we have on the set A surely, vn Am Cγ φn Am so that by Lemma 3 ≥ m | | ≤ n+1| | again and the condition λµ > 1/2, vn = o(√γn)OL1 (1). The above discussion concludes the proof of (30). (Ξ(α ,θ ,φ ) Ξ(1,θ ,φ )) 0 a.s. n+1 n n − n n → The second step is to prove that for any m 1, n 1 ≥ √γn ck A = o(1)Ow.p.1.(1)OL1 (1). By (26), there Ξ(1,θn,x)dπn(x) Ξ(1,θ⋆1,x)dπ⋆(x) 0 a.s. k=1 m Z − Z  → existsP a constant C > 0 such that almost-surely, n E 1 n γn Ξ(1,θk,φk) Ξ(1,θk,x)dπl(x) Am 0 . 1 1  − Z  → ck Am C (1 + φ0 + φn ) Am . kX=1 ≤ | | | | kX=1 The detailed computations are given in Section E-D. This n implies that the key quantity involved in the asymptotic Lemma 3 implies that k=1 ck = OL1 (1). This concludes the proof of the condition C3 in [32]. covariance matrix is Ξ(1,θ⋆1,x)dπ⋆(x). P R B. Expression of U⋆ D. Detailed computations for verifying the condition C2 1 1 Set U⋆ := Ξ(1, θ⋆,x) dπ1, θ⋆ (x) . Lemma 6 gives The proof of the following lemma follows from standard ⊗ ⊗ an explicit expressionR for U⋆. computations and is thus omitted. Lemma 6. Under the assumptions of Theorem 3, vec U⋆ = Lemma 7. Let Assumptions 5, 11 and 12-1) to hold. Let δ > 0 A A R (2) T (1) S ( ⋆ ⋆)( ⋆ m⋆ + 2 ⋆m⋆ + ⋆). and set K := θ : θ θ 1 δ . Fix a (0, 1/√ρ˜K) where ⊗ { | − ⋆ |≤ } ∈ Proof: For simplicity, we use the notations ρ˜K be given by Assumption 11. There exists a constant C dN W such that for any θ,θ′ K, α,α′ [0,a], x,z,y R and Rθ(w) := w θ and υθ(y,w) := wy zθ and ∈ ∈ ∈ − − T w M T˜θ,x(y,w) := (Rθ(w)x + υθ(y,w))(Rθ(w)x + υθ(y,w)) . 1 T T ∈ Note that T˜θ,x(y,w) coincides with Rθ(w)xx Rθ(w) + T T ξα,θ,x(y,w) C (1 + y + x ) , 2Rθ(w)xυθ(y,w) + υθ(y,w)υθ(y,w) . From (29), | |≤ | | | | λµ ξ (y,w) = A (R (w)x + υ (y,w)) so that A A ′ ′ C α α′ + θ θ′ , α,θ,x α,θ θ θ k α,θ − α ,θ k≤ | − | | − | vec Ξ(α,θ,x)=(Aα,θ Aα,θ) vec T˜θ,x(y,w) dµθ(y,w) .   ⊗ ξα,θ,x(y,w) ξα′,θ′,x(y,w) Applying the vec operatorR on T˜θ,x(y,w) yields | − | T λµ (Rθ(w) Rθ(w))vec (xx ) + 2(υθ(y,w) Rθ(w))x + C α α′ + θ θ′ (1 + x + y ) , ⊗ T ⊗ ≤ | − | | − | | | | | vec (υθ(y,w)υθ(y,w) ) . When applied with α = 1 and   T ξα,θ,x(y,w) ξα,θ,z(y,w) C x z θ = θ⋆1, it holds vec Ξ(1,θ⋆1,x)=(A⋆ A⋆)(R⋆ vec (xx )+ | − |≤ | − | T S ⊗ 2 ⋆x + ⋆) . This yields the result by integrating x w.r.t. π⋆. where λµ is given by Assumptions 5 and 12-1). 1) First term: Ξ(α ,θ ,φ ) Ξ(1,θ ,φ ): It is suffi- n+1 n n − n n C. Checking condition C3 of [32, Theorem 2.1] cient to prove that this term converges almost-surely to zero A We first prove that for any m 1, along the event m, for any m 1; which is implied by the ≥ almost-sure convergence to zero≥ along the event θ K := 1 1 ∈ un + vn + zn + sn+1 + tn Am √γno(1)OL (1) . (30) θ : θ θ δ . Below, C is a constant whose value | | ≤ ⋆ m {may change| − | upon ≤ } each appearance. By using the inequality Let m 1. By (8) and Proposition 5-1), there a2 b2 a b ( a + b ), Assumption 10 and Lemma 7, there exists a≥ constant C such that almost-surely on 1 |exists− a|≤| constant− | | | |such| that for any close enough to 1 A λµ Cm α the set m, zn C1 αn+1 1 + J θn 2 | | ≤ | − | | ⊥ | ≤ and θ K, Ξ(α,θ,x) Ξ(1,θ,x) Cm 1+ x α 1 . λµ λµ C1 αn+1 1 + γ 1+ φn . Assumption 13, ∈ | − |≤ | | | − | n+1 By Lemma 5, for any ε > 0, there exists C m such | − |  | | Lemma 3 and λµ > 1/2 imply that zn 1A = P 2 1 m that supn ℓ(1 + φn ) αn+1 1 θn K ε is no larger | | ≥ | | | − | ∈ ≥ √γno(1)OL1 (1). By Assumption 7, Proposition 3-3) (1+τ/2) than Cm n ℓ αn+1 1 . The latter term converges and Lemma 4, there exist a constant C2 > 0 and n n0 to zero as ℓ ≥ | by Assumption− | 13. This implies that almost- 1 ≥ P such that almost-surely, for all n n0, sn+1 Am → ∞ 1 λ surely, limn Ξ(αn+1,θn,φn) Ξ(1,θn,φn) θn K = 0. µ λµ λµ ≥ | | ≤ ∈ C2γn Yn+1 + φn (1 + φn+1 ) 1A . | − | | | | | | | m 2) Second term: Ξ(1,θn,x)dπn(x) Assumption 5, Lemma 3 and the condition λµ > 1/2 − Ξ(1,θ⋆1,x)dπ⋆(x): We applyR the following lemma imply that sn+1 1A = √γnO 1 (1). By Proposition 5-2) | | m L R(see [33, Proposition 4.3.]). and Lemma 4, there exist a constant C3 > 0 and n0 Lemma 8. Let µ, µn,n 0 be probability distributions on Lemma 5 implies that there exists a> 1 such that dN { ≥ } R endowed with its Borel σ-field. Let hn,n 1 be an R{dN R≥ } 1 a equicontinuous family of functions from to . Assume sup θn K Ξ(1,θn,x) παn+1,θn (dx) < . n ∈ Z | | ∞ 1) the sequence µn,n 0 weakly converges to µ. { dN ≥ } e) Conclusion: We can apply Lemma 8; we have a.s., 2) for any x R , limn hn(x) exists, and there exists ∈ a limn Ξ(1,θn,x)dπn(x) Ξ(1,θ⋆1,x)dπ⋆(x) 1A = 0. a> 1 such that supn hn dµn + limn hn dµ < . − m | | | | ∞ 3) ThirdR term: R : We Then limn hndµn = limRn hn dµ. R Ξ(1,θn,φn) Ξ(1,θn,x)dπ n(x) prove that for any − R R m 1 R a) Almost-sure weak convergence: In our case µn πn ≥ ← n and µ π⋆ and µn is a random probability. Since the set of E 1 ← lim γn Ξ(1,θk,φk) Ξ(1,θk,x)dπk(x) Am =0 . bounded Lipschitz functions is convergence determining (see n " − # k=1  Z  e.g. [34, Theorem 11.3.3.]), we prove that for any bounded and X n Lipschitz function h, limn hdπn = hdπ⋆ almost-surely, Set k=1 Ξ(1,θk,φk) Ξ(1,θk,x)dπk(x ) = 3 (i) − with an almost-sure set whichR has to beR uniform for the set i=1 Tn P  R of bounded Lipschitz functions. Following the same lines as P n (1) in the proof of [33, Proposition 5.2.], this convergence occurs with T = Ξ(1,θk,φk) Ξ(1,θk−1,φk) n { − } almost-surely if and only if for any bounded Lipschitz function k=1 Xn h, there exists a full set such that on this set, limn hdπn = (2) Tn = Ξ(1,θk−1,φk) Ξ(1,θk−1,x)dπk−1(x) hdπ⋆. R − k=1   R Let h be a bounded Lipschitz function. Then h X Z dN ∈ (3) L (R ). By Proposition 5-1), there exists a constant C Tn = Ξ(1,θ0,x)dπ0(x) Ξ(1,θn,x)dπn(x) . 0 f − such that for any n large enough, on the set θn K Z Z { ∈ λµ } (1) hdπ hdπ C α 1 + θ θ 1 . a) Term Tn : By (31), there exists a constant Cm n − ⋆ ≤ f | n+1 − | | n+1 − ⋆ | 1   such that for any k m + 1, on the set Am, SinceR limnRθn = θ⋆ almost-surely and limn αn = 1, we ≥ Ξ(1,θ ,φ ) Ξ(1,θ ,φ ) C θ θ λµ (1 + have limn hdπn = hdπ⋆ almost-surely. This concludes k k k 1 k m k k 1 | 2 − − | ≤ A| − − | the proof ofR the a.s. weakR convergence. φk ). Hence, by Lemma 4, on the set m, Ξ(1,θk,φk) | | λµ 2 λµ| λµ− b) Equicontinuity of the family of functions: We prove Ξ(1,θk 1,φk) Cmγ (1 + φk )( Yk + φk 1 ). − | ≤ k | | | | | − | that the family of functions x Ξ(1,θ,x); θ K is By Assumption 10, Lemma 5 and Assumption 13, the sum 1+λ { 7→ 2 2∈ } µ E 2 λµ λµ 1 equicontinuous. Using again the inequality a b a γ (1 + φk )( Yk + φk 1 ) Am is finite k 1 k − | − |≤| − ≥ | | |(1) | | | b ( a + b ), Lemma 7 and Assumption 10, we know there Pwhich implies lim γ E T 1A = 0 by the Kronecker | | | | | n n n m exists a constant C such that for any θ K, x,z RdN , h| | i m ∈ ∈ Lemma. Ξ(1,θ,x) Ξ(1,θ,z) Cm (1 + x + z ) x z . b) Term T(2): From the expression of ξ (see (29)), we | c) Almost-sure− limit|≤ of | | when| | | − | : Let n Ξ(1,θn,x) n x TC TC TD be fixed. We write → ∞ have Ξ(1,θ,φ) Ξ(1,θ,x)= φ θφ x θx +(φ x) θ C − AT A − − D with θ := (w Wθ) 1,θ 1,θ(w Wθ) dµθ(y,w) and θ := ′ − − Ξ(1,θ,x) Ξ(1,θ ,x) 2 (w W R)AT A (wy z ) dµ (y,w). We detail the proof − − θ 1,θ 1,θ − θ θ 2 2 ofR the statement ξ (y,w) ξ ′ (y,w) dµθ′ (y,w) ≤ 1,θ,x − 1,θ ,x Z n T 2 2 E D 1 + ξ (y,w)dµθ(y,w) ξ (y,w)dµθ′ (y,w) . lim γn φk xdπαk,θk−1 (x) θk−1 Am =0 1,θ,x 1,θ,x n − − " k # Z Z X=1  Z  2 2 Let us consider the first term. Using again a b The second statement, with the quadratic dependence on φk a b ( a + b ) and Lemma 7, there exists a constant| −C such| ≤ | − | | | | | m is similar and omitted (its proof will use Proposition 5-3) and that the first term is upper bounded by 2 1 λµ 1/(1+τ/2) Cm (1+ x ) θ θ⋆ the condition limn γnn = 0). Using again the Poisson for any θ K. For the second term, we use Assumption| | | − 12-2)| solution fn := fαn+1,θn associated to the identity function and and obtain∈ the same upper bound. Then, there exists a constant the kernel Pn := Pαn+1,θn , it holds by (28) Cm such that for any θ,θ′ K ∈ T λµ 2 φk xdπk−1(x) Dθ Ξ(1,θ,x) Ξ(1,θ′,x) Cm (1 + x ) θ θ′ . (31) − k−1 | − |≤ | | | − |  Z  T D Since lim θ = θ 1 almost-surely, the above discussion =(fk−1(φk) Pk−1fk−1(φk−1)) θk−1 (32) n n ⋆ − 1 T T implies that for any fixed x, limn Ξ(1,θn,x) = Ξ(1,θ⋆ ,x) + Pk−1f − (φk−1)Dθ Pkf (φk)Dθ (33) k 1 k−1 − k k almost-surely on Am. T T D + Pkfk(φk) Pk−1fk−1(φk) θk (34) d) Moment conditions: It is easily seen (using again T − P − f φ D D . (35) + k 1 k−1( k) θk θk−1 Lemma 7) that there exists a constant Cm such that for any − θ K, Ξ(1,θ,x) C (1 + x 2). Therefore, Lemma 5 From Assumption 12-2) and Lemma 7, there exists a ∈ | | ≤ m | | implies that Ξ(1,θ 1,x) dπ (x) < . In addition, for any constant C such that for any k, | ⋆ | ⋆ ∞ m θ K, α inR a neighborhood of 1 and a> 1, ∈ Dθ 1A Cm (36) | k | m ≤ a 2a λµ Ξ(1,θ,x) πα,θ(dx) Cm 1+ x πα,θ(dx) . Dθ Dθ 1A Cm θk θk−1 . (37) Z | | ≤  Z | |  | k − k−1 | m ≤ | − | Let us control the first term (32). Upon noting that it is a [12] S. Stankovic and, M. Stankovic, and D. Stipanovic, “Decentralized martingale-increment, the Burkholder inequality (see e.g. [31, Parameter Estimation by Consensus Based Stochastic Approximation,” IEEE Transactions on Automatic Control, vol. 56, no. 3, pp. 531–543, Theorem 2.10]) applied with p 2+ τ and Lemma 5 imply ← march 2011. n [13] J. Chen and A. Sayed, “Diffusion adaptation strategies for distributed E T D 1 optimization and learning over networks,” IEEE Trans. Signal Process- (fk−1(φk) Pk−1fk−1(φk−1)) θk−1 Am = O √n . − ing, vol. 60, no. 8, pp. 4289–4305, May 2012. k=1 X  [14] P. Bianchi, G. Fort, and W. Hachem, “Performance of a Distributed This term is o(1/γ ) by Assumption 13. Let us consider Stochastic Approximation Algorithm,” IEEE Trans. on Information n (33). Theory, vol. 59, no. 11, pp. 7405–7418, 2012. [15] P. Bianchi, G. Fort, W. Hachem, and J. Jakubowicz, “Performance n of a Distributed Robbins-Monro Algorithm for Sensor Networks,” in E T D T D 1 Pk−1fk−1(φk−1) θk−1 Pkfk(φk) θk Am EUSIPCO, Barcelona, Spain, 2011. − k=1 [16] P. Bianchi and J. Jakubowicz, “On the convergence of a multi-agent X E T D T D 1  projected stochastic gradient algorithm for non convex optimization,” = P0f0 (φ0) θ0 Pnfn(φn) θn Am − IEEE Trans. on Automatic Control, vol. 58, no. 2, pp. 391–405, February and this term is O(1) by Proposition 3-3), (36) and 2013, [online] arXiv:1107.2526v1. [17] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying Lemma 5. Let us see the third term (34). By Proposition 5-2) directed graphs,” in IEEE conf. on Decision and Control, Florence, Italy, and (36), we have 2013. n [18] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, “Asynchronous E T T D 1 distributed optimization using a randomized alternating direction method Pkfk(φk) Pk−1fk−1(φk) θk Am − of multipliers,” in Proceedings of the 52nd IEEE Conference on Decision k=1 and Control, CDC 2013, 2013, pp. 3671–3676. X n  λµ [19] P. Bianchi, W. Hachem, and F. Iutzeler, “A stochastic Cm E θk θk−1 + αk+1 α k 1A ≤ | − | | − | m coordinate descent primal-dual algorithm and applications to large- k X=1   scale composite optimization,” CoRR, 2014. [Online]. Available: By Lemmas 4 and 5 and Assumptions 10 and 13, this http://arxiv.org/abs/1407.0898 [20] A. Nedic and A. Ozdaglar, “Distributed Subgradient Methods for Multi- term is o(1/γn). Finally, the same conclusion holds for (35) Agent Optimization,” IEEE Transactions on Automatic Control, vol. 54, by using Proposition 3-3), Lemma 5 and (37). This concludes no. 1, pp. 48–61, 2009. (2) [21] S. Ram, A. Nedic, and V. Veeravalli, “Distributed stochastic subgradient the proof of lim γ E Tn 1A = 0. n n | | m projection algorithms for convex optimization,” Journal of Optimization (3) h i Theory and Applications, vol. 147, pp. 516–545, 2010, 10.1007/s10957- c) Term Tn : By Lemma 7, there exists C such m 010-9737-7. [Online]. Available: http://dx.doi.org/10.1007/s10957-010- that for any θ K, Ξ(1,θ,x) C (1 + x 2). By ∈ | | ≤ m | | 9737-7 Lemma 5, for any a in a neighborhood of 1 we have [22] K. Tsianos, S. Lawlor, Y. Jun, and M. Rabbat, “Networked optimization 2 with adaptive communication,” in Global Conference on Signal and supα [0,a],θ K x πα,θ(dx) < . Since limn αn = 1, ∈ ∈ | | ∞ 1 Information Processing (GlobalSIP), 2013 IEEE, 2013, pp. 579–582. we have supn Rm Ξ(1,θn,x)dπn(x) θn K