1

Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks Matthew Nokleby, Member, IEEE, and Waheed U. Bajwa, Senior Member, IEEE

T Abstract—Motivated by applications in T training samples {ξ(t) ∈ Υ}t=1 drawn independently from networks of sensors, internet-of-things (IoT) devices, and au- the distribution P (ξ), which is supported on a subset of Υ. tonomous agents, we propose techniques for distributed stochastic There are, in particular, two main categorizations of training convex learning from high-rate data streams. The setup involves a network of nodes—each one of which has a stream of data data that, in turn, determine the types of methods that can be arriving at a constant rate—that solve a stochastic convex used to find approximate solutions to the SO problem. These optimization problem by collaborating with each other over rate- are (i) batch training data and (ii) streaming training data. limited communication links. To this end, we present and analyze In the case of batch training data, where all T samples two algorithms—termed distributed {ξ(t)} are pre-stored and simultaneously available, a common mirror descent (D-SAMD) and accelerated distributed stochastic approximation mirror descent (AD-SAMD)—that are based on strategy is sample average approximation (SAA) (also referred two stochastic variants of mirror descent and in which nodes to as empirical risk minimization (ERM)), in which one collaborate via approximate averaging of the local, noisy subgra- minimizes the empirical average of the “risk” function φ(·, ·) dients using distributed consensus. Our main contributions are in lieu of the true expectation. In the case of streaming data, (i) bounds on the convergence rates of D-SAMD and AD-SAMD by contrast, the samples {ξ(t)} arrive one-by-one, cannot be in terms of the number of nodes, network topology, and ratio of the data streaming and communication rates, and (ii) sufficient stored in memory for long, and should be processed as soon conditions for order-optimum convergence of these algorithms. as possible. In this setting, stochastic approximation (SA) In particular, we show that for sufficiently well-connected net- methods—the most well known of which is stochastic gradient works, distributed learning schemes can obtain order-optimum descent (SGD)—are more common. Both SAA and SA have convergence even if the communications rate is small. Further a long history in the literature; see [2] for a historical survey we find that the use of accelerated methods significantly enlarges the regime in which order-optimum convergence is achieved; of SA methods, [3] for a comparative review of SAA and SA this is in contrast to the centralized setting, where accelerated techniques, and [4] for a recent survey of SO techniques. methods usually offer only a modest improvement. Finally, we Among other trends, the rapid proliferation of sensing demonstrate the effectiveness of the proposed algorithms using and wearable devices, the emergence of the internet-of-things numerical experiments. (IoT), and the storage of data across geographically-distributed Index Terms—Distributed learning, distributed optimization, data centers have spurred a renewed interest in development internet of things, machine learning, mirror descent, stochastic and analysis of new methods for learning from fast-streaming approximation, stochastic optimization, streaming data. and distributed data. The goal of this paper is to find a fast and efficient solution to the SO problem (1) in this setting I.INTRODUCTION of distributed, streaming data. In particular, we focus on Machine learning at its core involves solving stochastic geographically-distributed nodes that collaborate over rate- optimization (SO) problems of the form limited communication links (e.g., wireless links within an IoT infrastructure) and obtain independent streams of training data min ψ(x) min Eξ[φ(x, ξ)] (1) arriving at a constant rate. x∈X , x∈X arXiv:1704.07888v4 [stat.ML] 6 Aug 2018 The relationship between the rate at which communication to learn a “model” x ∈ X ⊂ Rn that is then used for takes place between nodes and the rate at which streaming tasks such as dimensionality reduction, classification, clus- data arrive at individual nodes plays a critical role in this tering, regression, and/or prediction. A primary challenge of setting. If, for example, data samples arrive much faster than machine learning is to find a solution to the SO problem nodes can communicate among themselves, it is difficult (1) without knowledge of the distribution P (ξ). This involves for the nodes to exchange enough information to enable an finding an approximate solution to (1) using a sequence of SA iteration on existing data in the network before new data arrives, thereby overwhelming the network. In order to Matthew Nokleby is with the Department of Electrical and Com- puter Engineering, Wayne State University, Detroit, MI 48202, USA address the challenge of distributed SO in the presence of a (email: [email protected]), and Waheed U. Bajwa is mismatch between the communications and streaming rates, with the Department of Electrical and Computer Engineering, Rut- we propose and analyze two distributed SA techniques, each gers University–New Brunswick, Piscataway, NJ 08854, USA (email: [email protected]). The work of WUB was supported in part based on distributed averaging consensus and stochastic mirror by the NSF under grant CCF-1453073, by the ARO under grant W911NF-17- descent. In particular, we present bounds on the convergence 1-0546, and by the DARPA Lagrange Program under ONR/SPAWAR contract rates of these techniques and derive conditions—involving the N660011824020. Some of the results reported here were presented at the 7th IEEE International Workshop on Computational Advances in Multi-Sensor number of nodes, network topology, the streaming rate, and Adaptive Processing (CAMSAP’17) [1]. the communications rate—under which our solutions achieve 2 order-optimum convergence speed. distributed stochastic dual averaging by computing approx- imate mini-batch averages of dual variables via distributed consensus. In addition, and similar to our work, [26] allows A. Relationship to Prior Work for a mismatch between the communications rate and the data SA methods date back to the seminal work of Robbins and streaming rate. Nonetheless, there are four main distinctions Monro [5], and recent work shows that, for stochastic convex between [26] and our work. First, we provide results for optimization, SA methods can outperform SAA methods [6], stochastic composite optimization, whereas [24], [26] suppose [7]. Lan [8] proposed accelerated stochastic mirror descent, a differentiable objective. Second, we consider distributed which achieves the best possible convergence rate for general mirror descent, which allows for a limited generalization stochastic convex problems. This method, which makes use of to non-Euclidean settings. Third, we explicitly examine the noisy subgradients of ψ(·) computed using incoming training impact of slow communications rate on performance, in par- samples, satisfies ticular highlighting the need for large mini-batches and their   impact on convergence speed when the communications rate ∗ L M + σ E[ψ(x(T )) − ψ(x )] ≤ O(1) + √ , (2) is slow. In [26], the optimum mini-batch size is first derived T 2 T from [24], after which the communications rate needed to where x∗ denotes the minimizer of (1), σ2 denotes variance facilitate distributed consensus at the optimum mini-batch size of the subgradient noise, and M and L denote the Lipschitz is specified. While it appears to be possible to derive some constants associated with the non-smooth (convex) component of our results from a re-framing of the results of [26], it of ψ and the gradient of the smooth (convex) component is crucial to highlight the trade-offs necessary under slow of ψ, respectively. Further assumptions such as smoothness communications, which is not done in prior works. Finally, and strong convexity of ψ(·) and/or presence of a structured this work also presents a distributed accelerated mirror descent regularizer term in ψ(·) can remove the dependence of the approach to distributed SA; a somewhat surprising outcome is convergence rate on M and/or improve the convergence rate that acceleration substantially improves the convergence rate to O(σ/T ) [6], [9]–[11]. in networks with slow communications. The problem of distributed SO goes back to the seminal work of Tsitsiklis et al. [12], which presents distributed first- order methods for SO and gives proofs of their asymptotic B. Our Contributions convergence. Myriad works since then have applied these In this paper, we present two strategies for distributed SA ideas to other settings, each with different assumptions about over networks with fast streaming data and slow communica- the type of data, how the data are distributed across the tions links: distributed stochastic approximation mirror descent network, and how distributed units process data and share (D-SAMD) and accelerated distributed stochastic approxima- information among themselves. In order to put our work in tion mirror descent (AD-SAMD). In both cases, nodes first context, we review a representative sample of these works. locally compute mini-batch stochastic subgradient averages to A recent line of work was initiated by distributed gradient accommodate a fast streaming rate (or, equivalently, a slow descent (DGD) [13], in which nodes descend using gradients communications rate), and then they collaboratively compute of local data and collaborate via averaging consensus [14]. approximate network subgradient averages via distributed con- More recent works incorporate accelerated methods, time- sensus. Finally, nodes individually employ mirror descent and varying or directed graphs, data structure, etc. [15]–[19]. These accelerated mirror descent, respectively, on the approximate works tend not to address the SA problem directly; instead, averaged subgradients for the next set of iterates. they suppose a linearly separable function consistent with SAA Our main theoretical contribution is the derivation of upper using local, independent and identically distributed (i.i.d.) data. bounds on the convergence rates of D-SAMD and AD-SAMD. The works [20]–[23] do consider SA directly, but suppose These bounds involve a careful analysis of the impact of that nodes engage in a single round of message passing per imperfect subgradient averaging on individual nodes’ iterates. stochastic subgradient sample. In addition, we derive sufficient conditions for order-optimum We conclude by discussing two lines of works [24]–[26] that convergence of D-SAMD and AD-SAMD in terms of the are most closely related to this work. In [24], nodes perform streaming and communications rates, the size and topology distributed SA by forming distributed mini-batch averages of of the network, and the data . stochastic gradients and using stochastic dual averaging. The Two key findings of this paper are that distributed methods main assumption in this work is that nodes can compute exact can achieve order-optimum convergence with small commu- stochastic gradient averages (e.g., via AllReduce in parallel nication rates, as long as the number of nodes in the network computing architectures). Under this assumption, it is shown in does not grow too quickly as a function of the number this work that there is an appropriate mini-batch size for which of data samples each node processes, and that accelerated the nodes’ iterates converge at the optimum (centralized) rate. methods seem to offer order-optimum convergence in a larger However, the need for exact averages in this work is not suited regime than D-SAMD, thus potentially accommodating slower to rate-limited (e.g., wireless) networks, in which mimicking communications links relative to the streaming rate. By con- the AllReduce functionality can be costly and challenging. trast, the convergence speeds of centralized stochastic mirror The need for exact stochastic gradient averages in [24] descent and accelerated stochastic mirror descent typically has been relaxed recently in [26], in which nodes carry out differ only in higher-order terms. We hasten to point out that 3 we do not claim superior performance of D-SAMD and AD- An important fact that will be used in this paper is that the SAMD versus other distributed methods. Instead, the larger subgradient g ∈ ∂h of a Lipschitz-continuous convex function goal is to establish the existence of methods for order-optimum h is bounded [27, Lemma 2.6]: stochastic learning in the fast-streaming, rate-limited regimes. kgk ≤ M, ∀g ∈ ∂h(y), y ∈ X. (9) D-SAMD and AD-SAMD should be best regarded as a proof ∗ of concept towards this end. Consequently, the gap between the subgradients of ψ is bounded: ∀x, y ∈ X and gx ∈ ∂h(x), gy ∈ ∂h(y), we have C. Notation and Organization k∂ψ(x) − ∂ψ(y)k∗ = k∇f(x) − ∇f(y) + gx − gyk∗ We typically use boldfaced lowercase and boldfaced capital ≤ k∇f(x) − ∇f(y)k∗ + kgx − gyk∗ letters (e.g., x and W) to denote (possibly random) vectors and ≤ L kx − yk + 2M. (10) matrices, respectively. Unless otherwise specified, all vectors are assumed to be column vectors. We use (·)T to denote the A. Distributed Stochastic Composite Optimization transpose operation and 1 to denote the vector of all ones. Further, we denote the expectation operation by E[·] and the Our focus in this paper is minimization of ψ(x) over a field of real numbers by R. We use ∇ to denote the gradient network of m nodes, represented by the undirected graph operator, while denotes the Hadamard product. Finally, G = (V,E). To this end, we suppose that nodes minimize given two functions p(r) and q(r), we write p(r) = O(q(r)) ψ collaboratively by exchanging subgradient information with if there exists a constant C such that ∀r, p(r) ≤ Cq(r), and their neighbors at each communications round. Specifically, we write p(r) = Ω(q(r)) if q(r) = O(p(r)). each node i ∈ V transmits a message at each communications The rest of this paper is organized as follows. In Section II, round to each of its neighbors, defined as we formalize the problem of distributed stochastic composite Ni = {j ∈ V :(i, j) ∈ E}, (11) optimization. In Sections III and IV, we describe D-SAMD and AD-SAMD, respectively, and also derive performance where we suppose that a node is in its own neighborhood, guarantees for these two methods. We examine the empirical i.e., i ∈ Ni. We assume that this message passing between performance of the proposed methods via numerical experi- nodes takes place without any error or distortion. Further, ments in Section V, and we conclude the paper in Section VI. we constrain the messages between nodes to be members of Proofs are provided in the appendix. the dual space of X and to satisfy causality; i.e., messages transmitted by a node can depend only on its local data and previous messages received from its neighbors. II.PROBLEM FORMULATION Next, in terms of data generation, we suppose that each The objective of this paper is order-optimal, distributed node i ∈ V queries a first-order stochastic “oracle” at a minimization of the composite function fixed rate—which may be different from the rate of message ψ(x) = f(x) + h(x), (3) exchange—to obtain noisy estimates of the subgradient of ψ at different query points in X. Formally, we use ‘t’ to where x ∈ X ⊂ Rn and X is convex and compact. The space index time according to data-acquisition rounds and define n R is endowed with an inner product h·, ·i that need not be {ξi(t) ∈ Υ}t≥1 to be a sequence of independent (with respect the usual one and a norm k·k that need not be the one induced to i and t) and identically distributed (i.i.d.) random variables by the inner product. In the following, the minimizer and the with unknown probability distribution P (ξ). At each data- minimum value of ψ are denoted as: acquisition round t, node i queries the oracle at search point x (s) to obtain a point G(x (s), ξ (t)) that is a noisy version x∗ arg min ψ(x), and ψ∗ ψ(x∗). (4) i i i , x∈X , of the subgradient of ψ at xi(s). Here, we use ‘s’ to index time according to search-point update rounds, with possibly We now make a few assumptions on the smooth (f(·)) and multiple data-acquisition rounds per search-point update. The non-smooth (h(·)) components of ψ. The function f : X → R reason for allowing the search-point update index s to be is convex with Lipschitz continuous gradients, i.e., different from the data-acquisition index t is to accommodate the setting in which data (equivalently, subgradient estimates) k∇f(x) − ∇f(y)k∗ ≤ L kx − xk , ∀ x, y ∈ X, (5) arrive at a much faster rate than the rate at which nodes can where k·k∗ is the dual norm associated with h·, ·i and k·k: communicate with each other; we will elaborate further on this in the next subsection. kgk∗ , sup hg, xi. (6) kxk≤1 Formally, G(x, ξ) is a Borel function that satisfies the following properties: The function h : X → R is convex and Lipschitz continuous: E[G(x, ξ)] , g(x) ∈ ∂ψ(x), and (12) kh(x) − h(y)k ≤ M kx − yk , ∀ x, y ∈ X. (7) 2 2 E[kG(x, ξ) − g(x)k∗] ≤ σ , (13) Note that h need not have gradients; however, since it is convex we can consider its subdifferential, denoted by ∂h(y): where the expectation is with respect to the distribution P (ξ). We emphasize that this formulation is equivalent to that in T ∂h(y) = {g : h(z) ≥ h(y) + g (z − y), ∀ z ∈ X}. (8) which the objective function is ψ(x) , E[φ(x, ξ)], and 4

where nodes in the network acquire data point {ξi(t)}i∈V at C. Problem Statement each data-acquisition round t that are then used to compute It is straightforward to see that mini-batching induces a the subgradients of φ(x, ξi(t)), which—in turn—are noisy performance trade-off: Averaging reduces subgradient noise subgradients of ψ(x). and processing time, but it also reduces the rate of search- point updates (and hence slows down convergence). This B. Mini-batching for Rate-Limited Networks trade-off depends on the relationship between the streaming A common technique to reduce the variance of the and communications rates. In order to carry out distributed (sub)gradient noise and/or reduce the computational burden in SA in an order-optimal manner, we will require that the centralized SO is to average “batches” of oracle outputs into a nodes collaborate by carrying out r ≥ 1 rounds of averaging single (sub)gradient estimate. This technique, which is referred consensus on their mini-batch averages θi(s) in each mini- to as mini-batching, is also used in this paper; however, batch round s (see Section III for details). In order to complete its purpose in our distributed setting is to both reduce the the r communication rounds in time for the next mini-batch subgradient noise variance and manage the potential mismatch round, we have the constraint between the communications rate and the data streaming rate. r ≤ bρ. (16) Before delving into the details of our mini-batch strategy, we present a simple model to parametrize the mismatch between If communications is faster, or if the mini-batch rounds are the two rates. Specifically, let ρ > 0 be the communications longer, nodes can fit in more rounds of information exchange ratio, i.e. the fixed ratio between the rate of communications between each mini-batch round or, equivalently, between each and the rate of data acquisition. That is, ρ ≥ 1 implies search-point update. But when the mismatch factor ρ is small, nodes engage in ρ rounds of message exchanges for every the mini-batch size b needed to enable sufficiently many data-acquisition round. Similarly, ρ < 1 means there is one consensus rounds may be so large that the reduction in communications round for every 1/ρ data-acquisition rounds. subgradient noise is outstripped by the reduction in search- We ignore rounding issues for simplicity. point updates and the resulting convergence speed is sub- The mini-batching in our distributed problem proceeds as optimum. In this context, our main goal is specification of follows. Each mini-batch round spans b ≥ 1 data-acquisition sufficient conditions for ρ such that the resulting convergence rounds and coincides with the search-point update round, i.e., speeds of the proposed distributed SA techniques are optimum. each node i updates its search point at the end of a mini-batch round. In each mini-batch round s, each node i uses its current III.DISTRIBUTED STOCHASTIC APPROXIMATION search point x (s) to compute an average of oracle outputs i MIRROR DESCENT sb 1 X In this section we present our first distributed SA algorithm, θ (s) = G(x (s), ξ (t)). (14) i b i i called distributed stochastic approximation mirror descent (D- t=(s−1)b+1 SAMD). This algorithm is based upon stochastic approximated This is followed by each node computing a new search point mirror descent, which is a generalized version of stochastic xi(s+1) using θi(s) and messages received from its neighbors. subgradient descent. Before presenting D-SAMD, we review In order to analyze the mini-batching distributed SA tech- a few concepts that underlie mirror descent. niques proposed in this work, we need to generalize the usual averaging property of variances to non-Euclidean norms. n A. Stochastic Mirror Descent Preliminaries Lemma 1: Let z1,..., zk be i.i.d. random vectors in R 2 2 with E[zi] = 0 and E[kzik∗] ≤ σ . There exists a constant Stochastic mirror descent, presented in [8], is a generaliza- C∗ ≥ 0, which depends only on k·k and h·, ·i, such that tion of stochastic subgradient descent. This generalization is  2 characterized by a distance-generating function ω : X → R k 2 1 X C∗σ that generalizes the Euclidean norm. The distance-generating E z ≤ . (15)  k i  k function must be continuously differentiable and strongly i=1 ∗ convex with modulus α, i.e. Proof: This follows directly from the property of norm 2 equivalence in finite-dimensional spaces. h∇ω(x) − ∇ω(y), x − yi ≥ α kx − yk , ∀ x, y ∈ X. (17) In order to illustrate Lemma 1, notice that when k·k = k·k , 1 In the convergence analysis, we will require two measures of i.e., the ` norm, and h·, ·i is the standard inner product, the 1 the “radius” of X that will arise in the convergence analysis, associated dual norm is the ` norm: k·k = k·k . Since ∞ ∗ ∞ defined as follows: kxk2 ≤ kxk2 ≤ n kxk2 , we have C = n in this case. ∞ 2 ∞ ∗ r Thus, depending on the norm in use, the extent to which q 2 Dω , max ω(x) − min ω(x), Ωω , Dω. averaging reduces subgradient noise variance may depend on x∈X x∈X α the dimension of the optimization space. The distance-generating function induces the prox function, In the following, we will use the notation zi(s) , θi(s) − V : X × X → 2 2 or the Bregman divergence R+, which g(xi(s)). Then, E[kzi(s)k∗] ≤ C∗σ /b. We emphasize that generalizes the Euclidean distance: the subgradient noise vectors zi(s) depend on the search points xi(s); we suppress this notation for brevity. V (x, z) = ω(z) − (ω(x) + h∇ω(x), z − xi). (18) 5

Fig. 1: The different time counters for the rate-limited framework of this paper, here with ρ = 1/2, b = 4, and r = 2. In this particular case, over T = 8 total data-acquisition rounds, each node receives 8 data samples and computes 8 (sub)gradients; it averages those (sub)gradients into S = 2 mini-batch (sub)gradients; it then engages in r = 2 rounds of consensus averaging to produce S = 2 locally averaged (sub)gradients; each of those (sub)gradients is finally used to update the search points twice, one for each 1 ≤ s ≤ S. Note that, while not explicitly shown in the figure, the search point xi(1) is used for all computations spanning the data-acquisition rounds 5 ≤ t ≤ 8.

The prox function V (x, ·) inherits strong convexity from ω(·), Euclidean setting. One can also show that when the distance- but it need not be symmetric or satisfy the triangle inequality. generating function ω(x) is an `p norm for p > 1, the resulting n We define the prox mapping Px : R → X as prox mapping is 1-Lipschitz continuous in x and y as required. However, not all Bregman divergences satisfy this condition. Px(y) = arg minhy, z − xi + V (x, z). (19) One can show that the K-L divergence results in a prox z∈X mapping that is not Lipschitz. Consequently, while we present The prox mapping generalizes the usual subgradient descent our results in terms of general prox functions, we emphasize step, in which one minimizes the local linearization of the that the results do not apply in all cases.1 One can think of objective function regularized by the Euclidean distance of the results primarily in the setting of Euclidean (accelerated) the step taken. In (centralized) stochastic mirror descent, one stochastic subgradient descent—for which case they are guar- computes iterates of the form anteed to hold—with the understanding that one can check on a case-by-case basis to see if they hold for a particular x(s + 1) = P (γ g(s)) x(s) s non-Euclidean setting. where g(s) is a stochastic subgradient of ψ(x(s)), and γs is a step size. These iterates have the same form as (stochastic) 1 2 B. Description of D-SAMD subgradient descent; indeed, choosing ω(x) = 2 kxk2 as well as h·, ·i and k·k to be the usual ones results in subgradient Here we present in detail D-SAMD, which generalizes descent iterations. stochastic mirror descent to the setting of distributed, stream- One can speed up convergence by choosing ω(·) to match ing data. In D-SAMD, nodes carry out iterations similar to the structure of X and ψ. For example, if the optimiza- stochastic mirror descent as presented in [8], but instead tion space X is the unit simplex over Rn, one can choose of using local stochastic subgradients associated with the P local search points, they carry out approximate consensus ω(x) = i xi log(xi) and k·k to be the `1 norm. This leads to V (x, z) = D(z||x)), where D(·||·) denotes the Kullback- to estimate the average of stochastic subgradients across the Leibler (K-L) divergence between z and x. This choice speeds network. This reduces the subgradient noise at each node and   up convergence on the order of O pn/ log(n) over using speeds up convergence. the Euclidean norm throughout. Along a similar vein, when ψ Let W be a symmetric, doubly-stochastic matrix consistent includes an ` regularizer to promote a sparse minimizer, one with the network graph G, i.e., [W]ij , wij = 0 if (i, j) ∈/ E. 1 T can speed up convergence by choosing ω(·) to be a p-norm Further suppose that W − 11 /n has spectral radius strictly with p = log(n)/(log(n) − 1). less than one, i.e. the second-largest eigenvalue magnitude is In order to guarantee convergence for D-SAMD, we need strictly less than one. This condition is guaranteed by choosing to restrict further the distance-generating function ω(·). In the diagonal elements of W to be strictly greater than zero. particular, we require that the resulting prox mapping be 1- Next, we focus on the case of constant step size γ and set it 2 0 0 n as 0 < γ ≤ α/(2L). For simplicity, we suppose that there is Lipschitz continuous in x, y pairs, i.e., ∀ x, x , y, y ∈ R , a predetermined number of data-acquisition rounds T , which 0 0 kPx(y) − Px0 (y )k ≤ kx − x k + ky − y0k . leads to S = T/b mini-batch rounds. We detail the steps of D-SAMD in Algorithm 1. Further, in Figure 1 we illustrate This condition is in addition to the conditions one usually the data acquisition round, mini-batch round, communication places on the Bregman divergence for stochastic optimization; we will use it to guarantee that imprecise gradient averages 1We note further that it is possible to relax the constraint that the best make a bounded perturbation in the iterates of stochastic mir- Lipschitz constant be no larger than unity. This worsens the scaling laws— ror descent. The condition holds whenever the prox mapping in particular, the required communications ratio ρ grows in T rather than decreases—and we omit this case for brevity’s sake. is the projection of a 1-Lipschitz function of x, y onto X. For 2It is shown in [8] that a constant step size is sufficient for order-optimal example, it is easy to verify that this condition holds in the performance, so we adopt such a rule here. 6

Algorithm 1 Distributed stochastic approximation mirror de- recalling that the subgradient noise zi(s) is defined with scent (D-SAMD) respect to the mini-batched subgradient θi(s). Also define Require: Doubly-stochastic matrix W, step size γ, number m 1 X of consensus rounds r, batch size b, and stream of mini- g(s) g(x (s)), G(s) [g(s),..., g(s)], and , m i , batched subgradients θi(s). i=1 1: for i = 1 : m do m 1 X 2: x (1) ← min ω(x) . Initialize search points z(s) zi(s), Z(s) [z(s), ··· , z(s)], i x∈X , m , 3: end for i=1 4: for s = 1 : S do where the matrices G(s), Z(s) ∈ Rn×n have identical 5: h0(s) ← θ (s) . i i Get mini-batched subgradients columns. Finally, define the matrices 6: for q = 1 : r, i = 1 : m do q q−1 7: h (s) ← P w h (s) . Consensus rounds r i j∈Ni ij j E(s) , G(s)W − G(s) and 8: end for ˜ r Z(s) , Z(s)W − Z(s) 9: for i = 1 : m do r 10: xi(s + 1) ← Pxi(s)(γhi (s)) . Prox mapping of average consensus error on the subgradients and subgradient av 1 Ps 11: xi (s + 1) ← s k=1 xi(k) . Average iterates noise, respectively. Then, the following facts are true. First, 12: end for one can write H(s) as 13: end for av H(s) = G(s) + E(s) + Z(s) + Z˜(s). (20) return xi (S + 1), i = 1, . . . , m. Second, the columns of Z(s) satisfy round, and search point update counters and their role in the C σ2 E[kz(s)k2] ≤ ∗ . (21) D-SAMD algorithm. ∗ mb In D-SAMD, each node i initializes its iterate at the mini- ˜ mizer of ω(·), which is guaranteed to be unique due to strong Finally, the ith columns of E(s) and Z(s), denoted by ei(s) ˜ convexity. At each mini-batch round s, each node i obtains and zi(s), respectively, satisfy its mini-batched subgradient and nodes engage in r rounds 2p r kei(s)k∗ ≤ max m C∗λ2 kgj(s) − gk(s)k (22) of averaging consensus to produce the (approximate) average j,k ∗ r subgradients hi (s). Then, each node i takes a mirror prox step, and r 2r 2 2 using hi (s) instead of its own mini-batched estimate. Finally, λ m C σ E[kz˜ (s)k2] ≤ ∗ , (23) each node keeps a running average of its iterates, which is i ∗ b well-known to speed up convergence [28]. where we have used gj(s) as a shorthand for g(xj(s)). The next step in the convergence analysis is to bound the C. Convergence Analysis distance between iterates at different nodes. As long as iterates The convergence rate of D-SAMD depends on the bias and are not too far apart, the subgradients computed at different r variance of the approximate subgradient averages hi (s). In nodes have sufficiently similar means that averaging them principle, averaging subgradients together reduces the noise together reduces the overall subgradient noise. variance and speeds up convergence. However, because aver- Lemma 3: Let as , maxi,j kxi(s) − xj(s)k. The moments aging consensus using only r communications rounds results of as follow: in approximate averages, each node takes a slightly different √ mirror prox step and therefore ends up with a different iterate. M + σ/ b 2p r s E[as] ≤ ((1 + αm C∗λ2) − 1), (24) At each mini-batch round s, nodes then compute subgradients L √ at different search points, leading to bias in the averages hr(s). (M + σ/ b)2 p i E[a2] ≤ ((1 + αm2 C λr)s − 1)2. (25) This bias accumulates at a rate that depends on the subgradient s L2 ∗ 2 noise variance, the topology of the network, and the number Now, we bound D-SAMD’s expected gap to optimality. of consensus rounds per mini-batch round. Theorem 1: For D-SAMD, the expected gap to optimality Therefore, the first step in bounding the convergence speed at each node i satisfies of D-SAMD is to bound the bias and the variance of the r subgradient estimates hi (s), which we do in the following av ∗ E[ψ(xi (S + 1))] − ψ ≤ lemma. r 2LΩ2 2(4M2 + 2∆2 ) rα Ξ D Lemma 2: Let 0 ≤ λ2 < 1 denote the magnitude of ω + S + S ω , (26) the second-largest (ordered by magnitude) eigenvalue of W. αS αS 2 L Define the matrices where H(s) [hr(s),..., hr (s)],   , 1 m σ 2p r Ξs , M + √ (1 + m C∗λ2)× G(s) , [g(x1(s)),..., g(xm(s))], and b 2p r s Z(s) , [z1(s),..., zm(s)], ((1 + αm C∗λ2) − 1) + 2M (27) 7 and Then, the condition on ρ essentially ensures that b = O(T 1/2)  2 as specified in [24]. 2 σ 4 2r ∆s , 2 M + √ (1 + m C∗λ2 )× We note that Corollary 1 imposes a strict requirement on M, b the Lipschitz constant of the non-smooth part of ψ. Essentially 2p r s 2 2 2 ((1 + αm C∗λ2) − 1) + 4C∗σ /(mb) the non-smooth part must vanish as m, T , or σ becomes large. 2r 2 2 + 4λ2 C∗σ m /b + 4M (28) This is because the contribution of h(x) to the convergence rate depends only on the number of iterations taken, not on quantify the moments of the effective subgradient noise. the noise variance. Reducing the effective subgradient noise The convergence rate proven in Theorem 1 is akin to that 2 via mini-batching has no impact on this contribution, so we provided in [8], with ∆s taking the role of the subgradient require the Lipschitz constant M to be small to compensate. noise variance. A crucial difference is the presence of the Finally, we note that Corollary 1 dictates the relationship final term involving Ξs. In [8], this term vanishes because between the size of the network and the number of data the intrinsic subgradient noise has zero mean. However, the samples obtained at each node. Leaving the terms besides equivalent gradient error in D-SAMD does not have zero mean m and T constant, Corollary 1 requires T = Ω(m), i.e. in general. As nodes’ iterates diverge, their subgradients differ, the number of nodes in the network should scale no faster and the nonlinear mapping between iterates and subgradients than the number of data samples processed per node. This results in noise with nonzero mean. is a relatively mild condition for big data applications; many The critical question is how fast communication needs to be applications involve data streams that are large relative to the for order-optimum convergence speed, i.e., the convergence size of the network. Furthermore, ignoring the log(mT ) term speed that one would obtain if nodes had access to other and assuming λ2 and σ to be fixed, Corollary 1 indicates nodes’ subgradient estimates at each round. After S mini- that a communication ratio of ρ = Ωpm/T  is sufficient batch rounds, the network has processed mT data samples. for order optimality; i.e., nodes need to communicate at least Centralized mirror descent, with access to all mT data samples Ωpm/T  times per data sample. This means that if T scales in sequence, achieves the convergence rate [8] faster than Ω(m) then the required communications ratio  L M + σ  approaches zero in this case as m, T → ∞. In particular, fast O(1) + √ . mT mT stochastic learning is possible in expander graphs, for which the spectral gap 1 − λ is bounded away from zero, even in The final term dominates the error as a function of m and T if 2 communication rate-limited scenarios. For graph families that σ2 > 0. In the following corollary we derive conditions under are poor expanders, however, the required communications which the convergence rate of D-SAMD matches this term. ratio depends on the scaling of λ as a function of m. Corollary 1: The optimality gap for D-SAMD satisfies 2   av ∗ M + σ IV. ACCELERATED DISTRIBUTED STOCHASTIC E[ψ(xi (S + 1))] − ψ = O √ , (29) mT APPROXIMATION MIRROR DESCENT provided the mini-batch size b, the communications ratio ρ, In this section, we present accelerated distributed stochastic the number of users m, and the Lipschitz constant M satisfy approximation mirror descent (AD-SAMD), which distributes the accelerated stochastic approximation mirror descent pro-  log(mT )  σT 1/2  b = Ω 1 + , b = O , posed in [8]. The centralized version of accelerated mirror 1/2 ρ log(1/λ2) m descent achieves the optimum convergence rate of  m1/2 log(mT )   m  ρ = Ω ,T = Ω , and  L M + σ2  σT 1/2 log(1/λ ) σ2 O(1) + √ . 2 T 2   1 1  T M = O min , √ . m mσ2T Consequently, we will see that the convergence rate of AD- SAMD has 1/S2 as its first term. This faster convergence in D. Discussion S allows for more aggressive mini-batching, and the resulting conditions for order-optimal convergence are less stringent. Corollary 1 gives new insights into influences of the com- munications and streaming rates, network topology, and mini- batch size on the convergence rate of distributed stochastic A. Description of AD-SAMD learning. In [24], a mini-batch size of b = O(T 1/2) is The setting for AD-SAMD is the same as in Section III. We prescribed—which is sufficient whenever gradient averages again suppose a distance function ω : X → R, its associated are perfect—and in [26] the number of imperfect consensus prox function/Bregman divergence V : X × X → R, and the n rounds needed to facilitate the mini-batch size b prescribed resulting (Lipschitz) prox mapping Px : R → X. in [24] is derived. By contrast, we derive a mini-batch As in Section III-B, we suppose a mixing matrix W ∈ condition sufficient to drive the effective noise variance to Rm×m that is symmetric, doubly stochastic, consistent with O(σ2/(mT )) while taking into consideration the impact of G, and has nonzero spectral gap. The main distinction be- imperfect subgradient averaging. This condition depends not tween accelerated and standard mirror descent is the way one 2 only on T but also on m, ρ, λ2, and σ —indeed, for all else averages iterates. Rather than simply average the sequence of constant, the optimum mini-batch size is merely Ω(log(T )). iterates, one maintains several distinct sequences of iterates, 8

Algorithm 2 Accelerated distributed stochastic approximation Theorem 2: For AD-SAMD, there exist step size sequences mirror descent (AD-SAMD) βs and γs such that the expected gap to optimality satisfies Require: Doubly-stochastic matrix W, step size sequences 8LD2 γs, βs, number of consensus rounds r, batch size b, and ag ∗ ω,X E[Ψ(xi (S + 1))] − Ψ ≤ 2 + stream of mini-batched subgradients θi(s). αS r 2 r 1: for i = 1 : m do 4M + ∆S 32 md ag 4Dω,X + Dω,X ΞS, (30) 2: xi(1), xi (1), xi (1) ← minx∈X ω(x) . Initialize αS α search points where 3: end for √ 4: for s = 1 : S do 2 2 2p r τ 2 ∆τ = 2(M + σ/ b) ((1 + 2γτ m C∗Lλ2) − 1) + 5: for i = 1 : m do 2 md −1 −1 ag 4C∗σ 2r 2 6: xi (s) ← βs xi(s) + (1 − βs )xi (s) (λ2 m + 1/m) + 4M. 0 b 7: hi (s) ← θi(s) . Get mini-batched subgradients 8: end for and √ 9: for q = 1 : r, i = 1 : m do p 2 r q P q−1 Ξτ = (M + σ/ b)(1 + C∗m λ2)× 10: hi (s) ← j∈N wijhj (s) . Consensus rounds i 2p r τ 11: end for ((1 + 2γτ m C∗Lλ2) − 1) + 2M. 12: for i = 1 : m do r As with D-SAMD, we study the conditions under which 13: xi(s + 1) ← Pxi(s)(γshi (s)) . Prox mapping ag ag AD-SAMD achieves order-optimum convergence speed. The 14: x (s + 1) ← β−1x (s + 1) + (1 − β−1)x (s) i s i s i centralized version of accelerated mirror descent, after pro- 15: end for cessing the mT data samples that the network sees after S 16: end for ag mini-batch rounds, achieves the convergence rate return xi (S + 1), i = 1, . . . , m.  L M + σ  O(1) + √ . (mT )2 mT carefully averaging them along the way. This involves two This is the optimum convergence rate under any circumstance. sequences of step sizes βs ∈ [1, ∞) and γs ∈ R, which are In the following corollary, we derive the conditions under not held constant. Again we suppose that the number of mini- which the convergence rate matches the second term, which batch rounds S = T/b is predetermined. We detail the steps usually dominates the error when σ2 > 0. of AD-SAMD in Algorithm 2. md ag Corollary 2: The optimality gap satisfies The sequences of iterates xi(s), xi (s), and xi (s) are M + σ  interrelated in complicated ways; we refer the reader to [8] E[ψ(xag(S + 1)] − ψ∗ = O √ , for an intuitive explanation of these iterations. i mT provided B. Convergence Analysis  log(mT )  σ1/2T 3/4  As with D-SAMD, the convergence analysis relies on b = Ω 1 + , b = O , ρ log(1/λ ) 1/4 bounds on the bias and variance of the averaged subgradients. 2 m  m1/4 log(mT )  m1/3  To this end, we note first that Lemma 2 also holds for AD- ρ = Ω ,T = Ω , 3/4 2 and SAMD, where H(s) has columns corresponding to noisy σT log(1/λ2) σ subgradients evaluated at xmd(s). Next, we bound the distance   1 1  i M = O min , √ . between iterates at different nodes. This analysis is somewhat m mσ2T more complicated due to the relationships between the three iterate sequences. C. Discussion Lemma 4: Let The crucial difference between the two schemes is that ag ag 2 as max xi (s) − xj (s) , AD-SAMD has a convergence rate of 1/S in the absence , i,j of noise and non-smoothness. This faster term, which is bs max kxi(s) − xj(s)k , and , i,j often negligible in centralized mirror descent, means that md md AD-SAMD tolerates more aggressive mini-batching without cs , max xi (s) − xj (s) . i,j impact on the order of the convergence rate. As a result, while

Then, the moments of as, bs, and cs satisfy: the condition on the mini-batch size b is the same in terms of ρ, √ the condition on ρ is relaxed by 1/4 in the exponents of m and M+σ/ b 2p r s T . This is because the condition b = O(T 1/2), which holds E[as],E[bs],E[cs] ≤ ((1+2γsm C∗Lλ2) −1), L √ for standard stochastic SO methods, is relaxed to b = O(T 3/4) 2 2 2 2 (M+σ/ b) 2p r s 2 for accelerated stochastic mirror descent. E[a ],E[b ],E[c ] ≤ ((1+2γsm C∗Lλ ) −1) . s s s L2 2 Similar to Corollary 1, Corollary 2 prescribes a relationship Now, we bound the expected gap to optimality of the AD- between m and T , but the relationship for AD-SAMD is SAMD iterates. T = Ω(m1/3), holding all but m, T constant. This again 9 is due to the relaxed mini-batch condition b = O(T 3/4) for Finally, we consider a communications-constrained adap- accelerated stochastic mirror descent. Furthermore, ignoring tation of distributed gradient descent (DGD), introduced in the log term, Corollary 2 indicates that a communications ratio [13], where local subgradient updates are followed by a single  m1/4  round of consensus averaging on the search points xi(s). ρ = Ω T 3/4 is needed for well-connected graphs such as expander graphs. In this case, as long as T grows faster than DGD implicitly supposes that ρ = 1. To handle the ρ < 1 the cube root of m, order-optimum convergence rates can be case, we consider two adaptations: naive DGD, in which obtained even for small communications ratio. Thus, the use data samples that arrive between communications rounds are of accelerated methods increases the domain in which order simply discarded, and mini-batched DGD, in which nodes optimum rate-limited learning is guaranteed. compute local mini-batches of size b = 1/ρ, take gradient updates with the local mini-batch, and carry out a consensus V. NUMERICAL EXAMPLE:LOGISTIC REGRESSION round. While it is not designed for the communications rate- To demonstrate the scaling laws predicted by Corollaries limited scenario, DGD has good performance in general, so it 1 and 2 and to investigate the empirical performance of D- represents a natural alternative against which to compare the SAMD and AD-SAMD, we consider supervised learning via performance of D-SAMD and AD-SAMD. binary logistic regression. Specifically, we assume each node A. Fully Connected Graphs observes a stream of pairs ξi(t) = (y(t), l(t)) of data points d yi(t) ∈ R and their labels li(t) ∈ {0, 1}, from which it learns First, we consider the simple case of a fully connected a classifier with the log-likelihood function graph, in which E is the set of all possible edges, and the obvious mixing matrix is W = 11T /n, which has λ = 0. F (x, x , y, l) = l(yT x + x ) − log(1 + exp(yT x + x )) 2 0 0 0 This represents a best-case scenario in which to validate the d where x ∈ R and x0 ∈ R are regression coefficients. theoretical claims made above. We choose ρ = 1/2 to examine The SO task is to learn the optimum regression coefficients the regime of low communications ratio, and we let m and d √ x, x0. In terms of the framework of this paper, Υ = (R × T grow according to two regimes: T = m, and T = m, {0, 1}), and X = Rd+1 (i.e., n = d + 1). We use the Eu- which are the regimes in which D-SAMD and AD-SAMD clidean norm, inner product, and distance-generating function are predicted to give order-optimum performance, respectively. to compute the prox mapping. The convex objective function The constraint on mini-batch size per Corollaries 1 and 2 is is the negative of the log-likelihood function, averaged over trivial, so we take b = 2 to ensure that nodes can average the unknown distribution of the data, i.e., each mini-batch gradient via (perfect) consensus. We select the following step-size parameter γ: 0.5 and 2 for (local and ψ(x) = −E [F (x, x , y, l)]. y,l 0 centralized) stochastic mirror descent (MD) and accelerated Minimizing ψ is equivalent to performing maximum likelihood stochastic mirror descent (A-MD), respectively; 5 for both estimation of the regression coefficients [29]. variants of DGD; and 5 and 20 for D-SAMD and AD-SAMD, We examine performance on synthetic data so that there respectively.4 These values were selected via trial-and-error to exists a “ground truth” distribution with which to compute give good performance for all algorithms; future work involves ψ(x) and evaluate empirical performance. We suppose that the use of adaptive step size rules such as AdaGrad and the data follow a Gaussian distribution. For li(t) ∈ {0, 1}, ADAM [30], [31]. 2 we let yi(t) ∼ N (µli(t), σr I), where µl(t) is one of two In Figure 2(a) and Figure 2(b), we plot the performance 2 3 mean vectors, and σr > 0 is the noise variance. For this averaged over 1200 and 2400 independent instances of the experiment, we draw the elements µ0 and µ1 randomly from problem, respectively.√ We also plot the order-wise theoretical 2 the standard normal distribution, let d = 20, and choose σr = performance 1/ mT , which has a constant slope on the log- 2. We consider several network topologies, as detailed in the log axis. As expected, the distributed methods significantly next subsections. outperform the local methods. The performance of the dis- We compare the performance of D-SAMD and AD-SAMD tributed methods is on par with the asymptotic theoretical against several other schemes. As a best-case scenario, we predictions, as seen by the slope of the performance curves, consider centralized mirror descent, meaning that at each data- with the possible exception of D-SAMD for T = m. However, acquisition round t all m data samples and their associated we observe that D-SAMD performance√ is at least as good as gradients are available at a single machine, which carries out predicted by theory for T = m, a regime in which optimality stochastic mirror descent and accelerated stochastic mirror is not guaranteed for D-SAMD. This suggests the possibility descent. These algorithms naturally have the best average that the requirement that T = Ω(m) for D-SAMD is an artifact performance. As a baseline, we consider local (accelerated) of the analysis, at least for this problem. stochastic mirror descent, in which nodes simply perform mirror descent on their own data streams without collabora- B. Expander Graphs tion. This scheme does benefit from an insensitivity to the For a more realistic setting, we consider expander graphs, communications ratio ρ, and no mini-batching is required, but which are families of graphs that have spectral gap 1 − λ it represents a minimum standard for performance in the sense 2 that it does not require collaboration among nodes. 4While the accelerated variant of stochastic mirror descent makes use of two sequences of step sizes, βs and γs, these two sequences can be expressed 3 2 2 The variance σr is distinct from the resulting gradient noise variance σ . as a function of a single parameter γ; see, e.g., the proof of Theorem 2. 10

100 100

10-1 10-1 * * ψ ψ

-2

(x) - (x) - 10 ψ ψ 10-2

10-3 D-SAMD D-SAMD Optimality gap: 10-3 AD-SAMD Optimality gap: AD-SAMD DGD DGD Mini-batch DGD Mini-batch DGD Local MD 10-4 Local MD Local A-MD Local A-MD Centralized MD Centralized MD 10-4 Centralized A-MD Centralized A-MD Constant × T-1 Constant × T-1

101 102 103 101 102 103 T T (a) T = m (a) T = m 100 100

10-1 10-1 * * ψ ψ

-2

(x) - (x) - 10 ψ ψ 10-2

10-3 D-SAMD D-SAMD Optimality gap: 10-3 AD-SAMD Optimality gap: AD-SAMD DGD DGD Mini-batch DGD Mini-batch DGD Local MD 10-4 Local MD Local A-MD Local A-MD Centralized MD Centralized MD 10-4 Centralized A-MD Centralized A-MD Constant × T-1.5 Constant × T-1.5

101 101 T T √ √ (b) T = m (b) T = m Fig. 2: Performance of different schemes for a fully connected graph Fig. 3: Performance of different schemes for 6-regular expander on log-log scale. The dashed lines (without markers) in (a) and graphs on log-log scale, for ρ = 1/2. Dashed lines once again (b) correspond to the asymptotic performance upper bounds for D- represent asymptotic theoretical upper bounds on performance. SAMD and AD-SAMD predicted by the theoretical analysis.

theoretical predictions. The performance of DGD, on the other bounded away from zero as m → ∞. In particular, we use hand, depends on the regime: For T = m, it appears√ to 6-regular graphs, i.e., regular graphs in which each node has have order-optimum performance, whereas for T = m six neighbors, drawn uniformly from the ensemble of such it has suboptimum performance on par with local methods. graphs. Because the spectral gap is bounded away from zero The reason for the dependency on regime is not immediately for expander graphs, one can more easily examine whether clear and suggests the need for further study into DGD-style performance of D-SAMD and AD-SAMD agrees with the methods in the case of rate-limited networks. ideal scaling laws discussed in Corollaries 1 and 2. At the same time, because D-SAMD and AD-SAMD make use of C. Erdos-Renyi˝ Graphs imperfect averaging, expander graphs also allow us to examine Finally, we consider Erdos-Renyi˝ graphs, in which a random non-asymptotic behavior of the two schemes. Per Corollaries 1 fraction (in this case 0.1) of possible edges are chosen. These and 2, we choose b = 1 log(mT ) . While this choice is guar- graphs are not expanders, and their spectral gaps are not 10 ρ log(1/λ2) anteed to be sufficient for optimum asymptotic performance, bounded. Therefore, order-optimum performance is not easy we chose the multiplicative constant 1/10 via trial-and-error to guarantee, since the conditions on the rate and the size of to give good non-asymptotic performance. the network depend on λ2, which is not guaranteed to be well behaved. We again take ρ = 1/2, consider the regimes T = m In Figure 3 we plot the performance averaged over 600 √ problem instances. We again take ρ = 1/2, and consider and T = m, and again we choose b = 1 log(mT ) . The √ 10 ρ log(1/λ2) the regimes T = m and T = m. The step sizes are the step sizes are chosen to be the same as for expander graphs same as in the previous√ subsection except that γ = 2.5 for in both regimes. D-SAMD when T = m, γ = 28 for AD-SAMD√ when Once again, we observe a clear distinction in performance T = m, and γ = 8 for AD-SAMD when T = m. Again, we between local and distributed methods; in particular, all see that AD-SAMD and D-SAMD outperform local methods, distributed methods (including DGD) appear to show near- while their performance is roughly in line with asymptotic optimum performance in both regimes. However, as expected 11

100 APPENDIX A. Proof of Lemma 2 10-1 The first claim follows immediately from the definitions *

ψ of the constituent matrices. The second claim follows from

-2 (x) - 10 Lemma 1. To establish the final claim, we first bound the ψ norm of the columns of E(s):

10-3 r D-SAMD E(s) , G(s)W − G(s) Optimality gap: AD-SAMD DGD r T Mini-batch DGD = G(s)(W − 11 /n), 10-4 Local MD r T Local A-MD = (G(s) − G(s))(W − 11 /n), (31) Centralized MD Centralized A-MD -1 Constant × T where the first equality follows from G(s) = G(s)11T /n by 101 102 103 T definition, and the second equality follows from the fact that W (a) T = m the dominant eigenspace of a column-stochastic matrix is 0 ¯ 10 the subspace of constant vectors, thus the rows of G(s) lie in the null space of Wr − 11T /n. We bound the norm of the columns of E(s) via the Frobenius norm of the entire matrix: 10-1 p

* kei(s)k∗ ≤ C1 kE(s)kF ψ

-2 p r T (x) - 10 = C (G(s) − G(s))(W − 11 /n) ψ 1 F p r ≤ C1mλ2 G(s) − G(s) F 10-3 2p r D-SAMD ≤ max m C∗λ kgj(s) − gk(s)k , Optimality gap: 2 AD-SAMD j,k ∗ DGD Mini-batch DGD 10-4 Local MD Local A-MD where C1 bounds the gap between the Euclidean and dual Centralized MD Centralized A-MD norms, C2 bounds the gap in the opposite direction, and thus Constant × T-1.5 C∗ ≤ C1C2. Then, we bound the variance of the subgradient 101 noise. Similar to the case of E(s), we can rewrite T √ (b) T = m Z˜(s) = Z(s)(Wr − 11T /n). Fig. 4: Performance of different schemes on Erdos-Renyi˝ graphs, for ρ = 1/2, displayed using log-log scale. We bound the expected squared norm of each column z˜i(s) in terms of the expected Frobenius norm: 2 E[k˜z (s)k2] ≤ C E[ Z(s)(Wr − 11T /n) ] the performance is somewhat more volatile than in the case i ∗ 1 F 2r 2 of expander graphs, especially for the case of T = m, and it ≤ mλ2 C1E[kZ(s)kF ] is possible that the trends seen in these plots will change as 2r 2 2 ≤ λ2 m C∗σ /b. T and m increase.

B. Proof of Lemma 3 VI.CONCLUSION By the Lipschitz condition on Px(y), We have presented two distributed schemes, D-SAMD and

AD-SAMD, for convex stochastic optimization over networks as+1 = max Px (s)(γhi(s)) − Px (s)(γhj(s)) i,j i j of nodes that collaborate via rate-limited links. Further, we ≤ max kxi(s) − xj(s)k + γ khi(s) − hj(s)k have derived sufficient conditions for the order-optimum con- i,j ∗ vergence of D-SAMD and AD-SAMD, showing that acceler- α ≤ max kxi(s) − xj(s)k + khi(s) − hj(s)k , ated mirror descent provides a foundation for distributed SO i,j 2L ∗ that better tolerates slow communications links. These results where the final inequality is due to the constraint γ ≤ α/(2L). characterize relationships between network communications Applying Lemma 2, we obtain speed and the convergence speed of stochastic optimization. A limitation of this work is that we are restricted to settings α as+1 ≤ as + max kei(s) − ej(s) + z˜i(s) − z˜j(s)k∗ , in which the prox mapping is Lipschitz continuous, which ex- i,j 2L cludes important Bregman divergences such as the Kullbeck- which, applying (22), yields Liebler divergence. Further, the conditions for optimum con- √ 2 r vergence restrict the Lipschitz constant of non-smooth compo- αm C∗λ2 as+1 ≤ as + max kgi(s) − gj(s)k + nent of the objective function to be small. Future work in this i,j 2L ∗ α direction includes study of the limits on convergence speed for kz˜ (s) − z˜ (s)k . more general divergences and composite objective functions. 2L i j ∗ 12

Applying (10), we get Define

2p r α as+1 ≤ as + αm C∗λ2(as + M/L) + max kz˜i(s)k∗ δi(τ) , hi(τ) − g(xi(τ)) i L 2p r α  2p r  ηi(τ) , g¯(τ) + ei(τ) − g(xi(τ)) = δi(τ) − z¯(τ) − z˜i(τ) = (1+αm C∗λ2)as+ m C∗λ2M+kz˜i∗ (s)k∗ , L ζ (τ) γhδ (τ), x∗ − x (τ)i. (32) i , i i

∗ Applying Lemmas 2 and 3, we bound E[kηi(τ)k∗]: where i ∈ V is the index that maximizes kz˜ik∗. The final expression (32) gives the recurrence relationship E[kη (τ)k ]=E [kg(τ) + ei(τ) − g(xi(τ))k ] that governs the divergence of the iterates, and it takes the i ∗ ∗ ≤E[kg(τ) − g(x (τ))k ] + E[ke (τ)k ] form of a standard first-order difference equation. Recalling i ∗ i ∗ 2p r the base case a1 = 0, we obtain the solution ≤LE[aτ ] + Lm C∗λ2E[aτ ] + 2M s−1 2p r s−1−τ ≤L(1 + m C∗λ2)E[aτ ] + 2M α X  2p r √ as ≤ 1 + αm C∗λ2 × 2p r L ≤(M+σ/ b)(1+m C∗λ )× τ=0 2   2p r s 2p r ((1+αm C∗λ2) −1)+2M. m C∗λ2M+kz˜i∗ (s)k∗ . (33)

Next, we bound the expected value of ζi(τ): To prove the lemma, we bound the moments of as using the 2 solution (33). First, we bound E[as]: ∗ E[ζi(τ)] = γE[hηi(τ) + z¯(τ) − z˜i(τ), x − xi(τ)i] " s−1 = γE[hη (τ), x∗ − x (τ)i]+ α X p i i E[a2] ≤ E (1 + αm2 C λr)s−1−τ × ∗ s L ∗ 2 γE[hz¯(τ) − z˜i(τ), x − xi(τ)i] τ=0 ∗ !2# = γE[hηi(τ), x − xi(τ)i]   2p r ≤ γE[kδ (τ)k kx∗ − x (τ)k] m C∗λ2M + kz˜i∗ (s)k∗ , i ∗ i ≤ γΞτ Ωω which we can rewrite via a change of variables as rα Ξ D ≤ τ ω , s−1 s−1 2 L  α 2 X X p E[a2] ≤ (1 + αm2 C λr)τ+q× s L ∗ 2 where the first inequality is due to the definition of the dual τ=0 q=0 norm, and the third equality follows because z¯(τ) and z˜i(τ)  2p r ∗ E (m C∗λ2M + kz˜i∗ (s − 1 − τ)k )× have zero mean and are independent of x − xi(τ). ∗ 2 2p r  Next, we bound E[kδi(τ)k∗]. By Lemma 2, (m C∗λ2M + kz˜i∗ (s − 1 − q)k∗) . h i Then, we apply the Cauchy-Schwartz inequality and (23) to 2 2 E[kδi(τ)k∗]=E kg(τ)+ei(τ)+z(τ)+z˜i(τ)−g(xi(τ))k∗ , obtain which we can bound via repeated uses of the triangle inequal- √  α 2 E[a2] ≤ λ2rC m4(M + σ/ b)2 × ity: s 2 ∗ L s−1 s−1 2 2 X 2p r τ X 2p r q E[kδi(τ)k∗] ≤ 2E[kg¯(τ) − g(xi(τ))k∗]+ (1 + αm C∗λ ) (1 + αm C∗λ ) . 2 2 2 2 2 τ=0 q=0 2E[kei(τ)k∗] + 4E[kz¯(τ)k∗ + kz˜t(τ)k∗]. Finally, we apply the geometric sum identity and simplify: Next, we apply Lemma 2: √  α 2 2 2r 4 2 2 E[as]≤λ2 C∗m (M+σ/ b) × 2 4 2r 2 L E[kδi(τ)k∗] ≤ 2L (1 + m C∗λ2 )E[aτ ]+ √ 2 2 2r 2 2 1−(1 + αm2 C λr)s  4C∗σ /(mb) + 4λ C∗σ m /b + 4M. √ ∗ 2 2 1−(1 + αm2 C λr) √ ∗ 2 Finally, we apply Lemma 3: 2 (M + σ/ b) 2p r s 2 = 2 ((1 + αm C∗λ2) − 1) . 2 2 4 2r 2 L E[kδi(τ)k∗] ≤ 2L (1 + m C∗λ2 )E[aτ ]+ 2 2 2r 2 2 2 This establishes the result for E[as], and the bound on E[as] 4C∗σ /(mb) + 4λ2 C∗σ m /b + 4M = ∆τ . follows a fortiori from Jensen’s inequality. Applying the proof of Theorem 1 of [8] to the D-SAMD iterates, we have that C. Proof of Theorem 1 av ∗ 2 The proof involves an analysis of mirror descent, with the Sγ(ψ(xi (S + 1)) − ψ ) ≤ Dω+ subgradient error quantified by Lemmas 2 and 3. As necessary, S  2  X 2γ  2 we will cite results from the proof of [8, Theorem 1] rather ζ (τ) + 4M2 + kδ (τ)k . i α i ∗ than reproduce its analysis. τ=1 13

√ We take the expectation to obtain In order to get optimum scaling we need ΞS = O(σ/ mT + M), which yields a slightly stronger necessary condition: 2 S r D 1 X α Ξτ Dω   E[ψ(xav(S + 1))] − ψ∗ ≤ ω + + log(mT ) i Sγ S 2 L b = Ω 1 + , (36) τ=1 ρ log(1/λ2) S ! 2γ X where the first term is necessary because b cannot be smaller 4SM2 + ∆2 . αS τ than unity. τ=1 Similar approximations establish that 2 Because {Ξτ } and {∆τ } are increasing sequences, we can ∆2 = O(S4m5λ2bρ(M+σ2/b)+σ2/(mb)+λ2bρσ2m2/b+M2). bound the preceding by S 2 2 Thus, (36) ensures ∆2 = O(σ2/(mb) + M2). The resulting 2 S av ∗ Dω 2γ 2 2 gap to optimality scales as E[ψ(xi (S + 1))] − ψ ≤ + (4M + ∆S) Sγ α r ! r b bM2 σ α ΞSDω av ∗ + . E[Ψ(xi (S+1))]−Ψ = O + + √ + M . 2 L T T mT Minimizing this bound over γ in the interval (0, α/(2L)], we Substituting (36) yields obtain the following optimum value:   av ∗ log(mT ) 1 ( s ) E[Ψ(xi (S +1))]−Ψ = O max , + α αD2 ρT log(1/λ2) T γ∗ = min , ω , 2 2 (s r ) ! 2L 2S(4M + 2∆S) M2 log(mT ) M2 σ max , + √ + M . which results in the bound ρT log(1/λ2) T mT

2LD2 In order to achieve√ order optimality, we need the first term to E[ψ(xav(s + 1))] − ψ∗ ≤ ω + be O((M + σ)/ mT ), which requires i αS r 2 2 r σT 1/2   m1/2 log(mT )  2(4M + 2∆S) α ΞSDω b = O , m = O(σ2T ), ρ = Ω . + . 1/2 1/2 αS 2 L m σT log(1/λ2) Finally, we also√ need the second and fourth terms to be D. Proof of Corollary 1 O((M + σ)/ mT ), which requires   1 1  First, we make a few simplifying approximations to Ξτ : M = O min , √ . (37) m mσ2T Ξτ √ This establishes the result. 2p r 2p r s = (M+σ/ b)(1+m C∗λ )((1+αm C∗λ ) −1)+2M √ 2 2 p 2p r τ ≤ 2(M + σ/ b) C∗m((1 + αm C∗λ2) ) + 2M E. Proof of Lemma 4 τ √ p X s p The sequences as, bs, cs are interdependent. Rather than = 2(M + σ/ b) C m (αm2 C λr)k) + 2M ∗ k ∗ 2 solving the complex recursion directly, we bound the dominant k=0 sequence. Define ds , max{as, bs, cs}. First, we bound as: √ τ p X 2p r k ≤ 2(M + σ/ b) C∗m (ταm C∗λ2) ) + 2M −1 as+1 = max ||βs+1(xi(s + 1) − xj(s + 1)+ i,j √ k=0 p 2p r −1 ag ag(s) ≤ 2(M + σ/ b) C∗m(τ + 1)(ταm C∗λ )) + 2M (1 − β )(x (s) − x )||, √ √ 2 s+1 i j 2 5 r = O(τ m λ2(M + σ/ b) + M), which, via the triangle inequality, is bounded by where the first inequality is trivial, the next two statements are −1 as+1 ≤ max βs+1 Pxi(s)(γshi(s)) − Pxj (s)(γshj(s)) + due to the binomial theorem and the exponential upper bound i,j (1 − β−1 )a . on the binomial coefficient,√ and the final inequality is true if s+1 s 2 r we constrain Sαm C∗λ2 ≤ 1, which is consistent with the Because the prox-mapping is Lipschitz, order-wise constraints listed in the statement of this corollary. Recalling that r ≤ bρ, we obtain −1 as+1 ≤ max βs+1(kxi(s) − xj(s)k + γs khi(s) − hj(s)k ) i,j ∗  √ √  2 bρ 5 2 −1 ΞS = O S λ2 m (M + σ/ b) + M . (34) + (1 − βs+1)as. Applying the first part of Lemma 2 yields Now, we find the optimum√ scaling law on the mini-batch size 2 r b. To ensure Sαm C∗λ2 ≤ 1 for all s, we need −1 −1 as+1 ≤ max βs+1bs + (1 − βs+1)as+   i,j log(mT ) −1 b = Ω . (35) βs+1γs kei(s) − ej(s) + z˜i(s) − z˜j(s)k∗ , ρ log(1/λ2) 14

and applying the second part and the triangle inequality yields Applying (10) and Lemmas 2 and 4, we bound E[kηi(τ)k]:

2p r as+1 ≤ max ds + 2γsm C∗λ2 kgi(s) − gj(s)k + E[kηi(τ)k ] = E [kg(τ) + ei(τ) − g(xi(τ))k ] i,j ∗ ∗ ∗ md ≤ E[ g(τ) − g(xi (τ)) ∗] + E[kei(τ)k∗] 2γs kz˜i(s)k∗ . p 2 r ≤ LE[cτ ] + L C∗m λ E[cτ ] + 2M We apply (10) and collect terms to obtain √ 2 p 2 r 2p r ≤ (M + σ/ b)(1 + C∗m λ2)× as+1 ≤ max ds + 2γsm C∗λ2(Lcs + 2M) + 2γs kz˜i(s)k i 2p r τ ((1 + 2γτ m C∗Lλ2) − 1) + 2M. 2p r ≤ max(1 + 2γsm C∗Lλ2)ds+ i By the definition of the dual norm: 2p r 2γs(2m C∗λ2M + kz˜i(s)k∗). ∗ Next, we bound b via similar steps: E[ζi(τ)] ≤ γτ E[kηi(τ)k∗ kx − xi(τ)k] ≤ s r α Ξτ Dω bs+1 = max Px (s)(γshi(s)) − Px (s)(γshj(s)) γτ Ξτ Ωω ≤ . i,j i j 2 L

≤ max kxi(s) − xj(s)k + γs khi(s) − hj(s)k∗ 2 i,j We bound E[kδi(τ)k ] via the triangle inequality: 2p r ≤ max bs +2γs(m C∗λ2(Lcs + 2M)+kz˜i(s)k ) 2 i ∗ 2 ¯ md 2 E[kδi(τ)k ] ≤2E[ g(τ) − g(xt (τ)) ∗] + 2E[kei(τ)k∗] 2p r ≤ max(1 + 2γsm C∗Lλ2)ds+ 2 2 i + 4E[kz˜i(τ)k∗] + 4E[kz¯(τ)k∗], p 2γ (2m2 C λrM + kz˜ (s)k ). s ∗ 2 i ∗ which by (10) and Lemma 2 is bounded by Finally, we bound cs: 2 2 4 2r 2 E[kδi(τ)k ] ≤ 2L (1 + m C∗λ2 )E[cτ ]+ c = max ||β−1(x (s + 1) − x (s + 1))+ s+1 s i j 4C σ2 i,j ∗ (λ2rm2 + 1/m) + 4M. −1 ag ag b 2 (1 − βs )(xi (s) − xj (s))||. Applying Lemma 4, we obtain We bound this sequence in terms of as and bs: √ −1 −1 2 2 2p r τ 2 cs+1 ≤ max βs bs+1 + (1 − βs )as i,j E[kδi(τ)k ] ≤ 2(M+σ/ b) ((1+2γτ m C∗Lλ2) −1) + 2 2p r 4C∗σ 2r 2 ≤ max(1 + 2γsm C∗Lλ2)ds+ (λ m + 1/m) + 4M. i b 2 p 2γ (2m2 C λrM + kz˜ (s)k ). s ∗ 2 i ∗ From the proof of [8, Theorem 2], we observe that Comparing the three bounds, we observe that ag ∗ 2 (βS+1 − 1)γS+1[Ψ(x (S + 1)) − Ψ ] ≤ D ω, X+ 2p r i ds+1 ≤ max(1 + 2γsm C∗Lλ2)ds+ S i X 2 2 2 2 2p r ζτ + (4M + kδi(τ)k∗)γτ )]. 2γs(2m C∗λ2M + kz˜i(s)k ). α ∗ τ=1 Solving the recursion yields Taking the expectation yields s−1 X 2 p r τ ds ≤ (1 + 2m γs C∗Lλ ) 2γs× ag ∗ 2 2 (βS+1 − 1)γS+1[E[Ψ(xi (S + 1))] − Ψ ] ≤ Dω,X + τ=0 r S S 2p r 2 X 2 X 2 2 2 (2m C∗λ2M + kz˜i(s)k∗). (38) D γ Ξ + γ (4M + ∆ ). α ω,X τ τ α τ τ Following the same argument as in Lemma 3, we obtain τ=1 τ=1 √ 2 Letting β = τ+1 and γ = τ+1 γ, observing that ∆ and Ξ (M + σ/ b) p τ 2 τ 2 τ τ E[d2] ≤ ((1 + 2γ m2 C Lλr)s − 1)2. s L2 s ∗ 2 are increasing in τ, and simplifying, we obtain The bound on E[ds] follows a fortiori. 4D2 E[Ψ(xag(S + 1))] − Ψ∗ ≤ ω,X + i (S2 + 1)γ F. Proof of Theorem 2 r 32 4γS 2 2 Similar to the proof of Theorem 1, the proof involves an Dω,X ΞS + (4M + ∆S). analysis of accelerated mirror descent, and as necessary we α α will cite results from [8, Theorem 2]. Define Solving for the optimum γ in the interval 0 ≤ γ ≤ α/(2L), md we obtain: δi(τ) = hi(τ) − g(xi (τ))   md s 2 ηi(τ) , g¯(τ) + ei(τ) − g(xi (τ)) = δi(τ) − z¯(τ) − z˜i(τ)  α αD  γ∗ = min , ω,X . ζ (τ) = γ hδ (τ), x∗ − x (τ)i. 2 2 2 i τ i i 2L S(S + 1)(4M + ∆S) 15

This gives the bound [5] H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, pp. 400–407, 1951. 2 [6] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic ag 8LDω,X ∗ approximation approach to stochastic programming,” SIAM Journal on E[Ψ(xi (S + 1))] − Ψ ≤ 2 + α(S + 1) Optimization, vol. 19, no. 4, pp. 1574–1609, 2009. 16SD2 (4M 2 + ∆2 ) r32 [7] A. Juditsky, A. Nemirovski, and C. Tauvel, “Solving variational inequal- ω,X S ities with stochastic mirror-prox algorithm,” Stochastic Systems, vol. 1, + Dω,X ΞS, α(S + 1)2 α no. 1, pp. 17–58, 2011. [8] G. Lan, “An optimal method for stochastic composite optimization,” which simplifies to the desired bound Mathematical Programming, vol. 133, no. 1, pp. 365–397, 2012. [9] C. Hu, W. Pan, and J. T. Kwok, “Accelerated gradient methods for 2 ag ∗ 8LDω,X stochastic optimization and online learning,” in Proc. Advances in E[Ψ(xi (S + 1))] − Ψ ≤ 2 + Neural Information Processing Systems (NIPS), 2009, pp. 781–789. αS [10] L. Xiao, “Dual averaging methods for regularized stochastic learning and r 2 r 4M + ∆S 32 online optimization,” Journal of Machine Learning Research, vol. 11, 4Dω,X + Dω,X ΞS. pp. 2543–2596, Oct. 2010. αS α [11] X. Chen, Q. Lin, and J. Pena, “Optimal regularized dual averaging methods for stochastic optimization,” in Proc. Advances in Neural G. Proof of Corollary 2 Information Processing Systems (NIPS), 2012, pp. 395–403. 2 [12] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous First, we bound Ξτ and ∆τ order-wise. Using steps similar deterministic and stochastic gradient optimization algorithms,” IEEE to those in the proof of Corollary 1, one can show that Trans. Automat. Control, vol. 31, no. 9, pp. 803–812, Sep. 1986.  √  [13] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- 3 3 r agent optimization,” IEEE Trans. Automat. Control, vol. 54, no. 1, pp. Ξτ = O τ m λ2(M + σ/ b) + M , 48–61, Jan. 2009. and [14] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of  σ2  the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010. ∆2 = O τ 6m6λ2r(M2 + σ2/b) + + M2 . τ 2 mb [15] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,” IEEE J. Select. Topics Signal Processing, vol. 5, In order to achieve order-optimum convergence, we need no. 4, pp. 772–790, Aug. 2011. 2 2 2 [16] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual ∆S = O(σ /(mb) + M ). Combining the above equation averaging for convex optimization,” in Proc. 51st IEEE Conf. Decision with the constraint r ≤ bρ, we obtain the condition and Control (CDC’12), Maui, HI, Dec. 2012, pp. 5453–5458. [17] A. Mokhtari and A. Ribeiro, “DSA: Decentralized double stochastic  log(mT )  b = Ω 1 + . averaging gradient algorithm,” J. Mach. Learn. Res., vol. 17, no. 61, pp. ρ log(1/λ2) 1–35, 2016. √ [18] A. S. Bijral, A. D. Sarwate, and N. Srebro, “On data dependence in This condition also ensures that ΞS = O(σ/ mT + M), as distributed stochastic optimization,” arXiv preprint arXiv:1603.04379, 2016. is necessary for optimum convergence speed. This leads to a [19] J. Li, G. Chen, Z. Dong, and Z. Wu, “Distributed mirror descent method convergence speed of for multi-agent optimization with delay,” Neurocomputing, vol. 177, pp. r ! 643–650, 2016. b2 bM2 σ [20] S. Sundhar Ram, A. Nedic, and V. Veeravalli, “Distributed stochastic av ∗ √ E[Ψ(xi (S+1))]−Ψ = O 2 + + + M . subgradient projection algorithms for convex optimization,” J. Optim. T T mT Theory App., vol. 147, no. 3, pp. 516–545, Jul. 2010. √ [21] J. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging for dis- In order to make the first term O((M + σ)/ mT ), we need tributed optimization: Convergence analysis and network scaling,” IEEE Trans. Automat. Control, vol. 57, no. 3, pp. 592–606, Mar. 2012. σ1/2T 3/4  b = O , m = O(σ2T ), and [22] M. Raginsky and J. Bouvrie, “Continuous-time stochastic mirror descent m1/4 on a network: Variance reduction, consensus, convergence,” in Proc. IEEE Annu. Conf. Decision and Control (CDC’12), Dec. 2012, pp.  m1/4 log(mT )  6793–6800. ρ = Ω . 3/4 [23] J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan, “Ergodic σT log(1/λ2) √ mirror descent,” SIAM J. Opt., vol. 22, no. 4, pp. 1549–1578, 2012. To make the second and fourth terms O((M+σ)/ mT ), we [24] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine need, as before, Learning Research, vol. 13, no. Jan, pp. 165–202, 2012.   1 1  [25] M. Rabbat, “Multi-agent mirror descent for decentralized stochastic M = O min , √ . optimization,” in Proc. IEEE Intl. Workshop Computational Advances in m mσ2T Multi-Sensor Adaptive Processing (CAMSAP’15), Dec. 2015, pp. 517– 520. REFERENCES [26] K. I. Tsianos and M. G. Rabbat, “Efficient distributed online prediction and stochastic optimization with approximate distributed averaging,” [1] M. Nokleby and W. U. Bajwa, “Distributed mirror descent for stochastic IEEE Transactions on Signal and Information Processing over Networks, learning over rate-limited networks,” in Proc. 7th IEEE Intl. Workshop vol. 2, no. 4, pp. 489–506, Dec 2016. Computational Advances in Multi-Sensor Adaptive Processing (CAM- [27] S. Shalev-Shwartz, Online learning and online convex optimization, ser. SAP’17), Dec. 2017, pp. 1–5. Foundations and Trends in Machine Learning. Now Publishers, Inc., [2] H. Kushner, “Stochastic approximation: A survey,” Wiley Interdisci- 2012, vol. 4, no. 2. plinary Reviews: Computational Statistics, vol. 2, no. 1, pp. 87–96, 2010. [28] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima- [3] S. Kim, R. Pasupathy, and S. G. Henderson, “A guide to sample average tion by averaging,” SIAM Journal on Control and Optimization, vol. 30, approximation,” in Handbook of Simulation Optimization. Springer, no. 4, pp. 838–855, 1992. 2015, pp. 207–243. [29] C. M. Bishop, Pattern Recognition and Machine Learning. New York, [4] M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, NY: Springer, 2006. A. O. Hero, and S. McLaughlin, “A survey of stochastic simulation and [30] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods optimization methods in signal processing,” IEEE Journal of Selected for online learning and stochastic optimization,” Journal of Machine Topics in Signal Processing, vol. 10, no. 2, pp. 224–241, 2016. Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011. 16

[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. International Conference for Learning Representations, 2015.

Matthew Nokleby (S’04–M’13) received the B.S. (cum laude) and M.S. degrees from Brigham Young University, Provo, UT, in 2006 and 2008, respec- tively, and the Ph.D. degree from Rice University, Houston, TX, in 2012, all in electrical engineering. From 2012–2015 he was a postdoctoral research associate in the Department of Electrical and Com- puter Engineering at Duke University, Durham, NC. In 2015 he joined the Department of Electrical and Computer Engineering at Wayne State University, where he is an assistant professor. His research interests span machine learning, signal processing, and information theory, including distributed learning and optimization, sensor networks, and wireless communication. Dr. Nokleby received the Texas Instruments Distinguished Fellowship (2008-2012) and the Best Dissertation Award (2012) from the Department of Electrical and Computer Engineering at Rice University.

Waheed U. Bajwa (S’98–M’09–SM’13) received BE (with Honors) degree in electrical engineering from the National University of Sciences and Tech- nology, Pakistan in 2001, and MS and PhD degrees in electrical engineering from the University of Wisconsin-Madison in 2005 and 2009, respectively. He was a Postdoctoral Research Associate in the Program in Applied and Computational Mathematics at Princeton University from 2009 to 2010, and a Research Scientist in the Department of Electrical and Computer Engineering at Duke University from 2010 to 2011. He has been with Rutgers University since 2011, where he is currently an associate professor in the Department of Electrical and Computer Engineering and an associate member of the graduate faculty of the Department of Statistics and Biostatistics. His research interests include statistical signal processing, high-dimensional statistics, machine learning, harmonic analysis, inverse problems, and networked systems. Dr. Bajwa has received a number of awards in his career including the Best in Academics Gold Medal and President’s Gold Medal in Electrical Engineering from the National University of Sciences and Technology (2001), the Morgridge Distinguished Graduate Fellowship from the University of Wisconsin-Madison (2003), the Army Research Office Young Investigator Award (2014), the National Science Foundation CAREER Award (2015), Rutgers University’s Presidential Merit Award (2016), Rutgers Engineering Governing Council ECE Professor of the Year Award (2016, 2017), and Rutgers University’s Presidential Fellowship for Teaching Excellence (2017). He is a co-investigator on the work that received the Cancer Institute of New Jersey’s Gallo Award for Scientific Excellence in 2017, a co-author on papers that received Best Student Paper Awards at IEEE IVMSP 2016 and IEEE CAMSAP 2017 workshops, and a Member of the Class of 2015 National Academy of Engineering Frontiers of Engineering Education Symposium. He served as an Associate Editor of the IEEE Signal Processing Letters (2014– 2017), co-guest edited a special issue of Elsevier Physical Communication Journal on “Compressive Sensing in Communications” (2012), co-chaired CPSWeek 2013 Workshop on Signal Processing Advances in Sensor Networks and IEEE GlobalSIP 2013 Symposium on New Sensing and Statistical Inference Methods, and served as the Publicity and Publications Chair of IEEE CAMSAP 2015, General Chair of the 2017 DIMACS Workshop on Dis- tributed Optimization, Information Processing, and Learning, and a Technical Co-Chair of the IEEE SPAWC 2018 Workshop. He is currently serving as a Senior Area Editor for IEEE Signal Processing Letters, an Associate Editor for IEEE Transactions on Signal and Information Processing over Networks, and a member of MLSP, SAM, and SPCOM Technical Committees of IEEE Signal Processing Society.