arXiv:1612.00903v1 [math.OC] 3 Dec 2016 aaua rps omncto efficiency. communication graphs, Ramanujan ssgicnl etro xadrgah hnohrregula other than graphs expander on optimiz graphs. better decentralized of significantly numeric performance Our is the degrees. that node and show nodes results of numbers different for so-called choices. the accu prescribed that a min discover achieve We to to is needed communication goal certain total Our of the optimization. complexity decentralized communication for the gorithms reduce to topology [ [ averaging importan [ decentralized Some include others. among works learning, coordin machine network control, local distributed sensor multiple-agent in perform processing, found uneconomical. must information widely are or network o computing Applications infeasible another. the decentralized one with either in communicate and is agents computing networ center the distributed a fusion Therefore, in stored a or collected and are data the when { .Polmadbackground and Problem A. reeg hncmuain ordcn omncto sa computing. is distributed communication in reducing concern so primary computation, more than cost energy can it or since often distributed is of bottleneck Communication algorithms. decentralized certain hc sdfie nacnetdntokof network connected a on defined is which oprtvl ov h rbe o omnminimizer common a for problem the solve cooperatively ment h gnscryotterieain ttesm e ftimes of set same the at iterations their out carry agents the R R 11 4 ytchow,wuty11,wotaoyin ecnie ycrnzdotmzto ehd,weeall where methods, optimization synchronized consider We . P hswr a upre npr yNFgatEC-429.Em ECCS-1462398. grant NSF by part in supported was work This ,[ ], ne Terms Index graphs expander construct to approaches three propose We Abstract osdrtedcnrlzdcnessproblem: consensus decentralized the Consider fe,teei rd-f between trade-off a is there Often, of efficiency communication the on focus we paper, this In eetaie loihsaeapidt ov rbe ( problem solve to applied are algorithms Decentralized ,[ ], ahagent Each . 5 and ,[ ], 12 xadrGahadCommunication-Efficient and Graph Expander ,adlwrn arxcmlto [ completion matrix low-rank and ], 6 I hsppr edsushwt eintegraph the design to how discuss we paper, this —In ,etmto [ estimation ], ovrec rates convergence dcnrlzdotmzto,epne graphs, expander optimization, —decentralized minimize 2 i x eateto lcrcl&Cmue niern,Bso Un Boston Engineering, Computer & Electrical of Department ep t rvt ovxfunction convex private its keeps ∈ R } P .I I. [email protected] and @math.ucla.edu 1 eateto ahmtc,Uiest fClfri,LsA Los California, of University Mathematics, of Department 7 f NTRODUCTION ,[ ], ( x = ) 8 xadrgraphs expander ,[ ], eetaie Optimization Decentralized fadcnrlzdalgorithm. decentralized a of a-i Chow Yat-Tin 9 M 1 ,[ ], X i M =1 10 omncto require- communication ,sas optimization sparse ], f 1 i ,[ ], ( x ) 13 2 r near-optimal are , M ,[ ], problems. ] 1 e Shi Wei , 3 gnswho agents f ,learning ], i : R ation, imize ation P racy. time x ails: the (1) al- → 2 1 al ∈ . inuWu Tianyu , k r ) f t uha XR [ EXTRA as such .Oeve forapproach our of Overview C. loihsta pl utpiainof multiplication apply that algorithms matrix F where n doubly-stochastic where and and topology, vertices network of the represents that 0 debtenthe between edge W fte.Hnew hl n h pia aac.T achieve To balance. optimal the find shall we Hence them. of the as ov rbe ( problem solve graph undirected an Consider information. logical piiainpolm( problem optimization k omncto.Ortre cuayis accuracy target Our communication. euneo trts and iterates, of sequence ec h pcfidaccuracy. specified the reach .Polmreformulation Problem B. graph. expander of graphs type Ramanujan balance a this so-called is that the which argue by we reached work, approximately this is In balance. the therefore the find accuracy, to must reduce solution To tends a solution. reaching also for the but communication toward iteration progress each faster gene at make example, communication for agents, more of ates pairs more connecting Directly iteration small a ulvariables. dual ed oasmall a to leads of h ewr.Adnentokuulyipisalarge a implies usually network dense A network. the hi product their i ( u oli odsg ewr of network a design to is goal Our efis eomlt rbe ( problem reformulate first We steieainindex, iteration the is ≤ x x noe h rp oooysince topology graph the encodes k := ) x , W star x W K ∗ ij and , minimize [ = f x 1 ic XR n DMieaebt rmland primal both iterate ADMM and EXTRA since ntecnrr,ahgl prentoktypically network sparse highly a contrary, the On . ∈ i ≤ ( ewr. h pia est stpclyneither typically is density optimal The network.) uhta tt-fteatcnessoptimization consensus state-of-the-art that such R n oa Yin Wotao and x x M i 1 ( 1 ) × .,x ..., , UK Let . K , U P 1 E P ihanal iia muto total of amount minimal nearly a with ) 14 n large a and i M etenme fieain eddto needed iterations of number the be hadthe and th ) 1 U i Apparently, . stesto de.Tk symmetric a Take edges. of set the is M ,[ ], W X vriy otn MA Boston, iversity, eteaon of amount the be i ] M 1 =1 ⊤ iigmatrix mixing ij 15 a erwitnas: rewritten be can ) z n h functions the and ε F n DM[ ADMM and ] gls CA ngeles, 1 = ∗ i 1 satrsod and threshold, a is sislmt euse We limit. its is ( x K j ) haet.Floig[ Following agents. th Teeaeseilcsssuch cases special are (There . and 1 The ujc to subject oicuetegahtopo- graph the include to ) U W and oa communication total N W k T W z W oe n mixing a and nodes omncto per communication k 16 K = ij = − W F ,[ ], ohdpn on depend both ntenetwork, the in W 0 = i z x { ∗ r endas defined are 17 z V W h matrix The . = k ( k G z ,[ ], z , e ε < x k steset the is } en no means ) ( = , e ∗ k 18 ∈ 14 where , instead E U sthe is ,can ], ,E V, ,the ], i.e. , total we , and (2) is r- ) 1 , 2 this, we express U and K in the network parameters that we known, a simple one-dimensional search method will find d. can tune, along with other parameters of the problem and the that affect total communication but we do not tune. D. Communication reduction Then, by finding the optimal values of the tuning parameters, We find that, in different problems, the graph topology we obtain the optimal network. affects total communication to different extents. When one The dependence of U and K on the network can be function fi(x) has a big influence to the consensus solution, complicated. Therefore, we make a series of simplifications total communication is sensitive to the graph topology. On the to the problem and the network so that it becomes sufficiently other hand, computing the consensus average of a set of similar simple to find the minimizer to (UK). We mainly (but are not numbers is insensitive to the graph topology, so a Ramanujan bound to) consider strongly convex objective functions and graph does not make much improvement. Therefore, our decentralized algorithms using fixed step sizes (e.g., EXTRA proposed graphs work only for the former type of problem and ADMM) because they have the linear convergence rate: and data. See section IV for more details and comparison. zk z∗ Ck for C < 1. Given a target accuracy ε, k − K k ≤ from C = ε, we deduce the sufficient number of iterations E. Limitations and possible resolutions K = log(ε−1)/ log(C−1). Since the constant C < 1 depends We assume nodes having the same degree, edges having on network parameters, so does K. the same communication cost, and a special mixing matrix We mainly work with three network parameters: the con- . Despite these assumptions, the may still dition number of the graph Laplacian, the first non-trivial W serve as the starting point to reduce the total communication eigenvalue of the incidence matrix, and the degree of each in more general settings, which is our future work. node. In this work, we restrict ourselves to the d-regular graph, e.g., every node is connected to exactly d other nodes. We further (but are not bound to) simplify the problem by F. Related work assuming that all edges have the same cost of communication. Other work that has considered and constructed Ramanujan Since most algorithms perform a fixed number (typically, one graphs include [21],[22]. Graph design for optimal conver- or two) of rounds of communication in each iteration along gence rate under decentralized consensus averaging and opti- each edge, U equals a constant times d. mzation is considered in [3],[23],[24],[25],[26],[27],[28]. So far, we have simplified our problem to minimizing (Kd), However, we suspect the mixing matrices W recommended in which reduces to choosing d, deciding the topology of the d- the above for average consensus might sometimes not be the regular graph, as well setting the mixing matrix W . best choices for decentralized optimization, since consensus For the two algorithms EXTRA and ADMM, we deduce (in averaging is a lot less communication demanding than decen- Section II from their existing analysis) that K is determined tralized optimization problems (c.f. Section I-D), and it is the by the mixing matrix W and κ˜(LG), where LG denotes the least sensitive to adding, removing, or changing one node. graph Laplacian and κ˜(LG) is its reduced condition number. On the other hand, we agree for consensus averaging, those We simplify the formula of K by relating W to LG (thus matrices recommended serve as the best candidates. eliminating W from the computation). Hence, our problem II. COMMUNICATION COST IN DECENTRALIZED becomes minimizing (Kd), where K only depends on κ˜(LG) OPTIMIZATION and κ˜(LG) further depends on the degree d (same for every node) and the topology of the d-regular graph. As described in Subsection I-C, we focus on a special By classical graph theories, a d-regular graph satisfies class of decentralized algorithm that allows us to optimize its d+λ1(A) , where is the first non-trivial communication cost by designing a suitable graph topology. κ˜(LG) d−λ1(A) λ1(A) eigenvalue≤ of the incidence matrix A, which we define later. For the notational sake, let us recall several definitions and We minimize to reduce d+λ1(A) and thus . properties of a graph. See [29],[30],[31] for background. λ1(A) d−λ1(A) κ˜(LG) Interestingly, λ (A) is lower bounded by the Alon Boppana Given a graph G, the A of G is a V V 1 | | × | | formula of d (thus also depends on d). The lower bound is matrix with components nearly attained by Ramanujan graphs, a type of graph with 1 if (i, j) E, a small number of edges but having high connectivity and Aij = ∈ good spectral properties! Therefore, given a degree d, we (0 if otherwise. shall choose Ramanujan graphs, because they approximately An equally important (oriented) incidence matrix is defined as minimize λ1(A), thus κ˜(LG), and in turn (Kd) and (KU). is a E V matrix such that Unfortunately, Ramanujan graphs are not known for all | | × | | values of (N, d). For large N, we follow [19] to construct a 1 if e = (i, j) for some j, where j > i , random network, which has nearly minimal λ (A). For small Bei = 1 if e = (i, j) for some j, where j < i , 1 − N, we apply the Ramanujan Sparsifier [20]. These approaches 0 otherwise. lead to near-optimal network topologies for a given d. The degree of a node i E, denoted as deg(i), is the number After the above deductions, minimizing the total communi-  of edges incident to the∈ node. Every node in a d-regular graph cation simplifies to choosing the degree d. To do this, one must has the same degree d. The graph Laplacian of G is: know the cost ratio between computation and communication, T which depends on applications. Once these parameters are LG := D A = B B, − 3

where D has diagonal entries Dii = deg(i) and 0 else- [28],[3]. Under our choice of W , there are many examples where. The matrix LG is symmetric and positive semidefinite. of communication-optimizable algorithms, where the conver- The multiplicity of its trivial eigenvalue 0 is the number of gence rate depends only on κ˜(LG,w). The convergence rate of connected components of G. Therefore, we consider its first EXTRA [14],[15] can be explicitly represented, after some nonzero eigenvalue, denoted as λmin(LG) and its largest eigen- straight-forward simplifications by letting W = WLG,w (5): value, denoted as λ (L ). The reduced condition number of max G k ∗ k z z − 2 LG is defined as: g k − k 1 + min 1 , 1 , z0 z∗ ≤ p κ˜(LG,w)+q rκ˜(LG,w) κ˜(L ) := λ (L )/λ (L ) , (3) k − k   G max G min G for some norm (where zis related to x), and decentralized k·k which is of crucial importance to our exposition. We need a ADMM [16],[17],[18]: g k weighted graph and its graph Laplacian. A weighted graph k ∗ − z z −2 2 G = (V,E,w) is a graph with nonnegative edge weights w = k − k 1+ p κ˜(LG,w) + q √q , z0 z∗ ≤ − (we). Its graph Laplacian is defined as k − k  q  where p,q,r are some coefficients depending only on Θ,µf . T LG,w := B Diag(w)B. Apparently, while fixing other parameters, a smaller κ˜(LG,w) Its eigenvalues and condition numbers are similarly defined, tends to give faster convergence. We see the same in extension of EXTRA, PG-EXTRA [15], for nonsmooth functions. e.g., κ˜(LG,w) being its reduced condition number. Next we define a special class of decentralized algorithm, In order to consider the communication cost for an optimiza- which simplifies the computation of total communication cost. tion procedure, we need the following two quantities. The first Definition 1: We call a first-order decentralized algorithm one is the amount of communication per iteration U: k to solve (2) communication-optimizable if any sequence z U(µ, W ) := e∈E µe We 0, (6) generated by the algorithm satisfies { } | | where µ = (µe) the communicationP cost on each edge e k ∗ z z R(Θ, Γf , κ˜(LG,w), k) , (4) and 0 as the 1/0 function that indicates if e is used for k − k≤ | · | communication or not. When W is related to LG,w via (5), for a certain norm and function R, where Θ consists k·k we also write U(µ,LG,w) := U(µ, WLG,w ) to represent this of all algorithmic parameters specified by the user, Γf is a quantity. Next, we introduce the second quantity, the total constant depending on the objective function f. In addition, communication cost for an optimization procedure. For this R is non-increasing as κ˜(LG,w) decreases or k increases. purpose, let us recall that Γf represents a set of constants The above definition means that there exists a convergence- depending only on the structure of the target function f, and speed bound that depends on only LG,w and k while other Θ be a set of parameters not depending on either f of the parameters Θ,µf are fixed. This description of convergence graph G. We let K be the number of iterations to achieve the rate is met by many algorithms mainly because κ˜(LG,w) accuracy zk z∗ ǫ, reflects the connectivity of graph topology. In fact, if G is k − k≤ d-regular and weighted, then we have from [30],[31]: K = K(Θ, Γf , κ˜(LG,w),ǫ),

d + λ1(A) as K clearly depends on Γf , Θ, κ˜(LG,w), and ǫ. By the κ˜(LG) , ≤ d λ (A) definition of R in (4), we have − 1 where λ1(A) = d is the first non-trivial eigenvalue of the R Θ, Γf , κ˜(LG,w),K(Θ, Γf , κ˜(LG,w),ǫ) = ǫ . adjacency matrix6 (as the eigenvalue is trivial A λ0(A) = d  and corresponds to the eigenvalue 0 of LG) and is related to i k Theorem 1: The total communication cost for (z )i=0 gen- the connectivity of G by the Cheeger inequalities [32]: erated by a communication-optimizable algorithm to satisfy k 1 z z∗ , denoted as , satisfies: (d λ (A)) h(G) 2d(d λ (A)) ε J(Θ, Γf , WLG,w ,ǫ) 2 − 1 ≤ ≤ − 1 k − k≤ J(Θ, Γ , W ,ǫ) K Θ, Γ , κ˜(L ,ǫ) U µ,L . where h(G) is the Cheeger constantp quantifying the con- f LG,w f G,w G,w ≤ · (7) nectivity of G. Therefore saying that R(Θ,µ , κ˜(L ), k) f G,w The cost consists of two parts, the first part K decreasing with is nondecreasing with κ˜(L ) is related to saying that is G,w the connectivity of G while the other part U growing with it. nonincreasing with the connectivity of G. (Higher connectivity In general, how to optimize J depends on each algorithm lets the algorithm converge faster.) Hence, we hope to increase (which gives the function K) and the weights µe . Given an the number of edges E of G though a cost is associated. { } additional assumption that µe are uniformly, we can assume With the help of LG,w, for simplicity we might choose the { } ave mixing matrix W to take the form [23],[24]: U(µ,LG,w) F (d ) , (8) ≈ 2 ave where d = i deg(i)/ i 1 is the average degree, and F WLG,w := I LG,w , for 0 < Θ1 < 1 , ave − (1+Θ1)λmax(L) a function that increases monotonically w.r.t. d . Therefore, (5) a reasonable approximationP P of the communication efficiency where Θ1 is among the set of algorithmic parameters Θ optimization problem becomes mentioned above. This choice of is symmetric and dou- W ave ave min J(Θ, Γf , WLG,w ,ǫ) min H(d )F (d ) (9) bly stochastic. Other choices are given in [25],[26],[27], G ≈ dave { } 4 where H(dave) reads The validity of using random graphs as an approximate ave optimizer is ensured by the Friedman Theorem as below: H(d ) := min K Θ, Γf , κ˜(LGdave ,w),ǫ , (10) (Gdave ,w) Theorem 2: [19] For every ǫ> 0,  where Gdave denotes an unknown graph whose nodes have the P(λ1(A) 2√d 1+ ǫ)=1 oN (1) (12) same degree. Optimizing dave in (9) after knowing H(dave) ≤ − − 1 and F (dave) can only be done on a problem-by-problem and where G is a random (N, d)-graph. algorithm-by-algorithm basis. However, the minimum argu- Roughly speaking, this theorem says that for a very large N, ments in the expression (10) in the definition of H(dave) are almost all random (N, d) graphs are Ramanujan, i.e. they are nicely connected. Therefore, it is just fine to adopt a random always graphs such that κ˜(LGdave ,w) is minimized. In light of this, we propose to promote communication efficiency by (N, d) graph when N is too large for a Ramanujan graph or sparsifier to be explicitly computed. optimizing κ˜(LG ave ,w) in any case (i.e. whether we know the d It is possible that some regular graphs are not connected. To actual expression of K or not, or even in the case when µe is not so uniform.) { } address this issue, we first grow a random spanning tree of the underlying graph to ensure connectivity. A random generator for regular graphs is GGen2 [42]. III. GRAPH OPTIMIZATION (OF κ˜(LG) OR κ˜(LG,w)) A. Exact optimizer with specific nodes and degree (N, d): Ramanujan graphs C. Approximation for small N: 2-Ramanujan Sparsifier For a general d-regular graph, it is known from Alon For some small N, an explicit (N, d)-Ramanujan graph may Boppana Theorem in [33],[34],[35] that still be unavailable. We hope to construct an approximate Ra- manujan graph, i.e. expanders which are sub-optimal but work 2√d 1 1 well in practice. An example is given by the 2-Ramanujan λ1(A) 2√d 1 − − ≥ − − D/2 sparsifier, which is the subgraph H as follows: ⌊ ⌋ where D is the diameter of G. Theorem 3: [20] For every d > 1, every undirected Definition 2: [29],[30],[31],[36] A d-regular graph G is weighted graph G = (V,E,w) on N vertices contains a ′ a Ramanujan graphs if it satisfies weighted subgraph H = (V,F,w ) with at most d(N 1) edges (i.e. an average at most 2d graph) that satisfies − λ1(AG) 2√d 1, (11) ≤ − d +1+2√d where is the adjacency matrix of . LG,w 4 LH,w′ 4 LG,w (13) AG G d +1 2√d In fact, a Ramanujan graph serves as an asymptotic mini- − where the Loewner partial order relation A 4 B indicates that mizer of κ˜(LGd ) for a d-regular graph Gd. For a Ramanujan graph, the reduced condition number of the Laplacian satisfies: B A is a positive semidefinite matrix. This− theorem allows us to construct a sparsifier from an d +2√d 1 original graph G. Actually, the proof of the theorem provides κ˜(LG ) − . d ≤ d 2√d 1 us with an explicit algorithm for such a construction [20]. − − If we use a Ramanujan graph that minimizes κ˜(LGd ), we can then find an approximate minimizer in (10) and thus (9) IV. NUMERICAL EXPERIMENTS for the total communication cost by further finding d (or dave). In this subsection, we illustrate the communication effi- Explicit constructions of Ramanujan graphs are known ciency optimization achieved by expander graphs. only for some special (N, d), as labelled Cayley graphs and Since the focus of our work is not to investigate various the Lubotzky-Phillips-Sarnak construction [37], with some methods to produce expander graphs, we did not test Algo- computation given in [38],[39] or construction of bipartite rithm I. Rather, our focus is to illustrate that a smaller reduced Ramanujan graphs [40]. Interested readers may consult [30], condition number improves communication efficiency and that [29],[41],[31],[37],[36] for more references. expander graphs are good choices to reduce communication in decentralized optimization. Therefore we compare the dif- B. Optimizer for large N: Random d-regular graphs ference in communication efficiency using existing graphs with different reduced condition numbers. We compare the In some practical situations, however, we encounter pairs convergence speeds on a family of (possibly non-convex non- (N, d) where an explicit Ramanujan graph is still unknown. smooth) problems which are approximate formulations of When N is very large, we propose a random d-regular finding the sup-norm of a vector l = (l , ..., l ) (up to sign). graph as a solution of optimal communication efficiency, with 1 M Example 1. We choose M = P = 1092. Our problem is: randomness performed as follows [19] (where d shall be even and N can be an arbitrary integer that is greater than 2): 1 M min f(x)= M i=1 fi(x), x∈RP choosing independently d/2 permutations πj , j = 1, .., d/2 of the numbers from 1 to n, with each permutation equally where P likely, a graph G = (V, E) with vertices V =1, ..., n is given fi(x)= lixi/( x 1 + ε) by k k 1 E = (i, πj (i)),i =1, ...n, j =1, ..., d/2 . The notion g(N)= oN (f(N)) means limN→∞ g(N)/f(N) = 0. { } 5

κ˜(LG) No. of Itr. Total comm. complexity 2500 LPS(29, 13) 1.9538 52 1703520

Regular 30 graph 30.5375 444 14545440 2000 Regular 60 graph 32.1103 494 32366880 Regular 120 graph 27.1499 454 59492160 1500 TABLE I: Communication complexity in Example 1 for Ramanujan 30-graph LPS(29, 13), 30-reg, 60-reg, and120-reg graph. 1000 for a small ε chosen as ε =1 10−5. One very important re- 500 mark is that this optimization problem× is non-convex and non- 0 smooth, and the ε-optimizer occupies a tabular neighbourhood 0 20 40 60 80 100 120 140 160 180 200 4 10 of a whole ray λen : λ> 0 , where ln = l ∞ and en is 3 the nth coordinate{− vector (0, 0,} ..., 1, ...., 0). k k 10

2 We solve the problem using the EXTRA algorithm [14]: 10

1 10 x1 = W x0 α f(x0) , 0 − ∇ 10 xk+2 = (W + I)xk+1 W +I xk α[ f(xk+1) f(xk)] . ( −1 − 2 − ∇ − ∇ 10

−2 To compare speeds, we plot the following sequence: 10

−3 1 M k 10 δk := M i=1 Fi(x ) minx F (x) . −4 − 10

−5 Figure 1 shows theP sequence δk under 4 different graph 10 topologies: Ramanujan 30-graph LPS(29, 13), a regular-30, 0 100 200 300 400 500 600 700 800 900 1000 a regular-60 graph, and a regular-120 graph. Our regular d Fig. 1: Convergence rate of EXTRA in Example 1 over Ramanujan graphs (other than LPS(29, 13)) are generated by joining the 30-graph LPS(29, 13) (red, the best), 30-reg graph (blue), 60-reg N d graph (green) and 120-reg graph (black). ith node to the i + d k mod N : 0 k< 2 th nodes for all 0 i N ⌊1 (in⌋ here we label≤ the nodes from 0 ≤ ≤  − to N 1). The mixing matrices W that we use are generated κ˜(LG) No. of Itr. Total comm. complexity − LPS(29, 13) 1.9538 811 26568360 as WLG,w as described in (5) with w = (we) where we = 1 on all edges e E. We clearly see that the convergence rate Regular 30 graph 30.5375 1001 32792760 ∈ Regular 60 graph 32.1103 832 54512640 of the algorithm under Ramanujan 30-graph LPS(29, 13) is Regular 120 graph 27.1499 953 124881120 even better than that of a regular-120 graph in the first 40 iterations. Afterward, the curve of the Ramanujan 30-graph TABLE II: Communication complexity in Example 2 for Ramanu- jan 30-graph LPS(29, 13), 30-reg, 60-reg and120-reg graphs. becomes flat, probably because it already arrived a small tabular neighborhood of the ray of ε-optimizer. Table I shows the number of iteration k as well as its total communication 0 To compare speeds, we plot the following sequence: complexity to achieve 1 M k −3 δk := i si(xi ) minx F (x) . stopping rule: δk0 δk0 < 1 10 . M =1 | − −1| × −

We see that an reduces communication com- and exclude the r i(x)Ppart. The mixing matrices that we use plexity. One can observe the rapid drop of the red curve are the same as those described in Example 1. As shown because of its efficiency to locate the active set. in Figure 2 the convergence rate of the algorithm under the Example 2. In this case we again choose M = 1092, but Ramanujan 30-graph LPS(29, 13) is still the best though is P = 1 (i.e. 1 dimensional problem). Our problem is with a less outstanding. Table II shows the numbers of iteration k0 similar form as in Example 1, but with as well as the total communication complexities to achieve −3 δk0 δk0 < 1 10 . The expander graph has the clear 2 −1 fi(x)= χ (x)+ x . |advantage.− | × a≥√|li| | | This is a convex non-smooth problem. We apply PG-EXTRA [15] to solve this problem. To proceed, we split fi as REFERENCES 2 [1] Alexandros G Dimakis, Soummya Kar, Jose MF Moura, Michael G si(x) := x | | Rabbat, and Anna Scaglione, “Gossip algorithms for distributed signal ri(x) := χ (x), processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, ( a≥√|li| 2010. and apply the PG-EXTRA iteration: [2] Bjorn Johansson, “On distributed optimization in networked systems,” 2008. 1 x 2 = W x0 α f(xk+1) , [3] Lin Xiao, Stephen Boyd, and Seung-Jean Kim, “Distributed average − ∇ 1 consensus with least-mean-square deviation,” Journal of Parallel and 1 1 2 2 x = arg min r(x)+ 2α x x 2 , Distributed Computing, vol. 67, no. 1, pp. 33–46, 2007.  3 1k − k  k+ 2 k+1 k+ 2 W +I k k+1 k [4] Pedro A Forero, Alfonso Cano, and Georgios B Giannakis, “Consensus- x = W x + x 2 x α[ s(x ) s(x )] ,  − −3 ∇ − ∇ based distributed support vector machines,” The Journal of Machine xk+2 = arg min r(x)+ 1 x xk+ 2 2 . Learning Research, vol. 11, pp. 1663–1707, 2010. 2α k − k2   6

4 x 10 2 Conference on Signals, Systems and Computers. IEEE, 2014, pp. 783– 1.8 787. 1.6 [17] Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin, “On the linear

1.4 convergence of the admm in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 1.2 2014. 1 [18] Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus

0.8 Puschel, “D-admm: A communication-efficient distributed algorithm for separable optimization,” IEEE Transactions on Signal Processing, 0.6 vol. 61, no. 10, pp. 2718–2723, 2013. 0.4 [19] Joel Friedman, A proof of Alon’s second eigenvalue conjecture and

0.2 related problems, American Mathematical Soc., 2008. [20] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava, “Twice- 0 0 20 40 60 80 100 120 140 160 180 200 ramanujan sparsifiers,” SIAM Journal on Computing, vol. 41, no. 6,

5 10 pp. 1704–1721, 2012. [21] Yoonsoo Kim and Mehran Mesbahi, “On maximizing the second small- 4 10 est eigenvalue of a state-dependent graph laplacian,” IEEE transactions on Automatic Control, vol. 51, no. 1, pp. 116–120, 2006. 3 10 [22] Arpita Ghosh and Stephen Boyd, “Growing well-connected graphs,” in Proceedings of the 45th IEEE Conference on Decision and Control. 2 10 IEEE, 2006, pp. 6605–6611. [23] Lin Xiao and Stephen Boyd, “Fast linear iterations for distributed 1 10 averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.

0 [24] Xiaochuan Zhao, Sheng-Yuan Tu, and Ali H Sayed, “Diffusion adap- 10 tation over networks under imperfect information exchange and non-

−1 stationary data,” IEEE Transactions on Signal Processing, vol. 60, no. 10 7, pp. 3460–3475, 2012.

−2 10 [25] Stephen Boyd, Persi Diaconis, and Lin Xiao, “Fastest mixing markov 0 100 200 300 400 500 600 700 800 900 1000 chain on a graph,” SIAM review, vol. 46, no. 4, pp. 667–689, 2004. [26] Kun Yuan, Qing Ling, and Wotao Yin, “On the convergence of Fig. 2: Convergence rate of PG-EXTRA in Example 2 over Ra- decentralized gradient descent,” arXiv preprint arXiv:1310.7063, 2013. manujan 30-graph LPS(29, 13) (in red), 30-reg (in blue), 60-reg (in [27] John Nikolas Tsitsiklis, “Problems in decentralized decision making and green), and 120-reg graph (in black). computation.,” Tech. Rep., DTIC Document, 1984. [28] I-An Chen et al., Fast distributed first-order methods, Ph.D. thesis, Massachusetts Institute of Technology, 2012. [29] M Ram Murty, “Ramanujan graphs,” Journal-Ramanujan Mathematical Society, vol. 18, no. 1, pp. 33–52, 2003. [5] Gonzalo Mateos, Juan Andres Bazerque, and Georgios B Giannakis, [30] Michael William Newman, The Laplacian spectrum of graphs, Ph.D. “Distributed sparse linear regression,” Signal Processing, IEEE Trans- thesis, University of Manitoba, 2000. actions on, vol. 58, no. 10, pp. 5262–5276, 2010. [31] Fan RK Chung, , vol. 92, American Mathematical [6] Joel B Predd, Sanjeev R Kulkarni, and H Vincent Poor, “A collaborative Soc., 1997. training algorithm for distributed learning,” Information Theory, IEEE [32] and Joel H Spencer, The probabilistic method, John Wiley Transactions on, vol. 55, no. 4, pp. 1856–1871, 2009. & Sons, 2004. [7] Juan Andres Bazerque and Georgios B Giannakis, “Distributed spectrum [33] Noga Alon, “On the edge-expansion of graphs,” , sensing for cognitive radio networks by exploiting sparsity,” Signal Probability and Computing, vol. 6, no. 02, pp. 145–152, 1997. Processing, IEEE Transactions on, vol. 58, no. 3, pp. 1847–1862, 2010. [34] Noga Alon, Alexander Lubotzky, and , “Semi-direct [8] Juan Andres Bazerque, Gonzalo Mateos, and Georgios B Giannakis, product in groups and zig-zag product in graphs: connections and “Group-lasso on splines for spectrum cartography,” Signal Processing, applications,” in Foundations of , 2001. Proceedings. IEEE Transactions on, vol. 59, no. 10, pp. 4648–4663, 2011. 42nd IEEE Symposium on. IEEE, 2001, pp. 630–637. [9] Vassilis Kekatos and Georgios Giannakis, “Distributed robust power [35] Jozef Dodziuk, “Difference equations, and system state estimation,” Power Systems, IEEE Transactions on, vol. transience of certain random walks,” Transactions of the American 28, no. 2, pp. 1617–1626, 2013. Mathematical Society, vol. 284, no. 2, pp. 787–794, 1984. [10] Qing Ling and Zhi Tian, “Decentralized sparse signal recovery for [36] Alexander Lubotzky, Ralph Phillips, and , “Ramanujan compressive sleeping wireless sensor networks,” Signal Processing, graphs,” Combinatorica, vol. 8, no. 3, pp. 261–277, 1988. IEEE Transactions on, vol. 58, no. 7, pp. 3816–3827, 2010. [37] Alexander Lubotzky, Ralph Phillips, and Peter Sarnak, “Explicit ex- panders and the ramanujan conjectures,” in Proceedings of the eighteenth [11] Qing Ling, Zaiwen Wen, and Wotao Yin, “Decentralized jointly sparse annual ACM symposium on Theory of computing. ACM, 1986, pp. 240– optimization by reweighted minimization,” Signal Processing, IEEE 246. Transactions on, vol. 61, no. 5, pp. 1165–1170, 2013. [38] “Lubotzky-phillips-sarnak graphs,” http://www.mast.queensu.ca/∼ctardif/LPS.html, [12] Kun Yuan, Qing Ling, Wotao Yin, and Alejandro Ribeiro, “A linearized Accessed: 2010-09-30. bregman algorithm for decentralized basis pursuit,” in Signal Processing [39] Randy Elzinga, “Producing the graphs of lubotzky, phillips and sarnak Conference (EUSIPCO), 2013 Proceedings of the 21st European. IEEE, in matlab,” http://www.mast.queensu.ca/∼ctardif/lps/LPSSup.pdf, 2010. 2013, pp. 1–5. [40] Adam Marcus, Daniel A Spielman, and Nikhil Srivastava, “Interlacing [13] Qing Ling, Yangyang Xu, Wotao Yin, and Zaiwen Wen, “Decentralized families i: Bipartite ramanujan graphs of all degrees,” in Foundations low-rank matrix completion,” in Acoustics, Speech and Signal Process- of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. ing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. IEEE, 2013, pp. 529–537. 2925–2928. [41] Shlomo Hoory, Nathan Linial, and Avi Wigderson, “Expander graphs [14] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin, “Extra: An exact and their applications,” Bulletin of the American Mathematical Society, first-order algorithm for decentralized consensus optimization,” SIAM vol. 43, no. 4, pp. 439–561, 2006. Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. [42] Wei Shi, “Ggen: Graph generation for network computing simulations,” [15] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin, “A proximal gradient http://home.ustc.edu.cn/∼shiwei00/html/GGen.html, 2014. algorithm for decentralized nondifferentiable optimization,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p. 29642968, 2015. [16] Euhanna Ghadimi, Andre Teixeira, Michael G Rabbat, and Mikael Johansson, “The admm algorithm for distributed averaging: Conver- gence rates and optimal parameter selection,” in 2014 48th Asilomar