Totally Asynchronous Primal-Dual Convex Optimization in Blocks

1 Totally Asynchronous Primal-Dual Convex Optimization in Blocks

Katherine R. Hendrickson and Matthew T. Hale∗

Abstract

We present a parallelized primal-dual algorithm for solving constrained convex optimization problems. The algorithm is “block-based,” in that vectors of primal and dual variables are partitioned into blocks, each of which is updated only by a single processor. We consider four possible forms of asynchrony: in updates to primal variables, updates to dual variables, communications of primal variables, and communications of dual variables. We construct a family of explicit counterexamples to show the need to eliminate asynchronous communication of dual variables, though the other forms of asynchrony are permitted, all without requiring bounds on delays. A ﬁrst-order primal-dual update law is developed and shown to be robust to asynchrony. We then derive convergence rates to a Lagrangian saddle point in terms of the operations agents execute, without specifying any timing or pattern with which they must be executed. These convergence rates include an “asynchrony penalty” that we quantify and present ways to mitigate. Numerical results illustrate these developments.

I.INTRODUCTION

A wide variety of machine learning problems can be formalized as convex programs [5], [8], [27], [28]. Large- scale machine learning then requires solutions to large-scale convex programs, which can be accelerated through parallelized solvers running on networks of processors. In large networks, it is difﬁcult (or outright impossible) to synchronize their behaviors. The behaviors of interest are computations, which generate new information, and communications, which share this new information with other processors. Accordingly, we are interested in asynchrony-tolerant large-scale optimization. The challenge of asynchrony is that it causes disagreements among processors that result from generating and receiving different information at different times. One way to reduce disagreements is through repeated averaging of arXiv:2107.10338v1 [math.OC] 21 Jul 2021 processors’ iterates. This approach dates back several decades [31], and approaches of this class include primal [22]– [24], dual [11], [29], [30], and primal-dual [20], [34] algorithms. However, these averaging-based methods require bounded delays in some form, often through requiring connectedness of agents’ communication graphs over intervals of a prescribed length [6, Chapter 7]. In some applications, delays are outside agents’ control, e.g., in a contested environment where communications are jammed, and thus delay bounds cannot be reliably enforced. Moreover,

This work was supported by a task order contract from the Air Force Research Laboratory through Eglin AFB, by ONR under grant N00014-19-1-2543, and by AFOSR under grant FA9550-19-1-0169. ∗The authors are with the Department of Mechanical and Aerospace Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL 32611. Emails: {kat.hendrickson,matthewhale}@ufl.edu.

July 23, 2021 DRAFT 2 graph connectivity cannot be easily checked locally by individual agents, meaning even satisfaction or violation of connectivity bounds is not readily ascertained. In addition, these methods require multiple processors to update each decision variable, which duplicates computations and increases processors’ workloads. This can be prohibitive in large problems, such as learning problems with billions of data points. Therefore, in this paper we develop a totally asynchronous parallelized primal-dual method for solving large constrained convex optimization problems. The term “totally asynchronous” dates back to Bertsekas [6] and includes scenarios where both computations and communications are performed asynchronously. Here, by “parallelized,” we mean that each decision variable is updated only by a single processor. As problems grow, this has the advantage of keeping each processor’s computational burden approximately constant. The decision variables assigned to each processor are referred to as a “block,” and asynchronous block-based algorithms date back several decades as well [4], [31]. Those early works solve unconstrained or set-constrained problems, in addition to select problems with functional constraints [6]. To bring parallelization to arbitrary constrained problems, we develop a primal-dual approach that does not require constraints to have a specific form. Block-based methods have previously been shown to tolerate arbitrarily long delays in both communications and computations in some unconstrained problems [4], [19], [32], eliminating the need to enforce and verify delay boundedness assumptions. For constrained problems of a general form, block-based methods have been paired with primal-dual algorithms with centralized dual updates [15], [17] and/or synchronous primal-updates [21]. To the best of our knowledge, arbitrarily asynchronous block-based updates have not been developed for convex programs of a general form. A counterexample in [17] suggested that arbitrarily asynchronous communications of dual variables can preclude convergence, though that example leaves open the extent to which more limited dual asynchrony is compatible with convergence. In this paper, we present a primal-dual optimization algorithm that permits arbitrary asynchrony in primal variables, while accommodating dual asynchrony to the extent possible. Four types of asynchrony are possible: (i) asynchrony in primal computations, (ii) asynchrony in communicating primal variables, (iii) asynchrony in dual computations, and (iv) asynchrony in communicating dual variables. The first contribution of this paper is to show that item (iv) is fundamentally problematic using an explicit family of counterexamples that we construct. This family shows, in a precise way, that even small disagreements among dual variables can cause primal computations to diverge. For this reason, we rule out asynchrony in communicating dual variables. However, we permit all other forms of asynchrony, and, relative to existing work, this is the first algorithm to permit arbitrarily asynchronous computations of dual variables in blocks. The second contribution of this paper is to establish convergence rates. These rates are shown to depend upon problem parameters, which lets us calibrate their values to improve convergence. Moreover, we show that convergence can be inexact due to dual asynchrony, and thus the scalability of parallelization comes at the expense of a potentially inexact solution. We term this inexactness the “asynchrony penalty,” and we give an explicit bound on it as well as methods to mitigate it. Simulation results show convergence of this algorithm and illustrate that the asynchrony penalty is mild. This paper is an expansion of work previously published as a conference paper [18]. In this paper, we expand

DRAFT July 23, 2021 3 all of our previous results for scalar blocks to non-scalar blocks. Additionally, we suggest some ways to mitigate the asynchrony penalty. Finally, this paper provides full proofs of all results. The rest of the paper is organized as follows. Section II provides the necessary background on convex optimization and gives the asynchronous primal-dual problem statement. Then Section III presents our asynchronous algorithm, and provides a counterexample to total dual asynchrony. Primal and dual convergence rates are developed in Sections IV and V, respectively. We present the overall convergence rate and ways to mitigate the asynchrony penalty in Section VI. Section VII presents a numerical example with implications for relationships among parameters. Finally, we present our conclusions in Section VIII.

II.BACKGROUNDAND PROBLEM STATEMENT

We study the following form of optimization problem.

Problem 1: Given f : Rn → R, g : Rn → Rm, and X ⊂ Rn, asynchronously solve

minimize f(x)

subject to g(x) ≤ 0

x ∈ X. ♦

We assume the following about the objective function f.

Assumption 1: f is twice continuously differentiable and convex. 4

We make a similar assumption about the constraints g.

Assumption 2: g satisﬁes Slater’s condition, i.e., there exists x¯ ∈ X such that gx¯ < 0. For all j ∈ {1, . . . , m}, the function gj is twice continuously differentiable and convex. 4

Assumptions 1 and 2 permit a wide range of functions to be used, such as all convex polynomials of all orders. We impose the following assumption on the constraint set.

Assumption 3: X is non-empty, compact, and convex. It can be decomposed via X = X1 × · · · × XN . 4

Assumption 3 permits many sets to be used, such as box constraints, which often arise multi-agent optimization [25]. We will solve Problem 1 using a primal-dual approach. This allows the problem to be parallelized across many processors by re-encoding constraints through Karush-Kuhn-Tucker (KKT) multipliers. In particular, because the constraints g couple the processors’ computations, they can be difficult to enforce in a distributed way. However, by introducing KKT multipliers to re-encode constraints, we can solve an equivalent, higher-dimensional unconstrained problem. An ordinary primal-dual approach would find a saddle point of the Lagrangian associated with Problem 1, defined T as L(x, µ) = f(x)+µ g(x). That is, one would solve minx maxµ L(x, µ), and, under Assumptions 1-3, this would furnish a solution to Problem 1. However, L is affine in µ, which implies that L(x, ·) is concave but not strongly concave. Strong convexity has been shown to provide robustness to asynchrony in minimization problems [6], and

July 23, 2021 DRAFT 4

thus we wish to endow the maximization over µ with strong concavity. We use a Tikhonov regularization [12, Chapter 12] in µ to form δ L (x, µ) = f(x) + µT g(x) − kµk2, (1) δ 2 where δ > 0 and k · k denotes the Euclidean norm. This ensures δ-strong concavity in µ. Instead of regularizing with respect to the primal variable x, we impose the following assumption in terms of 2 the Hessian H(x, µ) = ∇xLδ(x, µ). When convenient, we suppress the arguments x and µ and simply write H.

2 Assumption 4 (Diagonal Dominance): The n × n Hessian matrix H = ∇xLδ(x, µ) is β-diagonally dominant. That is, n X |Hii| − β ≥ |Hij| for all i = 1, . . . , n. 4 j=1 j6=i If this assumption does not hold, the Lagrangian can be regularized with respect to the primal variable as well to help provide H’s diagonal dominance. Some problems satisfy this assumption without regularizing [14], and, for such problems, we proceed without regularizing to avoid unnecessarily introducing regularization error. Diagonal dominance has been shown to arise in text mining tasks, learning tasks in image retrieval, and bioinformatics, but has traditionally resulted in poor performance with Support Vector Machines [14]. When applying our asynchronous algorithm, however, diagonal dominance improves convergence. This follows examples in the literature that illustrate the beneﬁt of diagonal dominance in optimization [10], especially when applied to sum of squares problems [1], [2] and distributed optimization [9], [13], [33]. Under Assumptions 1-4, Problem 1 is equivalent to the following saddle point problem.

Problem 2: Let Assumptions 1-4 hold and ﬁx δ > 0. For Lδ deﬁned in (1), asynchronously compute

xˆδ, µˆδ) := arg min arg max Lδ(x, µ). m ♦ x∈X µ∈R+ It is in this form that we will solve Problem 1, and we present our algorithm for doing so in the next section.

III.ASYNCHRONOUS PRIMAL-DUAL ALGORITHM

m One challenge of Problem 2 is that µˆδ is contained in the unbounded domain R+ , which is the non-negative orthant of Rm. Because this domain is unbounded, gradients and other terms are unbounded, which makes convergence analysis challenging as dual iterates may not be within a ﬁnite distance of the optimum. To remedy this problem, we next compute a non-empty, compact, and convex set M that contains µˆδ.

Lemma 1: Let Assumptions 1-4 hold, let x¯ be a Slater point of g, and set f ∗ := min f(x). Then x∈X

( ∗ ) m f(¯x) − f µˆδ ∈ M := µ ∈ R+ : kµk1 ≤ . min −gj(¯x) 1≤j≤m

Proof: Follows Section II-C in [16]. Here, f ∗ denotes the optimal unconstrained objective function value, though any lower-bound for this value will also provide a valid M. In particular, f may be non-negative and one can substitute 0 in place of f ∗ in such cases.

DRAFT July 23, 2021 5

Solving Problem 2 asynchronously requires choosing an update law that we expect to be robust to asynchrony and simple to implement in a distributed fashion. In this context, ﬁrst-order gradient-based methods offer both some degree of inherent robustness, as well as computations that are simpler than other available methods, such as Newton-type methods. We apply a projected gradient method to both the primal and dual variables, based on the seminal Uzawa iteration [3],

x(k + 1) = ΠX [x(k) − γ(∇xLδ(x(k), µ(k)))] (2)

µ(k + 1) = ΠM[µ(k) + ρ(∇µLδ(x(k), µ(k)))] (3)

where γ and ρ are stepsizes, ΠX is the Euclidean projection onto X, and ΠM is the Euclidean projection onto M.

A. Overview of Approach

We are interested in decentralizing (2) and (3) among a number of agents while allowing them to generate and share information as asynchronously as possible. We consider N agents indexed over i ∈ I := {1,...,N}. We

also deﬁne the sets Ip := {1,...,Np} and Id := {1,...,Nd}, where Np is the number of primal agents, Nd is 1 the number of dual agents , and Np + Nd = N. The set Ip contains indices of agents that update primal blocks n (contained in x), while Id contains indices of agents that update dual blocks (contained in µ). Thus, x ∈ R is m divided into Np blocks and µ ∈ R into Nd blocks.

Primal agent i is responsible updating for the i-th primal block, x[i], and dual agent c is responsible for updating

the c-th dual block, µ[c]. Let ni denote the length of primal agent i’s block and mc the length of dual agent c’s

PNp PNd block. Then n = i=1 ni and m = c=1 mc. The block of constraints g for which a dual agent c is responsible

is denoted by g[c] and each dual agent projects their individual blocks on a set Mc derived from Lemma 1, i.e.,    ∗  mc f(¯x) − f Mc = ν ∈ R+ : kνk1 ≤ . min −gj(¯x)  1≤j≤m 

Using a primal-dual approach, there are four behaviors that could be asynchronous: (i) computations of updates to primal variables, (ii) communications to share updated values of primal variables, (iii) computations of updates to dual variables, and (iv) communications to share updated values of dual variables. We examine each here:

(i) Computations of Updates to Primal Variables: When parallelizing Equation (2) across the Np primal agents, we index all primal agents’ computations using the same iteration counter, k ∈ N. However, they may

perform updates at different times. The subset of times at which primal agent i ∈ Ip computes an update is denoted i i j by K ⊂ N. For distinct i, j ∈ Ip, K and K need not have any relationship. (ii) Communications of Updated Primal Variables: Primal variable communications are also totally asynchronous. A primal block’s current value may be sent to other primal and dual agents that need it at each time k.

1 Although the same index may be contained in both Ip and Id, we deﬁne the sets in this way to avoid non-consecutive number of primal agents and dual agents, which would be cumbersome in the forthcoming analysis. The meaning of an index will always be made unambiguous by specifying whether it is contained in Ip or Id.

July 23, 2021 DRAFT 6

i 2 We use the notation Pj to denote the set of times at which primal agent i sends values of its primal variables to i agent j. Similarly, we use the notation Dc to denote the set of times at which primal agent i sends updated values i i i to dual agent c ∈ Id. Note that Pj and Dc need not be related to K ; updates and communications of primal blocks are totally asynchronous.

(iii) Computations of Updates to Dual Variables: Dual agents wait for each primal agent’s updated state before computing an update to a dual variable. Dual agents may perform updates at different times because they may receive primal updates at different times. In some cases, a dual agent may receive multiple updates from a subset of primal agents prior to receiving all required primal updates. In this case, only the most recently received

update from a primal agent will be used in the dual agent’s computation. For all c ∈ Id, dual agent c keeps an iteration count tc to track the number of updates it has completed.

(iv) Communications of Updated Dual Variables: Previous work [17] has shown that allowing primal agents to disagree arbitrarily about dual variables can preclude convergence. This is explained by the following: ﬁx µ1, µ2 ∈ M. Then an agent with µ1 onboard is minimizing L(·, µ1), while an agent with µ2 onboard is minimizing L(·, µ2). If µ1 and µ2 are arbitrarily far apart, then it is not surprising that the minima of L(·, µ1) and L(·, µ2) are arbitrarily far apart. However, one may conjecture that small disagreements in dual variables lead to small distances between these minima. Below, we show that this conjecture is false and that even small disagreements in dual variables can lead to arbitrarily large distances between the minima they induce. Even limited asynchrony can lead to small disagreements in dual variables, and, in light of the above discussion, this can cause primal agents’ computations to reach points that are arbitrarily far apart.

Therefore, we will develop an algorithm that proceeds with all primal agents agreeing on the value of µ while still allowing dual computations to be divided among dual agents. This is accomplished by allowing primal agents to work completely asynchronously (updates are computed and sent at different times to different agents) but requiring that primal agents use the same dual variable in their computations.

When a dual agent computes an update, it communicates its updated dual variable to all primal agents. Again, for simplicity, it is assumed that the update is received at the same time it is sent. When dual agent c sends its updated

dual block to all primal agents, it also sends its iteration count tc. This allows primal agents to annotate which

version of µ is used in their updates, e.g., by appending the value of tc to the end of the vector of primal variables they communicate. Other primal agents will disregard any received primal updates that use an outdated version of the dual variable to ensure that further primal updates rely upon the same dual value. For example, if a primal agent

has a dual variable with an iteration counter of tc for block c, this agent will ignore a primal update it receives if

it was computed with an iteration counter of tc −1 for dual block c. The algorithm we presented is unchanged if any other method is used to ensure that the primal agents use the same dual variable in their computations.

2For technical convenience, we suppose that there is no delay between sending and receiving messages. There is no loss of generality in our i i results because we can make Pj and Dc the times at which messages are received. However, assuming zero delays simpliﬁes the forthcoming discussion and analysis, and this is done for the remainder of the paper.

DRAFT July 23, 2021 7

B. Glossary of Notation in Algorithm 1

Every primal agent stores a local copy of x and µ for use in local computations while every dual agent stores a local copy of x and only their block of µ. The following glossary contains the notation used in our algorithm statement:

i Dc The times at which messages are sent by primal agent i and received by dual agent c. k The iteration count used by all primal agents. Ki The set of times at which primal agent i performs updates. ∂ ∇x The derivative with respect to the i-th block of x. That is, ∇x := . [i] [i] ∂x[i]

Ni Essential neighborhood of primal agent i. Primal agent j is an essential neighbor of primal agent i if ∇x[j] Lδ

depends upon x[i].

Id Set containing the indices of all dual agents.

Ip Set containing the indices of all primal agents. i Pj The times at which messages are sent by primal agent i and received by primal agent j. i τj (k) Time at which primal agent j originally computed the update that it sent to primal or dual agent i at time k. i Note that τi (k) = k for all i ∈ Ip. th t The vector of dual agent iteration counts. The c entry, tc, is the iteration count for dual agent c’s updates. i x[j] Primal or dual agent i’s value for the primal block j, which is updated/sent by primal agent j. If agent i is primal, it is indexed by both k and t; if agent i is dual, it is indexed by t. i x[i](k; t) Primal agent i’s value for its own primal block i at primal time k, calculated with the value of µ from dual update t.

xˆδ The primal component of the saddle point of Lδ. Part of the optimal solution pair (ˆxδ, µˆδ). t t xˆδ Given µ(t), xˆδ = arg minx∈X Lδ(x, µ(t)). c µ[d] Primal or dual agent c’s copy of dual block d, which is updated/sent by dual agent d.

µˆδ The dual component of the saddle point of Lδ, (ˆxδ, µˆδ). ∗ mc f(¯x)−f Mc The set {ν ∈ : kνk1 ≤ }. This uses the upper bound in Lemma 1 to project individual blocks R+ minj −gj (¯x) of µ.

C. Statement of Algorithm

Having deﬁned our notation, we impose the following assumption on agents’ communications and computations.

i i j Assumption 5: For all i ∈ Ip, the set K is inﬁnite. If {kn}n∈N is an increasing set of times in K , then limn→∞ τi (kn) = c ∞ for all j ∈ Ip and limn→∞ τi (kn) = ∞ for all c ∈ Id. 4

This simply ensures that no agent ever stop computing or communicating, though delays can be arbitrarily large. We now state the asynchronous primal-dual algorithm.

July 23, 2021 DRAFT 8

Algorithm 1: Step 0: Initialize all primal and dual agents with x(0) ∈ X and µ(0) ∈ M. Set t = 0 ∈ RNd and k = 0 ∈ N. i i Step 1: For all i ∈ Ip and all j ∈ Ni, if k ∈ Pj , then primal agent i sends x[i](k; t) to primal agent j.

Step 2: For all i ∈ Ip and all c ∈ Id, if primal agent i receives a dual variable update from dual agent c, it th uses the accompanying value of tc to update the c entry of the vector t and sets

i c µ[c](t) = µ[c](tc).

i Step 3: For all i∈Ip, if k ∈K , primal agent i executes

i i i i x[i](k+1; t)=ΠXi [x[i](k; t)−γ∇x[i] Lδ(x (k; t), µ (t))].

i i i If k∈ / K , then x[i](k+1; t)=x[i](k; t).

Step 4: For all i ∈ Ip and all j ∈ Ni,  xj (τ i(k+1); t) i receives xj at k+1 i  [j] j [j] x[j](k+1; t)=  i x[j](k; t) otherwise

i i Step 5: For all i∈Ip and all c∈Id, if k+1∈Dc, primal agent i sends x[i](k + 1; t) to dual agent c. Set k := k + 1.

Step 6: For c ∈ Id and i ∈ Ip, if dual agent c receives an update from primal agent i computed with dual update t, it sets

c i c x[i](tc)= x[i](τi (k), t).

c Otherwise, x[i](tc) remains constant.

Step 7: For c ∈ Id, if dual agent c has received an update from every primal agent that was computed with the latest dual iteration t, dual agent c executes

c c ∂L c c µ[c](tc + 1) = ΠMc [µ[c](tc) + ρ (x (tc), µ (tc))]. ∂µ[c] c Step 8: If dual agent c updated in Step 7, it sends µ[c](tc +1) to all primal agents. Set tc := tc +1. Step 9: Return to Step 1.

D. Counterexample to Asynchronous Dual Communications

Below we show that communications to share updated values of dual variables cannot be asynchronous in general. The intuition here is as follows. In a primal-dual setup, one can regard each ﬁxed choice of dual vector as specifying a problem to solve in a parameterized family of minimization problems. Formally, with µ ﬁxed, agents solve

δ minimize L (x, µ) := f(x) + µT g(x) − kµk2. x∈X δ 2

For two primal agents with different values of µ, denoted µ1 and µ2, they solve two different problems: agent 1 1 2 minimizes Lδ( · , µ ), while agent 2 minimizes Lδ( · , µ ). With a gradient-based method to minimize over x,

DRAFT July 23, 2021 9

gradients depend linearly upon µ. This may lead one to believe that for

1 2 xˆ1 := arg min Lδ(x, µ ) and xˆ2 := arg min Lδ(x, µ ), x∈X x∈X 1 2 having kµ − µ k small implies that kxˆ1 − xˆ2k is also small. However, we show in the following theorem that this is false.

Theorem 1: Fix any > 0 and L > . Then, under Assumptions 1-4, there always exists a problem such that kµ1 − 2 µ k ≤ and kxˆ1 − xˆ2k ≥ L.

Proof: See Appendix A.

IV. PRIMAL CONVERGENCE

Towards defining an overall convergence rate to the regularized optimal solution (ˆxδ, µˆδ), we first find the primal convergence rate for a fixed dual variable.

Given a ﬁxed µ(t), centralized projected gradient descent for minimizing Lδ(·, µ(t)) would be written as

h(x) = ΠX [x − γ∇xLδ(x, µ(t))] , (4)

t where γ > 0. The ﬁxed point of h is the minimizer of Lδ(·, µ(t)) and is denoted by xˆδ = arg minx∈X Lδ(x, µ(t)). Leveraging some existing theoretical tools in the study of optimization algorithms [4], [7], we can study h in a way that elucidates its behavior under asynchrony in a distributed implementation. According to [7], the assumption of diagonal dominance guarantees that h has the contraction property

t t kh(x) − xˆδk ≤ αkx − xˆδk for all x ∈ X,

t where α ∈ [0, 1) and xˆδ is a fixed point of h, which depends on the choice of fixed µ(t). However, the value of α is not specified in [7], and it is precisely that value that governs the rate of convergence to a solution. We therefore compute α explicitly below. First, we bound the stepsize γ.

2 Deﬁnition 1: Let H be the Hessian matrix H = ∇xLδ(x, µ). Choose the primal stepsize γ > 0 to satisfy 1 γ < Pn . max max max |Hij(x, µ)| i x∈X µ∈M j=1

Because x and µ both take values in compact sets, each entry Hij is bounded above and below, and thus the upper bound on γ is positive. Following the method in [4], two n × n matrices G and F must also be deﬁned.

Deﬁnition 2: Deﬁne the n × n matrices G and F as   |H11| −|H12| ... −|H1n|    . . .. .  G =  . . . .  and F = I − γG,   −|Hn1| −|Hn2| ... |Hnn| where I is the n × n identity matrix.

We now have the following.

July 23, 2021 DRAFT 10

Lemma 2: Let h, G, and F be as above and let Assumptions 1-4 hold. Then |h(x) − h(y)| ≤ F |x − y| for all x, y ∈ Rn, where |v| denotes the element-wise absolute value of the vector v ∈ Rn and the inequality holds component-wise.

Proof: See Appendix B. t t To establish convergence in x, we will show that kh(x) − h(ˆxδ)k ≤ αkx − xˆδk for all x ∈ X, where α ∈ [0, 1). Toward doing so, we deﬁne the following.

n Deﬁnition 3: Let v ∈ . Then kvkmax = max |vi|. R i We next show that the gradient update law h in (4) converges with asynchronous, distributed computations. Furthermore, we quantify the rate of convergence in terms of γ and β.

Lemma 3: Let h, γ, and kvkmax be as defined in (4) and Definitions 1 and 2. Let Assumptions 1-4 hold and t fix µ(t) ∈ M. Then for the fixed point xˆδ of h and for all x in the set X,

t t kh(x) − h(ˆxδ)kmax ≤ qpkx − xˆδkmax, where qp := (1 − γβ) ∈ [0, 1).

Proof: See Appendix C. The primal-only convergence rate can be computed by leveraging results in [17] in terms of the number of operations the primal agents have completed (counted in the appropriate sequence). We count operations as follows. Upon primal agents’ receipt of a dual variable with iteration vector t, we set ops(k, t) = 0. Then, after all primal agents have computed an update to their decision variable and sent it to and had it received by all other primal agents that need it, say by time k0, we increment ops to ops(k0, t) = 1. After ops(k0, t) = 1, we then wait until all primal agents have subsequently computed a new update (still using the same dual variable indexed with t) and it has been received by all other primal agents that need it. If this occurs at time k00, then we set ops(k00, t) = 2, and then this process continues. If at some time k000, primal agents receive an updated µ (whether just a single dual agent sent an update or multiple dual agents send updates) with an iteration vector of t0, then the count would begin again with ops(k000, t0) = 0.

Theorem 2: Let Assumptions 1-4 hold. Let µ(t) be the dual vector onboard all primal agents at some time k and t let k0 denote the latest time that any primal agent received µ(t). Then, with primal agents asynchronously executing the gradient update law h, agent i has

i t ops(k,t) j t t kx (k; t) − xˆδkmax ≤ qp max kx (k0; t) − xˆδkmax, j

t where xˆδ is the ﬁxed point of h with µ(t) held constant.

Proof: From Lemma 3 we see that h is a qp-contraction mapping with respect to the norm k·kmax. From Section 6.3 in [7], this contraction property implies that there exist sets of the form

n t k j t t X(k) = {x ∈ | kx − xˆδkmax ≤ qp max kx (k0; t) − xˆδkmax} R j that satisfy the following criteria from [17]:

DRAFT July 23, 2021 11

i. · · · ⊂ X(k + 1) ⊂ X(k) ⊂ · · · ⊂ X t ii. limk→∞ X(k) = {xˆδ}

iii. For all i, there are sets Xi(k) ⊂ Xi satisfying

X(k) = X1(k) × · · · × XN (k)

iv. For all y ∈ X(k) and all i ∈ Ip, hi(y) ∈ Xi(k + 1), where hi(y) = ΠXi yi − γ∇x[i] Lδ(y) . We will use these properties to compute the desired convergence rate. Suppose all agents have a fixed µ(t) i t onboard. Upon receipt of this µ(t), agent i has x (k0; t) ∈ X(0) by definition. Suppose at time ì that agent i i i computes a state update. Then x[i](ì + 1; t) ∈ Xi(1). For m = maxi∈Ip ì + 1, we find that x[i](m; t) ∈ Xi(1) for all i. Next, suppose that, after all updates have been computed, these updated values are sent to and received 0 i 0 by all agents that need them, say at time m . Then, for any i ∈ Ip, agent i has x[j](m ; t) ∈ Xj(1) for all j ∈ Ip. In particular, xi(m0; t) ∈ X(1), and this is satisfied precisely when ops has incremented by one. Iterating this

argument completes the proof. Of course, each dual update resets the value of ops to zero, which means that the interleaving of primal and dual updates has a signiﬁcant impact on convergence. We study this next.

V. DUAL CONVERGENCE

In this section, we present two theorems governing dual convergence. The first looks at the behavior during a single dual update and how it affects the distance from the dual optimum. Our definition of the asynchrony penalty follows from this result. The second theorem upper bounds the distance from the optimal dual variable when considering multiple updates by a single agent. Towards defining the behavior during a single dual update, we consider the number of operations primal agents compute before communications are sent to a dual agent. In particular, we are interested in defining the oldest

primal block a dual agent uses in its own computations. Recall that dual agent c performs update tc + 1 using its c copy of the primal variable, x (tc). This copy of the primal variable is comprised of updates received from primal agents prior to the time at which dual agent c computes update tc + 1. Each of these received primal variables

was sent when some number of operations had been completed by primal agents using dual update tc. Towards quantifying this, we are interested in deﬁning the primal computation time of the oldest primal block used by dual

agent c in update tc + 1.

Deﬁnition 4: For dual agent c computing update tc + 1, let κ(c, tc) denote the earliest primal computation time c for all blocks in x (tc). That is, for all primal blocks used by dual agent c during update tc + 1, κ(c, tc) is the oldest primal time any were computed.

Thus, the minimum number of operations completed by any primal agent for the blocks used by dual agent c

during update tc + 1 is equal to ops(κ(c, tc), t). We are now prepared to derive a block-wise convergence rate for the dual variable. That is, we bound the th c difference between dual agent c’s updated c block, µ[c], and the regularized optimal value for this block, µˆδ,[c].

July 23, 2021 DRAFT 12

2δ Theorem 3: Let Assumptions 1-4 hold. Let the dual stepsize ρ be deﬁned such that ρ < δ2+2 . Let tc ≥ 0 and consider the case where dual agent c performs a single update denoted with the iteration counter tc + 1. Then the distance from the optimal value for block c is bounded by

c 2 c 2 2ops(κ(c,tc),t) ops(κ(c,tc),t) kµ[c](tc + 1) − µˆδ,[c]k ≤ qdkµ[c](tc) − µˆδ,[c]k + qp E1(t) + qp E2(t) + E3, where E1(t), E2(t) and E3 are deﬁned as

2 2 2 E1(t) = (qd − ρ )nM[c]Dt ,

2√ 2 E2(t) = 2ρ nM[c]DtDmax,

2 2 2 E3 = (qd − ρ )M[c]Dmax,

2 2 j t for qd := (1−ρδ) +2ρ ∈ [0, 1), M[c] := max k∇g[c](x)k, Dt := max kx (k0) − xˆδ(t)kmax, Dmax := max kx − x∈X j x,y∈X yk, and where t is the dual iteration count vector onboard the primal agents when sending updates to dual agent c, and n is the length of the primal variable x.

Proof: See Appendix D.

2 2 2 Remark 1: The term E3 = (qd − ρ )M[c]Dmax in Theorem 3 is termed the “asynchrony penalty,” because it is an offset from reaching a solution. It is due to asynchronously computing dual variables and is absent when dual updates are centralized [17], [21]. Later, we suggest methods to mitigate this penalty.

We next present a simpliﬁed dual convergence rate.

Theorem 4: Let all conditions and deﬁnitions of Theorem 3 hold. Consider dual agent c after it has performed c some number of updates, a, in total. For each j = 1, . . . , a, let Tj denote the time at which dual agent c computed th its j update, i.e., the ﬁrst point in time at which tc = j. Then, Algorithm 1’s convergence for dual agent c obeys a+1 c 2 a+1 c 2 ¯ 1 − qd kµ[c](a + 1)−µˆδ,[c]k ≤qd kµ[c](0) − µˆδ,[c]k + E , 1 − qd 2 (κ(c,a−i),T c ) (κ(c,a−i),T c ) ¯ ops a−i c ops a−i c where E := max qp E1(Ta−i)+qp E2(Ta−i)+E3. i≤a Proof: See Appendix E.

VI.OVERALL CONVERGENCEAND REDUCINGTHE ASYNCHRONY PENALTY

i Theorems 3 and 4 can be used to give an overall primal convergence rate for x (k; t) converging to xˆδ.

Theorem 5: Let Assumptions 1-4 hold. Then for agents executing Algorithm 1, for all i, all k, and all t, we have

i ops(k,t) kx (k; t) − xˆδk2 ≤ K1qp + K2kµ(t) − µˆδk2,

√ j t 1 where K1 = n max kx (k0) − xˆδ(t)kmax and K2 = max k∇g(x)k. j β x∈X Proof: Follows the proof of Theorem 2 of [17]. The asynchrony penalty may be addressed in two ways: ﬁrst, by choosing a sufﬁciently small ρ for a desired error bound, and secondly, by implementing better dual update timing. Corollary 1 addresses the former and Remark 2 the latter.

DRAFT July 23, 2021 13

0 17 4 3 20 16 14 13 28

15 12 24

25 7

6 8 10 2 18

29 21

27 5 9 11 22 26

19 23 1

Fig. 1. Network graph where Node 0 is the source and Node 1 is the target and edge thicknesses correspond to their ﬂow capacities. The paths may be divided into three groups (green, pink, and orange) such that there are no shared edges for paths in different groups.

Corollary 1: Let all conditions and deﬁnitions of Theorem 3 hold. Let an error bound ≤ 1 for E3 be ﬁxed. Then there exists a dual stepsize ρ and regularization parameter δ such that

c 2 c 2 2ops(κ(c,tc),t) ops(κ(c,tc),t) kµ[c](tc + 1) − µˆδ,[c]k ≤ qdkµ[c](tc) − µˆδ,[c]k + qp E1(t) + qp E2(t) + .

The additional terms E1(t) and E2(t) in Theorem 3 may be addressed by enforcing minor dual update timing restrictions.

Remark 2: Let all conditions and deﬁnitions of Theorem 3 hold. Because the terms E1(t) and E2(t) are multiplied

2ops(κ(c,tc),t) ops(κ(c,tc),t) by qp and qp respectively, the overall error may be reduced by synchronizing and delaying dual updates. The ops(κ(c, tc), t) term is reset with every dual update shared by any dual agent; by delaying dual updates, ops(κ(c, tc), t) is allowed to grow, which reduces the error terms. Thus, if dual updates occur frequently and asynchronously, this ops term will also be reset frequently. If, however, dual updates occur around the same time and there is a delay between clusters of updates, the ops term will be allowed to grow larger, resulting in reduced error terms.

VII.NUMERICAL EXAMPLE

We consider a network ﬂow problem where agents are attempting to route data over given paths from the source (node 0) to the target (node 1)3. We consider an example whose network graph is given in Figure 1, where the

3The network graph in Figure 1 was generated using [26]. All other code for this section can be found at www.github.com/kathendrickson/dacoa.

July 23, 2021 DRAFT 14

edge widths represent the edge capacities. We consider a problem in which we must route 15 different paths from the source to the target along some combination of the 66 edges present. The primal variable is composed of the traffic assigned to each path (with n = 15) and the limits for traffic sent along each edge are the constraints (thus, m = 66). By construction, the paths and edges can be divided into three groups, with the edges in each group being used only by that group’s paths. Each edge group (indicated by the green, pink, and orange colors in Figure 1) contains five paths, corresponding to five entries of the primal variable. The objective function is given by n X f(x) = −β log(1 + xi), i=1

where β ≥ 10 is a constant that corresponds to the Hessian matrix’s β-diagonal dominance. Primal variables xi are allowed to take any value between 0 and 10. The constraints are given by Ax ≤ b, where the edge capacities, b, take random values between 5 and 40, with edges connected to the source and target having capacities of 50. The matrix A is given by  1 if ﬂow path i traverses edge k Ak,i = . 0 otherwise

Thus the regularized Lagrangian is given by

n X δ L (x, µ) = −β log(1 + x ) + µT Ax − b − kµk2, δ i 2 i=1 where the Hessian matrix H is β-diagonally dominant. We choose algorithm parameters γ = 0.01, δ = 0.1 and ρ = δ δ2+2 = .0498. Communications between agents occur with a random probability called the “communication rate,” which we vary across simulation runs below. We use this simulation example to explore the benefit achieved by using non-scalar blocks over our previous work with scalar blocks [18]. Additionally, we examine the effect that the magnitude of diagonal dominance has on convergence. We are also able to vary the communication rate and study the effect on convergence. Finally, this example demonstrates the effectiveness of our algorithm with large-scale problems and the ease with which it is scaled and distributed. We begin by comparing scalar blocks to non-scalar blocks. When using scalar blocks, we assign one primal agent to each flow path and one dual agent to each edge constraint. Thus, we have 15 primal agents and 66 dual agents in the scalar block case. Next, dividing among non-scalar blocks inherently presents an advantage: fewer agents are needed. We assign one primal agent to compute all five flow paths in each network group (denoted by the different colors in Figure 1). We then assign a single dual agent to the edge constraints for each group. Thus, we have 3 primal agents and 3 dual agents in the non-scalar block case. For a communication rate of 0.75 (primal agents have a 75-percent probability of communicating the latest update to another agent at each iteration), non-scalar blocks provide an advantage when considering number of iterations needed to converge as demonstrated in Figure 2. Next we vary β over β ∈ {1, 10, 50}. Figure 3 plots the iteration number k versus the Euclidean norm of the difference between the previous iteration’s decision variable x and current decision variable for these values of β

DRAFT July 23, 2021 15

Convergence for Scalar and Non-Scalar Blocks

Scalar Blocks Non-Scalar Blocks 10−2

10−3

10−4

10−5 Distance Between Iterations

10−6

0 2500 5000 7500 10000 12500 15000 17500 Iteration Number

Fig. 2. Convergence for scalar and non-scalar blocks. Non-scalar blocks provide a signiﬁcant advantage over scalar blocks when considering the number of iterations needed to reach a solution, as indicated by the red dotted line versus the solid blue line.

Diagonal Dominance and Convergence

Beta = 1 0 10 Beta = 10 Beta = 50 10−1

10−2

10−3

10−4

10−5 Distance Between Iterations

10−6

0 1000 2000 3000 4000 5000 6000 7000 Iteration Number

Fig. 3. Effect of diagonal dominance on convergence. Here, we see that larger values of β tend to produce faster convergence but also experience large oscillations.

and a communication rate of 0.75. As predicted by Theorem 2, a larger β correlates with faster convergence in general.

As intuition suggests, varying the communication rate has a signiﬁcant impact on the number of iterations required and the behavior of error in agents’ updates, shown in Figure 4. A low communication rate also produces signiﬁcant oscillations in the optimum’s proximity. This reveals that faster convergence can be achieved by both improving communication and increasing the diagonal dominance of the problem, but both come at the cost of increased oscillations.

July 23, 2021 DRAFT 16

Communication Rate and Convergence

25% Comm. Rate 50% Comm. Rate −2 10 75% Comm. Rate 100% Comm. Rate

10−3

10−4

10−5 Distance Between Iterations

10−6

0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration Number

Fig. 4. Effect of communication rate on convergence. Less frequent communication leads to slower convergence, and oscillation in the optimum’s proximity.

VIII.CONCLUSION

After exploring a counterexample to asynchronous dual communications, Algorithm 1 presents a primal-dual approach that is asynchronous in primal updates and communications and asynchronous in distributed dual updates. A numerical example illustrates the beneﬁt of non-scalar blocks and the effect diagonal dominance has with other parameters upon convergence. Future work will examine additional applications for the algorithm and implementation techniques to reduce error and improve convergence.

APPENDIX

A. Proof of Theorem 1:

Consider the quadratic program

1 minimize xT Qx 2 subject to Ax ≤ b, x ∈ X,

where Rn×n 3 Q = QT 0, A ∈ Rm×n, and b ∈ Rm with m < n. We take X sufﬁciently large in a sense to be described below. Because Q is symmetric and positive deﬁnite, its eigenvectors can be orthonormalized, and we denote these

eigenvectors by v1, . . . , vn. Suppose that these correspond to eigenvalues λ1(Q) ≥ λ2(Q) ≥ · · · ≥ λn(Q) > 0. To construct an example, suppose that A ∈ Rm×n has row i equal to the ith normalized eigenvector of Q. Take µ1 2 1 2 1 1 and µ such that kµ − µ k = . To compute xˆ1 := arg minx∈X Lδ(x, µ ), we set ∇xLδ(ˆx1, µ ) = 0. Expanding

DRAFT July 23, 2021 17

−1 T 1 and solving, we ﬁnd xˆ1 = −Q A µ , where we assume that X is large enough to contain this point. Similarly, −1 T 2 we ﬁnd xˆ2 = −Q A µ , where we also assume xˆ2 ∈ X. Then

−1 T 2 1 kxˆ1 − xˆ2k2 = kQ A (µ − µ )k2

−1 T 1 2 ≥ σmin Q A kµ − µ k2,

where σmin(·) is the minimum singular value. Expanding,

2 −1 T −T −1 T −2 T σi Q A = λi AQ Q A = λi AQ A ,

−2 T 1 1 and AQ A = diag 2 ,..., 2 , which follows from the fact that the rows of A are orthonormalized λ1(Q) λm(Q) −1 T 1 eigenvectors of Q. Then σmin Q A = . Using this above, we ﬁnd λ1(Q)

1 1 2 kxˆ1 − xˆ2k2 ≥ kµ − µ k2. λ1(Q)

To enforce kxˆ1 − xˆ2k ≥ L, we ensure that ≥ L, which is attained for any matrix satisfying λi(Q) ≤ for λ1(Q) L all i.

B. Proof of Lemma 2:

We proceed by showing the satisfaction of three conditions in [4]: (i) γ is sufficiently small, (ii) G is positive definite, and (iii) F is positive definite. Pn (i) γ is sufficiently small: Results in [7] require γ j=1 |Hij| < 1 for all i ∈ {1, . . . , n}, which here follows immediately from Definition 1. (ii) G is positive definite: By definition, G has only positive diagonal entries. By H’s diagonal dominance we have the following inequality for all i ∈ {1, . . . , n}:

n n n X X X |Gii| = |Hii| ≥ |Hij| + β > |Hij| = |Gij|. j=1 j=1 j=1 j6=i j6=i j6=i

Because G has positive diagonal entries, is symmetric, and is strictly diagonally dominant, G is positive definite by Gershgorin’s Circle Theorem. (iii) F is positive definite: Definition 1 ensures the diagonal entries of F are always positive. And F is diagonally dominant if, for all i ∈ {1, . . . , n},

n n X X |Fii| = 1 − γ|Hii| > γ |Hij| = |Fij|. j=1 j=1 j6=i j6=i

Pn This requirement can be rewritten as γ j=1 |Hij| < 1, which was satisﬁed under Condition (i). Because F has positive diagonal entries, is symmetric, and is strictly diagonally dominant, F is positive deﬁnite by Gershgorin’s

Circle Theorem.

July 23, 2021 DRAFT 18

C. Proof of Lemma 3:

Assumption 4 and the deﬁnition of F give n n X X Fij = 1 − γ |Hii| − |Hij| ≤ 1 − γβ. j=1 j=1 j6=i This result, Deﬁnition 3, and Lemma 2 give

t t ≤ max |xl − xˆδ,l|(1 − γβ) = (1 − γβ)kx − xˆδkmax. l

All that remains is to show (1 − γβ) ∈ [0, 1). By Deﬁnition 1 and the inequality |Hii| ≥ β, for all x ∈ X and µ(t) ∈ M, β β β γβ < Pn ≤ ≤ = 1. max |Hij| max |Hii| β i j=1 i

D. Proof of Theorem 3:

t c c Deﬁne xˆδ = arg minx∈X Lδ(x, µ(t)) and xˆδ = arg minx∈X Lδ(x, µˆδ). Let xt := x (tc) for brevity of notation.

Expanding the dual update law and using the non-expansiveness of ΠM, we ﬁnd

c 2 c c c 2 kµ[c](tc +1) − µˆδ,[c]k = kΠMc [µ[c](tc) + ρ(g[c](xt ) − δµ[c](tc))] − ΠMc [ˆµδ,[c] + ρ(g[c](ˆxδ) − δµˆδ,[c])]k

c c c 2 ≤ kµ[c](tc) + ρ(g[c](xt ) − δµ[c](tc)) − µˆδ,[c] − ρ(g[c](ˆxδ) − δµˆδ,[c])k

c c 2 = k(1 − ρδ)(µ[c](tc) − µˆδ,[c]) − ρ(g[c](ˆxδ) − g[c](xt ))k

2 c 2 2 c 2 ≤ (1 − ρδ) kµ[c](tc) − µˆδ,[c]k + ρ kg[c](xt ) − g[c](ˆxδ)k

c T c −2ρ(1−ρδ)(µ[c](tc)−µˆδ,[c]) (g[c](ˆxδ)−g[c](xt )).

t t Adding g[c](ˆxδ) − g[c](ˆxδ) inside the last set of parentheses gives

c 2 2 c 2 2 c 2 kµ[c](tc +1) − µˆδ,[c]k ≤ (1 − ρδ) kµ[c](tc) − µˆδ,[c]k + ρ kg[c](xt ) − g[c](ˆxδ)k

c T t − 2ρ(1 − ρδ)(µ[c](tc) − µˆδ,[c]) (g[c](ˆxδ) − g[c](ˆxδ))

c T t c − 2ρ(1−ρδ)(µ[c](tc)−µˆδ,[c]) (g[c](ˆxδ)−g[c](xt )). (5)

We can write t c 2 0 ≤ k(1 − ρδ)(g[c](ˆxδ) − g[c](ˆxδ)) + ρ(µ[c](tc) − µˆδ,[c])k , which can be expanded and rearranged to give

c T t 2 t 2 2 c 2 −2ρ(1 − ρδ)(µ[c](tc) − µˆδ,[c]) (g[c](ˆxδ) − g[c](ˆxδ)) ≤ (1−ρδ) kg[c](ˆxδ)−g[c](ˆxδ)k + ρ kµ[c](tc)−µˆδ,[c]k .

Similarly,

c T t c 2 t c 2 2 c 2 −2ρ(1 − ρδ)(µ[c](tc) − µˆδ,[c]) (g[c](ˆxδ) − g[c](xt )) ≤ (1−ρδ) kg[c](ˆxδ)−g[c](xt )k +ρ kµ[c](tc)−µˆδ,[c]k .

DRAFT July 23, 2021 19

Applying these inequalities to (5) gives

c 2 2 c 2 2 c 2 2 t 2 kµ[c](tc + 1) − µˆδ,[c]k ≤ (1 − ρδ) kµ[c](tc) − µˆδ,[c]k + ρ kg[c](xt )−g[c](ˆxδ)k + (1−ρδ) kg[c](ˆxδ)−g[c](ˆxδ)k

2 c 2 2 t c 2 + 2ρ kµ[c](tc)−µˆδ,[c]k + (1−ρδ) kg[c](ˆxδ)−g[c](xt )k

2 2 c 2 2 c 2 ≤ ((1−ρδ) + 2ρ )kµ[c](tc) − µˆδ,[c]k + ρ kg[c](xt )−g[c](ˆxδ)k

2 t 2 2 t c 2 + (1−ρδ) kg[c](ˆxδ)−g[c](ˆxδ)k + (1−ρδ) kg[c](ˆxδ)−g[c](xt )k . (6)

2 c 2 Expanding the ρ kg[c](xt ) − g[c](ˆxδ)k term in (6) gives

c 2 2 2 c 2 2 c t 2 2 t 2 kµ[c](tc + 1) − µˆδ,[c]k ≤ ((1−ρδ) +2ρ )kµ[c](tc)−µˆδ,[c]k + ρ kg[c](xt ) − g[c](ˆxδ)k + ρ kg[c](ˆxδ) − g[c](ˆxδ)k

2 c t t 2 t 2 + 2ρ kg[c](xt ) − g[c](ˆxδ)kkg[c](ˆxδ) − g[c](ˆxδ)k + (1 − ρδ) kg[c](ˆxδ) − g[c](ˆxδ)k

2 t c 2 + (1 − ρδ) kg[c](ˆxδ) − g[c](xt )k .

Grouping terms, this inequality simpliﬁes to

c 2 2 2 c 2 2 2 c t 2 kµ[c](tc + 1) − µˆδ,[c]k ≤ ((1−ρδ) +2ρ )kµ[c](tc)−µˆδ,[c]k + ((1 − ρδ) + ρ )kg[c](xt ) − g[c](ˆxδ)k

2 c t t 2 2 t 2 + 2ρ kg[c](xt ) − g[c](ˆxδ)kkg[c](ˆxδ) − g[c](ˆxδ)k + ((1 − ρδ) + ρ )kg[c](ˆxδ) − g[c](ˆxδ)k .

Using the Lipschitz property of g[c], we can write

c 2 2 2 c 2 2 2 2 c t 2 kµ[c](tc + 1) − µˆδ,[c]k ≤ ((1−ρδ) +2ρ )kµ[c](tc)−µˆδ,[c]k + ((1 − ρδ) + ρ )M[c]kxt − xˆδk

2 2 c t t 2 2 2 t 2 + 2ρ M[c]kxt − xˆδkkxˆδ − xˆδk + ((1 − ρδ) + ρ )M[c]kxˆδ − xˆδk .

t Using kxˆδ − xˆδk ≤ Dmax, the inequality simpliﬁes to

c 2 2 2 c 2 2 2 2 c t 2 kµ[c](tc + 1) − µˆδ,[c]k ≤ ((1−ρδ) +2ρ )kµ[c](tc)−µˆδ,[c]k + ((1−ρδ) +ρ )M[c]kxt −xˆδk

2 2 c t 2 2 2 2 + 2ρ M[c]Dmaxkxt −xˆδk + ((1−ρδ) +ρ )M[c]Dmax. (7)

c Using Deﬁnition 4, deﬁne x˜ (tc) as the primal variable whose distance is greatest from the optimal value at c j t primal time κ(c, tc). That is, x˜ (tc) := maxj kx (κ(c, tc), tc) − xˆδk. Using this, the contraction property of primal

updates from Theorem 2, and the deﬁnition of Dt, we can write

c t c t kxt − xˆδk ≤ kx˜ (tc) − xˆδk

√ c t ≤ nkx˜ (tc) − xˆδkmax √ ops(κ(c,tc),t) j t t ≤ qp n max kx (k0) − xˆδkmax j √ ops(κ(c,tc),t) = qp nDt.

Applying this result to Equation (7) above gives

c 2 2 2 c 2 2 2 2 2ops(κ(c,tc),t) 2 kµ[c](tc + 1) − µˆδ,[c]k ≤ ((1−ρδ) +2ρ )kµ[c](tc)−µˆδ,[c]k + ((1−ρδ) +ρ )nM[c]qp Dt √ 2 2 ops(κ(c,tc),t) 2 2 2 2 + 2ρ nM[c]Dmaxqp Dt + ((1−ρδ) +ρ )M[c]Dmax.

2δ 2 2 Using ρ < δ2+2 , we have qd = (1 − ρδ) + 2ρ ∈ (0, 1), completing the proof.

July 23, 2021 DRAFT 20

E. Proof of Theorem 4:

Recursively applying Theorem 3 gives

c c c 2 c 2 2ops(κ(c,a),Ta ) c ops(κ(c,a),Ta ) c kµ[c](a + 1)−µˆδ,[c]k ≤qdkµ[c](a) − µˆδ,[c]k + qp E1(Ta ) + qp E2(Ta ) + E3

c 2 c 2 2ops(κ(c,a−1),Ta−1) c ≤ qdkµ[c](a − 1) − µˆδ,[c]k + qdqp E1(Ta−1)

c ops(κ(c,a−1),Ta−1) c + qdqp E2(Ta−1) + qdE3

c c 2ops(κ(c,a),Ta ) c ops(κ(c,a),Ta ) c + qp E1(Ta ) + qp E2(Ta ) + E3 a c a+1 c 2 X i 2ops(κ(c,a−i),Ta−i) c ≤ qd kµ[c](0)−µˆδ,[c]k + qd qp E1(Ta−i) i=0 c ops(κ(c,a−i),Ta−i) c + qp E2(Ta−i) + E3 a a+1 c 2 ¯ X i ≤ qd kµ[c](0) − µˆδ,[c]k + E qd. i=0

By deﬁnition, qd ∈ [0, 1) and summing the geometric series completes the proof.

REFERENCES

[1] Amir Ali Ahmadi and Georgina Hall. Sum of Squares Basis Pursuit with Linear and Second Order Cone Programming, pages 27–53. Contemporary Mathematics, 2016 2017. [2] Amir Ali Ahmadi and Anirudha Majumdar. Dsos and sdsos optimization: More tractable alternatives to sum of squares and semidefinite optimization. SIAM Journal on Applied Algebra and Geometry, 3(2):193–230, 2019. [3] Kenneth Joseph Arrow, Hirofumi Azawa, Leonid Hurwicz, and Hirofumi Uzawa. Studies in linear and non-linear programming, volume 2. Stanford University Press, 1958. [4] Dimitri P. Bertsekas. Distributed asynchronous computation of fixed points. Mathematical Programming, 27(1):107–120, Sep 1983. [5] Dimitri P Bertsekas and Athena Scientific. Convex optimization algorithms. Athena Scientific Belmont, 2015. [6] Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Inc., USA, 1989. [7] Dimitri P. Bertsekas and John N. Tsitsiklis. Some aspects of the parallel and distributed iterative algorithms — a survey. Automatica, 27(1):3–21, January 1991. [8] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [9] Qianqian Cai, Zhaorong Zhang, and Minyue Fu. A fast converging distributed solver for linear systems with generalised diagonal dominance, 2019. [10] Michael B. Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix scaling and balancing via box constrained newton’s method and interior point methods. CoRR, abs/1704.02310, 2017. [11] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2011. [12] Francisco Facchinei and Jong-Shi Pang. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007. [13] Andreas Frommer. Generalized nonlinear diagonal dominance and applications to asynchronous iterative methods. Journal of Computational and Applied Mathematics, 38(1):105–124, 1991. [14] Derek Greene and Padraig´ Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, page 377–384, 2006. [15] M. T. Hale and M. Egerstedt. Cloud-based optimization: A quasi-decentralized approach to multi-agent coordination. In 53rd IEEE Conference on Decision and Control, pages 6635–6640, 2014. [16] M. T. Hale, A. Nedic,´ and M. Egerstedt. Cloud-based centralized/decentralized multi-agent optimization with communication delays. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 700–705, Dec 2015.

DRAFT July 23, 2021 21

[17] M. T. Hale, A. Nedic,´ and M. Egerstedt. Asynchronous multiagent primal-dual optimization. IEEE Transactions on Automatic Control, 62(9):4421–4435, Sep. 2017. [18] Katherine R. Hendrickson and Matthew T. Hale. Towards totally asynchronous primal-dual convex optimization in blocks. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 3663–3668, 2020. [19] S. Hochhaus and M. T. Hale. Asynchronous distributed optimization with heterogeneous regularizations and normalizations. In 2018 IEEE Conference on Decision and Control (CDC), pages 4232–4237, 2018. [20] Martin Jaggi, Virginia Smith, Martin Takac,´ Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication- efﬁcient distributed dual coordinate ascent. In Advances in neural information processing systems, pages 3068–3076, 2014. [21] Jayash Koshal, Angelia Nedic,´ and Uday V Shanbhag. Multiuser optimization: Distributed algorithms and error analysis. SIAM Journal on Optimization, 21(3):1046–1081, 2011. [22] A. Nedic, A. Ozdaglar, and P. A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922–938, 2010. [23] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [24] A. Nedic´ and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015. [25] Ivano Notarnicola and Giuseppe Notarstefano. Asynchronous distributed optimization via randomized dual proximal gradient. IEEE Transactions on Automatic Control, 62(5):2095–2106, 2016. [26] Tiago P. Peixoto. The graph-tool python library. ﬁgshare, 2014. [27] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012. [28] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012. [29] K. I. Tsianos, S. Lawlor, and M. G. Rabbat. Push-sum distributed dual averaging for convex optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 5453–5458, 2012. [30] K. I. Tsianos and M. G. Rabbat. Distributed dual averaging for convex optimization under communication delays. In 2012 American Control Conference (ACC), pages 1067–1072, 2012. [31] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803–812, 1986. [32] M. Ubl and M. T. Hale. Totally asynchronous distributed quadratic programming with independent stepsizes and regularizations. In 58th Conference on Decision and Control (CDC), pages 7423–7428, 2019. [33] Zhaorong Zhang and Minyue Fu. Convergence of message-passing for distributed convex optimisation with scaled diagonal dominance, 2019. [34] Minghui Zhu and Sonia Mart´ınez. On distributed convex optimization under inequality and equality constraints. IEEE Transactions on Automatic Control, 57(1):151–164, 2011.

July 23, 2021 DRAFT