Asynchronous Algorithms for Large-Scale Optimization

Analysis and Implementation

ARDA AYTEKIN

Licentiate Thesis Stockholm, Sweden 2017 KTH Royal Institute of Technology School of Electrical Engineering TRITA-EE 2017:021 Department of Automatic Control ISSN 1653-5146 SE-100 44 Stockholm ISBN 978-91-7729-328-6 Sweden

Akademisk avhandling som med tillstånd av Kungliga Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie licentiatexamen i elektro- och systemteknik fredagen den 07 apr 2017 klockan 10.00 i Q2, Q-huset, Kungliga Tekniska högskolan, Osquldas väg 10, Stockholm.

© Arda Aytekin, Apr 2017

Tryck: Universitetsservice US AB iii

Abstract

This thesis proposes and analyzes several first-order methods for convex optimization, designed for parallel implementation in shared and distributed memory architectures. The theoretical focus is on designing algorithms that can run asynchronously, allowing computing nodes to execute their tasks with stale information without jeopardizing convergence to the optimal solution.

The first part of the thesis focuses on shared memory architectures. We propose and analyze a family of algorithms to solve an unconstrained, smooth optimization problem consisting of a large number of component functions. Specifically, we investigate the effect of information delay, inherent in asynchronous implementations, on the conver- gence properties of the incremental prox-gradient descent method. Contrary to related proposals in the literature, we establish delay-insensitive convergence results: the pro- posed algorithms converge under any bounded information delay, and their constant step-size can be selected independently of the delay bound.

Then, we shift focus to solving constrained, possibly non-smooth, optimization problems in a distributed memory architecture. This time, we propose and analyze two important families of gradient descent algorithms: asynchronous mini-batching and incremental aggregated gradient descent. In particular, for asynchronous mini-batching, we show that, by suitably choosing the algorithm parameters, one can recover the best-known convergence rates established for delay-free implementations, and expect a near-linear speedup with the number of computing nodes. Similarly, for incremental aggregated gra- dient descent, we establish global linear convergence rates for any bounded information delay.

Extensive simulations and actual implementations of the algorithms in different platforms on representative real-world problems validate our theoretical results.

ACKNOWLEDGMENTS

First of all, I would like to express my gratitude to my main advisor, Mikael Johansson, and my co-advisors, Alexandre Proutiere and Dimos Dimarogonas, for accepting me as their Ph.D. student and giving me the opportunity of being part of such a great family at KTH Royal Institute of Technology. I would like to especially thank Mikael for his never- ending patience, professional-yet-friendly attitude, constant efforts in not only promoting my strengths but also improving my weaknesses, and his excellent guidance in research. Thanks to you, Mikael, I have learned a lot in research while working with you — from formulating problems and systematically analyzing them using the correct tools, to presenting the results of my research both in written and oral forms. I am also indebted to my colleagues whom I have collaborated with. I would like to thank Hamid for all the fruitful discussions and his help in convex optimization and algorithm analysis. I am grateful to Burak for the interesting control problems we have worked on together: even though I have not covered them in this thesis, the time spent there has added to my knowledge and skills. Last, but not least, I feel lucky to have such great support from Cristian Rojas and my “partner in crime” Niklas in developing software tools to be (hopefully) presented at a workshop in the near future. In addition, I would also like to acknowledge Burak, Demia, Hamid, Martin Biel, Sadegh, Sarit and Vien for proofreading my thesis and providing me with constructive comments. Automatic Control at KTH is a great family in terms of both quantity and quality. I am very fortunate to have spent my time among you all! Apologies in advance, should I forget to explicitly mention your names...I would like to start with thanking both the current — Demia “piccolo” Della Penda, Martin Biel, Max, Sarit, and Vien — and the former — António “W.M.” Gonga, Burak, Euhanna “the old chap” Ghadimi, Hamid, Jeff, Sadegh, and Themis “yet even older chap” Charalambous — members of our group for all the inspiring discussions we have had at the meetings and all the fun extracurricular activities we have done together. I thank you, my office mates, Jezdimir, Martin Andreasson, Martin Biel, Miguel, Mohamed, Niklas, and Valerio “Valerione” Turri for creating a warm and relaxing working environment. Among the great people I have met at the department, I would like to thank, in particular, Burak, Demia, Hamid, Jeff, Kaveh, Martin Andreasson, Mohamed, Niclas, Niklas, Riccardo, Sadegh, Themis, and Valerio for not only being my colleagues but also being a part of my as true friends!

v vi ACKNOWLEDGMENTS

Our administrators...Thank you, Anneli, Gerd, Hanna, Karin, Kristina and Silvia for being so helpful, positive and kind at all times. I am grateful to you all for fixing all the administrative issues, helping me with the paperwork, and spoiling us all with all the waffles and “semlor!” Finally, the closest ones in Turkey...I thank you, my parents, Mine and Süreyya, for always believing in me and for your unconditional support in my efforts to achieve my goals! Similarly, special thanks go to our extended family members, Berrin and Rıdvan Tuğsuz, and Göksan and İhsan Hakyemez, for always being “there” together with my parents. Equally important are my friends Burak and Serdar Demirel, Mehmet Ayyıldız, Utku Boz and Begüm Yıldırım. I thank you all for all your support and for putting up with me whenever I was stressed out.

Arda Aytekin Stockholm, March 2017. vii

CONTENTS

Acknowledgments v

Contents ix

1 Introduction 1 1.1 Motivation ...... 1 1.2 Contributions and Outline ...... 12

2 Preliminaries 15 2.1 Notation ...... 15 2.2 Preliminaries ...... 16

3 Shared Memory Algorithms 21 3.1 Problem Formulation ...... 22 3.2 Main Result ...... 23 3.3 Numerical Example ...... 33 3.4 Proofs ...... 40

4 Distributed Memory Algorithms 49 4.1 Problem Formulation ...... 51 4.2 Main Result ...... 61 4.3 Numerical Example ...... 67 4.4 Proofs ...... 75

5 Conclusion 89

Bibliography 93

ix

CHAPTER 1

INTRODUCTION

In this thesis, we will investigate the effect of information delay when designing and running asynchronous algorithms to solve optimization problems on a relatively large scale. Specifi- cally, we will propose a family of parallel algorithms, analyze their convergence properties under stale information and verify the theoretical results by implementing the algorithms to solve some representative examples of optimization problems.

1.1 Motivation

An optimization problem is a problem of choosing the best element (with respect to some criterion) from a given set of elements. The standard way of writing optimization problems is minimize f(x) x∈X ̃ subject to hi(x) ≤ 0 i = 1, … ,I, ̄ hj(x) = 0 j = 1, … ,J, X where x denotes the decision variable defined in some given set , f(x) is the objective, or, cost ̃ ̄ constraints the to be minimized, and hi(x) and hj(x) denote the inequality and equality of the problem, respectively. The problem is said to be feasible if there exists a decision variable in the given set which satisfies all the constraints. If there are no constraints in the problem, the problem is said to be unconstrained. Optimization problems are important in engineering applications. Engineers often find themselves in the loop of collecting data about processes, building representative mathemat- ical models based on the collected data, formulating optimization problems to minimize a cost while meeting some design criteria, and solving the problems. In these problems, the cost usually relates to some penalty on the resources used or the deviation from a desired behavior. Then, the task is to come up with the best decision that minimizes this cost while fulfilling the design criteria dictated by the constraints of the problem.

1 2 CHAPTER 1. INTRODUCTION

120 160 80 200 40 240

uk0 xk0

MPC Ak,Bk , Qk,Rk, f x , ̄x , u , ̄u ,K ̄ k k ̄k k  Figure 1.1: A simplified block diagram representation of an MPC, employed in velocity control of vehicles. Given the linearized model Ak,Bk and the cost Qk,Rk, f , MPC samples the current state x , and solves an optimization problem to find the best input k0   values that minimize the total cost while satisfying the constraints x , ̄x , u , ̄u (up to a ̄ k k ̄ k k horizon of K sampling instances). Then, it sends the best input uk0 to the vehicle and repeats the procedure in the next sampling interval.

Below are two illustrative, real-world examples of optimization problems encountered in engineering. Example 1.1 (Model Predictive Control). Model predictive control (MPC) is an advanced, multivariable control algorithm that uses an internal dynamical model to predict the future behavior of a given process, and solves, at each sampling instance, an optimization problem to minimize a given cost while satisfying a set of constraints. For instance, an MPC algorithm employed in velocity control of vehicles (cf. Figure 1.1) can be written in the form

k0+K−1 minimize ⊤ + ⊤ + f xk Qkxk uk Rkuk xk0+K uk k=k É0    subject to xk+1 = Akxk + Bkuk u ≤ u ≤ ̄u ̄ k k k x ≤ x ≤ ̄x ̄ k k k k = k0, … , k0 + K − 1 , e.g. e.g. where uk is the input, , the fuel injection, to the vehicle, and xk is the state, , the deviation from a set-point velocity. MPC samples the state of the vehicle periodically, as dictated by the sampling interval. Then, it tries to minimize the total cost, e.g., the penalty on not meeting the desired velocity and the fuel consumption, defined by the Qk,Rk pair k K at each sampling instance up to a horizon of sampling instances. It does so by using the internal, linearized in the above example, model represented by the Ak,Bk pair to predict the future values of the state, while satisfying some lower and upper bound constraints. In some applications, the objective might also contain a terminal cost f for stability purposes. 1.1. MOTIVATION 3

[2] vn = 1 10.85 score:0.25 8.80 goal:0.18

[root@t7600 ~]# cd 9.3 win:0.20

0.01 today:0.002

15.6 champ:0.23

0.05 last:0.003

0.03 protest:0.001 [2] ⋮ ⋮ vn = −1

[1] x v1

Figure 1.2: Document classification. Here, the task is to build a learning model that classi- [2] [2] fies documents into one of “sports-related” labeled vn = 1 and “others” vn = −1 . During the training phase, the model tries to learn a class indicator x that minimizes the empirical model error based on already labeled inputs. After the training, the model uses x to separate the “sports-related” documents from others in a set of previously unencountered (unlabeled) inputs. In this example, the model gives much more emphasis on frequently occurring words “score,” “goal,” “win,” and “champ” than others while determining whether a given document should be classified as “sports-related” or not. Please note that some weights are relatively too small, possibly due to the regularization.

Example 1.2 (Binary Classification). In machine learning, classification is a type of su- pervised learning problems, where one tries to assign a class, or a category, to a given input based on similar, past observations obtained from a training set. When there are only two categories, the problem is a binary classification problem; otherwise, it is a multiclass classification problem. A widely used example is the document classification (see Figure 1.2). In document classifi- cation, one usually assumes that the input is a fixed-size bag of words representation of a document, i.e., a collection of words and their frequency of appearance in the corresponding document. The goal in binary classification is to build a learning model for singling out some documents, e.g., “sports-related” documents, from all others. This can be achieved, for instance, by solving the following regularized logistic regression problem

N 1 [2] [1] minimize log 1 + exp −vn vn , x + 1 x 1 , x N n=1 «­­­­­­­­­­­­­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­­­­­­­­­­­­­¬É   «¯¬ ‖ ‖ empirical model error regularization

[1] where N is the total number of documents in the training set used, vn is the bag of words th [2] for the n document, and vn is the label for the category of the document. The idea here is that the learning model should learn a class indicator, x, to be used in classifying documents. 4 CHAPTER 1. INTRODUCTION

This is achieved by minimizing an empirical model error defined over the training set. These problems sometimes involve a regularization term, too, in the objective. In the above example, regularization promotes sparsity in the class indicator, i.e., it favors class indicators that contain many small weights (cf. Figure 1.2).

Looking at the examples, we realize two important aspects of optimization problems. First, they can grow very large in size, i.e., memory complexity, depending on the application. For instance, the document classification problem in a given language among different types of media, i.e., scientific journals, newspapers, magazines and books, can become very hard to solve, if not impossible, on general purpose computers. Second, even if the problems are moderate in size, there might exist constraints on the solution time, i.e., time complexity, of a given problem. For example, in the velocity control problem above, the MPC algorithm has to give a solution within the sampling interval defined between each sampling instance.

1.1.1 Emergence of “Big Data”

Although attempts to process and understand large amounts of available data have been around for long, the term “big data,” in its modern context, was coined by Roger Mougalas of O’Reilly Media in 2005 [1]. In its current context, it refers to large sets of information that are too large, or too complex, to handle by traditional means. We are hearing the term “big data” more often each day. The reason is partly due to the wide adoption of Internet-based, or cloud-based, technologies in a variety of engineering fields. It has become commonplace to collect, share and process large volumes of data over the Internet, thanks to the developments in communication infrastructures, data storage technologies as well as computing architectures. As a result, optimization problems have not only expanded in problem size but also evolved into a stage where the problem data is scattered among different geographical locations.

A wide variety of domains such as astronomy, biology and machine learning involve opti- mization problems defined over large datasets. Some of these datasets are also becoming publicly and easily accessible. For instance, Amazon [2] hosts freely-accessible, public datasets in its centralized data repositories using Apache’s Hadoop framework [3]. Hadoop is a resilient distributed dataset framework, i.e., a fault-tolerant abstraction for distributed data storage and processing, which takes advantage of data locality. More precisely, it splits the dataset into large blocks, and distributes them among a cluster of commodity hardware computers that are capable of executing complex mathematical operations on their portion of data. This, in return, allows for faster and more efficient processing of data as opposed to conventional approaches which define data and operations in different places.

Remarkable examples of hosted large datasets include Google Books Ngrams [4], NASA NEX [5], and 1000 Genomes [6].Google Books Ngrams, which exceeds 2 terabytes, is a collection of n-grams, i.e., fixed size tuples of n items, appearing in books in different 1.1. MOTIVATION 5 languages, which enables researchers, for instance, to make connections between emotional output of the society and significant events in a given period. Similarly, NASA NEX is a collaborative platform that combines supercomputing, Earth system modeling and remote- sensing data, which exceeds 10 terabytes, from NASA’s satellites in order to provide scientists with tools to run and share modeling algorithms related to the Earth’s surface. Although, it could be tempting to simply download 10 terabytes of data and save it on a couple of hard disks, given today’s connection and storage capabilities, it is worth noting that such an attempt would take 4 days to download the data with a decent 200 Mbit/sec Internet connection and 18 hours to read through it, rendering the problem impractical to solve on standard computers such as laptops. Finally, 1000 Genomes is an international research effort to establish a detailed catalog of human genetic variation in the hope for providing a valuable tool for all fields of biological science. It has a huge dataset,i.e., a data collection exceeding 300 terabytes, generated in the three phases of the project. As can be seen, many challenging problems involve “big data” applications in the sense that they can not be handled by traditional means, i.e., use of a single, general purpose computer. The increased availability of freely-accessible, abundant data has created a strong interest in developing optimization algorithms that “parallelize well.”

1.1.2 Parallel Programming

Traditionally, computer programs are written in a serial fashion, i.e., performing the actions specified in a program one at a time. However, time and memory complexities of most problems necessitate the use of parallel programming. Parallel programming is a form of writing programs, in which one splits the overall problem into parts, each of which is handled by a separate computing node, and then, collects the partial results obtained from individual nodes. Here, the computing nodes can be multiple processors in a single computer or multiple computers connected together in some way. The approach should give a better performance; it should either yield faster results, enable solutions to bigger problems, or achieve both, which would have been unachievable otherwise. When writing parallel programs, one needs to think of the following aspects.

Speedup. Perhaps, the most important performance measure for writing parallel programs is the potential speedup when using multiple computing nodes. Speedup is the relative improvement in execution time of a given computer program in its parallel implementation when compared to its best serial implementation. In other words, speedup is defined as SW = t1∕tW , where t1 is the execution time of the best sequential algorithm and tW is the execution time of the parallel implementation running on W worker (computing) nodes. The maximum speedup possible, then, is W , which is achievable only if the program can be split into equal-duration chunks and assigned to each of W workers without any additional overhead. Unfortunately, this is an idealized scenario. Usually, in parallel programs, there are times when some workers are simply idle waiting for others to finish their tasks, or extra 6 CHAPTER 1. INTRODUCTION computations are needed to recalculate some variables, or the local data of a program needs to be mapped to and collected from different worker nodes, which results in a communication overhead. As a result, most parallel programs can not achieve the maximum possible speedup.

Efficiency. Another performance measure for parallel programs is the efficiency, which measures to which extent the worker nodes are utilized for doing useful work. Time spent in communication, for instance, as opposed to doing the actual computations, is not considered useful, and thus, results in a drop in the efficiency of the program. Efficiency is defined as E = SW ∕W , which can be regarded as the “relative speedup per worker.” Highly efficient parallel programs, e.g., E ≈ 1, utilize all the worker nodes for computation at all times, whereas those inefficient ones, e.g., E ≪ 1, require worker nodes to spend more time on communication. For instance, most Monte Carlo simulations are embarrassingly parallel — a term used for programs which require little or no communication — and they are more efficient than, say, parallelized partial differential equation solvers, which need to communicate the boundary information among worker nodes after each iteration.

Scalability. Compared to the previous two aspects, scalability is rather imprecise [7]. From the hardware’s point of view, a parallel programming architecture is scalable if the architecture can maintain the performance improvement when the size, i.e., the number of worker nodes, is increased. Normally, when new workers are added to the architecture, communication among them will be increased resulting in delays and contentions, which in turn, decreases the efficiency. Hence, a parallel architecture is scalable if it can maintain the efficiency when the size of the architecture is increased. Similarly, a parallel algorithm is scalable if increased data in the algorithm does not result in much increase in computational steps needed. Combining the two notions of scalability, we realize that the goal, and the challenge, in parallel programming is to accommodate increased problem sizes with increased architecture sizes for a specific algorithm and architecture pair. Optimization problems are also solved algorithmically on computing nodes, and it is natural to try to exploit parallelism when solving “big data” optimization problems. In fact, parallel optimization has a rich history [8], and many early results have been unified and significantly extended in the influential book by Bertsekas and Tsitsklis [9]. However, it is still an important issue to tailor the algorithms to make the best use of current computing platforms. Today, modern computers consist of a multitude of computing nodes: multi-core central processing units (CPUs) provide tens of powerful computing units while general purpose graphics processing units (GPGPUs) can contain thousands of relatively simpler nodes. In addition, commoditized distributed computing services such as Amazon’s Elastic Compute Cloud [2], Google’s Compute Engine [10] and Microsoft’s Azure [11] have made it relatively cheap and convenient to have access to the desired computation power especially for short durations. These architectures certainly differ in scale, but they also tend to differ in their use of shared and distributed memory. 1.1. MOTIVATION 7

Processors Memories

P1 P32 M1 M16 Bus

Figure 1.3: A simplified view of shared memory architectures. This architecture has 32 computing nodes connected to 16 memory nodes. Memory nodes together define a single memory space, i.e., each processing node can directly access data stored in any memory node.

Shared Memory Architectures

In a shared memory architecture, all the computing nodes have access to the same memory, i.e., a single memory space, where both the executable code and the data for a given program are stored (see Figure 1.3). The nodes and the memory space are connected through an interconnection network, also referred to as the memory bus in traditional multi-core, multiprocessor desktop computers. From the programmer’s perspective, writing parallel programs for the shared memory archi- tecture is attractive and relatively easy for two reasons. First, there are open standards [12] that enable programmers to “ask” the compilers of low- languages such as C, C++, and Fortran to generate the parallel version of their serial algorithms suitable for shared memory architectures. Similarly, some languages also provide convenient threading support to help map portions of a program to individual nodes while using shared program data [13, 14]. Second, the convenience of having a single memory space shared among computing nodes removes a lot of burden from the programmer when accesssing problem data. Shared memory architectures have some drawbacks. Small-sized shared memory architec- tures are cost-effective. However, as the size increases, the memory bus needs to be expanded to provide enough bandwidth for increasing number of computing nodes in accessing the shared memory. Similarly, due to the limitations of physical connections, most large-sized shared memory architectures have hierarchical structure, in which some computing nodes are closer to some parts of memory locations than others. This type of configuration results in the so-called nonuniform memory access (NUMA), as opposed to uniform memory access (UMA), among computing nodes, i.e., some workers have faster access to parts of data than others. To alleviate the problem of different memory access times, modern computing architectures provide caches in computing nodes to save local copies of frequently accessed data. Then, the challenge is to maintain cache coherency among computing nodes, i.e., to provide identical copies in caches of individual nodes when the cached memory has been modified. 8 CHAPTER 1. INTRODUCTION

Shared Memory Architecture 1 Shared Memory Architecture N

Processors Memories Processors Memories

P1 P32 M1 M16 P1 P32 M1 M16 Bus Bus

Interconnection

Figure 1.4: A view of distributed memory architectures. Here, the global data is distributed over N shared memory architectures. Each shared memory architecture may consist of multiple computing nodes that have direct access to only local part of data. These nodes need to use an interconnection layer to access other parts of the data. In cloud computing systems, the interconnection layer is usually the Internet, which is much slower compared to the memory buses.

Distributed Memory Architectures

In a distributed memory architecture, each computing entity has immediate access to its local memory, whereas some global information needs to be coordinated among entities using an interconnection network (cf. Figure 1.4). Such an architecture is inherent in desktop computers utilizing both CPUs and GPGPUs, i.e., hybrid memory architectures (compare Figure 1.5), but also in cloud computing systems. Distributed memory architectures, specifically commodity computing systems, have some advantages over shared memory architectures. First, they scale better with the increasing demand for computing nodes than the shared memory architectures. For example, it is relatively cheaper to build an interconnection among already available computers than buying a brand new computer to have a system consisting of 64 processors. In addition, distributed memory architectures may be the only solution when moving all the program data to a single location is not feasible due to either the volume (cf. the NASA NEX example above) or the related privacy issues concerning the data. When writing parallel programs for distributed memory architectures, we should be aware of their limitations. First, the global information coordination might be tricky in these architectures. For instance, the information is coordinated on a relatively reliable, physical interconnection network in desktop computers, whereas the use of the Internet is required in cloud computing systems that involve computing nodes scattered in different geographical locations. The use of Internet may lead to significant delays if computing nodes have slow connections or if some of the nodes simply fail due to, e.g., power outage or connection problems. Second, since there is no single memory space, local memories of individual nodes are not accessible by the others. As a result, some sort of massage passing is needed when coordinating among nodes. This requires not only the use of third-party libraries such 1.1. MOTIVATION 9

CPU GPGPU CPU GPGPU

CPU Memory GPGPU Memory Unified Memory

Figure 1.5: A schematic representation of hybrid memory architectures. Computers utilizing both CPUs and GPGPUs are traditionally distributed memory architectures (left), where computing nodes on CPUs and GPGPUs each have their own separate memory spaces. A physical interconnection layer (cf. Figure 1.4) is used for data exchange between the CPU and GPGPU memory spaces. Modern computers also support the so-called unified memory architecture (right), where part of the CPU memory is used efficiently by the GPGPU. This architecture can also be regarded as the distributed shared memory architecture [7], since the data exchange between different memory spaces is implicit. as OpenMPI [15] and ZeroMQ [16], but also relatively larger changes in the serial version of the program.

Synchronous vs. Asynchronous Computations

Regardless of the parallel architecture used, parallel programs possess the problem of controlling access to shared resources among the computing nodes. These resources could be files on disks, any physical device that the program has access to, or simply some data in the memory relevant to the computations. Reading from shared resources normally does not pose any problems. However, when changing the resources’ state, e.g., writing to a file or changing a variable in memory, race conditions occur among computing nodes, in which the resulting state depends on the sequence of uncontrollable events. Sections of a program which contain shared resources are called critical sections. The programmer has to ensure consistent results by removing the race conditions via mutual exclusions. Mutual exclusion is a kind of synchronization primitive which allows for controlled access to a shared resource. The first computing node that reaches the mutual exclusion gets hold of a lock before executing the critical section, does the necessary computations related to the shared resource, and finally releases the lock upon exiting the critical section, enabling other nodes to get the lock if they need to. This mechanism ensures that the critical section is processed by only one computing node at a 10 CHAPTER 1. INTRODUCTION time, resulting in consistency of the state of the resource. In shared memory architectures, uninterruptible, or atomic, operations are used to efficiently provide controlled access in critical sections of the program. In distributed memory architectures, however, the process is somewhat more involved. Since there is no global memory space which is shared among all the nodes, defining a global lock common to all the nodes is harder. Instead, one defines reader/writer policies when giving access to shared information. The simplest policy is to have a single reader/single writer policy, i.e., a master-worker framework, in which the master is responsible for all the reading and writing of the shared data, and the workers make requests to the master. Synchronization mechanisms are also used to communicate global information that requires the attention of all computing nodes at the same time, also referred to as process synchroniza- tion, and to wait for a specific event to occur, i.e., event synchronization. Synchronization points in a program are often needed for consistency purposes. However, they reduce the performance, and hence, the efficiency of parallel programs due to the idle waiting times in critical sections. Algorithms that require significant amount of synchronization among nodes are called synchronous algorithms, whereas those that can tolerate asynchrony are called asynchronous algorithms.

Traditionally, optimization algorithms are designed under the assumption of serial and synchronous operations. At each iteration of an optimization algorithm, the goal is to find a new feasible decision variable that results in a decrease in the cost. In the popular gradient descent method, for example, this is achieved by computing the gradient of the cost at the current variable and then taking a step in its negative direction. When the decision variable is large, the task of computing the gradient can be split into smaller tasks and mapped to different computing nodes. In synchronous optimization algorithms, each node calculates their part of the gradient independently, and then, synchronizes with other nodes at the end of each iteration to calculate the new variable. Such a synchronization means that the algorithm will be running at the pace of the slowest computing node. Moreover, it has the risk of bringing the algorithm a deadlock, a state in which each computing node is waiting for some other node, in case one of the nodes fails. In contrast to synchronous algorithms, asynchrony allows the nodes to compute gradients at different rates without global synchronization, and lets each node perform its update independently of others by using out-of-date gradients. We can gain some advantages from asynchronous implementations of optimization algorithms. First, fewer global synchroniza- tion points will give reduced idle waiting times and alleviated congestion in interconnection networks. Second, fast computing nodes will be able to execute more updates in the algo- rithm. Similarly, the overall system will be more robust to individual node failures. However, on the negative side, asynchrony runs the risk of rendering an otherwise convergent algo- rithm divergent. Note that convergence analysis of asynchronous algorithms tends to be more challenging since their dynamics are much richer [17], and asynchronous optimization algorithms often converge under more restrictive conditions than their synchronous counter- parts. Thus, tuning an algorithm to withstand large amounts of asynchrony will typically result in unnecessarily slow convergence if the actual implementation is synchronous. 1.1. MOTIVATION 11

master-1 master-2 master-3

[root@master-1~]# cd [root@master-2~]# cd [root@master-3~]# cd

[root@us-1~]# cd [root@us-2~]# cd [root@eu-1~]# cd [root@eu-2~]# cd [root@ap-1~]# cd [root@ap-2~]# cd

us-1 us-2 eu-1 eu-2 ap-1 ap-2

Figure 1.6: A simple representation of the Parameter Server framework. Masters and workers can be located in different geographical locations such as the United States, Europe and Pacific Asia. Masters are responsible for all the reading and writing of the shared data, and workers make requests to the masters. Each worker is allowed to do operations on outdated information and at its own pace.

In the thesis, we analyze a family of asynchronous, gradient descent based algorithms for convex optimization problems. We investigate the effect of information delay, i.e., out-of- date gradients, in shared and distributed memory architectures. The main framework we will be following in implementing the algorithms is the Parameter Server framework [18]. Li et al. have proposed the Parameter Server framework (see Figure 1.6) to overcome the significant delays and possible failures that may occur in distributed memory architectures. In Parameter Server, optimization problems involving large parameter vectors (decision vari- ables) and big training data sets are distributed among different general purpose computers in a master-worker setting (generalization to the single reader/single writer policy). Masters can communicate among each other and have access to globally shared data, and they utilize different worker nodes to solve parts of the overall problem. Since different masters can utilize the same worker for different purposes, each worker is assigned a portion of problem data and a copy of the decision variable. To avoid communication overhead, masters push the most up-to-date decision variable to workers only when needed. Hence, workers have delayed copies of the most recent decision variable, and they do their calculations based on these outdated values. Whenever a worker finishes the assigned work, it sends the result to the calling master, which updates the decision variable and synchronizes with the other 12 CHAPTER 1. INTRODUCTION masters.

1.2 Contributions and Outline

The thesis has the following structure.

Chapter 2: Preliminaries

In this chapter, we first introduce the notation that is used throughout the thesis. Then, we review some important definitions and relations in convex optimization to make the thesis self-contained.

Chapter 3: Shared Memory Algorithms

In Chapter 3, we present a new family of algorithms for solving unconstrained, smooth convex optimization problems in shared memory architectures. We motivate the design of the algorithms from basic observations, and then analyze their convergence properties under information delays. The chapter is a summary of the following works:

• Arda Aytekin, Hamid Reza Feyzmahdavian, and Mikael Johansson. “Asynchronous In- cremental Block-Coordinate Descent”. In: 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton). Institute of Electrical and Elec- tronics Engineers (IEEE), Sept. 2014. DOI: 10.1109/allerton.2014.7028430,

• Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. “A Delayed Prox- imal Gradient Method with Linear Convergence Rate”. In: 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). Institute of Electrical and Electronics Engineers (IEEE), Sept. 2014. DOI: 10.1109/mlsp.2014.6958872.

Chapter 4: Distributed Memory Algorithms

In Chapter 4, we consider constrained, possibly non-smooth, optimization problems in distributed memory architectures. Motivated by the Parameter Server framework, we analyze a family of algorithms with their convergence properties. The chapter is a summary of the following works: 1.2. CONTRIBUTIONS AND OUTLINE 13

• Arda Aytekin, Hamid Reza Feyzmahdavian, and Mikael Johansson. “Analysis and Im- plementation of an Asynchronous Optimization Algorithm for the Parameter Server”. In: (Oct. 18, 2016). arXiv: 1610.05507v1 [math.OC] 1, • Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. “An Asyn- chronous Mini-Batch Algorithm for Regularized Stochastic Optimization”. In: IEEE Transactions on Automatic Control (2016), pp. 1–15. DOI: 10.1109/tac.2016. 2525015, • Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. “An asynchronous mini-batch algorithm for regularized stochastic optimization”. In: 2015 54th IEEE Conference on Decision and Control (CDC). Institute of Electrical and Electronics Engineers (IEEE), Dec. 2015. DOI: 10.1109/cdc.2015.7402404.

Chapter 5: Conclusion

We finalize the thesis with some possible future directions.

1Submitted to IEEE Transactions on Automatic Control. Under review.

CHAPTER 2

PRELIMINARIES

In this chapter, we first introduce the notation that is used in the thesis, and then review the key definitions and some important relations related to the convex optimization field that will be used throughout the thesis.

2.1 Notation

R N N Throughout the thesis, we use blackboard font to denote sets. We reserve , and 0 for the set of real numbers, the set of natural numbers, and the set of natural numbers including zero, respectively. Superscripts in the set notation denote the dimension information. For R (1) ( ) ⊤ instance, d denotes the d-dimensional vector space where a vector x = x ⋯ x d coordinates Rd1×d2 has its from the set of real numbers, whereas denotes the set of real matrices having d1 rows and d2 columns. Subscripts, when used with variables, denote their e.g. value at a specific instance, , xk is the value of x at time instance k. V R For a random variable v whose probability distribution  is supported on a set ⊆ d2 , we use Pv [ ̄v] as a shorthand for P [v = ̄v] to denote “probability of v being equal to ̄v.” Rd1 Rd2 → R Similarly, for a function F∶ × , we use Ev [F(x, v)] to denote the expectation defined as

E [F(x, v)] = F(x, v) d , v ÊV and we drop the subscript when the random variable is obvious from the context. Last, to denote the conditional expectation with respect to the random variable v, given the filtration ̄ [⋅] generated by vk up to some time instance k, we use the notation Evk k̄ . R We denote the inner product of two vectors x and y in the d-dimensionalð vector space d Rd with x, y . We assume that is endowed with a norm ⋅ , and use ⋅ ∗ to represent the

⟨ ⟩ 15 ‖ ‖ ‖ ‖ 16 CHAPTER 2. PRELIMINARIES corresponding dual norm, defined by

y ∗ ∶= sup x, y . x ≤1

‖ ‖ ‖ ‖ ⟨ ⟩ R R For a real valued function f ∶ d → , we use ∇ f(x) to denote its gradient evaluated at R R R point x. When a real valued function F∶ d1 × d2 → is under consideration, we use ∇(1) F and ∇(2) F to denote its partial gradient with respect to the first and second vector of variables, respectively. R [1] [ ] [ ] R For any partition of x ∈ d into B non-overlapping blocks, x , … , x B , with x b ∈ db , b = 1, … ,B , and  B

db = d , =1 Éb Rd×db i.e. we define U[b] ∈ as the corresponding partition, , set of columns, of the identity matrix:

Rd×d Id = U[1],U[2], … ,U[B] ∈ . R R  [ ] R Then, the partial gradient of f ∶ d → with respect to x b ∈ db is defined as ∇[b] f( ) = ⊤ ∇ f( ) x U[b] x .

2.2 Preliminaries

In the thesis, we focus on convex optimization problems. The optimization problem

minimize f(x) x∈X ̃ subject to hi(x) ≤ 0 i = 1, … ,I, ̄ hj(x) = 0 j = 1, … ,J, is a convex optimization problem, if the cost function f and all the functions that define ̃ ̄ inequalities, i.e., hi are convex functions, and all the functions that define equalities, i.e., hj, are linear functions of the decision variable x. We start with the definition of a convex set and a convex function. C Definition 1 (Convex Set). A set is convex, if the line segment between any two points x C and y in the set lies in the set, i.e., for any x, y ∈ and any  ∈ (0, 1),

x + (1 − )y ∈ C . 2.2. PRELIMINARIES 17

R R Definition 2 (Convex Function). A function f ∶ d → is convex if dom f is convex, and

f (x + (1 − ) y) ≤  f(x) + (1 − ) f(y) is satisfied for any x, y ∈ dom f and any  ∈ (0, 1). In other words, f is convex if the chord from x to y lies above the graph of f. Furthermore, f is strictly convex if the above inequality holds strictly whenever x ≠ y.

R Definition 3 (Strongly Convex Function). A continuously differentiable function f ∶ d → R is -strongly convex with respect to norm ⋅ if f is convex, and there exists a constant  > 0 such that ‖ ‖  2 f(y) ≥ f(x) + ∇ f(x), y − x + y − x 2 ⟨ ⟩ ‖ ‖ holds for any x, y ∈ dom f.

R R Definition 4 (Smoothness). A function f ∶ d → is called L-smooth on dom f, if ∇ f is Lipschitz-continuous with Lipschitz constant L, defined as

∇ f(x) − ∇ f(y) ∗ ≤ L x − y for any x, y ∈ dom f. The gradient‖ of f is called‖ block‖ coordinate-wise‖ Lipschitz continuous on dom f if there exist constants l1, … , lB such that

∇[b] f x + U Δx[b] − ∇[b] f(x) ≤ l Δx[b] [b] ∗ b ô  ô ô ô ô [b] Rdbô ô ô [b] holds for all x ∈ domô f, b = 1, … ,B, and Δx ∈ ôsuch thatô x +ôU[b]Δx ∈ dom f.

Remark 2.1. When ∇ f is block coordinate-wise Lipschitz continuous, we have

2 [b] [b] [b] lb [b] f x + U[ ]Δx ≤ f(x) + ∇ f(x), Δx + Δx (2.1) b 2    ô ô [b] Rdb [b] ô ô for all x ∈ dom f and all Δx ∈ such that x + U[b]Δx ∈ domô f [24].ô

To make the thesis self-contained, we provide below some useful inequalities related to convex functions. The inequalities are all borrowed from [25, Chapter 2]. Interested readers should consult the book for the proofs. Moreover, other books such as [26–28] provide a thorough treatment of convex optimization problems and algorithms used therein. 18 CHAPTER 2. PRELIMINARIES

Note 2.2 (Useful Inequalities). From convex analysis, we know that the following relations R R hold for an L-smooth, (possibly) -strongly convex function f ∶ d → :

1 2 L 2 ∇ f(x) − ∇ f(y) ≤ f(y) − f(x) − ∇ f(x), y − x ≤ x − y , 2L ∗ 2 1 2 2 ∇ f(x) − ∇ f(y) ≤ ∇ f(x) − ∇ f(y), x − y ≤ L x − y , L ‖ ‖∗ ⟨ ⟩ ‖ ‖  2 1 2 y − x ≤ f(y) − f(x) − ∇ f(x), y − x ≤ ∇ f(x) − ∇ f(y) , ‖ 2 ‖ ⟨ ⟩ 2 ‖ ‖∗ 2 1 2  ‖y − x‖ ≤ ∇ f(x) − ∇ f(⟨y), x − y ⟩ ≤ ‖∇ f(x) − ∇ f(y)‖ ,  ∗ L 2 1 2 ‖x − y‖ + ⟨ ∇ f(x) − ∇ f(y⟩) ≤ ∇ f(‖x) − ∇ f(y), x − ‖y ,  + L  + L ∗ for any x, y ∈ dom‖ f and‖0 <  ≤ L.‖ ‖ ⟨ ⟩

The following definition introduces subgradients of proper convex functions. Rd → R Definition 5 (Subgradient). For a convex function h∶ ∪ {+∞}, a vector sx is called a subgradient of h at x if

h(y) ≥ h(x) + sx, y − x h subdifferential h holds for all y. The set of all subgradients of ⟨ at x is called⟩ the of at x, and is denoted by ) h(x). Remark 2.3. If ) h(x) is non-empty for any x ∈ dom h, then h is a convex function [25]. In other words, subdifferentiability of a function implies convexity.

Convex functions are an important class of functions in that they are also utilized to define a generalized notion of distance between two points. The main motivation to use generalized distance functions, instead of the usual squared Euclidean distance, is to design optimization algorithms that can take advantage of the geometry of the feasible set (see, e.g., [29–32]). R Definition 6 (Generalized Distance Function). Every  -strongly convex function !∶ d → R ! induces a generalized distance function defined as

D! (x, y) ∶= !(x) − !(y) − ∇ !(y), x − y . Remark . 2.4 Strong convexity of the distance generating⟨ function⟩! always ensures that  2 D (x, y) ≥ ! x − y ! 2 for all x, y ∈ dom !, and D! (x, y) = 0 if and only‖ if x ‖= y. Remark . 2.5 Throughout the thesis, there is no loss of generality to assume that ! = 1. 1 Indeed, if ! ≠ 1, we can choose the scaled function ̄!(x) = !(x), which has modulus !  ̄! = 1, to generate the generalized distance function. 2.2. PRELIMINARIES 19

Note 2.6. In the thesis, we assume that the generalized distance function is induced by a strongly convex function !. However, it is enough for ! to be a Bregman function, instead, to induce a generalized distance function that satisfies D! (x, y) ≥ 0 [8, Chapter 2].

Below, we provide three well-known examples of generalized distance functions.

Example 2.1 (Squared Euclidean Distance). Choosing the distance generating function as i.e. ( ) = 1 2 1 the squared Euclidean norm, , ! x 2 x 2, which is -strongly convex with respect to l C D ( ) = 1 − 2 i.e. the 2-norm over any convex set , would result in ! x, y 2 x y 2, , the squared Euclidean distance. ‖ ‖ ‖ ‖ Example 2.2 (Squared Mahalanobis Distance). The distance generating function !(x) = 1 x, Qx for some positive definite matrix Q is  -strongly convex with respect to the 2 C min l2-norm over any convex set , where min is the minimum eigenvalue of Q. The resulting D ( ) = 1 − ( − ) generalized⟨ ⟩ distance function ! x, y 2 x y, Q x y , which is a generalization of the squared Euclidean distance, is called the squared Mahalanobis distance. ⟨ ⟩ Example 2.3 (Kullback-Leibler Divergence). Another common example of distance gener- ating functions is the negative entropy function

d !(x) = x(i) log x(i) , =1 Éi which is 1-strongly convex with respect to the l1-norm over the probability simplex

d P ∶= x ∈ Rd ∶ x(i) = 1 , x ≥ 0 . T =1 U Éi Its associated generalized distance function is

d (i) ( ) x D (x, y) = x i log , ! (i) =1 y Éi which is called the KL divergence function.

Using their definition, it is easy to verify the following important property of generalized distance functions.

Definition 7 (Four-Point Identity [33, Lemma 3.1]). Generalized distance functions posses the following identity

D! (x, y) − D! (x, v) − D! (z, y) + D! (z, v) = ∇ !(v) − ∇ !(y), x − z

⟨ ⟩ 20 CHAPTER 2. PRELIMINARIES for any four points v, x, y, z ∈ dom !. Choosing z = y results in the so-called three-point identity:

D! (x, y) − D! (x, v) + D! (y, v) = ∇ !(v) − ∇ !(y), x − y .

( ) = 1 2 When the distance generating function is chosen as⟨ ! x 2 x 2, the two⟩ identities take the form: ‖ ‖ 1 2 1 2 1 2 1 2 x − y − x − v − z − y + z − v = v − y, x − z , 2 2 2 2 2 2 2 2 1 2 1 2 1 2 x − y − x − v + y − v = v − y, x − y , ‖ ‖ 2 ‖ ‖2 2 ‖ ‖2 2 ‖ ‖2 ⟨ ⟩ R for any v, x, y, z ∈ d. ‖ ‖ ‖ ‖ ‖ ‖ ⟨ ⟩ CHAPTER 3

SHARED MEMORY ALGORITHMS

Machine learning problems are constantly increasing in size. More and more often, we have access to more data than can be conveniently handled on a single computing node, and we would like to consider more features than are efficiently dealt with using traditional optimization techniques. This has caused a strong recent interest in designing computation frameworks and developing optimization algorithms in these frameworks that can deal with problems of truly huge scale.

A popular approach for dealing with huge decision vectors is to use coordinate descent methods, where only a subset of the decision variables are updated in each iteration. A recent breakthrough in the analysis of coordinate descent methods was obtained by Nesterov [24] who established global non-asymptotic convergence rates for a class of randomized co- ordinate descent methods for convex minimization. Nesterov’s results have since been extended in several directions, for both randomized and deterministic update orders, (see, e.g., [34–37]). In this chapter, we propose and analyze a family of algorithms to solve a global loss mini- mization problem over a (possibly huge and not necessarily sparse) decision vector using an incremental, prox-gradient descent method with delayed information. Although the algorithms covered in this chapter are best suited to shared memory architectures, they might benefit as well from parallelization in the distributed setting.

The chapter is structured as follows. In the next section, we first formulate the problem and give related prior work. In Section 3.2, we motivate the proposed family of algorithms from some basic observations, and then give two different implementations of the algorithms with their convergence properties. Later, in Section 3.3, we give a numerical example to verify our theoretical findings. We finish the chapter by providing the proofs in Section 3.4.

21 22 CHAPTER 3. SHARED MEMORY ALGORITHMS

3.1 Problem Formulation

To deal with the abundance of data, it is customary to distribute the data on multiple (say N) computing nodes. More precisely, we consider the following unconstrained, smooth global loss minimization problem:

N 1 minimize f(x) ∶= f n(x) (3.1) ∈Rd x N =1 Én as an average of N losses. We impose the following set of basic assumptions on the problem: X Assumption 3.1 (Existence of a Minimum). The optimal set ⋆, defined as

X⋆ ∶= x⋆ ∶ f ⋆ = f x⋆ ≤ f(x) , ∀x ,  is nonempty. R R Assumption 3.2 (Smoothness). Each f ∶ d → , for n = 1, … ,N, is an L -smooth R n n convex function on d.

Note 3.1. Assumption 3.2 guarantees that f is L-smooth with a constant L ≤ L̄ , where

L̄ = max Ln . 1≤n≤N R R Assumption 3.3 (Strong Convexity). The overall objective function f ∶ d → is - strongly convex. X Note 3.2. Strong convexity of the overall objective function implies that ⋆ is a singleton.

The understanding is here that node n maintains the data necessary to evaluate f n(x) and to estimate ∇ f n(x). Even if a single computing node maintains the current iterate of the decision vector and orchestrates the gradient evaluations, there will be an inherent delay in querying the other nodes. Moreover, when the communication latency or work load on the nodes change, so will the query delay. It is therefore important that techniques developed for this setup can handle time-varying delays [18, 38, 39]. Another challenge in this formulation is to balance the delay penalty of synchronizing with all nodes to collect gradients (to compute the gradient of the total loss) and the bias that occurs when the decision vector is updated incrementally (albeit at a faster rate) as soon as new partial gradient information from a node arrives (cf. [39]). Several authors have proposed solutions that avoid synchronization among different comput- ing nodes. In [40], an asynchronous incremental sub-gradient method has been studied in which gradient steps are taken using out-of-date gradients. Niu et al. have studied a lock-free approach to parallelizing the stochastic gradient descent method [39]. Their code, called HOGWILD!, uses atomic operations to avoid locking of loosely coupled memory locations 3.2. MAIN RESULT 23 in the minimization problem of sparse, separable loss functions, and have achieved linear speedup in the number of processors. Following a similar sparsity and separability assump- tion, Fercoq and Richtárik have proposed an accelerated, parallel and prox-coordinate descent method to better utilize the available processors to achieve even further speedups [36]. Contrary to related algorithms in the literature, we are able to establish linear rate of con- vergence for minimization of strongly convex functions with Lipschitz-continuous gradients without any additional assumptions on boundedness of the gradients (e.g., [38, 39]). We believe that this is an important contribution, since many of the most basic machine learn- ing problems (such as least-squares estimation) do not satisfy the assumption of bounded gradients used in earlier works.

Our algorithm is shown to converge for all upper bounds on the time-varying information delay that occurs when querying the individual nodes, and explicit expressions for the convergence rate are established. While the convergence rate depends on the maximum delay bound, the constant step-size of the algorithm does not. Similar to the related algorithms in the literature, our algorithm does not converge to the optimum unless additional assumptions are imposed on the individual loss functions (e.g., that gradients of individual loss functions all vanish at the optimum, cf. [41]). We also derive an explicit bound on the asymptotic error that reveals the trade-off between convergence speed and residual error. Extensive simulations show that the bounds are reasonably tight, and highlight the strengths of our method and its analysis compared to alternatives from the literature.

3.2 Main Result

The structure of our delayed prox-gradient descent method is motivated by some basic observations about the delay sensitivity of different alternative implementations of delayed gradient iterations. We summarize these observations in this section, first.

Observations on Delayed Gradient Iterations

R R The most basic technique for minimizing a differentiable convex function f ∶ d → is to use the gradient iterations

xk+1 = xk − ∇ f xk .  If f is L-smooth, then these iterations converge to the optimum if the positive step-size is smaller than 2∕L. If f is also -strongly convex, then the convergence rate is linear, and the optimal step-size is ⋆ = 2∕( + L) [25].

In the asynchronous optimization setting that we are interested in, the gradient computations will be made available to the master node with an information delay. The corresponding 24 CHAPTER 3. SHARED MEMORY ALGORITHMS gradient iterations then take the form

= − ∇ f xk+1 xk xk−k , (3.2)   N where k ∈ 0 is the query delay. Iterations that combine current and delayed states are often hard to analyze, and are known to be sensitive to the time-delay. As an alternative, one could consider updating the iterates based on a delayed prox-step, i.e., based on the difference between the delayed state and the (scaled) gradient evaluated at this delayed state:

= − ∇ f xk+1 xk−k xk−k . (3.3)   One advantage of iterations (3.3) over (3.2) is that they are easier to analyze. Indeed, while we are unaware of any theoretical results that guarantee linear convergence rate of the delayed gradient iteration (3.2) under time-varying delays, we can give the following guarantees for the iteration (3.3): R R Proposition 3.1. Assume that f ∶ d → is L-smooth and -strongly convex. If 0 ≤  ≤ N k ̄ for all k ∈ 0, then the sequence of vectors generated by Iterations (3.3) with the optimal step-size ⋆ = 2∕( + L) satisfies

k ⋆  − 1 1+ ̄ N x − x ≤ , k ∈ 0 , k  + 1   where is the conditionô numberô of the problem.  = L∕ ô ô Note 3.3. The optimal step-size is independent of the delays, while the convergence rate depends on the upper bound on the time-varying delays. We will make similar observations for the delayed prox-gradient descent methods described and analyzed in this section.

Another advantage of the iterations (3.3) over (3.2) is that they tend give a faster convergence rate. The following simple example illustrates this point.

Example 3.1. Consider minimizing the quadratic function

(1) ⊤ (1) 1 (1) 2 (2) 2 1 x  0 x f(x) =  x + L x = , 2 2 (2) 0 L (2) 4x 5 4 5 4x 5     and assume that the gradients are computed with a fixed one-step delay, i.e.,  = 1 for N k all k ∈ 0. The corresponding iterations (3.2) and (3.3) can then be rewritten as linear iterations in terms of the augmented state vector

⊤ = (1) (2) (1) (2) ̂xk xk xk xk−1 xk−1 ,   3.2. MAIN RESULT 25

Convergence Factor 1

0.8

0.6  0.4

0.2

0 2 4 6 8 10 

Iteration (3.2) Iteration (3.3)

Figure 3.1: Comparison of the convergence factor  of the iterations (3.2) and (3.3) for different values of the condition number  ∈ [1, 10]. and studied by using the eigenvalues of the corresponding four-by-four matrices. Doing so, we find that

k ̂xk ≤  ̂x0 ,

ô ô ô ô (2−1) where  = ∕( + 1) for the delayedô gradientô ô iterationsô (3.2), while  = for the √(+1) delayed prox-step iterations (3.3). Clearly, the latter iterations have a smaller convergence factor, and hence, converge faster than the former, see Figure 3.1.

The combination of a more tractable analysis and potentially faster convergence rate leads us to develop an asynchronous optimization algorithm based on these iterations.

Prox-Gradient Descent Method

Leveraging on the intuition developed for delayed gradient iterations, we develop an opti- mization algorithm with objective function f of the form (3.1) under the assumption that N is large. In this case, it is natural to use randomized incremental gradient method that operates on a single component f n at each iteration, rather than on the entire objective function. We assume that Assumptions 3.1–3.3 hold, and we further impose the following assumption: 26 CHAPTER 3. SHARED MEMORY ALGORITHMS

Assumption 3.4 (Bounded Delay). The information delay is bounded, i.e.,  ≤ ̄ for all N k k ∈ 0.

Under these assumptions, we can state our first main result:

Theorem 3.2. Suppose that Assumptions 3.1–3.4 hold, and that the step-size in Algo- rithm 3.1 satisfies

 ∈ 0 , 2 . 0 L̄ 1 Then the sequence generated by Algorithm 3.1 satisfies xk E ⋆ k ⋆ f xk − f ≤  f x0 − f +  ,

⋆     where, f is the optimal value of Problem (3.1),

1 2 1+ ̄ L̄  = 1 − 2  1 − , H H  II and

N L̄ 2  = ∇ f x⋆ . 2 n ∗ 2N  − L̄ n=1 É ô ô   ô ô ô ô Proof. See Section 3.4. ■

Theorem 3.2 shows that even with a constant step-size, which can be chosen independently of the maximum delay bound ̄ and the number of objective function components N, the iterates generated by Algorithm 3.1 converge linearly to within some ball around the optimum. Note the inherent trade-off between  and : a smaller step-size yields a smaller residual error  but also a larger convergence factor . Algorithm 3.1 is closely related to HOGWILD! [39], which uses randomized delayed gradients as follows:

= − ∇ f xk+1 xk nk xk−k .   As discussed in the previous section, iterates combining the current state and delayed gradient are quite difficult to analyze. Thus, while HOGWILD! can also be shown to converge linearly to an -neighborhood of the optimal value, the convergence proof requires that the gradients are bounded, and the step-size which guarantees the convergence depends on ̄, N as well as the maximum bound on ∇ f(x) ∗.

‖ ‖ 3.2. MAIN RESULT 27

Algorithm 3.1: Prox-Gradient Descent I Input: The total number of functions, N; Lipschitz constant, L̄ ; strong-convexity parameter, ; averaging parameter,  ∈ (0, 1]; step-size, ; and the maximum iteration count, K. Output: The final decision vector, xK . Data: Individual loss functions, f n . ∈ Rd = 0 Initialization: Initial decision vector, x0 ; initial iteration count, k . while k < K do Pick nk uniformly at random from {1, 2, … ,N}

1 P [n] = , nk N Calculate the intermediate variable

← − ∇ f xk+1∕2 xk−k nk xk−k ,   Update the decision vector x

← xk+1 (1 − )xk + xk+1∕2 ,

Increment the iteration counter

k ← k + 1 .

. return xK 28 CHAPTER 3. SHARED MEMORY ALGORITHMS

Prox-Coordinate Descent Method

In Algorithm 3.1, we have assumed that only the knowledge of L̄ is available; not that of the individual Ln values of loss functions. Hence, it is natural to sample the loss functions uniformly at random.

Now, we analyze the setting where all the Ln values are available to the computing node. In addition to Assumptions 3.1–3.4, we assume that Problem (3.1) satisfies the following assumption:

Assumption 3.5 (Block-Coordinate Smoothness). The gradient of f is block coordinate-wise Lipschitz continuous with constants l1,..., lB.

We consider a setup similar to Parameter Server: a master node which maintains and updates the current iterate of the decision vector x, and N worker nodes, each of which can compute the gradient of one of the component functions f n, all in a shared memory architecture. Masters and workers follow the procedure in Algorithm 3.2. At every iteration, the master picks a worker n ∈ {1, … ,N} at random, and informs the worker about the current iterate ̂x. Then, the worker chooses a block b ∈ {1, … ,B} at random and evaluates a partial gradient mapping

[b] ̂x − U[b]∇ f n ( ̂x) . Lnlb

In other words, worker n only updates block b of ̂x by taking step of length ∕(Lnlb) in the [b] direction −∇ f n ( ̂x). When the worker completes the computation at some later time k, it returns the partial gradient mapping to the master (with delay), which averages the update according to

[b] xk+1 = (1 − )xk +  ̂x − U[b]∇ f n ( ̂x) 0 Lnlb 1 and passes the updated xk+1 back to the worker.

The next theorem shows that under Assumptions 3.1–3.5, the expected value of f xk converges linearly to a ball around the optimum.  Theorem 3.3. Assume that Assumptions 3.1–3.5 hold, and that the step-size in Algo- rithm 3.2 satisfies

2( + 1) ∈ 0,  . + 2 0  1 Then, the sequence generated by Algorithm 3.2 satisfies xk E ⋆ k ⋆ f xk − f ≤  f x0 − f +  ,     3.2. MAIN RESULT 29

Algorithm 3.2: Prox-Coordinate Descent

Input: Lipschitz constants, L1, … ,LN and l1, … , lB ; strong-convexity   ∈ (0, 1] parameter, ; averaging parameter, ; step-size, ; and the maximum iteration count, K. Output: The final decision vector, xK . Data: Individual loss functions, f n . ∈ Rd = 0 Initialization: Initial decision vector, x0 ; initial iteration count, k . while k < K do Pick nk ∈ {1, … ,N} with probability

L [n] = n , Pnk N n=1 Ln ∑ Pick bk ∈ {1, … ,B} with probability

l [b] = b , Pbk B b=1 lb ∑ Update the intermediate variable

[b ] x +1∕2 ← x − − U[ ]∇ k f x − , k k k L l bk nk k k nk bk   Update the decision vector

← xk+1 (1 − )xk + xk+1∕2 ,

Increment the iteration counter

k ← k + 1 .

. return xK 30 CHAPTER 3. SHARED MEMORY ALGORITHMS where f ⋆ is the optimal value for Problem (3.1), 1 1+ ̄  2( + 1) − ( + 2)  = 1 − , (3.4)  + 1 1 N B ⎛ N n=1 Ln b=1 lb ⎞ ⎜ ⎟ and ⎜ ∑ ∑ ⎟ ⎝ ⎠ N ( + 1)( + 2) 1 1 2  = ∇ f x⋆ . 2( + 1) − ( + 2) 2 n ∗    N n=1 Ln É ô ô ô ô ô ô Proof. See Section 3.4. ■

Theorem 3.3 establishes that the proposed algorithm has delay insensitive convergence, ∈ 0 2(+1) in the sense that as long as , +2  , the iterates will converge in expectation to a ball around the optimum regardless of how large ̄ is. However, ̄ does affect the convergence factor , and hence the time it takes for the iterates to converge. Specifically,  is monotonically increasing in ̄, and approaches one as ̄ tends to infinity. Therefore, while the algorithm converges linearly to a ball around the optimal value for arbitrary bounded time-varying delays, the convergence speed deteriorates with increasing delays.

We also note choosing involves a trade-off between accuracy and convergence speed: to decrease the residual error , one needs to decrease ; but this increases , and yields slower convergence. The averaging parameter  is subject to a similar trade-off.

Comparison to Asynchronous Incremental Gradient Descent

Randomized coordinate descent has been shown to be competitive with the classical gradient descent method, in the sense that it requires less work per iteration, but a comparable number of iterations to converge [24].

Now, we demonstrate that a similar property holds for the prox-coordinate descent method: if the amount of work required to evaluate a partial gradient is proportional to its block size, then the prox-coordinate descent method can always be expected to be more efficient than the corresponding incremental gradient descent algorithm. We establish this property by analyzing the prox-gradient method in Algorithm 3.3, obtained by restricting Algorithm 3.2 to use B = 1, and comparing the total work that each method requires to guarantee a given target error. Our comparison is based on the following result. If ∈ 0 2(+1) , then the sequence generated by Algorithm 3.3 Theorem 3.4. , +2  xk satisfies   E ⋆ k ⋆ f xk − f ≤  f x0 − f +  , (3.5)     3.2. MAIN RESULT 31 where

1 1+ ̄  2( + 1) − ( + 2)  = 1 − , (3.6)  + 1 1 N ⎛ L N n=1 Ln ⎞ ⎜ ⎟ and ⎜ ∑ ⎟ ⎝ ⎠ N ( + 1)( + 2) 1 1 2  = ∇ f x⋆ . (3.7) 2( + 1) − ( + 2) 2 n ∗    N n=1 Ln É ô ô ô ô ô ô Proof. See Section 3.4. ■

∈ 0 2(+1) ∈ (0 1] Comparing Theorems 3.3 and 3.4, we see that for any fixed , +2  and  , , the algorithms have the same bound on . Thus, partitioning the decision variables into several blocks does not affect the guaranteed residual error.

However, the two methods have different convergence factors, and will therefore require a different number of iterations to reach a guaranteed target error. To compare the two methods, we assume that the work required to evaluate a partial gradient is proportional to its block-size, and we count every B iterations of Algorithm 3.2 as one iteration of Algorithm 3.3. After B iterations of Algorithm 3.2, by (3.4)

2( + 1) − ( + 2) f − f ⋆ 1 − B  f − f ⋆ + E xk ≤ 1 x0  . ̄ + 1  + 1 N L B l H I   ⎛ N n=1 n b=1 b ⎞  ⎜ ⎟ ⎜ ∑ ∑ ⎟ and after one iteration of⎝ Algorithm 3.3, from (3.6), we obtain ⎠

1 2( + 1) − ( + 2) f − f ⋆ 1 −  f − f ⋆ + E xk ≤ 1 x0  . ̄ + 1  + 1 L N L H I   ⎛ N n=1 n ⎞  ⎜ ⎟ ⎜ ∑ ⎟ We can see that if ⎝ ⎠ B 1 ≥ , (3.8) B L b=1 lb

∑ ⋆ then the difference, in expectation, between f and the function values f xk generated by l ≤ L b ∈ {1, … ,B} Algorithm 3.2 is smaller than that by Algorithm 3.3. Since b for each  , ∈ N ∈ 0 2(+1) (3.8) always holds for all B , which implies that for any fixed , +2  and any fixed  ∈ (0, 1], Algorithm 3.2 is always more efficient than Algorithm 3.3 (see Figure 3.2). 32 CHAPTER 3. SHARED MEMORY ALGORITHMS

Algorithm 3.3: Prox-Gradient Descent II

Input: Lipschitz constants L1, … ,LN and L; strong-convexity parameter, ;  ∈ (0, 1] averaging parameter, ; step-size, ; and the maximum iteration count, K. Output: The final decision vector, xK . Data: Individual loss functions, f n . ∈ Rd = 0 Initialization: Initial decision vector, x0 ; initial iteration count, k . while k < K do Pick nk ∈ {1, … ,N} with probability

L [n] = n , Pnk N n=1 Ln ∑ Update the intermediate variable

x +1∕2 ← x − − ∇ f x − , k k k L L nk k k nk   Update the decision variable

← xk+1 (1 − )xk + xk+1∕2 ,

Increment the iteration counter

k ← k + 1 .

. return xK 3.3. NUMERICAL EXAMPLE 33

Objective Convergence 15

⋆ 10 − f ] ) x f( [

E 5

0 103 104 105 106 107 Total Number of Operations Algorithm 3.3 Algorithm 3.2

Figure 3.2: Convergence of Algorithms 3.2 and 3.3 with respect to the total operations required. In the example, the targeted error to achieve is 10% of the initial error. Clearly, Algorithm 3.3 is computationally more intensive in achieving the same targeted error.

Remark 3.4. Algorithms 3.1 and 3.3, and their analyses, differ from each other in the knowledge of the Lipschitz constants Ln available to the computing node. Algorithm 3.1 1 N would have a convergence factor on the same form as (3.6); but with N n=1 Ln replaced ̄ by L, where ∑

L̄ = max Ln . 1≤n≤N 1 N ≤ ̄ Since N n=1 Ln L [42], it follows that the guaranteed upper bound in Theorem 3.4 improves upon the one in Theorem 3.2, especially for applications where the component ∑ functions vary substantially in smoothness.

3.3 Numerical Example

To evaluate the performance of Algorithms 3.1 and 3.2, we have focused on unconstrained quadratic programming (QP) optimization problems, since they are frequently encountered in machine learning applications. We are thus interested in solving optimization problems of the form N 1 1 ⊤ ⊤ minimize f(x) ∶= x Qnx + q x . ∈Rd n x N =1 2 Én   34 CHAPTER 3. SHARED MEMORY ALGORITHMS

Objective Convergence ⋅104 14

⋆ 10 − f  ) k x 6 f(  E

2

100 101 102 103 104 k ̄ = 7 ̄ = 7 (UB) ̄ = 1 ̄ = 1 (UB)

Figure 3.3: Convergence for different values of maximum delay bound ̄ and for fixed step-size, . The solid curves represent the theoretical upper bounds on the expected error, whereas the dashed curves represent the averaged experimental results for ̄ = 1 (red) and ̄ = 7 (blue).

We have chosen to use randomly generated instances, where the matrices Qn and the vectors qn are generated as explained in [43]. We have considered a scenario with N = 20 machines, each with a loss function defined by a randomly generated positive definite R20×20 matrix Qn ∈ , whose condition numbers are linearly spaced in the range [1, 10] for R20 Algorithm 3.1 and [1, 5] for Algorithm 3.2, and a random vector qn ∈ . We have set the block count for Algorithm 3.2 as B = 10. Since we have constructed this numerical example on our local computer, we have artificially introduced random time delays, k in [0, ̄], into our simulation code. We have simulated the algorithms with different ̄ and values for 1000 times, and we present the expected error versus the iteration count.

Prox-Gradient Descent Method

Figure 3.3 shows how Algorithm 3.1 converges to an -neighborhood of the optimum value, irrespective of the upper bound of the delays. The simulations, shown in dashed curves, confirm that the delays affect the convergence rate, but not the remaining error. The theoretical upper bounds derived in Theorem 3.2, shown in solid curves, are clearly valid.

As observed in Figure 3.4, there is a distinct trade-off in choosing : decreasing it reduces the remaining error at the expense of convergence rate. 3.3. NUMERICAL EXAMPLE 35

Objective Convergence ⋅104 14

⋆ 10 − f  ) k x 6 f(  E

2

100 101 102 103 104 k = 0.0065561 = 0.065561

Figure 3.4: Convergence for two different choices of the step-size, . Dashed red curve represents the averaged experimental results for a bigger step-size, whereas the solid blue curve corresponds to a smaller step-size.

We have also compared the performance of our method to that of HOGWILD!, with the parameters suggested by the theoretical analysis in [39]. To compute an upper bound on the gradients, required by the theoretical analysis in [39], we have assumed that the HOGWILD! iterates never exceeded the initial value in norm. We have simulated the two methods for ̄ = 7. Figure 3.5 shows that Algorithm 3.1 converges faster than HOGWILD! when the theoretically justified step-sizes were used. In the simulations, we have also noticed that the step-size for HOGWILD! could be increased (yielding faster convergence) on our quadratic test problems. However, for these step-sizes, the theory in [39] does not give any convergence guarantees. 36 CHAPTER 3. SHARED MEMORY ALGORITHMS

Objective Convergence ⋅104 14

⋆ 10 − f  ) k x 6 f(  E

2

100 101 102 103 104 k = 0.000897 HOGWILD! = 0.006556

Figure 3.5: Convergence of the two algorithms for ̄ = 7. Solid blue curve represents the averaged experimental results of HOGWILD!, whereas dashed red curve represents that of our method. With theoretically justified step-sizes, our algorithm converges faster. 3.3. NUMERICAL EXAMPLE 37

Objective Convergence

101 ⋆ 100 − f  ) k x f(  10−1 E

10−2 0 5 10 15 20 3 k ⋅10  = 0.2  = 0.2 (UB)  = 1  = 1 (UB)

Figure 3.6: Convergence for different choices of the averaging parameter, , when ̄ = 1 and = 0.1. Solid curves represent the theoretical upper bounds on the expected error, whereas dashed curves represent the averaged experimental results for averaging (dark blue color), e.g.,  = 0.2, and non-averaging (dark red color), e.g.,  = 1, cases. Light regions represent confidence intervals of one standard deviation.

Prox-Coordinate Descent Method

Figures 3.6 and 3.7 show how different choices of the parameters  and in Algorithm 3.2 affect the convergence rate and the remaining error, whereas the effect of the magnitude of the time-varying delay is shown in Figure 3.8. As can be seen in Figure 3.6, an increase in  results in a faster convergence, but larger residual error. A similar trade-off is also valid for as shown in Figure 3.7. Finally, as the magnitude of the time-varying delay increases, the algorithm converges slower while the residual error remains unaffected (see Figure 3.8).

Although not so obvious due to the logarithmic y-axes in Figures 3.6–3.8, it is also worth noting that the variance of our sample iterates is sensitive to the averaging parameter, . This is not surprising as increasing (respectively, decreasing)  results in an amplification (respectively, suppression) of the random, outdated information obtained from the workers. 38 CHAPTER 3. SHARED MEMORY ALGORITHMS

Objective Convergence

101 ⋆ 0 − f 10  ) k x f( 

E 10−1

0 5 10 15 20 3 k ⋅10 = 3% = 3% (UB) = 10% = 10% (UB)

Figure 3.7: Convergence for different choices of , when  = 1 and ̄ = 1. Solid curves represent the theoretical upper bounds on the expected error, whereas dashed curves represent the averaged experimental results for = 0.03 (dark blue color) and = 0.1 (dark red color). Light regions represent confidence intervals of one standard deviation. 3.3. NUMERICAL EXAMPLE 39

Objective Convergence

101 ⋆ − f 

) 0

k 10 x f(  E

10−1

0 5 10 15 20 3 k ⋅10 ̄ = 10 ̄ = 10 (UB) ̄ = 1 ̄ = 1 (UB)

Figure 3.8: Convergence for different values of maximum time delay, ̄, when  = 1 and = 0.1. Solid curves show the theoretical upper bounds on the expected error, while dashed curves represent averaged experimental results for ̄ = 1 (dark blue color) and ̄ = 10 (dark red color). Light regions represent confidence intervals of one standard deviation. 40 CHAPTER 3. SHARED MEMORY ALGORITHMS

3.4 Proofs

In this section, we provide the proofs of Theorems 3.2–3.4. Before starting with the proofs, we state a key lemma that is instrumental in our argument. Lemma 3.5 allows us to quantify the convergence rates of discrete-time iterations with bounded time-varying delays:

Let be a sequence of real numbers satisfying Lemma 3.5. Vk N Vk+1 ≤ a1Vk + a2 max Vk̃ + a3 , k ∈ 0 , (3.9) k−k≤k̃ ≤k for some nonnegative constants a1, a2, and a3. If a1 + a2 < 1 and N 0 ≤ k ≤ ̄, k ∈ 0 , (3.10) then

k N Vk ≤  V0 +  , k ∈ 0 , (3.11)

1 1+ where  = a1 + a2 ̄ and  = a3∕(1 − a1 − a2). 

Proof of Lemma 3.5. Since a1 + a2 < 1, it holds that

− ̄ 1+ 1 ≤ a1 + a2 ̄ ,  which implies that

− ̄ − ̄ 1+ a1 + a2 = a1 + a2 a1 + a2 ̄ − ̄ 1+ ≤ a1 + a2 a1 +a2 ̄ 1 1+ ̄  = a1 + a2 =  .  (3.12) N We now use induction to show that (3.11) holds for all k ∈ 0. It is easy to verify that (3.11) is true for k = 0. Assume that the induction hypothesis holds for all k up to some N k̄ ∈ 0. Then,

k̄ Vk̄ ≤  V0 +  , (3.13) k̃ Vk̃ ≤  V0 +  , k̃ = k̄ − k̄ , … , k̄ . 3.4. PROOFS 41

From (3.9) and (3.13), we have

k̄ k̃ V̄ +1 ≤ a1 V0 + a1 + a2 max  V0 + a2 + a3 k ̄ ̃ ̄ Hk−k̄ ≤k≤k I

k̄ k̄ −̄ ≤ a1 V0 + a1 + a2 k V0 + a2 + a3 k̄ k̄ − ̄ ≤ a1 V0 + a1 + a2 V0 + a2 + a3 − ̄ k̄ = a1 + a2  V0 +  ,  where we have used (3.10) and the fact that  ∈ [0, 1) to obtain the second and third inequalities. It follows from (3.11) that

k̄ +1 Vk̄ +1 ≤  V0 +  , which completes the induction proof. ■

Now we are ready to prove Theorem 3.2.

Proof of Theorem 3.2. ⋆ First, we analyze how the distance between f xk and f changes f  ∈ (0, 1] in each iteration. Since is convex and ,  ⋆ ⋆ f xk+1 − f = f (1 − )xk + xk+1∕2 − f ⋆  ≤ (1 − ) f xk +  f xk+1∕2 − f ⋆ ⋆ = (1 − ) f xk − f +  f xk+1∕2 − f . (3.14)     As f is L̄ -smooth, it follows from Note 2.2 that

L̄ 2 f x ≤ f x + ∇ f x , x − x + x − x . k+1∕2 k−k k−k k+1∕2 k−k 2 k+1∕2 k−k   (   )  ô ô − = − ∇ f ô ô Note that xk+1∕2 xk−k nk xk−k . Thus, ô ô   2 2L̄ f xk+1∕2 ≤ f xt− − ∇ f xk− , ∇ f n xk− + ∇ f n xk− k k k k 2 k k ∗    (    ) ô  ô ≤ f x − ∇ f x , ∇ f x ô ô k−k k−k nk k−k ô ô ô ô   (    2 ) 2 2 ⋆ 2 ⋆ + L̄ ∇ f n xk− − ∇ f n x + L̄ ∇ f n x , k k k ∗ k ∗ ô   ô  ô ô ô ô where the second inequalityô holds since for any vectorsô x and y,ô and any normô ⋅ , we have ô ô ô ô 2 2 2 x ± y ≤ 2 x + y . ‖ ‖  ‖ ‖ ‖ ‖ ‖ ‖ 42 CHAPTER 3. SHARED MEMORY ALGORITHMS

Each f n, for n = 1, … ,N is convex and Ln-smooth. Therefore, according to Note 2.2, it holds that

2 ∇ f x − ∇ f x⋆ nk k−k nk ∗   ô ô ⋆ ⋆ ô ≤ Ln ∇ fôn xk− − ∇ f n x , xk− − x ô k ôk k k k ô (   ) ̄ ∇ f − ∇ f ⋆  − ⋆ ≤ L nk xk−k nk x , xk−k x , (    ) implying that

f f − ∇ f ∇ f xk+1∕2 ≤ xk−k xk−k , nk xk−k  2  (    )  + 2 ̄ ∇ f − ∇ f ⋆ − ⋆ L nk xk−k nk x , xk−k x (  2   ) + 2L̄ ∇ f x⋆ . nk ∗ ô ô ô ô Note that n0, n1,..., nk are independentô randomô variables. Moreover, xk depends on n0, n1, ..., nk−1; but not on nk̄ for any k̄ ≥ k. Thus,

N 1 f − f ⋆ f − f ⋆ − ∇ f ∇ f Enk k−1 xk+1∕2 ≤ xk−k xk−k , n xk−k X N n=1 Y       É   ð N 2 1 + 2 ̄ ∇ f − ⋆ L n xk−k , xk−k x XN =1 Y Én   N 2L̄ 2 + ∇ f x⋆ n ∗ N n=1 É ô ô 2 ô ⋆ ô = f xk− −ô f − ∇ô f xk− k k ∗  2  ô  ô + 2L̄ ∇ f x ô, x − x⋆ô k−k ô k−k ô ô ô (N   ) 2L̄ 2 + ∇ f x⋆ . n ∗ N n=1 É ô ô ô ô ô ô As f is -strongly convex, it follows from Note 2.2 that

2 ⋆ 1 ∇ f xk− , xk− − x ≤ ∇ f xk− , k k  k ∗ (   ) ô  ô ô ô ô ô ô ô 3.4. PROOFS 43 which implies that

2 2 ⋆ ⋆ L̄ En k−1 f xk+1∕2 − f ≤ f xk− − f − 1 − ∇ f xk− k k H  I k ∗     ô  ô ð N ô ô 2L̄ 2 ô ô + ∇ f x⋆ . ô ô n ∗ N n=1 É ô ô ô ô Moreover, according to [34, Theorem 3.2], it holdsô that ô 2 ⋆ 2 f xk− − f ≤ ∇ f xk− . k k ∗     ô  ô ô ô Thus, if we take ô ô ô ô  ∈ 0 , 2 , (3.15) 0 L̄ 1 we have 2 ⋆ L̄ ⋆ En k−1 f xk+1∕2 − f ≤ 1 − 2 1 − f xk− − f k H H  II H k I     ð N 2L̄ 2 + ∇ f x⋆ . n ∗ (3.16) N n=1 É ô ô ô ô Define the sequence Vk as ô ô

⋆ N Vk = E f xk − f , k ∈ .

Note that   = f − f ⋆ = f − f ⋆ Vk+1 E xk+1 E Enk k−1 xk+1 .      Using this fact together with (3.14) and (3.16), we obtainð

2 N L̄  2L̄ 2 V ≤ (1 − ) V +  1 − 2 1 − V + ∇ f x⋆ . k+1 k k−k n ∗ H H  II N n=1 «¯¬ «­­­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­­­¬ «­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­¬É ô ô ô ô a1 a2 ôa3 ô

One can verify that for any satisfying (3.15),

2 L̄ 2 + = 1 − 2 1 − ∈ 1 − 1 a1 a2  2 , . H  I 4 2L̄ 1 44 CHAPTER 3. SHARED MEMORY ALGORITHMS

It now follows from Lemma 3.5 that k N Vk ≤  V0 +  , k ∈ 0 , where 1 1 ̄ 2 1+ ̄ 1+ L  = a1 + a2 ̄ = 1 − 2  1 − , H H  II  and N a3 L̄ 2  = = ∇ f x⋆ . 1 − a − a 2 n ∗ 1 2 2M  − L̄ n=1 É ô ô   ô ô ô ô ■

Following a similar approach, we prove Theorem 3.3.

Proof of Theorem 3.3. Since f is convex and  ∈ (0, 1], we have ⋆ ⋆ f xk+1 − f = f (1 − )xk + xk+1∕2 − f ⋆  ≤ (1 − ) f xk +  f xk+1∕2 − f ⋆ ⋆ = (1 − ) f xk − f +  f xk+1∕2 − f . (3.17)     As the gradient of f is block coordinate-wise Lipschitz, it follows from (2.1) that

[b ] [b ] f x +1∕2 ≤ f x − − ∇ k f x − , ∇ k f x − k k k L l k k nk k k   nk bk (    )  2 2 + ∇[bk] f 2 nk xk−k 2 ∗ Ln lbk k ô  ô ô [b ] ô [b ] ≤ f x − −ô ∇ k f ôx − , ∇ k f x − k k ô L l ô k k nk k k   nk bk (    ) 2 (1 + ) 2 + ∇[bk] f − ∇[bk] f ⋆ 2 nk xk−k nk x 2 ∗ Ln lbk k ô   ô 1 ô 2 2 ô ô [bk] ⋆ ô + 1 + ô ∇ f n x , ô  2L2 l k ∗ 0 1 nk bk ô ô ô ô Rd where the second inequality uses the fact thatô for any  > 0,ô any vectors x and y in , and any norm ⋅ , we have

2 2 1 2 ‖ ‖ x + y ≤ 1 +  x + 1 + y . 0 1 0  1 ‖ ‖ ‖ ‖ ‖ ‖ 3.4. PROOFS 45

[ ] [ ] We use ̃pn and ̄pb to denote Pnk n and Pbk b , respectively. We also use vk as a shorthand to denote the random variable pair nk, bk . Note that xk depends on v0, v1,..., vk−1; but v̄ k̄ ≥ k not on k for any . Thus, 

f − f ⋆ Evk k−1 xk+1∕2 N B   ̃p ̄p ð f − f ⋆ − n b ∇[b] f ∇[b] f ≤ xk−k xk−k , n xk−k =1 =1 Lnlb   Én Éb (    ) N B (1 + ) 2 ̃p ̄p 2 + n b ∇[b] f x − ∇[b] f x⋆ 2 2 n k−k n n=1 b=1 Lnlb ∗ É É ô   ô N B 2 ô ô 1 ̃p ô̄p 2 ô + 1 + nôb ∇[b] f x⋆ ô 2 2 n ∗ n=1 b=1 0  1 Lnlb É É ô ô 2 ⋆ N ô ô = f xk− − f − ô ∇ fô xk− k N B k ∗ =1 Ln =1 lb   n b ô  ô 2 N ô ô 2 (1 + ) ∑ 1∑ ô ô ⋆ + ∇ f n ôxk− − ∇ f nô x N B L k ∗ 2 =1 Ln =1 lb n=1 n n b É ô   ô ô N ô ∑ 1 ∑ 2 ô 1 2 ô + 1 + ô ∇ f x⋆ ô  N B L n ∗ 0 1 2 =1 Ln =1 lb n=1 n n b É 2 ô ô ⋆∑ ∑ ô ô = f xk− − f − ∇ f xk− ô ô k k ∗     N ô ô 2 (1 + ) 1 ô ô + ô∇ f x ô− ∇ f x⋆ 2 ô n k−k ô n N n=1 Ln ∗ É ô   ô N ô ô 1 ô1 2 ô + 1 + ô ∇ f x⋆ , ô (3.18) 2 n ∗ 0  1 N n=1 Ln É ô ô ô ô ô ô where

N = N B . n=1 Ln b=1 lb ∑ ∑ For each n = 1, … ,N, f n is convex and Ln-smooth. Therefore, according to Note 2.2, it holds that

2 ⋆ ⋆ ⋆ ∇ f n xk− − ∇ f n x ≤ Ln f n xk− − ∇ f n x , xk− − x . k ∗ k k ô   ô (    ) ô ô ô ô ô ô 46 CHAPTER 3. SHARED MEMORY ALGORITHMS

Substituting this inequality into (3.18) yields 2 ⋆ ⋆ Ev k−1 f xk+1∕2 − f ≤ f xk− − f − ∇ f xk− k k k ∗     ô  ô ð (1 + ) ô ô ⋆ + ∇ f xôk− , xk− −ô x 2 ô k k ô ( N  ) 1 1 2 + 1 + ∇ f x⋆ . 2 n ∗ 0  1 N n=1 Ln É ô ô ô ô Since f is -strongly convex, it follows from Note 2.2 that ô ô 2 ⋆ 1 ∇ f xk− , xk− − x ≤ ∇ f xk− , k k  k ∗ (   ) ô  ô ô ô which implies that ô ô ô ô 2 ⋆ ⋆ (1 + ) Ev k−1 f xk+1∕2 − f ≤ f xk− − f − 1 − ∇ f xk− k k 2 k ∗   0 1     N ô ô ð 1 1 2ô ô + 1 + ∇ f x⋆ ô. ô 2 n ∗ô ô 0  1 N n=1 Ln É ô ô ô ô Moreover, according to [24, Theorem 2], it holds that ô ô 2 ⋆ 2 f xk− − f ≤ ∇ f xk− . k k ∗     ô  ô ô ô Thus, if we take ô ô ô ô 2 ∈ 0, , (3.19) 1 + 0  1 we have

⋆ (1 + ) ⋆ E −1 f x +1∕2 − f ≤ 1 − 2 1 − f x − − f vk k k 2 k k 0 0 11 0   1   N ð 1 1 2 + 1 + ∇ f x⋆ . (3.20) 2 n ∗ 0  1 N n=1 Ln É ô ô ô ô Define the sequence Vk by ô ô ⋆ N Vk = E f xk − f , k ∈ and note that   = f − f ⋆ = f − f ⋆ Vk+1 E xk+1 E Evk k−1 xk+1 .     ð   3.4. PROOFS 47

By combing this fact with (3.17) and (3.20), we obtain

(1 + ) V ≤ (1 − ) V +  1 − 2 1 − V k+1 k 2 k−k 0 0  11 «¯¬ «­­­­­­­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­­­­­­­¬ a1 a2 N 1 1 2 +  1 + ∇ f x⋆ . 2 n ∗ 0  1 N n=1 Ln «­­­­­­­­­­­­­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­­­­­­­­­­­­­¬É ô ô ô ô a3 ô ô

One can verify that for any satisfying (3.19),

2 (1 + )   a1 + a2 = 1 − 2 1 − ∈ 1 − , 1 . 2 1 +  1 N B ⎛ ⎞ ⎡ N n=1 Ln b=1 lb ⎞ ⎜ ⎟ ⎢ ⎟ ⎜ ⎟ ⎢ ∑ ∑ ⎟ It now follows from Lemma⎝ 3.5 that ⎠ ⎣ ⎠ k N Vk ≤  V0 +  , k ∈ 0 , where

1 1 (1 + ) 1+ ̄  = a + a 1+ ̄ = 1 − 2 1 − , 1 2 2 0 0  11  and

N a3 (1 + ) 1 2  = = ∇ f x⋆ . 1 − − 2 (2 − (1 + )) n ∗ a1 a2    n=1 Ln É ô ô ô ô Letting  = 1∕(1 + ) completes the proof. ô ô ■

Proof of Theorem 3.4. = 1 = [ ] = 1 Letting B , lb L, and Pbk b , the proof is similar to that of Theorem 3.3, and thus, omitted. ■

CHAPTER 4

DISTRIBUTED MEMORY ALGORITHMS

Many optimization problems that arise in machine learning, signal processing, and statistical estimation can be formulated as regularized optimization (also referred to as composite optimization) problems [44–46]. Specifically, regularized optimization problems are of the form minimize (x) ∶= f(x) + h(x) , x∈Rd (4.1) where the first part of the objective function is typically smooth and models the empirical data loss, and the second term is a regularizer. Possible choices of the regularizer include:

• Unconstrained, smooth optimization: h(x) = 0, • Constrained, smooth optimization: h is the indicator function of a nonempty closed C R convex set ⊆ d, i.e., C 0 if x ∈ , h(x) = IC (x) ∶= T+∞ otherwise.

• Lasso regularization: h(x) = 1 x 1 with 1 > 0, h( ) = 2 2 0 • Tikhonov (or, Ridge) regularization:‖ ‖ x 2 x 2 with 2 > , h( ) = + 2 2 0 • Elastic Net regularization: x 1 x 1 2 ‖x‖2 with 1, 2 > , and,

• In this case, h(x) = 1 x 1 + C (x) with Constrained, Lasso regularization: ‖ ‖ ‖ ‖ I 1 > 0. ‖ ‖ Stochastic gradient methods were among the first and the most commonly used algorithms developed for solving stochastic optimization problems [30, 47–52]. Their popularity comes mainly from the fact that they are easy to implement and have low computational cost per iteration. Stochastic gradient methods are inherently serial in the sense that the

49 50 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS gradient computations take place on a single processor which has access to the whole dataset. However, it happens more and more often that one single computer is unable to store and handle the amounts of data that we encounter in practical problems. This has caused a strong interest in developing parallel optimization algorithms which are able to split the data and distribute the computation across multiple computing nodes (see, e.g., [18, 53–57] and references therein). The performance of Google’s DistBelief model [58] and Microsoft’s Project Adam [59] have proven that parallel stochastic gradient methods are remarkably effective in real-world machine learning problems such as training deep learning systems. For example, while training a neural network for the ImageNet task with 16 million images may take about two weeks on a modern GPU, Google’s DistBelief model can successfully utilize 16,000 cores in parallel and train the network for three days [58].

A common practical solution for parallelizing stochastic gradient methods is mini-batching (MB), where iterates are updated based on the average gradient with respect to multiple data points rather than based on gradients evaluated at a single data at a time. Recently, Dekel et al. [60] proposed a parallel mini-batch algorithm for regularized stochastic optimization problems, in which multiple processors compute gradients in parallel using their own local data, and then aggregate the gradients up a spanning tree to obtain the averaged gradient. While this algorithm can achieve linear speedup in the number of processors, it has the drawback that the processors need to synchronize at each round and, hence, if one of them is slower than the rest, then the entire algorithm runs at the pace of the slowest processor. Furthermore, the need for global synchronization and requiring massive communication overhead make this method fragile to many types of failures that are common in distributed computing environments. For example, if one processor fails throughout the execution of the algorithm or is disconnected from the network connecting the processors, the algorithm will come to an immediate halt. Parallel mini-batch optimization algorithms are based on asynchronous incremental gradient methods. When the loss functions are strongly convex, which is often the case, it has recently been observed that incremental aggregated gradient (IAG) methods outperform incremental gradient descent and are, in addition, able to converge to the true optimum even with a constant step-size. Gurbuzbalaban et al. [61] established linear convergence for an incremental aggregated gradient method suitable for implementation in the Parameter Server framework. However, the analysis does not allow for any regularization (or, proximal) term, nor any additional convex constraints.

In the subsequent sections, we present two types of algorithms, i.e., mini-batching and asynchronous proximal incremental aggregated gradient (PIAG) method, to solve problems on the form (4.1), and discuss their implementation in the Parameter Server framework. The implementations, and thus, the analyses of the two algorithms differ from each other mainly in the aggregation part. In Section 4.1, we introduce the problem formulation, list the assumptions used in our analyses, and give prior work relevant to the algorithms. Then, in Section 4.2, we present the convergence rate analyses of MB (Section 4.2.1) and PIAG (Section 4.2.2). Later, in Sections 4.3.1 and 4.3.2, we provide different numerical examples to verify our theoretical findings related to MB and PIAG, respectively. We finish the chapter 4.1. PROBLEM FORMULATION 51 by providing the proofs in Section 4.4.

4.1 Problem Formulation

We impose the following two assumptions on Problem (4.1). X Assumption 4.1 (Existence of a Minimum). The optimal set ⋆, defined as

X⋆ ∶= x⋆ ∶ ⋆ =  x⋆ ≤ (x) , ∀x , is nonempty. 

Assumption 4.2 (Closedness of dom h). The function h is simple and lower semi-continuous, R and its effective domain, dom h = x ∈ d ∶ h(x) < +∞ , is closed. Moreover, h(x) is i.e. x, y ∈ h subdifferentiable everywhere in its effective domain, , for all dom ,

h(x) ≥ h(y) + sy, x − y , ∀sy ∈ ) h(y) .   4.1.1 Asynchronous Mini-Batching

Algorithm Description

For analyzing mini-batching, we consider the following stochastic convex optimization problem:

minimize (x) ∶= Ev [F(x, v)] + h(x) . x∈Rd (4.2)

Here, x is the decision variable, v is a random vector whose probability distribution  is V R V supported on a set ⊆ d2 , F(⋅, v) is convex and differentiable for each v ∈ , and h(x) is a proper convex function that may be nonsmooth and extended real-valued. Let us define

f(x) ∶= E [F(x, v)] = F(x, v) d . (4.3) v ÊV

Note that the expectation function f is convex, differentiable, and ∇ f(x) = Ev ∇(1) F(x, v) ∇ F(x, v) ∇ f(x) [62]. Thus, (1) can be viewed as an unbiased estimate of .   A difficulty when solving Problem (4.2) is that the distribution  is often unknown, so the expectation (4.3) cannot be computed. This situation occurs frequently in data-driven applications such as machine learning. To support these applications, we do not assume knowledge of f (or of ), only access to a stochastic oracle. Each time the oracle is queried R with an x ∈ d, it generates an independent and identically distributed (i.i.d.) sample v from  and returns ∇(1) F(x, v), which is a noise-corrupted version of ∇ f(x). The erroneous 52 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

gradient ∇(1) F(x, v) will be used in the update rule of our optimization algorithm instead of ∇ f(x). We also impose the following assumptions on Problem (4.2). V Assumption 4.3 (Smoothness of F). For each v ∈ , the function F(⋅, v) has Lipschitz R continuous gradient with constant L. That is, for all x, y ∈ d,

∇ F(x, v) − ∇ F(y, v) ≤ L x − y . (1) (1) ∗ ô ô Please note that, underô this assumption, f(x) is alsoô L-smooth‖ [49].‖ ô ô Assumption 4.4 (Bounded Gradient Variance). There exists a constant  ≥ 0 such that

2 ∇ F(x, v) − ∇ f(x) ≤ 2 , ∀x ∈ Rd . Ev (1) ∗ 4 5 ô ô ô ô In the case when the gradientsô are evaluated withoutô any errors, i.e., ∇ f(x) = ∇(1) F(x, v), we can set  = 0. Several practical problems in machine learning, statistical applications, and signal processing satisfy Assumptions 4.1–4.4 (see, e.g., [44–46]). One such example is l1-regularized logistic regression for sparse binary classification. We are then given a large number of observations

[1] [2] ⊤ [1] Rd [2] vn = vn vn ∶ vn ∈ , vn ∈ {−1, +1} , n = 1, … ,N , $   % drawn i.i.d. from an unknown distribution , and want to solve the minimization Prob- lem (4.2) with

F(x, v) = log 1 + exp −v[2] v[1], x , and h(x) =  x 1. The role of l1 regularization is to produce sparse solutions.

In many emerging‖ ‖ applications, such as large-scale machine learning and statistics, the size of dataset is so huge that it cannot fit on one computer. Hence, we need optimization algorithms that can be conveniently and efficiently executed in parallel on multiple processors. Our goal is (i) to develop an algorithm for solving regularized stochastic optimization problems which combines the strong performance guarantees of serial stochastic gradient methods, the parallelization benefits of mini-batching algorithms, and the speedups enabled by asynchronous implementations; (ii) to extend the analysis in [38] to solve the optimization Problem (4.2) with general regularization functions not necessarily h(x) = IC (x) without any additional assumption on boundedness of either the gradients or the feasible sets; and (iii) to determine whether an asynchronous mini-batch algorithm achieves the optimal rate  (1∕K) under the strong convexity assumption.

We assume W workers have access to a shared decision variable x. The workers may have different capabilities (in terms of processing power and access to data) and are able to 4.1. PROBLEM FORMULATION 53 update x without the need for global coordination or synchronization among each other. Conceptually, the algorithm lets each worker run its own stochastic composite mirror descent process, repeating the following steps:

1. Receive a copy of x and load it into the local storage location ̂x, i.i.d. 2. Sample b random variables v1, … , vb from the distribution ,

3. Compute the averaged stochastic gradient vector

b 1 ̄g = ∇(1) F ̂x,vi , b i=1 É  4. Update current value of shared x via

← 1 xnew arg min ̄g, x − xold + D! x, xold + h(x) . x < =  ⟨ ⟩ The algorithm can be implemented in many ways. One simple way is to use a master-worker setting as depicted in Figure 4.1. In this case, each of the worker nodes retrieves x from the master in Step 1 and returns the averaged gradient to the master in Step 3; the fourth step (carrying out the minimization) is executed by the master.

Independently of how we choose to implement the algorithm, computing nodes may work at different rates: while one node updates the decision vector, others are generally busy computing averaged gradient vectors. The nodes that perform gradient evaluations do not need to be aware of updates to the decision vector, but can continue to operate on stale information about x. Therefore, unlike synchronous parallel mini-batch algorithms [60], there is no need for processors to wait for each other to finish the gradient computations. Moreover, the value ̂x at which the average of gradients is evaluated by a node may differ from the value of x to which the update is applied. Algorithm 4.1 describes the W workers that run asynchronously in parallel. To describe the progress of the overall optimization process, we introduce a shared counter k that is incremented each time x is updated. Assume that k̂ denotes the time at which ̂x used to compute the averaged gradient involved in the update of x was read from the shared memory. N k It is clear that 0 ≤ k̂ ≤ k for all k ∈ 0. The value

k ∶= k − k̂ can be viewed as the delay between reading and updating for worker nodes. Moreover, k captures the staleness of the information used to compute the average of gradients for the kth update. We assume that the time-varying delay k is bounded; this is stated in the following assumption. 54 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

Algorithm 4.1: Asynchronous MB (running on each w) N Input: positive step-sizes k ; batch size b ∈ . ∈ h Initialization: x0 dom . repeat i.i.d. Sample b inputs v1, … , vb from distribution  1 ̄g ← b ∇ F x , v Calculate the incremental gradient, k−k b i=1 (1) k−k i Update the shared decision vector ∑   1 ← arg min − + D + h( ) xk+1 ̄gk−k , x xk ! x, xk x (4.4) x < k = ( ) 

Increment the shared counter, k ← k + 1 until termination test satisfied

Assumption 4.5 (Bounded Delay). There is a non-negative integer ̄ such that N 0 ≤ k ≤ ̄, k ∈ 0 .

The value of ̄ is an indicator of the asynchronism in the algorithm and in the execution platform. In practice, ̄ will depend on the number of parallel processors used in the algorithm [37, 39, 63]. Note that the cyclic-delay mini-batch algorithm [38], in which the workers are ordered and each worker w updates the decision variable under a fixed schedule, is a special case of Algorithm 4.1 where k = W − 1 for all k.

Prior Work

There have been extensive studies on asynchronous stochastic optimization, but mostly under the assumption that the loss function is nonsmooth with bounded subgradients, see, e.g., [40, 64, 65]. The literature on asynchronous algorithms for smooth stochastic optimization is relatively sparse. In this chapter, we propose an asynchronous mini-batch algorithm for regularized stochastic optimization problems with smooth loss functions that eliminates the overhead associated with global synchronization. Our algorithm allows multiple pro- cessors to work at different rates, perform computations independently of each other, and update global decision variables using out-of-date gradients. A similar model of parallel asynchronous computation was applied to coordinate descent methods for deterministic optimization in [37, 39, 63] and mirror descent and dual averaging methods for stochastic optimization in [38]. In particular, Agarwal and Duchi [38] have analyzed the convergence of asynchronous mini-batch algorithms for smooth stochastic convex problems, and inter- estingly shown that bounded delays do not degrade the asymptotic convergence. However, 4.1. PROBLEM FORMULATION 55

master

[root@master ~]# cd

̄g 7 , k x 1 x ̄g2 = 8 = 6 2 x k k = 5 7 ̄g1,

[root@us~]# cd [root@eu~]# cd [root@ap~]# cd

worker-1 worker-2 worker-3

Figure 4.1: Illustration of one possible realization of Algorithm 4.1, using a single reader/single writer policy. worker-2 receives x2 from the master and computes the aver- = 1 b ∇ F aged gradient vector ̄g2 b i=1 (1) x2, vi . As the worker nodes are being run without synchronization, the master writes x3 and x4 to the memory while worker-2 is evaluating ∑  ̄g2. At time instance k = 5, the master updates the current x, i.e., x4, using the out-of-date averaged gradient ̄g2 received from worker-2. they only considered the case where the regularization term is the indicator function of a compact convex set. Moreover, convergence rates for strongly convex stochastic problems was not discussed in [38].

We extend the results of [38] to general regularization functions (like the l1 norm, often used to promote sparsity), and establish a sharper expected-value type of convergence rate than the one given in [38]. Specifically, we make the following contributions:

(i) For general convex regularization functions, we show that when the feasible set is closed and convex (but not necessarily bounded), the running average of the iterates generated by our algorithm with constant step-sizes converges at rate  (1∕K) to a ball around the optimum. We derive an explicit expression that quantifies how the convergence rate and the residual error depend on loss function properties and algorithm parameters such as the step-size and the maximum delay bound ̄.

(ii) For general convex regularization functions and compact feasible sets, we prove that the running average of the iterates produced by our algorithm with a time-varying 56 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

step-size converges to the true optimum (without residual error) at rate

2 ( ̄ + 1) 1  + . H K K I √ As long as the number of processors is  K1∕4 , our algorithm enjoys near-linear speedup and converges asymptotically at a rate  1∕ K . This rate is known to be optimal for convex stochastic problems even in the absence√  of delays [49, 50]. (iii) When the regularization function is strongly convex and the feasible set is closed and convex, we establish that the iterates converge at rate

4 ( ̄ + 1) 1 +  2 . 0 K K 1 If the number of processors is of the order of  K1∕4 , this rate is  (1∕K) asymptot- K ically in , which is the best known rate for strongly  convex stochastic optimization problems in a serial setting [66–68].

In Section 4.2.1, we characterize the iteration complexity and the convergence rate of the proposed algorithm, and show that these compare favourably with the state of the art. Our approach is distinguished from recent work on stochastic optimization [38, 48–51, 60] in that it can deal with asynchrony and smooth objective functions as well as general regularization functions at the same time, cf. Table 4.1. To the best of our knowledge, our asynchronous algorithm is the first to attain the optimal convergence rates for convex and strongly convex stochastic composite optimization in spite of time-varying delays. 4.1. PROBLEM FORMULATION 57  × × Yes Yes Yes Yes strongly convex  Convergence rate for Yes Yes Yes Yes Yes Yes convex × × × × Yes Yes Asynchronous × × × Yes Yes Yes Parallel ) x ( C I Yes Yes Yes Yes Yes ) = x optimization h( Regularized stochastic [50] [51] [49] [60] [38] Our work Reference Table 4.1: Comparison of our algorithm with selected recent algorithms in the literature for stochastic convex optimization. 58 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

4.1.2 Proximal Incremental Aggregated Gradient Descent

Algorithm Description

For analyzing PIAG, we consider a similar problem formulation:

N

minimize f n(x) + h(x) . (4.5) ∈Rd x =1 Én

In order to solve Problem (4.5), we are going to use the proximal incremental aggregated N gradient method. In this method, at iteration k ∈ , the gradients of all component functions f (x), possibly evaluated at stale information x − n , are aggregated: n k k

N g = ∇ f x − n . k n k k =1 Én   Then, a proximal step is taken based on the current vector xk, the aggregated gradient gk, and the nonsmooth term h(x),

1 2 x +1 = arg min g , x − x + x − x + h(x) . (4.6) k k k 2 k x < = ô ô The algorithm has a natural implementation⟨ in⟩ the parameterô serverô framework. The master node maintains the iterate x and performs the proximal steps. Whenever a worker node reports new gradients, the master updates the iterate and informs the worker about the new iterate. Pseudo code for a basic parameter server implementation is given in Algo- rithms 4.2 and 4.3.

To establish convergence of the iterates to the global optimum, we impose the following assumptions on Problem (4.5): R R Assumption 4.6 (Smoothness). Each f ∶ d → , for n = 1, … ,N, is an L -smooth R n n convex function on d. Note . N 4.1 Under this assumption, f is L-smooth with L ≤ n=1 Ln. ∑ N Assumption 4.7 (Strong Convexity). The function f(x) ∶= n=1 fn(x) is -strongly con- vex. ∑ n i.e. Assumption 4.8 (Bounded Delay). The time-varying delays k are bounded, , there is a non-negative integer ̄ such that n ∈ {0 1 … } k , , , ̄ N N holds for all k ∈ 0 and n ∈ . 4.1. PROBLEM FORMULATION 59

Algorithm 4.2: IAG (Master Procedure) Input: Lipschitz constant, L; strong-convexity parameter, ; the maximum information delay, ̄; step-size, ; and the maximum iteration count, K > 0. Output: The final decision vector, xK . Rd Data: Buffers for incremental gradients of each worker, gw ∶ gw ∈ ; and the h(x) nonsmooth function, . Initialization: Initial (zero) buffers, gw ; and initial iteration count, k = 0. while k < K do W Wait until a set of workers return their gradients forall w do W if w ∈ then Update the incremental gradient of the worker

g ← ∈N ∇ f x − w w n w n k k   else ∑ Keep the incremental gradient of the worker ← Aggregate the incremental gradients, gk w gw Solve (4.6) with g W k ∑ forall w ∈ do Send xk+1 to worker w Increment the iteration counter, k ← k + 1 Signal EXIT return xK

Algorithm 4.3: IAG (Worker Procedure for each w) R N Data: Buffer for the decision vector, x ∈ d; and loss functions, f (x) ∶ n ∈ N N N n Ww with ∈W = {1, … ,N} and = ç for all w1 ≠ w2 ∈ , Ww w w1 w2 where ∶= {1, … ,W } ⋃ ⋂ repeat ← Receive the decision vector from master, x xk+1 ← N ∇ f ( ) Calculate the incremental gradient, IG n∈ w n x w Send IG to master with a delay of k ∑ until EXIT received 60 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

Prior Work

Incremental gradient methods for smooth optimization problems have a long tradition, most notably in the training of neural networks via back-propagation. In contrast to gradient methods, which compute the full gradient of the loss function before updating the iterate, incremental gradient methods evaluate the gradients of a single, or possibly a few, component functions in each iteration. Incremental gradient methods can be computationally more efficient than traditional gradient methods since each step is cheaper but makes a comparable progress on average. However, for global convergence, the step-size needs to diminish to zero, which can lead to slow convergence [69]. If a constant step-size is used, only convergence to an approximate solution can be guaranteed in general [70]. Recently, Blatt, Hero, and Gauchman [71] proposed a method, the incremental aggregated gradient (IAG), that also computes the gradient of a single component function at each iteration. But rather than updating the iterate based on this information, it uses the sum of the most recently evaluated gradients of all component functions. Compared to the basic incremental gradient methods, IAG has the advantage that global convergence can be achieved using a constant step-size when each component function is convex quadratic. Later, Gurbuzbalaban, Ozdaglar, and Parillo [61] proved linear convergence for IAG in a more general setting when component functions are strongly convex. In a more recent work, Vanli, Gurbuzbalaban and Ozdaglar [72] analyzed the global convergence rate of proximal incremental aggregated gradient methods, where they can provide the linear convergence rate only after sufficiently many iterations. Our result differs from theirs in that we provide the linear convergence rate of the algorithm without any constraints on the iteration count and we extend the result to the general distance functions.

There has been some recent work on the stochastic version of the IAG method (called stochastic average gradient, or SAG) where we sample the component function to update instead of using a cyclic order [73–75]. Unlike the IAG method where the linear convergence rate depends on the number of passes through the data, the SAG method achieves a linear convergence rate that depends on the number of iterations. Further, when the number of training examples is sufficiently large, the SAG method allows the use of a very large step-size, which leads to improved theoretical and empirical performance. Our algorithm, presented in this chapter, can handle both general convex regularizers and convex constraints. We establish linear convergence when the empirical data loss is strongly convex, give explicit expressions for step-size choices that guarantee convergence to the global optimum and bound the associated convergence factors. These expressions have an explicit dependence on the degree of asynchrony and recover classical results under synchronous operation. We believe that this is a practically and theoretically important addition to existing optimization algorithms for the parameter server architecture. 4.2. MAIN RESULT 61

4.2 Main Result

4.2.1 Asynchronous Mini-Batching

General Convex Regularization

The following theorem establishes convergence properties of Algorithm 4.1 when a constant step-size is used.

Theorem 4.1. Let Assumptions 4.1–4.5 hold. Assume also that

1 = ∈ 0 k , 2 . (4.7) 0 L( ̄ + 1) 1 N Then, for every K ∈ and any optimizer x⋆ of Problem (4.2), we have

⋆ 2 D! x , x0 c E − ⋆ +  ̄xK  ≤ 2 , K  2b 1 − L( ̄ + 1)   where is the Cesáro average of the iterates, ,  ̄xK i.e.

K 1 ̄xK ∶= xk , K =1 kÉ b is the batch size, and c ∈ [1, b] is given by

1 if ⋅ = ⋅ , = ∗ 2 c otherwise. T2 max x ≤1 !(x) ‖ ‖ ‖ ‖ ‖ ‖ Proof. See Section 4.4. ■

Theorem 4.1 demonstrates that for any constant step-size satisfying (4.7), the running average of iterates generated by Algorithm 4.1 will converge in expectation to a ball around the optimum at a rate of  (1∕K). The convergence rate and the residual error depend on the choice of : decreasing reduces the residual error, but it also results in a slower convergence. We now describe a possible strategy for selecting the constant step-size. Let K be the total number of iterations necessary to achieve -optimal solution to Problem (4.2), ⋆ that is, E  ̄xK −  ≤  when K ≥ K. If we pick   =  2 , (4.8) L( ̄ + 1) + c2∕b 62 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

then, using Theorem 4.1, the corresponding ̄xK satisfies 2 0 2 c   ̄x − ⋆ ≤ L( ̄ + 1) + + , E K 2 K 0 b 1   ⋆ where 0 = D! x , x0 . This inequality tells us that if the first term on the right-hand side ∕2 i.e. is less than , , if  2 L( ̄ + 1) 2 ∶= 2 + c K ≥ K 0 2 , 0  b 1 ⋆ then E  ̄xK −  ≤ . Hence, the iteration complexity of Algorithm 4.1 with the step-size choice  (4.8) is given by 2 L( ̄ + 1) 2 + c  2 . (4.9) 0  b 1

As long as the maximum delay bound ̄ is of the order 1∕ , the first term in (4.9) is asymp- totically negligible. In this case, the iteration complexity√ of Algorithm 4.1 is asymptotically  c2∕b2 , which is exactly the iteration complexity achieved by a serial mini-batch ̄ algorithm [60]. As discussed before, is related to the number of processors. Therefore, if the number of processors is of the order of  1∕  , parallelization does not appreciably degrade asymptotic convergence of Algorithm 4.1.√ Furthermore, as p processors are being run asynchronously and in parallel, updates may occur roughly p times as quickly, which means that the near-linear speedup in the number of processors can be expected. Remark 4.2. Another strategy for the selection of the constant step-size in Algorithm 4.1 is to use that depends on the prior knowledge of the number of iterations to be performed. More precisely, assume that the number of iterations is fixed in advance, say equal to K̄ . By choosing as 1 = , 2 L( ̄ + 1) + K̄ for some > 0, it follows from Theorem 4.1 that the√ running average of the iterates after K̄ iterations satisfies ( + 1)2 D ⋆ 2 ⋆ L ̄ ! x , x0 1 ⋆ c E  ̄x ̄ −  ≤ + D x , x0 + . K ̄ ! 2 K  K̄ 0 b1    The optimal choice of , which minimizes the second√ term on the right-hand side of the above inequality, is

 c ⋆ = . √ ⋆ 2b D! x , x0 t  4.2. MAIN RESULT 63

With this choice of , we then have

2 ⋆ ( + 1) D ⋆  2c D! x , x0 ⋆ L ̄ ! x , x0 E  ̄x ̄ −  ≤ + . K ̄ t K  bK̄    √ In the case that ̄ = 0, the preceding guaranteed bound reduces to the one obtained in [50, Theorem 1] for the serial stochastic mirror descent algorithm with constant step-sizes. Note that in order to implement Algorithm 4.1 with the optimal constant step-size policy, we need ⋆ ⋆ to estimate an upper bound on D! x , x0 , since D! x , x0 is usually unknown.   The following theorem characterizes the convergence of Algorithm 4.1 with a time-varying step-size sequence when dom h is bounded in addition to being closed and convex.

Theorem 4.2. Suppose that Assumptions 4.1–4.5 hold. In addition, suppose that dom h is compact and that is bounded on dom . Let D! (⋅, ⋅) h

2 R = max D! (x, y) . x,y∈dom h

If is set to −1 = ( + 1)2 + with k k L ̄ k

 c k + 1 k = , √R√ b then for all K ∈ N, the Cesáro average of the iterates√ generated by Algorithm 4.1 satisfies

2( + 1)2 2 2R c E ⋆ LR ̄  ̄xK −  ≤ + . K √ bK√   √ Proof. See Section 4.4. ■

The time-varying step-size k, which ensures the convergence of the algorithm, consists of two terms: the time-varying term k should control the errors from stochastic gradient 2 information while the role of the constant term L( ̄ + 1) is to decrease the effects of asyn- chrony (bounded delays) on the convergence of the algorithm. According to Theorem 4.2, in the case that ̄ =  K1∕4 , the delay becomes increasingly harmless as the algorithm ̄x progresses and the expected  function value evaluated at K converges asymptotically at a rate  1∕ K , which is known to be the best achievable rate of the mirror descent method for nonsmooth √  stochastic convex optimization problems [30]. For the special case of Problem (4.2) where h is restricted to be the indicator function of a compact convex set, Agarwal and Duchi [38, Theorem 2] showed that the convergence rate 64 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS of the delayed stochastic mirror descent method with a time-varying step-size is

2 + R c LR2G2 ̄2b log K LR RG ̄ + +  2 , H K bK√ c K I √ 2 G ∇ F(x, v) where is the maximum bound on E (1) ∗ . Comparing with this result, v 4 5 ô 2(log ô)∕ instead of an asymptotic penalty of the formô ̄ Kô K due to the delays, we have 2∕ ô ô the penalty  ̄ K , which is much smaller for large K. Therefore, not only do we extend the result of [38] to general regularization functions, but we also obtain a sharper guaranteed convergence rate than the one presented in [38].

Strongly Convex Regularization

In this subsection, we restrict our attention to stochastic composite optimization problems with strongly convex regularization terms. Specifically, we assume that h is ℎ-strongly convex with respect to ⋅ , that is, for any x, y ∈ dom h,

 2 h(y) ≥ h(‖ x‖) + s , y − x + ℎ y − x , ∀s ∈ ) h(x) . x 2 x

The strong convexity of h implies⟨ that Problem⟩ (4.2)‖ has‖ a unique minimizer x⋆ [76, Corollary 11.16]. Examples of the strongly convex function h include Tikhonov and Elastic Net regularization. In order to derive the convergence rate of Algorithm 4.1 for solving Problem (4.2) with a strongly convex regularization term, we need to assume that the generalized distance function D! (x, y) used in the algorithm satisfies the next assumption.

Assumption 4.9 (Quadratic Growth). For all x, y ∈ dom h, we have

 2 L 2 ! x − y ≤ D (x, y) ≤ ! x − y , 2 ! 2 with 0 < ! ≤ L!. ‖ ‖ ‖ ‖

Note 4.3. When the distance generating function is the squared Euclidean norm, i.e., !(x) = 1 2 D ( ) = 1 − 2 = 1 2 x 2, the generalized distance function becomes ! x, y 2 x y 2 with L! . Assumption 4.9 will automatically hold whenever the distance generating function ! is L!‖-smooth‖ [68]. ‖ ‖

The associated convergence result now reads as follows. 4.2. MAIN RESULT 65

Suppose that the regularization function is -strongly convex and that Theorem 4.3. h ℎ Assumptions 4.2–4.5 and 4.9 hold. If is set to −1 = 2 ( + 1)2 + with k k L ̄ k

 ℎ k = (k + ̄ + 1) , 3L! then for K ∈ N, the iterates produced by Algorithm 4.1 satisfies

2 4 6LL 2 ! + 1 ̄ + 1 2 2 2  18c L E ⋆ − ℎ D ⋆ + ! x xK ≤   2  ! x , x0 2 . (K + 1) b (K + 1)    ℎ ô ô ô ô Proof. See Section 4.4. ■

An interesting point regarding Theorem 4.3 is that for solving stochastic composite optimiza- tion problems with strongly convex regularization functions, the maximum delay bound ̄ can be as large as  K1∕4 without affecting the asymptotic convergence rate of Algo- rithm 4.1. In this case, our asynchronous mini-batch algorithm converges asymptotically at a rate of  (1∕K), which matches the best known rate achievable for strongly convex stochastic problems in a serial setting [66–68].

Running-time comparisons

Having derived the convergence rates for convex and strongly convex composite stochastic problems, we now explicitly compare the running times of the serial mini-batch algorithm (W = 1 and ̄ = 0) and the asynchronous mini-batch algorithm (W > 1 and ̄ > 0). We define a time-unit to be the time it takes a single processor to sample v from  and evaluate ∇(1) F(x, v); a worker thus needs b time-units to process a batch of b samples. We ignore the time required to update current x in the proximal step (4.4) since this step usually can be done very efficiently and requires negligible time compared to computing the averaged gradient, especially when b is large [38]. Let Nk be the number of time- units allocated to each algorithm. Since the serial algorithm uses b samples to compute an averaged gradient and execute an update, it will be able to complete Nk∕b iterations in Nk time-units. In the asynchronous algorithm, W processors concurrently and simultaneously compute stochastic averaged gradients, so the master will receive one averaged gradient vector every b∕W time-units. It follows that in Nk time-units, the serial and asynchronous mini-batch algorithms perform Nk∕b and WNk∕b iterations, respectively. Substituting these iteration counts into the guaranteed bounds provided by Theorems 4.2 and 4.3 together with assuming that ̄ is roughly proportional to W [37, 39, 63], we can derive upper bounds on the expected optimization accuracy of each algorithm after Nk time-units, cf. Table 4.2. We can see that the if the number of processors is suitably chosen, then the asynchronous mini- batch algorithm enjoys asymptotically faster convergence times for regularized stochastic optimization problems. 66 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

⋆ ⋆ 2 Table 4.2: Upper bounds on E  ̄xk −  for convex problems and E xk − x for strongly convex problems afterNk time-units.   ô ô ô ô Convergence rate for Serial Algorithm Asynchronous algorithm b  bW  convex   +  + N N N WN 0 k k 1 0 k k 1 b2 √2 b2W 2 √ 2   2 +  2 + strongly convex N N N WN 0 k k 1 0 k k 1

4.2.2 Proximal Incremental Aggregated Gradient Descent

First, we provide the convergence result for PIAG that uses the squared Euclidean distance in the update rule. Theorem 4.4. Assume that Problem (4.5) satisfies Assumptions 4.2 and 4.6–4.8, and that the step-size satisfies:

1 1 +  1 ( ̄+1) − 1 L ̄+1 ≤ ,   where N . Then, the iterates generated by Algorithms 4.2 and 4.3 satisfy: L = n=1 Ln ∑ k 2 1 2 x − x⋆ ≤ x − x⋆ . k + 1 0 0 1 for all . ô ô ô ô k ≥ 0 ô ô ô ô

Proof. See Section 4.4. ■ Remark . n = 0 4.4 For the special case of Algorithms 4.2 and 4.3 where k for all k, n, Xiao and Zhang [42] have shown that the convergence rate of serial proximal gradient method = 1 with a constant step-size L is k L − f  + H0 L ℎ 1 I where f and ℎ are strong convexity parameters of f(x) and h(x), respectively. It is clear that in the case that ̄ = 0, the guaranteed bound in Theorem 4.4 reduces to the one obtained in [42].

The update rule of our algorithm can be easily extended to a non-Euclidean setting, by replacing the squared Euclidean distance in the proximal step (4.6) with a generalized distance function. 4.3. NUMERICAL EXAMPLE 67

The associated convergence result now reads as follows. Corollary 4.5. Consider using the following proximal gradient method to solve Prob- lem (4.5): 1 x = arg min g , x − x + D x, x + h(x) , k+1 C k k ! k x∈ < = N  (4.10) ⟨ ⟩ g = ∇ f x − n , k n k k =1 Én   where satisfies Assumption 4.9. Assume also that the problem satisfies Assump- D! (⋅, ⋅) tions 4.2 and 4.6–4.8, and that the step-size satisfies:

1 1 +  1 ! ̄+1 − 1 L! L ̄+1 L ≤ ! ,    where N . Then, the iterates generated by the method satisfy: L = n=1 Ln ∑ k L D x⋆, x ≤ ! D x⋆, x . ! k + ! 0 0 L! 1   Proof. See Section 4.4 ■

4.3 Numerical Example

4.3.1 Asynchronous Mini-Batching

We have developed a complete master-worker implementation of our algorithm in C++ using the Massage Passing Interface (MPI) libraries OpenMPI [15], due to its flexibility in scaling the problem in distributed-memory environments. We have done extensive experiments to show how our algorithm parameters can be selected based on our theoretical convergence analyses (Section 4.2.1), how our algorithm performs on convex and strongly convex stochastic optimization problems, and how the performance compares to that of the synchronous version. To this end, we have used two different datasets: rcv1 [77] and Epsilon [78]. The first one, rcv1, is the corrected version of Reuters’ Text Categorization Test Collection, which consists of N = 804, 414 documents, with d = 47, 236 sparse (density: 0.16%) unique stemmed tokens spanning 103 topics. Out of these topics, we decided to sort out all sports, government and disaster related documents. The second one, Epsilon, is a synthetic, dense dataset consisting of N = 500, 000 samples with d = 2, 000 features. The dataset is already divided into two classes. To evaluate our 68 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS algorithm, we trained a sparse (binary) classifier by solving the following regularized logistic regression problem on the datasets:

[2] [1] 2 2 minimize E log 1 + exp −v v , x + 1 x 1 + x + IC (x) . x vn n n 2 2    (4.11) ‖ ‖ ‖ ‖ [1] Rd Here, vn ∈ , n = 1, … ,N, is the vector of sample features, 1 and 2 are two regular- ization parameters, and

C Rd = x ∈ ∶ x 2 ≤ R .

In rcv1, these feature vectors are sparse and contain‖ ‖ the stemmed tokens in each sampled document, which have already been cosine-transformed to have a maximum norm of 1. In Epsilon, however, the feature vectors need to be manually scaled when importing from the [2] dataset. The label vn ∈ {−1, 1} indicates whether a selected sample n falls into the desired [2] category, or not. In rcv1, vn = 1 if the sampled document is about sports, government or [2] disasters, and vn = −1 otherwise. To evaluate scalability, we used both the training and test sets available in rcv1 when solving the optimization problem. Finally, we have used ( ) = 1 2 the distance generating function ! x 2 x 2 in all experiments.

‖ ‖ Algorithm Parameter Selection

To demonstrate how the algorithm parameters can be selected based on our theoretical convergence analyses and the problem data, we have chosen to solve Problem (4.11) with W = 4 workers, 1 = 0.01, 2 = 0 and R = 10 on the Epsilon dataset. Our goal is to ⋆ achieve an error of  = 0.4 after K = 2, 500 iterations, i.e.,  ̄x2500 −  ≤ 0.4. Since the feasible set is compact, according to Theorem 4.2, Algorithm 4.1 can be implemented with time-varying step-sizes. Thus, we should find the batch-size b such that

2 LR2( ̄ + 1) 2 2R c + ≤  . K √ bK√ √ Using the problem data L = 0.25 and  = 1 together with c = 1, one can verify that b should be at least 15. In our master-worker implementation, we have observed that choosing a mini- batch size of b = 100 balances the communication and computations times, resulting in a good overall performance. In Figure 4.2, we present the actual function value attained by the iterates of the algorithm together with the theoretical upper bound as given in Theorem 4.2. As can be observed, the objective function value converges within the desired tolerance of the optimum, and the theoretical upper bound is valid. 4.3. NUMERICAL EXAMPLE 69

λ1 = 0.01, p = 4, b = 100 700

Theoretical Bound 600 Actual Value

500 ∗ φ

− 400 )

ave 300 x ( φ 200

100

0 100 102 104 Iteration (T)

Figure 4.2: Convergence of the objective function value in Problem (4.11) evaluated at the Cesáro average of the iterates to  = 0.4 neighborhood of the optimum function value when 1 = 0.01 and 2 = 0.

Asynchronous Algorithm on Convex Problems

To evaluate the performance of our asynchronous algorithm on convex problems for both sparse and dense datasets, we implemented Algorithm 4.1 with time-varying step-sizes given in Theorem 4.2, and used a batch size of b = 10, 000 samples over the total sample size of N = 500, 000. This time, we let the algorithm run for K = 200 iterations, which corresponds to touching all samples 4 times in expectation. For relative speedup comparison purposes, we run the algorithm on W = 1, 2, 4, 6, 8, 10 workers, and we assumed that true optimum function value is obtained when the algorithm is run serially (i.e. W = 1). We present the wall-clock times to achieve this optimum function value on the Epsilon dataset for different number of workers in Figure 4.3 (left).

Asynchronous Algorithm on Strongly Convex Problems

This time, we set the l2−regularization parameter in Problem (4.11) to 2 = 0.001 while keeping other variables the same as in Section 4.3.1. We run the same experiments with the step-sizes given in Theorem 4.3. We assume that the true optimizer for the problem is obtained by the algorithm for the 1-worker case, and present the wall-clock times to reach this optimizer for different number of workers in Figure 4.3 (right).

In Table 4.3, we summarize the relative speedup of the algorithm achieved during our experiments in Sections 4.3.1 and 4.3.1 with respect to the number of workers used. The 70 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

λ2 = 0 ×104 λ2 = 0.001 120 12

p = 1 p = 1 100 p = 2 10 p = 2 p = 4 p = 4 p = 6 p = 6 80 8 ∗ p = 8 p = 8 φ 2 2

p = 10 k p = 10 − ∗ x ) 60 6 − ave x x k ( φ 40 4

20 2

0 0 0 50 100 150 200 0 50 100 150 200 Time [sec] Time [sec]

Figure 4.3: (Left) Convergence of the objective function value in Problem (4.11) evaluated at the Cesáro average of the iterates to the optimum function value when 2 = 0.0, and, (right) convergence of the iterates to the optimizer of the same problem when 2 = 0.001, both for 1 = 0.01 and on the Epsilon dataset.

relative speedup of the algorithm on W workers is defined as SW = t1∕tW , where t1 and tW are the time it takes to run the corresponding algorithm (to -accuracy) on 1 and W workers, respectively. We observe a near-linear relative speedup, consistent with our theoretical results, when a relatively small number of workers is used. However, as the number of workers increases, the relative speedup starts saturating due to the communication overhead at the master side. To keep relative speedup increasing, one needs to increase the mini-batch size so that the communication overhead can be balanced with the increased computing times. Speedup values are averaged over 10 Monte Carlo simulations.

Table 4.3: Relative speedup of the asynchronous algorithm for convex (2 = 0.001) and strongly convex (2 = 0.0) objective functions, all with 1 = 0.01. W = 1 2 4 6 8 10 Convex N/A 1 1.88 2.55 2.70 3.23 Strongly

Epsilon N/A 1 1.82 2.52 3.08 3.69 Convex Convex N/A 1 1.60 2.32 2.92 3.53

rcv1 Strongly N/A 1 1.52 2.29 2.78 3.37 Convex 4.3. NUMERICAL EXAMPLE 71

Remark 4.5. We have observed super-linear speedups in our experiments. Super-linear speedups are common in parallel computations and are often due to caching effects, especially when extensive matrix-matrix computations take place [79, 80]. One way to eliminate the effect of such caching effects when presenting the results is to scale speedups with respect to that of the smallest number of processing units which result in a super-linear speedup [80]. For this reason, Table 4.3 presents speedup values relative to the two-worker case (W = 2).

Comparison to Synchronous Algorithm

Finally, we compare the performance of our asynchronous algorithm to that of the syn- chronous version. The synchronous version of Algorithm 4.1 can be implemented in the master-worker setting by forcing the master to wait for all the workers to return their gra- dients before updating the decision vector and sending it to each of the workers. We run the synchronous version of our algorithm on the Epsilon dataset with the same settings as in Section 4.3.1. Figure 4.4 shows the convergence time of the serial, synchronous and asynchronous implementations of the algorithm. We observe that although the synchronous algorithm can benefit from parallelization, asynchronous updates yield significant additional speedups.

λ1 = 0.01, λ2 = 0 120

p = 1 (Serial) 100 p = 4 (Synchronous) p = 4 (Asynchronous) 80 ∗ φ − ) 60 ave x ( φ 40

20

0 0 50 100 150 200 Time [sec]

Figure 4.4: Comparison of synchronous and asynchronous parallel algorithms to the serial version. Relative speedups can be obtained with the synchronous parallel algorithm, and even faster speedups can be obtained with the asynchronous version (two-fold in our experiment with this setting compared to the synchronous version). 72 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

4.3.2 Proximal Incremental Aggregated Gradient Descent

In this section, we present numerical examples which verify our theoretical bound in different settings. First, we simulate the implementation of Algorithms 4.2 and 4.3 on a parameter server architecture to solve a small, toy problem. Then, we implement the framework on Amazon Elastic Compute Cloud (EC2) and solve a binary classification problem on three different real-world datasets.

Toy problem

To verify our theoretical bounds provided in Theorem 4.4 and Corollary 4.5, we consider solving (4.5) with

( − )2 + 1 ( + )2 = 1 xn c1 2 xn+1 c1 n , f ( ) = 1 ( + )2 + 1 ( − )2 = n x 2 xn−1 c1 2 xn c1 n N, ⎧ 1 ( + )2 + 1 ( − )2 + 1 ( + )2 ⎪ 2 xn−1 c1 2 xn c1 2 xn+1 c1 otherwise, ⎨ h(x) = ⎪1 x 1 + IC (x) , C =⎩ {x ≥ 0} , ‖ ‖ 0 D = 1 − 2 for some c1 ≥ . We use ! x, xk 2 x xk p in the proximal step (4.10) and consider p = 1.5 p = 2 both and .  ô ô It can be verified that ∇ F(x) is (N + 1)ô-continuousô and F(x) is 2-strongly convex, both ⋅ ⋆ = max(0,c1−1) with respect to 2, and that the optimizer for the problem is x 3 e1, where en th denotes the n basis vector. Moreover, it can be shown that if p ∈ (1, 2], then ! = 1 and 2∕p−1 L! = N satisfy‖ ‖ Assumption 4.9 with respect to ⋅ 2. N = 100 c = 3  = 1 We select the problem parameters , 1 and‖ ‖ 1 . We simulate solving the problem with W = 4 workers, where at each iteration k, a worker w is selected uniformly at random to return their gradient, information evaluated on stale information x − w , to the k k w master. Here, at time k, k is simply the number of iterations since the last time worker w was selected. Each worker holds N∕W = 25 component functions, and we tune step-size based on the assumption that ̄ = W . Figure 4.5 shows the results of a representative simulation. As can be observed, the iterates converge to the optimizer and the theoretical bound derived is valid.

Binary classification on Actual Datasets

Next, we consider solving a regularized, sparse binary classification problem on three different datasets: epsilon (dense) [78], rcv1 (sparse) [77], and url (sparse) [81]. url 4.3. NUMERICAL EXAMPLE 73

Iterate Convergence

103 2 2

ô ô 10−1 ⋆ x − k x ô ô 10−5

10−9 0 3 6 9 12 15 3 k ⋅10 p = 1.5 p = 2.0

Figure 4.5: Convergence of the iterates in toy problem. Solid lines represent our theoretical upper bound, whereas dash-dotted lines represent simulation results.

is a collection of data for idenfitication of malicious URLs. It has N = 2, 396, 130 URL samples, each having d = 64 real valued features out of a total of 3, 231, 961 attributes (density: 18.08%). We implement the parameter server framework in the Julia language, and instantiate it with Problem (4.5):

1 [2] [1] 1 2 f (x) = log 1 + exp −v v , x + 2 x , n N n n 2 2   h(x) =  x ,   1 1 ‖ ‖ −5 −4 −3 We pick 1 = 10 and ‖2 =‖ 10 for rcv1 and epsilon datasets, and 1 = 10 and −4 2 = 10 for url. rcv1 is already normalized to have unit norm in its samples; hence, we normalize url and epsilon datasets to have comparable problem instances. For the rcv1 dataset, we choose to classify sports, disaster and government related articles from the corpus. The other two datasets are already divided into two classes. 2 2 It can be verified that ∇ F(x) is (1∕4 A 2 + 2)-Lipschitz continuous with A 2 = 1 in all the examples, and F(x) is 2-strongly convex with respect to ⋅ 2. ‖ ‖ ‖ ‖ c4.2xlarge We create three compute nodes at EC2. The compute‖ ‖ nodes are physically located in Ireland (eu), North Virginia (us) and Tokyo (ap), respectively. Then, we assign one CPU from each node as workers, resulting in a total of 3 workers, and we pick the master node at KTH in Sweden. We run a small number of iterations of the algorithms to obtain an a priori delay distribution of the workers in this setting, and we observe that ̄ = 6. 74 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

Iterate Convergence

103 2 2 ô ô ⋆ x − k

x 2

ô ô 10

0 10 20 30 40 50 3 k ⋅10 rcv1 url epsilon

Figure 4.6: Convergence of the iterates in EC2 experiments. Solid lines represent our theoretical upper bound, whereas dash-dotted lines represent experiment results.

Delay Distribution

8

6

 4

2

0

eu us ap Worker rcv1 url epsilon

Figure 4.7: Worker delays in EC2 experiments. Bars represent the mean delays, whereas vertical stacked lines represent the standard deviation. For each worker, from left to right, we present the delays obtained in rcv1, url and epsilon experiments, respectively. 4.4. PROOFS 75

In Figures 4.6 and 4.7, we present the convergence results of our experiments and delay distributions of the workers, respectively. As in the previous example, the iterates converge to the optimizer and the theoretical bound derived is valid. Another observation worth noting is that the denser the datasets become, the smaller the gap between the actual iterates and the theoretical upper bound gets.

4.4 Proofs

In this section, we prove the theorems of this chapter. We first state three key lemmas which are instrumental in our argument. The following result establishes an important recursion for the iterates generated by Algo- rithm 4.1.

Suppose Assumptions 4.1–4.5 hold. Then, the iterates x N generated Lemma 4.6. k k∈ 0 by Algorithm 4.1 satisfy

2 ⋆ 1 ⋆ 1 ⋆ 1 ⋆  x +1 −  + D x , x +1 ≤ e − + e − , x − x + D x , x k ! k 2 k k ∗ k k k ! k k k ( ) k   ô ô ̄  L(ô̄ + 1)ô 2 + ô ô x − x 2 k−j+1 k−j j=0 É ô ô 1 1 ô 2 ô  2 − −  ô x − x −ô ℎ x⋆ − x , 2 k k+1 k 2 k+1 0 k 1 ô ô ô (4.12)ô ô ô ô ô where ⋆ X⋆, is a sequence of strictly positive numbers, and x ∈ k ek ∶= ∇ f xk − ̄gk is the error in the gradient estimate. 

Proof of Lemma 4.6. We start with the first-order optimality condition for the point xk+1 in ∈ h the minimization problem (4.4): there exists subgradient sxk+1 ) xk+1 such that for x ∈ h all dom , we have  1 0 + ∇ D + − ≤ ̄gk−k (1) ! xk+1, xk sxk+1 , x xk+1 , @ k A  Plugging the following equality

∇(1) D! xk+1, xk = ∇ ! xk+1 − ∇ ! xk ,    76 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS into the previous inequality and re-arranging terms gives 1 ∇ − ∇ − + − ! xk ! xk+1 , x xk+1 ≤ ̄gk−k sk+1, x xk+1 k     ( ) = − + − ̄gk−k , x xk+1 sk+1, x xk+1 ( ) − + h( ) − h  ≤ ̄gk−k , x xk+1 x xk+1 ( ℎ )2  − x − x +1 , (4.13) 2 k ô ô where the last inequality used ô ô

 2 h(x) ≥ h x + s , x − x + ℎ x − x , k+1 k+1 k+1 2 k+1    ô threeô point identity by the (strong) convexity of h. We now use the following well-knownô ô of the generalized distance function [33] to rewrite the left-hand side of (4.13):

∇ !(v) − ∇ !(y), x − y = D! (x, y) − D! (x, v) + D! (y, v) .

From this relation,⟨ with v = xk and y =⟩ xk+1, we have

∇ ! xk − ∇ ! xk+1 , x − xk+1 = D! x, xk+1 − D! x, xk + D! xk+1, xk .        Substituting the preceding equality into (4.13) and re-arranging terms result in 1 1 h x +1 − h(x) + D x, x +1 ≤ ̄g − , x − x +1 + D x, x k ! k k k k ! k k ( ) k   1 ℎ  2 − D! xk+1, xk − x − xk+1 . k 2  ô ô Since the distance generating function !(x) is 1-strongly convex, we haveô the lowerô bound

1 2 D x , x ≥ x − x , ! k+1 k 2 k+1 k  ô ô which implies that ô ô 1 1 h x +1 − h(x) + D x, x +1 ≤ ̄g − , x − x +1 + D x, x k ! k k k k ! k k ( ) k   1 2 ℎ  2 − xk+1 − xk − x − xk+1 . (4.14) 2 k 2 ô ô ô ô ô ô ô ô The essential idea in the rest of the proof is to use convexity and smoothness of the expectation function f to bound f xk+1 − f(x) for each x ∈ dom h. According to Assumption 4.3,  4.4. PROOFS 77

∇(1) F(x, v) and, hence, ∇ f(x) are Lipschitz continuous with the constant L. By using the L-Lipschitz continuity of ∇ f and then the convexity of f, we have

L 2 f x ≤ f x + ∇ f x , x − x + x − x k+1 k−k k−k k+1 k−k 2 k+1 k−k   (   ) 2  L ô ô ≤ f(x) + ∇ f xk− , xk+1 − x + xk+1 − xôk− , ô (4.15) k 2 ô k ô (   ) ô ô x ∈ h ô ô (x) = for any dom . Combining inequalities (4.14) andô(4.15), and recallingô that f(x) + h(x), we obtain 1 1  x +1 − (x) + D x, x +1 ≤ ∇ f x − − ̄g − , x +1 − x + D x, x k ! k k k k k k ! k k (   ) k   1 2 ℎ 2  − xk+1 − xk − x − xk+1 2 k 2 L 2 + xô − x ô . ô ô 2 ôk+1 k−ôk ô ô ô ô ô ô ô ô = ∇ f − We now rewrite the above inequality in terms of the error ek−k xk−k ̄gk−k as follows:   1 1  x +1 − (x) + D x, x +1 ≤ e − , x +1 − x + D x, x k ! k k k k ! k k ( ) k   1 2 ℎ  2 − xk+1 − xk − x − xk+1 2 k 2 L 2 + xô − x ô ô ô 2 ôk+1 k−ôk ô ô = ô − ô+ − ek−ôk , xk+1 xk ô ek−k , xk x ô ô «­­­­­­­­­­¯­­­­­­­­­­¬( ) ( ) Γ1 1 1 2 + D! x, xk − xk+1 − xk k 2 k  2 ℎ 2 Lô ô − x − x +1 + x +1 − x − . (4.16) 2 k 2 ô k kô k «­­­­­­­­¯­­­­­­­­¬ ô ô ô ô ô Γ2 ô ô ô ô ô

Γ Γ  N We will seek upper bounds on 1 and 2. Let k k∈ 0 be a sequence of positive numbers. For Γ1, we have

2 2 1 1 k Γ1 ≤ e − ,  x +1 − x ≤ e − + x +1 − x , (4.17) k k k k k 2 k k ∗ 2 k k óX k Yó k ó √  ó ó ó ô ô ô ô ó √ ó ô ô ô ô ó ó ô ô ô ô ó ó 78 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS where the second inequality follows from Fenchel’s inequality applied to the conjugate pair 1 ⋅ 2 1 ⋅ 2 i.e. 2 and 2 ∗, ,

1 2 1 2 ‖ ‖ ‖ ‖ a, b ≤ a + b . 2 ∗ 2

We turn to Γ2. Rewriting Γ2 givesð⟨ ⟩ð ‖ ‖ ‖ ‖

2 k 2 xk−j+1 − xk−j Γ2 =  + 1 . k  + 1 ôj=0 k ô  ôÉ ô ô ô ô ô Then, by the convexity of the norm ⋅ , weô conclude that ô ô ô k ̄ ‖ ‖ 2 2 Γ2 ≤ k + 1 xk−j+1 − xk−j ≤ ̄ + 1 xk−j+1 − xk−j , (4.18) j=0 j=0  É ô ô  É ô ô ô ô ô ôN where the last inequality comesô from our assumptionô that kô≤ ̄ for all k ∈ ô 0. Substituting inequalities (4.17) and (4.18) into the bound (4.16) and simplifying yield 1 1 2 1  x +1 − (x) + D x, x +1 ≤ e − + e − , x − x + D x, x k ! k 2 k k ∗ k k k ! k k k ( ) k   ô ô ̄  L(ô̄ + 1)ô 2 + ô ô x − x 2 k−j+1 k−j j=0 É ô ô 1 1 ô 2 ô  2 − −  ô x − x −ô ℎ x − x . 2 k k+1 k 2 k+1 0 k 1 X ô ô ô ô Setting x = x⋆, where x⋆ ∈ ⋆, completes the proof. ô ô ô ô ■

The next result follows from Lemma 4.6 by taking summation of the relations in (4.12).

Let Assumptions 4.1–4.5 hold. Assume also that N is set to Lemma 4.7. k k∈ 0

1 = ∈ N k 2 , k 0 , k + L( ̄ + 1) where  is positive for all k. Then, the iterates x N produced by Algorithm 4.1 satisfy k k k∈ 0

K−1 K−1 K−1 1 2 1  x − ⋆ ≤ e + e , x − x⋆ + D x⋆, x k+1 2 k−k ∗ k−k k ! 0 k=0 k=0 k k=0 0 É   É É ( )  K−1 ô ô K−1 ô1 ô1  2 + ô −ô D x⋆, x − ℎ x − x⋆ . ! k+1 2 k+1 k=0 0 k+1 k 1 k=0 É  É ô ô ô ô 4.4. PROOFS 79

Proof of Lemma 4.7. Applying Lemma 4.6 with

1 2 k = − L( ̄ + 1) , k

−1 D ⋆ adding and subtracting k+1 ! x , xk+1 to the left-hand side of (4.13), and re-arranging terms, we obtain 

1 1 2 1 − ⋆ + D ⋆ + − ⋆ + D ⋆  xk+1  ! x , xk+1 ≤ ek−k ek−k , xk x ! x , xk +1 2 ∗ k k ( ) k   ô ô  ô1 ô1 ⋆ + ô −ô D! x , xk+1 0 k+1 k 1 ̄  L( ̄ + 1) 2 + xk−j+1 − xk−j 2 =0 Éj 2 ô ô L( ̄ + 1) ô 2 ô 2 − xô − x − ôℎ x⋆ − x . 2 k+1 k 2 k+1 ô ô ô ô ô Nô ô ô Summing the preceding inequality over k = 0, … ,K − 1, K ∈ , yields

K−1 ⋆ 1 ⋆  xk+1 −  + D! x , xK k=0 K É    K−1 K−1 1 2 1 + − ⋆ + D ⋆ ≤ ek−k ek−k , xk x ! x , x0 2 ∗ 0 k=0 k k=0 ( ) É ô ô É  K−1 ô ô ô1 ô1 ⋆ + − D! x , xk+1 k=0 0 k+1 k 1 É  K−1 ̄ 2 K−1 L( ̄ + 1) 2 L( ̄ + 1) 2 + x − x − x − x 2 k−j+1 k−j 2 k+1 k k=0 j=0 k=0 É É ô ô É K−1 ô ô ô ô ℎ ⋆ ô 2 ô ô ô − x − xk+1 2 =0 kÉ K−1 ô Kô−1 1 ô 2 ô 1 + − ⋆ + D ⋆ ≤ ek−k ek−k , xk x ! x , x0 2 ∗ 0 k=0 k k=0 ( ) É ô ô É  K−1 ô ô K−1 ô1 ô1 ⋆ ℎ ⋆ 2 + − D x +1, x − x − x +1 , (4.19) ! k 2 k k=0 0 k+1 k 1 k=0 É  É ô ô ô ô 80 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS where the second inequality used the facts

K−1 ̄ ̄ K−j−1 ̄ K−j−1 2 2 2 xk−j+1 − xk−j = xk+1 − xk = xk+1 − xk k=0 j=0 j=0 k=−j j=0 k=0 É É ô ô É É É É ô ô ô ô ̄ K−1 ô ô ô ô ô ô ô 2ô ≤ xk+1 − xk =0 =0 Éj kÉ Kô−1 ô ô ô 2 = ( ̄ + 1) xk+1 − xk , =0 kÉ ô ô and xk = x0 for all k ≤ 0. Dropping the second term on the left-handô side ofô(4.19) concludes the proof. ■ Rd Lemma 4.8. Let ⋅ be a norm over and let ⋅ ∗ be its dual norm. Let ! be a 1-strongly convex function with respect to over Rd. If Rd are zero-mean random ⋅ y1, … , yb ∈ variables drawn i.i.d.‖ ‖ from a distribution , then‖ ‖ ‖ ‖ 2 b b 1 c 2 E y ≤ E y , b i 2 i ∗ ô i=1 ô b i=1 ⎡ô É ô∗⎤ É   ⎢ô ô ⎥ ô ô ô ô ô ô where c ∈ [1, b] is given by ⎢ô ô ⎥ ⎣ô ô ⎦ 1 if ⋅ = ⋅ , = ∗ 2 c otherwise. T2 max x =1 !(x) ‖ ‖ ‖ ‖ ‖ ‖ Proof of Lemma 4.8. The result follows from [82, Lemma B.2] and convexity of the norm ⋅ ∗. For further details, see [60, Section 4.1]. ■

Now‖ ‖ we are ready to prove Theorems 4.1–4.3.

Proof of Theorem 4.1. N Assume that the step-size k k∈ 0 is set to 1 = = k 2 ,  + L( ̄ + 1) for some  > 0. It is clear that satisfies (4.7). Applying Lemma 4.7 with ℎ = 0, k = and k = , we obtain

K−1 K−1 K−1 2 D ⋆ ⋆ 1 ⋆ ! x , x0  xk+1 −  ≤ ek− + ek− , xk − x + , 2 k ∗ k  k=0 k=0 k=0 ( ) É   É ô ô É ô ô (4.20) ô ô 4.4. PROOFS 81

N N for K ∈ . Each xk, k ∈ , is a deterministic function of the history v[k−1] ∶= vi(t) ∶ i = 1, … , b , t = 0, … , k − 1 but not of vi(k). Since ∇ f(x) = Ev ∇(1) F(x, v) ,   − ⋆ = 0 Ev[k−1] ek−k , xk x . ( ) Moreover, as vi and vj are independent whenever i ≠ j, it follows from Lemma 4.8 that 2 b 2 1 E e − = E ∇ f x − − ∇(1) F x − , v k k ∗ b k k k k i 4 5 ⎡ô i=1     ô∗⎤ ô ô ô É ô ô ô ⎢ôb ô ⎥ ô ô ⎢ô 2ô ⎥ c ô ∇ f − ∇ F ô ≤ 2⎣ô E xk−k (1) xk−k , vi ô ⎦ b i=1 4 ∗5 É ô    ô c2 ô ô ≤ , ô ô b ô ô where the last inequality follows from Assumption 4.4. Taking expectation on both sides of (4.20) and using the above observations yield

K 2 ⋆ c D! x , x0  x − ⋆ ≤ K + . E k 2 k=1 b  É    By the convexity of , we have

K K 1 1  ̄xK =  xk ≤  xk , HK k=1 I K k=1  É É  which implies that 2 D ⋆ ⋆ c ! x , x0 E  ̄xK −  ≤ + . 2b K   2  Substituting  = −1 − L( ̄ + 1) into the above inequality proves the theorem. ■

−1 Proof of Theorem 4.2. N = Assume that the step-size k k∈ 0 is chosen such that k 2 L( ̄ + 1) + k where

 c k + 1 k = . √R√ b √ 2 Since k is a non-increasing sequence, and D! (x, y) ≤ R for all x, y ∈ dom h, we have K−1 1 1 ⋆ 1 1 2 − D! x , xk+1 ≤ − R . k=0 0 k+1 k 1 0 K 0 1 É  82 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

Applying Lemma 4.7 with ℎ = 0 and k = k, taking expectation, and using Lemma 4.8 completely identically to the proof of Theorem 4.1, we then obtain K K−1 R2 c2 1  x − ⋆ ≤ + . (4.21) E k 2 k=1 K b k=0 k É    É Viewing the sum as a lower-estimate of the integral of the function y(t) = 1∕ t + 1, one can verify that √ K−1 K−1 K−1 1 1 1 dt 2 K = ≤ 1 + ≤ , Ê k=0 k k=0 ̃ k + 1 ̃ H 0 t + 1I √̃ É É √ √ where ̃ =  c ∕ R b . Substituting this inequality into the bound (4.21), we obtain:  √   √  K 2 + 1 ⋆ 2 2 R c K E  xk −  ≤ LR ( ̄ + 1) + . k=1 √ √b É    √ Since 1 ≤ K, we have K + 1 ≤ 2K. Using this fact, we get the claimed guaranteed ■ bound. √ √

−1 Proof of Theorem 4.3. N = Assume that the step-size k k∈ 0 in Algorithm 4.1 is set to k 2 2L( ̄ + 1) + k, with ℎ k = (k + ̄ + 1) . 3L!

We first describe some important properties of k relevant to our proof. Clearly, k is non-increasing, i.e., 1 1 ≤ , (4.22) k k+1 ∈ N −1 −1 for all k 0. Since 0 ≤ k , we have

2 ℎ ̄ 1 2L( ̄ + 1) + ≤ . (4.23) 3L! k Moreover, one can easily verify that 1 1  4  2 − = ℎ L( + 1)2 + ℎ ( + ) + 1 2 2 ̄ k ̄ L! 3 3L! 3 k+1 k 0  1  2  ≤ ℎ 2L( ̄ + 1) + ℎ (k + ̄ + 1) 3 L! 0 L! 1  1 = ℎ , L! k 4.4. PROOFS 83 which implies that

1 1 1 ℎ ≤ + , (4.24) 2 L k+1 k 0 k ! 1 N for all k ∈ 0. Finally, by the definition of k, we have  ℎ ̄ 3L  ̄ k = 1 + ! ≤ 1 + ℎ , 2 ℎ 2 k+ ̄ 2L( ̄ + 1) + (k + ̄ + 1) 6LL!( ̄ + 1) 3L! and hence,

1  ̄ 1 1 + ℎ ≤ 2 . (4.25) k+ ̄ H 6LL!( ̄ + 1) I k

We are ready to prove Theorem 4.3. Applying Lemma 4.6 with

1 N k = , k ∈ 0 , 2 k and using the fact

L 2 D x⋆, x ≤ ! x⋆ − x , ! k+1 2 k+1  ô ô by Assumption 4.9, we obtain ô ô

⋆ 1 ℎ ⋆  xk+1 −  + + D! x , xk+1 0 k L! 1  2  ⋆ 1 ⋆ ≤ e − + e − , x − x + D x , x k k k ∗ k k k ! k ( ) k ô ô ̄  Lô ( ̄ + 1)ô 2 1 2 + ô ô xk−j+1 − xk−j − xk+1 − xk . 2 =0 4 k Éj ô ô ô ô ô ô ô ô Multiplying both sides of this relation by 1∕ô k, and then usingô (4.24), we have

1 ⋆ 1 ⋆  x +1 −  + D x , x +1 k 2 ! k k k+1   2  1 ⋆ 1 ⋆ ≤ ek− + ek− , xk − x + D! x , xk k ∗ k 2 k ( ) k ô ô ̄  ô L( ̄ ô+ 1) 2 1 2 +ô ô x − x − x − x . 2 k−j+1 k−j 4 2 k+1 k k j=0 k É ô ô ô ô ô ô ô ô ô ô 84 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

N Summing the above inequality from k = 0 to k = K − 1, K ∈ , and dropping the first term on the left-hand side yield K−1 K−1 1 2 1 1 D x⋆, x ≤ e + e , x − x⋆ + D x⋆, x 2 ! K k−k ∗ k−k k 2 ! 0 k K k=0 k=0 ( ) 0  É ô ô É  ô ôK−1 ̄ K−1 Lô( ̄ + 1)ô 1 2 1 1 2 + x − +1 − x − − x +1 − x . 2 k j k j 4 2 k k k=0 j=0 k k=0 k É É ô ô É ô ô ô (4.26)ô ô ô ô ô What remains is to bound the fourth term on the right-hand side of (4.26). It follows from (4.22)–(4.25) that K−1 ̄ ̄ K−j−1 L( ̄ + 1) 1 2 L( ̄ + 1) 1 2 x − x = x − x 2 k−j+1 k−j 2 k+1 k k=0 j=0 k j=0 k=0 k+j É É ô ô É É ô ô ̄ K−1 ô ô ô ô L( ̄ + 1) 1 ô 2ô ≤ xk+1 − xk 2 =0 =0 k+j Éj kÉ ̄ K−1 ô ô (4.22) L( ̄ + 1) 1 ô ô2 ≤ xk+1 − xk 2 =0 =0 k+ ̄ Éj kÉ 2 K−1 ô ô L( ̄ + 1) 1 ô 2 ô = xk+1 − xk 2 =0 k+ ̄ kÉ 2  ̄ ℎ K−1ô ô (4.25) 2L( ̄ + 1) + ô ô 3L! 1 2 ≤ xk+1 − xk 4 =0 k kÉ K−1 ô ô (4.23) 1 1 2 ô ô − ≤ 2 xk+1 xk . 4 =0 kÉ k ô ô Substituting the above inequality into (4.26), and then takingô expectationô on both sides (similarly to the proof of Theorems 4.1 and 4.2), we have 1 2 1 D ⋆ c K + D ⋆ 2 E ! x , xK ≤ 2 ! x , x0 . (4.27) b 0 K    According to Remark 2.4, 1 2 x⋆ − x ≤ D x⋆, x . 2 K ! K  Moreover, by the definition of ôk, ô ô ô ℎ(K + 1) 1 ≤ K ≤ . 3L! K 4.4. PROOFS 85

Combining these inequalities with the bound (4.27), we conclude 2 4 6LL 2 2 2 ! + 1 ̄ + 1 2 18c L  ⋆ − ! + ℎ D ⋆ E x xK ≤ 2   2  ! x , x0 . b (K + 1) (K + 1)   ℎ  ô ô ô ô ■

Next, we prove Theorem 4.4 and Corollary 4.5. To this end, we first provide a lemma which is key to proving the main results. Assume that the non-negative sequences and satisfy the following Lemma 4.9. Vk wk inequality:

k

Vk+1 ≤ a1Vk − a2wk + a3 wj , (4.28) = − j Ék k0 N for some real numbers a1 ∈ (0, 1) and a2, a3 ≥ 0, and some integer k0 ∈ 0. Assume also that for , and that the following holds: wk = 0 k < 0

k0+1 a3 1 − a1 ≤ a2 . 1 − a1 k0 a1 Then, k for all . Vk ≤ a1V0 k ≥ 0

Proof of Lemma 4.9. To prove the linear convergence of the sequence, we divide both sides k+1 of (4.28) by a1 and take the sum:

K−1 K−1 K−1 K−1 k Vk+1 V w 1 ≤ k − a k + a w k+1 k 2 k+1 3 k+1 j =0 a =0 a =0 a =0 a = − kÉ 1 kÉ 1 kÉ 1 kÉ 1 j Ék k0 K−1 K−1 V w a3 = k − a k + w + w + ⋯ + w k 2 k+1 −k0 −k0+1 0 =0 a =0 a a1 kÉ 1 kÉ 1   a3 + w−k +1 + w−k +2 + ⋯ + w1 + ⋯ a2 0 0 1   a3 + wK−1−k + wK−1−k +1 + ⋯ + wK−1 aK 0 0 1   K−1 K−1 Vk 1 1 wk ≤ + a3 1 + + ⋯ + − a2 , (4.29) k k0 k+1 k=0 a a1 a k=0 a É 1 ⎛ ⎛ 1 ⎞ ⎞ É 1 ⎜ ⎜ ⎟ ⎟ where we have used the non-negativity⎜ ⎜ of w to obtain (4.29).⎟ ⎟ ⎝ ⎝ k ⎠ ⎠ 86 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS

If the coefficient of the second sum of the right-hand side of (4.29) is non-positive, i.e., if

k0+1 a3 a3 a3 1 − a1 a3 + + ⋯ + = ≤ a2 , a1 k0 1 − a1 k0 a1 a1 then inequality (4.29) implies that

V V −1 V1 V −1 V −2 V0 K + K + ⋯ + ≤ K + K + ⋯ + . K K−1 1 K−1 K−2 0 a1 a1 a1 a1 a1 a1

K Hence, VK ≤ a1 V0 for any K ≥ 1, and the desired result follows. ■

We are now ready to prove the main results.

Proof of Theorem 4.4. We start with analyzing each component function f n(x) to find upper bounds on the function values: 2 Ln f n xk+1 ≤ f n xk−n + ∇ f n xk−n , xk+1 − xk−n + xk+1 − xk−n k k k 2 k   (   ) 2  Ln ô ô ≤ f n(x) + ∇ f n xk−n , xk+1 − x + xk+1 − xkô−n ∀x , ô (4.30) k 2 ô k ô (   ) ô ô where the first and second inequalities use L -continuity andô convexity ofôf (x), respectively. n ô ôn Summing (4.30) over all component functions, we obtain:

N 2 Ln f xk+1 ≤ f(x) + gk, xk+1 − x + xk+1 − xk−n ∀x . (4.31) 2 k n=1    É ô ô ô ô ô ô Next, we seek an upper bound on the second term of the right-hand side of (4.31). Observe that the optimality condition of (4.6) implies: 1 g , x +1 − x ≤ x +1 − x , x − x +1 + s , x − x +1 k k k k k xk+1 k ( )   1 2 1  2 1 2 = x − x − x +1 − x − x − x +1 2 k 2 k k 2 k + ôs , x ô− x ô ô ô ô ôxk+1 ô k+1 ô ô ô ô 1( 2 1) 2 1 2 ≤ x − x − x +1 − x − x − x +1 + h(x) − h x +1 2 k 2 k k 2 k k  ô ô ô ô ô ô (4.32) ô ô ô ô ô ô C for all x ∈ . Here, we have used three-point identity and subdifferentiability of h in the second and third steps, respectively. 4.4. PROOFS 87

Plugging (4.32) in (4.31), and rearranging the terms, we obtain the following relation for all C x ∈ :

1 2 1 2 1 2 f x +1 + h x +1 + x − x +1 ≤ f(x) + h(x) + x − x − x +1 − x k k 2 k 2 k 2 k k   N ô ô ô 2ô ô ô ô ô Ln ô ô ô ô + xk+1 − xk−n . 2 k n=1 É ô ô ô ô ô ô ⋆ Using the strong convexity property on f xk+1 + h xk+1 above and choosing x = x gives:  

⋆ ⋆  ⋆ 2 1 ⋆ 2 ∇ f x + s ⋆ , x +1 − x + x +1 − x + x − x +1 x k 2 k 2 k    N ô ô ô ô 2 1 ⋆ 2 1 ô ô2 ôLn ô ≤ x − xk − xk+1 − xk + xk+1 − xk−n . (4.33) 2 2 2 k n=1 É ô ô ô ô ô ô ô ô ô ô ô ô ô ô Due to the optimality condition of (4.5), there exists a subgradient sx⋆ such that the first term on the left-hand side is non-negative. Using this particular subgradient, we drop the first term. The last term on the right-hand side of the inequality can be further upper-bounded using Jensen’s inequality as follows:

2 N N k k 2 2 Ln Ln L( ̄ + 1) xk+1 − xk−n = xj+1 − xj ≤ xj+1 − xj , 2 k 2 2 n=1 n=1 ôj=k−n ô j=k− ̄ ô k ô É ô ô É ô É ô É ô ô ô ô ô ô ô ô ô ô ô ô ô ô N ô ô where L = n=1 Ln. As a result, rearrangingô the termsô in (4.33), we obtain: ∑ ⋆ 2 1 ⋆ 2 1 2 x +1 − x ≤ x − x − x +1 − x k  + 1 k  + 1 k k k ô ô ( ̄ +ô 1)L ô ô2 ô ô ô + ô ô x − x ô . ô + 1 j+1 j  j=k− ̄ É ô ô ô ô ô ô 2 ⋆ 2 We note that xj+1 − xj = 0 for all j < 0. Using Lemma 4.9 with Vk = xk − x , = − 2 = = 1 = ( ̄+1)L = ■ wk xk+1 ôxk , a1 ô a2 +1 , a3 +1 and k0 ̄ completes the proof. ô ô   ô ô ô ô ô ô ô ô ô ô Proof of Corollary 4.5. The analysis is similar to that of Theorem 4.4. This time, the 88 CHAPTER 4. DISTRIBUTED MEMORY ALGORITHMS optimality condition of (4.10) implies: 1 g , x +1 − x ≤ ∇ ! x +1 − ∇ ! x , x − x +1 + s , x − x +1 k k k k k xk+1 k   1   1  1 ( ) = D x, x − D x +1, x − D x, x +1 ! k ! k k ! k    + − ∀ ∈ C sxk+1 , x xk+1 x . ( ) Choosing x = x⋆, and following the steps of the proof of Theorem 4.4, we obtain:

 ⋆ 2 1 ⋆ 1 ⋆ 1 x +1 − x + D x , x +1 ≤ D x , x − D x +1, x 2 k ! k ! k ! k k  k  ô ô L( ̄ + 1) 2 ô ô + x − x . 2 j+1 j j=k− ̄ É ô ô ô ô ô ô This time, using the upper and lower bounds in Assumption 4.9 on the left and right hand-side of the above inequality, respectively, and rearranging the terms, we arrive at:

⋆ L! ⋆ L! D! x , xk+1 ≤ D! x , xk − D! xk+1, xk  + L!  + L!  k   L( ̄ + 1) L + ! D x , x . + ! j+1 j  L! ! j=k− ̄ É 

⋆ L! Applying Lemma 4.9 with Vk = D! x , xk , wk = D! xk+1, xk , a1 = a2 = ,  +L! L( ̄+1) L! a3 = and k0 = ̄ completes the proof.  ■  +L! ! CHAPTER 5

CONCLUSION

In the thesis, we have investigated asynchronous algorithms for large-scale optimization prob- lems. Specifically, we have analyzed the convergence properties of a family of algorithms under bounded information delays. In Chapter 3, we have proposed a new, flexible method for minimizing an average of a large number of smooth component functions based on partial gradient information under time-varying delays. The method covers delayed incremental gradient descent and delayed coordinate descent algorithms as special cases. Contrary to similar work in literature, our method is delay-insensitive, in the sense that it converges linearly for any bounded, time- varying information delay, and its parameters can be chosen independently of the delay bound. Similarly, previous works in the literature require that the gradients of the individual loss functions be bounded. In the thesis, we have established linear convergence rates for the proposed method assuming only that the total objective function is strongly convex and the individual loss functions have Lipschitz-continuous gradients. Using extensive simulations, we have verified our theoretical bounds and shown that they are reasonably tight. In Chapter 4, we have analyzed two different variants of incremental gradient descent meth- ods for composite optimization problems. First, we have investigated an asynchronous mini-batch algorithm that exploits multiple processors to solve regularized stochastic op- timization problems with smooth loss functions. We have established that for closed and convex feasible sets, the iteration complexity of the algorithm with constant step-sizes is asymptotically  1∕2 . For compact feasible sets, we have proved that the running average of the iterates generated  by our algorithm with time-varying step-sizes converges to the optimum at a rate  1∕ K . When the regularization function is strongly convex and the feasible set is closed and√ convex, the algorithm achieves the rate of the order  (1∕K). We have shown that the penalty in convergence rate of the algorithm due to asynchrony is asymptotically negligible and a near-linear speedup in the number of processors can be expected. Then, we have studied the proximal incremental aggregated gradient method. We have shown that when the objective function is strongly convex, the iterates generated by the method converges linearly to the global optimum. We have also given constant step-size rule

89 90 CHAPTER 5. CONCLUSION when the degree of asynchrony in the architecture is known. Moreover, we have validated our theoretical bounds through extensive experiments on reasonably large-scale problems.

Future Work

We conclude with some open issues for future work:

• Accelerated asynchronous methods. For general convex regularization functions, the convergence rate of our asynchronous algorithm is

2 L( ̄ + 1)   + . H K K I

The first term is related to the smooth component√ in the objective function and the existence of time-delays in gradient computations while the second term is related to the variance in stochastic gradients. As mentioned in Section 4.1, the accelerated stochastic approximation method proposed in [50] reduces the impact of the smooth component significantly in the absence of asynchrony and achieves the rate

L +   2 . HK K I

Hence, an interesting question is whether an√ asynchronous version of this method decreases or increases the effects of time-delays on the convergence rate. Answering this question is, however, challenging and non-trivial, since the convergence analysis of accelerated first-order methods even in a deterministic and serial setting is much more involved than that of non-accelerated methods [25].

• Non-i.i.d. sampling. In order to establish our results, we assume that the stochastic oracle can generate i.i.d. samples from the distribution over which we optimize. This assumption is commonly used in the analysis for stochastic optimization algorithms, for example, [30, 38, 48–52, 60, 66, 67, 83–85]. Recently, stochastic gradient meth- ods for non-smooth stochastic optimization was developed for situations in which there is no access to i.i.d. samples from the desired distribution [86]. Under rea- sonable assumptions on the ergodicity of the stochastic process that generates the samples, [86] obtained the convergence rate for serial mirror descent methods. It would be very interesting to extend this result to regularized stochastic optimization with smooth objective functions and investigate the convergence of asynchronous mini-batch algorithms when the random samples are dependent.

• Effect of sparsity. Throughout the thesis, we have not used any assumption on the density of available data. However, in most of the applications, even though the dimensions are large in optimization problems, the density of the data is low. Recent 91 research [87] shows that integrating carefully the available sparsity structure of the available data into analysis helps obtain improved convergence properties. It would be interesting to extend the analyses in this thesis to to benefit from the available structure of the data at hand.

BIBLIOGRAPHY

[1] Piyush Kumar Sinha et al. “Enhanced Single-Pass Algorithm for Efficient Indexing Using Hashing in Map Reduce Paradigm”. In: Intelligent Computing, Networking, and Informatics. Springer Nature, Dec. 2013, pp. 1195–1200. DOI: 10.1007/978- 81-322-1665-0_123. [2] Amazon Web Services, Inc. Elastic Compute Cloud (EC2) Cloud Server & Hosting. 2017. URL: https://aws.amazon.com/ec2/. [3] Apache. Hadoop. 2017. URL: https://hadoop.apache.org/. [4] Google. Ngram Viewer. 2017. URL: http://books.google.com/ngrams/. [5] NASA Earth Exchange (NEX). Downscaled Climate Projections (NEX-DCP30). 2017. URL: https://cds.nccs.nasa.gov/nex/. [6] Adam Auton et al. “A global reference for human genetic variation”. In: Nature 526.7571 (Sept. 2015), pp. 68–74. DOI: 10.1038/nature15393. [7] Barry Wilkinson and Michael Allen. Parallel Programming: Techniques and Applica- tions Using Networked Workstations and Parallel Computers (2nd Edition). Pearson, 2004. ISBN: 978-0131405639. [8] Yair Censor and Stavros A. Zenios. Parallel Optimization: Theory, Algorithms and Applications. Oxford University Press, 1997. ISBN: 0-19-510062-X. [9] Dimitri Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice Hall, 1989. ISBN: 0-13-648700-9. [10] Google Cloud Platform. Compute Engine. 2017. URL: https://cloud.google. com/compute/. [11] Microsoft Azure. Cloud Computing Platform & Services. 2017. URL: https:// azure.microsoft.com/. [12] OpenMP. The OpenMP API Specification for Parallel Programming. 2017. URL: http://www.openmp.org/. [13] Standard C++ Foundation. Standard C++. 2017. URL: https://isocpp.org. [14] Julia. The Julia Language. 2017. URL: http://julialang.org/. [15] Open MPI. Open Source High Performance Computing. 2017. URL: https://www. open-mpi.org/.

93 94 BIBLIOGRAPHY

[16] çMQ. Distributed Messaging. 2017. URL: http://zeromq.org/. [17] Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 2015. ISBN: 1-886529-15-9. [18] Mu Li et al. “Parameter Server for Distributed Machine Learning”. In: Big Learning Workshop, Advances in Neural Information Processing Systems 26 (NIPS). 2013. URL: http://web.archive.org/web/20160304101521/http://www.biglearn. org/2013/files/papers/biglearning2013_submission_2.pdf. [19] Arda Aytekin, Hamid Reza Feyzmahdavian, and Mikael Johansson. “Asynchronous In- cremental Block-Coordinate Descent”. In: 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton). Institute of Electrical and Elec- tronics Engineers (IEEE), Sept. 2014. DOI: 10.1109/allerton.2014.7028430. [20] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. “A Delayed Prox- imal Gradient Method with Linear Convergence Rate”. In: 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). Institute of Electrical and Electronics Engineers (IEEE), Sept. 2014. DOI: 10.1109/mlsp.2014.6958872. [21] Arda Aytekin, Hamid Reza Feyzmahdavian, and Mikael Johansson. “Analysis and Im- plementation of an Asynchronous Optimization Algorithm for the Parameter Server”. In: (Oct. 18, 2016). arXiv: 1610.05507v1 [math.OC]. [22] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. “An Asyn- chronous Mini-Batch Algorithm for Regularized Stochastic Optimization”. In: IEEE Transactions on Automatic Control (2016), pp. 1–15. DOI: 10.1109/tac.2016. 2525015. [23] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. “An asynchronous mini-batch algorithm for regularized stochastic optimization”. In: 2015 54th IEEE Conference on Decision and Control (CDC). Institute of Electrical and Electronics Engineers (IEEE), Dec. 2015. DOI: 10.1109/cdc.2015.7402404. [24] Yurii Nesterov. “Efficiency of Coordinate Descent Methods on Huge-Scale Optimiza- tion Problems”. In: SIAM Journal on Optimization 22.2 (Jan. 2012), pp. 341–362. DOI: 10.1137/100802001. [25] Yurii Nesterov. Introductory Lectures on Convex Optimization. Springer US, 2004. DOI: 10.1007/978-1-4419-8853-9. [26] Dimitri P. Bertsekas. Nonlinear Programming: 3rd Edition. Athena Scientific, 2016. ISBN: 978-1-886529-05-2. [27] Dimitri P. Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015. ISBN: 1-886529-28-0. [28] Dimitri P. Bertsekas, Angelia Nedić, and Asuman E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. ISBN: 1-886529-45-0. [29] Amir Beck and Marc Teboulle. “Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization”. In: Operations Research Letters 31.3 (May 2003), pp. 167–175. DOI: 10.1016/s0167-6377(02)00231-6. BIBLIOGRAPHY 95

[30] A. Nemirovski et al. “Robust Stochastic Approximation Approach to Stochastic Programming”. In: SIAM Journal on Optimization 19.4 (Jan. 2009), pp. 1574–1609. DOI: 10.1137/070704277. [31] Paul Tseng. “Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization”. In: Mathematical Programming 125.2 (Aug. 2010), pp. 263–295. DOI: 10.1007/s10107-010-0394-2. [32] John Duchi et al. “Composite Objective Mirror Descent”. In: Proceedings of the 23rd Annual Conference on Learning Theory (COLT). Ed. by Adam Tauman Kalai and Mehryar Mohri. OmniPress, 2010, pp. 14–26. URL: http://colt2010.haifa.il. ibm.com/papers/057Duchi.pdf. [33] Gong Chen and Marc Teboulle. “Convergence Analysis of a Proximal-Like Mini- mization Algorithm using Bregman Functions”. In: SIAM Journal on Optimization 3.3 (Aug. 1993), pp. 538–543. DOI: 10.1137/0803026. [34] Amir Beck and Luba Tetruashvili. “On the Convergence of Block Coordinate Descent Type Methods”. In: SIAM Journal on Optimization 23.4 (Jan. 2013), pp. 2037–2060. DOI: 10.1137/120887679. [35] Zhaosong Lu and Lin Xiao. “On the Complexity Analysis of Randomized Block- Coordinate Descent Methods”. In: Mathematical Programming 152.1-2 (Aug. 2014), pp. 615–642. DOI: 10.1007/s10107-014-0800-2. [36] Olivier Fercoq and Peter Richtárik. “Accelerated, Parallel, and Proximal Coordinate Descent”. In: SIAM Journal on Optimization 25.4 (Jan. 2015), pp. 1997–2023. DOI: 10.1137/130949993. [37] Ji Liu and Stephen J. Wright. “Asynchronous Stochastic Coordinate Descent: Paral- lelism and Convergence Properties”. In: SIAM Journal on Optimization 25.1 (Jan. 2015), pp. 351–376. DOI: 10.1137/140961134. [38] Alekh Agarwal and John C. Duchi. “Distributed Delayed Stochastic Optimization”. In: Advances in Neural Information Processing Systems 24 (NIPS). Ed. by John Shawe- Taylor et al. Curran Associates, Inc., 2011, pp. 873–881. URL: http://papers.nips. cc/paper/4247-distributed-delayed-stochastic-optimization.pdf. [39] Benjamin Recht et al. “HOGWILD!: A Lock-Free Approach to Parallelizing Stochas- tic Gradient Descent”. In: Advances in Neural Information Processing Systems 24 (NIPS). Ed. by John Shawe-Taylor et al. Curran Associates, Inc., 2011, pp. 693– 701. URL: http://papers.nips.cc/paper/4390-hogwild-a-lock-free- approach-to-parallelizing-stochastic-gradient-descent.pdf. [40] Angelia Nedić, Dimitri P. Bertsekas, and Vivek S. Borkar. “Distributed Asynchronous Incremental Subgradient Methods”. In: Studies in Computational Mathematics. Else- vier BV, 2001, pp. 381–407. DOI: 10.1016/s1570-579x(01)80023-9. [41] Mark Schmidt and Nicolas Le Roux. “Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition”. In: (Aug. 29, 2013). arXiv: 1308.6370v1 [math.OC]. 96 BIBLIOGRAPHY

[42] Lin Xiao and Tong Zhang. “A Proximal Stochastic Gradient Method with Progressive Variance Reduction”. In: SIAM Journal on Optimization 24.4 (Jan. 2014), pp. 2057– 2075. DOI: 10.1137/140961791. [43] Melanie L. Lenard and Michael Minkoff. “Randomly Generated Test Problems for Positive Definite Quadratic Programming”. In: ACM Transactions on Mathematical Software 10.1 (Jan. 1984), pp. 86–96. DOI: 10.1145/356068.356075. [44] Hui Zou and Trevor Hastie. “Regularization and Variable Selection via the Elastic Net”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (Apr. 2005), pp. 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x. [45] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer New York, 2009. DOI: 10.1007/978-0-387-84858-7.

[46] Shai Shalev-Shwartz and Ambuj Tewari. “Stochastic Methods for l1-Regularized Loss Minimization”. In: Journal of Machine Learning Research (JMLR) 12 (June 2011), pp. 1865–1892. ISSN: 1533-7928. URL: http://www.jmlr.org/papers/ volume12/shalev-shwartz11a/shalev-shwartz11a.pdf. [47] Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: Herbert Robbins Selected Papers. Springer Science + Business Media, 1985, pp. 102– 109. DOI: 10.1007/978-1-4612-5110-1_9. [48] Chonghai Hu, Weike Pan, and James T. Kwok. “Accelerated Gradient Methods for Stochastic Optimization and Online Learning”. In: Advances in Neural Infor- mation Processing Systems 22 (NIPS). Ed. by Yoshua Bengio et al. Curran Asso- ciates, Inc., 2009, pp. 781–789. URL: http://papers.nips.cc/paper/3817- accelerated- gradient- methods- for- stochastic- optimization- and- online-learning.pdf. [49] Lin Xiao. “Dual Averaging Method for Regularized Stochastic Learning and Online Optimization”. In: Advances in Neural Information Processing Systems 22 (NIPS). Ed. by Yoshua Bengio et al. Curran Associates, Inc., 2009, pp. 2116–2124. URL: http://papers.nips.cc/paper/3882- dual- averaging- method- for- regularized-stochastic-learning-and-online-optimization.pdf. [50] Guanghui Lan. “An Optimal Method for Stochastic Composite Optimization”. In: Mathematical Programming 133.1-2 (Jan. 2011), pp. 365–397. DOI: 10 . 1007 / s10107-010-0434-y. [51] Saeed Ghadimi and Guanghui Lan. “Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, I: A Generic Algorithmic Framework”. In: SIAM Journal on Optimization 22.4 (Nov. 2012), pp. 1469–1492. DOI: 10.1137/110848864. BIBLIOGRAPHY 97

[52] Deanna Needell, Rachel Ward, and Nati Srebro. “Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz Algorithm”. In: Advances in Neural In- formation Processing Systems 27 (NIPS). Ed. by Zoubin Ghahramani et al. Curran Associates, Inc., 2014, pp. 1017–1025. URL: http://papers.nips.cc/paper/ 5355 - stochastic - gradient - descent - weighted - sampling - and - the - randomized-kaczmarz-algorithm.pdf. [53] Ilan Lobel and Asuman E. Ozdaglar. “Distributed Subgradient Methods for Convex Optimization over Random Networks”. In: IEEE Transactions on Automatic Control 56.6 (June 2011), pp. 1291–1306. DOI: 10.1109/tac.2010.2091295. [54] Pascal Bianchi and Jérémie Jakubowicz. “Convergence of a Multi-Agent Projected Stochastic Gradient Algorithm for Non-Convex Optimization”. In: IEEE Transactions on Automatic Control 58.2 (Feb. 2013), pp. 391–405. DOI: 10.1109/tac.2012. 2209984. [55] Benjamin Recht and Christopher Ré. “Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion”. In: Mathematical Programming Computation 5.2 (Apr. 2013), pp. 201–226. DOI: 10.1007/s12532-013-0053-8. [56] Martin Jaggi et al. “Communication-Efficient Distributed Dual Coordinate Ascent”. In: Advances in Neural Information Processing Systems 27 (NIPS). Ed. by Zoubin Ghahramani et al. Curran Associates, Inc., 2014, pp. 3068–3076. URL: http:// papers.nips.cc/paper/5599-communication-efficient-distributed- dual-coordinate-ascent.pdf. [57] Peter Richtárik and Martin Takáč. “Parallel Coordinate Descent Methods for Big Data Optimization”. In: Mathematical Programming 156.1-2 (Apr. 2015), pp. 433–484. DOI: 10.1007/s10107-015-0901-6. [58] Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. In: Advances in Neural Information Processing Systems 25 (NIPS). Ed. by Fernando Pereira et al. Curran Associates, Inc., 2012, pp. 1223–1231. URL: http://papers.nips.cc/paper/ 4687-large-scale-distributed-deep-networks.pdf. [59] Trishul Chilimbi et al. “Project Adam: Building an Efficient and Scalable Deep Learning Training System”. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Oct. 2014, pp. 571– 582. ISBN: 978-1-931971-16-4. URL: https://www.usenix.org/conference/ osdi14/technical-sessions/presentation/chilimbi. [60] Ofer Dekel et al. “Optimal Distributed Online Prediction using Mini-Batches”. In: Journal of Machine Learning Research (JMLR) 13 (2012), pp. 165–202. ISSN: 1533- 7928. URL: http://www.jmlr.org/papers/volume13/dekel12a/dekel12a. pdf. [61] Mert Gurbuzbalaban, Asuman E. Ozdaglar, and Pablo Parrilo. “On the Convergence Rate of Incremental Aggregated Gradient Algorithms”. In: (June 5, 2015). arXiv: 1506.02081v1 [math.OC]. 98 BIBLIOGRAPHY

[62] Ralph T. Rockafellar and Roger J. B. Wets. “On the Interchange of Subdifferentiation and Conditional Expectation for Convex Functionals”. In: Stochastics 7.3 (Jan. 1982), pp. 173–182. DOI: 10.1080/17442508208833217. [63] Ji Liu et al. “An Asynchronous Parallel Stochastic Coordinate Descent Algorithm”. In: Journal of Machine Learning Research (JMLR) 16 (2015), pp. 285–322. ISSN: 1533-7928. URL: http://jmlr.org/papers/volume16/liu15a/liu15a.pdf. [64] K. I. Tsianos and M. G. Rabbat. “Distributed Dual Averaging for Convex Opti- mization under Communication Delays”. In: 2012 American Control Conference (ACC). Institute of Electrical and Electronics Engineers (IEEE), June 2012. DOI: 10.1109/acc.2012.6315289. [65] Brendan McMahan and Matthew Streeter. “Delay-Tolerant Algorithms for Asyn- chronous Distributed Online Learning”. In: Advances in Neural Information Pro- cessing Systems 27 (NIPS). Ed. by Zoubin Ghahramani et al. Curran Associates, Inc., 2014, pp. 2915–2923. URL: http : / / papers . nips . cc / paper / 5242 - delay-tolerant-algorithms-for-asynchronous-distributed-online- learning.pdf. [66] Saeed Ghadimi and Guanghui Lan. “Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms”. In: SIAM Journal on Optimization 23.4 (Jan. 2013), pp. 2061– 2089. DOI: 10.1137/110848876. [67] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. “Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization”. In: Proceedings of the 29th International Conference on Machine Learning (ICML). Ed. by John Langford and Joelle Pineau. OmniPress, 2012, pp. 449–456. ISBN: 978-1-4503-1285-1. URL: http: //www.icml.cc/2012/papers/261.pdf. [68] Angelia Nedić and Soomin Lee. “On Stochastic Subgradient Mirror-Descent Algo- rithm with Weighted Averaging”. In: SIAM Journal on Optimization 24.1 (Jan. 2014), pp. 84–107. DOI: 10.1137/120894464. [69] Dimitri P. Bertsekas. “Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey”. In: (July 3, 2015). arXiv: 1507.01030v1 [cs.SY]. [70] Mikhail V. Solodov. “Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero”. In: Computational Optimization and Applications 11.1 (1998), pp. 23–35. ISSN: 1573-2894. DOI: 10.1023/a:1018366000512. [71] Doron Blatt, Alfred O. Hero, and Hillel Gauchman. “A Convergent Incremental Gradient Method with a Constant Step Size”. In: SIAM Journal on Optimization 18.1 (Jan. 2007), pp. 29–51. DOI: 10.1137/040615961. [72] Nuri D. Vanli, Mert Gurbuzbalaban, and Asuman E. Ozdaglar. “Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods”. In: (Aug. 4, 2016). arXiv: 1608.01713v1 [math.OC]. BIBLIOGRAPHY 99

[73] Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing Finite Sums with the Stochastic Average Gradient”. In: (Sept. 10, 2013). arXiv: 1309.2388v2 [math.OC]. [74] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A Fast Incre- mental Gradient Method with Support for Non-Strongly Convex Composite Ob- jectives”. In: Advances in Neural Information Processing Systems 27 (NIPS). Ed. by Zoubin Ghahramani et al. Curran Associates, Inc., 2014, pp. 1646–1654. URL: http : / / papers . nips . cc / paper / 5258 - saga - a - fast - incremental - gradient-method-with-support-for-non-strongly-convex-composite- objectives.pdf. [75] Julien Mairal. “Incremental Majorization-Minimization Optimization with Applica- tion to Large-Scale Machine Learning”. In: SIAM Journal on Optimization 25.2 (Jan. 2015), pp. 829–855. DOI: 10.1137/140957639. [76] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Oper- ator Theory in Hilbert Spaces. Springer New York, 2011. DOI: 10.1007/978-1- 4419-9467-7. [77] David D. Lewis et al. “RCV1: A New Benchmark Collection for Text Categorization Research”. In: Journal of Machine Learning Research (JMLR) 5 (2004), pp. 361–397. ISSN: 1532-4435. URL: http://www.jmlr.org/papers/volume5/lewis04a/ lewis04a.pdf. [78] TU Berlin. Pascal Large Scale Learning Challenge. 2017. URL: http://largescale. ml.tu-berlin.de/. [79] Harald Kosch, László Böszörményi, and Hermann Hellwagner, eds. Euro-Par 2003 Parallel Processing. Springer Berlin Heidelberg, 2003. DOI: 10.1007/b12024. [80] Henri Casanova, Yves Robert, and Arnaud Legrand. Parallel Algorithms. Taylor & Francis Inc, July 17, 2008. 360 pp. ISBN: 1584889454. [81] Justin Ma et al. “Identifying suspicious URLs: An Application of Large-Scale Online Learning”. In: Proceedings of the 26th International Conference on Machine Learn- ing (ICML). Association for Computing Machinery (ACM), 2009. DOI: 10.1145/ 1553374.1553462. [82] Andrew Cotter et al. “Better Mini-Batch Algorithms via Accelerated Gradient Meth- ods”. In: Advances in Neural Information Processing Systems 24 (NIPS). Ed. by John Shawe-Taylor et al. Curran Associates, Inc., 2011, pp. 1647–1655. URL: http: //papers.nips.cc/paper/4432-better-mini-batch-algorithms-via- accelerated-gradient-methods.pdf. [83] Yurii Nesterov. “Primal-Dual Subgradient Methods for Convex Problems”. In: Math- ematical Programming 120.1 (June 2007), pp. 221–259. DOI: 10.1007/s10107- 007-0149-x. 100 BIBLIOGRAPHY

[84] Xi Chen, Qihang Lin, and Javier Pena. “Optimal Regularized Dual Averaging Methods for Stochastic Optimization”. In: Advances in Neural Information Processing Systems 25 (NIPS). Ed. by Fernando Pereira et al. Curran Associates, Inc., 2012, pp. 395–403. URL: http://papers.nips.cc/paper/4543-optimal-regularized-dual- averaging-methods-for-stochastic-optimization.pdf. [85] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. “Randomized Smoothing for Stochastic Optimization”. In: SIAM Journal on Optimization 22.2 (Jan. 2012), pp. 674–701. DOI: 10.1137/110831659. [86] John C. Duchi et al. “Ergodic Mirror Descent”. In: SIAM Journal on Optimization 22.4 (Dec. 2012), pp. 1549–1578. DOI: 10.1137/110836043. [87] Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. “Asaga: Asynchronous Parallel Saga”. In: (June 15, 2016). arXiv: 1606.04809v1 [math.OC].