Dynamic Parameter Allocation in Parameter Servers

Dynamic Parameter Allocation in Parameter Servers Alexander Renz-Wieland 1, Rainer Gemulla 2, Steffen Zeuch 1;3, Volker Markl 1;3 1Technische Universitat¨ Berlin, 2Universitat¨ Mannheim, 3German Research Center for Artificial Intelligence 1fi[email protected], [email protected], 3fi[email protected] ABSTRACT Classic PS (PS−Lite) To keep up with increasing dataset sizes and model complex- 4.5h ity, distributed training has become a necessity for large ma- 4h 200 chine learning tasks. Parameter servers ease the implemen- Classic PS tation of distributed parameter management|a key con- with fast local cern in distributed training|, but can induce severe com- access 2.4h munication overhead. To reduce communication overhead, 100 ● Dynamic Allocation PS (Lapse), distributed machine learning algorithms use techniques to 1.5h incl. fast local access increase parameter access locality (PAL), achieving up to 1.2h ● ● linear speed-ups. We found that existing parameter servers 0.6h ● 0 provide only limited support for PAL techniques, however, Epoch run time in minutes 0.4h 0.2h and therefore prevent efficient training. In this paper, we 1x4 2x4 4x4 8x4 explore whether and to what extent PAL techniques can Parallelism (nodes x threads) be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such Figure 1: Parameter server (PS) performance for a large knowledge graph embeddings task (RESCAL, dimen- a parameter server called Lapse, and experimentally com- pare its performance to existing parameter servers across a sion 100). The performance of the classic PSs falls behind the performance of a single node due to communication over- number of machine learning tasks. We found that Lapse provides near-linear scaling and can be orders of magnitude head. In contrast, dynamic parameter allocation enables faster than existing parameter servers. Lapse to scale near-linearly. Details in Section 4.1. PVLDB Reference Format: Alexander Renz-Wieland, Rainer Gemulla, Steffen Zeuch, Volker primitives or delegate parameter management to a parame- Markl. Dynamic Parameter Allocation in Parameter Servers. ter server (PS). PSs provide primitives for reading and writ- PVLDB, 13(11): 1877-1890, 2020. ing parameters and handle partitioning and synchronization DOI: https://doi.org/10.14778/3407790.3407796 across nodes. Many ML stacks use PSs as a component, e.g., TensorFlow [1], MXNet [8], PyTorch BigGraph [27], 1. INTRODUCTION STRADS [24], STRADS-AP [23], or Project Adam [9], and To keep up with increasing dataset sizes and model com- there exist multiple standalone PSs, e.g., Petuum [17], PS- plexity, distributed training has become a necessity for large Lite [28], Angel [22], FlexPS [18], Glint [20], and PS2 [64]. machine learning (ML) tasks. Distributed ML allows (1) for As parameters are accessed by multiple nodes in the clus- models and data larger than the memory of a single ma- ter and therefore need to be transferred between nodes, dis- chine, and (2) for faster training by leveraging distributed tributed ML algorithms may suffer from severe communica- compute. In distributed ML, both training data and model tion overhead when compared to single-machine implemen- parameters are partitioned across a compute cluster. Each tations. Figure 1 shows exemplarily that the performance node in the cluster usually accesses only its local part of the of a distributed ML algorithm may fall behind the perfor- training data, but reads and/or updates most of the model mance of single machine algorithms when a classic PS such parameters. Parameter management is thus a key concern as PS-Lite is used. To reduce the impact of communica- in distributed ML. Applications either manage model pa- tion, distributed ML algorithms employ techniques [14, 63, rameters manually using low-level distributed programming 56, 4, 27, 44, 62, 41, 33, 15, 36] that increase parameter access locality (PAL) and can achieve linear speed-ups. In- tuitively, PAL techniques ensure that most parameter ac- This work is licensed under the Creative Commons Attribution- cesses do not require (synchronous) communication; exam- NonCommercial-NoDerivatives 4.0 International License. To view a copy ple techniques include exploiting natural clustering of data, of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For parameter blocking, and latency hiding. Algorithms that any use beyond those covered by this license, obtain permission by emailing use PAL techniques typically manage parameters manually [email protected]. Copyright is held by the owner/author(s). Publication rights using low-level distributed programming primitives. licensed to the VLDB Endowment. Most existing PSs are easy to use|there is no need for Proceedings of the VLDB Endowment, Vol. 13, No. 11 ISSN 2150-8097. low-level distributed programming|, but provide only lim- DOI: https://doi.org/10.14778/3407790.3407796 ited support for PAL techniques. One limitation, for exam- 1877 Table 1: Per-key consistency guarantees of PS architec- server thread tures, using representatives for types: PS-Lite [28] for classic worker thread 1 and Petuum [59] for stale. worker thread 2 parameters worker thread 3 Parameter Server Classic Lapse Stale process at node 1 Synchronous sync async sync async sync, async Location caches off on server thread server thread worker thread 1 worker thread 1 Eventual XXXXXX a b b PRAM [30] XX XX × X worker thread 2 worker thread 2 b b Causal [19] XX XX × × parameters worker thread 3 parameters worker thread 3 b b Sequential [26] XX XX × × process at node 2 process at node 3 Serializability × × × × × × a I.e., monotonic reads, monotonic writes, and read your writes b Figure 2: PS architecture with server and worker threads Assuming that the network layer preserves message order (which is true for Lapse and PS-Lite) co-located in one process per node. Lapse employs this architecture. to what extent it is supported in existing PSs and iden- tify which features would be required to enable or improve ple, is that they allocate parameters statically. Moreover, support. Finally, we introduce DPA, which enables PSs to existing approaches to reducing communication overhead in exploit PAL techniques directly (Section 2.3). PSs provide only limited scalability compared to using PAL techniques (e.g., replication and bounded staleness [17, 11]) 2.1 Basic PS Architectures or are not applicable to the ML algorithms that we study PSs [53, 2, 12, 17, 28] partition the model parameters (e.g., dynamically reducing cluster size [18]). across a set of servers. The training data are usually parti- In this paper, we explore whether and to what extent tioned across a set of workers. During training, each worker PAL techniques can be supported in PSs, and whether such processes its local part of the training data (often multiple support is beneficial. To improve PS performance and suit- times) and continuously reads and updates model parame- ability, we propose to integrate dynamic parameter alloca- ters. To coordinate parameter accesses across workers, each tion (DPA) into PSs. DPA dynamically allocates parame- parameter is assigned a unique key and the PS provides pull ters where they are accessed, while providing location trans- and push primitives for reads and writes, respectively; cf. Ta- parency and PS consistency guarantees, i.e., sequential con- ble 2. Both operations can be performed synchronously or sistency. By doing so, PAL techniques can be exploited di- asynchronously. The push operation is usually cumulative, rectly. We discuss design options for PSs with DPA and de- i.e., the client sends an update term to the PS, which then scribe an efficient implementation of such a PS called Lapse. adds this term to the parameter value. Figure 1 shows the performance of Lapse for the task of Although servers and workers may reside on different ma- training knowledge graph embeddings [40] using data clus- chines, they are often co-located for efficiency reasons (es- tering and latency hiding PAL techniques. In contrast to pecially when PAL techniques are used). Some architec- classic PSs, Lapse outperformed the single-machine base- tures [28, 20, 22] run one server process and one or more line and showed near-linear speed-ups. In our experimen- worker processes on each machine, others [17, 18] use a tal study, we observed similar results for multiple other ML single process with one server thread and multiple worker tasks (matrix factorization and word vectors): the classic PS threads to reduce inter-process communication. Figure 2 approach barely outperformed the single-machine baseline, depicts such a PS architecture with one server and three whereas Lapse scaled near-linearly, with speed-ups of up to worker threads per node. two orders of magnitude compared to classic PSs and up In the classic PS architecture, parameters are statically to one order of magnitude compared to state-of-the-art PSs. allocated to servers (e.g., via a range partitioning of the pa- Figure 1 further shows that|although critical to the perfor- rameter keys) and there is no replication. Thus precisely one mance of Lapse|fast local access alone does not alleviate server holds the current value of a parameter, and this server the communication overhead of the classic PS approach. is used for all pull and push operations on this parameter. In summary, our contributions are as follows. (i) We ex- Classic PSs typically guarantee sequential consistency [26] amine whether and to what extent existing PSs support us- for operations on the same key. This means that (1) each ing PAL techniques to reduce communication overhead. (ii) worker's operations are executed in the order specified by We propose to integrate DPA into PSs to be able to sup- the worker, and (2) the result of any execution is equivalent port PAL techniques directly.

Load more