Deep Learning Training in Facebook Data Centers: Design of Scale-Up and Scale-Out Systems

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems Maxim Naumov∗, John Kimy, Dheevatsa Mudigerez, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang and Mikhail Smelyanskiy Facebook, 1 Hacker Way, Menlo Park, CA Abstract—Large-scale training is important to ensure high 9 performance and accuracy of machine-learning models. At Face- 8 book we use many different models, including computer vision, 7 video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are 6 responsible for more than 50% of the training demand in our 5 data centers. Recommendation models present unique challenges 4 in training because they exercise not only compute but also 3 memory capacity as well as memory and network bandwidth. As 2 model size and complexity increase, efficiently scaling training NormalizedNumber of Racks 1 becomes a challenge. To address it we design Zion – Facebook’s next-generation large-memory training platform that consists 0 Q1 Q2 Q3 Q4 Q5 Q6 of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems. Recommendation Model All Others (a) I. INTRODUCTION 9 8 Artificial intelligence (AI) applications are rapidly evolv- 7 ing and increasing the demands on hardware and systems. 6 5 Machine learning (ML), deep learning (DL) in particular, 4 has been one of the driving forces behind the remarkable 3 Normalized Metric Normalized progress in AI and has become one of the most demanding 2 1 workloads in terms of compute infrastructure in the data 0 centers [8], [15], [27], [46]. Moreover, the continued growth Q1 Q2 Q3 Q4 Q5 Q6 Time of DL models in terms of complexity, coupled with significant (b) slowdown in transistor scaling, has necessitated going beyond Fig. 1. (a) Server compute demand for training and (b) number of distributed traditional general-purpose processors and developing special- training workflows across Facebook data centers. ized hardware with holistic system-level solutions to improve The increase in compute in Facebook data centers from performance, power, and efficiency [11], [33]. training is shown in Figure1(a). Across a period of 18 months, Within Facebook, DL is used across many social network there has been over 4× increase in the amount of computer re- services, including computer vision, i.e. image classification, sources utilized by training workloads. In addition, the number object detection, as well as video understanding. In addition, of workloads submitted for distributed training, as shown in it is used for natural language processing, i.e. translation and Figure1(b), has increased at an even higher rate – resulting in content understanding. However, some of the most important up to 7× increase in the number of training workflows. Thus, DL models within Facebook are the recommendation models the demand for training deep learning workloads is continuing used for ranking and click through rate (CTR) prediction, to increase while the compute necessary to support it is also including News Feed and search services [24]. increasing proportionally. The use of DL models is often split into inference and Prior training platform from Facebook, e.g. Big Basin [31], training work categories [4], [26]. Details of inference at consisted of NVidia GPUs. However, it did not leverage other Facebook has been discussed earlier [23], [40]; in comparison, accelerators and only had support for a limited number of we address the challenges in training and in particular, the CPUs. On the other hand, the Zion next-generation training scale-out requirements of the deep learning recommendation platform incorporates 8 CPU sockets, with a modular design models (DLRMs) at Facebook [38]. having separate sub-components for CPUs (Angels Landing 8- socket system) and accelerators (Emeralds Pools 8-accelerator ∗[email protected] [email protected] [email protected] Currently at Korea Advanced Institute of Science and system). This provides for sufficient general purpose compute Technology (KAIST). Work done while at Facebook. and more importantly, additional memory capacity. 1 3 2.5 2 1.5 1 Normalized Model Size Model Normalized 0.5 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Fig. 2. High-level overview of DLRM. Fig. 3. Increase in model complexity over time. Zion also introduced the common form factor OCP Ac- celerator Module (OAM) [49]1, which has been adopted by leading GPU vendors such as NVidia, AMD, Intel, as well as startups, such as Habana which was recently acquired by Intel. This is important for enabling consumers such Facebook to build vendor agnostic accelerator based systems. In this work, we provide an overview of the DLRM workloads, including the description and analysis of • Training at Facebook • Zion hardware platform (a) (b) • Impact on the accelerator fabric design Fig. 4. Communication patterns that are common in (a) data- and (b) model- • Implications for future scale-out systems parallelism. Both communication patterns need to be supported in DLRM. II. BACKGROUND A. Recommendation Model C. Distributed Training Neural network-based recommendation models, which are Distributed training becomes particularly important in the used to address personalization for different services, have context of increasing model sizes. It may rely on two types of become an important class of DL algorithms within Facebook. parallelism: data- and model-parallelism. The former enables High-level block overview of a typical recommendation model faster training as each worker trains on different data while is shown in Figure2, while its implementation in PyTorch the latter enables bigger model to train on the same data. framework has been publicly released in DLRM [38]. In data-parallelism: The input data samples are distributed The inputs to the recommendation model include both dense across different nodes. Each node processes the input data and sparse features. The dense or the continuous features are independently with replicas of parameters on each node, processed with a bottom multilayer perceptron (MLP) while and aggregating the local parameter updates into a global the sparse or the categorical features are processed using em- update on all the nodes. This requires communicating only beddings. The second-order interactions of different features the updates between the nodes, but communication volume are computed explicitly. Finally, the results are processed with increases due to replication as the number of nodes increases. a top MLP and fed into a sigmoid function in order to provide Therefore, scaling out requires large enough mini-batch size a probability of a click. to provide sufficient parallelism and computation to hide the B. Increase in Complexity communication overhead. In general the model complexity in terms of number of In model-parallelism: The model weights corresponding to parameters increases by more than 2× over 2 years, as shown neural network layers are distributed across multiple nodes. on Figure3. Notice that the increase trend is not monotonic Each node processes the entire mini-batch of data and com- because in order to alleviate pressure on the training, inference municates the activations forward or error gradients backwards and serving resources, we are constantly evaluating novel to other nodes. This introduces additional synchronization techniques to improve efficiency, such as quantization and across all nodes after each distributed layer in the forward compression. However, over time the newly available com- and backward pass. However, it allows us to fit the model plexity budget is often reused to improve and augment the into the aggregate memory of all distributed nodes. more efficient model, therefore driving its size up again. Note that a single embedding table contains tens of mil- lions of vectors, each with hundreds of elements. It requires 1Proposed and developed as part of the Open Compute Project (OCP). significant memory capacity, on the order of GBs. Therefore, 2 4.5 1.2 4 1.15 3.5 1.1 3 1.05 2.5 1 2 0.95 1.5 Normalized Loss Normalized 0.9 Normalized Throughput Normalized 1 0.5 0.85 0 0.8 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Num Trainers Num Trainers Fig. 5. Performance as the number of trainers are increased. Fig. 6. Quality as the number of trainers are increased. an embedding table often exists as a single instance and can results need to be communicated through to be exchanged and not be replicated on multiple devices or nodes. In contrast computed at once. MLPs are relatively small and can be replicated many times. Asynchronous training is well suited for a disaggregated Therefore, we will leverage both types of parallelism while design with use of dedicated parameter servers and trainers, training DLRMs. that are relying on point-to-point send/recv communication or Remote Procedure Calls (RPC) [10]. The compute intensive MLPs are replicated on different III. TRAINING AT FACEBOOK training processes and perform local weights updates based on A. Overview of Training at Facebook the data samples they receive individually, only occasionally synchronizing with a master copy of the weights stored on the The training of recommendation models often requires the parameter server. distribution of the model across multiple devices within a The embedding tables are not replicated due to memory single node or multiple nodes. Hence requiring both data- and constraints, but are assigned to different training processes that model-parallelism to scale the performance of training [38]. receive asynchronous updates throughout training. Notice that The distributed training can be performed using a combination because we use indices to access only a few of the embedding of synchronous algorithms that produce results equivalent to vectors in the forward pass, the simultaneous updates to an a sequential run of the model [18], [39] or asynchronous embedding table in the backward pass only collide when the algorithms that scale to a larger number of nodes [19], [42], indices used in the sparse lookups overlap between them.

Load more