Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Rajarshi Biswas,

Graduate Program in Department of Computer Science and Engineering

The Ohio State University

2018

Master’s Examination Committee:

Dr. Dhabaleswar K. (DK) Panda, Advisor Dr. Christopher Charles Stewart Dr. Xiaoyi Lu c Copyright by

Rajarshi Biswas

2018 Abstract

Google’s TensorFlow is one of the most popular Deep Learning (DL) frameworks avail- able in the community. gRPC, a (RPC) framework also developed by , is the main communication engine for distributed TensorFlow. TensorFlow pri- marily uses gRPC for exchanging tensors and communicating administrative tasks among different processes across the nodes. Tensor updates during the training phase are commu- nication intensive and thus TensorFlow’s performance is heavily dependent on the under- lying network and the efficacy of the communication engine. Apart from the default gRPC channel, TensorFlow supports various high-performance channels to efficiently transfer tensors such as gRPC+Verbs and gRPC+MPI. However, at present, the community lacks a thorough characterization of these available distributed TensorFlow communication chan- nels. This is critical to understand because high-performance Deep Learning with Tensor-

Flow on modern HPC systems needs an efficient communication runtime.

In this work, we first conduct a meticulous analysis of the communication character- istics of distributed TensorFlow over all available channels. Based on these characteris- tics we propose TF-gRPC-Bench micro-benchmark suite that enables system researches to quickly understand the impact of the underlying network and communication runtime on DL workloads. We propose three micro-benchmarks that take account TensorFlow DL workload characteristics over gRPC. Furthermore, our characterization shows that none of the existing channels in TensorFlow can support adaptive and efficient communication for

ii DL workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this work proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA proto- cols. In AR-gRPC, we propose designs such as hybrid communication protocols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for DL workloads. Our evaluations show that AR-gRPC can signif- icantly speedup gRPC performance by up to 4.1x and 2.3x compared to the default gRPC design on IPoIB and another RDMA-based gRPC design in the community. By integrating our AR-gRPC with TensorFlow, we can achieve up to 3x distributed training performance improvement over default gRPC-IPoIB based TensorFlow.

iii To my family, friends, and mentors.

iv Acknowledgments

My deepest gratitude is to my advisor Dr. D. K. Panda for the guidance and support he has given me throughout the thesis work. I am grateful to him for giving me this important opportunity to be part of HiBD research group. His work-ethic, commitment, principles are big inspirations for me and I would always want to follow in this direction of right path.

I also want to thank Dr. Christopher Charles Stewart for agreeing to be a committee member for my thesis defense exam, and make it work despite of his tight schedule and commitments.

I would like to give special thanks to Dr. Xiaoyi Lu, who has been my mentor and team lead. His technical guidance and encouragement throughtout my tenure in the lab has been invaluable to me. His insightful and thought-provoking technical comments has helped me grow. His willingness to support me even in trying circumstances kept me moving forward. He brings a lot of positivity to the lab and his commitment to work is exceptional.

I’ve learnt a lot working closely with him.

From my family, I want to thank my mother Mrs. Minati Biswas and my father Mr.

Ranajit Kumar Biswas for their continuous support and love. They have done a lot of sacrifices for me and I am grateful to have parents like them. I also want to thank my cousin sister Sampurna Biswas who has been a great support throughout my graduate studies in the US.

v Finally, I thank my lab colleagues and friends Shashank, Haseeb, Moniba, Sourav,

Haiyang, Dipti and others for the interactions we had. I would also like to thank all my friends back at home, particularly Arka and Chandan for encouraging me in my pursuit.

vi Vita

2011 ...... B.E., Information Technology, Jadavpur University, India. 2011 - 2015 ...... Software Development Engineer, Citrix Research and Development, India 2015 - 2016 ...... Senior Software Development Engineer, Citrix Research and Development, India 2016 - Present ...... M.S., Computer Science and Engineer- ing, The Ohio State University, USA 2017 - Present ...... Graduate Research Associate, The Ohio State University, USA

Publications

X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters, In The Journal of IEEE Transactions on Multi-Scale Computing Systems, [June 2018].

R. Biswas, X. Lu, and D. K. Panda, Designing a MicroBenchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, In 9th Workshop on Big data benchmarks, Perfor- mance Optimization, and Emerging hardware (BPOE - 9), in conjunction with ASPLOS, [March 2018].

X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-Capable Networks, In 25th Annual Symposium on High-Performance Interconnects (HOTI ’17), [August 2017].

Fields of Study

Major Field: Computer Science and Engineering

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Tables ...... x

List of Figures ...... xi

1. Introduction ...... 1

1.1 Motivation ...... 2 1.2 Organization of Thesis ...... 5

2. Background ...... 6

2.1 Overview of TensorFlow ...... 6 2.2 Overview of gRPC ...... 8 2.3 Overview of InfiniBand and RDMA ...... 9

3. Characterization of Distributed TensorFlow ...... 11

3.1 Distributed Execution of TensorFlow ...... 11 3.2 Methodology for Characterization ...... 12 3.3 Characterization for the gRPC Channel ...... 13 3.4 Characterization for the gRPC+Verbs Channel ...... 15 3.5 Characterization for the gRPC+MPI Channel ...... 15 3.6 Characteristics of TensorFlow Workload over gRPC Channel ...... 17 3.7 Summary ...... 18

viii 4. Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow . . . . 21

4.1 Introduction ...... 21 4.2 TensorFlow Deep Learning Micro-benchmarks for gRPC ...... 22 4.2.1 Design Considerations ...... 22 4.2.2 Design of TF-gRPC-Bench Micro-benchmark Suite ...... 25 4.3 Performance Evaluation ...... 29 4.3.1 Experimental Setup ...... 29 4.3.2 TF-gRPC-P2P-Latency (Serialized Mode) ...... 31 4.3.3 TF-gRPC-P2P-Latency (Non-serialized Mode) ...... 32 4.3.4 TF-gRPC-P2P-Bandwidth (Non-serialized Mode) ...... 33 4.3.5 TF-gRPC-PS-Throughput (Non-serialized Mode) ...... 34 4.4 Related Work ...... 35 4.5 Summary ...... 36

5. Accelerating TensorFlow with Adaptive RDMA-based gRPC (AR-gRPC)... 38

5.1 Introduction ...... 38 5.2 Proposed Design of AR-gRPC ...... 39 5.2.1 Architecture Overview of AR-gRPC ...... 39 5.2.2 Adaptive RDMA-based Communication ...... 41 5.3 Performance Evaluation ...... 44 5.3.1 Experimental Setup ...... 45 5.3.2 Evaluation of gRPC ...... 45 5.3.3 Evaluation of AR-gRPC Enhanced TensorFlow ...... 52 5.4 Related Work ...... 59 5.5 Summary ...... 61

6. Conclusion and Future Work ...... 62

Bibliography ...... 63

ix List of Tables

Table Page

1.1 Comparison with Related Work ...... 4

3.1 TensorFlow Performance for Resnet50 ...... 13

4.1 iovec Buffer Size Category ...... 28

4.2 Configurable Parameters for TF-gRPC-Bench Micro-benchmark Suite . . . 30

x List of Figures

Figure Page

1.1 Contrast Between Current and Proposed Deep Learning Benchmarks . . . .3

2.1 Overview of TensorFlow ...... 7

2.2 Overview of gRPC Deployment ...... 9

3.1 Communication Pattern Between TensorFlow Parameter Servers and Workers 12

3.2 TensorFlow Payload Distribution and Communication Flow over gRPC channel ...... 14

3.3 TensorFlow Payload Distribution and Communication Flow over gRPC+Verbs channel ...... 16

3.4 TensorFlow Payload Distribution and Communication Flow over gRPC+MPI channel ...... 17

3.5 iovec Buffer Distribution Observed for TensorFlow training over gRPC . 18

4.1 Design Considerations for TF-gRPC-Bench Micro-benchmark ...... 23

4.2 TF-gRPC-Bench Micro-benchmark Deign ...... 26

4.3 TF-gRPC-P2P-Latency (Serialized Mode) Evaluation on Cluster A with 64KBytes Payload ...... 31

4.4 TF-gRPC-P2P-Latency (Non-serialized Mode) ...... 32

4.5 TF-gRPC-P2P-Latency (Non-serialized Mode) Evaluation on Cluster A for Different iovec Counts ...... 33

xi 4.6 TF-gRPC-P2P-Bandwidth (Non-serialized Mode) ...... 34

4.7 TF-gRPC-PS-Throughput (Non-serialized Mode) ...... 35

5.1 Overview of AR-gRPC and the Corresponding Communication in Tensor- Flow...... 40

5.2 gRPC Point-to-Point Latency Evaluation on Cluster A ...... 46

5.3 gRPC Point-to-Point Latency Evaluation on Cluster B ...... 47

5.4 Analysis of Various gRPC Designs on Cluster A ...... 49

5.5 gRPC Single Server, Multiple Clients Throughput Evaluation on Cluster A . 50

5.6 Performance Comparison in Fully-Connected Architecture of gRPC . . . . 51

5.7 Inception4 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs ...... 54

5.8 Resnet152 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs ...... 55

5.9 Inception3 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs ...... 56

5.10 Resnet50 Evaluation on Cluster A (Higher is Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs ...... 56

5.11 GoogleNet & AlexNet Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs ...... 56

5.12 AR-gRPC enhanced TensorFlow Speedup Compared to gRPC-IPoIB chan- nel on Cluster A ...... 59

xii Chapter 1: Introduction

Deep Learning (DL) a subset of Machine Learning (ML) in Artificial Intelligence (AI) has gotten a lot of attention due to its inference accuracy. Many DL frameworks and tools have been proposed in the community, such as Caffe [26], Facebook Caffe2 [1], Microsoft

CNTK [42], Intel BigDL [8], Google TensorFlow [16], and many others.

Google’s TensorFlow is one of the most popular frameworks to perform distributed deep learning and it has been gaining a lot of momentum recently in Big Data, Deep

Learning, and High-Performance Computing (HPC) communities. During DL model train- ing and inference on TensorFlow, gradient updates (or tensor transmissions) are the criti- cal time-consuming steps that incur a massive volume of data transfer over the network.

This becomes a major bottleneck in DL workloads. Increasing the mini-batch size is one solution as this results in less gradient updates and longer local computation in Tensor-

Flow. However, studies [21, 41, 53] have shown this approach can increase the time for the DL model to converge. Other alternative solutions have been proposed to accelerate

TensorFlow by taking advantage of various high-performance technologies. For instance, the current open-source TensorFlow leverages multiple ways of doing gradient updates, by running default gRPC [7] over TCP/IP or IPoIB (IP over InfiniBand), gRPC with a dedicated Verbs-based channel [15], and gRPC with a dedicated MPI [15, 40] based chan- nel. The main reason of bringing Verbs and MPI based channels into TensorFlow is to

1 utilize high-performance communication mechanisms such as Remote Direct Memory Ac- cess (RDMA) over high-speed interconnects, like InfiniBand and RDMA over Converged

Ethernet (RoCE). However, these (Verbs and MPI) channels still use gRPC for adminis- trative message communication among different remote processes, thus making gRPC an invincible component of TensorFlow.

1.1 Motivation

The left side of Figure 1.1 shows the current approach to benchmark deep learning frameworks. Most of the current DL models and benchmarks are deep learning research oriented. In order to get the desired inference accuracy, the Neural Network typically needs a longer training time which makes the benchmarks run longer. Models like VGG [44],

AlexNet [29], GoogLeNet [49], etc. take several minutes, or hours, or even days to train on real datasets like ImageNet [4].

However, many system researchers are focused on solely improving the communication engine of deep learning frameworks to reduce the distributed training time. The right side of Figure 1.1 shows the proposed approach for benchmarking deep learning frameworks for system researchers. They need to consider only factors impacted by the underlying net- works and communication subsystem.

Therefore, a micro-benchmark suite that enables system researchers to quickly eval- uate, understand, and optimize the performance of the deep learning frameworks’ communication substrate in a stand-alone manner by capturing the characteristics of deep learning frameworks is highly desirable.

2 Some benchmarks like NCCL2 [11] or Baidu Allreduce [10] aim for reduce tree based collective communication performance evaluation. But, for parameter server based ap- proach, especially for gRPC-based communication runtime, there is a need for micro- benchmarks to evaluate the performance of underlying networks and protocols.

Figure 1.1 Contrast Between Current and Proposed Deep Learning Benchmarks

Moreover, in order to achieve the optimal communication performance on high perfor- mance networks, the TensorFlow community has maintained different channels, i.e. gRPC,

Verbs, and MPI. However, the users also need to understand and tune these channels on their platforms to get the desired performance. Such scenarios bring a lot of challenges for both developers as well as end users. This also motivates us to answer a broad question:

Can a unified approach be proposed to provide optimal performance for TensorFlow workloads?

3 Table 1.1 Comparison with Related Work

Work Channel Main Mechanism gRPC TCP/IP, IP-over-IB [18] gRPC + Verbs RDMA for Tensor transfers; gRPC for administrative tasks gRPC + MPI MPI for Tensor transfers; gRPC for administrative tasks [12] gRPC Replacing Sockets-based Send/ Recv with Verbs Send/Recv This work AR-gRPC Native RDMA; Adaptive communication for TensorFlow demands

To answer this question, we first conduct a survey on the existing solutions which have been proposed in the community for TensorFlow. Table 1.1 summarizes the comparison among these different solutions. From this table, we clearly see that the community is trying to run DL workloads on top of gRPC (with TCP/IP or IPoIB), Verbs (RDMA or

Send/Recv), and MPI. In all these different solutions, gRPC is responsible for at least administrative tasks such as establishing the RDMA path, exchanging computation graphs, etc. Therefore, if gRPC is a compulsory component of TensorFlow, it makes more sense to bring RDMA capability directly into the gRPC runtime. This will allow TensorFlow to automatically benefit from RDMA-enhanced gRPC. In fact, there is an existing version of

RDMA-based gRPC [12] in the community, which indicates researchers are investigating in this direction. With these many available channels, several important questions we need to explore:

• Can these new channels bring benefits for the DL workloads?

4 • Which channel performs the best and why?

• Is there any need to propose a new RDMA-based gRPC runtime, which can provide

better performance than existing channels?

• If so, how much additional performance benefit can we gain through the proposed

designs?

1.2 Organization of Thesis

The rest of the thesis is organized as follows. Chapter 2 introduces the topics and con- cepts that are relevant to this thesis. We provide thorough characteristics of distributed

TensorFlow in Chapter 3. We provide an in-depth analysis of the existing distributed Ten- sorFlow communication channels in this chapter. In Chapter 4, we discuss designing a micro-benchmark suite to evaluate gRPC for TensorFlow. We also provide results of our micro-benchmark suite on different clusters. Chapter 5 explores the design of Adaptive

RDMA-based gRPC (AR-gRPC). In addition, this chapter also explains how AR-gRPC can accelerate TensorFlow. Furthermore, we comprehensively evaluate AR-gRPC and

AR-gRPC enhanced TensorFlow and show comparisons with other designs. Chapter 6 concludes the thesis and discusses the future work.

5 Chapter 2: Background

In this section, we present an overview on TensorFlow, gRPC, InfiniBand and RDMA.

2.1 Overview of TensorFlow

TensorFlow [18] is a widely adopted open source Deep Learning framework developed by the Google Brain Team in November 2015.

TensorFlow leverages data flow graphs to do the distributed deep neural network train- ing. Nodes in the graph represent mathematical operations and the graph edges represent the multidimensional data arrays (i.e., tensors) communicated across the nodes. The execu- tion model of distributed TensorFlow can be attributed to four distinct components: client, master, a set of workers, and several Parameter Servers. Figure 2.1 illustrates the interac- tion among these components. The computational graph is built by a user-written client

Tensorflow program. The client then creates a session to the master and sends the graph definitions as a protocol buffer. Afterwards, the master delegates and coordinates the ex- ecution (after pruning and optimizing) of the subgraphs to a set of distributed worker and

Parameter Server (PS) processes. Each of these processes can leverage heterogeneous en- vironments seamlessly, for example, one or multiple devices (e.g., CPU, GPU, TPU [27]), to finish their tasks.

6 Figure 2.1 Overview of TensorFlow

The Parameter Servers are responsible for updating and storing the model parameters, while the workers send optimization updates of the model to and get the updated model from parameter servers. The parameter exchanging process (or tensor transmission) is the main communication phase, and the default open-source TensorFlow can support different communication channels such as gRPC, gRPC+Verbs, and gRPC+MPI to handle it, as shown in Figure 2.1.

To achieve parallelism, distributed TensorFlow supports both data parallel training and model parallel training [17]. In data parallel training, TensorFlow parallelize the computa- tion of the gradient for a mini-batch across mini-batch elements. This technique replicates the TensorFlow graph (that does the majority of the computation) across different nodes and each of the replicas operate on a different set of data. The gradients are combined after each iteration of computation by these replicated models. Afterwards the parameters update can be applied synchronously as well as asynchronously. Thus this will have the same affect as running the sequential graph computation with the accumulated mini-batch

7 size but much faster. On the other hand, in model parallel training, for the same batch of data, different portions of the graph computation are done on different nodes.

2.2 Overview of gRPC

Remote Procedure Calls (RPC) have come a long way from its inception. gRPC [7], a modern open-source RPC framework developed by Google, can efficiently connect ser- vices in and across data centers with pluggable support for load balancing, tracing, health checking, and authentication. It is the most fundamental communication mechanism in dis- tributed TensorFlow. No matter which channel, as mentioned in Section 2.1, is selected in

TensorFlow, gRPC will be always used to perform communication for administrative tasks and (or) exchanging model parameters. Not only for TensorFlow, many other production systems in companies like Netflix, Cisco, Juniper etc. use gRPC for connecting multiple services in their environments. The use case varies from connecting a handful of services to hundreds of services across various languages in native or cloud environments.

Figure 2.2 depicts a typical gRPC based communication scenario where a gRPC Python client communicates with a gRPC server that is written in C++. The client and server com- municate with each other by using the protocol buffer protocol (i.e., Proto Request/Response).

The gRPC core, written in C, handles the communication and thus is the main point of in- terest for us. The gRPC core defines an abstraction called Endpoint that encapsulates a channel between two communicating processes. An endpoint implements Write and Read callbacks for a specific transport protocol. For example, the default gRPC has endpoints implemented for traditional TCP and UDP protocols. In Section 5.2, we illustrate how we extend this design paradigm to support RDMA-based endpoints.

8 Call gRPC Core gRPC Client Return Python

Write- Read-Endpoint Endpoint

Proto Proto Request Response

Read- Endpoint Write-Endpoint gRPC Server gRPC Core Return C++ Call

Invoke Remote Method

Figure 2.2 Overview of gRPC Deployment

2.3 Overview of InfiniBand and RDMA

InfiniBand [24] is a computer networking communication standard used in many HPC clusters to achieve low latency and high throughput. It defines a switched network fabric for interconnecting I/O and compute nodes. One of the primary features of InfiniBand is Remote Direct Memory Access (RDMA). RDMA can be used by a process to remotely read or update memory contents of another remote process without CPU involvement at the remote side. RDMA is extremely powerful and can be leveraged to build high-performance communication engines. In addition to this, InfiniBand provides channel semantics (Send-

Receive) based communication. Reliable Connection (RC) and Unreliable Datagram (UD) based communication. For both of these communication paradigms, InfiniBand supports zero-copy transfer from source to destination buffers. Moreover, not only HPC domain, but with the recent convergence of RDMA over Converged Enhanced Ethernet (RoCE),

InfiniBand is paving its way to commercial domain too.

9 The lowest access layer of InfiniBand is Verbs layer that is capable of transmitting data in a OS-bypassed fashion. Verbs exposes an queue pair or communication end point se- mantics to the application layer. Application places a work request in the queue that is subsequently processed by the Host Channel Adapter (HCA). Upon completion a notifica- tion is placed in the completion queue. Application periodically polls the completion queue to detect any new event. Apart from this, InfiniBand also features Internet Protocol over

InfiniBand (IPoIB or IP-over-IB) [2] protocol that can be used to run traditional socket- based applications over InfiniBand hardware. This enables socket based application to use the InfiniBand HCA easily with an IP address.

10 Chapter 3: Characterization of Distributed TensorFlow

In this Chapter, we meticulously analyze the characteristics of distributed TensorFlow.

First, we discuss the execution scheme of distributed TensorFlow. Then, we provide the methodology we use for the characterization followed by a thorough analysis of existing

TensorFlow channels (i.e., gRPC, gRPC+Verbs, and gRPC+MPI). This characterization helps us to identify the possible bottlenecks present in the current channels. In addition, we also analyze the TensorFlow workload characteristics over gRPC channel.

3.1 Distributed Execution of TensorFlow

During DL training, in a distributed TensorFlow cluster values of training variables are updated using aggregated gradients and deltas, represented as tensors. The most widely used approach for managing the training variable in the community is Parameter Server [33].

Figure 3.1 depicts this communication pattern among TensorFlow parameter servers and workers. In a distributed TensorFlow cluster, the parameter server (PS) processes own the master copies of the variables, whereas, the worker processes request for those variables when needed. In addition, when a worker computes (such as gradient updates) a new value of a variable, it sends an update to the specific PS process. The variable updates also known

11 as the tensor updates are communication intensive and thus the performance is heavily de- pendent on the underlying networks, protocols and the communication subsystem design.

Figure 3.1 Communication Pattern Between TensorFlow Parameter Servers and Workers

3.2 Methodology for Characterization

We characterize the three communication channels available in the open source Tensor-

Flow. The default gRPC channel runs over IPoIB, while Verbs and MPI based channels use native RDMA based communication for tensor transmission. We choose MVAPICH2-2.3b and Intel-MPI-2018 libraries for the MPI channel and found that both the MPI libraries provide similar results. We deploy a four-node TensorFlow cluster (i.e., Cluster A, see

Section 5.3.1) in the Parameter Server (PS) mode. The PS is deployed on one node (uses

CPU), while the workers are deployed on the rest (use GPUs). We synchronously train

(32/GPU batch size) a resnet50 [22] DNN, available in TensorFlow Convolution Neural

Net (CNN) benchmark [15]. This benchmark generates synthetic image data and measures the performance by the total number of images processed per second. Resnet50 DNN is a moderately complex network and thus is suitable for our analysis.

12 Table 3.1 TensorFlow Performance for Resnet50

Channel Images/Sec gRPC 91.06 gRPC+Verbs 103.21 gRPC+MPI 84.45

We have two important observations from the results summarized in Table 3.1. First, gRPC+Verbs can perform slightly better than gRPC (i.e., 103.21 vs. 91.06), however, the benefit is not significant (around 13%). This implies that gRPC+Verbs can utilize the

RDMA network efficiently compared to the default gRPC, but the question is can the per- formance be improved even further? Second, we surprisingly see that gRPC+MPI performs worse than the default gRPC. These two observations motivate us to further understand the in-depth designs and communication characteristics of TensorFlow with different channels.

The following sections will present the characterization details.

3.3 Characterization for the gRPC Channel

To understand the communication characteristics, we first profile the payload sizes be- ing transmitted during TensorFlow training. Figure 3.2(a) shows 2K samples of payload distribution when the default gRPC channel is used. These snapshots are taken from one of the worker nodes as the other nodes have the similar traces due to the symmetrical char- acteristic of the workload. In Figure 3.2(a), we see that the communication over Socket- based gRPC channel involves a lot of short as well as large (up to 4 MBytes) messages.

The reason of such upper bound is that the default gRPC has a maximum 4 MBytes pay- load limit. However, from later profiling results (see Figure 3.3(a) and 3.4(a)), we see that the actual payload in the training of resnet50 can be much larger than 4 MBytes. Clearly,

13 such a naive chunking scheme in gRPC for transferring large messages with TCP/IP over

a high-performance network is one of the major bottlenecks. To further identify potential

bottlenecks, we analyze the communication flow for tensor transfer over gRPC channel, as

shown in Figure 3.2(b). TensorFlow uses a rendezvous protocol for tensor transmission.

The TF (TensorFlow) Sender always puts the tensors in the local table, whereas the TF re-

ceiver actively requests for the tensor only when needed. The default gRPC uses sendmsg

and recvmsg primitives for sending and receiving payloads. These primitives are useful

for sending or receiving from one or more buffers in a single function call. The payloads

are constructed using Unix iovec structures. However, sendmsg internally copies all the

data either into a pre-allocated (for payload less than 2KBytes), or a newly allocated buffer.

This extra copying and allocation of new memory can be a bottleneck for high-speed data

transfer.

16,777,216 TF Sender TF Receiver Request Tensor 4,194,304 Send Tensor Put Tensor in RecvTensorAsync (Register Recv Callback) 1,048,576 local table T0 T1 T2 262,144 Issue RPC

65,536 ) gRPC Core Find Tensor sendmsg (TCP/ IP B B B 16,384 iovec tcp_read 0 1 2 B1 B2 B0 tcp_write iovec 4,096 B0 B1 B2 Copy recvmsg 1,024 gRPC Core Respond to 256 RPC Invoke Func 64 Tensor(gRPC Byte Buffer) Payload size (Bytes) Serialize 16 gRPC Core sendmsg (TCP/ IP) Receive Tensor

4 B2 B1 B0 B tcp_read iovec 2 B1 B tcp_write 0 1 iovec Copy B2 B1 B0 recvmsg 0 1000 2000 gRPC Core Message ID Deserialize Tensor (gRPC Byte Payload over gRPC Buffer) (a) Payload Distribution (b) Communication Flow

Figure 3.2 TensorFlow Payload Distribution and Communication Flow over gRPC channel

14 3.4 Characterization for the gRPC+Verbs Channel

Similarly, Figure 3.3(a) shows the payload distributions over gRPC+Verbs. Here gRPC is responsible for administrative tasks and the tensor transfers are done over Verbs that is capable of RDMA. Figure 3.3(b) depicts the communication flow of tensor transfer over

Verbs. The Verbs-based scheme writes all the payloads by employing an RDMA Write operation. Figure 3.3(a) shows that the Verbs-based channel is sending mostly 512 Bytes chunk payloads. However, studies [34, 36] have shown that writing messages of this length using RDMA rendezvous protocol is suboptimal. Also, using only RDMA Write for all payloads may not be the most efficient use of RDMA [36]. As shown in Figure 3.3(b),

TF Sender and Receiver maintain two message buffers, two ACK buffers, and many tensor buffers. These buffers are pre-pinned RDMA buffers. When the tensor size increases, the current buffer is discarded and a new buffer of larger size is created and pinned. For requesting a tensor, TF Receiver sends a message to notify the TF sender. TF Sender first sends an ACK so that TF Receiver can set the message buffer idle. Then TF sender finds the tensor locally and places at corresponding RDMA tensor buffer for transmission. We notice that for a single tensor transfer, there are several RDMA writes involved for flow control and payload transmission. We aim to design better protocols to minimize the number of

RDMA operations to further improve the performance.

3.5 Characterization for the gRPC+MPI Channel

In the gRPC+MPI channel, gRPC is still responsible for administrative operations, whereas the MPI channel (capable of RDMA) is used for transferring tensors. Figure 3.4(a) indicates a wide range of payloads starting from 128 Bytes to 10 MBytes over the MPI channel. In our experiments, the MPI channel needs a sufficient amount of tuning to get

15 16,777,216 TF Sender TF Receiver Request Tensor 4,194,304 Send Tensor Put Tensor in 1,048,576 RecvFromRemoteAsync local table T0 T1 T2 (Register Recv Callback) 262,144 (Register Send Callback) rite Memory copy RDMA W 65,536 RDMA MSG Buffer RDMA MSG Buffer 16,384 RDMA Write Deserialize 4,096 RDMA ACK Buffer RDMA ACK Buffer 1,024 Find Tensor in local table (Trigger Send Callback) Set MSG Buffer Idle 256 Serialize & Mem Copy RDMA W rite Receive Tensor 64 RDMA Tensor Buffer Payload Size (Bytes) 16 RDMA Tensor Buffer 4 Trigger Recv Callback RDMA Write (Set Tensor Buffer Idle) 1 RDMA MSG Buffer RDMA MSG Buffer 0 1000 2000 Set Tensor Buffer Idle Message ID RDMA Write Payload over gRPC Payload over Verbs RDMA ACK Buffer RDMA ACK Buffer Set MSG Buffer Idle (a) Payload Distribution (b) Communication Flow

Figure 3.3 TensorFlow Payload Distribution and Communication Flow over gRPC+Verbs channel

acceptable TensorFlow performance. Figure 3.4(b) shows the communication flow for ten- sor exchange via MPI. A dedicated MPI thread handles all the MPI calls in both sender and receiver side. The TF Receiver places the tensor requests in a Request Queue and the

MPI-thread sends the requests to the remote node using MPI Isend. The remote node then

forwards the request to the TensorFlow core that finds the tensor in the local table. After-

wards, a callback places the tensor in the Send Queue of the MPI-thread. The MPI-thread

checks for a new tensor request or tensor using MPI Improbe and uses MPI MRecv (uses

ANY SOURCE) to receive data. Based on the studies [31, 32] in the HPC community, the

default design of the MPI channel in TensorFlow has many bottlenecks. For example, the

current communication flow heavily relies on the dedicated MPI thread which could be a

bottleneck due to multi-threading, locking, and context switching overhead. The probing

and blocking receiving with ANY SOURCE also incur overhead due to internal wildcard

16 matching logic in MPI. These are the reasons why we do not observe better performance

in Table 3.1 for the MPI channel. Although, the message size distribution over gRPC re-

mains almost similar for both Verbs and MPI channels, interestingly, we observe that the

gRPC+MPI channel has more message transmissions over the gRPC. This is due to addi-

tional gRPC messages needed to set up different MPI ranks and assign tasks.

16,777,216 TF Sender TF Receiver 4,194,304 Send Tensor 1,048,576 hread hread Request Tensor

262,144 Put Tensor in MPI T MPI T RecvFromRemoteAsync local table T0 T1 T2 (Register Recv Callback) 65,536 16,384 (Register Send Serialize Callback) MPI_Improbe Request R0 R1 R2 Buffer 4,096 MPI_Isend Request Queue Receive Tensor Request 1,024 MPI_MRecv Request (Blocking) 256 Deserialize Buffer Find Tensor in local table Deserialize 64 (Trigger Send Callback) Payload Size (Bytes) MPI_Improbe 16 Send Tensor S0 S1 S2 MPI_Isend 4 Tensor Send Queue Buffer MPI_Test Receive Tensor 1 MPI_MRecv Serialize Tensor 0 1000 2000 (Blocking) Buffer Message ID Trigger Recv Payload over gRPC Payload over MPI MPI_Wait MPI_Wait Callback

(a) Payload Distribution (b) Communication Flow

Figure 3.4 TensorFlow Payload Distribution and Communication Flow over gRPC+MPI channel

3.6 Characteristics of TensorFlow Workload over gRPC Channel

To further understand the pattern of TensorFlow Deep Learning workload over gRPC channel, we profile the iovec buffers transmitted during the training phase. For this exper- iment, we train Resnet [22], VGG [44], Alexnet [29], and Inception [50] Deep Neural Nets

(DNN) available in TensorFlow Convolution Neural Network [13] Benchmark on Cluster

A. Please see Section 5.3.1 for specifics about this cluster. We deploy the TensorFlow

17 cluster in Parameter Server mode across five nodes. Two of these nodes act as parameter

servers, where as the rest are workers. We use CPUs in parameter servers and GPUs (for

compute) in the workers. During training, we kept the batch size 64. Figure 3.5 shows our

observation of the common size distributions pattern of iovec buffers in gRPC payloads.

In Figure 3.5 Small, Medium and Large indicate buffers of few Bytes, KBytes and MBytes

of length, respectively. As shown in the figure, a gRPC payload may contain a uniform

distribution of such Small buffers. On the other hand, a lot of Large buffers and a few

Small buffers may create a skew distribution of such buffers in one gRPC payload.

Figure 3.5 iovec Buffer Distribution Observed for TensorFlow training over gRPC

3.7 Summary

From the above analysis, we have the following key observations:

1) The training involves wide range of message transfers. Communication optimization for large tensors (e.g., 10 MBytes for resnet50) will reduce the training time, which is espe- cially true if the DNN model is more complex.

18 2) The default designs for the three channels in TensorFlow still have bottlenecks for uti- lizing RDMA-capable high-performance networks as discussed above.

3) For both gRPC+Verbs and gRPC+MPI schemes, even though a small fraction, some messages still go over the default inefficient gRPC with TCP/IP. Also, both of these schemes still need to maintain two separate communication runtimes co-existing in the same Tensor-

Flow architecture. This may cause inefficient communication performance due to resource contention, unawareness between each other, and possible deadlocks.

4) None of these channels support adaptive communication for Deep Learning workloads with different message sizes.

5) TensorFlow DL workload over gRPC channel is comprised of iovec buffers of differ- ent lengths and patterns.

As we can see, to design a micro-benchmark suite to evaluate DL communication sub- strate, we need to keep in mind the distributed characteristics of TensorFlow. Not only that, the workload profiling in Section 3.6 suggests these data patterns needs to be preserved to capture the essence of DL workload. In the next chapter, we propose the design of a micro-benchmark suite to evaluate TensorFlow’s primary communication engine - gRPC.

Furthermore, with all these different available channels, we see a clear challenge is facing the TensorFlow community -

Can we propose a unified approach to have a single gRPC runtime in TensorFlow with adaptive and efficient RDMA protocols, which can resolve the bottlenecks as mentioned above?

Although there are some initial attempts in the community to integrate gRPC with RDMA, the design of existing version of RDMA-gRPC [12] is suboptimal due to several reasons, such as lack of using one-sided RDMA operations, no adaptive designs, interrupt-based

19 signaling, etc. As we will see later in Section 5.3, their design is not suitable for the transmission patterns that Deep Learning applications demand. Therefore, in Chapter 5 we propose a highly optimized adaptive gRPC with RDMA (i.e., AR-gRPC) that brings lower latency and higher throughput over high-speed interconnects.

20 Chapter 4: Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow

4.1 Introduction

Training Deep Learning models on TensorFlow may take significant time ranging from several minutes to several hours, even several days. Therefore, system researchers need to devote a lot of time to understand the impact of communication on the overall performance.

Thus, to quickly evaluate the impact of the underlying networks and communication sub- system on the DL frameworks, we present TF-gRPC-Bench micro-benchmark suite in this chapter. We focus on TensorFlow as the deep learning framework and gRPC as its commu- nication substrate. To achieve our goal, we use our characterization (Chapter 3) knowledge of TensorFlow workload during DL training over gRPC communication engine. We de- sign a set of micro-benchmarks by following the distributed TensorFlow communication patterns. In addition, for each of these benchmarks we propose different workload genera- tion schemes that capture distributed TensorFlow’s workload patterns over gRPC. Finally, we present the performance results of gRPC using TF-gRPC-Bench miro-benchmark suite over different networks (such as Ethernet, IPoIB) and protocols (such as RDMA).

21 4.2 TensorFlow Deep Learning Micro-benchmarks for gRPC

In this section, we first discuss the design considerations for the desired micro-benchmarks.

Then, we discuss the actual design of our proposed TF-gRPC-Bench micro-benchmark suite.

4.2.1 Design Considerations

We use our analysis and observations from Chapter 3 to guide our design considera- tions for TF-gRPC-Bench benchmarks. We essentially model the distributed TensorFlow communication pattern by only using gRPC. The performance of gRPC can be measured in terms of latency, bandwidth, and throughput. The performance can be significantly influ- enced by numerous factors such as - number of parameter servers and workers processes, characteristics of the iovec buffers, whether data is serialized or not, nature of the under- lying network etc. as depicted in Figure 4.1.

Essentially, the efficiency of the network intensive tensor transfers is measured by how fast the parameter servers and workers communicate. Based on these, we consider the following dimensions to design the TF-gRPC-Bench micro-benchmark suite.

Number of Parameter Servers and Workers As we have seen in Section 3.1, Tensor-

Flow uses a Parameter Server architecture for distributed training. TensorFlow uses gRPC to communicate (such as tensor transfer) among these remote processes. In such a deploy- ment, the number of parameter servers and workers plays an important role in the overall training performance. For, example deploying only one parameter server for multiple work- ers may not be efficient. Similarly, deploying more parameter servers may also effect the training time adversely. Therefore, to capture this characteristics, our benchmark models

22 Figure 4.1 Design Considerations for TF-gRPC-Bench Micro-benchmark

the Parameter Server architecture solely using gRPC and provides flexibility to tune the number of deployed parameter servers and worker processes.

Distribution of iovec buffers Default gRPC uses iovec buffers to construct a payload and uses recvmsg and sendmsg primitives for communication. Section 3.6 provides insight on the patterns of these buffers distribution in a gRPC payload during TensorFlow training. To capture the characteristics of TensorFlow deep learning workload, we need to consider these patterns to generate payload for the micro-benchmarks.

Size and number of iovec buffers The size of each individual buffer and the number of

23 such buffers in one payload have a major impact on the performance. For example, if the neural network built using TensorFlow is complex with a large number of input parameters, the tensor updates between parameter servers and workers become increasingly involved.

An increase in the tensor size may equate to multiple large buffers in one gRPC payload.

Thus controlling these iovec buffers in gRPC payload are utmost important for designing the benchmark. In TF-gRPC-Bench, we provide granular control of these buffer size and count to construct the gRPC payload.

Mode gRPC uses protocol buffer mechanism for serializing tensors. This serialization can have a constant overhead on gRPC communication engine. Moreover, TensorFlow sup- ports different types of tensor data, hence serialization time may vary which can impact the performance of gRPC. However, to understand the true impact of the underlying net- work and communication engine on the performance, eliminating serialization overhead is crucial. Therefore, we consider both serialized and non-serialized modes in our micro- benchmark suite design. However, the impact of serialization on RPC frameworks are well studied [23, 35, 39, 47] in the community. Therefore, in this work, we primarily focus on the non serialized mode.

Network Configuration The most crucial operation in distributed TensorFlow training is the tensor updates among different processes. These tensor updates during distributed Ten- sorFlow training result in a many-to-many network intensive communication over gRPC.

With large convoluted deep neural network distributed training, the tensor size also in- creases significantly. Therefore, different network configurations are important parameters to consider, especially when scaling out. This will help the system researchers understand

24 the impact of different networking interconnects and protocols on the distributed training.

TF-gRPC-Bench, thus, supports running over any network and cluster configuration.

Resource Utilization During the involved Process-to-Process communication of distributed

TensorFlow, the gRPC component has a major impact on various computing resources such as CPU, memory, and network etc. With large tensor updates among many pa- rameter servers and workers, the impact on these system resources increases significantly.

Therefore, capturing the correlation among different resource utilization while performing network intensive tensor updates over the gRPC channel is essential. Thus, our micro- benchmark suite provides the functionality of measuring different resource utilization dur- ing the course of tensor updates.

4.2.2 Design of TF-gRPC-Bench Micro-benchmark Suite

We take the above considerations in account to design our micro-benchmark suite -

TF-gRPC-Bench. The design of the benchmark is depicted in Figure 4.2. Based on the user parameters our benchmark suite first deploys a cluster in Parameter Server architec- ture to exactly model the distributed TensorFlow communication pattern. We propose three different benchmarks to measure Point-to-Point latency, Point-to-Point Bandwidth and Pa- rameter Server Throughput.

TF-gRPC-Bench supports both serialized and non-serialized mode of payload transfer.

For serialized mode, we use gRPC’s C++ language binding API’s to implement the bench- marks. However, to implement the non serialized mode, we directly use gRPC’s core C

APIs to avoid any serialization overhead. Table 4.2 indicates the parameter that can be configured in our benchmark.

25 Figure 4.2 TF-gRPC-Bench Micro-benchmark Deign

TF-gRPC-P2P-Latency This benchmark measures the Point-to-Point latency of payload transmission between a PS and a worker process. In this benchmark, the RPC procedure in the PS is an echo function that sends back the payload that the worker sends. The processes can be deployed on the same node or different nodes. Users can construct the payload similar to the (we will discuss later in this section in more detail) deep learning workload pattern of TensorFlow over gRPC. In addition to this, users have the flexibility to choose the warm up period, total running period of the benchmark etc. as indicated in

Table 4.2.

TF-gRPC-P2P-Bandwidth In this benchmark, we measure the Point-to-Point bandwidth.

26 The worker invokes the remote procedure with a user defined payload and the PS acknowl- edges the worker request. Similar to the previous benchmark, users have the flexibility to construct payload, defining warm up period, selecting total running time period etc.. This benchmark reports the bandwidth in MBytes per second.

TF-gRPC-PS-Throughput This benchmark measures the throughput of Parameter Server architecture over gRPC. Users can deploy multiple parameter servers and workers on dif- ferent processes. This can be run on the same node or different nodes. The performance of the whole system is measured by the aggregated throughput in terms of the number of remote procedures invoked by the workers per second. Each worker invokes remote pro- cedures in all the parameter servers. This is necessary as in distributed TensorFlow, each parameter server is responsible for managing a certain portion of the variables, hence all the workers need to communicate with all the parameter servers to perform updates in the entire variable set. Similar to the previous benchmarks, users have the option to choose payload pattern, warm up period, and total running time of the benchmark etc..

For each of the above benchmarks, the characteristic of the payload is of utmost importance.

The essence of TensorFlow deep learning workload must be captured in all the benchmarks payloads. Our benchmark enables users to generate payloads containing Small, Medium and, Large buffers in any pattern. Table 4.1 presents the acceptable buffer size range of

Small, Medium, and Large categories. In addition to the customized payload generation, our micro-benchmark suite provides flexibility of automatic payload generation with little or no input from the users. These payloads are generated by taking the observed buffer

27 Category Default Value Value Range Small 10 Bytes [1 Byte - 1 KBytes) Medium 10 KBytes [1 KBytes - 1 MBytes) Large 1MBytes [1 MBytes - 10 MBytes]

Table 4.1 iovec Buffer Size Category

patterns described in Section 3.6 into consideration. Users have the option to choose any of the payload generation schemes described below.

Uniform In this payload generation scheme users can choose to construct the gRPC pay- load such that iovec buffers are distributed uniformly. Users also have the flexibility to choose either Small, Medium, Large buffers or a combination of them in any order. The de- fault value of Small, Medium and Large payloads are 10 Bytes, 10KBytes, 1 Mb. Although, these values are user tunable.

Random In this scheme, the buffers are distributed randomly in a gRPC payload. By default, all three buffer categories are used. However, users can choose any type of buffers

(at-least two) to automatically generate payload under this scheme.

Skew This payload generation scheme distributes the iovec buffers unevenly in a gRPC payload. Users need to construct the payload with at-least two different buffer categories.

By default, this benchmark chooses all three buffer types. By default, this scheme dis- tributes the buffers specifically keeping biased towards Large buffers because for deep learning workloads Large buffers are more important. For example, if users choose all three categories of the buffers, then one payload will have 60% Large buffers, 30% Medium

28 buffers and 10% Small buffers. Moreover, users have the option to generate the payload in

Small or Medium biased manner too.

4.3 Performance Evaluation

In this section, we present comprehensive results of TF-gRPC-Bench micro-benchmark suite. For each of the benchmarks, we use the different data generation schemes available in the micro-benchmark suite and evaluate the performance of gRPC on different clusters and different network interconnects. For workload generation, we use the default configu- rations, if not specified otherwise. Each payload is constructed using all of the three buffer categories (Table 4.1) with their default sizes and count (default is total ten buffers in one payload). We use all the three payload generation schemes for distribution of the buffers.

Therefore, the skewed distribution generates the largest payload as it contains more Large buffers. We run all the experiments (with default warm up and running time) five times and report the average results. The experiment in section 4.3.2 use serialized mode, whereas the rest of the experiments use non-serialized mode. In all our experiments, we use gRPC version 1.5. For evaluating RDMA, we use our gRPC RDMA (discussed in Chapter 5) design.

4.3.1 Experimental Setup

(1) Cluster A: Intel Broadwell Cluster (RI2-IB-EDR): The RI2 cluster comprises of

20 nodes. Each node is provisioned with Intel Broadwell (E5-2680-v4) dual fourteen-core processors, NVIDIA Tesla K80 GPU, 512 GB of memory, Mellanox IB EDR (100 Gbps)

HCA, and a 40G Ethernet. The host processors are running CentOS release 7.2.

29 Parameters Default Value Value Range Description TF-gRPC-P2P- TF-gRPC-P2P- Latency/ Benchmark Selects a benchmark Latency Bandwidth/ Throughput Configures IP of IP localhost Valid IP range parameter servers Configures port of Port 50001 Valid port range parameter servers Number of Controls the number of 1 No limit parameter servers parameter servers Number of Controls the number 1 No limit workers of workers Non-Serialized/ Controls the payload Mode Non-Serialized Serialized serialization mode Workload Uniform, Generates the payload generation Uniform Random, Skew using the given pattern scheme iovec buffer Controls the number of 10 No limit count iovec buffers in a payload All three iovec Depends on the Controls the size of categories with buffers’ size benchmark buffers in a payload default values Warmup Controls warmup seconds 2 sec No limit time for benchmark Total Controls total running running 10 sec No limit time for benchmark time

Table 4.2 Configurable Parameters for TF-gRPC-Bench Micro-benchmark Suite

30 (2) Cluster B: SDSC Comet (SDSC-Comet-IB-FDR): The Comet supercomputing sys- tem at SDSC has 1,984 compute nodes. Each node is provisioned with Intel Haswell (E5-

2680-v3) dual twelve-core processors, 128 GB of memory, 320 GB local SDD, a Mellanox

IB FDR (56Gbps) HCA, and 10G Ethernet. The host processors are running CentOS re- lease 6.7.

4.3.2 TF-gRPC-P2P-Latency (Serialized Mode)

First, we evaluate the Point-to-Point latency for gRPC when data serialization mode is enabled on Cluster A. Figure 4.3 represents the evaluation result for the 64KBytes payload for different communication interconnects. As expected, this figure suggests that the gRPC serialization overhead is constant irrespective of the underlying network. Also, this exper- iment shows that for 64 KBytes payload Point-to-Point latency is almost similar for both

40G Ethernet and IPoIB on Cluster A. However, RDMA reduces Point-to-Point latency by about 40% compared 40G Ethernet and IPoIB.

160 Serialization Overhead Commuincation Time 120

80

Time (us) Time 40

0 Ethernet 40G IPoIB RDMA

Figure 4.3 TF-gRPC-P2P-Latency (Serialized Mode) Evaluation on Cluster A with 64KBytes Payload

31 4.3.3 TF-gRPC-P2P-Latency (Non-serialized Mode)

Next, we evaluate the performance of gRPC in terms of Point-to-Point latency in non- serialized mode. Figure 4.4(a) and Figure 4.4(b) depict the result of TF-gRPC-P2P-Latency benchmark for different workloads on Cluster A and Cluster B, respectively.

4 6 Ethernet 40G IPoIB RDMA Ethernet 10G 5 3 IPoIB 4 RDMA 2 3 1 2 Latency (ms) Latency (ms) 1 0 0 Uniform Random Skew Uniform Random Skew Payload Generation Scheme Payload Generation Scheme (a) Evaluation on Cluster A (b) Evaluation on Cluster B

Figure 4.4 TF-gRPC-P2P-Latency (Non-serialized Mode)

On both clusters, we observe latency is higher for the skewed distribution scheme.

This is expected as the default skewed payload generation scheme is biased towards the iovec Large buffers. Also, we observe that RDMA performs better for all the payload distribution schemes in both the clusters. For example, on cluster A, for skewed payloads

RDMA reduces latency by 59% and 56% compared to 40G Ethernet and IPoIB. Similarly, on Cluster B, we observe that RDMA reduces 78% latency compared to 10G Ethernet and

69% compared to IPoIB. In addition, we see IPoIB (FDR-56Gbps) performs almost 27% better than 10G Ethernet on Cluster B.

Moreover, in Figure 4.5, we compare IPoIB and RDMA gRPC latency on Cluster A with uniformly generated different payloads. We use only Large buffer (1MB each) and vary the number of iovec buffer (from two - ten) counts, to generate the payloads. Clearly,

32 RDMA outperforms IPoIB for all payloads. In addition, IPoIB scales poorly with increas-

ing buffer counts in a payload (more total payload size).

5 IPoIB 4 RDMA 3 2 1 Latency (ms) 0 2 MB 4 MB 6 MB 8 MB 10 MB Payload Size (Uniform Scheme)

Figure 4.5 TF-gRPC-P2P-Latency (Non-serialized Mode) Evaluation on Cluster A for Dif- ferent iovec Counts

4.3.4 TF-gRPC-P2P-Bandwidth (Non-serialized Mode)

In this section, we present the result of evaluating TF-gRPC-P2P-Bandwidth bench- mark. Figure 4.6(a) shows the bandwidth obtained in MBytes/second for different payload generation schemes on Cluster A. Similarly, Figure 4.6(b) presents the results of running

TF-gRPC-P2P-Bandwidth benchmark on Cluster B. On Cluster A, we again observe that gRPC achieves similar bandwidth when the underlying network is either 40G Ethernet or

IPoIB. Whereas while using RDMA, gRPC achieves a 2.14x bandwidth compared to IPoIB for skewed distribution scheme. Figure 4.6(a) suggests that for random payload generation scheme, IPoIB achieves little less bandwidth than 40G Ethernet. This can be possible for the random nature of the payload generation scheme. As expected, on Cluster B also,

33 RDMA outperforms others. For example, we observe that RDMA achieves 3.2x bandwidth compared to IPoIB for skewed data.

6000 6000 Ethernet 40G IPoIB RDMA Ethernet 10G IPoIB RDMA 4000 4000

2000 2000 Bandwidth (MB/s) Bandwidth (MB/s) 0 0 Uniform Random Skew Uniform Random Skew Payload Generation Scheme Payload Generation Scheme (a) Evaluation on Cluster A (b) Evaluation on Cluster B

Figure 4.6 TF-gRPC-P2P-Bandwidth (Non-serialized Mode)

4.3.5 TF-gRPC-PS-Throughput (Non-serialized Mode)

We show performance of gRPC in terms of throughput in this section. Figure 4.7(a) and

Figure 4.7(b) represent the result of running TF-gRPC-PS-Throughput benchmark with two parameter servers and three workers on Cluster A and Cluster B, respectively. Analyzing the result of this benchmark is of utmost importance, as it essentially mimics TensorFlow communication pattern. The figures show the throughput measured in terms of total RPC calls, invoked by all the workers, per second. Figure 4.7(a) indicates that gRPC achieves

4.1x and 3.43x speedup for uniform payload generation scheme when RDMA is used com- pared to 40G Ethernet and IPoIB on Cluster A. On the other hand on Cluster B, as the

Figure 4.7(b) suggests, RDMA gRPC achieves better performance than others. For exam- ple, gRPC achieves 5.9x speedup with RDMA when compared to 10G Ethernet.

34 These results indicate that if gRPC uses RDMA, TensorFlow can do the most efficient tensor transfer, hence TensorFlow can achieve low training time. Moreover, using IPoIB compared 40G or 10G Ethernet provides better performance.

4000 4000 Ethernet Ethernet 40G IPoIB RDMA 10G 3000 3000 IPoIB

2000 2000 RDMA RPC/ second RPC/ second 1000 1000

0 0 Uniform Random Skew Uniform Random Skew Payload Generation Scheme Payload Generation Scheme (a) Evaluation on Cluster A (b) Evaluation on Cluster B

Figure 4.7 TF-gRPC-PS-Throughput (Non-serialized Mode)

4.4 Related Work

Remote Procedure Call has come a long way since its inception. The first RPC im- plementation was presented by Birrell and Nelson [19]. Since then, RPC has evolved and the community has proposed lot of open source high-performance RPC systems such as gRPC [7], Avro [6], and Thrift [5] to name a few. Over the past years, the community has proposed a lot of benchmarks for evaluating RPC frameworks. For example, Lu et al. propose [38] Micro-benchmark suite for Hadoop RPC.

Deep Learning is gaining a lot of attention for its popularity in the Artificial Intelli- gence domain. As a result benchmarking deep learning applications on different hardware

35 platforms has become important. The official TensorFlow community has support for Ten- sorFlow Convolution Neural Network Benchmark [13] for distributed TensorFlow perfor- mance. Baidu research proposes DeepBench [9] primarily to benchmark operations that are important to deep learning on different hardware platforms.

Workload generation is an important part to design efficient deep learning benchmark- ing. For example, Wang et al. propose BigDataBench [52], a benchmark suite for Big Data

Computing, covers typical Internet service workloads and provides representative data sets and data generation tools. The BigDataBench version 4 [14] has support for workload generation for different deep learning frameworks.

To the best of our knowledge, a micro-benchmark suite to benchmark the deep learn- ing communication substrate in a stand-alone manner is not available in the community.

In this work, we propose TF-gRPC-Bench, a micro-benchmark suite for gRPC by taking the characteristics of deep learning workload into account. This benchmark is specially designed for the system researchers’ perspective who are solely focused on improving the communication substrate.

4.5 Summary

In this chapter, we propose micro-benchmark suite TF-gRPC-Bench to measure the performance of gRPC over different network interconnects and protocols. We introduce benchmarks to measure Point-to-Point latency, Point-to-Point bandwidth, and Parameter

Server throughput that models the distributed TensorFlow communication pattern. More- over, we analyze the workload characteristics of distributed TensorFlow and design work- load for our benchmark that captures the deep learning workload characteristics. TF-gRPC-

Bench also provides users the flexibility to configure various parameters including (but not

36 limited to) the payload distribution and size, the number of parameter servers and work- ers etc.. This benchmark dramatically reduces the experimentation time for the system researchers. Thus, this may help system researchers to quickly evaluate novel communica- tion protocols over different interconnects for deep learning. In the next chapter, we discuss the design of our RDMA gRPC, the impact of it on TensorFlow, and present the evaluation results.

37 Chapter 5: Accelerating TensorFlow with Adaptive RDMA-based gRPC (AR-gRPC)

5.1 Introduction

Distributed TensorFlow supports various channels (See Table 1.1) to efficiently trans- fer tensors, such as gRPC over TCP/IP, gRPC+Verbs, and gRPC+MPI. Verbs and MPI based channels are capable of RDMA operations and introduced in TensorFlow primar- ily for high-performance communication. However, none of the channels are optimal for

Deep Learning workloads. Also, in all these different solutions gRPC is responsible for at least administrative tasks such as establishing the RDMA path, exchanging computa- tion graphs, etc. Therefore, if gRPC is a compulsory component of TensorFlow, it makes more sense to bring RDMA capability directly into the gRPC runtime. This will allow

TensorFlow to automatically benefit from RDMA-enhanced gRPC. Through our charac- terization in Chapter 3 we find many critical bottlenecks in all the existing channels in

TensorFlow. We use these observations to guide us proposing a new gRPC runtime, called

AR-gRPC. In AR-gRPC, we propose designs such as hybrid RDMA communication proto- cols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for DL workloads (See Section 5.2). From our per- formance evaluations, we see that our proposed design can speedup gRPC performance by

38 up to 4.1x and 2.3x compared to the default gRPC on IPoIB and the public RDMA-based

gRPC, respectively. By integrating AR-gRPC with TensorFlow, we achieve up to 3x per-

formance speedup for distributed training over default gRPC-IPoIB based TensorFlow (See

Section 5.3).

Through our proposed AR-gRPC design, TensorFlow can run with gRPC channel alone

and get the optimal performance. We believe that our proposed design will significantly

reduce the maintenance work for the TensorFlow community as well as simplify the usage

for the end users. AR-gRPC can also benefit other applications or systems which are using

gRPC.

5.2 Proposed Design of AR-gRPC

In this section, we present AR-gRPC that brings high-performance RDMA-based com-

munication over InfiniBand and RoCE. First, we discuss the key components of AR-gRPC

architecture in Section 5.2.1. Then in Section 5.2.2. we discuss the associated optimiza-

tions for achieving high performance.

5.2.1 Architecture Overview of AR-gRPC

In AR-gRPC, we revamp the communication layer of the default gRPC architecture.

We propose novel RDMA Endpoints that achieve low latency and high throughput. Fig-

ure 5.1(a) shows the architecture overview of AR-gRPC engine.

RDMA-Endpoint: RDMA-Endpoint extends the core communication abstraction (i.e.,

Endpoint) in default gRPC design, that encapsulates an RDMA connection between an

RDMA client and server. This provides functionalities such as write (RDMA-Endpoint-

Write), read (RDMA-Endpoint-Read), and polling (RDMA-Polling). RDMA-Endpoint can comply with the default gRPC Endpoint architecture seamlessly.

39 gRPC Caller

Protobuf Protobuf Request Response

Serialization Deserialization

AR-gRPC Core AR-gRPC Core

RDMA-Endpoint-Write RDMA-Endpoint-Read

RDMA Buffer RDMA Buffer

RDMA-Polling RDMA Comm Engine Library

Global Buffer gRPC Server pool RDMA Write / Read

RDMA / InfiniBand / RoCE

(a) Architecture of AR-gRPC (b) Communication Flow

Figure 5.1 Overview of AR-gRPC and the Corresponding Communication in TensorFlow

RDMA-Endpoint-Write: As shown in Figure 5.1(a), serialized protobuf messages from the application layer are sent to the remote process via RDMA-Endpoint-Write. RDMA-

Endpoint-Write uses a pinned RDMA buffer for the message transfer.

RDMA-Polling: Due to the popularity of multi-core processors on modern clusters, we choose one or more dedicated cores to perform RDMA completion queue polling. Also, we employ “busy polling” completion detection strategy that is to repeatedly poll completion queue until a completion (sending or receiving a message) become available. In this way, the core resources are used efficiently and also aids in achieving low latency send/receive over the network. All new incoming messages are kept on a global RDMA buffer pool.

Once the polling engine detects a newly received message, it triggers RDMA-Endpoint-

Read to consume the message.

RDMA-Endpoint-Read: Messages received by the RDMA-Polling thread are given to the RDMA-Endpoint-Read handler. The RDMA-Endpoint-Read handler then constructs

40 an application payload from the contents in the RDMA pinned buffer. Afterwards, it sends the payload to the upper layer where it is subsequently deserialized and consumed by the application.

5.2.2 Adaptive RDMA-based Communication

As shown in Chapter 3, Deep Learning workloads on TensorFlow involves many mes- sage transfers with various sizes. All the existing designs in different TensorFlow channels are transferring messages in a fixed manner, which is not fully optimized. In this work, we propose adaptive RDMA-based communication schemes as follows.

Hybrid Communication: We choose the eager protocol for small messages (with two- sided RDMA Send/Recv) and rendezvous (with one-sided RDMA READ from the remote side) protocol for the rest. The eager threshold of the message size is auto-tuned based on the underlying architecture. Moreover, this parameter is user tunable for added flexi- bility. Our proposed design uses RDMA operations more efficiently as compared to the

Verbs-based channel we discussed in Section 3.4. This is mainly because of three reasons:

1) Our design can adapt to the message sizes to automatically choose the proper proto- col, while the communication protocol in the default Verbs-based channel of TensorFlow is fixed. 2) We choose RDMA Read as it sends only RTS message before the receiver reads from the remote memory. In contrast, the default Verbs-based channel in TensorFlow chooses to use RDMA Write based protocol, which needs to send multiple control (e.g.,

RTS and CTS) messages before writing to the remote memory. 3) Our design decouples the buffer management with RDMA communication that can give the best flexibility, while the default Verbs channel in TensorFlow has tightly coupled message buffers and ACK buffers for each RDMA connection [15]. Figure 5.1(b) represents the communication flow of tensor transfer when AR-gRPC is used in TensorFlow. As we can see from the figure,

41 small payloads are transferred to the remote side using an eager protocol, while the large payload (especially when sending the requested tensor) is chunked and transferred using non-blocking one-sided RDMA READ in a pipelined fashion (as discussed below).

Message Pipelining and Coalescing: As discussed in Section 3.3, due to different sizes of tensors and control messages, the gRPC channel needs to handle the underlying iovec buffers (asymmetric in size) properly to get the best communication performance. We run multiple DL models (as discussed in Section 5.3) to identify the common patterns for these iovec buffers. Figure 3.5 shows the different patterns we observe during our experiments.

A naive design choice for RDMA-Endpoint-Write is to copy all these iovec buffers into a pinned RDMA buffer and then issue a blocking send operation (i.e., wait until the com- pletion). Even though this design can use RDMA, it suffers from a major drawback. If the application payload size is large, then RDMA-Endpoint-Write will block for a longer time, causing the achieved latency to be sub-optimal. To resolve this issue, we chunk large buffers into smaller sized payloads using a configurable threshold. The default chunk size is auto-tuned based on an architecture-aware algorithm. After the payload is chunked into smaller pieces, each of them is copied into the RDMA pinned buffer (we will discuss how to avoid this copy later). In order to achieve efficient pipelined transfers, we send out these pieces of messages to the remote process by leveraging a non-blocking rendezvous

(RDMA Read by the remote process) protocol as shown in Figure 5.1(b). The multiple non-blocking sends can saturate the network bandwidth as much as possible, which suits for large-message transfers. In the receiver side, there is one major challenge for large message transfers - when the engine should trigger the RDMA-Endpoint-Read to con- sume the received chunks. One solution is to wait for all the chunks to arrive and then trigger RDMA-Endpoint-Read. However, this solution hinders RDMA-Endpoint-Read to

42 consume partially arrived messages, which means the RDMA-Endpoint-Read completely

blocks until the entire message has arrived. To mitigate this issue, we devise a non-blocking

RDMA-Endpoint-Read, which triggers receiving callbacks as soon as it receives a chunk

of a message. Our design ensures that the order of a chunk in the message is preserved.

This mechanism ensures high concurrency of receiving an entire large message. The above

design is suitable for transmitting large iovec buffers. When gRPC application payloads have many small iovec buffers, sending these buffers individually would not be optimal.

Instead, we coalesce them (up to eager send threshold), maintaining the relative order, into pinned RDMA buffer and send it using eager protocol. For small message transfers, eager protocol performs better than rendezvous [36], because it sends both control messages and payloads together. This increases the efficiency of our design.

function RDMA EndPoint Write() begin eager thr ← detect best eager thr() chunk sz ← detect best chunk sz() acc sz ← 0 for i = 0; i < iov sz; i + + do if acc sz + iov[i].sz ≤ eager th then accumulate(pinned buff, iov[i], acc sz) else if acc sz 6= 0 then eager send(pinned buff, dst) acc sz = 0 i - - // Consider the current buffer in next iteration else RDMA chunk send(iov[i], chuck sz, dst) end end if acc sz 6= 0 then eager send(pinned buff, dst) end end Algorithm 1: Adaptive RDMA EndPoint Write

43 For readers’ better understanding we present the pseudo code of our adaptive RDMA- based Endpoint Write in Algorithm 1. The function ‘accumulate’accumulates the current iovecs data in a pinned buffer (pinned buff) and updates the current accumulated size-

‘acc size’. The communication functions are non-blocking in the algorithm.

Zero-Copy Transmission: Even though the above designs significantly improve the per- formance of gRPC, we notice that memory copying of large messages to the RDMA pinned buffer and vice-versa become bottlenecks. This extra copy is because between TensorFlow and gRPC, there is a layer to perform the serialization and deserialization. To achieve zero- copy transmission, we need to find a transparent approach to remove this copy. Through analyzing the TensorFlow design, we find that when TensorFlow sends a tensor, it first al- locates a gRPC byte buffer which can be directly backed by an RDMA pinned buffer. In this way, we do not need to change any code in TensorFlow but just make small changes in gRPC runtime to pick up an RDMA pinned buffer from the buffer pool and then return it to TensorFlow. During tensor transmission, gRPC will directly serialize the tensor into the RDMA buffer without any copy. Similarly, RDMA-Endpoint-Read also leverages the zero-copy design by constructing the application payload directly from the RDMA pinned buffer instead of allocating new memory and copying the content.

5.3 Performance Evaluation

This section presents detailed performance evaluations of AR-gRPC and its integration with TensorFlow. Broadly, our evaluations answer the following questions: (1) What is the improvement in the performance of AR-gRPC compared to other gRPC designs? (2) How much benefit can TensorFlow extract by using AR-gRPC?

44 5.3.1 Experimental Setup

We use the following two clusters in our evaluation:

(1) Cluster A: Intel Broadwell Cluster (RI2-IB-EDR): We use up to twelve nodes on

RI2 cluster. Each node is provisioned with Intel Broadwell (E5-2680-v4) dual fourteen- core processors, NVIDIA Tesla K80 GPU, 512 GB of memory, and a Mellanox IB EDR

(100 Gbps) HCA. The host processors are running CentOS release 7.2.

(2) Cluster B: SDSC Comet (SDSC-Comet-IB-FDR): We use up to four nodes on this cluster. Each node is provisioned with Intel Haswell (E5-2680-v3) dual twelve-core pro- cessors, 128 GB of memory, and a Mellanox IB FDR (56 Gbps) HCA. The host processors are running CentOS release 6.7.

In all of our experiments, we use gRPC 1.5. AR-gRPC is also based on this version. The public RDMA-gRPC [12] is based on gRPC r0.14. gRPC is evaluated on both Cluster A and B, however, the public RDMA-gRPC fails to run on Cluster B as it hangs due to a race condition in their code. We use TensorFlow 1.3 in our experiments on Cluster A. We were unable to carry TensorFlow experiments on Cluster B due to GLIBC dependency issues.

As Cluster B is public, we do not have the permission to install the required libraries.

5.3.2 Evaluation of gRPC

We implement three RPC micro-benchmarks [20] to evaluate different gRPC designs.

These benchmarks are - (1) Point-to-Point Latency, (2) Single Server, Multiple Clients

Throughput, and (3) Performance Comparison in a Fully-Connected Architecture. We use gRPC C++ APIs to design these benchmarks. We run the benchmarks 1K times and report the average results. Note that in all the experiments the default Socket-based gRPC runs

45 100 Default gRPC over IPoIB 80 Public RDMA-gRPC 60 AR-gRPC 40

Latency (us) 20 0 2 8 32 128 512 2K 8K Payload Size (Bytes) (a) Small Payload 1,000 Default gRPC over IPoIB 800 Public RDMA-gRPC 600 AR-gRPC 400

Latency (us) 200 0 16K 32K 64K 128K 256K 512K Payload Size (Bytes) (b) Medium Payload 20 Default gRPC over IPoIB 16 Public RDMA-gRPC 12 AR-gRPC 8

Latency (ms) 4 0 1M 2M 4M 8M Payload Size (Bytes) (c) Large Payload

Figure 5.2 gRPC Point-to-Point Latency Evaluation on Cluster A

46 100 Default gRPC over IPoIB 80 AR-gRPC 60 40

Latency (us) 20 0 2 8 32 128 512 2K 8K Payload (Bytes) (a) Small Payload 1,000 Default gRPC over IPoIB 800 AR-gRPC 600 400

Latency (us) 200 0 16K 32K 64K 128K 256K 512K Payload (Bytes) (b) Medium Payload 20 Default gRPC over IPoIB 16 AR-gRPC 12 8

Latency (ms) 4 0 1M 2M 4M 8M Payload (Bytes) (c) Large Payload

Figure 5.3 gRPC Point-to-Point Latency Evaluation on Cluster B

47 over IPoIB for a fair comparison.

Point-to-Point Latency: Figure 5.2 shows the comparison of Point-to-Point latency among

default gRPC, public RDMA-gPRC, and AR-gRPC on Cluster A. Figure 5.3 shows the

same comparison between default gRPC, and AR-gRPC on Cluster B. We categorize the

payload sizes in three different classes - small, medium, and large ranging from 2 Bytes to

8 KBytes, 16 KBytes to 512 KBytes, and 1 MBytes to 8 MBytes, respectively. We choose

these ranges because, from the characterization results in Section 3, we see TensorFlow

workloads contains all these message sizes.

We first compare the results between the default gRPC and AR-gRPC. Figure 5.2(a)

shows that the latency for 32 Bytes payload for default gRPC is 35.09 µs, whereas, AR-

gRPC achieves 13.32 µs latency, resulting a 2.6x performance speedup. Also, Figure 5.2(b) and 5.2(c) show that AR-gRPC reduces the latency of 64 KBytes and 1 MBytes payload by 52% and 55% respectively. Figure 5.3(a) depicts that in Cluster B AR-gRPC reduces

32 Bytes latency by 60%. Figure 5.3(b) and 5.3(c) show a speedup of about 2.5x and 4.1x for 64 KBytes and 1 MBytes payload, respectively. This improvement over default gRPC is mainly attributed to the AR-gRPC’s native RDMA design that can perform much better than IPoIB.

Similarly, Figure 5.2(a) and 5.2(b) show that AR-gRPC achieves 1.3x and 1.5x speedup for 32 Bytes and 64 KBytes payload, respectively, over the public RDMA-gRPC. As shown in Figure 5.2(c), Point-to-Point latency for 1 MBytes payload is 802.58 µs for public

RDMA-gRPC, however, AR-gRPC incurs only 430.66 µs latency for the same payload.

Thus, our design shows a significant speedup of about 1.8x. One key observation is that

48 the performance of public RDMA-gRPC degrades significantly as the message size in- creases. The primary reason is public RDMA-gRPC uses IBV WR SEND (similar to our eager send for small messages) for transmitting payload of all sizes and does not have any advanced optimization.

To further analyze the benefits of our design, Figure 5.4 depicts a latency comparison,

10 Public RDMA-gRPC 8 Hybrid Communication Message Pipelining and Coalescing 6 Zero-Copy Transmission 4

Latency (ms) 2 0 512K 1M 2M 4M 8M Payload Size (Bytes)

Figure 5.4 Analysis of Various gRPC Designs on Cluster A using 512 KBytes to 8 MBytes payloads, among different AR-gRPC designs and public

RDMA-gRPC. In this figure, the top line corresponds to public RDMA-gRPC and the rest depicts the incremental AR-gRPC designs as discussed in Section 5.2.2. Public RDMA- gRPC performs worse even when we have only hybrid communication in our design. This proves that one-sided RDMA operation performs better in term of latency than two-sided

Send-Receive for large messages. With incremental optimizations, we achieve even lower latency. The final version (bottom line in the figure) of AR-gRPC reduces latency of 8

MBytes message by 25% than the base AR-gRPC design (only has hybrid communication).

49 Single Server, Multiple Clients Throughput: The throughput of an RPC system can be measured by the number of requests served per second by the server. Thus, this benchmark computes the total RPC requests handled by one server when multiple concurrent clients are sending requests. We use a fixed message size of 4KBytes. The server runs on one node, while, we vary the number of concurrent clients from 4 to 64 and distribute them uniformly among four different nodes.

Figure 5.5 represents the Single Server, Multiple Clients Throughput comparison among

AR-gRPC, default gRPC and public RDMA-gRPC on Cluster A. In our experiment, we achieve at least 1.5x performance speedup in throughput compared to the default gRPC.

For example, AR-gRPC achieves a throughput of 112,751 calls/s for 64 concurrent clients,

150K Default gRPC over IPoIB 120K Public RDMA-gRPC AR-gRPC 90K 60K 30K Calls / Second K 4 8 16 32 64 Number Of Clients

Figure 5.5 gRPC Single Server, Multiple Clients Throughput Evaluation on Cluster A

whereas the default gRPC reaches 74,903 calls/s, resulting a 1.5x improvement. On the other hand, the public RDMA-gRPC fails to scale after 16 clients. For 64 clients, AR-gRPC design achieves a significant 2.3x speedup compared to public RDMA-gRPC design. We

50 attribute this speedup to message pipelining and optimized RDMA-Polling as discussed in

Section 5.2. A high throughput is desired for TensorFlow as, for example, in a large scale cluster all the workers may need to update variables in Parameter Server at once. There- fore, AR-gRPC is well-suited for that deployment.

Performance Comparison in Fully-Connected Architecture: In a TensorFlow cluster,

Node-to-Node communication is built by a gRPC server and multiple channels to connect with other workers’ gRPC servers. This kind of deployment forms a Fully-Connected net- work architecture. For this experiment, the benchmark exactly models the communication

30 500 24 Default gRPC over IPoIB 400 Default gRPC over IPoIB 18 Public RDMA-gRPC 300 Public RDMA-gRPC 12 AR-gRPC 200 AR-gRPC

6 Calls/ Second 100 Latency (ms) 0 0 2M 4M 8M 2M 4M 8M Payload Size (Bytes) Payload Size (Bytes) (a) Latency (A) (b) Throughput (A) 30 500 24 Default gRPC over IPoIB 400 Default gRPC over IPoIB 18 AR-gRPC 300 AR-gRPC 12 200 6 100 Latency (ms) Calls / Second 0 0 2M 4M 8M 2M 4M 8M Payload Size (Bytes) Payload Size (Bytes) (c) Latency (B) (d) Throughput (B)

Figure 5.6 Performance Comparison in Fully-Connected Architecture of gRPC

pattern of TensorFlow. We deploy four nodes that spawn a gRPC server on each of the nodes and create three distinct gRPC channels that connect to other nodes’ server. As in

51 distributed TensorFlow, communication involves sending large tensors, we measure the per- formance by sending large payloads ranging from 2 MBytes to 8 MBytes. Figure 5.6 shows the performance comparison of different gRPC designs in terms of latency and throughput averaged across all the processes.

As shown in Figure 5.6(a), in Cluster A, AR-gRPC reduces average latency by 45% compared to the default gRPC for 2 MBytes payload. In addition, Figure 5.6(b) shows that

AR-gRPC achieves about 1.8x and 1.18x average throughput speedup for 2 MBytes pay- load compared to default gRPC and public RDMA-gRPC, respectively. Also, Figure 5.6(c) and 5.6(d) show that in Cluster B, AR-gRPC achieves 60% reduction in average latency and obtains throughput speedup of about 2.68x for 4 Mbytes payload compared to default gRPC.

Results from the above experiments are a clear indication that compared with our pro- posed AR-gRPC design, the default gRPC on IPoIB or the public RDMA-gRPC do not achieve optimal performance over high-performance network. AR-gRPC outperforms de- fault gRPC because the bottlenecks of Socket-based communication, memory copying etc., suppress the benefits of the high-performance network. In addition, as the public RDMA- gRPC implements RDMA sub-optimally, AR-gRPC’s adaptive RDMA designs outperform significantly.

5.3.3 Evaluation of AR-gRPC Enhanced TensorFlow

In this section, we evaluate AR-gRPC enhanced TensorFlow with three other channels on Cluster A. We do not use the public RDMA-gRPC as we can not integrate it with Ten- sorFlow 1.3 because of incompatibility. In our experiments, we deploy TensorFlow in the

Parameter Server (PS) mode on up to twelve nodes. One node in the cluster hosts the PS

52 and uses CPU, while the other nodes host the workers and use Nvidia Tesla K80 GPUs. We

choose synchronous training over asynchronous as for the performance benefits [18, 51].

Also, we select different batch sizes for a comprehensive performance analysis. Note that

the larger the batch size, the fewer the parameter updates, but also the higher the number of

iterations needed for convergence. The maximum batch size we use is 32/GPU due to the

GPU memory limit. Batch size of 64/GPU causes out of GPU memory error in TensorFlow

is most of our experiments. We experiment with different DNNs available in TensorFlow

Convolution Neural Net (CNN) benchmark [15]. This benchmark generates synthetic im-

age data and measures the performance by the total number of images processed. We run

these tests five times and report the average result.

Inception4: Inception4 [48] is low computational cost DNN. Figure 5.7 shows the results on up to 12 nodes of Cluster A.

We observe that (Figure 5.7(a), 5.7(b), and 5.7(c)) AR-gRPC improves TensorFlow per- formance by a maximum of 29%, 80%, and 144% compared to default gRPC. For example,

Figure 5.7(c) shows an improvement of 80% (93 vs 51 images) for batch size 16/GPU (total

176) on 12 nodes.

Moreover, in our experiments AR-gRPC process a maximum of 27%, 12%, and 31% more images than Verbs channel as shown in Figure 5.7(a), 5.7(b), and 5.7(c).

Also, as shown in Figure 5.7(a), 5.7(b), and 5.7(c) AR-gRPC outperforms MPI chan- nel by a maximum of 29%, 151%, and 228% for 4, 8, and 12 nodes, respectively. In our experiments, TensorFlow scales poorly with default MPI channel.

53 50 gRPC gRPC + Verbs 40 gRPC + MPI AR-gRPC 30 20 10 Images / Second 0 8 16 32 Batch Size/GPU (a) 4 Nodes 100 gRPC gRPC + Verbs 80 gRPC + MPI AR-gRPC 60 40 20 Images / Second 0 8 16 32 Batch Size/GPU (b) 8 Nodes 150 gRPC gRPC + Verbs 120 gRPC + MPI AR-gRPC 90 60 30 Images / Second 0 8 16 32 Batch Size/GPU (c) 12 Nodes

Figure 5.7 Inception4 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs

54 50 gRPC gRPC + Verbs 40 gRPC + MPI AR-gRPC 30 20 10 Images / Second 0 8 16 32 Batch Size / GPU (a) 4 Nodes 100 gRPC gRPC + Verbs 80 gRPC + MPI AR-gRPC 60 40 20 Images / Second 0 8 16 32 Batch Size / GPU (b) 8 Nodes 150 gRPC gRPC + Verbs 120 gRPC + MPI AR-gRPC 90 60 30 Images / Second 0 8 16 32 Batch Size / GPU (c) 12 Nodes

Figure 5.8 Resnet152 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs

55 200 300 gRPC gRPC + Verbs gRPC gRPC + Verbs 240 160 gRPC + MPI AR-gRPC gRPC + MPI AR-gRPC 120 180 80 120

Images / Second 60

Images / Second 40 0 0 8 16 32 8 16 32 Batch Size/GPU Batch Size/GPU (a) 8 Nodes (b) 12 Nodes

Figure 5.9 Inception3 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs

250 300 gRPC gRPC + Verbs gRPC 240 200 gRPC + MPI AR-gRPC gRPC + Verbs 150 180 gRPC + MPI AR-gRPC 100 120

Images / Second 60

Images / Second 50 0 0 8 16 32 8 16 32 Batch Size / GPU Batch Size / GPU (a) 8 Nodes (b) 12 Nodes

Figure 5.10 Resnet50 Evaluation on Cluster A (Higher is Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs

400 700 gRPC 600 300 gRPC AR-gRPC 500 200 400 AR-gRPC 300 100 200 100 Images / Images/ Second Images / Images/ Second 0 0 8 16 32 8 16 32 8 16 32 8 16 32 GoogleNet AlexNet GoogleNet AlexNet Batch Size / GPU Batch Size / GPU (a) 4 Nodes (b) 8 Nodes

Figure 5.11 GoogleNet & AlexNet Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU) × NUMo f GPUs

56 Resnet152: Resnet152 [22] is a popular residual DNN with a depth of 152 layers. Fig-

ure 5.8 represents the results of our experiments on up to twelve nodes of Cluster A.

Figure 5.8(a) shows AR-gRPC accelerates TensorFlow by 62% (batch size 8/GPU)

more compared to default gRPC. Also, Figure 5.8(b) shows AR-gRPC improves Resnet152

performance by 32% (batch size 32/GPU) to 147% (batch size 8/GPU). In Figure 5.8(c),

AR-gRPC incurs a maximum speedup of 3x (55 vs 18 images) compared to default gRPC.

Even for higher batch size of 32/GPU (total 352) AR-gRPC improves TensorFlow perfor-

mance by 82% (Figure 5.8(c)).

Figure 5.8(a), 5.8(b), and 5.8(c) show that AR-gRPC processes a maximum of 40%,

35%, and 30% more images, respectively, than Verbs. For instance, AR-gRPC (batch size

32/GPU) improves performance by 23% (135 vs 108 images) on 12 nodes.

In addition, as seen in Figure 5.8(a). 5.8(b), and 5.8(c) AR-gRPC achieves a maximum

speedup of 1.61x, 3.3x and 4.5x compared to MPI channel.

Inception3: Figure 5.9(a) shows that for batch size of 16/GPU AR-gRPC improves In- ception3 [50] performance by 53%, 26%, and 52% over default gRPC, Verbs, and MPI channels, respectively on 8 nodes of Cluster A. Moreover, Figure 5.9(b) shows perfor- mance of Inception3 on 12 nodes of Cluster A. As shown in the figure, for a batch size of 16/GPU AR-gRPC enhanced TensorFlow processes 81%, 27%, and 146% more images than gRPC-IPoIB, Verbs and MPI based channels, respectively.

Resnet50: Figure 5.10 shows Resnet50 (Renset with 50 layers) performance on 8 and 12

nodes of Cluster A. As shown in Figure 5.10(a) for a batch size of 32/GPU AR-gRPC en-

hanced TensorFlow processes 47%, 20%, and 35% more images than gRPC-IPoIB, Verbs,

57 and MPI channel on 8 nodes. However, on 12 nodes (Figure 5.10(b)) AR-gRPC enhances the performance of TensorFlow by 62% and 58% compared to default gRPC and MPI channel, respectively. Although Verbs channel perform similarly to AR-gRPC. In all these experiments, we notice that AR-gRPC is the only channel that delivers performance con- sistently.

GoogleNet and AlexNet: In this section, we compare results of two drastically different

CNNs - GoogleNet [49] and AlexNet [30]. GoogleNet has only 5 Million parameters, whereas AlexNet has about 60 Million parameters. Figure 5.11(a) and 5.11(b) show the comparison among gRPC and AR-gRPC on 4 and 8 nodes of Cluster A, respectively. First, we observe that with increasing number of nodes AR-gRPC scales better. For example, for

4 nodes (as shown Figure 5.11(a)) AR-gRPC has almost identical performance to default gRPC for 16 and 32 batch size per GPU. However, as we go to 8 nodes (Figure 5.11(b)) the performance improvement over default gRPC become apparent. We have the same observation for AlexNet as well.

Moreover, as shown in Figure 5.11(b) AR-gRPC process a maximum of 128% (batch size 8/GPU) more images than default gRPC for GoogleNet. Although, for large batch size (32/GPU, total 224) the improvement is about 15% (597 vs 517). This is expected as higher batch size and less parameters in GoogleNet results in less network intensive gradi- ent updates. In comparison, for the same batch size (32/GPU) AR-gRPC shows 89% (124 vs 65) performance improvement for Alexnet compared to default gRPC (Figure 5.11(b)).

This proves, even with higher batch size, if the DNN has a large number of parameters,

AR-gRPC can improve TensorFlow performance significantly.

58 The above experiments show that AR-gRPC has the potential to accelerate Deep Learn- ing using TensorFlow compared to the other available channels. Moreover, we show that when the DNN is complex and performs frequent network intensive variable updates, AR- gRPC provides the optimal channel over RDMA networks. Also, AR-gRPC scales well as compared to other channels with increasing number of nodes. In addition, Figure 5.12 sum- marizes the maximum speedup achieved in TensorFlow by AR-gRPC channel compared to default gRPC over IPoIB channel for different CNN training.

4

3

2

Speedup 1

0

CNNs

Figure 5.12 AR-gRPC enhanced TensorFlow Speedup Compared to gRPC-IPoIB channel on Cluster A

5.4 Related Work

RPC over RDMA: Optimization of RPC is popular in the distributed computing field. The recent innovation of networking technologies powering large-scale data centers brings new challenges in terms of scaling and latency. Kalia et al. propose FaSST [28] RPC, to lever- age the modern hardware. Stuedi et al. propose DaRPC [45] to implement tight integration

59 between user space and RPC message passing. Moreover, SU et al. propose RDMA-based paradigm named RFP [46] that supports traditional RPC and provides high-performance.

The impact of a high-performance RPC is also well studied. For example, Lu et al. show high-performance Hadoop RPC [36] that benefits the entire Hadoop eco-system. In this paper, we have selected gRPC as it is a modern RPC framework that satisfies the need of the current data center requirements than other available open sourced RPCs. Even though there is a public RDMA-gRPC available, our design far exceeds that version in terms of performance.

Optimization of TensorFlow: Google’s TensorFlow has been in the lime light for efficient

Deep Learning in the recent time. Vishnu et al. extend TensorFlow [51] on large-scale cluster using MPI. Also, Horovod [43] is an attempt from Uber for enhancing distributed

TensorFlow training using MPI. By leveraging the high-performance optimized communi- cation offered by MPI, they show good performance improvements. Jia [25] et al. propose

RDMA TensorFlow similar to the official TensorFlow Verbs design. They report a 6x per- formance speedup over TCP/IP (1G Ethernet) gRPC, whereas AR-gRPC extended Tensor-

Flow achieves 12x speedup over TCP/IP (1G Ethernet) gRPC. We don’t show comparison numbers against TCP/IP over Ethernet in this paper to make a fair contrast. Lu [37] et al. have done a meticulous evaluation of popular Deep Learning frameworks over Big Data stacks on RDMA interconnects. They show that RDMA-based communication can lead to significant speed up in training time. TensorFlowOnSpark [3] is a framework proposed by

Yahoo!, which allows execution of Deep Learning workloads with TensorFlow on existing

Spark clusters.

Even though the primary and default communication of TensorFlow is powered by gRPC, TensorFlow’s GitHub repository has contributions for supporting MPI and Verbs

60 based channels. The primary reason for supporting these channels is that they can leverage high-performance communication mechanisms such as RDMA. However, in our paper, we present the potential in the direction of unification and adaptive designs. We argue that by making gRPC suitable for high-performance interconnects, we could achieve optimal performance for distributed TensorFlow training.

5.5 Summary

In this chapter, we propose a high-performance adaptive RDMA-based communication runtime with gRPC (i.e., AR-gRPC) for distributed TensorFlow. The results suggest that by optimizing the performance of gRPC alone, we can achieve high-performance while keep- ing a unified communication runtime throughout the TensorFlow stack. Thus, in this way we can eliminate the need to maintain different server protocols for distributed TensorFlow.

We perform a comprehensive analysis of TensorFlow architecture and propose an adaptive

RDMA-enhanced gRPC runtime specifically designed for Deep Learning applications. We demonstrate that our AR-gRPC achieves 4.1x speedup compared to the default gRPC on

IPoIB. We also show that AR-gRPC can speedup performance by 2.3x over the public

RDMA-gRPC design. Then, we show that our AR-gRPC design can benefit the runtime of distributed TensorFlow. We achieve 3x performance improvement when using AR-gRPC channel compared to using the default gRPC on IPoIB channel. Furthermore, AR-gRPC can benefit not only TensorFlow but other applications as well, such as micro-services run- ning on modern data centers with gRPC as their common communication substrate.

61 Chapter 6: Conclusion and Future Work

With the ubiquity of massive computational power Deep Learning is finally getting its momentum. The application of Deep Learning is wide - starting from self driving car to assist doctors to find out early stages of cancer. At present Google’s TensorFlow is one of the most popular Deep Learning framework in the community. Even though distributed

TensorFlow can be deployed on modern HPC systems seamlessly, the performance of Ten- sorFlow depends heavily on the efficacy of its communication engine and the underlying network. As the DL network grows in depth and in terms of number of parameters the variable updates between different nodes become a critical bottleneck. gRPC, a widely used Remote Procedure Call framework that enables client and server applications to com- municate transparently, is the main communication engine of TensorFlow. Furthermore,

TensorFLow supports gRPC+Verbs and gRPC+MPI channels that are capable of RDMA operations to achieve high-performance communication of tensors. However, in these two channels gRPC is still responsible for administrative tasks communication.

In this work, we first present a meticulous analysis of distributed TensorFlow execution

flow. Nowadays, a majority of the existing clusters are equipped with high performance in- terconnects such as InfiniBand, 10/25/40/80/100 GigE, RoCE etc.. In order to leverage the benefits of these low latency, high throughput networks, it is evident to study the impact of these networks on the communication engine of DL frameworks. Based on our observation

62 of distributed TensorFlow and its workload characteristics over gRPC channel, we pro- pose a micro-benchmark suite - TF-gRPC-Bench. This micro-benchmark suite includes benchmarks to measure Point-to-Point latency, Point-to-Point Bandwidth, and Parameter

Server throughout that models TensorFlow communication pattern, This benchmarks aim to quickly measure the performance of gRPC for TensorFlow DL workloads. In future, this work can be extended to support other communication channels of distributed TensorFlow.

In addition, from our characterization of different communication channels in dis- tributed TensorFlow, we observe that none of the existing communication engines (i.e., gRPC, gRPC+Verbs, gRPC+MPI) are efficient to achieve high-performance communica- tion by leveraging the modern interconnects. No matter what channel is used, gRPC is invincible in TensorFlow. We also show that then nature of TensorFlow DL workload de- mands a novel communication engine design. Thus, after a comprehensive analysis of Ten- sorFlow, we present an adaptive RDMA based gRPC (i.e., AR-gRPC) that is solely capable of accelerating TensorFlow. In our experiments, we observe a 4.1.x speedup compared to default gRPC on IPoIB. In this work, we also show that the AR-gRPC enhanced Tensor-

Flow can achieve upto 3x performance improvement compared to default gRPC channel.

Not only that, AR-gRPC supersedes Verbs and MPI channel too. As part of our future work, we plan to propose more designs in TensorFlow runtime.

63 Bibliography

[1] Caffe2 - lightweight, modular, and scalable deep learning framework.

https://github.com/caffe2/caffe2.

[2] IP over InfiniBand Working Group. http://www.ietf.org/html.

charters/ipoib-charter.html.

[3] TesnorFlowOnSpark. https://github.com/yahoo/TensorFlowOnSpark.

[4] The ImageNet Database. http://www.image-net.org/.

[5] . http://thrift.apache.org/, 2007.

[6] . http://avro.apache.org/, 2010.

[7] gRPC - A High-Performance, Open-Source Universal RPC Framework.

http://www.grpc.io/, 2015.

[8] BigDL: Distributed Deep Learning Library for Apache Spark.

https://github.com/intel-analytics/BigDL, 2016.

[9] DeepBench Benchmark. https://github.com/baidu-research/DeepBench, 2016.

[10] Baidu Allreduce Benchmark. https://github.com/baidu-research/baidu-allreduce,

2017.

64 [11] NVIDIA Collective Communications Library. https://developer.nvidia.com/nccl,

2017.

[12] Public RDMA Based gRPC. https://github.com/CGCL-codes/Tensorflow-

RDMA/tree/master/src/grpc.git, 2017.

[13] TensorFlow CNN Benchmark. https://github.com/tensorflow/benchmarks, 2017.

[14] BigDataBench Version 4. http://prof.ict.ac.cn/BigDataBench/wp-

content/uploads/2018/BigDataBench4-TechnicalReport.pdf, 2018.

[15] TensorFlow. https://github.com/tensorflow/, 2018.

[16] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig

Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow:

Large-scale machine learning on heterogeneous distributed systems. arXiv preprint

arXiv:1603.04467, 2016.

[17] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig

Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-

mawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing

Jia, Rafal Jozefowicz,´ Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane,´

Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster,

Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vin-

cent Vanhoucke, Vijay Vasudevan, Fernanda B. Viegas,´ Oriol Vinyals, Pete War-

den, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-

Flow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR,

abs/1603.04467, 2016.

65 [18] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,

Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor-

Flow: A System for Large-Scale Machine Learning. In OSDI, volume 16, pages

265–283, 2016.

[19] Andrew D Birrell and Bruce Jay Nelson. Implementing Remote Procedure Calls.

ACM Transactions on Computer Systems (TOCS), 2(1):39–59, 1984.

[20] Rajarshi Biswas, Xiaoyi Lu, and Dhabaleswar K Panda. Designing a Micro-

Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences. arXiv

preprint arXiv:1804.01138, 2018.

[21] Suyog Gupta, Wei Zhang, and Fei Wang. Model Accuracy and Runtime Tradeoff

in Distributed Deep Learning: A Systematic Study. In International Conference on

Data Mining (ICDM), pages 171–180. IEEE, 2016.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning

For Image Recognition. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 770–778, 2016.

[23] Michael R Head, Madhusudhan Govindaraju, Aleksander Slominski, Pu Liu, Nayef

Abu-Ghazaleh, Robert Van Engelen, Kenneth Chiu, and Michael J Lewis. A Bench-

mark Suite for Soap-based Communication in Grid Web Services. In Supercomputing,

2005. Proceedings of the ACM/IEEE SC 2005 Conference, pages 19–19. IEEE, 2005.

[24] Infiniband Trade Association. http://www.infinibandta.org.

66 [25] Chengfan Jia, Junnan Liu, Xu Jin, Han Lin, Hong An, Wenting Han, Zheng Wu, and

Mengxian Chi. Improving the Performance of Distributed TensorFlow with RDMA.

International Journal of Parallel Programming, pages 1–12, 2017.

[26] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross

Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture

for Fast Feature Embedding. In Proceedings of the 22nd ACM international confer-

ence on Multimedia, pages 675–678. ACM, 2014.

[27] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,

Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-

Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the

44th Annual International Symposium on Computer Architecture, pages 1–12. ACM,

2017.

[28] Anuj Kalia, Michael Kaminsky, and David G Andersen. FaSST: Fast, Scalable and

Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In OSDI,

pages 185–201, 2016.

[29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Classification with

Deep Convolutional Neural Networks. In Advances in neural information processing

systems, pages 1097–1105, 2012.

[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Classification with

Deep Convolutional Neural Networks. In Advances in neural information processing

systems, pages 1097–1105, 2012.

67 [31] Mingzhe Li, Xiaoyi Lu, Khaled Hamidouche, Jie Zhang, and Dhabaleswar K Panda.

Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA. In

High Performance Computing (HiPC), 2016 IEEE 23rd International Conference,

pages 42–51. IEEE, 2016.

[32] Mingzhe Li, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, Jithin Jose, Karen

Tomko, and Dhabaleswar K Panda. Scalable Graph500 Design with MPI-3 RMA.

In Cluster Computing (CLUSTER), 2014 IEEE International Conference, pages 230–

238. IEEE, 2014.

[33] Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander

Smola. Parameter Server For Distributed Machine Learning. In Big Learning NIPS

Workshop, volume 6, page 2, 2013.

[34] Jiuxing Liu, Jiesheng Wu, Sushmitha P Kini, Pete Wyckoff, and Dhabaleswar K

Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. In

Proceedings of the 17th Annual International Conference on Supercomputing, pages

295–304. ACM, 2003.

[35] Xiaoyi Lu, Jian Lin, Yongqiang Zou, Juan Peng, Xingwu Liu, and Li Zha. Investi-

gating, Modeling, and Ranking Interface Complexity of Web Services on the World

Wide Web. In Services (SERVICES-1), 2010 6th World Congress, pages 375–382.

IEEE, 2010.

[36] Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Hari Subramoni, and Dhabaleswar K

Panda. Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC

68 and HBase. In Cloud Computing Technology and Science (CloudCom), 2016 IEEE

International Conference, pages 310–317. IEEE, 2016.

[37] Xiaoyi Lu, Haiyang Shi, M Haseeb Javed, Rajarshi Biswas, and Dhabaleswar K

Panda. Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-

Capable Networks. In High-Performance Interconnects (HOTI), 2017 IEEE 25th An-

nual Symposium, pages 87–94. IEEE, 2017.

[38] Xiaoyi Lu, Md Wasi-ur Rahman, Nusrat Sharmin Islam, and Dhabaleswar K DK

Panda. A Micro-benchmark Suite for Evaluating Hadoop RPC on High-performance

Networks. In Workshop on Big Data Benchmarks, pages 32–42. Springer, 2013.

[39] Xiaoyi Lu, Yongqiang Zou, Fei Xiong, Jian Lin, and Li Zha. ICOMC: Invocation

Complexity of Multi-Language Clients for Classified Web Services and its Impact on

Large Scale SOA Applications. In Parallel and Distributed Computing, Applications

and Technologies, 2009 International Conference, pages 186–194. IEEE, 2009.

[40] MPI Forum. MPI: A Message Passing Interface. In Proceedings of Supercomputing,

1993.

[41] Yufei Ren, Xingbo Wu, Li Zhang, Yandong Wang, Wei Zhang, Zijun Wang, Michel

Hack, and Song Jiang. iRDMA: Efficient Use of RDMA in Distributed Deep Learning

Systems. In High Performance Computing and Communications, pages 231–238.

IEEE, 2017.

[42] Frank Seide and Amit Agarwal. CNTK: Microsoft’s Open-Source Deep-Learning

Toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.

69 [43] Alexander Sergeev and Mike Del Balso. Horovod: Fast and Easy Distributed Deep

Learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018.

[44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-

scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[45] Patrick Stuedi, Animesh Trivedi, Bernard Metzler, and Jonas Pfefferle. DaRPC: Data

Center RPC. In Proceedings of the ACM Symposium on Cloud Computing, SOCC

’14, pages 15:1–15:13, New York, NY, USA, 2014. ACM.

[46] Maomeng Su, Mingxing Zhang, Kang Chen, Zhenyu Guo, and Yongwei Wu. RFP:

When RPC is Faster than Server-Bypass with RDMA. In EuroSys, pages 1–15, 2017.

[47] Toyotaro Suzumura, Toshiro Takase, and Michiaki Tatsubori. Optimizing Web Ser-

vices Performance by Differential Deserialization. In Web Services, 2005. ICWS

2005. Proceedings. 2005 IEEE International Conference, pages 185–192. IEEE,

2005.

[48] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.

Inception-v4, inception-resnet and the impact of residual connections on learning.

In AAAI, volume 4, page 12, 2017.

[49] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir

Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going

Deeper with Convolutions. Cvpr, 2015.

[50] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-

niew Wojna. Rethinking the Inception Architecture for Computer Vision. CoRR,

abs/1512.00567, 2015.

70 [51] Abhinav Vishnu, Charles Siegel, and Jeffrey Daily. Distributed Tensorflow with MPI.

arXiv preprint arXiv:1603.02339, 2016.

[52] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,

Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. Bigdatabench: A Big Data

Benchmark Suite from Internet Services. In High Performance Computer Architec-

ture (HPCA), 2014 IEEE 20th International Symposium, pages 488–499. IEEE, 2014.

[53] Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware Async-SGD for

Distributed Deep Learning. arXiv preprint arXiv:1511.05950, 2015.

71