Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds

Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds Javier Jorge1 a, German´ Molto´1 b, Damian Segrelles1 c, Joao˜ Pedro Fontes2 and Miguel Angel Guevara2 1Instituto de Instrumentacion´ para Imagen Molecular (I3M), Centro Mixto CSIC, Universitat Politecnica` de Valencia,` Camino de Vera s/n, 46022, Valencia, Spain 2Computer Graphics Center, University of Minho, Campus de Azurem,´ Guimaraes,˜ Portugal Keywords: Cloud Computing, Deep Learning, Multi-cloud. Abstract: This paper introduces a platform based on open-source tools to automatically deploy and provision a distributed set of nodes that conduct the training of a deep learning model. To this end, the deep learning framework TensorFlow will be used, as well as the Infrastructure Manager service to deploy complex infrastructures programmatically. The provisioned infrastructure addresses: data handling, model training using these data, and the persistence of the trained model. For this purpose, public Cloud platforms such as Amazon Web Services (AWS) and General-Purpose Computing on Graphics Processing Units (GPGPU) are employed to dynamically and efficiently perform the workflow of tasks related to training deep learning models. This approach has been applied to real-world use cases to compare local training versus distributed training on the Cloud. The results indicate that the dynamic provisioning of GPU-enabled distributed virtual clusters in the Cloud introduces great flexibility to cost-effectively train deep learning models. 1 INTRODUCTION TensorFlow (TF) (Abadi et al., 2016), PyTorch (Py- Torch, 2018) or MXNet (MXNet, 2018)) enabling a Artificial Intelligence (AI), mainly through Deep quicker and easier evaluation of academic DL mod- Learning (DL) models, is currently dominating the els and the deployment in production environments. field of classification tasks, especially after the disrup- These frameworks provide a uniform set of functions, tive results obtained in 2012 in the most challenging such as parallelization on multiple nodes, flexibility image classification task at that moment (Krizhevsky to code using different interfaces, automatic differen- et al., 2012), defeating the SVM (Support Vector Ma- tiation, similar abstractions regarding the structure of chine) (Vapnik, 1998) approaches that were leading the network, and “model zoos” to deploy an out-of- those contests previously. DL models are trained us- the-box model in production quickly. ing Backpropagation (Rumelhart et al., 1985), an iter- Considering the interest of the academia and in- ative procedure that could take days or weeks depend- dustry to use DL models in their pipelines, it is of vital ing on the volume of the data or the complexity of the importance to provide easier, faster and better tools to model. Among the options to reduce the computa- deploy their algorithms and applications. To this end, tional cost, the most common technique to parallelize research groups and companies are using Cloud ser- this process is the data parallelism approach (Dean vices (such as Google’s Cloud AI (Cloud AI, 2018) et al., 2012) as that is the most straightforward and or Amazon SageMaker (Amazon SageMaker, 2018)) easy technique to implement. to develop their DL solutions, as long as they are Big tech companies and relevant research groups computationally (storage and processing) demanding have open-sourced many of their algorithms and and require expensive and dedicated hardware such methods as AI frameworks for the community (i.e: as General-Purpose Computing on Graphics Process- ing Units (GPGPUs), as it is the case of the DL tech- a https://orcid.org/0000-0002-9279-6768 niques. However, these Cloud solutions result in the b https://orcid.org/0000-0002-8049-253X user being locked-in to a particular Cloud and, there- c https://orcid.org/0000-0001-5698-7965 fore, loosing the ability to perform infrastructure de- 135 Jorge, J., Moltó, G., Segrelles, D., Fontes, J. and Guevara, M. Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds. DOI: 10.5220/0010359601350142 In Proceedings of the 11th International Conference on Cloud Computing and Services Science (CLOSER 2021), pages 135-142 ISBN: 978-989-758-510-4 Copyright c 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved CLOSER 2021 - 11th International Conference on Cloud Computing and Services Science ployment on different or multiple Cloud providers. than one GPU per node. For instance, GPU virtual- This Cloud approach is known as “Infrastructure As ization provides the interface to use a variable number Code” (IaC) (Wittig and Wittig, 2016) and it arises of GPUs, intercepting local calls and sending them to to overcome these gaps and define instead the cus- a GPU server farm to be processed, with thousands tomized computational environment required with- of GPUs working in parallel. An example of these out interaction with the Cloud provider’s interfaces proposals is shown in rCUDA (Reano˜ et al., 2015). and enabling non-expert users to define, in a Cloud- However, these approaches are either experimental or agnostic way, the virtual infrastructure as well as the more expensive than public Clouds, and they do not process of managing and provisioning the computa- provide additional services required for this kind of tional resources from multiple Clouds for different problems such as storage resources. trainings. Notice that a single training is performed The topic of distributed DL has been addressed in within the boundaries of a single cloud to achieve re- the past, as it is the case of the work by Lee et al. duced latency. (Lee et al., 2018) where a framework for DL across This paper introduces a Cloud Service that pro- multiple machines using Apache Spark is introduced. vides an automated DL pipeline, from the deployment There are also approaches that try to speed up this and provision of the required infrastructure on Cloud distributed computation providing another layer of providers, through an IaC approach, to its execution abstraction over TF, such as Horovod (Sergeev and in production. It is based on open-source tools to de- Del Balso, 2018), using MPI1, the commonly used ploy a customised computational environment based (M)essage (P)assing (I)nterface to intercommunicate on TF, to perform the distributed DL model training nodes. based on data parallelism, on Cloud infrastructures As opposed to previous works, we introduce involving accelerated hardware (GPUs) for increased an approach for distributed learning that relies ex- efficiency. As use case to validate the service, it is clusively on the capabilities of TF, thus minimiz- presented the deployment, configuration and execu- ing external dependencies. Also, these distributed tion of a distributed cluster for training deep neural clusters can be dynamically deployed on different networks, on a public Cloud provider. This uses re- Cloud back-ends through an Infrastructure-as-Code sources that the user may not be able to afford oth- approach thus fostering reproducibility of the dis- erwise, such as several nodes connected using one or tributed learning platform and the ability to easily more expensive GPUs. This approach will be con- profit from GPU resources provided by public Cloud veniently assessed by means of training several DL platforms to accelerate training. models on distributed clusters provisioned on Ama- zon Web Services (AWS). In addition, an actual research problem has been used to validate the proposed 3 DEPLOYMENT SERVICE FOR service in order to test the system in a real use case documented in the work by Brito at al. (Brito et al., DISTRIBUTED TensorFlow 2018). This section identifies the components to design and implement the proposed service. The goal is to train a deep neural network in a distributed way, fostering 2 RELATED WORK data parallelism, and considering that the training process can be split according to different shards of data. Recently, approaches for efficient DL have been de- This scales very easily, and this is how some of the veloped ad-hoc for specific use cases, such as Deep- big companies such as Google achieved the amazing Cell (Bannon et al., 2018), a tool used to perform breakthroughs these last years, such as AlphaGO (Sil- large-scale cellular image analysis, tuned for these ver et al., 2017). Among different approaches for do- particular problems. For a general purpose, the al- ing this, we have followed the scheme from (Dean ternative is using proprietary tools that the Cloud et al., 2012), that is reproduced in Figure 1. This ar- providers offer, tailored for a specific framework that chitecture shows different nodes that can be identified usually depends on the provider, or a high-level API as follows: to access these services easily. A thorough review of this can be found in (Luckow et al., 2016), where the • Parameter server (PS): Node(s) that initializes pa- authors studied distributed DL tools to use in the au- rameters, distributes them to the workers, gathers tomotive industry. the modifications of the model’s parameters and There are other approaches related to the use of multi-GPUs beyond using different nodes or more 1https://www.mpi-forum.org/docs/ 136 Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds then applies them, and scatters them back to the workers. S3 Bucket • Model replicas (MR): Node(s) that stores the model and that carries out the backpropagation’s x y = f(x3; w) forward and backward steps, and finally sends back the updates to the PS. There will be a mas- 1 PS WN1 WN2 WN3 ter among them in charge of persisting the model. These nodes are the ones that require more com- 2 putational power, i.e: GPU units. • Data shards: Nodes that provide data for MRs. Figure 2: Step 1: Getting the data and uploading it to the HDFS storage. Step 2: Training starts, the PS node and MR nodes collaborate to train the model, using TF and the HDFS file system.

Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds

Building Alexa Skills Using Amazon Sagemaker and AWS Lambda

Innovation on Every Front

Build Your Own Anomaly Detection ML Pipeline

Cloud Computing & Big Data

Build a Secure Enterprise Machine Learning Platform on AWS AWS Technical Guide Build a Secure Enterprise Machine Learning Platform on AWS AWS Technical Guide

Analytics Lens AWS Well-Architected Framework Analytics Lens AWS Well-Architected Framework

All Services Compute Developer Tools Machine Learning Mobile

Predictive Analytics with Amazon Sagemaker

Fraud Detection in Online Lending Using Amazon Sagemaker

Enabling Continual Learning in Deep Learning Serving Systems

Media2cloud Implementation Guide Media2cloud Implementation Guide

Aws-Mlops-Framework.Pdf