applied sciences

Article Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

Junghoon Woo, Hyeonseong Choi and Jaehwan Lee * School of Electronics and Information Engineering, Korea Aerospace University, 76, Hanggongdaehak-ro, Deogyang-gu, Gyeonggi-do, Goyang-si 10540, Korea; [email protected] (J.W.); [email protected] (H.C.) * Correspondence: [email protected]; Tel.: +82-2-300-0122

 Received: 28 August 2020; Accepted: 21 September 2020; Published: 25 September 2020 

Abstract: To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a -based library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%.

Keywords: Intel knights landing; distributed deep learning; python; collective communication; high bandwidth memory; docker; MPI

1. Introduction The performance of deep learning [1] has improved significantly, thanks to the advanced neural network architecture, computer hardware (HW) performance improvement, and utilization of large-scale datasets. To accommodate massive amounts of training data on large-scale neural network models, most deep learning platforms, such as TensorFlow [2] and PyTorch [3], offer “distributed” training methods to exploit multiple cores and machines. When executing distributed deep learning, the computation of gradients is executed by hundreds of cores in worker nodes, and the computed gradients should be shared with other nodes to keep the updated gradients. Since updating fresh gradients across nodes should be done periodically, the collective communication performance can be a major performance bottleneck point in the distributed deep learning process [4]. Li et al. show a bottleneck of communication when training a deep learning model (AlexNet, ResNet-152 and VGG-16) through distributed deep learning in five machine cluster. In this training, about 75% of the total running time was communication time [4]. Therefore, efficient distributed deep learning requires to reduce communication time by optimizing communication layers. Data parallelization is a representative distributed deep learning method. Data parallelization can be divided into a parameter server architecture and an AllReduce architecture according to the

Appl. Sci. 2020, 10, 6717; doi:10.3390/app10196717 www.mdpi.com/journal/applsci Appl. Sci. 2020, 10, 6717 2 of 24 communication method. In the parameter server architecture, there are parameter server processes and worker processes. The parameter server manages model parameters, and workers compute gradients. In deep learning systems, parameters are variables of the model to be trained. Gradients are values to find the optimal parameters obtained after back-propagation. Therefore, to train the deep learning model in the parameter server architecture, communication between the parameter server and the worker is required. In contrast, the AllReduce architecture does not have a parameter server to manage model parameters. The gradients computed by each worker are shared with each other through peer-to-peer (P-to-P) communication. Therefore, when performing distributed deep learning, various types of communication are performed, and an efficient communication method is required to improve the performance of the distributed deep learning system. Python is a library-rich and object-oriented language available on a variety of platforms [5]. It is used in various fields along with computer science because it is easy to learn and use, especially for deep learning applications. TensorFlow and PyTorch, which are typical frameworks for deep learning, are also based on Python. For these reasons, Python should use multiple cores and simultaneous communications to support distributed deep learning efficiently. In Python-based distributed deep learning environments, we can use multi-processing or multi-threading methods. The major difference between the two methods is the way to share memory; multi-threading can share memory more efficiently compared to multi-processing. However, in Python environments, multi-threading has the disadvantage of limiting the use of computing resources because of Python’s unique feature of Global (GIL) [6,7]. These characteristics make Python easier to use but also present obstacles to performance. Therefore, simply adopting multi-threading communications is not efficient in distributed deep learning training workloads. In this paper, we propose a novel method of efficiently exchanging data when performing distributed deep learning in Python and many-core CPU environments, and we compare and analyze the proposed method through extensive experiments in various environments. The contributions of this paper are as follows:

First, we propose a new method of exchanging distributed deep learning data using Message • Passing Interface (MPI) [8,9] that performs multi-processing instead of multi-threading, which has a problem with the GIL feature in Python. We developed Application Programming Interfaces (APIs) of Broadcast, Gather, and AllReduce collective communication [10], which are frequently used as MPI data communication methods in distributed deep learning. In addition, we compare and analyze Python’s two MPI communication methods, buffer communication, and object communication. Second, we improve our Python-based MPI library by leveraging the advanced performance • of On-package memory Multi-Channel Dynamic Random Access Memory (MCDRAM) of the Intel Xeon Phi Knights Landing processor [11], which is a typical many-core CPU used in supercomputing centers. To utilize MCDRAM, we adopted the Memkind library, a C-based library for memory allocation [12]. To use the Memkind library in a Python environment, Cython [13,14] is used to implement wrapping in Python. Third, for comparative analysis in various environments, we compare and analyze MPI • communication performance by extensive experiments according to the Cluster mode and Memory mode of the Intel Xeon Phi Knights Landing processor. In addition, considering the practical distributed deep learning environments, we also conducted the experiment on the open source virtualization platform Docker [15] to see the performance difference in a Cloud computing environment.

The remainder of this paper is organized as follows. Section2 presents related works. Section3 explains the background of our study. Section4 explains how we developed a Python-based MPI library to exploit MCDRAM and how we bound it with Docker Container. Section5 describes the Appl. Sci. 2020, 10, 6717 3 of 24 experiment setup and compares the results to evaluate the proposed method with existing methods. Section6 summarizes the experiment results and discussions. Section7 concludes this paper.

2. Related Work There is a study to improve and evaluate the communication performance between nodes in distributed deep learning. Ahn et al. [16] propose a new virtual shared memory framework called Soft Memory Box (SMB) that can share the memory of remote nodes between distributed processes in order to reduce the communication overhead of large-scale distributed deep learning. SMB proposed a new communication technology other than MPI by constructing separate servers and virtual devices. Our research and SMB have the same goal of improving data communication performance when deep learning in a distributed environment. However, there is a difference in that we proposed performance improvement based on the utilization of resources of high-performance computing and commonly used Python and MPI communication technologies. There are studies to analyze and evaluate the performance of High-Bandwidth Memory (HBM). Peng et al. [17] compare the performance of HBM and DRAM and analyze the impact of HBM on High Performance Computing (HPC) applications. The results of this study show that applications using HBM achieve up to three times the performance of applications using only DRAM. We not only compare HBM and DRAM but also compare and analyze the performance of two memories in a Docker Container environment. There is a study to evaluate communication performance in many-core CPU and high bandwidth memory environment. Cho et al. [18] compare the MPI communication performance according to the MVAPICH2 shared memory technology and the location of the many-core CPU processor. In the experiment, intra-node communication and inter-node communication are compared. In addition, this study evaluates experiments according to the location of the memory and processor. Unlike this study, we propose to improve the MPI performance in the Python environment, and we experiment and analyze according to the Cluster mode and Memory mode. There is a study to evaluate the performance of the virtualization solution Container and Hypervisor. Awan et al. [19] compare the performance overhead of virtualization in Container-based and Hypervisor-based. The results of this study show that the average performance of Containers is superior to that of virtual machines. Therefore, in our study, we evaluate the performance in the Docker Container environment. There are several studies evaluating communication performance using Containers in HPC environment. Zhang et al. [20] evaluate the performance of MPI in Singularity-based Containers. Experiments show that Singularity Container technology does not degrade MPI communication performance when using different Memory Modes (Flat, Cache) on Intel Xeon and Intel Xeon Knights Landing (KNL) platforms. However, our study evaluates and analyzes MPI performance in Docker Container. In addition, Zhang et al. [21] propose communication through shared memory and Cross Memory Attach (CMA) channels when executing MPI tasks in a multi-container environment. According to this study, communication performance is improved in a large-scale Docker Container environment. Our study compares and analyzes in multiple Docker Containers using Python environment and high-performance memory. Li et al. propose In-Network Computing to Exchange and Process Training Information Of Neural Networks (INCEPTIONN) to optimize communication in distributed deep learning [4]. They propose a compression algorithm to optimize communication and implement the algorithm through hardware. In addition, through experiments, they show that INCEPTIONN can reduce communication time by up to 80.7% and improve speed by up to 3.1 times compared to conventional training systems. Unlike INCEPTIONN, in our study, we propose a software-based method to optimize communication in distributed deep learning by utilizing KNL’s HBM and MPI. Appl. Sci. 2020, 10, 6717 4 of 24

3. Background In this section, we describe distributed deep learning, Python GIL, MPI, and Intel Knights Landing.

3.1. Distributed Deep Learning To achieve high accuracy using deep learning, the size of datasets and training models is increasing. For example, the ImageNet dataset, one of the most used datasets, consists of over 1.2 million images [22]. In addition, the amount of uploaded data from Social Networking Service (SNS) users is increasing day by day [23]. In the case of deep learning model, the AmoebaNet model, which shows the highest accuracy in recent years, consists of more than 550 M parameters [24]. Therefore, the computing power and memory resources required to perform deep learning are increasing [25,26]. In order to solve the increase in required resources, distributed deep learning using multiple workers is performed. The purpose of a deep learning system is to find the value of the parameters that shows the highest accuracy. In deep learning, the parameter is the weight of the features of the input data. Gradients are the derivatives of the loss function based on model parameters. Therefore, training is the process to find a parameter that can minimize the loss function. Therefore, it is necessary to minimize the loss by applying gradients computed by distributed workers to the model parameters. The distributed deep learning methods can be two-fold: data parallelization and model parallelization. In the case of data parallelization, the deep learning model is replicated across multiple workers [27,28]. Each worker trains a deep learning model with different input data. In the case of model parallelization, one training model is divided into multiple workers. Each worker trains part of a deep learning model with the same input data. In data parallelization, workers compute gradients using different input data. Therefore, it is necessary to aggregate the gradients computed by each worker and apply the aggregated gradients to the model parameters of all workers. Data parallelization architecture can be divided according to the way gradients aggregation is performed. Typical data parallelization architectures are parameter servers and AllReduce [29,30]. In the parameter server architecture, processes are divided into two types—parameter servers and workers. The parameter server updates model parameters using gradients computed by workers and send the latest model parameters to workers. The worker computes gradients to update the model parameters using the input data and model parameters received from the parameter server, and after that, sends gradients to the parameter server. The parameter server architecture can operate both synchronously and asynchronously. In a synchronous way, the parameter server broadcasts model parameters to all workers and gathers computed gradients. In an asynchronous way, the parameter server exchanges model parameters and gradients with each worker through peer-to-peer communication. In the case of the AllReduce architecture, there is no server that collects gradients computed by each worker and manages model parameters. In the AllReduce architecture, each worker shares gradients computed by each other through AllReduce based on P-to-P communication. Therefore, overhead due to communication between parameter server and worker can be reduced. However, the AllReduce architecture has a limitation that AllReduce architecture should operate only in a synchronous way. As described above, distributed deep learning can be performed through various communication methods. Broadcast is a One-to-All communication method. In the synchronous parameter server architecture, the parameter server broadcasts the latest model parameters to all workers. Gather is an All-to-One communication. Gather is used when the parameter server collects the gradients computed by workers. AllReduce is an All-to-All communication. In the AllReduce architecture, AllReduce communication is used to share the gradients computed by each worker with other workers. In addition, as the scale of the distributed deep learning system increases, the amount of communication increases. Therefore, in order to mitigate the performance degradation due to communication when performing distributed deep learning, an efficient communication method must be proposed. Appl. Sci. 2020, 10, 6717 5 of 24

3.2. Python GIL Python’s Global Interpreter Lock (GIL) is a mutex that restricts access to Python objects to only one when multiple threads are used to execute Python code [6]. GIL is needed because CPython’s memory management method is not thread-safe. CPython is the very first version of Python, which was originally written in C [31]. CPython is the standard implementation of Python to have basic functionality. In the early days of CPython’s development, there was no concept of threads. Later, the thread concept was used by many developers as an easy and concise CPython language. Over time, there was a problem due to the concept of a thread, and GIL was created as a prompt solution to the problem. CPython needs to request access to the GIL to run multiple threads in one interpreter. When using asynchronous and multi-threaded code, Python developers can avoid process crashes by locking or deadlocking variables. For this reason, GIL allows CPython to run only one thread when using multi-thread. It is not a problem when implementing a single thread using Python, but there is a problem of wasting resources when performing multi-thread in many-core CPU environments. There are various methods to solve this problem. As a first method, when performing multi-threading using a CPU, a task that requires a lot of computation is executed with C code out of GIL. The second method is to use multi-process instead of multi-thread. Since CPython 2.6, the multi-processing module has been added as a standard library. Multi-processing can share variables through queues or pipes, and it also has a lock object that locks the objects of the master process for use in other processes. However, multi-process has one important disadvantage; it has significant overhead in memory usage and execution time. Therefore, in many-core environments, the multi-processed Python execution is not an effective solution, especially when lots of data in memory should be shared with multi-processes.

3.3. MPI Message Passing Interface (MPI) is a library for message passing and refers to the basic functions, Application Programming Interface (API), required for data exchange in distributed and parallel processing [9,10]. This standard supports both point-to-point and collective communication. The MPI application consists of several processes that exchange data with each other. Each process starts in parallel simultaneously when the MPI application is running and exchanges data with each other during execution. The advantage of MPI is that it can exchange data and information easily between multiple machines. Parallel MPI is used not only in parallel computers that exchange information at high speed, such as shared memory or InfiniBand, but also in PC clusters. MPI has various open libraries on different HW environments, such as MPICH, OpenMPI, and Intel MPI [32,33]. MPI allocates tasks based on process. Each process corresponds to a set called the communicator. A communicator defines a group of processes that have the ability to communicate. In the same communicator, messages can be exchanged with different processes. In the communicator, processes are identified by rank. There are two types of communication methods in MPI: point-to-point communication and collective communication. MPI_Send and MPI_Recv are common point-to-point communication methods of MPI. Group communication is communication in which all processes, which are communicators, participate together. MPI_Bcast, MPI_Gather, MPI_Allgather, MPI_Reduce, and MPI_AllReduce are all representative collective communications. Figure1 shows the procedure for collective communication. The experiment presented in this study performs MPI_Bcast, MPI_Gather, and MPI_AllReduce communication. MPI_Bcast is a Broadcast communication that sends data of the root process to all other processes in the same communicator. MPI_Gather gathers data from multiple processes in the Communicator to the root process. MPI_AllReduce combines the values of all processes in the communicator and redistributes the results to all processes. Appl. Sci. 2020, 10, 6717 6 of 24

Appl.Appl. Sci. Sci. 2020 2020, ,10 10, ,x x FOR FOR PEER PEER REVIEW REVIEW 66 of of 25 25

FigureFigureFigure 1. 1.1. Message Message Passing Passing Interface Interface (MPI) (MPI) collective collectivecollective communication. communication.communication.

3.4.3.4.3.4. Many-Core Many-Core CPU: CPU: Knights Knights Landing Landing TheTheThe Intel IntelIntel Xeon XeonXeon Phi PhiPhi Knights KnightsKnights Landing LandingLanding (KNL) (KNL)(KNL) processor processorprocessor is isis a aa many-core many-coremany-core processor processorprocessor developed developeddeveloped by byby IntelIntelIntel for forfor use useuse in inin supercomputers, supercomputers,supercomputers, high highhigh performance performanceperformance servers serversservers and andand advanced advancedadvanced workstations workstationsworkstations [11]. [[11].11]. Figure FigureFigure 2 2 showsshowsshows the the KNL KNL diagram. diagram. diagram. A A Asocket socket socket of of ofKNL KNL KNL can can can have have have up up upto to 36 to36 active 36active active tiles, tiles, tiles, and and andeach each each tile tile consists tileconsists consists of of two two of corestwocores cores [34–36].[34–36]. [34 – Thus,36Thus,]. Thus, aa KnightsKnights a Knights LandingLanding Landing socketsocket socket cancan ha canhaveve have upup toto up 7272 to cores.cores. 72 cores. TheThe two Thetwo twocorescores cores inin eacheach in each tiletile communicatetilecommunicate communicate throughthrough through thethe 2D2D the MeshMesh 2D Mesh InterconnectInterconnect Interconnect ArArchitecturechitecture Architecture basedbased based onon thethe on RingRing the Ring Architecture.Architecture. Architecture. TheThe communicationThecommunication communication mesh mesh mesh is is composed composed is composed of of four four of parallel fourparallel parallel networks. networks. networks. Each Each network Eachnetwork network provides provides provides different different diff types erenttypes oftypesof packetpacket of packet information,information, information, suchsuch asas such commands,commands, as commands, data,data, andand data, responses.responses. and responses. ItIt isis optimizedoptimized It is optimized forfor KnightKnight for LandingLanding Knight trafficLandingtraffic flows flows tra ffiand andc flows protocols. protocols. and protocols.

Figure 2. Diagram of Knights Landing (KNL) process architecture. Redrawn from [36], Sandia FigureFigure 2.2. DiagramDiagram of of Knights Knights Landing Landing (KNL) (KNL) process process architecture. architecture. Redrawn Redrawn from [ 36from], Sandia [36], NationalSandia National Lab. (SNL-NM), 2017. NationalLab. (SNL-NM), Lab. (SNL-NM), 2017. 2017.

KNLKNLKNL has hashas three threethree main main ClusterCluster modes.modes. Cluster ClusterCluster mode mode defines definesdefines various variousvarious methods methodsmethods in inin which whichwhich memory memorymemory requestsrequestsrequests generated generatedgenerated by by the the core core are are executed executed by by the the memory memory controller. controller. To To To maintain maintain cache cache coherency, coherency, KNLKNLKNL has has a a DistributedDistributed TagTag Tag DirectoryDirectory Directory (DTD) (DTD) (DTD) [ 37[37]. [37].]. The The The DTD DTD DTD consists consists consists of of of a setaa set set of of Tagof Tag Tag Directories Directories Directories (TDs) (TDs) (TDs) for forfor eacheach tiletile thatthat identifiesidentifies thethe statestate andand locationlocation ofof thethe chipschips inin thethe cachecache line.line. ForFor everyevery memorymemory address,address, the the hardware hardware can can identify identify the the TD TD of of that that address address as as a a hash hash function. function. There There are are three three Cluster Cluster Appl. Sci. 2020, 10, 6717 7 of 24 each tile that identifies the state and location of the chips in the cache line. For every memory address, the hardware can identify the TD of that address as a hash function. There are three Cluster modes in KNL; All-to-All, Quadrant, and Sub-NUMA Clustering-4 (SNC-4) modes. In All-to-All mode, memory addresses are evenly distributed across all TDs of a chip, and this may result in a long waiting time due to cache hits and cache misses. Quadrant mode divides the tile into four parts called quadrants. The chip is divided into quadrants but mapped to a single Non-Uniform Memory Access (NUMA) node. Memory addresses served by a memory controller in a quadrant are guaranteed to be mapped only to TDs contained in that quadrant. SNC-4 mode divides the chip into quadrants, and each quadrant is mapped to a different NUMA node. In this mode, NUMA-aware software can pin software threads to the same quadrant that contains the TD and accesses NUMA-local memory. SNC-4 mode has a memory controller and TD in the same area and has a shorter path than Quadrant mode. Because Memory controller are physically close to each other, this mode has the lowest latency provided that communication stays within a NUMA domain. It is recommended that users set the Cluster mode according to the characteristics of the application. An appropriate mode should be set for the lowest latency with the cache and the largest communication bandwidth. KNL has two types of memory, MCDRAM and Double Data Rate 4 (DDR4) [38]. Multi-Channel Dynamic Random-Access Memory (MCDRAM) is an On-package High-Bandwidth Memory. MCDRAM has about five times the bandwidth of reading and writing than DDR4. MCDRAM has three Memory modes: Cache, Flat, and Hybrid. In Cache mode, the entire MCDRAM is used as a cache. In Flat mode, both MCDRAM and DDR4 memory are mapped to the same system address space. Hybrid mode is divided so that a part of MCDRAM is used as a cache, and the rest is used like Flat mode.

4. Python-Based MPI Collective Communication Library This section describes how we build the Python-based MPI collective communication library considering features of many-core architecture and MCDRAM. We used the Memkind library and the Cython Wrapping method to use high-performance memory MCDRAM in Python environment. In addition, we discuss how to bind Docker Container to core and memory when performing experiments in Docker environment.

4.1. Memkind Library for Using MCDRAM In order to use MCDRAM, a high-performance memory in the Intel Xeon Phi Knights Landing processor, we use the Memkind library, which is a user extensible heap manager built on top of jemalloc [12]. The Memkind library can control memory characteristics and partition the heap between memory types. Memkind is implemented in user space. It allows the user to manage new memory, such as MCDRAM, without modifying the application software or operating system. The Memkind library can be generalized to any NUMA architecture, but in the case of Knights Landing processors, it is mainly used to manually allocate to MCDRAM using special assignments for C/C++. Figure3 presents how Memkind library allocates memory through two di fferent interfaces, hbwmalloc and memkind. Both interfaces use the same backend. The memkind interface is used as a general interface, and the hbwmalloc interface is used for high bandwidth memory like MCDRAM. The memkind interface uses the memkind.h file, and the hbwmalloc interface uses the hbwmalloc.h file. In this paper, the hbwmalloc interface is used as an experiment using MCDRAM. Appl. Sci. 2020, 10, 6717 8 of 24 Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 25

FigureFigure 3.3. ArchitectureArchitecture didifferencefference betweenbetween generalgeneral memorymemory allocationallocation andand Multi-ChannelMulti-Channel DynamicDynamic RandomRandom AccessAccess MemoryMemory (MCDRAM)(MCDRAM) memorymemory allocationallocation usingusing Memkind.Memkind.

InIn general,general, whenwhen dynamicallydynamically allocatingallocating memorymemory inin CC andand CC++++ languages,languages, itit isis allocatedallocated usingusing functions,functions, such such as as malloc, malloc, realloc, realloc, and and calloc. calloc. In addi Intion, addition, the free the function free function is used isto useddeallocate to deallocate dynamic dynamicmemory. For memory. dynamic For memory dynamic allocation memory using allocation the hbwmalloc using the interface hbwmalloc of the interfaceMemkind of library, the Memkind memory library,can be allocated memory in can MCDRAM be allocated using in MCDRAMthe hbw_malloc, using hbw_realloc, the hbw_malloc, and hbw_calloc hbw_realloc, functions. and hbw_calloc Using the functions.hbw_free function Using the can hbw_free deallocate function allocated can memory. deallocate To allocated implement memory. code in To C implementand C++ languages code in C using and Cthe++ hbwmalloclanguages interface using the in hbwmalloca Linux environment, interface inwe a use Linux gcc environment,with -lmemkind we option use gcc when with compiling -lmemkind the optioncodes. when compiling the codes.

4.2.4.2. Wrapping Memkind ThisThis subsectionsubsection describesdescribes howhow wewe use the KNL processor’s MCDRAM in a Python environment. environment. NumpyNumpy [[39–41]39–41] andand MemkindMemkind librarieslibraries areare usedused toto performperform PythonPython MPIMPI communicationcommunication usingusing MCDRAM.MCDRAM. NumpyNumpy isis aa PythonPython librarylibrary forfor mathematicalmathematical andand scientificscientific operationsoperations thatthat makesmakes itit easyeasy toto processprocess matricesmatrices oror largelarge multidimensionalmultidimensional arrays.arrays. Wrapping isis implementedimplemented inin PythonPython usingusing CythonCython [13[13,14],14] to to use use Memkind Memkind library library in Python in Pyth environments.on environments. Figure Figure4 shows 4 theshows Wrapping the Wrapping method withmethod Cython with isCython written is in written the following in the following four steps. four steps. • StepStep 1.1. ImplementImplement CC codecode thatthat allocatesallocates memorymemory usingusing Memkind.Memkind. • • StepStep 2. The The Cython Cython file file has has .pyx .pyx extension. extension. Wrap Wrap the C the code C implemented code implemented in Step in 1 using Step 1Cython using • Cythoncode. code. • StepStep 3.3. TheThe CythonCython codecode mustmust bebe compiledcompiled didifffferentlyerently fromfrom Python.Python. ImplementImplement a setupsetup filefile toto • performperform compilation.compilation. • StepStep 4.4. ImportImport thethe CC modulemodule defineddefined inin Cython.pyxCython.pyx andand setup.pysetup.py intointo thethe PythonPython MPIMPI code.code. • In the first step, hbw_malloc.c implements the code that allocates memory using the hbw_malloc function from the Memkind library. In the second step, Cython.pyx is defined to wrap the array allocated memory with the hbw_malloc function in the first step into an array of Numpy. To receive C code from a Python object, Cython.pyx needs to get a pointer to the PyObject structure. This step defines the arrays used to exchange data. We use PyArrayObject when defining an array. PyArrayObject is a C API provided by Numpy. Numpy provides a C API to access array objects for system extensions and use in other routines. PyArrayObject contains all the information necessary for a C structured array. We use the PyArray_SimpleNewFromData function on PyArrayObject and create a Python array wrapper for C [42]. In the third step, the Python code setup.py is implemented using the distutils package provided by Cython to compile the Cython.pyx file. The distutils package supports building and installing additional modules in Python. To compile the Cython.pyx file, setup.py contains all compile-related information. The setup.py uses Cython compiler to compile Cython.pyx file into C extension file. After that, the C extension file is converted to the .so or .pyd extension file by the C Appl. Sci. 2020, 10, 6717 9 of 24 compiler, and .so file is used as a Python module. Finally, the fourth step is to implement the Python MPI code to exchange the actual data. Implementing the MPI code defines the Numpy used for data communication. In defining Numpy, the hbw_malloc module defined in Cython.pyx is used for the array data part of Numpy. MPI communication is performed with Python wrapped with Memkind and Numpy. Therefore, through this process, data exchange is performed using MCDRAM. Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 25

Figure 4. Memkind Cython Wrapping implementationimplementation architecture.

4.3. BindingIn the first Memory step, andhbw_malloc.c Core to Docker implements the code that allocates memory using the hbw_malloc functionThis from subsection the Memkind describes library. how we In conduct the second experiments step, Cython.pyx in a Docker is environmentdefined to wrap considering the array a distributedallocated memory environment. with the Docker hbw_malloc is an open function source thatin the automates first step the into task an ofarray placing of Numpy. Linux applications To receive intoC code software from a called Python Containers. object, Cython.pyx A Container needs is ato standard get a pointer software to the unit PyObject that packages structure. code This and step all dependencies.defines the arrays This allowsused applicationsto exchange to data. run quickly We use and PyArrayObject reliably from one when computing defining environment an array. toPyArrayObject another. is a C API provided by Numpy. Numpy provides a C API to access array objects for systemIn theextensions experiment and ofuse this in study, other toroutines. analyze PyAr the optimalrayObject MPI contains communication all the information performance necessary in KNL, wefor experimenta C structured according array. We to Memory use the PyArray_SimpleNewFromData mode and Cluster mode. In the experiment,function on MemoryPyArrayObject mode usesand Flatcreate mode a Python and Cache array mode,wrapper and for Cluster C [42]. modeIn the usesthird Quadrant step, the Python mode and code SNC-4 setup.py mode. is implemented In Quadrant modeusing the chipdistutils is divided package into provided four virtual by Cython quadrants to compile but is exposed the Cython.pyx to the operating file. The system distutils as package a single NUMAsupports node building [33]. In and Quadrant installing mode, additional unlike SNC-4 module mode,s in thePython. chip is To not compile divided intothe Cython.pyx multiple NUMA file, nodes.setup.py In contains Flat mode, all compile-related MCDRAM is used information. as an operating The setup.py system’s uses managed Cython addressablecompiler to memory.compile Therefore,Cython.pyx MCDRAM file into C is extension exposed as file. a separate After that, NUMA the node.C extension As shown file inis Figureconverted5, DDR4 to the and .so Core or .pyd are NUMAextension node file 0,by and the MCDRAMC compiler, is and NUMA .so file node is used 1. Therefore, as a Python when module. Flat mode Finally, is selected the fourth in Quadrant step is to mode,implement there the are Python two NUMA MPI code domains. to exchange Thus, inthe Quadrant actual data. Mode, Implementing there is no the need MPI to code allocate defines NUMA the NodesNumpy to used two Dockerfor data Containers communication. or four DockerIn defining Containers. Numpy, the hbw_malloc module defined in Cython.pyxOn the otheris used hand, for the in thearray SNC-4 data mode,part of each Numpy. quadrant MPI ofcommunication the chip is exposed is performed as a separate with NUMAPython nodewrapped to the with operating Memkind system. and Numpy. SNC-4 mode Therefore, looks likethro 4ugh sockets this process, in succession data becauseexchange each is performed quadrant appearsusing MCDRAM. as a separate NUMA domain to the operating system. Each node divides the core and DDR4 into quarters and appears grouped into four nodes. In addition, each node divides the MCDRAM 4.3. Binding Memory and Core to Docker into quarters and adds four NUMA nodes. Therefore, a total of 8 NUMA nodes are exposed to the operatingThis subsection system. The describes divided NUMAhow we nodes conduct have expe a distanceriments between in a Docker nodes. environment For example, considering in Figure6 a, nodedistributed 0 is closer environment. to node 4 than Docker other is MCDRAM an open nodessource (node5, that automates node6, node7). the task of placing Linux applications into software called Containers. A Container is a standard software unit that packages code and all dependencies. This allows applications to run quickly and reliably from one computing environment to another. In the experiment of this study, to analyze the optimal MPI communication performance in KNL, we experiment according to Memory mode and Cluster mode. In the experiment, Memory mode uses Flat mode and Cache mode, and Cluster mode uses Quadrant mode and SNC-4 mode. In Quadrant mode the chip is divided into four virtual quadrants but is exposed to the operating system as a single NUMA node [33]. In Quadrant mode, unlike SNC-4 mode, the chip is not divided into multiple Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 25

NUMA nodes. In Flat mode, MCDRAM is used as an operating system’s managed addressable memory. Therefore, MCDRAM is exposed as a separate NUMA node. As shown in Figure 5, DDR4 and Core are NUMA node 0, and MCDRAM is NUMA node 1. Therefore, when Flat mode is selected in Quadrant mode, there are two NUMA domains. Thus, in Quadrant Mode, there is no need to allocate NUMA Nodes to two Docker Containers or four Docker Containers.

Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 25

NUMA nodes. In Flat mode, MCDRAM is used as an operating system’s managed addressable memory. Therefore, MCDRAM is exposed as a separate NUMA node. As shown in Figure 5, DDR4 and Core are NUMA node 0, and MCDRAM is NUMA node 1. Therefore, when Flat mode is selected Appl.in Quadrant Sci. 2020, 10 mode,, 6717 there are two NUMA domains. Thus, in Quadrant Mode, there is no need10 of to 24 allocate NUMA Nodes to two Docker Containers or four Docker Containers.

Figure 5. Non-Uniform Memory Access (NUMA) configuration in Quadrant mode with High- Bandwidth Memory (HBM) in Flat mode.

On the other hand, in the SNC-4 mode, each quadrant of the chip is exposed as a separate NUMA node to the operating system. SNC-4 mode looks like 4 sockets in succession because each quadrant appears as a separate NUMA domain to the operating system. Each node divides the core and DDR4 into quarters and appears grouped into four nodes. In addition, each node divides the MCDRAM into quarters and adds four NUMA nodes. Therefore, a total of 8 NUMA nodes are exposed to the operatingFigure system. 5. 5. Non-UniformNon-Uniform The divided Memory MemoryNUMA Access nodes Access (NUMA) have (NUMA) a diconfigurationstance configuration between in Quadrant nodes. in Quadrant For mode example, modewith High- within Figure 6, nodeBandwidthHigh-Bandwidth 0 is closer Memory to Memorynode (HBM) 4 than (HBM) in Flatother in mode. Flat MCDRAM mode. nodes (node5, node6, node7).

On the other hand, in the SNC-4 mode, each quadrant of the chip is exposed as a separate NUMA node to the operating system. SNC-4 mode looks like 4 sockets in succession because each quadrant appears as a separate NUMA domain to the operating system. Each node divides the core and DDR4 into quarters and appears grouped into four nodes. In addition, each node divides the MCDRAM into quarters and adds four NUMA nodes. Therefore, a total of 8 NUMA nodes are exposed to the operating system. The divided NUMA nodes have a distance between nodes. For example, in Figure 6, node 0 is closer to node 4 than other MCDRAM nodes (node5, node6, node7).

Figure 6. NUMANUMA configuration configuration in Sub-NUMA Cluster Clustering-4ing-4 (SNC-4) mode with HBM in Flat mode.

In order toto obtainobtain optimizedoptimized performanceperformance in SNC-4 mode, code modificationmodification or NUMA environment control control is is required. required. In InSNC-4 SNC-4 mode mode,, when when the thepath path between between the theMCDRAM MCDRAM node node and andDDR4 DDR4 + Core+ nodeCore is node shortly is shortly allocated, allocated, the waiting the waitingtime is reduced, time is reduced,and the performance and the performance is improved. is improved.Therefore, we Therefore, assign a we Container assign a to Container two adjacent to two nodes adjacent like nodesthe architecture like the architecture in Figure in7. The Figure first7. TheContainer first Container binds to bindsnode 0 to (core, node DDR4) 0 (core, and DDR4) node and 4 (MCDRAM). node 4 (MCDRAM). The second The Container second Container binds node binds 1 nodeand node 1 and 5, node the third 5, the Container third Container binds bindsnode 2 node and 2node and 6, node and 6, the and fourth the fourth Container Container binds binds node node 3 and 3 and nodeFigure 7. 6. When NUMA creating configuration a Docker in Container,Sub-NUMA we Cluster declareing-4 the (SNC-4) privileged mode option with HBM so that in Flat the mode. Containers can access the Host’s device. We set up a port to communicate between each Container. For the Host environment,In order weto useobtain the sameoptimized binding performance policy of core in andSNC-4 memory mode, in SNC-4code modification mode. or NUMA environment control is required. In SNC-4 mode, when the path between the MCDRAM node and DDR4 + Core node is shortly allocated, the waiting time is reduced, and the performance is improved. Therefore, we assign a Container to two adjacent nodes like the architecture in Figure 7. The first Container binds to node 0 (core, DDR4) and node 4 (MCDRAM). The second Container binds node 1 and node 5, the third Container binds node 2 and node 6, and the fourth Container binds node 3 and Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 25

node 7. When creating a Docker Container, we declare the privileged option so that the Containers Appl.can access Sci. 2020 the, 10, Host’s 6717 device. We set up a port to communicate between each Container. For the11 Host of 24 environment, we use the same binding policy of core and memory in SNC-4 mode.

Figure 7.7. Four Docker Containers Architecture boundbound toto NUMANUMA nodenode inin SNC-4SNC-4 mode.mode.

5.5. Performance Evaluation InIn thisthis section,section, wewe comparecompare andand analyzeanalyze MPIMPI communicationcommunication performanceperformance inin high-performancehigh-performance memorymemory andand Python Python environment environment through through experiments. experiments. The The server server used used inthe in the experiment experiment consists consists of a KNLof a KNL processor processor with 68 with cores, 68 16 cores, GB of 16 MCDRAM, GB of MCDRAM, and 96 GB and of DDR4. 96 GB For of softwareDDR4. For (SW) software environment, (SW) weenvironment, used Linux we (Kernel used VersionLinux (Kernel 5.2.8) and Version Intel MPI 5.2.8) (Version and Intel 2017 MPI Update (Version 3). Due 2017 to Update the GIL 3). feature Due ofto Python,the GIL experimentsfeature of Python, were conducted experiments by exchanging were conducted data using by MPI,exchanging which isdata a multi-processing using MPI, which method is a insteadmulti-processing of multi-threading. method Weinstead tested of collective multi-threading. communication We tested of Broadcast, collective Gather, communication and AllReduce of methods,Broadcast, which Gather, are and widely AllReduce used as methods, MPI communication which are widely methods used inas parallelMPI communication and distributed methods deep learning.in parallel Python’sand distributed MPI communication deep learning. Python has two’s methods:MPI communication Object and has Bu fftwoer communicationmethods: Object [and43]. ObjectBuffer communicationcommunication provides [43]. Object Pickle-based communicati communicationon provides to convenientlyPickle-based communicate communication objects to inconveniently Python. Object communicate communication objects usesin Python. all lowercase Object communication methods, like bcast(), uses all gather(), lowercase and methods, allreduce(). like Bubcast(),ffer communication gather(), and allreduce(). is an efficient Buffer communication communication method is an for efficient array data communication communication, method such for as NumPy.array data Bu communication,ffer communication such uses as uppercase,NumPy. Buffer like Bcast(),communication Gather(), us andes AllReduce().uppercase, like The Bcast(), Pickle baseGather(), of object and communicationAllReduce(). The converts Pickle objectbase of data obje intoct communication byte streams and converts byte streams object back data into into object byte datastreams [44 ].and On byte the streams other hand, back into buff erobject communication data [44]. On allowsthe other direct hand, access buffer to communication object data using allows the budirectffer interfaceaccess to [ 45object]. The data objects using supported the buffer by theinterface buffer [45]. interface The areobjects byte andsupported array objects.by the Inbuffer this study,interface bu ffareer communicationbyte and array canobjects. be used In this because study, the buffer Numpy communication array is used incan the be experiment. used because Object the communicationNumpy array is isused simple in the to useexperiment. but has an Object overhead commu comparednication to is busimpleffer communication, to use but has an which overhead does notcompared require to a conversionbuffer communication, process. which does not require a conversion process. InIn SectionsSections 5.15.1–5.3,–5.3, toto evaluateevaluate thethe proposedproposed scheme,scheme, thethe ClusterCluster modemode ofof thethe KNLKNL processorprocessor waswas testedtested inin QuadrantQuadrant mode.mode. Memory mode was setset in Flat mode and Cache mode for comparison. DiDifferentfferent combinationscombinations of of DDR4 DDR4 and and MCDRAM MCDRAM were were compared compared by changing by changing the number the ofnumber processes of inprocesses C and in Python C and Python environments. environments. In addition, In additi toon, see to see the the Python Python communication communication performanceperformance characteristic,characteristic, we usedused thethe BuBufferffer andand ObjectObject communicationcommunication methods.methods. In order to considerconsider thethe distributeddistributed environment,environment, wewe conductedconducted anan experimentexperiment onon distributeddistributed communicationcommunication inin Docker’sDocker’s virtualvirtual environment.environment. In Section 5.45.4,, we compare QuadrantQuadrant mode andand SNC-4SNC-4 modemode amongamong ClusterCluster modesmodes ofof KNLKNL processor.processor.

5.1.5.1. Broadcast InIn the Broadcast Broadcast experiment, experiment, the the data data type type was was Integer Integer and and the the size size of the of thetransferred transferred data data was was256 MB. 256 MB.Figure Figure 8 graph8 graph shows shows the result the result of Broadcast of Broadcast communication communication time. time.In the In experiment the experiment graph graph of Figure8, C represents MPI communication in C environment, Bcast_B represents MPI Appl.Appl. Sci. 2020 Sci. 2020, 10,, 671710, x FOR PEER REVIEW 12 of 2512 of 24

of Figure 8, C represents MPI communication in C environment, Bcast_B represents MPI buffer buffercommunication communication in Python in Python environment, environment, and andBcas Bcast_Ot_O represents represents MPI MPIobject object communication communication in in PythonPython environment. environment.

(a)

(b)

(c)

Figure 8. Cont. Appl.Appl. Sci. 2020 Sci. 2020, 10,, 671710, x FOR PEER REVIEW 13 of 2513 of 24

(d)

FigureFigure 8. Broadcast 8. Broadcast communication communication timetime results acco accordingrding to tothe the experiment experiment environment: environment: (a) Host; (a) Host; (b) one(b) one Docker Docker Container; Container; (c )(c two) two Docker Docker Containers;Containers; (d (d) )four four Docker Docker Containers. Containers.

FigureFigure8 shows 8 shows the resultthe result of executing of executing Broadcast. Broadcast. Figure Figure8a 8a is theis the Host Host environment, environment, Figure Figure8 b8b is the is the one Docker Container environment, Figure 8c is the two Docker Containers environment, and one Docker Container environment, Figure8c is the two Docker Containers environment, and Figure8d Figure 8d is the four Docker Containers environment. In all environments, communication using is theMCDRAM four Docker in Flat Containers mode showed environment. the fastest communica In all environments,tion time. The communication case of using DDR4 using in Cache MCDRAM in Flatmode mode was showedsecond fastest. the fastest Finally, communication the slowest communication time. The time case was of using shown DDR4 when DDR4 in Cache was modeused was secondin Flat fastest. mode. Finally, The themethod slowest proposed communication in this paper time is was Bcast_B, shown which when DDR4performs was Python used inbuffer Flat mode. The methodcommunication proposed when in using this paperMCDRAM is Bcast_B, in Flat mode. which When performs using Pythononly DDR4 buff iner Flat communication mode, Bcast_B when usingshowed MCDRAM slower in time Flat than mode. C. However, When using in the only experiment DDR4 inenvironment Flat mode, ofBcast_B Host and showed one Container, slower time thanthe C. However,communication in the time experiment of Bcast_B environment and C was of Hostsimilar and in oneMC Container,DRAM of theFlat communicationmode. Bcast_B time of Bcast_Bcommunication and C was in similarMCDRAM in MCDRAMof Flat mode of was Flat con mode.firmed Bcast_B to improve communication performance in of MCDRAM up to 487% of Flat modecompared was confirmed to Bcast_B to improvecommunication performance in DDR4 of of up Fl toat 487%mode. compared This best case to Bcast_B was shown communication when 16 in processes were used in one Docker Container. In addition, the performances of the Host (Figure 8a) DDR4 of Flat mode. This best case was shown when 16 processes were used in one Docker Container. and one Container (Figure 8b) environments were similar. This result confirmed Docker’s claim that In addition,the performance the performances difference between of the the Host main (Figure Host 8anda) andDocker’s one Container is (Figurenot significant8b) environments because wereDocker similar. directly This result uses confirmedthe Host’s resources. Docker’s claimHowever, that as the the performance number of diContainersfference betweenincreased the in main HostMCDRAM and Docker’s of Flat Container mode, Bcast_B is not showed significant slower because communication Docker directly performance uses thecompared Host’s to resources. C. However,Through as this the numberresult, we of confirmed Containers that increased multiple Docker in MCDRAM Containers of Flatare relatively mode, Bcast_B less efficient showed than slower communicationcommunication performance in the Host comparedenvironment. to C. Through this result, we confirmed that multiple Docker ContainersAs arementioned relatively in the less background efficient than of the communication previous section, in in the Cache Host mode, environment. MCDRAM is used as aAs cache. mentioned MCDRAM in thecache background is treated as of a theLast previous Level Cache section, (LLC), in which Cache is mode,located MCDRAM between L2 iscache used as a and addressable memory in the memory hierarchy of Knights Landing processors. Therefore, it can cache. MCDRAM cache is treated as a Last Level Cache (LLC), which is located between L2 cache and be seen that the Bcast_B communication in Cache mode is up to 88% faster than the Bcast_B addressable memory in the memory hierarchy of Knights Landing processors. Therefore, it can be seen communication in DDR4 of Flat mode. that the Bcast_B communication in Cache mode is up to 88% faster than the Bcast_B communication in DDR45.2. of Gather Flat mode.

5.2. GatherIn the Gather experiment, the data type was Integer and the size of transferred data was 16 MB. Figure 9 shows the results of Gather communication times. Like the Broadcast experiment, in the FigureIn the 9 Gather graphs, experiment,C represents MPI the datacommunication type was Integerin C environment, and the size Gather_B of transferred represents data MPI buffer was 16 MB. Figurecommunication9 shows the in results Python of environment, Gather communication and Gather_O times. represents Like theMPI Broadcast object communication experiment, in in the FigurePython9 graphs, environment. C represents MPI communication in C environment, Gather_B represents MPI bu ffer communication in Python environment, and Gather_O represents MPI object communication in Python environment. Appl. Sci. 2020, 10, 6717 14 of 24 Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 25

(a)

(b)

(c)

Figure 9. Cont. Appl.Appl. Sci. Sci. 20202020,, 1010,, x 6717 FOR PEER REVIEW 1515 of of 25 24

(d)

FigureFigure 9. 9. GatherGather communication communication time time results results accord accordinging to tothe the experiment experiment environment: environment: (a) Host; (a) Host; (b) one(b) oneDocker Docker Container; Container; (c) two (c) two Docker Docker Containers; Containers; (d) four (d) four Docker Docker Containers. Containers.

FigureFigure 99 isis thethe resultresult ofof performingperforming GatherGather communication.communication. Figure 99aa isis thethe HostHost environment,environment, FigureFigure 99bb is is the the one one Docker Docker Container Container environment, environm Figureent,9 cFigure is the two9c Dockeris the two Containers Docker environment, Containers environment,and Figure9d and is the Figure four Docker9d is the Containers four Docker environment. Containers Similarenvironment. to the BroadcastSimilar to experimentthe Broadcast in experimentFigure8, communication in Figure 8, communica in MCDRAMtion of in Flat MCDRAM mode was of Flat the fastestmode wa ands the DDR4 fastest of Flat and mode DDR4 was of Flat the modeslowest. was In the the slowest. Gather In experiment, the Gather theexperiment, data size the is smallerdata size than is smaller that of than Broadcast that of andBroadcast AllReduce. and AllReduce.Gather communication Gather communication collects data coll fromects manydata from processes many inprocesses a single in process. a single This process. feature This of feature Gather ofcommunication Gather communication generates Outgenerates Of Memory Out Of when Memory the data when size the is large.data size The is reason large. that The the reason performance that the performancecomparison betweencomparison experiments between experiments seems small seems is that smal thel is data that size the data is small. size is The small. feature The feature of this ofexperiment this experiment is that is thethat case the case of Gather_B of Gather_B in MCDRAMin MCDRAM of of Flat Flat mode mode shows shows fasterfaster communication performanceperformance than C. ItIt waswas confirmedconfirmed that that the Gather_Bthe Gather_B communication communication in MCDRAM in MCDRAM of Flat mode of Flat has amode performance has a performanceimprovement improvement of up to 65% comparedof up to 65% to the compared C communication to the C incommunication MCDRAM of Flatin MCDRAM mode. In addition, of Flat mode.the Gather_B In addition, communication the Gather_B in MCDRAMcommunication of Flat in mode MCDRAM was able of Flat to confirm mode was up to able 74% to performance confirm up toimprovement 74% performance over the improvement Gather_B communication over the Gather_B in DDR4 communication of Flat mode. in It DDR4 can be of seen Flat that mode. Gather_B It can becommunication seen that Gather_B in Cache communication mode is up to in 13% Cach fastere mode than Gather_Bis up to communication13% faster than in Gather_B DDR4 of communicationFlat mode. Like in Broadcast, DDR4 of Flat the mode. Host environment Like Broadcast, (Figure the Host9a) andenvironment the one Container (Figure 9a) environment and the one Container(Figure9b) environment showed similar (Figure performance. 9b) showed In addition, similar poorperformance. communication In additi performanceon, poor communication was confirmed performancein the increase was of theconfirmed number in of the Containers. increase of the number of Containers.

5.3.5.3. AllReduce AllReduce InIn the AllReduceAllReduce experiment, experiment, the the data data type type was was Integer Integer and and the size the ofsize transferred of transferred data was data 256MB. was 256MB.Figure 10 Figure graph 10 shows graph the shows result the of AllReduce result of communicationAllReduce communication time. Like thetime. previous Like the experiments, previous experiments,in the graphs in of the Figure graphs 10, Cof indicatesFigure 10, MPI C indi communicationcates MPI communication in C environment, in C All_Benvironment, indicates All_B MPI indicatesbuffer communication MPI buffer communication in Python environment, in Python and environment, All_O indicates and MPIAll_O object indicates communication MPI object in communicationPython environment. in Python environment. Appl. Sci. 2020, 10, 6717 16 of 24 Appl. Sci. 2020, 10, x FOR PEER REVIEW 16 of 25

(a)

(b)

(c)

Figure 10. Cont. Appl. Sci. 2020, 10, 6717 17 of 24 Appl. Sci. 2020, 10, x FOR PEER REVIEW 17 of 25

(d)

FigureFigure 10. 10. AllReduceAllReduce communication communication time time results results according according to to the the experiment experiment environment: environment: (a (a) )Host; Host; ((bb)) one one Docker Docker Container; Container; ( (cc)) two two Docker Docker Containers; Containers; ( (dd)) four four Docker Docker Containers. Containers.

FigureFigure 10 10 is is the the result result of of executing executing AllReduce. AllReduce. Figure Figure 10a 10 isa isthe the Host Host environment, environment, Figure Figure 10b 10 isb theis theone one Docker Docker Container Container environment, environment, Figure Figure 10c is10 thec is two the Docker two Docker Containers Containers environment, environment, and Figureand Figure 10d is 10 thed isfour the Docker four Docker Containers Containers environm environment.ent. In the Inexperiment, the experiment, the communication the communication result betweenresult between the experimental the experimental environments environments is the same is the as same the previous as the previous experiments. experiments. With DDR4 With in DDR4 Flat mode,in Flat All_B mode, showed All_B showed slower slower performance performance than thanC. However, C. However, in MCDRAM in MCDRAM of ofFlat Flat mode, mode, the the communicationcommunication performanceperformance of of All_B All_B and and C was C similar.was similar. A special A featurespecial offeature the AllReduce of the AllReduce experiment experimentis seen in Python’s is seen object in Python’s communication. object commu Most ofnication. the previous Most communication of the previous experiments communication showed experimentsthe fastest performance showed the in fastest MCDRAM performance of Flat mode. in MC InDRAM AllReduce, of Flat object mode. communication In AllReduce, All_O object in communicationCache mode showed All_O faster in Cache performance mode showed than object faster communication performance than All_O object in MCDRAM communication of Flat All_O mode. inThis MCDRAM result seems of Flat to mode. be due This to the result feature seems of to AllReduce, be due to whichthe feature combines of AllReduce, values from which all combines processes valuesand distributes from all processes the results and to distributes all processes. the results Unlike to the all previous processes. experiments, Unlike the previous AllReduce experiments, has a data AllReduceoperation task.has Ita seemsdata operation that the utilization task. It seems of cache th hasat the increased utilization for computation. of cache has Therefore, increased it wasfor computation.confirmed that Therefore, object communication it was confirmed All_O that in ob Cacheject communication mode improved All_O the performancein Cache mode by improved up to 76% thecompared performance to object by up communication to 76% compared All_O to in object MCDRAM communication of Flat mode. All_O in MCDRAM of Flat mode. TheThe All_B All_B communication communication in in MCDRAM MCDRAM of of Fl Flatat mode mode showed showed up up to to 162% 162% performance performance improvementimprovement overover All_B All_B communication communication in DDR4 in DD of FlatR4 mode.of Flat In Cachemode. mode, In Cache All_B communicationmode, All_B communicationis up to 110% faster is up than to 110% DDR4 faster communication than DDR4 co inmmunication Flat mode. The in HostFlat mode. environment The Host (Figure environment 10a) and (Figureone Container 10a) and environment one Container (Figure environment 10b) showed (Figure almost 10b) showed similar almost performance. similar performance. Like the previous Like theexperiments, previous experiments, when the number when ofthe Containers number of increased, Containers the increased, poor communication the poor communication performance performancewas confirmed. was confirmed.

5.4.5.4. Cluster Cluster Mode Mode InIn this this subsection, subsection, we we compare compare and and analyze analyze the theperformance performance according according to the to Cluster the Cluster mode. mode. The previousThe previous experiments experiments were were performed performed in the in theenvi environmentronment where where Cluster Cluster mode mode is isQuadrant Quadrant mode. mode. InIn this this subsection, subsection, experiments experiments are are performed performed in inSNC-4 SNC-4 mode. mode. Then, Then, we wecompared compared and andanalyzed analyzed the performancethe performance differences differences according according to Cluster to Cluster mode. mode. The The data data type type in inthe the ex experimentperiment was was Integer, Integer, andand the the data data size of Broadcast andand AllReduceAllReduce was was 256 256 MB, MB, but but the the data data size size of of Gather Gather was was 16 16 MB. MB. In theIn theexperiment experiment graph graph of Figure of Figure 11, C 11, represents C represents MPI communication MPI communication in C environment; in C environment; Bcast_B, Gather_B,Bcast_B, Gather_B,and All_B and represent All_B MPIrepresent buffer MPI communication buffer communic in Pythonation in environment; Python environment; and Bcast_O, and Gather_O,Bcast_O, Gather_O,and All_O and represent All_O MPIrepresent object MPI communication object communi in Pythoncation in environment. Python environment. The comparison The comparison method is methodthe same is asthe in same the previous as in the sections; previous however, sections the; however, Memory the mode Memory experimented mode experimented with is DDR4 with in Flat is DDR4 in Flat mode and MCDRAM in Flat mode. This section differs from previous ones in that we compare just two environment types, Host and four Docker Containers. Appl. Sci. 2020, 10, 6717 18 of 24 mode and MCDRAM in Flat mode. This section differs from previous ones in that we compare just two environmentAppl. Sci. 2020, 10, types,x FOR PEER Host REVIEW and four Docker Containers. 18 of 25

(a)

(b)

(c)

Figure 11. Cont. Appl.Appl. Sci. 2020 Sci. 2020, 10,, 671710, x FOR PEER REVIEW 19 of 2519 of 24

(d)

(e)

(f)

FigureFigure 11. 11.BroadCast, BroadCast, Gather, Gather, andand AllReduceAllReduce communicati communicationon time time result result in SNC-4 in SNC-4 Cluster Cluster mode mode environment:environment: (a) ( Broadcasta) Broadcast Host; Host; ((bb)) BroadcastBroadcast four four Docker Docker Containers; Containers; (c) Gather (c) Gather Host; Host; (d) Gather (d) Gather four Docker Containers; (e) AllReduce Host; (f) AllReduce four Docker Containers. four Docker Containers; (e) AllReduce Host; (f) AllReduce four Docker Containers.

Figure 11 is the experiment result in SNC-4 Cluster mode environment. Figure 11a is Broadcast in Host environment, Figure 11b is Broadcast in four Docker Containers environment, Figure 11c is Gather in Host environment, Figure 11d is Gather in four Docker Containers environment, Figure 11e Appl. Sci. 2020, 10, 6717 20 of 24 is Host environment In AllReduce, and Figure 11f is AllReduce in four Docker Containers environment. As explained in the background of the previous section, SNC-4 mode divides the chip into quadrants, and each quadrant is mapped to a NUMA node. In this mode, SNC-4 mode has a memory controller and TD in the same area and has a shorter path than Quadrant mode. For this reason, it was expected that SNC-4 mode might perform better than Quadrant mode. However, the experimental results showed little difference in performance between SNC-4 mode and Quadrant mode. Due to the characteristics of the algorithms of Broadcast, Gather, and AllReduce performed in this paper, it seems that the characteristics of SNC-4 Cluster mode cannot be utilized. In SNC-4 mode, MCDRAM NUMA node and DDR4+Core node are physically close. SNC-4 mode shows the lowest latency when communication is maintained within the NUMA domain. But, if in SNC-4 mode cache traffic crosses NUMA boundaries, this path is longer than in the Quadrant mode. In this experiment, MPI collective communication creates a graph topology with a distributed method, and each node defines a neighbor. Therefore, due to this MPI communication characteristic, it seems difficult to utilize the advantages of SNC-4 mode when the process to be communicated is not in the same domain. According to our experiment, MPI communication in Quadrant mode is effective, which can maintain transparency without code modification.

6. Summary of Experimental Results In this section, the results of the experiments performed in this paper are summarized. Tables1–4 compare the performance improvement rate of Python communication and C communication in the Host and four Docker Containers environments. Table1 shows the ratio of DDR4 and MCDRAM communication times when performing Python buffer communication in Flat mode. Table2 is a table comparing the performance of C communication in Flat mode. Tables3 and4 are tables comparing the experimental results in the four Docker Containers environment. Table5 compares the communication time according to the Host environment and the increase of Docker Containers. Tables6 and7 compare the communication performance according to the Cluster mode. All experiments in the previous section increase the number of processes. However, in this summary, the communication time is compared in the number of processes with the best performance improvement.

Table 1. Comparison of Python buffer communication time between Double Data Rate 4 (DDR4) and MCDRAM in Host environment (Flat mode).

DDR4/MCDRAM (%) Number of Processes Broadcast 545 16 Gather 159 64 AllReduce 262 68

Table 2. Comparison of C communication time between DDR4 and MCDRAM in Host environment (Flat mode).

DDR4/MCDRAM (%) Number of Processes Broadcast 271 16 Gather 103 16 AllReduce 201 32

Table 3. Comparison of Python buffer communication time between DDR4 and MCDRAM in four Docker Containers environment (Flat mode).

DDR4/MCDRAM (%) Number of Processes Broadcast 202 68 Gather 131 64 AllReduce 217 64 Appl. Sci. 2020, 10, 6717 21 of 24

Table 4. Comparison of C communication time between DDR4 and MCDRAM in four Docker Containers environment (Flat mode).

DDR4/MCDRAM (%) Number of Processes Broadcast 105 68 Gather 108 64 AllReduce 102 32

Table 5. Communication time according to the increase in the number of Docker Containers.

Broadcast (s) Gather (s) AllReduce (s) Host 0.285 0.194 1.261 One Container 0.288 0.199 1.271 Two Containers 1.991 0.508 4.447 Four Containers 2.231 0.628 4.677

Table 6. Communication time between Quadrant mode and SNC-4 mode in Host environment.

Quadrant (s) SNC-4 (s) Broadcast 0.525 0.553 Gather 0.301 0.297 AllReduce 2.168 2.421

Table 7. Communication time of Quadrant mode and SNC-4 mode in four Docker Containers environment.

Quadrant (s) SNC-4 (s) Broadcast 2.398 2.456 Gather 1.538 1.531 AllReduce 5.342 5.361

In Table1, when Broadcast performed MPI collective communication with 16 processes, it was confirmed that MCDRAM performance improved 445% over DDR4. In addition, Gather improved the performance by 59% in 64 processes, and AllReduce confirmed performance improvement by 162% in 68 processes. As can be seen in Table2, Broadcast showed 171% performance improvement in 16 processes, Gather showed 3% performance improvement in 16 processes, and, finally, AllReduce showed 101% performance improvement in 32 processes. From Tables1 and2, it can be confirmed that the rate of performance improvement in the same Host experiment environment is greater in Python buffer communication than in C communication. Similarly, in Tables3 and4, the Python bu ffer communication in the four Containers environment showed a higher performance improvement rate than with C communication. Table5 shows the communication time results of Python bu ffer communication in 32 processes and MCDRAM in Flat mode. In Table5, the Host environment and one Container show almost the same communication time. In addition, when the number of Containers increases, it can be seen that all collective communication time increases. In Table5, it was confirmed that there was little di fference in performance between Host and one Docker, as claimed by Docker. However, multiple Containers show relatively less efficient communication than communication in the Host environment. The experimental results in Tables6 and7 are the communication time results of the Python bu ffer communication in 68 processes and MCDRAM in Flat mode. In Tables6 and7, Quadrant mode and SNC-4 mode are similar, or SNC-4 mode shows slow communication time. This experimental result shows that the advantages of SNC-4 mode cannot be utilized. Due to the MPI communication feature, it seems that the advantages of SNC-4 mode cannot be utilized when the communication process is not in the same domain. As a result, in MPI communication, the Quadrant mode that can maintain transparency may be more efficient than the SNC-4 mode. Appl. Sci. 2020, 10, 6717 22 of 24

Our performance evaluation can be summarized by the following three points:

The combination of MCDRAM usage and Python buffer communication has greatly improved the • performance of Python’s MPI collective communication. This result shows that MCDRAM was efficiently used in the Python environment by wrapping Memkind library. And it was confirmed that Python’s buffer communication is more efficient than Python’s object communication. In addition, in Memory mode of Knights Landing Process, the use of MCDRAM in Flat mode showed better performance than Cache mode using MCDRAM as a cache (Tables1–4). C communication is still fast in a distributed Docker environment. All experiments showed • similar communication performance on the Host and one Docker Container environments. In this result, we confirmed Docker’s claim that the performance difference between the main Host and Docker’s Container is not significant because Docker’s Container directly uses the Host’s resources. However, multiple Containers show relatively less efficient communication than communication in the Host environment (Table5). The Quadrant mode showed more efficient experimental results than the SNC-4 mode. SNC-4 • mode has a low latency when communication is maintained within the NUMA node. Therefore, it is difficult to efficiently utilize the SNC-4 mode when its process of communication is not in the same domain. Due to the characteristic of MPI communication, the advantages of SNC-4 mode cannot be utilized. In Cluster mode, Quadrant mode, which ensures transparency, is an efficient method for MPI communication (Tables6 and7).

7. Conclusions In this paper, we proposed a method of efficiently exchanging data when performing distributed deep learning in Python and many-core CPU environments, and we compared and analyzed the proposed method through extensive experiments in various environments. For performance evaluation, we compared performance according to KNL’s Memory mode and Cluster mode and also compared performance according to the Python MPI communication method. In addition, to consider a distributed environment, we experimented in a Docker environment. The combination of MCDRAM usage and Python buffer communication has greatly improved the performance of Python’s MPI collective communication. C communication is still fast in the distributed Docker environment, but similar communication performance was also seen in the Host and one Docker Container environments. In addition, it was confirmed that multiple Docker Containers are relatively less efficient than communication in the Host environment. In the Cluster mode experiment, the Quadrant mode showed more efficient experimental results than the SNC-4 mode. Due to the characteristic of MPI communication, the advantages of SNC-4 mode cannot be utilized. In Cluster mode, Quadrant mode, which ensures transparency, is an efficient method for MPI communication. In the future, we plan to apply to deep learning frameworks, such as Google’s TensorFlow and Facebook’s PyTorch, using our proposed MPI library, and further explore performance improvements.

Author Contributions: Conceptualization, J.W. and J.L.; methodology, J.W. and J.L.; software, J.W. and H.C.; writing—original draft preparation, J.W.; writing—review and editing, J.W., H.C. and J.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript. Funding: This research was supported by Next-Generation Information Computing Development Program (2015M3C4A7065646) and Basic Research Program (2020R1F1A1072696) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, GRRC program of Gyeong-gi province (No. GRRC-KAU-2020-B01, “Study on the Video and Space Convergence Platform for 360VR Services”), ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01423), the Korea Institute of Science and Technology Information (Grant No.K-20-L02-C08), and the National Supercomputing Center with supercomputing resources including technical support (KSC-2019-CRE-0101). Conflicts of Interest: The authors declare no conflict of interest. Appl. Sci. 2020, 10, 6717 23 of 24

References

1. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. 2. Abadi, M.; Barham, P.;Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. 3. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. 4. Li, Y.; Park, J.; Alian, M.; Yuan, Y.; Qu, Z.; Pan, P.; Wang, R.; Schwing, A.; Esmaeilzadeh, H.; Kim, N.S. A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2018), Fukuoka, Japan, 20–24 October 2018; pp. 175–188. 5. Oliphant, T.E. Python for scientific computing. Comput. Sci. Eng. 2007, 9, 10–20. [CrossRef] 6. Beazley, D. Understanding the python gil. In Proceedings of the 2010 PyCON Python Conference, Atlanta, GA, USA, 25–26 September 2010. 7. Global Interpreter Lock. Available online: https://wiki.python.org/moin/GlobalInterpreterLock/ (accessed on 20 May 2020). 8. MPI Forum. Available online: https://www.mpi-forum.org/ (accessed on 20 May 2020). 9. Gropp, W.; Lusk, E.; Doss, N.; Skjellum, A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 1996, 22, 789–828. [CrossRef] 10. MPI Tutorial. Available online: https://mpitutorial.com/ (accessed on 20 May 2020). 11. Sodani, A. Knights landing (knl): 2nd generation intel® xeon phi processor. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), Cupertino, CA, USA, 22–25 August 2015; pp. 1–24. 12. Memkind Library. Available online: https://github.com/memkind/memkind/ (accessed on 20 May 2020). 13. Cython. Available online: https://cython.org/ (accessed on 20 May 2020). 14. Behnel, S.; Bradshaw, R.; Citro, C.; Dalcin, L.; Seljebotn, D.S.; Smith, K. Cython: The best of both worlds. Comput. Sci. Eng. 2011, 13, 31–39. [CrossRef] 15. Docker. Available online: https://www.docker.com/ (accessed on 20 May 2020). 16. Ahn, S.; Kim, J.; Lim, E.; Kang, S. Soft memory box: A virtual shared memory framework for fast deep neural network training in distributed high performance computing. IEEE Access 2018, 6, 26493–26504. [CrossRef] 17. Peng, I.B.; Gioiosa, R.; Kestor, G.; Vetter, J.S.; Cicotti, P.; Laure, E.; Markidis, S. Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Comput. 2018, 76, 57–69. [CrossRef] 18. Cho, J.Y.; Jin, H.W.; Nam, D. Exploring the Performance Impact of Emerging Many-Core Architectures on MPI Communication. J. Comput. Sci. Eng. 2018, 12, 170–179. [CrossRef] 19. Li, Z.; Kihl, M.; Lu, Q.; Andersson, J.A. Performance overhead comparison between hypervisor and container based virtualization. In Proceedings of the 31st IEEE International Conference on Advanced Information Networking and Applications (AINA 2017), Taipei, Taiwan, 27–29 March 2017; pp. 955–962. 20. Zhang, J.; Lu, X.; Panda, D.K. Is singularity-based container technology ready for running MPI applications on HPC clouds? In Proceedings of the 10th International Conference on Utility and Cloud Computing (UCC 2017), Austin, TX, USA, 5–8 December 2017; pp. 151–160. 21. Zhang, J.; Lu, X.; Panda, D.K. High performance MPI library for container-based HPC cloud on InfiniBand clusters. In Proceedings of the 45th International Conference on Parallel Processing (ICPP 2016), Philadelphia, PA, USA, 16–19 August 2016; pp. 268–277. 22. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. 23. Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Ranzato, M.; Senior, A.; Tucker, P.; Yang, K.; et al. Large scale distributed deep networks. In Proceedings of the 26th Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1223–1231. Appl. Sci. 2020, 10, 6717 24 of 24

24. Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 103–112. 25. Kim, Y.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Efficient multi-GPU memory management for deep learning acceleration. In Proceedings of the 2018 IEEE 3rd International Workshops on Foundations and Applications of Self* Systems (FAS* W), Trento, Italy, 3–7 September 2018; pp. 37–43. 26. Kim, Y.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration. Clust. Comput. 2019, 23, 2193–2204. [CrossRef] 27. Kim, Y.; Choi, H.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Efficient Large-Scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster. In Proceedings of the 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS* W), Umeå, Sweden, 16–20 June 2019; pp. 176–181. 28. Kim, Y.; Choi, H.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Clust. Comput. 2020, 23, 2287–2300. [CrossRef] 29. Heigold, G.; McDermott, E.; Vanhoucke, V.; Senior, A.; Bacchiani, M. Asynchronous stochastic optimization for sequence training of deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 5587–5591. 30. Sergeev, A.; Del Balso, M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv 2018, arXiv:1802.05799. 31. CPython. Available online: https://docs.python.org/3/ (accessed on 20 May 2020). 32. Zhang, C.; Yuan, X.; Srinivasan, A. Processor affinity and MPI performance on SMP-CMP clusters. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, USA, 19–23 April 2010; pp. 1–8. 33. Neuwirth, S.; Frey, D.; Bruening, U. Communication models for distributed intel xeon phi coprocessors. In Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), Melbourne, Australia, 14–17 December 2015; pp. 499–506. 34. Lee, C.; Lee, J.; Koo, D.; Kim, C.; Bang, J.; Byun, E.; Eom, H. Empirical Analysis of the I/O Characteristics of a Highly Integrated Many-Core Processor. In Proceedings of the 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), Washington, DC, USA, 17–21 August 2020; pp. 1–6. 35. Sodani, A.; Gramunt, R.; Corbal, J.; Kim, H.S.; Vinod, K.; Chinthamani, S.; Hutsell, S.; Agarwal, R.; Liu, Y.C. Knights landing: Second-generation intel xeon phi product. IEEE Micro 2016, 36, 34–46. [CrossRef] 36. Agelastos, A.M.; Rajan, M.; Wichmann, N.; Baker, R.; Domino, S.P.; Draeger, E.W.; Anderson, S.; Balma, J.; Behling, S.; Berry, M.; et al. Performance on Trinity Phase 2 (a Cray XC40 Utilizing Intel Xeon Phi Processors) with Acceptance Applications and Benchmarks; Sandia National Lab. (SNL-NM): Albuquerque, NM, USA, 2017. 37. Vladimirov, A.; Asai, R. Clustering Modes in Knights Landing Processors: Developer’s Guide; Colfax International: Santa Clara, CA, USA, 2016. 38. Jeffers, J.; Reinders, J.; Sodani, A. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition; Morgan Kaufmann: Burlington, MA, USA, 2016. 39. Numpy. Available online: https://numpy.org/ (accessed on 20 May 2020). 40. Oliphant, T.E. A Guide to NumPy. Available online: https://web.mit.edu/dvp/Public/numpybook.pdf (accessed on 23 September 2020). 41. Walt, S.V.D.; Colbert, S.C.; Varoquaux, G. The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng. 2011, 13, 22–30. [CrossRef] 42. SciPy. Available online: https://scipy.org/ (accessed on 20 May 2020). 43. MPI for Python. Available online: https://mpi4py.readthedocs.io/en/stable/ (accessed on 20 May 2020). 44. Pickle-Python Object Serialization. Available online: https://docs.python.org/3/library/pickle.html/ (accessed on 20 May 2020). 45. Buffer Protocol. Available online: https://docs.python.org/3/c-api/buffer.html/ (accessed on 20 May 2020).

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).