MPI Based Python Libraries for Data Science Applications

MPI Based Python Libraries for Data Science Applications by John Scott Rodgers A thesis submitted to the Department of Computer Science, College of Natural Sciences and Mathematics in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Chair of Committee: Dr. Edgar Gabriel Committee Member: Dr. Shishir Shah Committee Member: Dr. Martin Huarte Espinosa University of Houston May 2020 ACKNOWLEDGMENTS The work in the thesis could not have been realized without the help and support of many people. A very special thanks must be given to my thesis advisor, Dr. Edgar Gabriel, who provided constant guidance and insight, as well as provided the resources necessary to carry out this work. Additional special thanks must be given to Angela Braun Esq., who was kind enough to act as my first draft copy editor. I would also like to thank my thesis committee members Dr. Shishir Shah and Dr. Martin Huarte-Espinosa for serving on the committee and providing their domain specific guidance and support. Additionally, I would like to thank the University of Houston department of Computer Science teaching faculty for providing me with the experience and tools necessary to succeed in this effort. Lastly, I would like to thank the oracles of Computer Science, Google and StackOverflow, for always being there in my time of need. ii ABSTRACT Tools commonly leveraged to tackle large-scale Data Science workflows have traditionally shied away from existing high performance computing paradigms, largely due to their lack of fault toler- ance and computation resiliency. However, these concerns are typically only of critical importance to problems tackled by technology companies at the highest level. For the average Data Scientist, the benefits of resiliency may not be as important as the overall execution performance. To this end, the work of this thesis aimed to develop prototypes of tools favored by the Data Science com- munity that function in a data-parallel environment, taking advantage of functionality commonly used in high performance computing. To achieve this goal, a prototype distributed clone of the Python NumPy library and a select module from the SciPy library were developed, which leverage MPI for inter-process communication and data transfers while abstracting away the complexity of MPI programming from its users. Through various benchmarks, the overhead introduced by logic necessary to resolve functioning in a data-parallel environment, as well as the scalability of using parallel compute resources for routines commonly used by the emulated libraries are analyzed. For the distributed NumPy clone, it was found that for routines that could act solely on their local array contents, the impact of the introduced overhead was minimal; while for routines that required global scope of distributed elements, a considerable amount of overhead was introduced. In terms of scalability, both the distributed NumPy clone and select SciPy module, a distributed implementation of K-Means clustering, exhibited reasonably performant results; notably showing sensitivity to local process problem sizes and operations that required large amounts of collective communication/synchronization. As this work mainly focused on the initial exploration and pro- totyping of behavior, the results of the benchmarks can be used in future development efforts to target operations for refinement and optimization. iii TABLE OF CONTENTS ACKNOWLEDGMENTS ii ABSTRACT iii LIST OF TABLES vi LIST OF FIGURES viii 1 INTRODUCTION 1 1.1 Motivation of Work . 2 1.2 Goals of Thesis . 4 1.3 Existing Implementations . 5 1.3.1 DistArray . 5 1.3.2 D2O . 6 1.3.3 Dask . 7 1.4 Organization of Remainder . 8 2 BACKGROUND 9 2.1 Python . 9 2.2 MPI . 10 2.2.1 Communicators . 12 2.2.2 Point-to-Point Communication . 13 2.2.3 Collective Communication . 13 2.3 mpi4py . 16 2.4 NumPy . 18 2.5 SciPy . 21 2.6 K-Means Clustering . 22 3 CONTRIBUTION 25 3.1 MPInumpy . 25 3.1.1 MPIArray Attributes . 26 3.1.2 Data Distributions . 28 3.1.3 Creation Routines . 29 3.1.4 Reductions Routines . 31 3.1.5 Behavior & Operations . 33 3.2 MPIscipy . 38 3.2.1 Cluster . 38 4 EVALUATION 41 4.1 MPInumpy - Evaluation . 43 4.1.1 Single Process Performance . 43 4.1.2 Scalability . 50 4.2 MPIscipy K-Means Clustering - Evaluation . 57 4.2.1 Single Process Performance . 57 iv 4.2.2 Scalability . 60 5 CONCLUSIONS 63 BIBLIOGRAPHY 65 v LIST OF TABLES 1 C like Supported NumPy Data Types . 19 2 Current SciPy Modules . 22 3 Global MPIArray Attributes . 27 4 Useful Distributed MPIArray Attributes . 27 vi LIST OF FIGURES 1 Bandwidth comparison between InfiniBand verbs and TCP over InfiniBand. 3 2 Bandwidth comparison between mpi4py and native Open MPI over a QDR Infini- Band and Gigabit Ethernet networks. 4 3 Example interactive Python 3 shell session. 10 4 Distributed memory architecture showing systems, with independent CPU's and memory, connected via a network interconnect. 11 5 Collection of MPI processes in default MPI communicator (black) and a sub-set of processes in a sub-communicator (red). 12 6 Broadcast of data from MPI process 1 to all other processes. 14 7 Scatter of data from MPI process 1 to all other processes. 14 8 Gather of data from all MPI processes to process 1. 15 9 All gather of data from all MPI processes to all processes. 15 10 All to all unique exchange of data from all MPI processes to all processes. Note: number shown on colored data elements represents MPI process ID of destination. 16 11 Reduction (Collective Sum) of data from all MPI processes to process 1. 16 12 Interactive mpi4py Python example using two tmux panes to demonstrate blocking Python object send from rank 0 (left) to rank 1 (right). 18 13 Interactive Python example of two dimensional array storage in memory. 20 14 Interactive Python example of using slicing notation to return the first and third rows of a 4x4 array of elements. 21 15 Example of K-Means clustering on simulated data, with three features, containing two clusters. 24 16 MPInumpy array UML object diagram. 26 17 Example block distribution of 1, 2, and 3 dimensional data among three MPI processes. 28 18 Example MPIArray creation routine for a 5x5 block partitioned array. 30 19 Example MPIArray reduction routine demonstrating how to normalize all columns of a 5x5 block partitioned array by the columns respective arithmetic mean. 33 20 Example MPIArray accessing routines of a 5x5 block partitioned array demonstrating how to get element in global position 0,0 and how to set the element in global position 4,4............................................... 34 21 Interactive MPInumpy Python example using two tmux panes, one for MPI process/rank 0 (left) and one for MPI process/rank 1 (right), to demonstrate accessing routine behavior shown in Figure 20. 34 22 Example MPIArray local and global row iteration of a 5x5 block partitioned array. 36 23 Example reshape operation of a block partitioned array of shape 7x3 to a new shape of 3x7. 37 24 Example usage of the MPIscipy K-Means clustering method on simulated data, with one feature, containing two clusters. 40 25 Example usage of the SciPy K-Means clustering method on simulated data, with one feature, containing two clusters. 40 26 NumPy vs. MPInumpy array creation execution performance (left) and overhead (right). 44 vii 27 NumPy vs. MPInumpy arithmetic operation execution performance (left) and overhead (right). 45 28 NumPy vs. MPInumpy reduction operation execution performance (left) and overhead (right). 46 29 NumPy vs. MPInumpy local access operation execution performance (left) and overhead (right). 47 30 NumPy vs. MPInumpy global access operation execution performance (left) and overhead (right). 48 31 NumPy vs. MPInumpy reshape operation execution performance (left) and overhead (right). 49 32 MPInumpy array creation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 51 33 MPInumpy arithmetic operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 52 34 MPInumpy reduction operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 53 35 MPInumpy local access operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 54 36 MPInumpy global access operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 55 37 MPInumpy reshape operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 56 38 SciPy K-Means2 execution performance as a function of the number of features, observations, and cluster centroids. 58 39 MPIscipy K-Means execution performance as a function of the number of features, observations, and cluster centroids. 59 40 SciPy vs MPIscipy K-Means overhead as a function of the number of features, observations, and cluster centroids. 59 41 MPIscipy K-Means strong (top) and weak (bottom) scaling execution performance (left) and speed up (right). 62 viii 1 Introduction The field of Data Science has seen a recent explosive growth, mainly driven by the inundation of data that is produced and collected by computer systems and sensors around the world. Practitioners of the field of Data Science, known as Data Scientists, leverage analytical models and algorithms.

Load more