MPI Based Python Libraries for Data Science Applications

by John Scott Rodgers

A thesis submitted to the Department of Computer Science,

College of Natural Sciences and Mathematics

in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science

Chair of Committee: Dr. Edgar Gabriel

Committee Member: Dr. Shishir Shah

Committee Member: Dr. Martin Huarte Espinosa

University of Houston May 2020 ACKNOWLEDGMENTS

The work in the thesis could not have been realized without the help and support of many people.

A very special thanks must be given to my thesis advisor, Dr. Edgar Gabriel, who provided constant guidance and insight, as well as provided the resources necessary to carry out this work.

Additional special thanks must be given to Angela Braun Esq., who was kind enough to act as my

first draft copy editor. I would also like to thank my thesis committee members Dr. Shishir Shah and Dr. Martin Huarte-Espinosa for serving on the committee and providing their domain specific guidance and support. Additionally, I would like to thank the University of Houston department of Computer Science teaching faculty for providing me with the experience and tools necessary to succeed in this effort. Lastly, I would like to thank the oracles of Computer Science, Google and

StackOverflow, for always being there in my time of need.

ii ABSTRACT

Tools commonly leveraged to tackle large-scale Data Science workflows have traditionally shied away from existing high performance computing paradigms, largely due to their lack of fault toler- ance and computation resiliency. However, these concerns are typically only of critical importance to problems tackled by technology companies at the highest level. For the average Data Scientist, the benefits of resiliency may not be as important as the overall execution performance. To this end, the work of this thesis aimed to develop prototypes of tools favored by the Data Science com- munity that function in a data-parallel environment, taking advantage of functionality commonly used in high performance computing. To achieve this goal, a prototype distributed clone of the

Python NumPy library and a select module from the SciPy library were developed, which leverage

MPI for inter-process communication and data transfers while abstracting away the complexity of

MPI programming from its users. Through various benchmarks, the overhead introduced by logic necessary to resolve functioning in a data-parallel environment, as well as the scalability of using parallel compute resources for routines commonly used by the emulated libraries are analyzed.

For the distributed NumPy clone, it was found that for routines that could act solely on their local array contents, the impact of the introduced overhead was minimal; while for routines that required global scope of distributed elements, a considerable amount of overhead was introduced.

In terms of scalability, both the distributed NumPy clone and select SciPy module, a distributed implementation of K- clustering, exhibited reasonably performant results; notably showing sensitivity to local process problem sizes and operations that required large amounts of collective communication/synchronization. As this work mainly focused on the initial exploration and pro- totyping of behavior, the results of the benchmarks can be used in future development efforts to target operations for refinement and optimization.

iii TABLE OF CONTENTS

ACKNOWLEDGMENTS ii

ABSTRACT iii

LIST OF TABLES vi

LIST OF FIGURES viii

1 INTRODUCTION 1 1.1 Motivation of Work ...... 2 1.2 Goals of Thesis ...... 4 1.3 Existing Implementations ...... 5 1.3.1 DistArray ...... 5 1.3.2 D2O ...... 6 1.3.3 Dask ...... 7 1.4 Organization of Remainder ...... 8

2 BACKGROUND 9 2.1 Python ...... 9 2.2 MPI ...... 10 2.2.1 Communicators ...... 12 2.2.2 Point-to-Point Communication ...... 13 2.2.3 Collective Communication ...... 13 2.3 mpi4py ...... 16 2.4 NumPy ...... 18 2.5 SciPy ...... 21 2.6 K-Means Clustering ...... 22

3 CONTRIBUTION 25 3.1 MPInumpy ...... 25 3.1.1 MPIArray Attributes ...... 26 3.1.2 Data Distributions ...... 28 3.1.3 Creation Routines ...... 29 3.1.4 Reductions Routines ...... 31 3.1.5 Behavior & Operations ...... 33 3.2 MPIscipy ...... 38 3.2.1 Cluster ...... 38

4 EVALUATION 41 4.1 MPInumpy - Evaluation ...... 43 4.1.1 Single Process Performance ...... 43 4.1.2 Scalability ...... 50 4.2 MPIscipy K-Means Clustering - Evaluation ...... 57 4.2.1 Single Process Performance ...... 57

iv 4.2.2 Scalability ...... 60

5 CONCLUSIONS 63

BIBLIOGRAPHY 65

v LIST OF TABLES

1 like Supported NumPy Data Types ...... 19 2 Current SciPy Modules ...... 22 3 Global MPIArray Attributes ...... 27 4 Useful Distributed MPIArray Attributes ...... 27

vi LIST OF FIGURES

1 Bandwidth comparison between InfiniBand verbs and TCP over InfiniBand...... 3 2 Bandwidth comparison between mpi4py and native Open MPI over a QDR Infini- Band and Gigabit Ethernet networks...... 4 3 Example interactive Python 3 shell session...... 10 4 Distributed memory architecture showing systems, with independent CPU’s and memory, connected via a network interconnect...... 11 5 Collection of MPI processes in default MPI communicator (black) and a sub-set of processes in a sub-communicator (red)...... 12 6 Broadcast of data from MPI process 1 to all other processes...... 14 7 Scatter of data from MPI process 1 to all other processes...... 14 8 Gather of data from all MPI processes to process 1...... 15 9 All gather of data from all MPI processes to all processes...... 15 10 All to all unique exchange of data from all MPI processes to all processes. Note: number shown on colored data elements represents MPI process ID of destination. . 16 11 Reduction (Collective Sum) of data from all MPI processes to process 1...... 16 12 Interactive mpi4py Python example using two tmux panes to demonstrate blocking Python object send from rank 0 (left) to rank 1 (right)...... 18 13 Interactive Python example of two dimensional array storage in memory...... 20 14 Interactive Python example of using slicing notation to return the first and third rows of a 4x4 array of elements...... 21 15 Example of K-Means clustering on simulated data, with three features, containing two clusters...... 24 16 MPInumpy array UML object diagram...... 26 17 Example block distribution of 1, 2, and 3 dimensional data among three MPI processes. 28 18 Example MPIArray creation routine for a 5x5 block partitioned array...... 30 19 Example MPIArray reduction routine demonstrating how to normalize all columns of a 5x5 block partitioned array by the columns respective arithmetic ...... 33 20 Example MPIArray accessing routines of a 5x5 block partitioned array demonstrating how to get element in global position 0,0 and how to set the element in global position 4,4...... 34 21 Interactive MPInumpy Python example using two tmux panes, one for MPI pro- cess/rank 0 (left) and one for MPI process/rank 1 (right), to demonstrate accessing routine behavior shown in Figure 20...... 34 22 Example MPIArray local and global row iteration of a 5x5 block partitioned array. . 36 23 Example reshape operation of a block partitioned array of shape 7x3 to a new shape of 3x7...... 37 24 Example usage of the MPIscipy K-Means clustering method on simulated data, with one feature, containing two clusters...... 40 25 Example usage of the SciPy K-Means clustering method on simulated data, with one feature, containing two clusters...... 40 26 NumPy vs. MPInumpy array creation execution performance (left) and overhead (right)...... 44

vii 27 NumPy vs. MPInumpy arithmetic operation execution performance (left) and over- head (right)...... 45 28 NumPy vs. MPInumpy reduction operation execution performance (left) and over- head (right)...... 46 29 NumPy vs. MPInumpy local access operation execution performance (left) and overhead (right)...... 47 30 NumPy vs. MPInumpy global access operation execution performance (left) and overhead (right)...... 48 31 NumPy vs. MPInumpy reshape operation execution performance (left) and overhead (right)...... 49 32 MPInumpy array creation strong (top) and weak (bottom) scaling execution perfor- mance (left) and speed up (right)...... 51 33 MPInumpy arithmetic operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right)...... 52 34 MPInumpy reduction operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right)...... 53 35 MPInumpy local access operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right)...... 54 36 MPInumpy global access operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right)...... 55 37 MPInumpy reshape operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right)...... 56 38 SciPy K-Means2 execution performance as a function of the number of features, observations, and cluster centroids...... 58 39 MPIscipy K-Means execution performance as a function of the number of features, observations, and cluster centroids...... 59 40 SciPy vs MPIscipy K-Means overhead as a function of the number of features, ob- servations, and cluster centroids...... 59 41 MPIscipy K-Means strong (top) and weak (bottom) scaling execution performance (left) and speed up (right)...... 62

viii 1 Introduction

The field of Data Science has seen a recent explosive growth, mainly driven by the inundation of data that is produced and collected by computer systems and sensors around the world. Practitioners of the field of Data Science, known as Data Scientists, leverage analytical models and algorithms to extract insights from this collected data. In terms of the data they analyze, there is an inherent correlation between the sample size and the knowledge that can be extracted from it, as well as its statistical significance. As the amount of data grows beyond the processing capabilities of personal computers/workstations, the need for large scale parallel compute resources becomes a necessity.

Within the field Data Science, these large scale problems fall under the category of Big Data

Analytics, which has resulted in the Big Data stack. The primary focus of this software stack has been the the handling of the “4 V’s”: volume, variety, veracity, and velocity of data that is collected. Handling of the variety (variable formatting) and veracity (accuracy) of data is typically resolved in pre-processing stages prior to entering the processing pipeline. Once in the pipeline, large volumes of data are processed by frameworks such as Hadoop MapReduce [13] and Spark [92]. These processing frameworks typically offer resiliency in computation, meaning that a hardware failure of a system resource will not result in a failure to produce a result. Big

Data architectures, such as Lambda [24], are used to manage the velocity (speed) at which data is collected by orchestrating the processing pipeline.

In the traditional academic and scientific research domain, problems requiring large scale par- allel compute resources typically fall under the category of High Performance Computing (HPC), which leverages its own independent HPC software stack. In contrast to the Big Data software stack, the HPC software stack is primarily focused on raw performance with little concern for computation resiliency. This focus on performance is largely achievable due to the type of data that is processed: typically numeric in value, consistent in format, and of a known fixed size. The most prevalent HPC programming frameworks in use today include the Message Passing Interface

(MPI) [43], OpenMP [7], and CUDA [49]. The Big Data and HPC software stacks have largely

1 evolved independently of each other, as they focus on their domain specific problems [71].

1.1 Motivation of Work

Python [86, 85] is currently the programming language of choice within the Data Science com- munity [25]. This is most likely due to the language’s ease of use and its rich ecosystem of open source scientific libraries. Additionally, non-profit organizations for the advancement of scientific computing, such as NUMFOCUS [52], have supported the development efforts of Python-based sci- entific libraries, including NumPy [83], Matplotlib [26], Pandas [59], and SciPy [88]; demonstrating an expected positive impact on the scientific community through the advancement of the Python ecosystem.

Within the Python ecosystem there currently exists a number of parallel Data Science libraries, such as PySpark [1] and Dask [11]. In an effort to ensure computation resiliency, these libraries have shied away from communication platforms that are not fault tolerant, instead relying on Internet

Protocol (IP) based protocols for resource communication and bulk data transfer. While resiliency is desirable for incredibly large scale problems, such as indexing the internet in the case of Google, for the average Data Scientist the benefits of resiliency may not be as important as the overall execution performance. Additionally, compute centers housing the parallel resources (clusters) often invest in expensive low-latency high-bandwidth network interconnects, such as Mellanox’s

InfiniBand (IB) [22] interconnect. To fully utilize these high-end networks, libraries must make use of the hardware provided API’s. While these interconnects do support data communication via standard IP protocols (IP over IB), a blocking data exchange benchmark over a QDR InfiniBand network shown in Figure 1 demonstrates that a considerable amount of performance is not realized without making use of the full capabilities of the networking hardware.

2 Figure 1: Bandwidth comparison between InfiniBand verbs and TCP over InfiniBand.

One widely used parallel communication framework in the HPC software stack is MPI. The

Message Passing Interface (MPI) has traditionally been ignored by Big Data software developers due to its lack of fault tolerance. However, because MPI was developed within the HPC software stack, it makes full use of high-end network interconnects as well as fully leverages state-of-the-art compute systems. These qualities make MPI an excellent choice as the communication platform for this work.

Currently MPI does not include language bindings for Python. However, there are several libraries that provide Python MPI interfaces, the most popular of which is mpi4py [10]. The mpi4py library was built on top of the MPI-1/2/3 specifications and provides an interface similar to the

MPI-2 C++ bindings [9]. Additionally, the library makes full use of high-end network interconnects and, when leveraging contiguous memory buffers provided by the Python NumPy library, can obtain communication performance comparable to applications written in C. This performance is demonstrated in a blocking data exchange benchmark over a QDR Infiniband interconnect and a

Gigabit Ethernet interconnect shown in Figure 2 below.

3 Figure 2: Bandwidth comparison between mpi4py and native Open MPI over a QDR InfiniBand and Gigabit Ethernet networks.

Provided all of the performance benefits of using MPI as the communication platform, why do we not find Python Data Science libraries that utilize it? As previously discussed, the most immediate reason is its lack of fault tolerance, a requirement for computation resiliency. An additional factor may be that writing MPI code is difficult, typically requiring a complete refactoring of an application to resolve data decomposition and communication patterns that are not typically found in the

Big Data software stack. These above factors should not immediately disqualify the use of MPI; rather, this should be seen as a motivation for members of the HPC community to find appropriate applications/use cases to improve the computational performance of work flows for Data Scientists.

1.2 Goals of Thesis

The primary goal of this thesis is the contribution to the development of the Python MPI Data

Science (MPIDS) [18] library, which is currently under development by members of the Parallel

Software Technologies Laboratory [19] at the University of Houston. In accord with the motivation for this work, the library aims to leverage MPI for inter-process communication and data transfers while abstracting away the complexity of MPI programming from its users.

4 To achieve this goal, a prototype distributed clone of the Python NumPy library was created.

The created library, MPInumpy, emulates the usage and behavior of a NumPy array while focus- ing on the creation and management of a distributed array. MPInumpy reuses functionality found in NumPy when possible and re-implements functionality as required to function in a data-parallel environment. Additionally, a distributed version of a select SciPy module was developed to create portions of the MPIscipy library, which employed distributed arrays created by the MPInumpy library.

Code developed in this thesis was compared against the base libraries they intend to emulate.

This comparison involved profiling the developed code’s execution performance and scalability under typical use cases.

1.3 Existing Implementations

There are several Python libraries that have created distributed NumPy-like arrays. An overview of a subset of these implementations–that have been a key inspiration in the development of the

MPInumpy library–are discussed in the following sections.

1.3.1 DistArray

The DistArray [29] library was developed by Enthought Inc. [15] in partnership with members of the (Py)Trillinos [82] project and the IPython project [31]. The goal of the project was to develop a multidimensional NumPy-like distributed array for Python that had the interactivity of the IPython platform with the performance found leveraging the MPI framework.

To realize this goal, an extensive quantity of native NumPy functionality was re-implemented in a distributed framework. DistArray arrays support most unary and binary operations, indexing and slicing, and reductions commonly found in a NumPy ndarray object.

Within this library, the distribution of data is managed by an internal distribution ob- ject, which helps local arrays translate between local and global index spaces. It supports block, cyclic, and block-cyclic data distributions similar to the ones found in High Performance Fortran

5 (HPF) [37], as well as unstructured and non-distributed distributions. Mapping, or partitioning, of data to a given process is achieved by placing the processes on a cartesian grid, which is compatible with the MPI-2 virtual topologies [39] generated by MPI’s MPI Cart create.

After the data is partitioned, knowledge of where a given data element resides among the pool of processes is resolved with the help of the distributed array protocol (DAP) [8]. This protocol was modeled after the existing Python PEP-3118 buffer protocol [56]. The DAP provides metadata containing the version of the DAP, the process local section of the distributed array(buffer), and a dimensional dictionary that describes how the data is distributed. Usage of this library requires that all producing and consuming operations have the ability to process the DAP.

Programs developed using the DistArray library require a master worker pattern in which a client process is designated as the master to delegate work to the worker processes, which are referred to as “engines”. When using IPython, communication between the client and engines is handled using the asynchronous messaging library ZeroMQ [93]. An alternative way to execute a

DistArray program is to use MPI-only mode. However, when using MPI-only mode, an additional

MPI process is still required to act as the client.

This final design choice of requiring an additional process to be created as the client–primarily for interactive usage–for execution makes it unsuitable for our intended application. This is because the extra process will ultimately lower an applications parallel efficiency and introduce unnecessary communication/coordination overhead between the client and workers.

1.3.2 D2O

D2O [78] is another distributed NumPy-like array project developed by researchers at the Max

Planck Institute for Astrophysics. The goal of the project was to create a Python distributed array for use in the Numerical Information Field Theory (NIFTY) [77] library, used in information field theory to build signal inference algorithms [14].

Similar to the design of Distarray, this goal was realized through re-implementing native NumPy functionality in a distributed framework. However, given that this project was developed for a

6 specific use, some desirable functionally was never implemented. Most notably, this library lacks support for reshaping (or redistribution) of the array after creation, along with not supporting

NumPy like object slicing.

In D2O, the distribution of data is managed by a “distribution factory”, which partitions data among processes based on the specified distribution. Supported distributions include: non- distributed, equal (blocked based on the first axis), fftw (optimized for FFT), and freeform (manual specification of distribution at creation).

Unlike the design of Distarray, a master worker design pattern was not enforced on the library.

As a result, programs developed using D2O can execute seamlessly on a cluster environment using

MPI as its communication platform.

Comparing D2O to DistArray, this library was much closer to our intended application in terms of design. However, D2O was deemed unsuitable for our needs due to its lack of desirable functionality and the application specific distribution patterns.

1.3.3 Dask

The goal of the Dask project is to leverage blocked algorithms and dynamic memory aware task scheduling to create a parallel out-of-core framework that enables existing data analytics libraries in the Python ecosystem to be scaled to multi-core and distributed systems [12, 72]. To realize this goal, the project has been developed in coordination with popular Python projects used in the field of data analytics such as NumPy, Pandas, and Scikit-Learn [5, 60]. By working with the larger

Python community this project has been able to create distributed versions of these libraries with familiar API’s and usage, borrowing as much as it can from the base library it aims to enhance.

At its core, Dask represents all computations as a directed acyclic graph of tasks with data dependencies [72]. It uses a specialized dynamic scheduler to optimize and resolve the work nec- essary to complete a computation. Additionally, most tasks/computations use the lazy evaluation execution strategy, delaying the execution of a given task until its result in required by another computation or triggered by a compute() method call.

7 Focusing our attention on the projects distributed NumPy implementation, the Dask Array, distribution and partitioning of data can either be automatically handled by the library or as specified by the user. Partitions are created in ’chunks’ or blocks of either a fixed amount of data(128MiB by default) if done automatically, or by user specified amounts of data items specified by dimension.

Programs using the Dask framework require a master worker pattern, similar to DistArray, as well as a dedicated process to act as the task scheduler. When utilized in a distributed environment communication and bulk data transfers are handled via TCP sockets. Additionally, while a MPI deployment mode is available for use on HPC clusters, MPI is only used to start the Dask cluster and not for inter-process communication.

The requirement of two additional processes, one for the client and the task scheduler, and lack of support for leveraging a high-speed network interconnect that MPI can utilize makes this framework sub-optimal for our intended application.

1.4 Organization of Remainder

The organization of the remainder of this thesis is as follows: Chapter 2 will provide background on the libraries used in this work as well as their application domain, Chapter 3 will cover the contributions of this thesis’s work in terms of library development, Chapter 4 will provide an evaluation of work developed in Chapter 3, and finally Chapter 5 will cover final conclusions and outline the next steps for future work.

8 2 Background

This chapter will provide a brief background on the libraries, frameworks, and use cases used in this work.

2.1 Python

Python [86, 85] is an interpreted, object-oriented, general-purpose programming language that was created in the late 1980s [87]. The language’s initial development was heavily influenced by the

ABC programming language [21]. Python continues to be actively supported by the non-profit

Python Software Foundation [68] with developers all around the world.

Since its initial release, the language has had two major updates. The first, Python 2 [35], was released in 2000 and introduced the language’s Python Enhancement Proposals (PEP) system for proposing new features within the development community. Notable added features in this release included: support for Unicode, list comprehension, augmented assignment operators, and cycle-detecting garbage collectors. The second major update released in 2009, Python 3 [84], was the language’s first intentionally backwards incompatible release. Notable changes in this release included: print function, syntax, comparison ordering, and return behavior of builtin objects. As of the start of 2020, Python 2 has reached its end of life and users can only expect future support and development of Python 3.

Python offers an extensive standard library [69] as well as a repository of community developed open source packages called the Python Package Index (PyPI) [67], containing well over 200,000 projects. Packages from this repository can be installed via any platforms terminal using the

Python pip command (e.g., pip install ).

Python is a fully object oriented program, meaning that everything in Python is an object.

This include typical elements of any language such as strings, lists, functions/methods, and even modules. While Python objects may not all have attributes, methods, or be subclassable every object can be assigned to a variable or be passed to a function [62].

9 Code developed in Python is executed by an interpreter. CPython [6] is the core/reference interpreter for the language, and first compiles Python code into bytecode prior to interpreting it. Other notable Python interpreters include PyPy [65] (which features a just-in-time compiler for performance), Jython [32] (which compiles Python code to Java bytecode for use in JVM’s),

IronPython [30] (an implementation for the .NET framework), and IPython [61] (which focuses extended interactivity over the standard interpreter). These interpreters are able to run Python code either directly as a script in a terminal (e.g., python .py), or interactively in a shell session by invoking the interpreter. An example of defining a function in an interactive shell session using the Python 3 interpreter is shown in Figure 3 below.

Figure 3: Example interactive Python 3 shell session.

As a consequence of Python’s flexibility, the language is leveraged in a wide number of ap- plication domains. Python acts as a scripting language in web applications and major software products. Moreover, it is an effective platform in the scientific computing domain. Additionally, the language is not only a standard component of most Linux distributions, but is also sometimes used to create the installers [90].

2.2 MPI

The Message Passing Interface (MPI) [43] is a communication specification that was designed to be used on distributed memory architectures (clusters of computers), such as the one shown in Figure 4.

The development of MPI started as a workshop in April 1992. This workshop aimed to assess the need for a message-passing specification on distributed memory systems, and was attended by

10 members of universities, government laboratories, as well as hardware and software vendors. This exploratory workshop created a working group tasked with creating a practical specification [89].

In 1993, a draft of the MPI specification was presented to the high performance community at The

International Conference for High Performance Computing, Networking, Storage, and Analysis for input and feedback [80]. Since then, the MPI Forum [44] has led the efforts of updating and maintaining the specification.

Figure 4: Distributed memory architecture showing systems, with independent CPU’s and memory, connected via a network interconnect.

Version 1.0 of the MPI specification [42] was released in 1994 and outlined the nomenclature and expected features of the specification. These features included: bindings for C and FORTRAN, point-to-point communication, collective communications, process topologies, and environmental management. Version 2.0 of the specification [40] was later released in 1997 and was the first major revision. Along with this revision came new functionality such as: dynamic processes, one-sided communication, extended collective operations, C++ language bindings, and parallel I/O [2]. In

2012, Version 3.0 of the specification [43] was released, adding a considerable number of extensions to previous functionality [2]. The next major version of the specification, Version 4.0 [41], is in active development and hopes to add: extensions for hybrid programming models, support for fault tolerance, persistent collectives, and remote memory access. While the MPI specification mandates an expected API it does not outline a specific means of implementation. As a result, there are

11 several implementations of MPI, such as MPICH [46], Open MPI [81], and Intel MPI [27].

The following sub-sections will cover the some of the core elements of the MPI specification.

Note: This is a brief overview of core elements and by no means covers the vast capabilities outlined in the MPI specification.

2.2.1 Communicators

MPI communicators are objects used to define collections of processes/resources. The default communicator, MPI COMM WORLD, contains all processes in a program’s execution. Within a com- municator, individual processes are resolved using their assigned “rank”, which acts as a unique process ID with ordered integer values starting from zero. These communicators are used by the various message-passing routines to determine the start and end point of information exchanges.

The MPI specification also provides mechanisms to create sub-communicators, or subsets of processes that are grouped together with unique rank’s within that sub-communicator. These additional communicators can be dynamically created or destroyed during a program’s execution.

Figure 5 below provides a visual representation of MPI processes as viewed by the default global communicator and by a sub-communicator.

Figure 5: Collection of MPI processes in default MPI communicator (black) and a sub-set of processes in a sub-communicator (red).

12 2.2.2 Point-to-Point Communication

The MPI specification outlines mechanisms for point-to-point communications between processes within a communicator. This can be thought of as a direct process to process exchange of data.

To perform this exchange, both the sending and receiving process require the following knowledge: location of the data (buffer) used for the exchange, the amount of data (count of elements), the type of the data (to resolve displacements), and the respective rank of the other process involved in operation.

There are several different types point-to-point communication routines. They can “blocking”, meaning both the sending and receiving process do not return from the exchange routine until the data buffer is safe to re-use, effectively halting the execution of the process. They can also be

“non-blocking”, meaning the sending and receiving process return from the exchange routine and continue onto the next instruction. This latter method is particularly useful when computations can be overlapped with message exchanges, allowing for a higher level of parallelism. Additional types of routines include synchronous sends, buffered sends, and combined send/receive operations.

2.2.3 Collective Communication

All collective communication routines act on all processes within a given communicator. These routines can be grouped into three different categories: synchronization routines, data movement routines, and computation routines.

Synchronization routines are used to control the flow of execution of a MPI program.

MPI Barrier is one such routine, which will halt the execution of processes until all processes in a communicator reach the barrier.

Data movement routines are used to distribute or collect data among processes within a com- municator. An example of two distribution operations are shown in Figures 6 & 7. Figure 6 shows the result of a broadcast (MPI Bcast) routine, which results in a copy of the data found on the broadcasting process to be sent to all others. Figure 7 shows the result of a scatter (MPI Scatter) routine, which results in the data found on the scattering process being distributed among the

13 available processes.

Figure 6: Broadcast of data from MPI process 1 to all other processes.

Figure 7: Scatter of data from MPI process 1 to all other processes.

Examples of two data collection operations are shown in Figures 8 & 9. Figure 8 shows the result of a gather (MPI Gather) routine, which results in unique data from all processes to collected by a given process. Figure 9 shows a more complicated version of the gather routine, known as an all-gather (MPI Allgather), which results in all processes collectively gathering the process unique data elements.

14 Figure 8: Gather of data from all MPI processes to process 1.

Figure 9: All gather of data from all MPI processes to all processes.

These data movement routines also come in combined forms, such as the operation shown in

Figure 10. This operation is known as an all-to-all (MPI Alltoall) exchange of data. An all-to-all exchange is a collective scatter/gather routine where each process sends a unique piece of data to each process. This routine is particularly useful when needing to transpose elements of a matrix.

15 Figure 10: All to all unique exchange of data from all MPI processes to all processes. Note: number shown on colored data elements represents MPI process ID of destination.

Collective computation routines are used determine a result from data found among processes within a communicator. These operations can be seen as reductions of the data elements using a specified operator. Figure 11 shows the result of performing an all-reduce (MPI Allreduce) with summation specified as the operation. In addition to predefined reduction operators (MPI MAX,

MPI MIN, MPI SUM, MPI PROD, etc.), MPI supports user defined reduction functions.

Figure 11: Reduction (Collective Sum) of data from all MPI processes to process 1.

2.3 mpi4py

As mentioned in section 1.1, the MPI standard does not include language bindings for Python.

However, there are several existing Python libraries that provide MPI interfaces, such as pypar [64],

16 mympi [33], mpipy [63], and mpi4py [10]. The mpi4py library is currently the most popular among this group and provides MPI interfaces that resemble MPI-2 C++ bindings. This is achieved by leveraging Cython [3] to create appropriate C extensions for use by the Python interpreter.

Additionally, with the help of mpi4py, existing MPI codes developed in C and FORTRAN can be executed in Python by taking advantage of the SWIG [79] and F2PY [16] code wrappers.

The mpi4py library supports nearly all of the MPI-1/2/3 standard routines. This includes the core elements discussed in the previous section, as well as dynamic process management, one-sided communications, parallel I/O, and environment management.

In terms of data communication, mpi4py supports the exchange of Python objects as well as contiguous memory buffers, such as the ones provided by the NumPy [83] package. However, when passing Python objects between MPI processes, a considerable amount of overhead is introduced due to the need to “pickle” (serialize) the objects prior to sending and un-pickling after receiving.

Complementing versions of message-passing routines are defined with varying case based on the object that will be sent (e.g., send(python object, ...) vs. Send(numpy object, ...)).

Code developed using mpi4py can be executed as a standalone Python script by pre-pending the Python interpreter call with a standard MPI executor (e.g., mpiexec -n python3

MPI python code.py). Additionally, mpi4py code can be developed/executed interactively either by using IPython’s ipcluster tool [28] or individual xterm/tmux sessions (one per MPI process).

An example of using an interactive tmux session to perform a blocking send of a Python object, between two MPI processes, is shown in Figure 12 below.

17 Figure 12: Interactive mpi4py Python example using two tmux panes to demonstrate blocking Python object send from rank 0 (left) to rank 1 (right).

2.4 NumPy

NumPy [83] is one of the core packages of the SciPy [88] stack for scientific computing in Python.

The package was the result of efforts to unify SciPy’s numeric array developers around a single package. This was achieved by porting functionality from Numarray [50] to Numeric [51], which later became NumPy in 2005 [57]. The language continues to be actively developed and supported by NUMFOCUS, the Gordon and Betty Moore Foundation, the Alfred P. Sloan Foundation, and

Tidelift.

The NumPy package provides a multidimensional array for efficient numerical computations.

These N-dimensional array (ndarray) objects are uniform collections of elements that are stored contiguously in memory. The computational efficiency of this package is realized by making use routines written in lower level languages, such as C and Fortran, which are commonly used in high-performance computations on numerical data. Additionally, NumPy makes use of the basic linear algebra subprograms (BLAS) [4] and the linear algebra package (LAPACK) [36] for efficient linear algebra computations. NumPy also provides a C-API [53] which allows developers to further extend and enhance the library.

ndarray objects are created via NumPy’s array creation routines. They take an array like object, such as a list of numerical data or nested lists of data, and return an ndarray object. Once created, the object describes the data in memory using the following attributes: a pointer to the

18 address of the first byte of the array, the data type of the elements of the array, the N-dimensional shape of the array, the strides (number of bytes) by dimension to traverse the array, and flags that provide auxiliary information on how the data is stored [83].

NumPy arrays supports a wide range of predefined data types, as well as structured data types, which are a composition of data types organized as named fields [55]. A sample of supported data types, sourced from the developers documentation [54], that are similar to those in C is shown in

Table 1.

Table 1: C like Supported NumPy Data Types

NumPy type C type Description

np.int8 int8 t Byte (-128 to 127)

np.int16 int16 t Integer (-32768 to 32767)

np.int32 int32 t Integer (-2147483648 to 2147483647)

np.int64 int64 t Integer (-9223372036854775808 to 9223372036854775807)

np.uint8 uint8 t Unsigned integer (0 to 255)

np.uint16 uint16 t Unsigned integer (0 to 65535)

np.uint32 uint32 t Unsigned integer (0 to 4294967295)

np.uint64 uint64 t Unsigned integer (0 to 18446744073709551615)

np.intp intptr t Integer used for indexing, typically the same as ssize t

np.uintp uintptr t Integer large enough to hold a pointer

np.float32 float

np.float64 double Note that this matches the precision of the builtin python float.

np.complex64 float complex Complex number, represented by two 32-bit floats (real and imaginary components)

np.complex128 double complex Note that this matches the precision of the builtin python complex.

As previously mentioned, ndarrays are stored in a contiguously in memory. Storage of the data elements follows a strided indexing scheme, meaning that for the N-dimensional index (n0, n1, ..., nN−1) the corresponding offset (in bytes) from the start of the memory block can be resolved by the following [47]: N−1 X noffset = sknk (1) k=0 where sk in Equation 1 is the strides along axis k, and nk is the position of the element along that axis. Using an interactive Python shell to inspect the inner workings of these arrays, shown in

19 Figure 13 below, provides a visual representation of this mapping.

Figure 13: Interactive Python example of two dimensional array storage in memory.

In Figure 13 above, we first use the NumPy array creation routine arange to generate an array containing 16 32-bit integers with values from 0 to 16 - 1. Note: by default, these arrays are created in C row-major ordering. At the time of creation, the array is reshaped to a 4x4 matrix using the reshape method. The 2-dimensional representation of this array can be seen after calling the created array arr. By calling arr.base one can see how the 2-dimensional array is being stored contiguously in memory, with each integer stored sequentially. Inspecting the strides, or number of bytes necessary to traverse in either dimension to get to the next element, by calling arr.strides one can see that along the zeroth axis(along rows) 16 bytes need to be traversed (4 [elements] ∗ 32 [bits/element] / 8 [bits/bytes] = 16 [bytes]) to go from one row to another. While along the first axis(along columns) 4 bytes need to be traversed (1 [elements] ∗

32 [bits/element] / 8 [bits/bytes] = 4 [bytes]) to go through the elements along a row.

NumPy arrays use zero-based semantics for accessing/indexing of elements. For a 1-dimensional vector of N elements the first element is at index 0, while the last element is at index N-1. Honoring the standard method for Python object indexing, elements are indexed by using square brackets on the array object (e.g., for a 1-Dimensional array array 1D[0] returns the first element of the array). For multi-dimensional arrays, access of single elements is done by specifying the individual axis coordinates separated by commas (e.g., for a 2-Dimensional array of shape 4x4, array 1D[3,

3] returns the element in the forth row and column). NumPy arrays also feature slicing and strided

20 accessing using the notation of array[start:stop:step], where start is the inclusive starting position along an axis, stop is the exclusive stopping position, and step is the move up from the starting position. An example of slicing the first and third row of a 4x4 NumPy array is shown in

Figure 14 below.

Figure 14: Interactive Python example of using slicing notation to return the first and third rows of a 4x4 array of elements.

2.5 SciPy

The SciPy [74] package, not to be confused with the SciPy [88] stack (collection of libraries), is one of the core packages of the SciPy software stack for scientific computing. The package was the result of merging scientific computation modules built on top of the Numeric package (predecessor to NumPy) into one single package [75]. Like NumPy, the project continues to be actively developed by an open source community and is supported by NUMFOCUS.

The SciPy library is built on top of the NumPy package to leverage the NumPy’s efficiency in numerical computations. It contains a wide variety of scientific modules that are commonly used in

Data Science and engineering workflows. Table 2 below provides a overview of the SciPy modules sourced from the package’s current reference API.

21 Table 2: Current SciPy Modules

SciPy Module Description

cluster Clustering algorithms: vector quantization, k-means, hierarchical

constants Physical and mathematical constants and units

fft Discrete fourier transforms: FFTs, and DST/DCT

fftpack Legacy version of fft module

integrate Integration and ODEs solvers

interpolate Single dimension and multi-dimensional interpolation routines

io Data input and output modules for a variety of file formats

linalg Linear algebra routines

misc Miscellaneous support routines

ndimage Routines for multi-dimensional image processing

odr Orthogonal distrance regression routines

optimize Optimization and root finding routines

signal Signal processing routines

sparse Two-dimensional sparse matrix routines for numerica data

spatial Spatial algorithms and data structures

special Special functions: elliptic and integrals, Bessel, raw statistical, information theory, gamma, etc.

stats Statistical functions and large number of probability distributions

One of the goals of this thesis is the creation of distributed versions of select SciPy modules.

The next section will focus on the tool provided by the SciPy cluster module, namely K-Means clustering.

2.6 K-Means Clustering

K-Means clustering is an algorithm commonly leveraged by Data Scientists in classification work-

flows. Fundamentally, the practice seeks to partition a set of observations into k clusters (or groups) that are most similar to each other based on observable features. The usage of this clustering tech- nique is not limited to just Data Science applications; it is commonly used in image compression, digital signal processing, natural language processing, and many other domains.

At it’s core, the K-Means algorithm attempts to minimize intra-cluster variance, which can be

22 represented by the squared error function [73, 34]:

k n X X j 2 J = kxi − cjk (2) j=1 i=1 where J in Equation 2 is the objective function we attempt to minimize, k is the number of clusters, n is the number of observations, xi is the individual observation(feature vector), and cj is a given

j centroid (cluster center). Note: the term kxi − cjk in the above represents the euclidean distance between a given observation and a given cluster centroid.

There are four key stages in a typical K-Means clustering algorithm:

1. Initialize k cluster centroids.

2. Assign each of the observations to the nearest centroid based on their euclidean distance.

3. Recompute centroids from assigned observations.

4. Repeat steps 2 & 3 until centroids converge.

Pseudocode for a K-Means clustering algorithm implementing the above stages is shown in

Algorithm 1 below. Lines 1-3 correspond to stage 1, lines 6-14 correspond to stage 2, lines 15-

17 correspond to stage 3, and the termination condition, stage 4, corresponds to line 18. This algorithm is very sensitive to the initial centroid assignments and does not guarantee a globally optimal solution [23]. To address this, some implementations will randomly select centroids at the start of each iteration of stages 2 & 3. However, by doing so additional terminating conditions, such as a maximum number of iterations, must be introduced to prevent the implementation from executing indefinitely.

23 Algorithm 1: K-Means Clustering

Data: X = {x1, x2, ..., xn} // set of observations k // number of clusters Result: C = {c1, c2, ..., ck} // set of cluster centroids L = {l1, l2, ..., ln} // set of labels for X 1 foreach ci ∈ C do 2 ci ← xj ∈ X // select k initial centroids from X 3 end 4 repeat 5 foreach xj ∈ X do 6 minDist ← ∞ 7 foreach ci ∈ C do 8 dist ← computeDistance(xj, ci) // between centroid and observation 9 if dist < minDist then 10 lj ← i // assign cluster to observation 11 minDist ← dist 12 end 13 end 14 end 15 foreach ci ∈ C do 16 updateCentroids(ci) // compute new center from closest observations 17 end 18 until ∆C = 0

An example of executing an implementation of the K-Means algorithm provided by the SciPy cluster module on simulated data with three features per observation is shown in Figure 15 below.

Figure 15: Example of K-Means clustering on simulated data, with three features, containing two clusters.

24 3 Contribution

The main contributions of this work involve the initial creation of the MPInumpy and MPIscipy libraries for use in the MPIDS [18] project. This chapter will cover the creation, design, and usage of those developed libraries.

All work presented in this chapter was written in Python 3 [85] and developed with a full suite of unit tests using the Python unittest library [70] to enable ease of future development. Additionally, all MPI based inter-process communications were handled using the mpi4py [10] library discussed in section 2.3. Execution of code developed using the libraries outlined in this chapter can done in the same manner as code developed using mpi4py (e.g., mpiexec -n python3

MPIDS python code.py)

3.1 MPInumpy

Applying the lessons learned from section 1.3, the development of the MPInumpy library attempts to re-use as much functionality as possible from the base library, NumPy [83], it aims to emulate in a data-parallel environment. To achieve this goal, MPInumpy introduces the MPIArray object, which is subclassed from the NumPy ndarray object. Subclassing the ndarray object [48] enables the inheritance of the base classes’ pre-defined properties and operations. This allows develop- ment efforts to focus on the functionality that would need to be re-implemented in a distributed framework.

Figure 16 below presents a high-level Unified Modeling Language (UML) object diagram show- ing the main components of the MPInumpy library, as well as how they interact with eachother.

Fundamentally, an array generated using this library creates an MPIArray object, which acts as a subclass of the NumPy ndarray object and a pseudo-abstract base class for the data distribution objects. Data distribution objects act as concrete implementations of the MPIArray, which resolve how native NumPy behavior is re-implemented in their respective data partitioning schemes. This design choice permits new data distributions to be integrated as desired, requiring only that the

25 expected MPIArray methods and properties are implemented for that partitioning scheme. Actual creation (processing of input and data partitioning among processes) of the array objects is man- aged by the various array creation routines. These main components of the MPInumpy library, as well as their supported operations and behavior, are discussed in greater detail in the remaining sections of this chapter.

Figure 16: MPInumpy array UML object diagram.

3.1.1 MPIArray Attributes

As previously discussed, the MPIArray object will natively inherit all of the key proprieties and components of it’s ndarray base class. These include attributes such as the data type, shape, size, number of bytes, and dimensions of the array object. However, since these array objects will be distributed among many processes, these inherited attributes will reflect only the properties of the array elements that are found locally on a given process. As a result, complementing versions of

26 select attributes are introduced in the MPIArray object to reflect the global resolution of these properties.

These global properties are presented to the user as a name shift version of the base property they intend to resolve in a distributed environment (e.g., the global version of the shape property is globalshape). The means of accessing these introduced properties is implemented in a consistent manner, where a given attribute can be return via MPIArray.. The actual computation of the global properties is handled by the data distribution objects implementing the MPIArray pseudo-abstract base class. Table 3 below provides a description of the expected global attributes to be resolved by concrete implementations of the MPIArray.

Table 3: Global MPIArray Attributes

Attribute Description

globalshape Global representation of shape of array distributed among processes

globalsize Global representation of number of elements distributed among processes

globalnbytes Global representation of number of bytes distributed among processes

globalndim Global representation of number of dimensions distributed among processes

Beyond the attributes that needed to be introduced to emulate the ndarray object, select attributes that are useful in developing MPI code are introduced to the MPIArray object. These attributes include the MPI process communicator, properties useful for creating virtual process topologies, and a mechanism to resolve local array index space to the global array index space.

Table 4 below provides a description of these attributes.

Table 4: Useful Distributed MPIArray Attributes

Attribute Description

comm MPI communicator associated with distributed array object

comm dims Balanced distribution of processes when placed on a cartesian grid

comm coord Coordinates of a given process when placed on cartesian grid

local to global Dictionary specifying local data index range in global index space by axis

27 3.1.2 Data Distributions

The MPInumpy library currently supports two parallel data distributions, namely the Block and

Replicated classes of data partitioning. These classes of objects act as concrete implementations of the MPIArray object, requiring unique routines to resolve operations that are impacted in a data-parallel environment.

As the name implies, the Block data distribution assumes partitioning of data in blocks, or contiguous chunks of elements. An example of the expected block data distribution pattern for various dimensions of array data is shown in Figure 17 below. As highlighted in this example, the array elements are partitioned into separate chunks based on the leading dimension of the global array’s shape.

Figure 17: Example block distribution of 1, 2, and 3 dimensional data among three MPI processes.

One potential shortcoming of this implementation, as highlighted in the 1-D and 2-D cases found in Figure 17 above, is that the block partitioning scheme does not guarantee even distributions of data elements. Instead the first MPI process (process 0) is responsible for additional elements/rows of the global array, commonly referred to as ‘over-partitioning’ of the data. Conversely, this imple- mentation of block distribution pattern is also susceptible to ‘under-partitioning’, which will occur when the number of MPI processes exceeds the number of elements in the leading dimension of the global array.

The second data distribution included in this work, the Replicated data distribution, assumes

28 that every MPI process associated with the objects communicator has a duplicate copy of the array data. Compared to the block distribution, the development of this pattern is easier to implement, as it is effectively wrapping a cloned ndarray object with MPI attributes.

An MPIArray object’s data distribution can be resolved by looking at it’s dist attribute. In- voking this attribute will return a string with either a value of ‘b’ or ‘’ for the blocked and replicated patterns respectively. Actual partitioning and distribution of data for these patterns is handled by the array creation routines discussed in the next section.

3.1.3 Creation Routines

The MPInumpy library provides several array creation routines that can produce concrete imple- mentations of the MPIArray object. These routines can be grouped into two different categories: those that distribute existing array-like data and those that create the desired distributed array based on argument parameters.

Routines that work with existing data take an array-like object (e.g., list, list of lists, tuple, ndarray, etc.) and produce an MPIArray. Currently, only one routine has been implemented under this category: the MPInumpy.array() method. To emulate the expected behavior of using a NumPy.array(), this routine accepts the same initial parameters used in array creation, such as the array-like data, data type, storage order, minimum number of dimensions.

Routines that create a distributed array based on parameters take either the desired global shape or formatting information to produce an MPIArray. Implemented routines that work off of a desired shape include the MPInumpy.empty() (MPIArray with un-initialized values), MPInumpy.zeros()

(MPIArray with all values initialized to zero), and MPInumpy.ones() (MPIArray with all values initialized to one) methods. The implemented routine that works off of formatting information is the MPInumpy.arange() method; which will create an MPIArray with evenly spaced values within a specified interval (e.g., interval start, stop, step). Like the first category of creation routines, these methods accept the same initial parameters as their respective NumPy counterparts.

In addition to the expected NumPy method parameters, all of these array creation methods

29 accept three additional parameters: the MPI communicator associated with the object, the root or process which has the local data, and the desired final distribution of the generated MPIArray.

A sample array creation routine using MPInumpy.array() is shown Figure 18 below. Note: all of the additional MPInumpy passed parameters (comm, root, dist) are provided their respective default values: MPI.COMM WORLD, 0, and ‘b’.

1 import mpids.MPInumpy as mpi_np

2 import numpy as np

3 from mpi4py importMPI

4

5 comm = MPI.COMM_WORLD

6 rank = comm.Get_rank()

7 data = np.arange(25).reshape(5,5) if rank == 0 else None

8 mpi_arr = mpi_np.array(data, comm=comm, root=0, dist=’b’)

Figure 18: Example MPIArray creation routine for a 5x5 block partitioned array.

On lines 1-3 of Figure 18 above, the MPInumpy, NumPy, and mpi4py packages are imported.

On lines 5-6 the mpi4py library is used to resolve the default communicator for all processes and their respective ranks (process ID’s). On line 7, the NumPy library is used to generate a 5x5 array of elements with values from 0 to 24 on the process with rank 0, storing the array in the variable data. For all other MPI processes/ranks, the variable data is initialized with the Python value of None. On line 8, the MPInumpy array() creation routine is used to distribute the array data among processes under a blocked partitioning scheme.

As mentioned previously, the actual partitioning and distribution of the data is managed in the array creation routines. Under the circumstance that the desired data distribution is replicated

(dist=’r’), the values for the MPIArray attributes (comm dims, comm coord, local to global) are all initialized to None and a MPI broadcast operation is used to provide duplicate copies of the array data/shape information to all processes. If the desired distribution is the default blocked

30 (dist=’b’) partitioning scheme, the stages outlined below are performed:

1. Resolve comm dims and comm coord from number of MPI processes.

(a) comm dims computed using MPI routine MPI Dims create().

(b) comm coord computed using a implementation based on the Open MPI virtual topology

Cartesian coordinate creation method [45].

2. MPI broadcast the global array shape to all processes.

3. Determine the local array shape and local to global mapping on each MPI process using

expected global shape.

(a) Local responsibility resolved by distributing elements in leading dimension among pro-

cesses.

4. Based on the creation routine type:

(a) If array like data: MPI scatter the data to the respective process and generate local

MPIArray honoring expected local array shape.

i. Leveraging the previously determined local shapes to compute displacements of the

transmission buffer for a personalized MPI Scatterv operation.

(b) If desired shape: create MPIArray honoring local array shape and creation routine ini-

tialization.

3.1.4 Reductions Routines

Select common reduction routines found in the NumPy library are currently implemented in the

MPInumpy library. These include the min(), max(), mean(), sum(), and std() methods to compute an array’s minimum, maximum, arithmetic mean, summation, and standard deviation respectively. The results of these operations can be computed using the entire distributed array contents, or along a specified axis.

31 Like the previously discussed creation routines, the MPInumpy reduction routines attempt to emulate the expected behavior of using the base ndarray, accepting the same function parameters.

These parameters include the specified axis (default: use entire array contents) along which to compute the result and the output data type (default: maintain current data type).

For replicated arrays, the computation of the reduction routines is relatively straightforward, requiring only a local computation of the result using the base ndarray. For block partitioned arrays, the operations necessary to perform the reduction are shown in Algorithm 2 below. On line

1, a local reduction is performed on each process, resulting in a process local minimum, maximum, arithmetic mean, etc.. The computation of the globally reduced result is dependent on axis, if specified, as shown on lines 2-5. If the axis is not specified, or the axis is the axis of partitioning, a MPI Allreduce is leveraged to compute the result from the distributed values. If the specified axis is not the partitioning axis, a MPI Allgatherv is used to gather process varying (under or over partitioned) number of elements from the locally determined results. The final step, as shown on lines 7-8, involves the reshaping of the global result, should the original array have three or more dimensions. This final step is necessary because an ndarray reduction operation flattens the axis by which it performs it’s computation, effectively resulting in the axis being removed by the operation.

Algorithm 2: General N-Dimensional Reduction Routine Data: localArray // process local MPIArray axis // axis along which to perform reduction Result: globalReduction // globally determined result 1 localReduction ← localArray.reduction(axis) 2 if axis = None ∨ partitioningAxis then 3 globalReduction ← AllReduce(localReduction) 4 else 5 globalReduction ← AllGather(localReduction) 6 7 if localArray.globalndim > 2 then 8 reshapeArray(globalReduction, axis) // reshape array accounting for lost axis

Currently, all reduction routine results are returned as replicated MPIArray objects. A sample

32 reduction routine, using MPIarray.mean() to normalize each column of a 5x5 array by its arith- metic mean, is shown in Figure 19 below.

1 import mpids.MPInumpy as mpi_np

2

3 mpi_arr = mpi_np.arange(25, dist=’b’).reshape(5,5)

4 mpi_arr_mean = mpi_arr.mean(axis=0)

5 mpi_arr_normalized = mpi_arr / mpi_arr_mean

Figure 19: Example MPIArray reduction routine demonstrating how to normalize all columns of a 5x5 block partitioned array by the columns respective arithmetic mean.

In Figure 19 above, on line 3, the MPInumpy array creation routine arange() is used to generate a 5x5 array of elements with values from 0 to 24 with a block data partitioning scheme. On line

4, the arithmetic mean of each column of elements is computed by the MPIArray.mean() method by specifying a reduction axis of 0 (i.e computation along the rows). Finally, on line 5, the 5x5 array is divided by the computed column mean values, normalizing each column by its respective arithmetic mean.

3.1.5 Behavior & Operations

As previously discussed, the MPIarray will naturally inherit all of the functionality of the base ndarray class it subclasses. This inherited functionality includes unary, binary, and comparison operators, as well as array object manipulation routines. These can be leveraged locally on each process by using the MPIarray’s base or local attributes. This section covers the behavior and operations that were addressed in a data-distributed framework, which include the getting or setting of elements, slicing, distributed array reconstruction, iterating, and array reshaping.

Direct accessing, or getting/setting, of elements in an MPIArray object honors the Python object data model [66]. It does this by implementing custom getitem () and setitem () dunder

(double underscore) methods for respectively getting or setting elements within a container object.

33 When performing either operation, a ‘key’ or index must be supplied indicating the location of the desired element. For the setting operation, an additional parameter for the desired updated value is required. In this implementation, the ‘key’ in both cases represents the global position/in- dex of the target element. A sample of both of these accessing routines is shown in Figure 20 below.

1 import mpids.MPInumpy as mpi_np

2

3 mpi_arr = mpi_np.arange(25, dist=’b’).reshape(5,5)

4 mpi_arr_index_00 = mpi_arr[0,0]#Getter routine

5 mpi_arr[4,4] = 9999#Setter routine

Figure 20: Example MPIArray accessing routines of a 5x5 block partitioned array demonstrating how to get element in global position 0,0 and how to set the element in global position 4,4.

On line, 4 of Figure 20 above, a getting routine is used to capture the element found at global index (‘key’) 0,0 and store it in the variable mpi arr index 00. On line 5, a setting routine is used to assign the the element found at global index 4,4 with value of 9999, replacing its previous value of 24. An example of the expected behavior of the getting/setting routines in Figure 20 is highlighted in the interactive tmux session, using two MPI processes, shown in Figure 21 below.

Figure 21: Interactive MPInumpy Python example using two tmux panes, one for MPI process/rank 0 (left) and one for MPI process/rank 1 (right), to demonstrate accessing routine behavior shown in Figure 20.

In both accessing operations, a transformation from a global array index space to a local space

34 is required. For a replicated partitioning scheme, this mapping from global to local space results in no additional work, requiring only that the ‘key’ be passed to the base local array since each process has a duplicated copy of the MPIArray. For a blocked partitioning scheme, mapping from a global to local index space requires the transformation of a global ‘key’ to a local one. To perform this mapping, the MPIArray attribute local to global is leveraged to transform the global index to either a local index or a non-access ‘key’ (slice that produces in empty result). After which, either the element is updated (when setting), or the element is returned as a replicated MPIArray to all processes (when getting).

Slicing of an MPIArray object takes advantage of the previously outlined getting routine. Just like its emulated counterpart, the ndarray object, MPIArray’s support slicing with strided element access patterns using the notation MPIArray[start:stop:step], where start is the global inclu- sive starting position along an axis, stop is the global exclusive stopping position, and step is the global move up from the starting position. Like the getting routine, the result of a slice produces a replicated MPIArray to all processes.

To produce a replicated MPIArray from block partitioned distributed data, an array reconstruc- tion method was implemented. This method, MPIArray.collect data(), takes no parameters and returns a reconstructed array to all processes associated with the object. The actual reconstruc- tion of the array is resolved by using the MPI Allgatherv operation to gather the process varying number of elements stored locally on each process. This method is leveraged in the previously mentioned accessing routines, where an intermediate distributed MPIArray is created with the lo- cally indexed/sliced results, after which the collect data() method is invoked to produce the

final replicated result.

Iteration of an MPIArray can be performed either locally by a given MPI process, or globally.

Iterating over a process’s local elements is handled using a custom implementation of the iter () dunder method, which is the method invoked by a standard for loop operation (e.g., for element in collection:). This custom implementation of the iteration dunder method is defined in the

MPIArray pseudo-abstract base class, and invokes the iteration dunder method on the local ndarray

35 object. In addition to local iteration via the standard means of Python object interaction, the

MPIArray’s local or base attributes can also be used when coupled with a getter routine. In this alternative form of local iteration, resolution of the various indexes can be done by using the

MPIArray’s local shape or size properties.

The second means of iteration, globally traversing the distributed elements, can be achieved using global shape of the MPIArray in conjunction with a getter routine. However, because the cur- rent implementation of the MPIArray getter routine returns a new replicated copy of the element, it can not be used to update the value of the element in the global array that’s being traversed. A sample of both iteration methods is shown in Figure 22 below.

1 import mpids.MPInumpy as mpi_np

2

3 mpi_arr = mpi_np.arange(25, dist=’b’).reshape(5,5)

4 #Local iteration of array

5 for row_index, row in enumerate(mpi_arr):

6 print(row)#Method1

7 print(mpi_arr.local[row_index])#Method2

8 #Global iteration of array

9 for row_index in range(mpi_arr.globalshape[0]):

10 print(mpi_arr[row_index])

Figure 22: Example MPIArray local and global row iteration of a 5x5 block partitioned array.

In Figure 22 above lines 5-7 demonstrate how to locally iterate the elements, resulting in each row of the process local array being printed to standard output. Note that the Python built- in enumerate() method is used to produce a monotonically increasing index (row index) along with the iterator produced result (row). Lines 9-10 demonstrate how the elements can be globally iterated, resulting in each row of data distributed among processes being printed to standard output.

The final re-implemented behavior carried out in this work is a routine that enables the reshaping of an MPIArray. This operation can be performed on any MPIArray object by calling its reshape()

36 method with a new shape that has dimensions compatible with it’s original shape (i.e. the product of the new dimension’s lengths must be equal to the product of the previous dimension’s length).

As seen in previous re-implemented operations, little work is required to perform this operation on a replicated partitioning scheme, requiring only the local ndarray object’s reshape method to be invoked with the desired shape. For a block partitioning scheme two key steps need to be performed, namely the distribution of the new desired shape and an all-to-all exchange of elements to match the new array shape.

The first of these two operations to reshape a block partitioned MPIArray uses the same mecha- nism leveraged in the shape based array creation routines to distribute the shape. While the second stage of this process uses the current global shape and desired new shape to resolve what elements need to be exchanged between processes. Once this information is locally determined, the necessary data is exchanged in a MPI Alltoallv operation. An example of this reshaping operation is shown in Figure 23 below, where a block partitioned array of shape 7x3 is reshaped to a 3x7 array. As highlighted in the example, each process may end up with more or fewer elements depending on the final shape; with MPI process 0 losing 2 elements, while MPI process 1 & 2 both gaining an element.

Figure 23: Example reshape operation of a block partitioned array of shape 7x3 to a new shape of 3x7.

37 3.2 MPIscipy

Akin to the SciPy [74] package building on top of NumPy as its numerical array core, the MPIscipy library developed in this work leverages MPInumpy to act as its distributed numerical array library.

Currently, the MPIscipy library only has one module, the MPIscipy.cluster module, which will be discussed in the remainder of this section.

3.2.1 Cluster

The MPIscipy module cluster currently contains only one computational kernel, namely kmeans().

The kmeans() kernel is a parallel MPI implementation of the K-Means clustering algorithm dis- cussed in section 2.6. Fundamentally, the kernel carries out all of the stages listed in Algorithm 1 of section 2.6, but instead of the observations all residing on a single compute resource the observations are distributed among MPI processes.

The MPIscipy.cluster.kmeans() method is based on a MPI parallel K-Means implementation originally developed for cell classification on multispectral images of thyroid specimens [20]. Taking this code originally developed in C, the implementation was first ported to Python leveraging the mpi4py package, then updated to make use of the MPInumpy library. As a reminder, there are four key stages in the typical K-Means clustering algorithm:

1. Initialize k cluster centroids.

2. Assign each of the observations to the nearest centroid based on their euclidean distance.

3. Recompute centroids from assigned observations.

4. Repeat steps 2 & 3 until centroids converge.

In this data parallel implementation, each of the observation vectors are distributed among the various MPI processes, while the k cluster centroids are replicated, or duplicated, among resources.

As a result, the execution parallelism is realized in stage 2 of the above sequence, where each process can independently assign its respective collection of observations to a given centroid. While, to

38 facilitate the recomputation of the centroids in stage 3, a set of MPI Reduce operations is leveraged to combine the process independent results.

The method parameters and returned result behavior of the implementation created in this work most closely matches the SciPy K-Means2 implementation (scipy.cluster.vq.kmeans2()) [76]; accepting the observation data and number of clusters (or seeds for the centroids), while returning the computed cluster centroids and observation labels. There are some notable differences between two implementations, namely that the SciPy version permits variable means of cluster centrioid initialization (selection), as well as does not use a user specified threshold as it’s terminating condition (instead using the number of iterations). While the MPIscipy version introduces an additional parameter for the MPI communicator associated with distributed observation object, defaulting to MPI.COMM WORLD as is done in the MPInumpy array creation routines.

A sample usage of the MPIscipy.cluster.kmeans() method on simulated data, with one di- mension per feature, is shown in Figure 24 below. On lines 5-9, the simulated data is created with two clusters of points that are randomly selected from a uniform distribution between -1 to -0.75 and 1 to 1.25 respectively. These sets of observations are then stored in a NumPy ndarray with the variable named np 1D obs features. On line 12, this feature vector is then distributed, under a block partitioning scheme, to the various MPI process. Finally, on line 15, the parallel K-Means clustering kernel is executed using the distributed observations with two specified as the number of clusters.

For comparison, a sample usage of the SciPy scipy.cluster.vq.kmean2() method on the same simulated data is shown in Figure 25 below. Examining the two code snippets, Figures 24 & 25, we

find that the distributed MPIscipy version requires only one additional line of code to distribute the observations prior to calling the K-Means clustering implementation.

39 1 import numpy as np

2 import mpids.MPInumpy as mpi_np

3 import mpids.MPIscipy.cluster as mpi_scipy_cluster

4 #Create simulated1D observation vector

5 k, num_points, centers = 2, 50, [[-1, -0.75],

6 [1, 1.25]]

7 x0 = np.random.uniform(centers[0][0], centers[0][1], size=(num_points))

8 x1 = np.random.uniform(centers[1][0], centers[1][1], size=(num_points))

9 np_1D_obs_features = np.array(x0.tolist() + x1.tolist(), dtype=np.float64)

10

11 #Distribute observations amongMPI processes

12 mpi_np_1D_obs_features = mpi_np.array(np_1D_obs_features, dist=’b’)

13

14 #ComputeK-Means Clustering Result

15 centroids, labels = mpi_scipy_cluster.kmeans(mpi_np_1D_obs_features, k)

Figure 24: Example usage of the MPIscipy K-Means clustering method on simulated data, with one feature, containing two clusters.

1 import numpy as np

2 import scipy.cluster.vq as scipy_cluster

3 #Create simulated1D observation vector

4 k, num_points, centers = 2, 50, [[-1, -0.75],

5 [1, 1.25]]

6 x0 = np.random.uniform(centers[0][0], centers[0][1], size=(num_points))

7 x1 = np.random.uniform(centers[1][0], centers[1][1], size=(num_points))

8 np_1D_obs_features = np.array(x0.tolist() + x1.tolist(), dtype=np.float64)

9

10 #ComputeK-Means Clustering Result

11 centroids, labels = scipy_cluster.kmeans2(np_1D_obs_features, k)

Figure 25: Example usage of the SciPy K-Means clustering method on simulated data, with one feature, containing two clusters.

40 4 Evaluation

This chapter will evaluate the work developed in Chapter 3. In doing so, several benchmarks/profiles will be generated using the developed libraries along with comparisons to the base libraries they intend to emulate in a data-parallel environment.

The primary metrics used to evaluate performance are the execution times, speed up (factor difference) of one implementation against another, and scalability of a given an implementation to multiple parallel compute resources. Formally, speed up is defined as follows:

T ime SpeedUp = X (3) T imeY

where T imeX & T imeY represent the execution times of two implementations to perform same task. A speed up factor greater than 1.0 reflects the time required by implementation Y was less than that of X, while factors less than 1.0 reflect reflect the opposite.

Scalability of a parallel implementation is typically analysed in terms of its strong and weak scaling. Strong scaling involves the evaluation of an implementations execution performance for a fixed problem size with varying numbers of resources (i.e. as resources increase the amount of individual resource work decreases), whereas weak scaling is the evaluation of the execution performance for a fixed problem size per resource (i.e. as resources increase the total problem size increase but the individual resource work remains the same).

A commonly discussed attribute of a parallel implementation’s scalability is it’s parallel effi- ciency. Formally, parallel efficiency is defined as follows:

SpeedUp T ime P araEff(N) = N = 1 (4) N N × T imeN

where N are the number of parallel resources and SpeedUpN is the achieved speed up using those resources (i.e. factor difference between single process and N parallel processes). Parallel efficiency is an imported attribute to discuss as it reflects how well the additional resources are leveraged. In

41 an ideal case, doubling the number of parallel resources would halve the execution time, resulting in a two factor speed up and a parallel efficiency of 1.0. In sub-optimal cases, such as the execution time remaining the same or increasing with additional resources, the speed up would remain constant or decrease and the parallel efficiency would have a value less than 1.0.

The various trials presented in this chapter were executed using the Parallel Software Technolo- gies Laboratory Crill cluster [17] at the University of Houston, making use of 15 nodes with four

12-cores AMD Opteron 6174 processors (48 cores per node) and 64 GB of main memory per node.

These systems are connected via QDR Infiniband and Gigabit Ethernet network interconnects. The software stack used in these trials is as follows: openSUSE [58] version 42.3, Python3 [85] version

3.4.6, NumPy [83] version 1.9.3, SciPy [74] version 1.2.3, and mpi4py [10] version 3.0.0 running on top of Open MPI [81] version 3.0.1.

All measurements were executed fifty times and the arithmetic mean of all fifty measurements, after outlier rejection, is presented. The outlier rejection method used in this work is the Modified

Z-Score [91], which uses the sample’s median value in lieu of the mean for a standard deviation based outlier classification, with a rejection threshold of 3.5. Additionally, all time based measure- ments were taken using code internal high-resolution clock timers provided by the mpi4py library

(MPI.Wtime()), which have a clock resolution of 1 nanosecond for the resources used in these mea- surements. Presented execution times that are near or below this stated clock resolution were executed a sufficient number of iterations to fall within the resolution of the timer; as a result, the presented times represent the measured time divided by the number of execution iterations.

For trials involving more than one process, the MPI execution command mpiexec was pro- vided with the map by node option (e.g., mpiexec -n --map-by node python3

MPIDS python code.py). Supplying this option causes processes to be allocated in a round-robin fashion among available nodes, resulting in the specified number of processes being evenly dis- tributed among the resources.

42 4.1 MPInumpy - Evaluation

The evaluation of the MPInumpy library created in this work will be done in two parts. The first, will compare the execution performance/behavior of the NumPy and MPInumpy libraries using a single process (CPU core). The second will be an evaluation the MPInumpy library’s scalability.

The MPInumpy library includes two separate data distributions (Block & Replicated); how- ever, this evaluation only covers the behavior of the block data partitioning scheme. This is be- cause the replicated data distribution is effectively a wrapper for a cloned NumPy ndarray object distributed among processes, mainly functioning as a return type of block distributed objects de- pending on the interaction.

4.1.1 Single Process Performance

To evaluate the single process performance of the NumPy and MPInumpy libraries, the execution time of various complementing routines are compared on vector arrays containing a variable number of elements. Vector array sizes ranged from 1 to 227 elements, increasing by factors of 2 between each sample. The elements of each vector consisted of 8-Byte floating point (dtype=np.float64) numbers with values that ranged from 0 to N − 1, where N is the total number of elements.

In addition to a comparison of the individual execution times, the overhead introduced by using the MPInumpy library, or the speed up of NumPy over MPInumpy, was computed. This overhead comparison provides insights into how much additional computation effort the MPInumpy requires to resolve complementing operations in a data-parallel environment.

43 The comparison of complementing array creation routines, one of each type (from array-like data, shape, parameters), is shown in Figure 26 below. In the case of creating arrays from existing array-like data or parameters (arange() & arange()), the routines exhibit an initial overhead of nearly 128 (27) times that of the complementing NumPy versions. This overhead diminishes with an increasing number of elements and eventually settles at around a factor of 2 or a non-significant amount of overhead for the array() and arange() routines respectively. The creation routine that uses shape information to generate an array (empty()) exhibits a near constant introduced overhead just shy of a factor of 64 (26). Interestingly, we find change in the execution time behavior of the

NumPy version between roughly two million and four million elements, resulting in a diminishing introduced overhead with increasing number of elements.

Figure 26: NumPy vs. MPInumpy array creation execution performance (left) and overhead (right).

The source of the introduced overheads in the array creation routines is primarily caused by the logic necessary to distribute the data/shape/parameters. This includes the work necessary to resolve partitioning, transmit the data/initialization information, and then locally initialize the array object on a given process.

44 The comparison of performing various arithmetic operations (addition, subtraction, multipli- cation, and division) on all elements in the vector array, generated by both libraries, is shown in

Figure 27 below. Similar to the array-like data and shape based array creation routines, the arith- metic operations exhibit an initial overhead which diminishes to negligible overall factor increase with increasing number of elements. For all tested arithmetic operations, the point of comparable execution performance is achieved for vector array sizes above 65,536 (216) elements.

Figure 27: NumPy vs. MPInumpy arithmetic operation execution performance (left) and overhead (right).

A NumPy vector array containing 65,536 8-Byte floating point elements corresponds to a mem- ory footprint of 512KiB, which happens to be the level 2 cache (on-chip memory) size of the resources used in this trial. As a result, the arithmetic overhead convergence point exhibited in the profile indicates that MPInumpy array objects can not be fully contained in the higher levels of cache on the resources used in this evaluation.

45 The comparison of performing various reduction type operations, namely resolving the max- imum, arithmetic mean, summation, and standard deviation of all elements in a vector array, is shown in Figure 28 below. Once again, we find that nearly all implementations exhibit an initial introduced overhead for the smallest tested sizes, which diminishes with increasing array sizes.

The outlier to this is the standard deviation implementation, which maintains a factor increase of roughly 1.2 for the largest tested sizes.

Figure 28: NumPy vs. MPInumpy reduction operation execution performance (left) and overhead (right).

For the reduction routines, the introduced overhead is the result of having to locally compute the target reduction, followed by the logic to resolve the global result, and finally return a replicated

MPIArray object of the global result. For reduction operations that do not have complementing

MPI reduction routines (arithmetic mean and standard deviation), an additional computation requirement is introduced when resolving the global result. In the case of computing the arithmetic mean, this involves an additional division operation using the result of the summation reduction.

For the computation of the standard deviation, several expensive mathematical operations have to be leveraged to resolve the global result, resulting in the roughly 20% increase in execution time for the larger tested sizes.

46 The comparison of performing local access operations (leveraging an MPIArray’s local prop- erty) on an MPInumpy array against a NumPy array is shown in Figure 29 below. Here we find a near constant overhead is introduced by the getting, setting, and slicing routines, exhibiting only a modest decrease in overhead for the largest tested sizes. For the iterating (vector array traversal) operation, we find that an initial overhead is introduced for the smallest tested sizes, which diminishes to a negligible overall factor increase for vector arrays with more than 1,024 (210) elements.

Figure 29: NumPy vs. MPInumpy local access operation execution performance (left) and overhead (right).

Local access getting, setting, slicing, and iterating operations all make use of the MPIarray’s local property to act on the base ndarray object associated with the object. As a result, the introduced overhead for these operations is effectively the cost of a function call.

47 The comparison of performing global access operations on an MPInumpy array against a NumPy is shown in Figure 30 below. Compared to the overhead found when leveraging local access opera- tions, we find a considerable amount of overhead is introduced by interacting with a MPIArray in a global manner; introducing nearly a 50 and 1000 times factor increase in operation overhead for getting and setting routines respectively. Global slicing of the vector array initially exhibits a nearly constant introduced overheard up until 16,384 (214) elements, after which the overhead increases at a near constant rate with increasing number of elements. The most costly of these operations, global iteration (vector array traversal), initially introduces a roughly 100 times factor overhead for single elements, and increases up until 256 (28) elements, after which the introduced overhead remains nearly constant at a factor over 2,000 times that of the NumPy library. Note: the tested value range for the global iterating operation was reduced due to its excessive time requirements.

However, from its profiled behavior its expected the execution rate will continue to grow at a near constant rate with increasing numbers of elements.

Figure 30: NumPy vs. MPInumpy global access operation execution performance (left) and overhead (right).

All global access operations introduce a base overhead through the conversion of a global ‘key’ to a local one. In the case of the setting operation, this is the only introduced additional work.

For the getting and slicing routines, which return a replicated result to all processes, additional complexity is added through the process of collecting the distributed result. In the most costly case

(globally iterating), the operation effectively calls the global getting routine for every element in

48 the vector array, resulting near constant growth in execution time with increasing problem sizes.

The last comparison reviewed will be that of the execution performance/introduced overhead of a reshape operation, shown in Figure 31 below. Here we find the MPInumpy library introduces an initial overhead of roughly a factor of 600 for the smallest array sizes, followed by modest increases in overhead until roughly 256 (28) elements, after which the profile exhibits a linear increase in overhead with increasing array sizes.

Figure 31: NumPy vs. MPInumpy reshape operation execution performance (left) and overhead (right).

For the reshape operation, the introduced overhead is the result of having to distribute the new desired shape, orchestrate an exchange among processes (resolving elements to be exchanged as well as source/destination pairs), and then finally return an MPIarray of the redistributed result.

As the overhead profile indicates, this is an incredibly costly operation when compared to the near constant execution time required by the base NumPy library.

In summary, we find a non-trivial overhead is introduced for most operations, mainly ones that interact with global entries of an MPInumpy MPIArray object, while for operations that can be leveraged on process local elements the overhead is still present but considerably less impactful.

Additionally, in the case of arithmetic and reduction operations, we find that the introduced over- head is minimized for the larger vector array sizes. This is a promising result, given that the target application involves large scale problems that would warrant parallel compute resources.

49 4.1.2 Scalability

To evaluate the scalability (performance using parallel compute resources) of the MPInumpy library the execution time and speed up of various routines will be reviewed. Here the speed up metric

(factor change) will take the form of the execution time of using a single process to carry out an operation divided by the execution time of using N parallel processes. For all scalability measure- ments, the number of processes are varied from 1 to 512 MPI processes, increasing by factors of 2 between each sample.

Total problem sizes will vary based on the scalability metric that is being reviewed. In the case of strong scaling, a vector array containing 225 elements will be distributed on N processes, meaning that when N = 1, all 225 elements will be contained on a single process, while when

N = 512 each process will locally contain 216 elements. In the case of weak scaling, a vector array containing 216 elements will be stored on each process. Thus, when N = 1 the total problem size will be 216 elements, while when N = 512 the total problem size becomes 225 elements. In either scaling benchmark, the elements of each vector consist of 8-Byte floating point (dtype=np.float64) numbers with values that ranged from 0 to N − 1, where N is the total number of elements.

The scaling behavior of the MPInumpy array creation routines is shown in Figure 32 below.

Overall we find that all of the creation routines scale poorly with increasing numbers of processes, with only the empty() routine showing an initial performance improvement under strong scaling up to 128 processes. In the case of the array() routine, this behavior is expected, as this operation requires the distribution of existing data (resolution of target processes and transmitting of data) to be carried out by a single process. For the routines that generate arrays based on shapes/- parameters (empty() & arange()) this poor scaling stems from a design decision related to the resolution/calculation of global attributes.

50 Figure 32: MPInumpy array creation strong (top) and weak (bottom) scaling execution perfor- mance (left) and speed up (right).

An MPIArray object is created with its global attributes (globalshape, globalndim, etc.) initially set to None. A technique known as lazy evaluation is used to compute the necessary attribute only when an operation interacts/invokes the property for the first time, after which the compute value is stored in the object, a behavior that is necessary to support slicing of MPIArray objects. As a result, when using an array creation routine, an additional overhead to compute these values (by invoking them in the routine) is incurred that increases with growing numbers of processes due to the reduction routines that have to be utilized.

51 The scaling behavior of performing arithmetic operations on a MPIArray object is shown in

Figure 33 below. Here we find a nearly ideal scaling both in terms of strong and weak scalability.

Because these operations require no message/data exchanges to compute the result, this behavior is expected. Additionally, we find evidence of super-linear speed up (instances where the speed up passes the indicated ‘optimal’ line).

Figure 33: MPInumpy arithmetic operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right).

52 The scaling behavior of performing various reduction routines on a block distributed MPIArray object is shown in Figure 34 below. Here we find the resolution of maximum, arithmetic mean, summation, and standard deviation of the entire distributed array continues to strongly scale up to

256 processes, after which the maximum, arithmetic mean, and summation implementations exhibit a decrease in parallel efficiency (i.e. ratio of speed up to the number of parallel resources). While the standard deviation implementation shows minimal performance improvement from doubling the number of resources. For this total problem size, distributing the work among 256 process highlights the point where the computation benefits of data-parallelization is outweighed by the increased communication costs (more processes exchanging messages) and resource requirements of the individual nodes (more processes competing for finite resources on nodes).

This negative impact on performance resulting from increased communication costs and resource utilization is further highlighted in the weak scaling. Each routine exhibits an increased performance penalty, independent of the local size, with increasing total problem sizes.

Figure 34: MPInumpy reduction operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right).

53 The scaling behavior of performing various local access operations on a MPIArray is shown in

Figure 35 below. Here we find that getting, setting, and slicing operations result in a near constant execution time independent of the type of scaling, while for local iterating, (traversal) of the block distributed MPIArray, near optimal speed up is achieved for both strong and weak scaling.

Figure 35: MPInumpy local access operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right).

54 The scaling behavior of performing various global access operations as well as a collect data routine (collecting the entire array among all processes) on a block distributed MPIArray is shown in Figure 36 below. Excluding the setting operation, which exhibits a near constant execution time independent of the type of scaling, all global means of access exhibit a decrease in parallel efficiency with increasing numbers of resources. Note: global iterating was excluded from this analysis due to it’s excessive run time requirements.

Global getting, slicing, and iteration operations all initially resolve their process local results and store them in a intermediate distributed MPIArray object. Thereafter, the globally determined replicated result is produced by calling the collect data (collect data()) routine on that inter- mediate object. As a result, the influence of this routine is evident in the profile of the access operations that leverage it.

Figure 36: MPInumpy global access operation strong (top) and weak (bottom) scaling execution performance (left) and speed up (right).

55 The last implementation reviewed in terms of scalability will be that of the reshape operation shown in Figure 37 below. In terms of strong scaling, we find that the implementation scales rea- sonably well up to 128 processes, after which a diminishing return in parallel efficiency is observed.

In the case of weak scaling we find a continued increase in execution time with an increase of the total problem size. Similar to the scaling of the reduction operations, here the shortcoming in terms of scalability can be attributed to increasing communication costs and resource requirements as the number of processes is increased.

Figure 37: MPInumpy reshape operation strong (top) and weak (bottom) scaling execution per- formance (left) and speed up (right).

Summarizing the findings of this section, we find operations that require large amounts of collective communication scale poorly with increasing resources, while operations that can solely operate on local elements scale very well. Additionally, we identify a handle full of routines/design design decisions, such as the collect data routine and the resolution of global properties, that would be candidates to target for refactoring/optimization in future works.

56 4.2 MPIscipy K-Means Clustering - Evaluation

The evaluation of the MPIscipy K-Means clustering implementation created in this work will follow the same evaluation sequence as the MPInumpy library. First a comparison of the execution performance/behavior of the SciPy and MPIscipy K-Means clustering algorithms will be evaluated using a single process (CPU core). This will be followed by an evaluation of the scalability of the

MPIscipy K-Means implementation.

4.2.1 Single Process Performance

The influence of input parameters that are independent of the SciPy and MPInumpy K-Means clustering implementations will be the basis of the single process performance evaluation. These independent input parameters are the number of observations (data samples), number of features per observation, and the number of cluster centroids to determine.

The number of observations in these trials is varied from 24 to 210, increasing by factors of

2 between samples. The number of features per observation is varied from 2 to 210, increasing by factors of 2 between samples. The final parameter, the number of clusters, is varied from 2 to 2log2(Num Obs)−1, increasing by factors of 2 between samples. The elements of each observation vector (vector of features) consist of 8-Byte floating point (dtype=np.float64) numbers with values populated by the Scikit Learn make blobs() [38] method, which generates isotropic Gaussian blobs of points based on the specified number of samples (observations), features, and centers (centroids).

Additionally, the random state parameter of the make blobs() method was seeded with a constant value to ensure input consistency between trials.

In addition to a comparison of the parameter dependent execution times, a rough estimate of the overhead introduced by the MPIscipy implementation, or rough speed up of SciPy’s kmeans2() method over MPIscipy’s kmeans() method, is computed.

To enable a reasonable comparison between the two implementations, the method for initial centroid selection in the SciPy kmeans2() method was set to ‘points’. Selecting this option results in the k initial centroid positions to be chosen from the available observations, most closely

57 matching the initial centroid selection of the MPIscipy kmeans() method.

A fundamental difference in the heuristics of the implementations prevents a true apples-to- apples comparison of the two implementations. The SciPy kmeans2() method executes for a fixed number of iterations (left as the default iter=10 for these trials). As a result, the implementation can generate different final results for the centroids and labels depending on the number of iterations.

The MPIscipy kmeans() implementation executes as many iterations as necessary until the iteration to iteration position changes of the centroids fall below a specified threshold (left as the default thresh=1e-5 for these trials). Disregarding floating point rounding variances, this means that the implementation should generate the same results between independent executions, but is capable of executing significantly more iterations to come to its final result.

The execution behavior, as a function of the number of features, observations, and centroids, for the SciPy kmeans2() and MPIscipy kmeans() implementations are shown in Figures 38 & 39 below.

Overall, the profiles indicate that the SciPy implementation’s execution performance is a factor of all three variable input parameters. Whereas, for the MPIscipy implementation, the execution performance is mainly influenced by the number of observations and the number of centroids.

Figure 38: SciPy K-Means2 execution performance as a function of the number of features, obser- vations, and cluster centroids.

58 Figure 39: MPIscipy K-Means execution performance as a function of the number of features, observations, and cluster centroids.

The differences in execution behavior is highlighted in the overhead profile, shown in Figure 40 below. The profile indicates the variance in overhead is largest as the number of features increases and the number of observations and centroids decreases.

Figure 40: SciPy vs MPIscipy K-Means overhead as a function of the number of features, obser- vations, and cluster centroids.

59 In terms of overall execution time, the SciPy implementation is considerably faster than the

MPIscipy implementation for all tested samples, completing the most computationally expensive sample (1024 (210) features, 1024 (210) observations, 512 (29) centroids) in under a second. Whereas, the MPIscipy implementation required just over two and half minutes. Two factors mainly con- tribute to this difference in execution rate: the first, as previously mentioned, is the algorithm heuristics in terms of terminating condition, the second is that the SciPy implementation utilizes optimized kernels written in Cython [3] while the MPIscipy implementation is written entirely in

Python.

4.2.2 Scalability

To evaluate the scalability of the MPIscipy K-Means clustering implementation, the execution time and speed up will be reviewed. Similar to the MPInumpy library scalability evaluation, the speed up metric will take the form of the execution time of using a single process to carry out an operation divided by the execution time of using N parallel resources. For all scalability measurements, the number of processes will vary from 1 to 512 MPI processes, increasing by factors of 2 between each sample.

The MPIscipy K-Means clustering implementation achieves execution parallelism through dis- tribution of the observations (feature vector). As a result, the following trials examine the behavior of varying the number of observations as the total problem size. In the case of strong scaling, the observation vector size is varied from 215 to 220, increasing by the sizes factors of 2 between trials, with two features per observation. In the case of weak scaling, the local observation vector size is varied from 210 to 215, increasing by the sizes factors of 2 between trials, with two features per observation.

In all trials, the features of each observation vector consisted of 8-Byte floating point (dtype= np.float64) numbers with values populated by the Scikit Learn make blobs() method (seeded with constant value for consistency between trials). Additionally, the number of clusters to resolve in all trials was set to 2 with no initial value seeded, meaning the initial cluster centroids would

60 be selected from the available observations. The final non-MPI related parameter used to specify the iteration to iteration centroid movement terminating condition threshold was left at its default value (thresh=1e-5).

The strong and weak scaling behavior of using the MPIscipy K-Means implementation for varying sizes of observations is shown in Figure 41 below. In terms of strong scaling, the MPIscipy implementation scales reasonably well with increasing observation problem sizes, achieving nearly optimal speed up when distributed among 64 process. After 64, the implementation exhibits a decrease in parallel efficiency (i.e. ratio of speed up to the number of parallel resources). However, the general trend in performance for higher processes seems to indicate that main source in the loss of parallel efficiency is due to the increasing small local problem sizes, as highlighted by the difference between the strong speed up between the smallest (215) and largest (220) tested observation problem sizes. This indicates that the performance decrease is likely due to more time being spent in costly communication operations than processing results.

To evaluate how performant the strong scaling results were in comparison to the sequential

SciPy implementation, the SciPy kmeans2() algorithm was executed on the same total problem sizes. For these trials, the number of iterations was set to 100 in an effort to ensure convergence of centroid positions, while, as done in the previous single process evaluation, the centroid selection method was set to ‘points’. It was found that for all total problem sizes, a minimum of 128 parallel processes were necessary for the MPIscipy implementation to meet or exceed the performance of the SciPy implementation.

In terms of weak scaling, an oscillatory behavior is present in the profile of the execution times.

As a result, a bimodal speed up profile (centered around roughly a factor of 0.95 and 1.15 depending on the single process execution time) is present for each tested problem size, which is relatively consistent up to 64 process. Overall, the profile indicates the implementation scales reasonably well with increasing observation problem sizes. But similar to the strong scaling, the weak scaling profile exhibits a decrease in parallel efficiency for larger numbers of processes, tapering off after

64 processes. Just as observed with the strong scaling, the magnitude of the parallel efficiency

61 loss is inversely proportional to the local problem size, once again indicating the time spent in communication is the primary culprit for the loss is performance.

Figure 41: MPIscipy K-Means strong (top) and weak (bottom) scaling execution performance (left) and speed up (right).

Summarizing the findings of this section, the MPIscipy K-Means implementation shows very promising results in terms of scalability, by achieving nearly optimal strong scaling speed up and consistently high weak scaling speed up for up to 64 processes. However, the scaling behavior indicates a limitation on the achievable parallel efficiency based on the local problem size of each process.

62 5 Conclusions

The work of this thesis aimed to create a distributed clone of the Python [86, 85] NumPy [83] library.

This library was developed to leverage MPI [43] for inter-process communication and data transfers while abstracting away the complexity of MPI programming from its users. Additionally, this work aimed to create a distributed version of a select SciPy [74] module, employing the distributed arrays provided by the previously developed library, while maintaining a high level of parallelism abstraction. To this end, the MPInumpy and MPIscipy libraries were developed.

In the case of the MPInumpy library, enabling this high level of parallelism abstraction intro- duced a computational overhead to emulated routines that now function in a data-parallel environ- ment. For routines that could act solely on their local array contents, the impact of this overhead was minimal; while a considerable amount of overhead was introduced for routines that required global scope of the distributed array.

In terms of scalability to parallel compute resources, the MPInumpy library exhibited reason- ably performant results; notably showing sensitivity to local process problem sizes and operations that require large amounts of collective communication and synchronization. Additionally, it was identified that the resolution of global array attributes had a considerable impact on array creation routines that otherwise would have scaled reasonably well.

For the implemented MPIscipy K-Means clustering module, a combination of algorithm heuris- tics and non-optimized internal operations resulted in the single process performance to be consid- erably under-performing as compared to the existing SciPy K-Means clustering implementation.

However, under the implementation’s intended application scenario (i.e. instances where the pro- cessing capabilities beyond a singular compute resource are required), the implementation exhibited very promising scalability.

The developed libraries are natively intended for application scenarios where problem sizes are too large to be handled by singular compute resources, or when the execution time of an operation is deemed too costly, warranting the utilization of parallel resources to reduce the execution time. For

63 the MPInumpy library, algorithms that involve raw numerical computations, as well as statistical aggregation of large amounts of data, would benefit from the libraries usage. Meanwhile, algorithms that would require sequential (element wise) access of a distributed array would be ill-suited to use the library. For the MPIscipy K-Means clustering module, scenarios that would benefit from its usage would be clustering of data sets that exceed the memory capabilities of singular compute resources, realizing its full potential when the number of features per observation is the source of the heavy memory requirements.

The libraries outlined in this work are still in their early prototype stages of development. For the MPInumpy library, the current design enables future development efforts to explore alternate data distribution/partitioning patterns, refine and optimize existing routines/kernels, and expand supported functionality provided by the emulated NumPy library. In terms of what to target, future development efforts would benefit from exploring additional SciPy modules to refine/guide the library’s expected behavior and usage. For the implemented MPIscipy K-Means clustering module, both the heuristics (centroid selection method, execution terminating condition, etc.) and internal operations should be revisited for refinement/optimization. Fortunately, the full suite of unit tests included with the libraries permit efforts to explore optimizations and expanded functionality, knowing that integration issues will be identified. Lastly, further optimization beyond computational kernel refinement could be explored in the form of porting existing MPInumpy and

MPIscipy code written in Python to Cython [3].

64 Bibliography

[1] Apache Spark. Welcome to Spark Python API Docs! https://spark.apache.org/docs/latest/api/python/index.html#. Retrieved January 31, 2020.

[2] Barney, B. A Brief Word on MPI-2 and MPI-3. https://computing.llnl.gov/tutorials/mpi/#MPI2-3. Retrieved February 9, 2020.

[3] Behnel, S., Bradshaw, R., Seljebotn, D. S., Ewing, G., Stein, W., and Gellner, G. Welcome to Cython’s Documentation. https://cython.readthedocs.io/en/latest/. Retrieved February 11, 2020.

[4] BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/. Retrieved February 8, 2020.

[5] Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. API Design for Machine Learning Software: Experiences from the SciKit-Learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (2013), pp. 108–122.

[6] The Python Programming Language. https://github.com/python/cpython. Retrieved February 5, 2020.

[7] Dagum, L., and Menon, R. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng. 5, 1 (Jan. 1998), pp. 46–55.

[8] Daily, J., Granger, B., Grant, R., Ragan-Kelley, M., Kness, M., Smith, K., and Spotz, B. Distributed Array Protocol. https://distributed-array-protocol.readthedocs.io/. Retrieved January 24, 2020.

[9] Dalcin, L. MPI for Python. https://mpi4py.readthedocs.io/en/stable/. Retrieved January 31, 2020.

[10] Dalcin, L. D., Paz, R. R., Kler, P. A., and Cosimo, A. Parallel Distributed Computing Using Python. Advances in Water Resources 34, 9 (2011), pp. 1124–1139.

[11] Dask Development Team. Dask: Library for Dynamic Task Scheduling. https://dask.org. Retrieved January 24, 2020.

[12] Dask Development Team. Why Dask? https://docs.dask.org/en/latest/why.html. Retrieved January 24, 2020.

[13] Dean, J., and Ghemawat, S. MapReduce: Simplified on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), pp. 107–113.

[14] Ensslin, T., Frommert, M., and Kitaura, F. Information Field Theory for Cosmological Perturbation Reconstruction and Non-Linear Signal Analysis. arXiv.org 80, 10 (Sept. 2009).

65 [15] Enthought. About. https://www.enthought.com/about/. Retrieved January 24, 2020.

[16] F2PY Users Guide and Reference Manual. https://numpy.org/doc/1.18/f2py/. Retrieved February 12, 2020.

[17] Gabriel, E. Crill Technical Data. http://pstl.cs.uh.edu/resources.shtml. Retrieved March 14, 2020.

[18] Gabriel, E. MPI Data Science Modules. https://github.com/edgargabriel/mpids. Retrieved February 2, 2020.

[19] Gabriel, E. Parallel Software Technologies Laboratory. http://pstl.cs.uh.edu/index.shtml. Retrieved February 2, 2020.

[20] Gabriel, E., Venkatesan, V., and Shah, S. Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions. Computer Methods and Programs in Biomedicine 98, 3 (2010), pp. 231–240.

[21] Geurts, L., Meertens, L., and Pemberton, S. The ABC Programmers’ Handbook. https://homepages.cwi.nl/~steven/abc/programmers/handbook.html. Retrieved February 4, 2020.

[22] Grun, P. Introduction to InfiniBand for End Users. https://www.mellanox.com/pdf/whitepapers/Intro_to_IB_for_End_Users.pdf. Retrieved January 31, 2020.

[23] Hartigan, J. A., and Wong, M. A. A K-Means Clustering Algorithm. Journal of the Royal Statistical Society: Series C (Applied ) 28, 1 (Mar. 1979), pp. 100–108.

[24] Hausenblas, M., and Bijnens, N. Lambda Architecture. http://lambda-architecture.net/. Retrieved February 2, 2020.

[25] Hayes, B. Programming Languages Most Used and Recommended by Data Scientists. https://businessoverbroadway.com/2019/01/13/ programming-languages-most-used-and-recommended-by-data-scientists/. Retrieved January 26, 2020.

[26] Hunter, J. D. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9, 3 (2007), pp. 90–95.

[27] Intel MPI Library. https://software.intel.com/en-us/mpi-library. Retrieved February 9, 2020.

[28] Starting the IPython Controller and Engines. https://ipython.org/ipython-doc/stable/parallel/parallel_process.html. Retrieved February 12, 2020.

[29] IPython, and Enthought. Distarray. https://distarray.readthedocs.io/. Retrieved January 24, 2020.

66 [30] IronPython the Python Programming Language for the .NET Framework. https://ironpython.net/. Retrieved February 5, 2020.

[31] Jupyter. About Us. https://jupyter.org/about.html. Retrieved January 24, 2020.

[32] What is Jython? https://www.jython.org/. Retrieved February 5, 2020.

[33] Kaiser, T., Brieger, L., and Healy, S. MYMPI - MPI Programming in Python. In PDPTA (Jan. 2006), pp. 458–464.

[34] Data Mining Algorithms In R/Clustering/K-Means. https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means. Retrieved February 12, 2020.

[35] Kuchling, A., and Zadka, M. What’s New in Python 2.0. https://docs.python.org/3/whatsnew/2.0.html. Retrieved February 4, 2020.

[36] LAPACK — Linear Algebra PACKage. http://www.netlib.org/lapack/. Retrieved February 8, 2020.

[37] Loveman, D. B. High Performance Fortran. IEEE Parallel Distributed Technology: Systems Applications 1, 1 (Feb. 1993), pp. 25–42.

[38] Sklearn Datasets Make Blobs. https: //scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html. Retrieved March 16, 2020.

[39] Message Passing Interface Forum. 158. Virtual Topologies. https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node165.htm#Node165. Retrieved January 24, 2020.

[40] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface. https://www.mpi-forum.org/docs/mpi-2.0/mpi2-report.pdf. Retrieved February 9, 2020.

[41] Message Passing Interface Forum. MPI 4.0. https://www.mpi-forum.org/mpi-40/. Retrieved February 9, 2020.

[42] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. https://www.mpi-forum.org/docs/mpi-1.0/mpi-10.ps. Retrieved February 9, 2020.

[43] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard Version 3.0. https://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf. Retrieved February 9, 2020.

[44] Message Passing Interface Forum. MPI Forum. https://www.mpi-forum.org/. Retrieved February 9, 2020.

[45] Open MPI - topo base cart coords.c. https://github.com/open-mpi/ompi/blob/master/ ompi/mca/topo/base/topo_base_cart_coords.c. Retrieved February 25, 2020.

67 [46] MPICH. https://www.mpich.org/. Retrieved February 9, 2020.

[47] The N-dimensional Array (ndarray). https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html. Retrieved February 7, 2020.

[48] Subclassing ndarray. https://docs.scipy.org/doc/numpy/user/basics.subclassing.html. Retrieved February 18, 2020.

[49] Nickolls, J., Buck, I., Garland, M., and Skadron, K. Scalable Parallel Programming with CUDA. Queue 6, 2 (Mar. 2008), pp. 40–53.

[50] Numarray 1.5.1. https://pypi.org/project/numarray/. Retrieved February 7, 2020.

[51] Numeric 24.2. https://pypi.org/project/Numeric/. Retrieved February 7, 2020.

[52] NUMFOCUS. NUMFOCUS Open Code for Better Science. https://numfocus.org/. Retrieved January 26, 2020.

[53] NumPy C-API. https://docs.scipy.org/doc/numpy/reference/c-api.html. Retrieved February 7, 2020.

[54] Data Types. https://numpy.org/devdocs/user/basics.types.html. Retrieved February 7, 2020.

[55] Structured Arrays. https://docs.scipy.org/doc/numpy/user/basics.rec.html. Retrieved February 7, 2020.

[56] Oliphant, T., and Banks, C. PEP 3118 – Revising the Buffer Protocol. https://www.python.org/dev/peps/pep-3118/, 2006. Retrieved January 24, 2020.

[57] Oliphant, T. E. Guide to NumPy. https://web.mit.edu/dvp/Public/numpybook.pdf. Retrieved February 7, 2020.

[58] OpenSUSE 42.3 Release Information. https://en.opensuse.org/Portal:42.3. Retrieved April 13, 2020.

[59] Pandas Core Team. The Pandas Project. https://pandas.pydata.org/about.html. Retrieved January 25, 2020.

[60] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[61] Perez,´ F., and Granger, B. E. IPython: A System for Interactive Scientific Computing. Computing in Science and Engineering 9, 3 (May 2007), pp. 21–29.

68 [62] Pilgrim, M. 2.4. Everything Is an Object. https://linux.die.net/diveintopython/ html/getting_to_know_python/everything_is_an_object.html. Retrieved February 5, 2020.

[63] MPI Python. https://sourceforge.net/projects/pympi/. Retrieved February 29, 2020. [64] Pypar - Parallel Programming with Python. https://sourceforge.net/projects/pypar/files/pypar/pypar_1.9.3/. Retrieved February 29, 2020.

[65] PyPY. https://pypy.org/. Retrieved February 5, 2020.

[66] Python Software Foundation. 3. Data model. https://docs.python.org/3/reference/datamodel.html. Retrieved February 28, 2020.

[67] Python Software Foundation. Find, Install and Publish Python Packages with the Python Package Index. https://pypi.org/. Retrieved February 4, 2020.

[68] Python Software Foundation. Python Software Foundation. https://www.python.org/psf-landing/. Retrieved February 4, 2020.

[69] Python Software Foundation. The Python Standard Library. https://docs.python.org/3/library/. Retrieved February 4, 2020.

[70] Python Software Foundation. unittest — Unit Testing Framework. https://docs.python.org/3.8/library/unittest.html. Retrieved February 20, 2020.

[71] Reed, D., and Dongarra, J. Exascale Computing and Big Data. Communications of the ACM 58, 7 (June 2015), pp. 56–68.

[72] Rocklin, M. Dask: Parallel Computation with Blocked Algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference (2015), pp. 130–136.

[73] Sayad, S. K-Means Clustering. https://www.saedsayad.com/clustering_kmeans.htm. Retrieved February 12, 2020.

[74] SciPy. https://docs.scipy.org/doc/scipy/reference/. Retrieved February 8, 2020. [75] SciPy: History of SciPy. https://scipy.github.io/old-wiki/pages/History_of_SciPy.html. Retrieved February 8, 2020.

[76] SciPy Cluster VQ Kmeans2. https://docs.scipy.org/doc/scipy-0.14.0/reference/ generated/scipy.cluster.vq.kmeans2.html. Retrieved February 28, 2020.

[77] Selig, M., Bell, M., Junklewitz, H., Oppermann, N., Reinecke, M., Greiner, M., Pachajoa, C., and Ensslin, T. NIFTY - Numerical Information Field Theory - A Versatile Python Library for Signal Inference. arXiv.org 554 (June 2013).

[78] Steininger, T., Greiner, M., Beaujean, F., and Enßlin, T. d2o : A Distributed Data Object for Parallel High-Performance Computing in Python. Journal of Big Data 3, 1 (Dec. 2016), pp. 1–34.

69 [79] 32 SWIG and Python. http://www.swig.org/Doc4.0/Python.html#Python. Retrieved February 12, 2020.

[80] The Mpi Forum, Corporate. MPI: A Message Passing Interface. In Proceedings of the 1993 ACM/IEEE conference on supercomputing (Dec. 1993), pp. 878–883.

[81] The Open MPI Project. Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/. Retrieved February 9, 2020.

[82] Trilinos. Pytrilinos. https://trilinos.github.io/pytrilinos.html. Retrieved January 24, 2020.

[83] Van Der Walt, S., Colbert, S. C., and Varoquaux, G. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering 13, 2 (Mar. 2011), pp. 22–30.

[84] Van Rossum, G. What’s New in Python 3.0. https://docs.python.org/3.0/whatsnew/3.0.html. Retrieved February 4, 2020.

[85] Van Rossum, G., and Drake, F. L. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 2009.

[86] Van Rossum, G., and Drake Jr, F. L. Python Reference Manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995.

[87] Venners, B. The Making of Python. https://www.artima.com/intv/python.html. Retrieved February 4, 2020.

[88] Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Jarrod Millman, K., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C., Polat, I.,˙ Feng, Y., Moore, E. W., Vand erPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Contributors, S. . . SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python. arXiv e-prints (July 2019), arXiv:1907.10121.

[89] Walker, D. Standards for Message-Passing in a Distributed Memory Environment. https://www.osti.gov/servlets/purl/10170156. Retrieved February 9, 2020.

[90] Wikipedia. Python (programming language). https://en.wikipedia.org/wiki/Python_(programming_language). Retrieved February 5, 2020. [91] Detection of Outliers. https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm. Retrieved March 14, 2020.

[92] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (USA, 2010), p. 10.

70 [93] ZeroMQ. ZeroMQ Documentation. https://zeromq.org/get-started/. Retrieved January 24, 2020.

71