AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K. Nelson, B.S.
A Dissertation In COMPUTER SCIENCE
Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Submitted to: Dr. Nelson Rushton Chair of Committee
Dr. Richard Watson
Dr. Bradley Nemanich
Dr. Yong Chen
Dr. Mark Sheridan Dean of the Graduate School
August, 2016 c Bryant K. Nelson, 2016 Texas Tech University, Bryant Nelson, August 2016
For Dr. Dan Cooke, in memoriam.
ii Texas Tech University, Bryant Nelson, August 2016
ACKNOWLEDGMENTS
There are many people that made getting to where I am today possible. While I can never truly express the extent of my gratitude for these people, here I would like to take some time to acknowledge them and their contributions. First, I would like to acknowledge Dr. Nelson Rushton, the chair of my committee, my academic advisor, and my friend. Four years ago Dr. Rushton took me on as a PhD student, fresh out of my bachelor’s programs. Over these past four years he has been there to guide me and offer advice as both a mentor and a friend. From collaborating on papers and guiding me down research paths to helping me get my grouping tighter and draw faster. Without Dr. Rushton’s guidance and insight, I doubt that I would have made it this far in my academic career. Second, I would like to acknowledge my friend and colleague, Josh Archer. Josh and I actually started our undergraduate degrees at the same time and went through the same program, dual majoring in Computer Science and Mathematics. However, it was not until graduate school that Josh and I became friends, and I am thankful we did. We have collaborated on numerous projects, papers and endeavors, both academic and entertaining. Josh is often able to provide an alternate opinion or view on a subject which has influenced the way I view the world. Third, I would like to acknowledge Dr. Bradley Nemanich my colleague, mentor, and friend. When I first heard of Brad I knew of him only as the person that Dr. Cooke would communicate with to get me a copy of SequenceL. Over the years Brad has become somewhat of an intellectual role model, providing me with invaluable advice and insight. He was always available to listen and offer suggestions whenever I would encounter one of the numerous roadblocks along the way. It is Brad’s research that laid the groundwork for the work I have done, without his shoulders to stand on
iii Texas Tech University, Bryant Nelson, August 2016
I would never have made it this far. I would like to thank my friend, JD Davenport for providing me access to the HPC cluster used to test the work done in this dissertation. This made the testing process much simpler and painless. My family has provided encouragement and support throughout the years. They keep me grounded and are always there when I need them. I would like to specifically thank my brother, Tyler Nelson, who was there for me when I was going through a particularly difficult time. Last, and far from least, I would like to acknowledge my girlfriend, Amber Cran- dall. She stood by me through the toughest parts of my research. Though the late nights and long days, she was there to keep me focused and was patient with me when I was distant. For that I am thankful.
Bryant K. Nelson Texas Tech University August, 11, 2016
iv Texas Tech University, Bryant Nelson, August 2016
CONTENTS
ACKNOWLEDGMENTS ...... iii ABSTRACT ...... ix LIST OF FIGURES ...... xi LIST OF TABLES ...... xiii NOMENCLATURE ...... xiv I INTRODUCTION ...... 1 1.1 Motivation ...... 1 1.2 Problem Statement ...... 2 1.3 Dissertation Overview ...... 3 II BACKGROUND ...... 5 2.1 SequenceL ...... 5 2.1.1 Normalize Transpose(NT) ...... 6 2.1.2 Indexed Functions ...... 7 2.1.3 Consume Simplify Produce(CSP) ...... 8 2.1.4 Entry Points and Intended Uses ...... 9 2.1.5 SequenceL Compiler Overview ...... 10 2.1.6 SequenceL Runtime Library ...... 10 2.1.7 SequenceL C++ Driver Library ...... 11 2.2 Parallel Computing Architectures ...... 11 2.2.1 Shared Memory ...... 11 2.2.1.1 Pthreads ...... 13 2.2.1.2 OpenMP ...... 15 2.2.1.3 TBB ...... 17 2.2.2 Distributed Memory ...... 19
v Texas Tech University, Bryant Nelson, August 2016
2.2.2.1 MPI ...... 19 III RELATED WORK ...... 24 3.1 Parallel Programming Languages ...... 24 3.1.1 NESL ...... 24 3.1.2 Sisal ...... 24 3.2 Manual Heterogeneous Computing ...... 25 3.3 Automatic Distributed Computing ...... 26 IV EXPERIMENTAL DESIGN ...... 27 4.1 Environment ...... 27 4.2 Metrics ...... 28 4.3 Test Programs ...... 29 V PHASE 1 ...... 32 5.1 Introduction ...... 32 5.2 Framework Design ...... 32 5.2.1 Target Programs ...... 32 5.2.2 C++ Driver Library Additions ...... 33 5.2.2.1 Distributed Sequence Class ...... 34 5.2.2.2 Utility Functions ...... 37 5.2.3 Distributed Execution ...... 37 5.3 Experimental Design ...... 38 5.3.1 Test Programs ...... 38 5.3.1.1 First Class of Test Problems ...... 38 5.3.1.2 Second Class of Test Problems ...... 38 5.3.2 Experimental Results ...... 39 5.3.2.1 Matrix Multiply ...... 39 5.3.2.2 Monte Carlo Mandelbrot Area ...... 42
vi Texas Tech University, Bryant Nelson, August 2016
5.3.2.3 Pi Approximation ...... 46 5.4 Conclusion and Next Steps ...... 49 5.4.1 Next Steps ...... 49 5.4.1.1 Improvements in the Runtime ...... 50 5.4.1.2 Additional Distribution Targets ...... 50 VI PHASE 2 ...... 52 6.1 Introduction ...... 52 6.2 Compiler Modifications ...... 52 6.2.1 Program Targets ...... 52 6.2.1.1 Parallelization Targets ...... 52 6.2.1.2 Excluded Programs ...... 54 6.2.2 Runtime Additions ...... 55 6.2.3 Generated Code Additions ...... 56 6.3 Experimental Design ...... 57 6.3.1 Test Programs ...... 57 6.3.2 Experimental Results ...... 57 6.3.2.1 Monte Carlo Mandelbrot Area ...... 58 6.3.2.2 Matrix Multiply ...... 62 6.3.2.3 LU Factorization ...... 64 6.3.2.4 PI Approximation ...... 66 VII CONCLUSIONS & FUTURE WORK ...... 70 7.1 Conclusions ...... 70 7.1.1 Contributions ...... 70 7.2 Future Work ...... 71 7.2.1 Optimizations ...... 71 7.2.2 Improvements ...... 72
vii Texas Tech University, Bryant Nelson, August 2016
INDEX ...... 74 BIBLIOGRAPHY ...... 82 APPENDIX ...... 83 A. SequenceL Grammar ...... 84 B. Experimental Data ...... 87 C. Test Programs ...... 117
viii Texas Tech University, Bryant Nelson, August 2016
ABSTRACT
Hybrid parallel programming, consisting of a distributed memory model for in- ternode communication in combination with a shared-memory model to manage in- tranode parallelisms, is now a common method of achieving scalable parallel per- formance. Such a model burdens developers with the complexity of managing two parallel programming systems in the same program. I have hypothesized it is possible to specify heuristics which, on average, allow scalable across-node (distributed mem- ory) and across-core (shared memory) hybrid parallel C++ to be generated from a program written in a high-level functional language. Scalable here means a dis- tributed core-speedup that is no more than an order of magnitude worse than shared memory core-speedup. This dissertation reports the results of testing this hypothe- sis by extending the SequenceL compiler to automatically generate C++ which uses a combination of MPI and Intel’s TBB to achieve scalable distributed and shared memory parallelizations.
ix Texas Tech University, Bryant Nelson, August 2016
LIST OF FIGURES
2.1 Normalize-Transpose Illustration [Nelson & Rushton, 2013] ...... 6 2.2 Consume Simplify Produce Illustration [Nelson & Rushton, 2013] . .9 2.3 Shared Memory System Illustration ...... 12 2.4 Distributed Memory System Illustration ...... 20
5.1 Illustration of the Distributed Sequence Distribution Structure . . . . 35 5.2 Phase 1: MM on HPC Cluster Performance Graphs ...... 42 5.3 Phase 1: MAN on HPC Cluster Performance Graphs ...... 46 5.4 Phase 1: PI on HPC Cluster Performance Graphs ...... 49
6.1 Generic Call Graph ...... 53 6.2 quicksort Operation Tree ...... 55 6.3 Extended Parallel For-Loop Illustration ...... 56 6.4 Illustration of the Phase 2 Distribution Scheme ...... 58 6.5 Phase 2: MAN on HPC Cluster Performance Graphs ...... 59 6.6 Phase 2: MM on HPC Cluster Performance Graphs ...... 62 6.7 Phase 2: LU on HPC Cluster Performance Graphs ...... 64 6.8 Phase 2: PI on HPC Cluster Performance Graphs ...... 67
B.1 Phase 1: 2DFFT on Virtual Network Performance Graphs ...... 87 B.2 Phase 1: BHUT on Virtual Network Performance Graphs ...... 88 B.3 Phase 1: GOL on Virtual Network Performance Graphs ...... 89 B.4 Phase 1: LU on Virtual Network Performance Graphs ...... 90 B.5 Phase 1: 2DFFT on Networked PCs Performance Graphs ...... 91 B.6 Phase 1: BHUT on Networked PCs Performance Graphs ...... 92
x Texas Tech University, Bryant Nelson, August 2016
B.7 Phase 1: GOL on Networked PCs Performance Graphs ...... 93 B.8 Phase 1: LU on Networked PCs Performance Graphs ...... 94 B.9 Phase 1: 2DFFT on HPC Cluster Performance Graphs ...... 95 B.10 Phase 1: BHUT on HPC Cluster Performance Graphs ...... 97 B.11 Phase 1: GOL on HPC Cluster Performance Graphs ...... 99 B.12 Phase 1: LU on HPC Cluster Performance Graphs ...... 101 B.13 Phase 2: 2DFFT on Networked PCs Performance Graphs ...... 103 B.14 Phase 2: BHUT on Networked PCs Performance Graphs ...... 104 B.15 Phase 2: GOL on Networked PCs Performance Graphs ...... 105 B.16 Phase 2: LU on Networked PCs Performance Graphs ...... 106 B.17 Phase 2: MM on Networked PCs Performance Graphs ...... 107 B.18 Phase 2: MAN on Networked PCs Performance Graphs ...... 108 B.19 Phase 2: PI on Networked PCs Performance Graphs ...... 109 B.20 Phase 2: 2DFFT on HPC Cluster Performance Graphs ...... 110 B.21 Phase 2: BHUT on HPC Cluster Performance Graphs ...... 112 B.22 Phase 2: GOL on HPC Cluster Performance Graphs ...... 114
xi Texas Tech University, Bryant Nelson, August 2016
LIST OF TABLES
5.1 Phase 1: MM on HPC Cluster ...... 41 5.2 Phase 1: MAN on HPC Cluster ...... 45 5.3 Phase 1: PI on HPC Cluster ...... 48
6.1 Phase 2: Monte Carlo Mandelbrot Area on HPC Cluster ...... 60 6.2 Phase 2: Matrix Multiply on HPC Cluster ...... 63 6.3 Phase 2: LU Factorization on HPC Cluster ...... 65 6.4 Phase 2: Pi Approximation on Server ...... 68
B.1 Phase 1: 2DFFT on Virtual Network ...... 87 B.2 Phase 1: BHUT on Virtual Network ...... 88 B.3 Phase 1: GOL on Virtual Network ...... 89 B.4 Phase 1: LU on Virtual Network ...... 90 B.5 Phase 1: 2DFFT on Networked PCs ...... 91 B.6 Phase 1: BHUT on Networked PCs ...... 92 B.7 Phase 1: GOL on Networked PCs ...... 93 B.8 Phase 1: LU on Networked PCs ...... 94 B.9 Phase 1: 2DFFT on HPC Cluster ...... 96 B.10 Phase 1: BHUT on HPC Cluster ...... 98 B.11 Phase 1: GOL on HPC Cluster ...... 100 B.12 Phase 1: LU on HPC Cluster ...... 102 B.13 Phase 2: 2DFFT on Networked PCs ...... 103 B.14 Phase 2: BHUT on Networked PCs ...... 104 B.15 Phase 2: GOL on Networked PCs ...... 105
xii Texas Tech University, Bryant Nelson, August 2016
B.16 Phase 2: LU on Networked PCs ...... 106 B.17 Phase 2: MM on Networked PCs ...... 107 B.18 Phase 2: MAN on Networked PCs ...... 108 B.19 Phase 2: PI on Networked PCs ...... 109 B.20 Phase 2: 2DFFT on HPC Cluster ...... 111 B.21 Phase 2: BHUT on HPC Cluster ...... 113 B.22 Phase 2: GOL on HPC Cluster ...... 115
xiii Texas Tech University, Bryant Nelson, August 2016
NOMENCLATURE
Table Headings Throughout this dissertation performance data is presented in tables with the following headings: Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency
The column headings are defined as follows:
• Nodes The number of nodes the test run was executed across, specified using mpiexec.
• Cores The core count used on each node, specified using SequenceL’s thread argument.
• Setup Time The amount of time required to setup the input to the program.
• Compute Time The amount of time spent performing the program’s computations.
• Total Time The total time the executable took to run.
• Speedup The ratio of the total time on a single core on a single node to the total time on the current run.
• Core Efficiency The percentage of the speedup over the total number of cores.
xiv Texas Tech University, Bryant Nelson, August 2016
CHAPTER I INTRODUCTION
1.1 Motivation
Hochstein found in [Hochstein et al., 2005] that significantly more lines of code are required to implement parallel programs then serial programs and that the cost per line of parallel code is greater than the cost per line of serial code. In fact, Pancake estimates that the development of parallel code costs, on average, $800 per line of code [Pancake, 1999]. Writing an efficient scalable program is much harder, scalability meaning that the performance of a program increases as the number of processor cores increases [Reinders, 2007]. Problems often grow to the point where, due to memory or performance con- straints, they would benefit from a distributed architecture. Writing software for dis- tributed systems introduces an entirely new set of complications including distributed memory management and network communication. Such complications have been shown to make the level of effort required to write distributed memory code greater than the level of effort required for shared memory code [Hochstein et al., 2005]. This added complexity makes programming for distributed memory systems an excellent candidate for automation. The SequenceL compiler currently simplifies the task of writing parallel software, but is limited to shared memory architectures. It was hypothesized that SequenceL would lend itself to distributed environments as well [Andersen & Cooke, 2002]. There was even a prototype distributed interpreter implemented [Andersen et al., 2006]. However, there was no further work done on the distributed interpreter while significant advances were made in the SequenceL shared memory compiler [Nelson & Rushton, 2013].
1 Texas Tech University, Bryant Nelson, August 2016
The SequenceL shared memory compiler has been used successfully in a wide range of domains. These include the implementation of guidance, navigation, and control systems for NASA [Cooke & Rushton, 2009], the WirelessHART mesh network algo- rithm [Han et al., 2012], an answer set solver [Nelson et al., 2013], the Easel game engine [Nelson et al., 2014, Archer et al., 2014], and a particle-fluid flow Simulation [Ba¸sa˘gao˘gluet al., 2016]. This success which SequenceL has found in shared memory systems [Nelson & Rushton, 2013] is the main motivation behind this research into using the semantics of SequenceL to automatically generate C++ for a hybrid system made up of distributed memory nodes each with shared memory parallel capabilities.
1.2 Problem Statement
Hybrid parallel programming, consisting of a distributed memory model for in- ternode communication in combination with a shared-memory model to manage in- tranode parallelisms, has become a common method of achieving scalable parallel performance. Such a model burdens developers with the complexity of managing two parallel programming systems in the same program. SequenceL is a simple, high-level purely functional language whose semantics allow for the automatic compilation to parallel executables [Cooke et al., 2008]. This allows the programmer to focus on problem solving, leaving the low-level optimization to the compiler. The primary focus, thus far, has been having the SequenceL compiler produce parallel C++ code which runs on shared memory architectures, typically multiple processors or cores on a single machine [Nemanich et al., 2010]. Hypothesis: It is possible to specify heuristics which, on average, allow scalable across-node (distributed memory) and across-core (shared memory) hybrid parallel C++ to be generated from a program written in a high-level functional language. Here scalable means a distributed core-speedup that is no more than an order of
2 Texas Tech University, Bryant Nelson, August 2016 magnitude worse than shared memory core-speedup. This dissertation reports the results of research in testing this hypothesis by ex- tending the SequenceL compiler to automatically generate C++ which uses a com- bination of MPI and Intel’s TBB to achieve scalable distributed and shared memory parallelizations. The specific contributions of this research are as follows.
• Extensions to the SequenceL runtime library allowing the user to make small changes to their C++ driver program enabling it to run in an arbitrary dis- tributed environment.
• Modifications to the SequenceL code generator which allow it to automatically produce hybrid distributed & shared memory C++ code from any SequenceL program.
• Extensions to the SequenceL runtime library which facilitate the efficient dis- tribution of compiled SequenceL programs.
• Definition of a metric to predict the performance of this generated code.
• Targets for future performance improvements.
• Discovery that the hypothesis is true for a certain class of programs.
1.3 Dissertation Overview
This dissertation consists of 7 chapters. Chapter II provides background knowl- edge that is useful in the understanding of the information presented in later chapters. This includes details on the SequenceL programming language and parallel computing architectures. Chapter III details related work in the fields of automatically paral- lelizing programming languages and hybrid shared memory and distributed memory computing. Chapter IV describes the environments, metrics and programs used to
3 Texas Tech University, Bryant Nelson, August 2016 test the modifications to the SequenceL compiler presented in Chapters V and VI. Chapter V describes and presents the results of a manual approach to extending SequenceL programs to execute in an arbitrary distributed memory environment. Chapter VI describes and presents the results of extending the SequenceL compiler to enable compiled SequenceL programs to automatically execute in an arbitrary dis- tributed memory environment. The primary contributions of this work are described in Chapter VI. Chapter VII summarizes the results and impact of this work and presents additions and extensions that are planned for future work.
4 Texas Tech University, Bryant Nelson, August 2016
CHAPTER II BACKGROUND
2.1 SequenceL
SequenceL [COOKE et al., 2010] is a syntactically and semantically simple, statically typed, high-level purely functional programming language [Nemanich et al., 2010]. Development on SequenceL was started in 1991, originally under the name BagL, with the proof of Turing-completeness being published in 1995 [Friesen, 1995]. Though SequenceL was not originally designed with parallel programming in mind, [Cooke et al., 2006], it was discovered that the semantics of SequenceL allow for the automatic generation of parallel executables [Nemanich et al., 2010]. SequenceL was designed to be a concise language and remains simple and straight- forward. This is evidenced by the ability to describe the entire syntax and semantics of the language in about 20 pages [Nelson & Rushton, 2013]. The grammar for Se- quenceL, which is presented in Appendix A, is small, considering the grammar for Java contains over 150 rules [Alves-Foss, 1999]. Due to its declarative nature, SequenceL has no built-in I/O [Cooke & Rushton, 2005]. Therefore, to make a complete executable a driver must be written in a procedural language, which handles the I/O for the program. The current SequenceL compiler compiles SequenceL programs into parallel C++ code, capable of running on an arbitrary number of shared memory cores, which must then be linked with a C++ driver [Nelson & Rushton, 2013]. SequenceL uses three key semantics to automatically derive parallelisms. These are Normalize-Transpose(NT), Indexed Functions and Consume Simplify Produce(CSP) [Nelson & Rushton, 2013].
5 Texas Tech University, Bryant Nelson, August 2016
Figure 2.1: Normalize-Transpose Illustration [Nelson & Rushton, 2013]
2.1.1 Normalize Transpose(NT)
A primary source of parallelisms in SequenceL is the Normalize-Transpose(NT) semantic. This semantic allows the programmer to apply any function, including user defined functions, to sequences of inputs, as illustrated in figure 2.1. Consider the following SequenceL function definition:
1
2 divides(d(0), n(0)) := true whenn modd=0 else false;
3 The function divides takes two scalar arguments, n and d, and is defined to be true when d evenly divides n and false otherwise. The following expressions illustrate normal (non-NT) calls to divides:
1
2 divides(2,4) \\Result: true
3 divides(3,7) \\Result: false
4 In contrast, the following expressions illustrate various NT’d calls to divides:
1
2 divides([2,3,5],10) \\Result: [true,false,true]
3 divides(3,[10,15,20]) \\Result: [false,true,false]
4 divides([2,3,5],[10,14,20]) \\Result: [true,false,true]
5
6 Texas Tech University, Bryant Nelson, August 2016
Function arguments which are of a greater depth than is expected by the function are said to be overtyped [Waugh, 2016]. In the code segment above, the first argument of the expression on line 2 is overtyped, the second argument of the expression on line 3 is overtyped, and both arguments of the expression on line 4 are overtyped.
Definition 1 (Normalize Transpose). Let f, xi be identifiers, di be integers and Li be expressions. Given the SequenceL function definition:
f (x1 (d1) , . . . , xn (dn)) := hexpressioni and the expression:
f (L1, ..., Ln)
If at least one Li is overtyped and all overtyped arguments are of length M, then
th f (L1,...,Ln) is equal to the list of length M whose k element is: