AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K. Nelson, B.S.

A Dissertation In COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Submitted to: Dr. Nelson Rushton Chair of Committee

Dr. Richard Watson

Dr. Bradley Nemanich

Dr. Yong Chen

Dr. Mark Sheridan Dean of the Graduate School

August, 2016 Bryant K. Nelson, 2016 Texas Tech University, Bryant Nelson, August 2016

For Dr. Dan Cooke, in memoriam.

ii Texas Tech University, Bryant Nelson, August 2016

ACKNOWLEDGMENTS

There are many people that made getting to where I am today possible. While I can never truly express the extent of my gratitude for these people, here I would like to take some time to acknowledge them and their contributions. First, I would like to acknowledge Dr. Nelson Rushton, the chair of my committee, my academic advisor, and my friend. Four years ago Dr. Rushton took me on as a PhD student, fresh out of my bachelor’s programs. Over these past four years he has been there to guide me and offer advice as both a mentor and a friend. From collaborating on papers and guiding me down research paths to helping me get my grouping tighter and draw faster. Without Dr. Rushton’s guidance and insight, I doubt that I would have made it this far in my academic career. Second, I would like to acknowledge my friend and colleague, Josh Archer. Josh and I actually started our undergraduate degrees at the same time and went through the same program, dual majoring in Computer Science and Mathematics. However, it was not until graduate school that Josh and I became friends, and I am thankful we did. We have collaborated on numerous projects, papers and endeavors, both academic and entertaining. Josh is often able to provide an alternate opinion or view on a subject which has influenced the way I view the world. Third, I would like to acknowledge Dr. Bradley Nemanich my colleague, mentor, and friend. When I first heard of Brad I knew of him only as the person that Dr. Cooke would communicate with to get me a copy of SequenceL. Over the years Brad has become somewhat of an intellectual role model, providing me with invaluable advice and insight. He was always available to listen and offer suggestions whenever I would encounter one of the numerous roadblocks along the way. It is Brad’s research that laid the groundwork for the work I have done, without his shoulders to stand on

iii Texas Tech University, Bryant Nelson, August 2016

I would never have made it this far. I would like to thank my friend, JD Davenport for providing me access to the HPC cluster used to test the work done in this dissertation. This made the testing process much simpler and painless. My family has provided encouragement and support throughout the years. They keep me grounded and are always there when I need them. I would like to specifically thank my brother, Tyler Nelson, who was there for me when I was going through a particularly difficult time. Last, and far from least, I would like to acknowledge my girlfriend, Amber Cran- dall. She stood by me through the toughest parts of my research. Though the late nights and long days, she was there to keep me focused and was patient with me when I was distant. For that I am thankful.

Bryant K. Nelson Texas Tech University August, 11, 2016

iv Texas Tech University, Bryant Nelson, August 2016

CONTENTS

ACKNOWLEDGMENTS ...... iii ABSTRACT ...... ix LIST OF FIGURES ...... xi LIST OF TABLES ...... xiii NOMENCLATURE ...... xiv I INTRODUCTION ...... 1 1.1 Motivation ...... 1 1.2 Problem Statement ...... 2 1.3 Dissertation Overview ...... 3 II BACKGROUND ...... 5 2.1 SequenceL ...... 5 2.1.1 Normalize Transpose(NT) ...... 6 2.1.2 Indexed Functions ...... 7 2.1.3 Consume Simplify Produce(CSP) ...... 8 2.1.4 Entry Points and Intended Uses ...... 9 2.1.5 SequenceL Overview ...... 10 2.1.6 SequenceL Runtime Library ...... 10 2.1.7 SequenceL C++ Driver Library ...... 11 2.2 Architectures ...... 11 2.2.1 Shared Memory ...... 11 2.2.1.1 Pthreads ...... 13 2.2.1.2 OpenMP ...... 15 2.2.1.3 TBB ...... 17 2.2.2 Distributed Memory ...... 19

v Texas Tech University, Bryant Nelson, August 2016

2.2.2.1 MPI ...... 19 III RELATED WORK ...... 24 3.1 Parallel Programming Languages ...... 24 3.1.1 NESL ...... 24 3.1.2 Sisal ...... 24 3.2 Manual Heterogeneous Computing ...... 25 3.3 Automatic Distributed Computing ...... 26 IV EXPERIMENTAL DESIGN ...... 27 4.1 Environment ...... 27 4.2 Metrics ...... 28 4.3 Test Programs ...... 29 V PHASE 1 ...... 32 5.1 Introduction ...... 32 5.2 Framework Design ...... 32 5.2.1 Target Programs ...... 32 5.2.2 C++ Driver Library Additions ...... 33 5.2.2.1 Distributed Sequence Class ...... 34 5.2.2.2 Utility Functions ...... 37 5.2.3 Distributed Execution ...... 37 5.3 Experimental Design ...... 38 5.3.1 Test Programs ...... 38 5.3.1.1 First Class of Test Problems ...... 38 5.3.1.2 Second Class of Test Problems ...... 38 5.3.2 Experimental Results ...... 39 5.3.2.1 Matrix Multiply ...... 39 5.3.2.2 Monte Carlo Mandelbrot Area ...... 42

vi Texas Tech University, Bryant Nelson, August 2016

5.3.2.3 Pi Approximation ...... 46 5.4 Conclusion and Next Steps ...... 49 5.4.1 Next Steps ...... 49 5.4.1.1 Improvements in the Runtime ...... 50 5.4.1.2 Additional Distribution Targets ...... 50 VI PHASE 2 ...... 52 6.1 Introduction ...... 52 6.2 Compiler Modifications ...... 52 6.2.1 Program Targets ...... 52 6.2.1.1 Parallelization Targets ...... 52 6.2.1.2 Excluded Programs ...... 54 6.2.2 Runtime Additions ...... 55 6.2.3 Generated Code Additions ...... 56 6.3 Experimental Design ...... 57 6.3.1 Test Programs ...... 57 6.3.2 Experimental Results ...... 57 6.3.2.1 Monte Carlo Mandelbrot Area ...... 58 6.3.2.2 Matrix Multiply ...... 62 6.3.2.3 LU Factorization ...... 64 6.3.2.4 PI Approximation ...... 66 VII CONCLUSIONS & FUTURE WORK ...... 70 7.1 Conclusions ...... 70 7.1.1 Contributions ...... 70 7.2 Future Work ...... 71 7.2.1 Optimizations ...... 71 7.2.2 Improvements ...... 72

vii Texas Tech University, Bryant Nelson, August 2016

INDEX ...... 74 BIBLIOGRAPHY ...... 82 APPENDIX ...... 83 A. SequenceL Grammar ...... 84 B. Experimental Data ...... 87 C. Test Programs ...... 117

viii Texas Tech University, Bryant Nelson, August 2016

ABSTRACT

Hybrid parallel programming, consisting of a distributed memory model for in- ternode communication in combination with a shared-memory model to manage in- tranode parallelisms, is now a common method of achieving scalable parallel per- formance. Such a model burdens developers with the complexity of managing two parallel programming systems in the same program. I have hypothesized it is possible to specify heuristics which, on average, allow scalable across-node (distributed mem- ory) and across-core (shared memory) hybrid parallel C++ to be generated from a program written in a high-level functional language. Scalable here means a dis- tributed core-speedup that is no more than an order of magnitude worse than shared memory core-speedup. This dissertation reports the results of testing this hypothe- sis by extending the SequenceL compiler to automatically generate C++ which uses a combination of MPI and Intel’s TBB to achieve scalable distributed and shared memory parallelizations.

ix Texas Tech University, Bryant Nelson, August 2016

LIST OF FIGURES

2.1 Normalize-Transpose Illustration [Nelson & Rushton, 2013] ...... 6 2.2 Consume Simplify Produce Illustration [Nelson & Rushton, 2013] . .9 2.3 Shared Memory System Illustration ...... 12 2.4 Distributed Memory System Illustration ...... 20

5.1 Illustration of the Distributed Sequence Distribution Structure . . . . 35 5.2 Phase 1: MM on HPC Cluster Performance Graphs ...... 42 5.3 Phase 1: MAN on HPC Cluster Performance Graphs ...... 46 5.4 Phase 1: PI on HPC Cluster Performance Graphs ...... 49

6.1 Generic Call Graph ...... 53 6.2 quicksort Operation Tree ...... 55 6.3 Extended Parallel For-Loop Illustration ...... 56 6.4 Illustration of the Phase 2 Distribution Scheme ...... 58 6.5 Phase 2: MAN on HPC Cluster Performance Graphs ...... 59 6.6 Phase 2: MM on HPC Cluster Performance Graphs ...... 62 6.7 Phase 2: LU on HPC Cluster Performance Graphs ...... 64 6.8 Phase 2: PI on HPC Cluster Performance Graphs ...... 67

B.1 Phase 1: 2DFFT on Virtual Network Performance Graphs ...... 87 B.2 Phase 1: BHUT on Virtual Network Performance Graphs ...... 88 B.3 Phase 1: GOL on Virtual Network Performance Graphs ...... 89 B.4 Phase 1: LU on Virtual Network Performance Graphs ...... 90 B.5 Phase 1: 2DFFT on Networked PCs Performance Graphs ...... 91 B.6 Phase 1: BHUT on Networked PCs Performance Graphs ...... 92

x Texas Tech University, Bryant Nelson, August 2016

B.7 Phase 1: GOL on Networked PCs Performance Graphs ...... 93 B.8 Phase 1: LU on Networked PCs Performance Graphs ...... 94 B.9 Phase 1: 2DFFT on HPC Cluster Performance Graphs ...... 95 B.10 Phase 1: BHUT on HPC Cluster Performance Graphs ...... 97 B.11 Phase 1: GOL on HPC Cluster Performance Graphs ...... 99 B.12 Phase 1: LU on HPC Cluster Performance Graphs ...... 101 B.13 Phase 2: 2DFFT on Networked PCs Performance Graphs ...... 103 B.14 Phase 2: BHUT on Networked PCs Performance Graphs ...... 104 B.15 Phase 2: GOL on Networked PCs Performance Graphs ...... 105 B.16 Phase 2: LU on Networked PCs Performance Graphs ...... 106 B.17 Phase 2: MM on Networked PCs Performance Graphs ...... 107 B.18 Phase 2: MAN on Networked PCs Performance Graphs ...... 108 B.19 Phase 2: PI on Networked PCs Performance Graphs ...... 109 B.20 Phase 2: 2DFFT on HPC Cluster Performance Graphs ...... 110 B.21 Phase 2: BHUT on HPC Cluster Performance Graphs ...... 112 B.22 Phase 2: GOL on HPC Cluster Performance Graphs ...... 114

xi Texas Tech University, Bryant Nelson, August 2016

LIST OF TABLES

5.1 Phase 1: MM on HPC Cluster ...... 41 5.2 Phase 1: MAN on HPC Cluster ...... 45 5.3 Phase 1: PI on HPC Cluster ...... 48

6.1 Phase 2: Monte Carlo Mandelbrot Area on HPC Cluster ...... 60 6.2 Phase 2: Matrix Multiply on HPC Cluster ...... 63 6.3 Phase 2: LU Factorization on HPC Cluster ...... 65 6.4 Phase 2: Pi Approximation on Server ...... 68

B.1 Phase 1: 2DFFT on Virtual Network ...... 87 B.2 Phase 1: BHUT on Virtual Network ...... 88 B.3 Phase 1: GOL on Virtual Network ...... 89 B.4 Phase 1: LU on Virtual Network ...... 90 B.5 Phase 1: 2DFFT on Networked PCs ...... 91 B.6 Phase 1: BHUT on Networked PCs ...... 92 B.7 Phase 1: GOL on Networked PCs ...... 93 B.8 Phase 1: LU on Networked PCs ...... 94 B.9 Phase 1: 2DFFT on HPC Cluster ...... 96 B.10 Phase 1: BHUT on HPC Cluster ...... 98 B.11 Phase 1: GOL on HPC Cluster ...... 100 B.12 Phase 1: LU on HPC Cluster ...... 102 B.13 Phase 2: 2DFFT on Networked PCs ...... 103 B.14 Phase 2: BHUT on Networked PCs ...... 104 B.15 Phase 2: GOL on Networked PCs ...... 105

xii Texas Tech University, Bryant Nelson, August 2016

B.16 Phase 2: LU on Networked PCs ...... 106 B.17 Phase 2: MM on Networked PCs ...... 107 B.18 Phase 2: MAN on Networked PCs ...... 108 B.19 Phase 2: PI on Networked PCs ...... 109 B.20 Phase 2: 2DFFT on HPC Cluster ...... 111 B.21 Phase 2: BHUT on HPC Cluster ...... 113 B.22 Phase 2: GOL on HPC Cluster ...... 115

xiii Texas Tech University, Bryant Nelson, August 2016

NOMENCLATURE

Table Headings Throughout this dissertation performance data is presented in tables with the following headings: Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency

The column headings are defined as follows:

• Nodes The number of nodes the test run was executed across, specified using mpiexec.

• Cores The core count used on each node, specified using SequenceL’s argument.

• Setup Time The amount of time required to setup the input to the program.

• Compute Time The amount of time spent performing the program’s computations.

• Total Time The total time the executable took to run.

• Speedup The ratio of the total time on a single core on a single node to the total time on the current run.

• Core Efficiency The percentage of the speedup over the total number of cores.

xiv Texas Tech University, Bryant Nelson, August 2016

CHAPTER I INTRODUCTION

1.1 Motivation

Hochstein found in [Hochstein et al., 2005] that significantly more lines of code are required to implement parallel programs then serial programs and that the cost per line of parallel code is greater than the cost per line of serial code. In fact, Pancake estimates that the development of parallel code costs, on average, $800 per line of code [Pancake, 1999]. Writing an efficient scalable program is much harder, scalability meaning that the performance of a program increases as the number of processor cores increases [Reinders, 2007]. Problems often grow to the point where, due to memory or performance con- straints, they would benefit from a distributed architecture. Writing software for dis- tributed systems introduces an entirely new set of complications including distributed memory management and network communication. Such complications have been shown to make the level of effort required to write distributed memory code greater than the level of effort required for shared memory code [Hochstein et al., 2005]. This added complexity makes programming for distributed memory systems an excellent candidate for automation. The SequenceL compiler currently simplifies the task of writing parallel software, but is limited to shared memory architectures. It was hypothesized that SequenceL would lend itself to distributed environments as well [Andersen & Cooke, 2002]. There was even a prototype distributed implemented [Andersen et al., 2006]. However, there was no further work done on the distributed interpreter while significant advances were made in the SequenceL shared memory compiler [Nelson & Rushton, 2013].

1 Texas Tech University, Bryant Nelson, August 2016

The SequenceL shared memory compiler has been used successfully in a wide range of domains. These include the implementation of guidance, navigation, and control systems for NASA [Cooke & Rushton, 2009], the WirelessHART mesh network algo- rithm [Han et al., 2012], an answer set solver [Nelson et al., 2013], the Easel game engine [Nelson et al., 2014, Archer et al., 2014], and a particle-fluid flow Simulation [Ba¸sa˘gao˘gluet al., 2016]. This success which SequenceL has found in shared memory systems [Nelson & Rushton, 2013] is the main motivation behind this research into using the semantics of SequenceL to automatically generate C++ for a hybrid system made up of distributed memory nodes each with shared memory parallel capabilities.

1.2 Problem Statement

Hybrid parallel programming, consisting of a distributed memory model for in- ternode communication in combination with a shared-memory model to manage in- tranode parallelisms, has become a common method of achieving scalable parallel performance. Such a model burdens developers with the complexity of managing two parallel programming systems in the same program. SequenceL is a simple, high-level purely functional language whose semantics allow for the automatic compilation to parallel executables [Cooke et al., 2008]. This allows the to focus on problem solving, leaving the low-level optimization to the compiler. The primary focus, thus far, has been having the SequenceL compiler produce parallel C++ code which runs on shared memory architectures, typically multiple processors or cores on a single machine [Nemanich et al., 2010]. Hypothesis: It is possible to specify heuristics which, on average, allow scalable across-node (distributed memory) and across-core (shared memory) hybrid parallel C++ to be generated from a program written in a high-level functional language. Here scalable means a distributed core-speedup that is no more than an order of

2 Texas Tech University, Bryant Nelson, August 2016 magnitude worse than shared memory core-speedup. This dissertation reports the results of research in testing this hypothesis by ex- tending the SequenceL compiler to automatically generate C++ which uses a com- bination of MPI and Intel’s TBB to achieve scalable distributed and shared memory parallelizations. The specific contributions of this research are as follows.

• Extensions to the SequenceL runtime library allowing the user to make small changes to their C++ driver program enabling it to run in an arbitrary dis- tributed environment.

• Modifications to the SequenceL code generator which allow it to automatically produce hybrid distributed & shared memory C++ code from any SequenceL program.

• Extensions to the SequenceL runtime library which facilitate the efficient dis- tribution of compiled SequenceL programs.

• Definition of a metric to predict the performance of this generated code.

• Targets for future performance improvements.

• Discovery that the hypothesis is true for a certain class of programs.

1.3 Dissertation Overview

This dissertation consists of 7 chapters. Chapter II provides background knowl- edge that is useful in the understanding of the information presented in later chapters. This includes details on the SequenceL programming language and parallel computing architectures. Chapter III details related work in the fields of automatically paral- lelizing programming languages and hybrid shared memory and distributed memory computing. Chapter IV describes the environments, metrics and programs used to

3 Texas Tech University, Bryant Nelson, August 2016 test the modifications to the SequenceL compiler presented in Chapters V and VI. Chapter V describes and presents the results of a manual approach to extending SequenceL programs to execute in an arbitrary distributed memory environment. Chapter VI describes and presents the results of extending the SequenceL compiler to enable compiled SequenceL programs to automatically execute in an arbitrary dis- tributed memory environment. The primary contributions of this work are described in Chapter VI. Chapter VII summarizes the results and impact of this work and presents additions and extensions that are planned for future work.

4 Texas Tech University, Bryant Nelson, August 2016

CHAPTER II BACKGROUND

2.1 SequenceL

SequenceL [COOKE et al., 2010] is a syntactically and semantically simple, statically typed, high-level purely language [Nemanich et al., 2010]. Development on SequenceL was started in 1991, originally under the name BagL, with the proof of Turing-completeness being published in 1995 [Friesen, 1995]. Though SequenceL was not originally designed with parallel programming in mind, [Cooke et al., 2006], it was discovered that the semantics of SequenceL allow for the automatic generation of parallel executables [Nemanich et al., 2010]. SequenceL was designed to be a concise language and remains simple and straight- forward. This is evidenced by the ability to describe the entire syntax and semantics of the language in about 20 pages [Nelson & Rushton, 2013]. The grammar for Se- quenceL, which is presented in Appendix A, is small, considering the grammar for Java contains over 150 rules [Alves-Foss, 1999]. Due to its declarative nature, SequenceL has no built-in I/O [Cooke & Rushton, 2005]. Therefore, to make a complete executable a driver must be written in a procedural language, which handles the I/O for the program. The current SequenceL compiler compiles SequenceL programs into parallel C++ code, capable of running on an arbitrary number of shared memory cores, which must then be linked with a C++ driver [Nelson & Rushton, 2013]. SequenceL uses three key semantics to automatically derive parallelisms. These are Normalize-Transpose(NT), Indexed Functions and Consume Simplify Produce(CSP) [Nelson & Rushton, 2013].

5 Texas Tech University, Bryant Nelson, August 2016

Figure 2.1: Normalize-Transpose Illustration [Nelson & Rushton, 2013]

2.1.1 Normalize Transpose(NT)

A primary source of parallelisms in SequenceL is the Normalize-Transpose(NT) semantic. This semantic allows the programmer to apply any function, including user defined functions, to sequences of inputs, as illustrated in figure 2.1. Consider the following SequenceL function definition:

1

2 divides(d(0), n(0)) := true whenn modd=0 else false;

3 The function divides takes two scalar arguments, n and d, and is defined to be true when d evenly divides n and false otherwise. The following expressions illustrate normal (non-NT) calls to divides:

1

2 divides(2,4) \\Result: true

3 divides(3,7) \\Result: false

4 In contrast, the following expressions illustrate various NT’d calls to divides:

1

2 divides([2,3,5],10) \\Result: [true,false,true]

3 divides(3,[10,15,20]) \\Result: [false,true,false]

4 divides([2,3,5],[10,14,20]) \\Result: [true,false,true]

5

6 Texas Tech University, Bryant Nelson, August 2016

Function arguments which are of a greater depth than is expected by the function are said to be overtyped [Waugh, 2016]. In the code segment above, the first argument of the expression on line 2 is overtyped, the second argument of the expression on line 3 is overtyped, and both arguments of the expression on line 4 are overtyped.

Definition 1 (Normalize Transpose). Let f, xi be identifiers, di be integers and Li be expressions. Given the SequenceL function definition:

f (x1 (d1) , . . . , xn (dn)) := hexpressioni and the expression:

f (L1, ..., Ln)

If at least one Li is overtyped and all overtyped arguments are of length M, then

th f (L1,...,Ln) is equal to the list of length M whose k element is:

k k  f L1,...,Ln

Where:   Li [k] , if Li is overtyped k  Li =  Li, otherwise

In short,

 1 1  2 2  M M  f (L1, ..., Ln) = f L1,...,Ln , f L1,...,Ln , . . . , f L1 ,...,Ln

2.1.2 Indexed Functions

Indexed functions are a syntactic construct which allows the developer define a function which returns a sequence, element-wise [Cooke, 1998]. The following Se-

7 Texas Tech University, Bryant Nelson, August 2016

quenceL code snippet defines the function Identity, which returns the N×N identity matrix, namely:   1 0 0 ... 0     0 1 0 ... 0 Identity =   N×N . . . . . ......      0 0 0 ... 1 | {z } N

1

2 Identity(N(0))[i, j] :=

3 1 wheni=j

4 else

5 0

6 foreach i within 1 ... N,

7 j within 1 ... N;

8 The indexed function construct allows the developer to define nonscalars in an intuitive way, similar to the way they are defined in informal mathematical discourse. The ranges of the indexes can also be inferred by their usage. The following is

an example of a SequenceL indexed function which infers the range of the indexes. i ranges over 1 . . . size(A) and j ranges over 1 . . . size(B).

1

2 foo(A(1),B(1))[i,j] := A[i] + B[j];

3

2.1.3 Consume Simplify Produce(CSP)

The third semantic which is a source of automatic parallelizations in SequenceL is Consume Simplify Produce(CSP). CSP allows parameters of a function call to be evaluated in parallel, as illustrated in Figure 2.2 [Cooke, 2002]. CSP can be viewed as a simultaneous, eager beta reduction. Eager here means that all input arguments to functions and operations are evaluated before the

8 Texas Tech University, Bryant Nelson, August 2016

Figure 2.2: Consume Simplify Produce Illustration [Nelson & Rushton, 2013]

function is executed. This is in contrast to lazy reduction, which only evaluates inputs as they are needed. Simultaneous means that all input arguments can be evaluated at the same time. Consider the following SequenceL expression.

1

2 (2 * 3) + (4 * 2)

3 The evaluation trace of this expression looks like:

((2 × 3) + (4 × 2))

6 + 8

14

2.1.4 Entry Points and Intended Uses

When a SequenceL program is compiled the user must specify the signatures of all the functions that will be called from the driver program. These functions are

9 Texas Tech University, Bryant Nelson, August 2016

called entry points and the specified signatures are called intended uses. Compiling a SequenceL function will generate a C++ function of the following form:

1

2 sl_(, , ..., , numThreads, result)

3

where is the name of the SequenceL function, each < argi > is an input to the SequenceL function, numThreads is the number of threads to use while executing, and result is where to store the result of the SequenceL function.

2.1.5 SequenceL Compiler Overview

The first thing the SequenceL compiler does to a given SequenceL program is convert it into a simple procedural intermediate language. Optimizations and static code analyses are performed on the program while it is in this intermediate language. While the details of the SequenceL intermediate language are not important for the understanding of this dissertation it is necessary to be aware of its existence. After the optimizations and analyses are performed on the intermediate code, it is translated into C++. This C++ code is the output of the SequenceL compiler and is contained in a .cpp file and a .h file. These two files contain the entry point functions that were specified at compile time and all functions and objects on which they depend. When the developer is writing the C++ driver necessary to run their compiled SequenceL program they will include the .h file produced by the SequenceL compiler.

2.1.6 SequenceL Runtime Library

The SequenceL runtime library is an extensive collection of classes and functions which are used by the generated C++ code. These include abstractions to Intel’s Threading Building Blocks, extensive sequence control constructs and memory man-

10 Texas Tech University, Bryant Nelson, August 2016

agement tools. Most of the functionality of the runtime library is shielded from the user as it is only intended to be used by the generated code.

2.1.7 SequenceL C++ Driver Library

The SequenceL runtime library also provides a library of functions and classes which allows the programmer to develop C++ drivers for compiled SequenceL pro- grams. Included in this library are various classes and utility functions that enable and simplify the creation of the C++ drivers. A key class included in this library is the sequence class. The sequence class is the C++ analogue of the primary data structure that exists within SequenceL. The following C++ code snippet shows the creation of

the C++ version of the SequenceL sequence [1,2,3,4,5,6,7,8,9,10]:

1

2 Sequence floatSequence;

3 floatSequence.setSize(10);

4 for(int i = 1; i <= floatSequence.size(); i++)

5 {

6 floatSequence[i] = i;

7 }

8 This driver library also provides access to utility functions which allow the devel- oper to construct and manipulate sequences and profile their program.

2.2 Parallel Computing Architectures

2.2.1 Shared Memory

A shared memory system is primarily characterized by the fact that all of an executable’s parallel executing threads have access to the same memory space [Kametani, 1999]. This common memory access is one of the strengths of shared memory architectures, freeing the developer from the need to communicate data

11 Texas Tech University, Bryant Nelson, August 2016 between executing threads. Common shared memory systems are PCs with multicore processors and larger servers with multiple processors. A drawback of shared memory systems is that limits to the number of cores and amount of memory a system can have are quickly encountered. Monetary cost lim- itations are often encountered due to the increased cost to add cores and memory to a shared memory system. Even if cost does not become an issue, hardware limi- tations restrict the number of cores and amount of memory that can be added to a shared memory system. Due to these limitations, to get greater performance power, distributed memory architectures are used. Three commonly used approaches to writing programs to manually take advantage of shared memory systems are Pthreads, OpenMP, and TBB [Mogill & Haglin, 2010, Tousimojarad & Vanderbauwhede, 2014]. All of these methods provide the developer with means to manage the thread and task scheduling necessary to take advantage of a shared memory parallel system. In all of these approaches the programmer must explicitly specify and manage the parallel control structures.

Figure 2.3: Shared Memory System Illustration

12 Texas Tech University, Bryant Nelson, August 2016

2.2.1.1 Pthreads

Pthreads, or Posix Threads, is a programming interface specified in POSIX(Portable Interface), section 1003.1c [Lewine, 1991]. This interface is a standardized model for dividing a program into subtasks which can be executed in parallel [Buttlar et al., 1996]. It is typically accessed via runtime library and operating system calls [Mogill & Haglin, 2010]. The following is an example of matrix multiply using Pthreads [Today, ].

1

2 #include

3 #include

4 #include

5

6 #defineM3

7 #defineK2

8 #defineN3

9 #define NUM_THREADS 10

10

11 int A [M][K] = { {1,4}, {2,5}, {3,6} };

12 int B [K][N] = { {8,7,6}, {5,4,3} };

13 intC[M][N];

14

15 structv{

16 int i;/* row*/

17 int j;/* column*/

18 };

19

20 void *runner(void *param);/* the thread*/

21

22 int main(int argc, char *argv[]) {

23

24 int i,j, count = 0;

25 for(i = 0; i < M; i++) {

26 for(j = 0; j < N; j++) {

27 //Assigna row and column for each thread

28 struct v *data = (struct v *) malloc(sizeof(struct v));

29 data ->i = i;

13 Texas Tech University, Bryant Nelson, August 2016

30 data ->j = j;

31 /* Now create the thread passing it data asa parameter*/

32 pthread_t tid;//ThreadID

33 pthread_attr_t attr;//Set of thread attributes

34 //Get the default attributes

35 pthread_attr_init(&attr);

36 //Create the thread

37 pthread_create(&tid,&attr,runner,data);

38 //Make sure the parent waits for all thread to complete

39 pthread_join(tid, NULL);

40 count ++;

41 }

42 }

43

44 //Print out the resulting matrix

45 for(i = 0; i < M; i++) {

46 for(j = 0; j < N; j++) {

47 printf ("%d", C[i][j]);

48 }

49 printf ("\n");

50 }

51 }

52

53 //The thread will begin control in this function

54 void *runner(void *param) {

55 struct v *data = param;// the structure that holds our data

56 int n, sum = 0;//the counter and sum

57

58 //Row multiplied by column

59 for(n = 0; n< K; n++){

60 sum += A[data->i][n] * B[n][data->j];

61 }

62 //assign the sum to its coordinate

63 C[data->i][data->j] = sum;

64

65 //Exit the thread

66 pthread_exit(0);

67 }

68

14 Texas Tech University, Bryant Nelson, August 2016

2.2.1.2 OpenMP

OpenMP has become the standard method for programming for shared memory systems. OpenMP, which often uses Pthreads as its underlying infrastructure in , in fact supports a subset of the functionality of Pthreads [Mogill & Haglin, 2010]. The API provided by OpenMP uses the fork-join model, allowing threads to communicate via the sharing of variables [Tousimojarad & Vanderbauwhede, 2014]. More recent versions of OpenMP also support task-based parallelisms [Ayguad´eet al., 2009]. The following is an example of matrix multiply using OpenMP [Barney, ].

1

2 /*******************************************************************

3 *FILE: omp_mm.c

4 *DESCRIPTION:

5 * OpenMp Example- Matrix Multiply-C Version

6 * Demonstratesa matrix multiply using OpenMP. Threads share

7 * row iterationsaccording toa predefined chunk size.

8 *AUTHOR: Blaise Barney

9 *LASTREVISED: 06/28/05

10 *******************************************************************/

11 #include

12 #include

13 #include

14

15 #define NRA 62/* number of rows in matrixA*/

16 #define NCA 15/* number of columns in matrixA*/

17 #define NCB 7/* number of columns in matrixB*/

18

19 int main (int argc, char *argv[])

20 {

21 int tid, nthreads, i, j, k, chunk;

22 double a[NRA][NCA],/* matrixA to be multiplied*/

23 b[NCA][NCB],/* matrixB to be multiplied*/

24 c[NRA][NCB];/* result matrixC*/

25

26 chunk = 10;/* set loop iteration chunk size*/

15 Texas Tech University, Bryant Nelson, August 2016

27

28 /*** Spawna parallel region explicitly scoping all variables***/

29 #pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k)

30 {

31 tid = omp_get_thread_num();

32 if (tid == 0)

33 {

34 nthreads = omp_get_num_threads();

35 printf ("Starting matrix multiple example with%d threads\n"

36 ,nthreads);

37 printf ("Initializing matrices...\n");

38 }

39 /*** Initialize matrices***/

40 # pragma omp for schedule (static, chunk)

41 for (i=0; i

42 for (j=0; j

43 a[i][j]= i+j;

44 # pragma omp for schedule (static, chunk)

45 for (i=0; i

46 for (j=0; j

47 b[i][j]= i*j;

48 # pragma omp for schedule (static, chunk)

49 for (i=0; i

50 for (j=0; j

51 c[i][j]= 0;

52

53 /*** Do matrix multiply sharing iterations on outer loop***/

54 /*** Display who does which iterations for demo purposes***/

55 printf ("Thread%d starting matrix multiply...\n",tid);

56 # pragma omp for schedule (static, chunk)

57 for (i=0; i

58 {

59 printf ("Thread=%d did row=%d\n",tid,i);

60 for(j=0; j

61 for (k=0; k

62 c[i][j] += a[i][k] * b[k][j];

63 }

64 }/*** End of parallel region***/

65

66 /*** Print results***/

67 printf ("******************************************************\n");

68 printf ("Result Matrix:\n");

16 Texas Tech University, Bryant Nelson, August 2016

69 for (i=0; i

70 {

71 for (j=0; j

72 printf ("%6.2f", c[i][j]);

73 printf ("\n");

74 }

75 printf ("******************************************************\n");

76 printf ("Done.\n");

77

78 }

79

2.2.1.3 TBB

Intel’s TBB (Threading Building Blocks) is a higher-level task based parallel library designed to abstract away platform details and threading mechanisms [Reinders, 2007]. TBB’s tasks and workstealing task scheduler are paramount to its performance, allowing the developer to use tasks to express parallelisms [Kim & Voss, 2011]. TBB is, at it’s root, a runtime library which provides numerous data structures and algorithms to be used in parallel programs. While TBB is a level of abstraction over standard threading models, it is still a low-level model, requiring the developer to manually detect and specify potential parallelisms [Tousimojarad & Vanderbauwhede, 2014]. The following is an example of matrix multiply using TBB [Computing, ].

1

2 /* matrix-tbb.cpp*/

3 #include

4 #include

5

6 using namespace tbb;

7

8 const int size = 1000;

9

10 float a[size][size];

17 Texas Tech University, Bryant Nelson, August 2016

11 float b[size][size];

12 float c[size][size];

13

14

15 class Multiply

16 {

17 public:

18 void operator()(blocked_range r) const{

19 for(int i = r.begin(); i != r.end(); ++i) {

20 for(int j = 0; j < size; ++j) {

21 for(int k = 0; k < size; ++k) {

22 c[i][j] += a[i][k] * b[k][j];

23 }

24 }

25 }

26 }

27 };

28

29

30 int main()

31 {

32 // Initialize buffers.

33 for(int i = 0; i < size; ++i) {

34 for(int j = 0; j < size; ++j) {

35 a[i][j] = (float)i + j;

36 b[i][j] = (float)i - j;

37 c[i][j] = 0.0f;

38 }

39 }

40

41 // Compute matrix multiplication.

42 //C <-C+AxB

43 parallel_for(blocked_range(0,size), Multiply());

44

45 return 0;

46 }

47

18 Texas Tech University, Bryant Nelson, August 2016

2.2.2 Distributed Memory

A distributed memory system is characterized by the fact that each processor can directly address only a portion of the total memory system, therefore executing threads do not necessarily operate in the same memory space [Callahan & Kennedy, 1988]. Due to this segmented memory access, the developer now must handle the distribution of data in addition to the distribution of computation. Common distributed memory systems include networks of PCs and high perfor- mance computing clusters. While it is often less costly to construct a distributed memory system with superior computational ability to that of a shared memory system, the difficulty lies in the development required to make use of it. Since the data is distributed across multiple nodes the developer must handle communicating the needed data to the correct nodes [Bondhugula, 2011].

2.2.2.1 MPI

The Message Passing Interface(MPI [mpi, 2012]) is a widely used parallel pro- gramming model. It has been the dominant parallel programming model since the mid 90’s [Hoefler et al., 2013]. MPI is a standardized and portable message-passing system designed to operate on a wide variety of parallel computers [Snir et al., 1998]. The standard defines the syntax and semantics of a library which is useful to a wide range of users writing portable distributed memory programs [Gropp et al., 1998]. There are numerous MPI implementations which are actively being developed and used, both commercial and open source [Snir et al., 1998]. The basic approach of MPI is that programs are parallelized by the explicit use of collective and point-to-point message passing library functions. These library func-

19 Texas Tech University, Bryant Nelson, August 2016

Figure 2.4: Distributed Memory System Illustration

tions abstract away things like sockets, buffered data copying and message routing. While this method of parallel expression can create extremely efficient parallel pro- grams, it is very difficult to design and develop programs using MPI [Pacheco, 1997]. In [Pacheco, 1997], Pacheco states that MPI “has been called the assembly language of parallel computing because it forces the programmer to deal with so much detail.” The developer is required to explicitly handle every point in the program where data sharing is needed. The following is an example of matrix multiply using MPI [Amza, ].

1

2 /*

3 * mmult.c: matrix multiplication usingMPI.

4 * There are some simplifications here. The main one is that

5 * matricesB andC are fully allocated everywhere, even

6 * though onlya portion of them is used by each processor

20 Texas Tech University, Bryant Nelson, August 2016

7 *(except for processor 0)

8 */

9

10 #include

11 #include

12

13 #define SIZE 8/* Size of matrices*/

14

15 intA[SIZE][SIZE],B[SIZE][SIZE],C[SIZE][SIZE];

16

17 void fill_matrix(int m[SIZE][SIZE])

18 {

19 static int n=0;

20 int i, j;

21 for (i=0; i

22 for (j=0; j

23 m[i][j] = n++;

24 }

25

26 void print_matrix(int m[SIZE][SIZE])

27 {

28 int i,j= 0;

29 for (i=0; i

30 printf ("\n\t|");

31 for (j=0; j

32 printf ("%2d", m[i][j]);

33 printf ("|");

34 }

35 }

36

37

38 int main(int argc, char *argv[])

39 {

40 int myrank, P, from, to, i, j, k;

41 int tag = 666;/* any value will do*/

42 MPI_Status status;

43

44 MPI_Init (&argc, &argv);

45 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);/* who ami*/

46 MPI_Comm_size(MPI_COMM_WORLD, &P);/* number of processors*/

47

48 /* To use the simple variants of MPI_Gather and MPI_Scatter we*/

21 Texas Tech University, Bryant Nelson, August 2016

49 /* impose thatSIZE is divisible byP.*/

50

51 if (SIZE%P!=0) {

52 if (myrank==0) printf("Matrix size not divisible by number of processors\n");

53 MPI_Finalize();

54 exit ( -1);

55 }

56

57 from = myrank * SIZE/P;

58 to = (myrank+1) * SIZE/P;

59

60 /* Process0 fills the input matrices and broadcasts them*/

61 /*(only the relevant stripe ofA is sent to each process)*/

62

63 if (myrank==0) {

64 fill_matrix(A);

65 fill_matrix(B);

66 }

67

68 MPI_Bcast (B, SIZE*SIZE, MPI_INT, 0, MPI_COMM_WORLD);

69 MPI_Scatter (A, SIZE*SIZE/P, MPI_INT, A[from], SIZE*SIZE/P,

70 MPI_INT, 0, MPI_COMM_WORLD);

71

72 printf ("computing slice%d(from row%d to%d)\n",

73 myrank, from, to-1);

74 for (i=from; i

75 for (j=0; j

76 C[i][j ]=0;

77 for (k=0; k

78 C[i][j] += A[i][k]*B[k][j];

79 }

80

81 MPI_Gather (C[from], SIZE*SIZE/P, MPI_INT, C, SIZE*SIZE/P,

82 MPI_INT, 0, MPI_COMM_WORLD);

83

84 if (myrank==0) {

85 printf ("\n\n");

86 print_matrix(A);

87 printf ("\n\n\t*\n");

88 print_matrix(B);

89 printf ("\n\n\t=\n");

22 Texas Tech University, Bryant Nelson, August 2016

90 print_matrix(C);

91 printf ("\n\n");

92 }

93

94 MPI_Finalize();

95 return 0;

96 }

97

23 Texas Tech University, Bryant Nelson, August 2016

CHAPTER III RELATED WORK

3.1 Parallel Programming Languages

Cooke and Anderson claim that a problem with the majority of high level parallel programming languages is that they end up forcing the developer into explicitly coding the data decomposition [Cooke & Andersen, 2000]. The examples below demonstrate this.

3.1.1 NESL

NESL is a high-level functional language, a sugar typed λ-calculus with an explicit parallel map over arrays [Blelloch & Greiner, 1996]. The following is an example of matrix multiply in NESL.

1

2 function matrix_multiply(A,B) =

3 {

4 {

5 sum({x*y: x inrowA; yincolumnB})

6 : columnB in transpose(B)

7 }

8 : rowA inA

9 }

10 In NESL, the programmer must indicate which parallelisms to exploit through the use of curly brackets “{}”.

3.1.2 Sisal

Sisal (Streams and Iterations in Single Assignment Language) is a general-purpose applicative language [Feo et al., 1990]. The goal of Sisal is to provide a general-purpose user interface for a wide range of parallel processing

24 Texas Tech University, Bryant Nelson, August 2016

platforms [Gaudiot et al., 2001]. While this goal may seem parallel to that of SequenceL, Sisal attempts to accom- plish this goal in a very different manner. Sisal requires to explicitly express loops, which can be seen in the following example of matrix multiply in Sisal.

1

2 function dot_product( x, y : array [ real ] returns real)

3 for a in x dot b iny

4 returns value of suma*b

5 end for

6 end function %-- dot_product

7

8 type One_Dim_R = array [ real ];

9

10 type Two_Dim_R = array [ One_Dim_R ];

11

12 function matrix_mult( x, y_transposed : Two_Dim_R returns Two_Dim_R)

13 for x_row inx%-- for all rows ofx

14 cross y_col in y_transposed %--& all columns ofy

15 returns array of dot_product(x_row, y_col)

16 end for

17 end function %-- matrix_mult

18 Sisal is similar to NESL in that it requires the developer to explicitly provide the control structures necessary to traverse the input data and construct the output data.

3.2 Manual Heterogeneous Computing

Shared memory systems have been on the rise and have even become a regular occurrence in the HPC environment. It is now ordinary for a distributed memory system to be composed of shared memory nodes. While MPI is capable of directly handling this development, by assigning multiple MPI processes per node, the per- formance of MPI on a shared memory system is often less than the performance of a direct shared memory approach. Because of this developers have begun to move towards a hybrid model, mixing MPI with some shared memory model as seen in

25 Texas Tech University, Bryant Nelson, August 2016

[Hoefler et al., 2013] and [Rabenseifner et al., 2009]. These hybrid approaches still require the developer to manually parallelize their software, confounding the solving of the problem. Having to manage both levels of parallelisms cause even more difficulty. Getting good performance on hybrid archi- tectures is dependent on the programming models at ones disposal fitting optimally to the hierarchical hardware [Rabenseifner et al., 2009]. The work of Hoefler [Hoefler et al., 2013] attempts to extend MPI in such a way that allows it to be efficiently used to manage shared memory parallelizations in addition to managing the distributed memory parallelizations. The traditional approach to programming for hybrid shared and distributed mem- ory systems is to use a combination of MPI and OpenMP. In this approach OpenMP is used for the the intranode parallelisms and MPI is used for the internode paral- lelisms. OpenMP performs better than better on finer grain problem sizes than pure MPI, allowing this hybrid approach to better make use of the shared memory nodes than a pure MPI approach [Smith & Bull, 2001].

3.3 Automatic Distributed Computing

An often sought after approach to automatic parallel and distributed programming is the automatic transformation of existing serial code into parallel or distributed programs [Pande et al., 1993]. Some kind of automatic transformation tool could be used to parallelize the unimaginable amount of legacy code in existence. However, such approaches have not gained general use. Such automatic transformation approaches have extreme limitations, such as only targetting affine loop nests [Bondhugula, 2011]. For some serial algorithms, deter- mining how to distribute the serial data structures to enable parallel execution is an NP-complete problem [Kessler, 1996]

26 Texas Tech University, Bryant Nelson, August 2016

CHAPTER IV EXPERIMENTAL DESIGN

4.1 Environment

The following environments were used throughout this research to test various phases of the project. In every environment the operating systems used were all 14.04 LTS. There are plans to extend the project to other operating systems, but this initial use of simplified the implementation of the project, the setting up of networks and the testing of the project.

1. Network of Virtual Machines (Virtual Network) Throughout this dissertation a large amount of testing that was done took place

TM on a network of virtual machines hosted on a machine with an Intelr Core i7-4702MQ at 2.2GHz with 16GB of memory. This virtual network is well suited for feasibility and correctness testing, allowing for quick and easy modifications to system specs like changing the number of processors, the amount of mem- ory or the number of machines in the network. Using the virtual network for correctness testing allowed for more rapid prototyping and iteration.

2. Network of Physical PC’s (Physical Network) A simple network of 2 physical PC’s, connected by a 100Mbps switch, was used for preliminary performance testing. This environment served as a staged testing environment. It was more readily accessible than the larger testing server, so it was simpler to test and iterate on. Since it was a physical network, was able to reveal performance and correctness issues related to communication latency over a physical network. However, due to its limited number of nodes and performance capabilities, it was not suited for high intensity profiling and

27 Texas Tech University, Bryant Nelson, August 2016

testing.

TM Machine 1: Intelr Core i7-4702MQ at 2.2GHz with 16GB of memory.

Machine 2: AMDr 6-Core with 8 GB of Memory.

3. HPC Cluster

Performance testing and profiling was done on a Dellr 6100 Server, which is a TM cluster of 3 machines, each with an Intelr Xeon with 8 cores at 2.7GHz and 12GB of memory. The node interconnect is a 100Mbps switch.

4.2 Metrics

The early stages of the project were focused primarily on feasibility, seeking to evaluate the benefit of continuing this project and extending the functionality to other parts of SequenceL. At each phase of the project, a number of performance tests were conducted to evaluate the effectiveness of the distributed extensions. The following metrics are used to quantify the results and analyze the effectiveness of the modifications.

Definition 2 (Core Speedup). n-core speed-up for a given program P is equal to

R1(P ) , where R (P ) is the runtime of P on c cores. Rn(P ) c

In other words, n-core speed-up is the ratio of a program’s runtime on n cores to its runtime on 1 core. Ideal, or perfect, n-core speed-up is equal to n, meaning that the program was perfectly parallelized.

Definition 3 (Core Efficiency). n-core efficiency for a given program P is equal to

Sn(P ) n , where Sn (P ) is the n-core speedup of program P .

Ideal core efficiency is 100%, meaning that running the program on n cores resulted in an n× reduction in runtime.

28 Texas Tech University, Bryant Nelson, August 2016

Definition 4 (Effective Core Speedup). Effective core speed-up is the core speedup of a given program based on the total number of across-node cores being used.

Definition 5 (Effective Core Efficiency). Effective core efficiency is the core usage efficiency based on the total number of across-node cores being used.

For example, Effective 4-Core Efficiency compares the core usage efficiency of 4 cores on 1 node, 2 cores on each of 2 nodes and 1 core on Comparing the effective core efficiency on a single node to the same effective core efficiency across more nodes allow us to compare the performance of the distributed memory extensions to the performance of the existing shared memory system.

4.3 Test Programs

The programs used to test various phases of this project are slight variations on programs chosen from the SequenceL heatmap. The heatmap is a set of benchmarks used to regularly test the performance of compiled SequenceL programs. These bench- marks have been chosen essentially at random from SequenceL programs that have been written over the past three years [Nelson & Rushton, 2013]. These programs were chosen based on the following credentials:

• Wide availability of the algorithms used.

• Able to easily generate large amounts of test data.

The following is a list of the programs used.

1. 2D Fast Fourier Transformation (2DFFT) 2DFFT is implmented based on [Cooley & Tukey, 1965]. The complete imple- mentation can be found in Appendix C. An input to 2DFFT is a 2-dimensional sequence of floats representing a matrix of real numbers.

29 Texas Tech University, Bryant Nelson, August 2016

2. Barnes-Hut N-Body (BHUT) BHUT is implemented based on [Barnes & Hut, 1986]. The complete imple- mentation can be found in Appendix C. An input to BHUT consists of an 8×n sequence of floats representing physical bodies, a 2×2 sequence of floats repre- senting the bounds of the simulation, and an integer specifying the number of simulation steps to execute.

3. Conway’s Game of Life (GOL) GOL is a standard implementation of John Conway’s cellular automaton, Life, based on [Gardner, 1970]. The complete implementation can be found in Ap- pendix C. An input to GOL consists of a 2-dimensional sequence of 1’s and 0’s, representing live cells and dead cells, and an integer specifying the number of simulation steps to execute.

4. LU Factorization (LU) The complete implementation of LU can be found in Appendix C. An input to LU is a 2-dimensional sequence of floats representing a matrix of real numbers.

5. Matrix Multiply (MM) MM is an implementation of matrix multiplication, based on the following def- inition of matrix multiply: The product of an m × p matrix A and a p × n matrix B is an m × n matrix Pp denoted AB whose entries are given by (AB)ij = k=1 (AikBkj). The Phase 1 implementation can be found in Section 5.3.2.1, and the complete implementation can be found in Appendix C. An input to MM consists of two 2-dimensional sequences of floats, representing matrices of real numbers.

6. Monte Carlo Mandelbrot Area (MAN)

30 Texas Tech University, Bryant Nelson, August 2016

MAN is an implementation of a Monte Carlo sampling method for approximat- ing the area of the Mandelbrot set. The random samples are generated within the rectangle ranging over (−2.0, 0.5) in the real axis and (0.0, 1.125) in the imaginary axis, which covers the top half of the Mandelbrot Set. The Phase 1 implementation can be found in Section 5.3.2.2, and the complete implementa- tion can be found in Appendix C. An input to MAN consists of a single integer which specifies the number of samples to generate.

7. PI Approximation (PI) PI is an implementation of a numeric integration method of approximating π based on [Mattson, ]. The Phase 1 implementation can be found in Section 5.3.2.3, and the complete implementation can be found in Appendix C. An input to PI consists of a single integer which specifies the number of terms to include in the partial sum.

31 Texas Tech University, Bryant Nelson, August 2016

CHAPTER V PHASE 1

5.1 Introduction

The first phase of the project began as a feasibility study with a small, not so useful, selection of target programs. However, after implementation had been finished, it was discovered that the changes made by this phase of the project allowed the developer to easily distribute some useful, though embarrassingly parallel, programs. The contributions of this phase allow the developer to easily distribute, with one or two lines of changes, some embarrassingly parallel programs. This confirms the feasibility of continuing the project to target more complex distribution targets. The experiments conducted during this phase show that the distributed versions of the test programs demonstrate similar speedups to their shared memory counterparts. This shows that, even at this stage, the project provides value to the developer by allowing them to use a network of low-cost shared memory systems and have similar performance to more expensive shared memory machines with a comparable number of cores.

5.2 Framework Design

5.2.1 Target Programs

The target programs for this phase of the project are those which have an NT over the entry point function, which we call an entry-level NT or an NT’d entry-point. When compiling a function using the SequenceL compiler, specifying an overtyped intended use will generate an NT’d entry-point. For example, consider the following SequenceL function definition

32 Texas Tech University, Bryant Nelson, August 2016

1

2 //--PrimeFilter takes an integer input and returnsa sequence

3 //--containing all numbers between1 andn which are prime.

4 PrimeFilter(n(0)) :=

5 let

6 divisors := [2] ++ ((1 ... floor(sqrt(n)/2))*2 + 1);

7 in

8 n whenn=2 orn>1 and none(n mod divisors = 0);

9 which has the following signature.

1

2 PrimeFilter: int(0) -> int(0);

3 Compiling this function with the intended use

1

2 -f"PrimeFilter(int(1))"

3 will generate a C++ function sl PrimeFilter, expecting as input a Sequence of integers over which it will NT the function PrimeFilter. We call this NT an entry- level NT. An NT’d entry-point can be thought of as the application of a SequenceL function to multiple inputs simultaneously. Programs with entry-level NT’s were chosen as targets for this phase of the project for two main reasons. Due to the fact that nodes do not need to communicate during distributed execution of the compiled SequenceL functions, this initial implementa- tion is a simpler starting point. Also, the changes required for these functions were primarily limited to the SequenceL C++ Driver library.

5.2.2 C++ Driver Library Additions

MPI has been chosen as the message passing backend. Its wide implementation and support makes the generated programs portable and able to benefit from any of

33 Texas Tech University, Bryant Nelson, August 2016 its advances. The Boost::MPI library(BoostMPI) has been used to interface with OpenMPI where it is possible and efficient. This library provides a slightly higher level and more robust interface to the OpenMPI framework, though it is restricted to MPI 1.1. This version of MPI is missing some of the newer features. As the project progresses, it may become necessary to make more calls directly to OpenMPI. The modifications required by this phase of the project were limited to extending the sequence class and adding functionality to the SequenceL C++ Driver library. These modifications work as a layer of abstraction, allowing the developer to create and interact with distributed sequences, the primary data structure for SequenceL, without the need to directly interact with OpenMPI.

5.2.2.1 Distributed Sequence Class

The primary addition to the SequenceL C++ Driver library is a distributed wrap- per for the sequence class. This wrapper class enables the developer to create and use a distributed sequence in the same manner they would a normal sequence without any knowledge that it is distributed. There is currently a distributed sequence class for each of the following depths:

• Depth 1 DistributedSequence

• Depth 2 DistributedSequenceSequence

• Depth 3 DistributedSequenceSequenceSequence

In future phases of the project these classes will be combined and rolled into the standard sequence class. During development and iteration it is useful for them to be distinct.

34 Texas Tech University, Bryant Nelson, August 2016

Figure 5.1: Illustration of the Distributed Sequence Distribution Structure

When the developer creates a distributed sequence, portions of it are stored as standard sequences on each node of the network. In the current implementation, distributed sequences are evenly divided across the nodes. There are some immedi- ately apparent drawbacks to this na¨ıve method of distribution. For example, it is possible that one node is higher performance than the others (e.g. more memory or more processors) or that the work does not distribute evenly in this manner. Further discussion of these shortcomings is presented in section 5.4, on page 49. Extending on the PrimeFilter example, the following is an example of using the distributed sequence wrapper.

35 Texas Tech University, Bryant Nelson, August 2016

1

2 int main(int argc, char** argv)

3 {

4 int threads = 0;

5 DistributedSequence input;

6 DistributedSequence result;

7

8 sl_init(threads);

9

10 DistributedEllipsis(1, atoi(argv[1]), 1, input);

11

12 sl_isPrime(input.LocalSequence, threads, result.LocalSequence);

13

14 rcout <<"Result:" << result.size() <<’\n’;

15

16 sl_done ();

17

18 return 0;

19 }

20

Get & Set The get and set member functions provide a means for the user to read from and write to indexes in the distributed sequence. These functions oper- ate on the base type of the sequence. For example, the get function of the object

DistributedSequenceSequence expects two inputs, row and column. There’s currently no way for the user to get an entire row at once. This was done to prevent the user from doing the following.

1

2 sequence.get(1)[1]

3 Which would result in the entire first row being transferred when all they wanted was the first element of that row. The early versions of these functions consumed a majority of the runtime during execution in setting up inputs and displaying outputs. This was partly due to the

36 Texas Tech University, Bryant Nelson, August 2016 fact that the distributed bounds of each node’s local sequence was being computed at every call to Get or Set. Those bounds are now cached and only recomputed when invalidated.

Size The size member function provides means for the user to determine the full, across-node, size of the distributed sequence. Again, in an attempt to mitigate per- formance costs, the result of size is cached and only recomputed after changes to the sequence are made which require it to be updated.

5.2.2.2 Utility Functions

The addition of certain distributed utility functions to the SequenceL C++ Driver library provides efficient means for the user to perform common tasks. These utility functions include:

• DistributedSum

• DistributedProduct

• DistributedEllipsis rcout The rcout object is similar to the standard C++ cout except that it only writes to stdout on the root node. It is used in the same manner as cout, but output is only displayed on the root node. This is useful for displaying status messages and final results.

5.2.3 Distributed Execution

In the C++ driver code, the user constructs the distributed sequences that will be used as overtyped inputs to the generated NT’d entry-point and a distributed sequence that will hold the result of the function call. The local sequences of the

37 Texas Tech University, Bryant Nelson, August 2016 distributed inputs, output, and number of shared memory threads to use are passed as arguments to the generated function. Each node will handle the execution of the NT’d function across the portion of the distributed sequence inputs that reside on that node and store the results in the portion of the distributed sequence output that resides on that node.

5.3 Experimental Design

5.3.1 Test Programs

The programs used to test this phase of the project are slight variations on pro- grams from the SequenceL heatmap. There are two classes of test programs. The first class is composed of the programs which distributed and ran on multiple problem instances. This was the original target of this phase of the project. However, it was discovered that the modifications made in this phase enabled the user to easily distribute some embarrassingly parallel problems.

5.3.1.1 First Class of Test Problems

1. Conway’s Game of Life (GOL)

2. Barnes-Hut N-Body (BHUT)

3. 2D Fast Fourier Transformation (2DFFT)

4. LU Factorization (LU)

5.3.1.2 Second Class of Test Problems

1. Matrix Multiply (MM)

2. Monte Carlo Mandelbrot Area (MAN)

3. PI Approximation (PI)

38 Texas Tech University, Bryant Nelson, August 2016

5.3.2 Experimental Results

A complete listing of experimental data are presented in Appendix 7.2.2. The results discussed in this section are those that have the most impact, specifically the results of the second class of programs.

5.3.2.1 Matrix Multiply

Implementation The following is the SequenceL code for Matrix Multiply. No change was needed to enable it run distributed.

1

2 mm(A(2),B(2))[i,j]:=sum(A[i]*transpose(B)[j]);

3 The following is the C++ driver for the distributed version of Matrix Multiply.

By only changing the first input matrix (A) to a DistributedSequenceSequence the program is capable of distributing A and executing on any number of networked nodes, executing fully in parallel on all of their shared memory cores.

39 Texas Tech University, Bryant Nelson, August 2016

1

2 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

3 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

4 int m1 = 100; if(argc > 3) m1 = atoi(argv[3]);

5 int n = 100; if(argc > 4) n = atoi(argv[4]);

6 int m2 = 100; if(argc > 5) m2 = atoi(argv[5]);

7

8 DistributedSequenceSequence A;

9 Sequence< Sequence< SL_FLOAT> > B;

10 DistributedSequenceSequence result;

11

12 sl_init(threads);

13 srand(seed);

14

15 A.setFullSize(m1, n);

16 B.setSize(n);

17 for(int y = 1; y <= m1; y++)

18 {

19 for(int x = 1; x <= n; x++)

20 {

21 A.set(y, x, ((double)rand() / RAND_MAX) * 100000.0);

22 }

23 }

24 rcout <<"MatrixA:" << A <<"\n";

25

26 for(int y = 1; y <= n; y++)

27 {

28 B[y].setSize(m2);

29 for(int x = 1; x <= m2; x++)

30 {

31 B[y][x] = ((double)rand() / RAND_MAX) * 100000.0;

32 }

33 }

34 rcout <<"MatrixB:" << B <<"\n";

35

36 sl_mm(A.LocalSequence, B, threads, result.LocalSequence);

37

38 rcout <<"Result:" << result <<"\n";

39

40 sl_done ();

41

40 Texas Tech University, Bryant Nelson, August 2016

Performance Table 5.1 shows the results of testing the distributed matrix multiply on a 10000×2000 matrix multiplied by a 2000×2000 on the HPC Cluster. Figure 5.2(a) shows the single node speedup curve was maintained throughout the various node counts. Refer to the NOMENCLATURE section on page xiv for a description of the table headings.

Table 5.1: Phase 1: MM on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.860606 52.3489 53.2095 1.000 100.000% 1 2 0.936909 27.7425 28.6795 1.855 92.766% 1 3 0.985987 17.6191 18.6051 2.860 95.331% 1 4 0.991891 12.6391 13.631 3.904 97.589% 1 6 0.955561 8.67681 9.63237 5.524 92.067% 1 8 0.995945 7.28221 8.27815 6.428 80.346% 2 1 0.818587 26.1842 27.0028 1.971 98.526% 2 2 0.889703 12.3157 13.2054 4.029 100.734% 2 3 0.938569 8.21848 9.15705 5.811 96.846% 2 4 0.885259 6.46505 7.35031 7.239 90.489% 2 6 0.88857 4.42761 5.31618 10.009 83.408% 2 8 0.910978 3.44006 4.35104 12.229 76.432% 3 1 0.842537 16.3933 17.2358 3.087 102.905% 3 2 0.879904 8.30592 9.18582 5.793 96.543% 3 3 0.902477 5.50112 6.4036 8.309 92.326% 3 4 0.927242 4.16876 5.096 10.441 87.012% 3 6 0.919147 2.9435 3.86265 13.775 76.530% 3 8 0.927108 2.53876 3.46586 15.352 63.969%

41 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 20 105 1 Nodes 100 1 Nodes 2 Nodes 2 Nodes 15 3 Nodes 95 3 Nodes 90 85 10 80 Speedup 75 5 Core Efficiency (%) 70 65 0 60 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 5.2: Phase 1: MM on HPC Cluster Performance Graphs

5.3.2.2 Monte Carlo Mandelbrot Area

Implementation The following is the SequenceL code for the distributed Monte Carlo Mandelbrot Area program. The primary change which was required for this program was to move some work that had been done in the SequenceL entry point up to the C++ driver.

42 Texas Tech University, Bryant Nelson, August 2016

1

2 iterate_point( point, orig, iterations, threshold ) :=

3 let

4 next_point := (r: (point.r*point.r)-(point.i*point.i)+orig.r,

5 i: point.r*point.i * 2.0 + orig.i);

6 in

7 1

8 when point.r * point.r + point.i * point.i > threshold

9 else

10 iterate_point(next_point, orig, iterations-1, threshold)

11 when iterations > 0

12 else

13 0;

14

15 compute_outside(real(1), imaginary(1), max_iter, threshold ) :=

16 let

17 points := newPoint(real, imaginary);

18 in

19 [sum(iterate_point(points, points, max_iter, threshold))];

20 The following is the C++ driver for the distributed Monte Carlo Mandelbrot Area program. The expression on line 32 was moved from the original SequenceL entry point to enable the distributed execution of this program.

43 Texas Tech University, Bryant Nelson, August 2016

1

2 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

3 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

4 int numPoints = 100; if(argc > 3) numPoints = atoi(argv[3]);

5 int maxIters = 100; if(argc > 4) maxIters = atoi(argv[4]);

6 SL_FLOAT threshold = 2.0; if(argc > 5) threshold = atof(argv[5]);

7

8 DistributedSequence pointsR;

9 DistributedSequence pointsI;

10 DistributedSequence partialResults;

11 SL_FLOAT result;

12

13 sl_init(threads);

14

15 srand(seed);

16

17 pointsR.setSize(numPoints);

18 pointsI.setSize(numPoints);

19 for(int i = 1; i <= numPoints; i++)

20 {

21 pointsR.set(i, -2.0 + 2.5 * (double)rand() / RAND_MAX);

22 pointsI.set(i, 1.125 * (double)rand() / RAND_MAX);

23 }

24

25 sl_compute_outside(pointsR.LocalSequence,

26 pointsI.LocalSequence,

27 maxIters ,

28 threshold ,

29 threads ,

30 partialResults.LocalSequence);

31

32 result = 2.0 * ( 2.5 * 1.125 ) *

33 (numPoints - DistributedSum(partialResults)) / numPoints;

34

35 rcout << result <<"\n";

36 sl_done ();

37 Performance Table 5.2 shows the results of testing the distributed Monte Carlo Mandelbrot area program with an input of 1000000 initial points, 10000 iterations and a threshold

44 Texas Tech University, Bryant Nelson, August 2016 of 2 on the HPC Cluster. Figure 5.3(a) shows the single node speedup curve was maintained throughout the various node counts. The speedups demonstrated by this heatmap are historically better than matrix multiply on shared memory architectures, and that trend continues on distributed memory. Refer to the NOMENCLATURE section on page xiv for a description of the table headings.

Table 5.2: Phase 1: MAN on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.081291 61.8043 61.8856 1.000 100.000% 1 2 0.0905659 30.5102 30.6008 2.022 101.118% 1 3 0.0921061 20.1409 20.233 3.059 101.955% 1 4 0.0931771 15.1549 15.2481 4.059 101.464% 1 6 0.0942321 10.6068 10.701 5.783 96.386% 1 8 0.0889051 8.37354 8.46244 7.313 91.412% 2 1 0.074542 30.6087 30.6832 2.017 100.846% 2 2 0.0840609 15.3007 15.3847 4.023 100.564% 2 3 0.0828881 10.1122 10.1951 6.070 101.169% 2 4 0.08392 7.63235 7.71627 8.020 100.252% 2 6 0.0885892 5.31314 5.40173 11.457 95.472% 2 8 0.088207 4.22767 4.31588 14.339 89.619% 3 1 0.0890419 20.4789 20.568 3.009 100.294% 3 2 0.104837 10.1873 10.2921 6.013 100.215% 3 3 0.106986 6.76662 6.87361 9.003 100.037% 3 4 0.103352 5.1901 5.29345 11.691 97.425% 3 6 0.100534 3.58529 3.68583 16.790 93.279% 3 8 0.101922 3.04689 3.14881 19.654 81.890%

45 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 20 105 1 Nodes 1 Nodes 2 Nodes 2 Nodes 100 15 3 Nodes 3 Nodes

95 10

Speedup 90

5 Core Efficiency (%) 85

0 80 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 5.3: Phase 1: MAN on HPC Cluster Performance Graphs

5.3.2.3 Pi Approximation

Implementation The following is the SequenceL code for the Pi Approximation program. The primary

change was to move the ellipsis to the C++ driver and have findPI expect a list as input.

1

2 helpFindPI ( w(0), N(0) ) :=

3 let

4 local := ( 1.0 * w + 0.5 ) / N;

5 in

6 4.0 / ( 1.0 + local * local );

7

8 findPI( n(1), N(0)) :=

9 [sum( helpFindPI ( n, N ) ) * 1.0 / N];

10

46 Texas Tech University, Bryant Nelson, August 2016

The following is the C++ driver for the Pi Approximation program. The ellipsis on line 12 and the sum on line 17 would have originally been done in the SequenceL entry point, but were moved up to the C++ driver to enable distribution.

1

2 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

3 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

4 int n = 100; if(argc > 3) n = atoi(argv[3]);

5

6 DistributedSequence input;

7 DistributedSequence partialResults;

8 SL_FLOAT result;

9

10 sl_init(threads);

11

12 DistributedEllipsis(0, n-1, 0, input);

13

14 sl_findPI(input.LocalSequence, threads,

15 partialResults.LocalSequence);

16

17 result = DistributedSum(partialResults);

18

19 rcout << result <<"\n";

20

21 sl_done ();

22 Performance Table 5.3 shows the results of testing the distributed pi approximation program with an input of 800000000 on the HPC Cluster. Figure 5.4(a), again, shows the single node speedup curve was maintained throughout the various node counts. Refer to the NOMENCLATURE section on page xiv for a description of the table headings.

47 Texas Tech University, Bryant Nelson, August 2016

Table 5.3: Phase 1: PI on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 1.26397 23.5341 24.798 1.000 100.000% 1 2 0.637848 11.7721 12.41 1.998 99.911% 1 3 0.454338 8.23917 8.69351 2.852 95.082% 1 4 0.348546 6.09318 6.44172 3.850 96.240% 1 6 0.346751 4.58136 4.92811 5.032 83.866% 1 8 0.352632 3.40921 3.76184 6.592 82.400% 2 1 0.655289 11.8338 12.489 1.986 99.279% 2 2 0.320398 5.89049 6.21088 3.993 99.817% 2 3 0.235915 4.0588 4.29471 5.774 96.235% 2 4 0.205209 3.10516 3.31037 7.491 93.638% 2 6 0.160028 2.21201 2.37204 10.454 87.119% 2 8 0.178973 1.74991 1.92888 12.856 80.351% 3 1 0.439698 7.84821 8.2879 2.992 99.736% 3 2 0.237294 4.09861 4.3359 5.719 95.320% 3 3 0.179649 2.76623 2.94588 8.418 93.532% 3 4 0.145049 2.03635 2.1814 11.368 94.733% 3 6 0.156169 1.50945 1.66562 14.888 82.712% 3 8 0.137357 1.19075 1.32811 18.672 77.799%

48 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 20 100 1 Nodes 1 Nodes 2 Nodes 2 Nodes 95 15 3 Nodes 3 Nodes

90 10

Speedup 85

5 Core Efficiency (%) 80

0 75 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 5.4: Phase 1: PI on HPC Cluster Performance Graphs

5.4 Conclusion and Next Steps

The results from this phase of the project were very promising. The results of the tests performed in this phase of the project suggest that it is feasible for a compiler to specify a combination of compile-time and run-time heuristics which, on average, allow efficient heterogeneous “across-node” and “across-core” parallel C++ to be generated from a program written in a high-level functional language.

5.4.1 Next Steps

This is the first of at least three phases of this project, and as such, there are many extensions and improvements planned to be implemented.

49 Texas Tech University, Bryant Nelson, August 2016

5.4.1.1 Improvements in the Runtime

Future versions of the distributed runtime will allow the developer to specify distri- bution patterns for the distributed sequences, thus allowing them to take advantage of knowledge of the hardware and problem to improve the performance of the program. The current distributed sequence is not capable of distributing a 2D sequence such that a row spans nodes or a 3D sequence such that a sub-matrix spans nodes. Allow- ing this will give the programmer more control over the distribution of work and will eventually be necessary, as more parts of SequenceL become targets for distribution. While giving the programmer more control over how to distribute data may allow them to improve performance, a large motivation behind this project is to take some of this decision making out of the programmers hands. Therefore, a method of across- node work-stealing will need to be implemented. This will allow nodes that get finished sooner, either due to having better hardware or less time consuming divisions of work, to assist other nodes in their tasks. A great deal of work will have to be put into ensuring that the communications from this distributed work-stealing do not outweigh the performance gains.

5.4.1.2 Additional Distribution Targets

The next phase of the project will focus on the distribution of arbitrary NT’s and simple indexed functions. This will require changes in the code generator portion of the SequenceL compiler to enable the automatic production of distributed C++ code. A simple indexed function is one with a simple memory access pattern and could be easily replaced with an NT. The third phase will focus on the distribution of more complex indexed functions. These are indexed functions which access wide ranges of indexes from a single iteration of the indexed function. The implications of this access are that nodes may need to

50 Texas Tech University, Bryant Nelson, August 2016 access data that is stored on another node. This communication will have to be handled through a combination of compile-time and runtime heuristics.

51 Texas Tech University, Bryant Nelson, August 2016

CHAPTER VI PHASE 2

6.1 Introduction

This phase of the project extended the SequenceL compiler to enable the au- tomatic distribution of computation for all SequenceL programs which contain vi- able NTd operations or Indexed Function applications. On the distributed memory nodes, computations are performed utilizing automatic shared memory paralleliza- tions. This is accomplished through a combination of compile-time and rudimentary run-time heuristics. At this stage complete copies of the data exist on every node, only the computations are being divided across the nodes. Even with these limita- tions considerable performance was achieved on select programs and opportunities for enhancements have been discovered which would improve the performance of a larger class of programs.

6.2 Compiler Modifications

6.2.1 Program Targets

The modifications made to the compiler in this phase allow it to automatically generate correct hybrid code from any SequenceL program. However, only the com- putations of viable distribution sources, defined below, are executed in a distributed manner.

6.2.1.1 Parallelization Targets

Recall that SequenceL treats CSPs, NT’d operations, and the applications of Indexed Functions as sources of parallelizations. A viable parallelization source is an operation which is a parallelization source

52 Texas Tech University, Bryant Nelson, August 2016

Figure 6.1: Generic Call Graph that meets a runtime complexity heuristic and does not have an operation ancestor which is a viable parallelization source. This runtime heuristic attempts to estimate the amount of work required to complete a given operation in an attempt to prevent the parallelization of a very small operation, where the overhead of parallelization would outweigh the operation itself. Viable distribution sources are NT’d operations or applications of Indexed Func- tions that are themselves viable parallelization sources. In the intermediate code, NTs and Indexed functions are treated in a similar manner. They are both ultimately represented as some type of basic for-loop. It is for this reason that both of these parallel sources were targeted by this phase. It is also due to this similarity that, in the current implementation, all data exists on every distributed node and only computations are distributed. It is left as a future endeavor to produce heuristics which eliminate the need to have all data on all nodes.

53 Texas Tech University, Bryant Nelson, August 2016

6.2.1.2 Excluded Programs

There are two classes of programs which are excluded as targets of distribution in this phase of the project. The first is programs which have no sources of parallelisms. The second class consists of those whose only viable parallel sources are CSPs. SequenceL programs which contain no sources of parallelisms are trivial. The following are some examples of SequenceL functions which, if compiled as entry points, contain no sources of parallelisms.

1

2 constVal := sqrt(3);

3

4 myMax(a(0), b(0)) := a whena>=b else b;

5

6 myIndex(A(1), i(0)) := A[i];

7 The following code example is an implementation of quicksort which is an example of a function whose only parallel sources are CSPs. While filtermin and filtermax on lines 6 and 7 appear to be viable Indexed Function parallelization sources, they are referenced from line 9, which is a viable CSP parallelization source. Figure 6.2 shows the operation tree for this function. The indexed function operations (B,C) occur below the CSP operation (A).

1

2 qsort(x(1)):=

3 let

4 pivot := head(x);

5 rest := tail(x);

6 filtermin[i] := rest[i] when rest[i] < pivot;

7 filtermax[i] := rest[i] when rest[i] >= pivot;

8 in

9 qsort(filtermin) ++ [pivot] ++ qsort(filtermax)

10 when size(x)>1

11 else

12 x;

13

54 Texas Tech University, Bryant Nelson, August 2016

Figure 6.2: quicksort Operation Tree

6.2.2 Runtime Additions

Further abstractions were added to the parallel for-loop used by the generated code. Instead of just subdividing the work across shared memory threads, each node now first computes which part of the sequence it will work across. Then each node calls the previous implementation of the TBB parallel for-loop abstraction to iterate over its computed subsequence. This is illustrated in Figure 6.3. This parallelization scheme differs from the standard MPI+X hybrid paralleliza- tion method. In most manually coded hybrid systems there is an outer loop which is parallelized using MPI and some inner loop which is parallelized using a shared memory parallelization method. There does not appear to be any immediate perfor- mance differences between the two approaches. Depending on the results of future investigations, it may in fact be beneficial to have the compiler generate code which more closely resembles the common manual approach.

55 Texas Tech University, Bryant Nelson, August 2016

Figure 6.3: Extended Parallel For-Loop Illustration

6.2.3 Generated Code Additions

As stated previously, the current implementation expects all data to be on every node. This is a simplifying approach to handling the unpredictable nature of indexed functions. While NT’d operations have a well defined data dependency, specifically the ith element of the result only depends on the ith element of the overtyped sequence, indexed functions are not as predictable. In fact, it is undecidable, in general, the data dependency of each member of the result of an indexed function. In the following example it is impossible to determine, a priori, the data dependency of any member of the result.

1

2 randIndexedFunction(A(2))[i,j] :=

3 let

4 sourceI := rand(1,size(A));

5 sourceJ := rand(1,size(A[sourceI]));

6 in

7 A[sourceI,sourceJ];

8 However, there are some cases in which even a compile-time heuristic could de- termine the data dependency. In the following SequenceL function, it is obvious that every ith row of the output depends on both the ith and (i − 1)th row of A.

56 Texas Tech University, Bryant Nelson, August 2016

1

2 upSum(A(2))[i,j] :=

3 A[i-1,j] + A[i,j] wheni>1

4 else

5 A[i,j];

6 Due to the requirement of having all data on every node, all non-distributed computations take place on each node and the results from each distributed node must be gathered on every node after the distributed execution of each viable distribution source. When a viable distributed operation is distributed each distributed operation is executed in parallel on the distributed node. The generated code uses a simple abstraction of TBB to automatically parallelize the operations in the shared memory environment. Figure 6.4 illustrates the overarching design of the code generated by the SequenceL compiler.

6.3 Experimental Design

6.3.1 Test Programs

The test programs used to test this phase of the project are mostly the same programs used in phase 1, however, with a key difference. In phase 1 changes had to be made to some of the programs to allow them to make use of the contributions of that phase. No changes must be made to SequenceL programs or their C++ drivers for them to make use of the contributions of this phase. It is, however, possible that the contributions negatively impact performance of the program under certain conditions.

6.3.2 Experimental Results

A complete listing of experimental data are presented in Appendix 7.2.2. The results of a few test programs are discussed in this section to illustrate the primary

57 Texas Tech University, Bryant Nelson, August 2016

Figure 6.4: Illustration of the Phase 2 Distribution Scheme discoveries.

6.3.2.1 Monte Carlo Mandelbrot Area

The MAN program was the best performing test program during this phase of the project. Figure 6.5(a) shows that the shared memory speed-up curve is maintained throughout 1, 2, and 3 distributed node counts. Figure 6.5(b) shows that the core efficiency was, at worst, ≈60% on 24 cores.

58 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 15 100 1 Nodes 95 1 Nodes 2 Nodes 2 Nodes 3 Nodes 90 3 Nodes 10 85 80 75 Speedup 5 70

Core Efficiency (%) 65 60 0 55 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 6.5: Phase 2: MAN on HPC Cluster Performance Graphs

59 Texas Tech University, Bryant Nelson, August 2016

Table 6.1: Phase 2: Monte Carlo Mandelbrot Area on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.103356 33.8961 33.9995 1.000 100.000% 1 2 0.103063 17.1236 17.2267 1.974 98.683% 1 3 0.101596 11.3806 11.4822 2.961 98.702% 1 4 0.102975 8.52675 8.62972 3.940 98.495% 1 6 0.102693 5.89462 5.99731 5.669 94.485% 1 8 0.102228 4.48969 4.59191 7.404 92.553% 2 1 0.103675 17.6914 17.795 1.911 95.531% 2 2 0.103586 9.20653 9.31011 3.652 91.297% 2 3 0.103036 6.41649 6.51952 5.215 86.917% 2 4 0.104103 4.97243 5.07654 6.697 83.717% 2 6 0.10264 3.70381 3.80645 8.932 74.434% 2 8 0.105743 2.94783 3.05357 11.134 69.590% 3 1 0.102298 12.0829 12.1852 2.790 93.008% 3 2 0.104547 6.48857 6.59311 5.157 85.947% 3 3 0.103119 4.5806 4.68371 7.259 80.657% 3 4 0.104655 3.62716 3.73182 9.111 75.923% 3 6 0.104542 2.76444 2.86898 11.851 65.837% 3 8 0.103473 2.28939 2.39286 14.209 59.203%

The following program is the SequenceL code from the MAN program used in this phase. Line 7, which is an NT’d operation over the sequence points, is the source of all distributed parallelisms in this program. Due to the simplification assumptions made in this phase, the generated code must gather each floating point element of the result of line 7 on each node after the distributed computation. The reason this program performs well appears to be that there are a large number of operations for each element that must be communicated. In fact there are approximately 10,000 floating point operations per element of the result of line 7.

60 Texas Tech University, Bryant Nelson, August 2016

1

2 Complex ::= (r : float, i : float);

3

4 //-- entry point

5 compute_set_area( points(1), maxIter, threshold ) :=

6 let

7 outsideP := iterate_point(points, points, maxIter, threshold);

8 outside := sum(outsideP);

9 total := size( points );

10 inside := total - outside;

11 area := 2.0 * ( 2.5 * 1.125 ) * inside / total;

12 in

13 area ;

14

15 iterate_point( point, original, iterations, threshold ) :=

16 let

17 next_point := ( r: (point.r ^ 2) - (point.i ^ 2) + original.r,

18 i: point.r * point.i * 2.0 + original.i);

19 in

20 1 when (point.r ^ 2) + (point.i ^ 2) > threshold

21 else

22 iterate_point( next_point, original, iterations-1, threshold )

23 when iterations > 0

24 else

25 0;

26 Interestingly, though MAN is the best performing program, it also contains an

example of a future planned optimization. We can see that outsideP, the result of the distribution source on line 7, is only used as an input to the sum function on line 8. In a situation like this, where the result of the distributed work is only used as input to some reduce operation, it would make sense to distribute the reduce. Such a change would allow the generated code to only have to gather a single floating point number, instead of the entire sequence.

61 Texas Tech University, Bryant Nelson, August 2016

6.3.2.2 Matrix Multiply

MM was the second best performing program during this phase. While the speed- up curve is not as good as MAN, we can see from Figure 6.6(a) that the speed-up curves on 2 and 3 nodes are scaled versions of the single node speed-up curve. This means that the distributed memory cores are utilized as well as the shared memory cores.

Speedup vs. Core Count Efficiency vs. Core Count 5 100 1 Nodes 95 1 Nodes 90 2 Nodes 85 2 Nodes 3 Nodes 80 3 Nodes 75 70 65 60 55 50 Speedup 45 40

Core Efficiency (%) 35 30 25 20 0 15 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 6.6: Phase 2: MM on HPC Cluster Performance Graphs

62 Texas Tech University, Bryant Nelson, August 2016

Table 6.2: Phase 2: Matrix Multiply on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 2.801 15.9054 18.7064 1.000 100.000% 1 2 2.80337 8.58798 11.3914 1.642 82.108% 1 3 2.81318 6.47493 9.28812 2.014 67.134% 1 4 2.81006 5.60102 8.41108 2.224 55.600% 1 6 2.79742 4.8082 7.60562 2.460 40.992% 1 8 2.79653 4.75609 7.55262 2.477 30.960% 2 1 2.76837 9.2865 12.0549 1.552 77.588% 2 2 2.78198 5.06071 7.84269 2.385 59.630% 2 3 2.78977 4.02625 6.81602 2.744 45.741% 2 4 2.77265 3.32532 6.09796 3.068 38.346% 2 6 2.76819 2.94612 5.71431 3.274 27.280% 2 8 2.81801 2.6923 5.51031 3.395 21.217% 3 1 2.76854 7.327 10.0955 1.853 61.765% 3 2 2.789 3.98609 6.77509 2.761 46.018% 3 3 2.77285 3.23233 6.00518 3.115 34.612% 3 4 2.78227 2.68573 5.46801 3.421 28.509% 3 6 2.78875 2.2829 5.07165 3.688 20.491% 3 8 2.79039 2.31706 5.10745 3.663 15.261%

The following program is the SequenceL code from the MM program used in this phase.

1 mm(A(2),B(2))[i,j]:=sum(A[i]*transpose(B)[j]); This single line is responsible for generating a distributed+shared memory matrix multiply application capable of utilizing any number of nodes and cores. There is only a single viable distribution source in this program which is the entry point indexed function. As with MAN, this program performs well because it has a high number of operations for each element that must be communicated. If A is an m × p depth 2 sequence and B is a p × n depth 2 sequence, then there are n dot products of vectors

63 Texas Tech University, Bryant Nelson, August 2016 of size p for each of the (m × n) elements that must be communicated.

6.3.2.3 LU Factorization

LU was the test program which demonstrated the worst performance in this phase. For all node counts greater than 1, the core efficiency is less than 1%.

Speedup vs. Core Count Efficiency vs. Core Count 5 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 3 Nodes 75 3 Nodes 70 65 60 55 50 45

Speedup 40 35 30 25 Core Efficiency (%) 20 15 10 5 0 0 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 6.7: Phase 2: LU on HPC Cluster Performance Graphs

64 Texas Tech University, Bryant Nelson, August 2016

Table 6.3: Phase 2: LU Factorization on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.00163102 0.183863 0.185494 1.000 100.000% 1 2 0.00167608 0.132071 0.133747 1.387 69.345% 1 3 0.0021379 0.115873 0.118011 1.572 52.395% 1 4 0.00205398 0.114877 0.116931 1.586 39.659% 1 6 0.00202417 0.104502 0.106526 1.741 29.022% 1 8 0.00165296 0.109255 0.110908 1.673 20.906% 2 1 0.00163102 43.995 43.9966 0.004 0.211% 2 2 0.00169516 44.025 44.0267 0.004 0.105% 2 3 0.00167513 44.0009 44.0025 0.004 0.070% 2 4 0.00167584 43.9993 44.0009 0.004 0.053% 2 6 0.00166893 44.0314 44.0331 0.004 0.035% 2 8 0.00166297 44.0231 44.0248 0.004 0.026% 3 1 0.00166821 68.1026 68.1042 0.003 0.091% 3 2 0.00170684 67.4825 67.4842 0.003 0.046% 3 3 0.00167203 67.6178 67.6195 0.003 0.030% 3 4 0.0016849 66.1936 66.1953 0.003 0.023% 3 6 0.00168514 65.1418 65.1434 0.003 0.016% 3 8 0.0016911 65.5315 65.5332 0.003 0.012%

The following program is the SequenceL code from the LU program used in this phase. The only viable distribution sources are the calls to RMod and RScl which occur on lines 10, 11, 16 and 19. Investigating these functions reveal the reason for such poor performance. Each distribution source performs between 0 and 2 operations per element that must be communicated.

65 Texas Tech University, Bryant Nelson, August 2016

1

2 LU(X(2)) := Gen(X,size(X));

3

4 //--Gen should keep returninga modified version of the matrix

5 //--l is the row we are modifying

6 Gen(X(2),l(0)) :=

7 let

8 n := size(X);

9 A := Gen(X,l-1);

10 Q := RScl(A[l],l);

11 Q1:= RScl(X[1],1);

12 in

13 A[1 ... l-1] ++ [Q]

14 whenl=n

15 else

16 A[1 ... l-1] ++ [Q] ++ RMod(A[l+1 ... n],Q,l)

17 whenl>1

18 else

19 [Q1] ++ RMod(X[2 ... n],Q1,1);

20

21 //--R is the kth row, being modified.V is the previously done row

22 RMod(R(1),V(1),k(0))[i]:=

23 R[i] - R[k]*V[i] wheni>k

24 else

25 R[i];

26

27 //--Divides the rowk to the right of diagonal by diagonal element

28 RScl(X(1),k(0))[i] := X[i]/X[k] wheni>k else X[i];

29

6.3.2.4 PI Approximation

PI performs slightly better than LU, but still has sub 1% core efficiency for most core counts on node counts larger than 1.

66 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 3 Nodes 75 3 Nodes 70 65 60 55 5 50 45

Speedup 40 35 30 25 Core Efficiency (%) 20 15 10 5 0 0 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure 6.8: Phase 2: PI on HPC Cluster Performance Graphs

67 Texas Tech University, Bryant Nelson, August 2016

Table 6.4: Phase 2: Pi Approximation on Server Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0 2.48061 2.48061 1.000 100.000% 1 2 0 1.24411 1.24411 1.994 99.694% 1 3 9.53674E-07 0.864749 0.86475 2.869 95.620% 1 4 0 0.670534 0.670534 3.699 92.486% 1 6 0 0.479438 0.479438 5.174 86.233% 1 8 0 0.402257 0.402257 6.167 77.084% 2 1 0 71.3098 71.3098 0.035 1.739% 2 2 0 70.8853 70.8853 0.035 0.875% 2 3 0 70.4524 70.4524 0.035 0.587% 2 4 0 70.3217 70.3217 0.035 0.441% 2 6 0 70.499 70.499 0.035 0.293% 2 8 0 70.3717 70.3717 0.035 0.220% 3 1 0 74.2537 74.2537 0.033 1.114% 3 2 0 74.4895 74.4895 0.033 0.555% 3 3 9.53674E-07 73.6386 73.6386 0.034 0.374% 3 4 0 73.5555 73.5555 0.034 0.281% 3 6 0 73.5194 73.5194 0.034 0.187% 3 8 0 73.4172 73.4172 0.034 0.141%

The following program is the SequenceL code from the PI program used in this phase. The only viable distribution source is the NT of the function helpFindPI over the sequence 0 ...N − 1 on line 3. Meaning that there are 6 operations per element in the communicated result, which is slightly more than PI, explaining the slightly better distributed performance.

68 Texas Tech University, Bryant Nelson, August 2016

1

2 piApprox( N(0) ) :=

3 sum( helpFindPI ( 0...N-1, N ) ) * 1.0 / N;

4

5 helpFindPI ( w(0), N(0) ) :=

6 let

7 local := ( 1.0 * w + 0.5 ) / N;

8 in

9 4.0 / ( 1.0 + local * local );

10 Comparing the results from this phase to the results from phase 1 presented in section 5.3.2.3 shows the viability of the previously mentioned post-reduce-gather optimization. The primary difference between the code generated by the compiler changes in phase 2 and the hand coded program using the functionality from phase 1 is that the sum is done before the gather. This means that, in the automatically

generated code from phase 2, the entire sequence result of helpFindPI ( 0...N-1, N) is communicated to every node, only to be immediately summed down to a single number. An optimization which moved the gather after reduce functions would 7N increase the number of operations per communicated element from 6 to . hnodeCounti In the above case of 3 nodes the 6 operations would increase to just over 233 million operations.

69 Texas Tech University, Bryant Nelson, August 2016

CHAPTER VII CONCLUSIONS & FUTURE WORK

This chapter disserts conclusions on the results of this research and outlines op- portunities for future research. The goal of this research was to extend the SequenceL compiler, enabling it to produce scalable hybrid distributed and shared memory C++. This work accomplishes that goal for a specific class of SequenceL programs and pro- vides a path forward to increase the number of programs for which the compiler works.

7.1 Conclusions

The results of this work, primarily phase 2, prove the hypothesis for a certain class of SequenceL programs, specifically those with a high operation per communicated result ratio. This shows that is possible to specify heuristics which, on average, allow scalable across-node (distributed memory) and across-core (shared memory) hybrid parallel C++ to be generated from a program written in a high-level functional language. It is to be expected that such an automatic approach would work for only a sub- class of problems, especially one in such an early stage. The current version of the compiler also has a few key limitations that are addressed in Section 7.2.2 along with improvements planned to address them.

7.1.1 Contributions

The result of this research is the first SequenceL compiler which targets hybrid distributed and shared memory architectures. This compiler is capable of creating executables that take advantage of an arbitrary distributed environment, meaning a

70 Texas Tech University, Bryant Nelson, August 2016 network of any number of nodes each containing any number of cores. To reiterate, the specific contributions of this work are as follows.

1. Extensions to the SequenceL runtime library allowing the user to make small changes to their C++ driver program enabling it to run in an arbitrary dis- tributed environment.

2. Modifications to the SequenceL code generator which allow it to automatically produce hybrid distributed & shared memory C++ code from any SequenceL program.

3. Extensions to the SequenceL runtime library which facilitate the efficient dis- tribution of compiled SequenceL programs.

4. Definition of a metric to predict the performance of this generated code.

5. Targets for future performance improvements.

6. Discovery that the hypothesis is true for a certain class of programs.

7.2 Future Work

7.2.1 Optimizations

It is apparent that improved performance is dependent upon minimizing the amount of data transfer required per unit of work. In order to improve performance either the amount of data transfer must be reduced or the amount of distributed work must be increased. Due to this, the following optimizations have been targeted as future improvements.

As discussed in Section 6.3.2, moving reduce operations (e.g. sum, product, etc.) before the communication step would greatly reduce the amount of communicated

71 Texas Tech University, Bryant Nelson, August 2016 data. This Post-Reduce-Gather optimization could also be applied to user defined functions that meet certain conditions. Some combination of static and dynamic analysis could, in some cases, allow each node to transmit only necessary data to each node. This is in contrast to the current model, which transmits the results from each node to every other node. This optimization is closely related to the removal of the need to have complete copies of all data on every node.

7.2.2 Improvements

The following improvements are planned to address the limitations of the current compiler. As was mentioned previously, the current runtime requires that all data is present on every node. The removal of this assumption would both improve performance by reducing the amount of data that must be communicated and allow programs to oper- ate on much larger data sets. Early in phase 2, this was attempted using MPI’s RMA (remote memory addressing) functionality. However, it was soon discovered that no implementation of MPI met the thread safety requirements of RMA. It is possible that some approximation of RMA could be implemented to allow the direct distribu- tion of data. Another approach is a combination of static and dynamic analysis to determine data dependencies. Most likely a combination of these two approaches will provide adequate improvements in performance and memory usage. Non-uniform nodes and non-uniform computations cause problems for the current version of the compiler. The current runtime divides work across nodes by evenly dividing the indexes of the result. Such a distribution scheme frequently causes load imbalance. Some way to specify the relative power of each node would allow the runtime to more evenly divide work across the nodes. However, this would only

72 Texas Tech University, Bryant Nelson, August 2016 address issues of uniform computations. Some form of distributed work stealing would be needed to completely address the issue of across node load balancing. SequenceL has only just begun to make an impact, providing programmers with a new and effective method for solving many difficult parallel programming problems. This work has added to the development of the SequenceL language. It will be interesting to see the results of continued research into extensions to SequenceL and its uses as it makes strides into the future.

73 Texas Tech University, Bryant Nelson, August 2016

INDEX

C

Consume Simplify Produce ...... 8 core efficiency ...... 28

E effective core efficiency ...... 29 effective core speed-up ...... 29 entry point ...... 10 entry-level NT ...... 32

I indexed function ...... 7

Intel’s TBB ...... 17 intended use ...... 10 intermediate language ...... 10

M

Message Passing Interface ...... 19

N

74 Texas Tech University, Bryant Nelson, August 2016

NT’d entry-point ...... 32

O

OpenMP ...... 15 overtyped ...... 7

P

Pthreads ...... 13

S sequence class ...... 11

SequenceL ...... 5

SequenceL runtime library ...... 10 speed-up ...... 28

V viable distribution source ...... 53 viable parallelization source ...... 52

75 Texas Tech University, Bryant Nelson, August 2016

BIBLIOGRAPHY

[mpi, 2012] (2012). MPI Forum: MPI: A Message-Passing Interface Standard.

[Alves-Foss, 1999] Alves-Foss, J. (1999). Formal syntax and semantics of Java. Springer Science & Business Media.

[Amza, ] Amza, C. Matrix multiplication in using mpi. http://www.eecg.toronto.

edu/~amza/ece1747h/homeworks/examples/MPI/other-examples/mmult.c. Ac- cessed: 2016-07-05.

[Andersen & Cooke, 2002] Andersen, P. & Cooke, D. E. (2002). Assessment of se- quencel as a high-level parallel programming language. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Sys- tems, volume 15 (pp. 289–294).

[Andersen et al., 2006] Andersen, P., Cooke, D. E., Rushton, J. N., & Russbach, J. (2006). A cluster implementation for the parallel programming language sequencel. In PDPTA (pp. 569–575).

[Archer et al., 2014] Archer, J., Nelson, B., & Rushton, N. (2014). An experiment comparing easel with pygame. In Proceedings of the International Conference on Software Engineering Research and Practice (SERP) (pp.1).:˜ The Steering Com- mittee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).

[Ayguad´eet al., 2009] Ayguad´e,E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., & Zhang, G. (2009). The design of tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3), 404–418.

76 Texas Tech University, Bryant Nelson, August 2016

[Ba¸sa˘gao˘gluet al., 2016] Ba¸sa˘gao˘glu,H., Blount, J., Blount, J., Nelson, B., & Succi, S. (2016). Computational performance of sequencel coding of the lattice boltz- mann method for multi-particle flow simulations. Submitted to Computer Physics Communications.

[Barnes & Hut, 1986] Barnes, J. & Hut, P. (1986). A hierarchical o (n log n) force- calculation algorithm. nature, 324(6096), 446–449.

[Barney, ] Barney, B. Openmp example - matrix multiply - c version. https:// computing.llnl.gov/tutorials/openMP/samples/C/omp_mm.c. Accessed: 2016- 07-05.

[Blelloch & Greiner, 1996] Blelloch, G. E. & Greiner, J. (1996). A provable time and space efficient implementation of nesl. In ACM SIGPLAN Notices, volume 31 (pp. 213–225).: ACM.

[Bondhugula, 2011] Bondhugula, U. (2011). Automatic distributed-memory paral- lelization and code generation using the polyhedral framework. rap. tech. ISc- CSA-TR-2011-3, Indian Institute of Science.

[Buttlar et al., 1996] Buttlar, D., Farrell, J., & Nichols, B. (1996). PThreads Pro- gramming: A POSIX Standard for Better . ” O’Reilly Media, Inc.”.

[Callahan & Kennedy, 1988] Callahan, D. & Kennedy, K. (1988). Compiling pro- grams for distributed-memory multiprocessors. The Journal of Supercomputing, 2(2), 151–169.

[Computing, ] Computing, S. Parallelizing matrix multiplica-

tion using tbb. http://blog.speedgocomputing.com/2010/08/ parallelizing-matrix-multiplication_8641.html. Accessed: 2016-07-05.

77 Texas Tech University, Bryant Nelson, August 2016

[COOKE et al., 2010] COOKE, D., RUSHTON, J., & NEMANICH, B. (2010). Method, apparatus and computer program product for automatically generating a computer program using consume, simplify & produce semantics with normalize, transpose & distribute operations. US Patent App. 12/711,614.

[Cooke, 1998] Cooke, D. E. (1998). Sequencel provides a different way to view pro- gramming. Computer Languages, 24(1), 1 – 32.

[Cooke, 2002] Cooke, D. E. (2002). A Concise Introduction to Computer Language: Design, Implementations, and Paradigms. Brooks/Cole Publishing Co.

[Cooke & Andersen, 2000] Cooke, D. E. & Andersen, P. (2000). Automatic parallel control structures in sequencel. Software: Practice and Experience, 30(14), 1541– 1570.

[Cooke et al., 2006] Cooke, D. E., Nemanich, B., & Rushton, J. N. (2006). The role of theory and experiment in language design–a 15 year perspective. 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, 0, 163–168.

[Cooke & Rushton, 2005] Cooke, D. E. & Rushton, J. N. (2005). Sequencel-an overview of a simple language. In PLC (pp. 64–70).

[Cooke & Rushton, 2009] Cooke, D. E. & Rushton, J. N. (2009). Taking parnas’s principles to the next level: Declarative language design. Computer, 42(9), 56–63.

[Cooke et al., 2008] Cooke, D. E., Rushton, J. N., Nemanich, B., Watson, R. G., & Andersen, P. (2008). Normalize, transpose, and distribute: An automatic approach for handling nonscalars. ACM Trans. Program. Lang. Syst., 30(2), 9:1–9:49.

78 Texas Tech University, Bryant Nelson, August 2016

[Cooley & Tukey, 1965] Cooley, J. W. & Tukey, J. W. (1965). An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90), 297–301.

[Feo et al., 1990] Feo, J. T., Cann, D. C., & Oldehoeft, R. R. (1990). A report on the sisal language project. Journal of Parallel and Distributed Computing, 10(4), 349–366.

[Friesen, 1995] Friesen, B. (1995). The Universality of BagL. PhD thesis, Masters Thesis, University of Texas at El Paso.

[Gardner, 1970] Gardner, M. (1970). The fantastic combinations of john conways new solitaire games. Mathematical Games.

[Gaudiot et al., 2001] Gaudiot, J.-L., DeBoni, T., Feo, J., B¨ohm,W., Najjar, W., & Miller, P. (2001). The sisal project: Real world functional programming. In Compiler optimizations for scalable parallel systems (pp. 45–72). Springer.

[Gropp et al., 1998] Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., & Snir, M. (1998). MPI the Complete Reference: The MPI-2 Extensions, Vol. 2. MIT Press, Cambridge.

[Han et al., 2012] Han, S., Mok, A. K., Nixon, M., Chen, D., Waugh, L., & Stotz, F. (2012). Utilizing parallelization and embedded multicore architectures for schedul- ing large-scale wireless mesh networks. In IECON 2012-38th Annual Conference on IEEE Industrial Electronics Society (pp. 3244–3251).: IEEE.

[Hochstein et al., 2005] Hochstein, L., Carver, J., Shull, F., Asgari, S., Basili, V., Hollingsworth, J. K., & Zelkowitz, M. V. (2005). Parallel programmer productivity:

79 Texas Tech University, Bryant Nelson, August 2016

A case study of novice parallel programmers. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 conference (pp. 35–35).: IEEE.

[Hoefler et al., 2013] Hoefler, T., Dinan, J., Buntinas, D., Balaji, P., Barrett, B., Brightwell, R., Gropp, W., Kale, V., & Thakur, R. (2013). Mpi+ mpi: a new hy- brid approach to parallel programming with mpi plus shared memory. Computing, 95(12), 1121–1136.

[Kametani, 1999] Kametani, M. (1999). Shared memory system. US Patent 5,960,458.

[Kessler, 1996] Kessler, C. W. (1996). Pattern-driven automatic parallelization. Sci- entific Programming, 5(3), 251–274.

[Kim & Voss, 2011] Kim, W. & Voss, M. (2011). Multicore desktop programming with intel threading building blocks. IEEE software, 28(1), 23.

[Lewine, 1991] Lewine, D. (1991). POSIX programmers guide. ” O’Reilly Media, Inc.”.

[Mattson, ] Mattson, T. A hands-on introduction to openmp.

[Mogill & Haglin, 2010] Mogill, J. A. & Haglin, D. J. (2010). A comparison of shared memory parallel programming models. Pacific Northwest National Laboratory, Richland, WA, USA.

[Nelson et al., 2013] Nelson, B., Archer, J., & Rushton, N. (2013). A functional approach to finding answer sets. In Proceedings of the International Conference on Foundations of Computer Science (FCS) (pp.1).:˜ The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).

80 Texas Tech University, Bryant Nelson, August 2016

[Nelson et al., 2014] Nelson, B., Archer, J., & Rushton, N. (2014). Easel: Purely functional game programming. In Proceedings of the International Conference on Software Engineering Research and Practice (SERP) (pp.1).:˜ The Steering Com- mittee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).

[Nelson & Rushton, 2013] Nelson, B. & Rushton, J. N. (2013). Fully automatic par- allel programming. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, volume 1 of PDPTA ’13 (pp. 327–329).: WorldComp.

[Nemanich et al., 2010] Nemanich, B., Cooke, D., & Rushton, J. N. (2010). Sequen- cel: Transparency and multi-core parallelisms. In Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, DAMP ’10 (pp. 45–52). New York, NY, USA: ACM.

[Pacheco, 1997] Pacheco, P. S. (1997). Parallel programming with MPI. Morgan Kaufmann.

[Pancake, 1999] Pancake, C. (1999). Those who live by the flop may die by the flop. In Keynote Address, 41” International Cray User Group Conf.

[Pande et al., 1993] Pande, S. S., Agrawal, D. P., & Mauney, J. (1993). A fully automatic compiler for distributed memory machines. In System Sciences, 1993, Proceeding of the Twenty-Sixth Hawaii International Conference on, volume ii (pp. 536–545 vol.2).

[Rabenseifner et al., 2009] Rabenseifner, R., Hager, G., & Jost, G. (2009). Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In Paral-

81 Texas Tech University, Bryant Nelson, August 2016

lel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on (pp. 427–436).: IEEE.

[Reinders, 2007] Reinders, J. (2007). Intel threading building blocks: outfitting C++ for multi-core processor parallelism. ” O’Reilly Media, Inc.”.

[Smith & Bull, 2001] Smith, L. & Bull, M. (2001). Development of mixed mode mpi/openmp applications. Scientific Programming, 9(2-3), 83–98.

[Snir et al., 1998] Snir, M., Otto, S., Huss-Lederman, S., Walker, D., & Dongarra, J. (1998). MPI: The Complete Reference (Vol. 1). The MIT Press.

[Today, ] Today, T. Matrix multiplication in [c] using pthreads

on linux. https://macboypro.wordpress.com/2009/05/20/ matrix-multiplication-in-c-using-pthreads-on-linux/. Accessed: 2016- 07-05.

[Tousimojarad & Vanderbauwhede, 2014] Tousimojarad, A. & Vanderbauwhede, W. (2014). Comparison of three popular parallel programming models on the intel xeon phi. In Euro-Par 2014: Parallel Processing Workshops (pp. 314–325).: Springer.

[Waugh, 2016] Waugh, L. (2016). SequenceL Language Reference and Problem Solv- ing Guide. Texas Multicore Technologies.

82 Texas Tech University, Bryant Nelson, August 2016

APPENDIX A SequenceL Grammar

83 Texas Tech University, Bryant Nelson, August 2016

SequenceL Grammar hprogrami ::= hlinei+ hlinei ::= hpublici | himporti | hsignaturei | hforeignFunci | htypeDef i | hfunctionDef i hpublici ::=‘ public’ hidi+ ‘;’ himporti ::=‘ import’ hfromi hfilei hasi ‘;’ hfilei ::= hstringi | ‘<’ hchari∗ ‘>’ hfromi ::= hidi ‘from’ | ‘*’‘from’ | hemptyi hasi ::=‘ as’ hidi | ‘as’ hidi ‘::’ hidi | hemptyi hsignaturei ::= hidi htypeParamsi ‘:’ htypei+ ‘->’ htypei ‘;’ | hidi htypeParamsi ‘:’ htypeBasei ‘(’ hinti ‘)’‘;’ | hidi htypeParamsi ‘:’ htypeIDi ‘;’ htypei ::= htypeBasei | htypeBasei ‘(’ hinti ‘)’ htypeBasei ::=‘ (’ htypei+ ‘->’ htypei ‘)’ htypeInputsi | htypeIDi htypeIDi ::= hidi ‘::’ hidi htypeInputsi | hidi htypeInputsi htypeDef i ::= hidihtypeParamsi ‘::= (’ htypeValsi ‘);’ htypeValsi ::= hidi ‘:’ htypei ‘,’ htypeValsi | hidi ‘:’ htypei htypeParamsi ::=‘ <’ hidi+ ‘>’ | hemptyi htypeInputsi ::=‘ <’ htypei+ ‘>’ | hemptyi hforeignFunci ::= hidi ‘(’ htypei∗ ‘)’‘->’ htypei ‘;’ | hidi ‘->’ htypei ‘;’ hfunctionDef i ::= hidi hinputsi hindexesi ‘:=’ hletBodyi hexpi hforeachi ‘;’ hletBodyi :=‘ let’ hleti+ ‘in’ | hemptyi

84 Texas Tech University, Bryant Nelson, August 2016 hleti ::= hidi hindexesi ‘:=’ hexpi hforeachi ‘;’ hinputsi ::=‘ (’ hargi∗ ‘)’ | hemptyi hargi ::= hidi ‘(’ hinti ‘)’ | hidi hindexesi ::=‘ [’ hidi+ ‘]’ | hemptyi hforeachi ::=‘ foreach’ hwithini+ | hemptyi hwithini ::= hidi ‘within’ hexpi hexpi ::= hexpi when hexpi else hexpi | hexpi when hexpi | hexpi ‘[’ hexpi+ ‘]’ | hexpi ‘.’ hidi | hexpi hinOpi hexpi | hpreOpi hexpi | hexpi ‘(’ hexpi∗ ‘)’ | hexpi ‘->’ htypei | hidi | hidi ‘::’ hidi | hsequencei | hstructi | hinti | hfloati | hstringi | ‘’’ hchari ‘’’ | hbooli hsequencei ::=‘ [’ hexpi∗ ‘]’ hstructi ::=‘ (’ hstructVali+ ‘)’ hstructVali ::= hidi ‘:’ hexpi hinOpi ::=‘ +’ | ‘-’ | ‘*’ | ‘/’ | ‘^’ | ‘...’ | ‘++’ | ‘mod’ | ‘and’ | ‘or’ | ‘>’ | ‘<’ | ‘=’ | ‘>=’ | ‘<=’ hpreOpi ::=‘ +’ | ‘-’ | ‘not’

One to many (+) and zero to many (∗) has implied delimiters when necessary, usually commas.

85 Texas Tech University, Bryant Nelson, August 2016

APPENDIX B Experimental Data

86 Texas Tech University, Bryant Nelson, August 2016

Phase 1

Virtual Network Results

Speedup vs. Core Count Efficiency vs. Core Count 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 3 Nodes 85 3 Nodes 4 Nodes 80 4 Nodes 75 5 70

Speedup 65 60

Core Efficiency (%) 55 50 0 45 0 4 8 0 4 8 Total Core Count Total Core Count

(a) (b)

Figure B.1: Phase 1: 2DFFT on Virtual Network Performance Graphs

Table B.1: Phase 1: 2DFFT on Virtual Network Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 3.18575 87.9887 91.1744 1.000 100.000% 1 2 4.15407 45.3765 49.5306 1.841 92.038% 1 3 4.16444 31.0377 35.2022 2.590 86.334% 1 4 4.46466 23.8007 28.2654 3.226 80.641% 2 1 2.92359 45.0825 48.0061 1.899 94.961% 2 2 3.80764 23.5212 27.3289 3.336 83.405% 2 3 3.80362 22.0301 25.8337 3.529 58.821% 2 4 4.1535 19.3877 23.5412 3.873 48.412% 3 1 2.99301 31.6698 34.6628 2.630 87.677% 3 2 3.8761 21.3462 25.2223 3.615 60.247% 4 1 2.94368 23.4463 26.39 3.455 86.372% 4 2 4.81491 19.735 24.5499 3.714 46.423%

87 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 3 Nodes 85 3 Nodes 4 Nodes 80 4 Nodes 75 5 70 65 Speedup 60 55 Core Efficiency (%) 50 45 0 40 0 4 8 0 4 8 Total Core Count Total Core Count

(a) (b)

Figure B.2: Phase 1: BHUT on Virtual Network Performance Graphs

Table B.2: Phase 1: BHUT on Virtual Network Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.00730491 57.2833 57.2906 1.000 100.000% 1 2 0.00986099 30.1736 30.1834 1.898 94.904% 1 3 0.010277 20.2618 20.272 2.826 94.203% 1 4 0.010452 16.259 16.2695 3.521 88.034% 2 1 0.0526149 29.311 29.3637 1.951 97.553% 2 2 0.009974 16.7222 16.7322 3.424 85.599% 2 3 0.01037 14.7571 14.7675 3.880 64.658% 2 4 0.0108259 13.6333 13.6441 4.199 52.487% 3 1 0.086616 20.3969 20.4836 2.797 93.230% 3 2 0.045918 15.0668 15.1127 3.791 63.182% 4 1 0.106832 18.325 18.4318 3.108 77.706% 4 2 0.0624242 15.8894 15.9519 3.591 44.893%

88 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 3 Nodes 85 3 Nodes 4 Nodes 4 Nodes 80 75 5 70 Speedup 65 60 Core Efficiency (%) 55 50 0 45 0 4 8 0 4 8 Total Core Count Total Core Count

(a) (b)

Figure B.3: Phase 1: GOL on Virtual Network Performance Graphs

Table B.3: Phase 1: GOL on Virtual Network Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.309608 78.7004 79.01 1.000 100.000% 1 2 0.43928 40.0815 40.5208 1.950 97.493% 1 3 0.426923 27.9888 28.4157 2.781 92.684% 1 4 0.442657 21.4728 21.9154 3.605 90.131% 2 1 0.352934 40.1058 40.4587 1.953 97.643% 2 2 0.391961 20.9084 21.3003 3.709 92.733% 2 3 0.398523 21.9276 22.3262 3.539 58.982% 2 4 0.38385 20.7646 21.1485 3.736 46.700% 3 1 0.556993 27.8046 28.3615 2.786 92.861% 3 2 0.524958 21.8347 22.3596 3.534 58.893% 4 1 0.618151 22.0277 22.6458 3.489 87.224% 4 2 0.61639 20.3765 20.9929 3.764 47.046%

89 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 3 Nodes 85 3 Nodes 4 Nodes 4 Nodes 80 75 5 70 Speedup 65 60 Core Efficiency (%) 55 50 0 45 0 4 8 0 4 8 Total Core Count Total Core Count

(a) (b)

Figure B.4: Phase 1: LU on Virtual Network Performance Graphs

Table B.4: Phase 1: LU on Virtual Network Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.840606 100.146 100.986 1.000 100.000% 1 2 0.461647 50.6314 51.093 1.977 98.826% 1 3 0.463328 35.6726 36.1359 2.795 93.154% 1 4 0.467158 27.9252 28.3923 3.557 88.920% 2 1 0.503873 51.3219 51.8258 1.949 97.428% 2 2 0.385339 28.4725 28.8579 3.499 87.486% 2 3 0.384207 27.6261 28.0103 3.605 60.089% 2 4 0.385248 25.9308 26.3161 3.837 47.968% 3 1 0.601708 36.2652 36.8669 2.739 91.307% 3 2 0.381801 28.1864 28.5682 3.535 58.915% 4 1 0.698166 30.7561 31.4543 3.211 80.264% 4 2 0.683098 26.6732 27.3563 3.692 46.144%

90 Texas Tech University, Bryant Nelson, August 2016

Networked PCs Results

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 85 80 75 5 70 65 Speedup 60 55 Core Efficiency (%) 50 45 0 40 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.5: Phase 1: 2DFFT on Networked PCs Performance Graphs

Table B.5: Phase 1: 2DFFT on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 2.99549 83.5385 86.5339 1.000 100.000% 1 2 3.93834 42.9897 46.928 1.844 92.199% 1 4 3.97626 22.9186 26.8949 3.217 80.437% 1 7 3.93744 16.9633 20.9007 4.140 59.146% 2 1 3.62163 43.1578 46.7794 1.850 92.491% 2 2 5.13442 22.7006 27.8351 3.109 77.720% 2 4 5.14956 13.0687 18.2183 4.750 59.373% 2 7 5.16318 9.26352 14.4267 5.998 42.844%

91 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 95 1 Nodes 2 Nodes 2 Nodes 90 85 80 5 75

Speedup 70 65

Core Efficiency (%) 60 55 0 50 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.6: Phase 1: BHUT on Networked PCs Performance Graphs

Table B.6: Phase 1: BHUT on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.0110981 53.9749 53.986 1.000 100.000% 1 2 0.00925493 28.1387 28.1479 1.918 95.897% 1 4 0.0166941 17.9364 17.9531 3.007 75.176% 1 7 0.017149 11.8905 11.9077 4.534 64.767% 2 1 0.054904 27.3356 27.3905 1.971 98.549% 2 2 0.013567 15.0842 15.0977 3.576 89.394% 2 4 0.014364 8.92134 8.93571 6.042 75.520% 2 7 0.0147569 7.11351 7.12827 7.574 54.096%

92 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 85 80 75 5 70 65 Speedup 60 55 Core Efficiency (%) 50 45 0 40 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.7: Phase 1: GOL on Networked PCs Performance Graphs

Table B.7: Phase 1: GOL on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.338985 73.0183 73.3573 1.000 100.000% 1 2 0.395673 37.5999 37.9956 1.931 96.534% 1 4 0.39677 20.2381 20.6349 3.555 88.875% 1 7 0.392261 19.4002 19.7925 3.706 52.947% 2 1 0.37765 52.8252 53.2028 1.379 68.941% 2 2 0.522473 27.6775 28.1999 2.601 65.033% 2 4 0.515173 16.1255 16.6406 4.408 55.104% 2 7 0.507177 11.4833 11.9904 6.118 43.700%

93 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 85 80 75 5 70 65 Speedup 60 55 Core Efficiency (%) 50 45 0 40 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.8: Phase 1: LU on Networked PCs Performance Graphs

Table B.8: Phase 1: LU on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.377332 92.9747 93.352 1.000 100.000% 1 2 0.407887 48.1769 48.5848 1.921 96.071% 1 4 0.411591 26.3635 26.7751 3.487 87.163% 1 7 0.448391 22.178 22.6264 4.126 58.940% 2 1 0.448483 71.167 71.6155 1.304 65.176% 2 2 0.458509 37.6597 38.1182 2.449 61.225% 2 4 0.439309 21.2979 21.7372 4.295 53.682% 2 7 0.44791 15.2419 15.6898 5.950 42.499%

94 Texas Tech University, Bryant Nelson, August 2016

HPC Cluster Results

Speedup vs. Core Count Efficiency vs. Core Count 20 105 1 Nodes 1 Nodes 2 Nodes 100 2 Nodes 15 3 Nodes 3 Nodes 95

10 90 Speedup 85 5 Core Efficiency (%) 80

0 75 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.9: Phase 1: 2DFFT on HPC Cluster Performance Graphs

95 Texas Tech University, Bryant Nelson, August 2016

Table B.9: Phase 1: 2DFFT on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 1.71168 967.148 968.859 1.000 100.000% 1 2 1.82188 484.377 486.199 1.993 99.636% 1 3 1.915 322.826 324.741 2.983 99.449% 1 4 1.80683 239.366 241.172 4.017 100.432% 1 6 1.89583 170.265 172.161 5.628 93.794% 1 8 1.8457 136.988 138.833 6.979 87.232% 2 1 1.48596 479.349 480.835 2.015 100.748% 2 2 2.07797 242.575 244.653 3.960 99.003% 2 3 1.65573 162.72 164.376 5.894 98.236% 2 4 1.66913 123.857 125.526 7.718 96.480% 2 6 1.62614 89.765 91.3912 10.601 88.344% 2 8 2.10392 73.2143 75.3183 12.864 80.397% 3 1 1.43835 324.243 325.682 2.975 99.162% 3 2 2.18025 162.544 164.724 5.882 98.029% 3 3 2.10992 113.647 115.757 8.370 92.997% 3 4 2.16489 86.0461 88.211 10.983 91.529% 3 6 2.12935 60.017 62.1464 15.590 86.611% 3 8 2.02814 51.6235 53.6516 18.058 75.243%

96 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 25 100 1 Nodes 1 Nodes 2 Nodes 2 Nodes 20 3 Nodes 3 Nodes 95 15

Speedup 10 90

5 Core Efficiency (%)

0 85 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.10: Phase 1: BHUT on HPC Cluster Performance Graphs

97 Texas Tech University, Bryant Nelson, August 2016

Table B.10: Phase 1: BHUT on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.0283668 919.332 919.36 1.000 100.000% 1 2 0.0296638 467.26 467.29 1.967 98.371% 1 3 0.0290241 309.675 309.705 2.969 98.950% 1 4 0.0308888 232.444 232.475 3.955 98.867% 1 6 0.0344329 163.578 163.613 5.619 93.652% 1 8 0.0327809 129.351 129.384 7.106 88.821% 2 1 0.042603 461.826 461.868 1.991 99.526% 2 2 0.0444262 233.127 233.171 3.943 98.571% 2 3 0.022047 155.219 155.241 5.922 98.702% 2 4 0.019726 117.379 117.399 7.831 97.888% 2 6 0.020216 82.4519 82.4721 11.148 92.896% 2 8 0.0229199 65.1127 65.1357 14.115 88.216% 3 1 0.049849 313.983 314.032 2.928 97.587% 3 2 0.041414 155.835 155.877 5.898 98.300% 3 3 0.06374 104.843 104.906 8.764 97.374% 3 4 0.0444741 77.6884 77.7328 11.827 98.560% 3 6 0.06054 55.9507 56.0112 16.414 91.188% 3 8 0.0512991 44.3217 44.373 20.719 86.329%

98 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 20 100 1 Nodes 1 Nodes 2 Nodes 2 Nodes 15 3 Nodes 95 3 Nodes

10 90 Speedup

5 85 Core Efficiency (%)

0 80 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.11: Phase 1: GOL on HPC Cluster Performance Graphs

99 Texas Tech University, Bryant Nelson, August 2016

Table B.11: Phase 1: GOL on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 3.59295 1001.93 1005.52 1.000 100.000% 1 2 3.87529 508.809 512.684 1.961 98.064% 1 3 3.97689 339.077 343.054 2.931 97.703% 1 4 4.08498 253.242 257.327 3.908 97.689% 1 6 3.86303 181.239 185.102 5.432 90.537% 1 8 3.88707 141.394 145.281 6.921 86.515% 2 1 3.06548 527.028 530.094 1.897 94.844% 2 2 3.47607 259.742 263.218 3.820 95.503% 2 3 3.44719 170.188 173.635 5.791 96.517% 2 4 3.47428 127.059 130.534 7.703 96.289% 2 6 3.47319 91.5147 94.9879 10.586 88.215% 2 8 3.46338 71.6609 75.1243 13.385 83.655% 3 1 2.90471 350.255 353.16 2.847 94.907% 3 2 3.20054 192.27 195.47 5.144 85.735% 3 3 3.21783 114.331 117.549 8.554 95.045% 3 4 3.21912 87.0142 90.2333 11.144 92.863% 3 6 3.41491 64.5358 67.9507 14.798 82.210% 3 8 3.19079 49.017 52.2078 19.260 80.250%

100 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 25 105 1 Nodes 1 Nodes 2 Nodes 2 Nodes 20 3 Nodes 100 3 Nodes

15 95

Speedup 10

90 5 Core Efficiency (%)

0 85 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.12: Phase 1: LU on HPC Cluster Performance Graphs

101 Texas Tech University, Bryant Nelson, August 2016

Table B.12: Phase 1: LU on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 1.8474 892.886 894.733 1.000 100.000% 1 2 1.84473 444.219 446.063 2.006 100.292% 1 3 1.83086 294.426 296.257 3.020 100.671% 1 4 1.84063 222.476 224.317 3.989 99.717% 1 6 1.83259 158.27 160.102 5.589 93.142% 1 8 1.85116 124.53 126.381 7.080 88.496% 2 1 1.44043 449.362 450.802 1.985 99.238% 2 2 1.44891 223.926 225.375 3.970 99.249% 2 3 1.44626 148.142 149.588 5.981 99.689% 2 4 1.46207 111.346 112.808 7.931 99.143% 2 6 1.44428 78.7933 80.2375 11.151 92.925% 2 8 1.44469 63.1795 64.6242 13.845 86.532% 3 1 1.30904 298.618 299.927 2.983 99.439% 3 2 1.36914 149.806 151.175 5.919 98.642% 3 3 1.34805 101.284 102.632 8.718 96.865% 3 4 1.33738 78.0381 79.3755 11.272 93.935% 3 6 1.31007 53.878 55.1881 16.212 90.069% 3 8 1.33851 41.9547 43.2932 20.667 86.112%

102 Texas Tech University, Bryant Nelson, August 2016

Phase 2

Networked PCs

Speedup vs. Core Count Efficiency vs. Core Count 5 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 75 70 65 60 55 50 45 Speedup 40 35 30 Core Efficiency (%) 25 20 15 10 0 5 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.13: Phase 2: 2DFFT on Networked PCs Performance Graphs

Table B.13: Phase 2: 2DFFT on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.545915 23.8616 24.4076 1.000 100.000% 1 2 0.546412 12.7861 13.3325 1.831 91.534% 1 3 0.512367 11.1571 11.6695 2.092 69.719% 1 4 0.545614 8.36125 8.90686 2.740 68.508% 1 6 0.546068 7.02937 7.57544 3.222 53.699% 2 1 0.546295 18.5179 19.0642 1.280 64.014% 2 2 0.546768 12.8004 13.3472 1.829 45.717% 2 3 0.511924 11.9861 12.498 1.953 32.549% 2 4 0.545241 10.475 11.0203 2.215 27.685% 2 7 0.511812 20.0539 20.5657 1.187 8.477%

103 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 95 1 Nodes 2 Nodes 90 2 Nodes 85 80 75 70 5 65

Speedup 60 55

Core Efficiency (%) 50 45 40 0 35 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.14: Phase 2: BHUT on Networked PCs Performance Graphs

Table B.14: Phase 2: BHUT on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.0127161 46.1251 46.1378 1.000 100.000% 1 2 0.0128019 26.1269 26.1397 1.765 88.252% 1 3 0.0130088 19.3068 19.3198 2.388 79.604% 1 4 0.013025 16.3662 16.3792 2.817 70.421% 1 6 0.0131621 12.7968 12.8099 3.602 60.029% 2 1 0.0129352 26.3471 26.36 1.750 87.515% 2 2 0.0131669 15.7793 15.7924 2.922 73.038% 2 3 0.0182052 12.8192 12.8374 3.594 59.900% 2 4 0.012557 11.063 11.0756 4.166 52.071% 2 7 0.0126212 8.79704 8.80966 5.237 37.408%

104 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 5 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 75 70 65 60 55 50 45

Speedup 40 35 30 25 Core Efficiency (%) 20 15 10 5 0 0 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.15: Phase 2: GOL on Networked PCs Performance Graphs

Table B.15: Phase 2: GOL on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 3.0375 3.21103 6.24853 1.000 100.000% 1 2 3.05685 1.74295 4.79981 1.302 65.091% 1 3 3.03934 1.31318 4.35253 1.436 47.854% 1 4 2.86046 1.18227 4.04273 1.546 38.641% 1 6 2.84363 1.03923 3.88286 1.609 26.821% 2 1 3.05438 9.30915 12.3635 0.505 25.270% 2 2 2.8431 8.286 11.1291 0.561 14.036% 2 3 2.84233 7.98811 10.8304 0.577 9.616% 2 4 3.03951 7.89348 10.933 0.572 7.144% 2 7 2.84499 7.79574 10.6407 0.587 4.194%

105 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 5 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 75 70 65 60 55 50 45

Speedup 40 35 30 25 Core Efficiency (%) 20 15 10 5 0 0 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.16: Phase 2: LU on Networked PCs Performance Graphs

Table B.16: Phase 2: LU on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.00155902 0.243575 0.245134 1.000 100.000% 1 2 0.00156689 0.157385 0.158952 1.542 77.109% 1 3 0.00155902 0.137407 0.138966 1.764 58.800% 1 4 0.00203085 0.136871 0.138902 1.765 44.120% 1 6 0.0016911 0.138976 0.140667 1.743 29.044% 2 1 0.00154209 4.81718 4.81872 0.051 2.544% 2 2 0.00201797 4.811 4.81302 0.051 1.273% 2 3 0.00201297 4.81258 4.81459 0.051 0.849% 2 4 0.00157189 4.80908 4.81065 0.051 0.637% 2 7 0.00157595 4.81493 4.81651 0.051 0.364%

106 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 5 100 1 Nodes 95 1 Nodes 90 2 Nodes 85 2 Nodes 80 75 70 65 60 55 50 Speedup 45 40

Core Efficiency (%) 35 30 25 20 0 15 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.17: Phase 2: MM on Networked PCs Performance Graphs

Table B.17: Phase 2: MM on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 3.61968 14.7373 18.357 1.000 100.000% 1 2 3.40318 8.13283 11.536 1.591 79.564% 1 3 3.40566 6.43653 9.84219 1.865 62.171% 1 4 3.40704 6.32603 9.73306 1.886 47.151% 1 6 3.41402 6.48873 9.90275 1.854 30.895% 2 1 3.40929 8.34482 11.7541 1.562 78.088% 2 2 3.64225 4.53478 8.17703 2.245 56.124% 2 3 3.40784 4.52429 7.93213 2.314 38.571% 2 4 3.40634 3.60711 7.01345 2.617 32.717% 2 7 3.62763 3.56108 7.18871 2.554 18.240%

107 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 1 Nodes 2 Nodes 95 2 Nodes 90

85 5 80 Speedup 75 Core Efficiency (%) 70

0 65 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.18: Phase 2: MAN on Networked PCs Performance Graphs

Table B.18: Phase 2: MAN on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.118973 22.286 22.4049 1.000 100.000% 1 2 0.126268 11.2246 11.3509 1.974 98.692% 1 3 0.126243 7.84829 7.97453 2.810 93.652% 1 4 0.126466 6.32729 6.45375 3.472 86.790% 1 6 0.129372 4.36242 4.4918 4.988 83.133% 2 1 0.126132 11.2297 11.3558 1.973 98.650% 2 2 0.126662 5.74693 5.87359 3.815 95.363% 2 3 0.126212 4.17044 4.29665 5.215 86.908% 2 4 0.126068 3.23659 3.36266 6.663 83.286% 2 7 0.126933 2.26256 2.38949 9.376 66.975%

108 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 5 140 135 130 1 Nodes 125 1 Nodes 120 2 Nodes 115 2 Nodes 110 105 100 95 90 85 80 75 70 65 60

Speedup 55 50 45 40 35 Core Efficiency (%) 30 25 20 15 10 5 0 0 0 4 8 12 0 4 8 12 Total Core Count Total Core Count

(a) (b)

Figure B.19: Phase 2: PI on Networked PCs Performance Graphs

Table B.19: Phase 2: PI on Networked PCs Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 6 1.56296 1.56296 1.000 100.000% 1 2 7 0.566848 0.566848 2.757 137.864% 1 3 0 0.501775 0.501775 3.115 103.829% 1 4 0 0.366938 0.366938 4.259 106.487% 1 6 0 0.35553 0.35553 4.396 73.269% 2 1 6 8.58587 8.58587 0.182 9.102% 2 2 7 7.86934 7.86934 0.199 4.965% 2 3 0 117.949 117.949 0.013 0.221% 2 4 0 178.023 178.023 0.009 0.110% 2 7 0 8.56171 8.56171 0.183 1.304%

109 Texas Tech University, Bryant Nelson, August 2016

HPC Cluster

Speedup vs. Core Count Efficiency vs. Core Count 10 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 3 Nodes 75 3 Nodes 70 65 60 55 5 50 45

Speedup 40 35 30 25 Core Efficiency (%) 20 15 10 5 0 0 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.20: Phase 2: 2DFFT on HPC Cluster Performance Graphs

110 Texas Tech University, Bryant Nelson, August 2016

Table B.20: Phase 2: 2DFFT on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.470559 34.3722 34.8428 1.000 100.000% 1 2 0.469162 17.4731 17.9422 1.942 97.097% 1 3 0.469891 11.4274 11.8973 2.929 97.621% 1 4 0.470218 8.63592 9.10614 3.826 95.657% 1 6 0.469979 6.0915 6.56148 5.310 88.503% 1 8 0.470594 4.83822 5.30881 6.563 82.040% 2 1 0.46846 76.5561 77.0246 0.452 22.618% 2 2 0.467929 67.5216 67.9896 0.512 12.812% 2 3 0.467906 64.4614 64.9293 0.537 8.944% 2 4 0.468017 63.068 63.536 0.548 6.855% 2 6 0.468981 62.0125 62.4815 0.558 4.647% 2 8 0.468252 61.3059 61.7741 0.564 3.525% 3 1 0.467521 71.4259 71.8934 0.485 16.155% 3 2 0.467885 65.468 65.9359 0.528 8.807% 3 3 0.467485 63.5698 64.0373 0.544 6.046% 3 4 0.468674 63.322 63.7907 0.546 4.552% 3 6 0.474506 62.1635 62.638 0.556 3.090% 3 8 0.46928 61.7451 62.2143 0.560 2.334%

111 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 10 100 1 Nodes 95 1 Nodes 90 2 Nodes 85 2 Nodes 3 Nodes 80 3 Nodes 75 70 65 5 60 55 Speedup 50 45 40 Core Efficiency (%) 35 30 25 0 20 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.21: Phase 2: BHUT on HPC Cluster Performance Graphs

112 Texas Tech University, Bryant Nelson, August 2016

Table B.21: Phase 2: BHUT on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 0.01947 17.9564 17.9758 1.000 100.000% 1 2 0.0186908 9.6395 9.65819 1.861 93.060% 1 3 0.01968 6.38983 6.40951 2.805 93.485% 1 4 0.0184259 4.93965 4.95807 3.626 90.639% 1 6 0.0194139 3.64521 3.66463 4.905 81.754% 1 8 0.0199661 2.992 3.01196 5.968 74.602% 2 1 0.017477 10.5543 10.5718 1.700 85.018% 2 2 0.0178051 6.19462 6.21243 2.894 72.338% 2 3 0.0172551 4.55802 4.57528 3.929 65.482% 2 4 0.0176408 3.84379 3.86143 4.655 58.190% 2 6 0.0177631 3.07681 3.09458 5.809 48.407% 2 8 0.0174029 2.78545 2.80286 6.413 40.084% 3 1 0.0176339 8.40781 8.42545 2.134 71.117% 3 2 0.0177319 5.40249 5.42022 3.316 55.274% 3 3 0.0173349 4.27951 4.29684 4.183 46.483% 3 4 0.0176671 3.81795 3.83562 4.687 39.055% 3 6 0.0175881 3.42628 3.44387 5.220 28.998% 3 8 0.017262 3.16511 3.18237 5.649 23.536%

113 Texas Tech University, Bryant Nelson, August 2016

Speedup vs. Core Count Efficiency vs. Core Count 5 100 95 1 Nodes 90 1 Nodes 2 Nodes 85 2 Nodes 80 3 Nodes 75 3 Nodes 70 65 60 55 50 45

Speedup 40 35 30 25 Core Efficiency (%) 20 15 10 5 0 0 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Total Core Count Total Core Count

(a) (b)

Figure B.22: Phase 2: GOL on HPC Cluster Performance Graphs

114 Texas Tech University, Bryant Nelson, August 2016

Table B.22: Phase 2: GOL on HPC Cluster Cores per Setup Compute Total Core Nodes Speedup Node Time Time Time Efficiency 1 1 2.3301 4.96949 7.29959 1.000 100.000% 1 2 2.3468 2.5795 4.9263 1.482 74.088% 1 3 2.34669 1.83221 4.1789 1.747 58.226% 1 4 2.3294 1.44623 3.77563 1.933 48.334% 1 6 2.32971 1.0289 3.35861 2.173 36.223% 1 8 2.34394 0.892344 3.23629 2.256 28.194% 2 1 2.34436 72.4513 74.7956 0.098 4.880% 2 2 2.32878 70.9246 73.2534 0.100 2.491% 2 3 2.32759 70.2176 72.5452 0.101 1.677% 2 4 2.34547 69.9661 72.3115 0.101 1.262% 2 6 2.33939 69.8249 72.1643 0.101 0.843% 2 8 2.32761 69.6029 71.9305 0.101 0.634% 3 1 2.33289 73.4442 75.7771 0.096 3.211% 3 2 2.34435 71.8298 74.1741 0.098 1.640% 3 3 2.32798 71.1064 73.4344 0.099 1.104% 3 4 2.33867 70.9994 73.338 0.100 0.829% 3 6 2.33168 70.7754 73.1071 0.100 0.555% 3 8 2.33281 70.8083 73.1411 0.100 0.416%

115 Texas Tech University, Bryant Nelson, August 2016

APPENDIX C Test Programs

116 Texas Tech University, Bryant Nelson, August 2016

Make.inc

1

2 CPP = mpic ++

3 SLC = $(SL_HOME)/bin/slc

4 SL_FLAGS=-p

5 C_OPT_FLAGS=-std=c++11 -O3

6 LINKER_OPT_FLAGS=-O3

7

8 all : $(TEST)

9

10 SL_Generated.cpp : $(TEST).sl

11 $(SLC) $(SL_FLAGS) -c $(TEST).sl $(INTENDED_USE) -o SL_Generated

12

13 SL_Generated.o : SL_Generated.cpp

14 $(CPP) $(C_OPT_FLAGS) -msse3 -c SL_Generated.cpp -I$(SL_HOME)/ include

15

16 $(TEST).o : $(TEST).cpp SL_Generated.cpp

17 $(CPP) $(C_OPT_FLAGS) -msse3 -c $(TEST).cpp -I$(SL_HOME)/include

18

19 $(TEST) : $(TEST).o SL_Generated.o

20 $(CPP) $(LINKER_OPT_FLAGS) $^ -o $(TEST) -L$(SL_HOME)/lib -L/usr/ local/lib -lslrt -ltbb -ltbbmalloc -lpthread

21

22 clean :

23 rm -rf *.o SL_Generated.* $(TEST)

24

Execution Command

1 mpiexec -x LD_LIBRARY_PATH --hostfile ../mpi_hosts -np $nodes --bind -to none ./$TEST $cores $SEED $input

117 Texas Tech University, Bryant Nelson, August 2016

2D Fast Fourier Transformation

SequenceL Source

1

2 import ;

3 import ;

4 import ;

5

6 //-- Evens and Odds

7 even : T(1) -> T(1);

8 even(s(1)) :=

9 let

10 n := size(s)/2;

11 in

12 s[(1...n)*2-1];

13

14 odd : T(1) -> T(1);

15 odd(s(1)) :=

16 let

17 n := size(s)/2;

18 in

19 s[(1...n)*2];

20

21 //-- Misc Functions

22 r2c(a(0)) := (Real:a, Imaginary:0.0);

23

24 pad2n : float(1) * int -> float(1);

25 pad2n(s(1),n(0)) :=

26 let

27 m := size(s);

28 in

29 s ++ duplicate((Real:0.0, Imaginary:0.0), n-m);

30

31 cpad2n : Complex(1) * int -> Complex(1);

32 cpad2n(s(1),n(0)) :=

33 let

34 m := size(s);

35 in

36 s ++ duplicate(0.0, n-m);

37

38 nxtpof2 : number -> int;

118 Texas Tech University, Bryant Nelson, August 2016

39 nxtpof2(n(0)) := nxtp2r(n,1);

40

41 nxtp2r : number * number -> number;

42 nxtp2r(n(0),a(0)) := a whena>=n else nxtp2r(n,a*2);

43

44 //-- Computing"Twiddle" factors

45 we : number * number -> Complex;

46 we(i(0),N(0)) :=

47 (Real:cos(2.0*pi*i/N),Imaginary:sin(2.0*pi*i/N));

48 wc : int -> Complex(1);

49 wc(N(0)) := we(0...(N-1),N);

50

51 //-- RecursiveFFT- Decimation in Time

52 bfly : Complex(1) * Complex(1) * Complex(1) -> Complex(1);

53 bfly(x(1),y(1),w(1)) :=

54 let

55 z := complexMultiply(y,w);

56 in

57 complexAdd(x,z) ++ complexSubtract(x,z);

58

59 fftr : Complex(1) * Complex(1) -> Complex(1);

60 fftr(x(1),w(1)) :=

61 let

62 m := size(x)/2;

63 w2 := even(w);

64 w1 := w[1...m];

65 f1 := fftr (even(x),w2);

66 f2 := fftr (odd(x),w2);

67 in

68 x when size(x) < 2

69 else

70 bfly(f1,f2,w1);

71

72 //-- fft1D

73 //-- arbitraryn, real seq

74 //-- arbitraryn, cplx seq

75 fft : float(1) * Complex(1) -> Complex(1);

76 fft(x(1),w(1)) :=

77 let

78 n := nxtpof2(size(x));

79 x2 := r2c(pad2n(x/size(x),n));

80 in

119 Texas Tech University, Bryant Nelson, August 2016

81 fftr(x2,w);

82

83 fftc : Complex(1) * Complex(1) -> Complex(1);

84 fftc(x(1),w(1)) :=

85 let

86 n := nxtpof2(size(x));

87 x2 := cpad2n(complexScale(x, 1.0/size(x)),n);

88 in

89 fftr(x2,w);

90

91 //--2D fft

92 fft2dt : float(2) * Complex(1) -> Complex(2);

93 fft2dt(z(2),w(1)) := transpose(fft(z,w));

94

95 fft2dtc : Complex(2) * Complex(1) -> Complex(2);

96 fft2dtc(z(2),w(1)) := transpose(fftc(z,w));

97

98 fft2d : float(2) -> float(2);

99 fft2d(x(2)) :=

100 let

101 n := size(x[1]);

102 m := size(x);

103 in

104 complexMagnitude(fft2dtc(fft2dt(x,wc(nxtpof2(n))),

105 wc(nxtpof2(m))));

106

107

C++ Driver Source

1

2 #include

3 #include

4 #include"SL_Generated.h"

5

6 using namespace std;

7

8 int main(int argc, char** argv)

9 {

10 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

11

120 Texas Tech University, Bryant Nelson, August 2016

12 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

13

14 Sequence< Sequence > input;

15 Sequence< Sequence > result;

16 double setupTime = 0;

17 double compTime = 0;

18 double printTime = 0;

19

20 int numCols = 100; if(argc > 3) numCols = atoi(argv[3]);

21

22 int numRows = 100; if(argc > 4) numRows = atoi(argv[4]);

23

24 bool print = false; if(argc > 5) print = (atoi(argv[5]) != 0);

25

26 SLTimer T;

27

28 sl_init(threads);

29 MPI_Init(&argc, &argv);

30

31 srand(seed);

32

33 T. start ();

34

35 input.setSize(numRows);

36 for(int y = 1; y <= numRows; y++)

37 {

38 input[y].setSize(numCols);

39 for(int x = 1; x <= numCols; x++)

40 {

41 input[y][x] = (rand() % 2) * 10;

42 }

43 }

44

45 T. stop ();

46 setupTime = T.getTime();

47

48 T. start ();

49 sl_fft2d(input, threads, result);

50 T. stop ();

51 compTime = T.getTime();

52

53 T. start ();

121 Texas Tech University, Bryant Nelson, August 2016

54 if(print) rcout << result <<"\n";

55 T. stop ();

56 printTime = T.getTime();

57

58

59 rcout <<"," << threads <<"," << setupTime <<",";

60 rcout << compTime <<"," << setupTime + compTime <<’\n’;

61

62 MPI_Finalize();

63 sl_done ();

64

65 return 0;

66 }

67

Makefile

1

2 TEST = fft2d

3 INTENDED_USE = -f"fft2d(double(2))"

4

5 include ../make.inc

6

122 Texas Tech University, Bryant Nelson, August 2016

Barnes-Hut N-Body

SequenceL Source

1

2 Quadrant ::=

3 (mnPt : float(1), mxPt : float(1), velX : float, velY : float,

4 velZ : float, bodyM : float, bodyX : float, bodyY : float,

5 bodyZ : float, children : Quadrant(1));

6

7 // Some constants

8 GCONST := 0.000000000066728; //-- meters^3/(kg* sec^2)

9 TIMESLICE := 1.0; //-- secs

10 THETA := 0.5;

11 //-- Group bodies with(region width/ distance to region)

12

13 //-- Simple accessors.

14 Mass : Quadrant -> float;

15 Mass(octree) := octree.bodyM;

16 CofGx : Quadrant -> float;

17 CofGx(octree) := octree.bodyM * octree.bodyX;

18 CofGy : Quadrant -> float;

19 CofGy(octree) := octree.bodyM * octree.bodyY;

20 CofGz : Quadrant -> float;

21 CofGz(octree) := octree.bodyM * octree.bodyZ;

22 VcityX : Quadrant -> float;

23 VcityX(octree) := octree.bodyM * octree.velX;

24 VcityY : Quadrant -> float;

25 VcityY(octree) := octree.bodyM * octree.velY;

26 VcityZ : Quadrant -> float;

27 VcityZ(octree) := octree.bodyM * octree.velZ;

28

29 //-- Givena min point(x,y,z) and max point,

30 //-- and list of body lists[mass,x,y,z, vX, vY, vZ],

31 //-- returns the subset of bodies whosex,y,z values lie

32 //-- within the region definied by min/max

33 envelops: float(1) * float(1) * float(1) -> float(1);

34 envelops(mn(1), mx(1), body(1)) :=

35 let mnX := mn[1];

36 mnY := mn[2];

37 mnZ := mn[3];

38 mxX := mx[1];

123 Texas Tech University, Bryant Nelson, August 2016

39 mxY := mx[2];

40 mxZ := mx[3];

41 bodyX := body[2];

42 bodyY := body[3];

43 bodyZ := body[4];

44 nX := bodyX > mnX and bodyX <= mxX;

45 nY := bodyY > mnY and bodyY <= mxY;

46 nZ := (bodyZ > mnZ and bodyZ <= mxZ) or

47 ( mnZ = 0 and mxZ = 0 and bodyZ = 0);

48 in

49 body when nX and nY and nZ;

50

51 //-- Contains the base cases for octree generation,

52 //-- calls newOctree() for recursive case

53 genOctree : float(1) * float(1) * float(2) -> Quadrant;

54 genOctree(mn(1), mx(1), allObjects(2)) :=

55 let bodies := envelops(mn, mx, allObjects);

56 numBodies := size(bodies);

57 mdX := (mn[1] + mx[1]) / 2;

58 mdY := (mn[2] + mx[2]) / 2;

59 mdZ := (mn[3] + mx[3]) / 2;

60 mxX := mx[1];

61 mxY := mx[2];

62 mxZ := mx[3];

63 mnX := mn[1];

64 mnY := mn[2];

65 mnZ := mn[3];

66 in

67 ( mnPt: mn, mxPt: mx, velX: 0.0, velY: 0.0, velZ: 0.0,

68 bodyM: 0.0, bodyX: 0.0, bodyY: 0.0, bodyZ: 0.0,

69 children: [] ) when numBodies = 0

70 else

71 ( mnPt: mn, mxPt: mx, velX: bodies[1][5], velY: bodies[1][6],

72 velZ: bodies[1][7], bodyM: bodies[1][1], bodyX: bodies[1][2],

73 bodyY: bodies[1][3], bodyZ: bodies[1][4], children: [] )

74 when numBodies = 1

75 else

76 newOctree(mn, mx, bodies);

77

78 //-- Givena region and list of bodies, splits the region

79 //-- into8 parts, calls octree() on each, and aggregates

80 //-- their centers of gravity, velocities, and masses into

124 Texas Tech University, Bryant Nelson, August 2016

81 //-- the parent octree

82 newOctree : float(1) * float(1) * float(2) -> Quadrant;

83 newOctree(mn(1), mx(1), bodies(2)) :=

84 let

85 mxX := mx[1];

86 mxY := mx[2];

87 mxZ := mx[3];

88 mnX := mn[1];

89 mnY := mn[2];

90 mnZ := mn[3];

91 mdX := (mnX + mxX) / 2;

92 mdY := (mnY + mxY) / 2;

93 mdZ := (mnZ + mxZ) / 2;

94

95 octB1 := genOctree([mdX, mdY, mdZ], [mxX, mxY, mxZ], bodies);

96 octB2 := genOctree([mnX, mdY, mdZ], [mdX, mxY, mxZ], bodies);

97 octB3 := genOctree([mnX, mnY, mdZ], [mdX, mdY, mxZ], bodies);

98 octB4 := genOctree([mdX, mnY, mdZ], [mxX, mdY, mxZ], bodies);

99 octF1 := genOctree([mdX, mdY, mnZ], [mxX, mxY, mdZ], bodies);

100 octF2 := genOctree([mnX, mdY, mnZ], [mdX, mxY, mdZ], bodies);

101 octF3 := genOctree([mnX, mnY, mnZ], [mdX, mdY, mdZ], bodies);

102 octF4 := genOctree([mdX, mnY, mnZ], [mxX, mdY, mdZ], bodies);

103 octrees := [octB1, octB2, octB3, octB4,

104 octF1, octF2, octF3, octF4];

105

106 totalM := sum(Mass(octrees));

107 cogX := sum(CofGx(octrees)) / totalM;

108 cogY := sum(CofGy(octrees)) / totalM;

109 cogZ := sum(CofGz(octrees)) / totalM;

110 vX := sum(VcityX(octrees)) / totalM;

111 vY := sum(VcityY(octrees)) / totalM;

112 vZ := sum(VcityZ(octrees)) / totalM;

113 in

114 (mnPt: mn, mxPt: mx, velX: vX, velY: vY, velZ: vZ,

115 bodyM: totalM, bodyX: cogX, bodyY: cogY, bodyZ: cogZ,

116 children: octrees);

117

118 distance : float(1) * Quadrant -> float;

119 distance(body(1), node) :=

120 let x1 := body[2];

121 x2 := node.bodyX;

122 y1 := body[3];

125 Texas Tech University, Bryant Nelson, August 2016

123 y2 := node.bodyY;

124 z1 := body[4];

125 z2 := node.bodyZ;

126 in

127 sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2);

128

129 sl_abs : number -> number;

130 sl_abs(x) := -x whenx<0 else x;

131

132 //-- an internal node could possibly have the same center of mass as

133 //-- one of its external nodes, but it not also the same mass

134 octreeIsBody : float(1) * Quadrant -> bool;

135 octreeIsBody(body(1), octree) :=

136 ( body[1] = octree.bodyM ) and ( body[2] = octree.bodyX ) and

137 ( body[3] = octree.bodyY ) and ( body[4] = octree.bodyZ );

138

139 //-- calculates the forces ona body froma single node

140 force : float(1) * Quadrant -> float(1);

141 force(body(1), node) :=

142 let bM := body[1];

143 nM := node.bodyM;

144 xComp := sl_abs(body[2]-node.bodyX);

145 yComp := sl_abs(body[3]-node.bodyY);

146 zComp := sl_abs(body[4]-node.bodyZ);

147 dist := distance(body, node);

148 totalForce := (GCONST * bM * nM) / (dist^2);

149 fX := totalForce * xComp / dist;

150 fY := totalForce * yComp / dist;

151 fZ := totalForce * zComp / dist;

152 in

153 [fX, fY, fZ];

154

155 //--THISISTHEMAINROUTINE...

156 //-- calculates the forces ona body from all of the

157 //-- bodies within an octree(node)

158 calcForceOn : Quadrant * float(1) * float(1) -> float(1);

159 calcForceOn(node, body(1), netForce(1)) :=

160 let

161 externalNode := size(node.children) = 0;

162 farAway :=

163 THETA >= (node.mxPt[1] - node.mnPt[1]) /

164 sqrt((body[2]-node.bodyX)^2 +

126 Texas Tech University, Bryant Nelson, August 2016

165 (body[3]-node.bodyY)^2 +

166 (body[4]-node.bodyZ)^2);

167 in

168 netForce when octreeIsBody(body, node)

169 else

170 netForce + force(body, node) when farAway or externalNode

171 else

172 sum(transpose(calcForceOn(node.children, body, netForce)));

173

174 //-- First clause: body doesn’t act on itself, so just return

175 //-- the pre-existing forces(netForce)

176 //-- Second clause: if it’s"far away"(i.e.s/d <=THETA), then

177 //-- just use the center of mass of that region

178 //-- for calculations if it’s an external(empty)

179 //-- node, add the netForce to it

180 //-- Third clause: it’s close enough that we have to actually

181 //-- recurse on the region and add all the

182 //-- individual forces. Current forces(netForce)

183 //-- are passed to the recursive call, so not added

184 //-- in again...

185

186

187 //-- calculates the new position ofa body based on the forces

188 //-- it’s under overaTIMESLICE interval

189 //--F=ma, so given the Force and mass, calculate the acceleration,

190 //-- then determine the velocity at the end of theTIMESLICE, and

191 //-- then average the velocity to get the distance travelled

192 //-- in thatTIMESLICE

193 move : float(1) * float(1) -> float(1);

194 move(body(1), force(1)) :=

195 let

196 bM := body[1];

197 bX := body[2];

198 bY := body[3];

199 bZ := body[4];

200 vX := body[5];

201 vY := body[6];

202 vZ := body[7];

203 fX := force[1];

204 fY := force[2];

205 fZ := force[3];

206 accX := fX/bM;

127 Texas Tech University, Bryant Nelson, August 2016

207 accY := fY/bM;

208 accZ := fZ/bM;

209 velX := vX + (accX * TIMESLICE);

210 velY := vY + (accY * TIMESLICE);

211 velZ := vZ + (accZ * TIMESLICE);

212 newX := bX + (vX + velX) / 2;

213 newY := bY + (vY + velY) / 2;

214 newZ := bZ + (vZ + velZ) / 2;

215 in

216 [ bM, newX, newY, newZ, velX, velY, velZ ];

217

218 //-- generate the Octrees, then calculate the forces on all bodies,

219 //-- calculate their velocities and new positions,

220 //-- recursive forN generations

221 go : float(2) * float(2) * int -> float(2);

222 go(universe(2), bodies(2), iterationsLeft(0)) :=

223 let

224 octree := genOctree(universe[1], universe[2], bodies);

225 netForces := calcForceOn(octree, bodies, [0.0, 0.0, 0.0]);

226 in

227 bodies when iterationsLeft = 0 else

228 go(universe, move(bodies, netForces), iterationsLeft - 1 );

229

230 bhutmain : float(2) * float(2) * int -> float(2);

231 bhutmain(mybodies(2),myuniverse(2),n(0)):=go(myuniverse,mybodies,n);

232

C++ Driver Source

1

2 #include

3 #include

4 #include"SL_Generated.h"

5

6 using namespace std;

7

8 int main(int argc, char** argv)

9 {

10 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

11

12 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

128 Texas Tech University, Bryant Nelson, August 2016

13

14 Sequence< Sequence< SL_FLOAT > > input;

15 Sequence< Sequence< SL_FLOAT > > result;

16 Sequence > univ;

17 double setupTime = 0;

18 double compTime = 0;

19 double printTime = 0;

20

21 int bodyCount = 100; if(argc > 3) bodyCount = atoi(argv[3]);

22

23 int iterations = 100; if(argc > 4) iterations = atoi(argv[4]);

24

25 double max_bound = pow(10.0,10.0);

26 if(argc > 5) max_bound = atof(argv[5]);

27

28 bool print = false; if(argc > 6) print = (atoi(argv[6]) != 0);

29

30 SLTimer T;

31

32 sl_init(threads);

33 MPI_Init(&argc, &argv);

34

35 srand(seed);

36

37 T. start ();

38 int nSizeCol = 7;

39

40 univ.setSize(2);

41 //Universe minx,miny,minz,max.maxy.maxz- Format:

42 //[[minX, minY, minZ],[maxX, maxY, maxZ]]

43 univ[1].setSize(3);

44 univ[2].setSize(3);

45 univ[1][1]=0-max_bound;

46 univ[1][2]=0-max_bound;

47 univ[1][3]=0-max_bound;

48 univ[2][1]=max_bound;

49 univ[2][2]=max_bound;

50 univ[2][3]=max_bound;

51

52 input.setSize(bodyCount);

53 for(int y = 1; y <= bodyCount; y++)

54 {

129 Texas Tech University, Bryant Nelson, August 2016

55 input[y].setSize(nSizeCol);

56 for(int x = 1; x <= nSizeCol; x++)

57 {

58 input[y][x] = ((double)rand() / RAND_MAX) * 100000.0;

59 }

60 }

61

62 T. stop ();

63 setupTime = T.getTime();

64

65 T. start ();

66 sl_bhutmain(input, univ, iterations, threads, result);

67 sl_SyncNodes();

68 T. stop ();

69 compTime = T.getTime();

70

71 T. start ();

72 if(print) rcout << result <<"\n";

73 T. stop ();

74 printTime = T.getTime();

75

76 rcout <<"," << threads <<"," << setupTime <<",";

77 rcout << compTime <<"," << setupTime + compTime <<’\n’;

78

79 MPI_Finalize();

80 sl_done ();

81

82 return 0;

83 }

84

Makefile

1

2 TEST = bhut

3 INTENDED_USE = -f"bhutmain(double(2),double(2),int(0))"

4

5 include ../make.inc

6

130 Texas Tech University, Bryant Nelson, August 2016

Conway’s Game of Life

SequenceL Source

1

2 evolve : int(2) -> int(2);

3 evolve(Cells(2))[X,Y] :=

4 let

5 n := neighbors(Cells,X,Y);

6 old_val := Cells[X,Y];

7 border := X=1 or X=size(Cells) or Y=1 or Y=size(Cells[X]);

8 in

9 old_val when border

10 else

11 1 when (n = 3) or (3 = (old_val + n))

12 else

13 0;

14

15 neighbors : int(2) * int * int -> int;

16 neighbors(A(2),i,j) :=

17 A[i-1,j-1] + A[i-1,j] + A[i-1,j+1] + A[i,j-1] +

18 A[i,j+1] + A[i+1,j-1] + A[i+1,j] + A[i+1,j+1];

19

20 life : int(2) * int -> int(2);

21 life(world(2),n) := life(evolve(world), n-1) whenn>0 else world;

22 C++ Driver Source

1

2 #include

3 #include

4 #include"SL_Generated.h"

5

6 using namespace std;

7

8 int main(int argc, char** argv)

9 {

10 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

11

12 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

13

14 Sequence< Sequence< int > > world;

131 Texas Tech University, Bryant Nelson, August 2016

15

16 Sequence< Sequence< int > > result;

17 double setupTime = 0;

18 double compTime = 0;

19 double printTime = 0;

20

21 int gridX = 100; if(argc > 3) gridX = atoi(argv[3]);

22

23 int gridY = 100; if(argc > 4) gridY = atoi(argv[4]);

24

25 int iterations = 1000; if(argc > 5) iterations = atoi(argv[5]);

26

27 bool display = 0; if(argc > 6) display = atoi(argv[6]) != 0;

28

29 SLTimer T;

30

31 sl_init(threads);

32 MPI_Init(&argc, &argv);

33

34 srand(seed);

35

36 T. start ();

37

38 world.setSize(gridY);

39 for(int y = 1; y <= gridY; y++)

40 {

41 world[y].setSize(gridX);

42 for(int x = 1; x <= gridX; x++)

43 {

44 world[y][x] = rand() % 2;

45 }

46 }

47

48 T. stop ();

49 setupTime = T.getTime();

50

51 T. start ();

52 sl_life(world, iterations, threads, result);

53 T. stop ();

54 compTime = T.getTime();

55

56 if(display)

132 Texas Tech University, Bryant Nelson, August 2016

57 {

58 for(int y = 1; y <= gridY; y++)

59 {

60 for(int x = 1; x <= gridX; x++)

61 {

62 rcout << (result[y][x] == 1 ?"O":"");

63 }

64 rcout <<"\n";

65 }

66 }

67

68 rcout <<"," << threads <<"," << setupTime <<",";

69 rcout << compTime <<"," << setupTime + compTime <<’\n’;

70

71 MPI_Finalize();

72 sl_done ();

73

74 return 0;

75 }

76

Makefile

1

2 TEST = gol

3 INTENDED_USE = -f"life(int(2),int)"

4

5 include ../make.inc

6

133 Texas Tech University, Bryant Nelson, August 2016

LU Factorization

SequenceL Source

1

2 //--R is the kth row, being modified.

3 //--V is the previously done row

4 RMod : float(1) * float(1) * int -> float(1);

5 RMod(R(1),V(1),k(0))[i]:= R[i] - R[k]*V[i] wheni>k else R[i];

6

7 //-- Divides the part of rowk to the right of

8 //-- diagonal by the diagonal element

9 RScl : float(1) * int -> float(1);

10 RScl(X(1),k(0))[i] := X[i]/X[k] wheni>k else X[i];

11

12 //--x is the matrix,l is the row we are modify

13 //-- Gen should keep returninga modified version of the matrix

14 Gen : float(2) * int -> float(2);

15 Gen(X(2),l(0)) := let

16 n := size(X);

17 A := Gen(X,l-1);

18 Q := RScl(A[l],l);

19 Q1:= RScl(X[1],1);

20 in

21 A[1 ... l-1] ++ [Q] whenl=n

22 else

23 A[1 ... l-1] ++ [Q] ++ RMod(A[l+1 ... n],Q,l) whenl>1

24 else

25 [Q1] ++ RMod(X[2 ... n],Q1,1);

26 C++ Driver Source

1

2 #include

3 #include

4 #include"SL_Generated.h"

5

6 using namespace std;

7

8 int main(int argc, char** argv)

9 {

10 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

134 Texas Tech University, Bryant Nelson, August 2016

11

12 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

13

14 Sequence< Sequence< SL_FLOAT > > input;

15 Sequence< Sequence< SL_FLOAT > > result;

16 double setupTime = 0;

17 double compTime = 0;

18 double printTime = 0;

19

20 int numCols = 100; if(argc > 3) numCols = atoi(argv[3]);

21

22 int numRows = 100; if(argc > 4) numRows = atoi(argv[4]);

23

24 bool print = false; if(argc > 5) print = (atoi(argv[5]) != 0);

25

26 SLTimer T;

27

28 sl_init(threads);

29 MPI_Init(&argc, &argv);

30

31 srand(seed);

32

33 T. start ();

34

35 input.setSize(numRows);

36 for(int y = 1; y <= numRows; y++)

37 {

38 input[y].setSize(numCols);

39 for(int x = 1; x <= numCols; x++)

40 {

41 if (x==y) input[y][x] = pow(10,9);

42

43 else if (x-y==1) input[y][x] = pow(10,8);

44

45 else if (y-x==1) input[y][x] = pow(10,8);

46

47 else input[y][x] = 0.0;

48 }

49 }

50 T. stop ();

51 setupTime = T.getTime();

52

135 Texas Tech University, Bryant Nelson, August 2016

53 T. start ();

54 sl_LU(input, threads, result);

55 T. stop ();

56 compTime = T.getTime();

57

58 T. start ();

59 if(print) rcout << result <<"\n";

60 T. stop ();

61 printTime = T.getTime();

62

63

64 rcout <<"," << threads <<"," << setupTime <<",";

65 rcout << compTime <<"," << setupTime + compTime <<’\n’;

66

67 MPI_Finalize();

68 sl_done ();

69

70 return 0;

71 }

72

Makefile

1

2 TEST = lu

3 INTENDED_USE = -f"LU(double(2))"

4

5 include ../make.inc

6

136 Texas Tech University, Bryant Nelson, August 2016

Matrix Multiply

SequenceL Source

1

2 mm : number(2) * number(2) -> number(2);

3 mm(A(2),B(2))[i,j]:=sum(A[i]*transpose(B)[j]);

4 C++ Driver Source

1

2 #include

3 #include

4 #include"SL_Generated.h"

5

6 using namespace std;

7

8 int main(int argc, char** argv)

9 {

10 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

11

12 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

13

14 Sequence< Sequence< SL_FLOAT > > A;

15 Sequence< Sequence< SL_FLOAT > > B;

16

17 Sequence< Sequence< SL_FLOAT > > result;

18 double setupTime = 0;

19 double compTime = 0;

20 double printTime = 0;

21

22 int m1 = 100; if(argc > 3) m1 = atoi(argv[3]);

23

24 int n = 100; if(argc > 4) n = atoi(argv[4]);

25

26 int m2 = 100; if(argc > 5) m2 = atoi(argv[5]);

27

28 bool print = false; if(argc > 6) print = (atoi(argv[6]) != 0);

29

30 SLTimer T;

31

32 sl_init(threads);

137 Texas Tech University, Bryant Nelson, August 2016

33 MPI_Init(&argc, &argv);

34

35 srand(seed);

36

37 T. start ();

38

39 A.setSize(m1);

40 for(int y = 1; y <= m1; y++)

41 {

42 A[y].setSize(n);

43 for(int x = 1; x <= n; x++)

44 {

45 A[y][x] = ((double)rand() / RAND_MAX) * 100000.0;

46 }

47 }

48

49 B.setSize(n);

50 for(int y = 1; y <= n; y++)

51 {

52 B[y].setSize(m2);

53 for(int x = 1; x <= m2; x++)

54 {

55 B[y][x] = ((double)rand() / RAND_MAX) * 100000.0;

56 }

57 }

58

59 T. stop ();

60 setupTime = T.getTime();

61

62 T. start ();

63 sl_mm(A, B, threads, result);

64 T. stop ();

65 compTime = T.getTime();

66

67 T. start ();

68 if(print) rcout << result <<"\n";

69 T. stop ();

70 printTime = T.getTime();

71

72 rcout <<"," << threads <<"," << setupTime <<",";

73 rcout << compTime <<"," << setupTime + compTime <<’\n’;

74

138 Texas Tech University, Bryant Nelson, August 2016

75 MPI_Finalize();

76 sl_done ();

77

78 return 0;

79 }

80

Makefile

1

2 TEST = mm

3 INTENDED_USE = -f"mm(double(2), double(2))"

4

5 include ../make.inc

6

139 Texas Tech University, Bryant Nelson, August 2016

Monte Carlo Mandelbrot Area

SequenceL Source

1

2 Complex ::= (r : float, i : float);

3

4 //-- Monte Carlo sampling

5 //-- play witha point until max iterations expire

6 //-- ora threshold is hit so we know it is outside

7 iterate_point : Complex* Complex * int * float -> int;

8 iterate_point( point, orig, iterations, threshold ) :=

9 let

10 next_point := ( r: (point.r ^ 2) - (point.i ^ 2) + orig.r,

11 i: point.r * point.i * 2.0 + orig.i);

12 in

13 1 when point.r * point.r + point.i * point.i > threshold

14 else

15 iterate_point( next_point, original, iterations - 1, threshold )

16 when iterations > 0

17 else

18 0;

19

20

21 //-- calculate set area

22 compute_set_area : Complex(1) * int * float -> float;

23 compute_set_area( points(1), max_iter, threshold ) :=

24 let

25 outside :=

26 sum( iterate_point( points, points, max_iter, threshold ) );

27 total := size( points );

28 inside := total - outside;

29 area := 2.0 * ( 2.5 * 1.125 ) * inside / total;

30 in

31 area ;

32

140 Texas Tech University, Bryant Nelson, August 2016

C++ Driver Source

1

2 #include

3 #include

4 #include"SL_Generated.h"

5

6 using namespace std;

7

8 int main(int argc, char** argv)

9 {

10 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

11

12 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

13

14 Sequence<_sl_Complex> points;

15 SL_FLOAT result;

16

17 double setupTime = 0;

18 double compTime = 0;

19 double printTime = 0;

20

21 int numPoints = 100; if(argc > 3) numPoints = atoi(argv[3]);

22

23 int maxIters = 100; if(argc > 4) maxIters = atoi(argv[4]);

24

25 SL_FLOAT threshold = 100; if(argc > 5) threshold = atof(argv[5]);

26

27 SLTimer T;

28

29 sl_init(threads);

30 MPI_Init(&argc, &argv);

31

32 srand(seed);

33

34 T. start ();

35

36 points.setSize(numPoints);

37 for(int i = 1; i <= numPoints; i++)

38 {

39 points[i].r = -2.0 + 2.5 * (double)rand() / RAND_MAX;

141 Texas Tech University, Bryant Nelson, August 2016

40 points[i].i = 1.125 * (double)rand() / RAND_MAX;

41 }

42

43 sl_SyncNodes();

44 T. stop ();

45 setupTime = T.getTime();

46

47 T. start ();

48 sl_compute_set_area(points, maxIters, threshold, threads, result);

49 sl_SyncNodes();

50 T. stop ();

51 compTime = T.getTime();

52

53 rcout <<"," << threads <<"," << setupTime <<",";

54 rcout << compTime <<"," << setupTime + compTime <<’\n’;

55

56 MPI_Finalize();

57 sl_done ();

58

59 return 0;

60 }

61

Makefile

1

2 TEST = man

3 INTENDED_USE = -f"compute_set_area(Complex(1), int, float)"

4

5 include ../make.inc

6

142 Texas Tech University, Bryant Nelson, August 2016

PI Approximation

SequenceL Source

1

2 helpFindPI : int * int -> float;

3 helpFindPI ( w(0), N(0) ) :=

4 let

5 local := ( 1.0 * w + 0.5 ) / N;

6 in

7 4.0 / ( 1.0 + local * local );

8

9 findPI : int -> float;

10 findPI( n(1), N(0)) :=

11 [sum( helpFindPI ( n, N ) ) * 1.0 / N];

12 C++ Driver Source

1 #include

2 #include

3 #include"SL_Generated.h"

4

5 using namespace std;

6

7 int main(int argc, char** argv)

8 {

9 int threads = 2; if(argc > 1) threads = atoi(argv[1]);

10

11 int seed = 12345; if(argc > 2) seed = atoi(argv[2]);

12

13 int input = 100; SL_FLOAT result;

14

15 double setupTime = 0; double compTime = 0; double printTime = 0;

16

17 if(argc > 3) input = atoi(argv[3]);

18

19 SLTimer T;

20

21 sl_init(threads);

22 MPI_Init(&argc, &argv);

23

24 srand(seed);

143 Texas Tech University, Bryant Nelson, August 2016

25

26 T. start ();

27 T. stop ();

28 setupTime = T.getTime();

29

30 T. start ();

31 sl_findPI(input, threads, result);

32 T. stop ();

33 compTime = T.getTime();

34

35 rcout <<"," << threads <<"," << setupTime <<",";

36 rcout << compTime <<"," << setupTime + compTime <<’\n’;

37

38 MPI_Finalize();

39 sl_done ();

40

41 return 0;

42 }

43

Makefile

1

2 TEST = pi

3 INTENDED_USE = -f"findPI(int(0))"

4

5 include ../make.inc

6

144