Robert Pavel's Thesis

SIMULATION METHODOLOGY AND TOOLS FOR THE

DEVELOPMENT OF NOVEL PROGRAM EXECUTION MODELS

AND ARCHITECTURES

Robert Pavel

A dissertation submitted to the Faculty of the University of Delaware in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering

Spring 2015

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

ProQuest 3718369

Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author.

ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 SIMULATION METHODOLOGY AND TOOLS FOR THE

DEVELOPMENT OF NOVEL PROGRAM EXECUTION MODELS

AND ARCHITECTURES

Robert Pavel

Approved: Kenneth E. Barner, Ph.D. Chair of the Department of Electrical and Computer Engineering

Approved: Babatunde A. Ogunnaike, Ph.D. Dean of the College of Engineering

Approved: James G. Richards, Ph.D. Vice Provost for Graduate and Professional Education I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Guang R. Gao Professor in charge of dissertation

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Xiaoming Li Member of dissertation committee

Signed: Chengmo Yang Member of dissertation committee

Signed: Jingyi Yu Member of dissertation committee ACKNOWLEDGEMENTS

First and foremost, I would like to thank my parents for supporting me throughout my academic studies. Without them I would not have been able to pursue an advanced degree. I would also like to thank my adviser, Professor Gao, for allowing me to be a researcher and graduate student in his group and for supporting and directing my research endeavours. Similarly, I would like to thank Daniel Orozco and Elkin Garcia for guiding me and providing advice as to how to be a productive graduate student. Additional thanks goes to Allen McPherson, Christoph Junghans, Timothy Ger- mann, and Ben Bergen of Los Alamos National Laboratory for giving me wonderful internship and career opportunities and for teaching me how to apply the skills I have learned in academia to more practical real world applications. Finally, I would like to thank my friends and colleagues at the University of Delaware. In particular, I would like to thank Joshua Landwehr, Aaron Landwehr, and Phil Saponaro for going to lunch with me frequently.

iv TABLE OF CONTENTS

LIST OF FIGURES ...... xi ABSTRACT ...... xiv

Chapter

1INTRODUCTION...... 1

2 BACKGROUND INFORMATION AND WORK ...... 5

2.1 Origins of Parallel Discrete Event Simulation for Computer Architectures ...... 5

2.1.1 Packet Communication Architectures ...... 5 2.1.2 Chandy and Misra’s Model ...... 7 2.1.3 Common Terminology ...... 9

2.1.3.1 Physical System ...... 9 2.1.3.2 Logical System ...... 9 2.1.3.3 Host System ...... 9 2.1.3.4 Component ...... 10 2.1.3.5 Link ...... 10 2.1.3.6 Distributed ...... 10 2.1.3.7 Virtual Time ...... 10 2.1.3.8 Message ...... 11

2.1.4 A Comparison of Models ...... 11

2.2 Common Synchronization Techniques ...... 12

2.2.1 General Tools and Techniques ...... 12

2.2.1.1 Lamport Timestamps ...... 12

v 2.2.1.2 Lookahead Techniques ...... 13

2.2.2 Conservative Synchronization ...... 13

2.2.2.1 Lockstep Execution through Global Time ...... 13 2.2.2.2 Null Messages ...... 14 2.2.2.3 Misra’s Marker ...... 14 2.2.2.4 Demand Driven Deadlock Free Synchronization ... 15

2.2.3 Optimistic Synchronization ...... 15

2.2.3.1 Time Warp: Reverting to Resolve Issues ...... 15 2.2.3.2 Optimistic Time Windows ...... 16 2.2.3.3 Lax Synchronization ...... 17

2.2.4 Inﬂuential Frameworks ...... 17

2.2.4.1 MPISim ...... 17 2.2.4.2 BigSim ...... 17

2.2.5 Queuing Networks ...... 18

2.3 Background of Program Execution Models ...... 18

2.3.1 Common Terminology ...... 19

2.3.1.1 Program Execution Model ...... 19 2.3.1.2 Runtime ...... 19 2.3.1.3 Codelet ...... 19

2.3.2 Novel Program Execution Models ...... 20

2.3.2.1 Distributed Shared Memories ...... 20 2.3.2.2 MapReduce ...... 20 2.3.2.3 Asynchronous Task Based Models ...... 21

2.4 Porting TiNy Threads to Distributed System ...... 21 2.5 Studying and Using TIDEFlow ...... 23 2.6 Work Toward Enhancing Scientiﬁc Simulations through Co-Design .. 24

2.6.1 Heterogeneous Multiscale Modeling ...... 24 2.6.2 Adaptive Mesh Reﬁnement ...... 25

vi 3 PICASIM: A MODEL FOR THE DEVELOPMENT AND SIMULATION OF RUNTIMES AND ARCHITECTURES .... 28

3.1 Why PICASim ...... 28 3.2 The PICASim Model ...... 29

3.2.1 Firing Rules of PICASim ...... 30 3.2.2 PICASim Synchronization Model ...... 34

3.3 Goals of the PICASim Model ...... 36

3.3.1 Composability Due to a Modular Design ...... 37 3.3.2 A Task Based Framework ...... 37 3.3.3 Completely Decentralized Methodology ...... 38 3.3.4 Terminating Simulation ...... 38

3.4 Representing a System in PICASim ...... 38 3.5 Suitability Toward Exascale Research ...... 39

3.5.1 Exascale Challenges Addressed ...... 40 3.5.2 Uniqueness and Novelty ...... 40 3.5.3 Applicability ...... 41 3.5.4 Maturity ...... 41 3.5.5 Strengths and Weaknesses of PICASim Model ...... 42

4 THE PICASIM FRAMEWORK ...... 43

4.1 Language and Libraries Used ...... 43 4.2 Shared Memory Implementation of PICASim ...... 43

4.2.1 Implementation of Task Queue ...... 43 4.2.2 Packets and Envelopes ...... 45

vii 4.2.3 Implementation of the CHU ...... 46

4.2.3.1 Specialized CHUs ...... 47

4.2.4 Implementation of the Link ...... 48

4.2.4.1 Termination Signal Links ...... 50

4.2.5 Preliminary Performance Data ...... 51

4.3 Communication in a Distributed System ...... 55 4.4 Using PICASim ...... 56

5 PICASIM TO SIMULATE NOVEL ARCHITECTURES ..... 57

5.1 Modeling an Architecture: A Case Study ...... 57 5.2 Performance and Power Modeling with Petri Nets ...... 61

5.2.1 Petri Nets ...... 61 5.2.2 Implementation in PICASim ...... 62 5.2.3 Verifying Model ...... 63 5.2.4 Extrapolating Results ...... 67 5.2.5 Studying New Algorithms ...... 67

5.3 Validation via Intel Xeon Phi ...... 71

5.3.1 Intel Xeon Phi ...... 71 5.3.2 Modeling Intel Xeon Phi ...... 71 5.3.3 Simulation Results ...... 71

6 PICASIM TO DEVELOP ARCHITECTURES AND PROGRAM EXECUTION MODELS ...... 75

6.1 Dennis’s Fresh Breeze ...... 75

6.1.1 The Fresh Breeze Program Execution Model ...... 75 6.1.2 The Fresh Breeze Architecture ...... 76

6.2 Fresh Breeze: System One X ...... 79

6.2.1 Goals of System One X ...... 79 6.2.2 Implementation in PICASim ...... 79 6.2.3 Preliminary Timing Data ...... 80

viii 6.2.4 Benchmark Applications ...... 82 6.2.5 Architectural Modiﬁcations ...... 82

6.2.5.1 Scalar Dot Product ...... 85 6.2.5.2 1D Heat Distribution ...... 85

6.2.6 Performance Evaluation ...... 88

6.2.6.1 Scalar Dot Product ...... 88 6.2.6.2 1D Heat Distribution ...... 88

7 A LANGUAGE AND COMPILER FOR AUTOMATED DISTRIBUTION OF SIMULATION ...... 90

7.1 Rationale Behind The LADS ...... 90 7.2 The LADS Grammar ...... 90 7.3 Current Applications of the LADS ...... 93 7.4 Additional Applications of the LADS ...... 94 7.5 The LADSPiler ...... 94

7.5.1 Lexical Analysis and Syntactic Analysis ...... 94 7.5.2 A Graph Based IR ...... 95 7.5.3 Applying Optimizations ...... 97 7.5.4 Code Generation ...... 97

8 RELATED WORKS ...... 103

8.1 Dennis’s Framework ...... 103 8.2 μπ...... 104 8.3 Graphite ...... 104 8.4 SST: The Structural Simulation Toolkit ...... 105 8.5 COTson ...... 106 8.6 COREMU ...... 106 8.7 Flow stream processing system simulator ...... 106 8.8 SystemC ...... 107 8.9 ROSS ...... 107

9 CONCLUSIONS AND CLOSING THOUGHTS ...... 108

9.1 Conclusions ...... 108 9.2 Continuation of Work ...... 109

9.2.1 PICASim Model and Framework ...... 110

ix 9.2.2 LADS and LADSpiler ...... 110

9.3 Closing Thoughts ...... 111

BIBLIOGRAPHY ...... 113

Appendix

A COPYRIGHT INFORMATION ...... 125

A.1 Permission from IEEE ...... 125 A.2 Permissions from Springer ...... 126 A.3 Papers I Own the Copyright to ...... 126

x LIST OF FIGURES

2.1 Example of Bryant Function Component ...... 6

2.2 Example of Bryant Switch Component ...... 7

2.3 Example of Bryant Arbiter Component ...... 7

2.4 Example of Chandy and Misra’s Logical Processes ...... 8

2.5 Scalability of Read and Write Operations in TDSM (MTAAP 2010[124]c ) ...... 22

2.6 Tiling of Centralized Shock Discontinuity [96] ...... 26

2.7 High Level Overview of Adaptive Mesh Reﬁnement of a Hydrodynamics Application in Intel’s Concurrent Collections [96] . 27

3.1 Example of FPU CHU using FMADD Instruction ...... 32

3.2 Example of FPU CHU using FADD Instruction ...... 33

3.3 Simple PICASim Graph ...... 35

3.4 An Example of a Graph of a System in PICASim ...... 39

4.1 UML Class Diagram of Abstract Class SimulatableTask ...... 44

4.2 UML Class Diagram of TaskQueue ...... 44

4.3 UML Class Diagram of Packet and Envelope ...... 45

4.4 UML Class Diagram of CHU ...... 46

4.5 Algorithm to Check and Fire CHU ...... 47

4.6 UML Class Diagram of PiCHU ...... 48

xi 4.7 UML Class Diagram of Link Types ...... 49

4.8 Algorithm to Push message (pktPtr) into Link ...... 49

4.9 Algorithm to Pop Message from Link ...... 50

4.10 UML Class Diagram of Termination Signal Link ...... 51

4.11 256 CHUs and Variable Task Lengths on a Single Node of a Commodity Cluster [99] ...... 52

4.12 217 Messages per CHU and a Variable Number of CHUs on a Single Node of a Commodity Cluster [99] ...... 53

4.13 Scalability of Tasking and Control Framework on a Single Node of a Commodity Cluster [99] ...... 54

4.14 UML Class Diagram of Router Message ...... 55

5.1 Layout of A Single Cyclopst-64 Node [99] ...... 58

5.2 Possible Mapping of a Single Node. Each box represents a potential CHU [99] ...... 60

5.3 Petri Net Representation of DGEMM on IBM Cyclops-64 [45] ... 63

5.4 Petri Net Representation of DGEMM on IBM Cyclops-64 [45] ... 64

5.5 Veriﬁcation of Model for DGEMM Optimized for On-Chip Memory [45] ...... 65

5.6 Veriﬁcation of Model for DGEMM Optimized for Oﬀ-Chip Memory [45] ...... 66

5.7 Simulation of Oﬀ-Chip DGEMM on Modiﬁed C64 [45] ...... 68

5.8 Simulation of LU Decomposition Optimized for On-Chip Memory of Varying Sizes [45] ...... 69

5.9 Simulation of LU Decomposition Optimized for On-Chip Memory with Varying Numbers of TUs [45] ...... 70

xii 5.10 Simulated Execution Time versus Measured Execution Time of FDTD Microbenchmark on single Intel Xeon Phi Coprocessor 5110P 72

5.11 Error of Simulated Results with respect to Measured Results .... 73

6.1 Fresh Breeze: System One [97] ...... 77

6.2 Fresh Breeze: System Two [97] ...... 78

6.3 Testbed for Modiﬁcations to Fresh Breeze Architecture [97] .... 81

6.4 Scalar Dot Product on Fresh Breeze SysOneX [97] ...... 83

6.5 1D Heat Distribution on Fresh Breeze SysOneX [97] ...... 84

6.6 Queue Eﬃciency Tests Using Scalar Dot Product on Fresh Breeze SysOneX [97] ...... 86

6.7 Queue Eﬃciency Tests Using Ten Iterations of 1D Heat Distribution on Fresh Breeze SysOneX [97] ...... 87

7.1 Backus-Naur Form Grammar of LADS ...... 91

7.2 Simple Graph Modelled in LADS: LADS Source ...... 92

7.3 Simple Graph Modelled in LADS: Top Level ...... 93

7.4 Simple Graph Modelled in LADS: Eﬀects of Flat Modiﬁer ..... 93

7.5 Simple Graph Modelled in LADS: Flattened With Redundant Ports Removed ...... 97

7.6 Subset of Simpliﬁed Output of LADSpiler for Example: Part 1 ... 99

7.7 Subset of Simpliﬁed Output of LADSpiler for Example: Part 2 ... 100

7.8 Expanded Output of LADSpiler in dot language: Part 1 ...... 100

7.9 Expanded Output of LADSpiler in dot language: Part 2 ...... 101

7.10 Expanded Output of LADSpiler in graphical form ...... 102

xiii ABSTRACT

The exascale era of high performance computing will be defined by novel architectures with a focus on power efficiency and resiliency as well as program execution models with a focus on usability and scalability. To develop and evaluate these architectures and program execution models, simulation and modelling tools must be used. However, existing simulation and modelling tools are very complex and tend to emphasize high performance simulations. And to achieve these speeds, two general methods are used. Many simulators simply eschew accuracy in the name of performance. Others still are the product of large amounts of time and effort to optimize the simulation of a specific class of architecture or system, which is perfect for studying existing systems and evolutionary designs but not for studying revolutionary designs. To this end, I have developed PICASim as a model and framework for the development and study of novel architectures and program execution models. The PICASim model’s compromise between accuracy and performance can be adjusted during the early development of exascale systems to obtain highly accurate simulations of smaller, but representative systems to aid in the study of the impact that new architectural features and runtime capabilities will have on the performance of a full exascale system. Additionally, its modular nature and focus on composability will allow for new architectural features to be evaluated and studied with ease. Additionally, I have developed a language to express system graphs and the interconnection of components at a high level: the LADS. The LADS is a language used to express PICASim graphs but is already being extended to support codelet- based languages as well. The LADSpiler is a source-to-source compiler designed to

xiv transform a graph expressed in the LADS to source code with all interconnections represented and additional optimizations, based on graph theory, applied. The contributions of my dissertation are the following:

1. PICASim: A model for Parallel Discrete Event Simulation that combines the ﬂexibility of the Chandy and Misra model with the modularity and intuitiveness of the Bryant model

2. The PICASim Framework: A Fully-Distributed PDES framework designed to study and develop novel architectures and program execution models at an indicative scale prior to devoting resources toward full-scale simulation

3. Results demonstrating PICASim’s ability to model said architectures and program execution models with a high degree of accuracy and ﬂexibility.

4. Results demonstrating PICASim’s use in the study and development of novel program execution models and architectures

5. The LADS: A language used to express system graphs, including but not limited to PICASim graphs

6. The LADSpiler: A source-to-source compiler that can optimize and translate systems expressed in LADS to the target language.

xv Chapter 1

INTRODUCTION

The world is in the midst of a race to develop a supercomputer capable of exascale performance. This race covers all facets of computer science and computer engineering. New architectures are needed that place an even greater emphasis on power efficiency and resiliency [67, 36]. New program execution model will be needed to take advantage of these platforms. And new algorithms will be needed that will be able to scale to the size of these systems and that will be able to take advantage of the heterogeneity that is expected of these systems [22, 24]. This race to exascale performance is punctuated by the need for new algorithms and programming models that can fully utilize the revolutionary architectures that will be required to maximize power efficiency and resiliency. These architectures are expected to possess a high degree of heterogeneity and place an even greater burden on the user and programming model to efficiently utilize resources and provide a high degree of resilience while minimizing power consumption [67, 36]. Many government funded projects from DARPA [22] as well as the Department of Energy [24] have been structured around these issues. Much of this burden is placed on the program execution model as it must bridge the gap between the domain expert with an algorithm and the underlying architecture. Even more importantly, these program execution models must assist in the manage- ment of power efficiency and resiliency due to heat issues and the high probability of faults [67, 11]. Now, more than ever, the scientific community is in desperate need of new program execution models. Due to the unprecedented scale and complexity of exascale systems, simulation and modelling tools are of the utmost importance in achieving these goals [102, 120,

1 75, 57, 108]. By using software for evaluation, the cost in terms of time and resources associated with hardware can be avoided to a large degree. For example, if the benefits of an architectural feature are found to be lacking in a software simulation there is no need to design and fabricate a hardware implementation. However, many existing tools sacrifice accuracy in the name of performance [81, 56] so as to allow for high level evaluation of systems. Others rely heavily on additional, equally complex, tools to model the system which restricts the study of novel and revolutionary systems to groups with the resources to develop high performance simulation and modelling tools from the ground up. To this end, I have developed the PCA Inspired Computer Architecture Simulator (PICASim) model and framework to provide an easily modifiable tool for parallel discrete event simulation. While initially intended for computer architecture simulation it has evolved into a tool to aid in the development and evaluation of novel architectures and programming models. The bottom-up design of the PICASim framework addresses the challenges of developing new architectures in conjunction with novel program execution models for the exascale era. The PICASim model’s compromise between accuracy and performance can be adjusted during the early development of exascale systems to obtain highly accurate simulations of smaller, but representative systems to aid in the study of the impact that new architectural features and runtime capabilities will have on the performance of a full exascale system. Other projects also address the issue of the development of exascale architectures. However, these are not complete solutions. Many existing tools sacrifice accuracy in the name of performance [81, 56]soas to allow for high level evaluation of systems. These tools are best suited toward studying systems where communication and memory access patterns are a known quantity. Similarly, many many existing solutions take advantage of direct execution [77] or existing sequential simulators [57, 3]. These techniques are highly effective and may even be essential to simulate a full exascale system. However, they all utilize

2 highly efficient and optimized tools as their building blocks and, as such, are best suited toward studying architectures of an evolutionary nature as much of the work in optimizing these building blocks will already have been completed. Or the work is limited to large groups with the infrastructure to develop highly specialized simulators for a specific target architecture [18]. PICASim is novel in its approach in that it is designed to allow for the study of the design space of radically different systems with an emphasis on accuracy and ease of use over performance. It too is not a complete solution and is not intended as one. Instead, PICASim is designed to be used as a tool for early research into new architectures and to study how new program execution models can be adapted to utilize these new architectures. By evaluating and exploring architectures and program execution models at these early stages, research is able to be directed to the areas of greatest importance. Once the viability of a design is confirmed at a representative scale, high performance implementations can be made that will utilize more robust frameworks [57] for exascale simulation. PICASim has been implemented as a distributed framework in C++ and MPI which I discuss in Chapter 4. In Chapter 5 I show how PICASim has been used to simulate the execution of real applications on real architectures. In Chapter 6 I discuss how PICASim has been used to aid in the study and development of novel program execution models and architectures, particulary the Fresh Breeze architecture [74]. Over the course of developing PICASim and modeling systems with it, it became painfully obvious that additional tools would be needed to improve the usability of PICASim. To achieve this, I developed the Language for Automated Distribution of Simulation or LADS for short. Much as with PICASim, LADS is largely a legacy name as the tool was found to have far greater potential than was initially anticipated as its potential use extends to any graph-based, or codelet-based, system. This is described in Chapter 7. Additionally, a compiler for the LADS, the LADSpiler, has been implemented using a Python implementation of lex and yacc, as explained in Chapter 7.5.

3 My dissertation provides the following contributions to the ﬁeld of electrical and computer engineering:

1. PICASim: A model and framework for the development and study of novel architectures and program execution models.

2. Results demonstrating PICASim’s ability to model said architectures and program execution models with a high degree of accuracy and ﬂexibility.

3. Results demonstrating PICASim’s use in the study and development of novel program execution models and architectures

4. The LADS: A language used to express system graphs, including but not limited to PICASim graphs

5. The LADSpiler: A source-to-source compiler that can convert systems expressed in LADS to code ready to be simulated using PICASim

4 Chapter 2

BACKGROUND INFORMATION AND WORK

In this chapter, I will provide a brief background of parallel discrete event simulation and modern asynchronous task based program execution model research while also deﬁning terminology that will be vital moving forward. I will then describe my own work and how it has impacted my studies and, in some cases, inﬂuenced the development of the very works I would later study and model.

2.1 Origins of Parallel Discrete Event Simulation for Computer Architec- tures I will first discuss the origins of modern discrete event simulation for computer architectures, specifically the Bryant model [13] and the Chandy and Misra model [19], discuss their similarities and differences, and then develop a common terminology moving forward.

2.1.1 Packet Communication Architectures In 1977, Randal Bryant developed a distributed simulation methodology based upon the concept of a Packet Communication Architecture, or PCA. Bryant’s model [13] treated the system as a number of independent processor modules, referred to as “Com- ponents”, that communicate by sending messages, or “Packets” of information, to one another. Bryant’s PCA model has three types of Components: Functions, Switches, and Arbiters. Each Component has a number of input and output ports and are classiﬁed as follows:

t1, it expects no messages and outputs messages to port X and Y . In the interim, the LP advances the simulation as much as it can without any additional inputs. With respect to synchronization and termination, Chandy and Misra rely on a simple approach using timestamped packets (i.e. Lamport Timestamps [69]), null messages [82], and a speciﬁc number of cycles to run the simulation for. However, as with Bryant’s model, other synchronization and termination schemes may be employed.

2.1.3 Common Terminology As with many aspects of computer science, there is much ambiguity with respect to the meaning of specific terms. To properly perform a comparison, and later explain how my work fits in, I will briefly define the following terms and how they relate to the Bryant model [13] and the Chandy and Misra model [19]. Where ambiguity was not an issue, I used the original term. Otherwise, I focused on consistency.

2.1.3.1 Physical System The Physical System is what is being modelled by the simulation. In the case of an architecture simulation this would be the computer system, for example a single x86 processor, that is being studied. Whereas in a program execution model or runtime simulation it would be the underlying abstract machine model.

2.1.3.2 Logical System The Logical System is the simulation itself. Under the Bryant model this is the collection of PCA components and under the Chandy and Misra model this is the Logical System.

2.1.3.3 Host System The Host System is the machine that is physically running the simulation. For example: if one were to run a simulation of an x86 processor using a PCA–based simulator running on a laptop (commodity hardware), then the logical system would

9 be the simulation, the physical system would be the x86 processor, and the host system would be the laptop.

2.1.3.4 Component A Component is an element of the Logical System that communicates with other Components in the system. In the case of Bryant’s model, this is a PCA component. In the case of the Chandy and Misra model, this is a Logical Process.

2.1.3.5 Link A Link is the path along which Components in a Logical System communicate and synchronize.

2.1.3.6 Distributed As per Chandy and Misra, “programs which consist of two or more cooperating processes which communicate with each other exclusively through messages (...) with no central control process” [19]. This is independent of whether the Host System consists of a single core or a cluster of multiprocessors.

2.1.3.7 Virtual Time Virtual Time is a flexible abstraction of real time [60] that represents the time of the Logical System, but not necessarily that of the host system. Virtual Time is used for the purpose of synchronization and ensuring causality in a logical system as well as for the purposes of providing the output of the simulation. Additionally, in a distributed system, each component may have a different virtual time. A simple conceptual example is that the host system may have run the simulation for ten thousand cycles but component i in the logical system may be at cycle five hundred of the simulation while j is at cycle six hundred. Synchronization is used to ensure that causality is maintained between components i and j.

10 2.1.3.8 Message A Message is data, generally timestamped, that is sent along Links between Components. A message may be strictly synchronization information or it may also contain data being transferred between Components.

2.1.4 A Comparison of Models Using the above terminology, we are able to compare the two models that form the foundations of modern parallel discrete event simulation, the Bryant model [13] and the Chandy and Misra model [19]. At a high level, both models consist of representing a physical system as a logical system made up of components with strictly defined interconnections. Messages are sent along these interconnections via links. In this regard, both models allow for the physical system to be decomposed into a logical system in a highly intuitive manner. Similarly, both models are inherently distributed and rely solely upon point to point communication. The difference is primarily in terms of the behaviour and granularity of a given component. The PCA model has strictly defined firing rules wherein a component only advances simulation if the firing conditions of the component are satisfied. Whereas the Chandy and Misra model is comparable to the CSP model [52]inwhichcomponents execute in parallel until synchronization is required. In terms of components, the PCA model is represented by a component that becomes active, consumes messages, advances simulation, emits new messages and switches to an idle state. Whereas the Chandy and Misra model is represented as a component that is active throughout all execution, generating output messages as per its behavior, and only stalls while waiting for input messages. Thus, it becomes clear that the primary difference between the two models is in terms of the granularity of a given component and how structured the logical system is. The Bryant model results in a system with simpler synchronization requirements but considerably more components and potentially more communication. Whereas the

11 Chandy and Misra model potentially has more complex synchronization requirements but the number of components can be more closely mapped to the available resources of the host system.

2.2 Common Synchronization Techniques As alluded to in the previous section, each model has synchronization requirements. Synchronization is required to avoid deadlock and to ensure causality. Following the development of the Bryant model and the Chandy and Misra model, the distributed simulation community built on the simple, distributed, synchronization models and developed further techniques to increase the efficiency of the simulation as a whole. This, in turn, led to two schools of thought: Conservative Synchronization, described in Chapter 2.2.2, and Optimistic Synchronization, described in Chapter 2.2.3. Conservative approaches guarantee that all simulated event occur in the correct order. Whereas Optimistic approaches assume that the order messages arrive and events occur in are the correct ordering. If this assumption fails, error avoidance and correction techniques are utilized as needed. But first, I will briefly describe a subset of the techniques used to implement these different approaches.

2.2.1 General Tools and Techniques While most of these methods are more commonly associated with Conservative models, they are also commonly used to verify correctness in Optimistic models.

2.2.1.1 Lamport Timestamps Lamport Timestamps [69] is a simple algorithm that is frequently used to determine the order of events in a distributed system. Through the use of Lamport Timestamps, a partial ordering of events can be obtained, even if the processes involved are not perfectly synchronized. Put simply, every process possesses a counter (clock). When a process generates a message, it uses the counter value as a timestamp. When a process receives a message, it updates its internal counter if the message’s timestamp had a higher value. This

12 very simple algorithm serves to create a partial ordering of all operations on a given path. By utilizing this approach on every process, a partial ordering of all events in the system can be generated and obeyed.

2.2.1.2 Lookahead Techniques However, Lamport Timestamps alone are very prone to deadlock in the case of asynchronous merges. If a process receives no input messages, it will never be able to update its own counter to continue working. This is where Lookahead [86] techniques come into play. With a Lookahead Technique, the component uses knowledge of its inputs and self to determine when the next available message will be available, and acts accordingly. A simple example would be to consider a component, A, that is fed a constant stream of input values. The most recently arrived message has a timestamp of T .

However, knowledge of the system also indicates that it requires at least tSource units of time for the source to generate another message. Thus, A is able to operate under the assumption that the next output message will have a timestamp greater than or equal to T + tSource. This, in turn, allows components dependent upon the outputs of

A to simulate and process all events that occur prior to T + tSource with the knowledge that there will be no causality issues.

2.2.2 Conservative Synchronization As previously mentioned, Conservative Synchronization is where techniques are used to ensure that all events occur in the correct order. This has the advantage of allowing for very high accuracy with little to no redundant computation, but suﬀers from much higher synchronization overheads [87].

2.2.2.1 Lockstep Execution through Global Time The simplest form of Conservative Synchronization is to ensure that the clock of every process is consistent. While there are many approaches to this, they all largely consist of advancing the time of each process by a certain amount, simulating all events

13 that occur during that period, waiting at a barrier, and then continuing on. A good example of a parallel discrete event simulation framework that uses such an approach is the FAST simulator [26] for the IBM Cyclops-64.

2.2.2.2 Null Messages In the event of systems with asynchronous merges and cycles, methods described in 2.2 are used to avoid deadlock. While taking advantage of Lookahead Techniques is suﬃcient for advancing the state of a process, it may not be enough to advance the state of the simulation as a whole unless the process generates a message. Null Messages [82] resolve this issue. In the event that Process A advances its own internal clock but doesn’t generate a message, it will instead generate an empty message, or Null Message, consisting of nothing but the current clock. This, in turn, will be used by processes that depend upon Process A to advance their own simulations, often through the use of Lookahead Techniques. With this, approach, many deadlocks are avoided through the propagation of system state, even when no actual data needs to be communicated. A later expansion on Null Messages is proposed by Cai [15] in which additional information, speciﬁcally related to exploiting lookahead, is attached to the Null Mes- sages.

2.2.2.3 Misra’s Marker A similar solution to deadlocks is the use of a special message, referred to as a marker, that was proposed by Misra [82]. This algorithm, used to detect deadlocks in cyclic paths, involves determining a cyclic path and passing a special message, the marker, along that path in ﬁnite time. As a process receives the marker, it sets its color to white. Upon receiving or sending a subsequent message, it sets its color to red. This knowledge can then be used to detect which processes are still white and which processes must advance the state of the simulation by processing an event.

14 2.2.2.4 Demand Driven Deadlock Free Synchronization A different approach is proposed by Bain and Scott [4]. In a manner similar to the use of Null Messages and Lookahead Techniques, a Process uses knowledge of the system to advance simulation as far as possible and to avoid deadlock. Where this differs is that the process that is unable to continue signals its predecessors with a request for the updated time. Specifically, Process A sends a request to its predecessor, Process B,forthe ability to progress to time t. Process B will then signal Process A when it can guarantee that no messages will arrive with a timestamp less than time t. Specifically, Process B will either respond with yes, indicating that it can offer that guarantee, no, indicating that it cannot and another request must be made, or ryes, which is that it has conditionally reached time t and is used to resolve cycles.

2.2.3 Optimistic Synchronization Unfortunately, the aforementioned Conservative Synchronization comes at a cost. While redundant computation and error correction is avoided, the methodologies used often result in decreased parallelism and higher synchronization overheads when compared to Optimistic approaches [87]. Under an Optimistic approach, the assumption is made that most events in the simulation will occur in the same order as the physical system being modelled. In many cases, this is a very safe assumption and there are few issues. However, much research has been done on how to resolve problems as they arise.

2.2.3.1 Time Warp: Reverting to Resolve Issues One of the earliest solutions for Optimistic Synchronization was that of the Time Warp, proposed by Jefferson [60]. While similar to the work of Chandy and Misra in its use of processes and synchronization messages, Jefferson’s work differs in that simulation continues, where possible, regardless of input messages. If a predecessor A sends a message to Process B indicating an event with a timestamp earlier than Process

15 B’s clock, then Process B will revert to a previously saved state and recompute the simulation with the knowledge of the event. In addition to this, messages must be sent to successor processes indicating that a rollback has occurred, which can result in what is referred to as a rollback chain. One solution is Gafni’s Lazy Cancellation [39]. Using the above example, Process B will ﬁrst recompute its state using the updated information from Predecessor A.Ifthe messages that it sent to its successors still hold true, no message will be sent to the successors and simulation will continue. Otherwise, the invalidated simulation state will cascade. Prakash and Subramanian[104] propose a solution to the issue of the rollback chain that is similar to Cai’s enhancements to the Null Message[15]. By providing additional state information in the rollback message, processes are able to avoid re- computing messages based on obsolete states that will likely result in another rollback. In systems where the processes simulate at approximately the same rate, this has the potential to allow for a very high rate of simulation as little to no time is spent waiting for events and many events, even those with a causal relationship, are able to be executed in parallel. This leads to what is referred to as Supercritical Speed-up [59], which is a situation where even the critical path is able to be executed in parallel.

2.2.3.2 Optimistic Time Windows A more conservative Optimistic method involves time windows. Sokol, Briscoe, and Wieland[113] proposed a scheme with what is referred to as a moving time window.

Put simply, a Process with time tLP is only able to simulate and process events where tevent ≤ tLP + δ. This threshold is advanced through additional logic, but the intuitive approach is that this greatly limits the amount of work that will need to be recomputed in the event of an out of order event.

16 2.2.3.3 Lax Synchronization A comparatively recent work by Miller et al. involves the use of what is referred to as “Lax Synchronization”[81]. Lax synchronization builds upon the previous work with time windows and to further avoid the need for rollbacks by processing events in the order in which they are received, not in the order in which they occur[20]. While this removes functional accuracy it greatly increases performance and, in tests[81, 20], still yields indicative results with respect to cycle accuracy. Lax synchronization is further enhanced through its use of point-to-point synchronization of the clocks of each process so as to maintain the time window.

2.2.4 Inﬂuential Frameworks Over the decades of research in the ﬁeld, many works have been developed. Most research have built upon the work of Chandy and Misra by using a sequential simulator to act as a component in the logical system with hooks to a distributed framework. More contemporary models are discussed in Chapter 8.

2.2.4.1 MPISim MPISim [105] was an early simulation library designed to simulate the behaviour and latencies of the MPI communication library itself and was used in conjunction with other simulators. This approach is still popular because libraries provide a familiar interface to programmers on machines. By treating interprocess communication strictly as MPI operations, the simulation framework is able to represent a wide variety of architectures and interconnection frameworks. The limitation of this approach is that it restricts the user to a speciﬁc programming model which may not be conducive to the research and development of new programming models.

2.2.4.2 BigSim BigSim [129] increases performance by restricting simulation to speciﬁc programming models, in this case CHARM++ and AMPI, to increase simulation speed

17 by greatly reducing the complexity of simulating individual instructions and avoid the need to implement a software stack for the simulated architecture. Instead, the latencies of function calls are determined based upon the simulated architecture while the behaviour is determined by the host system. BigSim also lowered the overhead of optimistic synchronization by taking advantage of determinacy in the simulated programs.

2.2.5 Queuing Networks A different approach is the use of queuing networks. Because of the nature of modern interconnects, Queuing systems [65, 72] are frequently used to simulate traffic in a system at a high level. Traditionally, each input to the interconnect will be modelled as a function, often a stochastic process, that generates traffic. The queuing system will then be used to model the latency of the system as a whole. Jacquet et al. demonstrated how the concept of percolation on a novel architecture can be studied through the use of queuing network-based simulation [56].

2.3 Background of Program Execution Models As computer research has advanced, the focus has shifted from comparatively low level tools, such as MPI [79] and POSIX Threads [84], toward higher level abstrac- tions referred to as “runtimes” [22, 24]. These runtimes, and their underlying models, abstract away the need to manually move data and distribute work and provide a simpler interface for the programmer. This is part of a major push toward usability and a decoupling of domain experts and tuning experts. A key part of my work has been the evaluation of computer systems for the future of scientiﬁc computing. As such, the study and evaluation of novel runtimes and models. To this end, I will once again deﬁne a set of terms common to various models and then describe many of the common tools and models

18 2.3.1 Common Terminology With respect to runtimes and programming models, there is much more com- monality between approaches. As such, the number of terms I will deﬁne is much smaller.

2.3.1.1 Program Execution Model As established in the ﬁeld, a Program Execution Model (PXM) is a low–level abstraction of the system architecture upon which the programming model, runtime system, and other software are developed. It is commonly deﬁned in the context of

• Threading Model: How is the application expressed and mapped to available system resources

• Synchronization Model: How are resources and data shared between parallel threads of the application

• Memory Model:j How do the parallel threads of the application interact with the system memory.

Speciﬁcally, we primarily focus on PXMs built around asynchronous tasks which are lightweight tasks that are scheduled based upon the availability of data dependencies.

2.3.1.2 Runtime In the context of this thesis, a runtime is the software implementation of a program execution model.

2.3.1.3 Codelet The term “codelet” has been used heavily over the past few years by a variety of sources [92, 38, 29, 70]. The majority of works agree on a codelet being a comparatively lightweight task with strictly deﬁned data dependencies in the form of inputs and outputs. The diﬀerence lies in the granularity and whether or not execution of the codelet may be pre-empted.

19 As such, for the purpose of this thesis, a codelet is a lightweight task to be scheduled by an asynchronous task based PXM.

2.3.2 Novel Program Execution Models A major push in the development of modern program execution models has been toward usability and a decoupling of domain experts and tuning experts. As such, much research has been performed on this subject and, in this section, I will brieﬂy describe a few categories of modern program execution models and list representative examples of each.

2.3.2.1 Distributed Shared Memories Many PXMs are built around the concept of a globally addressed distributed shared memory. Under these models, the user is able to access distributed memory in a manner similar to shared memory. This is handled by performing all distributed memory calls and synchronizations in the background. Scioto [34], a framework for global–view task parallelism, was developed by the Ohio State University. It is designed around performing load balancing of tasks while also providing a global memory through Global Arrays [88]. Via Global Arrays, any data that is ﬂagged as global will be automatically shared among all processors. In this regard, this work builds upon work such as UPC [16].

2.3.2.2 MapReduce Another popular approach is to take advantage of the MapReduce [25] algorithm. Under the MapReduce algorithm, a given task is intrinsically coupled to the associated data. A given operation creates a pool of tasks, “maps” them to available resources, executes the tasks, and performs a “reduction” to collect the results. Spark [126], Pathos [78], and Mesos [51] are built around this model. Tasks, and their data, are mapped to available processors and, after executed, the result is migrated to the appropriate direction.

20 2.3.2.3 Asynchronous Task Based Models Modern asynchronous task based models build upon the previous two approaches. They often provide a global shared memory of some form, but more as a way of providing a layer of abstraction between the application and the hardware. And, much like MapReduce models, the task is intrinsically tied to the data and is migrated to the desired destination. However, the tasks and dependencies generally have a much ﬁner granularity. Popular examples of these include the “codelet” models include the University of Delaware Codelet Model [42], ET international’s SWARM [70], MIT’s Fresh Breeze model [29], and ParalleX [62]. These models, as well as others such as Intel’s Concurrent Collections [14], tend to rely upon ﬁne grain dependencies and comparatively short tasks. More coarse grained approaches that are still based around asynchronous tasks and data and task migrations include Charm++ [63], Swift [128], and Scala [90]with Akka [10] or DFScala [47]. These approaches tend to be more conventional, with a focus on legacy code, and have coarser grained tasks.

2.4 Porting TiNy Threads to Distributed System In early 2010, I worked as part of a team to design and implement a distributed shared memory [89] to support endeavours to adapt the TiNy Threads program execution model [26] to a distributed memory system [124], speciﬁcally the BlueGene/P [116]. We focused primarily on building a shared memory layer that encapsulated distributed memory and the underlying message passing mechanisms, DCMF in this case [110], in order to provide the programmer, and the program execution model, with a shared memory view. This allows the user, and the program execution model, to leverage the usability of traditional shared memory parallel programming models on distributed systems. The TNT Distributed Shared Memory, or TDSM, was evaluated with respect to memory latencies when run at scale. Figure 2.5 shows our preliminary scaling results.

21 Figure 2.5: Scalability of Read and Write Operations in TDSM (MTAAP 2010[124]c )

22 These results are for a benchmark that reserved a ﬁxed size, 1024 bytes, of memory on each node to create a shared logical address space. This benchmark then ran a loop wherein each iteration performed a read or write operation on the entire logical address space to ensure that each remote node is accessed. The execution time was then recorded and divided by the number of iterations to determine average access time. These results were presented at MTAAP 2010 [124]. While the project to port TiNy Threads to BG/P did not continue, the methodologies used and explored are still a key part of modern program execution model research, as discussed in Chapter 2.3, with many of the discussed models and frameworks being built around providing a layer of abstraction so as to treat a distributed memory system as one with a shared memory.

2.5 Studying and Using TIDEFlow Similarly, I assisted Daniel Orozco and Elkin Garcia in the development of the Time Iterated Dependency Flow (TIDeFlow) program execution model [92]. Pri- marily my work was focused on implementing common high performance computing (HPC) algorithms and applications under the TIDeFlow model. TIDeFlow is a parallel execution model that was designed to eﬃciently express and execute traditional HPC programs. TIDeFlow leveraged three common character- istics of traditional HPC programs: the abundance of parallel loops, easily expressed dependencies between parallel loops, and the composability of said programs. Using this, it introduces weighted actors and weighted arcs to address and express the concurrent execution of tasks as well as the overlapping of communication and computation through double buﬀering and pipelining [93]. TIDeFlow was primarily focused on shared memory systems with a specialized implementation available for the IBM Cyclops-64 and a more general implementation that employs only features supported by GCC, the GNU C compiler [114]. TIDeFlow itself was born out of early work on ET International’s SWARM model [70] and was inspired, in large part, by previous work on the EARTH-MANNA

23 multithreaded system [53, 118]. Similarly, it and much of the work it builds on [46, 43] also went on to inﬂuence the development of the University of Delaware Codelet Model [42]. So while it is true that TIDeFlow is no longer being actively developed, its impact on the ﬁeld and the future of program execution models can still be felt.

2.6 Work Toward Enhancing Scientific Simulations through Co-Design Much of my work on evaluating program execution models was performed as an intern at Los Alamos National Laboratory under Allen McPherson, Tim Germann, Christoph Junghans, and Ben Berghen as part of a Co-Design Summer School. The purpose of this internship was to group physicists and computer scientists, as domain experts and tuning experts, to evaluate program execution models and how they fulfill the needs of scientific computing. Through doing this, I was able to gain a deeper understanding of the needs of scientific computing.

2.6.1 Heterogeneous Multiscale Modeling My first task was to evaluate modern program execution models with respect to a physics application modelling a shockwave propagation through heterogeneous multiscale modelling. This consisted of treating the system as a finite element problem with the small scale simulations performed using a molecular dynamics library [83]. Specifically, I closely interacted with physicists to study the problem and determine what runtime or program execution model best suited the problem. The models studied and evaluated include, but are not limited to, Charm++ [63], Scioto [34], Mesos [51], Spark [126], Pathos [78], Swift [128], Intel’s Concurrent Collections [66, 14], and Scala [90]withAkka[10]. After much study, we primarily focused on Scioto [34], Spark [126], and Intel’s Concurrent Collections [66, 14] as they all provided simplified interfaces that allowed for rapid development in which the computer scientists would create a fast version which would allow the physicists to study the problem and add to the simulation.

24 After which, the computer scientists would optimize the new additions to repeat the cycle. This work is still ongoing, but preliminary results [109] demonstrate the beneﬁts of using proxy applications to evaluate program execution models and runtimes [64].

2.6.2 Adaptive Mesh Refinement Following this, we investigated a different, yet still representative, scientific application. Specifically, we developed a scheme for adaptive mesh refinement [8]ofa hydrodynamics simulation [71, 1]. This was chosen as it is representative of several applications and interests in scientific computing and has a tendency for very large scale simulations of non–uniform systems Adaptive mesh refinement is an important part of scientific computing as, in many simulations, only a portion of the physical region requires a detailed analysis. Thus, through adaptive mesh refinement, the computational power of the host can be focused on the most pertinent regions. An example of this can be seen in Figure 2.6, with Figure 2.6a showing a visualization of a single timestep of the simulation and Figure 2.6b showing how the tiles can be broken up to prioritize computation. We worked closely with domain experts to develop our simulation and then, using hashed quad–trees [111, 121], implemented a simple program execution model neutral library and implementation. Using this PXM–neutral implementation I then adapted and implemented the application in Intel’s Concurrent Collections, as seen in Figure 2.7, while my colleagues worked on applying the same library and algorithm to other models such as Charm++ [63] and ParalleX [62]. While Figure 2.7 is specifically a Concurrent Collections graph, the algorithm and library remained largely unchanged between all implementations. First, the mesh is broken up into tiles which exchange ghost cells as needed. Then, a one dimensional sweep of the tiels is performed, with flux corrections exchanged between tiles to reduce the error generated through the differing levels of refinement between tiles. Afterwards,

25 (a) Without Adaptive Mesh Reﬁnement (b) With Adaptive Mesh Reﬁnement

Figure 2.6: Tiling of Centralized Shock Discontinuity [96] the algorithm is repeated for the second dimension of the problem, and the end result is a full time step of the hydrodynamics problem in two dimensions. This work was presented at the 2014 workshop on Concurrent Collections [96]. This work was particularly important as it demonstrated that scientiﬁc applications of great interest can not only be implemented in modern asynchronous task based program execution models but that said applications can be implemented in a largely PXM–neutral manner so that the same scientiﬁc libraries can be used with a wide range of runtimes.

Chapter 3

PICASIM: A MODEL FOR THE DEVELOPMENT AND SIMULATION OF RUNTIMES AND ARCHITECTURES

In this chapter, I will describe the model for PICASim: The PCA Inspired Computer Architecture Simulator. PICASim is a tool for parallel discrete event simulation of systems including, but not limited to, computer architectures, abstract machines, and program execution models.

3.1 Why PICASim PICASim was originally developed as a tool to facilitate the development and simulation of computer architectures. Hence the “CA” in PICASim. By using a simulator, researchers are able to rapidly study and evaluate new hardware and systems while spending minimal time and money on the hardware implementation. In the case of a new processor, this is essential as the process of fabrication can often take months or even years. And to find a bug in the final chip can be devastating due to the costs of re-fabricating the chip after fixing the problem. However, over the years the focus of parallel discrete event simulation has largely shifted toward speed, rather than accuracy. Many works, such as MIT’s Graphite [81], outright ignore functional accuracy in the name of high performance. And many of the accurate simulators are based around using modified versions of currently existing sequential simulators of currently existing architectures, such as the heavy reliance on tools such as AMD’s SimNow [5] to model x86 and AMD64 architectures. This extends toward the growing trend of program execution models designed with novel architectures in mind [28, 42], where a runtime designed to run on existing

28 technology may not present a full picture of the eﬀectiveness and value of the proposed model. To this end, I have developed the PICASim model and framework as a tool to provide a highly accurate and easily modiﬁable tool to aid in the development and evaluation of these architectures and programming models. The bottom-up design of the PICASim framework addresses the challenges of developing new architectures in conjunction with novel program execution models for the exascale era. The PICASim model’s compromise between accuracy and performance can be adjusted during the early development of exascale systems to obtain highly accurate simulations of smaller, but representative systems to aid in the study of the impact that new architectural features and runtime capabilities will have on the performance of a full exascale system. The PICASim model is primarily designed to explore the design space of novel architectures at an early stage. It is not intended to replace tools intended for larger scale simulation [57, 50], but to instead perform research prior to investing time and resources into the development of high performance sequential simulators that are vital to larger scale experiments.

3.2 The PICASim Model The PICASim model is a hybrid of the Chandy and Misra [19]modelthat incorporates the Bryant [13] model’s concept of components that only fire when the input dependencies are satisfied. Thus, the PICASim model can also be viewed as one in which a physical system is modelled as a logical system which, in turn, is a directed graph of Components that communicate via Messages sent on Links. Specifically, the PICASim model is built around the concept of the Computer Hardware Unit, or CHU. The CHU is the fundamental component and acts as a hybrid of the Bryant component and the Chandy and Misra process. A CHU is defined as an entity with zero or more input ports, zero or more output ports, and a behaviour that defines how packets on the former are consumed to generate packets on the latter. In a

29 manner similar to Chandy and Misra’s LPs, the CHU is used to model the behaviour of a portion of the system that is being simulated. CHUs are connected via, and communicate along, Links. A Link is defined as a form of point-to-point communication between the ports of CHUs and is characterized by FIFO behavior. Together, the CHUs and Links are used to build directed graphs that represent the system to be simulated. The CHU combines the structure and intuitive nature of Bryant’s PCA Com- ponents [13] with the flexibility and freedom of Chandy and Misra’s Logical Process (LP) [19]. Like the Bryant component, the CHU may only fire if its input dependencies are satisfied. Similarly, the CHU only communicates with other CHUs via timestamped messages sent on Links. However, unlike the Bryant model, a CHU’s firing rules are purely tied to the behaviour and role of the CHU and are subject to change based upon the state of a given CHU. Thus, rather than model a component of the physical system as a combination of arbiters, switches, and functions, the PICASim model allows for a 1:1 mapping of logical components to physical components. This allows the user to fit the model to the physical system instead of vice versa. Similarly, synchronization is primarily handled through the use of timestamped messages on Links, via the use of Lamport Timestamps [69]

3.2.1 Firing Rules of PICASim While the firing rules of PICASim are inspired by the PCA Model, modifications were made to increase the flexibility and usability of the system. Under the PCA model, the same firing rules are defined for each type of component, i.e. all Functions share the same firing rules which differ from the ones shared by all Switches and those shared by all Arbiters. These firing rules are a function of the timestamp and the availability of messages on the input link.

30 In PICASim, the firing rules of CHUs are also dependent on the timestamp and availability of messages on the input link. However, each CHU can have its own distinct set of firing rules so long as it is solely a function of packets obtained from input links and, optionally, the internal state of the CHU itself. As such, the output message(s) of a CHU are a function of only the consumed input message(s) and the internal state of the CHU itself (i.e. the firing of a CHU has no side effects on any other CHU). The flexibility of this schema allows for the modelling of systems with great ease and flexibility. At its simplest, one can choose to replicate the PCA model by restricting CHUs to three sets of firing rules: Function CHUs that only fire when a message is available on each input link, Arbiter CHUs that act as asynchronous merges, and Switch CHUs that route messages accordingly. This allows for the modelling of complex systems through the tight interconnection of Function CHUs and the modelling of contention for resources in highly parallel systems through asynchronous merge operators that maintain causality while serializing access. However, PICASim’s strength comes from the ability to consider CHUs with firing rules that are a hybrid of the Function and Arbiter in addition to even more complex approaches. And the use of internal state simplifies the modelling of memory as well as allowing for more advanced behaviour and the simulation of self-aware systems. Figures 3.1 and 3.2 shows a very simple example of the benefits of this approach with a CHU representing one method of implementing a floating point unit (FPU). Figures 3.1a and 3.2a show the initial state of the CHU, with an input on every port. In practice, only the FMADD instruction would have a packet on port C, but a packet is used for the sake of this example. Figures 3.1b and 3.2b show the CHU after the first step in which the operator is consumed. Note that no output Packet is generated and only the “Op” message is consumed. After this step, the operation to perform is determined and the internal state of the CHU is updated to reflect this. As such, Figures 3.1c and 3.2c show the result of the next step, in which the specified Packets

are consumed and an output is generated. Figure 3.1c shows this for the fused multiply and add operation in which all three operands are used, whereas 3.2c is a simple ﬂoating point addition and only uses two of the three operand ports. As a result, the physical system is modelled in a highly intuitive manner with distinct ﬁring rules and a clear interface without becoming overly complex.

3.2.2 PICASim Synchronization Model PICASim employs a fine-grain conservative synchronization model based on Lamport Timestamps [69] with heavy use of look ahead techniques [106]. The theory behind the simulation methodology is based on the approaches of Bryant, Chandy, and Misra with modifications to accommodate the flexible yet structured CHU. Upon firing, a CHU will generate zero or more timestamped packets on its output ports. This timestamp is referred to as TCHU. However, all output ports must propagate this time, regardless of if a packet is generated or not. Each Link then records the time, referred to as TLink.

Periodically, the CHU will use its input links and TBehavior, the minimum amount of time that will be required to process inputs to generate an output, to determine the earliest possible ﬁring time, which TCHU is then updated to. The equation for this is described as follows:

TCHU = max(TInput,TCHU)+TBehavior (3.1)

The new TCHU is then propagated to become the new TLink for all output Links so as to avoid deadlocks.

However, this glosses over the issue of computing TInput. This varies depending upon the behaviour of the CHU, but is deﬁned as the maximum time of all consumed packets. To explain this, I will use the Function and Arbiter components of the PCA model as illustrative examples. In the case of a Function component, a packet is consumed on all input ports.

Thus, TInput would be deﬁned as the maximum of TLink for all input links.

34 indicates that the Link is empty. The label on each Link corresponds to the timestamp of the Message in the case of a filled Link or TLink in the case of an empty Link. Each CHU, except for Node E, consumes one message from each input Link prior to firing, whereas CHU E behaves as an asynchronous merge operator and merely consumes the earliest message to ensure FIFO behaviour. Node B and Node E are able to fire independently from one another, and Node G only requires the results of B and E to fire. And, because of the FIFO nature of the Links, Nodes B and E may fire multiple times prior to G consuming any messages. As a result of this decentralized approach, PICASim lends itself well to being run across multiple hosts in a distributed environment. This is because, outside of periodic synchronization regarding the number of termination signals obtained, each task queue is able to operate independently. Thus, there is no need for every task queue to be on the same physical host. Instead, they can operate on separate hosts and communicate over a network through message passing. The same is true of CHUs; there is no need for each end of a link to be on the same host.

3.3 Goals of the PICASim Model As previously discussed, the PICASim model was designed to explore the design space of novel architectures at an early stage. To do this, the model was designed with three key features in mind:

1. a high degree of composability due to a modular design

2. support of task-based execution to utilize a wide variety of host systems.

3. a fully non centralized methodology for parallel discrete-event simulation that represents the system as a directed graph with ﬁring rules

By targeting these goals, the PICASim model was developed to allow for the rapid study and exploration of the design space of novel architectures while still supporting commodity host systems.

36 3.3.1 Composability Due to a Modular Design As previously mentioned, a system simulated in PICASim is divided into CHUs. Because each CHU only communicates with the system as a whole through input and output Links, our framework also allows for a high degree of composability. Individual CHUs can be replaced as long as the input and output behaviour remains similar. Thus, different architectural features can be investigated with minimal alterations to the simulated system. An example of the advantages of this approach would be investigating the benefits of a different interconnect design or an altered memory hierarchy in a many-core chip. The composability of the system offers additional benefits to aid in the collection of data during a simulation. Cycle-accuracy has been a goal for architectural simulation, but has often been ignored in favor of functional accuracy and indicative timing results due to the high cost of a cycle-accurate simulation [26, 81]. However, different applications require different levels of accuracy. The PI- CASim framework is designed so that implementations with varying levels of performance and accuracy can be freely interchanged (e.g. A communication bound program would benefit from a cycle-accurate interconnect, but may still use a functionally accurate processor and memory). In a similar way, this capability can be further extended to power-accurate simulation [44] or a wide range of other metrics of interest and value.

3.3.2 A Task Based Framework PICASim is speciﬁcally designed to provide support for a wide variety of host platforms by allowing each CHU to be simulated independently of the rest through the use of Links for synchronization. As such, each CHU is well suited to being treated as a recurring unit of work in a tasking system with multiple workers corresponding to available threads. This tasking system provides many advantages to usability and ﬂexibility by avoiding the need to map the directed graph to the host system. As such, the system graph can be designed to express the inherent parallelism of the simulated system, not

37 the host. This also has the added beneﬁt of allowing for dynamic scheduling to be used to avoid stalls due to inter-CHU dependencies [43].

3.3.3 Completely Decentralized Methodology As previously mentioned, the termination of the simulation as a whole is handled by allowing speciﬁc CHUs to signal the framework upon reaching a pre-deﬁned internal state (e.g. catching a signal to return from the main method of a program). Once all expected termination signals have been received, the simulation itself terminates. Again, this avoid the need for any centralized constructs to maintain global state with all information being passed in a point-to-point fashion.

3.3.4 Terminating Simulation This decentralized and point-to-point approach extends to detecting when the simulation should terminate. PICASim utilizes a special CHU with the ability to terminate the simulation of the system. Links are used to allow other CHUs throughout the system to signal the Termination CHU when a user-speciﬁed state has been reached. Upon receiving a pre-determined number of signals, the Termination CHU signals the PICASim framework to end simulation.

3.4 Representing a System in PICASim As indicated by the name, PICASim draws inspiration from the early works of Bryant, Chandy, and Misra. This is most evident in the fundamental unit of PICASim: the Computer Hardware Unit or “CHU”. CHUs are connected via, and communicate along, Links. A Link is deﬁned as a form of point-to-point communication between the ports of CHUs and is characterized by FIFO behavior. Together, the CHUs and Links are used to build directed graphs that represent the system to be simulated. An example of such a system can be seen in Figure 3.4 where each node of the graph represents a CHU and each edge represents a Link with the arrowhead on the side that feeds into an input port.

3.5.1 Exascale Challenges Addressed The bottom-up design of the PICASim framework addresses the challenges of developing new architectures in conjunction with novel program execution models for the exascale era. The PICASim model’s compromise between accuracy and performance can be adjusted during the early development of exascale systems to obtain highly accurate simulations of smaller, but representative systems to aid in the study of the impact that new architectural features and runtime capabilities will have on the performance of a full exascale system.

3.5.2 Uniqueness and Novelty Although other projects address the issue of the development of exascale architectures, PICASim is novel in its approach. Traditional approaches, such as SST [57] and COTSon [3], provide frameworks to utilize existing, high-performance, solutions to great effect. However, by relying on existing solutions for simulation, creativity is potentially limited to modifying existing tools as opposed to creating new ones. Thus, esearchers become restricted by the resources they can expend toward making low level simulation tools. This promotes evolutionary, as opposed to revolutionary, designs. And, due to the complexity of these existing solutions, smaller research groups, such as those in academia, are greatly limited in what they can study. Instead, PICASim is designed to allow for the study of radically different systems with an emphasis on accuracy and ease of use over performance. In doing this, revolutionary designs can be proposed and studied at a representative scale prior to developing specialized tools for a high-performance solution for a full exascale simulation. Similarly, while PICASim builds upon, and takes inspiration from, the works of Chandy, Misra [19], and Bryant [13], it fills a gap left by both models. The Bryant model’s reliance on representing systems with the three components (Function, Arbiter, and Merge) results in a very fine grain simulation with a high degree of additional communication due to synchronization between components. Whereas the Chandy

40 and Misra model have much coarser grained components which avoids communication between the building block components but potentially results in additional overhead due to the coarser grained components. PICASim is intended to still have ﬁner grained components than the Chandy and Misra model while avoiding the overhead of individual Bryant components at the cost of potentially losing some of Bryant’s guarantees regarding deadlock avoidance.

3.5.3 Applicability While the PICASim framework was designed with the challenges faced by the development of exascale architectures and new programming models in mind [33], it has already demonstrated its usefulness in the study of pre-exascale architectures [99] and the study of new paradigms [28]. Furthermore, although PICASim has been designed for the simulation and study of computer architectures and runtime systems, it is also a general PDES (Parallel Discrete Event Simulator) tool capable of simulating more general systems. PICASim has already demonstrated its capabilities for Timed Petri Net modelling of systems [45].

3.5.4 Maturity While still under development, the PICASim framework has already been used to simulate the interconnect and the performance of highly tuned algorithms on a modern manycore architecture [99, 45]. This work in particular demonstrates the strengths and applicability of PICASim as it is able to model the performance of the application to a very close degree as well as its ability to allow for the extrapolation of performance on theoretical systems with additional architectural features. Additionally, as discussed in Chapter 5, PICASim is already being used in the active development and study of novel architectures and program execution models.

41 3.5.5 Strengths and Weaknesses of PICASim Model As previously discussed, PICASim is a powerful model, but it is not a complete solution. It has both strengths and weaknesses and many limitations. In terms of strengths, PICASim is highly eﬀective at modeling revolutionary architectures and program execution models, as shown in in Chapters 5 and 6. Sim- ilarly, because of its highly modular nature, we are able to easily build experiments to explore the design space of architectures and program execution models at an early stage, as shown in Chapter 6. However, as mentioned, PICASim is not a complete solution. Because of its focus on accuracy and its reliance on conservative synchronization, it is not reasonable to scale PICASim to the level required to perform large scale simulations. For those applications, tools such as Sandia’s SST [57, 50], are a much better solution. Simi- larly, if the architecture is a more evolutionary design, existing tools, such as AMD’s SimNow [5] is more than suﬃcient for early research.

42 Chapter 4

THE PICASIM FRAMEWORK

In this chapter, I will go into detail on how the PICASim model is implemented as well as providing insight into how PICASim is intended to be used.

4.1 Language and Libraries Used PICASim itself is implemented entirely in C++. Speciﬁcally, I take advantage of the C++11 standard [54] for the purpose of utilizing the features added for concurrency as well as the object oriented nature of C++ due to its historical beneﬁts with respect to modeling simulations [21]. Additionally, I utilize the Boost.Lockfree Library [2] for their implementation of lock-free queues and MPI [79] for communication in a distributed system.

4.2 Shared Memory Implementation of PICASim The foundation of the PICASim framework is the shared memory implementation.

4.2.1 Implementation of Task Queue In the PICASim framework, the fundamental unit is not actually the CHU: It is the SimulatableTask. The SimulatableTask, described in Figure 4.1, is used to represent an entity in the Task Queue. The advantages of this abstraction are further described in Chapters 4.2.3. The task queue itself is a fairly simple implementation that consists of three aspects:

• A Task Pool consists of a collection of instances of the SimulatableTask class.

43 PICASim::SimulatableTask

+simulate(): SimulatableTask::ReturnStatus

Figure 4.1: UML Class Diagram of Abstract Class SimulatableTask

PICASim::TaskQueue #termCounter: std::atomic uint +getTask(): SimulatableTask +giveTask(SimulatableTask): void +spawnThreads(numThreads: unsigned int) +joinThreads(): void +ﬁreThreads(): void

Figure 4.2: UML Class Diagram of TaskQueue

• A Thread Pool that is used to execute the tasks.

• A Counter to keep track of the number of termination signals (See Chapter 3.3.4) received globally

The UML class diagram of the task queue is shown in Figure 4.2. Tasks are added to the pool via the giveTask() method, and are retrieved through the getTask() method. For the sake of convenience and to simplify the system, the TaskQueue is also responsible for spawning and maintaining threads, which are spawned with spawnThreads() and terminated with joinThreads(). Additionally, for timing purposes, the threads do not poll the queue for tasks until the fireThreads() method is invoked. As a thread is ﬁred, it requests a task from its parent queue using TaskQueue::getTask(), invokes the task’s SimulatableTask::simulate(), and either re-enqueues the task or discards it, depending upon the returned value. Once the task is complete, another task is requested through another invocation of TaskQueue::getTask(). In the current PICASim framework implementation, the Boost.Lockfree library’s lockfree queue is used and the threads are provided by the C++11 standard.

44 PICASim::Envelope PICASim::Packet +payload: PICASim::Packet * +payload: char[32] timeStamp: long long unsigned

Figure 4.3: UML Class Diagram of Packet and Envelope

In the current PICASim framework, every node in a distributed system runs one instance of the TaskQueue with all available thread units assigned to the queue. Preliminary results (Chapter 4.2.5) suggests this is viable for common multicore systems).

4.2.2 Packets and Envelopes Conceptually, all messages between CHUs are transferred along Links as packets. And this is true in the implementation as well, with a few modifications. The PICASim::Packet represents a single unit of data. I currently implement it as a 32 byte block as this allows for a variety of data to be encoded and for most “small” messages to fit in a single Packet. However, not all messages are small enough to fit in a single Packet. This is where the PICASim::Envelope becomes useful. The Envelope acts similarly to the Chandy and Misra tuple [19] in that it contains data and a timestamp. To handle variable length messages, the payload is a pointer to an array of Packets. In a shared memory implementation, the memory is allocated, the message is written, and then the Envelope is sent. Figure 4.3 shows the UML diagram of the Packet and Envelope. In the common case, the destination port of a Link is aware of the size of the message and no additional information is required. In the case of interconnections between CHUs where a variable amount of data will pass, this is resolved in one of two ways:

• The ﬁrst is to create an agreement between both CHUs to transfer data in mul- tiples of packets. Thus, every Envelope contains N packets, and all messages are rounded up to the nearest multiple of N. This is a very simple approach but has the issue of redundant communication and synchronization.

45 PICASim::SimulatableTask

PICASim::CHU #terminationLink: PICASim::TermSignalLink * #theID: unsigned int #curTime: unsigned long long +CHU(numInLinks: int, numOutLinks: int, termLink: PICASim::TermSignalLink *, theID: unisgned int) +addInLink(link:PICASim::LinkInterface *, index: int): bool +addOutLink(link:PICASim::LinkInterface *, index: int): bool +getInLink(index: int): PICASim::LinkInterface * +getOutLink(index: int): PICASim::LinkInterface * +getTerminator(): PICASim::LinkInterface * +simulate(): PICASim::SimulatableTask::ReturnStatus #assignLinkTypes(): void #pullAndCheckLinks(): bool

Figure 4.4: UML Class Diagram of CHU

• The alternative, and preferred, scheme is to adopt the methodology used by protocols such as UDP [103] and encode the size of the message into the ﬁrst packet. Thus, the destination CHU is able to determine the size of a message merely by reading the ﬁrst packet.

4.2.3 Implementation of the CHU As previously mentioned, the CHU inherits the SimulatableTask, as seen in Figure 4.4. As can be seen, the CHU is still built around the simulate() method from the SimulatableTask. However, it also adds a requirement for the maintenance of termination and data links (described in Chapter 4.2.4). While primarily self–explanatory, there are two methods that warrant further discussion and are tied to the problem of a CHU being executed before its ﬁring rules are satisﬁed:

46 • assignLinkTypes(): This is tied to variable length messages and is used as a way to generate buffers for messages and resolve issues related to variable length messages. This method must allocate packet buffers for each link such that the message is able to be fully buffered. • pullAndCheckLinks(): This method iterates over every input link and checks for data (if the buffer is not already full) and updates the known time of said Link if needed.

While the behaviour of the simulation() method is not speciﬁed, the suggested behaviour and the behaviour used in all currently implemented CHUs is shown by the algorithm in Figure 4.5. Call pullAndCheckLinks() to collect data and update time on input Links as needed if Firing Rules are Satisﬁed then Fire the CHU, consuming and generating messages as needed else Update the CHU’s curTime as needed end if for all Output Links do if Output Message has been Generated then Generate and Output Envelope else Output Time with a Null Message end if end for Figure 4.5: Algorithm to Check and Fire CHU

4.2.3.1 Specialized CHUs During the course of implementation, I also created a subclass of CHUs for the purpose of representing systems in a manner similar to Bryant’s PCAs [13]. Specifically, I implemented the same simplified PCA Model as Dennis [33]. I refer to this as the PCA implementation CHUs, or “PiCHUs” for short. Figure 4.6 shows a simplified UML class diagram of these CHUs to demonstrate their relationship. The PCAComponent implements all methods specified in Figure 4.4 and spec- ifies a function checkReady() that is used to determine if a CHU is able to fire.

47 PICASim::CHU

PICASim::PCAComponent

checkReady(): bool

PICASim::PCAFunctionComponent PICASim::PCAArbiterComponent

checkReady(): bool checkReady(): bool

Figure 4.6: UML Class Diagram of PiCHU

PCAFunctionComponent and PCAArbiterComponent implement this function as per the requirements of a Function Component and Arbiter.

4.2.4 Implementation of the Link Figure 4.7 shows the simplified UML class diagram of the LinkInterface. As can be seen, the Link provides a very simple interface for pushing and popping messages from the internal queue. Specifically, I promote two major types of Links: The Sequential Link (SeqLink) and the Thread Safe Link (TSLink). The latter is required any time inter-CHU communication occurs as such communication must be Thread-Safe. The former is for the case where a SimulatableTask may be comprised of multiple CHUs for the purpose of creating a larger task. In that case, Thread Safety is not a requirement and a much simpler internal queue may be used. In both cases, the internal queue is an effectively boundless queue provided by either the C++ Standard Template Library [115] or the Boost.Lockfree library, depending upon if thread safety is required. This has the benefit of requiring no

48 PICASim::LinkInterface #theID: unsigned int +LinkInterface(theID: unsigned int, initialTime: long long unsigned int) +pushPacket(pktPtr: PICASim::Packet *, timeStamp: long long unsigned int): void +pop(): PICASim::Envelope

PICASim::SeqLink PICASim::TSLink

Figure 4.7: UML Class Diagram of Link Types confirmation that an Envelope has reached its destination and allows for CHUs to fire multiple times if sufficient input data is available. In either case, the the push() method can be used to either propagate data by pushing an Envelope-worth of information or to propagate the current time of a CHU through the use of a Null Message, which is handled in a special manner. The algorithm for a push() is shown in Figure 4.8.

if pktPtr = nullptr then Build an Envelope to encapsulate pktPtr and its timestamp Push said Envelope into Link’s internal FIFO end if Update variable containing time of Link with timestamp

Figure 4.8: Algorithm to Push message (pktPtr) into Link

With the pop() algorithm shown in Figure 4.9. Through this, null messages are able to be propagated through the system but are only acknowledged if no meaningful message is available. One issue is that there is the potential for a race condition. There is no guarantee that the time of the Link and the time of the most recent packet are the same, and it

49 if Internal Queue is not empty then Pop and return Envelope from Queue else Create Envelope with nullptr for payload and Link’s time for Timestamp Return Envelope end if Figure 4.9: Algorithm to Pop Message from Link is possible that a CHU will check an input Link and receive an out of date timestamp as the actual payload has not yet propagated. However, in practice, this is a benign race condition as one of two secnarios can occur:

• The pop() occurs before the internal queue has completely enqueued the Enve- lope with TNew. The destination CHU receives a null message with an outdated timestamp TOld. By deﬁnition, TOld ≤ TNew, so the CHU will only be able to proceed if the Envelope with timestamp TNew were unnecessary to begin with. A subsequent pop() will yield the Envelope with time TNew

• The pop() occurs after the source CHU has pushed a null message and updated the time of the Link. This is not an issue as the timestamp on the Envelope is used.

Even in the event of a distributed host system, this is not an issue for reasons that will be discussed later in the Chapter (4.3)

4.2.4.1 Termination Signal Links While implemented separately from the LinkInterface for performance reasons, the Termination Signal Link operates in a similar manner and is also used for point- to-point communication between CHUs. Speciﬁcally, it is used to signal when a CHU reaches a pre-designated state that can be used to terminate the simulation as a whole. The UML Class Diagram can be seen in Figure 4.10. The Termination Signal Link behaves similarly to the Thread Safe Link with the exception of not relying on timestamped Envelopes and instead containing a single monotonically increasing variable that is atomically incremented and read as needed.

50 PICASim::TermSignalLink #numSignals: std::atomic +TermSignalLink() +signalTermination(): void +collectSignals(): unsigned int

Figure 4.10: UML Class Diagram of Termination Signal Link

4.2.5 Preliminary Performance Data To get an idea of the efficiency of the PICASim framework, I developed a simple microbenchmark designed to test the scalability of the PICASim framework on a single node of a cluster. Intra-node communication is the common case for PICASim and must demonstrate good performance. The host system used to run the following benchmarks is a 64-bit Intel Core i7-920, running at 2.67 GHz, with 6 GB of RAM running Ubuntu Linux with kernel version 3.0.0-16. This system was chosen because it is representative of a researcher’s workstation or a single node of a low end commodity cluster. Because the Intel Core i7- 920 employs hyperthreading [76] to spread eight logical processors across four physical processors, I use a dashed red line to indicate when hyperthreading is used. The PICASim framework and all benchmarks were compiled with g++ version 4.6.1 with -O3 as optimization flags. To do this, I created a system consisting of parallel accumulators with no data or control dependencies between CHUs. I then seeded the input link to each CHU with a number of messages to process to simulate a data stream. Due to the lack of control and data dependencies, the PICASim framework will process every single input message for a given CHU as a single task. Once a CHU has processed every input message, it will signal for termination. Once all CHUs have signalled termination, the simulation completes. With this, we were able to test the scalability of the system in two directions: task length and number of tasks. In the first test, I fixed the number of accumulators, CHUs, and varied the

/PSNBMJ[FE&YFDVUJPO5JNF

? ? .FTTBHFT1FS5BTL /VNCFSPG5BTLT

Figure 4.13: Scalability of Tasking and Control Framework on a Single Node of a Commodity Cluster [99]

For all tests, the execution time is directly proportional to the length and number of tasks. For any tested task length, the experiments with 512 tasks will take twice as long as those with 256 tasks, and four times as long as those with 128. Similarly, for any tested number of tasks, the experiments with 216 messages per task take twice as long as those with 215 messages per task, and half as long as those with 217 messages per task. This further demonstrates the scalability and low overhead of our system on a single node and shows that task length and execution time are primarily a function of the amount of work and number of tasks and that any slowdown in the system will be a result of communication. My initial experiments demonstrate close to linear speed-up for all 4 threads

54 PICASim::Packet PICASim::Envelope

PICASim::RouterMessage +sourceID: unsigned int +destID: unsigned int +payload: PICASim::Envelope

Figure 4.14: UML Class Diagram of Router Message mapped to physical processors, and continuing speed-up even when experiencing resource contention due to hyperthreading. These results show that our one queue per node implementation is a suﬃcient foundation for the distributed implementation even in the context of ﬁne grain tasks.

4.3 Communication in a Distributed System PICASim also supports distributed hosts. This is primarily handled through what the PICASim framework refers to as a “Router”. The Router behaves in a manner similar to routers and switches in conventional networks and is used to allow CHUs on separate physical hosts to communicate over Links. Each node of the distributed system contains one Router which is implemented as a PICASim::SimulatableTask and executed in the same Task Queue as the CHUs. The Router itself communicates in terms of RouterMessages, which are described in Figure 4.14. Essentially, a RouterMessage is an Envelope with additional information regarding the source and destination IDs of the Links. These IDs are used so that an Envelope can be pop()’d out of a Link by a Router, sent to another Router on a remote host, and then push()’d into the appropriate Link so that the target CHU receives the message. Thus, the Router primarily consists of a large table of source and destination Links for any communication between CHUs on separate hosts. Also, the use of Links

55 and an intermediary buffer, the Router, means that any race conditions remain benign (Chapter 4.2.4). To improve performance, the Router aggregates messages wherever possible. For example, if four Links on Node A have destinations on Node B, the associated RouterMessages are grouped together and sent in a single non-blocking send. However, it is worth noting that communication over a network can be prob- lematic for such a fine-grained system in which a very large number of messages may be sent along links. The true challenge will instead be in the distribution of work across the system so as to minimize inter-process communication while still ensuring a sufficiently large workload. However, the issues of grouping and partitioning a graph is a well known and well researched subject with a wide variety of literature on the subject [35, 112]. And should this prove insufficient, the time tested techniques of work stealing [9]can be employed. As such, our goal has primarily been to focus on functionality.

4.4 Using PICASim A detailed description of the use of PICASim is beyond the scope of this dissertation as it is very much implementation work. Instead, this has been provided elsewhere in the form of CAPSL Technical Note 24 [98].

56 Chapter 5

PICASIM TO SIMULATE NOVEL ARCHITECTURES

PICASim was originally conceived as a tool for simulating and studying novel computer architectures. While it has been expanded to oﬀer so much more, it is still important to evaluate it in the context of simulating these architectures.

5.1 Modeling an Architecture: A Case Study As a case study, I investigated how to represent a modern many-core architecture with the PICASim model. PICASim is designed to provide maximum flexibility so that it is up to the designer of the simulated system to make decisions regarding performance and accuracy. Because of the completely distributed synchronization and control structure, the designer can easily provide multiple implementations while maximizing reuse. In this section, we will consider a possible representation of the IBM Cyclops64 (C64) architecture [27]. AsshowninFigure5.1, a single C64 chip contains 80 processors, each with two Thread Units (TU), making for 160 independent hardware threads. Furthermore, each C64 chip has two 30kB SRAM memory banks divided into scratchpad memory (SP) and globally addressable interleaved shared memory, a shared FPU, and a port to the on-chip interconnect. Every 5 processors share an I-cache of 30 KB (not shown) and there is no data cache. A processing node consists of the previously described C64 chip and external off-chip memory (DRAM). A 96-port crossbar network with a bandwidth of 4 GB/s per port connects all TUs and on-chip memory banks [27]. First, consider the processor. Intuitively, one can treat an entire processor as a single CHU. This has the advantage of being able to leverage existing work to define

57 Node Chip Processor 1 Processor 2 Processor 80 Host 3D Mesh Interface FPU FPU FPU

SP SRAM SP SP SRAM SP ... SP SRAM SP

TU TU TU TU TU TU

Crossbar Network Control Network HD DDR2 DDR2 DDR2 DDR2 A-Switch FPGA SDRAM SDRAM SDRAM SDRAM Controller Controller Controller Controller Gigabit Ethernet Off-Chip Off-Chip Off-Chip Off-Chip Memory Memory Memory Memory

Figure 5.1: Layout of A Single Cyclopst-64 Node [99] the behaviour of the CHU. Potentially, the entire processor can be modeled using a traditional sequential simulator with hooks to communicate with the system. However, depending upon the purpose of the simulation, it may be beneficial to treat the TUs, FP, and SPs as separate CHUs for the purpose of easily interchang- ing different implementations to explore the benefit of specific modifications to the processor and to allow for varying degrees of accuracy and precision. Both TUs can be separately modelled so as to simplify the logic of the technique used to process instructions and generate signals while also obtaining more specific power and heat statistics for each thread unit on the chip. And further still, the potential for stalls due to a shared floating point unit can be modelled by taking advantage of a separate FP CHU. While all of this can be modelled by a single CHU with sufficiently complex logic, by separating this we increase the potential throughput while greatly simplifying the implementation and allowing for new architectural features to be studied with minimal effort. In keeping with the advantages of a modular design, PICASim allows the user

58 to swap out implementations of the CHUs themselves depending upon the desired behaviour or level of accuracy. By ensuring that each CHU implementation has the same inputs and outputs (i.e. Links to and from the memory and signal bus), we are able to interchange CHU implementations on a problem by problem basis. Next, consider the memory. Again, intuition would suggest either having a single CHU for all shared memory or a CHU for every single bank of on-chip SRAM. The latter has the benefit of maximizing parallelism and more easily allowing for a study of a highly heterogeneous system while the former will decrease overhead through coarser tasks. Similarly, an inspection of how DRAM is accessed indicates that all accesses pass through a DDR2 SDRAM Controller. Thus, one can treat the controller and the Off-Chip Memory as a single CHU, referred to as “DRAM”, with ease. Finally, consider the interconnect network. An inspection reveals that all intra- chip communication is handled by the Crossbar Network. The crossbar, described by Zhang et al. [127], is a simple yet effective structure that guarantees sequential consistency [68] between the TUs of a chip while also promoting fairness. In a PICASim implementation, a simplified version can be used such that sequential consistency is guaranteed, but fairness may not be. But if the goal of the simulation is congestion on the interconnect, a detailed, but slower, implementation may be needed. For the purpose of simulating a single node, the mechanisms that provide off- chip communication are ignored for this specific example. Figure 5.2 presents a possible mapping of a single C64 node to the PICASim model. In this case, high performance is desired, so the entire processor is treated as a single CHU with the TUs and SRAM modelled internally and each bank of DRAM is a separate CHU. In this case, the inputs and outputs to the Node and DRAM CHUs would be ports of the crossbar, with all communication, much as in the real C64 chip, handled through the IC itself.

59 TU 96-Port Crossbar TU Arbiter

SRAM

...

C64 Node ...

DRAM Bank

...

96-Port DRAM Bank Crossbar Arbiter

Figure 5.2: Possible Mapping of a Single Node. Each box represents a potential CHU [99]

60 5.2 Performance and Power Modeling with Petri Nets However, such detail is often not required when studying an architecture. An example of this was the use of PICASim as a tool to facilitate Garcia et al.’s performance modelling of linear algebra on the IBM Cyclops-64 [45]. One programming model used on the IBM Cyclops-64 is the use of codelets [42], which are collections of machine instructions that can be scheduled atomically as a unit of computation. They are more ﬁne-grained than a traditional thread and they have proven themselves, under multiple execution models, to be capable of keeping compute cores on a chip occupied with useful work when assigned and scheduled at runtime [92, 70]. Because of the nature of the codelet model [42], we modelled the execution of applications with Timed Petri Nets.

5.2.1 Petri Nets Petri Nets are a mathematical tool, proposed by Carl Adam Petri [101], that use directed, weighted, bipartite graphs to model and analyze parallel, concurrent, asynchronous, or stochastic systems [85]. A Petri Net graph is composed of two kinds of nodes: Places and Transitions. When modelling systems, places are used to represent conditions that must be met as well as available resources. A place contains a finite number of tokens that represent the available number of a given resource or if a specific state has been reached. In graphical representations, a place is represented as a circle with tokens optionally represented either as dots within places or as a numeric count. Transitions represent the actual operations. A transition consumes a specified number of tokens from each input Place. In graphical representations, a Transition is represented as a rectangle. The dependencies between a Place and a Transition are specified via arcs, which are represented as edges between nodes with a weight indicating the number of tokens that are consumed by a given arc.

61 A given state of a Petri Net is referred to as a marking. The Marking represents the number of tokens in each place. The operational semantics of a Petri Net is deﬁned by three simple rules:

• A Transition is enabled if each input arc is connected to a Place with the required number of tokens

• Enabled transitions may or may not ﬁre

• When fired, a transition consumes the number of tokens specified on each input arc and generates the number of tokens specified on each output arc.

For example, a place may be used to represent the available memory bandwidth while a second place may represent the state of the simulation itself. A transition representing a memory operation that occurs during a speciﬁc stage of the simulation would require at least one token from the memory bandwidth Place in addition to a token from the Place representing the appropriate state. With these simple rules, a wide range of systems are able to be modelled [85]. Additionally, we utilized Time Petri Nets [107] for the purpose of modelling execution time. Timed Petri Nets are, put simply, Petri Nets with the addition of virtual time which allows for the modelling of the amount of time a simulation took.

5.2.2 Implementation in PICASim After implementing a scheme for simulating Petri Nets using PICASim, I used Garcia’s models for Dense Matrix Multiplication (DGEMM) [46] and the LU Decom- position [45] to simulate the execution of those applications on the IBM Cyclops-64. Each algorith was implemented as a Petri Net with the different types of Codelets acting as operations. A simplified diagram of each can be seen in Figures 5.3 and 5.4. To obtain the duration of the Compute and Copy Codelets, we profiled the execution of these algorithms on the FAST simulator [26] and used mathematical models to determine the duration as a function of the problem size and other system parameters. For other actors, we used parameters of the architecture to compute latencies. For example, the dynamic scheduling was implemented with the in-memory atomic

62 HIGH C LOW C I I D 1 R 1 1 R l 1 W^2 1 1 l 1 F 1 n Copy 1 n Compute W^2 1 e e o i Blocks i Tile a a n 1 t t T n n e

1 S Q S C I t c l 1 1 n 1 1 P 2 a h e i r P e a t t d n

1 Q

C C I I D 1 R 1 1 R l 1 W^2 1 1 l 1 T 1 n Copy 1 n Compute W^2 1 e e o i Blocks i Tile a a n 1 t t HIGHn LOW n e F

Figure 5.3: Petri Net Representation of DGEMM on IBM Cyclops-64 [45] addition. The latency of this operation is 3 cycles. Also the number of memory banks is 8 for the Cyclops-64 architecture. Through this approach, we were able to rapidly determine the parameters of our simulation.

5.2.3 Verifying Model First, we used Garcia’s previous data, collected on the physical chip, to verify the accuracy of the model. To do this, we compared our results obtained with PICASim to those that had previously been obtained with the FAST simulator. First, we compared our results to those generated for a highly optimized version of Dense Matrix Multiplication that was speciﬁcally optimized for IBM Cyclops-64’s on-chip memory. This is seen in Figure 5.5. Similarly, we compared the results for a Dense Matrix Multiplication that utilizes oﬀ-chip memory (DRAM) with our Petri Net Model for the same, as seen in Figure 5.6. In both cases, our petri net model matches the performance on the architecture to a very high degree. Of particular note is that our model of the specialized on-chip memory version matches the “jagged” line seen in the experimental results. This shows that our petri net model closely matches the behaviour of the algorithm, as opposed to just the overall trend. This behaviour is essential to properly study a system. The

63 Start 1

1 Init 1

1 Compute Diag. Block 1

1 Clean 1

1 1 1 Init

Init Init N-j-1 N-j-1

1 1 Compute Compute Row Col Block Block 1 1

N-j-1 N-j-1 1 1 Clean Clean 2 Clean 1

1 1 1 Init

Init Init 2 (N-j-1) - 1 1

1 1 Compute Compute Inner Inner Block Block 1

1 1

Compute Diag Block 2 (N-j-1) - 1 1

1 1 1 Clean Clean 2 Clean 1

1 1 Done F T 1

Figure 5.4: Petri Net Representation of DGEMM on IBM Cyclops-64 [45]

64 80.0

70.0

60.0

50.0

40.0

30.0 Performance(GFLOPS) Measured

20.0 Performance Model using Petri Nets

10.0 Size m - 0 100 200 300 400 500 600 700

Figure 5.5: Veriﬁcation of Model for DGEMM Optimized for On-Chip Memory [45]

65 70

60 Measured

50 Performance Model

20 Performance(GFLOPS)

Thread Units - 0 20406080100120140160

Figure 5.6: Veriﬁcation of Model for DGEMM Optimized for Oﬀ-Chip Memory [45]

66 average error for our matrix multiply simulation using on-chip memory is 2.5% and 1.0% for oﬀ-chip memory.

5.2.4 Extrapolating Results After verifying the accuracy of our model, we then began to study hypothetical systems. One of the key advantages of using a software simulator is that a wide range of systems can be studied. First, we extrapolated the DGEMM optimized for On-Chip memory for an IBM Cyclops-64 with more SRAM, as seen in Figure 5.5. Similarly, we considered a larger chip for the purposes of studying the behaviour of the DGEMM optimized for Oﬀ- Chip memory, as seen in Figure 5.7. We consider systems with more threads as well as larger memory and memory with greater bandwidth. This allows us to study the limiting factors of the algorithm and potentially improve it to increase the lifetime of our solution.

5.2.5 Studying New Algorithms However, all of the results presented thus far were for an application that had already been optimized for years. So we next considered the case of the LU Decompo- sition, an application that was still being studied to discover how best to optimize it for the IBM Cyclops-64. Using techniques similar to the DGEMM implementation, Garcia developed a codelet graph that we then converted into a petri net model and simulated using PICASim. The results can be seen in Figures 5.8 and 5.9. Because of the proven accuracy of the approach taken to model these systems, we can use these results to further optimize the LU Decomposition with a fair degree of confidence. For example, we can see that common lookahead techniques employed in the LU decomposition only provide benefits for larger matrix sizes and may not actually demonstrate any benefit on a single 160 thread Cyclops-64 chip. Thus, time can be better spent focusing on different optimization schemes.

67 500 Peak Performance 450 Performance Model Using C64 features 400 Performance Model for Double Size on-chip Memory

350 Performance Model for Double Memory Bandwidth

300

250

200

150 Performance(GFLOPS)

100

50 Thread Units - 0 200 400 600 800 1000

Figure 5.7: Simulation of Oﬀ-Chip DGEMM on Modiﬁed C64 [45]

68 140

120

100

Speed Up Peak Performance 40 Naïve LU

20 Lookahead LU Matrix Size - 0 50 100 150 200 250 300 350 400 450

Figure 5.8: Simulation of LU Decomposition Optimized for On-Chip Memory of Vary- ing Sizes [45]

5.3 Validation via Intel Xeon Phi To further evaluate PICASim, we developed a timing accurate simulation of the Intel Xeon Phi Coprocessor 5110p. As the purpose of PICASim is the rapid development of simulations to study architectures and systems, we employed a simpliﬁed model for congestion on the Intel Xeon Phi by analyzing data collected from microbenchmarks. Similarly, we took advantage of the use of the MPI programming model’s clearly deﬁned synchronization and communication mechanics.

5.3.1 Intel Xeon Phi As a platform, we chose the Intel Xeon Phi Coprocessor 5110P [58], which consists of 60 cores at 1.053 GHz. We chose to use the Intel Xeon Phi as it is a coprocessor used to accelerate scientiﬁc computing while still allowing for the use of conventional programming models, in this case MPI [79].

5.3.2 Modeling Intel Xeon Phi Thus, we use a single CHU to represent each MPI process, in this case a core of the Intel Xeon Phi. Each CHU executes a trace corresponding to the application. The input and output messages of the CHU correspond to communication and synchronization in the MPI programming model, and a CHU is able to execute until a blocking operation is reached. We connect these CHUs via our interconnection model so that we may execute the traces and obtain the simulated execution time.

5.3.3 Simulation Results To test our simulation, we implemented a finite difference time domain (FDTD) solution of Maxwell’s equations in one dimension [125] as FDTD is representative of many applications in scientific computing. This was implemented using MPI. We allowed for the number of processors and the problem size to be varied, and ran each test case ten times and averaged the results. We then compared the results predicted by PICASim with the data collected from running the benchmark on the actual machine.

Figure 5.11: Error of Simulated Results with respect to Measured Results

73 can be used to obtain accurate and precise results of existing systems running code indicative of a class of scientiﬁc applications (stencils). This, in turn, demonstrates PICASim’s viability as a tool to study new architectures and systems.

74 Chapter 6

PICASIM TO DEVELOP ARCHITECTURES AND PROGRAM EXECUTION MODELS

Thus far I have only discussed studying existing architectures and models through the use of PICASim. Such endeavours are eﬀective for the purpose of validating a methodology but don’t necessarily teach us much. In most cases, it is actually easier to run code on a chip. In this chapter, I will investigate the use of PICASim in its intended purpose: Studying and developing novel architectures and program execution models (PXMs). To this end, we study the Fresh Breeze memory model, program execution model, and architecture as proposed and developed by Dennis [28].

6.1 Dennis’s Fresh Breeze The Fresh Breeze model is based on a write-once memory and is intended to allow for massively parallel computation by avoiding the coherence issues associated with shared memory in multi- and many-core systems. The Fresh Breeze model is described in greater detail in previous work by Dennis [28, 30, 31]. For the purposes of this dissertation, I will only explain a high level overview of the system as it pertains to modeling it with PICASim.

6.1.1 The Fresh Breeze Program Execution Model The Fresh Breeze model has been developed to be a massively parallel system architecture built around a program execution model designed to exploit the natural parallelism inherent in many applications [74]. This model is built around asynchronous tasks with explicit data dependencies and a focus on functional programming, and

75 write–once memory, to resolve many of the issues facing the development of program execution models for the coming era [74]. The asynchronous tasks are referred to as “codelets” and are based on research along those lines [29, 117], and the memory model is based around a tree of write once “chunks”, with each “chunk” consisting of sixteen 64–bit values.

6.1.2 The Fresh Breeze Architecture Dennis’s work on the Fresh Breeze architecture utilizes a bottom–up approach in which a novel and highly scalable architecture is built from a single processor and expanded upon [74]. The single processor system is referred to as “System One” and consists of four components.

• The Processor Core is a single processor with multiple execution slots. A codelet is assigned to each slot, and the active slot is determined based upon available data dependencies. Multiple slots are used to avoid stalls during high latency operations.

• The Core Scheduler is a hardware unit that assigns codelets to execution slots.

• The AutoBuﬀer is a small, low latency, memory bank that stores recently accessed chunks. This is the Fresh Breeze model’s alternative to a cache.

• Finally, the Memory Unit is the larger, higher latency, memory bank that stores all chunks.

The interconnection of these components can be seen in Figure 6.1. The next evolution of the Fresh Breeze Architecture is the multiprocessor version, referred to as “System Two” and shown in Figure 6.2. In addition to multiple instances of the PC and AB, this adds two components:

• The Load Balancer is a hardware unit that assigns codelets to Core Schedulers

• The InterConnect: A hardware interconnect to allow communication between AutoBuﬀers and the Memory Unit.

System Three (not pictured), in turn, is used to model a full system with a collection of FB chips.

6.2 Fresh Breeze: System One X Currently, Fresh Breeze System One and Two have been implemented to varying degrees and studied using a proprietary tool [74]. As such, the PICASim–based research has focused more on researching modiﬁcations to the Fresh Breeze architecture and to more closely study its needs. New features and design choices can be studied and evaluated outside of the critical path and can be integrated at a later date. To this end, I have developed and implemented the “System One X” testbed in the PICASim framework. As the speciﬁc latencies of the Fresh Breeze architecture are still unknown, SysOneX has instead been built around providing a functionally accurate, instruction set architecture compatible, implementation of the Fresh Breeze Architecture while still allowing for performance trends to be studied.

6.2.1 Goals of System One X The purpose of System One X is to evaluate modifications to the architecture and to provide a framework in which different latencies can be used to determine where research needs to be focused. We have specifically targeted Fresh Breeze System One to begin with, but by ensuring that applications written for Fresh Breeze System One and Two are compatible with System One X, we will be able to expand upon the design as needed. Once System One has been studied sufficiently, the modifications and testbed will be extended to a testbed of System Two, referred to as System Two X.

6.2.2 Implementation in PICASim The PICASim framework is well suited to this variety of research. Its modular nature allows us to switch in and out various implementations of each component. Not only does this allow for research into different architectural features, such as a more cache–like AutoBuffer, but it also ensures that any modifications to the architecture that we study will be comparatively easy to integrate into the ongoing research into the Fresh Breeze architecture.

79 To that end, the general layout of System One X is identical to that in Figure 6.1. Therefore, we implement the PC, AB, CS, and MU as individual CHUs. However, to perform these experiments we have taken advantage of the modular nature of PICASim to employ a modiﬁed Processor Core. Whereas in Fresh Breeze System One the PC is a single execution unit with multiple slots for codelets, we instead re–envision it as a single processor with multiple hardware threads, or Execution Units, as depicted in Figure 6.3. The interface of the PC remains unchanged, but the internal behaviour will execute an assigned codelet if input dependencies are satisﬁed.

6.2.3 Preliminary Timing Data The true challenge of this implementation has been determining reasonable numbers for timing purposes. As the Fresh Breeze architecture is still highly theoretical, all timing data is purely theoretical. Similarly, much of the Fresh Breeze architecture is highly dependent upon hardware implementations of operations that are traditionally resolved at a software level. To this end, we chose to use an existing and heavily researched architecture for our initial estimates of latencies and timing data: The IBM Cyclops 64 [27] (C64). This architecture was chosen as, as one of the early manycore architectures, it represents a similar level of complexity to a single Fresh Breeze chip while also being a platform on which many of the essential operations have already been implemented and timed [91]. And, as an older architecture, it should hopefully provide a reasonable estimate upon which research can be directed. To that end, latencies have been selected based on this research. Queue related operations in the Core Scheduler use the average execution time of the nonblocking concurrent queue put forward by Michael and Scott [80], as implemented on the C64 [91]. Memory accesses in the AutoBuﬀer are based on access times for SRAM on the C64, whereas memory accesses at the Memory Unit level are based on access times for DRAM on the C64. Finally, all other operations (e.g. ALU and FPU operations) are based upon their cost, in cycles, on the C64.

With these numbers, we are given a starting point for our experiments and research.

6.2.4 Benchmark Applications Two benchmarks were chosen to evaluate the benefits of our modifications: a scalar dot product and a one dimensional steady–state heat distribution using the Jacobi method [55, 122]. The former is a heavily studied algorithm on the Fresh Breeze architecture [74] and the latter represents an application with memory access and synchronization patterns that are similar to more advanced scientific applications. Specifically, our scalar dot product is between two vectors with 256 elements each so as to take advantage of the Fresh Breeze memory model. The heat distribution algorithm also operates on a 256 element vector, but runs for a variable number of iterations.

6.2.5 Architectural Modifications We took advantage of the PICASim framework’s highly modular nature to ex- amine multiple implementations of two key components of Fresh Breeze System One: The AutoBuffer and the Processor Core. In the Fresh Breeze architecture, the Processor Core (PC) consists of multiple execution slots, with each slot corresponding to a codelet. When dependencies are satisfied, the codelet is executed. The AutoBuffer also consists of multiple slots, with each slot corresponding to an execution slot on the PC. Each AutoBuffer slot is independent of the other slots, meaning that each slot must pay the cost of a Memory Unit (MU) look up for a given data chunk. As part of System One X, we first consider allowing a PC to execute multiple execution slots simultaneously. Additionally, because of the write–once nature of the Fresh Breeze memory model, we also remove the concept of AutoBuffer slots and instead have a pool that is shared between execution slots. The advantage of the

write–once scheme is that the coherency and synchronization issues associated with multiple threads accessing the same buffer are greatly reduced.. With these modifications, we ran each benchmark for a varying number of execution units and with and without AutoBuffer slots enabled.

6.2.5.1 Scalar Dot Product For the Scalar Dot Product, shown in Figure 6.4, we measured speed-up relative to the simulated execution time of the single execution unit run with AutoBuffer slots enabled. As is expected, having more execution units increases the throughput of the system and, thus, lowers execution time. Similarly, removing the concept of having separate AutoBuffer slots for each execution unit decreases execution time by allowing us to take advantage of temporal locality between codelets. However, it is when both are combined that we are able to draw more meaningful conclusions. While the overhead of the task based framework is costly in both cases, the lack of AutoBuffer slots results in performance increasing almost linearly to six execution units, whereas the version with AutoBuffer slots has diminishing returns as soon as four execution units are employed.

6.2.5.2 1D Heat Distribution For the 1D Heat Distribution, shown in Figure 6.5, we measured speed-up relative to the simulated execution time of the single execution unit run, with AutoBuffer slots enabled, for each number of iterations. We see similar results, with each test case behaving in a similar fashion, regardless of if AutoBuffer slots are enabled or not. This is logical as there is more work per codelet as well as more synchronization, both of which are able to better mask the cost of memory accesses. However, one can still see that, with AutoBuffer slots disabled, we see a continuing increase in speed-up out to eight execution units as opposed to five.

6.2.6 Performance Evaluation From these experiments, we are able to see that, for these applications, the removal of the concept of AutoBuffer slots allows for greater speed-up. We also see that performance seems limited less by memory and more by synchronization. As such, it is important that we explore the benefits of a more efficient Core Scheduler. As previously mentioned, we based the efficiency of our Core Scheduler off of previous implementations of the MS–Queue [80] on the IBM Cyclops 64 [91]. It is important to note that this was a software implementation and that more advanced queues with similar properties have been developed over the years and this was merely meant to act as a baseline to start from. For our experiments, we used the same benchmarks described in Section 6.2.4 with AutoBuffer slots disabled. We then varied the latency of the underlying queue operations in the Core Scheduler. The results of this can be seen in Figures 6.6 and 6.7 with the results in terms of speed–up relative to the simulated single execution unit results using the baseline queue performance.

6.2.6.1 Scalar Dot Product Figure 6.6 shows the results for the Scalar Dot Product. Because of the simplic- ity of the benchmark and the low execution time of a given codelet, we see minimal diﬀerences. A more eﬃcient queue results in a shorter execution time, but the pattern remains the same.

6.2.6.2 1D Heat Distribution Figure 6.7 provides the more interesting results by examining the 1D Heat Distri- bution problem. At the baseline queue latency, we see contention becoming a problem at nine execution units, but we also see the increase in performance drop noticeably as early as four. By reducing the queue latency to just 0.75x, we are able to see continued performance increases out to nine execution units. Additionally, 0.25x and 0.10x queue

88 latency have very similar performance, which suggests that Core Scheduler eﬃciency is no longer the bottleneck at this point. From these results, we are able to determine that, even at the single chip level, queue latency is a very important problem. However, we are also able to see that, for these very small applications, comparatively minimal reductions to queue latency are required. And, by comparing the two benchmarks, it is conﬁrmed that simply increasing the computational weight of a given codelet will resolve many of these problems.

89 Chapter 7

A LANGUAGE AND COMPILER FOR AUTOMATED DISTRIBUTION OF SIMULATION

In the course of developing PICASim and modelling systems with it, it became painfully obvious that additional tools would be needed to improve the ease of use of PICASim. To this end, I developed the Language for Automated Distribution of Simulation or LADS for short. Much as with PICASim, LADS is largely a legacy name as the tool was found to have far greater potential than was initially anticipated.

7.1 Rationale Behind The LADS As the name suggests, The LADS was developed as a way to simplify the layout of logical systems in PICASim. Speciﬁcally, it was designed with the distribution of CHUs between nodes of a commodity cluster. The goal was to develop a language that would aid in the layout of CHUs and the establishment of links between the CHUs. Initially VHDL [94] was considered as a basis. However, during preliminary research it was determined that it would be better to have the LADS language focus on expressing dependencies as opposed to semantic behaviour. This allows the LADS language to be used with more than just PICASim and to instead be a general language for expressing dependencies between components in a graph. As such, the dot language [40] was chosen as a basis instead.

7.2 The LADS Grammar The LADS language is, at its base, an extension of a subset of the dot language [40]. The dot language was chosen as a basis as its ability to eﬃciently specify

90 system|= systemgraph|graph graph|=graphlabel{declistedgelist} declist|= declistdeclaration|declaration edgelist|= edgelistedge|edge declaration|=nodelabel [ nodetype,number,number]; nodetype|= modulelabel,(proplist) | modulelabel|subgraphlabel,(proplist) edge|=edgelabel=arcweight; arc|= port->port port|= portlabel[number] portlabel|= label IN |label OUT |label weight|=(number) proplist|= property,proplist|proplist|λ property|= label=number|label label|= [a-zA-Z0-9]+ number|= [0-9]+

Figure 7.1: Backus-Naur Form Grammar of LADS graphs is proven and it is a fairly simple language. The current Backus-Naur Form of the LADS grammar is shown in Figure 7.1. As can be seen, the LADS language differs from dot in three key aspects: First, the LADS language forces all nodes to be declared prior to use in edges. This was done as such behaviour is common to languages such as C and C++. Fur- thermore, the LADS requires the information present in declarations for the purpose of recursively generating subgraphs and performing code generation at a later stage. Additionally, every node has an optional property list that can be used for the purpose of optimization and code generation. Second, the LADS language employs a different methodology for specifying ports in an edge. Again, this is done to ensure that all pertinent information is explicitly provided. All edges must provide info regarding the specific port number, the weight of the edge, and a variable name for the given edge. The weight is used for the distribution

91 1 graph GLOBAL 2 { 3 node alu0 [ subgraph adder, (), 2, 1 ]; 4 node alu1 [ subgraph adder, (), 2, 1 ]; 5 node alu2 [ subgraph adder, (flat), 2, 1 ]; 6 7 edge varA = GLOBAL_IN[0] -> alu0[0] (2); 8 edge varB = GLOBAL_IN[1] -> alu0[1] (5); 9 edge varC = GLOBAL_IN[2] -> alu1[0] (4); 10 edge varD = GLOBAL_IN[3] -> alu1[1] (7); 11 12 edge varE = alu0[0] -> alu2[0] (2); 13 edge varF = alu1[0] -> alu2[1] (2); 14 15 edge varG = alu2[0] -> GLOBAL_OUT[0] (1); 16 } 17 18 graph adder 19 { 20 node alu[module wire, 2, 1]; 21 22 edge varA = adder_IN[0] -> alu[0] (1); 23 edge varB = adder_IN[1] -> alu[1] (1); 24 edge varC = alu[0] -> adder_OUT[0] (1); 25 }

Figure 7.2: Simple Graph Modelled in LADS: LADS Source

and is described in Chapter 7.3. The port numbers are used to match up ports on the CHU to edges in the graph. Third, the LADS language provides more explicit support for subgraphs and recursion. This is best shown in Figure 7.2, which is a very simple graph that demonstrates many of the features of LADS. The LADS language requires that all systems have a toplevel graph with a name “GLOBAL”. The graph listed in Figure 7.2 initially expands to three CHUs, as seen in Figure 7.3 with circles representing nodes and boxes representing special ports. However, each of the instances of an “alu” are actually subgraphs of type “adder”. This use of subgraphs is to simplify the placement of repeated structures. For example, the

92 work at the process level may not be feasible. But determining what to distribute where is not an unsubstantial task. To this end, the LADS was developed. By representing the system graph as a weighted graph we are able to employ simple graph cutting techniques, such as normalized cuts [112], to efficiently divide the graph to minimize inter-process communication while ensuring each process has a sufficient amount of work. This is where the flattening and the weights shown in Figure 7.2 are employed. The weights represent the “weight” of communication along a Link, and subgraphs that are not flattened are treated as a single unit of work for the purpose of distribution. This can be employed to ensure that CHUs corresponding to a single thread unit are not spread out among processes or to group infrequently firing CHUs as a single task to avoid starvation. Once the optimizations have been applied (see Chapter 7.5.3), the rest of the graph is flattened and code is generated and written to source files.

7.4 Additional Applications of the LADS However, this is not the only use of the LADS. Many highly anticipated programming models employ codelets, which can be expressed as a graph [92, 70, 42, 29]. Most of these models help to alleviate the high cost of data movement by using containers for codelets with data dependencies. Using the LADS, those data dependencies can be used in place of the weights and similar techniques can be employed to optimally allocate codelets to containers.

7.5 The LADSPiler In this section, I will explain how the LADS is implemented as a compiler, referred to as the LADSpiler.

7.5.1 Lexical Analysis and Syntactic Analysis The ﬁrst steps for any compiler is the transformation of the input, in this case source code written in the LADS language, into an intermediate representation that

94 can be manipulated. This is commonly achieved through tokenizing the input via lexical analysis and then using syntactic analysis to parse those results. To accomplish this, I used PLY [6]. PLY, or Python Lex-Yacc, is a Python implementation of the traditional lex [73]andyacc[61]tools. For the purpose of lexical analysis, PLY, like lex, uses user-provided regular expressions to convert an input stream into a sequence of tokens which are then passed on to the parser. PLY then implements an LR parser in a manner similar to yacc. Through the use of a grammar, speciﬁcally the one shown in Figure 7.1, the sequence of tokens are given semantic meaning and an intermediate representation, or IR, is generated. After much research, I chose PLY for this stage for three key reasons;

1. Python’s file i/o and string manipulation libraries are very flexible which allowed me to preprocess the input stream as needed with minimal difficulty. 2. Python is strongly typed but also employs dynamic typing. This vastly simplifies the generation of the IR and further eases the implementation of later steps. 3. Finally, the use of PLY results in a much simpler methodology for building and debugging. PLY does not require external tools to be used and, by default, generates a log of all shifts and reduces performed by the LR parser.

PLY provides all of the above features while still maintaining an interface similar to lex and yacc so as to allow later work to use more conventional tools when the focus shifts to an optimization of the compiler as opposed to the system being modelled.

7.5.2 A Graph Based IR The LADSpiler utilizes a graph-based IR so as to better match the PICASim graph and to potentially match codelet-based programming languages. Graph-based IRs are not a new concept and have been studied in the past [37]. In particular, they are a perfect fit for the LADSpiler as it is specifically designed with graph-based languages in mind, allowing for a 1:1 matching of nodes/vertices of the input graph to nodes/vertices of the IR. This has additional benefits during optimization and code generation, as will be discussed.

95 Speciﬁcally, I use the NetworkX library [49] for the graph IR. I chose NetworkX due to its support for directed graphs with weighted edges and its ﬂexibility in terms of what can be stored in both edges and nodes of the graph. A node in the IR consists of the following:

• Label: This is the label of a given node and is primarily used for code generation purposes as well as to provide a unique identiﬁer for every node. • Behaviour: This is used to contain either the module label or the subgraph label. The former is stored until code generation whereas the latter is used during expansion. There is also a third possibility that is a unique string that is used internally to indicate a node that has been generated during expansion. • Subgraph: A boolean value used to indicate if this is a subgraph. • Property List: Either a boolean or integer value used to indicate a property of a node. Currently only “Flatten” is used, which indicates if a subgraph should be ﬂattened before or after optimization, but other examples include the computational or energy cost of a given operation. • Number of Inputs: An integer value used to indicate the number of inputs. This is stored until code generation • Number of Outputs: An integer value used to, similarly, indicate the number of outputs. This is stored until code generation

Similarly, an edge between two nodes of the following

• Label: A list of labels for all variables represented by a given edge. This is primarily used for code generation purposes. • Weight: The sum of all the weights of all connections between the nodes • Remote: A boolean value used to indicate if this is a remote link that will be processed through a router. This is optionally set during optimization, as discussed in Chapter 7.5.3. • Ports: A list of tuples containing the index of the output port of the source node and the input port of the destination node. This is used so that no information is lost when edges are merged together.

Because of the internal structure used by NetworkX, adding additional properties is a trivial matter and the amount of data that can be stored is limited solely by the memory of the host system.

96 to compiling C and C++ code. However, using the LADSpiler for other codelet based languages may very well benefit from outputting to assembly or bytecode. Also, the LADSpiler is fully capable of generating debug output. In fact, Figures 7.3, 7.4,and7.5 are all tweaked versions of the debug output of the LADSpiler at varying stages of the compilation of the code in Figure 7.2. While of limited use with large systems, this allows for visual inspection to be used as a tool in the debugging process. In this simple example it is actually the same graph as Figure 7.3, but note that each node has a unique label that is created based on its label in the code and its subgraph. “alu2” was expanded into a subgraph containing a node called “alu” and is now known as “alu2 alu”. Once the final graph is obtained, it is simply a process of iterating through the list of nodes and the list of edges. Each node corresponds to a CHU and the behaviour and number of inputs/outputs are used as parameters to the constructor. Each edge corresponds to a Link. A simplified version of the output can be seen in Figure 7.6 and 7.7. A more visual version of the output can be seen in Figure 7.8 and 7.9, which is the debug output using the dot language with all variables listed. This graph can then be seen in Figure 7.10. As can be seen, even a subset of this very simple, almost trivial, example results in many calls and many points where a programmer can make a mistake and connect the wrong Link to the wrong CHU. By using the LADSpiler, the focus can be put on implementing the CHUs themselves and modelling the interconnections of the system without needing to worry about problems that will arise from a faulty connection.

98 1 int main(int argc, char ** argv) 2 { 3 ///Code to initialize PICASim Framework 4 5 //Set up CHUs 6 PICASim::CHU *alu0 = new adder(2, 1, termLink , id0); 7 PICASim::CHU *alu1 = new adder(2, 1, termLink , id1); 8 PICASim::CHU *alu2_adder_in = new SPECIAL(2, 2, termLink, id2); 9 PICASim::CHU *alu2_alu = new wire(2, 1, termLink, id3); 10 11 //Set up Links 12 PICASim::LinkInterface *varA = new PICASim::TSLink(id4, 0); 13 PICASim::LinkInterface *varB = new PICASim::TSLink(id5, 0); 14 PICASim::LinkInterface *varC = new PICASim::TSLink(id6, 0); 15 PICASim::LinkInterface *varD = new PICASim::TSLink(id7, 0); 16 PICASim::LinkInterface *varE = new PICASim::TSLink(id8, 0); 17 PICASim::LinkInterface *varF = new PICASim::TSLink(id9, 0); 18 PICASim::LinkInterface *alu2_varA = new PICASim::TSLink( id10, 0); 19 PICASim::LinkInterface *alu2_varB = new PICASim::TSLink( id11, 0); 20 PICASim::LinkInterface *alu2_varC = new PICASim::TSLink( id12, 0);

Figure 7.6: Subset of Simpliﬁed Output of LADSpiler for Example: Part 1

99 22 //Connect Links to CHUs 23 alu0->addInLink(varA, 0); 24 alu0->addInLink(varB, 1); 25 alu0->addOutLink(varC, 0); 26 alu1->addInLink(varC, 0); 27 alu1->addInLink(varD, 1); 28 alu1->addOutLink(varF, 0); 29 alu2_adder_in ->addInLink(varE, 0); 30 alu2_adder_in ->addInLink(varF, 1); 31 alu2_adder_in ->addOutLink(alu2_varA , 0); 32 alu2_adder_in ->addOutLink(alu2_varB , 1); 33 alu2_alu ->addInLink(alu2_varA , 0); 34 alu2_alu ->addInLink(alu2_varB , 1); 35 alu2_alu ->addOutLink(alu2_varC , 0); 36 37 ///Code to simulate generated system and collect data 38 return 0; 39 }

Figure 7.7: Subset of Simpliﬁed Output of LADSpiler for Example: Part 2

1 digraph G { 2 varA [shape=box, style=filled, color=violet]; 3 varB [shape=box, style=filled, color=violet]; 4 varC [shape=box, style=filled, color=violet]; 5 varD [shape=box, style=filled, color=violet]; 6 varE [shape=box, style=filled, color=violet]; 7 varF [shape=box, style=filled, color=violet]; 8 alu2_varA [shape=box, style=filled, color=violet]; 9 alu2_varB [shape=box, style=filled, color=violet]; 10 alu2_varC [shape=box, style=filled, color=violet];

Figure 7.8: Expanded Output of LADSpiler in dot language: Part 1

100 12 GLOBAL_in [label="call", shape=diamond, style=filled, color =lightblue]; 13 GLOBAL_in ->varA; 14 GLOBAL_in ->varB; 15 GLOBAL_in ->varC; 16 GLOBAL_in ->varD; 17 alu0 [label="adder", shape=oval, style=filled, color= forestgreen]; 18 varA->alu0; 19 varB->alu0; 20 alu0->varE; 21 alu1 [label="adder", shape=oval, style=filled, color= forestgreen]; 22 varC->alu1; 23 varD->alu1; 24 alu1->varF; 25 alu2_adder_in [label="call", shape=diamond, style=filled, color=lightblue]; 26 varE->alu2_adder_in; 27 varF->alu2_adder_in; 28 alu2_adder_in ->alu2_varA; 29 alu2_adder_in ->alu2_varB; 30 alu2_alu [label="wire", shape=oval, style=filled, color= forestgreen]; 31 alu2_varA ->alu2_alu; 32 alu2_varB ->alu2_alu; 33 alu2_alu ->alu2_varC; 34 GLOBAL_out [label="call", shape=diamond, style=filled, color=lightblue]; 35 alu2_varC ->GLOBAL_out; 36 }

Figure 7.9: Expanded Output of LADSpiler in dot language: Part 2

101

Chapter 8

RELATED WORKS

As the research and development of program execution models and architectures for the next generation of high performance computing is of great interest, the development of tools for the modelling and study of these systems is an area that has been researched heavily. In this chapter, I will discuss many modern simulation tools with similar goals to PICASim and compare them to PICASim.

8.1 Dennis’s Framework Professor Jack Dennis of the Massachusetts Institute of Technology, in conjunction with the University of Delaware, has developed a simulation framework designed to facilitate the comparative evaluation of alternative program execution models [33, 74]. Dennis utilizes a simplified version of Bryant’s PCA model in which there are only two components: Function and Merge. However, Dennis also modifies the firing rules as follows:

• A Function Component requires a packet on each input port but is not required to output a packet on each output port

• The Merge component behaves identically to Bryant’s original model.

Through this, Dennis is able to drastically simplify the graphs of the system while maintaining many of the guarantees provided by Bryant’s PCA model. Dennis further increases the usability of his model through two concepts: Mod- ules and Ensembles. A Module is a collection of components and links that can be

103 quickly placed and reproduced. Similarly, an Ensemble acts as a vector of components, or modules, that can be used to more easily identify and interact with repeated structures. PICASim and Dennis’s framework were designed with similar goals in mind, and both take inspiration from the work of Bryant [13]. However, Dennis’s approach is much more faithful to the Bryant model, whereas PICASim sacrifices some of the benefits of the Bryant model in the name of performance and flexibility. However, the biggest difference is that the Dennis Framework currently supports only sequential execution whereas PICASim is designed with multithreaded execution from the ground up.

8.2 μπ The Micro Parallel Performance Investigation system, μπ [100], is designed to simulate MPI-based programs running on large scale distributed memory architectures. μπ simulates the behaviour of the MPI library in order to efficiently execute unmodified MPI programs with one or more simulated cores for every physical core, resulting in a highly efficient simulator allowing for the research and optimization of existing MPI programs. PICASim sacrifices the performance gains of targeting a single programming model in the interests of flexibility and studying program execution models not based on MPI.

8.3 Graphite MIT’s Graphite [81] is a highly eﬃcient distributed simulator that employs novel optimistic synchronization techniques to provide a powerful tool for research into shared memory architectures. To do this, Graphite provides the simulated threaded applications with a consistent operating system interface in addition to a threading interface. However, a lack of cycle-accuracy and support for distributed memory systems limit its potential for the research of new architectures in high performance computing.

104 PICASim emphasizes accuracy over performance.

8.4 SST: The Structural Simulation Toolkit SST [57, 50] is designed for the co-design of extreme-scale architectures through simulation. For small scale research, SST Micro is used. SST Micro, much like PICASim, is designed around modelling a computer architecture in a highly modular manner. Once the architecture has been studied at a small scale, SST Macro is employed. SST Macro consists of a tightly coupled collection of modifiable components, each representing and simulating a hardware system or resource. SST has thus far proven itself to be highly scalable and efficient and has been widely used. SST Macro has a focus on simulating MPI programs and, as such, benefits the study of how existing software and applications will perform on the next generation of computer systems. Furthermore, SST’s focus on high performance allows for said computer systems to be simulated at a very large scale. PICASim largely differs from SST in how this goal is achieved. SST is primarily focused on providing a highly scalable framework to study these architectures at large scales. Thus, the focus is primarily on acting as a framework for the interoperability of existing simulation tools as “node components” [50]. This is essential for large scale simulations. However, the benefit of this approach diminishes as architectures are more and more unconventional. As such, while both PICASim and SST work toward many of the same goals, they are best suited toward different stages in the development of a computer architecture. PICASim’s homogeneous interface toward the simulation of any given component of the system make it best suited toward early development and the exploration of the design space, whereas SST’s scalability and use of powerful tools make it essential for larger scale research of a more stable architecture.

105 8.5 COTson COTSon [3] uses a pluggable architecture based on modules and existing simulators that allows the simulation of the entire system, thus providing a platform to research the entire system stack while utilizing existing tools wherever possible. PI- CASim is designed with the development of new tools, to study revolutionary models and architectures, in mind.

8.6 COREMU COREMU [120] is a parallel simulator that uses multiple instances of QEMU [7] and an additional layer of libraries to achieve parallel simulation. This approach uses highly efficient sequential simulators to simulate physical processors and additional libraries to ensure synchronization and simulate shared resources. While highly efficient, QEMU obtains best performance when simulating systems similar to its host and the use of multiple instances of a simulator greatly coarsens the granularity of the system and limits parallelism in terms of the execution of the simulation. Again, PICASim is designed to be used to study systems where highly efficient sequential simulators do not yet exist.

8.7 Flow stream processing system simulator Park et al.’s Flow [95] executes StreamIt [119] programs to analyze stream processing applications. As per StreamIt, the system is modeled as a directed graph where arcs represent streams. Flow performs simulation by executing the speciﬁed stream processing applications with consumption rates and stochastic delays modelled on the behaviour of the target architecture. Flow simulates the execution of speciﬁc classes of applications on architectures, which allows for an emphasis on acyclic graphs and an avoidance of asynchronous merge operators to improve performance. Finally, Flow eschews functional and cycle accuracy for high performance. PICASim is designed for accuracy and the modelling of cyclic graphs.

106 8.8 SystemC Over the past decade, SystemC [48] has been developed as a tool to represent event-driven systems in C++. SystemC is designed for event-driven simulation at all levels. SystemC is beneﬁcial for lower level simulation of architectures, whereas PICASim is designed to study the interface of the program execution model and the architecture.

8.9 ROSS Rensselaers optimistic simulation system (ROSS) [17] is an extremely modular kernel designed around the Time Warp [60] concept. It is highly eﬃcient and built around the coupling of an eﬃcient pointer–based framework [23] and reverting causality errors through reverse computation. This approach has the potential for high performance. However, the need to implement reverse computation for all major operations increases the complexity of modelling the system.

107 Chapter 9

CONCLUSIONS AND CLOSING THOUGHTS

The world is racing toward the exascale era, and this race covers all facets of computer science and computer engineering. To compete, simulation and modelling tools must be developed and reﬁned so as to not limit the imagination and daring of computer scientists and engineers. In this, the ﬁnal chapter of my thesis, I will summarize my conclusions, describe how my work may be continued, and then provide my closing thoughts at the end of this journey.

9.1 Conclusions Due to the unprecedented scale and complexity of exascale systems, simulation and modelling tools are of the utmost importance in achieving these goals [102, 120, 75, 57, 108]. By using software for evaluation, the cost in terms of time and resources associated with hardware can be avoided to a large degree. For example, if the benefits of an architectural feature are found to be lacking in a software simulation there is no need to design and fabricate a hardware implementation. However, many existing tools sacrifice accuracy in the name of performance [81, 56] so as to allow for high level evaluation of systems. Others rely heavily on additional, equally complex, tools to model the system which restricts the study of novel and revolutionary systems to groups with the resources to develop high performance simulation and modelling tools from the ground up. The goal of my dissertation has been to develop tools that place the emphasis on accuracy so as to allow for an exploration of the design space of these architectures and program execution models. These needs have driven my work over the past five years.

108 In this dissertation, I have presented the PICASim model and framework (Chap- ters 3 and 4) as tools for the development and evaluation of program execution models and architectures. I have demonstrated the accuracy and effectiveness of the PICASim model with real architectures and systems when using novel programming models (Chapter 5). Furthermore, by using PICASim I have been able to contribute to the study and development of radically new program execution models and architectures (Chapter 6) in a manner that allows research to be directed toward the areas of greatest importance. Additionally, I have presented The LADS and The LADSpiler (Chapters 7 and 7.5), a language and a tool, respectively, to increase the usability of PICASim and to allow graph theory to be used to optimize simulations and codelet-based languages. I have demonstrated how the compiler is able to perform source to source translation (Chapter 7.5) as well as demonstrated its benefits with respect to optimization with subgraph flattening as the current example. My dissertation has provided the following contributions to the field of electrical and computer engineering:

1. PICASim: A model and framework for the development and study of novel architectures and program execution models.

2. Results demonstrating PICASim’s ability to model said architectures and program execution models with a high degree of accuracy and ﬂexibility.

3. Results demonstrating PICASim’s use in the study and development of novel program execution models and architectures

4. The LADS: A language used to express system graphs, including but not limited to PICASim graphs

5. The LADSpiler: A source-to-source compiler that can convert systems expressed in LADS to code ready to be simulated using PICASim

9.2 Continuation of Work As previously discussed, the ﬁeld of high performance computing and scientiﬁc computing is an ever changing one, and as each problem is solved many more become

109 known. To that end, I will brieﬂy describe how the work in my dissertation can be continued to apply to new projects and applications.

9.2.1 PICASim Model and Framework While the PICASim model and framework are complete, there is room for improvement with respect to three key points: Performance, fairness, and re-usable libraries of CHUs. From a performance standpoint, there is much room for improvement. While PICASim will never, and never was intended to, compete with tools such as SST, there is still room for improvement. Already, we have gained great performance increases through the Boost.Lockfree Library [2], and more eﬃcient lock free queues can further increase the performance of the FIFOs. Additionally, better ways to hook in to existing tools has the potential to increase performance for common architectural features such as interconnects. From a fairness standpoint, a standardized way to resolve merges is essential. Currently, if a CHU performing an asynchronous merge sees two Messages with the same timestamp, the serialization of the Messages is very arbitrary. A more standardized approach will allow PICASim to better model non–deterministic systems. Finally, additional, re-usable, libraries of CHUs is a must. Currently we have the PCA inspired CHUs, or the PiCHUs, which are valuable for modelling and verifying against more faithful PCA–based simulations. A library of CHUs to model common interconnect technologies and generic processors would further increase the usability of PICASim and allow for easier adoption.

9.2.2 LADS and LADSpiler The LADS and LADSpiler are at a much earlier stage of research, and there is even more room for improvement. However, the fundamental LADS language is highly extendible through the use of property lists. The focus of work must instead be on the development of a standardized list of properties that can, in turn, be used to apply

110 standardized optimizations to a wide variety of input graphs. Once this is complete, the LADSpiler can be better integrated into more languages and systems.

9.3 Closing Thoughts Over the course of my time in this research group, I have worked on a wide range of tasks. In this, final, section I will briefly describe this work and how it has influenced my dissertation. Some of my earliest work involved the development of new tiling schemes under the direct supervision of Daniel Orozco. Through this work, I learned the value of using real scientific applications as benchmarks. By doing this, my work is more legitimate and the results are of greater interest to the scientific community. Later work, as published at MTAAP 2010 [124], involved the porting of a program execution model from a single host machine to a distributed platform. This work provided me with a greater understanding of what the difference is between a program execution model and a runtime and the need for both. I then took a detour and worked on improving the performance of the Fast Fourier Transform on the IBM Cyclops 64. This work taught me the value of a cycle accurate simulator as our results were unpublishable due to performance and behaviour of the simulator vastly differing from that of the actual chip. I then spent the next few years working closely with Daniel Orozco and Elkin Garcia on a wide range of applications involving TIDEFlow and linear algebra on the IBM Cyclops 64. This work taught me how to be a graduate student, and a researcher, and the importance of constantly working toward a goal and having a set of experiments in mind. I then had the privelage of working with Professor Jack Dennis, of MIT, on the simulation of his Fresh Breeze architecture. While my contributions were, ultimately, unused, I was given access to an architecture with a radical design that forms the basis of a significant portion of my contributions.

111 During this, I also was fortunate enough to have an internship at Los Alamos National Laboratory under Allen McPherson, Tim Germann, Christoph Junghans, and Ben Berghen as part of a Co-Design Summer School. This work allowed me to work directly with domain experts and to gain an even deeper understanding of how to evaluate and study a program execution model from the perspective of those who will actually use it. Through all of this, I have matured as a competent researcher and I look forward to continuing to identify and solve the problems facing scientiﬁc and high performance computing.

112 BIBLIOGRAPHY

[1] Riemann Solvers and Numerical Methods for Fluid Dynamics - A Practical Introduction.

[2] Boost.Lockfree Library. September 2013. http://www.boost.org/doc/libs/1 54 0/doc/html/lockfree.html.

[3] Eduardo Argollo, Ayose Falcn, Paolo Faraboschi, Matteo Monchiero, and Daniel Ortega. COTSon: Infrastructure for Full System Simulation. SIGOPS Oper. Syst. Rev., 43(1):52–61, January 2009.

[4] William L. Bain and David S. Scott. An algorithm for time synchronization in distributed discrete event simulation. In SCS Multiconference on Distributed Simulation, pages 30–33, 1988.

[5] Robert Bedichek. SimNow: Fast platform simulation purely in software. In Hot Chips, volume 16, 2004.

[6] Shannon Behrens. PROTOTYPING INTERPRETERS USING PYTHON LEX-YACC-To test the Python and PLY environment, Shannon wrote a language called” Squeamish” that consists of only 850 lines of code. Dr Dobb’s Journal-Software Tools for the Professional Programmer, pages 30–35, 2004.

[7] Fabrice Bellard. QEMU, a Fast and Portable Dynamic Translator. In USENIX Annual Technical Conference, FREENIX Track, pages 41–46, 2005.

[8] Marsha J Berger and Joseph Oliger. Adaptive mesh reﬁnement for hyperbolic partial diﬀerential equations. Journal of Computational Physics, 53(3):484–512, March 1984.

[9] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46(5):720–748, September 1999.

[10] J. Bonr, V. Klang, R. Kuhn, and others. Akka library. h ttp://akka. io.

[11] R. Brightwell, B.W. Barrett, K.S. Hemmert, and K.D. Underwood. Challenges for High-Performance Networking for Exascale Computing. In 2010 Proceedings of 19th International Conference on Computer Communications and Networks (ICCCN), pages 1–6, August 2010.

113 [12] J. Dean Brock and William B. Ackerman. Scenarios: A model of non-determinate computation. In J. Daz and I. Ramos, editors, Formalization of Programming Concepts, number 107 in Lecture Notes in Computer Science, pages 252–259. Springer Berlin Heidelberg, 1981.

[13] R. E. Bryant. SIMULATION OF PACKET COMMUNICATION ARCHITECTURE COMPUTER SYSTEMS. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA, 1977.

[14] Zoran Budimli, Michael Burke, Vincent Cav, Kathleen Knobe, Geoﬀ Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and others. Concurrent collections. Scientiﬁc Programming, 18(3):203–217, 2010.

[15] Wentong Cai and Stephen J. Turner. An Algorithm for Distributed Discrete-event Simulation: The” carrier Null Message” Approach. University of Exeter. Department of Computer Science, 1989.

[16] William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. Introduction to UPC and language speciﬁcation. Center for Computing Sciences, Institute for Defense Analyses, 1999.

[17] Christopher D. Carothers, David Bauer, and Shawn Pearce. ROSS: A high-performance, low-memory, modular Time Warp system. Journal of Parallel and Distributed Computing, 62(11):1648–1669, November 2002.

[18] Nicholas P. Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua Fryman, Ivan Ganev, Roger A. Golliver, Rob Knauerhase, Richard Lethin, Benoit Meister, Asit K. Mishra, Wilfred R. Pinfold, Justin Teller, Josep Torrellas, Nicolas Vasilache, Ganesh Venkatesh, and Jianping Xu. Runnemede: An architecture for Ubiquitous High-Performance Computing. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), volume 0, pages 198–209, Los Alamitos, CA, USA, 2013. IEEE Computer Society.

[19] K.M. Chandy and J. Misra. Distributed Simulation: A Case Study in Design and Veriﬁcation of Distributed Programs. IEEE Transactions on Software Engineering, SE-5(5):440–452, September 1979.

[20] Jianwei Chen, Murali Annavaram, and Michel Dubois. SlackSim: A Platform for Parallel Simulations of CMPs on CMPs. SIGARCH Comput. Archit. News, 37(2):20–29, July 2009.

[21] OJ Dahl, B Myhrhaug, and K Nygaard. SIMULA 67 Common Base Language. Technical report, 1967.

114 [22] DARPA-BAA-10-37. UHPC: Ubiquitous High Performance Computing. DARPA, Arlington VA, USA, 2010.

[23] S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette. GTW: a time warp system for shared memory multiprocessors. In Simulation Conference Proceedings, 1994. Winter, pages 1332–1339, December 1994.

[24] DE-FOA-0000R619-X-Stack. Department of Energy. DOE, Washington, DC, USA, 2011.

[25] Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed Data Processing on Large Clusters. Commun. ACM, 51(1):107–113, January 2008.

[26] Juan Del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. FAST: A functionally accurate simulation toolset for the Cyclops64 cellular architecture. In Workshop on Modeling, Benchmarking and Simulation (MoBS05) of ISCA, volume 5, 2005.

[27] Monty Denneau. Cyclops. In David Padua, editor, Encyclopedia of Parallel Computing: SpringerReference (www.springerreference.com). Springer-Verlag Berlin Heidelberg, 2011.

[28] Jack B. Dennis. Fresh Breeze: A Multiprocessor Chip Architecture Guided by Modular Programming Principles. SIGARCH Comput. Archit. News, 31(1):7–15, March 2003.

[29] Jack B. Dennis. Compiling Fresh Breeze Codelets. In Proceedings of Programming Models and Applications on Multicores and Manycores, PMAM’14, pages 51:51–51:60, New York, NY, USA, 2007. ACM.

[30] Jack B. Dennis. The fresh breeze project: A multi-core chip supporting composable parallel programming. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–5. IEEE, 2008.

[31] Jack B. Dennis, Guang R. Gao, and Xiao X. Meng. Experiments with the Fresh Breeze tree-based memory model. Computer Science - Research and Development, 26(3-4):325–337, June 2011.

[32] Jack B. Dennis, Guang R. Gao, Chengmo Yang, Xiaoming Li, Robert Pavel, Aaron Landwehr, Daniel Orozco, and Kelly Livingston. A Fresh Foundation for Software/Hardware Co-Design of Exascale Computing Systems. Technical report, CAPSL Technical Memo 112, 2012.

[33] Jack B. Dennis, Robert Pavel, and Guang R. Gao. Comparative Evaluation of Alternative Program Execution Models. Technical report, CAPSL Technical Memo 109, 2011.

115 [34] James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and Ponnuswamy Sadayappan. Scioto: A framework for global-view task parallelism. In Parallel Processing, 2008. ICPP’08. 37th International Conference on, pages 586–593. IEEE, 2008.

[35] C.H.Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and H.D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In ICDM 2001, Proceedings IEEE International Conference on Data Mining, 2001, pages 107–114, 2001.

[36] Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E. Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad van der Steen, Jeﬀrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. The International Exascale Software Project roadmap. International Journal of High Performance Computing Applications, 25(1):3–60, February 2011.

[37] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, July 1987.

[38] M. Frigo and S.G. Johnson. FFTW: an adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, volume 3, pages 1381–1384 vol.3, May 1998.

[39] Anat Gafni. Rollback mechanisms for optimistic distributed simulation systems. In SCS Multiconference on Distributed Simulation, pages 61–67, 1988.

[40] Emden R. Gansner and Stephen C. North. An open graph visualization system and its applications to software engineering. Software Practice and Experience, 30(11):1203–1233, 2000.

[41] Gao, Yates, Dennis, and Mullin. A strict monolithic array constructor. In Parallel and Distributed Processing, IEEE Symposium on, volume 0, pages 596–603, Los Alamitos, CA, USA, 1990. IEEE Computer Society.

116 [42] Guang R. Gao, Joshua Suetterlein, and Stephane Zuckerman. Toward an execution model for extreme-scale systems-runnemede and beyond. Technical report, CAPSL Technical Memo 104, 2011. [43] E. Garcia, D. Orozco, R. Khan, IE. Venetisz, K. Livingston, and G.R. Gao. A dynamic schema to increase performance in many-core architectures through percolation operations. In 2013 20th International Conference on High Performance Computing (HiPC), pages 276–285, December 2013. [44] Elkin Garcia, Daniel Orozco, and Guang R. Gao. Energy eﬃcient tiling on a Many-Core Architecture. In Proceedings of 4th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG 2011); 6th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), January 2011. [45] Elkin Garcia, Robert Pavel, Daniel Orozco, and Guang R. Gao. Performance Modeling of Fine Grain Task Execution Models with Resource Constraints on Many-core Architectures. Technical report, CAPSL Technical Memo 118, 2012. [46] Elkin Garcia, Ioannis E. Venetis, Rishi Khan, and Guang R. Gao. Optimized Dense Matrix Multiplication on a Many-Core Architecture. In Pasqua D’Ambra, Mario Rosario Guarracino, and Domenico Talia, editors, Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II, volume 6272 of Lecture Notes in Computer Science, pages 316–327. Springer, 2010. [47] D. Goodman, S. Khan, C. Seaton, Y. Guskov, B. Khan, M. Lujan, and I. Watson. DFScala: High Level Dataﬂow Support for Scala. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2012, pages 18–26, September 2012. [48] Thorsten Grotker, Stan Liao, Grant Martin, and Stuart Swan. System Design with SystemC. Springer Publishing Company, Incorporated, 1st edition, 2010. [49] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Laboratory (LANL), 2008. [50] Simon Hammond, K. Scott Hemmert, Suzanne Kelly, Arun Rodrigues, Sudhakar Yalmanchili, and Jun Wang. Towards a standard architectural simulation framework. In Workshop on Modeling & Simulation of Exascale Systems & Applications, September 2013. [51] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, volume 11, pages 22–22, 2011.

117 [52] C. A. R. Hoare. Communicating Sequential Processes. Commun. ACM, 21(8):666–677, August 1978.

[53] Herbert HJ Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Guang R. Gao, and Laurie J. Hendren. A study of the EARTH-MANNA multithreaded system. International Journal of Parallel Programming, 24(4):319–348, 1996.

[54] ISO. ISO/IEC 14882:2011 Information technology Programming languages C++. International Organization for Standardization, Geneva, Switzerland, February 2012.

[55] C. G. J. Jacobi. Uber eine neue auﬂosungsart der bei der methode der kleinsten quadrate workommenden linearen gleichungen. Astronomische Nachrichten, 22:297–303, 1845.

[56] A Jacquet, V. Janot, C. Leung, G.R. Gao, R. Govindarajan, and T.L. Sterling. An executable analytical performance evaluation approach for early performance prediction. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International, pages 8 pp.–, April 2003.

[57] Curtis L. Janssen, Helgi Adalsteinsson, Scott Cranford, Joseph P. Kenny, Ali Pinar, David A. Evensky, and Jackson Mayo. A Simulator for Large-Scale Parallel Computer Architectures:. International Journal of Distributed Systems and Technologies, 1(2):57–73, 2010.

[58] James Jeﬀers and James Reinders. Intel Xeon Phi Coprocessor High-Performance Programming. Newnes, February 2013.

[59] David Jeﬀerson and Peter Reiher. Supercritical speedup. In Simulation Symposium, 1991., Proceedings of the 24th Annual, pages 159–168. IEEE, 1991.

[60] David R. Jeﬀerson. Virtual Time. ACM Trans. Program. Lang. Syst., 7(3):404–425, July 1985.

[61] Stephen C. Johnson. Yacc: Yet another compiler-compiler, volume 32. Bell Laboratories Murray Hill, NJ, 1975.

[62] H. Kaiser, M. Brodowicz, and T. Sterling. ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications. In International Conference on Parallel Processing Workshops, 2009. ICPPW ’09, pages 394–401, September 2009.

[63] Laxmikant V Kale and Sanjeev Krishnan. CHARM++: a portable concurrent object oriented system based on C++, volume 28. ACM, 1993.

118 [64] I. Karlin, A. Bhatele, J. Keasler, B.L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, and C.H. Still. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application. In 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pages 919–932, May 2013.

[65] Leonard Kleinrock. Theory, volume 1, Queueing systems. Wiley-interscience, 1975.

[66] Kathleen Knobe. Ease of use with concurrent collections (CnC). Hot Topics in Parallelism, 2009.

[67] Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, W. Carson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, and others. Exascale computing study: Technology challenges in achieving exascale systems. 2008.

[68] L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9):690–691, September 1979.

[69] Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7):558–565, July 1978.

[70] Christopher Lauderdale and Rishi Khan. Towards a Codelet-based Runtime for Exascale Computing: Position Paper. In Proceedings of the 2Nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaﬂop Era, EXADAPT ’12, pages 21–26, New York, NY, USA, 2012. ACM.

[71] Peter Lax and Burton Wendroﬀ. Systems of conservation laws. Communications on Pure and Applied Mathematics, 13(2):217–237, May 1960.

[72] Edward D. Lazowska, John Zahorjan, G. Scott Graham, and Kenneth C. Sevcik. Quantitative System Performance: Computer System Analysis Using Queueing Network Models. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1984.

[73] Michael E. Lesk and Eric Schmidt. Lex: A lexical analyzer generator.Bell Laboratories Murray Hill, NJ, 1975.

[74] Xiaoming Li, Jack Dennis, Guang R. Gao, Willie Lim, Haitao Wei, Chao Yang, and Robert Pavel. FreshBreeze: A Data Flow Approach for Meeting DDDAS Challenges. In Dynamic Data Driven Applications Systems and LargeScaleBigData & LargeScaleBigComputing (DDDAS/InfoSymbioticSystems 2015); International Conference on Computational Science (ICCS), 2015. Manuscript submitted for publication.

119 [75] M. Lis, Pengju Ren, Myong Hyon Cho, Keun Sup Shim, C.W. Fletcher, O. Khan, and S Devadas. Scalable, accurate multicore simulation in the 1000-core era. In 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 175–185, April 2011.

[76] Deborah T Marr, Frank Binns, David L Hill, Glenn Hinton, David A Koufaty, J Alan Miller, and Michael Upton. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, 6(1), 2002.

[77] I Mathieson and R. Francis. A dynamic-trace-driven simulator for evaluating parallelism. In Architecture Track, Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences, 1988. Vol.I, volume 1, pages 158–166, 1988.

[78] Michael M. McKerns, Leif Strand, Tim Sullivan, Alta Fang, and Michael A. G. Aivazis. Building a Framework for Predictive Science. CoRR, abs/1202.1056, 2012.

[79] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 2.2. Speciﬁcation, High Performance Computing Center Stuttgart (HLRS), September 2009.

[80] Maged M Michael and Michael L Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the ﬁfteenth annual ACM symposium on Principles of distributed computing, pages 267–275. ACM, 1996.

[81] J.E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A Agarwal. Graphite: A distributed parallel simulator for multicores. In 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), pages 1–12, January 2010.

[82] Jayadev Misra. Distributed Discrete-event Simulation. ACM Comput. Surv., 18(1):39–65, March 1986.

[83] J Mohd-Yusof and N Sakharnykh. Optimizing CoMD: A Molecular Dynamics Proxy Application Study. In GPU Technology Conference (GTC), 2014.

[84] Frank Mueller and others. A Library Implementation of POSIX Threads under UNIX. In USENIX Winter, pages 29–42, 1993.

[85] T. Murata. Petri nets: Properties, analysis and applications. Proceedings of the IEEE, 77(4):541–580, April 1989.

[86] David M. Nicol. Parallel discrete-event simulation of FCFS stochastic queueing networks, volume 23. ACM, 1988.

120 [87] David M. Nicol. The Cost of Conservative Synchronization in Parallel Discrete Event Simulations. J. ACM, 40(2):304–333, April 1993.

[88] Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littleﬁeld. Global arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10(2):169–189, June 1996.

[89] Bill Nitzberg and Virginia Lo. Distributed shared memory: A survey of issues and algorithms. Distributed Shared Memory-Concepts and Systems, pages 42–50, 1991.

[90] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Sebastian Maneth, Stphane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. An overview of the Scala programming language. Technical report, 2004.

[91] Daniel Orozco, Elkin Garcia, Rishi Khan, Kelly Livingston, and Guang R. Gao. Toward High-throughput Algorithms on Many-core Architectures. ACM Trans. Archit. Code Optim., 8(4):49:1–49:21, January 2012.

[92] Daniel Orozco, Elkin Garcia, Robert Pavel, Rishi Khan, and Guang Gao. Tideﬂow: The time iterated dependency ﬂow execution model. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2011 First Workshop on, pages 1–9. IEEE, 2011.

[93] Daniel A. Orozco. TIDeFlow: A Dataﬂow-Inspired Execution Model for High Performance Computing Programs. PhD thesis, University of Delaware, 2012.

[94] IEEE Design Automation Standards Committee and others. IEEE Standard VHDL Language Reference Manual. IEEE Std 1076-2008 (Revision of IEEE Std 1076-2002), pages c1–626, January 2009.

[95] Alfred J. Park, Cheng-Hong Li, Ravi Nair, Nobuyuki Ohba, Uzi Shvadron, Ayal Zaks, and Eugen Schenfeld. Towards ﬂexible exascale stream processing system simulation. Simulation, page 0037549711412981, 2011.

[96] Robert Pavel, Robert Bird, Pascal Grosset, Ken Czuprynski, Andrew Reisner, Erin Carrier, Christoph Junghans, Benjamin Bergen, and Allen L. McPherson. Adaptive Mesh Reﬁnement under the Concurrent Collections Programming Model. In The Sixth Annual Concurrent Collections Workshop (CnC-2014), 2014.

[97] Robert Pavel, Sergio Pino, Jaime Arteaga, and Guang R Gao. Evaluation and Development of Novel Architectures through Discrete Event Simulation. In Submitted to Euro-Par 2015. Springer, 2015.

121 [98] Robert S. Pavel and Guang R. Gao. Guide to Use of PICASim Framework. Technical report, CAPSL Technical Note 24, 2015.

[99] Robert S. Pavel, Elkin Garcia, Daniel Orozco, and Guang R. Gao. Toward a Highly Parallel Framework for Discrete-Event Simulation. Technical report, CAPSL Technical Memo 113, 2012.

[100] Kalyan S. Perumalla. : A Scalable and Transparent System for Simulating MPI Programs. In Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, SIMUTools ’10, pages 62:1–62:6, ICST, Brussels, Belgium, Belgium, 2010. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering).

[101] Carl Adam Petri. Kommunikation mit Automaten. PhD thesis, Universitt Hamburg, 1962.

[102] Antoni Portero, Alberto Scionti, Zhibin Yu, Paolo Faraboschi, Caroline Concatto, Luigi Carro, Arne Garbade, Sebastian Weis, Theo Ungerer, and Roberto Giorgi. Simulating the Future Kilo-x86-64 Core Processors and Their Infrastructure. In Proceedings of the 45th Annual Simulation Symposium, ANSS ’12, pages 9:1–9:7, San Diego, CA, USA, 2012. Society for Computer Simulation International.

[103] Jon Postel. User datagram protocol. Isi, 1980.

[104] Atul Prakash and Rajalakshmi Subramanian. Filter: An Algorithm for Reducing Cascaded Rollbacks in Optimistic Distributed Simulations. In Proceedings of the 24th Annual Symposium on Simulation, ANSS ’91, pages 123–132, Los Alamitos, CA, USA, 1991. IEEE Computer Society Press.

[105] Sundeep Prakash and Rajive L. Bagrodia. MPI-SIM: Using Parallel Simulation to Evaluate MPI Programs. In Proceedings of the 30th Conference on Winter Simulation, WSC ’98, pages 467–474, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press.

[106] Bruno R. Preiss and Wayne M. Loucks. The impact of lookahead on the performance of conservative distributed simulation. In Modelling and Simulation, Proc. of the European Simulation Multiconference, pages 204–209. Citeseer, 1990.

[107] C. Ramchandani. ANALYSIS OF ASYNCHRONOUS CONCURRENT SYSTEMS BY TIMED PETRI NETS. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA, 1974.

[108] Juergen Ributzka, Yuhei Hayashi, Fei Chen, and Guang R. Gao. DEEP: An Iterative Fpga-based Many-core Emulation System for Chip Veriﬁcation and

122 Architecture Research. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’11, pages 115–118, New York, NY, USA, 2011. ACM.

[109] Bertrand Rouet-Leduc, Kipton Barros, Emmanuel Cieren, Venmugil Elango, Christoph Junghans, Turab Lookman, Jamaludin Mohd-Yusof, Robert S. Pavel, Axel Y. Rivera, Dominic Roehm, Allen L. McPherson, and Timothy C. Germann. Spatial adaptive sampling in multiscale simulation. Computer Physics Communications, 185(7):1857–1864, July 2014.

[110] K. D. Ryu, T. A Inglett, R. Bellofatto, M. A Blocksome, T. Gooding, S. Kumar, A R. Mamidala, M. G. Megerian, S. Miller, M. T. Nelson, B. Rosenburg, B. Smith, J. Van Oosten, A Wang, and R. W. Wisniewski. IBM Blue Gene/Q system software stack. IBM Journal of Research and Development, 57(1/2):5:1–5:12, January 2013.

[111] Hanan Samet. The Quadtree and Related Hierarchical Data Structures. ACM Comput. Surv., 16(2):187–260, June 1984.

[112] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, August 2000.

[113] Lisa M. Sokol, Duke P. Briscoe, and Alexis P. Wieland. NTW: A strategy for scheduling discrete simulation events for concurrent execution. In SCS Multiconference on Distributed Simulation, pages 34–44, 1988.

[114] Richard M. Stallman and others. Using and porting the GNU compiler collection. Free Software Foundation, 1999.

[115] Alexander Stepanov and Meng Lee. The standard template library, volume 1501. Hewlett Packard Laboratories 1501 Page Mill Road, Palo Alto, CA 94304, 1995.

[116] Rick Stevens. The LLNL/ANL/IBM Collaboration to Develop BG/P and BG/Q. DOE ASCAC Report, 1(2):3, 2006.

[117] Joshua Suettlerlein, Stphane Zuckerman, and Guang R Gao. An implementation of the codelet model. In Euro-Par 2013 Parallel Processing, pages 633–644. Springer, 2013.

[118] Kevin Bryan Theobald. EARTH: AN EFFICIENT ARCHITECTURE: FOR RUNNING THREADS. PhD thesis, McGill University, 1999.

[119] William Thies, Michal Karczmarek, and Saman Amarasinghe. StreamIt: A Language for Streaming Applications. In R. Nigel Horspool, editor, Compiler Construction, number 2304 in Lecture Notes in Computer Science, pages 179–196. Springer Berlin Heidelberg, January 2002.

123 [120] Zhaoguo Wang, Ran Liu, Yufei Chen, Xi Wu, Haibo Chen, Weihua Zhang, and Binyu Zang. COREMU: A Scalable and Portable Parallel Full-system Emulator. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’11, pages 213–222, New York, NY, USA, 2011. ACM. [121] M. S. Warren and J. K. Salmon. A Parallel Hashed Oct-Tree N-body Algorithm. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Supercomputing ’93, pages 12–21, New York, NY, USA, 1993. ACM. [122] James Hardy Wilkinson, James Hardy Wilkinson, and James Hardy Wilkinson. The algebraic eigenvalue problem, volume 87. Clarendon Press Oxford, 1965. [123] Robert Kim Yates. Networks of real-time processes. In Eike Best, editor, CONCUR’93, number 715 in Lecture Notes in Computer Science, pages 384–397. Springer Berlin Heidelberg, 1993. [124] Handong Ye, Robert S. Pavel, Aaron Landwehr, and Guang R. Gao. TiNy threads on BlueGene/P: Exploring many-core parallelisms beyond The traditional OS. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Workshop Proceedings, pages 1–8. IEEE, 2010. [125] Kane Yee. Numerical solution of initial boundary value problems involving maxwell’s equations in isotropic media. IEEE Transactions on Antennas and Propagation, 14(3):302–307, May 1966. [126] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010. [127] Ying Ping Zhang, Taikyeong Jeong, Fei Chen, Haiping Wu, R. Nitzsche, and G.R. Gao. A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pages 10 pp.–, April 2006. [128] Yong Zhao, Mihael Hategan, Ben Cliﬀord, Ian Foster, Gregor Von Laszewski, Veronika Nefedova, Ioan Raicu, Tiberiu Stef-Praun, and Michael Wilde. Swift: Fast, reliable, loosely coupled parallel computation. In Services, 2007 IEEE Congress on, pages 199–206. IEEE, 2007. [129] Gengbin Zheng, Gunavardhan Kakulapati, and L.V. Kale. BigSim: a parallel simulator for performance prediction of extremely large parallel machines. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, pages 78–, April 2004.

124 Appendix A

This thesis contains, in part, results, figures, tables and text written by me and published in scientific journals, conference proceedings or technical memos. In some cases, the copyright for the figures, tables and text belongs to the publisher of a particular paper. Because those parts have been used in this thesis, I have obtained permission to reproduce parts of it. This appendix contains the relevant details of the copy permissions obtained.

A.1 Permission from IEEE The IEEE does not require authors of their papers to obtain a formal reuse license for their thesis. Their formal policy states:

“Requirements to be followed when using any portion (e.g., ﬁgure, graph, table, or textual material) of an IEEE copyrighted paper in a thesis:

1. In the case of textual material (e.g., using short quotes or referring to the work within these papers) users must give full credit to the original source (author, paper, publication) followed by the IEEE copyright line c 2011 IEEE. 2. In the case of illustrations or tabular material, we require that the copyright line [Year of original publication] IEEE appear prominently with each reprinted ﬁgure and/or table. 3. If a substantial portion of the original paper is to be used, and if you are not the senior author, also obtain the senior authors approval.

Requirements to be followed when using an entire IEEE copyrighted paper in a thesis:

125 1. The following IEEE copyright/ credit notice should be placed prominently in the references: c [year of original publication] IEEE. Reprinted, with permission, from [author names, paper title, IEEE publication title, and month/year of publication] 2. Only the accepted version of an IEEE copyrighted paper can be used when posting the paper or your thesis on-line. 3. In placing the thesis on the author’s university website, please display the following message in a prominent place on the website: In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of [university/educational entity’s name goes here]’s products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications standards/ publications/rights/rights link.html to learn how to obtain a License from RightsLink.

If applicable, University Microﬁlms and/or ProQuest Library, or the Archives of Canada may supply single copies of the dissertation.”

The copyright permission from IEEE applies to my work presented at the MTAAP 2010 workshop [124].

A.2 Permissions from Springer A license for my EuroPar 2015 paper [97] could not be obtained because it is currently under review at the time this dissertation was written. A license for it will be obtained pending its acceptance and publication by Springer.

A.3 Papers I Own the Copyright to Minor portions of CAPSL Technical Memos 109[33], 112[32], 113[99], 118[45], and CAPSL Technical Note 24[98] were partially reproduced as part of this thesis. As an author, I own the copyright for them.

126