SCHEDULING TASKS ON HETEROGENEOUS CHIP MULTIPROCESSORS WITH RECONFIGURABLE HARDWARE

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Justin Stevenson Teller, B.S., M.S.

*****

The Ohio State University

2008

Dissertation Committee: Approved by

Prof. F¨usun Ozg¨uner,Adviser¨ Prof. Umit¨ C¸ataly¨urek Adviser Prof. Eylem Ekici Graduate Program in Electrical and Engineering c Copyright by

Justin Stevenson Teller

2008 ABSTRACT

This dissertation presents several methods to more efficiently use the computa- tional resources available on a Heterogeneous Chip Multiprocessor (H-CMP). Using task scheduling techniques, three challenges to the effective usage of H-CMPs are addressed: the emergence of reconfigurable hardware in general purpose computing, utilization of the network on a chip (NoC), and fault tolerance.

To utilize reconfigurable hardware, we introduce the Mutually Exclusive

Groups reconfiguration model, and an accompanying task scheduler, the Heteroge- neous Earliest Finish Time with Mutually Exclusive Processor Groups (HEFT-MEG) scheduling heuristic. HEFT-MEG schedules reconfigurations using a novel back- tracking to evaluate how different reconfiguration decisions affect previously scheduled tasks. In both simulation and real execution, HEFT-MEG successfully schedules reconfiguration allowing the architecture to adapt to changing application requirements.

After an analysis of IBM’s Cell Processor NoC and generation of a simple stochas- tic model, we propose a hybrid task scheduling system using a Compile- and Run-time

Scheduler (CtS and RtS) that work in concert. The CtS, Contention Aware HEFT

(CA-HEFT), updates task start and finish times when scheduling to account for network contention. The RtS, the Contention Aware Dynamic Scheduler (CADS),

ii adjusts the schedule generated by CA-HEFT to account for variation in the commu- nication pattern and actual task finish times, using a novel dynamic block algorithm.

We find that using a CtS and RtS in concert improves the performance of several application types in real execution on the Cell processor.

To enhance fault tolerance, we modify the previously proposed hybrid scheduling system to accommodate variability in the processor availability. The RtS is divided into two portions, the Fault Tolerant Re-Mapper (FTRM) and the Reconfiguration and Recovery Scheduler (RRS). FTRM examines the current processor availability and remaps tasks to the available set of processors. RRS changes the reconfiguration schedule so that the reconfigurations more accurately reflect the new hardware capa- bilities. The proposed hybrid scheduling system enables application performance to gracefully degrade when processor availability diminishes, and increase when proces- sor availability increases.

iii Dedicated to my wonderful wife, Lindsay.

iv ACKNOWLEDGMENTS

I would like to thank Prof. F¨usun Ozg¨unerfor¨ being my adviser, and providing me with the guidance to finish my graduate degree. Especially, I want to thank you for recruiting me. My Ph.D. topic would be vastly different had I not been able to come to and work at Ohio State.

I would also like to sincerely thank Prof. Umit¨ C¸ataly¨urekand Eylem Ekici. You

are truly at the top of the best professors I have had the honor to study with in my

graduate work. Your contributions to my education cannot be overstated.

I would like to thank Tim Hartley, for extremely constructive discussions concern-

ing the Cell processor, parallel processing, and StarCraft.

I would like to sincerely thank Dr. Robert Ewing, AFRL, for his insightful con-

versations and guidance when working on base. I am grateful to Al Scarpelli, AFRL,

for his support and help in providing access to the TRIPS system and developers.

Of course, none of this would have been possible without the love and support

of my family. My wife Lindsay was incredibly supportive, and I especially want to

thank my parents, brothers, and all of the Highfields: my “Columbus family.”

Finally, I would like to acknowledge the Dayton Area Graduate Studies Institute

for providing support for my Ph.D. studies through a joint research fellowship.

v VITA

April 19, 1980 ...... Born - Downer’s Grove, Illinois

2002 ...... B.S. in Electrical Engineering, Ohio University, Athens, Ohio 2004 ...... M.S. in Electrical Engineering, Univer- sity of Maryland, College Park, Mary- land 2004 ...... Given’s Associate in parallel processing at the MCS division at Argonne Na- tional Laboratory 2005 – present ...... Air Force Research Labora- tory/Dayton Area Graduate Studies Institute Fellow

PUBLICATIONS

1. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “Scheduling Task Graphs on Re- configurable Hardware.” to appear in the 37th International Conference on Parallel Processing (ICPP-08), SRMPDS workshop Portland, Oregon, September 2008

2. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “Optimization at Runtime on a Nanoprocessor Architecture.” to appear in the 31st IEEE Annual Midwest Symposium on Circuits and Systems, Knoxville, Tennessee, August 2008

3. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “Scheduling Reconfiguration at Runtime on the TRIPS Processor.” in Proceedings of the Parallel and Distributed Processing Symposium, (IPDPS 2008),RAW workshop Miami, Florida, April 2008.

4. Justin Teller, “Matching and Scheduling on a Heterogeneous Chip Multi-Processor.” presentation at the ASME Dayton Engineering Sciences Symposium, October 29, 2007.

vi 5. Justin Teller “Reconfiguration at Runtime with the Nanoprocessor Architecture.” presentation at the ASME Dayton Engineering Sciences Symposium, October 30, 2006. Selected for an Outstanding Presentation Award.

6. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “The Morphable Nanoprocessor Architecture: Reconfiguration at Runtime.” in Proceedings of the International Mid- west Symposium on Circuits and Systems (MWSCAS ’06), San Juan, Puerto Rico, August 6-9, 2006.

7. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “What are the Building Blocks of a Nanoprocessor Architecture?” in Proceedings of the International Midwest Sym- posium on Circuits and Systems (MWSCAS ’05), Cincinnati, Ohio, August 7-10, 2005.

8. Justin Teller, Charles B. Silio, and Bruce Jacob, “Performance Characteristics of MAUI: An Intelligent Memory System Architecture” in Proceedings of the 3rd ACM SIGPLAN Workshop on Memory Systems Performance (MSP 2005), Chicago, Illinois, June 12, 2005.

9. Mark Hereld, Rick Stevens, Justin Teller, Wim van Drongelen, and Hyong Lee, “Large Neural Simulations on Large Parallel .” International Journal of Bioelectromagnetism (IJBEM), vol. 7, no. 1, May 2005.

FIELDS OF STUDY

Major Field: Electrical and Computer Engineering

Studies in: Parallel Processing

vii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

List of Tables ...... xi

List of Figures ...... xii

Chapters:

1. Introduction ...... 1

1.1 Current Trends ...... 1 1.1.1 Chip Multiprocessors ...... 1 1.1.2 Heterogeneous Processing Cores ...... 3 1.1.3 Reconfigurable Hardware in General Purpose Computing . . 4 1.1.4 Intermittent Hardware Faults ...... 5 1.2 Summary ...... 5

2. Background, Prior Work, and Motivation ...... 8

2.1 Reconfigurable Hardware ...... 8 2.1.1 Scheduling on Reconfigurable Hardware ...... 9 2.2 Task Scheduling for Heterogeneous Systems ...... 12 2.2.1 Matching and Scheduling Heuristics ...... 13 2.2.2 HEFT List Scheduler ...... 14 2.2.3 Scheduling Network Access ...... 16

viii 2.2.4 Dynamic Schedulers ...... 17 2.3 Intermittent Faults ...... 18 2.3.1 Sources of Faults ...... 18 2.3.2 Fault Tolerance in Chip Multiprocessors ...... 20 2.4 Motivation ...... 21 2.4.1 GPS Acquisition on the TRIPS Processor ...... 21 2.4.2 RDA on the Cell Processor ...... 23

3. Scheduling on Reconfigurable Hardware ...... 26

3.1 Introduction ...... 26 3.2 Reconfiguration Model: Mutually Exclusive Processor Groups . . . 27 3.3 HEFT with Mutually Exclusive Processor Groups ...... 30 3.3.1 -MEG Scheduling Extension ...... 30 3.3.2 Generating New Configurations ...... 33 3.3.3 HEFT-MEG Time Complexity ...... 40 3.4 Results ...... 43 3.4.1 Simulation Results ...... 43 3.4.2 Results on TRIPS ...... 52

4. The Modeling and Scheduling of Network Access ...... 57

4.1 Introduction ...... 57 4.2 The Cell Processor’s Network on a Chip ...... 58 4.2.1 Cell’s NoC: The EIB ...... 59 4.2.2 Cell EIB: In-Network Contention ...... 60 4.3 Model ...... 63 4.3.1 Calculating End-Point Contention ...... 64 4.3.2 Calculating NoC Contention ...... 65 4.3.3 Experimental Verification: NoC Contention ...... 72 4.4 System Overview ...... 74 4.5 Scheduling on the Cell Processor ...... 75 4.5.1 Compile Time Scheduling ...... 75 4.5.2 Run Time Scheduling ...... 77 4.6 Scheduling Results ...... 81

5. Fault Tolerance with Reconfigurable Hardware ...... 89

5.1 Introduction ...... 89 5.2 Proposed Failure Model ...... 91 5.3 Mutually Exclusive Processor Groups Revisited ...... 91 5.4 Run-Time Scheduler ...... 94

ix 5.4.1 Fault Tolerant Re-mapper ...... 96 5.4.2 Reconfiguration and Recovery Scheduler ...... 100 5.5 Simulation Results ...... 104

6. Conclusions ...... 109

6.1 Contributions ...... 109 6.2 Future Work ...... 113

Bibliography ...... 117

x LIST OF TABLES

Table Page

5.1 Results for four node system, CCR = 1.0...... 106

5.2 Results for two node system, CCR = 1.0 ...... 107

5.3 Results for four node system, CCR = 0.25...... 107

5.4 Results for two node system, CCR = 0.25 ...... 107

xi LIST OF FIGURES

Figure Page

1.1 Hypothetical H-CMP consisting of processing cores optimized for dif- ferent computation types. The on-chip network is not shown...... 2

2.1 A chromosome for the partitioning algorithm in Mei, et al [70]. . . . . 10

2.2 Partitioning a DAG into blocks [68] ...... 19

2.3 Graph illustrating three distinct phases executing GPS acquisition on the TRIPS processor...... 22

2.4 Comparing the performance of Cell’s SPE to Intel’s processors [81] on the RDA application...... 24

3.1 Illustrating mutually exclusive processors with a group of possible con- figurations for an FPGA...... 29

3.2 Illustrating mutually exclusive processors with the TRIPS processor configurations...... 30

3.3 Scheduling a DAG fragment onto RH using HEFT-MEG...... 34

3.4 Illustrating FindSmartConfs algorithm...... 37

3.5 Continuation of Figure 3.4. Illustrating the generation of m − 1 other configurations, and their testing in HEFT-MEG...... 38

3.6 Comparing the runtime of HEFT-MEG to reconfiguration oblivious HEFT [108] and Mei00 [70] while varying the number of nodes in the architecture...... 42

xii 3.7 Comparing the runtime of HEFT-MEG to reconfiguration oblivious HEFT [108] and Mei00 [70] while varying the number of tasks in the DAG...... 42

3.8 Normalized schedule length for random DAGs while varying number of tasks between 50 and 550, on a one node architecture...... 46

3.9 Normalized schedule length for random DAGs while varying number of tasks between 50 and 550, on a two node architecture...... 47

3.10 Normalized schedule length for random DAGs varying the number of nodes in the architecture between 1 and 4 and ...... 48

3.11 Normalized schedule length for random DAGs varying the relative re- configuration time...... 49

3.12 Normalized schedule length vs. matrix size for Laplace Transform DAGs. 50

3.13 Normalized schedule length vs. matrix size for LU Decomposition DAGs. 50

3.14 Normalized schedule length vs. matrix size for Gaussian Elimination DAGs...... 51

3.15 Directed Acyclic Task Graph (DAG) of the GPS Acquisition algorithm. 53

3.16 GPS Acquisition’s schedule when HEFT-MEG is used for scheduling. 55

3.17 Comparing the runtime of GPS Acquisition using different schedules. 56

4.1 Block level diagram illustrating the topology of the Cell Processor’s NoC...... 59

4.2 Illustrating the operation of the Cell NoC. The latencies of messages 0 → 3, 1 → 4, and 2 → 6 depend on ordering by the arbiter and sharing of links on the NoC...... 61

4.3 Illustrating concurrent, independent reducing the re- alized bandwidth of other messages in the system...... 66

xiii 4.4 Illustrating how two messages can overlap with a test message. a) Neither message affect the test message. b) One message affects the test message. c) Both messages overlap, independently. d) One message overlaps, but both messages share an end-point. e) Both messages overlap with the test message over one link...... 69

4.5 Comparing the predicted pdf and experimental relative frequency of a test message’s latency for 2, 3, and 5 concurrent messages...... 73

4.6 System overview. Applications are represented as a task graph. . . . 75

4.7 Operation of the CADS re-mapper...... 78

4.8 Normalized schedule length for random DAGs varying the CCR be- tween 0.01 and 10...... 83

4.9 Normalized schedule length for random DAGs varying the number tasks between 200 and 800...... 84

4.10 Normalized schedule length for Gaussian elimination DAGs varying the CCR between 0.01 and 10...... 85

4.11 Normalized schedule length for Gaussian elimination DAGs varying the matrix size between 5 and 45...... 85

4.12 Normalized schedule length for LU decomposition DAGs varying the CCR between 0.01 and 10...... 86

4.13 Normalized schedule length for LU decomposition DAGs varying the matrix size between 5 and 45...... 86

4.14 Normalized schedule length for Laplace transform DAGs varying the CCR between 0.01 and 10...... 87

4.15 Normalized schedule length for Laplace transform DAGs varying the matrix size between 5 and 45...... 88

5.1 Illustrating processor availability changes on an FPGA...... 93

5.2 System overview...... 95

xiv 5.3 Operation of the FTRM re-mapper. a) The original schedule, anno- tated to indicate the active block after t1 is scheduled. b) FTRM decides to schedule task t3 to processor P 3. c) The scheduling decision is not reflected in the original schedule, but the active block is updated. 97

5.4 Illustrating the extraction of the configuration schedule...... 101

xv CHAPTER 1

INTRODUCTION

Consisting of a mix of processing units (cores) that are targeted for different types of computations, Heterogeneous Chip Multiprocessors (H-CMPs) can efficiently run a diverse mix of applications [7, 45, 48, 58, 91, 97, 99, 101, 103, 104]. Figure 1.1 illustrates a hypothetical H-CMP containing twelve processing cores of four types: simple processing cores (in-order, short pipeline, etc), vector processors, a complex processing core (out-of-order, deep pipeline, etc), and a Reconfigurable Hardware

(RH) processor.

In this dissertation, we present several methods to more efficiently use the compu- tational resources available on an H-CMP. Using scheduling techniques, we address three challenges to the effective usage of H-CMPs: the emergence of reconfigurable hardware in general purpose computing, utilization of the network on a chip (NoC), and fault tolerance.

1.1 Current Trends

1.1.1 Chip Multiprocessors

Multi-core processors and Chip multiprocessors (CMPs) are becoming more com- monplace, as even commodity off the shelf (COTS) processors are integrating several

1 Shared Shared Shared Shared Memory Memory Memory Memory

Simple Proc. Simple Proc. Simple Proc. Simple Proc.

Simple Proc. Simple Proc. Simple Proc. Simple Proc.

Vector Processor Complex Processor Polymorphous or (Wide-issue, Deep- Reconfigurable pipeline, Superscalar, Hardware (RH) Vector etc.) Processor Processor

Heterogeneous CMP

Figure 1.1: Hypothetical H-CMP consisting of processing cores optimized for different computation types. The on-chip network is not shown.

processing cores onto a single chip [81, 99, 57]. CMP architectures have demonstrated benefits for processing and power efficiency [48, 59, 104, 110]. Addi- tionally, there are proposed architectures that utilize multiple cores for redundant processing to recover from transient faults and radiation induced errors [66, 24, 35].

H-CMPs targeted to general purpose and high-performance computing already have been introduced or proposed [7, 48, 58]. As future solutions integrate more - ing cores onto a single chip, managing the computational resources becomes more difficult [1, 5, 9, 49, 57]. While a number of CMPs are currently being marketed and researched, the development of quality software tools to enable efficient utilization of

CMPs is expected to be a significant roadblock to their future use [87].

2 1.1.2 Heterogeneous Processing Cores

State of the art H-CMPs having slightly different cores have already been in- troduced or proposed. One example are General Purpose computation on Graphics

Processing Unit (GPGPU) architectures, such as nVidia’s G80 GPU core using CUDA

(Compute Unified Device Architecture), an interface that allows users to write high- performance programs for any compute-intensive task in the standard C language

[86]. Also, the paper by Kumar, et al. [58] proposes a single-ISA multi-core archi- tecture with cores of varying sizes, performance, and power consumption as a way to provide significantly higher performance in the same area as a conventional chip multiprocessor. There has been significant work into what combination of core types and interconnects yield the highest performance [59, 7, 106, 102], indicating that fu- ture solutions are moving towards more heterogeneity as more specialized cores are integrated onto a single chip.

One important commercial example of an H-CMP is the Cell processor, developed jointly by IBM, Sony, and Toshiba and originally designed for the Sony Playstation3 gaming system [48]. The Cell processor consists of nine processing cores of two differ- ent types: a single Power Processing Element (PPE) and eight Synergistic Processing

Elements (SPE) [45] connected by a high-speed NoC [55]. The PPE is a traditional 64- bit PowerPC processor and runs the operating system and can be programmed using a traditional compiler tool-chain. Conversely, each SPE is a high-performance vector engine, lacking traditional and branch prediction units [78]. Instead of a tradi- tional cache, each SPE uses a software managed Local Store (LS) memory [45, 78].

The SPE is therefore optimized for data-parallel code with simple control structures, making it a promising architecture for a variety of applications [111, 110, 76].

3 1.1.3 Reconfigurable Hardware in General Purpose Comput- ing

Reconfigurable hardware is attractive for general purpose computing, as the per- formance and flexibility of Field Programmable Gate Arrays (FPGAs) and other reconfigurable architectures have enabled system developers to achieve high levels of performance for a variety of applications [88, 77, 85, 100]. The promise of high perfor- mance coupled with low power consumption has inspired several commercial offerings coupling FPGAs with general purpose processors, including the Cray XD1 and XT5h

[80] and SRC Mapstation 7 [82]. The SRC-7 system is one interesting architecture and programming environment integrating an FPGA-based reconfigurable hardware system into the memory system of a traditional (PC) architecture

[82]. With the addition of the SRC’s Carte programming environment, an applica- tion developer can focus on the utilization of the FPGA for application acceleration, resulting in significant performance benefits in general purpose and high-performance computing applications [72].

Polymorphous Computing Architectures (PCAs) is a second class of reconfigurable computing that is used in general purpose computing. PCAs reconfigure in a coarse grained manner and target applications showing high variability in computational requirements [21]. The TRIPS processor, developed at UT at Austin [19], is an important PCA architecture. Tiled to support both instruction and level parallelism, the TRIPS processor has has two different configurations, or “morphs:” the Desktop Morph (D-Morph) and the Threaded Morph (T-Morph).

4 1.1.4 Intermittent Hardware Faults

As more processing resources are integrated onto a single chip, the possibility of experiencing faults increases [14, 29, 15]. These hardware errors can have effects lasting a wide range of time scales, and effectively make the logical processors available for execution a dynamic quantity that can both decrease and increase during an application’s execution. Even though the exact rate of faults for future processors is not known, it is expected that the rate of intermittent hardware faults will increase in the future due to increased cross-talk, voltage and temperature variations, and decreased noise margins [25, 14].

1.2 Summary

In this dissertation, we propose scheduling methods for Heterogeneous Chip Multi- processors (H-CMPs) that address three important areas: utilization of reconfigurable hardware for general purpose computing, consideration of shared network on a chip resources when scheduling, and fault tolerance. The dissertation is composed of three main parts.

In Chapter 3, we address the problem of scheduling applications represented as directed acyclic task graphs (DAGs) onto architectures with reconfigurable process- ing cores. We introduce the Mutually Exclusive Processor Groups reconfiguration model, a novel reconfiguration model that captures many different modes of reconfig- uration. Additionally, we propose the Mutually Exclusive Processor Groups (-MEG) list scheduling extension. The -MEG extension uses a novel back-tracking algorithm to schedule reconfigurations and evaluate how different reconfiguration decisions af- fect previously scheduled tasks. While the -MEG extension can be used with any list

5 scheduler, we demonstrate our scheduler by extending HEFT (proposed by Topcuoglu

et al. [108]), to create HEFT-MEG. We find that HEFT-MEG generates higher qual-

ity schedules than the hardware-software co-scheduler proposed by Mei, et al. [70] and

HEFT [108] using a single configuration in simulation by choosing efficient configura-

tions for different application phases. Additionally, we used HEFT-MEG to schedule

for the polymorphous TRIPS processor. In actual execution, we found that using

the HEFT-MEG scheduler improves the performance of GPS Acquisition, a software

radio application, by about 20%, compared to the best single-configuration schedule

on the same hardware.

In Chapter 4, we perform an analysis of the Cell processor NoC and introduce

a simple stochastic model to predict message latency based on the number of other

competing messages communicating concurrently on the network. Using this model,

we propose a hybrid scheduling system using a Compile-time Scheduler (CtS) and

Run-time Scheduler (RtS) that work in concert. The proposed CtS is built using a novel Contention Aware (CA-) list scheduling extension. While the CA- extension could be used with any list scheduler, we demonstrate the scheduling extension using the HEFT scheduler proposed by Topcuoglu et al. [108], to create CA-HEFT. Next, we propose the Contention Aware Dynamic Scheduler (CADS) runtime re-mapper as the RtS. At runtime, CADS adjusts the schedule generated by CA-HEFT to account for variation in the communication pattern and actual task finish times. CADS uses a novel dynamic block algorithm that updates the active block of tasks depending on run time scheduling decisions, the schedule generated by CA-HEFT, and actual task finish times. We find that using a CtS and RtS in concert improves the per- formance of several application types in real execution on the Cell processor. As

6 the Communication to Computation Ratio (CCR) increases, the performance bene-

fit of using CA-HEFT and CADS to schedule “around” communication contention increases, resulting in up to a 60% reduction in execution time.

In Chapter 5, we introduce a fault tolerant extension to the Mutually Exclusive

Processor Groups model. We expand on the hybrid scheduler proposed in Chapter

4, using HEFT-MEG as the CtS portion of the hybrid scheduler. The RtS is di- vided into two portions: a high-cost recovery scheduler and a low-cost re-mapper.

The low-cost re-mapper redirects tasks based on actual system conditions. Named the Fault-Tolerant Re-Mapper (FTRM), the re-mapper examines the current pro- cessor availability and, using the schedule generated at compile time, remaps tasks to the available set of processors. The high-cost recovery scheduler is named the

Reconfiguration and Recovery Scheduler (RRS) and specifically addresses the oppor- tunities when designing a fault tolerant system for reconfigurable hardware. RRS examines the changes in processor availability and determines a new configuration schedule, inserting new reconfiguration tasks into the task graph. The recovery can take a relatively long time (total reconfiguration on an FPGA can take upwards of

10ms [28, 34]), but allows the RtS to adjust the configuration schedule to account for changes in processor availability.

7 CHAPTER 2

BACKGROUND, PRIOR WORK, AND MOTIVATION

2.1 Reconfigurable Hardware

While FPGAs are an important class of fine-grained Reconfigurable Hardware

(RH), Polymorphous Computing Architectures (PCAs) represent a different class of reconfigurable computing. PCAs can reconfigure in a coarse grained manner and tar- get applications showing high variability in computational requirements [21]. Com- pared with FPGA RH, a PCA’s organization enables faster reconfiguration times and clock speeds at the expense of fewer possible configurations [21, 46, 73]. One im- portant PCA architecture is the TRIPS processor, developed at UT at Austin [19].

The TRIPS processor’s current implementation has two different configurations, or

“morphs:” the Desktop Morph (D-Morph) and the Threaded Morph (T-Morph). The

D-Morph allocates all on-chip resources to a single thread, using the resources to sup- port a large number of in-flight instructions for . Conversely, the T-Morph statically allocates on-chip resources to four threads, so each thread is

1 allocated 4 of the on-chip resources. This limits the amount of speculative execution available to each thread in the T-Morph, as compared to the D-Morph [19, 91]. Due to these differences, the D-Morph efficiently executes applications with Instruction

Level Parallelism (ILP), while the T-Morph efficiently executes applications with

8 high Thread Level Parallelism (TLP). We obtained access to a TRIPS evaluation board through our collaboration with Air Force Research Laboratory (AFRL), and used this evaluation board for some of our experiments [105].

2.1.1 Scheduling on Reconfigurable Hardware

While reconfiguration at runtime has been previously studied, most studies fo- cus on offloading specific functions onto FPGAs [20, 47, 112] or determining an ef-

ficient partitioning of work between a and some number of FPGA soft-processors (sometimes categorized as hardware-software co-design) [70, 89, 90].

Additionally, a number of examples in the literature propose scheduling methods that target only the FPGA [28, 32, 34].

One interesting hardware-software partitioning and scheduling approach was pro- posed in Mei, et al. [70]. The scheduler proposed in [70] uses a Genetic Algorithm

(GA) that searches for a good partitioning of tasks between a single microproces- sor and some number of soft-processors on a single FPGA. Ensuing use of the term

Mei00 will refer to the scheduler described by Mei, et al. [70]. Mei00 uses a simple gene structure to describe the mapping of tasks to either a general purpose CPU or the FPGA, as shown in Figure 2.1. Mei00 then uses a cost function to determine the most fit individuals in a particular generation using a cost function based on accumu- lated violation, or tardiness. The tardiness for a particular task is the amount of time it misses its deadline after being scheduled. The goal of Mei00 is to find a schedule with zero tardiness, which is also a solution that meets all timing constraints [70].

Mei00’s GA main has the following steps [70]:

9 Figure 2.1: A chromosome for the partitioning algorithm in Mei, et al [70].

1. Initialization. To start with diversity of the initial population, each individual

individual’s chromosome is randomly generated by setting each gene to either

1 or 0.

2. Evaluation and Fitness. The scheduler is invoked, and the tardiness of each

individual’s resulting schedule is calculated.

3. Selection. Reproduction trials are run on chromosomes using the normal tour-

nament selection strategy.

4. Crossover and Mutation. Crossover and mutation operations are applied on

selected parent individuals.

5. Update Population. New individual fitness values are recalculated and lower

fitness individuals are discarded.

10 6. Stop Criteria. If one of the stop criteria is met (either maximum number of

generations or a solution with zero tardiness), the algorithm stops. Otherwise,

it repeats steps 3 through 6.

Mei00 uses a list scheduler to evaluate individuals in the GA [70]. Described in more detail in the next section, list schedulers in general schedule tasks by listing the priority of each task, choosing the task with the highest priority, and placing that task on a particular processor to execute at a particular time. Mei00 uses a dynamic priority scheme, given by:

priority(t) = −(ASAPdyna(t) + ALAP (t)) (2.1)

The ASAPdyna value is the earliest a task could possibly execute, based on processor availability, while ALAP is the negative of the task’s “” from the bottom of the graph [70]. Unlike static priority calculations, once a task is scheduled, all

ASAPdyna will be recalculated to reflect the current status. Larger ASAP times means the task must be scheduled later, so it has a lower priority. Similarly, larger

ALAP values means the task can be executed later, so the task has a lower priority.

Then, priority(t) is further modified to account for the reconfiguration overhead.

Basically, if the task can reuse a configuration on an FPGA, it is given higher priority when scheduling [70].

The paper by Mei, et al. [70] schedules 3, 4, or 5 task graphs consisting of an average of 10 tasks onto a single microprocessor, single FPGA system. They chose mix of task graphs because it resembles several periodic real-time tasks with each task having deadlines.

11 In Chapter 3 we compare our RH scheduler to Mei00. This choice was made because of Mei00’s flexibility [70]. Although it was originally targeted to a system with a single FPGA and microprocessor, Mei00 can easily be extended to multiple microprocessor, multiple FPGA systems by changing the gene representation in Fig- ure 2.1 to include more than a single bit per task. Secondly, changing the fitness value to the overall schedule length allows Mei00 to be re-targeted to task graphs where individual tasks have no deadline, and the goal is to reduce the time needed to execute a particular set of tasks.

The Reconfigurable computing Co-Scheduler (ReCoS) is another co-scheduler that targets single microprocessor, single FPGA workstations [89]. ReCoS is a clustering scheduler that clusters tasks to execute on a particular processor. In this case, Re-

CoS chooses tasks for a particular cluster based on their similarity and possibility to co-execute on the FPGA [88]. Then, ReCoS iterates over the clustering, redistribut- ing tasks to try to minimize the time required for execution on the microprocessor and FPGA and maximize the FPGA utilization [90]. We chose not to compare our reconfigurable scheduler to ReCoS in the forthcoming chapters. Because ReCoS was targeted only to the scheduling and placement of logical processors within the FPGA, it was not flexible enough to schedule within our proposed reconfiguration model.

2.2 Task Scheduling for Heterogeneous Systems

An important part of the parallelization process is allocating tasks to processors and determining the order of execution. This scheduling can either be performed before the application executes (called compile-time) or while the application is exe- cuting (called run-time). Compile-time scheduling is designated as static scheduling,

12 and uses estimations of task execution and communication time when scheduling.

Run-time scheduling is called dynamic scheduling and actual application behavior

can be used when scheduling. While dynamic schedulers can use more accurate in-

formation when scheduling than their static counterparts, dynamic schedulers need

to have real-time response to be useful in scheduling. To have real-time response,

dynamic schedulers perform lower complexity analysis than static schedulers.

2.2.1 Matching and Scheduling Heuristics

Historically called Mapping and Scheduling for homogeneous systems and Match- ing and Scheduling for heterogeneous systems, scheduling a task graph representing an application is a well studied problem [95, 50, 39]. For scheduling on a homo- geneous parallel system, an application is represented as a Directed Acyclic Task

Graph (DAG), G = (V, E, w, c), where the nodes V represent the application tasks and the edges E the communications (data dependencies) between tasks. The weight w(v) associated with node v ∈ V represents its computation cost, and the weight

c(e) associated with e ∈ E represents its communication cost. The model is similar

for heterogeneous systems, except that the task’s computation and communication

costs depend on the processor executing the task [18, 43]. Unfortunately, the optimal

scheduling of an arbitrary DAG onto a limited number of processors is NP-hard [83],

so most solutions present in the literature propose heuristics to find near-optimal

solutions.

Static scheduling heuristics can be loosely broken into three categories: guided

stochastic search, cluster, and list-based schedulers. Guided stochastic search sched-

ulers use genetic [113, 38], simulated annealing [115], or other randomized search

13 methods to search through possible schedules for near-optimal solutions. Clustering heuristics have two steps: first, application tasks are clustered together to run on a single processor as an attempt to reduce communication time, then the execution order is defined [12, 27].

List scheduling heuristics are a common framework to use when scheduling. A list scheduler’s basic idea is to generate a scheduling list (or a sequence of nodes for scheduling) ordered by some priority, then repeatedly execute following steps until all the nodes in the DAG are scheduled, with the following two scheduling steps [60, 95]:

1. Remove the first node from the scheduling list.

2. Allocate the node to a processor that minimizes some cost function.

List schedulers differ by the definition of the listing priority and the scheduling cost function. List scheduling is a simple, well performing, and well studied scheduling algorithm with a large number of list schedulers present in the literature [11, 37, 41,

60, 61, 64, 71, 75]. Another important category of list schedulers attempts to avoid interprocessor communication by duplicating task execution [2, 6, 16, 17, 36, 53].

2.2.2 HEFT List Scheduler

Heterogeneous Earliest Finish Time (HEFT) is one heuristic often used as a bench- mark to evaluate other heterogeneous scheduling for its simplicity and ability to generate high-quality schedules [107, 108]. HEFT is a static list scheduler, so tasks priorities do not change while scheduling, and are only calculated once when scheduling. A task’s listing priority is defined as the bottom rank defined as:

 rankb(ni) = wi + max ci,j + rankb(nj) (2.2) nj ∈succ(ni)

14 where succ(ni) is the set of immediate successors of task ni, ci,j is the average com-

munication cost of the edge between ni and nj, and wi is the average computation

cost of task ni. Exit tasks (tasks without successors) have the bottom rank equal to:

rankb(nexit) = wexit (2.3)

Before defining HEFT’s scheduling cost function, we define several other functions.

HEFT-MEG uses an insertion-based policy that considers inserting tasks into idle time slots between two already-scheduled tasks on a processor, as originally described in [108]. Assuming that Ij is the set of idle time slots on processor pj and each time

slot s has a start time of ss and an end time of se, we define the set of appropriate idle times slots for task ni on processor pj as:

n  o Aj = s : s ∈ Ij ∧ (tm + wi) ≤ se (2.4)

Where wi(pj) is the runtime of task ni on processor pj, and tm is defined as:

tm = max{tr(ni, pj), ss} (2.5)

tr(ni, pj) is the time all data generated by ni’s immediate predecessors would be

available to processor pj.

HEFT then defines the scheduling cost function to be Earliest Finish Time (EFT)

of task ni on processor pj is defined as [107]:

EFT (ni, pj) = min{ss + wi(pj)} (2.6) s∈Aj

Using EFT as a cost function allows HEFT to schedule tasks onto heterogeneous

processors and networks, as execution time differences are taken into account when

scheduling. Algorithm 1 is a pseudo-code representation of the HEFT scheduling

heuristic.

15 Algorithm 1 HEFT Scheduling Heuristic [108] 1: procedure HEFT(G = (V, E, w, c)) .G is a task graph 2: Compute rank for all tasks t ∈ V . Using Equation 2.2 3: Sort the tasks in decreasing order by rank and put in list 4: while there are unscheduled tasks in list do 5: Select the first task in the list, ni, and remove from list 6: for all processors pj do 7: Evaluate EFT(nk,pj), saving the minimum EFT . Using Equation 2.6 8: end for 9: Schedule task nk on the processor px with the minimum EFT 10: end while 11: end procedure

2.2.3 Scheduling Network Access

The literature contains a number of heuristics that consider network contention when scheduling. A simple model examines end-point contention when scheduling.

One such example is the one-port model, which models the network port of a processor as able to accommodate only a single input or output at a time [10]. This effectively limits the total I/O bandwidth available to each processor when scheduling, and forces the scheduling heuristic to schedule access to each processor’s network port, without having to consider the . Similarly, the parameter g in the

LogP model models the amount of communication a processor can accommodate simultaneously [31].

The literature also contains a number of other scheduling models that consider other modes of network contention. A number of approaches model edges on the net- work graph as processors that only execute communication tasks. These based schemes schedule access to the network, and can more accurately model actual network conditions [22, 40, 61, 94, 96].

16 Unlike previous work, in Chapter 4 we consider only a single communication archi-

tecture (Cell Processor’s NoC), and model the end-point contention more faithfully

than the one-port and LogP models. Our approach also differs from previous work by

introducing a stochastic network model, since the considered processing model allows

remapping of tasks to processors.

2.2.4 Dynamic Schedulers

Static schedules are not always efficient in unpredictable computational environ-

ments, as the estimated execution and communication time used when scheduling

may not be accurate. Dynamic matching and scheduling algorithms generate the

schedule at runtime, so the scheduling heuristics can use more accurate information

about the running application. A number of dynamic schedulers have been proposed

in the relevant literature [3, 26, 56, 50, 54, 39, 114].

Using run-time information as it becomes available forces a dynamic scheduler

to make scheduling decisions in real-time. A main challenge to the development of

a dynamic scheduler is limiting its complexity to ensure real-time response. One

approach to limiting runtime complexity while generating high quality schedules is to

utilize a hybrid scheduler. A hybrid scheduler takes a statically generated schedule as

an input, and tasks are selectively rescheduled using runtime information [13, 68, 67].

As one example, Maheswaran and Siegel [68] propose a dynamic re-mapper that

uses a statically generated schedule as an input. The first phase in the scheduling uses

the initial static mapping generated by the compile time scheduler and partitions the

DAG into B blocks numbered consecutively from 0 to B − 1. Blocks are generated such that all tasks within a block are independent, and inter-block data dependencies

17 are monotonically increasing. In other words, all subtasks that send data to tasks in

block k must be partitioned into blocks 0 to k − 1. The (B − 1)th block includes all tasks without successors and the 0-th block includes all tasks without predecessors

[68]. Generating three blocks from a seven node DAG is shown in Figure 2.2. Once the tasks in the DAG are partitioned, they are scheduled at runtime based on their block.

Blocks are scheduled consecutively from block 0 to B − 1. When tasks from block i are being executed, the re-mapper is scheduling block i + 1 [68]. Work extending the hybrid re-mapper in [68] merge blocks together at runtime to consider a larger number of tasks when scheduling, reducing the resulting schedule length [67, 13].

These extensions operate on largely the same principle, however.

The runtime schedulers in Chapters 4 and 5 differ from previous hybrid schedulers in the focus on contention in the network and fault tolerance as the rationale for the dynamic portion of the scheduling system. Additionally, unlike previously proposed schedulers, we consider “dynamic” blocks when scheduling the DAG at runtime, where the block membership depends on what tasks have already been scheduled, as well as a task’s level in the DAG.

2.3 Intermittent Faults

2.3.1 Sources of Faults

As more processing cores are integrated into a single system, the cores are becom- ing more susceptible to hardware errors. Particularly, intermittent hardware faults can cause hardware errors that occur in bursts. These faults are often caused by process variation combined with voltage and temperature fluctuations (also denoted as PVT fluctuations) [14, 29].

18 Figure 2.2: Partitioning a DAG into blocks [68]

Because the underlying cause of intermittent faults can vary widely, so can the duration and number of cores affected by the fault. Different software phases can exercise different portions of core, causing intermittent faults depending on applica- tion behavior [104, 92, 33]. Voltage fluctuations can affect a number of cores, but the effects last on the order of nanoseconds [15]. Temperature fluctuations can be localized to a single processor or group of processors, causing faults that can last up to several seconds [79].

19 In addition to hardware faults, if a mechanism were in place to allow the software to recover from hardware faults, it would free operating system, firmware, or hyper- visor modules to make decisions that affect processor availability. For instance, the operating system could decide to limit the number of processing cores available to an application to limit power consumption, or dedicate certain resources to a high- priority application [35, 25]. This type of behavior would be enabled by software that can recover from changes in processor availability.

2.3.2 Fault Tolerance in Chip Multiprocessors

The literature contains a number of examples that provide fault tolerance for a

CMP system. For instance, Ding et al. [35] propose a helper-thread based scheme that aims to reduce the energy-delay product (EDP) when processor availability can change during application execution. The helper threads execute in parallel to the application threads gathering energy-delay product statistics during an application’s execution using hardware performance counters. The system then uses this infor- mation to scale the number of active processors and threads to minimize the EDP

[35].

Chakraborty, et al. [24] propose an Over-provisioned Multi-core System (OPMS) as a way to provide fault-tolerance and reduce power consumption. In an OPMS, the number of available processing cores is larger than the number of simultaneously active cores allowed by thermal or power constraints of a chip. Chakraborty, et al. [24] use a lightweight Virtual Machine Monitor (VMM) to perform dynamic task reassignment by mapping computation fragments to processors as processor availability changes during execution.

20 While Ding et al. [35] do not consider how faults are detected (it is assumed that the Operating System notifies the application when a fault occurs), other schemes consider the mechanism for detecting transient hardware faults as well as proposing solutions to increase fault-tolerance [25, 23]. The methods proposed in Chapter 5 differ from those found in the literature, as our method targets CMPs with reconfig- urable hardware. The opportunity to reconfigure allows the architecture to find more efficient configurations when processor availability changes.

2.4 Motivation

Our initial work programming the TRIPS and Cell processors showed the need for tools managing the parallel and reconfigurable resources available in these processors.

The next two subsections overview preliminary work developing applications for the

TRIPS and Cell processors and explain how this work led to the work in the remainder of the dissertation.

2.4.1 GPS Acquisition on the TRIPS Processor

Figure 2.3 illustrates the motivation for scheduling reconfiguration on the TRIPS processor. In Figure 2.3, one can see several high level phase changes when executing

GPS Acquisition on TRIPS, as indicated by changes in the average number of In- structions executed Per Cycle (IPC). GPS Acquisition is a real-world software radio application [63, 109]. A TRIPS processor consists of sixteen processing tiles (cores), so an average IPC of eight means that one half of the processing tiles remain idle

(or are busy communicating) on average at any time, an average IPC of four means three-quarters of the tiles are idle, etc. The graph shows three distinct high level phases, as detected by examining average IPC. The first phase runs from 9 million

21 Figure 2.3: Graph illustrating three distinct phases executing GPS acquisition on the TRIPS processor.

cycles to about 41 million cycles. The second phase shows higher and more variable average IPC, and lasts from 41 million to 48.6 million cycles. The final phase is clearly composed of shorter sub-phases and continues through the remainder of the experiment.

The first phase shown in Figure 2.3 utilizes fewer processing resources than the two subsequent phases, indicating that those processing resources could be used for other tasks. Similarly, phase three shows high variability in average IPC, but average IPC over the entire phase is significantly lower than the peak IPC value. The trends shown

22 in Figure 2.3 shows that the single-threaded usage of the TRIPS processor changes dynamically. This work led us to develop the reconfiguration scheduler described in

Chapter 3. After breaking the GPS Acquisition application into tasks, several tasks that utilize fewer tiles can be executed under TRIPS’s T-Morph, which runs four threads simultaneously on the same hardware [19], without reducing the per-task performance significantly. Then tasks that utilize more tiles can be executed under

TRIPS’s D-Morph, which runs a single thread [19], to get the highest single task performance possible.

2.4.2 RDA on the Cell Processor

We ran a number of performance tests using IBM’s Cell processor. Figure 2.4 shows our tests using the Cell’s SPE as an accelerator for the Robust Data Align- ment (RDA) application, a computer vision application [51, 52]. Our work showed the performance potential of the Cell processor. Using a single SPE yielded an approxi- mately 4x performance increase compared to comparably clocked Intel processors [81].

As there are 8 SPEs on a Cell processor, we expected a significant increase in perfor- mance as we increased the number of SPEs used. However, the actual performance using multiple SPEs was significantly lower due to memory and NoC contention. This realization led us to develop the contention aware scheduling algorithms presented in

Chapter 4.

Additionally, our original development for the Cell processor led us to several other conclusions. First, the Cell processor’s organization, specifically the explicitly distributed on-chip memory instead of a logically shared cache, enabled very high performance. However, high performance was difficult to obtain, resulting in a fair

23 Figure 2.4: Comparing the performance of Cell’s SPE to Intel’s processors [81] on the RDA application.

amount of performance fragility when manual or ad-hoc methods are used. This rein- forced work done by others stating that software will be the important consideration in the efficient use of future CMP designs [87]. The Cell processor was originally designed for applications with regular memory accesses, where the SPE’s Local Store

(LS) memory can be most effectively leveraged [48], such as graphics or other “stream- ing” applications. However, there are several examples of work trying to fit more ir- regular applications, like graph exploration, to the Cell’s organization [76, 110, 111].

Unfortunately, these efforts largely used ad-hoc methods to overlap computation and

24 communication on the Cell’s SPEs, further illustrating the need for novel tools to ease the development of software for H-CMPs. The work presented in Chapter 4 addresses a subset of the problems facing the development of high-performance applications for the Cell processor.

25 CHAPTER 3

SCHEDULING ON RECONFIGURABLE HARDWARE

3.1 Introduction

One of the more difficult problems facing the use of Reconfigurable Hardware (RH) for general purpose computing is the efficient management of reconfigurable resources.

To enable the scheduling of application tasks onto RH resources and scheduling re- configuration at runtime, this chapter introduces the Mutually Exclusive Processor

Groups reconfiguration model. The Mutually Exclusive Processor Groups model is simple, but it still captures many different modes of reconfiguration, ranging from

Polymorphous Computing Architecture (PCA) processors to Field-Programmable

Gate Arrays (FPGAs). Next, we propose a reconfiguration aware list scheduler exten- sion named the Mutually Exclusive Processor Groups (-MEG) extension. Our goal is to have the -MEG extension choose the most efficient configuration for each appli- cation phase and schedule the appropriate reconfigurations. Using any list scheduler as a “base” scheduler, -MEG schedules hardware reconfiguration using a novel back- tracking algorithm. While the -MEG extension could be used with any list scheduler, we demonstrate the -MEG extension using HEFT [108] as our base scheduler to create

HEFT-MEG.

26 Section 3.4.1 discusses our results using HEFT-MEG to schedule randomly gener- ated, LU decomposition, Laplace Transform, and Gaussian Elimination task graphs onto a number of architectures consisting of a mix of and Field Pro- grammable (FPGA) RH processors. In simulation, we find that using

HEFT-MEG to evaluate reconfiguration decisions generates schedules that are about

20% shorter than HEFT [108] using a single configuration, and about 50% shorter than a previously proposed Genetic Algorithm (GA) based hardware-software co- scheduler [70] for graphs with larger numbers of tasks. Section 3.4.2 discusses our results using HEFT-MEG to schedule GPS Acquisition [63, 109] (a software radio ap- plication) onto the reconfigurable TRIPS processor [91] (developed at UT at Austin).

We obtained access to a TRIPS evaluation board through our collaboration with

Air Force Research Laboratory (AFRL), and used this evaluation board for some of our experiments [105]. In actual execution, we find that HEFT-MEG success- fully schedules reconfigurations to occur at runtime, reducing the execution time of

GPS Acquisition by about 20% compared to the best performing single configuration schedule.

3.2 Reconfiguration Model: Mutually Exclusive Processor Groups

When an RH resource has more than one configuration, each configuration is com- posed of one or more logical processors. Obviously, it is not possible for two different configurations using the same underlying hardware to execute tasks concurrently; we define the logical processors that use the same underlying hardware to be Mutually

Exclusive Processors. Mutually Exclusive Processors are processors that, while logi- cally distinct, cannot be used concurrently. For our model, an RH does not need to

27 instantiate an entire instruction based architecture to be considered a logical proces-

sor. Rather, any computational function that can be realized by an RH is considered

a logical processor (such as an ALU or multiplier). This way, any hardware block

that can execute a task in the DAG can be utilized in our reconfiguration model.

Ensuing use of the term processor will refer to a logical processor.

Figure 3.1 shows an example of how we define the relationships among processors

that can be instantiated by an FPGA using our Mutually Exclusive Processor Groups

model. All configurations for a particular RH belong to a single SuperGroup. Each

configuration is represented as a single SubGroup. Processors belonging to the same

SubGroup can be used concurrently; logical processors in the same SuperGroup and

in different SubGroups cannot be used concurrently and are mutually exclusive.

Figure 3.1 illustrates how a set of possible configurations for an FPGA map to

Mutually Exclusive Processor Groups. Figure 3.1.a shows three possible configura- tions for an FPGA. The possible configurations are composed of soft-processors of

five types, V -Z. Across all the configurations, there are thirteen logically separate processors. Figure 3.1 illustrates that a processor type can be present in multiple configurations and more than one instance of a processor type can be present in a single configuration. Figure 3.1.b shows how the three possible configurations are mapped to a single SuperGroup (Super) that contains three SubGroups (S1, S2, and

S3 ). The group membership defines which processors are mutually exclusive. For

instance, processor X in S1 is mutually exclusive with processor Y in S3, because

these processors belong to different SubGroups within the same SuperGroup.

28 Y Y Y X X W Z Y W Y Y W V

a)

FPGA

S1 S2 S3

X Y Y

X Z W Y Y

Y Y W W V

b) Super

Figure 3.1: Illustrating mutually exclusive processors with a group of possible config- urations for an FPGA.

Figure 3.2 illustrates how the TRIPS processor’s configurations map to Mutually

Exclusive Processor Groups. A TRIPS processing core has two possible configura-

tions, the D-Morph and the T-Morph. The D-Morph runs a single thread, while the

T-Morph runs four threads simultaneously. Therefore, the D-Morph consists of a sin-

gle logical processor, while the T-Morph is modeled as four logical processors. Based

on this, D-Morph’s processor (X) is mutually exclusive with T-Morph’s processors

(X0).

29 D-Morph T-Morph sub-group sub-group

S1 S2

X' X' TRIPS Processor super-group X X' X'

Super

Figure 3.2: Illustrating mutually exclusive processors with the TRIPS processor con- figurations.

A strength of the Mutually Exclusive Processor Groups model is that it cap- tures many different kinds of reconfiguration, ranging from PCA computing cores to

FPGAs, but the model remains simple. However, the proposed model requires all configurations that will be considered in scheduling to be enumerated and the rela- tionships among all the possible configurations need to be specified before scheduling.

Because of this, it is likely that the system designer or programmer will choose a set of promising configurations to be considered during scheduling.

3.3 HEFT with Mutually Exclusive Processor Groups

3.3.1 -MEG Scheduling Extension

We propose the Mutually Exclusive Processor Groups (-MEG) scheduling exten- sion as a means to augment any list scheduler with the ability to schedule for RH resources. When scheduling, the goal is to have the -MEG extension choose the most

30 efficient available configuration for each application phase. This is done by using the

-MEG scheduling extension to explore the reconfiguration space while the base sched- uler decides the mapping of tasks to processors. While the -MEG extension could be applied to any list scheduler, we demonstrate the -MEG extension using HEFT, orig- inally proposed by Topcuoglu et al. [107, 108]. HEFT with the Mutually Exclusive

Processor Groups extension (HEFT-MEG) analyzes an application at compile time and generates a runtime schedule.

The -MEG extension uses a novel backtracking algorithm to evaluate the per- formance impact of different reconfiguration decisions. After each task is scheduled,

-MEG finds a number of candidate reconfiguration times over a programmer control- lable ws in time. For each candidate reconfiguration time tk, -MEG backtracks by unscheduling all tasks that finish after tk. Based on the properties of the unscheduled tasks, -MEG chooses a number of new configurations. For each new configuration, a reconfiguration task is inserted at tk, and the unscheduled tasks are rescheduled using the base scheduler under the new configuration. For each configuration and candidate reconfiguration time combination, -MEG tentatively reschedules the tasks and only keeps the partial schedule that has the shortest makespan. By doing this the -MEG scheduling extension iteratively refines the reconfiguration schedule with each sched- uled task. A pseudo-code representation of HEFT-MEG is shown in Algorithm 2, where lines 10 through 21 are additions to the original HEFT algorithm.

The scheduling and cost functions HEFT uses are detailed in the Chapter 2.

HEFT-MEG specifically uses the bottom level rank (rankb as defined in Equation

2.2) in the listing step and Earliest Finish Time (EFT, as defined in Equation 2.6) in the placement step. Also note that HEFT-MEG uses the same insertion-based policy

31 Algorithm 2 HEFT-MEG Algorithm 1: procedure HEFT-MEG(G = (V, E, w, c)) .G is a task graph 2: Compute rank for all tasks t ∈ V . Using Equation 2.2 3: Sort the tasks in decreasing order by rank and put in list 4: while there are unscheduled tasks in list do 5: Select the first task in the list, ni, and remove from list 6: for all processors pj do 7: Evaluate EFT(nk,pj), saving the minimum EFT . Using Equation 2.6 8: end for 9: Schedule task nk on the processor px with the minimum EFT 10: Save the minimum EFT as EFTcurr 11: Find candidate reconfiguration times between EFTcurr and EFTcurr − ws and put in listtimes 12: . Candidate reconfiguration times found using Equation 3.1 13: for all times tk in listtimes do 14: Generate possible reconfigurations for tk and put in listr 15: for all Reconfiguration possibilities rw in listr do 16: Unschedule tasks scheduled between tk and EFTcurr, put in list2 17: Insert reconfiguration rw at tk 18: Perform HEFT with tasks in list2 using the new configuration 19: end for 20: end for 21: Choose schedule (from all considered configurations) that minimizes the partial schedule’s makespan 22: end while 23: end procedure

as the original HEFT. HEFT-MEG distinguishes itself from the scheduler proposed by

Topcuoglu et al. [107] in its consideration of reconfigurable computational resources.

We define a candidate reconfiguration time as the point in time that HEFT-MEG will

evaluate a reconfiguration possibility. C is the set of candidate reconfiguration times and is defined as:

  C = ET ∪ ST ∩ {t : t > EF Tcurr − ws} (3.1)

32 Where ET and ST are, respectively, the end and start times of all tasks part of

the partial schedule, EFTcurr is the EFT of the last scheduled task, and ws is the

programmer defined window size.

Figure 3.3 illustrates how the the -MEG evaluates different reconfiguration options

collaboratively with the base scheduler. In Figure 3.3, the target hardware includes a

single RH chip with its configurations represented by a single SuperGroup with three

SubGroups and a microprocessor which (not being reconfigurable) is represented by

a single SubGroup within a SuperGroup. In Figure 3.3.a, S1 represents the current

configuration of the FPGA, and tasks c and d have been scheduled on f10 and f21, respectively. Then, in 3.3.b, HEFT-MEG backtracks the schedule to a point within ws, inserts a reconfiguration task across all processors part of the SuperGroup that is being reconfigured, and reschedules tasks b, c, d, and e under the new configu- ration using the base scheduler. The reconfiguration task’s virtual execution time is the time it takes the architecture to reconfigure, so reconfiguration time is con- sidered during scheduling. In this case, the reconfiguration causes the new partial schedule’s makespan to be less than the previous configuration’s partial schedule, so the reconfiguration is saved and used in the next scheduling step. In more compli- cated examples, the number of candidate reconfiguration times and configurations considered increases.

3.3.2 Generating New Configurations

In Algorithm 2, the generation of possible configurations in line 14 is undefined.

The naive version of HEFT-MEG iterates over all possible configurations. However,

33 For each candidate reconfiguration time ...

CPU FPGA time S0 S1 S2 S3

f1 f1 f1 f1 cpu f10f21 f32 3 4 5 6 candidate reconfig. a time

c b d

ws e

a) Backtrack the schedule and insert a reconfiguration task ... CPU FPGA time S0 S1 S2 S3

f1 f1 f1 f1 cpu f10f21 f32 3 4 5 6 candidate reconfig. a time Reconfig. Task c d b

e

b)

And reschedule tasks b,c,d, and e

Figure 3.3: Scheduling a DAG fragment onto RH using HEFT-MEG.

34 an architecture with pr RH resources and q total configurations per RH can exist in

one of qpr different configurations.

We define a function FindSmartConfs with the goal of building a “good” config-

uration at a particular candidate reconfiguration time, t, based on a partial schedule

that extends to some time after t. To do this, each task in the partial schedule that finishes after the time t “chooses” a processor to add to the configuration. A pseudo-code representation of the FindSmartConfs configuration building heuristic

is shown in Algorithm 3. In order to find a “good” configuration, choosing the same

processor for independent tasks is penalized. The rationale for this choice is to have

independent tasks (that will likely be scheduled to execute in parallel) choose dif-

ferent processors to add to the configuration. This way, portions of the task graph

with more parallelism will likely build a configuration with more processors. Simi-

larly, portions of the task graph that are more serial in nature will build a higher

performance configuration, regardless of the number of processors.

To implement the processor choice heuristic, we define the Processor Penalty (PP)

cost function. PP penalizes processors that have already been “chosen” for the current

configuration by another task, unless task ni directly depends on the task that has chosen the processor. First we define Qi,j as the set of tasks that the source task ni

does not depend on that have already “chosen” processor pj. Qi,j is defined more formally as: n  o Qi,j = nk : nk ∈/ pred(ni) ∧ %k(pj) (3.2)

35 Algorithm 3 Function to find a “good” set of configurations 1: function FindSmartConfs(t,m) . t is the time the configurations will be inserted in the schedule 2: . m is the number of configurations that will be generated 3: Find all tasks that finish after t and put them in listt, ordered by rank 4: for all Task ni in listt do 5: for all processors pj do 6: Evaluate PP(ni,pj), saving minimum PP . Using Equation 3.3 7: end for 8: pk is the processor having the minimum PP(ni,pk) 9: if Current Configuration does not contain pk’s SuperGroup then 10: Add pk to the Confcurr 11: end if 12: end for 13: Ensure Confcurr contains one SubGroup from each SuperGroup 14: . Choose one random SubGroup for each unrepresented SuperGroup 15: Add Confcurr to ConfSet 16: for i=1:m − 1 do 17: Choose a random configuration in ConfSet and copy it to κi 18: Swap one random SubGroup in κi with another SubGroup in κi’s Super- Group 19: Add κi to ConfSet 20: end for 21: end function

Where pred(ni) is the set of all of ni’s predecessors, %k(pj) is true iff task nk has already “chosen” processor pj for the current configuration, Next, we define PP as:

PP(ni, pj) = wi(pj) + max {wk(pj)} (3.3) nk∈Qi,j

Where ni is the source task, pj is the processor being tested and wi(pj) is the runtime of task ni on processor pj.

In addition to a single “good” configuration (κ), FindSmartConfs also generates m−1 random configurations “around” κ (where m is programmer defined). A pseudo- code representation of this process is shown in lines 16-20 of Algorithm 3. The function textitFindSmartConfs(t, m) avoids the exponential growth in configuration evaluation

36 SubGroup and processor part of the configuration SubGroup and processor not part of the configuration

CPU1 FPGA1 CPU2 FPGA2 time S0 S1 S2 S3 S0 S1 S2 S3

f1 f1 f1 f1 f1 f1 f1 f1 cpu f10f21 f32 3 4 5 6 cpu f17f28 f3 10 11 12 13 candidate 9 reconfig. a f time h b d g c e i

a)

PP for each task and processor Task execution f1 f1 combination time ✻1 f1 f2 f3 f1 f2 f3 g cpu1 0 1 2 3 4 5 6 cpu2 7 8 9 10 11 12 13 Processor not cpu = 100 chosen f2 = 30 100 ∞ 30 ∞ ∞ ∞ ∞ ∞ 100 ∞ 30 ∞ ∞ ∞ ∞ ∞ Processor chosen b cpu = 100 f1 = 30 100 30 ∞ ∞ 30 30 30 30 100 30 ∞ ∞ 30 30 30 30 Processor cannot be chosen c cpu = 80 PP not considered f2 = 60 80 ∞ 90 ∞ ∞ ∞ ∞ ∞ 80 ∞ 60 ∞ ∞ ∞ ∞ ∞ Task chooses this d cpu = 50 processor for f2 = 40 130 ∞ 130 ∞ ∞ ∞ ∞ ∞ 50 ∞ 40 ∞ ∞ ∞ ∞ ∞ new configuration

h cpu = 80 f3 = 40 160 ∞ ∞ 40 ∞ ∞ ∞ ∞ 130 ∞ ∞ 40 ∞ ∞ ∞ ∞ ✻2 e cpu = 200 f1 = 20 200 20 ∞ ∞ 20 20 20 20 330 20 ∞ ∞ 20 20 20 20

i cpu = 50 f3 = 50 130 ∞ 50 ∞ ∞ ∞ ∞ ∞ 50 ∞ 50 ∞ ∞ ∞ ∞ ∞

b) c)

1 The PP for tasks c and d on processor f21 is increased because task g has already chosen processor f21, and c and d do not depend on g. 2 The PP for task e running on processors cpu1 and f110 are not increased, even though the processors were already chosen, because task e depends on tasks that chose processors cpu1 and f110.

Figure 3.4: Illustrating FindSmartConfs choosing a “good” configurations based on the partial schedule. a) For each candidate reconfiguration time, b) unschedule tasks after the candidate reconfiguration time, and c) determine the new configuration using PP.

37 for m = 2 CPU1 FPGA1 CPU2 FPGA2 S0 S1 S0 S4 cpu1 f1 f2 cpu2 f1 f1 f1 f1 Config1 0 1 10 11 12 13

For each new configuration, choose one random SuperGroup, and swap CPU1 FPGA1 CPU2 FPGA2 the chosen SubGroup for another S0 S1 S0 S1 SubGroup. cpu1 f1 f2 cpu2 f1 f2 Config2 0 1 7 8 SubGroup and processor part of the configuration d) Generate m-1 other configurations around the "good" configuration found in 3.4.c SubGroup and processor not part of the configuration

CPU1 FPGA1 CPU2 FPGA2 time Config1 S0 S1 S2 S3 S0 S1 S2 S3 cpu f1 f2 f3 f1 f1 f1 f1 cpu f1 f2 f1 f1 f1 f1 0 1 2 3 4 5 6 7 8 f39 10 11 12 13 a f Reconfig. Task g c b h g b d c e i

d

h CPU1 FPGA1 CPU2 FPGA2 Config2 time e S0 S1 S2 S3 S0 S1 S2 S3 cpu f1 f2 f3 f1 f1 f1 f1 cpu f1 f2 f1 f1 f1 f1 i 0 1 2 3 4 5 6 7 8 f39 10 11 12 13 a f Reconfig. Task d b c g h e i

Schedule with original configuration is shown in Figure 3.4.a

e) For each new configuration, insert reconfiguration task and reschedule tasks unscheduled in (b)

f) Among all candidate reconfiguration times and configurations (including the originating configuration), choose the partial schedule with the shortest schedule length. In this example, Config2 yields the shortest schedule length.

g) Schedule a new task from the task graph, find new candidate reconfiguration times, then for each candidate reconfiguration time, repeat starting at (3.4.a) ...

Figure 3.5: Continuation of Figure 3.4. Illustrating the generation of m − 1 other configurations, and their testing in HEFT-MEG.

38 as the number of reconfigurable processors grows by generating a constant number of

configurations for each scheduling step.

Figures 3.4 and 3.5 illustrate how FindSmartConfs works with the HEFT-MEG

algorithm to choose a set of “good” configurations to test during scheduling. Figure

3.4.a shows a partial schedule with 9 tasks (a–i), scheduled on a two microprocessor,

two FPGA system. In Figure 3.4.b, tasks b–e and g–i are unscheduled, and Figure

3.4.c illustrates how FindSmartConfs uses the PP cost function to build a new config-

uration based on the tasks unscheduled in Figure 3.4.b. FindSmartConfs fills in the

table in Figure 3.4.c from top to bottom, where the tasks along the left are prioritized

by rank, using Equation 2.2. First, FindSmartConfs calculates task g’s PP for all

processors. Notice that g’s PP is ∞ for processors on which g cannot execute. Then,

g chooses the processor with the minimum PP (f21) to add to the new configura-

tion. Also, all other processors part of f21’s SubGroup are also added to the new

configuration, and processors that are mutually exclusive with f21’s SubGroup are

excluded from the new configuration. The same process is subsequently used for the

other tasks. The entire configuration is specified after task h chooses a processor, but the final two task’s are included for completeness.

Figure 3.5 continues the example in Figure 3.4. Figure 3.5.d illustrates how Find-

SmartConfs generates the additional m−1 other configurations to test. To generate a

new configuration, one SubGroup part of the new configuration is chosen and swapped

with a another SubGroup part of the same SubGroup. Figures 3.5.e–g illustrate how

the new configurations generated by FindSmartConfs are considered by HEFT-MEG

in generating a schedule. A good choice for m depends on architectural features and

39 the target application. We found that setting m to 5 resulted in high quality sched-

ules. In Figure 3.5.e, the tasks unscheduled in Figure 3.4.b are rescheduled using the

new configurations. In this example, Figure 3.5.e shows that Config2 yields a shorter

schedule than either Config1 or the original configuration. Then, Config2 is saved for the next step in scheduling, so the reconfiguration schedule is iteratively refined in each scheduling step.

3.3.3 HEFT-MEG Time Complexity

To calculate the runtime of HEFT-MEG, first we define several variables. An application consists of n total tasks, and the number of edges in the graph is O(n2), in the worst case of a very dense task graph. The architecture is composed of a total of p distinct logical processors with pr RH processors, and p ≥ pr. Each RH processor has q possible configurations, so the total number of possible configurations

pr is equal to q . Additionally, with the worst case of an infinite ws, the number of candidate reconfiguration times in line 11 of Algorithm 2 is O(n). Therefore, the naive implementation of HEFT-MEG has a worst case runtime of O(n4 · p · qpr ). HEFT-

MEG using FindSmartConfs (from Algorithm 3) reduces the runtime significantly.

First, Algorithm 3 has a runtime of O(n2 · p) and generates a constant number of

configurations (m). Therefore, HEFT-MEG using FindSmartConfs has a worst case

runtime of O(n4 · p).

If p  n, using a smaller ws reduces the time complexity further. If the ws is

less than or equal to the execution time of the shortest running task, the number

of tasks that can be scheduled on a limited number of processors (p) in the ws is

≤ p. Then, listtimes at line 12 of Algorithm 2 would contain at most p candidate

40 reconfiguration times. Considering at most p candidate reconfiguration times reduces the loop bounds in line 13 in Algorithm 2 and the number of tasks considered in lines

14 and 18 in Algorithm 2 to O(p). Additionally, the number of tasks considered in

Algorithm 3 is also O(p), because the tasks were extracted from a smaller window in the schedule. Therefore, HEFT-MEG using FindSmartConfs has a O(n2 · p + n · p4) time complexity, for a smaller ws. Limiting the ws, even if it is longer than the shortest

running task, still reduces the number candidate reconfiguration times and number of

tasks considered in Algorithm 3, reducing the execution time of HEFT-MEG. A good

choice for ws depends on architectural features and the target application. In our

case, we wanted to evaluate different reconfiguration decisions considering a number

of different tasks at once, so we found that a ws at least equal to the execution time

of the longest running task generated high quality schedules.

Figures 3.6 and 3.7 compare the actual execution time for HEFT, HEFT-MEG

with the naive reconfiguration generation, HEFT-MEG using FindSmartConfs, and

the hardware-software co-scheduler proposed by Mei et al. [70] (designated as Mei00).

All scheduling tests were performed on a machine with a 1.5 GHz G4 processor, 1.5GB

of RAM, and running Mac OS X 10.4.11 as the OS. All schedulers were written in

Java, and used the 1.5.0 13 Java(TM) 2 Runtime Environment, Standard Edition

virtual machine. In both Figures 3.6 and 3.7, FindSmartConfs generated 5 config-

urations in each scheduling step. In Figure 3.6, each node consists of one FPGA

and on microprocessor and the task graphs are randomly generated and contain 150

tasks. In Figure 3.6, one can see that HEFT-MEG using FindSmartConfs completes

generating a schedule significantly earlier than HEFT-MEG naive as we increase the

number of nodes in the graph. This is because HEFT-MEG using FindSmartConfs

41 Figure 3.6: Comparing the runtime of HEFT-MEG to reconfiguration oblivious HEFT [108] and Mei00 [70] while varying the number of nodes in the architecture.

Figure 3.7: Comparing the runtime of HEFT-MEG to reconfiguration oblivious HEFT [108] and Mei00 [70] while varying the number of tasks in the DAG.

42 avoids the exponential growth in the search space by generating a constant number

of configurations to explore at each scheduling step.

In Figure 3.7, the architecture consists of 2 nodes for all tests, and the task graphs

are randomly generated. Figure 3.7 also shows that the execution time of HEFT-MEG

using FindSmartConfs is shorter than HEFT-MEG naive as we increase the number of tasks in the DAG. However, the difference is not as great as when we increase the number of nodes. In all cases however, both Figures 3.6 and 3.7 show that HEFT-

MEG using FindSmartConfs takes significantly less time to generate a schedule than

the hardware software co-scheduler proposed by Mei, et al. [70].

3.4 Results

3.4.1 Simulation Results

In our simulations, we compared the performance of both versions of HEFT-MEG

to a single configuration HEFT scheduler and a hardware-software co-scheduler pro-

posed in Mei et al. [70]. In our tests, we designate HEFT-MEG naive as the version

of HEFT-MEG that performs an exhaustive search of the configuration space each

scheduling step and HEFT-MEG w/SmartConf as the version of HEFT-MEG that

utilizes the FindSmartConfs function (Algorithm 3) to reduce the amount of the con-

figuration space considered each scheduling step. HEFT using a single configuration

performs an exhaustive search of the configuration space and chooses the configura-

tion that yields the shortest schedule, and is designated as HEFT SingleConf. The

scheduler proposed by Mei et al. is a Genetic Algorithm (GA) that partitions the

tasks between an FPGA and a microprocessor and uses a list scheduler to determine

task execution order [70], which we designate as Mei00. The original Mei00 was

43 targeted to scheduling a task graph with task deadlines onto a single FPGA, single microprocessor system [70]. We extended Mei00 to schedule onto larger numbers of processors by changing its gene representation to include more than one bit per task, so that a gene represents the processor mapping for each task. Similarly, we changed the fitness function so that Mei00 favored schedules that generated a shorter schedule length to fit into our scheduling model (of tasks without deadlines).

The test architecture is composed of compute nodes, with each node containing a single microprocessor and a single FPGA. For our tests, we generated a random set of configurations for each experiment. The configurations are composed of nine soft-processor types, and each soft-processor utilized (on average) 50% of the total

FPGA available area. A single FPGA can have several soft-processors executing concurrently.

Unless otherwise mentioned, the test DAGs were generated with a communication to computation ratio (CCR) of 1.0, and contained 150 tasks. We performed exper- iments where we varied the CCR between 0.1 and 10, however we find the schedule length results do not depend on the task graph’s CCR. As HEFT-MEG uses the same list ranking, processor selection, and placement algorithms as HEFT SingleConf, this result is not surprising. We also assume that the probability that a particular task could execute on a FPGA soft-processor was 80% and the FPGA implementation of a task executes ten times faster than the microprocessor version on average (with a maximum twenty times speedup). Finally, we assume the reconfiguration time is equal to the average task execution time on the microprocessor. For all tests, the schedule length was normalized against HEFT SingleConf.

44 The results on randomly generated DAGs (Figures 3.8, 3.9, 3.10, and 3.11) show that HEFT-MEG generates the shortest schedules (on average) among all the tested scheduling algorithms. Figure 3.8 plots the normalized schedule length while varying the number of tasks between 50 and 550 on a single computational node. A node is composed of a one microprocessor and one FPGA. While Mei00 generates schedules of approximately equal length to those generated by HEFT-MEG for a smaller number of tasks on a single processor, HEFT-MEG significantly out-performs both Mei00 and

HEFT SingleConf as the number of tasks increase. Figure 3.8 shows both versions of HEFT-MEG outperforming HEFT SingleConf by approximately 35% on a single node, and Figure 3.9 shows HEFT-MEG outperforming HEFT SingleConf by just over 20% on two nodes. The effect of HEFT-MEG outperforming HEFT SingleConf is from HEFT-MEG’s ability to generate schedules that include reconfiguration.

Figures 3.8 and 3.9 show that HEFT-MEG significantly outperforms Mei00, with the performance difference growing as we increase the number of tasks. The effect of

HEFT-MEG outperforming Mei00 for larger numbers of tasks can be explained by two factors. First, as the number of tasks grows, the effect of scheduling with the non-insertion based list scheduler part of Mei00 is cumulative with each successively scheduled task. Secondly, as we increase the number of tasks in the DAG, Mei00’s search space grows exponentially, but Mei00 uses a constant number of individuals and generations in evaluating the GA.

Additionally, examining Figures 3.8 and 3.9, one can see that HEFT-MEG w/SmartConf performs very close to HEFT-MEG naive, despite HEFT-MEG w/SmartConf considering a much smaller number of possible configurations in each scheduling step. In Figure 3.10, we explore the effect of the number of nodes in

45 Figure 3.8: Normalized schedule length for random DAGs while varying number of tasks between 50 and 550, on a one node architecture.

the architecture on the schedule length. As the number of nodes in the architecture increase, the relative performance of HEFT-MEG decreases. This effect is because as the number of reconfigurable processors in the architecture grows, the advantage of being able to reconfigure at runtime decreases. In our tests, we assumed the FPGA configurations were composed of 9 soft-processors types. With a 4 node architecture

(containing 4 FPGAs) most (if not all) soft-processor types can be contained in a single configuration. Because of this, HEFT SingleConf can generate schedules with only slightly longer lengths than HEFT-MEG. Additionally, Figure 3.10 shows that as the number of nodes increases, HEFT-MEG w/SmartConf’s performance suffers slightly compared with HEFT-MEG naive. The reduction in performance is due to

46 Figure 3.9: Normalized schedule length for random DAGs while varying number of tasks between 50 and 550, on a two node architecture.

the exponential growth of the size of the configuration space (as discussed in Section

3.3.3) and HEFT-MEG w/SmartConf continuing to test only a constant number of configurations each scheduling step. However, the difference in average schedule length between HEFT-MEG w/SmartConf and HEFT-MEG naive is less than 6% for a 4 node architecture, demonstrating that FindSmartConfs generates high-quality configurations likely to reduce overall schedule length.

Figure 3.11 explores how the relative reconfiguration time affects average schedule length. In Figure 3.11, the x axis varies the reconfiguration time, as compared to the average execution time. HEFT-MEG continues to generate schedules that are approximately 20% shorter than HEFT SingleConf until the reconfiguration time is

47 Figure 3.10: Normalized schedule length for random DAGs varying the number of nodes in the architecture between 1 and 4 and

48 Figure 3.11: Normalized schedule length for random DAGs varying the relative re- configuration time.

104 times greater than the average task execution time, at which point the schedule length of HEFT-MEG is longer than HEFT SingleConf. For longer reconfiguration times, HEFT-MEG does generate single configuration schedules. The reduction in performance from HEFT SingleConf to HEFT-MEG is because HEFT considers the entire DAG in the determination of the configuration while HEFT-MEG effectively considers only the “top” of the DAG when generating the first configuration, and once the first configuration is chosen, changing the configuration is so expensive, reconfiguration tasks are not added to the schedule. Note that the schedules generated using Mei00 are much longer than the HEFT schedulers for longer reconfiguration times, and is truncated in Figure 3.11. For a reconfiguration time of 106 times longer

49 Figure 3.12: Normalized schedule length vs. matrix size for Laplace Transform DAGs.

Figure 3.13: Normalized schedule length vs. matrix size for LU Decomposition DAGs.

50 Figure 3.14: Normalized schedule length vs. matrix size for Gaussian Elimination DAGs.

than the average task execution time, Mei00 generates schedules that are about 245 times longer than HEFT SingleConf.

Figure 3.12 shows that HEFT-MEG generates shorter schedules than both HEFT

SingleConf and Mei00 when scheduling Laplace Transform DAGs. Figure 3.12 shows a similar trend to Figure 3.9, where the schedules generated by HEFT-MEG are ap- proximately 20% shorter than HEFT SingleConf, and do not change much as the matrix size and number of tasks in the DAG increase. Similarly, Mei00’s schedule length increases substantially as the matrix size increases. There are similar trends in scheduling LU Decomposition DAGs, shown in Figure 3.13, with HEFT-MEG

51 generating schedule lengths up to almost 35% shorter than HEFT SingleConf. In- terestingly, for matrix sizes between 25 and 45, HEFT-MEG w/SmartConf generates schedules with shorter length than HEFT-MEG naive. This effect is seen because

FindSmartConfs generates configurations that are effective for significant portions of the DAG, and allows HEFT-MEG w/SmartConf to avoid “local minima” when scheduling. However, as HEFT-MEG naive iterates over all possible configurations at each scheduling step, it becomes “stuck” with configurations that may be locally better performing, but increase schedule length overall.

Figure 3.14 shows that HEFT-MEG generates schedules with shorter length than

HEFT SingleConf and Mei00 for Gaussian Elimination graphs. Similar to Figure

3.13, HEFT-MEG w/SmartConf generates shorter schedules than HEFT-MEG naive for several matrix sizes. Again, this is because FindSmartConfs generates effective configurations and allows HEFT-MEG to avoid “local minima.”

3.4.2 Results on TRIPS

We performed experiments running on the TRIPS hardware using GPS Acquisi- tion, a real-world software radio application [109]. First, we decomposed GPS Ac- quisition into a Directed Acyclic Task Graph (DAG), as shown in Figure 3.15. Note that this DAG is fairly coarse grained, as several tasks (including correlate, integrate, and inter-bin SNR) perform one dimensional FFTs of several thousand points.

We executed GPS Acquisition on a TRIPS processor evaluation board located at

Air Force Research Labs (AFRL) at Wright Patterson Air Force Base (WPAFB). To obtain task runtime estimates, we measured the runtime of each task when executed

52 input data

Signal Prep 1ms Generate PhaseShift Generate Digold Correlate

Integrate

Inter-Bin SNR

Output

Figure 3.15: Directed Acyclic Task Graph (DAG) of the GPS Acquisition algorithm.

under both configurations. When a task was executed under the T-Morph configura- tion, three other copies of the task were executed in the remaining thread slots, to take resource contention into account when determining task runtime estimations. Recon-

figuring the TRIPS processor is a logical reconfiguration involving only writing to a configuration register, so reconfiguration time on TRIPS is relatively short. Recon-

figuration is analogous to flushing the processor’s pipeline and servicing a processor interrupt, taking on the order of several hundred processor cycles. During scheduling,

53 we used a conservative reconfiguration time estimate of 700 cycles, which corresponds

to about 2µs. We then used HEFT-MEG to generate a task schedule for one iteration of the main loop of the GPS Acquisition algorithm. The resulting schedule is shown in Figure 3.16. In Figure 3.16, a task’s execution time is approximately proportionate its length in the time dimension in the schedule.

We compared the runtime of GPS Acquisition using several schedules: the sched- ule generated with HEFT-MEG in figure 3.16, and two schedules that utilized one configuration exclusively. The schedule using the D-Morph was trivial, as all tasks are scheduled on a single logical processor, X. To schedule GPS Acquisition onto the

T-Morph, we found an optimal schedule using an exhaustive search of the scheduling space. We then executed all three schedules and compared the total execution time to complete one iteration of the GPS Acquisition algorithm. The current state of the

TRIPS programming tools do not allow the configuration to be changed at runtime, even though the it is technically capable. For our tests, we executed each configura- tion phase independently, and calculated the total runtime by summing the execution time for each phase with the reconfiguration time for each reconfiguration task (as- sumed to be 700 cycles, or about 2µs). The runtime results are shown in Figure 3.17.

As Figure 3.17 shows, using the HEFT-MEG generated schedule reduces the runtime

of GPS Acquisition by about 20% compared to the D-Morph only schedule and over

50% compared to the T-Morph only schedule.

54 X0 X'1 X'2 X'3 X'4 time

SP

SP

Reconfiguration PS PS PS GD PS PS PS PS PS PS PS C C C PS C C C C C C C C C C C C C C C C C C C I I I I I I I I I I I Reconfiguration

SNR

Legend: SignalPrep: SP Generate Digold: GD Generate PhaseShift: PS Correlate: C Integrate: I Inter-Bin SNR: SNR

Figure 3.16: GPS Acquisition’s schedule when HEFT-MEG is used for scheduling.

55 Figure 3.17: Comparing the runtime of GPS Acquisition using different schedules.

56 CHAPTER 4

THE MODELING AND SCHEDULING OF NETWORK ACCESS

4.1 Introduction

When creating parallel programs to execute on an H-CMP, NoC usage can become a first level design consideration. Contention for the NoC can drastically change the throughput, latency, and efficiency of communication. Unfortunately, a priori determination of contention on the NoC is difficult. This determination is made more difficult when techniques for load-balancing and multi-application execution are applied.

This chapter examines the Cell Processor in particular. Developed jointly by

IBM, Sony, and Toshiba for the Sony Playstation3 gaming system, the Cell proces- sor is an example of an H-CMP [48, 45, 55] that promises impressive performance across a range of application types [111, 110]. The chapter begins by introducing a stochastic model for the contention on the Cell processor’s NoC. The chapter then proposes a combination of compile and run time scheduling methods. This hybrid scheduling system uses two schedulers that work in concert, a Compile-time Sched- uler (CtS) and a Run-time Scheduler (RtS). As the basis of the CtS, we propose the

Contention Aware (CA-) list scheduler extension. To demonstrate the CA- extension,

57 we modify the HEFT scheduler, proposed by Topcuoglu et al. [108], to create CA-

HEFT. While we focus on CA-HEFT, the CA- extension can be used in conjunction with any list scheduler with minimal modifications. Next, we propose the Contention

Aware Dynamic Scheduler (CADS) as the RtS. At runtime, CADS adjusts the sched- ule generated by CA-HEFT to account for variation in the communication pattern and actual task finish times. Even though the accuracy of the proposed stochastic model degrades as more messages communicate concurrently, the stochastic model allows CA-HEFT and CADS to schedule “around” network contention and generate higher-quality schedules. We find that using a CtS and RtS in concert improves the performance of several application types in real execution. As the Communication to Computation Ratio (CCR) increases, the performance benefit of using CA-HEFT and CADS to schedule “around” communication contention increases, resulting in up to a 60% reduction in execution time.

4.2 The Cell Processor’s Network on a Chip

Figure 4.1 illustrates the NoC of one example of an H-CMP: the Cell Processor

[48]. The Cell processor is composed of two different processing core types: a single

Power Processing Element (PPE) and 8 Synergistic Processing Elements (SPEs). The

PPE and SPEs communicate with memory using the Memory Interface Controller

(MIC) and I/O devices using two I/O controllers (I/O 0 and 1). The PPE is a traditional 64-bit PowerPC processor and runs the operating system. The SPEs are high performance vector engines, but each SPE lacks a traditional cache and branch prediction unit; therefore the SPE is optimized for data-parallel code with simple control structures.

58 PPE SPE SPE SPE SPE I/O 1

MIC SPE SPE SPE SPE I/O 0

Heterogeneous CMP (Cell Processor)

Figure 4.1: Block level diagram illustrating the topology of the Cell Processor’s NoC.

4.2.1 Cell’s NoC: The EIB

The Cell Processor’s NoC is called the Element Interconnect (EIB). Despite its name, the EIB is organized as four concentric point-to-point rings. The EIB is clocked at 1.6GHz, or half the processing cores’ [48]. Each link in the EIB is capable of transferring 16Bytes per clock cycle, for a peak communication bandwidth of 25.6 GB/s/link [55]. The SPEs and PPE communicate over the EIB using explicit

DMAs. Each SPE includes a Memory Flow Controller (MFC) to manage outstanding

DMA requests, and the MFC can service up to 16Bytes of input and output each clock cycle for a peak simultaneous 25.6 GB/s input and output bandwidth [55].

When communicating over the EIB, the MFC breaks each DMA into packets that are 128Bytes or less in length [55]. The EIB has a centralized data bus arbiter that processes packet requests and decides what each packet takes. The arbiter always

59 selects one of the rings that has the shortest path, ensuring that data will not travel

more than half way around the ring. Also, the arbiter ensures that a new packet will

not interfere with other in-flight transactions by either scheduling the packet on a free

ring or delaying the packet until there is a free ring. Packet requests are scheduled

in a round-robin fashion. As a DMA message is likely to be significantly larger than

a single packet, the round-robin scheduling allows multiple DMAs to fairly share the

NoC bandwidth [55]. The centralized arbiter ensures that there are no conflicts or

collisions on the NoC, so there are no in-network packet buffers; rather new packet

requests are delayed at the source until the arbiter decides to schedule the packet

request. The EIB’s maximum data bandwidth is limited by the rate the centralized

arbiter can handle new packet requests, which is one per clock cycle. Each request can

transfer up to 128Bytes, so the theoretical peak bandwidth on the EIB is 128Bytes

× 1.6 GHz = 204.8 GB/s [55].

4.2.2 Cell EIB: In-Network Contention

A message’s packets are scheduled to the EIB’s links in a round-robin fashion,

allowing multiple messages to share links in a time-slice fashion [55]. Even when

two messages do not share a source or destination, it is still possible to contend on

the NoC. Figure 4.2 illustrates how messages with different sources and destinations

can still compete for NoC resources, and how it affects the total message latency.

Figure 4.2 does not show the MIC or I/O cores, the PPE and SPEs are all viewed

as identical as far as communication is concerned, and all messages are assumed to

be 512Bytes. In Figure 4.2.a, three messages are initiated simultaneously: 0 → 3,

1 → 4, and 2 → 6. Because all three messages use the link between 2 and 3 and the

60 Cell processor's NoC can Communications: only transmit 2 flows 0->3 simultaneously 1->4 2->6

1 2 3 4

0

8 7 6 5

a)

Messages 0->3 and 1->4 communicate concurrently while 2->6 blocks waiting for acccess

1->4 blocks

0->3 blocks

2->6 blocks

1->4 blocks

0->3 finished

b)

Figure 4.2: Illustrating the operation of the Cell NoC. The latencies of messages 0 → 3, 1 → 4, and 2 → 6 depend on ordering by the arbiter and sharing of links on the NoC.

61 Cell’s NoC can support only two concurrent messages on one “hop” simultaneously,

the centralized arbiter time-slices the communication so that the messages can share

the links over time. Figure 4.2.b shows how the three messages are transmitted on

the NoC over time. Each horizontal line in Figure 4.2.b represents one clock cycle

on the EIB. Each message is 512Bytes, which corresponds to four 128 Byte packets.

Each packet takes lp cycles to transmit:

 128Bytes  l = + m (4.1) p 16Bytes/cycle

Where m is the number of hops the message needs to reach its destination. The

m term in Equation 4.1 is added to the time it takes the system to transmit each

packet due to “pipelining” in the network. It takes m cycles for the first 16 Bytes in

the message to reach the destination; after the first 16 Bytes arrives, a new 16 Byte

portion of the data arrives every cycle. As an example, m = 3 for 0 → 3 and 1 → 4

and m = 4 for 2 → 6.

For the example shown in Figure 4.2, the message 0 → 3 has a latency of 57 cycles, while the latencies for the messages 1 → 4 and 2 → 6 are 68 and 70 cycles, respectively. Note that these latencies are only valid assuming that all three messages arrive simultaneously, and the centralized arbiter chooses the message originating from 0 first. If the message 0 → 3 arrives even one cycle later, 0 → 3 is blocked

first, leading to 1 → 4 and 2 → 6 having shorter latencies than 0 → 3. While a message’s total latency can depend on the order the centralized arbiter grants link access to the messages, the dependence decreases as the message size increases. Note that the 512Bytes message size considered in Figure 4.2 is smaller than the message size the Cell processor’s programming tools recommend for high performance [98], so

62 the effect of message ordering by the arbiter is lower than that shown in Figure 4.2 for most messages.

4.3 Communication Model

Based on the Cell Processor’s characteristics [55], our communication model as- sumptions are:

1. Assuming no contention, each message transmits at 25.6 GB/s.

2. Each processor can inject up to 25.6 GB/s of data into the network and ingest

up to 25.6 GB/s of data from the network simultaneously.

3. The Cell’s EIB is modeled as a bidirectional ring, with each link capable of

transmitting 51.2 GB/s in each direction.

4. A message’s bandwidth is equal for each hop in the network.

5. When multiple messages communicate over the same link, they fairly share the

link’s bandwidth.

6. When multiple messages share an end-point, they fairly share the end-point’s

maximum input or output bandwidth.

7. A message always takes the shortest path on the EIB.

Note that our assumptions model an approximation of the Cell NoC. Notably, we are not considering the overheads involved in accessing the centralized data arbiter.

Also, we assume that messages completely share communication links, even though sharing is done on a round-robin, time-sliced basis with packets that are 128Bytes or less [55].

63 4.3.1 Calculating End-Point Contention

Using our assumptions, it is possible to calculate a message’s requested bandwidth

(BWr). A message’s requested bandwidth is the highest possible bandwidth achiev-

able for the particular message, assuming no contention within the EIB, but taking

into account each processor’s input/output limitations.

First, we define a set of BWm inequalities, which describe that the sum of all mes- sages’ bandwidths coming into or out of a node are limited by that node’s maximum bandwidth. There is one BWm inequality for each port, and for processor m, its two

BWm inequalities are:

X BWm ≥ BWr(msgi) (4.2)

msgi∈out(m) X BWm ≥ BWr(msgj) (4.3)

msgj ∈in(m)

BWm is the maximum bandwidth an EIB element can inject or ingest from the

network, and is equal to 25.6 GB/s. Then, in(m) is the set of all messages whose

destinations are processor m, out(m) is the set of all messages whose sources are

processor m, and BWr(msgi) is the requested bandwidth for message i.

Algorithm 4 solves the BWm inequalities to accurately model how a node’s in-

put/output bandwidth is allocated to messages in the Cell’s NoC. A message’s re-

quested bandwidth depends on its endpoint with the highest contention. Algorithm 4

solves the BWm inequalities by starting with the port that has the most messages ac-

cessing it concurrently, equally dividing its bandwidth among all the messages access-

ing it. Algorithm 4 then iteratively solves the remaining inequalities by the number

messages, substituting previously found BWr(msgi) values for each iteration.

64 Algorithm 4 Algorithm for solving the set of requested bandwidth inequalities 1: procedure Solve-Inequalities(Msgs) . Msgs is the set concurrently communicating messages 2: for all msgi ∈ Msgs do 3: Initialize BWr(msgi) to NaN 4: end for 5: InEqs := set of inequalities . From Equation 4.2 6: Sort InEqs by number of messages 7: while There are un-solved inequalities in inEqs do 8: Select ineqi ∈ inEqs with most messages 9: Remove ineqi from inEqs 10: Find all messages (mp...q)in ineqi with BWr = NaN and put in list unsolved 11: Find all messages (mp...q)in ineqi with BWr 6= NaN and put in list solved 12: BW := BW − P BW (m ) rem m mi∈solved r i 13: for all messages msgj ∈ unsolved do 14: . Equally divide the remaining bandwidth 15: BWr(msgj) := BWrem/length(unsolved) 16: end for 17: end while 18: end procedure

4.3.2 Calculating NoC Contention

The H-CMP’s fine-grained and tightly integrated nature enable a scheduler to exert significant control of where and when tasks execute and communicate over the

NoC. In our intended model however, a Run-time Scheduler (RtS) can modify where tasks execute within the H-CMP, so the exact communication pattern cannot be de- termined at compile-time. Also, even slight differences between the predicted and actual task execution time and communication times can drastically change transient

NoC contention. Additionally, programming tools currently available for the Cell pro- cessor do not allow for tasks to be deterministically mapped to a particular processing core [98].

65 Figure 4.3 illustrates how concurrent communications in the Cell Processor affect a message’s realized bandwidth. We used a Sony Playstation3 and a custom com- munication benchmarking application to test the realized vs. requested bandwidth of a single message. For the test, the concurrent, independent communications did not share a source or destination with the test message. All messages’ sources and destinations were otherwise uniformly distributed in the architecture. Note that with the limitations of the Playstation3, the test only utilizes 6 SPE’s, as the final two

SPE’s are disabled for user-access.

Figure 4.3: Illustrating concurrent, independent communications reducing the real- ized bandwidth of other messages in the system.

66 Figure 4.3 illustrates the how the realized bandwidth of a test message depends on

the number of concurrently communicating messages. In Figure 4.3, The error bars

represent one standard deviation from the mean, showing that the realized bandwidth

is a stochastic process that depending on the requested bandwidth and the other mes-

sages in the network. Figure 4.3 illustrates the possible value in treating a message’s

delay as a random variable.

Calculating the pdf describing the maximally loaded link

In the previous section, we calculated the effect end-point contention has on a

message’s requested bandwidth. In this section, we calculate the probability density

function (pdf) describing a message’s latency, based on the number of messages in

the network. Because the RtS and the programming tools for the Cell processor can

remap tasks to processors on the EIB, we model a processor’s position on the EIB

as a uniformly distributed random variable. This is an approximation of the actual

system, as PPE threads can only execute on the PPE, and the SPE threads cannot be

mapped to the PPE. Additionally, we assume that when two or more messages’ paths

share a link, all messages evenly share the link’s bandwidth. Our model does not

take ordering by the arbiter into account, or that link access is shared in a time-sliced

fashion. Finally, we assume that all processors can send only one message at a time,

but processors can receive multiple messages simultaneously. For the cases when

multiple messages share destinations, we use the end-point contention model and

calculation presented in the previous section to determine each message’s requested

bandwidth.

To calculate the pdf describing a message’s latency, we first define the test message as the message of interest, or the message whose latency will be calculated. For

67 our calculations, we assume that no other messages share an end-point with the

test message as calculating the effect of end-point contention was determined in the

previous section. Also, for simplicity, we assume that all other messages contend for

the entire time the test message is communicating.

Let the random variable x represent the sum of the requested bandwidths by all messages using the maximally loaded link on the test message’s path. The maximally loaded link determines the bandwidth a message achieves. When the test message is the only message in the system, the pdf of x (f1(x)) is given as:

f1(x) = 1 · δ(x − 25.6GB/s), (4.4)

indicating that the probability is 1 that the maximally loaded link has a load of 25.6

GB/s. The pdf of x for the case with two messages (f2(x)) depends on the probability

that the second message overlaps part of its path with the first message. An analytical

solution is fairly simple, with the pdf f2(x) found as:

f2(x) = (1 − p) · δ(x − 25.6GB/s) + p · δ(x − 51.2GB/s) (4.5) where p is the probability two messages overlap on a bidirectional ring, given that the messages do not share end-points. For the Cell processor in the Playstation3, p ≈ 13%.

Calculating f(x) with three messages

The pdf of x for the case with three message (f3(x)) depends on how three mes- sages can overlap on the NoC. For three messages, Figure 4.4 illustrates the five cases of how two messages can overlap with a test message on the Cell’s EIB. First is the case where neither additional message affects the test message (for the maximum

68 a) b)

c) d)

e)

Figure 4.4: Illustrating how two messages can overlap with a test message. a) Neither message affect the test message. b) One message affects the test message. c) Both messages overlap, independently. d) One message overlaps, but both messages share an end-point. e) Both messages overlap with the test message over one link.

69 load of 25.6 GB/s), with an example shown in Figure 4.4.a. Figure 4.4.b illustrates the second case, where one message overlaps the path with the test message and the second message does not, making the maximally loaded link have the sum of the requested bandwidth equal 51.2 GB/s. The third case is illustrated in Figure 4.4.c, where both messages’ paths overlap the test message’s path, but since the additional messages’ paths do not overlap, it has the same effect as the case illustrated in Figure

4.4.b. The fourth case is where one message overlaps with the test message, but that message’s requested bandwidth is reduced because it shares a destination with the other message. As illustrated in Figure 4.4.d, the two messages “chain” together to affect the test message. Here, the two additional messages share a destination, so their requested bandwidth is limited by end-point contention; therefore, the maxi-

25.6 mally loaded link is loaded with 25.6 GB/s + 2 GB/s = 38.4 GB/s of requested bandwidth. The last case is where the two additional messages overlap with the test message, and all messages overlap on one or more links, making the maximally loaded link’s sum of its requested bandwidth 76.8 GB/s. Illustrated in Figure 4.4.e, this case is the one that has the largest effect on the test message’s latency.

Calculating f(x) with more than three messages

Calculating fm(x) for larger number of messages is similar to the case with 3 total messages, but the addition of each message increases the number of cases that need to be calculated. We wrote a program to enumerate every possible scenario and deter- mine the probability of each case. This was feasible only because the targeted system was the Cell processor in the Playstation3, which contains only 7 total processors

70 (one PPE and 6 SPEs). For larger numbers of processors, a brute-force determina-

tion would quickly become intractable. The results for several examples with more

than three total messages are shown in Figure 4.5.

Calculating message latency using f(x)

Calculating the pdf describing a message’s latency using f(x) is then straightfor-

ward. We model each link on the NoC as able to transmit at 51.2 GB/s, and multiple

messages can fairly share a links bandwidth. Then, we define y as a random variable

describing the test message’s latency. With x describing the sum of the requested

bandwidths of the maximally loaded link on the test message’s path,

s y = 51.2GB/s + l (4.6) min(1, x ) · BWr where l is the network latency overhead, and s is the data size of the message. The mapping of the maximally loaded link on the test message’s path to the latency is a simple one, overlooking several factors. First, the number of links shared on the NoC between two messages can affect the total latency, with longer overlapping paths introducing more overhead on the EIB’s centralized arbiter [55]. Additionally, we assume that all the messages contend for the entire time the test message is executing, even though some messages may finish earlier or start later. Finally, all messages executing concurrently contend for access to the EIB’s centralized arbiter.

We account for the final factor in our definition of l. To account for contention at the central arbiter, l is defined as:

s 1  x  l = · · + h (4.7) 128bytes 1.6GHz 2 · (8 + h)

This describes how access to the centralized arbiter can affect a message’s latency

on average. A message needs to query the EIB’s centralized arbiter to transmit each

71 packet. It takes an average of 8+h cycles for the EIB to transfer a 128 byte packet [55],

where h is the average number of hops in the network a packet takes. Then, all the

messages are attempting to access the centralized arbiter approximately every 8 + h

cycles. Therefore, the number of messages attempting to access the EIB’s centralized

x arbiter in a cycle is 8+h on average. For the Cell processor in the Playstation3, h ≈ 2.28, as there are eight possible sources or destinations for a message (one PPE, six SPEs, and main memory). Finally, the average number of cycles a message waits

x for the arbiter is 2·(8+h) as access to the arbiter is granted in a round-robin fashion, and the EIB is clocked at 1.6 GHz on the Cell processor [55].

4.3.3 Experimental Verification: NoC Contention

Figure 4.5 shows that our model predicts a message’s likely latency using only

the number of other concurrent messages in the network as an input. As there are

only 8 possible sources and destinations for a message (including the PPE, all SPEs,

and main memory), we do not test message latency for more than 8 concurrent

messages. Figure 4.5 compares the predicted and measured latencies for messages

that are 16KB in length: results for other message sizes showed similar results. If the

predicted probability would result in a relative frequency of less than one result per

2500 trials, the data point is not shown in Figure 4.5.

In Figure 4.5.a, the predicted latency is based on only a few possibilities. In the

actual system however, the delay is spread over a larger range. This is because our

mapping from the pdf describing how messages share links to the delay is a simple one,

not taking several factors into account. For instance, the ordering the arbiter places

on access to the EIB can affect the message’s latency, and our model assumes the

72 Figure 4.5: Comparing the predicted pdf and experimental relative frequency of a test message’s latency for 2, 3, and 5 concurrent messages.

73 other messages communicate for the entire time the test message is communicating.

Therefore, the predicted data points are an aggregation of number of points over a range of actual latencies. Even with the simple mapping however, the model does predict expected latency well. A similar trend is seen in Figure 4.5.b. In Figure 4.5.c, the experimental data begins to be spread more uniformly over a larger range of latencies as the secondary effects we are not accounting for becomes more important.

Although the simple stochastic model we use does not accurately predict the test message’s actual probability density function, the predicted average latency is similar to the actual average latency.

4.4 Software System Overview

Figure 4.6 illustrates the operation and interaction of the proposed compile and runtime scheduling systems. As shown in Figure 4.6, the scheduling system is broken into two parts: the Compile-time Scheduler (CtS) and the Run-time Scheduler (RtS).

We decided to implement a hybrid scheduling system because scheduling at compile- time enables the use of more complex scheduling algorithm, but relevant information is available at compile-time, so the proposed system augments the CtS with an RtS that modifies the schedule at runtime.

In Figure 4.6.a, we assume that the application is represented as a directed acyclic task graph (DAG). While several automatic task graph generation techniques have been proposed [4, 30], we assume that the application’s task graph is created by an

“expert” developer. Then in Figure 4.6.b the RtS modifies the compile-time schedule to actual execution conditions at runtime. In addition to adapting the schedule at runtime to actual execution conditions, the RtS could also be used to merge the

74 a.) Application (Represented as a Task Graph) Task Schedule P1 P2 P3

Compile-time Scheduler

b.)

PPE SPE SPE SPE SPE I/O 1

Run-time Runtime Scheduler Information MIC SPE SPE SPE SPE I/O 0

Heterogeneous CMP (Cell Processor)

Modified Schedule P1 P2 P3 P4

Figure 4.6: System overview. Applications are represented as a task graph.

schedules of several different applications, or it could be used to schedule the portions of an application whose execution cannot be represented as a DAG.

4.5 Scheduling on the Cell Processor

4.5.1 Compile Time Scheduling

In this section we define our CtS scheduling heuristic: Contention Aware Heteroge- neous Earliest Finish Time (CA-HEFT). CA-HEFT is based on the HEFT scheduling

75 heuristic, a list scheduler described in [107, 108]. While this chapter uses HEFT as the “base scheduler”, the CA- scheduling extension can be applied to any list sched- uler with minimal changes. The CA- scheduling extension updates task start and end times based on the communication model proposed earlier, informing the base scheduler on how network contention affects communication time and task start and end times.

Algorithm 5 CA-HEFT Algorithm 1: procedure CA-HEFT(G = (V, E, w, c)) .G is a task graph 2: Compute rank for all tasks t ∈ V . Using Equation 2.2 3: Sort the tasks in decreasing order by rank and put in list 4: while there are unscheduled tasks in list do 5: Select the first task in the list, ni, and remove from list 6: Insert the task ni to the processor pj that minimizes the EFT value of ni . Equation 2.6 7: for all Tasks (nm) where C(ni, nm) = 1 do . Equation 4.8 8: Recompute start and finish times for nm according to network model 9: Propagate any finish time changes to all descendants of nm 10: end for 11: end while 12: end procedure

Like HEFT, CA-HEFT is a static scheduler, analyzing an application at compile time and generating a runtime schedule. Tasks are scheduled in order by rank, as de-

fined in Equation 2.2, and the scheduling cost function is EFT as defined in Equation

2.6. As it uses HEFT as the base scheduler, CA-HEFT also uses an insertion-based policy that considers inserting tasks into idle time slots between two already-scheduled tasks on a processor, as originally described in [108]. When calculating tr(ni, pj) (the

76 time all data generated by ni’s immediate predecessors would be available to proces-

sor pj), the CA- scheduling extension uses the Cell processor network model detailed

in the previous sections to calculate expected communication time.

To allow the CA- extension to inform the base scheduler of contention on the

network, we define C(ni, nj), which is used when recalculating task finish times:   0 if task ni is equal to nj or if ni does C(ni, nj) = not communicate concurrently with nj (4.8)  1 else

Then, when a task, ni, is scheduled, previously scheduled tasks for which C(ni, nj) is equal to 1 have their finish times recalculated according to the Cell processor communication model. For this model, we schedule assuming that the communication time is equal to the expected value of the message’s latency, using the pdf calculated in the above section. Before running CA-HEFT, we calculate expected value of the maximally loaded link for all relevant number of messages in the network, so that calculating the expected value of a message’s latency is a constant time operation.

The CA-HEFT scheduling algorithm is detailed in Algorithm 5. The CA- exten- sion worst-case time complexity is O(n2), and the CA- extension is executed after the base scheduler schedules each task. Therefore, the runtime for CA-HEFT is

O(n2p + n3) for scheduling a task graph with n tasks onto p processors. The CA- extension to HEFT [108] is shown in the addition of lines 7–10 in Algorithm 5, which update task start and finish times according the network model.

4.5.2 Run Time Scheduling

The goal of the Run-time Scheduler (RtS) is to adapt the schedule generated by the CtS to actual execution conditions. As Figure 4.6 shows, the RtS uses the schedule generated by the CtS as an input to perform scheduling at runtime. This approach has

77 P1 P2 P3 P1 P2 P3 legend: time P2 t1 t1 Already Hybrid t scheduled Remapper t4 task t2 t3 t4 t2 t3 t4

Active t t5 t5 Block Current t6 System t6 Conditions Unconsidered t task a.) b.) c.)

Figure 4.7: Operation of the CADS re-mapper.

the advantage that the CtS can perform relatively expensive task-graph analysis off- line to generate a high quality schedules. Then, the RtS updates the schedule based on actual system conditions. Additionally, the use of this hybrid system can enable the

flexibility to run multiple applications simultaneously on the same processing cores by performing transforms to merge multiple schedules, and the ability to schedule applications that cannot be represented a priori as a DAG. However, this chapter only explores the first point of using actual system conditions to update the schedule at runtime.

We propose the Contention Aware Dynamic Scheduler (CADS) as the RtS in our scheduling system. This chapter presents CADS as a companion scheduler for CA-

HEFT, but CADS can be used with any static scheduler that prioritizes tasks. Figure

4.7 illustrates the operation of the CADS re-mapper, and how it interacts with the

CtS.

While previously proposed hybrid schedulers use static blocks when remapping [13,

68, 67], CADS introduces the concept of dynamic blocks when scheduling at runtime.

78 At each scheduling decision, CADS examines only the tasks in the active block,

choosing one task in the block to map next. When using dynamic blocks, active block

membership depends on what tasks have already been scheduled and the statically

generated schedule (taken as an input from the CtS). In Figure 4.7.a, CADS has

already scheduled task t1, so the active block at this point in scheduling consists of

{t2, t3, t4}. In Figure 4.7.b, CADS decides to remap task t4 to execute on processor

P 2. Figure 4.7.c illustrates how the active block’s membership is updated to include task t6, as it is the task that was scheduled to execute immediately after task t4.

Before describing CADS in more detail, we first define the cost function Ω. Ω uses

the bottom level rank (rankb) of a task as calculated by CA-HEFT (rankb is defined

in Equation 2.2). CADS uses Ω to rate a task ni when scheduling, defined as:

rankb(ni) Ω(ni, p, νc) = γ · νc · ci + (−1) · + R(ni) (4.9) P (ni, p)

where p is the processor that task ni is being tested for and νc is the current number

of tasks communicating in the system. Then, γ is the user-defined penalty constant,

ci is the estimated total communication time for ni, and rankb(ni) is the rank used by the CtS when scheduling the task. The variables ci and rankb(ni) are calculated

by the CtS ahead of time and is stored in the input DAG. R(ni) is used to determine

if the task ni is ready to execute, and is defined as:

 0 n ’s predecessors have finished R(n ) = i (4.10) i ∞ else

Finally, the CADS processor penalty P (ni, p) is defined as:

 1 if task n was scheduled on p P (n , p) = i (4.11) i 1 + ∆ else

with ∆ being another user-defined penalty constant. The CADS cost function Ω

adjusts each task’s rank depending on the current machine conditions, namely the

79 Algorithm 6 CADS Algorithm 1: procedure CADS(S) 2: .S is a schedule generated by a CtS 3: Initialize active block, A 4: while there are unscheduled tasks do 5: Wait for an idle processor, p 6: Read the number of communicating tasks, νc 7: ωmin := ∞ 8: while ωmin = ∞ do 9: for all task, ni, in active block A do 10: ωi := Ω(ni, p, νc) . Equation 4.9 11: if ωi < ωmin then 12: ωmin := ωi 13: nmin := ni 14: end if 15: end for 16: end while 17: Schedule nmin to execute on p 18: Remove nmin from active block A 19: Add nextTask(p, nmin) to active block A 20: end while 21: end procedure

number of concurrently communicating tasks and what the processor was originally scheduled to execute the task.

Algorithm 6 is a pseudo-code representation of CADS. Here, we assume that when

CADS begins running, no tasks from the task graph have begun execution, and that

CA-HEFT saved many of the values it calculated within each task’s structure. We also assume that there is some other mechanism in place to alert a task when all of its input data is ready, and that counts the number of concurrently communicating tasks. In Algorithm 6, line 3, the active block is initialized to hold the first task each processor is scheduled to execute. CADS runs until every task has been scheduled to execute on a processor. In Algorithm 6, line 5, CADS waits for a processor to become

80 idle before the next scheduling decision is made. Since Ω (Equation 4.9) returns ∞

for tasks that have a predecessor that has not finished execution, once a processor

becomes idle, CADS waits at the while loop at line 8 until there is a ready task.

Finally, the function nextTask(p, ni) returns the task that was scheduled after ni on processor p.

Assuming there is at least one task with all of its inputs available, the runtime of

Algorithm 6 is O(p), where p is the number of processors. The worst case runtime arises because calculating Ω is constant time operation, adding the next task to the active block is also a constant time operation, the loop at line 9 iterates over all tasks in the active block, and the number of tasks in the active block is always ≤ p.

4.6 Scheduling Results

We ran our tests using a Sony Playstation3 as a Cell processor evaluation platform.

Again, the Playstation3 can only utilize 6 SPE’s, as two SPEs are disabled for user- access in the Playstation3. Additionally, while SPE to SPE communication is possible on the Cell processor, we utilized the Accelerated Library Framework to write the task management and data-movement code, so all communication was performed through main memory [98]. Finally, during execution, the SPEs executed all tasks while the

CADS re-mapper executed on the PPE. All tests were actual execution on the Cell processor. For the tests we set the user defined constants γ (from Equation 4.9) and

∆ (from Equation 4.11) to 0.15. In our experience this value yielded good results across all test applications.

The test applications were Gaussian elimination, Laplace transform, LU decom- position, and random task graphs. To adjust the communication to computation

81 ratio (CCR), we scaled the task run times to appropriate values for the communica- tion volume before execution. This was done through the use of a “dummy” loop in the main body of the task execution to take up more or less time as needed for the particular test. Our tests used random data, so the output at the end of our tests was not accurate. However, real communication patterns and data sizes were used in all tests. All execution times were averaged over 20 trials and normalized against the

Reference scheduler. The Reference scheduler is the default scheduler used by the

Accelerated Library Framework. Developed by IBM and used in its system software, it is expected to be a high quality scheduler [98]. For our results, execution times using only the CA-HEFT CtS is marked by CA-HEFT, and the execution time when using both our proposed CtS and RtS is marked with CA-HEFT+CADS.

Figures 4.8 and 4.9 compares the execution times of randomly generated DAGs executing on the Cell platform when scheduled using the three different schedulers.

Figure 4.8 plots the normalized execution time versus the communication to compu- tation ratio (CCR) for DAGs with 500 tasks. One can see that as the CCR increases, both CA-HEFT and CA-HEFT+CADS generate higher quality schedules compared to the Reference scheduler. It is interesting to note that the benefit from using CADS starts at a lower CCR than the benefit from using CA-HEFT only. This is because

CADS is able to better schedule “around” contention on the network and react to actual system conditions. Figure 4.9 plots the normalized execution time versus the

1 number of tasks in the randomly generated DAGs with a CCR of 2 . Here, we see that the performance of CA-HEFT compared with the Reference scheduler stays fairly constant as we increase the number of tasks in the graph. However, the benefit of

82 Figure 4.8: Normalized schedule length for random DAGs varying the CCR between 0.01 and 10.

using the CADS re-mapper increases slightly as we increase the number of tasks in the graph, resulting in an over 20% reduction in execution time for large task graphs.

Figures 4.10 and 4.11 compares the execution times of Gaussian elimination DAGs of using the proposed schedulers to the reference scheduler. In Figure 4.10, one can see that CA-HEFT generates schedules that results in a little less than a 20% reduction in execution time, regardless of the DAGs CCR. However, the addition of CADS decreases the execution time as we increase the CCR, to about a 60% reduction for a CCR of 10. In Figure 4.10, we can see that CA-HEFT probably does not successfully predict schedule around all network contention because the execution time does not improve as we increase the CCR. However, CADS does successfully handle higher rates of communication, as the execution time benefit can be seen as the CCR increases. Also, it is interesting to note that for low CCRs, it is possible for

83 Figure 4.9: Normalized schedule length for random DAGs varying the number tasks between 200 and 800.

CADS additional overhead can adversely affect execution time; although the penalty seen in Figure 4.10 is only about 6% slower than the execution time of using CA-HEFT only. Figure 4.11 plots the normalized execution time versus the matrix size of the

1 input to the Gaussian elimination DAG, where the DAG has a CCR of 2 . For small matrix sizes, performance with CA-HEFT is worse than the Reference scheduler, but

CA-HEFT’s performance increases to a bit less than a 20% reduction in execution time as we increase the matrix size. CADS reduces the execution time even farther, with better performance as the matrix size increases. Figures 4.12 and 4.13 shows almost identical trends as Figures 4.10 and 4.11 across different CCRs and matrix sizes.

Figures 4.14 and 4.15 show different trends however. In Figure 4.14, one can see that CA-HEFT and CA-HEFT+CADS conveys no benefit in execution time for

84 Figure 4.10: Normalized schedule length for Gaussian elimination DAGs varying the CCR between 0.01 and 10.

Figure 4.11: Normalized schedule length for Gaussian elimination DAGs varying the matrix size between 5 and 45.

85 Figure 4.12: Normalized schedule length for LU decomposition DAGs varying the CCR between 0.01 and 10.

Figure 4.13: Normalized schedule length for LU decomposition DAGs varying the matrix size between 5 and 45.

86 Figure 4.14: Normalized schedule length for Laplace transform DAGs varying the CCR between 0.01 and 10.

smaller CCRs. However, as we increase the CCR, one can see that both CA-HEFT and CADS decrease the execution time significantly, with CA-HEFT reducing the execution time up to almost 40% and CA-HEFT+CADS decreasing execution time up to about 60%. The increase in performance as the CCR increases is because both CA-HEFT and CA-HEFT+CADS is able to more efficiently utilize the Cell processor NoC by avoiding some network access when the network is more heavily loaded. Figure 4.15 shows that, generally, the performance benefit increases for the largest task graph sizes.

87 Figure 4.15: Normalized schedule length for Laplace transform DAGs varying the matrix size between 5 and 45.

88 CHAPTER 5

FAULT TOLERANCE WITH RECONFIGURABLE HARDWARE

5.1 Introduction

As the number of processing cores integrated into a system grows, a changes in processor availability become more likely. Processor availability could change for a number of reasons, including an increase in transient errors in a core, the result of an operating system (OS) decision, a pending thermal or electrical “emergencies”

[14, 29], or because of the sharing of virtualized hardware. The proposed approach to fault tolerance can be applied to any if these cases.

Hardware reliability techniques show significant promise at tolerating low level faults [42, 62], even hiding the faults from system and user level software [93]. How- ever, completely hardware based solutions can have weaknesses in covering all possible faults; also, as more devices are integrated into a single system, the ways faults arise will likely increase. Faults are expected to increase in future technology generates from sources such as increased cross-talk, increased PVT variations, and decreased noise margins [14, 29].

Allowing the performance of the system to gracefully and predictably degrade under changes in processor availability opens up additional options when designing

89 the rest of the system. We expect that the implementation of a fault tolerant system resembling the proposed solution would be useful for low-level system software. For a simple example, the ability to suspend execution on a particular core could be used to reduce both the dynamic and static power consumption of a CMP by allowing one or more processing cores to be turned off. Using this mechanism, the OS or layer that operates “under” the proposed fault tolerant system would be able to manage thermal properties by removing active cores when the chip becomes too hot and adding cores when the chip cools down. Similarly, a virtualization layer could share among a group of applications processing resources (such as RH) that cannot be shared through time-slicing and like traditional microprocessors.

Unlike previous proposals to deal with changes in processor availability [8, 23, 25,

35], we propose using reconfigurable hardware (RH) to allow the architecture to adapt to the new conditions. In this chapter, we extend the Mutually Exclusive Processor

Groups reconfiguration model originally introduced in Chapter 3 to include changes in processor availability. Next, we propose a fault tolerant extension of the hybrid scheduling system proposed in Chapter 4. We use the HEFT-MEG heuristic for the

Compile-time Scheduler (CtS), scheduling reconfiguration tasks along with applica- tion tasks. For the Run-time Scheduler (RtS) we propose a novel two-part scheduler.

The first part is the Fault-Tolerant Re-Mapper (FTRM). The FTRM resembles the

CADS re-mapper in Chapter 4, but extends its functionality to accommodate changes in processor availability. The second part of the RtS is the Reconfiguration and Re- covery Scheduler (RRS). The RRS modifies the future reconfiguration schedule when changes in processor availability occurs. In this way, the RRS addresses the oppor- tunity when using RH in a fault tolerant system to adapt the hardware to changes

90 in processing capability. Finally, the last section of the chapter shows how using the

FTRM and RRS schedules in tandem allows execution to continue when changes in processor availability occur, and that application performance degrades as the amount of faulty hardware decreases.

5.2 Proposed Failure Model

We assume a fault in any portion of a processor results in that processor being unable to execute any task. We model transient faults, with faults lasting between several nanoseconds to several seconds. A processor that is unavailable for execution due to a fault may become available at a later time. A processor that is unable to execute tasks due to a fault is designated as “unavailable”; all other processors are designated as “available.”

Our failure model assumes the OS or other underlying software layer is responsible for failure detection, and the underlying software notifies our system when processor availability changes. This approach is flexible, being applicable to a wide range of fault scenarios, including “faults” that are the result of voluntary changes in hardware availability from the underlying layer. We assume that the faults can occur at any time, and that all intermediate results in a task is lost if a fault happens during its execution. Finally, we assume that the underlying software layer presents the processors that are currently unavailable as an “unavailable group.”

5.3 Mutually Exclusive Processor Groups Revisited

This section presents an extension to the Mutually Exclusive Processor Groups model to include changes in processor availability. The premise behind the Mutually

91 Exclusive Processor Groups is that it is not possible for two different configurations

using the same underlying hardware to execute tasks concurrently; logical processors

that use the same underlying hardware are defined to be Mutually Exclusive Pro-

cessors. All logical processors bound to a particular RH are grouped together into

a SuperGroup, while logical processors part of different configurations of the same

hardware compose a SubGroup. Group membership describes what processors can be used concurrently: logical processors in different SubGroups but in the same Super-

Group are mutually exclusive. Ensuing use of the term processor will refer to logical processors.

When the availability of the underlying hardware changes, this can change the associations among the SubGroups. As illustrated in Figure 5.1.a, part of the FPGA becomes unavailable. This impacts the availability of all processors that used that portion of the underlying hardware. We extend the Mutually Exclusive Processor

Groups model to describe this with Unavailable Groups. Here, the Unavailable Group

contains all the groups that are currently unavailable due to processor availability

changes. At any point in time, there is only a single unavailable group within the ar-

chitecture, containing all the currently unavailable processors across all SuperGroups.

Figure 5.1.a shows how the unavailable portion of FPGA impacts every SubGroup,

and Figure 5.1.b illustrates the creation of the Unavailable Group.

Using the Mutually Exclusive Processor Groups model in a runtime, fault tol-

erant scheduler has one major advantage. Because all possible configurations are

defined at compile-time, the runtime system would not have to perform the (often

very expensive) place-and-route procedure to change the configuration. Rather, the

92 Figure 5.1: Illustrating processor availability changes on an FPGA.

93 RtS can choose from a set of configurations to more quickly change the configuration at runtime.

5.4 Run-Time Scheduler

Figure 5.2 illustrates the interaction of the compile time scheduler and two runtime schedulers for fault tolerance on an H-CMP architecture. As Figure 5.2.a illustrates, the fault tolerant system uses HEFT with Mutually Exclusive Groups (HEFT-MEG) as the Compile-time Scheduler (CtS). The CtS takes an application represented as a

DAG as an input and generates a schedule mapping tasks to processors, including a reconfiguration schedule. In Figure 5.2.b, the OS notifies the proposed system that a portion of the H-CMP is no longer available. In Figure 5.2.b, the Run-time Scheduler

(RtS) is composed of two portions: The Fault Tolerant Re-Mapper (FTRM) and the Reconfiguration and Recovery Scheduler (RRS). When a change in processor availability occurs, the FTRM remaps tasks based on runtime conditions, remapping tasks originally scheduled for currently unavailable processors and balancing the load among the remaining processing resources. The RRS examines the future application requirements and generates a new reconfiguration schedule based on the new processor availability. For both the FTRM and the RRS we assume that a change in processor availability does not result in an architecture that is unable to execute the application

DAG.

We develop a centralized RtS, and we assume that the RtS executes on some processor in the architecture. Previous work shows that helper-thread schemes can be implemented with acceptable overhead [35, 65], so a helper thread based scheme could be used to run the RtS. The functionality of the RtS is divided into two schedulers

94 Figure 5.2: System overview.

95 because the functions of the two schedulers are different. The FTRM is a relatively light-weight remapping scheduler, while the RRS performs significantly more involved calculations to optimize the reconfiguration scheduler. In a real system, the FTRM would allow a system to respond quickly to changes in processor availability while the

RRS takes a longer time to finish, but allows the architecture to adapt to changes in processor availability. Besides this significant difference, the RtS closely resembles the RtS proposed in Chapter 4, and can use very similar mechanisms.

5.4.1 Fault Tolerant Re-mapper

The FTRM is a hybrid re-mapper similar to the Contention Aware Dynamic Sched- uler introduced in Chapter 4. FTRM takes a statically generated schedule, and adapts it to actual execution conditions at runtime. While previously proposed hybrid sched- ulers use static blocks when remapping [13, 67, 68], FTRM uses dynamic blocks when scheduling at runtime. When using dynamic blocks, the set of tasks considered when scheduling changes as each task is scheduled. We designate the block of tasks from which the FTRM chooses tasks to schedule as the active block. The active block holds the tasks in the next “level” of the schedule that was generated at compile-time; after a task is scheduled, it is removed from the active block, and the next task in the schedule is added to the active block.

Figure 5.3 illustrates the operation of FTRM when processor availability is re- duced. In Figure 5.3.a, processor P 2 is no longer available, and task t1 has already been scheduled to execute. Shown in Figure 5.3.b, FTRM chooses a task from the active block to execute next based on the current system conditions. In this example,

96 FTRM decides to execute task t3 on processor P 3. Then, in Figure 5.3.c, FTRM

updates the active block by removing task t3 and adding task t6 to the active block.

original schedule, original schedule, annotated annotated

P1 P2 P3 P1 P2 P3 time P3 t1 t1 FTRM t3 t2 t3 t4 t2 t3 t4

t5 t5 t6 Current t6 System Conditions

a) b) c)

legend: P2 Already Active Unconsidered Unavailable t scheduled t t task Block task Processor

Figure 5.3: Operation of the FTRM re-mapper. a) The original schedule, annotated to indicate the active block after t1 is scheduled. b) FTRM decides to schedule task t3 to processor P 3. c) The scheduling decision is not reflected in the original schedule, but the active block is updated.

The FTRM uses a schedule generated at compile time as the basis for generating the run time schedule. While we use HEFT-MEG as the compile time scheduler, any scheduler that prioritizes tasks can be used as the compile time scheduler. FTRM uses the bottom level rank (rankb) as calculated by HEFT-MEG to prioritize tasks, defined in Equation 2.2. Then, using rankb, we define the cost function FTRM uses

97 to evaluate the cost of mapping a task ni onto a particular processor pj as:

rankb(ni) Ψ(ni, pj) = (−1) · + wi(pj) + R(ni) (5.1) Pft(ni, pj)

where wi(pj) is the expected execution time of task ni on processor pj, and R(ni) is

used to determine of task ni is ready to execute, and is defined as:

 0 n ’s predecessors have finished R(n ) = i (5.2) i ∞ else

Pft(n, p) is the fault tolerant processor penalty, defined as   1 if task n was scheduled on p    1 if n was scheduled on a currently unavailable processor Pft(n, p) =  p is in the same SubGroup as s(n)    1 + ∆ else (5.3) where ∆ is a user defined penalty constant and s(n) returns the processor on which n was originally scheduled to execute. The goal when designing Equation 5.1 was to use the schedule generated at compile-time to inform the decisions made at runtime, but enable the scheduler to react to runtime information. In Equation 5.1 a task’s rankb is multiplied by (−1) to reduce the relative cost of tasks that had a high priority in the CtS. Then, that value is divided by Pft(n, p) to penalize mapping tasks to a processor different than the CtS’s mapping. Equation 5.3 also favors mapping tasks to processors to the originally scheduled processor’s SubGroup when the originally scheduled processor is unavailable. Processors in the same SubGroup are are part of the same RH resource, and in most configurations are able to communicate at a lower cost than between different SubGroups. HEFT-MEG is likely to schedule tasks that communicate more to be located physically close to reduce communication costs, and Equation 5.3 attempts to keep an approximation of that mapping even

98 when the processor availability changes. Finally, in Equation 5.1, we add the task ni’s expected execution time on processor pj (wi(pj)) to the negative, scaled rankb.

This favors mapping tasks to the processors on which they will execute faster.

Algorithm 7 FTRM Algorithm 1: procedure FTRM(S) 2: .S is a schedule generated by HEFT-MEG 3: Initialize active block, A 4: while there are unscheduled tasks do 5: Wait for an idle processor, pj 6: ψmin := ∞ 7: while ψmin = ∞ do 8: for all task, ni, in active block A do 9: ψi := Ψ(ni, pj) . Equation 5.1 10: if ψi < ψmin then 11: ψmin := ψi 12: nmin := ni 13: end if 14: end for 15: end while 16: Schedule nmin to execute on pj 17: Remove nmin from active block A 18: Add nextTask(p, nmin) to active block A 19: end while 20: end procedure

Algorithm 7 is a pseudo-code representation of FTRM. In Algorithm 7 we assume that the scheduler is only used when the processor availability is less than the number of processors considered by the CtS. Upon startup, FTRM initializes the active block to hold the next task each processor was scheduled to execute before the processor availability change. This takes O(p) time. In the main loop from lines 7–19, FTRM waits for an idle processor to place the next task. Processors that are unavailable are considered busy, so they are never chosen for scheduling. In the statically generated

99 schedule, reconfiguration tasks are scheduled on every processor and depend on all tasks that execute before them. This way, a FTRM strictly follows the reconfiguration schedule, whether it was statically generated or generated by the RRS, described in the next section.

5.4.2 Reconfiguration and Recovery Scheduler

The second half of the fault tolerant RtS is the Reconfiguration and Recovery

Scheduler (RRS). RRS examines the changes in processor availability and determines a new configuration schedule, inserting new reconfiguration tasks into the task graph.

Although it considers tasks when generating the new configuration schedule, RRS does not determine a new task scheduling. Because of this, RRS can generate new configurations that would violate the correctness of the task mapping. Therefore,

RRS relies on the FTRM scheduler to change the processor mapping at runtime and generate a correct and feasible schedule on processors that are part of the current configuration.

To reduce the size of the reconfiguration space RRS explores, RRS uses the re- configuration schedule generated by the CtS (HEFT-MEG) as the starting point to generate a new configuration schedule. Figure 5.4 illustrates the process RRS uses to extract the configuration schedule from the total application’s schedule. This can be done at compile time, to reduce RRS’s overhead when a change in processor avail- ability occurs.

Figure 5.4.a shows the schedule as generated by HEFT-MEG, showing several re- configuration tasks executing on both FPGAs. Figure 5.4.b illustrates the SubGroup that is made active for each reconfiguration, with the initial configuration shown at

100 SubGroup and processor part of the configuration Original Schedule SubGroup and processor not part of the configuration

CPU1 FPGA1 CPU2 FPGA2 S0 S1 S2 S3 S0 S1 S2 S3

f1 f1 f1 f1 f1 f1 f1 f1 cpu f10f21 f32 3 4 5 6 cpu f17f28 f3 10 11 12 13 Processor 9 availability changes

Reconfiguration 1

Reconfiguration 2

Reconfiguration 3

a)

Processor CPU0 FPGA1 CPU1 FPGA2 {S0,S1,S0,S2} availability changes S0 S1 S0 S2

{S2} Reconfig. 1 { -- ,S1, -- , -- }

Reconfig. 2 {S3} {S1} Reconfig. 3 { -- ,S3, -- , -- }

{ -- , -- , -- ,S1}

b) c)

Figure 5.4: Illustrating the extraction of the configuration schedule.

101 the top of the graph. Figure 5.4.c shows the reconfiguration schedule. The reconfigu-

ration schedule is a chain of reconfigurations, ordered by the time they appear in the

original schedule. Not illustrated in Figure 5.4 is the process of merging configura-

tions. If two reconfigurations in the configuration schedule have no tasks scheduled

to execute between them, they are merged.

When a change in processor availability occurs, RRS examines the configura-

tion schedule from the point of the availability change onwards in the reconfiguration

schedule. Each reconfiguration task after the change is reconsidered, and the new pro-

cessor availability and the tasks that were scheduled between reconfiguration tasks

are used to generate the new reconfiguration task. Figure 5.4.c illustrates how tasks

scheduled between the reconfiguration tasks are associated with different reconfigura-

tion tasks. For instance, the four tasks that were scheduled between Reconfiguration

1 and Reconfiguration 2 are associated with Reconfiguration 1. In the example given

in Figure 5.4, three reconfiguration tasks are reconsidered, as the change in processor

availability occurs after the initialization reconfiguration but before the first recon-

figuration task.

RRS uses an election based algorithm to determine the new configuration. Each

task associated with a reconfiguration task requests a processor to be added to the new

configuration. Tasks request a processor by voting for one processor. To prioritize tasks, a task’s the vote value is its rankb, as computed by HEFT-MEG (defined in Equation 2.2). Then, the configuration is built from the processors that have accumulated the highest number of votes during the election stage of the algorithm.

For a particular task nj, we define the function proc(nj) as returning the processor p that is part of the currently available processors that yields the lowest runtime for

102 task nj, modified by the current number of votes for that processor. More formally, proc(nj) is defined as:

proc(nj) = processor pi with min{wj(pi) + votes[pi]} (5.4) pi∈P where P is the set of all logical processors that are available (or, all processors that have not failed).

Algorithm 8 RRS Algorithm

1: procedure RRS(Rs) .Rs is a reconfiguration schedule 2: r := head(Rs) 3: while r occurs before change in processor availability do 4: r := next reconfiguration task in Rs 5: end while 6: while r is a valid reconfiguration task do 7: Clear votes 8: Order tasks N associated with r by rankb, put in list 9: for all Task nj in list do 10: votes[proc(nj)] += rankb(nj) 11: end for 12: Build new configuration, rn with processors with most votes 13: Ensure rn is a complete configuration 14: . Ensure each SubGroup is complete 15: . For each unrepresented SuperGroup, choose the SubGroup in the previous reconfiguration task 16: Insert rn into the new reconfiguration schedule 17: r := next reconfiguration task in Rs 18: end while 19: end procedure

Algorithm 8 is a pseudo-code representation of RRS. First, the while loop from line 3 to 5 “fast-forwards” the schedule until after the time processor availability changes. The definition of proc(nj) in Equation 5.4 in conjunction with considering

103 tasks in order by rankb causes tasks with lower rank values to favor requesting pro- cessors that have not been voted on. This helps RRS to generate configurations that have a number of different processing cores in the configuration, avoiding the case where every task votes for the same processor, conveying little information on what configuration may yield higher performance.

In Algorithm 8, line 10, the vote value for the processor chosen by nj is incremented

by its bottom level rank, as calculated by HEFT-MEG. Then, the reconfiguration is

built in lines 12 and 13 by iteratively adding the processors with the highest number

of “votes” to the current configuration until the configuration is specified. To ensure

that the configuration is a valid configuration, RRS inserts the SubGroup from the

previous reconfiguration task for each unrepresented SuperGroup in the newly created

configuration. In other words, if no tasks vote for any processor in a particular Super-

Group, RRS chooses the configuration for that SuperGroup so that the SuperGroup

does not need to be reconfigured.

Assuming that the original reconfiguration schedule is generated by the CtS, in-

cluding the ordering of all tasks associated with each reconfiguration by rankb, RRS

has a worst-case runtime of O(n · p) with n tasks and p processors.

5.5 Simulation Results

We simulated the effect of using the FTRM and RRS hybrid schedulers when a

system with reconfigurable processors undergoes a change in processor availability.

For the simulation, we generated a Gaussian elimination DAG for the application,

and used HEFT-MEG to schedule the application onto two and four node systems.

The Gaussian elimination DAG was generated for a matrix size of 35, and contained

104 662 tasks. We assumed that executing a task on an RH is ten times faster than the

microprocessor version on average, when an RH version of the task is available. RH

versions of tasks are available for 80% of all tasks. For the reconfigurable architecture,

each node consists of a general-purpose microprocessor coupled with a reconfigurable

processor. In each case, the reconfigurable processor could choose from nine processor

types, with each logical processor occupying 30% of a reconfigurable processor’s area

on average. Therefore, there are about three logical processors per reconfigurable

processor per configuration, on average. We found that a penalty constant of ∆ equal

to 1.0, used in Equation 5.3, yielded the best performance in our experiments.

As illustrated in Figure 5.1, when a fault occurs in a portion of the RH, we assume

other logical processors part of the same RH, but not using the faulty area, remain

unaffected. We also assured that at least one general purpose processor was always

available, as some tasks can only execute on that processor type. In evaluating the

performance of the fault tolerant hybrid scheduler, we compared the time it would

have taken the original schedule to complete all tasks executed during a transient

fault to the simulated execution time. For our tests, we assume a single fault occurs,

it starts after 10% of the tasks have finished, and the fault continues through the end

of execution. Finally, we assume that both portions of the RtS has no overhead.

Table 5.1 shows the performance results for various faults on a four node system.

tsim SlowDown is defined as where tsim is the total time of the transient failure and Torig

Torig is the time the original schedule would take to execute all tasks scheduled in simulation. A lower SlowDown indicates higher performance. In Table 5.1, one can see that as the amount of failed processors increases, the SlowDown also increases,

105 SlowDown Fail Type FTRM FTRM & RRS 1 µP 2.775 2.856 2 µP 3.561 3.846 1/2 of 1 RH 2.548 2.729 1/2 of 2 RHs 2.840 2.593

Table 5.1: Results for four node system, CCR = 1.0.

indicating that the performance of the system degrades with the decrease in process- ing resources. Also, one can see that the RRS reconfiguration rescheduler actually degrades performance when the a microprocessor fails, but increases performance when the failure is in the RH. This behavior is not surprising, as there is more op- portunity for reconfiguration to increase performance through reconfiguration when a failure occurs in the reconfigurable hardware. For tasks originally scheduled on the microprocessor, it is unlikely that changing the configuration will result in those tasks being remapped favorably to a RH processor. In these cases, using RRS to change the reconfiguration schedule penalizes the performance of applications running on RH processors.

Table 5.2 shows the results for simulating various faults on a two node system.

The results for the two nodes system are similar to the results for a four node system, except the use of RRS is more beneficial when RH is part of the fault and more detrimental when the microprocessors are part of the fault.

Tables 5.3 and 5.4 show the results for simulating the execution of a Gaussian

DAG with a CCR of 0.25 under various faults. Results are similar to when the CCR is 1.0, although the benefit for using RRS is decreased when multiple RHs fail.

106 SlowDown Fail Type FTRM FTRM & RRS 1 µP 3.999 5.137 1/2 of 1 RH 2.915 3.045 1/2 of 2 RHs 4.057 3.568

Table 5.2: Results for two node system, CCR = 1.0

SlowDown Fail Type FTRM FTRM & RRS 1 µP 2.119 3.192 2 µP 2.946 4.353 1/2 of 1 RH 2.227 2.537 1/2 of 2 RHs 2.416 2.343

Table 5.3: Results for four node system, CCR = 0.25.

SlowDown Fail Type FTRM FTRM & RRS 1 µP 3.519 5.255 1/2 of 1 RH 2.740 3.284 1/2 of 2 RHs 3.888 3.563

Table 5.4: Results for two node system, CCR = 0.25

107 The results show that, when using the FTRM, application performance is higher

when fewer processing resources fail, allowing performance to degrade as the num-

ber of active processors is reduced and increase as the number of active processors

1 increases. However, performance is lower than expected. For example, when 2 of 2 RHs fail on a four node system, this failure reduces the amount of RH resources by

25% and does not affect the number of microprocessors available. Despite this, using

FTRM takes almost 3x longer to execute the Gaussian DAG with CCR of 1.0 than if there was no failure and the CtS schedule is used. We believe that the FTRM cost function may map tasks to processors on which they execute slower too often, res- ducing application performance. As a task’s rankb is used when scheduling using the

FTRM, and a task’s rankb is likely to be significantly larger than its execution time, a task with a high rankb will likely be chosen to execute on a processor regardless of its execution time.

108 CHAPTER 6

CONCLUSIONS

6.1 Contributions

This dissertation presented solutions to three challenges to the efficient usage of current and future Heterogeneous Chip Multiprocessor (H-CMP) architectures, specif- ically the efficient use of Reconfigurable Hardware (RH) resources, the efficient use of Network on a Chip (NoC) resources, and how to accommodate expected increases in transient faults.

In Chapter 3, we developed the Mutually Exclusive Processor Groups reconfigu- ration model; the proposed model captures a wide range of Reconfigurable Hardware

(RH) types, from reconfiguration using FPGAs to the more coarsely reconfigurable polymorphous computer architectures (PCAs). Based on this model, we proposed the Mutually Exclusive Processor Groups (-MEG) scheduling extension. The -MEG extension evaluates reconfiguration decisions during scheduling using a novel back- tracking algorithm. To reduce the runtime of -MEG, we further propose a method to choose only “good” configurations to evaluate, reducing the configuration search space when scheduling. While -MEG can be used to extend any list scheduler, we extend the HEFT scheduler (proposed by Topcuoglu et al. [107, 108]) to create HEFT-MEG.

109 Testing the proposed scheduler on randomly generated, Gaussian elimination, LU decomposition, and Laplace transform task graphs, we find that HEFT-MEG using

FindSmartConfs generates schedules that are about 20% shorter on average than the best single-configuration, HEFT generated schedules. HEFT-MEG generates higher quality schedules than the single configuration HEFT because the -MEG scheduling extension explores the reconfiguration space, adapting the architecture’s configuration to transient application requirements.

We also find that HEFT-MEG significantly outperforms the previously proposed scheduler in Mei et al. [70] (Mei00) as we increase the number of tasks in the DAG and the number of processors in the architecture. HEFT-MEG outperforms Mei00 for two central reasons. First, the -MEG scheduling extension generates higher quality reconfiguration schedules. The -MEG extension iteratively refines the reconfiguration schedule, building off of higher performing previously found partial reconfiguration schedules. However, the GA used by Mei00 determines what configuration executes each task, effectively determining the entire reconfiguration schedule at once. Sec- ondly, the GA used by Mei00 only determines the mapping of tasks to processors, using a list scheduler to determine execution ordering and evaluate each individual’s

fitness. Because the list scheduler is used to evaluate each individual’s fitness in a population, it is needfully simple to reduce execution time, and is outperformed by the single configuration HEFT in many cases, despite Mei00’s ability to adapt the architecture to transient application requirements. Even with a reduced complexity list scheduler, we found that Mei00 takes significantly longer to generate a scheduler than HEFT-MEG using FindSmartConfs or the single configuration HEFT.

110 Finally, we demonstrate that HEFT-MEG can also generate high quality schedules for the TRIPS processor3. Using HEFT-MEG increases GPS Acquisition performance by about 20% versus the best single configuration schedule. The performance advan- tage is because HEFT-MEG can change the TRIPS configuration to match transient application requirements.

When writing programs to execute for a H-CMP system, utilization of the Net- work on a Chip (NoC) can become a first level design consideration. As the NoC is a shared resource, contention for the NoC can drastically change the throughput, latency, and efficiency of the NoC. Chapter 4 presented a network model for the Cell processor’s NoC4 and a compile scheduler (CtS) and run-time scheduler (RtS) using this model. First, we developed a simple stochastic model to predict the expected message latency using only the number of other messages concurrently using the NoC.

Then, we introduced the Contention Aware (CA-) list scheduling extension. The CA- extension informs a base scheduler how scheduling decisions affect NoC contention and influence previously scheduled tasks’ start and finish times. Although any list scheduler can be extended using the proposed CA- extension, we demonstrate the

CA- extension using HEFT scheduler, proposed by Topcuoglu et al. [107, 108], as the base scheduler to create Contention Aware HEFT (CA-HEFT). Next, we introduce the Contention Aware Dynamic Scheduler (CADS) RtS. Despite being based on a very simple stochastic model that disregards several secondary effects, using the CA-

HEFT and CADS in concert as a hybrid scheduler results in significant improvements to application run times. Comparing the proposed hybrid scheduler to the default

3The TRIPS processor was developed at the University of Texas at Austin [19, 91, 44]. 4The Cell processor was jointly developed by IBM, Sony, and Toshiba for the Sony Playstation3 gaming console [48, 45].

111 scheduler used by the Accelerated Library Framework [98], CA-HEFT+CADS can decrease actual execution time on the Cell processor by about 60% for task graphs with a high communication to computation ration (CCR). CA-HEFT reduces actual execution time by scheduling tasks “around” contention on the NoC. CADS further reduces execution time by reacting to actual system conditions, further improving the task schedule.

As the number of processing elements integrated onto a single chip increases, the likelihood that faults will occur increases. Additionally, future process technologies are expected to suffer from higher transient faults from increasing voltage and tem- perature fluctuations (or PVT variations) [14, 29]. Modeling intermittent faults as changes in processor availability, Chapter 5 modified the hybrid scheduling system from Chapter 4 to accommodate variability in the processors available for computa- tion. We used the same CtS as in Chapter 2 (HEFT-MEG), where the CtS assumes that all processors will be available for computation during application execution.

Chapter 5 introduced a novel two-part RtS. The Fault Tolerant Re-Mapper (FTRM) resembles the CADS re-mapper as presented in Chapter 4. The FTRM examines the current processor availability and, using the schedule generated at compile time, remaps tasks to the available set of processors. The second portion of the RtS, the

Reconfiguration and Recovery Scheduler (RRS), specifically addresses the opportu- nities when designing a fault tolerant system for RH. When a change in processor availability occurs, the RRS changes the reconfiguration schedule so that the reconfig- urations more accurately reflect the new hardware capabilities. The proposed hybrid scheduling system enables application performance to gracefully degrade when pro- cessor availability diminishes, and increase when processor availability increases.

112 6.2 Future Work

This section proposes several directions for further research building on the work in this dissertation.

In Chapter 3, we explored scheduling reconfiguration on a single-user/single- application system. Generating a rigid schedule of configurations becomes less useful when multiple applications are executing concurrently that utilize the RH resources.

Some of the concepts presented in Chapter 5 for fault tolerance could be applied to multi-application access to RH, especially the concept of unavailable processing resources. An interesting avenue to study this further is in the area of a reconfigura- tion aware OS, and the interaction of compile- and run-time and the OS scheduling of hardware resources. Some work has already been done developing a reconfigura- tion aware OS [84, 74], but more work will need to be completed to fully utilize the capabilities present in future H-CMP architectures.

We compared our proposed HEFT-MEG scheduler to two other schedulers: a single configuration HEFT [108] and Mei00 [70]. Further work will explore how the -MEG extension coupled with other list schedules affects the resulting schedule quality, and will develop extensions to other existing hardware-software co-schedulers to compare HEFT-MEG to a greater variety of other schedulers. Also, future work will develop a branch and bound like version of the -MEG extension to further reduce execution time by discontinuing repeated exploration of reconfiguration decisions that have previously been found to be non-beneficial.

In the future, scheduling NoC access can be expanded in several areas. First, using a hybrid scheduler can allow the execution of applications that are not spec- ified as directed acyclic graphs (DAGs). Future work may include tools to allow a

113 program to dynamically generate tasks to add to the task graph at runtime, using

CADS to schedule the new tasks. Additionally, a hybrid re-mapper can also be used to merge the schedules of more than one application at runtime. Work exploring the performance, , and fairness concerns when scheduling multiple, independent task graphs onto a H-CMP is another avenue of possible future work.

In the stochastic model demonstrated in Chapter 4, we disregard several secondary effects that can become more important as the number of simultaneous communica- tions increases. Further work will explore whether the considering the secondary effects is beneficial to the schedule resulting from using the CA- extension.

Also, we plan to explore the effect of using different schedulers with the CA- ex- tension, determining the best combination of CtS and RtS heuristics for our hybrid scheduling system. Chapter 4 demonstrates that the overhead to the CADS RtS is ac- ceptable on the Cell processor, but increasing the number of processing cores increases the Run-time Scheduler (RtS) overhead. Current technological trends indicate larger numbers of processing cores will be integrated into future H-CMP architectures, and the runtime of the CADS RtS (which is O(p)) could become unacceptable in the future. A possible solution would be to implement a partitioned, hierarchal version of the CADS RtS. In hierarchal CADS, processors would be grouped together, with a separate CADS scheduler for each processor group. Then, a higher level scheduler would be executed relatively infrequently, and would perform load balancing (and possibly other) operations across groups and assign blocks of tasks to groups. An interesting avenue to begin the implementation of the hierarchal CADS RtS would be to examine operating system schedulers, such as the ULE scheduler that is part of the

FreeBSD operating system [69], which utilizes a per-processor list of threads, where

114 each processor makes local scheduling decisions, but there is a higher level scheduler that decides which processor gets what threads.

In Chapter 5 we demonstrate that the fault tolerant hybrid scheduler gracefully degrades application performance when processor availability decreases in simulation.

First, FTRM and RRS schedulers should be optimized to increase the performance of the application after a change in processor availability. Future work would include studies measuring the real-world overhead involved in implementing the RtS on an

H-CMP with reconfigurable hardware. The FTRM re-mapper suffers from the same worst case runtime (O(p)) as CADS, which may become an unacceptable overhead as the number of processors in the system increases. Future work would investigate a similar partitioned, hierarchal RtS as suggested for the CADS scheduler to reduce its overhead. Similarly, the RRS scheduler examines reconfiguration decisions for the en- tire reconfiguration schedule on a change in processor availability. For transient faults that are expected to last only short amounts of time, the RRS could limit the size of the reconfiguration area it searches. Another avenue to reduce the overhead involved with the RRS, would be to start the computation of a new reconfiguration schedule when a change occurs, but continue execution in the meantime using the FTRM to remap tasks originally scheduled on the unavailable processors. This would enable more complex heuristics to be used, as the response time of the RRS would not be as important. Finally, the work presented in Chapter 5 only considers when the under- lying architecture provides fewer processing cores than an application was originally scheduled to run on. One interesting avenue for future work would study how to

“scale-up” the number of cores at runtime in a way that increases the application performance in a predictable manner. Finally, the RRS scheduler was developed with

115 FPGA soft-cores in mind, where a particular task can either be executed in hardware on the FPGA or in software on a microprocessor. Future enhancements could target

RRS to situations where the choice is between a small number of more powerful cores or a large number of less powerful cores. The current implementation of RRS would most likely pick a small number of more powerful cores, even if the other configuration

(or something in between) would have higher performance.

116 BIBLIOGRAPHY

[1] Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi- Chang Miao, Carl Ramey, and David Wentzlaff. Tile Processor: Embedded Multicore for Networking and Multimedia. In HotChips: A Symposium on High Performance Chips, Stanford, California, August 2007.

[2] Ishfaq Ahmad and Yu-Kwong Kwok. On Exploiting Task Duplication in Parallel Program Scheduling. IEEE Transactions on Parallel and Distributed Systems, 9(9):872–892, September 1998.

[3] S. Ali, J. Kim, H. Siegel, A. Maciejewski, Y. Yu, S. Gundala, S. Gertphol, and V. Prasanna. Greedy Heuristics for Resource Allocation in Dynamic Dis- tributed Real-time Systems. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA ’02), pages 519–530, 2002.

[4] Randy Allen, David Callahan, and Ken Kennedy. Automatic Decomposition of Scientific Programs for Parallel Execution. In Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 63–76, Munich, West Germany, 1987.

[5] Murali Annavaram, Ed Grochowski, and John Shen. Mitigating Amdahl’s Law Through EPI Throttling. In Proceedings of the 32nd Annual International Sym- posium on Computer Architecture (ISCA’05), pages 298–309, Madison, Wiscon- sin, June 2005.

[6] Rashmi Bajaj and Dharma Agrawal. Improving Scheduling of Tasks in a Hetero- geneous Environment. IEEE Transactions on Parallel and Distributed Systems, 15(2):107–118, February 2004.

[7] Saisanthosh Balakrishnan, Ravi Rajwar, Mike Upton, and Konrad Lai. The Impact of Performance Asymmetry in Emerging Multi-core Architectures. In Proceedings of the 32nd Annual International Symposium on Computer Archi- tecture (ISCA’05), pages 506–519, Madison, Wisconsin, June 2005.

117 [8] Michel Banatre and Peter A. Lee. Hardware and Software Architectures for Fault Tolerance, Experiences and Perspectives. In (Lecture Notes in Computer Science). Springer-Verlag, 1994.

[9] Luiz Andre Barroso, Robert McNamara Kourosh Gharachorloo, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocess- ing. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00), pages 282–293, Vancouver, Canada, June 2000.

[10] O. Beaumont, V. Boudet, and Y. Robert. A Realistic Model and an Efficient Heuristic for Scheduling with Heterogeneous Processors. In Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS ’02 Workshops), page 37, 2002.

[11] O. Beaumont, V. Boudet, and Y. Robert. The iso-level scheduling heuristic for heterogeneous processors. In PDP’2002, 10th Euromicro Workshop on Paral- lel, Distributed and Network-based Processing, pages 335–350, Canary Islands, Spain, 2002. IEEE Press.

[12] Cristina Boeres, Jose Filho, and Vinod Rebello. A Cluster-based Strategy for Scheduling Task on Heterogeneous Processors. In Proceedings of the 16th Sym- posium on Computer Architecture and High Performance Computing (SBAC- PAD’04), pages 214–221, October 2004.

[13] Cristina Boeres and Alexandre Lima. Hybrid Task Scheduling: Integrating Static and Dynamic Heuristics. In Proceedings of the 15th Symposium on Com- puter Architecture and High Performance Computing (SBAC-PAD’03), pages 199–206, November 2003.

[14] Shekhar Borkar. and Design Challenges for Gigascale Inte- gration: Keynote. In Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO), 2004.

[15] Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi, and Vivek De. Parameter variations and impact on circuits and microarchi- tecture. In Proceedings of the 40th Annual Conference on Design Automation, pages 338–342, June 2003.

[16] Doruk Bozdag, Umit Catalyurek, and Fusun Ozguner. A Task Duplication Based Bottom-Up Scheduling Algorithm for Heterogeneous Environments. In Proceedings of 20th International Parallel and Distributed Processing Sympo- sium (IPDPS), 2006.

118 [17] Doruk Bozdag, Fusun Ozguner, Eylem Ekici, and Umit Catalyurek. A Task Duplication Based Scheduling Algorithm Using Partial Schedules. In Proceed- ings of the 2005 International Conference on Parallel Processing (ICPP), pages 630–637, June 2005. [18] Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau Bni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, and Bin Yao. A taxonomy for describing matching and scheduling heuristics for mixed-machine heterogeneous computing systems. In Symposium on Reliable Distributed Systems, pages 330–335, 1998. [19] Doug Burger, Stephen Keckler, K. McKinley, M. Dahlin, L. John, C. Lin, C. Moore, J. Burrill, R. McDonald, W. Yoder, and the TRIPS Team. Scal- ing to the End of Silicon with EDGE Architectures. IEEE Computer, pages 44–55, July 2004. [20] Jim Burns, Adam Donlin, Jonathan Hogg, Satnam Singh, and Mark de Wit. A Dynamic Reconfiguration Run-Time System. In 5th IEEE Symposium on FPGA-Based Custom Computing Machines (FCCM ’97), pages 66–75, Napa Valley, CA,, April 1997. [21] Daniel P. Campbell, Dennis M. Cottel, Randall R. Judd, and Mark A. Richards. Introduction to Morphware: Software Architecture for Polymorphous Comput- ing Architectures. Technical report, Georgia Institute of Technology and Space and Naval Warfare Systems Center, San Diego, February 2004. [22] Umit Catalyurek and Cevdet Aykanat. Hypergraph-partitioning-based decom- position for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distrib. Syst., 10(7):673–693, 1999. [23] Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the- fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 283–292, San Jose, California, 2006. [24] Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. A Case for an Over-provisioned Multicore System: Energy Efficient Processing of Multi- threaded Programs. Technical report, University of Wisconsin Computer Sci- ences Technical Reports, 2007. [25] Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. Adapting to In- termittent Faults in Multicore Systems. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIII), 2007.

119 [26] Hongtu Chen and M. Maheswaran. Distributed Dynamic Scheduling of Com- posite Tasks on Grid Computing Systems. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS02), pages 89–97, 2002. [27] Bertrand Cirou and Emmanuel Jeannot. Triplet : A Clustering Scheduling Al- gorithm for Heterogeneous Systems. In Proceedings of the International Con- ference on Parallel Processing Workshops (ICPPW’01), pages 231–236, 2001. [28] Katherine Compton, Zhiyuan Li, James Cooley, and Scott Hauck. Configuration Relocation and Defragmentation for Run-time Reconfigurable computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 10(3), June 2002. [29] Cristian Constantinescu. Intermittent faults in VLSI circuits. In Proceedings of the IEEE Workshop on Silicon Errors in Logic - System Effects, 2007. [30] M. Cosnard and M. Loi. Automatic Task Graph Generation Techniques. In Proceedings of the Twenty-Eighth Hawaii International Conference on System Sciences, pages 113–122, 1995. [31] D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. ACM SIG-PLAN Notices, Proc. Symp. Principles and Practice of Parallel Programming, 28(7):1–12, July 1993. [32] Bharat P. Dave. CRUSADE: Hardware/Software Co-Synthesis of Dynamically Reconfigurable Heterogeneous Real-Time Distrubuted Embedded Systems. In Proceedings of Design, Automation and Test in Europe (DATE), pages 97–104, March 1999. [33] A. Dhodapkar and J. E. Smith. Comparing Program Phase Detection Tech- niques. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, page 217, 2003. [34] O. Diessel, H. ElGindy, M. Middendorf, H. Schmeck, and B. Schmidt. Dy- namic scheduling of tasks on partially reconfigurable FPGAs. IEE Proceedings- Computers and Digital Techniques, 147(3):181–188, May 2000. [35] Yang Ding, Mahmut Kandemir, Padma Raghavan, and Mary Jane Irwin. A Helper Thread Based EDP Reduction Scheme for Adapting Application Ex- ecution in CMPs. In Proceedings of the Parallel and Distributed Processing Symposium, (IPDPS 2008), Miami, Florida, April 2008. [36] Atakan Dogan and F¨usun Ozg¨uner.¨ LDBS: A Duplication Based Scheduling Independent of Tasks with QoS Requirements in Grid Computing with Time- Varying Resource Prices. In Proceedings of the International Conference on Parallel Processing, August 2002.

120 [37] Atakan Dogan and F¨usun Ozg¨uner.Matching¨ and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Hetero- geneous Computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):308–323, March 2002.

[38] Atakan Dogan and F¨usun Ozg¨uner.¨ Genetic Algorithm Based Scheduling of Meta-Tasks with Stochastic Execution Times in Heterogeneous Computing Sys- tems. Cluster Computing (Kluwer), 2(7), 2004.

[39] Atakan Dogan and F¨usun Ozg¨uner.Scheduling¨ of a Meta-Task with QoS Re- quirements in Heterogeneous Computing Systems. Journal of Parallel Dis- tributed Computing (Elsevier), 66(12):181–196, 2006.

[40] H. El-Rewini and T.G. Lewis. Scheduling Parallel Program Tasks onto Arbi- trary Target Machines. Jounal of Parallel and , 9(2):138– 153, June 1990.

[41] Mohammed Eltayeb, Atakan Dogan, and F¨usun Ozg¨uner.Concurrent¨ Schedul- ing: Efficient Heuristics for Online Large-Scale Data Transfers in Distributed Real-Time Environments. IEEE Transactions on Parallel and Distributed Sys- tems, 17(11):1348–1359, Nov 2006.

[42] Daniel Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Toan Pham, Ra- jeev Rao, Conrad Ziesler, David Blaauw, Todd Austin, and Trevor Mudge. A Low-Power Pipeline Based on Circuit-Level Timing Speculation. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), pages 7–18, December 2003.

[43] M. M. Eshaghian. Heterogeneous Computing. Prentice-Hall, Inc., 1996.

[44] Mark Gebhart and Steve Keckler. Large Matrix Multiplication on the TRIPS SVM System. Technical report, Department of Computer Sciences, The Uni- versity of Texas at Austin, December 2005.

[45] Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. Synergistic Processing in Cell’s Multicore Architecture. IEEE Micro, 26(2):10–24, 2006.

[46] Charles R. Hardnett, Ajay Jayaraj, Tushar Kumar, Krishna V. Palem, and Sud- hakar Yalamanchili. Compiling Stream Kernels for Polymorphous Computing Architectures. In The 12th International Conference on Parallel Architectures and Compilation Techniques PACT-2003, New Orleans, Louisiana, September 2003.

121 [47] J. Harkin., T.M. McGinnity, and L.P. Maguire. Partitioning Methodology for Dynamically Reconfigurable Embedded Systems. IEE Proceedings - Computers and Digital Techniques, 147(6):391–396, November 2000.

[48] H. Peter Hofstee. Power Efficient Processor Architecture and The Cell Pro- cessor. In Proceedings of the 11th IEEE International Symposium on High- Performance Computing (HPCA-11 2005), pages 258–262, San Francisco, Cal- ifornia, February 2005.

[49] Yatin Hoskote, Sriram Vangal, Nitin Borkar, and Shekhar Borkar. Teraflop Prototype Processor with 80 Cores. In HotChips: A Symposium on High Per- formance Chips, Stanford, California, August 2007.

[50] M.A. Iverson and F¨usun Ozg¨uner.Dynamic,¨ Competitive Scheduling of Multi- ple DAGS in a Distributed Heterogeneous Enviornment. In Proceedings of the Heterogeneous Processing Workshop, pages 70–78, March 1998.

[51] Sangil Jwa and Umit Ozg¨uner.Multi-UAV¨ Sensing Over Urban Areas via Lay- ered Data Fusion. In IEEE/SP 14th Workshop on Statistical Signal Processing, 2007 (SSP’07), pages 576–580, August 2007.

[52] Sangil Jwa, Zhijun Tang, and Umit Ozg¨uner.¨ Robust Data Alignment Based on Information Theory and its Applications in Road Following Situation. In Intelligent Transportation Systems Conference (ITSC), pages 1328–1333, 2006.

[53] Vida Kianzad and Shuvra Bhattacharyya. Efficient Techniques for Clustering and Scheduling onto Embedded Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 17(7):667–680, July 2006.

[54] Jong-Kook Kim, S. Shivle, H.J. Siegel, A.A. Maciejewski, T.D. Braun, M. Schneider, S. Tideman, R. Chitta, R.B. Dilmaghani, R. Joshi, A. Kaul, A. Sharma, S. Sripada, P. Vangari, and S.S. Yellampalli. Dynamic Mapping in a Heterogeneous Environment with Tasks Having Priorities and Multiple Deadlines. In Proceedings of the International Parallel and Distributed Process- ing Symposium (IPDPS), page 15, April 2003.

[55] Michael Kistler, Michael Perrone, and Fabrizio Petrini. Cell Multiprocessor Communication Network: Build for Speed. IEEE Micro, 26(3):10–23, May- June 2006.

[56] Kathleen Knobe, James M. Rehg, Arun Chauhan, Rishiyur S. Nikhil, and Umakishore Ramachandran. Scheduling Constrained Dynamic Applications on Clusters. In Proceedings of Supercomputing (SC’99), pages 46–61, Portland, Oregon, November 1999.

122 [57] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, Month-April 2005.

[58] Rakesh Kumar, Deam M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. Single-ISA Heterogeneous Multi-Core Architec- tures for Multi-threaded Workload Performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04), pages 64–75, Munich, Germany, June 2004.

[59] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen. Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling. In Proceedings of the 32nd Annual International Symposium on Computer Ar- chitecture (ISCA’05), pages 408–419, Madison, Wisconsin, June 2005.

[60] Yu-Kwong Kwok and Ishfaq Ahmad. Static Scheduling Algorithms for Allo- cating Directed Task Graphs to Multiprocessors. ACM Computing Surveys, 31(4):406–471, December 1999.

[61] Yu-Kwong Kwok and Ishfaq Ahmad. Link Contention-Constrained Scheduling and Mapping of Tasks and Messages to a Network of Heterogeneous Processors. Cluster Computing: J. Networks, Software Tools, and Applications, 3(2):113– 124, 2000.

[62] Xiaoyao Liang and David Brooks. Mitigating the Impact of Process Variations on Files and Execution Units. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 504–514, 2006.

[63] David M. Lin, James B.Y. Tsui, Lee L. Liou, and Y. T. Jade Morton. Sensi- tivity Limit of A Stand-Alone GPS Receiver and An Acquisition Method. In Proceedings of ION GPS, pages 1663–1667, Portland, Oregon, September 2002.

[64] G. Q. Liu, K. L. Poh, and M. Xie. Iterative list scheduling for heterogeneous computing. Journal of Parallel and Distributed Computing, 65(5):654–665, 2005.

[65] Jiwei Lu, Abhinav Das, Wei-Chung Hsu, Khoa Nguyen, and Santosh G. Abra- ham. Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Pro- cessor. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05) , pages 12–23, Barcelona, Spain, November 2005.

123 [66] Niti Madan and Rajeev Balasubramonian. Power-Efficient Approaches to Re- dundant Multithreading. Transactions on Parallel and Distributed Systems, 18(8):1066–1079, August 2007.

[67] Muthucumaru Maheswaran, Shoukat Ali, Howard Siegel, Debra Hensgen, and Richard Frund. Dynamic Matching and Scheduling of a Class of Independent Tasks onto Heterogeneous Computing Systems. In Proceedings of the 8th Het- erogeneous Computing Workshop (HCW’99), pages 30–44, San Juan, Puerto Rico, April 1999.

[68] Muthucumaru Maheswaran and Howard Siegel. A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems. In Proceedings of the Seventh Heterogeneous Computing Workshop, pages 57–69, 1998.

[69] Marshall McKusick and George V. Neville-Neil. The Design and Implementa- tion of the FreeBSD Operating System. Addison-Wesley Professional, 2004.

[70] Bingfeng Mei, Patrick Schaumont, and Serge Vernalde. A Hardware-Software Partitioning and Scheduling Algorithm for Dynamically Reconfigurable Em- bedded Systems. In 11th ProRISC workshop on Circuits, Systems and Signal Processing, November 2000.

[71] Daniel A. Menasce, Debanjan Saha, Stella C. da Silva Porto, Virgilio A. F. Almeida, and Satish K. Tripathi. Static and dynamic processor scheduling disciplines in heterogeneous parallel architectures. Journal of Parallel and Dis- tributed Computing, 28(1):1–18, 1995.

[72] Gerald R. Morris, Viktor K. Prasanna, and Richard D. Anderson. A Hy- brid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer. In 14th Annual IEEE Symposium on Field- Programmable Custom Computing Machines (FCCM’06), pages 3–12, Napa, California, April 2006.

[73] Ramadass Nagarajan, Karthikeyan Sankaralingam, Doug Burger, and Stephen W. Keckler. Design Space Evaluation of Grid Processor Architectures. In Proceedings of the 34th Annual International Symposium on Microarchitec- ture (MICRO-34), pages 44–51, December 2001.

[74] V. Nollet, P. Coene, D. Verkestand S. Vernalde, and R. Lauwereins. Designing an operating system for a heterogeneous reconfigurable soc. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), April 2003.

124 [75] Hyunok Oh and Soonhoi Ha. A static scheduling heuristic for heterogeneous processors. In Euro-Par ’96: Proceedings of the Second International Euro- Par Conference on Parallel Processing-Volume II, pages 573–577, London, UK, 1996. Springer-Verlag.

[76] Leonid Oliker, Andrew Canning, Jonathan Carter, Costin Iancu, Michael Lijew- ski, Shoaib Kamil, John Shalf, Hongzhang Shan, Erich Strohmaier, Stephane Ethier, and Tom Goodale. Scientific Application Performance on Candidate PetaScale Platforms. In IPDPS ’07: Proceedings of the Parallel and Distributed Processing Symposium, March 2007.

[77] Jonathan Phillips, Matthew Areno, Chris Rogers, Aravind Dasu, , and Brandon Eames. A Reconfigurable Load Balancing Architecture for Molecular Dynamics. In Proceedings of the International Parallel and Distributed Processing Sympo- sium (IPDPS’07), March 2007.

[78] F. Pollack. New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. In 32nd Annual International Symposium on Microarchitecture (MICRO-32), page 2, November 1999.

[79] Michael D. Powell, Mohamed Gomaa, , and T. N. Vijaykumar. Heat-and-Run: Leveraging SMT and CMP to Manage Power Density Through the Operating System. In Proceedings of the 11th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 260–270, Boston, Massachusetts, October 2004.

[80] Cray Inc. Products. Webpage. http://www.cray.com/products.

[81] Intel Corp. Products. Webpage. http://www.intel.com/products.

[82] SRC Computers Inc. products. Webpage. http://www.srccomputers.com/products.

[83] M. J. Quinn. : Theory and Practice. McGraw-Hill Book Company, 1993.

[84] Vincenzo Rana, Marco Santambrogio, Donatella Sciuto, Boris Kettelhoit, Markus Koester, Mario Porrmann, , and Ulrich R¨uckert. Partial Dynamic Reconfiguration in a Multi-FPGA Clustered Architecture Based on Linux. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’07), March 2007.

[85] F. Rodriguez-Henriquez, N.A. Saqib, and A. Diaz-Perez. 4.2 Gbit/s single-chip FPGA implementation of AES algorithm. Electronic Letters, 39(15):1115–1116, July 2003.

125 [86] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen mei W. Hwu. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73–82, 2008.

[87] Bratin Saha, Ali-Reza Adl-Tabatabai, Anwar Ghuloum, Mohan Rajagopalan, Richard L. Hudson, Leaf Petersen, Vijay Menon, Brian Murphy, Tatiana Shpeis- man, Eric Sprangle, Anwar Rohillah, Doug Carmean, and Jesse Fang. Enabling and Performance in a Large Scale CMP Environment. In EuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pages 73–86, Lisbon, Portugal, 2007.

[88] Proshanta Saha and Tarek El-Ghazawi. A Methodology for Automating Co- Scheduling for Reconfigurable Computing Systems. In MEMOCODE’07: 5th IEEE/ACM International Conference on Formal Methods and Models for Code- sign, pages 159–168, May 2007.

[89] Proshanta Saha and Tarek El-Ghazawi. Extending Embedded Computing Scheduling Algorithms for Reconfigurable Computing Systems. In SPL ’07:Pro- ceedings of the 3rd Southern Conference on Programmable Logic, pages 87–92, February 2007.

[90] Proshanta Saha and Tarek El-Ghazawi. Software/Hardware Co-Scheduling for Reconfigurable Computing Systems. In FCCM’07: 15th Annual IEEE Sym- posium on Field-Programmable Custom Computing Machines, pages 299–300, April 2007.

[91] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug Burger, Stephen W. Keckler, and Charles R. Moore. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture. In Proceedings of the 30th Annual International Symposium on Computer Ar- chitecture (ISCA’03), pages 422–433, San Diego, California, June 2003.

[92] T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. In Pro- ceedings of the 30th Annual International Symposium on Computer Architecture (ISCA-30), pages 336–349, 2003.

[93] Smitha Shyam, Kypros Constantinides, Sujay Phadke, Valeria Bertacco, and Todd Austin. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), San Jose, Califor- nia, 2006.

126 [94] Gilbert C. Sih and Edward A. Lee. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distributed Systems, 4(2):175–187, 1993.

[95] Oliver Sinnen. Task Scheduling for Parallel Systems. Wiley Series on Parallel and Distributed Computing. John Wiley & Sons, Inc., Hoboken, New Jersey, 2007.

[96] Oliver Sinnen and L. Sousa. Communication Contention in Task Scheduling. IEEE Transactions on Parallel and Distributed Systems, 16(6):503–515, June 2005.

[97] Lodewijk Smit, Johann Hurink, and Gerard Smit. Run-time Mapping of Appli- cations to a Heterogeneous SoC. In Proceedings of the International Symposium on System-on-Chip, pages 78–81, November 2005.

[98] Software Development Kit for Multicore Acceleration, Version 3.0. Accelerated Library Framework for Cell Broadband Engine Programmer’s Guide and API Reference. Technical report, IBM Corp., 2007.

[99] Jon Stokes. A Closer Look at AMD’s CPU/GPU Fusion, November 2006. http://arstechnica.com/news.ars/post/20061119-8250.html.

[100] E.J. Swankoski, R.R. Brooks, V. Narayanan, M. Kandemir, and M.J. Irwin. A Parallel Architecture for Secure FPGA Symmetric Encryption. In Proceed- ings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04), 2004.

[101] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Pso- taand Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. Evaluation of the RAW Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04), pages 2–13, Munich, Germany, June 2004.

[102] Justin Teller. Performance Characteristics of an Intelligent Memory System. Master’s thesis, University of Maryland, August 2004.

[103] Justin Teller, Robert Ewing, and F¨usun Ozg¨uner.What¨ are the Building Blocks of a Nanoprocessor Architecture? In Proceedings of the International Midwest Symposium on Circuits and Systems (MWSCAS’05), Cincinnati, Ohio, August 2005.

127 [104] Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing. The Morphable Nanoproces- sor Architecture: Reconfiguration at Runtime. In Proceedings of the Interna- tional Midwest Symposium on Circuits and Systems (MWSCAS’06), San Juan, Puerto Rico, August 2006.

[105] Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing. Scheduling Reconfiguration at Runtime on the TRIPS Processor. In Proceedings of the Parallel and Distributed Processing Symposium, (IPDPS 2008), Miami, Florida, April 2008.

[106] Justin Teller, Charles B. Silio, and Bruce Jacob. Performance Characteris- tics of MAUI: An Intelligent Memory System. In Proceedings of the 3rd ACM SIGPLAN Workshop on Memory System Performance (MSP 2005), Chicago, Illinois, June 2005.

[107] Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Task scheduling algorithms for heterogeneous processors. In Proceeding of the Heterogeneous Computing Workshop (HCW ’99), pages 3–14, San Juan, Puerto Rico, 1999.

[108] Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Trans- actions on Parallel and Distributed Systems, 13(3):260–274, March 2002.

[109] James B. Tsui. Fundamentals of Global Positioning System Receivers: A Soft- ware Approach. Wiley-Interscience: John Wiley & Sons, inc, 2005.

[110] Oreste Villa, Daniele Paolo Scarpazza, Fabrizio Petrini, and Juan Fernandez Peinador. Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-Core Processors. In IPDPS ’07: Proceedings of the Parallel and Dis- tributed Processing Symposium, pages 1–10, March 2007.

[111] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Katherine Yelick. The Potential of the Cell Processor for Scientific Com- puting. In CF ’06: Proceedings of the 3rd Conference on Computing Frontiers, pages 9–20, Ischia, Italy, 2006. ACM Press.

[112] Michael J. Wirthlin, Brad L. Hutchings, and Kent L. Gilson. The Nano Proces- sor: a Low Resource Reconfigurable Processor. In IEEE Workshop on FPGAs for Custom Computing Machines, pages 23–30, Napa, California, April 1994.

[113] Annie Wu, Han Yu, Shiyuan Jin, Kuo-Chi Lin, and Guy Schiavone. An In- cremental Genetic Algorithm Approach to Multiprocessor Scheduling . IEEE Transactions on Parallel and Distributed Systems, 15(9):824–834, September 2004.

128 [114] Min-You Wu. On Runtime Parallel Scheduling for Processor Load Balancing. IEEE Transactions on Parallel and Distributed Systems, 8(2):173–186, February 1997.

[115] Asim YarKhan and Jack Dongarra. Experiments with scheduling using simu- lated annealing in a grid environment. In GRID ’02: Proceedings of the Third International Workshop on Grid Computing, pages 232–242, London, UK, 2002. Springer-Verlag.

129