SCHEDULING TASKS ON HETEROGENEOUS CHIP MULTIPROCESSORS WITH RECONFIGURABLE HARDWARE
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Justin Stevenson Teller, B.S., M.S.
*****
The Ohio State University
2008
Dissertation Committee: Approved by
Prof. F¨usun Ozg¨uner,Adviser¨ Prof. Umit¨ C¸ataly¨urek Adviser Prof. Eylem Ekici Graduate Program in Electrical and Computer Engineering c Copyright by
Justin Stevenson Teller
2008 ABSTRACT
This dissertation presents several methods to more efficiently use the computa- tional resources available on a Heterogeneous Chip Multiprocessor (H-CMP). Using task scheduling techniques, three challenges to the effective usage of H-CMPs are addressed: the emergence of reconfigurable hardware in general purpose computing, utilization of the network on a chip (NoC), and fault tolerance.
To utilize reconfigurable hardware, we introduce the Mutually Exclusive Processor
Groups reconfiguration model, and an accompanying task scheduler, the Heteroge- neous Earliest Finish Time with Mutually Exclusive Processor Groups (HEFT-MEG) scheduling heuristic. HEFT-MEG schedules reconfigurations using a novel back- tracking algorithm to evaluate how different reconfiguration decisions affect previously scheduled tasks. In both simulation and real execution, HEFT-MEG successfully schedules reconfiguration allowing the architecture to adapt to changing application requirements.
After an analysis of IBM’s Cell Processor NoC and generation of a simple stochas- tic model, we propose a hybrid task scheduling system using a Compile- and Run-time
Scheduler (CtS and RtS) that work in concert. The CtS, Contention Aware HEFT
(CA-HEFT), updates task start and finish times when scheduling to account for network contention. The RtS, the Contention Aware Dynamic Scheduler (CADS),
ii adjusts the schedule generated by CA-HEFT to account for variation in the commu- nication pattern and actual task finish times, using a novel dynamic block algorithm.
We find that using a CtS and RtS in concert improves the performance of several application types in real execution on the Cell processor.
To enhance fault tolerance, we modify the previously proposed hybrid scheduling system to accommodate variability in the processor availability. The RtS is divided into two portions, the Fault Tolerant Re-Mapper (FTRM) and the Reconfiguration and Recovery Scheduler (RRS). FTRM examines the current processor availability and remaps tasks to the available set of processors. RRS changes the reconfiguration schedule so that the reconfigurations more accurately reflect the new hardware capa- bilities. The proposed hybrid scheduling system enables application performance to gracefully degrade when processor availability diminishes, and increase when proces- sor availability increases.
iii Dedicated to my wonderful wife, Lindsay.
iv ACKNOWLEDGMENTS
I would like to thank Prof. F¨usun Ozg¨unerfor¨ being my adviser, and providing me with the guidance to finish my graduate degree. Especially, I want to thank you for recruiting me. My Ph.D. topic would be vastly different had I not been able to come to and work at Ohio State.
I would also like to sincerely thank Prof. Umit¨ C¸ataly¨urekand Eylem Ekici. You
are truly at the top of the best professors I have had the honor to study with in my
graduate work. Your contributions to my education cannot be overstated.
I would like to thank Tim Hartley, for extremely constructive discussions concern-
ing the Cell processor, parallel processing, and StarCraft.
I would like to sincerely thank Dr. Robert Ewing, AFRL, for his insightful con-
versations and guidance when working on base. I am grateful to Al Scarpelli, AFRL,
for his support and help in providing access to the TRIPS system and developers.
Of course, none of this would have been possible without the love and support
of my family. My wife Lindsay was incredibly supportive, and I especially want to
thank my parents, brothers, and all of the Highfields: my “Columbus family.”
Finally, I would like to acknowledge the Dayton Area Graduate Studies Institute
for providing support for my Ph.D. studies through a joint research fellowship.
v VITA
April 19, 1980 ...... Born - Downer’s Grove, Illinois
2002 ...... B.S. in Electrical Engineering, Ohio University, Athens, Ohio 2004 ...... M.S. in Electrical Engineering, Univer- sity of Maryland, College Park, Mary- land 2004 ...... Given’s Associate in parallel processing at the MCS division at Argonne Na- tional Laboratory 2005 – present ...... Air Force Research Labora- tory/Dayton Area Graduate Studies Institute Fellow
PUBLICATIONS
1. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “Scheduling Task Graphs on Re- configurable Hardware.” to appear in the 37th International Conference on Parallel Processing (ICPP-08), SRMPDS workshop Portland, Oregon, September 2008
2. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “Optimization at Runtime on a Nanoprocessor Architecture.” to appear in the 31st IEEE Annual Midwest Symposium on Circuits and Systems, Knoxville, Tennessee, August 2008
3. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “Scheduling Reconfiguration at Runtime on the TRIPS Processor.” in Proceedings of the Parallel and Distributed Processing Symposium, (IPDPS 2008),RAW workshop Miami, Florida, April 2008.
4. Justin Teller, “Matching and Scheduling on a Heterogeneous Chip Multi-Processor.” presentation at the ASME Dayton Engineering Sciences Symposium, October 29, 2007.
vi 5. Justin Teller “Reconfiguration at Runtime with the Nanoprocessor Architecture.” presentation at the ASME Dayton Engineering Sciences Symposium, October 30, 2006. Selected for an Outstanding Presentation Award.
6. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “The Morphable Nanoprocessor Architecture: Reconfiguration at Runtime.” in Proceedings of the International Mid- west Symposium on Circuits and Systems (MWSCAS ’06), San Juan, Puerto Rico, August 6-9, 2006.
7. Justin Teller, F¨usun Ozg¨uner,and¨ Robert Ewing, “What are the Building Blocks of a Nanoprocessor Architecture?” in Proceedings of the International Midwest Sym- posium on Circuits and Systems (MWSCAS ’05), Cincinnati, Ohio, August 7-10, 2005.
8. Justin Teller, Charles B. Silio, and Bruce Jacob, “Performance Characteristics of MAUI: An Intelligent Memory System Architecture” in Proceedings of the 3rd ACM SIGPLAN Workshop on Memory Systems Performance (MSP 2005), Chicago, Illinois, June 12, 2005.
9. Mark Hereld, Rick Stevens, Justin Teller, Wim van Drongelen, and Hyong Lee, “Large Neural Simulations on Large Parallel Computers.” International Journal of Bioelectromagnetism (IJBEM), vol. 7, no. 1, May 2005.
FIELDS OF STUDY
Major Field: Electrical and Computer Engineering
Studies in: Parallel Processing Computer Architecture
vii TABLE OF CONTENTS
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita ...... vi
List of Tables ...... xi
List of Figures ...... xii
Chapters:
1. Introduction ...... 1
1.1 Current Trends ...... 1 1.1.1 Chip Multiprocessors ...... 1 1.1.2 Heterogeneous Processing Cores ...... 3 1.1.3 Reconfigurable Hardware in General Purpose Computing . . 4 1.1.4 Intermittent Hardware Faults ...... 5 1.2 Summary ...... 5
2. Background, Prior Work, and Motivation ...... 8
2.1 Reconfigurable Hardware ...... 8 2.1.1 Scheduling on Reconfigurable Hardware ...... 9 2.2 Task Scheduling for Heterogeneous Systems ...... 12 2.2.1 Matching and Scheduling Heuristics ...... 13 2.2.2 HEFT List Scheduler ...... 14 2.2.3 Scheduling Network Access ...... 16
viii 2.2.4 Dynamic Schedulers ...... 17 2.3 Intermittent Faults ...... 18 2.3.1 Sources of Faults ...... 18 2.3.2 Fault Tolerance in Chip Multiprocessors ...... 20 2.4 Motivation ...... 21 2.4.1 GPS Acquisition on the TRIPS Processor ...... 21 2.4.2 RDA on the Cell Processor ...... 23
3. Scheduling on Reconfigurable Hardware ...... 26
3.1 Introduction ...... 26 3.2 Reconfiguration Model: Mutually Exclusive Processor Groups . . . 27 3.3 HEFT with Mutually Exclusive Processor Groups ...... 30 3.3.1 -MEG Scheduling Extension ...... 30 3.3.2 Generating New Configurations ...... 33 3.3.3 HEFT-MEG Time Complexity ...... 40 3.4 Results ...... 43 3.4.1 Simulation Results ...... 43 3.4.2 Results on TRIPS ...... 52
4. The Modeling and Scheduling of Network Access ...... 57
4.1 Introduction ...... 57 4.2 The Cell Processor’s Network on a Chip ...... 58 4.2.1 Cell’s NoC: The EIB ...... 59 4.2.2 Cell EIB: In-Network Contention ...... 60 4.3 Communication Model ...... 63 4.3.1 Calculating End-Point Contention ...... 64 4.3.2 Calculating NoC Contention ...... 65 4.3.3 Experimental Verification: NoC Contention ...... 72 4.4 Software System Overview ...... 74 4.5 Scheduling on the Cell Processor ...... 75 4.5.1 Compile Time Scheduling ...... 75 4.5.2 Run Time Scheduling ...... 77 4.6 Scheduling Results ...... 81
5. Fault Tolerance with Reconfigurable Hardware ...... 89
5.1 Introduction ...... 89 5.2 Proposed Failure Model ...... 91 5.3 Mutually Exclusive Processor Groups Revisited ...... 91 5.4 Run-Time Scheduler ...... 94
ix 5.4.1 Fault Tolerant Re-mapper ...... 96 5.4.2 Reconfiguration and Recovery Scheduler ...... 100 5.5 Simulation Results ...... 104
6. Conclusions ...... 109
6.1 Contributions ...... 109 6.2 Future Work ...... 113
Bibliography ...... 117
x LIST OF TABLES
Table Page
5.1 Results for four node system, CCR = 1.0...... 106
5.2 Results for two node system, CCR = 1.0 ...... 107
5.3 Results for four node system, CCR = 0.25...... 107
5.4 Results for two node system, CCR = 0.25 ...... 107
xi LIST OF FIGURES
Figure Page
1.1 Hypothetical H-CMP consisting of processing cores optimized for dif- ferent computation types. The on-chip network is not shown...... 2
2.1 A chromosome for the partitioning algorithm in Mei, et al [70]. . . . . 10
2.2 Partitioning a DAG into blocks [68] ...... 19
2.3 Graph illustrating three distinct phases executing GPS acquisition on the TRIPS processor...... 22
2.4 Comparing the performance of Cell’s SPE to Intel’s processors [81] on the RDA application...... 24
3.1 Illustrating mutually exclusive processors with a group of possible con- figurations for an FPGA...... 29
3.2 Illustrating mutually exclusive processors with the TRIPS processor configurations...... 30
3.3 Scheduling a DAG fragment onto RH using HEFT-MEG...... 34
3.4 Illustrating FindSmartConfs algorithm...... 37
3.5 Continuation of Figure 3.4. Illustrating the generation of m − 1 other configurations, and their testing in HEFT-MEG...... 38
3.6 Comparing the runtime of HEFT-MEG to reconfiguration oblivious HEFT [108] and Mei00 [70] while varying the number of nodes in the architecture...... 42
xii 3.7 Comparing the runtime of HEFT-MEG to reconfiguration oblivious HEFT [108] and Mei00 [70] while varying the number of tasks in the DAG...... 42
3.8 Normalized schedule length for random DAGs while varying number of tasks between 50 and 550, on a one node architecture...... 46
3.9 Normalized schedule length for random DAGs while varying number of tasks between 50 and 550, on a two node architecture...... 47
3.10 Normalized schedule length for random DAGs varying the number of nodes in the architecture between 1 and 4 and ...... 48
3.11 Normalized schedule length for random DAGs varying the relative re- configuration time...... 49
3.12 Normalized schedule length vs. matrix size for Laplace Transform DAGs. 50
3.13 Normalized schedule length vs. matrix size for LU Decomposition DAGs. 50
3.14 Normalized schedule length vs. matrix size for Gaussian Elimination DAGs...... 51
3.15 Directed Acyclic Task Graph (DAG) of the GPS Acquisition algorithm. 53
3.16 GPS Acquisition’s schedule when HEFT-MEG is used for scheduling. 55
3.17 Comparing the runtime of GPS Acquisition using different schedules. 56
4.1 Block level diagram illustrating the topology of the Cell Processor’s NoC...... 59
4.2 Illustrating the operation of the Cell NoC. The latencies of messages 0 → 3, 1 → 4, and 2 → 6 depend on ordering by the arbiter and sharing of links on the NoC...... 61
4.3 Illustrating concurrent, independent communications reducing the re- alized bandwidth of other messages in the system...... 66
xiii 4.4 Illustrating how two messages can overlap with a test message. a) Neither message affect the test message. b) One message affects the test message. c) Both messages overlap, independently. d) One message overlaps, but both messages share an end-point. e) Both messages overlap with the test message over one link...... 69
4.5 Comparing the predicted pdf and experimental relative frequency of a test message’s latency for 2, 3, and 5 concurrent messages...... 73
4.6 System overview. Applications are represented as a task graph. . . . 75
4.7 Operation of the CADS re-mapper...... 78
4.8 Normalized schedule length for random DAGs varying the CCR be- tween 0.01 and 10...... 83
4.9 Normalized schedule length for random DAGs varying the number tasks between 200 and 800...... 84
4.10 Normalized schedule length for Gaussian elimination DAGs varying the CCR between 0.01 and 10...... 85
4.11 Normalized schedule length for Gaussian elimination DAGs varying the matrix size between 5 and 45...... 85
4.12 Normalized schedule length for LU decomposition DAGs varying the CCR between 0.01 and 10...... 86
4.13 Normalized schedule length for LU decomposition DAGs varying the matrix size between 5 and 45...... 86
4.14 Normalized schedule length for Laplace transform DAGs varying the CCR between 0.01 and 10...... 87
4.15 Normalized schedule length for Laplace transform DAGs varying the matrix size between 5 and 45...... 88
5.1 Illustrating processor availability changes on an FPGA...... 93
5.2 System overview...... 95
xiv 5.3 Operation of the FTRM re-mapper. a) The original schedule, anno- tated to indicate the active block after t1 is scheduled. b) FTRM decides to schedule task t3 to processor P 3. c) The scheduling decision is not reflected in the original schedule, but the active block is updated. 97
5.4 Illustrating the extraction of the configuration schedule...... 101
xv CHAPTER 1
INTRODUCTION
Consisting of a mix of processing units (cores) that are targeted for different types of computations, Heterogeneous Chip Multiprocessors (H-CMPs) can efficiently run a diverse mix of applications [7, 45, 48, 58, 91, 97, 99, 101, 103, 104]. Figure 1.1 illustrates a hypothetical H-CMP containing twelve processing cores of four types: simple processing cores (in-order, short pipeline, etc), vector processors, a complex processing core (out-of-order, deep pipeline, etc), and a Reconfigurable Hardware
(RH) processor.
In this dissertation, we present several methods to more efficiently use the compu- tational resources available on an H-CMP. Using scheduling techniques, we address three challenges to the effective usage of H-CMPs: the emergence of reconfigurable hardware in general purpose computing, utilization of the network on a chip (NoC), and fault tolerance.
1.1 Current Trends
1.1.1 Chip Multiprocessors
Multi-core processors and Chip multiprocessors (CMPs) are becoming more com- monplace, as even commodity off the shelf (COTS) processors are integrating several
1 Shared Shared Shared Shared Memory Memory Memory Memory
Simple Proc. Simple Proc. Simple Proc. Simple Proc.
Simple Proc. Simple Proc. Simple Proc. Simple Proc.
Vector Processor Complex Processor Polymorphous or (Wide-issue, Deep- Reconfigurable pipeline, Superscalar, Hardware (RH) Vector etc.) Processor Processor
Heterogeneous CMP
Figure 1.1: Hypothetical H-CMP consisting of processing cores optimized for different computation types. The on-chip network is not shown.
processing cores onto a single chip [81, 99, 57]. CMP architectures have demonstrated benefits for processing throughput and power efficiency [48, 59, 104, 110]. Addi- tionally, there are proposed architectures that utilize multiple cores for redundant processing to recover from transient faults and radiation induced errors [66, 24, 35].
H-CMPs targeted to general purpose and high-performance computing already have been introduced or proposed [7, 48, 58]. As future solutions integrate more process- ing cores onto a single chip, managing the computational resources becomes more difficult [1, 5, 9, 49, 57]. While a number of CMPs are currently being marketed and researched, the development of quality software tools to enable efficient utilization of
CMPs is expected to be a significant roadblock to their future use [87].
2 1.1.2 Heterogeneous Processing Cores
State of the art H-CMPs having slightly different cores have already been in- troduced or proposed. One example are General Purpose computation on Graphics
Processing Unit (GPGPU) architectures, such as nVidia’s G80 GPU core using CUDA
(Compute Unified Device Architecture), an interface that allows users to write high- performance programs for any compute-intensive task in the standard C language
[86]. Also, the paper by Kumar, et al. [58] proposes a single-ISA multi-core archi- tecture with cores of varying sizes, performance, and power consumption as a way to provide significantly higher performance in the same area as a conventional chip multiprocessor. There has been significant work into what combination of core types and interconnects yield the highest performance [59, 7, 106, 102], indicating that fu- ture solutions are moving towards more heterogeneity as more specialized cores are integrated onto a single chip.
One important commercial example of an H-CMP is the Cell processor, developed jointly by IBM, Sony, and Toshiba and originally designed for the Sony Playstation3 gaming system [48]. The Cell processor consists of nine processing cores of two differ- ent types: a single Power Processing Element (PPE) and eight Synergistic Processing
Elements (SPE) [45] connected by a high-speed NoC [55]. The PPE is a traditional 64- bit PowerPC processor and runs the operating system and can be programmed using a traditional compiler tool-chain. Conversely, each SPE is a high-performance vector engine, lacking traditional cache and branch prediction units [78]. Instead of a tradi- tional cache, each SPE uses a software managed Local Store (LS) memory [45, 78].
The SPE is therefore optimized for data-parallel code with simple control structures, making it a promising architecture for a variety of applications [111, 110, 76].
3 1.1.3 Reconfigurable Hardware in General Purpose Comput- ing
Reconfigurable hardware is attractive for general purpose computing, as the per- formance and flexibility of Field Programmable Gate Arrays (FPGAs) and other reconfigurable architectures have enabled system developers to achieve high levels of performance for a variety of applications [88, 77, 85, 100]. The promise of high perfor- mance coupled with low power consumption has inspired several commercial offerings coupling FPGAs with general purpose processors, including the Cray XD1 and XT5h
[80] and SRC Mapstation 7 [82]. The SRC-7 system is one interesting architecture and programming environment integrating an FPGA-based reconfigurable hardware system into the memory system of a traditional personal computer (PC) architecture
[82]. With the addition of the SRC’s Carte programming environment, an applica- tion developer can focus on the utilization of the FPGA for application acceleration, resulting in significant performance benefits in general purpose and high-performance computing applications [72].
Polymorphous Computing Architectures (PCAs) is a second class of reconfigurable computing that is used in general purpose computing. PCAs reconfigure in a coarse grained manner and target applications showing high variability in computational requirements [21]. The TRIPS processor, developed at UT at Austin [19], is an important PCA architecture. Tiled to support both instruction and thread level parallelism, the TRIPS processor has has two different configurations, or “morphs:” the Desktop Morph (D-Morph) and the Threaded Morph (T-Morph).
4 1.1.4 Intermittent Hardware Faults
As more processing resources are integrated onto a single chip, the possibility of experiencing faults increases [14, 29, 15]. These hardware errors can have effects lasting a wide range of time scales, and effectively make the logical processors available for execution a dynamic quantity that can both decrease and increase during an application’s execution. Even though the exact rate of faults for future processors is not known, it is expected that the rate of intermittent hardware faults will increase in the future due to increased cross-talk, voltage and temperature variations, and decreased noise margins [25, 14].
1.2 Summary
In this dissertation, we propose scheduling methods for Heterogeneous Chip Multi- processors (H-CMPs) that address three important areas: utilization of reconfigurable hardware for general purpose computing, consideration of shared network on a chip resources when scheduling, and fault tolerance. The dissertation is composed of three main parts.
In Chapter 3, we address the problem of scheduling applications represented as directed acyclic task graphs (DAGs) onto architectures with reconfigurable process- ing cores. We introduce the Mutually Exclusive Processor Groups reconfiguration model, a novel reconfiguration model that captures many different modes of reconfig- uration. Additionally, we propose the Mutually Exclusive Processor Groups (-MEG) list scheduling extension. The -MEG extension uses a novel back-tracking algorithm to schedule reconfigurations and evaluate how different reconfiguration decisions af- fect previously scheduled tasks. While the -MEG extension can be used with any list
5 scheduler, we demonstrate our scheduler by extending HEFT (proposed by Topcuoglu
et al. [108]), to create HEFT-MEG. We find that HEFT-MEG generates higher qual-
ity schedules than the hardware-software co-scheduler proposed by Mei, et al. [70] and
HEFT [108] using a single configuration in simulation by choosing efficient configura-
tions for different application phases. Additionally, we used HEFT-MEG to schedule
for the polymorphous TRIPS processor. In actual execution, we found that using
the HEFT-MEG scheduler improves the performance of GPS Acquisition, a software
radio application, by about 20%, compared to the best single-configuration schedule
on the same hardware.
In Chapter 4, we perform an analysis of the Cell processor NoC and introduce
a simple stochastic model to predict message latency based on the number of other
competing messages communicating concurrently on the network. Using this model,
we propose a hybrid scheduling system using a Compile-time Scheduler (CtS) and
Run-time Scheduler (RtS) that work in concert. The proposed CtS is built using a novel Contention Aware (CA-) list scheduling extension. While the CA- extension could be used with any list scheduler, we demonstrate the scheduling extension using the HEFT scheduler proposed by Topcuoglu et al. [108], to create CA-HEFT. Next, we propose the Contention Aware Dynamic Scheduler (CADS) runtime re-mapper as the RtS. At runtime, CADS adjusts the schedule generated by CA-HEFT to account for variation in the communication pattern and actual task finish times. CADS uses a novel dynamic block algorithm that updates the active block of tasks depending on run time scheduling decisions, the schedule generated by CA-HEFT, and actual task finish times. We find that using a CtS and RtS in concert improves the per- formance of several application types in real execution on the Cell processor. As
6 the Communication to Computation Ratio (CCR) increases, the performance bene-
fit of using CA-HEFT and CADS to schedule “around” communication contention increases, resulting in up to a 60% reduction in execution time.
In Chapter 5, we introduce a fault tolerant extension to the Mutually Exclusive
Processor Groups model. We expand on the hybrid scheduler proposed in Chapter
4, using HEFT-MEG as the CtS portion of the hybrid scheduler. The RtS is di- vided into two portions: a high-cost recovery scheduler and a low-cost re-mapper.
The low-cost re-mapper redirects tasks based on actual system conditions. Named the Fault-Tolerant Re-Mapper (FTRM), the re-mapper examines the current pro- cessor availability and, using the schedule generated at compile time, remaps tasks to the available set of processors. The high-cost recovery scheduler is named the
Reconfiguration and Recovery Scheduler (RRS) and specifically addresses the oppor- tunities when designing a fault tolerant system for reconfigurable hardware. RRS examines the changes in processor availability and determines a new configuration schedule, inserting new reconfiguration tasks into the task graph. The recovery can take a relatively long time (total reconfiguration on an FPGA can take upwards of
10ms [28, 34]), but allows the RtS to adjust the configuration schedule to account for changes in processor availability.
7 CHAPTER 2
BACKGROUND, PRIOR WORK, AND MOTIVATION
2.1 Reconfigurable Hardware
While FPGAs are an important class of fine-grained Reconfigurable Hardware
(RH), Polymorphous Computing Architectures (PCAs) represent a different class of reconfigurable computing. PCAs can reconfigure in a coarse grained manner and tar- get applications showing high variability in computational requirements [21]. Com- pared with FPGA RH, a PCA’s organization enables faster reconfiguration times and clock speeds at the expense of fewer possible configurations [21, 46, 73]. One im- portant PCA architecture is the TRIPS processor, developed at UT at Austin [19].
The TRIPS processor’s current implementation has two different configurations, or
“morphs:” the Desktop Morph (D-Morph) and the Threaded Morph (T-Morph). The
D-Morph allocates all on-chip resources to a single thread, using the resources to sup- port a large number of in-flight instructions for speculative execution. Conversely, the T-Morph statically allocates on-chip resources to four threads, so each thread is
1 allocated 4 of the on-chip resources. This limits the amount of speculative execution available to each thread in the T-Morph, as compared to the D-Morph [19, 91]. Due to these differences, the D-Morph efficiently executes applications with Instruction
Level Parallelism (ILP), while the T-Morph efficiently executes applications with
8 high Thread Level Parallelism (TLP). We obtained access to a TRIPS evaluation board through our collaboration with Air Force Research Laboratory (AFRL), and used this evaluation board for some of our experiments [105].
2.1.1 Scheduling on Reconfigurable Hardware
While reconfiguration at runtime has been previously studied, most studies fo- cus on offloading specific functions onto FPGAs [20, 47, 112] or determining an ef-
ficient partitioning of work between a microprocessor and some number of FPGA soft-processors (sometimes categorized as hardware-software co-design) [70, 89, 90].
Additionally, a number of examples in the literature propose scheduling methods that target only the FPGA [28, 32, 34].
One interesting hardware-software partitioning and scheduling approach was pro- posed in Mei, et al. [70]. The scheduler proposed in [70] uses a Genetic Algorithm
(GA) that searches for a good partitioning of tasks between a single microproces- sor and some number of soft-processors on a single FPGA. Ensuing use of the term
Mei00 will refer to the scheduler described by Mei, et al. [70]. Mei00 uses a simple gene structure to describe the mapping of tasks to either a general purpose CPU or the FPGA, as shown in Figure 2.1. Mei00 then uses a cost function to determine the most fit individuals in a particular generation using a cost function based on accumu- lated violation, or tardiness. The tardiness for a particular task is the amount of time it misses its deadline after being scheduled. The goal of Mei00 is to find a schedule with zero tardiness, which is also a solution that meets all timing constraints [70].
Mei00’s GA main loop has the following steps [70]:
9 Figure 2.1: A chromosome for the partitioning algorithm in Mei, et al [70].
1. Initialization. To start with diversity of the initial population, each individual
individual’s chromosome is randomly generated by setting each gene to either
1 or 0.
2. Evaluation and Fitness. The scheduler is invoked, and the tardiness of each
individual’s resulting schedule is calculated.
3. Selection. Reproduction trials are run on chromosomes using the normal tour-
nament selection strategy.
4. Crossover and Mutation. Crossover and mutation operations are applied on
selected parent individuals.
5. Update Population. New individual fitness values are recalculated and lower
fitness individuals are discarded.
10 6. Stop Criteria. If one of the stop criteria is met (either maximum number of
generations or a solution with zero tardiness), the algorithm stops. Otherwise,
it repeats steps 3 through 6.
Mei00 uses a list scheduler to evaluate individuals in the GA [70]. Described in more detail in the next section, list schedulers in general schedule tasks by listing the priority of each task, choosing the task with the highest priority, and placing that task on a particular processor to execute at a particular time. Mei00 uses a dynamic priority scheme, given by:
priority(t) = −(ASAPdyna(t) + ALAP (t)) (2.1)
The ASAPdyna value is the earliest a task could possibly execute, based on processor availability, while ALAP is the negative of the task’s “distance” from the bottom of the graph [70]. Unlike static priority calculations, once a task is scheduled, all
ASAPdyna will be recalculated to reflect the current status. Larger ASAP times means the task must be scheduled later, so it has a lower priority. Similarly, larger
ALAP values means the task can be executed later, so the task has a lower priority.
Then, priority(t) is further modified to account for the reconfiguration overhead.
Basically, if the task can reuse a configuration on an FPGA, it is given higher priority when scheduling [70].
The paper by Mei, et al. [70] schedules 3, 4, or 5 task graphs consisting of an average of 10 tasks onto a single microprocessor, single FPGA system. They chose mix of task graphs because it resembles several periodic real-time tasks with each task having deadlines.
11 In Chapter 3 we compare our RH scheduler to Mei00. This choice was made because of Mei00’s flexibility [70]. Although it was originally targeted to a system with a single FPGA and microprocessor, Mei00 can easily be extended to multiple microprocessor, multiple FPGA systems by changing the gene representation in Fig- ure 2.1 to include more than a single bit per task. Secondly, changing the fitness value to the overall schedule length allows Mei00 to be re-targeted to task graphs where individual tasks have no deadline, and the goal is to reduce the time needed to execute a particular set of tasks.
The Reconfigurable computing Co-Scheduler (ReCoS) is another co-scheduler that targets single microprocessor, single FPGA workstations [89]. ReCoS is a clustering scheduler that clusters tasks to execute on a particular processor. In this case, Re-
CoS chooses tasks for a particular cluster based on their similarity and possibility to co-execute on the FPGA [88]. Then, ReCoS iterates over the clustering, redistribut- ing tasks to try to minimize the time required for execution on the microprocessor and FPGA and maximize the FPGA utilization [90]. We chose not to compare our reconfigurable scheduler to ReCoS in the forthcoming chapters. Because ReCoS was targeted only to the scheduling and placement of logical processors within the FPGA, it was not flexible enough to schedule within our proposed reconfiguration model.
2.2 Task Scheduling for Heterogeneous Systems
An important part of the parallelization process is allocating tasks to processors and determining the order of execution. This scheduling can either be performed before the application executes (called compile-time) or while the application is exe- cuting (called run-time). Compile-time scheduling is designated as static scheduling,
12 and uses estimations of task execution and communication time when scheduling.
Run-time scheduling is called dynamic scheduling and actual application behavior
can be used when scheduling. While dynamic schedulers can use more accurate in-
formation when scheduling than their static counterparts, dynamic schedulers need
to have real-time response to be useful in scheduling. To have real-time response,
dynamic schedulers perform lower complexity analysis than static schedulers.
2.2.1 Matching and Scheduling Heuristics
Historically called Mapping and Scheduling for homogeneous systems and Match- ing and Scheduling for heterogeneous systems, scheduling a task graph representing an application is a well studied problem [95, 50, 39]. For scheduling on a homo- geneous parallel system, an application is represented as a Directed Acyclic Task
Graph (DAG), G = (V, E, w, c), where the nodes V represent the application tasks and the edges E the communications (data dependencies) between tasks. The weight w(v) associated with node v ∈ V represents its computation cost, and the weight
c(e) associated with e ∈ E represents its communication cost. The model is similar
for heterogeneous systems, except that the task’s computation and communication
costs depend on the processor executing the task [18, 43]. Unfortunately, the optimal
scheduling of an arbitrary DAG onto a limited number of processors is NP-hard [83],
so most solutions present in the literature propose heuristics to find near-optimal
solutions.
Static scheduling heuristics can be loosely broken into three categories: guided
stochastic search, cluster, and list-based schedulers. Guided stochastic search sched-
ulers use genetic [113, 38], simulated annealing [115], or other randomized search
13 methods to search through possible schedules for near-optimal solutions. Clustering heuristics have two steps: first, application tasks are clustered together to run on a single processor as an attempt to reduce communication time, then the execution order is defined [12, 27].
List scheduling heuristics are a common framework to use when scheduling. A list scheduler’s basic idea is to generate a scheduling list (or a sequence of nodes for scheduling) ordered by some priority, then repeatedly execute following steps until all the nodes in the DAG are scheduled, with the following two scheduling steps [60, 95]:
1. Remove the first node from the scheduling list.
2. Allocate the node to a processor that minimizes some cost function.
List schedulers differ by the definition of the listing priority and the scheduling cost function. List scheduling is a simple, well performing, and well studied scheduling algorithm with a large number of list schedulers present in the literature [11, 37, 41,
60, 61, 64, 71, 75]. Another important category of list schedulers attempts to avoid interprocessor communication by duplicating task execution [2, 6, 16, 17, 36, 53].
2.2.2 HEFT List Scheduler
Heterogeneous Earliest Finish Time (HEFT) is one heuristic often used as a bench- mark to evaluate other heterogeneous scheduling algorithms for its simplicity and ability to generate high-quality schedules [107, 108]. HEFT is a static list scheduler, so tasks priorities do not change while scheduling, and are only calculated once when scheduling. A task’s listing priority is defined as the bottom rank defined as:
rankb(ni) = wi + max ci,j + rankb(nj) (2.2) nj ∈succ(ni)
14 where succ(ni) is the set of immediate successors of task ni, ci,j is the average com-
munication cost of the edge between ni and nj, and wi is the average computation
cost of task ni. Exit tasks (tasks without successors) have the bottom rank equal to:
rankb(nexit) = wexit (2.3)
Before defining HEFT’s scheduling cost function, we define several other functions.
HEFT-MEG uses an insertion-based policy that considers inserting tasks into idle time slots between two already-scheduled tasks on a processor, as originally described in [108]. Assuming that Ij is the set of idle time slots on processor pj and each time
slot s has a start time of ss and an end time of se, we define the set of appropriate idle times slots for task ni on processor pj as:
n o Aj = s : s ∈ Ij ∧ (tm + wi) ≤ se (2.4)
Where wi(pj) is the runtime of task ni on processor pj, and tm is defined as:
tm = max{tr(ni, pj), ss} (2.5)
tr(ni, pj) is the time all data generated by ni’s immediate predecessors would be
available to processor pj.
HEFT then defines the scheduling cost function to be Earliest Finish Time (EFT)
of task ni on processor pj is defined as [107]:
EFT (ni, pj) = min{ss + wi(pj)} (2.6) s∈Aj
Using EFT as a cost function allows HEFT to schedule tasks onto heterogeneous
processors and networks, as execution time differences are taken into account when
scheduling. Algorithm 1 is a pseudo-code representation of the HEFT scheduling
heuristic.
15 Algorithm 1 HEFT Scheduling Heuristic [108] 1: procedure HEFT(G = (V, E, w, c)) .G is a task graph 2: Compute rank for all tasks t ∈ V . Using Equation 2.2 3: Sort the tasks in decreasing order by rank and put in list 4: while there are unscheduled tasks in list do 5: Select the first task in the list, ni, and remove from list 6: for all processors pj do 7: Evaluate EFT(nk,pj), saving the minimum EFT . Using Equation 2.6 8: end for 9: Schedule task nk on the processor px with the minimum EFT 10: end while 11: end procedure
2.2.3 Scheduling Network Access
The literature contains a number of heuristics that consider network contention when scheduling. A simple model examines end-point contention when scheduling.
One such example is the one-port model, which models the network port of a processor as able to accommodate only a single input or output at a time [10]. This effectively limits the total I/O bandwidth available to each processor when scheduling, and forces the scheduling heuristic to schedule access to each processor’s network port, without having to consider the network topology. Similarly, the parameter g in the
LogP model models the amount of communication a processor can accommodate simultaneously [31].
The literature also contains a number of other scheduling models that consider other modes of network contention. A number of approaches model edges on the net- work graph as processors that only execute communication tasks. These hypergraph based schemes schedule access to the network, and can more accurately model actual network conditions [22, 40, 61, 94, 96].
16 Unlike previous work, in Chapter 4 we consider only a single communication archi-
tecture (Cell Processor’s NoC), and model the end-point contention more faithfully
than the one-port and LogP models. Our approach also differs from previous work by
introducing a stochastic network model, since the considered processing model allows
remapping of tasks to processors.
2.2.4 Dynamic Schedulers
Static schedules are not always efficient in unpredictable computational environ-
ments, as the estimated execution and communication time used when scheduling
may not be accurate. Dynamic matching and scheduling algorithms generate the
schedule at runtime, so the scheduling heuristics can use more accurate information
about the running application. A number of dynamic schedulers have been proposed
in the relevant literature [3, 26, 56, 50, 54, 39, 114].
Using run-time information as it becomes available forces a dynamic scheduler
to make scheduling decisions in real-time. A main challenge to the development of
a dynamic scheduler is limiting its complexity to ensure real-time response. One
approach to limiting runtime complexity while generating high quality schedules is to
utilize a hybrid scheduler. A hybrid scheduler takes a statically generated schedule as
an input, and tasks are selectively rescheduled using runtime information [13, 68, 67].
As one example, Maheswaran and Siegel [68] propose a dynamic re-mapper that
uses a statically generated schedule as an input. The first phase in the scheduling uses
the initial static mapping generated by the compile time scheduler and partitions the
DAG into B blocks numbered consecutively from 0 to B − 1. Blocks are generated such that all tasks within a block are independent, and inter-block data dependencies
17 are monotonically increasing. In other words, all subtasks that send data to tasks in
block k must be partitioned into blocks 0 to k − 1. The (B − 1)th block includes all tasks without successors and the 0-th block includes all tasks without predecessors
[68]. Generating three blocks from a seven node DAG is shown in Figure 2.2. Once the tasks in the DAG are partitioned, they are scheduled at runtime based on their block.
Blocks are scheduled consecutively from block 0 to B − 1. When tasks from block i are being executed, the re-mapper is scheduling block i + 1 [68]. Work extending the hybrid re-mapper in [68] merge blocks together at runtime to consider a larger number of tasks when scheduling, reducing the resulting schedule length [67, 13].
These extensions operate on largely the same principle, however.
The runtime schedulers in Chapters 4 and 5 differ from previous hybrid schedulers in the focus on contention in the network and fault tolerance as the rationale for the dynamic portion of the scheduling system. Additionally, unlike previously proposed schedulers, we consider “dynamic” blocks when scheduling the DAG at runtime, where the block membership depends on what tasks have already been scheduled, as well as a task’s level in the DAG.
2.3 Intermittent Faults
2.3.1 Sources of Faults
As more processing cores are integrated into a single system, the cores are becom- ing more susceptible to hardware errors. Particularly, intermittent hardware faults can cause hardware errors that occur in bursts. These faults are often caused by process variation combined with voltage and temperature fluctuations (also denoted as PVT fluctuations) [14, 29].
18 Figure 2.2: Partitioning a DAG into blocks [68]
Because the underlying cause of intermittent faults can vary widely, so can the duration and number of cores affected by the fault. Different software phases can exercise different portions of core, causing intermittent faults depending on applica- tion behavior [104, 92, 33]. Voltage fluctuations can affect a number of cores, but the effects last on the order of nanoseconds [15]. Temperature fluctuations can be localized to a single processor or group of processors, causing faults that can last up to several seconds [79].
19 In addition to hardware faults, if a mechanism were in place to allow the software to recover from hardware faults, it would free operating system, firmware, or hyper- visor modules to make decisions that affect processor availability. For instance, the operating system could decide to limit the number of processing cores available to an application to limit power consumption, or dedicate certain resources to a high- priority application [35, 25]. This type of behavior would be enabled by software that can recover from changes in processor availability.
2.3.2 Fault Tolerance in Chip Multiprocessors
The literature contains a number of examples that provide fault tolerance for a
CMP system. For instance, Ding et al. [35] propose a helper-thread based scheme that aims to reduce the energy-delay product (EDP) when processor availability can change during application execution. The helper threads execute in parallel to the application threads gathering energy-delay product statistics during an application’s execution using hardware performance counters. The system then uses this infor- mation to scale the number of active processors and threads to minimize the EDP
[35].
Chakraborty, et al. [24] propose an Over-provisioned Multi-core System (OPMS) as a way to provide fault-tolerance and reduce power consumption. In an OPMS, the number of available processing cores is larger than the number of simultaneously active cores allowed by thermal or power constraints of a chip. Chakraborty, et al. [24] use a lightweight Virtual Machine Monitor (VMM) to perform dynamic task reassignment by mapping computation fragments to processors as processor availability changes during execution.
20 While Ding et al. [35] do not consider how faults are detected (it is assumed that the Operating System notifies the application when a fault occurs), other schemes consider the mechanism for detecting transient hardware faults as well as proposing solutions to increase fault-tolerance [25, 23]. The methods proposed in Chapter 5 differ from those found in the literature, as our method targets CMPs with reconfig- urable hardware. The opportunity to reconfigure allows the architecture to find more efficient configurations when processor availability changes.
2.4 Motivation
Our initial work programming the TRIPS and Cell processors showed the need for tools managing the parallel and reconfigurable resources available in these processors.
The next two subsections overview preliminary work developing applications for the
TRIPS and Cell processors and explain how this work led to the work in the remainder of the dissertation.
2.4.1 GPS Acquisition on the TRIPS Processor
Figure 2.3 illustrates the motivation for scheduling reconfiguration on the TRIPS processor. In Figure 2.3, one can see several high level phase changes when executing
GPS Acquisition on TRIPS, as indicated by changes in the average number of In- structions executed Per Cycle (IPC). GPS Acquisition is a real-world software radio application [63, 109]. A TRIPS processor consists of sixteen processing tiles (cores), so an average IPC of eight means that one half of the processing tiles remain idle
(or are busy communicating) on average at any time, an average IPC of four means three-quarters of the tiles are idle, etc. The graph shows three distinct high level phases, as detected by examining average IPC. The first phase runs from 9 million
21 Figure 2.3: Graph illustrating three distinct phases executing GPS acquisition on the TRIPS processor.
cycles to about 41 million cycles. The second phase shows higher and more variable average IPC, and lasts from 41 million to 48.6 million cycles. The final phase is clearly composed of shorter sub-phases and continues through the remainder of the experiment.
The first phase shown in Figure 2.3 utilizes fewer processing resources than the two subsequent phases, indicating that those processing resources could be used for other tasks. Similarly, phase three shows high variability in average IPC, but average IPC over the entire phase is significantly lower than the peak IPC value. The trends shown
22 in Figure 2.3 shows that the single-threaded usage of the TRIPS processor changes dynamically. This work led us to develop the reconfiguration scheduler described in
Chapter 3. After breaking the GPS Acquisition application into tasks, several tasks that utilize fewer tiles can be executed under TRIPS’s T-Morph, which runs four threads simultaneously on the same hardware [19], without reducing the per-task performance significantly. Then tasks that utilize more tiles can be executed under
TRIPS’s D-Morph, which runs a single thread [19], to get the highest single task performance possible.
2.4.2 RDA on the Cell Processor
We ran a number of performance tests using IBM’s Cell processor. Figure 2.4 shows our tests using the Cell’s SPE as an accelerator for the Robust Data Align- ment (RDA) application, a computer vision application [51, 52]. Our work showed the performance potential of the Cell processor. Using a single SPE yielded an approxi- mately 4x performance increase compared to comparably clocked Intel processors [81].
As there are 8 SPEs on a Cell processor, we expected a significant increase in perfor- mance as we increased the number of SPEs used. However, the actual performance using multiple SPEs was significantly lower due to memory and NoC contention. This realization led us to develop the contention aware scheduling algorithms presented in
Chapter 4.
Additionally, our original development for the Cell processor led us to several other conclusions. First, the Cell processor’s organization, specifically the explicitly distributed on-chip memory instead of a logically shared cache, enabled very high performance. However, high performance was difficult to obtain, resulting in a fair
23 Figure 2.4: Comparing the performance of Cell’s SPE to Intel’s processors [81] on the RDA application.
amount of performance fragility when manual or ad-hoc methods are used. This rein- forced work done by others stating that software will be the important consideration in the efficient use of future CMP designs [87]. The Cell processor was originally designed for applications with regular memory accesses, where the SPE’s Local Store
(LS) memory can be most effectively leveraged [48], such as graphics or other “stream- ing” applications. However, there are several examples of work trying to fit more ir- regular applications, like graph exploration, to the Cell’s organization [76, 110, 111].
Unfortunately, these efforts largely used ad-hoc methods to overlap computation and
24 communication on the Cell’s SPEs, further illustrating the need for novel tools to ease the development of software for H-CMPs. The work presented in Chapter 4 addresses a subset of the problems facing the development of high-performance applications for the Cell processor.
25 CHAPTER 3
SCHEDULING ON RECONFIGURABLE HARDWARE
3.1 Introduction
One of the more difficult problems facing the use of Reconfigurable Hardware (RH) for general purpose computing is the efficient management of reconfigurable resources.
To enable the scheduling of application tasks onto RH resources and scheduling re- configuration at runtime, this chapter introduces the Mutually Exclusive Processor
Groups reconfiguration model. The Mutually Exclusive Processor Groups model is simple, but it still captures many different modes of reconfiguration, ranging from
Polymorphous Computing Architecture (PCA) processors to Field-Programmable
Gate Arrays (FPGAs). Next, we propose a reconfiguration aware list scheduler exten- sion named the Mutually Exclusive Processor Groups (-MEG) extension. Our goal is to have the -MEG extension choose the most efficient configuration for each appli- cation phase and schedule the appropriate reconfigurations. Using any list scheduler as a “base” scheduler, -MEG schedules hardware reconfiguration using a novel back- tracking algorithm. While the -MEG extension could be used with any list scheduler, we demonstrate the -MEG extension using HEFT [108] as our base scheduler to create
HEFT-MEG.
26 Section 3.4.1 discusses our results using HEFT-MEG to schedule randomly gener- ated, LU decomposition, Laplace Transform, and Gaussian Elimination task graphs onto a number of architectures consisting of a mix of microprocessors and Field Pro- grammable Gate Array (FPGA) RH processors. In simulation, we find that using
HEFT-MEG to evaluate reconfiguration decisions generates schedules that are about
20% shorter than HEFT [108] using a single configuration, and about 50% shorter than a previously proposed Genetic Algorithm (GA) based hardware-software co- scheduler [70] for graphs with larger numbers of tasks. Section 3.4.2 discusses our results using HEFT-MEG to schedule GPS Acquisition [63, 109] (a software radio ap- plication) onto the reconfigurable TRIPS processor [91] (developed at UT at Austin).
We obtained access to a TRIPS evaluation board through our collaboration with
Air Force Research Laboratory (AFRL), and used this evaluation board for some of our experiments [105]. In actual execution, we find that HEFT-MEG success- fully schedules reconfigurations to occur at runtime, reducing the execution time of
GPS Acquisition by about 20% compared to the best performing single configuration schedule.
3.2 Reconfiguration Model: Mutually Exclusive Processor Groups
When an RH resource has more than one configuration, each configuration is com- posed of one or more logical processors. Obviously, it is not possible for two different configurations using the same underlying hardware to execute tasks concurrently; we define the logical processors that use the same underlying hardware to be Mutually
Exclusive Processors. Mutually Exclusive Processors are processors that, while logi- cally distinct, cannot be used concurrently. For our model, an RH does not need to
27 instantiate an entire instruction based architecture to be considered a logical proces-
sor. Rather, any computational function that can be realized by an RH is considered
a logical processor (such as an ALU or multiplier). This way, any hardware block
that can execute a task in the DAG can be utilized in our reconfiguration model.
Ensuing use of the term processor will refer to a logical processor.
Figure 3.1 shows an example of how we define the relationships among processors
that can be instantiated by an FPGA using our Mutually Exclusive Processor Groups
model. All configurations for a particular RH belong to a single SuperGroup. Each
configuration is represented as a single SubGroup. Processors belonging to the same
SubGroup can be used concurrently; logical processors in the same SuperGroup and
in different SubGroups cannot be used concurrently and are mutually exclusive.
Figure 3.1 illustrates how a set of possible configurations for an FPGA map to
Mutually Exclusive Processor Groups. Figure 3.1.a shows three possible configura- tions for an FPGA. The possible configurations are composed of soft-processors of
five types, V -Z. Across all the configurations, there are thirteen logically separate processors. Figure 3.1 illustrates that a processor type can be present in multiple configurations and more than one instance of a processor type can be present in a single configuration. Figure 3.1.b shows how the three possible configurations are mapped to a single SuperGroup (Super) that contains three SubGroups (S1, S2, and
S3 ). The group membership defines which processors are mutually exclusive. For
instance, processor X in S1 is mutually exclusive with processor Y in S3, because
these processors belong to different SubGroups within the same SuperGroup.
28 Y Y Y X X W Z Y W Y Y W V
a)
FPGA
S1 S2 S3
X Y Y
X Z W Y Y
Y Y W W V
b) Super
Figure 3.1: Illustrating mutually exclusive processors with a group of possible config- urations for an FPGA.
Figure 3.2 illustrates how the TRIPS processor’s configurations map to Mutually
Exclusive Processor Groups. A TRIPS processing core has two possible configura-
tions, the D-Morph and the T-Morph. The D-Morph runs a single thread, while the
T-Morph runs four threads simultaneously. Therefore, the D-Morph consists of a sin-
gle logical processor, while the T-Morph is modeled as four logical processors. Based
on this, D-Morph’s processor (X) is mutually exclusive with T-Morph’s processors
(X0).
29 D-Morph T-Morph sub-group sub-group
S1 S2
X' X' TRIPS Processor super-group X X' X'
Super
Figure 3.2: Illustrating mutually exclusive processors with the TRIPS processor con- figurations.
A strength of the Mutually Exclusive Processor Groups model is that it cap- tures many different kinds of reconfiguration, ranging from PCA computing cores to
FPGAs, but the model remains simple. However, the proposed model requires all configurations that will be considered in scheduling to be enumerated and the rela- tionships among all the possible configurations need to be specified before scheduling.
Because of this, it is likely that the system designer or programmer will choose a set of promising configurations to be considered during scheduling.
3.3 HEFT with Mutually Exclusive Processor Groups
3.3.1 -MEG Scheduling Extension
We propose the Mutually Exclusive Processor Groups (-MEG) scheduling exten- sion as a means to augment any list scheduler with the ability to schedule for RH resources. When scheduling, the goal is to have the -MEG extension choose the most
30 efficient available configuration for each application phase. This is done by using the
-MEG scheduling extension to explore the reconfiguration space while the base sched- uler decides the mapping of tasks to processors. While the -MEG extension could be applied to any list scheduler, we demonstrate the -MEG extension using HEFT, orig- inally proposed by Topcuoglu et al. [107, 108]. HEFT with the Mutually Exclusive
Processor Groups extension (HEFT-MEG) analyzes an application at compile time and generates a runtime schedule.
The -MEG extension uses a novel backtracking algorithm to evaluate the per- formance impact of different reconfiguration decisions. After each task is scheduled,
-MEG finds a number of candidate reconfiguration times over a programmer control- lable ws in time. For each candidate reconfiguration time tk, -MEG backtracks by unscheduling all tasks that finish after tk. Based on the properties of the unscheduled tasks, -MEG chooses a number of new configurations. For each new configuration, a reconfiguration task is inserted at tk, and the unscheduled tasks are rescheduled using the base scheduler under the new configuration. For each configuration and candidate reconfiguration time combination, -MEG tentatively reschedules the tasks and only keeps the partial schedule that has the shortest makespan. By doing this the -MEG scheduling extension iteratively refines the reconfiguration schedule with each sched- uled task. A pseudo-code representation of HEFT-MEG is shown in Algorithm 2, where lines 10 through 21 are additions to the original HEFT algorithm.
The scheduling and cost functions HEFT uses are detailed in the Chapter 2.
HEFT-MEG specifically uses the bottom level rank (rankb as defined in Equation
2.2) in the listing step and Earliest Finish Time (EFT, as defined in Equation 2.6) in the placement step. Also note that HEFT-MEG uses the same insertion-based policy
31 Algorithm 2 HEFT-MEG Algorithm 1: procedure HEFT-MEG(G = (V, E, w, c)) .G is a task graph 2: Compute rank for all tasks t ∈ V . Using Equation 2.2 3: Sort the tasks in decreasing order by rank and put in list 4: while there are unscheduled tasks in list do 5: Select the first task in the list, ni, and remove from list 6: for all processors pj do 7: Evaluate EFT(nk,pj), saving the minimum EFT . Using Equation 2.6 8: end for 9: Schedule task nk on the processor px with the minimum EFT 10: Save the minimum EFT as EFTcurr 11: Find candidate reconfiguration times between EFTcurr and EFTcurr − ws and put in listtimes 12: . Candidate reconfiguration times found using Equation 3.1 13: for all times tk in listtimes do 14: Generate possible reconfigurations for tk and put in listr 15: for all Reconfiguration possibilities rw in listr do 16: Unschedule tasks scheduled between tk and EFTcurr, put in list2 17: Insert reconfiguration rw at tk 18: Perform HEFT with tasks in list2 using the new configuration 19: end for 20: end for 21: Choose schedule (from all considered configurations) that minimizes the partial schedule’s makespan 22: end while 23: end procedure
as the original HEFT. HEFT-MEG distinguishes itself from the scheduler proposed by
Topcuoglu et al. [107] in its consideration of reconfigurable computational resources.
We define a candidate reconfiguration time as the point in time that HEFT-MEG will
evaluate a reconfiguration possibility. C is the set of candidate reconfiguration times and is defined as: