Scheduling Tasks on Heterogeneous Chip Multiprocessors with Reconfigurable Hardware

SCHEDULING TASKS ON HETEROGENEOUS CHIP MULTIPROCESSORS WITH RECONFIGURABLE HARDWARE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Justin Stevenson Teller, B.S., M.S. ***** The Ohio State University 2008 Dissertation Committee: Approved by Prof. Füsun Ozgüner,Adviser¨ Prof. Umit¨ Çatalyürek Adviser Prof. Eylem Ekici Graduate Program in Electrical and Computer Engineering c Copyright by Justin Stevenson Teller 2008 ABSTRACT This dissertation presents several methods to more efficiently use the computa- tional resources available on a Heterogeneous Chip Multiprocessor (H-CMP). Using task scheduling techniques, three challenges to the effective usage of H-CMPs are addressed: the emergence of reconfigurable hardware in general purpose computing, utilization of the network on a chip (NoC), and fault tolerance. To utilize reconfigurable hardware, we introduce the Mutually Exclusive Processor Groups reconfiguration model, and an accompanying task scheduler, the Heteroge- neous Earliest Finish Time with Mutually Exclusive Processor Groups (HEFT-MEG) scheduling heuristic. HEFT-MEG schedules reconfigurations using a novel back- tracking algorithm to evaluate how different reconfiguration decisions affect previously scheduled tasks. In both simulation and real execution, HEFT-MEG successfully schedules reconfiguration allowing the architecture to adapt to changing application requirements. After an analysis of IBM’s Cell Processor NoC and generation of a simple stochas- tic model, we propose a hybrid task scheduling system using a Compile- and Run-time Scheduler (CtS and RtS) that work in concert. The CtS, Contention Aware HEFT (CA-HEFT), updates task start and finish times when scheduling to account for network contention. The RtS, the Contention Aware Dynamic Scheduler (CADS), ii adjusts the schedule generated by CA-HEFT to account for variation in the communication pattern and actual task finish times, using a novel dynamic block algorithm. We find that using a CtS and RtS in concert improves the performance of several application types in real execution on the Cell processor. To enhance fault tolerance, we modify the previously proposed hybrid scheduling system to accommodate variability in the processor availability. The RtS is divided into two portions, the Fault Tolerant Re-Mapper (FTRM) and the Reconfiguration and Recovery Scheduler (RRS). FTRM examines the current processor availability and remaps tasks to the available set of processors. RRS changes the reconfiguration schedule so that the reconfigurations more accurately reflect the new hardware capa- bilities. The proposed hybrid scheduling system enables application performance to gracefully degrade when processor availability diminishes, and increase when processor availability increases. iii Dedicated to my wonderful wife, Lindsay. iv ACKNOWLEDGMENTS I would like to thank Prof. Füsun Ozgünerfor¨ being my adviser, and providing me with the guidance to finish my graduate degree. Especially, I want to thank you for recruiting me. My Ph.D. topic would be vastly different had I not been able to come to and work at Ohio State. I would also like to sincerely thank Prof. Umit¨ Çatalyürekand Eylem Ekici. You are truly at the top of the best professors I have had the honor to study with in my graduate work. Your contributions to my education cannot be overstated. I would like to thank Tim Hartley, for extremely constructive discussions concern- ing the Cell processor, parallel processing, and StarCraft. I would like to sincerely thank Dr. Robert Ewing, AFRL, for his insightful con- versations and guidance when working on base. I am grateful to Al Scarpelli, AFRL, for his support and help in providing access to the TRIPS system and developers. Of course, none of this would have been possible without the love and support of my family. My wife Lindsay was incredibly supportive, and I especially want to thank my parents, brothers, and all of the Highfields: my “Columbus family.” Finally, I would like to acknowledge the Dayton Area Graduate Studies Institute for providing support for my Ph.D. studies through a joint research fellowship. v VITA April 19, 1980 . Born - Downer’s Grove, Illinois 2002 . .B.S. in Electrical Engineering, Ohio University, Athens, Ohio 2004 . .M.S. in Electrical Engineering, Univer- sity of Maryland, College Park, Mary- land 2004 . .Given’s Associate in parallel processing at the MCS division at Argonne Na- tional Laboratory 2005 – present . Air Force Research Labora- tory/Dayton Area Graduate Studies Institute Fellow PUBLICATIONS 1. Justin Teller, Füsun Ozgüner,and¨ Robert Ewing, “Scheduling Task Graphs on Re- configurable Hardware.” to appear in the 37th International Conference on Parallel Processing (ICPP-08), SRMPDS workshop Portland, Oregon, September 2008 2. Justin Teller, Füsun Ozgüner,and¨ Robert Ewing, “Optimization at Runtime on a Nanoprocessor Architecture.” to appear in the 31st IEEE Annual Midwest Symposium on Circuits and Systems, Knoxville, Tennessee, August 2008 3. Justin Teller, Füsun Ozgüner,and¨ Robert Ewing, “Scheduling Reconfiguration at Runtime on the TRIPS Processor.” in Proceedings of the Parallel and Distributed Processing Symposium, (IPDPS 2008),RAW workshop Miami, Florida, April 2008. 4. Justin Teller, “Matching and Scheduling on a Heterogeneous Chip Multi-Processor.” presentation at the ASME Dayton Engineering Sciences Symposium, October 29, 2007. vi 5. Justin Teller “Reconfiguration at Runtime with the Nanoprocessor Architecture.” presentation at the ASME Dayton Engineering Sciences Symposium, October 30, 2006. Selected for an Outstanding Presentation Award. 6. Justin Teller, Füsun Ozgüner,and¨ Robert Ewing, “The Morphable Nanoprocessor Architecture: Reconfiguration at Runtime.” in Proceedings of the International Mid- west Symposium on Circuits and Systems (MWSCAS ’06), San Juan, Puerto Rico, August 6-9, 2006. 7. Justin Teller, Füsun Ozgüner,and¨ Robert Ewing, “What are the Building Blocks of a Nanoprocessor Architecture?” in Proceedings of the International Midwest Sym- posium on Circuits and Systems (MWSCAS ’05), Cincinnati, Ohio, August 7-10, 2005. 8. Justin Teller, Charles B. Silio, and Bruce Jacob, “Performance Characteristics of MAUI: An Intelligent Memory System Architecture” in Proceedings of the 3rd ACM SIGPLAN Workshop on Memory Systems Performance (MSP 2005), Chicago, Illinois, June 12, 2005. 9. Mark Hereld, Rick Stevens, Justin Teller, Wim van Drongelen, and Hyong Lee, “Large Neural Simulations on Large Parallel Computers.” International Journal of Bioelectromagnetism (IJBEM), vol. 7, no. 1, May 2005. FIELDS OF STUDY Major Field: Electrical and Computer Engineering Studies in: Parallel Processing Computer Architecture vii TABLE OF CONTENTS Page Abstract . ii Dedication . iv Acknowledgments . v Vita . vi List of Tables . xi List of Figures . xii Chapters: 1. Introduction . 1 1.1 Current Trends . 1 1.1.1 Chip Multiprocessors . 1 1.1.2 Heterogeneous Processing Cores . 3 1.1.3 Reconfigurable Hardware in General Purpose Computing . 4 1.1.4 Intermittent Hardware Faults . 5 1.2 Summary . 5 2. Background, Prior Work, and Motivation . 8 2.1 Reconfigurable Hardware . 8 2.1.1 Scheduling on Reconfigurable Hardware . 9 2.2 Task Scheduling for Heterogeneous Systems . 12 2.2.1 Matching and Scheduling Heuristics . 13 2.2.2 HEFT List Scheduler . 14 2.2.3 Scheduling Network Access . 16 viii 2.2.4 Dynamic Schedulers . 17 2.3 Intermittent Faults . 18 2.3.1 Sources of Faults . 18 2.3.2 Fault Tolerance in Chip Multiprocessors . 20 2.4 Motivation . 21 2.4.1 GPS Acquisition on the TRIPS Processor . 21 2.4.2 RDA on the Cell Processor . 23 3. Scheduling on Reconfigurable Hardware . 26 3.1 Introduction . 26 3.2 Reconfiguration Model: Mutually Exclusive Processor Groups . 27 3.3 HEFT with Mutually Exclusive Processor Groups . 30 3.3.1 -MEG Scheduling Extension . 30 3.3.2 Generating New Configurations . 33 3.3.3 HEFT-MEG Time Complexity . 40 3.4 Results . 43 3.4.1 Simulation Results . 43 3.4.2 Results on TRIPS . 52 4. The Modeling and Scheduling of Network Access . 57 4.1 Introduction . 57 4.2 The Cell Processor’s Network on a Chip . 58 4.2.1 Cell’s NoC: The EIB . 59 4.2.2 Cell EIB: In-Network Contention . 60 4.3 Communication Model . 63 4.3.1 Calculating End-Point Contention . 64 4.3.2 Calculating NoC Contention . 65 4.3.3 Experimental Verification: NoC Contention . 72 4.4 Software System Overview . 74 4.5 Scheduling on the Cell Processor . 75 4.5.1 Compile Time Scheduling . 75 4.5.2 Run Time Scheduling . 77 4.6 Scheduling Results . 81 5. Fault Tolerance with Reconfigurable Hardware . 89 5.1 Introduction . 89 5.2 Proposed Failure Model . 91 5.3 Mutually Exclusive Processor Groups Revisited . 91 5.4 Run-Time Scheduler . 94 ix 5.4.1 Fault Tolerant Re-mapper . 96 5.4.2 Reconfiguration and Recovery Scheduler . 100 5.5 Simulation Results . 104 6. Conclusions . 109 6.1 Contributions . 109 6.2 Future Work . 113 Bibliography . 117 x LIST OF TABLES Table Page 5.1 Results for four node system, CCR = 1.0. 106 5.2 Results for two node system, CCR = 1.0 . 107 5.3 Results for four node system, CCR = 0.25. 107 5.4 Results for two node system, CCR = 0.25 . 107 xi LIST OF FIGURES Figure Page 1.1 Hypothetical H-CMP consisting of processing cores optimized for different computation types. The on-chip network is not shown. 2 2.1 A chromosome for the partitioning algorithm in Mei, et al [70]. 10 2.2 Partitioning a DAG into blocks [68] . 19 2.3 Graph illustrating three distinct phases executing GPS acquisition on the TRIPS processor. 22 2.4 Comparing the performance of Cell’s SPE to Intel’s processors [81] on the RDA application. 24 3.1 Illustrating mutually exclusive processors with a group of possible configurations for an FPGA. 29 3.2 Illustrating mutually exclusive processors with the TRIPS processor configurations. 30 3.3 Scheduling a DAG fragment onto RH using HEFT-MEG.

Scheduling Tasks on Heterogeneous Chip Multiprocessors with Reconfigurable Hardware

Network on Chip for FPGA Development of a Test System for Network on Chip

Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With

An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System

A Comparison of Four Series of Cisco Network Processors

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles

Using Inspiration from Synaptic Plasticity Rules to Optimize Traffic Flow in Distributed Engineered Networks Arxiv:1611.06937V1

State-Of-The-Art in Heterogeneous Computing

ESA DSP Day 2016 Workshop Proceedings

Hardware Design of Message Passing Architecture on Heterogeneous System

Smartcell: an Energy Efficient Reconfigurable Architecture for Stream Processing

Towards a Scalable Software Defined Network-On-Chip for Next Generation Cloud

Low-Power Design Using Noc Technology