ARCHITECTURE Fuat Keceli, Doctor of Philosophy

ABSTRACT Title of dissertation: POWER AND PERFORMANCE STUDIES OF THE EXPLICIT MULTI-THREADING (XMT) ARCHITECTURE Fuat Keceli, Doctor of Philosophy, 2011 Dissertation directed by: Professor Uzi Vishkin Department of Electrical and Computer Engineering Power and thermal constraints gained critical importance in the design of microprocessors over the past decade. Chipmakers failed to keep power at bay while sustaining the performance growth of serial computers at the rate expected by consumers. As an alter- native, they turned to fitting an increasing number of simpler cores on a single die. While this is a step forward for relaxing the constraints, the issue of power is far from resolved and it is joined by new challenges which we explain next. As we move into the era of many-cores, processors consisting of 100s, even 1000s of cores, single-task parallelism is the natural path for building faster general-purpose computers. Alas, the introduction of parallelism to the mainstream general-purpose domain brings another long elusive problem to focus: ease of parallel programming. The result is the dual challenge where power efficiency and ease-of-programming are vital for the prevalence of up and coming many-core architectures. The observations above led to the lead goal of this dissertation: a first order valida- tion of the claim that even under power/thermal constraints, ease-of-programming and com- petitive performance need not be conflicting objectives for a massively-parallel general- purpose processor. As our platform, we choose the eXplicit Multi-Threading (XMT) many- core architecture for fine grained parallel programs developed at the University of Mary- land. We hope that our findings will be a trailblazer for future commercial products. XMT scales up to thousand or more lightweight cores and aims at improving single task execution time while making the task for the programmer as easy as possible. Per- formance advantages and ease-of-programming of XMT have been shown in a number of publications, including a study that we present in this dissertation. Feasibility of the hardware concept has been exhibited via FPGA and ASIC (per our partial involvement) prototypes. Our contributions target the study of power and thermal envelopes of an envisioned 1024-core XMT chip (XMT1024) under programs that exist in popular parallel benchmark suites. First, we compare XMT against an area and power equivalent commercial high-end many-core GPU. We demonstrate that XMT can provide an average speedup of 8.8x in ir- regular parallel programs that are common and important in general purpose computing. Even under the worst-case power estimation assumptions for XMT, average speedup is only reduced by half. We further this study by experimentally evaluating the performance advantages of Dynamic Thermal Management (DTM), when applied to XMT1024. DTM techniques are frequently used in current single and multi-core processors, however until now their effects on single-tasked many-cores have not been examined in detail. It is our purpose to explore how existing techniques can be tailored for XMT to improve performance. Performance improvements up to 46% over a generic global management technique has been demonstrated. The insights we provide can guide designers of other similar many- core architectures. A significant infrastructure contribution of this dissertation is a highly configurable cycle-accurate simulator, XMTSim. To our knowledge, XMTSim is currently the only publicly-available shared-memory many-core simulator with extensive capabilities for es- timating power and temperature, as well as evaluating dynamic power and thermal management algorithms. As a major component of the XMT programming toolchain, it is not only used as the infrastructure in this work but also contributed to other publications and dissertations. POWER AND PERFORMANCE STUDIES OF THE EXPLICIT MULTI-THREADING (XMT) ARCHITECTURE by Fuat Keceli Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2011 Advisory Committee: Professor Uzi Vishkin, Chair/Advisor Assistant Professor Tali Moreshet Associate Professor Manoj Franklin Associate Professor Gang Qu Associate Professor William W. Pugh c Copyright by Fuat Keceli 2011 To my loving mother, Gulbun, who never stopped believing in me. Anneme. ii Acknowledgments I would like to sincerely thank my advisors Dr. Uzi Vishkin and Dr. Tali Moreshet for their guidance. It has been a long but rewarding journey through which they have always been understanding and patient with me. Dr. Vishkin’s invaluable experiences, unique insight and perseverance not only shaped my research, but also gave me an everlasting perspective on defining and solving problems. Dr. Moreshet selflessly spent countless hours in discussions with me, during which we cultivated the ideas that were collected in this dissertation. Her advice as a mentor and a friend kept me focused and taught me how to convey ideas clearly. It was a privilege to have both as my advisors. I am indebted to the past and present members of the XMT team, especially my good friends George C. Caragea and Alexandros Tzannes with whom I worked closely and puz- zled over many problems. James Edwards’ collaboration and feedback was often crucial. Aydin Balkan, Xingzhi Wen and Michael Horak were exceptional colleagues during their time in the team. I would not have made it to the finish line without their help. It was the sympathetic ear of many friends that kept me sane through the course of difficult years. They have helped me deal with the setbacks along the way, push through the pain of my father’s long illness and gave me a space where I can let off steam. Thank you (in no particular order), Apoorva Prasad, Thanos Chrysis, Harsh Dhundia, Nimi Dvir, Orkan Dere, Bulent Boyaci, and everybody else that I may have inadvertently left out. Apoorva Prasad deserves a special mention, for he took the time to proofread most of this dissertation during his brief visit from overseas. Finally and most importantly, I would like express my heart-felt gratitude to my family. I am lucky that my mother Gulbun and my brother Alp were, are, and always will be with me. Alas, my father Ismail, who was the first engineer that I knew and a very good one, passed away in 2009; he lives in our memories and our hearts. I would not be the person that I am today without their unwavering support, encouragement, and love. I owe many thanks to Tina Peterson for being my family away from home, for her continuous support and concern. Lastly, I will always remember my grandfather, Muzaffer Tuncel, as the person who planted the seeds of scientific curiosity in me. iii Table of Contents List of Tables ix List of Figures x List of Abbreviations xiii 1 Introduction 1 2 Background on Power/Temperature Estimation and Management 5 2.1 Sources of Power Dissipation in Digital Processors ............... 5 2.2 Dynamic Power Consumption .......................... 6 2.2.1 Clock Gating ................................ 6 2.2.2 Estimation of of Dynamic Power in Simulation ............. 7 2.3 Leakage Power Consumption ........................... 9 2.3.1 Management of Leakage Power ..................... 10 2.3.2 Estimation of of Leakage Power in Simulation ............. 11 2.4 Power Consumption in On-Chip Memories ................... 12 2.5 Power and Clock Speed Trade-offs ........................ 12 2.6 Dynamic Scaling of Voltage and Frequency ................... 14 2.7 Technology Scaling Trends ............................. 16 2.8 Thermal Modeling and Design .......................... 18 2.9 Power and Thermal Constraints in Recent Processors ............. 21 2.10 Tools for Area, Power and Temperature Estimation .............. 22 3 The Explicit Multi-Threading (XMT) Platform 24 3.1 The XMT Architecture ............................... 25 3.1.1 Memory Organization ........................... 26 3.1.2 The Mesh-of-Trees Interconnect (MoT-ICN) ............... 26 3.2 Programming of XMT ............................... 28 3.2.1 The PRAM Model ............................. 28 3.2.2 XMTC – Enhanced C Programming for XMT .............. 29 3.2.3 The Prefix-Sum Operation ......................... 30 iv 3.2.4 Example Program .............................. 31 3.2.5 Independence-of-Order and No-Busy-Wait ............... 31 3.2.6 Ease-of-Programming ........................... 32 3.3 Thread Scheduling in XMT ............................ 34 3.4 Performance Advantages ............................. 36 3.5 Power Efficiency of XMT and Design for Power Management ........ 36 3.5.1 Suitability of the Programming Model .................. 37 3.5.2 Re-designing Thread Scheduling for Power ............... 37 3.5.3 Low Power States for Clusters ...................... 40 3.5.4 Power Management of the Synchronous MoT-ICN ........... 41 3.5.5 Power Management of Shared Caches – Dynamic Cache Resizing .. 42 4 XMTSim – The Cycle-Accurate Simulator of the XMT Architecture 44 4.1 Overview of XMTSim ................................ 45 4.2 Simulation Statistics and Runtime Control .................... 48 4.3 Details of Cycle-Accurate Simulation ....................... 49 4.3.1 Discrete-Event Simulation ......................... 49 4.3.2 Concurrent Communication of Data Between Components ...... 52 4.3.3 Optimizing the DE Simulation Performance .............. 54 4.3.4 Simulation Speed .............................. 56 4.4 Cycle Verification Against

ARCHITECTURE Fuat Keceli, Doctor of Philosophy

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

Chapter 1-Introduction to Microprocessors File

Computer Architectures an Overview

Green500 List

Power Vector Intrinsic Programming Reference

Game Consoles: a Look Ahead Game Consoles: a Look Ahead by Brian Neal – December 2003

IBM Research Report: Power Measurement of the Apple Power

SQA Advanced Higher Computing Unit 3C: Computer Architecture

Performance of Various Computers Using Standard Linear Equations Software

The Playstation 3 for High- Performance Scientific Computing

Die Geschichte Der Digitalen Evolution Bezugsquelle