Accelerating Control-Flow Intensive Code in Spatial Hardware

UCAM-CL-TR-870 Technical Report ISSN 1476-2986 Number 870 Computer Laboratory Accelerating control-flow intensive code in spatial hardware Ali Mustafa Zaidi May 2015 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2015 Ali Mustafa Zaidi This technical report is based on a dissertation submitted February 2014 by the author for the degree of Doctor of Philosophy to the University of Cambridge, St. Edmund’s College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract Designers are increasingly utilizing spatial (e.g. custom and reconfigurable) architectures to improve both efficiency and performance in increasingly heterogeneous systems-on- chip. Unfortunately, while such architectures can provide orders of magnitude better efficiency and performance on numeric applications, they exhibit poor performance when implementing sequential, control-flow intensive code. This thesis studies the problem of improving sequential code performance in spatial hardware without sacrificing its inherent efficiency advantage. I propose (a) switching from a statically scheduled to a dynamically scheduled, dataflow execution model, and (b) utilizing a newly developed compiler intermediate representation (IR) designed to expose ILP in spatial hardware, even in the presence of complex control flow. I describe this new IR – the Value State Flow Graph (VSFG) – and how it statically exposes ILP from control-flow intensive code by enabling control-dependence analysis, execution along multiple flows of control, as well as aggressive control-flow speculation. I also present a High-Level Synthesis (HLS) toolchain, that compiles unmodified high-level language code to dataflow custom hardware, via the LLVM compiler infrastructure. I show that for control-flow intensive code, VSFG-based custom hardware performance approaches, or even exceeds the performance of a complex superscalar processor, while consuming only 1/4 the energy of an efficient in-order processor, and 1/8 that of a complex out-of-order processor.× I also present a discussion of compile-time optimizations× that may be attempted to further improve both efficiency and performance for VSFG-based hardware, including using alias analysis to statically partition and parallelize memory operations. This work demonstrates that it is possible to use custom and/or reconfigurable hardware in heterogeneous systems to improve the efficiency of frequently executed sequential code, without compromising performance relative to an energy inefficient out-of-order superscalar processor. Acknowledgements First and foremost, my sincerest thanks to my supervisor, David Greaves, for his careful and invaluable guidance during my PhD, especially for challenging me to explore new and unfamiliar topics and broaden my perspective. My heartfelt thanks also to Professor Alan Mycroft for his advice and encouragement as my second advisor, as well as for his enthusiasm for my work. I would also like to acknowledge Robert Mullins for his encouragement, many insightful conversations, and especially for running the CompArch reading group. Sincere thanks also to Professor Simon Moore, for seeding the novel and exciting Communication-centric Computer Design project, and for making me a part of his amazing team. I would also like to acknowledge my colleagues for making the Computer Laboratory a fun and engaging workplace, and especially for all of the illuminating discussions on diverse topics, ranging from computer architecture, to the Fermi paradox, to the tractability of magic roundabouts. In particular, I would like to thank Daniel Bates, Alex Bradbury, Andreas Koltes, Alan Mujumdar, Matt Naylor, Robert Norton, Milos Puzovic, Charlie Reams, and Jonathan Woodruff. My gratitude is also due to my friends at St. Edmund’s College, as well as around Cambridge, for helping me to occasionally escape from work to try something less stressful, like running a political discussion forum! In particular, my heartfelt thanks go the co- founders of the St. Edmund’s Political Forum, as well as to my friends Parul Bhandari, Taylor Burns, Ali Khan, Tim Rademacher, and Mohammad Razai. Last but not least, I am most grateful to my wife Zenab for her boundless patience and support during my PhD, to my son Mahdi for bringing both meaning and joy to the life of a humble student, and to my parents for their unwavering faith, encouragement and support. Contents 1 Introduction 11 1.1 ThesisStatement ................................ 13 1.2 Contributions .................................. 14 1.3 PublicationsandAwards............................ 15 2 Technical Background 17 2.1 TheUniprocessorEra.............................. 17 2.2 TheMulticoreEra ............................... 20 2.3 TheDarkSiliconProblem ........................... 21 2.3.1 Insufficient Explicit Parallelism . 21 2.3.2 The Utilization Wall . 23 2.3.3 Implications of Dark Silicon . 24 2.4 The Spatial ComputationModel. .. .. 26 2.4.1 AdvantagesofSpatialComputation . 26 2.4.2 IssueswithSpatialComputation . 28 2.4.3 A Brief Survey of Spatial Architecture Research . 31 2.5 Summary .................................... 36 3 Statically Exposing ILP from Sequential Code 39 3.1 TheNatureofImperativeCode . 39 3.2 ExposingILPfromImperativeCode . 42 3.2.1 FalseorNamedependencies . 42 3.2.2 OvercomingControlFlow . 44 3.2.3 Pointer Arithmetic and Memory Disambiguation . 46 3.3 TheSuperscalarPerformanceAdvantage . ... 47 3.3.1 Case Study 1: Outer-loop Pipelining . 49 3.4 Limitations of Superscalar Performance . .. 52 3.4.1 Case Study 2: Multiple Flows of Control . 52 3.5 Improving Sequential Performance forSpatial Hardware . ....... 53 3.5.1 Why the Static DataflowExecutionModel? . 54 3.5.2 WhyaVSDG-basedcompilerIR? . 55 3.6 OvercomingControl-flowwiththeVSDG. 55 3.6.1 DefiningtheVSDG........................... 55 3.6.2 Revisiting Case Studies 1 and 2 . 64 3.7 RelatedWorkonCompilerIRs . .. .. 70 3.8 Summary .................................... 72 4 Definition and Semantics of the VSFG 75 4.1 TheVSFGasCustomHardware. 75 4.2 Modeling Execution with Petri-Nets . 77 4.2.1 Well-behavendessinDataflowGraphs. 79 4.3 Operational Semantics for the VSFG-S .................... 82 4.3.1 SemanticsforBasicOperations . 84 4.3.2 Compound Operations: Nested Acyclic Subgraphs . 89 4.3.3 Compound Operations: Nested Loop Subgraphs . 92 4.4 Comparison with Existing Dataflow Models . 100 4.4.1 ComparisonwithPegasus . .100 4.4.2 Relation to Original Work on Dataflow Computing . 101 4.5 Limitations of Static Dataflow Execution . 103 4.6 Summary ....................................107 5 A VSFG-Based High-Level Synthesis Toolchain 109 5.1 TheToolchain..................................109 5.2 ConversionfromLLVMtoVSFG-S . .111 5.2.1 Convert Loops to Tail-Recursive Functions . 111 5.2.2 Implement State-edges between State Operations . 116 5.2.3 Generate Block Predicate Expressions . 117 5.2.4 Replace each φ-node with a MUX ..................119 5.2.5 ConstructtheVSFG-S . .119 5.3 ConversionfromVSFG-StoBluespec . 120 5.4 CurrentLimitations...............................122 5.5 Summary ....................................124 6 Evaluation Methodology and Results 125 6.1 EvaluationMethodology . .. .. .. .125 6.1.1 Comparison with an Existing HLS Tool . 126 6.1.2 ComparisonwithPegasus/CASH . 126 6.1.3 Comparison with Conventional Processors . 127 6.1.4 Selectedbenchmarks:. .128 6.2 Results......................................132 6.2.1 CycleCounts ..............................132 6.2.2 FrequencyandDelay .. .. .. .139 6.2.3 ResourceRequirements. .141 6.2.4 PowerandEnergy ...........................144 6.3 EstimatingILP .................................150 6.4 Summary ....................................151 7 Conclusions and Future Work 153 7.1 FutureWork...................................154 7.1.1 IncrementalEnhancements. 154 7.1.2 Mitigating the Effects of Dark Silicon . 154 Bibliography 170 CHAPTER 1 Introduction Over the past two decades, pervasive, always-on computing services have become an integral part of our lives. Not only are we using increasingly portable devices like tablets and smartphones, there is also an increasing reliance on cloud computing: server-side computation and services, like web search, mail, and social media. On the client side, there is an ever growing demand for increased functionality and diversity of applications, as well as an expectation of continued performance scaling with every new technology generation. Designers incorporate increasingly powerful processors and systems-on-chip into such devices to meet this demand. However, the key trade-off in employing high-performance processors is the high energy cost they incur [GA06]: for increasingly portable devices, in addition to the demand for ever higher performance, users have an expectation of a minimum time that their battery should last under normal use. On the server-side, power dissipation and cooling infrastructure costs are growing, cur- rently accounting for more than 40% of the running costs for datacenters [Ham08], and around 10% of the total lifetime cost [KBPS09]. To meet the growing demand for com- putational capacity in datacenters, computer architects are striving to develop processors capable of not only providing higher throughput and performance, but also achieving high energy efficiency. Unfortunately, for the past decade, architects have had to struggle with several key issues that hinder their

Load more