Conditional Techniques for Stream Processing Kernels

CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Ujval J. Kapasi March 2004 c Copyright by Ujval J. Kapasi 2004 All Rights Reserved ii I certify that I have read this dissertation and that, in my opin- ion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. William J. Dally (Principal Adviser) I certify that I have read this dissertation and that, in my opin- ion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Mark Horowitz I certify that I have read this dissertation and that, in my opin- ion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Christoforos Kozyrakis Approved for the University Committee on Graduate Stud- ies. iii Abstract The stream programming model casts applications as a set of sequential data streams that are operated on by data-parallel computation kernels. Previous work has shown that this model is a powerful representation for media processing applications, such as image- and signal-processing, because it captures the locality and concurrency inherent to an application. This careful handling of important application properties results in kernels that are compute-intensive—i.e., kernels that perform a large number of arithmetic operations per unit of inter-kernel communication bandwidth. Furthermore, the stream model can be im- plemented with efficient VLSI designs, such as the Imagine Programmable Stream Proces- sor. The Imagine chip supports 48 ALUs on a single die, operating at over 200MHz. This large number of ALUs provides a high peak performance, but makes efficiently executing kernels with conditional code a challenge. We will introduce two techniques for efficiently supporting kernel conditionals, such as if-statements, case-statements, and while-loops: conditional routing and conditional streams. Conditional routing is a code transformation that exploits the trade-off of increas- ing inter-kernel communication in order to increase the performance of kernel inner-loops containing conditional code. The second technique we will discuss is the use of a conditional stream, which is a mechanism to reduce the amount of load-imbalance that arises between parallel processing clusters on a single stream processor chip. Load-imbalance results when different conditional paths are taken on different processing clusters, and causes one or more of the clusters to wait idle for the others to complete a kernel. We will also present a case study of the impact of these techniques on a programmable polygon render- ing pipeline that contains many unpredictable conditionals. We show that our techniques improve the performance of this application by 1.9×. iv Acknowledgments Dedicated to the late Chandulal K. Kapasi I am truly fortunate to have had the opportunity to work with my adviser, Bill Dally, these past few years. He has, by example, taught me to set the bar as high as possible, and then a little higher yet. I am overwhelmed by gratitude for his patience and guidance throughout my graduate student career. I look forward to working with Bill outside of Stanford as we embark on a new venture to commercialize our Stanford research. I am also indebted to Mark Horowitz and Christos Kozyrakis for many educational interactions, and for significantly helping to improve the work in this dissertation. Another critical support group during my Stanford tenure has been the Imagine project team members. A group experience as positive and successful as ours is rare. I would particularly like to thank the original Stanford members: Scott Rixner, Brucek Khailany, John Owens, and Peter Mattson. More recently Brian Towles, Jinyung Namkoong, Ben Serebrin, Jung Ho Ahn, and Abhishek Das have been important collaborators in ensuring the longevity of our research. Many of the ideas in this dissertation have been molded during white-board discussions with these people. My officemates over the years have provided valuable relief and encouragement when I needed it. Brian Towles, Sarah Harris, and Steve Lacy have been wonderful officemates, and I will miss them after I leave Stanford. I would also like to thank the CVA group administrators, Pamela Elliot and Shelley Russell. They ensured that I could focus on my research by handling all of the administrative details. Similarly, the National Science Foundation Fellowship and the Stanford Graduate Fellowship enabled me to pursue my research without financial concerns. Finally, I am eternally thankful to my family. Mom, Dad, Anokhi: you guys believed v in me, and it is largely because of your constant support and encouragement that I am completing my Ph.D. To my new wife, Kristin: you have provided me with inspiration in all walks of life, and I leave Stanford excited to begin the rest of my life with you. vi Contents Abstract iv Acknowledgments v 1 Introduction 1 1.1 Conditionals are Challenging . 2 1.2 Research Contributions . 2 1.3 Thesis Roadmap . 3 2 Stream Processing 5 2.1 Stream Programming Model . 7 2.1.1 Example Application . 7 2.1.2 Locality and Concurrency . 10 2.1.3 High-Level Stream Languages . 11 2.2 Programmable Stream Processors . 12 2.2.1 Instruction Set Architecture (ISA) . 13 2.2.2 Micro-Architecture . 16 2.3 Kernel Conditionals . 20 2.3.1 Motivating Example: The geom rast Kernel . 21 2.3.2 Challenges . 23 3 VLIW Scheduling with Conditional Routing 25 3.1 Motivation . 25 3.1.1 Existing Techniques: If-statements . 26 vii 3.1.2 Existing Techniques: While-loops . 30 3.2 Applying Conditional Routing to If-Statements . 31 3.2.1 Performance . 36 3.2.2 Comparison to Other Techniques . 41 3.2.3 Technique Selection . 49 3.2.4 SRF Allocation . 58 3.2.5 Conditional Routing Summary for If-Statements . 59 3.3 Applying Conditional Routing to While-Loops . 61 3.3.1 Expanded Conditional Routing . 62 3.3.2 Flattened Conditional Routing . 65 3.3.3 State-Unrolling . 68 3.3.4 Performance . 71 3.3.5 Ordering . 77 3.3.6 Conditional Routing Summary for While-Loops . 83 3.4 RENDER Performance . 86 3.5 Dynamic Techniques . 90 4 SIMD Performance with Conditional Routing 91 4.1 Motivation . 92 4.2 Performance . 93 4.2.1 If-Statements . 94 4.2.2 While-Loops . 97 4.3 RENDER Performance . 98 4.3.1 MIMD Comparison . 99 4.4 Summary . 100 5 Load-balancing with Conditional Streams 102 5.1 Motivation . 102 5.2 Existing Techniques for Improving Load-Balance . 103 5.3 Conditional Streams . 104 5.3.1 If-Statements . 105 5.3.2 While-Loops . 109 viii 5.4 Implementation . 109 5.4.1 Hardware Overview . 110 5.4.2 Hardware Details . 112 5.4.3 Scalability . 119 5.5 Performance . 126 5.5.1 If-statements . 127 5.5.2 While-loops . 129 5.5.3 Impact of SRF Communication . 130 5.6 Other Uses of Conditional Streams . 132 5.7 RENDER Performance . 133 5.8 Summary . 136 6 Conclusions 137 6.1 Conditional Routing . 138 6.2 Conditional Streams . 139 6.3 A Compiler Framework . 140 A The IMAGINE Stream Processor 142 A.1 Conditional Streams Implementation on Imagine . 144 Bibliography 148 ix List of Tables 3.1 Summary of if-statement conditional techniques. 60 3.2 Summary of while-loop conditional techniques. 85 3.3 Impact of conditional routing on VLIW schedule of geom rast. 89 4.1 1 to 8 cluster speedup of existing techniques on geom rast. 92 4.2 8-cluster performance of existing techniques on geom rast. 93 4.3 Impact of conditional routing on eight-cluster performance of RENDER. 100 5.1 Instructions supported by the conditional stream unit. 114 5.2 Instruction sequences for executing conditional stream accesses. 115 5.3 Stream buffer specific state stored in the conditional stream unit. 117 5.4 Conditional stream unit instruction details. 118 5.5 Area impact of na¨ıve GEN CO STATE implementation. 123 5.6 Conditional stream speedup for while-loops. 130 5.7 Impact of conditional streams on SIMD performance of RENDER. 135 x List of Figures 2.1 Streams and a kernel from an MPEG-2 video encoder application. 8 2.2 MPEG-2 I-frame encoder mapped to streams and kernels. 10 2.3 A programmable stream processor. 13 2.4 Kernel execution unit (KEU) architecture . 16 2.5 KEU organizations for multi-cluster execution models. 18 3.1 Applying conditional routing to a kernel containing an if-statement. 33 3.2 Impact of conditional routing for if-statements on application-level code. 34 3.3 The synthetic if benchmark. 39 3.4 Impact of conditional routing on execution time for synthetic if . 41 3.5 Comparison of conditional routing and existing techniques for if-statements. 42 3.6 Results of conditional techniques on synthetic if else. 47 3.7 Results of conditional techniques on synthetic case. 48 3.8 Impact of number of operations in BODY on if-statement techniques. 51 3.9 Technique selection function for the synthetic if kernel. 52 3.10 Impact of SRF communication on CONDITIONAL ROUTING and PREDI- CATION. 54 3.11 Impact of SRF communication on if-statement techniques. 55 3.12 iterative, an example kernel with a while-loop. 62 3.13 Expanding a while-loop. 63 3.14 Application-level pseudo-code for expanded conditional routing. 64 3.15 Loop-flattening the compute iterative kernel . 67 3.16 Virtual processor view of state-unrolling.

Load more