Pipelined Multithreading Transformations And

PIPELINED MULTITHREADING TRANSFORMATIONS AND SUPPORT MECHANISMS RAM RANGAN ADISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED FOR ACCEPTANCE BY THE DEPARTMENT OF COMPUTER SCIENCE JUNE 2007 c Copyright by Ram Rangan, 2007. All Rights Reserved Abstract Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved appli- cation performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is es- sential to their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, which communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program performance, they cannot be applied to parallelize general-purpose applications which are char- acterized by difficult-to-break recurrences. Even though cyclic multithreading techniques, such as DOACROSS, are more applicable, the cyclic inter-thread dependences created by these techniques cause them to have very low tolerance to rising inter-core latencies. To address these problems, this work introduces a pipelined multithreading (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, in particular, and PMT techniques, in general, are able to tolerate inter-core latencies, while still handling codes with complex recurrences. They achieve this by enforcing an acyclic communication discipline amongst threads, which allow threads to use inter-thread queues to communicate in a pipelined fashion. This dissertation demonstrates that DSWPed codes not only tolerate inter-core communication costs, but also effectively tolerate variable latency stalls in applications better than single-threaded execution on both in-order and out-of-order issue processors with comparable resources. It then performs a thorough analysis of the performance scalability of automatically generated DSWPed codes and identifies the conditions necessary to achieve peak PMT performance. Next, the dissertation shows that even though PMT applications tolerate inter-core latencies well, the high frequency of inter-thread communication (once every 5 to 20 dynamic instructions) in these codes, makes them very sensitive to the intra-thread overhead imposed by communication operations. In order to understand the issues surrounding inter-thread iii communication for PMT applications, this dissertation undertakes a methodical exploration of the design space of communication support options for PMT. Three new communication mechanisms with varying cost-performance tradeoffs are introduced and are shown to perform 38% to 200% better than the state of the art. iv Acknowledgments I owe this work to my advisor, David August. I thank him for believing in me. I thank him for supporting me every step of the way, from recommending my admission to Princeton to feeding me for a whole three and a half years (with liberal grants from NSF Grant No. 0133712, NSF Grant No. 0305617, and Intel Corporation)1, from showing me how to use gdb during my first summer to helping improve my writing skills over the years, from celebrating every small research success I had to consoling and encouraging me every single time I was not able to make a paper deadline or had a paper rejected, and so on. His tonic was definitely bitter many a time, but in hindsight, there appears to have been a method to the madness and the many situations I sulked about originally, have ultimately worked out to my benefit. His take-the-bull-by-the-horns approach to research and hard work will always be an inspiration for me. The central idea of this dissertation, decoupled software pipelining (DSWP), was born out of discussions with Neil Vachharajani, my primary collaborator, and David August. I would like to thank Margaret Martonosi, George Cai, Li Shiuan Peh, and Jaswinder Pal Singh for serving on my thesis committee. Their collective wisdom has made this dissertation more complete in terms of providing an in-depth understanding of various aspects of DSWP behavior. I thank my advisor and my readers, George Cai and Margaret Martonosi, for carefully reading through my dissertation and suggesting fixes. Special thanks to Mar- garet Martonosi for her impressive turnaround time that enabled the timely scheduling of my final defense. As collaborator, George Cai’s expert inputs helped improve the quality of the DSWP communication support work. I would like to thank HP Labs and Intel for providing me with valuable internship opportunities. I had great fun during these internships and learned a lot from working closely with Shail Aditya and Scott Mahlke (at HP) and Shubu Mukherjee, Arijit Biswas, Paul Racunas, and Joel Emer (at Intel). 1Opinions, findings, conclusions, and recommendations expressed in this dissertation are not necessarily the views of the NSF or Intel Corporation v I would probably not be in computer architecture if not for my undergraduate advisor, Ranjani Parthasarathi. Her classes and exams gave me a good understanding of the basics. By allowing her students complete freedom and by being available for discussions, she enabled the exploration of research-quality ideas for class projects. For all this and more, I am deeply indebted to her. My heartfelt thanks to all past and current members of the Liberty group. I thank Manish Vachharajani, David Penry, and Neil Vachharajani for developing the Liberty Sim- ulation Environment; it really made microarchitecture modeling a straightforward process and helped me reason about and truly understand the innards of various hardware blocks. The cache, in particular! Thanks to David Penry for developing a robust IA64 emulator and a cycle-accurate Itanium 2 core model. Many thanks to Spyridon Triantafyllis, Matthew Bridges, Easwaran Raman, and the entire VELOCITY compiler team for their hard work. Guilherme Ottoni and Neil Vachharajani implemented the automatic DSWP algorithm used in this dissertation in the VELOCITY compiler. I had great fun working with Jonathan Chang, George Reis, Adam Stoler, Jason Blome, and Bolei Guo on various projects. The random discussions, jokes, movies, literati, text twist, tennis, and squash games with my lab mates kept me sane through graduate school. Outside of my lab, Lakshmi and Mani, Easwar, Logo, the Shivakumar and the Mahesh families, Julia Chen, Kevin Ko, Ge Wang, Margaret and Bob Slighton, Steve Wallitt, Neal Kaufmann, the entire PUTTC crowd, and several others cheered me up so much that I almost never missed home. Close relatives and friends from school and college, Alwyn, Anusha, Arun, Arundhati, Kasturi athai, Mani, Ravi, Robs, Satish, Siva, Subha, and others have always provided great encouragement and support. I thank Nitya for her unconditional love and friendship. A big thank you to my little sister, Ramya, for her love and affection over the years. I thank my grandparents for their blessings and prayers. I salute their indefatigable spirit that made them take great interest in my progress towards thesis completion, despite their myriad age-related problems. I thank vi all four grandparents for loving me so much and for all that they have taught me. Finally, words cannot express my love and gratitude for my parents who have given their all for both their children. Their innumerable sacrifices, their selfless love, and their keen interest in my success, are what have gotten me to this point. I am indeed fortunate to have such a wonderful family and I wish to dedicate my life’s work (well, at least, five and a half years of it) to my parents and my grandparents for everything they have given me and done for me. vii Contents Abstract . iii Acknowledgments . v List of Tables . xi List of Figures . xii 1 Introduction 1 1.1 Thread Level Parallelism Paradigms . 2 1.1.1 Independent Multithreading (IMT) . 2 1.1.2 Cyclic Multithreading (CMT) . 4 1.1.3 Pipelined Multithreading (PMT) . 5 1.2 Contributions . 6 1.3 Overview . 9 2 Decoupled Software Pipelining 11 2.1 Limitations of single-threaded execution . 11 2.2 Overview of DSWP . 17 2.3 RDS Loops and Latency Tolerance . 18 2.4 RDS Parallelization . 21 2.5 Decoupled Software Pipelining . 24 2.6 Automatic DSWP . 25 2.7 Summary . 27 viii 3 Communication Support for PMT 29 3.1 High-Frequency Streaming . 31 3.2 Design Space . 36 3.2.1 Communication Operation Sequences . 36 3.2.2 Dedicated Interconnects . 39 3.2.3 Pipelined Interconnects . 40 3.2.4 Synchronization . 41 3.2.5 Queue Backing Store . 43 3.3 The Synchronization Array . 49 3.3.1 Operation . 50 3.3.2 Handling Control Speculation . 55 3.3.3 Performance Scalability . 56 3.3.4 Integrating with the Itanium R 2 pipeline . 57 3.4 The Snoop-Based Synchronization technique . 63 3.5 Summary . 66 4 Evaluation Methodology 67 4.1 Benchmarks and Tools . 67 4.2 Performance Measurement . 69 4.3 Sampling Methodology . 72 4.4 Summary . 74 5 Performance Evaluation of DSWP 75 5.1 Performance of 2-thread DSWP . 75 5.1.1 Balancing Threads Better . 76 5.1.2 Latency tolerance through decoupling . 80 5.2 Performance scalability of DSWP . 84 5.2.1 Linear and non-linear thread pipelines . 89 ix 5.2.2 Performance under ideal communication behavior . 95 5.3 Summary . 102 6 Evaluation of Communication Support 104 6.1 Systems Studied . 104 6.2 Experimental Setup . 106 6.3 Results and Analysis . 107 6.4 Sensitivity Study . 113 6.5 Summary . 116 7 Communication Support Optimizations 117 7.1 Amortizing Overhead Costs for Software Queues . 117 7.1.1 Analysis . 119 7.1.2 Code Generation . 124 7.1.3 Evaluation . 125 7.1.4 Discussion . 125 7.2 Hardware Enhancements to Snoop-Based Synchronization . 129 7.3 Summary . 132 8 Conclusions and Future Directions 134 x List of Tables 2.1 Multi-core techniques to improve single-threading performance .

Pipelined Multithreading Transformations And

Asp Net Core Request Pipeline

Pipelining 2

Stclang: State Thread Composition As a Foundation for Monadic Dataflow Parallelism Sebastian Ertel∗ Justus Adam Norman A

EECS 570 Lecture 5 GPU Winter 2021 Prof

Let's Get Functional

BIOVIA Pipeline Pilot System Requirements

14 Superscalar Processors

Concepts Introduced in Chapter 3 Instruction Level Parallelism (ILP)

Fast Linux I/O in the Unix Tradition

Making a Faster Curry with Extensional Types

Intro Evolution of Superscalar Processor

Simty: Generalized SIMT Execution on RISC-V Caroline Collange