Nagarajanr54359.Pdf (2.985Mb)
Total Page:16
File Type:pdf, Size:1020Kb
Copyright by Ramadass Nagarajan 2007 The Dissertation Committee for Ramadass Nagarajan certifies that this is the approved version of the following dissertation: Design and Evaluation of a Technology-Scalable Architecture for Instruction-Level Parallelism Committee: Douglas C. Burger, Supervisor Michael D. Dahlin Stephen W. Keckler Kathryn S. McKinley Gurindar S. Sohi Design and Evaluation of a Technology-Scalable Architecture for Instruction-Level Parallelism by Ramadass Nagarajan, B.Tech., M.S. Dissertation Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy The University of Texas at Austin May 2007 To my parents and sisters. Acknowledgments I thank the members of my committee, Doug Burger, Mike Dahlin, Steve Keckler, Kathryn McKinley, and Guri Sohi for supervising this dissertation. I value their comments and suggestions that have improved its quality. I especially thank Guri Sohi for taking the time off his busy schedule to come down to Austin and serve on my committee. I am eternally grateful to Doug Burger who has been my principal advisor and mentor through all the years in graduate school. Doug has been as perfect an advisor as one could ever hope for. I thank him for all the advice—technical, professional, and otherwise, hand-holding when necessary and cutting me loose at other times, being patient when I was tardy or sloppy, propping me up on opportune occasions, and of course, supporting me financially, even if it meant going to loggerheads with the establishment. His infectious enthusiasm and sharp wit only made the experience all the more enjoyable. I thank Steve Keckler for serving as my de-facto co-advisor throughout my stay in school. Steve has complemented Doug in an advisory capacity in many ways, and I have benefited immeasurably from it. His insightful opinion, wise counsel, and deep intuition have been a guiding influence on my development as a researcher. Steve was the first faculty member that I met after I was admitted to the doctoral program at UT and he offered me the opportunity to explore architecture research during my first year. It was that experience and his advice that drew me to the v CART group, which he co-leads with Doug, and for that I will be forever grateful. I thank Kathryn McKinley for several years of research collaboration on the TRIPS project. I thank her for the numerous interactions and thoughtful advice on technical writing and presentation. I cannot thank Doug, Steve, and Kathryn enough for their vision, leadership, and drive that set the direction of the TRIPS project in general, and this dissertation in particular. I also thank Daniel Jimenez, Calvin Lin, and Chuck Moore for the research collaborations on the TRIPS project. I thank the entire TRIPS team that helped transform an abstract research idea into working silicon. I would like to thank Robert McDonald for instilling rigor and attention to detail on every aspect of the TRIPS prototype implementation. I thank the entire hardware team—Raj Desikan, Saurabh Drolia, Madhu S. Sibi Govindan, Divya Gulati, Paul Gratz, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Karu Sankaralingam, Simha Sethumadhavan, and Premk- ishore Shivakumar—and the entire software team—Jim Burrill, Katie Coons, Mark Gebhart, Madhavi Krishnan, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Behnam Robatmili, Sadia Sharif, Aaron Smith, and Bill Yoder—for their contribu- tions. Without their collective contributions, the infrastructure necessary for this dissertation would not have been possible. Karu Sankaralingam merits my special gratitude for his close collaboration. I started working with him on the PowerPC port of the SimpleScalar simulator in my first year, and the collaboration continued through the implementation of the TRIPS prototype processor. I have always admired his expansive scholarship, deep insight, and unique perspective on various issues, research and otherwise. I will forever cherish our initial joint work on the TRIPS architecture, which included non-stop brainstorming sessions, simulator hacking, and co-authorship of multiple TRIPS- related publications. Karu was also a constant source of entertainment around the vi cubicle area that brought comic relief to the occasional dreary moment. I thank Changkyu Kim for adding the TRIPS port to the tsim-flex simulator, which is one of the simulators that I used to generate the results presented in this dissertation. I thank Nitya Ranganathan for help with the SimPoint simulations. I thank Heather Hanson, Divya Gulati, Madhu S. Sibi Govindan, and Aaron Smith for proof-reading portions of this dissertation. I thank all the past and current members of the CART group for numerous technical discussions and feedback on papers and practice talks. I have always appreciated, and often dreaded, the comments and the questions that they threw my way, and the experience has made me a better researcher. The members of the CART group and the Speedway group have also provided a welcome distraction from the rigors of academic life through numerous social occasions. I thank M.S. Hrishikesh for his patience and counsel during the numerous times I have turned to him for advice. He hosted me during my visit to Intel, Bangalore and has offered valuable feedback on my post-school career choices. I also thank Kartik Agaram, for tolerating me as his roommate during the first few months, Simha Sethumadhavan, for enduring me as his cubicle mate for several years, Heather Hanson for her lessons on power and general advice, and Madhu S. Sibi Govindan and Suriya Subramanian for their valuable help at critical junctures. I thank the staff of the Computer Sciences department, and in particular, Gloria Ramirez for her help as the graduate coordinator and the entire gripe staff for the timely resolution of my complaints with the facilities. I also thank Gem Naivar for her infinite patience, getting me paid on time, and help with travel arrangements and other miscellaneous paperwork. I thank all friends outside the immediate research group for their merry com- pany, numerous hiking, camping, and ski trips, and cooking me several delicious vii meals. In particular, I would like to thank Murali Narasimhan, Karthik Subra- manian, Siddarth Krishnan, Shobha Vasudevan, Kunal Punera, Vinod Viswanath, Silpa Sigireddy, Bharat Chandra, Ramsundar Ganapathy, Subbu Iyer, Vishv Jeet, Prem Melville, Joseph Modayil, Aniket Murarka, Amol Nayate, Arun Venkatramani, Vibha volunteers, and the entire “Far West” gang. Last, but not the least, I remain forever indebted to my parents and sisters. I cannot thank them enough for supporting me unflinchingly, continuing to worry for my well-being, and tolerating my extended absence from home for several weeks, months, and years at a stretch ever since high school. I especially thank my mom for asking me when I was going to graduate every time I spoke with her, and my dad and sisters for never asking me the same question. Without their constant encouragement and understanding I could not have come this far. To you, Mom, Dad, Hema, and Lalitha, I owe everything. Ramadass Nagarajan The University of Texas at Austin May 2007 viii Design and Evaluation of a Technology-Scalable Architecture for Instruction-Level Parallelism Publication No. Ramadass Nagarajan, Ph.D. The University of Texas at Austin, 2007 Supervisor: Douglas C. Burger Future performance improvements must come from the exploitation of con- currency at all levels. Recent approaches that focus on thread-level and data-level concurrency are a natural fit for certain application domains, but it is unclear whether they can be adapted efficiently to eliminate serial bottlenecks. Conventional superscalar hardware that instead focuses on instruction-level parallelism (ILP) is limited by power inefficiency, on-chip wire latency, and design complexity. Ulti- mately, poor single-thread performance and Amdahl’s law will inhibit the overall performance growth even on parallel workloads. To address this problem, we un- dertook the challenge of designing a scalable, wide-issue, large-window processor ix that mitigates complexity, reduces power overheads, and exploits ILP to improve single-thread performance at future wire-delay dominated technologies. This dissertation describes the design and evaluation of the TRIPS architec- ture for exploiting ILP. The TRIPS architecture belongs to a new class of instruction set architectures called Explicit Data Graph Execution (EDGE) architectures that use large dataflow graphs of computation and explicit producer-consumer commu- nication to express concurrency to the hardware. We describe how these architec- tures match the characteristics of future sub-45 nm CMOS technologies to mitigate complexity and improve concurrency at reduced overheads. We describe the archi- tectural and microarchitectural principles of the TRIPS architecture, which exploits ILP by issuing instructions widely, in dynamic dataflow fashion, from a large dis- tributed window of instructions. We then describe our specific contributions to the development of the TRIPS prototype chip, which was implemented in a 130 nm ASIC technology and consists of more than 170 million transistors. In particular, we describe the implementation of the distributed control protocols that offer various services for executing a single program in the hardware. Finally, we describe a detailed evaluation of the TRIPS architecture and identify the key determinants of its performance. In particular, we describe the development of the infrastructure required for a detailed analysis, including a validated performance model, a highly optimized suite of benchmarks, and critical path models that identify various architectural and microarchitectural bottlenecks at a fine level of granularity. On a set of highly optimized benchmark kernels, the manufactured TRIPS parts out-perform a conventional superscalar processor by a factor of 3× on average. We find that the automatically compiled versions of the same kernels are yet to reap the benefits of the high-ILP TRIPS core, but exceed the performance of the superscalar processor in many cases.