Copyright by Onur Mutlu 2006 the Dissertation Committee for Onur Mutlu Certiﬁes That This Is the Approved Version of the Following Dissertation

Copyright by Onur Mutlu 2006 The Dissertation Committee for Onur Mutlu certifies that this is the approved version of the following dissertation: Efficient Runahead Execution Processors Committee: Yale N. Patt, Supervisor Craig M. Chase Nur A. Touba Derek Chiou Michael C. Shebanow Efficient Runahead Execution Processors by Onur Mutlu, B.S.; B.S.E.; M.S.E. DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY THE UNIVERSITY OF TEXAS AT AUSTIN August 2006 Dedicated to my loving parents, Hikmet and Nevzat Mutlu, and my sister Miray Mutlu Acknowledgments Many people and organizations have contributed to this dissertation, intellectually, motivationally, or otherwise financially. This is my attempt to acknowledge their contributions. First of all, I thank my advisor, Yale Patt, for providing me with the freedom and resources to do high-quality research, for being a caring teacher, and also for teaching me the fundamentals of computing in EECS 100 as well as valuable lessons in real-life areas beyond computing. My life as a graduate student life would have been very short and unproductive, had it not been for Hyesoon Kim. Her technical insights and creativity, analytical and questioning skills, high standards for research, and continuous encouragement made the contents of this dissertation much stronger and clearer. Her presence and support made even the Central Texas climate feel refreshing. Very special thanks to David Armstrong, who provided me with the inspiration to write and publish, both technically and otherwise, at a time when it was difficult for me to do so. To me, he is an example of fairness, tolerance, and open-mindedness in an all-too- unfair world. Many members of the HPS Research Group have contributed to this dissertation and my life in the past six years. I especially thank: • Jose´ Joao and Angel´ es Juarans Font, for our memorable trips and delicious barbe- cues, for their hearty friendship, and for proofreading this dissertation. v • Chang Joo Lee, for always being a source of fun and optimism, and for bearing with my humor and complaints. • Moinuddin Qureshi, for many interesting technical discussions and being my room- mate during the last year. • Santhosh Srinath, for being a very good and cheerful friend. • Francis Tseng, for his helpfulness and hard work in maintaining our group's network and simulation infrastructure. • Danny Lynch, Aater Suleman, Mary Brown, Sangwook Peter Kim, Kameswar Sub- ramaniam, David Thompson, Robert Chappell, and Paul Racunas for their contributions to our group's simulation infrastructure and for their friendship. Many people in computer industry have helped shape my career and provided invaluable feedback on my research. I especially thank Jared Stark and Eric Sprangle for their mentorship and assistance in developing research ideas. They strongly influenced my first years as a researcher and I am honored to have them as co-authors in my first schol- arly paper and patent. Stephan Meier, Chris Wilkerson, Konrad Lai, Mike Butler, Nhon Quach, Mike Fertig, and Chuck Moore provided valuable comments and suggestions on my research activities and directions. I would like to thank Derek Chiou, Nur Touba, Michael Shebanow, and Craig Chase for serving on my dissertation committee. Special thanks to Derek for he has always been accessible, encouraging, and supportive. Throughout graduate school, I have had generous financial support from several organizations. I would like to thank the University of Texas Graduate School for the Uni- versity Continuing Fellowship, and Intel Corporation for the Intel Foundation PhD Fellow- ship. I also thank the University Co-op for awarding me the George H. Mitchell Award for vi Excellence in Graduate Research. Many thanks to Intel Corporation and Advanced Micro Devices for providing me with summer internships, which were invaluable for my research career. Finally, I cannot express with words my indebtedness to my parents, and Hikmet and Nevzat Mutlu, and my little sister Miray Mutlu, for giving me their unconditional love and support at every step I have taken in my life. Even though they will likely not understand much of it, this dissertation would be meaningless without them. Onur Mutlu May 2006, Austin, TX vii Efficient Runahead Execution Processors Publication No. Onur Mutlu, Ph.D. The University of Texas at Austin, 2006 Supervisor: Yale N. Patt High-performance processors tolerate latency using out-of-order execution. Unfor- tunately, today's processors are facing memory latencies in the order of hundreds of cycles. To tolerate such long latencies, out-of-order execution requires an instruction window that is unreasonably large, in terms of design complexity, hardware cost, and power consump- tion. Therefore, current processors spend most of their execution time stalling and waiting for long-latency cache misses to return from main memory. And, the problem is getting worse because memory latencies are increasing in terms of processor cycles. The runahead execution paradigm improves the memory latency tolerance of an out-of-order execution processor by performing potentially useful execution while a long- latency cache miss is in progress. Runahead execution unblocks the instruction window blocked by a long-latency cache miss allowing the processor to execute far ahead in the program path. This results in other long-latency cache misses to be discovered and their data to be prefetched into caches long before it is needed. This dissertation presents the runahead execution paradigm and its implementation on an out-of-order execution processor that employs state-of-the-art hardware prefetching viii techniques. It is shown that runahead execution on a 128-entry instruction window achieves the performance of a processor with three times the instruction window size for a current, 500-cycle memory latency. For a near-future 1000-cycle memory latency, it is shown that runahead execution on a 128-entry window achieves the performance of a conventional processor with eight times the instruction window size, without requiring a significant increase in hardware cost and complexity. This dissertation also examines and provides solutions to two major limitations of runahead execution: its energy inefficiency and its inability to parallelize dependent cache misses. Simple and effective techniques are proposed to increase the efficiency of runahead execution by reducing the extra instructions executed without affecting the performance improvement. An efficient runahead execution processor employing these techniques ex- ecutes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance. Finally, this dissertation proposes a new technique, called address-value delta (AVD) prediction, that predicts the values of pointer load instructions encountered in runahead execution in order to enable the parallelization of dependent cache misses using runahead execution. It is shown that a simple 16-entry AVD predictor improves the performance of a baseline runahead execution processor by 14.3% on a set of pointer-intensive applications, while it also reduces the executed instructions by 15.5%. An analysis of the high-level pro- gramming constructs that result in AVD-predictable load instructions is provided. Based on this analysis, hardware and software optimizations are proposed to increase the benefits of AVD prediction. ix Table of Contents Acknowledgments vi Abstract ix List of Tables xvii List of Figures xviii Chapter 1. Introduction 1 1.1 The Problem: Tolerating Long Main Memory Latencies . 1 1.2 The Solution: Efficient Runahead Execution . 3 1.3 Thesis Statement . 5 1.4 Contributions . 5 1.5 Dissertation Organization . 6 Chapter 2. The Runahead Execution Paradigm 8 2.1 The Basic Idea . 8 2.2 Out-of-Order Execution and Memory Latency Tolerance . 9 2.2.1 Instruction and Scheduling Windows . 9 2.2.2 Main-Memory Latency Tolerance of Out-of-Order Execution . 10 2.2.3 Why Runahead Execution: Providing Useful Work to the Processor During an L2 Cache Miss . 12 2.3 Operation of Runahead Execution . 13 2.4 Advantages, Disadvantages and Limitations of the Runahead Execution Paradigm . 15 2.4.1 Advantages . 16 2.4.2 Disadvantages . 18 2.4.3 Limitations . 19 2.5 Implementation of Runahead Execution in a State-of-the-art High Perfor- mance Out-of-order Processor . 20 x 2.5.1 Overview of the Baseline Microarchitecture . 21 2.5.2 Requirements for Runahead Execution . 22 2.5.3 Entering Runahead Mode . 23 2.5.3.1 When to Enter Runahead Mode . 23 2.5.3.2 Processor Actions for Runahead Mode Entry . 26 2.5.3.3 Hardware Requirements for Runahead Mode Entry . 26 2.5.4 Instruction Execution in Runahead Mode . 28 2.5.4.1 INV Bits and Instructions . 28 2.5.4.2 Propagation of INV Values . 29 2.5.4.3 Hardware Requirements to Support INV Bits and Their Prop- agation . 30 2.5.4.4 Runahead Store Operations and Runahead Cache . 30 2.5.4.5 Runahead Load Operations . 34 2.5.4.6 Hardware Requirements to Support Runahead Store and Load Operations . 35 2.5.4.7 Prediction and Execution of Runahead Branches . 36 2.5.4.8 Instruction Pseudo-Retirement in Runahead Mode . 38 2.5.4.9 Exceptions and Input/Output Operations in Runahead Mode 39 2.5.5 Exiting Runahead Mode . 40 2.5.5.1 When to Exit Runahead Mode . 40 2.5.5.2 Processor Actions for Runahead Mode Exit . 42 2.5.5.3 Hardware Requirements for Runahead Mode Exit . 43 2.5.6 Multiprocessor Issues . 44 2.5.6.1 Lock Operations . 44 2.5.6.2 Serializing Instructions . 45 Chapter 3. Background and Related Work 47 3.1 Related Research in Caching . 47 3.2 Related Research in Prefetching .

Copyright by Onur Mutlu 2006 the Dissertation Committee for Onur Mutlu Certiﬁes That This Is the Approved Version of the Following Dissertation

Precise Runahead Execution

Runahead Execution

Rock:Ahigh-Performance Sparc Cmt Processor

Vector Runahead

Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

Onur-447-Spring12-Lecture23

Runahead Execution

Configurable Simultaneously Single-Threaded (Multi-)Engine Processor

Runahead Execution a Short Retrospective

Runahead Execution: an Alternative to Very Large Instruction Windows for Out-Of-Order Processors

Vector Runahead

Runahead Threads