Improving Processor Efficiency by Exploiting Common-Case Behaviors of Memory Instructions

IMPROVING PROCESSOR EFFICIENCY BY EXPLOITING COMMON-CASE BEHAVIORS OF MEMORY INSTRUCTIONS A Thesis Presented to The Academic Faculty by Samantika Subramaniam In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology May 2009 IMPROVING PROCESSOR EFFICIENCY BY EXPLOITING COMMON-CASE BEHAVIORS OF MEMORY INSTRUCTIONS Approved by: Professor Gabriel H. Loh, Advisor Professor Nathan Clark School of Computer Science School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Professor Milos Prvulovic Professor Hsien-Hsin S. Lee School of Computer Science School of Electrical and Computer Georgia Institute of Technology Engineering Georgia Institute of Technology Professor Hyesoon Kim Aamer Jaleel, PhD School of Computer Science Intel Corporation Georgia Institute of Technology Date Approved: 10 December 2008 To my family, for their unconditional love and support iii ACKNOWLEDGEMENTS I take this opportunity to express my deepest gratitude to Professor Gabriel H. Loh, my thesis adviser for his unwavering support, immense patience and excellent guidance. Col- laborating with him has been a tremendous learning experience and has been instrumental in shaping my future research career. I will especially remember him for his endless energy and superb guidance and thank him for continually bringing out the best in me. I thank Professor Milos Prvulovic for helping me throughout this journey, by being an excellent teacher and a helpful co-author as well as for providing me very useful feedback on my research ideas, presentations, thesis and general career decisions. I thank Professor Hsien- Hsien S. Lee for his excellent insights and tips throughout our research collaboration. He has always been a wealth of information to me and has helped me become a better re- searcher. I thank Professor Hyesoon Kim for her invaluable suggestions during my thesis proposal, dissertation writing and for always responding promptly to my requests. I thank Professor Nathan Clark for agreeing to serve on my thesis committee and for his feedback to help improve my work. I thank Aamer Jaleel from Intel Corporation for taking the time to be the external member on my thesis committee. I would like to thank my friends, Kiran, Guru, Ioannis D, Rahul, Jon, Yuejien, Richard, Mrinmoy, Dong Hyuk, Dean, Gopi, Ioannis S, Carina and countless other friends and col- leagues that I have met at Georgia Tech and who have supported me throughout. I thank Susie McClain, Deborah Mitchell, Becky Wilson and Alan Glass for their assistance in ad- ministrative matters and for providing me guidance throughout my time at Georgia Tech. I thank the various program committees and reviewers for their constructive feedback on my research papers. I thank Intel Corporation for research equipment, awarding me a fel- lowship to pursue my graduate studies and for hiring me as an intern with the MTL group in Santa Clara. This experience contributed greatly to my research career. I would also like to thank Georgia Tech and the College of Computing for giving me the opportunity and iv the infrastructure to pursue my graduate studies at this prestigious university. I thank my parents, Mohana and Seetharam Subramaniam for being my pillars of strength throughout this journey. I thank my sister Menaka, my brother-in-law Vikram, my niece Kavya, my grandmother Visalakshi, my grandaunt Jayalakshmi, my mother-in-law Mangala and other family members for being tremendous sources of support and encour- agement. I thank my late grandfather, Ramachandran, for his support and blessings all my life. Finally I would like to thank Sankar, my husband, for his endless love and support throughout my graduate career. v TABLE OF CONTENTS DEDICATION . iii ACKNOWLEDGEMENTS . iv LIST OF TABLES . xi LIST OF FIGURES . xii SUMMARY . xvi I INTRODUCTION . 1 1.1 Motivation . 1 1.2 Efficiency of Current Processors . 1 1.3 Memory Instruction Processing . 2 1.3.1 Conservativeness in Memory Instruction Processing . 2 1.3.2 Scalability Concerns . 4 1.3.3 Multi-threaded Microarchitectures . 5 1.4 Memory Instruction Behaviors . 6 1.4.1 Predictability in Memory Dependencies . 6 1.4.2 Predictability in Data Forwarding . 6 1.4.3 Predictability in Instruction Criticality . 7 1.4.4 Conservative Allocation and Deallocation Policies . 8 1.5 Scope and Overview of the Dissertation . 8 1.5.1 Scope . 8 1.5.2 Overview of Dissertation . 9 II AN OVERVIEW OF MEMORY INSTRUCTION BEHAVIOR . 11 2.1 Speculative Memory Disambiguation and Memory Dependencies . 11 2.2 Load and Store Queues and Data Forwarding Patterns . 13 2.3 Instruction Criticality and Criticality-Aware Load Execution . 15 2.4 Allocation and Deallocation Constraints Imposed on Load and Store Queues 16 2.5 Applying Characteristics of Memory Behavioral Patterns to Improve Pro- cessor Efficiency . 18 vi III STORE VECTORS: DESIGNING A SCALABLE, EFFICIENT, AND HIGH- PERFORMANCE MEMORY DEPENDENCE PREDICTOR . 19 3.1 Introduction . 19 3.2 Background . 20 3.2.1 Memory Dependence Predictors . 20 3.2.2 Store Sets . 22 3.2.3 Scheduling Structures . 26 3.3 Store Vector Dependence Prediction . 27 3.3.1 Step-by-Step Operation . 28 3.3.2 Example . 30 3.4 Evaluation . 32 3.4.1 Methodology . 32 3.4.2 Performance Results . 33 3.5 Optimizations . 40 3.5.1 Sensitivity to Hardware Budget . 40 3.5.2 Reduction of Store Vector Length . 41 3.5.3 Most Recently Conflicting store . 43 3.5.4 Store Window . 44 3.5.5 Tagged SVT . 45 3.5.6 Effect of Incorporating Control Flow Information . 46 3.6 Why Do Store Vectors Work? . 47 3.7 Implementation Complexity . 50 3.8 Conclusions . 52 IV PEEP: APPLYING PREDICTABILITY OF MEMORY DEPENDENCES IN SI- MULTANEOUS MULTI-THREADED PROCESSORS . 54 4.1 Introduction . 54 4.2 Background . 56 4.2.1 SMT Fetch Policies and Resource Stalls . 56 4.2.2 Memory Dependences and SMT . 58 4.3 Adapting SMT Fetch Policies to Exploit Memory Dependences . 60 4.3.1 Proactive Exclusion . 61 vii 4.3.2 Leniency and Early Parole . 63 4.4 Evaluation . 66 4.4.1 Methodology . 66 4.4.2 Proactive Exclusion Only . 68 4.4.3 Proactive Exclusion with Early Parole (PEEP) . 69 4.4.4 Fairness Results . 72 4.5 PEEP on an SMT Processor without Speculative Memory Disambiguation 74 4.6 Sensitivity Analysis . 76 4.6.1 Sensitivity to Multithreading . 76 4.6.2 Scalability of PEEP . 78 4.6.3 PEEP Combined with Other Fetch Policies . 78 4.6.4 Delay Sensitivity . 79 4.7 Evaluating PEEP in an Industrial Simulator . 81 4.8 Conclusions . 84 V FIRE-AND-FORGET: IMPLEMENTING SCALABLE STORE TO LOAD FOR- WARDING . 85 5.1 Introduction . 85 5.2 Background . 86 5.2.1 Conventional Load and Store Queues . 86 5.2.2 Load Re-Execution . 87 5.2.3 Other Related Work . 91 5.3 Fire-and-Forget Store-to-Load Forwarding . 94 5.3.1 Load Queue Index Prediction . 94 5.3.2 Complete Store Queue Elimination . 97 5.3.3 Speculative Memory Cloaking . 99 5.3.4 Corner Cases and Loose Ends . 100 5.4 Evaluation . 100 5.4.1 Methodology . 101 5.4.2 Performance Results . 103 5.4.3 Scalability of FnF: Performance analysis over a medium configuration104 5.5 Specific Case Studies . 104 viii 5.5.1 A Good Example . 104 5.5.2 A Neutral Example . 105 5.5.3 An Extreme Example . 106 5.5.4 A Misbehaving Example . 107 5.6 Power, Area and Latency considerations . 107 5.7 Speculative Stores to Improve FnF Efficiency . 108 5.8 FnF ROB: Load/Store Scheduling without a Store Queue or a Load Queue 114 5.9 NoSQ . 117 5.10 Conclusions . 119 VI LCP: LOAD CRITICALITY PREDICTION FOR EFFICIENT LOAD INSTRUC- TION PROCESSING . 120 6.1 Introduction . 120 6.2 Background . 122 6.2.1 Instruction (and Load) Criticality . 122 6.2.2 Predicting Instruction Criticality . 123 6.3 Our Load-Criticality Predictor . 125 6.3.1 Basic Operational Description . 125 6.3.2 Example . 126 6.3.3 Issue-Rate Filtering of the Predictor . ..

Load more