Improving Memory Hierarchy Performance with Dram Cache, Runahead Cache Misses, and Intelligent Row-Buffer Prefetches

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH DRAM CACHE, RUNAHEAD CACHE MISSES, AND INTELLIGENT ROW-BUFFER PREFETCHES By XI TAO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2016 © 2016 Xi Tao To my parents ACKNOWLEDGMENTS It was such a long journey since I first arrived at Gainesville. I have never dreamt of ever studying at a place so distant from my hometown, yet I spent five and a half wonderful years here. Obtaining a Ph.D. degree is never an easy job. You constantly feel stressful, at loss, and sometimes wondering how to continue. During those years, I am really grateful for all the help and guidance from my advisor, Dr. Jih-Kwon Peir, who is always so patient and kind. His brilliant suggestions helped me overcome many obstacles. He has also spent numerous hours helping me reviewing my paper and making modifications. Without his help, I really cannot imagine sitting here and writing this dissertation now. I also want to thank my Ph.D. committee members: Dr. Shigang Chen, Dr. Prabhat Mishra, Dr. Beverly Sanders and Dr. Tan Wong. Thank you for your advice and support during my study at University of Florida. I also would like to thank my lab mate Qi Zeng, who has provided great suggestions and advice on our collaborating work. Lastly, I want to give my greatest thanks to my friends here at Gainesville. You guys really made my life colorful here. I also want to thank my parents, who has always been there encouraging me and believed in me. I could not achieve all these without your support! 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................7 LIST OF FIGURES .........................................................................................................................8 ABSTRACT ...................................................................................................................................10 CHAPTER 1 INTRODUCTION ..................................................................................................................12 1.1 DRAM Caches ...............................................................................................................17 1.2 Runahead Cache Misses ................................................................................................18 1.3 Hashing Fundamentals and Bloom Filter ......................................................................19 1.4 Intelligent Row Buffer ...................................................................................................21 2 PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION ..........................23 2.1 Evaluation Methodology ...............................................................................................23 2.2 Workload Selection .......................................................................................................25 3 CACHE LOOKASIDE TABLE .............................................................................................26 3.1 Background and Related Work ......................................................................................26 3.2 CLT Overview ...............................................................................................................29 3.2.1 Stacked Off-die DRAM Cache with On-Die CLT ............................................29 3.2.2 CLT Coverage ...................................................................................................31 3.2.3 Comparison of DRAM Cache Methods ............................................................32 3.3 CLT Design ...................................................................................................................37 3.4 Performance Evaluation .................................................................................................41 3.4.1 Difference between Related Proposals..............................................................41 3.4.2 Performance Results..........................................................................................43 3.4.3 Sensitivity Study and Future Projection ............................................................47 3.4.4 Summary ...........................................................................................................49 4 RUNAHEAD CACHE MISSES USING BLOOM FILTER .................................................50 4.1 Background and Related work .......................................................................................50 4.2 Memory Hierarchy and Timing analysis .......................................................................51 4.3 Performance Results ......................................................................................................57 4.3.1 IPC Comparison ................................................................................................58 4.3.2 Sensitivity Study ...............................................................................................60 4.4 Summary ........................................................................................................................62 5 5 GUIDED MULTIPLE HASHING .........................................................................................64 5.1 Background ....................................................................................................................64 5.2 Hashing ..........................................................................................................................66 5.3 Proposed Algorithm .......................................................................................................67 5.3.1 The Setup Algorithm .........................................................................................67 5.3.2 The Lookup Algorithm .....................................................................................70 5.3.3 The Update Algorithm ......................................................................................71 5.4 Performance Results ......................................................................................................72 5.5 Summary ........................................................................................................................82 6 INTELLIGENT ROW BUFFER PREFETCHES ..................................................................83 6.1 Background and Motivation ..........................................................................................83 6.2 Hot Row Buffer Design and Results .............................................................................86 6.3 Performance Evaluation .................................................................................................93 6.4 Conclusion .....................................................................................................................95 7 SUMMARY ............................................................................................................................97 LIST OF REFERENCES .............................................................................................................100 BIOGRAPHICAL SKETCH .......................................................................................................106 6 LIST OF TABLES Table page 2-1 Architecture parameters of processor and memories ........................................................ 24 2-2 MPKI and footprint of the selected benchmarks .............................................................. 25 3-1 Comparison of different DRAM cache designs ................................................................ 33 3-2 Difference between three designs ..................................................................................... 42 3-3 Comparison of L4 MPKR, L4 occupancy and predictor accuracy ................................... 46 4-1 False-positive rates of 12 benchmarks .............................................................................. 59 4-2 Future Conventional DRAM parameters .......................................................................... 62 5-1 Notation and Definition .................................................................................................... 68 5-2 Routing table updates for enhanced 4-ghash .................................................................... 80 6-1 Hit ratio for hybrid scheme of 10 workloads using 64 entries .......................................... 89 6-2 Prefetch usage for 10 workloads using a simple stream prefetcher .................................. 95 6-3 Sensitivity study on prefetch granularity .......................................................................... 95 7 LIST OF FIGURES Figure page 1-1 The structure of a memory hierarchy. ............................................................................... 13 1-2 Memory hierarchy organization with 4-level caches ........................................................ 14 1-3 Dram Internal Organization .............................................................................................. 15 3-1 Memory hierarchy with stacked DRAM cache ................................................................ 30 3-2 Reuse distance curves normalized to the percentage of the maximum distance

Improving Memory Hierarchy Performance with Dram Cache, Runahead Cache Misses, and Intelligent Row-Buffer Prefetches

Precise Runahead Execution

Runahead Execution

Rock:Ahigh-Performance Sparc Cmt Processor

Vector Runahead

Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

Onur-447-Spring12-Lecture23

Copyright by Onur Mutlu 2006 the Dissertation Committee for Onur Mutlu Certiﬁes That This Is the Approved Version of the Following Dissertation

Runahead Execution

Configurable Simultaneously Single-Threaded (Multi-)Engine Processor

Runahead Execution a Short Retrospective

Runahead Execution: an Alternative to Very Large Instruction Windows for Out-Of-Order Processors

Vector Runahead