Bundling Spatially Correlated Prefetches for Improved Main Memory Row Buffer Locality
Total Page:16
File Type:pdf, Size:1020Kb
Bundling Spatially Correlated Prefetches for Improved Main Memory Row Buffer Locality by Patrick Judd A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto © Copyright by Patrick Judd 2014 Bundling Spatially Correlated Prefetches for Improved Main Memory Row Buffer Locality Patrick Judd Masters of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014 Abstract Row buffer locality is a consequence of programs ’ inherent spatial locality that the memory system can easily exploit for significant performance gains and power savings. However, as the number of cores on a chip increases, request streams become interleaved more frequently and row buffer locality is lost. Prefetching can help mitigate this effect, but more spatial locality remains to be recovered. In this thesis we propose Prefetch Bundling, a scheme which tags spatially correlated prefetches with information to allow the memory controller to prevent prefetches from becoming interleaved. We evaluate this scheme with a simple scheduling policy and show that it improves the row hit rate by 11%. Unfortunately, the simplicity of the scheduling policy does not translate these row hits to improved performance or power savings. However, future work may build on this framework and develop more balanced policies that leverage the row locality of Prefetch Bundling. ii Acknowledgments There are many people to thank as I reflect back on my Master’s degree. First and foremost I must thank my supervisor, Andreas Moshovos, for his support, confidence in my abilities, and most of all for his ability to keep me grounded during stressful and trying times. I would also like to thank the members of the AENAO research group for their help, support and genuine camaraderie. Special thanks go to Jason Zebchuk for his help and guidance in learning and wrestling with Flexus. I would also like to thank Myrto Papadopoulou for her technical and moral support, and for tolerating my uncanny ability to kill cluster machines. I would also like to thank Andreas Moshovos, the University of Toronto and the Natural Sciences and Engineering Research Council of Canada for their generous financial support. iii Table of Contents 1 Introduction.......................................................................................................................................1 1.1 Contributions.............................................................................................................................4 1.2 Thesis Structure ........................................................................................................................5 2 Background .......................................................................................................................................6 2.1 Spatial Memory Streaming.......................................................................................................6 2.1.1 Overview .......................................................................................................................6 2.1.2 Hardware Design ..........................................................................................................7 2.2 Dynamic Random Access Memory .........................................................................................9 2.2.1 Memory Geometry .....................................................................................................10 2.2.2 DRAM Commands.....................................................................................................12 2.2.3 Row Buffer Management Policies.............................................................................12 2.2.4 Memory Scheduling ...................................................................................................13 2.2.5 Address Mapping........................................................................................................14 3 Motivation.......................................................................................................................................15 3.1 Trace Based Simulation..........................................................................................................15 3.2 Timing Simulation ..................................................................................................................17 4 Design..............................................................................................................................................19 4.1 Single Packet Bundling vs. Tagged Bundling ......................................................................20 4.1.1 Single Packet Bundle .................................................................................................20 4.1.2 Tagged Bundle............................................................................................................21 4.2 Prefetch Bundling Design ......................................................................................................24 4.3 Prefetch Bundling Implementation ........................................................................................25 4.3.1 Prefetch Controller .....................................................................................................25 4.3.2 Caches and Interconnect ............................................................................................26 iv 4.3.3 Bundle Queue .............................................................................................................27 4.4 Bundling Parameters...............................................................................................................29 4.5 Scheduling Policy ...................................................................................................................29 4.6 Tagging Optimization.............................................................................................................30 4.7 Deadlock Avoidance...............................................................................................................31 5 Methodology ...................................................................................................................................33 5.1 System Overview ....................................................................................................................33 5.2 Workloads ...............................................................................................................................36 5.3 Trace Based Simulation..........................................................................................................37 5.4 Timing Simulation ..................................................................................................................38 5.5 Area and Power Models .........................................................................................................38 6 Evaluation .......................................................................................................................................39 6.1 Workload Characterization ....................................................................................................39 6.1.1 Bundle Size Distribution ............................................................................................39 6.1.2 Prefetch Fill Levels ....................................................................................................41 6.1.3 Memory request breakdown ......................................................................................43 6.1.4 Memory Traffic ..........................................................................................................44 6.2 Prefetch Bundling ...................................................................................................................45 6.2.1 Row Hit Rate ..............................................................................................................45 6.2.2 Memory Latency.........................................................................................................48 6.2.3 Memory Bandwidth....................................................................................................51 6.2.4 Memory Power Consumption....................................................................................52 6.2.5 System Performance...................................................................................................54 6.2.6 SMS Performance.......................................................................................................56 7 Related Work ..................................................................................................................................57 7.1 Micro-pages.............................................................................................................................57 v 7.2 Adaptive Stream Detection ....................................................................................................58 7.3 Scheduled Region Prefetching ...............................................................................................58 7.4 Prefetch-Aware DRAM Controllers ......................................................................................59 7.5 Footprint Cache.......................................................................................................................59 7.6 Complexity