The Push Architecture: a Prefetching Framework for Linked Data Structures

THE PUSH ARCHITECTURE A PREFETCHING FRAMEWORK FOR LINKED DATASTRUCTURES by ChiaLin Yang Department of Computer Science DukeUniversity Date Approved Dr Alvin R Leb eck Sup ervisor Dr Jerey S Chase Dr Gershon Kedem Dr Nikos P Pitsianis Dr Xiaobai Sun Dissertation submitted in partial fulllmentofthe requirements for the degree of Do ctor of Philosophy in the Department of Computer Science in the Graduate Scho ol of DukeUniversity c Copyright by ChiaLin Yang All rights reserved ABSTRACT Computer Science THE PUSH ARCHITECTURE A PREFETCHING FRAMEWORK FOR LINKED DATASTRUCTURES by ChiaLin Yang Department of Computer Science DukeUniversity Date Approved Dr Alvin R Leb eck Sup ervisor Dr Jerey S Chase Dr Gershon Kedem Dr Nikos P Pitsianis Dr Xiaobai Sun An abstract of a dissertation submitted in partial fulllment of the requirements for the degree of Do ctor of Philosophy in the Departmentof Computer Science in the Graduate Scho ol of DukeUniversity Abstract The widening p erformance gap b etween pro cessors and memory makes techniques that alleviate this disparity essential for building highp erformance computer sys tems Caches are recognized as a costeective metho d to improve memory system p erformance However a caches eectiveness can b e limited if programs have p o or lo calityThus techniques that hide memory latency are essential to bridging the CPUmemory gap Prefetching is a commonly used technique to overlap memory accesses with computation Prefetching for arraybased numeric applications with regular ac cess patterns has b een well studied in the past decade However prefetching for p ointerintensive applications remains a challenging problem Prefetching linked data structures LDS is dicult b ecause address sequences do not present the same arithmetic regularityasarraybased applications and b ecause data dep en dence of p ointer dereferences can serialize the address generation pro cess The push architecture prop osed in this thesis is a co op erative hardwaresoftware prefetching framework designed sp ecically for linked data structures The push architecture exploits program structure for future address generation instead of relying on past address historyItidenties the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution This allows the prefetch engine to successfully generate future addresses Toover come the serial nature of LDS address generation the push architecture employs anovel data movement mo del It attaches the prefetch engine to eachlevel of the memory hierarchyand pushes rather than pul ls data to the CPU This push mo del decouples the p ointer dereference from the transfer of the currentnode up to the pro cessor Thus a series of p ointer dereferences b ecomes a pip elined iv pro cess rather than a serial pro cess Simulation results show that the push archi tecture can reduce up to of memory stall time on a suit of p ointerintensive applications reducing overall execution time byanaverage v Acknowledgements As I lo ok back the past six years my heart is lled with thankfulness I thank Go d for those p eople who havehelpedmeinthisjourney Iwould like to express my deep est appreciation to my advisor Professor Alvin R Leb eck He guided me through every step of my PhD study with patience and encouragement He has b een a great mentor for me in these years I hop e I could have the same inuence on my future students as he has on me Iwould like to thank Pro cessor Jerey S Chase Professor Gershon Kedem Dr Nikos P Pitsianis and Professor Xiaobai Sun for taking their time to serve in my PhD committee and givemevaluable comments on this thesis I want to express my sp ecial thanks to Professor Xiaobai Sun for her friendship and encouragement throughout myyears at Duke Sp ecial thanks to Dr Barton Sano and Dr Norman P Jouppi for a great intern exp erience at Western Research Lab I also liketothankIntel Foundation for supp orting me to nish this thesis Iwould like to thank my friends Srikanth Srinivasan Mithuna Thottetho di Wei Jin Chong Xu RungHuang Tsai Yujuan Bao and Elizab eth Cherry Their friendship has brought me manyjoys in these years I also want to express thanks to mychurch family for their steady supp ort Without their prayers I would not have gone this far FinallyI would liketothankmy dear familymy parents my siblings my husband ChunFaandmy son Nathan Their love and supp ort gave me the strength to carry on in the most dicult time vi Contents Abstract iv Acknowledgements vi List of Tables x List of Figures xi Intro duction Background Hiding Memory Latency Prefetching Prefetching Array vs Linked Data Structures ThePushArchitecture LDS Traversal Kernel The Push Mo del Design Issues of the Push Architecture Implementation of The Push Architecture PFEDesign Sp ecialized PFE Design Programmable PFE Design Interaction Among PrefetchEngines Synchronization b etween the Pro cessor and PFEs Reducing the Eect of RedundantPrefetches vii Mo dications to the CacheMemory Controller Evaluation Metho dology Microb enchmarkResults Macrob enchmark Results Benchmark Characterization Performance Comparison Between The Push and Pull Mo del PrefetchCoverage for the Push Mo del Eect of the PFE Data Cache and Throttle Mechanism Performance Impact of Dierent PFE Architectures Numb er of PFEs Sp eed of PFEs Impact of Address Translation Performance Prediction of the Push Architecture for Future Pro cessors Related Work Prefetching for Regular Applications Mechanisms for Improving Memory Performance for Irregular Ap plications Decoupled Architecture Pro cessorInMemory Architecture Conclusion A LDS Traversal Kernels of Olden Benchmark and Rayshade Bibliography viii Biography ix List of Tables PFE State EventTyp es Memory System Conguration Benchmark Characteristics x List of Figures Illustration of PrefetchingProcess Illustration of the PointerChasing Problem Linked Data Structure LDS Traversal Example of Binary Tree Traversal in DepthFirst Order Hash Table Lo okup Bitonic Sorting Pull vs Push Data Movement for Linked Data Structures Prefetching Performance Push vs Pull Rroundtrip memory latency DDRAM access time Ccomputation time Smemory stall time The Push Architecture Variations of the Push Architecture LinkedList Traversal Example Traversal Kernel Table Traversal Kernel Table Blo ck Diagram of the Sp ecialized PFE Source of the Hash Table Lo okup Blo ck Diagram of the Programmable PFE xi Traversal Kernel Example Tree Traversal Example State Diagram of the Interaction Scheme Prefetch Buer Diagram Synchronization b etween the CPU and PFE Execution Tree Traversal Example Complete PFE State Diagram The PFE Blo ck Diagram with a Data Cache The Blo ck Diagram of the Memory Hierarchy Microb enchmark Cumulative Distance Distribution b etween Recurrent Loads Bandwidth Evaluation Performance Comparison b etween the Push and Pull Mo del PrefetchCoverage for the Push Mo del Eect of the PFE Data Cache and Throttle Mechanism Redundant Prefetches from the LMemory Levels The yaxis shows the p ercentage of prefetches that are redundant This num b er is obtained by dividing the numb er of redundant prefetches by the total numb er of prefetches Eect of the PFE Data Cache The yaxis shows the p ercentage of redundant prefetches that are PFE data cache hits Programmable vs Sp ecialized PFE xii Eect of Wider Issue PFEs Variations of the Push Architecture Eect of the PFE Clo ckRate Impact of Address Translation Performance Trend for Future Pro cessors Only the CPU clo ck rates increase Performance Trend for Future Pro cessors The memory system scales with the CPU clo ck rate except for the DRAM access time xiii Chapter Intro duction Micropro cessor p erformance has b een growing at a rate of p er year in the past decade The current generation such as Alpha and Intel Pentium is able to achieveover GHz clo ck rate However memory access time is increasing at a muchslower rate at ab out p er year Current DRAM Dynamic Ram implementations usually takeafewhundred naroseconds to retrieve data As the p erformance gap b etween pro cessors and memory continues to grow techniques that reduce the eect of this disparityare essential to building a highp erformance computer system The use of caches b etween the CPU and main memory is recognized as an eective metho d to bridge this gap Caches are comp osed of SRAM Static Ram which has lower access latency but is more exp ensive compared to DRAM The design of caches is based on one imp ortant program prop erty lo cality of ref erences Programs tend to reuse data that have b een recently referenced Caches are used to keep recently used data thereby satisfying successive accesses to this data Current memory systems usually adopt

The Push Architecture: a Prefetching Framework for Linked Data Structures

Data Structure

Alphaserver GS1280 Overview

Data Structure Invariants

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Verification-Aware Opencl Based Read Mapper for Heterogeneous

Applying Front End Compiler Process to Parse Polynomials in Parallel

Exploratory Large Scale Graph Analytics in Arkouda 59 2 60 3 61 4 Zhihui Du,Oliver Alvarado Rodriguez and Michael Merrill and William Reus 62 5 David A

Computer Architectures an Overview

Array Data Structure

View of This Work

When Prefetching Works, When It Doesn&Rsquo

Array Data Structure