Push Vs. Pull: Data Movement for Linked Data Structures

Push vs. Pull: Data Movement for Linked Data Structures Chia-Lin Yang and Alvin R. Lebeck Department of Computer Science Duke University Durham, NC 27708 USA fyangc,[email protected] y are essential to build a high p er- Abstract the e ect of this disparit formance computer system. The use of caches b etween the As the performance gap between the CPU and main mem- CPU and main memory is recognized as an e ective metho d ory continues to grow, techniques to hide memory latency to bridge this gap [21]. If programs exhibit go o d temp oral are essential to deliver a high performance computer sys- and spatial lo cality, the ma jority of memory requests can b e tem. Prefetching can often overlap memory latency with satis ed by caches without having to access main memory. computation for array-based numeric applications. How- However, a cache's e ectiveness can b e limited for programs ever, prefetching for pointer-intensive applications stil l re- with p o or lo cality. mains a chal lenging problem. Prefetching linked data struc- Prefetching is a common technique that attempts to hide tures LDS is dicult because the address sequence of LDS memory latency by issuing memory requests earlier than de- traversal does not present the same arithmetic regularity as mand fetching a cache blo ck when a miss o ccurs. Prefetching array-based applications and the data dependenceofpointer works very well for programs with regular access patterns. dereferences can serialize the address generation process. Unfortunately, many applications create sophisticated data In this paper, we propose a cooperative hardware/software structures using p ointers, and do not exhibit sucient reg- mechanism to reduce memory access latencies for linked data ularity for conventional prefetch techniques to exploit. Fur- structures. Instead of relying on the past address history to thermore, prefetching p ointer-intensive data structures can predict futureaccesses, we identify the load instructions that be limited by the serial nature of p ointer dereferences| traverse the LDS, and execute them ahead of the actual com- known as the p ointer chasing problem|since the address of putation. To overcome the serial nature of the LDS address the next no de is not known until the contents of the current generation, we attach a prefetch control ler to each level of no de is accessible. the memory hierarchy and push, rather than pul l, data to the CPU. Our simulations, using four pointer-intensive ap- For these applications, p erformance may improveiflower plications, show that the push model can achieve between 4 levels of the memory hierarchy could dereference p ointers and 30 larger reductions in execution time compared to the and pro-actively push data up to the pro cessor rather than pul l model. requiring the pro cessor to fetch each no de b efore initiating the next request. An oracle could accomplish this by know- yout and traversal order of the p ointer-based data 1. INTRODUCTION ing the la structure, and thus overlap the transfer of consecutively ac- Since 1980, micropro cessor p erformance has improved cessed no des up the memory hierarchy. The challenge is to 60 p er year, however, memory access time has improved develop a realizable memory hierarchy that can obtain these only 10 p er year. As the p erformance gap b etween pro- b ene ts. cessors and memory continues to grow, techniques to reduce This pap er presentsanovel memory architecture to p er- This work supp orted in part byDARPA GrantDABT63- form p ointer dereferences in the lower levels of the memory 98-1-0001, NSF Grants CDA-97-2637, CDA-95-12356, and hierarchy and push data up to the CPU. In this architec- EIA-99-72879, Career Award MIP-97-02547, Duke Univer- ture, if the cache blo ck containing the data corresp onding sity, and an equipment donation through Intel Corp oration's to p ointer p *p is found at a particular level of the mem- Technology for Education 2000 Program. The views and conclusions contained herein are those of the authors and ory hierarchy, the request for the next element *p+o set should not b e interpreted as representing the ocial p olicies is issued from that level. This allows the access for the next or endorsements, either expressed or implied, of DARPAor no de to overlap with the transfer of the previous element. In the U.S. Government. this way, a series of p ointer dereferences b ecomes a pip elined pro cess rather than a serial pro cess. In this pap er, we present results using sp ecialized hardware to supp ort the ab ove push mo del for data movement in list-based programs. Our simulations using four p ointer- intensive b enchmarks show that conventional prefetching using the pull mo del can achieve sp eedups of 0 to 223, while the new push mo del achieves sp eedups from 6 to More sp eci cally, the push mo del reduces execution Copyright ACM 2000, Appears in International Conference on Supercom- 227. y 4 to 30 more than the pull mo del. puting (ICS) 2000 time b The rest of this pap er is organized as follows. Section 2 list = head; provides background information and motivation for the push whilelist != NULL { mo del. Section 3 describ es the basic push mo del and details p = list->patient; /* 1 traversal load */ of a prop osed architecture. Section 4 presents our simula- x = p->data; /* 2 data load */ tion results. Related work is discussed in Section 5 and we list = list->forward; /* 3 recurrent load */ conclude in Section 6. } 2. BACKGROUND AND MOTIVATION a source co de Pointer-based data structures, often called linked data 1: lw $15, o1$24 structures LDS, presenttwochallenges to any prefetching 2: lw $4, o2$15 technique: address irregularity and data dep endencies. To 3: lw $24, o3$24 improve the memory system p erformance of many p ointer- based applications, b oth of these issues must b e addressed. b LDS Traversal Kernel 2.1 Address Irregularity hing LDS is unlike array-based numeric applica- Prefetc 3a tions since the address sequence from LDS accesses do es ve arithmetic regularity. Consider traversing a linear not ha 1b 3b arrayofintegers versus traversing a linked-list. For linear ys, the address of array elements accessed at iteration arra 2b i is A +4i, where A is the base address of ar- addr addr c Data Dep endencies ray A and i is the iteration numb er. Using this formula, the address of elementA[i]atany iteration i is easily com- Figure 1: Linked Data Structure LDS Traversal puted. Furthermore, the ab ove formula pro duces a compact representation of array accesses, thus making it easier to predict future addresses. In contrast, LDS elements are al- lo cated dynamically from the heap and adjacent elements p endencies. In LDS traversal a series of p ointer dereferences are not necessarily contiguous in memory. Therefore, tra- is required to generate future element addresses. If the ad- ditional array-based address prediction techniques are not dress of no de i is stored in the p ointer p, then the address of applicable to LDS accesses. no de i+1 is *p+x, where x is the o set of the next eld. Instead of predicting the future LDS addresses based on The address *p+x is not known until the value of p is address regularity, Roth et. al [19] observed that the load known. This is commonly called the p ointer-chasing prob- instructions Program Counters that access LDS elements lem. In this scenario, the serial nature of the memory access are themselves a compact representation of LDS accesses| can make it dicult to hide memory latencies longer than called a traversal kernel. Consider the linked-list traversal one iteration of computation. example shown in Figure 1, the structure list is the LDS The primary contribution of this pap er is the develop- b eing traversed. Instructions 1 and 3 access list itself, while ment of a new mo del for data movement in linked data instruction 2 accesses another p ortion of the LDS through structures that reduces the impact of data dep endencies. p ointer p. These three instructions make up the list traversal By abandoning the conventional pul l approach of initiat- kernel. Assembly co de for this kernel is shown in Figure 1b. ing all memory requests demand fetch or prefetch by the Consecutive iterations of the traversal kernel create data pro cessor or upp er level of the memory hierarchy, we can dep endencies. Instruction 3 at iteration a pro duces an ad- eliminate some of the delay in traversing linked data struc- dress consumed by itself and instruction 1 in the next it- tures. The following section elab orates on the new mo del eration b. Instruction 1 pro duces an address consumed by and a prop osed implementation. instruction 2 in the same iteration. This relation can b e rep- resented as a dep endency tree, shown in Figure 1c. Loads in versal kernels can b e classi ed into three typ es: re- LDS tra 3. THE PUSH MODEL Most existing prefetch techniques issue requests from the current, traversal, and data. Instruction 3 list=list!forward CPU level to fetch data from the lower levels of the mem- generates the address of the next element of the list struc- ory hierarchy. Since p ointer-dereferences are required to ture, and is classi ed as a recurrent load.

Push Vs. Pull: Data Movement for Linked Data Structures

Device Tree 101

Slide Set 17 for ENCM 339 Fall 2015

Fast Sequential Summation Algorithms Using Augmented Data Structures

Linked List Data Structure

Trees • Binary Trees • Traversals of Trees • Template Method Pattern • Data

Verifying Linked Data Structure Implementations

8.1 Doubly Linked List

Linked Data Structures 17

MASTER of SCIENCE May,1995 DESIGN and IMPLEMENTATION of an EFFICIENT

Efficient Run-Time Support for Global View Programming of Linked Data Structures on Distributed Memory Parallel Systems

Lecture 8: 8.1 Doubly Linked List

15. Symbol Tables