Push vs. Pull: Data Movement for Linked Data 

Chia-Lin Yang and Alvin R. Lebeck

Department of Computer Science Duke University

Durham, NC 27708 USA

fyangc,[email protected]

y are essential to build a high p er-

Abstract the e ect of this disparit

formance computer system. The use of caches b etween the

As the performance gap between the CPU and main mem-

CPU and main memory is recognized as an e ective metho d

ory continues to grow, techniques to hide memory latency

to bridge this gap [21]. If programs exhibit go o d temp oral

are essential to deliver a high performance computer sys-

and spatial lo cality, the ma jority of memory requests can b e

tem. Prefetching can often overlap memory latency with

satis ed by caches without having to access main memory.

computation for array-based numeric applications. How-

However, a cache's e ectiveness can b e limited for programs

ever, prefetching for pointer-intensive applications stil l re-

with p o or lo cality.

mains a chal lenging problem. Prefetching linked data struc-

Prefetching is a common technique that attempts to hide

tures LDS is dicult because the address sequence of LDS

memory latency by issuing memory requests earlier than de-

traversal does not present the same arithmetic regularity as

mand fetching a cache blo ck when a miss o ccurs. Prefetching

array-based applications and the data dependenceofpointer

works very well for programs with regular access patterns.

dereferences can serialize the address generation process.

Unfortunately, many applications create sophisticated data

In this paper, we propose a cooperative hardware/software

structures using p ointers, and do not exhibit sucient reg-

mechanism to reduce memory access latencies for linked data

ularity for conventional prefetch techniques to exploit. Fur-

structures. Instead of relying on the past address history to

thermore, prefetching p ointer-intensive data structures can

predict futureaccesses, we identify the load instructions that

be limited by the serial nature of p ointer dereferences|

traverse the LDS, and execute them ahead of the actual com-

known as the p ointer chasing problem|since the address of

putation. To overcome the serial nature of the LDS address

the next no de is not known until the contents of the current

generation, we attach a prefetch control ler to each level of

no de is accessible.

the memory hierarchy and push, rather than pul l, data to

the CPU. Our simulations, using four pointer-intensive ap-

For these applications, p erformance may improveiflower

plications, show that the push model can achieve between 4

levels of the memory hierarchy could dereference p ointers

and 30 larger reductions in execution time compared to the

and pro-actively push data up to the pro cessor rather than

pul l model.

requiring the pro cessor to fetch each no de b efore initiating

the next request. An oracle could accomplish this by know-

yout and traversal order of the p ointer-based data

1. INTRODUCTION ing the la

, and thus overlap the transfer of consecutively ac-

Since 1980, micropro cessor p erformance has improved

cessed no des up the memory hierarchy. The challenge is to

60 p er year, however, memory access time has improved

develop a realizable memory hierarchy that can obtain these

only 10 p er year. As the p erformance gap b etween pro-

b ene ts.

cessors and memory continues to grow, techniques to reduce

This pap er presentsanovel memory architecture to p er-



This work supp orted in part byDARPA GrantDABT63-

form p ointer dereferences in the lower levels of the memory

98-1-0001, NSF Grants CDA-97-2637, CDA-95-12356, and

hierarchy and push data up to the CPU. In this architec-

EIA-99-72879, Career Award MIP-97-02547, Duke Univer-

ture, if the cache blo ck containing the data corresp onding

sity, and an equipment donation through Intel Corp oration's

to p ointer p *p is found at a particular level of the mem-

Technology for Education 2000 Program. The views and

conclusions contained herein are those of the authors and

ory hierarchy, the request for the next element *p+o 

should not b e interpreted as representing the ocial p olicies

is issued from that level. This allows the access for the next

or endorsements, either expressed or implied, of DARPAor

no de to overlap with the transfer of the previous element. In

the U.S. Government.

this way, a series of p ointer dereferences b ecomes a pip elined

pro cess rather than a serial pro cess.

In this pap er, we present results using sp ecialized hard-

ware to supp ort the ab ove push mo del for data movement

in list-based programs. Our simulations using four p ointer-

intensive b enchmarks show that conventional prefetching us-

ing the pull mo del can achieve sp eedups of 0 to 223,

while the new push mo del achieves sp eedups from 6 to

More sp eci cally, the push mo del reduces execution

Copyright ACM 2000, Appears in International Conference on Supercom- 227. y 4 to 30 more than the pull mo del. puting (ICS) 2000 time b

The rest of this pap er is organized as follows. Section 2

list = head;

provides background information and motivation for the push

whilelist != NULL {

mo del. Section 3 describ es the basic push mo del and details

p = list->patient; /* 1 traversal load */

of a prop osed architecture. Section 4 presents our simula-

x = p->data; /* 2 data load */

tion results. Related work is discussed in Section 5 and we

list = list->forward; /* 3 recurrent load */

conclude in Section 6. }

2. BACKGROUND AND MOTIVATION a source co de

Pointer-based data structures, often called linked data

1: lw $15, o1$24

structures LDS, presenttwochallenges to any prefetching

2: lw $4, o2$15

technique: address irregularity and data dep endencies. To

3: lw $24, o3$24

improve the memory system p erformance of many p ointer-

based applications, b oth of these issues must b e addressed.

b LDS Traversal Kernel

2.1 Address Irregularity

hing LDS is unlike array-based numeric applica-

Prefetc 3a

tions since the address sequence from LDS accesses do es

ve arithmetic regularity. Consider traversing a linear

not ha 1b 3b

arrayofintegers versus traversing a linked-list. For linear

ys, the address of array elements accessed at iteration

arra 2b

i is A +4i, where A is the base address of ar-

addr addr

c Data Dep endencies

ray A and i is the iteration numb er. Using this formula,

the address of elementA[i]atany iteration i is easily com-

Figure 1: Linked LDS Traversal

puted. Furthermore, the ab ove formula pro duces a compact

representation of array accesses, thus making it easier to

predict future addresses. In contrast, LDS elements are al-

lo cated dynamically from the and adjacent elements

p endencies. In LDS traversal a series of p ointer dereferences

are not necessarily contiguous in memory. Therefore, tra-

is required to generate future element addresses. If the ad-

ditional array-based address prediction techniques are not

dress of no de i is stored in the p ointer p, then the address of

applicable to LDS accesses.

no de i+1 is *p+x, where x is the o set of the next eld.

Instead of predicting the future LDS addresses based on

The address *p+x is not known until the value of p is

address regularity, Roth et. al [19] observed that the load

known. This is commonly called the p ointer-chasing prob-

instructions Program Counters that access LDS elements

lem. In this scenario, the serial nature of the memory access

are themselves a compact representation of LDS accesses|

can make it dicult to hide memory latencies longer than

called a traversal kernel. Consider the linked-list traversal

one iteration of computation.

example shown in Figure 1, the structure list is the LDS

The primary contribution of this pap er is the develop-

b eing traversed. Instructions 1 and 3 access list itself, while

ment of a new mo del for data movement in linked data

instruction 2 accesses another p ortion of the LDS through

structures that reduces the impact of data dep endencies.

p ointer p. These three instructions make up the list traversal

By abandoning the conventional pul l approach of initiat-

kernel. Assembly co de for this kernel is shown in Figure 1b.

ing all memory requests demand fetch or prefetch by the

Consecutive iterations of the traversal kernel create data

pro cessor or upp er level of the memory hierarchy, we can

dep endencies. Instruction 3 at iteration a pro duces an ad-

eliminate some of the delay in traversing linked data struc-

dress consumed by itself and instruction 1 in the next it-

tures. The following section elab orates on the new mo del

eration b. Instruction 1 pro duces an address consumed by

and a prop osed implementation.

instruction 2 in the same iteration. This relation can b e rep-

resented as a dep endency , shown in Figure 1c. Loads in

versal kernels can b e classi ed into three typ es: re-

LDS tra 3. THE PUSH MODEL

Most existing prefetch techniques issue requests from the

current, traversal, and data. Instruction 3 list=list!forward

CPU level to fetch data from the lower levels of the mem-

generates the address of the next element of the list struc-

ory hierarchy. Since p ointer-dereferences are required to

ture, and is classi ed as a recurrent load. Instruction 1

generate addresses for successive prefetch requests recur-

p=list!patient pro duces addresses consumed by other p ointer

rent loads, these techniques serialize the prefetch pro cess.

loads and is called a traversal load. All the loads in the leaves

The prefetch schedule is shown in Figure 2a, assuming

of the dep endency tree, such as instruction 2 p!time++

each recurrent load is a L2 cache miss. The memory la-

in this example, are data loads.

tency is divided into six parts r1: sending a request from

Roth et. al prop osed dynamically constructing this ker-

L1 to L2; a1: L2 access; r2: sending a request from L2

nel and rep eatedly executing it indep endent of the CPU ex-

to main memory; a2: memory access; x2: transferring a

ecution. This metho d overcomes the irregular address se-

cache blo ck back to L2; x1: transferring a cache blo ck back

quence problem. However, its e ectiveness is limited be-

to L1. Using this timing, no de i+1 arrives at the CPU

cause of the serial nature of LDS address generation. r1+a1+r2+a2+x2+x1 cycles the full memory access la-

2.2 Data Dependence tency after no de i.

The second challenge for improving p erformance for LDS In the push mo del, p ointer dereferences are p erformed

traversal is to overcome the serialization caused by data de- at the lower levels of the memory hierarchy by executing time r1+a1+r2+a2+x2+x1

r1 a1r2 a2 x2 x1

L1

r1 x1 node 1 node 1 arrives, node 2 arrives, prefetch node 2 prefetch node 3

a) Traditional Prefetch Data Movement L2 a1 time r2 x2 a2

r1 a1r2 a2 x2 x1 Main a2 Memory a2 x2 x1 node 1 node 2 arrives c) Memory System Model access node 2 in main memory a2 x2 x1

access node 3 in main memory

b) Push Model Data Movement

Figure 2: Data Movement for Linked Data Structures

traversal kernels in micro-controllers asso ciated with each

memory hierarchy level. This eliminates the request from

els to the lower level, and allows us to overlap the upp er lev Prefetch prefetch req Prefetch L1

Buffer Engine

the data transfer of no de i with the RAM access of no de

For example, assume that the request for no de i ac-

i+1. L2 Bus

The result is known to the cesses main memory at time t. prefetched blocks

L2 Prefetch

troller at time t+a2, where a2 is the memory memory con prefetch req

Engine

access latency. The controller then issues the request for

us overlapping this

no de 2 to main memory at time t+a2, th Memory Bus k to the CPU. access with the transfer of no de 1 bac prefetched blocks

Main Prefetch wn in Figure 2b Note: in this simple This pro cess is sho prefetch req

Memory Engine

mo del, we ignore the p ossible overhead for address transla-

tion. Section 4 examines this issue. In this way,we are able

Figure 3: The Push Architecture

to request subsequent no des in the LDS much earlier than

a traditional system. In the push architecture, no de i+1

arrives at the CPU a2 cycles after no de i. From the CPU

standp oint, the latency between no de accesses is reduced

r1+a1+r2+a2+x2+x1 to a2. Assuming 100-cycle

from 3.1 A Push Model Implementation

overall memory latency and 60-cycle memory access time,

The function of the prefetch engine is to execute traver-

the total latency is p otentially reduced by 40. The actual

sal kernels. In this pap er we fo cus on a prefetch engine

sp eedup for real applications dep ends on i the contribution

designed to traverse linked-lists. Supp ort for more sophisti-

of LDS accesses to total memory latency, ii the p ercentage

cated data structures e.g., trees is an imp ortant area that

of LDS misses that are L2 misses and iii the amountof

we are currently investigating, but it remains future work.

computation in the program that can be overlapp ed with

Consider the linked-list traversal example in Figure 1.

the memory latency.

The prefetch engine is designed to execute iterations of ker-

In the push mo del, a controller|called a prefetch engine| nel lo ops until the returned value of the recurrent load in-

is attached to each level of the memory hierarchy, as shown struction 3: list=list!forward is NULL. Instances of the

in Figure 3. The prefetch engines PFE executes traver- recurrent load 3a,3b,3c,.. etc form a dep endence chain

sal kernels indep endent of CPU execution. Cache blo cks which is the critical path in the kernel lo op. To allow the

accessed by the prefetch engines in the L2 or main mem- prefetch engine to move ahead along this critical path as fast

ory levels are pushed up to the CPU and stored either in as p ossible, the prefetch engine must supp ort out-of-order

the L1 cache or a prefetch bu er. The prefetch bu er is execution. General CPUs implement complicated register

a small, fully-asso ciative cache, which can be accessed in renaming and wakeup logic [14; 22] to achieve out-of-order

parallel with the L1 cache. The remainder of this section execution. However, the complexity and cost of these tech-

describ es the design details of this novel push architecture. niques is to o high for our prefetch engine design.

to the result of the probing traversal load. In this way,

the prefetch engine allows the recurrent-based and traversal-

based threads to execute indep endently.

3a 3b 3c …..

e assume that each level of the memory hierarchy has thread 1: W

recurrent-based load

a small bu er that stores requests from the CPU, PFE, and he coherence transactions e.g., sno op requests if appro-

1a 1b 1c cac

By maintaining information on the typ e of access,

….. priate.

the cache controller can provide the appropriate data re-

quired for the PFE to continue execution. For example, if

access is a recurrent prefetch, the resulting data the thread 2: 2a 2b 2c the

traversal-based load to the

….. address of the next no de is extracted and stored in

BAB. If the access is a traversal load prefetch, the PC is

used to prob e the PCT and the resulting data is added to

Figure 4: Kernel Execution

the o set from the PCT entry. Before serving any request,

the controller checks if a request to the same cache blo ck

already exists in the request bu er. If such a request exists,

the controller simply drops the new request.

Toavoid complicated logic, the prefetch engine executes

Our data movement mo del also places new demands on

a single traversal kernel in two separate threads, as shown

the memory hierarchy to correctly handle cache blo cks that

in Figure 4. The rst thread executes recurrent-based loads,

were not requested by the pro cessor. In particular, when all

which are dep endent on recurrent loads, e.g., inst. 1: p=

memory requests initiate from the CPU level, a returning

list!patient and inst. 3: list = list!forward. This thread

cache blo ck is guaranteed to nd a corresp onding MSHR

executes in order and starts a new iteration once a recurrent

entry that contains information ab out the request, and pro-

load completes. The other thread executes traversal-based

vides bu er space for the data if it must comp ete for cache

loads, which are loads that dep end on traversal loads, such

p orts. In our mo del, it is no longer guaranteed that an arriv-

as inst. 2 p!time++. Because traversal loads from di er-

ing cache blo ck will nd a corresp onding MSHR. One could

ent iterations, e.g., 1a and 1b are allowed to complete out

exist if the pro cessor executed a load for the corresp onding

of order, since they can hit in di erent levels of the memory

cache blo ck. However, in our implementation, if there is

hierarchy, traversal-based loads should be able to execute

no matching MSHR the returned cache blo ck can b e stored

out of order.

in a free MSHR or in the request bu er. If neither one is

The internal structure of the prefetch engine is shown

available, this cache blo ck is dropp ed. Dropping these cache

in Figure 5. There are four main hardware structures in

blo cks do es not a ect the program correctness, since they

the prefetch engine: 1 Recurrent Bu er RB, 2 Pro ducer-

are a result of a non-binding prefetch op eration, not a de-

Consumer Table PCT, 3 Base Address Bu er BAB, and

mand fetch from the pro cessor. Note that the push mo del

4 Prefetch Request Queue. The remainder of this section

requires the transferred cache blo cks to also transfer their

describ es its basic op eration.

addresses up the memory hierarchy. This is similar to a co-

herence sno op request traversing up the memory hierarchy.

The Recurrent Bu er RB contains the typ e, PC, and

o set of recurrent-based loads. The typ e eld indicates if

t load. Traversal-based

the corresp onding load is a recurren 3.2 Interaction Among Prefetch Engines

loads are stored in the Pro ducer-Consumer Table PCT.

Having discussed the design details of individual prefetch

This structure is similar to the Correlation Table in [19], it

engines, wenow address the interaction b etween prefetch en-

contains the load's o set, PC and the pro ducer's PC. The

gines at the various levels. Recall, that the ro ot address of

PCT is indexed using the pro ducer PC.

a LDS is conveyed to the prefetch engine through a sp ecial

instruction, whichwe call ldroot. If the memory access of a We initialize the prefetch engine bydown-loading a traver-

sal kernel to the RB and PCT through a memory mapp ed

ldroot instruction is a hit in the L1 cache, the loaded value

interface. Prior to executing an LDS access e.g., list traver-

is stored in the BAB of the prefetch engine in the st level.

sal, the program starts the PFE by storing the ro ot address

This triggers the prefetching activity in the L1 prefetch en-

of the LDS in the Base Address Bu er BAB. We accom-

gine.

plish this by assuming a \ avored" load that annotates a

If a ldroot instruction or the prefetch for a recurrent load

load with additional information, in this case that the data

causes a L1 cache miss, the cache miss is passed down to

b eing loaded is the ro ot address of an LDS. When the PFE

the L2 like other cache misses but with the typ e bits set to

detects that the BAB is not empty and the instruction queue

RP recurrent prefetch. If this miss is satis ed by the L2,

is empty, it loads the rst element of the BAB into the Base

the cache controller will detect the RP typ e and store the

register and fetches the recurrent-based instructions from

result in the BAB. The prefetch engine in the L2 level will

the RB into the instruction queue. At each cycle, the rst

nd out that the BAB is not empty and start prefetching. If

instruction of the instruction queue is pro cessed, and a new

the L1 cache miss is a result of a recurrent load prefetch, the

prefetch request is formed by adding the instruction's o set

L1 PFE stops execution and the L2 PFE b egins execution.

to the value in the Base register. The prefetch request is

Data provided by the L2 PFE is placed in the prefetch bu er

then stored in the prefetch request queue PRQ along with

to avoid p olluting the L1 data cache.

its typ e and PC.

The interaction between the L2 PFE and the memory

level PFE is similar the the L1/L2 PFE interaction. The L2 When a traversal load completes, its PC is used to search

PFE susp ends execution when it incurs a miss on a recur- the PCT. For each matching entry, the e ective address of

rent load, and the memory PFE b egins execution when it the traversal-based load is generated by adding its o set Recurrent Buffer (RB) Instruction Queue Base Address Buffer type pc offset type pc offset $Base (BAB) head 1 o1 NR :::::::

RP 3 o3 +

producer-consumer table (PCT)

Producer Self Offset + prefetch request queue (PRQ) 1 2 o2 (type, pc, addr) TLB

addr type pc data Request Request from Buffer

Upper Level

Figure 5: The Prefetch Engine

observes a non-empty BAB. Since the data provided by the p orts, that can be accessed in parallel with the L1 data

memory PFE passes the L2 cache, we dep osit the data in cache. The second level cache is pip elined with an 18 cycle

b oth the L2 cache and the prefetch bu er. The L2 cache is round-trip latency. Latency to main memory after an L2

generally large enough to contain the additional data blo cks miss is 66 cycles. The CPU has 64 RUU entries and a 16

and not su er from severe p ollution e ects. entry load-store queue, a 2-level branch predictor with a

total of 8192 entries, and a p erfect instruction cache.

e simulate the pull mo del by using only a single prefetch

4. EVALUATION W

engine attached to the L1 cache. This engine executes the

This section presents simulation results that demonstrate

same traversal kernels used by the push architecture. How-

the e ectiveness of the push mo del for a set of four list-based

ever, prefetch requests that miss in the L1 cache must wait

b enchmarks. First, we describ e our simulation framework.

for the requested cache blo ck to return from the lower levels

This is followed by an analysis of a microb enchmark for

b efore the fetch for the next no de can b e initiated.

b oth the push mo del and the conventional pull mo del for

hing. We nish by evaluating the two mo dels for

prefetc 4.2 Microbenchmark Results

four macrob enchmarks and examining the impact of address

translation overhead.

We b egin by examining the p erformance of b oth prefetch-

ing mo dels push and pull using a microb enchmark. Our

hmark simply traverses a of 32,768 32-

4.1 Methodology microb enc

Toevaluate the push mo del we mo di ed the SimpleScalar

byte no des 10 times, see Figure 6. Therefore, all LDS loads

simulator [3] to also simulate our prefetch engine implemen-

are recurrent loads. We also include a small computation

tation. SimpleScalar mo dels a dynamically scheduled pro-

lo op b etween each no de access. This allows us to evaluate

cessor using a Register Up date Unit RUU and a Load/Store

the e ects of various ratios b etween computation and mem-

Queue LSQ [22]. The pro cessor pip eline stages are: 1

ory access. To eliminate branch e ects, we use a p erfect

Fetch: Fetch instructions from the program instruction stream,

branch predictor for these microb enchmark simulations.

2 Dispatch: Deco de instructions, allo cate RUU, LSQ en-

, 3 Issue/Execute: Execute ready instructions if the

required functional units are available, 4 Writeback: Sup-

for i = 0; i < 10; i++

ply results to dep endent instructions, and 5 Commit: Com-

node = head;

mit results to the register le in program order, free RUU

while node

and LSQ entries.

for j = 0; j < iters; j++

Our base pro cessor is a 4-way sup erscalar, out-of-order

sum = sum+j;

pro cessor. The memory system consists of a 32KB, 32-byte

node = node->next;

cache line, 2-way set-asso ciative rst level data cache and a

endwhile

512KB, 64-byte cache line, 4-way set asso ciative second level

endfor

cache. Both are lo ck-up free caches that can have up to eight

outstanding misses. The L1 cache has 2 read/write p orts,

and can b e accessed in a single cycle. We also simulate a 32-

Figure 6: Microb enchmark Source Co de

entry fully-asso ciative prefetch bu er with four read/write

ws the sp eedup for each prefetch mo del over

Figure 7 sho Push Model

base system versus compute iterations on the x-axis.

the 100% k solid line indicates the sp eedup of a p erfect mem- The thic Miss Ratio

80%

he. In gen- ory system where all references hit in the L1 cac Prefetch Merge

Prefetch Hit h

eral, the results for the push mo del thin solid line matc 60%

our intuition, large sp eedups over 100 are achieved for w computation to memory access ratios, and the b ene ts

lo 40%

decreases as the amountof computation increases relative

20%

to memory accesses. However, the increase in computation

cache p ollution

is not the only cause for decreased b ene ts: 0% is also a factor. 2 12 22 32 42 52 62 72 82 92 Computation per Node Access (iterations) 180% 160% Pull Pull Model 140% Push 100% 120% Perfect Memory Miss Ratio Prefetch Merge 100% 80% Prefetch Hit 80% 60% 60% Speedup(%) 40% 40% 20% 0% 20%

2 2 2 2 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 0% Computation per Node Access (iterations) 2 12 22 32 42 52 62 72 82 92

Computation per Node Access (iterations)

Figure 7: Push vs. Pull: Microb enchmark Results

Figure 8: Microb enchmark PrefetchPerformance

The dashed line in Figure 7 represents the pull mo del.

We see the same shap e as the push mo del, but at a di erent

scale. It takes much longer than the push mo del for the pull

p orted by the decrease in prefetch bu er hits shown in Fig-

mo del to reach its p eak b ene t. This is a direct consquence

ure 8 and an increase in the L2 miss ratio shown in Figure 9.

of the serialization caused by p ointer dereferences; the next

Although the L1 miss ratio returns to its base value,

no de address is not available until the current no de is fetched

we continue to see signi cant sp eedups. These p erformance

from the lower levels of the memory hierarchy.

gains are due to prefetch b ene ts in the L2 cache. Recall,

We consider three distinct phases in Figure 7. Initially,

that we place prefetched cache blo ckinto b oth the L2 and

the push mo del achieves nearly 100 sp eedup. While this

prefetch bu er. Even though the prefetch engine is so far

is not close to the maximum achievable, it is signi cant.

ahead that it p ollutes the prefetch bu er, the L2 still pro-

These sp eedups are a result of cache misses merging with

vides b ene ts. Figure 9 shows that the global L2 miss ratio,

the prefetch requests see Figure 8. In contrast, the pull

remains b elow the base system. However, the miss ratio

mo del do es not exhibit any sp eedup for low computation p er

increases, hence the sp eedup decreases, as the computation

no de. Although Figure 8 shows that 100 of the prefetch

p er no de increases b ecause of L2 cache p ollution.

requests merge with cache misses, the prefetch is issued to o

late to pro duce any b ene t.

20% h

As the computation p er no de access increases, the prefetc 18%

is able to move further ahead in the list traversal,

engine 16%

entually eliminating many cache misses. This is shown

ev 14%

y the increase in prefetch bu er hits and de-

in Figure 8 b 12%

Although the miss ratio and prefetch crease in miss ratio. 10% Base

8%

plateau, p erformance continues to increase due to merges Pull

6% Push

prefetch merges hiding more and more of the overall

the 4%

memory latency. Eventually, there is enough computation

L2 Global2% Miss Rate h requests no longer merge, instead they result

that prefetc 0%

in prefetch hits. This is shown by the spike in prefetch hits

2 2 2 2 2 2 2 2 2 2 Figure 8, and o ccurs at the same iter-

the dashed line in 1 2 3 4 5 6 7 8 9

t as the maximum sp eedup shown in Figure 7.

ation coun Computation per Node Access (iterations)

Further increases in computation cause the prefetch en-

gine to run to o far ahead, thus causing p ollution. As new

Figure 9: Push vs. Pull: Microb enchmark L2 Cache Global

cache blo cks are brought closer to the pro cessor, they replace

Miss Ratio

older blo cks. For suciently long delays b etween no de ac-

cesses, the prefetched blo cks b egin replacing other prefetched

blo cks not yet accessed by the CPU. This conclusion is sup- The results obtained from this microb enchmark clearly

show some of the p otential b ene ts for the push mo del. that we execute a simple traversal kernel on the L1 prefetch

However, there are some opp ortunities for further improve- engine to issue prefetch requests. This kernel executes inde-

ment, suchasdeveloping a technique to \throttle" the prefetch p endent of the CPU, and do es not synchronize in anyway.

engine to match the rate of data consumption by the pro ces- Most previous techniques are \throttled" by the CPU ex-

sor, thus avoiding p ollution. We leaveinvestigation of this ecution, since they prefetch a sp eci c numb er of iterations

as future work. ahead. Furthermore, our kernel may prefetch incorrect data.

For example, this can o ccur in mst since the linked list is

hed for a matching key and the real execution may not

4.3 Macrobenchmark Results searc

traverse the entire list. Similarly, in em3d each nodeinthe

While the ab ove microb enchmark results provide some

graph contains a variable size array of p ointers to/from other

insightinto the p erformance of the push and pull mo dels,

no des in the graph. Our kernel assumes the maximum ar-

it is dicult to extend those results to real applications.

ray size, and hence may try to interpret non-p ointer data as

Variations in the typ e of computation and its execution on

p ointers and issue prefetch requests for this incorrect data.

dynamically scheduled pro cessors can makeitvery dicult

To gain more insightinto the b ehavior of the push archi-

to precisely predict memory latency contributions to overall

tecture, we analyze prefetchcoverage, p ollution, and prefetch

execution time.

request distribution at di erent levels of the memory hier-

To address this issue, we use four list-based macrob ench-

archy. Prefetchcoverage measures the fraction of would-b e

marks: em3d, health, and mst from the Olden p ointer-

read misses serviced by the prefetch mechanism. To deter-

intensive b enchmark suite [18] and Rayshade [9]. Em3d

mine the impact of cache p ollution, we count the number

simulates electro-magnetic elds. It constructs a bipartite

of prefetched blo cks replaced from the prefetch bu er be-

graph and pro cesses lists of interacting no des. Health imple-

fore they are ever referenced by the CPU. Finally, the re-

ments a hospital check in/out system. The ma jorityofcache

quest distribution indicates where in the memory hierarchy

misses come from traversing several long link lists. Mst nds

prefetch requests are initiated.

the minimum spanning tree of a graph. The main data struc-

Lists are used to implement the buckets

ture is a . Prefetch Coverage

in a hash table. Rayshade is real-world graphics application

We examine prefetchcoverage by categorizing successful prefetch that implements a raytracing algorithm. Rayshade divides

requests according to where in the memory hierarchy the space into grids and ob jects in each grid are stored in a

would-b e cache miss \meets" the prefetched data. The three linked-list. If a rayintersects with a grid, the entire ob ject

categories are partially hidden, hidden L1, and hidden L2.

list in that grid is traversed.

Partially hidden misses are those would-b e cache misses that

merge with a prefetch request somewhere in the memory hi-

223%227%

hy. Hidden L1 misses are those L1 misses now satis ed

50% erarc

y the prefetch bu er. Similarly, hidden L2 misses are those

45% 41% b

ere L2 misses without prefetching, but are

40% PULL references that w

w satis ed by prefetched data in the L2 cache. Note, since

35% PUSH no

e dep osit data in the L2 cache and the prefetch bu er,

30% w

ks replaced from the prefetch bu er can still b e resident

25% blo c he. 20% 16% in the L2 cac

15% 11% 10% 6% 70% 5% 0% 0% 0% 60% partially hidden Reduction in Execution Time Health Mst Em3d Rayshade 50% total hidden L1

40% total hidden L2

Push vs. Pull: Macrob enchmark Results Figure 10: 30%

20%

10 shows the sp eedup for b oth push and pull

Figure 10%

prefetching using our b enchmarks on systems that incur a

0%

one cycle overhead for address translation with a p erfect

We examine the e ects of TLB misses later in this

TLB. Health Mst Em3d Rayshade

section. From Figure 10 we see that the push mo del out-

p erforms the pull mo del for all four b enchmarks. Rayshade

Figure 11: PrefetchCoverage

shows the largest b ene ts, with sp eedups of 227 for the

push mo del, and 223 for the pull mo del. Health achieves

41 improvement in execution time for the push mo del

Figure 11 shows the prefetch coverage for our b ench-

while achieving only 11 for the pull mo del. Similarly, for

marks. Our results show that for health and mst, the ma jor-

the push mo del, em3d and mst achieve 6 and 16, resp ec-

ity of load misses are only partially hidden, 47 and 35,

tively, while neither shows any b ene t using the pull mo del.

resp ectively. These b enchmarks do not have enough com-

We note that the results for our pull mo del are not di- putation to overlap with the prefetch latency. Rayshade is

rectly comparable to previous pull-based prefetching tech- the other extreme, with no partially hidden misses, but 60

niques, since we do not implement the same mechanisms of L2 misses are eliminated. Em3d shows very little b ene t,

or use the exact same memory system parameters. Recall, with only a small numb er of L2 misses eliminated. Prefetch Buffer Pollution

Health

To eliminate all L1 misses, a cache blo ckmust arriveinthe

h bu er b efore the CPU requires the data. Further-

prefetc 100%

the cache blo ck must remain in the prefetch bu er

more, total redundant

til the CPU accesses the data. If the prefetch engine runs

un 80%

to o far ahead it may replace data b efore it accessed by the

CPU. 60%

Figure 12 shows the p ercentage of blo cks placed into the h bu er that are replaced without the CPU access-

prefetc 40%

ing the data for the push mo del. From this data we see

that mst, em3d, and rayshade all replace nearly 100 of

20%

the blo cks placed in the prefetch bu er. Health su ers very little p ollution. 0% L1 L2 Mem 120%

100% Mst 100% 80% total redundant 60% 80%

40% 60%

20%

Prefetch Replacements 40% 0% 20% Health Mst Em3d Rayshade

0%

Figure 12: Prefetch Bu er Pollution

L1 L2 Mem

ts can b e caused by the prefetch engine

These replacemen Em3d ks b efore they are

running to o far ahead, and replacing blo c 100%

y prefetching the wrong cache blo cks. Recall, that

used, or b total redundant

versal kernel is approximate, and

for mst and em3d our tra 80%

we could b e fetching the wrong cache blo cks. This app ears

be the case, since neither of these b enchmarks showed

to 60%

many completely hidden misses see Figure 11. In con-

yshade shows a large numb er of hidden L2 misses.

trast, ra 40%

Therefore, we b elieve the replacements it incurs are caused

y the prefetch engine running to o far ahead, but not so far

b 20%

ahead as to p ollute the L2 cache. 0%

Prefetch Request Distribution L1 L2 Mem

The other imp ortant attribute of the push mo del is the

h request distribution at di erent levels of the mem-

prefetc Rayshade

hierarchy. The light bar in Figure 13 shows the p er-

ory 100%

tage of prefetch op erations issued at each level of the

cen total redundant

hy and the dark bar shows the p ercentage

memory hierarc 80%

of redundant prefetch op erations. A redundant prefetch re-

he blo ck that already resides in a higher memory

quests a cac 60%

hierarchy level than the level that issued the prefetch.

rom Figure 13 we see that health and rayshade issue

F 40%

most prefetches at the main memory level. Em3d issues

he and half from main memory. Finally,

half from the L2 cac 20%

mst issues over 30 of its prefetches from the L1 cache,

. Health and

20 from the L2 and 60 from main memory 0%

yshade traverse the entire list each time it is accessed.

ra L1 L2 Mem

Their lists are large enough that capacity misses likely dom-

inate. In contrast, mst p erforms a hash lo okup and may

Figure 13: Prefetch Request Distribution

stop b efore accessing each no de, hence con ict misses can

b e a large comp onent of all misses. Furthermore, the hash

lo okup is data dep endent and blo cks may not b e replaced b e- Palacharla et. al [15] extend this mechanism to detect non-

tween hash lo okups. In this case, it is p ossible for a prefetch unit stride o -chip. Baer et. al [2] prop osed to use a refer-

traversal to nd the corresp onding data in any level of the ence prediction table RPT to detect the stride asso ciated

cache hierarchy, as shown in Figure 13. with load/store instructions. The RPT is accessed ahead

of the regular program counter by a lo ok-ahead program

Mst also exhibits a signi cantnumb er of redundant prefetch

counter. When the lo ok-ahead program counter nds a

op erations. Most of the L1 prefetches are redundant, while

matching stride entry in the table, it issues a prefetch. The

very few of the memory level prefetches are redundant. For

Tango system [16] extends Baer's scheme to issue prefetches

health and em3d, 15 and 7, resp ectively, of all prefetches

more e ectively on 1 pro cessor. On the software side, Mowry

are redundant and issued from the memory level. Rayshade

et. al [13] prop osed a compiler algorithm to insert prefetch

incurs minimal redundant prefetch requests. Redundant

instructions in scienti c computing co de. They improve

prefetches can prevent the prefetch engine from running

up on the previous algorithms [5; 8; 17] by p erforming lo-

ahead of the CPU and they comp ete with the CPU for

cality analysis to avoid unnecessary prefetches.

cache, memory and bus cycles. We are currently investigat-

ing techniques to reduce the numb er of redundant prefetch

Correlation-based prefetching is another class of hard-

op erations.

ware prefetching mechanisms. It uses the current state of

the address stream to predict future references. Joseph et.

[6] build a Markov mo del to approximate the miss ref-

Impact of Address Translation al

erence string using a transition frequency. This mo del is

Our prop osed implementation places a TLB with each prefetch

realized in hardware as a prediction table where eachentry

engine. However, the results presented thus far assume a

contains an index address and four prioritized predictions.

p erfect TLB. To mo del overheads for TLB misses, we assume

Alexander et. al [1] also implement a table-based prediction

hardware managed TLBs with a 30 cycle miss p enalty [19].

scheme similar to [6]. However, it is implemented at the

When a TLB miss o ccurs at any level, all the TLBs are

main memory level to prefetch data from the DRAM array

up dated by the hardware miss handler.

to SRAM prefetch bu ers on the DRAM chip. Correlation-

based prefetching is able to capture complex access patterns i it cannot predict future 45% but it has three shortcomings: PULL

40% references as far ahead of the program execution as the push

PUSH

hitecture; ii the prefetch prediction accuracy decreases

35% arc

applications have dynamically changing data structures

30% if

versal orders, and iii to achieve go o d prediction

25% or LDS tra , it requires the prediction table prop ortional to the

20% accuracy orking set size.

15% w

class of data prefetching mechanisms fo cuses

10% Another

on p ointer-intensive applications. Data struc-

5% sp eci cally

ture information is used to construct solutions. The Spaid

Reduction in Execution0% Time

scheme [10] prop osed by Lipasti et. al is a compiler-based

Perfect 32 64 128 256

ter prefetching mechanism. It inserts prefetch instruc-

TLB Size p oin

tions for the ob jects p ointed by p ointer arguments at call

sites. Luk et. al [11] prop osed three compiler-based prefetch-

Figure 14: Impact of Address Translation for Health

ing algorithms. Greedy prefetching schedules prefetch in-

structions for all recurrent loads a single instance ahead.

History-p ointer prefetching mo di es source co de to create

For most of our b enchmarks, the TLB miss ratios are

p ointers to provide direct access to non-adjacent no des based

extremely low, and hence TLB misses do not contribute sig-

on previous traversals. This approach is only e ective for

ni cantly to execution time. Health is the only b enchmark

applications that have static structures and traversal or-

with a noticeable TLB miss ratio 17 for a 32-entry fully-

der. This metho d incurs b oth space and execution over-

asso ciative TLB. Therefore, we present results for only this

head. Data-linearization prefetching maps heap-allo cated

b enchmark.

no des that are likely to b e accessed close together in time

Figure 14 shows the sp eedup achieved for various TLB

into contiguous memory lo cations so the future access ad-

sizes, including a p erfect TLB that never misses. From this

dresses can b e correctly predicted if the creation order is the

data we see that smaller TLB sizes reduce the e ectiveness

same as the traversal order.

of b oth push and pull prefetching. However, we see that

Chilimbi et. al [4] present another class of software ap-

the push mo del still achieves over 30 sp eedup even for a

proaches to p ointer-intensive applications. They prop osed

32-entry TLB, while the pull mo del achieves less than 5.

to p erform cache conscious data placement considering sev-

Increasing the TLB size, increases the b ene ts for b oth push

eral cache parameters such as the cache blo ck size and asso-

and pull prefetching.

ciativity. The reorganization pro cess happ ens at run time.

This approach greatly improves the data spatial and tem-

lo cality. The shortcoming of this approach is that

5. RELATED WORK p oral

it can incur high overhead for dynamically changing struc-

The early data prefetching research, b oth in hardware

tures, and it cannot hide latency from capacity misses.

and software, fo cused on regular applications. On the hard-

ware side, Jouppi prop osed to use stream bu ers to prefetch Zhang et. al [23] prop osed a hardware prefetching scheme

sequential streams of cache lines based on cache misses [7]. for irregular applications in shared-memory multipro cessors.

The original stream bu er design can only detect unit stride. This mechanism uses ob ject information to guide prefetch-

ing. Programs are annotated to bind together groups of data structures with little computation between no de accesses,

e.g., elds in a record or two records linked by a p ointer since p ointer dereferences can serialize memory accesses. For

either by programmers or the compiler. Groups of data are these programs, the push mo del describ ed in this pap er can

then prefetched under hardware control. The constraintof dramatically improve p erformance.

this scheme is that accessed elds in a record need to be

tiguous in memory. Mehrotra et. al [12] extends stride

con 7. REFERENCES

detection schemes to capture b oth linear and recurrent ac-

[1] T. Alexander and G. Kedem. Distributed predictive

cess patterns.

cache design for high p erformance memory system.

Recently, Roth et. al [19] prop osed a dep endence based

In Proceedings of the 2th International Symposium

prefetching mechanism DBP that dynamically captures

on High-Performance Computer Architectur,February

LDS traversal kernels and issues prefetches. In their scheme,

1996.

they only prefetch a single instance ahead of a given load and

wavefront tree prefetching is p erformed regardless of the tree

[2] J.-L. Baer and T.-F. Chen. An e ective on-chip preload-

traversal order. The prop osed push architecture allows the

ing scheme to reduce data access p enalty.InProceeding

prefetching engine to run far ahead of the pro cessor.

of Supercomputing '91, 1991.

In their subsequentwork [20], Roth et. al prop osed a

jump-p ointer prefetching framework similar to the history-

[3] D. C. Burger, T. M. Austin, and S. Bennett. Evaluating

p ointer prefetching in [11]. They provide three implementa-

future micropro cessors-the simplescalar to ol set. Tech-

tions: software-only, hardware-only and a co op erative soft-

nical Rep ort 1308, Computer Sciences Department,

ware/hardware technique. The p erformance of these 3 im-

University of Wisconsin{Madison, July 1996.

plementations varies across the b enchmarks they tested. For

[4] T. Chilimbi, J. Larus, and M. Hill. Improving p ointer-

few of the b enchmarks whichhave static structure, traver-

based co des through cache-concious data placement.

sal order, and multiple passes to LDS, the jump-p ointer

Technical Rep ort CSL-TR-98-1365, University of Wis-

prefetching can achieve high sp eedup over the DBP mech-

consin, Madison., March 1998.

anism. However, for applications that have dynamically

changing b ehavior, the jump-p ointer prefetching is not ef-

[5] E. Gornish, E. Granston, and A. Veidenbaum.

fective and sometimes can even degrade p erformance. The

Compiler-detected data prefething in multipro cessor

prop osed push architecture can give more stable sp eedup

with memory hierarchy. In Prceedindings of Interna-

over the DBP mechanism and adjust the prefetching strat-

tional ConferenceonSupercomputing, 1990.

egy according to program attributes.

[6] D. Joseph and D. Grunwald. Prefetching using markov

In Proceedings of the 24th Annual Inter-

6. CONCLUSION predictors.

national Symposium on Computer Architecture, pages

This pap er presented a novel memory architecture de-

252{263, June 1997.

signed sp eci cally for linked data structures. This new ar-

chitecture employs a new data movement mo del for linked

[7] N. P. Jouppi. Improving direct-mapp ed cache p erfor-

data structures by pushing cache blo cks up the memory hi-

mance by the addition of a small fully-asso ciative cache

erarchy. The push mo del p erforms p ointer dereferences at

and prefetch bu ers. In Proceedings of the 17th An-

various levels in the memory hierarchy. This decouples the

nual International Symposium on Computer Architec-

p ointer dereference obtaining the next no de address from

ture, pages 364{373, May 1990.

the transfer of the current no de up to the pro cessor, and

[8] A. C. Klaib er and H. M. Levy. An architecture for

allows implementations to pip eline these two op erations.

software-controlled data prefetching. In Proceedings of

Using a set of four macrob enchmarks, our simulations

the 18th Annual International Symposium on Computer

show that the push mo del can improve execution time by

Architecture, pages 43{53, 1991.

6 to 227, and outp erforms a conventional pull-based

prefetch mo del that achieves 0 to 223 sp eedups. Our re-

[9] C. Kolb. The rayshade user's guide. In

sults using a microb enchmark show that for programs with

http://graphics.stanford.EDU/ cek/rayshade.

little computation b etween no de accesses, the push mo del

[10] M. H. Lipasti, W. Schmidt, S. R. Kunkel, and R. Ro edi-

signi cantly outp erforms a conventional pull-based prefetch

ger. Spaid: Software prefeteching in p ointer- and call-

mo del. For programs with larger amounts of computation,

intensiveenvironments. In Proceedings of the 28th An-

the pull mo del is b etter than the push mo del.

nual International Symposium on Microarchitecture,

We are currently investigating several extensions to this

June 1995.

work. First, is a technique to \throttle" the push prefetch

engine, thus preventing it from running to o far ahead of the

[11] C.-K. Luk and T. Mowry. Compiler based prefetch-

actual computation. We are also investigating techniques

ing for recursive data structure. In Proceedings of the

to eliminate redundant prefetch op erations, and the use of

Seventh International ConferenceonArchitectural Sup-

a truly programmable prefetch engine that may allowusto

port for Programming Languages and Operating Sys-

eliminate incorrect prefetch requests e.g., em3d. Finally,

tems ASPLOS VII, pages 222{233, Octob er 1996.

we are investigating supp ort for more sophisticated p ointer-

based data structures, such as trees and graphs.

[12] S. Mehrotra and L. Harrison. Examination of a memory

As the impact memory system p erformance on overall access classi cation scheme for p ointer-intensive and

program p erformance increases, new techniques to help tol- numeric program. In Proceedings of the 10th Interna-

erate memory latency b ecome ever more critical. This is par- tional Conference on Supercomputing, pages 133{139,

ticularly true for programs that utilize p ointer-based data May 1996.

[13] T. Mowry, M. S. Lam, and A. Gupta. Design and eval-

uation of a compiler algorithm for prefetching. In Pro-

ceedings of the Fifth International ConferenceonArchi-

tectural Support for Programming Languages and Oper-

ating System, pages 62{73, Octob er 1992.

[14] S. Palacharla, N. P. Jouppi, and J. E. Smith.

Complexity-e ective sup erscalar pro cessors. In Proceed-

ings of the 24th Annual International Symposium on

Computer Architecture, pages 206{218, June 1997.

[15] S. Palacharla and R. Kessler. Evaluating stream bu ers

as a secondary cache replacement. In Proceedings of

the 21st Annual International Symposium on Computer

Architecture, pages 24{33, April 1994.

[16] S. Pinter and A. Yoaz. A hardware-based data prefetch-

ing technique for sup erscalar pro cessor. In Proceedings

of the 29th Annual International Symposium on Mi-

croarchitecture, pages 214{225, Decemb er 1996.

[17] A. K. Porter eld. Software Methods for Improvement

of Cache PerformanceonSupercomputer Applications.

PhD thesis, Department of Computer Science, Rice

University, 1989.

[18] A. Roger, M. Carlisle, J.Reppy, and L. Hendren. Sup-

p orting dynamic data structures on distributed memory

machines. ACM Transactions on Programming Lan-

guages and Sytems, 172, March 1995.

[19] A. Roth, A. Moshovos, and G. Sohi. Dep endence based

prefetching for linked data structures. In Proceedings

of the Eigth International ConferenceonArchitectural

Support for Programming Languages and Operating

Systems ASPLOS VIII, pages 115{126, Octob er 1998.

[20] A. Roth and G. Sohi. E ective jump-p ointer prefetch-

ing for linked data structures. In Proceedings of the 26th

Annual International Symposium on Computer Archi-

tecture, pages 111{121, May 1999.

[21] A. Smith. Cache memories. Computing Surveys,

144:473{530, Septemb er 1982.

[22] G. Sohi. Instruction issue logic for high p erformance,

interruptable, multiple functional unit, pip elined com-

puters. IEEE Transactions on Computers, 393:349{

359, March 1990.

[23] Z. Zhang and J. Torrellas. Sp eeding up irregular ap-

plications in shared-memory multipro cessors: Memory

binding and group prefetching. In Proceedings of the

22nd Annual International Symposium on Computer

Architecture, pages 188{200, June 1995.