Push vs. Pull: Data Movement for Linked Data Structures
Chia-Lin Yang and Alvin R. Lebeck
Department of Computer Science Duke University
Durham, NC 27708 USA
fyangc,[email protected]
y are essential to build a high p er-
Abstract the e ect of this disparit
formance computer system. The use of caches b etween the
As the performance gap between the CPU and main mem-
CPU and main memory is recognized as an e ective metho d
ory continues to grow, techniques to hide memory latency
to bridge this gap [21]. If programs exhibit go o d temp oral
are essential to deliver a high performance computer sys-
and spatial lo cality, the ma jority of memory requests can b e
tem. Prefetching can often overlap memory latency with
satis ed by caches without having to access main memory.
computation for array-based numeric applications. How-
However, a cache's e ectiveness can b e limited for programs
ever, prefetching for pointer-intensive applications stil l re-
with p o or lo cality.
mains a chal lenging problem. Prefetching linked data struc-
Prefetching is a common technique that attempts to hide
tures LDS is dicult because the address sequence of LDS
memory latency by issuing memory requests earlier than de-
traversal does not present the same arithmetic regularity as
mand fetching a cache blo ck when a miss o ccurs. Prefetching
array-based applications and the data dependenceofpointer
works very well for programs with regular access patterns.
dereferences can serialize the address generation process.
Unfortunately, many applications create sophisticated data
In this paper, we propose a cooperative hardware/software
structures using p ointers, and do not exhibit sucient reg-
mechanism to reduce memory access latencies for linked data
ularity for conventional prefetch techniques to exploit. Fur-
structures. Instead of relying on the past address history to
thermore, prefetching p ointer-intensive data structures can
predict futureaccesses, we identify the load instructions that
be limited by the serial nature of p ointer dereferences|
traverse the LDS, and execute them ahead of the actual com-
known as the p ointer chasing problem|since the address of
putation. To overcome the serial nature of the LDS address
the next no de is not known until the contents of the current
generation, we attach a prefetch control ler to each level of
no de is accessible.
the memory hierarchy and push, rather than pul l, data to
the CPU. Our simulations, using four pointer-intensive ap-
For these applications, p erformance may improveiflower
plications, show that the push model can achieve between 4
levels of the memory hierarchy could dereference p ointers
and 30 larger reductions in execution time compared to the
and pro-actively push data up to the pro cessor rather than
pul l model.
requiring the pro cessor to fetch each no de b efore initiating
the next request. An oracle could accomplish this by know-
yout and traversal order of the p ointer-based data
1. INTRODUCTION ing the la
structure, and thus overlap the transfer of consecutively ac-
Since 1980, micropro cessor p erformance has improved
cessed no des up the memory hierarchy. The challenge is to
60 p er year, however, memory access time has improved
develop a realizable memory hierarchy that can obtain these
only 10 p er year. As the p erformance gap b etween pro-
b ene ts.
cessors and memory continues to grow, techniques to reduce
This pap er presentsanovel memory architecture to p er-
This work supp orted in part byDARPA GrantDABT63-
form p ointer dereferences in the lower levels of the memory
98-1-0001, NSF Grants CDA-97-2637, CDA-95-12356, and
hierarchy and push data up to the CPU. In this architec-
EIA-99-72879, Career Award MIP-97-02547, Duke Univer-
ture, if the cache blo ck containing the data corresp onding
sity, and an equipment donation through Intel Corp oration's
to p ointer p *p is found at a particular level of the mem-
Technology for Education 2000 Program. The views and
conclusions contained herein are those of the authors and
ory hierarchy, the request for the next element *p+o set
should not b e interpreted as representing the ocial p olicies
is issued from that level. This allows the access for the next
or endorsements, either expressed or implied, of DARPAor
no de to overlap with the transfer of the previous element. In
the U.S. Government.
this way, a series of p ointer dereferences b ecomes a pip elined
pro cess rather than a serial pro cess.
In this pap er, we present results using sp ecialized hard-
ware to supp ort the ab ove push mo del for data movement
in list-based programs. Our simulations using four p ointer-
intensive b enchmarks show that conventional prefetching us-
ing the pull mo del can achieve sp eedups of 0 to 223,
while the new push mo del achieves sp eedups from 6 to
More sp eci cally, the push mo del reduces execution
Copyright ACM 2000, Appears in International Conference on Supercom- 227. y 4 to 30 more than the pull mo del. puting (ICS) 2000 time b
The rest of this pap er is organized as follows. Section 2
list = head;
provides background information and motivation for the push
whilelist != NULL {
mo del. Section 3 describ es the basic push mo del and details
p = list->patient; /* 1 traversal load */
of a prop osed architecture. Section 4 presents our simula-
x = p->data; /* 2 data load */
tion results. Related work is discussed in Section 5 and we
list = list->forward; /* 3 recurrent load */
conclude in Section 6. }
2. BACKGROUND AND MOTIVATION a source co de
Pointer-based data structures, often called linked data
1: lw $15, o1$24
structures LDS, presenttwochallenges to any prefetching
2: lw $4, o2$15
technique: address irregularity and data dep endencies. To
3: lw $24, o3$24
improve the memory system p erformance of many p ointer-
based applications, b oth of these issues must b e addressed.
b LDS Traversal Kernel
2.1 Address Irregularity
hing LDS is unlike array-based numeric applica-
Prefetc 3a
tions since the address sequence from LDS accesses do es
ve arithmetic regularity. Consider traversing a linear
not ha 1b 3b
arrayofintegers versus traversing a linked-list. For linear
ys, the address of array elements accessed at iteration
arra 2b
i is A +4i, where A is the base address of ar-
addr addr
c Data Dep endencies
ray A and i is the iteration numb er. Using this formula,
the address of elementA[i]atany iteration i is easily com-
Figure 1: Linked Data Structure LDS Traversal
puted. Furthermore, the ab ove formula pro duces a compact
representation of array accesses, thus making it easier to
predict future addresses. In contrast, LDS elements are al-
lo cated dynamically from the heap and adjacent elements
p endencies. In LDS traversal a series of p ointer dereferences
are not necessarily contiguous in memory. Therefore, tra-
is required to generate future element addresses. If the ad-
ditional array-based address prediction techniques are not
dress of no de i is stored in the p ointer p, then the address of
applicable to LDS accesses.
no de i+1 is *p+x, where x is the o set of the next eld.
Instead of predicting the future LDS addresses based on
The address *p+x is not known until the value of p is
address regularity, Roth et. al [19] observed that the load
known. This is commonly called the p ointer-chasing prob-
instructions Program Counters that access LDS elements
lem. In this scenario, the serial nature of the memory access
are themselves a compact representation of LDS accesses|
can make it dicult to hide memory latencies longer than
called a traversal kernel. Consider the linked-list traversal
one iteration of computation.
example shown in Figure 1, the structure list is the LDS
The primary contribution of this pap er is the develop-
b eing traversed. Instructions 1 and 3 access list itself, while
ment of a new mo del for data movement in linked data
instruction 2 accesses another p ortion of the LDS through
structures that reduces the impact of data dep endencies.
p ointer p. These three instructions make up the list traversal
By abandoning the conventional pul l approach of initiat-
kernel. Assembly co de for this kernel is shown in Figure 1b.
ing all memory requests demand fetch or prefetch by the
Consecutive iterations of the traversal kernel create data
pro cessor or upp er level of the memory hierarchy, we can
dep endencies. Instruction 3 at iteration a pro duces an ad-
eliminate some of the delay in traversing linked data struc-
dress consumed by itself and instruction 1 in the next it-
tures. The following section elab orates on the new mo del
eration b. Instruction 1 pro duces an address consumed by
and a prop osed implementation.
instruction 2 in the same iteration. This relation can b e rep-
resented as a dep endency tree, shown in Figure 1c. Loads in
versal kernels can b e classi ed into three typ es: re-
LDS tra 3. THE PUSH MODEL
Most existing prefetch techniques issue requests from the
current, traversal, and data. Instruction 3 list=list!forward
CPU level to fetch data from the lower levels of the mem-
generates the address of the next element of the list struc-
ory hierarchy. Since p ointer-dereferences are required to
ture, and is classi ed as a recurrent load. Instruction 1
generate addresses for successive prefetch requests recur-
p=list!patient pro duces addresses consumed by other p ointer
rent loads, these techniques serialize the prefetch pro cess.
loads and is called a traversal load. All the loads in the leaves
The prefetch schedule is shown in Figure 2a, assuming
of the dep endency tree, such as instruction 2 p!time++
each recurrent load is a L2 cache miss. The memory la-
in this example, are data loads.
tency is divided into six parts r1: sending a request from
Roth et. al prop osed dynamically constructing this ker-
L1 to L2; a1: L2 access; r2: sending a request from L2
nel and rep eatedly executing it indep endent of the CPU ex-
to main memory; a2: memory access; x2: transferring a
ecution. This metho d overcomes the irregular address se-
cache blo ck back to L2; x1: transferring a cache blo ck back
quence problem. However, its e ectiveness is limited be-
to L1. Using this timing, no de i+1 arrives at the CPU
cause of the serial nature of LDS address generation. r1+a1+r2+a2+x2+x1 cycles the full memory access la-
2.2 Data Dependence tency after no de i.
The second challenge for improving p erformance for LDS In the push mo del, p ointer dereferences are p erformed
traversal is to overcome the serialization caused by data de- at the lower levels of the memory hierarchy by executing time r1+a1+r2+a2+x2+x1
r1 a1r2 a2 x2 x1
L1
r1 x1 node 1 node 1 arrives, node 2 arrives, prefetch node 2 prefetch node 3
a) Traditional Prefetch Data Movement L2 a1 time r2 x2 a2
r1 a1r2 a2 x2 x1 Main a2 Memory a2 x2 x1 node 1 node 2 arrives c) Memory System Model access node 2 in main memory a2 x2 x1
access node 3 in main memory
b) Push Model Data Movement
Figure 2: Data Movement for Linked Data Structures
traversal kernels in micro-controllers asso ciated with each
memory hierarchy level. This eliminates the request from
els to the lower level, and allows us to overlap the upp er lev Prefetch prefetch req Prefetch L1
Buffer Engine
the data transfer of no de i with the RAM access of no de
For example, assume that the request for no de i ac-
i+1. L2 Bus
The result is known to the cesses main memory at time t. prefetched blocks
L2 Prefetch
troller at time t+a2, where a2 is the memory memory con prefetch req
Engine
access latency. The controller then issues the request for
us overlapping this
no de 2 to main memory at time t+a2, th Memory Bus k to the CPU. access with the transfer of no de 1 bac prefetched blocks
Main Prefetch wn in Figure 2b Note: in this simple This pro cess is sho prefetch req
Memory Engine
mo del, we ignore the p ossible overhead for address transla-
tion. Section 4 examines this issue. In this way,we are able
Figure 3: The Push Architecture
to request subsequent no des in the LDS much earlier than
a traditional system. In the push architecture, no de i+1
arrives at the CPU a2 cycles after no de i. From the CPU
standp oint, the latency between no de accesses is reduced
r1+a1+r2+a2+x2+x1 to a2. Assuming 100-cycle
from 3.1 A Push Model Implementation
overall memory latency and 60-cycle memory access time,
The function of the prefetch engine is to execute traver-
the total latency is p otentially reduced by 40. The actual
sal kernels. In this pap er we fo cus on a prefetch engine
sp eedup for real applications dep ends on i the contribution
designed to traverse linked-lists. Supp ort for more sophisti-
of LDS accesses to total memory latency, ii the p ercentage
cated data structures e.g., trees is an imp ortant area that
of LDS misses that are L2 misses and iii the amountof
we are currently investigating, but it remains future work.
computation in the program that can be overlapp ed with
Consider the linked-list traversal example in Figure 1.
the memory latency.
The prefetch engine is designed to execute iterations of ker-
In the push mo del, a controller|called a prefetch engine| nel lo ops until the returned value of the recurrent load in-
is attached to each level of the memory hierarchy, as shown struction 3: list=list!forward is NULL. Instances of the
in Figure 3. The prefetch engines PFE executes traver- recurrent load 3a,3b,3c,.. etc form a dep endence chain
sal kernels indep endent of CPU execution. Cache blo cks which is the critical path in the kernel lo op. To allow the
accessed by the prefetch engines in the L2 or main mem- prefetch engine to move ahead along this critical path as fast
ory levels are pushed up to the CPU and stored either in as p ossible, the prefetch engine must supp ort out-of-order
the L1 cache or a prefetch bu er. The prefetch bu er is execution. General CPUs implement complicated register
a small, fully-asso ciative cache, which can be accessed in renaming and wakeup logic [14; 22] to achieve out-of-order
parallel with the L1 cache. The remainder of this section execution. However, the complexity and cost of these tech-
describ es the design details of this novel push architecture. niques is to o high for our prefetch engine design.
to the result of the probing traversal load. In this way,
the prefetch engine allows the recurrent-based and traversal-
based threads to execute indep endently.
3a 3b 3c …..
e assume that each level of the memory hierarchy has thread 1: W
recurrent-based load
a small bu er that stores requests from the CPU, PFE, and he coherence transactions e.g., sno op requests if appro-
1a 1b 1c cac
By maintaining information on the typ e of access,
….. priate.
the cache controller can provide the appropriate data re-
quired for the PFE to continue execution. For example, if
access is a recurrent prefetch, the resulting data the thread 2: 2a 2b 2c the
traversal-based load to the
….. address of the next no de is extracted and stored in
BAB. If the access is a traversal load prefetch, the PC is
used to prob e the PCT and the resulting data is added to
Figure 4: Kernel Execution
the o set from the PCT entry. Before serving any request,
the controller checks if a request to the same cache blo ck
already exists in the request bu er. If such a request exists,
the controller simply drops the new request.
Toavoid complicated logic, the prefetch engine executes
Our data movement mo del also places new demands on
a single traversal kernel in two separate threads, as shown
the memory hierarchy to correctly handle cache blo cks that
in Figure 4. The rst thread executes recurrent-based loads,
were not requested by the pro cessor. In particular, when all
which are dep endent on recurrent loads, e.g., inst. 1: p=
memory requests initiate from the CPU level, a returning
list!patient and inst. 3: list = list!forward. This thread
cache blo ck is guaranteed to nd a corresp onding MSHR
executes in order and starts a new iteration once a recurrent
entry that contains information ab out the request, and pro-
load completes. The other thread executes traversal-based
vides bu er space for the data if it must comp ete for cache
loads, which are loads that dep end on traversal loads, such
p orts. In our mo del, it is no longer guaranteed that an arriv-
as inst. 2 p!time++. Because traversal loads from di er-
ing cache blo ck will nd a corresp onding MSHR. One could
ent iterations, e.g., 1a and 1b are allowed to complete out
exist if the pro cessor executed a load for the corresp onding
of order, since they can hit in di erent levels of the memory
cache blo ck. However, in our implementation, if there is
hierarchy, traversal-based loads should be able to execute
no matching MSHR the returned cache blo ck can b e stored
out of order.
in a free MSHR or in the request bu er. If neither one is
The internal structure of the prefetch engine is shown
available, this cache blo ck is dropp ed. Dropping these cache
in Figure 5. There are four main hardware structures in
blo cks do es not a ect the program correctness, since they
the prefetch engine: 1 Recurrent Bu er RB, 2 Pro ducer-
are a result of a non-binding prefetch op eration, not a de-
Consumer Table PCT, 3 Base Address Bu er BAB, and
mand fetch from the pro cessor. Note that the push mo del
4 Prefetch Request Queue. The remainder of this section
requires the transferred cache blo cks to also transfer their
describ es its basic op eration.
addresses up the memory hierarchy. This is similar to a co-
herence sno op request traversing up the memory hierarchy.
The Recurrent Bu er RB contains the typ e, PC, and
o set of recurrent-based loads. The typ e eld indicates if
t load. Traversal-based
the corresp onding load is a recurren 3.2 Interaction Among Prefetch Engines
loads are stored in the Pro ducer-Consumer Table PCT.
Having discussed the design details of individual prefetch
This structure is similar to the Correlation Table in [19], it
engines, wenow address the interaction b etween prefetch en-
contains the load's o set, PC and the pro ducer's PC. The
gines at the various levels. Recall, that the ro ot address of
PCT is indexed using the pro ducer PC.
a LDS is conveyed to the prefetch engine through a sp ecial
instruction, whichwe call ldroot. If the memory access of a We initialize the prefetch engine bydown-loading a traver-
sal kernel to the RB and PCT through a memory mapp ed
ldroot instruction is a hit in the L1 cache, the loaded value
interface. Prior to executing an LDS access e.g., list traver-
is stored in the BAB of the prefetch engine in the st level.
sal, the program starts the PFE by storing the ro ot address
This triggers the prefetching activity in the L1 prefetch en-
of the LDS in the Base Address Bu er BAB. We accom-
gine.
plish this by assuming a \ avored" load that annotates a
If a ldroot instruction or the prefetch for a recurrent load
load with additional information, in this case that the data
causes a L1 cache miss, the cache miss is passed down to
b eing loaded is the ro ot address of an LDS. When the PFE
the L2 like other cache misses but with the typ e bits set to
detects that the BAB is not empty and the instruction queue
RP recurrent prefetch. If this miss is satis ed by the L2,
is empty, it loads the rst element of the BAB into the Base
the cache controller will detect the RP typ e and store the
register and fetches the recurrent-based instructions from
result in the BAB. The prefetch engine in the L2 level will
the RB into the instruction queue. At each cycle, the rst
nd out that the BAB is not empty and start prefetching. If
instruction of the instruction queue is pro cessed, and a new
the L1 cache miss is a result of a recurrent load prefetch, the
prefetch request is formed by adding the instruction's o set
L1 PFE stops execution and the L2 PFE b egins execution.
to the value in the Base register. The prefetch request is
Data provided by the L2 PFE is placed in the prefetch bu er
then stored in the prefetch request queue PRQ along with
to avoid p olluting the L1 data cache.
its typ e and PC.
The interaction between the L2 PFE and the memory
level PFE is similar the the L1/L2 PFE interaction. The L2 When a traversal load completes, its PC is used to search
PFE susp ends execution when it incurs a miss on a recur- the PCT. For each matching entry, the e ective address of
rent load, and the memory PFE b egins execution when it the traversal-based load is generated by adding its o set Recurrent Buffer (RB) Instruction Queue Base Address Buffer type pc offset type pc offset $Base (BAB) head 1 o1 NR :::::::
RP 3 o3 +
producer-consumer table (PCT)
Producer Self Offset + prefetch request queue (PRQ) 1 2 o2 (type, pc, addr) TLB
addr type pc data Request Request from Buffer
Upper Level
Figure 5: The Prefetch Engine
observes a non-empty BAB. Since the data provided by the p orts, that can be accessed in parallel with the L1 data
memory PFE passes the L2 cache, we dep osit the data in cache. The second level cache is pip elined with an 18 cycle
b oth the L2 cache and the prefetch bu er. The L2 cache is round-trip latency. Latency to main memory after an L2
generally large enough to contain the additional data blo cks miss is 66 cycles. The CPU has 64 RUU entries and a 16
and not su er from severe p ollution e ects. entry load-store queue, a 2-level branch predictor with a
total of 8192 entries, and a p erfect instruction cache.
e simulate the pull mo del by using only a single prefetch
4. EVALUATION W
engine attached to the L1 cache. This engine executes the
This section presents simulation results that demonstrate
same traversal kernels used by the push architecture. How-
the e ectiveness of the push mo del for a set of four list-based
ever, prefetch requests that miss in the L1 cache must wait
b enchmarks. First, we describ e our simulation framework.
for the requested cache blo ck to return from the lower levels
This is followed by an analysis of a microb enchmark for
b efore the fetch for the next no de can b e initiated.
b oth the push mo del and the conventional pull mo del for
hing. We nish by evaluating the two mo dels for
prefetc 4.2 Microbenchmark Results
four macrob enchmarks and examining the impact of address
translation overhead.
We b egin by examining the p erformance of b oth prefetch-
ing mo dels push and pull using a microb enchmark. Our
hmark simply traverses a linked list of 32,768 32-
4.1 Methodology microb enc
Toevaluate the push mo del we mo di ed the SimpleScalar
byte no des 10 times, see Figure 6. Therefore, all LDS loads
simulator [3] to also simulate our prefetch engine implemen-
are recurrent loads. We also include a small computation
tation. SimpleScalar mo dels a dynamically scheduled pro-
lo op b etween each no de access. This allows us to evaluate
cessor using a Register Up date Unit RUU and a Load/Store
the e ects of various ratios b etween computation and mem-
Queue LSQ [22]. The pro cessor pip eline stages are: 1
ory access. To eliminate branch e ects, we use a p erfect
Fetch: Fetch instructions from the program instruction stream,
branch predictor for these microb enchmark simulations.
2 Dispatch: Deco de instructions, allo cate RUU, LSQ en-
tries, 3 Issue/Execute: Execute ready instructions if the
required functional units are available, 4 Writeback: Sup-
for i = 0; i < 10; i++
ply results to dep endent instructions, and 5 Commit: Com-
node = head;
mit results to the register le in program order, free RUU
while node
and LSQ entries.
for j = 0; j < iters; j++
Our base pro cessor is a 4-way sup erscalar, out-of-order
sum = sum+j;
pro cessor. The memory system consists of a 32KB, 32-byte
node = node->next;
cache line, 2-way set-asso ciative rst level data cache and a
endwhile
512KB, 64-byte cache line, 4-way set asso ciative second level
endfor
cache. Both are lo ck-up free caches that can have up to eight
outstanding misses. The L1 cache has 2 read/write p orts,
and can b e accessed in a single cycle. We also simulate a 32-
Figure 6: Microb enchmark Source Co de
entry fully-asso ciative prefetch bu er with four read/write
ws the sp eedup for each prefetch mo del over
Figure 7 sho Push Model
base system versus compute iterations on the x-axis.
the 100% k solid line indicates the sp eedup of a p erfect mem- The thic Miss Ratio
80%
he. In gen- ory system where all references hit in the L1 cac Prefetch Merge
Prefetch Hit h
eral, the results for the push mo del thin solid line matc 60%
our intuition, large sp eedups over 100 are achieved for w computation to memory access ratios, and the b ene ts
lo 40%
decreases as the amountof computation increases relative
20%
to memory accesses. However, the increase in computation
cache p ollution
is not the only cause for decreased b ene ts: 0% is also a factor. 2 12 22 32 42 52 62 72 82 92 Computation per Node Access (iterations) 180% 160% Pull Pull Model 140% Push 100% 120% Perfect Memory Miss Ratio Prefetch Merge 100% 80% Prefetch Hit 80% 60% 60% Speedup(%) 40% 40% 20% 0% 20%
2 2 2 2 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 0% Computation per Node Access (iterations) 2 12 22 32 42 52 62 72 82 92
Computation per Node Access (iterations)
Figure 7: Push vs. Pull: Microb enchmark Results
Figure 8: Microb enchmark PrefetchPerformance
The dashed line in Figure 7 represents the pull mo del.
We see the same shap e as the push mo del, but at a di erent
scale. It takes much longer than the push mo del for the pull
p orted by the decrease in prefetch bu er hits shown in Fig-
mo del to reach its p eak b ene t. This is a direct consquence
ure 8 and an increase in the L2 miss ratio shown in Figure 9.
of the serialization caused by p ointer dereferences; the next
Although the L1 miss ratio returns to its base value,
no de address is not available until the current no de is fetched
we continue to see signi cant sp eedups. These p erformance
from the lower levels of the memory hierarchy.
gains are due to prefetch b ene ts in the L2 cache. Recall,
We consider three distinct phases in Figure 7. Initially,
that we place prefetched cache blo ckinto b oth the L2 and
the push mo del achieves nearly 100 sp eedup. While this
prefetch bu er. Even though the prefetch engine is so far
is not close to the maximum achievable, it is signi cant.
ahead that it p ollutes the prefetch bu er, the L2 still pro-
These sp eedups are a result of cache misses merging with
vides b ene ts. Figure 9 shows that the global L2 miss ratio,
the prefetch requests see Figure 8. In contrast, the pull
remains b elow the base system. However, the miss ratio
mo del do es not exhibit any sp eedup for low computation p er
increases, hence the sp eedup decreases, as the computation
no de. Although Figure 8 shows that 100 of the prefetch
p er no de increases b ecause of L2 cache p ollution.
requests merge with cache misses, the prefetch is issued to o
late to pro duce any b ene t.
20% h
As the computation p er no de access increases, the prefetc 18%
is able to move further ahead in the list traversal,
engine 16%
entually eliminating many cache misses. This is shown
ev 14%
y the increase in prefetch bu er hits and de-
in Figure 8 b 12%
Although the miss ratio and prefetch crease in miss ratio. 10% Base
8%
plateau, p erformance continues to increase due to merges Pull
6% Push
prefetch merges hiding more and more of the overall
the 4%
memory latency. Eventually, there is enough computation
L2 Global2% Miss Rate h requests no longer merge, instead they result
that prefetc 0%
in prefetch hits. This is shown by the spike in prefetch hits
2 2 2 2 2 2 2 2 2 2 Figure 8, and o ccurs at the same iter-
the dashed line in 1 2 3 4 5 6 7 8 9
t as the maximum sp eedup shown in Figure 7.
ation coun Computation per Node Access (iterations)
Further increases in computation cause the prefetch en-
gine to run to o far ahead, thus causing p ollution. As new
Figure 9: Push vs. Pull: Microb enchmark L2 Cache Global
cache blo cks are brought closer to the pro cessor, they replace
Miss Ratio
older blo cks. For suciently long delays b etween no de ac-
cesses, the prefetched blo cks b egin replacing other prefetched
blo cks not yet accessed by the CPU. This conclusion is sup- The results obtained from this microb enchmark clearly
show some of the p otential b ene ts for the push mo del. that we execute a simple traversal kernel on the L1 prefetch
However, there are some opp ortunities for further improve- engine to issue prefetch requests. This kernel executes inde-
ment, suchasdeveloping a technique to \throttle" the prefetch p endent of the CPU, and do es not synchronize in anyway.
engine to match the rate of data consumption by the pro ces- Most previous techniques are \throttled" by the CPU ex-
sor, thus avoiding p ollution. We leaveinvestigation of this ecution, since they prefetch a sp eci c numb er of iterations
as future work. ahead. Furthermore, our kernel may prefetch incorrect data.
For example, this can o ccur in mst since the linked list is
hed for a matching key and the real execution may not
4.3 Macrobenchmark Results searc
traverse the entire list. Similarly, in em3d each nodeinthe
While the ab ove microb enchmark results provide some
graph contains a variable size array of p ointers to/from other
insightinto the p erformance of the push and pull mo dels,
no des in the graph. Our kernel assumes the maximum ar-
it is dicult to extend those results to real applications.
ray size, and hence may try to interpret non-p ointer data as
Variations in the typ e of computation and its execution on
p ointers and issue prefetch requests for this incorrect data.
dynamically scheduled pro cessors can makeitvery dicult
To gain more insightinto the b ehavior of the push archi-
to precisely predict memory latency contributions to overall
tecture, we analyze prefetchcoverage, p ollution, and prefetch
execution time.
request distribution at di erent levels of the memory hier-
To address this issue, we use four list-based macrob ench-
archy. Prefetchcoverage measures the fraction of would-b e
marks: em3d, health, and mst from the Olden p ointer-
read misses serviced by the prefetch mechanism. To deter-
intensive b enchmark suite [18] and Rayshade [9]. Em3d
mine the impact of cache p ollution, we count the number
simulates electro-magnetic elds. It constructs a bipartite
of prefetched blo cks replaced from the prefetch bu er be-
graph and pro cesses lists of interacting no des. Health imple-
fore they are ever referenced by the CPU. Finally, the re-
ments a hospital check in/out system. The ma jorityofcache
quest distribution indicates where in the memory hierarchy
misses come from traversing several long link lists. Mst nds
prefetch requests are initiated.
the minimum spanning tree of a graph. The main data struc-
Lists are used to implement the buckets
ture is a hash table. Prefetch Coverage
in a hash table. Rayshade is real-world graphics application
We examine prefetchcoverage by categorizing successful prefetch that implements a raytracing algorithm. Rayshade divides
requests according to where in the memory hierarchy the space into grids and ob jects in each grid are stored in a
would-b e cache miss \meets" the prefetched data. The three linked-list. If a rayintersects with a grid, the entire ob ject
categories are partially hidden, hidden L1, and hidden L2.
list in that grid is traversed.
Partially hidden misses are those would-b e cache misses that
merge with a prefetch request somewhere in the memory hi-
223%227%
hy. Hidden L1 misses are those L1 misses now satis ed
50% erarc
y the prefetch bu er. Similarly, hidden L2 misses are those
45% 41% b
ere L2 misses without prefetching, but are
40% PULL references that w
w satis ed by prefetched data in the L2 cache. Note, since
35% PUSH no
e dep osit data in the L2 cache and the prefetch bu er,
30% w
ks replaced from the prefetch bu er can still b e resident
25% blo c he. 20% 16% in the L2 cac
15% 11% 10% 6% 70% 5% 0% 0% 0% 60% partially hidden Reduction in Execution Time Health Mst Em3d Rayshade 50% total hidden L1
40% total hidden L2
Push vs. Pull: Macrob enchmark Results Figure 10: 30%
20%
10 shows the sp eedup for b oth push and pull
Figure 10%
prefetching using our b enchmarks on systems that incur a
0%
one cycle overhead for address translation with a p erfect
We examine the e ects of TLB misses later in this
TLB. Health Mst Em3d Rayshade
section. From Figure 10 we see that the push mo del out-
p erforms the pull mo del for all four b enchmarks. Rayshade
Figure 11: PrefetchCoverage
shows the largest b ene ts, with sp eedups of 227 for the
push mo del, and 223 for the pull mo del. Health achieves
41 improvement in execution time for the push mo del
Figure 11 shows the prefetch coverage for our b ench-
while achieving only 11 for the pull mo del. Similarly, for
marks. Our results show that for health and mst, the ma jor-
the push mo del, em3d and mst achieve 6 and 16, resp ec-
ity of load misses are only partially hidden, 47 and 35,
tively, while neither shows any b ene t using the pull mo del.
resp ectively. These b enchmarks do not have enough com-
We note that the results for our pull mo del are not di- putation to overlap with the prefetch latency. Rayshade is
rectly comparable to previous pull-based prefetching tech- the other extreme, with no partially hidden misses, but 60
niques, since we do not implement the same mechanisms of L2 misses are eliminated. Em3d shows very little b ene t,
or use the exact same memory system parameters. Recall, with only a small numb er of L2 misses eliminated. Prefetch Buffer Pollution
Health
To eliminate all L1 misses, a cache blo ckmust arriveinthe
h bu er b efore the CPU requires the data. Further-
prefetc 100%
the cache blo ck must remain in the prefetch bu er
more, total redundant
til the CPU accesses the data. If the prefetch engine runs
un 80%
to o far ahead it may replace data b efore it accessed by the
CPU. 60%
Figure 12 shows the p ercentage of blo cks placed into the h bu er that are replaced without the CPU access-
prefetc 40%
ing the data for the push mo del. From this data we see
that mst, em3d, and rayshade all replace nearly 100 of
20%
the blo cks placed in the prefetch bu er. Health su ers very little p ollution. 0% L1 L2 Mem 120%
100% Mst 100% 80% total redundant 60% 80%
40% 60%
20%
Prefetch Replacements 40% 0% 20% Health Mst Em3d Rayshade
0%
Figure 12: Prefetch Bu er Pollution
L1 L2 Mem
ts can b e caused by the prefetch engine
These replacemen Em3d ks b efore they are
running to o far ahead, and replacing blo c 100%
y prefetching the wrong cache blo cks. Recall, that
used, or b total redundant
versal kernel is approximate, and
for mst and em3d our tra 80%
we could b e fetching the wrong cache blo cks. This app ears
be the case, since neither of these b enchmarks showed
to 60%
many completely hidden misses see Figure 11. In con-
yshade shows a large numb er of hidden L2 misses.
trast, ra 40%
Therefore, we b elieve the replacements it incurs are caused
y the prefetch engine running to o far ahead, but not so far
b 20%
ahead as to p ollute the L2 cache. 0%
Prefetch Request Distribution L1 L2 Mem
The other imp ortant attribute of the push mo del is the
h request distribution at di erent levels of the mem-
prefetc Rayshade
hierarchy. The light bar in Figure 13 shows the p er-
ory 100%
tage of prefetch op erations issued at each level of the
cen total redundant
hy and the dark bar shows the p ercentage
memory hierarc 80%
of redundant prefetch op erations. A redundant prefetch re-
he blo ck that already resides in a higher memory
quests a cac 60%
hierarchy level than the level that issued the prefetch.
rom Figure 13 we see that health and rayshade issue
F 40%
most prefetches at the main memory level. Em3d issues
he and half from main memory. Finally,
half from the L2 cac 20%
mst issues over 30 of its prefetches from the L1 cache,
. Health and
20 from the L2 and 60 from main memory 0%
yshade traverse the entire list each time it is accessed.
ra L1 L2 Mem
Their lists are large enough that capacity misses likely dom-
inate. In contrast, mst p erforms a hash lo okup and may
Figure 13: Prefetch Request Distribution
stop b efore accessing each no de, hence con ict misses can
b e a large comp onent of all misses. Furthermore, the hash
lo okup is data dep endent and blo cks may not b e replaced b e- Palacharla et. al [15] extend this mechanism to detect non-
tween hash lo okups. In this case, it is p ossible for a prefetch unit stride o -chip. Baer et. al [2] prop osed to use a refer-
traversal to nd the corresp onding data in any level of the ence prediction table RPT to detect the stride asso ciated
cache hierarchy, as shown in Figure 13. with load/store instructions. The RPT is accessed ahead
of the regular program counter by a lo ok-ahead program
Mst also exhibits a signi cantnumb er of redundant prefetch
counter. When the lo ok-ahead program counter nds a
op erations. Most of the L1 prefetches are redundant, while
matching stride entry in the table, it issues a prefetch. The
very few of the memory level prefetches are redundant. For
Tango system [16] extends Baer's scheme to issue prefetches
health and em3d, 15 and 7, resp ectively, of all prefetches
more e ectively on 1 pro cessor. On the software side, Mowry
are redundant and issued from the memory level. Rayshade
et. al [13] prop osed a compiler algorithm to insert prefetch
incurs minimal redundant prefetch requests. Redundant
instructions in scienti c computing co de. They improve
prefetches can prevent the prefetch engine from running
up on the previous algorithms [5; 8; 17] by p erforming lo-
ahead of the CPU and they comp ete with the CPU for
cality analysis to avoid unnecessary prefetches.
cache, memory and bus cycles. We are currently investigat-
ing techniques to reduce the numb er of redundant prefetch
Correlation-based prefetching is another class of hard-
op erations.
ware prefetching mechanisms. It uses the current state of
the address stream to predict future references. Joseph et.
[6] build a Markov mo del to approximate the miss ref-
Impact of Address Translation al
erence string using a transition frequency. This mo del is
Our prop osed implementation places a TLB with each prefetch
realized in hardware as a prediction table where eachentry
engine. However, the results presented thus far assume a
contains an index address and four prioritized predictions.
p erfect TLB. To mo del overheads for TLB misses, we assume
Alexander et. al [1] also implement a table-based prediction
hardware managed TLBs with a 30 cycle miss p enalty [19].
scheme similar to [6]. However, it is implemented at the
When a TLB miss o ccurs at any level, all the TLBs are
main memory level to prefetch data from the DRAM array
up dated by the hardware miss handler.
to SRAM prefetch bu ers on the DRAM chip. Correlation-
based prefetching is able to capture complex access patterns i it cannot predict future 45% but it has three shortcomings: PULL
40% references as far ahead of the program execution as the push
PUSH
hitecture; ii the prefetch prediction accuracy decreases
35% arc
applications have dynamically changing data structures
30% if
versal orders, and iii to achieve go o d prediction
25% or LDS tra , it requires the prediction table prop ortional to the
20% accuracy orking set size.
15% w
class of data prefetching mechanisms fo cuses
10% Another
on p ointer-intensive applications. Data struc-
5% sp eci cally
ture information is used to construct solutions. The Spaid
Reduction in Execution0% Time
scheme [10] prop osed by Lipasti et. al is a compiler-based
Perfect 32 64 128 256
ter prefetching mechanism. It inserts prefetch instruc-
TLB Size p oin
tions for the ob jects p ointed by p ointer arguments at call
sites. Luk et. al [11] prop osed three compiler-based prefetch-
Figure 14: Impact of Address Translation for Health
ing algorithms. Greedy prefetching schedules prefetch in-
structions for all recurrent loads a single instance ahead.
History-p ointer prefetching mo di es source co de to create
For most of our b enchmarks, the TLB miss ratios are
p ointers to provide direct access to non-adjacent no des based
extremely low, and hence TLB misses do not contribute sig-
on previous traversals. This approach is only e ective for
ni cantly to execution time. Health is the only b enchmark
applications that have static structures and traversal or-
with a noticeable TLB miss ratio 17 for a 32-entry fully-
der. This metho d incurs b oth space and execution over-
asso ciative TLB. Therefore, we present results for only this
head. Data-linearization prefetching maps heap-allo cated
b enchmark.
no des that are likely to b e accessed close together in time
Figure 14 shows the sp eedup achieved for various TLB
into contiguous memory lo cations so the future access ad-
sizes, including a p erfect TLB that never misses. From this
dresses can b e correctly predicted if the creation order is the
data we see that smaller TLB sizes reduce the e ectiveness
same as the traversal order.
of b oth push and pull prefetching. However, we see that
Chilimbi et. al [4] present another class of software ap-
the push mo del still achieves over 30 sp eedup even for a
proaches to p ointer-intensive applications. They prop osed
32-entry TLB, while the pull mo del achieves less than 5.
to p erform cache conscious data placement considering sev-
Increasing the TLB size, increases the b ene ts for b oth push
eral cache parameters such as the cache blo ck size and asso-
and pull prefetching.
ciativity. The reorganization pro cess happ ens at run time.
This approach greatly improves the data spatial and tem-
lo cality. The shortcoming of this approach is that
5. RELATED WORK p oral
it can incur high overhead for dynamically changing struc-
The early data prefetching research, b oth in hardware
tures, and it cannot hide latency from capacity misses.
and software, fo cused on regular applications. On the hard-
ware side, Jouppi prop osed to use stream bu ers to prefetch Zhang et. al [23] prop osed a hardware prefetching scheme
sequential streams of cache lines based on cache misses [7]. for irregular applications in shared-memory multipro cessors.
The original stream bu er design can only detect unit stride. This mechanism uses ob ject information to guide prefetch-
ing. Programs are annotated to bind together groups of data structures with little computation between no de accesses,
e.g., elds in a record or two records linked by a p ointer since p ointer dereferences can serialize memory accesses. For
either by programmers or the compiler. Groups of data are these programs, the push mo del describ ed in this pap er can
then prefetched under hardware control. The constraintof dramatically improve p erformance.
this scheme is that accessed elds in a record need to be
tiguous in memory. Mehrotra et. al [12] extends stride
con 7. REFERENCES
detection schemes to capture b oth linear and recurrent ac-
[1] T. Alexander and G. Kedem. Distributed predictive
cess patterns.
cache design for high p erformance memory system.
Recently, Roth et. al [19] prop osed a dep endence based
In Proceedings of the 2th International Symposium
prefetching mechanism DBP that dynamically captures
on High-Performance Computer Architectur,February
LDS traversal kernels and issues prefetches. In their scheme,
1996.
they only prefetch a single instance ahead of a given load and
wavefront tree prefetching is p erformed regardless of the tree
[2] J.-L. Baer and T.-F. Chen. An e ective on-chip preload-
traversal order. The prop osed push architecture allows the
ing scheme to reduce data access p enalty.InProceeding
prefetching engine to run far ahead of the pro cessor.
of Supercomputing '91, 1991.
In their subsequentwork [20], Roth et. al prop osed a
jump-p ointer prefetching framework similar to the history-
[3] D. C. Burger, T. M. Austin, and S. Bennett. Evaluating
p ointer prefetching in [11]. They provide three implementa-
future micropro cessors-the simplescalar to ol set. Tech-
tions: software-only, hardware-only and a co op erative soft-
nical Rep ort 1308, Computer Sciences Department,
ware/hardware technique. The p erformance of these 3 im-
University of Wisconsin{Madison, July 1996.
plementations varies across the b enchmarks they tested. For
[4] T. Chilimbi, J. Larus, and M. Hill. Improving p ointer-
few of the b enchmarks whichhave static structure, traver-
based co des through cache-concious data placement.
sal order, and multiple passes to LDS, the jump-p ointer
Technical Rep ort CSL-TR-98-1365, University of Wis-
prefetching can achieve high sp eedup over the DBP mech-
consin, Madison., March 1998.
anism. However, for applications that have dynamically
changing b ehavior, the jump-p ointer prefetching is not ef-
[5] E. Gornish, E. Granston, and A. Veidenbaum.
fective and sometimes can even degrade p erformance. The
Compiler-detected data prefething in multipro cessor
prop osed push architecture can give more stable sp eedup
with memory hierarchy. In Prceedindings of Interna-
over the DBP mechanism and adjust the prefetching strat-
tional ConferenceonSupercomputing, 1990.
egy according to program attributes.
[6] D. Joseph and D. Grunwald. Prefetching using markov
In Proceedings of the 24th Annual Inter-
6. CONCLUSION predictors.
national Symposium on Computer Architecture, pages
This pap er presented a novel memory architecture de-
252{263, June 1997.
signed sp eci cally for linked data structures. This new ar-
chitecture employs a new data movement mo del for linked
[7] N. P. Jouppi. Improving direct-mapp ed cache p erfor-
data structures by pushing cache blo cks up the memory hi-
mance by the addition of a small fully-asso ciative cache
erarchy. The push mo del p erforms p ointer dereferences at
and prefetch bu ers. In Proceedings of the 17th An-
various levels in the memory hierarchy. This decouples the
nual International Symposium on Computer Architec-
p ointer dereference obtaining the next no de address from
ture, pages 364{373, May 1990.
the transfer of the current no de up to the pro cessor, and
[8] A. C. Klaib er and H. M. Levy. An architecture for
allows implementations to pip eline these two op erations.
software-controlled data prefetching. In Proceedings of
Using a set of four macrob enchmarks, our simulations
the 18th Annual International Symposium on Computer
show that the push mo del can improve execution time by
Architecture, pages 43{53, 1991.
6 to 227, and outp erforms a conventional pull-based
prefetch mo del that achieves 0 to 223 sp eedups. Our re-
[9] C. Kolb. The rayshade user's guide. In
sults using a microb enchmark show that for programs with
http://graphics.stanford.EDU/ cek/rayshade.
little computation b etween no de accesses, the push mo del
[10] M. H. Lipasti, W. Schmidt, S. R. Kunkel, and R. Ro edi-
signi cantly outp erforms a conventional pull-based prefetch
ger. Spaid: Software prefeteching in p ointer- and call-
mo del. For programs with larger amounts of computation,
intensiveenvironments. In Proceedings of the 28th An-
the pull mo del is b etter than the push mo del.
nual International Symposium on Microarchitecture,
We are currently investigating several extensions to this
June 1995.
work. First, is a technique to \throttle" the push prefetch
engine, thus preventing it from running to o far ahead of the
[11] C.-K. Luk and T. Mowry. Compiler based prefetch-
actual computation. We are also investigating techniques
ing for recursive data structure. In Proceedings of the
to eliminate redundant prefetch op erations, and the use of
Seventh International ConferenceonArchitectural Sup-
a truly programmable prefetch engine that may allowusto
port for Programming Languages and Operating Sys-
eliminate incorrect prefetch requests e.g., em3d. Finally,
tems ASPLOS VII, pages 222{233, Octob er 1996.
we are investigating supp ort for more sophisticated p ointer-
based data structures, such as trees and graphs.
[12] S. Mehrotra and L. Harrison. Examination of a memory
As the impact memory system p erformance on overall access classi cation scheme for p ointer-intensive and
program p erformance increases, new techniques to help tol- numeric program. In Proceedings of the 10th Interna-
erate memory latency b ecome ever more critical. This is par- tional Conference on Supercomputing, pages 133{139,
ticularly true for programs that utilize p ointer-based data May 1996.
[13] T. Mowry, M. S. Lam, and A. Gupta. Design and eval-
uation of a compiler algorithm for prefetching. In Pro-
ceedings of the Fifth International ConferenceonArchi-
tectural Support for Programming Languages and Oper-
ating System, pages 62{73, Octob er 1992.
[14] S. Palacharla, N. P. Jouppi, and J. E. Smith.
Complexity-e ective sup erscalar pro cessors. In Proceed-
ings of the 24th Annual International Symposium on
Computer Architecture, pages 206{218, June 1997.
[15] S. Palacharla and R. Kessler. Evaluating stream bu ers
as a secondary cache replacement. In Proceedings of
the 21st Annual International Symposium on Computer
Architecture, pages 24{33, April 1994.
[16] S. Pinter and A. Yoaz. A hardware-based data prefetch-
ing technique for sup erscalar pro cessor. In Proceedings
of the 29th Annual International Symposium on Mi-
croarchitecture, pages 214{225, Decemb er 1996.
[17] A. K. Porter eld. Software Methods for Improvement
of Cache PerformanceonSupercomputer Applications.
PhD thesis, Department of Computer Science, Rice
University, 1989.
[18] A. Roger, M. Carlisle, J.Reppy, and L. Hendren. Sup-
p orting dynamic data structures on distributed memory
machines. ACM Transactions on Programming Lan-
guages and Sytems, 172, March 1995.
[19] A. Roth, A. Moshovos, and G. Sohi. Dep endence based
prefetching for linked data structures. In Proceedings
of the Eigth International ConferenceonArchitectural
Support for Programming Languages and Operating
Systems ASPLOS VIII, pages 115{126, Octob er 1998.
[20] A. Roth and G. Sohi. E ective jump-p ointer prefetch-
ing for linked data structures. In Proceedings of the 26th
Annual International Symposium on Computer Archi-
tecture, pages 111{121, May 1999.
[21] A. Smith. Cache memories. Computing Surveys,
144:473{530, Septemb er 1982.
[22] G. Sohi. Instruction issue logic for high p erformance,
interruptable, multiple functional unit, pip elined com-
puters. IEEE Transactions on Computers, 393:349{
359, March 1990.
[23] Z. Zhang and J. Torrellas. Sp eeding up irregular ap-
plications in shared-memory multipro cessors: Memory
binding and group prefetching. In Proceedings of the
22nd Annual International Symposium on Computer
Architecture, pages 188{200, June 1995.