<<

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

The Impact of Exploiting InstructionLevel

Parallelism on SharedMemory Multipro cessors

Vijay S Pai Parthasarathy Ranganathan Hazim Ab delSha and Sarita Adve

Department of Electrical and Computer Engineering

Rice University

Houston Texas

fvijaypai|parthas|sha|saritag@riceedu

Abstract Current micropro cessors incorp orate tech sors however have assumed a simple processor with single

niques to aggressively exploit instructionlevel parallelism

issue inorder blo cking loads and no sp ecu

ILP This pap er evaluates the impact of such pro cessors

lation A few multiprocessor architecture studies mo del

on the p erformance of sharedmemory multiprocessors b oth

stateoftheart ILP pro cessors but do not

without and with the latencyhiding optimization of soft

ware prefetching

analyze the impact of ILP techniques

Our results show that while ILP techniques substantially

To fully exploit recent advances in unipro cessor technol

reduce CPU time in multiprocessors they are less eec

ogy for sharedmemory multiprocessors a detailed analysis

tive in removing memory stall time Consequently despite

the inherent latency tolerance features of ILP pro cessors

of how ILP techniques aect the p erformance of such sys

we nd memory system p erformance to b e a larger b ot

tems and how they interact with previous optimizations for

tleneck and parallel eciencies to b e generally p o orer in

such systems is required This pap er evaluates the impact

ILPbased multiprocessors than in previousgeneration mul

of exploiting ILP on the p erformance of sharedmemory

tipro cessors The main reasons for these deciencies are

insucient opp ortunities in the applications to overlap mul

multiprocessors b oth without and with the latencyhiding

tiple load misses and increased contention for resources in

optimization of software prefetching

the system We also nd that software prefetching do es not

For our evaluations we study ve applications using de

change the memory b ound nature of most of our applications

on our ILP multiprocessor mainly due to a large number of

tailed simulation describ ed in Section I I

late prefetches and resource contention

Section III analyzes the impact of ILP techniques on

Our results suggest the need for additional latency hiding

the p erformance of sharedmemory multiprocessors with

or reducing techniques for ILP systems such as software

clustering of load misses and pro ducerinitiated communi

out the use of software prefetching All our applications

1

cation

see p erformance improvements from the use of current ILP

techniques but the improvements vary widely In partic

ular ILP techniques successfully and consistently reduce

I Introduction

the CPU comp onent of execution time but their impact

Sharedmemory multiprocessors built from commo dity

on the memory stall time is lower and more application

micropro cessors are b eing increasingly used to provide high

dep endent Consequently despite the inherent latency tol

p erformance for a variety of scientic and commercial ap

erance features integrated within ILP pro cessors we nd

plications Current commo dity micropro cessors improve

memory system p erformance to b e a larger b ottleneck and

p erformance by using aggressive techniques to exploit high

parallel eciencies to b e generally p o orer in ILPbased

levels of instructionlevel parallelism ILP These tech

multiprocessors than in previousgeneration multiproces

niques include multiple instruction issue outoforder dy

sors These deciencies are caused by insucient opp or

namic scheduling nonblo cking loads and sp eculative ex

tunities in the application to overlap multiple load misses

ecution We refer to these techniques collectively as ILP

and increased contention for system resources from more

techniques and to pro cessors that exploit these techniques

frequent memory accesses

as ILP processors

Softwarecontrolled nonbinding prefetching has b een

Most previous studies of sharedmemory multiproces

shown to b e an eective technique for hiding memory

latency in simple pro cessorbased shared memory sys

This work is supp orted in part by an IBM Partnership award

tems Section IV analyzes the interaction b etween soft

Intel Corp oration the National Science Foundation under Grant

No CCR CCR CDA and CDA

ware prefetching and ILP techniques in sharedmemory

and the Texas Advanced Technology Program under Grant No

multiprocessors We nd that compared to previous

Sarita Adve is also supp orted by an Alfred P Sloan

generation systems increased late prefetches and increased

Research Fellowship Vijay S Pai by a Fannie and John Hertz Foun

dation Fellowship and Parthasarathy Ranganathan by a Lo dieska

contention for resources cause software prefetching to b e

Sto ckbridge Vaughan Fellowship

less eective in reducing memory stall time in ILPbased

1

This pap er combines results from two previous conference pa

systems Thus even after adding software prefetching

p ers using a common set of system parameters a more

aggressive MESI versus MSI coherence proto col a more ag

most of our applications remain largely memory b ound on

gressive compiler the b etter of SPARC SC and gcc for

the ILPbased system

each application rather than gcc and full simulation of pri

vate memory references Overall our results suggest that compared to previous

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

Pro cessor parameters

generation sharedmemory systems ILPbased systems

Clo ck rate MHz

have a greater need for additional techniques to tolerate

Fetchdecoderetire rate p er cycle

or reduce memory latency Sp ecic techniques motivated

Instruction window re

order buer size

by our results include clustering of load misses in the appli

Memory queue size

cations to increase opp ortunities for load misses to overlap

Outstanding branches

Functional units ALUs FPUs address genera with each other and techniques such as pro ducerinitiated

tion units all cycle latency

communication that reduce latency to make prefetching

Memory hierarchy and network parameters

more eective Section V

L cache KB directmapp ed p orts

MSHRs byte line

L cache KB way asso ciative p ort

I I Methodology

MSHRs byte line pip elined

Memory way interleaved ns access time

A Simulated Architectures

bytescycle

Bus MHz bits split transaction

To determine the impact of ILP techniques on multipro

Network D mesh MHz bits p er hop

cessor p erformance we compare two systems ILP and

it delay of network cycles

Simple equivalent in every resp ect except the pro cessor

No des in multiprocessor

Resulting contentionless latencies in processor cycles

used The ILP system uses stateoftheart ILP pro cessors

L hit cycle

while the Simple system uses simple pro cessors Section I I

L hit cycles

A We compare the ILP and Simple systems not to sug

Lo cal memory cycles

Remote memory cycles

gest any architectural tradeos but rather to understand

Cachetocache transfer cycles

how aggressive ILP techniques impact multiprocessor p er

formance Therefore the two systems have identical clo ck

Fig System parameters

rates and include identical aggressive memory and network

status holding registers MSHRs to store information

congurations suitable for the ILP system Section I IA

on outstanding misses and to coalesce multiple requests to

Figure summarizes all the system parameters

the same cache line All multiprocessor results rep orted in

A Pro cessor mo dels

this pap er use a conguration with no des

The ILP system uses stateoftheart pro cessors that in

B Simulation Environment

clude multiple issue outoforder dynamic scheduling

We use RSIM the Rice Simulator for ILP Multipro

nonblo cking loads and sp eculative execution The Simple

cessors to mo del the systems studied RSIM is

system uses previousgeneration simple pro cessors with sin

an executiondriven simulator that mo dels the pro cessor

gle issue inorder static scheduling and blo cking loads

pip elines memory system and interconnection network

and represents commonly studied sharedmemory systems

in detail including contention at all resources It takes

Since we did not have access to a compiler that schedules

SPARC application executables as input To sp eed up our

instructions for our inorder simple pro cessor we assume

simulations we assume that all instructions hit in the in

singlecycle functional unit latencies as also assumed by

struction cache This assumption is reasonable since all our

most previous simplepro cessor based sharedmemory stud

applications have very small instruction fo otprints

ies Both pro cessor mo dels include supp ort for software

controlled nonbinding prefetching to the L cache

C Performance Metrics

A Memory Hierarchy and Multipro cessor Conguration

In addition to comparing execution times we also rep ort

the individual comp onents of execution time CPU data We simulate a hardware cachecoherent nonuniform

memory stall and synchronization stall times to charac

memory access CCNUMA sharedmemory multiproces

terize the p erformance b ottlenecks in our systems With sor using an invalidationbased fourstate MESI directory

ILP pro cessors it is unclear how to assign stall time to sp e

coherence proto col We mo del release consistency b e

cic instructions since each instructions execution may b e cause previous studies have shown that it achieves the b est

overlapped with b oth preceding and following instructions p erformance

We use the following convention similar to previous work

The pro cessing no des are connected using a two

eg to account for stall cycles At every cycle we

dimensional mesh network Each no de includes a pro ces

calculate the ratio of the instructions retired from the in

sor two levels of caches a p ortion of the global shared

struction window in that cycle to the maximum retire rate

memory and directory and a network interface A split

of the pro cessor and attribute this fraction of the cycle to

transaction bus connects the network interface directory

the busy time The remaining fraction of the cycle is at

controller and the rest of the system no de Both caches

tributed as stall time to the rst instruction that could not

use a writeallo cate writeback p olicy The cache sizes

b e retired that cycle We group the busy time and func

are chosen commensurate with the input sizes of our ap

tional unit nonmemory stall time together as CPU time

plications following the metho dology describ ed by Woo et

Henceforth we use the term memory stal l time to denote

al The primary working sets for our applications t

the data memory stall comp onent of execution time

in the L cache while the secondary working sets do not t

In the rst part of the study the key metric used to in the L cache Both caches are nonblo cking and use miss

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

Application Input Size

ation from in Mpd to in LUopt The average

LU LUopt x matrix blo ck

ILP sp eedup for the original applications ie not includ

FFT FFTopt p oints

Mpd particles

ing LUopt and FFTopt is only

Water molecules

 The memory stall comp onent is generally a larger part

Radix radix K keys max K

of the overall execution time in the ILP system than in the

Fig Applications and input sizes

Simple system

 Parallel eciency for the ILP system is less than that for

evaluate the impact of ILP is the ratio of the execution

the Simple system for all applications

time with the Simple system relative to that achieved by

the ILP system which we call the ILP For detailed

We next investigate the reasons for the ab ove trends

analysis we analogously dene an ILP sp eedup for each

B Factors Contributing to ILP Speedup

comp onent of execution time

Figure indicates that the most imp ortant comp onents

D Applications

of execution time are CPU time and memory stalls Thus

Figure lists the applications and the input sets used in

ILP sp eedup will b e shap ed primarily by CPU ILP sp eedup

this study Radix LU and FFT are from the SPLASH

and memory ILP sp eedup Figure summarizes these

suite and Water and Mpd are from the SPLASH

sp eedups along with the total ILP sp eedup The gure

suite These ve applications and their input sizes were

shows that the low and variable ILP sp eedup for our ap

chosen to ensure reasonable simulation times Since RSIM

plications can b e attributed largely to insucient and vari

mo dels aggressive ILP pro cessors in detail it is ab out

able memory ILP sp eedup the CPU ILP sp eedup is similar

times slower than simplepro cessorbased sharedmemory

and signicant among all applications ranging from

simulators LUopt and FFTopt are versions of LU and

to More detailed data shows that for most of our

FFT that include ILPsp ecic optimizations that can p o

applications memory stall time is dominated by stalls due

tentially b e implemented in a compiler Sp ecically we use

to loads that miss in the L cache We therefore fo cus on

function inlining and lo op interchange to move load misses

the impact of ILP on L load misses b elow

closer to each other so that they can b e overlapped in the

The load miss ILP sp eedup is the ratio of the stall time

ILP pro cessor The impact of these optimizations is dis

due to load misses in the Simple and ILP systems and

cussed in Sections I I I and V Both versions of LU are also

is determined by three factors describ ed b elow The rst

mo died slightly to use ags instead of barriers for b etter

factor increases the sp eedup the second decreases it while

load balance

the third may either increase or decrease it

Since a SPARC compiler for our ILP system do es not ex

 Load miss overlap Since the Simple system has blo ck

ist we compiled our applications with the commercial Sun

ing loads the entire load miss latency is exp osed as stall

SC or the gcc compiler based on b etter simulated

time In ILP load misses can b e overlapped with other use

ILP system p erformance with full optimization turned on

ful work reducing stall time and increasing the ILP load

The compilers deciencies in addressing the sp ecic in

miss sp eedup The number of instructions b ehind which a

struction grouping rules of our ILP system are partly hid

load miss can overlap is however limited by the instruc

den by the outoforder scheduling in the ILP pro cessor

tion window size further load misses have longer latencies

than other instructions in the instruction window There

I I I Impact of ILP Techniques on Performance

fore load miss latency can normally b e completely hidden

only b ehind other load misses Thus for signicant load

This section analyzes the impact of ILP techniques on

miss ILP sp eedup applications should have multiple load

multiprocessor p erformance by comparing the Simple and

misses clustered together within the instruction window to

ILP systems without software prefetching

enable these load misses to overlap with each other

A Overall Results

 Contention Compared to the Simple system the ILP

system can see longer latencies from increased contention

Figures and illustrate our key overall results For

due to the higher frequency of misses thereby negatively

each application Figure shows the total execution time

aecting load miss ILP sp eedup

and its three comp onents for the Simple and ILP systems

 Change in the number of misses The ILP sys

normalized to the total time on the Simple system Ad

tem may see fewer or more misses than the Simple sys

ditionally at the b ottom the gure also shows the ILP

tem b ecause of sp eculation or reordering of memory ac

sp eedup for each application Figure shows the parallel

cesses thereby p ositively or negatively aecting load miss

eciency of the ILP and Simple systems expressed as a

ILP sp eedup

p ercentage These gures show three key trends

All of our applications except LU see a similar number

 ILP techniques improve the execution time of all our ap

of cache misses in b oth the Simple and ILP case LU sees

plications However the ILP sp eedup shows a wide vari

X fewer misses in ILP b ecause of a reordering of ac

2

To the b est of our knowledge the key compiler optimization iden

cesses that otherwise conict When the number of misses

tied in this pap er clustering of load misses is not implemented in

do es not change the ILP system sees > load miss ILP

any current sup erscalar compiler

3

sp eedup if the load miss overlap exploited by ILP outweighs

The parallel eciency for an application on a system with N pro

Execution time on uniprocessor

1

cessors is dened as

any additional latency from contention We illustrate the

Execution time on multiprocessor N

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

| 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Sync 100 Memory CPU 80 | 78.2 75.3

60 | 41.5 44.9 | 39.0 40 34.0 28.2

20 | Normalized execution time 0 || Simple ILP Simple ILP Simple ILP Simple ILP Simple ILP Simple ILP Simple ILP ILP 2.41X 2.56X 2.94X 3.54X 1.29X 1.33X 2.30X speedup

FFT FFTopt LU LUopt Mp3d Radix Water

Fig Impact of ILP on multiprocessor p erformance

| Simple taneously at various times In contrast Radix almost never 100 ILP 86 87 86

82 82 This 80 | 78 79 has more than outstanding load miss at any time 73 71

63 64 dierence arises b ecause load misses are clustered together

60 | in the instruction window in LUopt but typically separated 40 | 39

31 by to o many instructions in Radix | % Parallel efficiency 20

20

Causes for contention Figure c extends the data 0 ||

FFT FFTopt LU LUopt Mp3d Radix Water of Figure b by displaying the total MSHR o ccupancy for

Fig Impact of ILP on parallel eciency

b oth load and store misses The gure indicates that Radix

has a large amount of store miss overlap This overlap do es

not contribute to an increase in memory ILP sp eedup since

store latencies are already hidden in b oth the Simple and

ILP systems due to release consistency The store miss

overlap however increases contention in the memory hi

erarchy resulting in the ILP memory slowdown in Radix

In LUopt the contentionrelated latency comes primarily

from load misses but its eect is mitigated since overlapped

load misses contribute to reducing memory stall time

Fig ILP sp eedup for total execution time CPU time and memory

C Memory stal l component and parallel eciency

stall time in the multiprocessor system

Using the ab ove analysis we can see why the ILP system

generally sees a larger relative memory stall time comp o

eects of load miss overlap and contention using the two ap

nent Figure and a generally p o orer parallel eciency

plications that b est characterize them LUopt and Radix

Figure than the Simple system

Figure a provides the average load miss latencies for

Since memory ILP sp eedup is generally less than CPU

LUopt and Radix in the Simple and ILP systems normal

ILP sp eedup the memory comp onent b ecomes a greater

ized to the Simple system latency The latency shown is

fraction of total execution time in the ILP system than in

the total miss latency measured from address generation

the Simple system To understand the reduced parallel e

to data arrival including the overlapped part in ILP and

ciency Figure provides the ILP sp eedups for the unipro

the exp osed part that contributes to stall time The dif

cessor conguration for reference The unipro cessor also

ference in the bar lengths of Simple and ILP indicates the

generally sees lower memory ILP sp eedups than CPU ILP

additional latency added due to contention in ILP Both

sp eedups However the impact of the lower memory ILP

of these applications see a signicant latency increase from

sp eedup is higher in the multiprocessor b ecause the longer

resource contention in ILP However LUopt can overlap

latencies of remote misses and increased contention result

all its additional latency as well as a large p ortion of the

in a larger relative memory comp onent in the execution

base Simple latency thus leading to a high memory ILP

time relative to the unipro cessor Additionally the di

sp eedup On the other hand Radix cannot overlap its ad

chotomy b etween lo cal and remote miss latencies in a mul

ditional latency thus it sees a load miss slowdown in the

tipro cessor often tends to decrease memory ILP sp eedup

ILP conguration

relative to the unipro cessor b ecause load misses must b e

We use the data in Figures b and c to further inves

overlapped not only with other load misses but with load

tigate the causes for the load miss overlap and contention

misses with similar latencies Thus overall the multipro

related latencies in these applications

cessor system is less able to exploit ILP features than the

Causes for load miss overlap Figure b shows

corresp onding unipro cessor system for most applications

the ILP systems L MSHR o ccupancy due to load misses

for LUopt and Radix Each curve shows the fraction of

4

FFT and FFTopt see b etter memory ILP sp eedups in the mul

total time for which at least N MSHRs are o ccupied by

tipro cessor than in the unipro cessor b ecause they overlap multiple

load misses with similar multiprocessor remote latencies The sec

load misses for each p ossible N on the X axis This g

tion of the co de that exhibits overlap has a greater impact in the

ure shows that LUopt achieves signicant overlap of load

multiprocessor b ecause of the longer remote latencies incurred in this

misses with up to load miss requests outstanding simul section

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

216.1 209.2   | LUopt LUopt

200 |  | overlapped 1.0  Radix 1.0  Radix stall  | | |  150 0.8 0.8       | | Utilization Utilization

| 0.6 0.6  100 100.0 100.0  0.4 | 0.4 | 50 |  |  |  0.2    0.2          |

|  | 0 | 0.0 | | | || | | | |  0.0 | | | | | | | | | Simple ILP Simple ILP 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Normalized miss latency LUopt Radix Number of L1 MSHRS Number of L1 MSHRS

b L MSHR o ccupancy due c L MSHR o ccupancy due

a Eect of ILP on average L miss latency

to loads to loads and stores

Fig Load miss overlap and contention in the ILP system

Software prefetching achieves signicant reductions in

execution time on ILP to for three cases LU

Mpd and Water These reductions are similar to or

greater than those in Simple for these applications How

ever software prefetching is less eective at reducing mem

ory stalls on ILP than on Simple average reduction of

in ILP ranging from to vs average and range

of to in Simple The net eect is that even af

ter prefetching is applied to ILP the average memory stall

time is on ILP with a range of to vs average

Fig ILP sp eedup for total execution time CPU time and memory

stall time in the unipro cessor system

of and range of to for Simple For most ap

plications the ILP system remains largely memoryb ound

Consequently the ILP multiprocessor generally sees lower

even with software prefetching

parallel eciency than the Simple multiprocessor

B Factors Contributing to the Eectiveness of Software

IV Interaction of ILP Techniques with

Prefetching

Software Prefetching

We next identify three factors that make software

The previous section shows that the ILP system sees a

prefetching less successful in reducing memory stall time

greater b ottleneck from memory latency than the Simple

in ILP than in Simple two factors that allow ILP addi

system Softwarecontrolled nonbinding prefetching has

tional b enets in memory stall reduction not available in

b een shown to eectively hide memory latency in shared

Simple and one factor that can either help or hurt ILP We

memory multiprocessors with simple pro cessors This sec

fo cus on issues that are sp ecic to ILP systems previous

tion evaluates how software prefetching interacts with ILP

work has discussed nonILP sp ecic issues Figure

techniques in sharedmemory multiprocessors We followed

summarizes the eects that were exhibited by the applica

the software prefetch algorithm developed by Mowry et

tions we studied Of the negative eects the rst two are

al to insert prefetches in our applications by hand

the most imp ortant for our applications

with one exception The algorithm in assumes that

Increased late prefetches The last column of Fig

lo cality is not maintained across synchronization and so

ure shows that the number of prefetches that are to o

do es not schedule prefetches across synchronization ac

late to completely hide the miss latency increases in all

cesses We removed this restriction when b enecial For a

our applications when moving from Simple to ILP One

consistent comparison the exp eriments rep orted are with

reason for this increase is that multipleissue and outof

prefetches scheduled identically for b oth Simple and ILP

order scheduling sp eed up computation in ILP decreasing

the prefetches are scheduled at least dynamic instruc

the computation time with which each prefetch is over

tions b efore their corresp onding demand accesses The im

lapp ed Simple also stalls on any load misses that are not

pact of this scheduling decision is discussed b elow includ

prefetched or that incur a late prefetch thereby allowing

ing the impact of varying this prefetch distance

other outstanding prefetched data to arrive at the cache

ILP do es not provide similar leeway

A Overall Results

Increased resource contention As shown in Sec

Figure graphically presents the key results from our tion I I I ILP pro cessors stress system resources more than

exp eriments FFT and FFTopt have similar p erformance Simple Prefetches further increase demand for resources

so only FFTopt app ears in the gure The gure shows resulting in more contention and greater memory latencies

the execution time and its comp onents for each appli The resources most stressed in our conguration were cache

p orts MSHRs ALUs and address generation units

cation on Simple and ILP b oth without and with soft

ware prefetching PF indicates the addition of software Negative interaction with clustered misses Opti

prefetching Execution times are normalized to the time mizations to cluster load misses for the ILP system as in

for the application on Simple without prefetching Figure LUopt can p otentially reduce the eectiveness of software

summarizes some key data prefetching For example the addition of prefetching re

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

| 100.0 | 100.0 | 100.0 100 Sync 100 Sync 100 Sync 89.6 Memory 88.5 Memory Memory CPU CPU 84.7 CPU 80 | 80 | 80 |

60 | 60 | 60 |

| | | 38.5 40 34.0 40 40 37.4 29.5 28.2 27.3

20 | 20 | 20 | Normalized execution time Normalized execution time Normalized execution time

0 || 0 || 0 || SimpleSimple ILP ILP SimpleSimple ILP ILP SimpleSimple ILP ILP +PF +PF +PF +PF +PF +PF LU LUopt FFTopt

112.0

| 100.0 | 100.0 | 100.0 100 Sync 100 Sync 100 Sync Memory Memory 89.4 Memory CPU CPU CPU 80 | 77.8 80 | 75.3 75.9 80 |

60 | 56.5 60 | 60 | 44.1 43.4 40 | 40 | 40 | 34.6

20 | 20 | 20 | Normalized execution time Normalized execution time Normalized execution time

0 || 0 || 0 || SimpleSimple ILP ILP SimpleSimple ILP ILP SimpleSimple ILP ILP +PF +PF +PF +PF +PF +PF

Mp3d Radix Water

Fig Interaction b etween software prefetching and ILP

App reduc reduc re

duces the execution time of LU by on the ILP system

tion in tion in maining prefetches

in contrast LUopt improves by only On the Simple

execution memory memory that are

system b oth LU and LUopt improve by ab out with time stall time stall time late

Sim ILP Sim ILP Sim ILP Sim ILP

prefetching LUopt with prefetching is slightly b etter than

ple ple ple ple

LU with prefetching on ILP by The clustering opti

LU

mization used in LUopt reduces the computation b etween

LUopt

FFTopt

successive misses contributing to a high number of late

Mpd

prefetches and increased contention with prefetching

Radix

Water

Overlapped accesses In ILP accesses that are di

Average

cult to prefetch may b e overlapped b ecause of nonblo cking

Fig Detailed data on eectiveness of software prefetching For the

loads and outoforder scheduling Prefetched lines in LU

average from LU and LUopt only LUopt is considered since it

and LUopt often suer from L cache conicts resulting

provides b etter p erformance than LU with prefetching and ILP

in these lines b eing replaced to the L cache b efore b eing

Factor LU LU FFT Mpd Water Radix

used by the demand accesses This L cache latency results

opt opt

in stall time in Simple but can b e overlapped by the pro

Late

p p p p p p

prefetches

cessor in ILP Since prefetching in ILP only needs to target

Resource

those accesses that are not already overlapped by ILP it

p p p p

contention

can app ear more eective in ILP than in Simple

Clustered

p p

load misses

Fewer early prefetches Early prefetches are those

Overlapped

p p

where the prefetched lines are either invalidated or re

accesses

placed b efore their corresp onding demand accesses Early

Early

p p p p p

prefetches prefetches can hinder demand accesses by invalidating or

Sp eculative

replacing needed data from the same or other caches with

p p

prefetches

out providing any b enets in latency reduction In many

Fig Factors aecting the p erformance of prefetching for ILP

of our applications the number of early prefetches drops

in ILP improving the eectiveness of prefetching for these

applications This reduction o ccurs b ecause the ILP sys

C Impact of Software Prefetching on Execution Time

tem allows less time b etween a prefetch and its subsequent

demand access decreasing the likelihoo d of an intervening

Despite its reduced eectiveness in addressing memory

invalidation or replacement

stall time software prefetching achieves signicant execu

tion time reductions with ILP in three cases LU Mpd Sp eculative prefetches In ILP prefetch instructions

and Water for two main reasons First memory stall time can b e sp eculatively issued past a mispredicted branch

contributes a larger p ortion of total execution time in ILP Sp eculative prefetches can p otentially hurt p erformance by

Thus even a reduction of a small fraction of memory stall bringing unnecessary lines into the cache or by bringing

time can imply a reduction in overall execution time sim needed lines into the cache to o early Sp eculative prefetches

ilar to or greater than that seen in Simple Second ILP can also help p erformance by initiating a prefetch for a

systems see less instruction overhead from prefetching com needed line early enough to hide its latency In our appli

pared to Simple systems b ecause ILP techniques allow the cations most prefetches issued past mispredicted branches

overlap of these instructions with other computation were to lines also accessed on the correct path

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION

D Al leviating Late Prefetches and Contention multiple load misses and increased contention for system

resources from more frequent memory accesses Thus ILP

Our results show that late prefetches and resource con

based multiprocessors see a larger b ottleneck from memory

tention are the two key limitations to the eectiveness of

system p erformance and generally p o orer parallel ecien

prefetching on ILP We tried several straightforward mo d

cies than previousgeneration multiprocessors

ications to the prefetching algorithm and the system to

Softwarecontrolled nonbinding prefetching is a la

address these limitations Sp ecically we doubled and

tency hiding technique widely recommended for previous

quadrupled the prefetch distance ie the distance b etween

generation sharedmemory multiprocessors We nd that

a prefetch and the corresp onding demand access and in

while software prefetching results in substantial reductions

creased the number of MSHRs However these mo di

in execution time for some cases on the ILP system in

cations traded o b enets among late prefetches early

creased late prefetches and increased contention for re

prefetches and contention without improving the com

sources cause software prefetching to b e less eective in

bination of these factors enough to improve overall p er

reducing memory stall time in ILPbased systems Even

formance We also tried varying the prefetch distance for

after the addition of software prefetching most of our ap

each access according to the exp ected latency of that access

plications remain largely memory b ound

versus a common distance for all accesses and prefetching

Thus despite the latencytolerating techniques inte

only to the L cache These mo dications achieved their

grated within ILP pro cessors multiprocessors built from

purp ose but did not provide a signicant p erformance b en

ILP pro cessors have a greater need for additional tech

et for our applications

niques to hide or reduce memory latency than previous

generation multiprocessors One ILPsp ecic technique

V Discussion

discussed in this pap er is the software clustering of load

Our results show that sharedmemory systems are lim

misses Additionally latencyreducing techniques such as

ited in their eectiveness in exploiting ILP pro cessors due

pro ducerinitiated communication that can improve the ef

to limited b enets of ILP techniques for the memory sys

fectiveness of prefetching app ear promising

tem The analysis of Section I I I implies that the key rea

References

sons for the limited b enets are the lack of opp ortunity

H Ab delSha et al An Evaluation of FineGrain Pro ducer

for overlapping load misses andor increased contention in

Initiated Communication in CacheCoherent Multipro cessors In

Proc rd Intl Symp on HighPerformance Computer Architec

the system Compiler optimizations akin to the lo op inter

ture pages

changes used to generate LUopt and FFTopt may b e able to

E H Gornish Adaptive and Integrated Data Cache Prefetching

exp ose more p otential for load miss overlap in an applica

for SharedMemory Multiprocessors PhD thesis University of

Illinois at UrbanaChampaign

tion The simple lo op interchange used in LUopt provides

D Kroft Lo ckupFree Instruction FetchPrefetch Cache Or

a reduction in execution time compared to LU on an

ganization In Proc th Intl Symp on Computer Architecture

ILP multiprocessor Hardware enhancements can also in

pages

J Laudon and D Lenoski The SGI Origin A ccNUMA

crease load miss overlap eg through a larger instruction

Highly Scalable Server In Proc th Intl Symp on Computer

window Targeting contention requires increased hardware

Architecture pages

resources or other latency reduction techniques

CK Luk and T C Mowry CompilerBased Prefetching for

Recursive Data Structures In Proc th Intl Conf on Architec

The results of Section IV show that while software

tural Support for Programming Languages and Operating Sys

prefetching improves memory system p erformance with

tems pages

T Mowry Tolerating Latency through Softwarecontrolled Data

ILP pro cessors it do es not change the memoryb ound

Prefetching PhD thesis Stanford University

nature of these systems for most of the applications b e

B A Nayfeh et al Evaluation of Design Alternatives for a Multi

cause the latencies are to o long to hide with prefetching

pro cessor Micropro cessor In Proc rd Intl Symp on Computer

Architecture pages

andor b ecause of increased contention Our results moti

K Olukotun et al The Case for a SingleChip Multipro cessor In

vate prefetching algorithms that are sensitive to increases

Proc th Intl Conf on Architectural Support for Programming

in resource usage They also motivate latencyreducing

Languages and Operating Systems pages

V S Pai et al An Evaluation of Memory Consistency Mo dels

rather than tolerating techniques such as pro ducer

for SharedMemory Systems with ILP Pro cessors In Proc th

initiated communication which can improve the eective

Intl Conf on Architectural Support for Programming Languages

ness of prefetching

and Operating Systems pages

V S Pai et al RSIM Reference Manual Version ECE TR

Rice University

VI Conclusions

V S Pai et al The Impact of Instruction Level Parallelism on

Multipro cessor Performance and Simulation Metho dology In

This pap er evaluates the impact of ILP techniques sup

Proc rd Intl Symp on High Performance Computer Architec

p orted by stateoftheart pro cessors on the p erformance of

ture pages

P Ranganathan et al The Interaction of Software Prefetching

sharedmemory multiprocessors All our applications see

with ILP Pro cessors in SharedMemory Systems In Proc th

p erformance improvements from current ILP techniques

Intl Symp on Computer Architecture pages

However while ILP techniques eectively address the CPU

J P Singh et al SPLASH Stanford Parallel Applications for

SharedMemory Computer Architecture News pages Mar

comp onent of execution time they are less successful in im

proving data memory stall time These applications do not

S C Woo et al The SPLASH Programs Characterization

see the full b enet of the latencytolerating features of ILP

and Metho dological Considerations In Proc nd Intl Symp

on Computer Architecture pages

pro cessors b ecause of insucient opp ortunities to overlap