Software Cache Coherence for Large Scale Multipro cessors

Leonidas I Kontothanassis and Michael L Scott

Department of Computer Science

University of Ro chester

Ro chester NY

fkthanasiscottgcsrochesteredu

July

Abstract

Shared memory is an app ealing abstraction for parallel programming It must b e implemented

with caches in order to p erform well however and caches require a coherence mechanism to ensure

that pro cessors reference current data Hardware coherence mechanisms for largescale machines

are complex and costly but existing software mechanisms have not b een fast enough to provide

a serious alternative

We present a new software coherence proto col that narrows the p erformance gap b etween

1

hardware and software coherence This proto col runs on NCCNUMA machines in which a

global physical address space allows pro cessors to ll cache lines from remote memory We compare

the p erformance of the proto col to that of existing software and hardware alternatives We also

evaluate the tradeos among various write p olicies writethrough writeback writethrough

with a writecollect buer Finally we observe that certain simple program changes can greatly

improve p erformance For the programs in our test suite the p erformance advantage of hardware

cache coherence is small enough to suggest that software coherence may b e more cost eective

Keywords cache coherence scalability costeectiveness lazy release consistency NCCNUMA ma

chines

Intro duction

Large scale multipro cessors can provide the computational p ower needed for some of the larger

problems of science and engineering to day Shared memory provides an app ealing programming

mo del for such machines To p erform well however shared memory requires the use of caches

which in turn require a coherence mechanism to ensure that copies of data are suciently upto

date Coherence is easy to achieve on small busbased machines where every pro cessor can see

the memory tra of the others Coherence is substantially harder to achieve on largescale



This work was supp orted in part by NSF Institutional Infrastructure grant no CDA and ONR research

grant no NJ in conjunction with the DARPA Research in Information Science and TechnologyHigh

Performance Computing Software Science and Technology program ARPA Order no

1

NCCNUMA stands for non cache coherent non uniform memory access

multipro cessors it increases b oth the cost of the machine and the time and intellectual

eort required to bring it to market Given the sp eed of advances in micropro cessor technology long

development times generally lead to machines with outofdate pro cessors There is thus a strong

motivation to nd coherence mechanisms that will pro duce acceptable p erformance with little or no

2

sp ecial hardware

There are at least three reasons to hop e that a software coherence mechanism might b e com

p etitive with hardware coherence First traphandling overhead is not very large in comparison to

remote communication latencies and will b ecome even smaller as pro cessor improvements continue

to outstrip network improvements Second software may b e able to emb o dy proto cols that are to o

complicated to implement reliably in hardware at acceptable cost Third programmers and

develop ers are b ecoming aware of the imp ortance of lo cality of reference and are attempting to write

programs that communicate as little as p ossible thereby reducing the impact of coherence op era

tions In this pap er we present a software coherence mechanism that exploits these opp ortunities to

deliver p erformance approaching that of the b est hardware alternativeswithin worst case in

our exp eriments and usually much closer

As in most software coherence systems we use address translation hardware to control access

to shared pages To minimize the impact of the false sharing that comes with such large coherence

blo cks we employ a relaxed consistency proto col that combines asp ects of b oth eager release

consistency and lazy release consistency We target our work however at NCCNUMA

machines rather than messagebased multicomputers or networks of workstations Machines in the

NCCNUMA class include the Cray T the BBN TC and the Princeton Shrimp None of

these has hardware cache coherence but each provides a globallyaccessible physical address space

with hardware supp ort for cache lls and uncached references that access remote lo cations In

comparison to multicomputers NCCNUMA machines are only slightly harder to build but they

provide two imp ortant advantages for implementing software coherence they p ermit very fast access

to remote directory information and they allow data to b e moved in cacheline size chunks

We also build on the work of Petersen and Li who develop ed an ecient software

implementation of release consistency for smallscale multipro cessors The key observation of their

work was that NCCNUMA machines allow the coherence blo ck and the data transfer blo ck to b e of

dierent sizes Rather than copy an entire page in resp onse to an access fault a software coherence

mechanism for an NCCNUMA machine can create a mapping to remote memory allowing the

hardware to fetch individual caches lines as needed on demand

Our principal contribution is to extend the work of Petersen and Li to large machines We

distribute and reorganize the directory data structures insp ect those structures only with regard to

pages for which the current pro cessor has a mapping p ostp one coherence op erations for as long as

p ossible and intro duce a new dimension to the proto col state space that allows us to reduce the cost

of coherence maintenance on wellb ehaved pages

We compare our mechanism to a variety of existing alternatives including sequentiallyconsistent

hardware releaseconsistent hardware sequentiallyconsistent software and the software coherence

scheme of Petersen and Li We nd substantial improvements with resp ect to the other software

schemes enough in most cases to bring software cache coherence within sight of the hardware alter

natives

2

We are sp eaking here of behaviordriven coherencemechanisms that move and replicate data at run time in

resp onse to observed patterns of program b ehavioras opp osed to compilerbased techniques

We also rep ort on the impact of several architectural alternatives on the eectiveness of software

coherence These alternatives include the choice of write p olicy writethrough writeback write

through with a writecollect buer and the availability of a remote reference facility which allows

a pro cessor to cho ose to access data directly in a remote lo cation by disabling caching Finally to

obtain the full b enet of software coherence we observe that minor program changes can b e crucial

In particular we identify the need to employ readerwriter lo cks avoid certain interactions b etween

program synchronization and the coherence proto col and align data structures with page b oundaries

whenever p ossible

The rest of the pap er is organized as follows Section describ es our software coherence proto col

and provides intuition for our algorithmic and architectural choices Section describ es our exp er

imental metho dology and workload We present p erformance results in section and compare our

work to other approaches in section We summarize our ndings and conclude in section

The Software Coherence Proto col

In this section we present a scalable algorithm for software cache coherence The algorithm was

inspired by Karin Petersens thesis work with Kai Li Petersens algorithm was designed for

smallscale multipro cessors with a single physical address space and noncoherent caches and has

b een shown to work well for several applications on such machines

Like most b ehaviordriven software coherence schemes Petersens relies on address translation

hardware and therefore uses pages as its unit of coherence Unlike most software schemes however

it do es not migrate or replicate whole pages Instead it maps pages where they lie in main memory

and relies on the hardware cachell mechanism to bring lines into the lo cal cache on demand To

minimize the frequency of coherence op erations the algorithm adopts release consistency for its

3

memory semantics and p erforms coherence op erations only at synchronization p oints Between

synchronization p oints pro cesses may continue to use stale data in their caches To keep track of

inconsistent copies the algorithm keeps a count in uncached main memory of the numb er of readers

and writers for each page together with an uncached weak list that identies all pages for which

there are multiple writers or a writer and one or more readers

Pages that may b ecome inconsistent under Petersens scheme are inserted in the weak list by

the pro cessor that detects the p otential for inconsistency For example if a pro cessor attempts to

read a variable in a currentlyunmapp ed page the page fault handler creates a readonly mapping

increments the reader count and adds the page to the weak list if it has any current writers On an

acquire op eration a pro cessor scans the uncached weak list and purges all lines of all weak pages

from its cache The pro cessor also removes all mappings it may have for such a page If all mappings

for a page in the weak list have b een removed the page is removed from the weak list as well

Unfortunately while a centralized weak list works well on small machines it p oses serious obsta

cles to scalability the size of the list and consequently the amount of work that a pro cessor needs to

p erform at a synchronization p oint increases with the size of the machine Moreover the frequency

of references to each element of the list also increases with the size of the machine implying the

3

Under release consistency memory references are classied as acquires releases or ordinary references A

release indicates that the pro cessor is completing an op eration on which other pro cessors may dep end all of the

pro cessors previous writes must b e made visible to any pro cessor that p erforms a subsequent acquire An acquire

indicates that the pro cessor is b eginning an op eration that may dep end on someone else all other pro cessors writes

must b e now b e made lo cally visible Acquire & Checks <= Limit Acquire Read/Write Acquire & Checks > Limit Read/Write & Count = 1 & Count = 1

UNCACHED Write DIRTY UNCACHED Write DIRTY SAFE SAFE Read/Write UNSAFE UNSAFE & Count > 1 Write & Count = 1 & Notices > Limit Write & Count = 1 Read/Write Read/Write Read & Count > 1 Read & Count > 1 & Notices <= Limit Read Read Acquire & Count = 0 Acquire & Acquire & Count = 0 Acquire Checks <= Limit

SHARED WEAK SHARED WEAK SAFE Write & Count > 1 SAFE UNSAFE Write & Count > 1 UNSAFE & Notices <= Limit

All non-acquire accesses All non-acquire accesses Acquire & Count != 0 Acquire & Count != 0

Write & Count > 1 & Notices > Limit

Acquire & Checks > Limit

Figure Scalable software cache coherence state diagram

p otential for serious memory contention Our goal has b een to achieve scalability by designing an

algorithm whose overhead is a function of the degree of sharing and not of the size of the machine

Since previous studies have shown that the degree of sharing for coherence blo cks remains relatively

constant when the size of the machine increases an algorithm with the ab ove prop erty should

scale nicely to larger numb ers of pro cessors

Our solution assumes a distributed nonreplicated directory that maintains cacheabil

ity and sharing information similar to the coherent map data structure of Platinum Pages

can b e in one of the following four states

Uncached No pro cessor has a mapping to this page This is the initial state for all pages

Shared One or more pro cessors have readonly mappings to this page

Dirty A single pro cessor has b oth read and write mappings to the page

Weak Two or more pro cessors have mappings to the page and at least one has b oth read and

write mappings to it

To facilitate transitions from weak back to the other states the coherent map includes auxiliary

counts of the numb er of readers and writers of each page

Each pro cessor holds the p ortion of the coherent map that describ es the pages whose physical

memory is lo cal to that pro cessorthe pages for which the pro cessor is the home node In addition

each pro cessor holds a lo cal weak list that indicates which of the pages to which it has mappings are

weak When a pro cessor takes a page fault it lo cks the coherent map entry representing the page

on which the fault was taken It then changes the coherent map entry to reect the new state of

the page If necessary ie if the page has made the transition from shared or dirty to weak the

pro cessor up dates the weak lists of all pro cessors that have mappings for that page It then unlo cks

the entry in the coherent map The pro cess of up dating a pro cessors weak list is referred to as

p osting a write notice

Distribution of the coherent map and weak list eliminates b oth the problem of centralization ie

memory contention and the need for pro cessors to do unnecessary work at acquire p oints scanning

weak list entries in which they have no interest However it makes the transition to the weak state

considerably more exp ensive since a p otentially large numb er of remote memory op erations might

have to b e p erformed serially in order to notify all sharing pro cessors Ideally we would like to

maintain the low acquire overhead of p erpro cessor weak lists while requiring only a constant amount

of work p er shared page on a transition to the weak state

In order to approach this goal we take advantage of the fact that page b ehavior tends to b e

relatively constant over the execution of a program or at least a large p ortion of it Pages that are

weak at one acquire p oint are likely to b e weak at another We therefore intro duce an additional pair

of states called safe and unsafe These new states which are orthogonal to the others for a total of

distinct states reect the past b ehavior of the page A page that has made the transition to weak

several times and is ab out to b e marked weak again is also marked as unsafe Future transitions to

the weak state will no longer require the sending of write notices Instead the pro cessor that causes

the transition to the weak state changes only the entry in the coherent map and then continues The

acquire part of the proto col now requires that the acquiring pro cessor check the coherent map entry

for all its unsafe pages and invalidate the ones that are also marked as weak A pro cessor knows

which of its pages are unsafe b ecause it maintains a lo cal list of them this list is never mo died

remotely A page changes from unsafe back to safe if has b een checked at several acquire op erations

and found not to b e weak

The correctness of the proto col dep ends on the following observation unsafe pages are known

as such by the pro cessors that create new mappings to them they will therefore b e checked for the

p ossibility of b eing weak on the next acquire op eration Safe pages have write notices p osted on their

b ehalf when they make the transition to the weak state When a page rst makes the transition to

unsafe some pro cessors those that already have mappings to it will receive write notices others

those that create subsequent mappings will know that it is unsafe from the state information saved

in the coherent map entry

Comparing our proto col to the use of a central weak list we see that rather than iterate over al l

weak pages at each acquire p oint a pro cessor iterates over only those pages to which it currently has

a mapping and that on the basis of past b ehavior have a high probability of really b eing weak The

comparatively minor downside is that for pages that b ecome weak without a past history of doing

so a pro cessor must pay the cost of p osting appropriate write notices

The state diagram for a page in our proto col app ears in gure The state of a page is represented

in the coherent map It is a prop erty of the system as a whole not as in most proto cols the viewp oint

of a single pro cessor The transactions represent read write and acquire accesses on the part of any

pro cessor Count is the numb er of pro cessors having mappings to the page notices is the numb er

of notices that have b een sent on b ehalf of a safe page and checks is the numb er of times that a

pro cessor has checked the coherent map regarding an unsafe page and found it not to b e weak The

access to the coherent map is then wasted work since the pro cessor was not required to invalidate

its mapping to the page To guard against this waste our p olicy switches a page back to safe after

a small numb er of unnecessary checks of the coherent map

We apply one additional optimization When a pro cessor takes a page fault on a write to a

shared nonweak page we could cho ose to make the transition to weak and p ost write notices if the

page was safe immediately or we could cho ose to wait until the pro cessorss next release op eration

4

the semantics of release consistency do not require us to make writes visible b efore then The

advantage of delayed transitions is that any pro cessor that executes an acquire op eration b efore

the writing pro cessors next release will not have to invalidate the page This serves to reduce

the overall numb er of invalidations On the other hand delayed transitions have the p otential to

lengthen the critical path of the computation by intro ducing contention esp ecially for programs

with barriers in which many pro cessors may want to p ost notices for the same page at roughly the

same time and will therefore serialize on the lo ck of the coherent map entry Delayed write notices

were intro duced in the Munin distributed shared memory system which runs on networks of

workstations and communicates solely via messages Though the relative values of constants are

quite dierent exp eriments indicate see section that delayed transitions are generally b enecial

in our environment as well

One nal question that has to b e addressed is the mechanisms whereby written data makes its

way back into main memory Petersen in her work found a writethrough cache to b e the b est option

but these can generate a p otentially unacceptable amount of memory trac in largescale systems

Assuming a writeback cache either requires that no two pro cessors write to the same cache line of a

weak pagean unreasonable assumptionor a mechanism to keep track of which individual words

are dirty We ran our exp eriments see section under three dierent assumptions writethrough

caches writeback caches with p erword hardware dirty bits in the cache and writethrough caches

with a writecollect buer that hangs onto recentlywritten lines in our exp eriments and

coalesces any writes that are directed to the same line Dep ending on the write p olicy the coherence

proto col at a release op eration must force a writeback of all dirty lines purge the writecollect buer

or wait for acknowledgments of writethroughs

Exp erimental Metho dology

We use execution driven simulation to simulate a meshconnected multipro cessor with up to

no des Our simulator consists of two parts a front end Mint that simulates the execution

of the pro cessors and a back end that simulates the memory system The front end calls the back

end on every data reference instruction fetches are assumed to always b e cache hits The back

end decides which pro cessors blo ck waiting for memory and which continue execution Since the

decision is made online the back end aects the timing of the front end so that the interleaving

of instructions across pro cessors dep ends on the b ehavior of the memory system and control ow

within a pro cessor can change as a result of the timing of memory references This is more accurate

than tracedriven simulation in which control ow is predetermined recorded in the trace

The front end is the same in all our exp eriments It implements the MIPS II instruction set

Interchangeable mo dules in the back end allow us to explore the design space of software and hard

ware coherence Our hardwarecoherent mo dules are quite detailed with nitesize caches full

proto col emulation distancedep endent network delays and memory access costs including mem

ory contention Our simulator is capable of capturing contention within the network but only at a

substantial cost in execution time the results rep orted here mo del network contention at the sending

and receiving no des of a message but not at the no des inb etween Our softwarecoherent mo dules

add a detailed simulation of TLB b ehavior since it is the protection mechanism used for coherence

4

Under the same principle a write pagefault on an unmapp ed page will take the page to the shared state The

writes will b e made visible only on the subsequent release op eration

System Constant Name Default Value

TLB size entries

TLB ll time cycles

Interrupt cost cycles

Coherent map mo dication cycles

Memory resp onse time cyclescache line

Page size K

Total cache p er pro cessor K bytes

Cache line size bytes

Network path width bits bidirectional

Link latency cycles

Wire latency cycle

Directory lo okup cost cycles

Cache purge time cycleline

Table Default values for system parameters

and can b e crucial to p erformance To avoid the complexities of instructionlevel simulation of in

terrupt handlers we assume a constant overhead for page faults Table summarizes the default

parameters used b oth in our hardware and software coherence simulations which are in agreement

with those published in and in several hardware manuals

Some of the transactions required by our coherence proto cols require a collection of the op erations

shown in table and therefore incur the aggregate cost of their constituents For example a read page

fault on an unmapp ed page consists of the following a a TLB fault and TLB ll b a pro cessor

interrupt caused by the absence of read rights c a coherent map entry lo ck acquisition and d

a coherent map entry mo dication followed by the lo ck release Lo ck acquisition itself requires

traversing the network and accessing the memory mo dule where the lo ck is lo cated The total cost

for the example transaction is well over cycles

Workload

We rep ort results for six parallel programs Three are b est describ ed as computational kernels

Gauss sor and fft Three are complete applications mpd water and appbt The kernels are

lo cal creations Gauss p erforms Gaussian elimination without pivoting on a  matrix Sor

computes the steady state temp erature of a metal sheet using a banded parallelization of redblack

successive overrelaxation on a  grid Fft computes an onedimensional FFT on a

element array of complex numb ers using the algorithm describ ed in

Mpd and water are part of the SPLASH suite Mpd is a windtunnel airow simulation

We simulated particles for steps in our studies Water is a molecular dynamics simulation

computing inter and intramolecule forces for a set of water molecules We used molecules

and times steps Finally appbt is from the NASA parallel b enchmarks suite It computes an

approximation to NavierStokes equations It was translated to shared memory from the original

messagebased form by Doug Burger and Sanjay Mehta at the University of Wisconsin Due to

simulation constraints our input data sizes for all programs are smaller than what would b e run

on a real machine a fact that may cause us to see unnaturally high degrees of sharing Since we

still observe reasonable scalability for all the applications we b elieve that the data set sizes do not

compromise our results

Results

Our principal goal is to determine whether one can approach the p erformance of hardware cache

coherence without the sp ecial hardware To that end we b egin in section by evaluating the

tradeos b etween dierent software proto cols Then in sections and we consider the impact

of dierent write p olicies and of simple program changes that improve the p erformance of software

cache coherence These changes include segregation of synchronization variables data alignment

and padding use of readerwriter lo cks to avoid coherence overhead and use of uncached remote

references for negrain data sharing Finally in section we compare the b est of the software

results to the corresp onding results on sequentiallyconsistent and releaseconsistent hardware

subsectionSoftware coherence proto col alternatives This section compares the software proto col

alternatives discussed in section The architecture on which the comparison is made assumes a

writeback cache which is ushed at the time of a release Coherence messages if needed can b e

overlapp ed with the ush op erations once the writes have entered the network The ve proto cols

we compare are

reldistrdel The delayed version of our distributed proto col with safe and unsafe pages Write

notices are p osted at the time of a release and invalidations are done at the time of an acquire

At release time the proto col scans the TLBpage table dirty bits to determine which pages

have b een written Pages can therefore b e mapp ed readwrite on the rst miss eliminating the

need for a second trap if a read to an unmapp ed page is followed by a write This proto col has

slightly higher b o okkeeping overhead than reldistrnodel b elow but reduces trap costs and

p ossible coherence overhead by delaying transitions to the dirty or weak state and p osting of

asso ciated write notices for as long as p ossible It provides the unit of comparison normalized

running time of in our graphs

reldistrno del Same as reldistrdel except that write notices are p osted as so on as an in

consistency o ccurs Invalidations are done at the time of an acquire as b efore While this

proto col has slightly less b o okkeeping overhead no need to rememb er pages for an up com

ing release op eration it may cause higher coherence overhead and higher trap costs The

TLBpage table dirty bits are not sucient here since we want to take action the moment an

inconsistency o ccurs We must use the writeprotect bits to generate page faults

relcentrdel Same as reldistrdel except that write notices are propagated by inserting weak

pages in a global list which is traversed on acquires List entries are distributed among the

no des of the machine although the list itself is conceptually centralized

relcentrno del Same as reldistrnodel except that write notices are propagated by inserting

weak pages in a global list which is traversed on acquires This is the proto col prop osed by

Petersen and Li The previous proto col relcentrdel is also similar to that of

Petersen and Li with the addition of the delayed write notices Performance on 64 processors Overhead on 64 processors

rel.distr.del ipc-interrupts rel.distr.ndl lock wait rel.centr.del coherence rel.centr.ndl cache seq Overhead Analysis Normalized execution time 0 1 2 3 4 0 1 2 3 4 5 6

gauss sor water mp3d appbt fft gauss sor water mp3d appbt fft

Figure Comparative p erformance of Figure Overhead analysis of dierent

dierent software proto cols on pro cessors software proto cols on pro cessors

seq A sequentially consistent software proto col that allows only a single writer for every coherence

blo ck at any given p oint in time Interpro cessor interrupts are used to enforce coherence when

an access fault o ccurs Interpro cessor interrupts present several problems for our simulation

environment fortunately this is the only proto col that needs them and the level of detail at

which they are simulated is signicantly lower than that of other system asp ects Results for

this proto col may underestimate the cost of coherence management esp ecially in cases of high

network trac but since it is the worst proto col in most cases the inaccuracy has no eect

on our conclusions

Figure presents the running time of the dierent software proto cols on our set of partially mo d

ied applications We have used the b est version of the applications that do es not require proto col

mo dications ie no identication of readerwriter lo cks or use of remote reference see section

The distributed proto cols outp erform the centralized implementations often by a signicant mar

gin The distributed proto cols also show the largest improvement almost threefold on water and

mpd the two applications in which software coherence lags the most b ehind hardware coherence

see section This is predictable b ehavior applications in which the impact of coherence is im

p ortant are exp ected to show the greatest variance with dierent coherence algorithms However it

is imp ortant to notice the dierence in the scale of gures and While the distributed proto cols

improve p erformance over the centralized ones by a factor of three for water and mpd they are only

to worse than their hardware comp etitors In programs where coherence is less imp ortant

the decentralized proto cols still provide reasonable p erformance improvements over the centralized

ones ranging from to

The one application in which the sequential proto col outp erforms the relaxed alternatives is

Gaussian elimination While the actual dierence in p erformance may b e smaller than shown in the

graph due in part to the reduced detail in the implementation of the sequential proto col there is

one source of overhead that the relaxed proto cols have to pay that the sequential version do es not

Since the releaser of a lo ck do es not know who the subsequent acquirer of the lo ck will b e it has to

ush changes to shared data at the time of a release in the relaxed proto cols so those changes will b e

visible Gauss uses lo cks as ags to indicate that a particular pivot row is available to pro cessors to

eliminate their rows In section we note that use of the ags results in many unnecessary ushes

and we present a renement to the relaxed consistency proto cols that avoids them

Sor and water have very regular sharing patterns sor among neighb ors and water within a

welldened subset of the pro cessors partaking in the computation The distributed proto col makes

a pro cessor pay a coherence p enalty only for the pages it cares ab out while the centralized one forces

pro cessors to examine all weak pages which is all the shared pages in the case of water resulting

in very high overheads It is interesting to notice that in water the centralized relaxed consistency

proto cols are badly b eaten by the sequentially consistent software proto col This agrees to some

extent with the results rep orted by Petersen and Li but the advantage of the sequentially

consistent proto col was less pronounced in their work We b elieve there are two reasons for our

5

dierence in results First we have restructured the co de to greatly reduce false sharing thus

removing one of the advantages that relaxed consistency has over sequential consistency Second we

have simulated a larger numb er of pro cessors aggravating the contention caused by the centralized

weak list used in the centralized relaxed consistency proto cols

Appbt and fft have limited sharing Fft exhibits limited pairwise sharing among dierent

pro cessors for every phase the distance b etween paired elements decreases for each phase We

were unable to establish the access pattern of appbt from the source co de it uses linear arrays to

represent higher dimensional data structures and the computation of osets often uses several levels

of indirection

Mpd has very widespread sharing We mo died the program slightly prior to the current

studies to ensure that colliding molecules b elong with high probability to either the same pro cessor

or neighb oring pro cessors Therefore the molecule data structures exhibit limited pairwise sharing

The main problem is the space cell data structures Space cells form a three dimensional array

Unfortunately molecule movement is fastest in the outermost dimension resulting in long stride

access to the space cell array That coupled with the large coherence blo ck results in having all the

pages of the space cell data structure shared across all pro cessors Since the pro cessors mo dify the

data structure for every particle they pro cess the end b ehavior is a long weak list and serialization

on the centralized proto cols The distributed proto cols improve the coherence management of the

molecule data structures but can do little to improve on the cell data structure since sharing is

widespread

While runtime is the most imp ortant metric for application p erformance it do es not capture the

full impact of a coherence algorithm Figure shows the breakdown of overhead into its ma jor

comp onents for the ve software proto cols on our six applications These comp onents are IPC

interrupt handling overhead sequentially consistent proto col only time sp ent waiting for application

lo cks coherence proto col overhead including waiting for system lo cks and ushing and purging

cache lines and time sp ent waiting for cache misses Coherence proto col overhead has an impact

on the time sp ent waiting for application lo cksthe two are not easily separable The relative

heights of the bars do not agree in gures and b ecause the former p ertains to the critical

path of the computation while the latter provides totals over all pro cessors for the duration of

execution Aggregate costs for the overhead comp onents can b e higher but critical path length can

b e shorter if some of the overhead work is done in parallel The coherence part of the overhead

is signicantly reduced by the distributed delayed proto col for all applications For mpd the main

5

The sequentially consistent software proto col still outp erforms the centralized relaxed consistent software proto cols

on the unmo died application but to a lesser extent Performance on 64 processors

W-back W-through W-through+Clct

>10 >10 >10

Application Write Back Write Through Write Collect

Gauss

Sor

Water

Mpd

Normalized execution time

Appbt

6

Fft NA 0 1 2 3

gauss sor water mp3d appbt fft

Figure Comparative p erformance of Figure Delayed cache misses for dierent cache

dierent cache architectures on pro cessors

typ es

b enet comes from the reduction of lo ck waiting time The program is tightly synchronized a

reduction in coherence overhead implies less time holding synchronization variables and therefore a

reduction in synchronization waiting time

We have also run simulations in order to determine the p erformance b enets caused by the intro

duction of the safe and unsafe states What we have discovered is that for our mo died applications

the p erformance impact of these two states is small they help p erformance in some cases by up to

and hurt it in other cases due to unnecessary checks on the unsafe state by up to The

reason for this b ehavior is the limited degree of sharing for pages exhibited by our mo died appli

cations We have run simulations on unmo died applications and have found that the existence of

these two states can help improve p erformance by as much as Unfortunately the p erformance

of software coherence even with the intro duction of our optimization is not comp etitive to hard

ware for the unmo died applications We view our optimization as a safeguard that can help yield

reasonable p erformance for bad sharing patterns but for well b ehaving programs that scale nicely

under software coherence its impact is signicantly reduced

Write p olicies

In this section we consider the choice of write p olicy for the cache Sp ecically we compare the

p erformance obtained with a writethrough cache a writeback cache and a writethrough cache

with a buer for merging writes We assume that a single p olicy is used for all cached data b oth

private and shared We have mo died our simulator to allow us to vary p olicies indep endently for

private and shared data and exp ect to have results shortly that will simulate the ab ove options for

shared data only while using a writeback p olicy for private data

Writeback caches imp ose the minimum load on the memory and network since they write blo cks

back only on eviction or when explicitly ushed In a software coherent system however write

back caches have two undesirable qualities The rst of these is that they delay the execution of

synchronization op erations since dirty lines must b e ushed at the time of a release Writethrough

caches have the p otential to overlap memory accesses with useful computation

The second problem is more serious b ecause it aects program correctness in addition to p erfor

mance Because a software coherent system allows multiple writers for the same page it is p ossible

for dierent p ortions of a cache line to b e written by dierent pro cessors When those lines are

ushed back to memory we must make sure that changes are correctly merged so no data mo dica

tions are lost The obvious way to do this is to have the hardware maintain p erword dirty bits and

then to write back only those words in the cache that have actually b een mo died We assume there

is no subword sharing words mo died by more than one pro cessor imply that the program is not

correctly synchronized

Writethrough caches can p otentially b enet relaxed consistency proto cols by reducing the amount

of time sp ent at release p oints They also eliminate the need for p erword dirty bits Unfortunately

they may cause a large amount of trac delaying the service of cache misses and in general de

grading p erformance In fact if the memory subsystem is not able to keep up with all the trac

writethrough caches are unlikely to actually sp eed up releases b ecause at a release p oint we have to

make sure that all writes have b een globally p erformed b efore allowing the pro cessor to continue A

write completes when it is acknowledged by the memory system With a large amount of write trac

we may have simply replaced waiting for the writeback with waiting for missing acknowledgments

Writethrough caches with a writecollect buer employ a small entries in our case fully

asso ciative buer b etween the cache and the interconnection network The buer merges writes to

the same cache line and allo cates a new entry for a write to a nonresident cache line When it runs

out of entries the buer randomly cho oses a line for eviction and writes it back to memory The

writecollect buer is an attempt to combine the desirable features of b oth the writethrough and the

writeback cache It reduces memory and network trac when compared to a plain writethrough

cache and has a shorter latency at release p oints when compared to a writeback cache Perword

dirty bits are required at the buer to allow successful merging of cache lines into memory

Figure presents the relative p erformance of the dierent cache architectures when using the b est

relaxed proto col on our b est version of the applications For all programs with the exception of mpd

the writeback cache outp erforms the others The main reason is the reduced amount of memory

trac Figure presents the numb er of delayed cache misses under dierent cache p olicies A miss is

dened as delayed when it is forced to wait in a queue at the memory while contending accesses are

serviced The dierence b etween the dierent cache typ es is most pronounced on programs that have

little sharing or a lot of private data Water appbt and fft fall in this category For water which

has a very large numb er of private writes the writethrough cache ends up degrading p erformance

by a factor of more than

For programs whose data is mostly actively shared the writethrough p olicies fare b etter The

b est example is mpd in which the writecollect cache outp erforms the writeback cache by ab out

The reason for this is that frequent synchronization in mpd requires frequent writebacks so

the program generates approximately the same amount of trac as it would with a writethrough

cache Furthermore a ush op eration on a page costs cycles cycle p er line regardless of the

numb er of lines actually present in the cache So if only a small p ortion of a page is touched the

writeback p olicy still pays a high p enalty at releases

6

Our writethrough simulation for fft required to o much memory so we had to mo dify it slightly The numb er of

delayed misses that we have is not directly comparable with that of the other two proto cols although it is larger than

either of them

Our results are in partial agreement with those rep orted by Chen and Veidenbaum We b oth

nd that writethrough caches suer signicant p erformance degradation due to increased network

and memory trac However while their results favor a writecollect buer in most cases we discover

that writeback caches are preferable under our software scheme We b elieve the dierence stems

from the fact that we overlap cache ush costs with other coherence management in their case cache

ushes constitute the coherence management cost and we use a dierent set of applications

Program mo dications to supp ort software cache coherence

The p erformance observed under software coherence is very sensitive to the lo cality prop erties of

the application In this section we describ e the mo dications we had to make to our applications

in order to get them to run eciently on a software coherent system We then present p erformance

comparisons for the mo died and unmo died applications

We have used four dierent techniques to improve the p erformance of our applications Two

are simple program mo dications and require no additions to the coherence proto col Two take

advantage of program semantics to give hints to the coherence proto col on how to reduce coherence

management costs Our four techniques are

 Separation of synchronization variables from other writable program data

 Data structure alignment and padding at page or subpage b oundaries

 Identication of readerwriter lo cks and avoidance of coherence overhead at the release p oint

 Identication of ne grained shared data structures and use of remote reference for their access

to avoid coherence management

All our changes pro duced dramatic improvements on the runtime of one or more applications with

some showing improvement of well over

Separation of busywait synchronization variables from the data they protect is also used on

hardware coherent systems to avoid invalidating the data protected by lo cks due to unsuccessful

and set op erations on the lo cks themselves Under software coherence however this optimiza test

tion b ecomes signicantly more imp ortant to p erformance The problem caused by the colo cation is

aggravated by an adverse interaction b etween the application lo cks and the lo cks protecting coherent

map entries at the OS level A pro cessor that attempts to access an application lo ck for the rst time

will take a pagefault and will attempt to map the page containing the lo ck This requires the acqui

sition of the OS lo ck protecting the coherent map entry for that page The pro cessor that attempts

to release the application lo ck must also acquire the lo ck for the coherent map entry representing the

page that contains the lo ck and the data it protects in order to up date the page state to reect the

fact that the page has b een mo died In cases of contention the lo ck protecting the coherent map

entry is unavailable it is owned by the pro cessors attempting to map the page for access

We have observed this lo ckinteraction eect in Gaussian elimination in the access to the lo ck

protecting the index to the next available row It is also present in the implementation of barriers

under the Argonne P macros used by the SPLASH applications since they employ a shared

counter protected by a lo ck We have changed the barrier implementation to avoid the problem in

all our applications and have separated synchronization variables and data in Gauss to eliminate the Runtime on 64 processors Runtime on 64 processors

Plain 6 4 e7 Sync-fix e7 Plain RW locks Sync-fix,pad R-ref R-ref 5 e7 3 e7 4 e7 3 2 e7 e7 Runtime (million cycles) Runtime (million cycles) 2 e7 1 e7 1 e7 0 0 e0 e0

water

gauss

Figure Runtime of Gauss with dierent Figure Runtime of water with dierent

levels of restructuring levels of restructuring

Runtime on 64 processors Runtime on 64 processors

Plain Plain Sync-fix,pad Sync-fix,pad 6

e7 Sort R-ref 5 R-ref e7 5 e7 4 e7 4 e7 3 e7 3 e7 Runtime (million cycles) Runtime (million cycles) 2 e7 2 e7 1 e7 1 e7 0 0 e0 e0

mp3d appbt

Figure Runtime of mpd with dierent Figure Runtime of appbt with dierent

levels of restructuring levels of restructuring

adverse interaction Gauss enjoys the greatest improvement due to this change though noticeable

improvements o ccur in water appbt and mpd as well

Data structure alignment and padding is a wellknown means of reducing false sharing Since

coherence blo cks in software coherent systems are large K bytes in our case it is unreasonable

to require padding of data structures to that size However we can often pad data structures to

subpage b oundaries so that a collection of them will t exactly in a page This approach coupled

with a careful distribution of work ensuring that pro cessor data is contiguous in memory can greatly

improve the lo cality prop erties of the application Water and appbt already had go o d contiguity

so padding was sucient to achieve go o d p erformance Mpd on the other hand starts by assigning

molecules to random co ordinates in the threedimensional space As a result interacting particles

are seldom contiguous in memory and generate large amounts of sharing We xed this problem

by sorting the particles according to their slowmoving x co ordinate and assigned each pro cessor

a contiguous set of particles Interacting particles are now likely to b elong to the same page and

pro cessor reducing the amount of sharing

We were motivated to give sp ecial treatment to readerwriter lo cks after studying the Gaussian

elimination program Gauss uses lo cks to test for the readiness of pivot rows In the pro cess of

eliminating a given row a pro cessor acquires and immediately releases the lo cks on the previous

rows one by one With regular exclusive lo cks the pro cessor is forced on each release to notify other

pro cessors of its most recent singleelement change to its own row even though no other pro cessor

will attempt to use that element until the entire row is nished Our change is to observe that the

critical section protected by the pivot row lo ck do es not mo dify any data it is in fact empty so

no coherence op erations are needed at the time of the release We communicate this information to

7

the coherence proto col by identifying the critical section as b eing protected by a readers lo ck

In general changing to the use of readers lo cks means changing application semantics since

concurrent entry to a readers critical section is allowed Alternatively one can think of the change

as a program annotation that retains exclusive entry to the critical section but p ermits the co

herence proto col to skip the usual coherence op erations at the time of the release In Gauss the

dierence do es not matter b ecause the critical section is empty A skip coherence op erations on

release annotation could b e applied even to critical sections that mo dify data if the programmer

or compiler is sure that the data will not b e used by any other pro cessor until after some subsequent

release This style of annotation is reminiscent of entry consistency but with a critical dierence

Entry consistency requires the programmer to identify the data protected by particular lo cksin

eect to identify all situations in which the proto col must not skip coherence op erations Errors of

omission aect the correctness of the program In our case correctness is aected only by an error of

commission ie marking a critical section as protected by a readers lo ck when this is not the case

Even with the changes just describ ed there are program data structures that are shared at a very

ne grain b oth spatial and temp oral and that can therefore cause p erformance degradations It can

b e b enecial to disallow caching for such data structures and to access the memory mo dule in which

they reside directly We term this kind of access remote reference although the memory mo dule may

sometimes b e lo cal to the pro cessor making the reference We have identied the data structures in

our programs that could b enet from remote reference and have annotated them appropriately by

hand our annotations range from one line of co de in water to ab out ten lines in mpd Mpd sees

7

An alternative x for Gauss would b e to asso ciate with each pivot row a simple ag variable on which the pro cessors

for later rows could spin Reads of the ag would b e acquire op erations without corresp onding releases This x was

not available to us b ecause our programming mo del provides no means of identifying acquire and release op erations

except through a predened set of synchronization op erations

the largest b enet it improves by almost two fold when told to use remote reference on the space

cell data structure Appbt improves by ab out when told to use remote reference on a certain

array of condition variables Water and Gauss improve only minimally they have a bit of negrain

shared data but they dont use it very much

The p erformance improvements for our four mo died applications can b e seen in gures

through Gauss improves markedly when xing the lo ck interference problem and also b enets

from the identication of readerwriter lo cks Remote reference helps only a little Water gains most

of its p erformance improvement by padding the molecule data structures to subpage b oundaries

and relo cating synchronization variables Mpd b enets from relo cating synchronization variables

and padding the molecule data structure to subpage b oundaries It b enets even more from im

proving the lo cality of particle interactions via sorting and remote reference shaves o another

Finally appbt sees dramatic improvements after relo cating one of its data structures to achieve go o d

page alignment and b enets nicely from the use of remote reference as well

Our program changes were simple identifying and xing the problems was a mechanical pro cess

that consumed at most a few hours The one exception was mpd which apart from the mechanical

changes required an understanding of program semantics for the sorting of particles Even in that

case identifying the problem was an eort of less than a day xing it was even simpler a call to

a sorting routine We b elieve that such mo dest forms of tuning represent a reasonable demand on

the programmer We are also hop eful that smarter will b e able to make many of the

changes automatically The results for mpd could most likely b e further improved with more ma jor

restructuring of access to the space cell data structure but this would require eort out of keeping

with the current study

Hardware v software coherence

Figures and compare the p erformance of our b est software proto col to that of a relaxed

consistency DASHlike hardware proto col on and pro cessors resp ectively The unit line in

the graphs represents the running time of each application under a sequentially consistent hardware

coherence proto col In all cases the p erformance of the software proto col is within of the b etter

of the hardware proto cols In most cases it is much closer For fft the software proto col is fastest

For all programs the b est software proto col is the one describ ed in section with a distributed

coherence map and weak list safeunsafe states delayed transitions to the weak state and except

8

for mpd writeback caches augmented with p erword dirty bits The applications include all the

program mo dications describ ed in section though remote reference is used only in the context

of software coherence it do es not make sense in the hardwarecoherent case Exp eriments not

shown conrm that the program changes improve p erformance under b oth hardware and software

coherence though they help more in the software case They also help the sequentiallyconsistent

hardware more than the release consistent hardware we b elieve this accounts for the relatively

mo dest observed advantage of the latter over the former

8

The mpd result uses a writethrough cache with the writecollect buer since this is the conguration that p erforms

b est for software coherence on this program Performance on 16 processors Performance on 64 processors

hw-best hw-best sw-best sw-best Normalized execution time Normalized execution time 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5

gauss sor water mp3d appbt fft gauss sor water mp3d appbt fft

Figure Comparative software and Figure Comparative software and

hardware system p erformance on hardware system p erformance on

pro cessors pro cessors

Related Work

Our work is most closely related to that of Petersen and Li we b oth use the notion of

weak pages and purge caches on acquire op erations The dierence is scalability we distribute the

coherent map and weak list distinguish b etween safe and unsafe pages check the weak list only

for unsafe pages mapp ed by the current pro cessor and multicast write notices for safe pages that

turn out to b e weak We have also examined architectural alternatives and programstructuring

issues that were not addressed by Petersen and Li Our work resembles Munin and lazy release

consistency in its use of delayed write notices but we take advantage of the globally accessible

physical address space for cache lls and for access to the coherent map and the lo cal weak lists

Our use of remote reference to reduce the overhead of coherence management can also b e found in

work on NUMA memory management However relaxed consistency greatly reduces

the opp ortunities for protable remote data reference In fact early exp eriments we have conducted

with online NUMA p olicies and relaxed consistency have failed badly in their attempt to determine

when to use remote reference

On the hardware side our work b ears resemblance to the Stanford Dash pro ject in the use

of a relaxed consistency mo del and to the Georgia Tech Beehive pro ject in the use of relaxed

consistency and p erword dirty bits for successful merging of inconsistent cache lines Both these

systems use their extra hardware to allow coherence messages to propagate in the background of

computation p ossibly at the exp ense of extra coherence trac in order to avoid a higher waiting

p enalty at synchronization op erations

Coherence for distributed memory with p erpro cessor caches can also b e maintained entirely

by a compiler Under this approach the compiler inserts the appropriate cache ush and

invalidation instructions in the co de to enforce data consistency The static nature of the approach

however and the diculty of determining access patterns for arbitrary programs often dictates

conservative decisions that result in higher miss rates and reduced p erformance

Conclusions

We have shown that supp orting a shared memory programming mo del while maintaining high p er

formance do es not necessarily require exp ensive hardware Similar results can b e achieved by main

taining coherence in software using the op erating system and address translation hardware We have

intro duced a new scalable proto col for software cache coherence and have shown that it outp erforms

existing approaches b oth relaxed and sequentially consistent We have also studied the tradeos

b etween dierent cache write p olicies showing that in most cases a writeback cache is preferable but

that a writecollect buer can help make a writethrough cache acceptable Both writeback with

p erword dirty bits and writecollect require sp ecial hardware but neither approaches the complex

ity of fullscale hardware coherence Finally we have shown how some simple program mo dications

can signicantly improve p erformance on a software coherent system

We are currently studying the sensitivity of software coherence schemes to architectural parame

ters eg network latency and page and cache line sizes We are also pursuing proto col optimizations

that will improve p erformance for imp ortant classes of programs For example we are considering

p olicies in which ushes of mo died lines and purges of invalidated pages are allowed to take place in

the backgroundduring synchronization waits or idle time or on a communication copro cessor

We are developing online p olicies that use past page b ehavior to identify situations in which remote

access is likely to outp erform remote cache lls We are considering several issues in the use of

remote reference such as whether to adopt it globally for a given page or to let each pro cessor make

its own decision and deal with the coherence issues that then arise Finally we b elieve strongly

that software coherence can b enet greatly from compiler supp ort We are actively pursuing the

design of annotations that a compiler can use to provide p erformanceenhancing hints for OSbased

coherence

Acknowledgements

Our thanks to Ricardo Bianchini and Jack Veenstra for the long nights of discussions idea exchanges

and suggestions that help ed make this pap er p ossible

References

A Agarwal and others The MIT Alewife Machine A LargeScale DistributedMemory Multi

pro cessor In M Dub ois and S S Thakkar editors Scalable Shared Memory Multiprocessors

Kluwer Academic Publishers

S G Akl The Design and Analysis of Paral lel Algorithms Prentice Hall Inc Englewo o d

Clis NJ

T E Anderson H M Levy B N Bershad and E D Lazowska The Interaction of Archi

tecture and Op erating System Design In Proceedings of the Fourth International Conference

on Architectural Support for Programming Languages and Operating Systems pages

Santa Clara CA April

J Archibald and J Baer Cache Coherence Proto cols Evaluation Using a Multipro cessor

Simulation Mo del ACM Transactions on Computer Systems Novemb er

D Bailey J Barton T Lasinski and H Simon The NAS Parallel Benchmarks Rep ort

RNR NASA Ames Research Center January

B N Bershad and M J Zekauskas Midway Shared Memory Parallel Programming with

Entry Consistency for Distributed Memory Multipro cessors CMUCS CarnegieMellon

University Septemb er

W J Bolosky R P Fitzgerald and M L Scott Simple But Eective Techniques for NUMA

Memory Management In Proceedings of the Twelfth ACM Symposium on Operating Systems

Principles pages Litcheld Park AZ Decemb er

W J Bolosky M L Scott R P Fitzgerald R J Fowler and A L Cox NUMA Policies and

Their Relation to Memory Architecture In Proceedings of the Fourth International Conference

on Architectural Support for Programming Languages and Operating Systems pages

Santa Clara CA April

W J Bolosky and M L Scott False Sharing and its Eect on Shared Memory Performance

In Proceedings of the Fourth USENIX Symposium on Experiences with Distributed and Multi

processor Systems pages Septemb er Also available as MSRTR Microsoft

Research Lab oratory Septemb er

J B Carter J K Bennett and W Zwaenep o el Implementation and Performance of Munin

In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles pages

Pacic Grove CA Octob er

D Chaiken J Kubiatowicz and A Agarwal LimitLESS Directories A Scalable Cache Coher

ence Scheme In Proceedings of the Fourth International Conference on Architectural Support

for Programming Languages and Operating Systems pages Santa Clara CA April

Y Chen and A Veidenbaum An Eective Write Policy for Software Coherence Schemes In

Proceedings Supercomputing Minneap olis MN Novemb er

H Cheong and A V Veidenbaum CompilerDirected Cache Management in Multipro cessors

Computer June

A L Cox and R J Fowler The Implementation of a Coherent Memory Abstraction on a

NUMA Multipro cessor Exp eriences with PLATINUM In Proceedings of the Twelfth ACM

Symposium on Operating Systems Principles pages Litcheld Park AZ Decemb er

E Darnell J M MellorCrummey and K Kennedy Automatic Software Cache Coherence

Through Vectorization In ACM International Conference on Supercomputing Washing

ton DC July

S J Eggers and R H Katz Evaluation of the Performance of Four Sno oping Cache Coherency

Proto cols In Proceedings of the Sixteenth International Symposium on Computer Architecture

pages May

S J Eggers and R H Katz The Eect of Sharing on the Cache and Bus Performance of Parallel

Programs In Proceedings of the Third International Conference on Architectural Support for

Programming Languages and Operating Systems pages Boston MA April

M D Hill and J R Larus Cache Considerations for Multipro cessor Programmers Commu

nications of the ACM August

P Keleher A L Cox and W Zwaenep o el Lazy Release Consistency for Software Distributed

Shared Memory ACM SIGARCH Computer Architecture News May

P Keleher A L Cox S Dwarkadas and W Zwaenep o el ParaNet Distributed Shared

Memory on Standard Workstations and Op erating Systems In Proceedings of the USENIX

Winter Technical Conference San Francisco CA January

Kendall Square Research KSR Principles of Op eration Waltham MA

R P LaRowe Jr and C S Ellis Exp erimental Comparison of Memory Management Policies for

NUMA Multipro cessors ACM Transactions on Computer Systems Novemb er

R P LaRowe Jr C S Ellis and L S Kaplan The Robustness of NUMA Memory Man

agement In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles

pages Pacic Grove CA Octob er

D Lenoski J Laudon K Gharachorlo o A Gupta and J Hennessy The DirectoryBased

Cache Coherence Proto col for the DASH Multipro cessor In Proceedings of the Seventeenth

International Symposium on Computer Architecture pages Seattle WA May

D Lenoski J Laudon K Gharachorlo o W Web er A Gupta J Hennessy M Horowitz and

M S Lam The Stanford Dash Multipro cessor Computer March

K Petersen and K Li Cache Coherence for Shared Memory Multipro cessors Based on Virtual

Memory Supp ort In Proceedings of the Seventh International Paral lel Processing Symposium

Newp ort Beach CA April

K Petersen Op erating System Supp ort for Mo dern Memory Hierarchies Ph D dissertation

CSTR Department of Computer Science Princeton University Octob er

G Shah and U Ramachandran Towards Exploiting the Architectural Features of Beehive

GITCC College of Computing Georgia Institute of Technology Novemb er

J P Singh W Web er and A Gupta SPLASH Stanford Parallel Applications for Shared

Memory ACM SIGARCH Computer Architecture News March

J E Veenstra Mint Tutorial and User Manual TR Computer Science Department

University of Ro chester July

J E Veenstra and R J Fowler Mint AFront End for Ecient Simulation of SharedMemory

Multipro cessors In Proceedings of the Second International Workshop on Modeling Analysis

and Simulation of Computer and Telecommunication Systems MASCOTS pages

Durham NC January February