SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

Recent Advances in Memory Consistency Mo dels for

Hardware SharedMemory Systems

Sarita V Adve Vijay S Pai and Parthasarathy Ranganathan

Department of Electrical and Computer Engineering

Rice University

Houston Texas

rsimecericeedu

Abstract sp ecication that determines what values the programmer

The memory consistency mo del of a sharedmemory sys

can exp ect a read to return

tem determines the order in which memory op erations will

Unipro cessors provide a simple memory consistency

app ear to execute to the programmer The memory consis

tency mo del for a system typically involves a tradeo b e

mo del to the programmer that ensures that memory op era

tween p erformance and programmability This pap er pro

tions will app ear to execute one at a time and in the order

vides an overview of recent advances in hardware optimiza

sp ecied by the program or program order Thus a read

tions optimizations and programming environ

in a unipro cessor returns the value of the last write to the

ments relevant to memory consistency mo dels of hardware

distributed sharedmemory systems

same lo cation where last is uniquely dened by the pro

We discuss recent hardware and compiler optimizations

gram order Unipro cessor hardware and how

that exploit the observation that it is sucient to only ap

ever do not necessarily execute memory op erations one

pear as if the ordering rules of the consistency mo del are

ob eyed These optimizations substantially improve the p er

at a time in program order They may reorder and over

formance of the strictest consistency mo del making it more

lap memory op erations as long as data and control de

attractive for its programmability Recent concurrent pro

p endences are maintained the system appears to ob ey the

gramming languages and environments on the other hand

exp ected memory consistency mo del

supp ort more relaxed consistency mo dels We discuss sev

eral such environments including POSIX threads Java and

In a multiprocessor the notion of a last write is not as

Op enMP

welldened and a more formal sp ecication of the memory

consistency mo del is required The most intuitive mo del for

I Introduction

multiprocessors is a natural extension of the unipro cessor

mo del This mo del sequential consistency requires that

The memory consistency mo del of a sharedmemory sys

memory op erations app ear to execute one at a time in some

tem sp ecies the order in which memory op erations will

sequential order and the op erations of each pro cessor o ccur

app ear to execute to the programmer The memory consis

in program order A read in a sequentially consistent

tency mo del aects the pro cess of writing parallel programs

multiprocessor must return the value of the last write to

and forms an integral part of the entire system design in

the same lo cation in the ab ove sequential order

cluding the architecture the compiler and the program

Unlike unipro cessors simply maintaining p erpro cessor

ming language

data and control dep endences is insucient to maintain se

Figure shows a program fragment to illustrate how the

quential consistency in a multiprocessor For example in

memory consistency mo del aects the programmer The

Figure there are no data or control dep endences among

gure shows pro cessor P pro ducing some data Data and

the memory op erations of pro cessor P Thus a unipro

Data that needs to b e consumed by pro cessor P The

cessor could allow Ps write to Flag to b e executed b efore

variable Flag is used for synchronization Pro cessor P

its writes to the data variables In this case it is p ossible

writes the value into Flag to indicate that the data is pro

that P executes all its reads b efore the data writes of P

duced pro cessor P rep eatedly reads Flag until it returns

returning the old values for the data variables and violating

the value indicating the data is ready to b e consumed

sequential consistency This simple example shows that se

The program is written with the exp ectation that pro cessor

quential consistency imp oses more restrictions than simply

Ps reads of the data variables will return the new values

preserving data and control dep endences at each pro cessor

written by pro cessor P However in many current sys

tems pro cessor Ps reads may return the old values of the

Sequential consistency can restrict several common hard

data variables giving an unexp ected result The memory

ware and compiler optimizations used in unipro cessors

consistency mo del of a sharedmemory system is a formal

For this reason relaxed consistency mo dels have b een pro

p osed that explicitly p ermit relaxations of some program

This work is supp orted in part by IBM Intel the National Sci

orderings These mo dels allow more optimizations than se

ence Foundation under Grant No CCR CCR

quential consistency but present a more complex mo del to

CDA and CDA and the Texas Advanced Tech

nology Program under Grant No Sarita V Adve is also

the programmer Thus choosing the memory consistency

supp orted by an Alfred P Sloan Research Fellowship Vijay S Pai by

mo del of a multiprocessor system has typically involved a

a Fannie and John Hertz Foundation Fellowship and Parthasarathy

Ranganathan by a Lo dieska Sto ckbridge Vaughan Fellowship tradeo b etween p erformance and programmability A tu

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

Initially Data Data Flag

Mo del Program Order Relaxation

Sequential consistency None

P P

Pro cessor consistency Write Read

Weak Ordering Data Data

Data while Flag fg

Release Consistency Data Data

Data register Data

Data Acquire

Flag register Data

Release Data

Release Acquire

Fig Motivation for a memory consistency mo del

Fig Relaxations of program order allowed by dierent memory

torial pap er by Adve and Gharachorloo gives a detailed

consistency mo dels Only program order b etween memory op

description of the optimizations restricted by straightfor

erations to dierent lo cations is considered

ward implementations of sequential consistency as well as

release and a subsequent acquire The distinctions b etween

relaxations p ermitted by several relaxed mo dels

various categories of memory op erations can b e achieved

This pap er provides an overview of recent advances in

in several ways Current pro cessors supp ort a conserva

hardware optimizations compiler optimizations and pro

tive metho d through fence or instructions

gramming languages and environments relevant to mem

Typically all program orders with resp ect to a fence or

ory consistency mo dels of hardware sharedmemory sys

memory barrier instruction are enforced

tems The advances in hardware and compiler optimiza

The examples in Figure further illustrate the dier

tions seek to narrow the p erformance gap b etween memory

ences among the mo dels from the programmers view

consistency mo dels making stricter mo dels more attrac

p oint Part a rep eats the example in Figure With

tive for their programmability On the other hand recent

sequential consistency since all memory op erations must

concurrent programming languages and environments eg

app ear to o ccur in program order it follows that Ps reads

POSIX threads Java and Op enMP supp ort more relaxed

of Data and Data do not o ccur until Ps writes to the

consistency mo dels We conclude with a discussion of the

data lo cations complete Therefore Ps reads of the data

impact of the ab ove advances

lo cations will return the newly written values and

Pro cessor consistency ensures Ps writes will o ccur in pro

I I Background

gram order and Ps reads will o ccur in program order

A Hardwarecentric Models

therefore Ps reads must again return the newly writ

ten values Weak ordering and release consistency how

We briey describ e key asp ects of the memory consis

ever do not imp ose any such restrictions Therefore with

tency mo dels most commonly supp orted in commercial sys

these mo dels Ps writes to Data and Data may o ccur

tems sequential consistency supp orted by HP PA

after Ps write to Flag and after Ps reads to Data and

and MIPS R pro cessors pro cessor consistency

Data Consequently Ps reads of the data lo cations may

similar to Suns total store ordering and the mo del sup

return the old values of giving unexp ected results With

p orted by Intel pro cessors weak ordering and release

a weakly ordered or release consistent system Ps reads

consistency similar to Suns relaxed memory order

can b e made to return the new values if the programmer

ing and the mo dels supp orted by Digital Alpha and IBM

explicitly identies all accesses to Flag as synchronization

PowerPC pro cessors More detailed descriptions of these

op erations

mo dels and their features can b e found in

Figure b illustrates a simplied version of Dekkers

The relaxed mo dels mentioned ab ove have b een referred

algorithm for implementing critical sections Sequential

to as hardwarecentric b ecause they were primarily moti

consistency prohibits the reads of b oth Flag and Flag

vated by hardware optimizations In this pap er optimiza

from returning in the same execution Thus with se

tions regarding reordering a pair of memory op erations im

quential consistency at most one pro cessor will enter the

plicitly refer to op erations on dierent lo cations

critical section With pro cessor consistency however the

Figure summarizes the relaxations in program order al

reads may execute b efore the writes of their resp ective pro

lowed by the various memory consistency mo dels discussed

cessors eg if the write is sent to a write buer and the

in this pap er Sequential consistency requires program or

read is allowed to bypass the buer Therefore b oth the

der to b e maintained among all op erations Pro cessor con

reads of Flag and Flag may return and b oth pro cessors

sistency allows reordering b etween a write followed by a

may enter the critical section an undesirable result With

read in program order Weak ordering requires distinguish

a weakly ordered or release consistent implementation also

ing b etween data and synchronization op erations Data

b oth reads may return To prohibit b oth pro cessors from

op erations can b e reordered with resp ect to each other

returning the value the programmer must identify all

however program ordering of all op erations with resp ect

op erations to Flag and Flag as synchronization op era

to synchronization op erations must b e maintained Re

tions with weak ordering and release consistency and use

lease consistency further categorizes synchronization op er

readmo difywrites with pro cessor consistency

ations into acquires and releases Release consistency do es

not enforce ordering b etween two data op erations b etween The choice of the consistency mo del may limit the use

a data op eration and a subsequent acquire or b etween a of hardware and compiler optimizations that improve p er

release and a subsequent data op eration In this pap er formance by overlapping reordering or eliminating mem

we use the term release consistency to refer to the RCpc ory op erations eg write buers lo ckupfree caches

mo del which also do es not enforce ordering b etween a outoforder execution register allo cation common sub

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

Initially Data Data Flag Initially Flag Flag

P P P P

Data while Flag fg Flag Flag

Data register Data register Flag register Flag

Flag register Data if register if register

critical section critical section

Allowed results Allowed results

SC PC register register SC Both register and register cannot b e

WO RC register or register or PC WO RC register or register or

a b

Fig Illustration of memory consistency mo dels SC sequential consistency PC pro cessor consistency WO weak ordering RC

release consistency

expression elimination compiletime co de motion and lo op synchronization op erations also called race op erations

transformations In particular reordering any pair of other op erations may b e distinguished as either data or

memory op erations can p otentially lead to a violation of synchronization

sequential consistency Weak ordering and release consis

For example in Figure a the op erations accessing

tency allow the most optimizations since they allow re

Data and Data are always separated by the op erations on

ordering of all data op erations b etween consecutive syn

Flag in any sequentially consistent execution Therefore

chronization op erations However these mo dels are harder

the op erations on Data and Data may b e distinguished

to reason with since they require programmers to deal with

as data or synchronization However the write of Flag

nonintuitive hardware and compiler optimizations Thus

and the nal read of Flag are not separated by any other

the choice of the memory consistency mo del of a system

op erations and must b e distinguished as synchronization

has typically involved a tradeo b etween programmability

The optimizations of weak ordering and release con

and p erformance

sistency can b e applied to dataracefree programs with

out violating sequential consistency The mechanism for

B Programmercentric Framework

distinguishing op erations as data or synchronization may

vary based on the programming language supp ort see Sec

The programmercentric framework for sp ecifying mem

tion V

ory consistency mo dels seeks to alleviate the tradeo b e

tween programmability and p erformance seen by hardware

I I I Hardware Advances

centric mo dels The framework is based on the hy

p othesis that programmers should not b e required to rea

Straightforward hardware implementations of consis

son ab out nonsequentially consistent systems The frame

tency mo dels directly enforce their ordering constraints by

work guarantees sequential consistency if the programmer

prohibiting a memory op eration O from entering the mem

provides some information ab out the program or ob eys cer

ory system until the completion of all previous op erations

tain constraints The information and constraints relate

for which O must app ear to wait For example a straight

only to the b ehavior of the program on sequentially con

forward implementation of sequential consistency will not

sistent systems The knowledge of this b ehavior is used by

issue a memory op eration until the preceding memory op

the system to determine optimizations that will not violate

eration of its pro cessor is complete However certain fea

sequential consistency for that program

tures in recent pro cessors describ ed b elow p ermit more

The dataracefree and prop erlylab eled mo dels il optimized implementations of consistency mo dels

lustrate the programmercentric framework These mo dels

Outoforder scheduling A pro cessor with outof

seek to provide the optimizations of the weak ordering and

order scheduling simultaneously examines several consecu

release consistency hardwarecentric mo dels but without

tive instructions within an instruction window The pro

requiring the programmer to reason ab out those optimiza

cessor issues an instruction from the instruction window to

tions The mo dels provide sequential consistency to data

its functional units or the memory hierarchy once the in

racefree programs which are dened as follows

structions data dep endences are resolved even if previous

Call two memory op erations conicting if they are from instructions are still blo cked To maintain precise excep

dierent pro cessors they access the same lo cation and at tions however instructions are retired from the instruction

least one of them is a write Assume the system provides window and all changes to architectural state eg regis

a mechanism for distinguishing memory op erations as data

ters and memory are made in program order

or synchronization

Nonblo cking loads Many current pro cessors do not

A program is dataracefree if in every sequentially con blo ck on a load but can continue executing indep endent

sistent execution of the program any two conicting mem instructions including other loads while one or more loads

ory op erations are either separated by synchronization op including cache misses are outstanding To maintain pre

erations or are b oth distinguished as synchronization op cise interrupts a load do es not leave the instruction window

erations Alternatively if two conicting op erations o ccur until it returns a value Therefore the eectiveness of this

one after another without any intervening memory op era technique is limited by the instruction window size which

tions then the programmer should distinguish them as determines the maximum number of instructions that will

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

b e overlapped with the load lo cation do es change its value then the pro cessor rolls back

its execution until the incorrect load The rollback can b e

Sp eculation Current pro cessors supp ort sp eculative

achieved using the same mechanisms as used for ushing

execution in several forms the most common of which is

incorrectly executed instructions after a branch missp ecu

branch prediction Instructions after the sp eculation p oint

lation or an exception

eg a branch continue to b e deco ded issued and exe

cuted as ordinary instructions However these instructions To detect whether a value sp eculatively consumed by

are not allowed to commit their values into the architec a load changes b efore the load is allowed to issue non

tural state of the pro cessor until all prior sp eculations have sp eculatively the pro cessor exploits the coherence mecha

b een resolved If any sp eculation is determined to b e incor nisms of hardware cachecoherent multiprocessors In these

rect eg a mispredicted branch the execution rolls back systems a change of a cache line by another pro cessor will

to the state at the p oint of the mispredicted sp eculation trigger an external coherence action eg invalidate or up

and all later sp eculated instructions are squashed date to other caches holding the data as long as the data

remains in the cache Thus the pro cessor monitors coher

Prefetching Many pro cessors provide supp ort for non

ence requests to and replacements from its caches either

binding prefetch requests for cache lines that are likely to

action to a cache line with sp eculatively consumed data

b e accessed in the future Such a request is exp ected to

triggers a rollback

bring the accessed line in the pro cessors cache b efore the

pro cessor issues the actual demand access thereby over As with nonblo cking loads in general the latency tol

lapping the memory latency with other work Prefetch erance p otential of sp eculative load execution is limited

requests can either b e initiated by software through ex by the size of the instruction window In addition con

plicit prefetch instructions or by hardware using runtime suming sp eculative values to o early can result in increased

prediction

rollbacks p otentially degrading p erformance

We next describ e optimizations that use the ab ove fea Sp eculative load execution and hardware store prefetch

tures to improve the p erformance of consistency mo dels

ing from the instruction window are supp orted by the HP

and are supp orted in current commercial systems These

PA Intel Pentium Pro and MIPS R pro cessors

optimizations exploit the observation that it is sucient to

C CrossWindow Prefetching

only appear as if the ordering rules of the consistency mo del

are ob eyed Each of these techniques assumes a hardware

Prefetches may also b e issued for instructions that are

cachecoherent system

not currently in the instruction window but are exp ected

to b e executed in the future Either the compiler may insert

A Hardware Prefetching from the Instruction Window

explicit software prefetch instructions in the program or the

In straightforward implementations a pro cessors in

hardware may issue such requests at runtime Although

struction window may contain several deco ded memory in

this technique is applicable even in unipro cessor systems

structions that are not issued to the memory system due to

use of crosswindow prefetching to hide latency may also

consistency constraints For example a deco ded load is not

impact the relative p erformance of consistency mo dels

issued in a sequentially consistent system until the previous

This technique alleviates the limitations imp osed by a small

memory op eration of the pro cessor completes The pro ces

instruction window size for the hardware prefetching tech

sor can issue nonbinding prefetches for such instructions

nique describ ed in Section I I IA but requires the compiler

without violating the consistency mo del thereby hiding

or hardware to predict future accesses Most current pro

some memory latency

cessors supp ort a software prefetch instruction and there

The latency that can b e hidden using this technique has b een signicant compiler work for eective insertion

is limited by the instruction window size Additionally of such prefetches eg Several hardware techniques

for predicting future data accesses have also b een prop osed

prefetches that are issued to o early may sometimes degrade

eg but are not yet commonly implemented

p erformance by fetching data b efore it is ready Such a

degradation may result due to extra network trac and

D Simulation Studies

due to extra memory latency seen at pro cessors that lose

their cache lines prematurely to early prefetches

There have b een several simulation studies that analyze

the p erformance dierences b etween dierent memory con

B Speculative Load Execution

sistency mo dels Earlier studies have analyzed straight

The hardware prefetching technique describ ed in the pre forward implementations of consistency mo dels

vious section do es not allow a load to consume its value un and the optimization of hardware prefetching from the in

til the completion of preceding memory op erations ordered struction window with singleissue inorder pro cessors

by the consistency mo del even if the loads value is already More recent studies have examined the optimized im

in the cache Sp eculative load execution extends the b en plementations discussed in this section for multiproces

ets of prefetching by sp eculatively consuming the value sors built from stateoftheart pro cessors These pro ces

of loads brought into the cache regardless of consistency sors aggressively exploit instructionlevel parallelism ILP

constraints If the accessed lo cation do es not change with features such as multiple instructionissue outof

its value until the load could have b een nonsp eculatively order scheduling nonblo cking reads and sp eculative ex

issued then the sp eculation is successful However if the ecution This section summarizes results

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

ILP Pro cessor

pro cessor system is simulated for LU MPD and Radix

Pro cessor sp eed MHz

Hardware prefetching from the instruction window

Maximum fetchretire rate

instructions p er cycle

and sp eculative load execution

Instruction window entries

Figure shows for each application the execution time

Functional units integer arithmetic

for the straightforward implementations of sequential con

oating p oint

address generation

sistency SC pro cessor consistency PC and release con

Memory queue size entries

sistency RC normalized to the time for sequential con

Cache parameters

sistency We do not consider weak ordering separately as

Cache line size bytes

L cache onchip writeback Direct mapp ed K

our applications see very little synchronization overhead

L request p orts

therefore we exp ect few dierences b etween weak ordering

L hit time cycle

and release consistency Figure shows for each appli

L cache ochip writeback way asso ciative K

L request p orts

cation the execution times for the b est implementation of

L hit time cycles pip elined

the three consistency mo dels normalized to the time for

Number of MSHRs p er cache

the straightforward implementation of sequential consis

Memory parameters

Memory access time cycles ns

tency in Figure Comparing the corresp onding bars of

Memory interleaving way

Figures and shows the impact of the hardware opti

Bus parameters

mizations

Bus sp eed MHz

Bus width bytes

The optimizations of hardware prefetching from the in

Network parameters

struction window and sp eculative load execution provide

Topology dimensional mesh

signicant b enets for the more constrained consistency

Network sp eed MHz

Network width bits

mo dels sequential consistency and pro cessor consistency

Flit delay p er hop network cycles

but do not signicantly impact release consistency The

Fig Base system parameters

optimizations greatly narrow the p erformance gap b etween

the various mo dels for all applications Nevertheless pro

Application Input Size Pro cessors

cessor consistency and release consistency continue to show

FFT p oints

LU by matrix blo ck

sizeable p erformance b enets compared to sequential con

MPD particles

sistency for two of our ve applications in all of the con

Radix radix K keys max K

gurations studied release consistency shows or more

Water molecules

reduction in execution time for Radix and MPD The dif

Fig Applications input sizes and system sizes

ferences in p erformance b etween the various mo dels seen

found in the latter studies however the quantitative data

in Figure stem primarily from limited hardware re

rep orted here are based on a new set of simulations with

sources mainly the instruction window which limit the

a uniform set of system parameters for all exp eriments a

extent to which the optimizations can b e exploited and

more recent and aggressive compiler and a more aggres

the negative eects of early store prefetches which lead to

sive cache coherence proto col MESI versus MSI

additional exp osed latencies in the stricter mo dels

Simulation framework

Increasing the instruction window size

We rep ort results for ve scientic applications from the

The instruction window size largely determines the eec

Stanford SPLASH and SPLASH suites on a sim

tiveness of the optimizations of hardware prefetching from

ulated hardware cachecoherent sharedmemory multipro

the instruction window and sp eculative load execution In

cessor system The simulations are p erformed with RSIM

creasing the size of the instruction window and memory

the Rice Simulator for ILP Multipro cessors a detailed

queue generally narrows the remaining p erformance gap

executiondriven simulator RSIM mo dels an outof

b etween memory consistency mo dels For example dou

order sup erscalar pro cessor pip eline a twolevel cache hi

bling the instruction window and memory queue sizes re

erarchy a splittransaction bus on each pro cessor no de

duces the dierence in execution time b etween the b est

and an aggressive memory and multiprocessor intercon

SC and RC versions to and for MPD and Radix

nection network subsystem including contention at all re

resp ectively compared to and with the base con

sources The mo deled systems implement an invalidation

guration However in a few cases larger increases in the

based fourstate MESI directory cache coherence proto col

instruction window and memory queue sizes lead to p erfor

Figure summarizes the values of the parameters for the

mance degradations in sequential consistency and pro ces

base systems simulated The cache sizes are chosen fol

sor consistency widening the p erformance gap with release

lowing the working set characterizations of Woo et al

consistency This degradation o ccurs b ecause fundamental

We also p erformed exp eriments that double and quadruple

limitations of hardware prefetching and sp eculative loads

all the miss latency comp onents in the system our results

caused by negative eects of early prefetches and rollbacks

hold qualitatively even in this conguration

with sp eculative loads were exp osed or exacerbated with

larger instruction window sizes

The applications their input sizes and the number of

Crosswindow software prefetching

pro cessors in the system for each application are summa

rized in Figure A pro cessor system is simulated for Software prefetches were inserted in the applications by

applications that scale well FFT and Water while an hand following a stateoftheart prefetching algo

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

100 100 100 100 100 100 100 | 82 79 79 79 77 80 | 67 68 58 57 60 | 54 44 43

40 |

20 | Normalized execution time

0 || SC PC RC SC PC RC SC PC RC SC PC RC SC PC RC SC PC RC

FFT LU MP3D Radix Water Mean

Fig Performance of straightforward implementations of memory consistency mo dels

100 |

80 | 76 73 68 66 63 63 60 58 60 60

| 57 56 60 53 47 45 43 43 43

40 |

20 | Normalized execution time

0 || SC PC RC SC PC RC SC PC RC SC PC RC SC PC RC SC PC RC

FFT LU MP3D Radix Water Mean

Fig Impact of hardware prefetching and sp eculative execution on consistency mo dels normalized to straightforward SC

rithm Software prefetching improves the p erformance of writes and thus generally have a greater inuence on

of all the consistency mo dels with and without the opti consistency mo dels where write latencies are not already

mizations of hardware prefetching from the instruction win completely overlapped SC and PC RC is less sensitive

dow and sp eculative load execution However MPD to these choices This pap er uses the b est practical con

and Radix continue to see signicant improvements with gurations for all mo dels ie writeback caches and MESI

pro cessor consistency and release consistency compared to proto col

sequential consistency more than reduction in execu

Finally even if a pro cessor supp orts a strict consistency

tion time with release consistency

mo del the underlying system design decisions may result

Limitations of software prefetching on multiprocessors

in a more relaxed mo del For example even if the pro ces

with aggressive pro cessors due to late or early prefetches

sor supp orts sequential consistency the memory controller

or resource contention and memory latencies that are not

may acknowledge a data store to the pro cessor b efore it is

amenable to prefetching due to limitations of the soft

actually globally visible The net result is a system that

ware prefetching algorithm and contention at pro cessor re

is more aggressive than the consistency mo del supp orted

sources are resp onsible for the remaining gap b etween the

by the pro cessor but somewhat more conservative than

consistency mo dels

one in which the pro cessor supp orts a relaxed consistency

mo del The Convex SPP system is an example of

E Other Optimizations

a commercial system that includes a weak ordering mo de

even though its pro cessors HP PA supp ort sequen

We have fo cused on optimizations implemented in cur

tial consistency The Stanford FLASH system also uses a

rent commercial systems at the pro cessor level Several

similar approach

other pro cessorcentric optimizations have also b een pro

p osed and evaluated in the literature for hardware shared

F Summary

memory systems These include techniques to overlap b oth

data latency eg multithreading and synchroniza

In summary recent hardware optimizations result in a

tion latency eg multithreading and fuzzy and selec

signicant narrowing of the p erformance gap b etween con

tive acquires The simulation studies describ ed ab ove

sistency mo dels virtually eliminating the gap for three of

found that data latency is far more dominant than synchro

the ve applications studied on our base system Never

nization latency for the applications and systems studied

theless pro cessor consistency and release consistency show

therefore future techniques that target data latency ap

signicant p erformance b enets over sequential consistency

p ear to b e more likely to b enet p erformance for this class

for two of our applications on all the system congurations

of systems and applications

we studied Thus for the class of applications and systems

System design decisions b elow the pro cessor level also studied here the choice of the consistency mo del for future

inuence the p erformance of memory consistency mo dels systems will dep end on the imp ortance of the remaining

For example the cache write p olicy and cache coherence hardware p erformance gap to system vendors as well as

proto col can impact the relative p erformance of consistency the impact of relaxed consistency mo dels on compiler opti

implementations These p olicies aect the overhead mizations and programming language supp ort for relaxed

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

po po

P1 P2 co

 Write Data ! Write Data ! Write Flag !

po

Write, Data1 co

Read Flag ! Read Data ! Write Data

po po po

co

 Write Data ! Write Flag ! Read Flag ! po

Write, Data2 co

Read Data ! Read Data ! Write Data

po

the analysis so far all program order edges shown

po co With

Figure could b e on a p otential cycle prohibiting mean

Write, Flag in

optimizations Shasha and Snir further formalized a

co ingful

set of cycles called critical cycles and showed

Read, Flag minimal

it is sucient to consider only program order edges

po that

on critical cycles Op erations that are not ordered by such

Read, Data1 po

program order edges can b e reordered without vio co critical

po

lating sequential consistency Applying their formalization

discussed here further due to lack of space only the

Read, Data2 not

rst two cycles mentioned ab ove for Figure are critical

Fig Execution graph corresp onding to Figure

Therefore program ordering needs to b e maintained only

consistency mo dels The rest of the pap er discusses the

with resp ect to accesses to Flag the accesses to the data

latter two issues

lo cations can b e reordered with resp ect to each other

At compile time it is dicult to predict or analyze all

IV Compiler Advances

executions and their execution graphs Therefore Shasha

and Snir apply their analysis to a graph where the vertices

Sequential consistency and pro cessor consistency restrict

are the static memory op erations in the program and a

the direct application of several unipro cessor compiletime

bidirectional conict order edge is introduced for every

optimizations that may involve reordering or eliminating

pair of static memory op erations that could conict in any

memory op erations including register allo cation In their

execution

pioneering work Shasha and Snir developed a compile

More recently Krishnamurthy and Yelick showed that

time analysis to identify memory op erations that can b e

Shasha and Snirs algorithm for detecting critical cycles

reordered without violating sequential consistency

has exp onential complexity in the number of pro cessors

To understand the analysis by Shasha and Snir consider

They developed an algorithm for SPMD programs with

an execution of a parallel program represented as a graph

p olynomial complexity They also evaluated the eect

with the following prop erties The vertices of the graph are

of certain optimizations using their algorithm on array

the memory op erations in the execution There is an edge

based SplitC programs run on a CM message passing

from op eration A to op eration B in the graph if either i

multiprocessor They found reductions in execution time of

A precedes B in program order or ii A precedes B in the

to They further improved their algorithm to re

execution and A and B conict Call the two categories

duce the impact of bidirectional conict order edges

of edges program order po and conict order co edges

using some information from the programmer This im

resp ectively and call the graph the execution graph Fig

provement reduced the number of critical cycles for which

ure illustrates an execution graph for an execution of the

program order needs to b e enforced and gave additional

program in Figure in which the data reads by pro cessor

reductions in execution time of to for SplitC

P return the new values written by pro cessor P

programs on a CM multiprocessor

Shasha and Snir observed that if the execution graph is

It is not yet known how eective the ab ove analyses and

acyclic then the execution app ears sequentially consistent

optimizations are for general programs on native shared

since all memory op erations can b e totally ordered in an

memory machines or how they compare to compiletime

order consistent with program order For example the

optimizations enabled by a relaxed consistency mo del such

execution graph in Figure is acyclic and the depicted

as release consistency

execution is sequentially consistent

To ensure the execution graph is acyclic it is sucient

V Programming Languages

to ensure that if there is a program order edge from A to

Until recently most memory consistency mo dels were

B on a p otential cycle in the graph then A and B are not

developed primarily by computer architects for the hard

reordered Op erations on all other program order edges can

ware interface Many common highlevel programming lan

b e reordered without violating sequential consistency For

guages did not include standard supp ort for explicitly par

the program in Figure p otential cycles in the execution

allel sharedmemory programs and did not explicitly con

graph are

po po

co

sider the issue of the memory consistency mo del Pro

 Write Data ! Write Flag ! Read Flag !

co

grammers typically relied on parallelizing compilers or

Read Data ! Write Data

po po

used nonstandard vendor sp ecic extensions to generate

co

 Write Data ! Write Flag ! Read Flag !

sharedmemory parallelism and synchronization Recent

co

Read Data ! Write Data

languages and programming environments however deal

with sharedmemory parallelism and the issue of memory

1

Recall that two op erations conict if they are from dierent pro

cessors they access the same lo cation and at least one is a write consistency mo dels more explicitly Below we discuss com

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

monly used languages and environments that are currently ing and release consistency Java provides another mecha

supp orted by multiple system vendors and conclude with nism to overcome this deciency as discussed next

a discussion on the implications for the compiler

C The Java Programming Language

A POSIX Threads

This section discusses the highlevel Java programming

language as opp osed to the Java virtual machine The for

POSIX is an IEEE standard that includes a threads

mal memory consistency mo del for Java is dened as a set

interface for the C language It sp ecies several synchro

of rules on an abstract representation of the system

nization functions eg functions to implement mutual ex

The abstraction consists of a main memory containing

clusion lo cks and condition variables For memory con

the master copy of all variables and a working memory

sistency POSIX requires applications to use the provided

p er thread A threads instructions result in use andor

synchronization functions to separate conicting accesses

assign op erations on the values in its working memory

to the same memory lo cation Such applications can rely

The use and assign op erations and the lo ck and un

on sequentially consistent results Applications that syn

lo ck op erations describ ed b elow o ccur in program order

chronize in other ways as in the examples in Figure or

The underlying implementation transfers data b etween the

that include races among user data structures eg asyn

working and main memories sub ject to several ordering

chronous algorithms may get unexp ected results

rules The current Java mo del is complex and hard to in

The POSIX threads mo del is like the datarace

terpret only the key ordering rules are informally describ ed

freeprop erlylab eled mo del describ ed in Section I IB in

b elow

that it requires synchronization to b e explicit to ensure

A Java program must use the volatile or synchronized

reliable results However it do es not provide the full ex

keywords to enforce memory ordering Volatile is com

ibility of the dataracefreeprop erly lab eled mo del since

mon to C and C and is discussed in Section VB The

it restricts synchronization to only the provided synchro

synchronized keyword results in a lo ck access on the asso

nization functions The provided synchronization func

ciated ob ject b efore executing the corresp onding statement

tions may b e overkill for certain cases eg Figure a

or metho d and an unlo ck access afterward The key or

p otentially resulting in a loss of p erformance The data

dering rules are i A lo ck access must app ear to ush

racefreeprop erly lab eled mo del conceptually p ermits any

each variable V from its threads working memory b efore

memory op eration including races in asynchronous pro

the threads next use of V unless the thread makes an as

grams to b e distinguished as synchronization

signment to V b etween the lo ck and the next use of V

B The volatile Declaration

ii Before an unlo ck access all values previously assigned

by that thread must b e copied to main memory Thus a

The ANSI C C and Java languages supp ort the

threads load following a lo ck will either get a value as

volatile declaration to suppress certain optimizations A

signed by that thread after the lo ck or a value that is at

key motivation for this declaration was to inhibit optimiza

least as recent as in the master copy at the time of the lo ck

tions in co des such as device drivers and interrupt handlers

After an unlo ck all the values from the threads preceding

The declaration is often also used for enforcing data con

assignments will b e in the master copy

sistency in sharedmemory programs

Referring to Figure a the synchronized keyword

The Java language sp ecication provides the most pre

need b e applied only to the accesses to Flag to ensure

cise sp ecication for the semantics of the volatile declara

that the reads of the data lo cations will return the correct

tion Every access to a volatile variable must result

values In contrast to the use of the volatile declaration

in a memory access ie values of such variables cannot

discussed in Section VB the use of synchronized allows

b e cached in registers and accesses to such variables can

the writes and reads of Data and Data to b e reordered

not b e eliminated through optimizations such as common

with resp ect to each other without aecting the result of

subexpression elimination Furthermore programordered

the execution To prevent excessive spinning on lo cks Java

accesses to volatile variables cannot b e reordered with

also provides wait and notify primitives for pro ducer

resp ect to each other There is no restriction on the or

consumer synchronization Nevertheless these primitives

dering b etween an access to a volatile variable and an

also require lo ck accesses for eective use which app ears

access to a nonvolatile variable For example for the

to b e overkill for interactions such as in Figure a

program in Figure a the variables Data Data and

The ordering rules for Java are almost similar to those

Flag must all b e declared volatile to ensure that the

for release consistency Java is also accompanied by a pro

reads of Data and Data will return the new values writ

ten in the program This however unnecessarily precludes

2

There are two subtle dierences b etween Java and release consis

optimizations on accesses to Data and Data

tency First in a Java thread the presence of a lo ck b etween a write

and a read to the same lo cation requires that the write app ear to b e

The ab ove example illustrates that the current sp ecica

serialized at main memory b efore the read unless there is another

tion is not restrictive enough to use volatile as the sole

write to the same lo cation b etween the lo ck and the read Re

mechanism for enforcing consistency while enabling com

lease consistency do es not require such a serialization Second unlike

release consistency Java do es not allow programordered reads to the

mon optimizations An additional constraint of preserving

same lo cation to b e reordered As mentioned earlier this pap er

program order b etween nonvolatile and volatile ac

do es not discuss ordering constraints b etween memory op erations to

cesses would have provided a mo del similar to weak order the same lo cation

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

gramming style recommendation similar to the datarace distinct asp ect of the sequential mo del provided by HPF is

freeprop erlylab eled mo del It states that if a variable that the data parallel array statements and the FORALL

is ever to b e assigned by one thread and used or assigned statement have copyincopyout semantics where all lo

by another then all accesses to that variable should b e en cations on the right hand side of an assignment are read

closed in synchronized metho ds or synchronized state b efore any lo cations on the left hand side are assigned This

ments provides straightforward deterministic sequential semantics

even for a statement that may app ear to have dep endences

D OpenMP

among its various computations

HPF also provides EXTRINSIC pro cedures to invoke

Op enMP is a recently prop osed application program

other parallelism paradigms not directly supp orted in HPF

ming interface for p ortable sharedmemory programs

eg explicitly parallel co de for branchandbound paral

It provides relatively highlevel directives to sp ecify parallel

lelism Such a pro cedure may have an indep endent mem

tasks or lo ops and various synchronization constructs eg

ory consistency mo del the language requires all pro cessors

critical sections atomic up dates and barriers For mem

to app ear to enter and exit such a pro cedure together

ory consistency it supp orts a FLUSH directive that must b e

used to ensure all preceding writes of a pro cessor are seen

G Implications for the Compiler

by the memory system and subsequent reads return new

values A FLUSH is also implicitly asso ciated with many of

The consistency mo del supp orted by the programming

the synchronization constructs The description of FLUSH

language inuences the optimizations a compiler can p er

in the current sp ecication is fairly informal eg it do es

form From the compilers viewp oint most recent program

not explicitly state the impact of a FLUSH on preceding

ming environments supp ort mo dels similar to datarace

reads or subsequent writes but it app ears to provide a

freeprop erly lab eled allowing signicant exibility

mo del similar to weak ordering and release consistency and

For example with POSIX Java and Op enMP the com

has semantics similar to memory barrier or fence instruc

piler can reorder any pair of op erations to dierent lo ca

tions supp orted by most current pro cessors Thus to ob

tions b etween two synchronizationvolatileFLUSH op

tain the exp ected result for the example in Figure a the

erations as well as allo cate memory in registers in this

application needs to simply include FLUSHes with the ac

interval

cesses to Flag Additionally Op enMP allows a FLUSH to

In addition the compiler has the resp onsibility to ensure

b e explicitly applied to only a sp ecied list of variables

that the parallel program it generates runs correctly for

the memory consistency mo del supp orted by the hardware

E Message Passing Interface MPI

This is fairly straightforward for the languages and environ

ments discussed in this section and for current commercial

The issue of the memory consistency mo del is generally

systems For example most current pro cessors and sys

not relevant to traditional message passing applications

tems supp ort memory barriers or fences For such sys

b ecause synchronization is implicit with every data trans

tems the sp ecial synchronization constructs in the ab ove

fer message The new Message Passing Interface MPI

environments must include fences at all p oints in the con

standard however also includes supp ort for onesided

struct where a memory op eration may b e involved in a

communication which has similarities with sharedmemory

race Typically these fences would b e inserted by li

and must consider the consistency mo del We briey dis

brary writers for the synchronization constructs and the

cuss relevant asp ects of MPI b elow

compiler simply must ensure that it maintains these fences

The key primitives for onesided communication are get

and their orderings with resp ect to other memory op era

and put These allow a pro cessor to load or store a data

tions in the nal program Similarly the FLUSH directive

buer from a remote pro cessors memory without any cor

of Op enMP must b e replaced by a hardware fence For

resp onding send or receive by the remote pro cessors pro

HPF the parallelizing compiler itself inserts synchroniza

gram Sp ecial synchronization mechanisms are provided to

tion and must insert hardware fences at these p oints A

ensure that the loads return and stores dep osit appropri

more comprehensive description of how to p ort datarace

ate values at the appropriate time Ordering rules anal

freeprop erly lab eled programs to common hardware mem

ogous to release consistency are enforced eg the user

ory consistency mo dels app ears in

cannot rely on the completion of gets and puts until ap

propriate synchronization has o ccurred

VI Concluding Remarks

F High Performance HPF

This pap er provides an overview of recent advances in

The High Performance Fortran HPF language pro memory consistency mo dels covering hardware compil

vides various constructs to exp ose data parallelism to the ers and programming environments Recent hardware op

system but provides sequential semantics to the program timizations have signicantly narrowed the hardware p er

mer Each HPF data parallel construct including Fortran formance gap b etween various consistency mo dels virtu

array statements the FORALL statement and HPF ally eliminating the gap for three of the ve applications

intrinsics has equivalent sequential semantics Therefore studied on our base system Nevertheless pro cessor con

HPF programmers need not b e concerned with the tradi sistency and release consistency show signicant b enets

tional notion of a sharedmemory consistency mo del One over sequential consistency for two of our applications on

SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION

all the system congurations we studied D Shasha and M Snir Ecient and Correct Execution of Par

allel Programs that Share Memory ACM Trans on Program

Compiler optimizations relevant to consistency mo dels

ming Languages and Systems vol no pp April

have not b een as extensively studied as hardware optimiza

J E Smith and A R Pleszkun Implementation of Precise

tions Recent work has shown that it is p ossible to imple

Interrupts in Pip elined Pro cessors in Proc th Intl Symp

ment reordering optimizations in the compiler without vio

on Computer Architecture

lating sequential consistency However these optimizations

K Gharachorloo A Gupta and J Hennessy Two Techniques

to Enhance the Performance of Memory Consistency Mo dels

have b een evaluated for restricted programs and systems

in Proc Intl Conf on Parallel Processing pp II

and have not b een compared with the b enets of relaxed

T Mowry Tolerating Latency through SoftwareControlled Data

mo dels The imp ortance of relaxed mo dels to compilers

Prefetching PhD thesis Stanford University

TF Chen and JL Baer Eective HardwareBased Data

remains one of the key op en questions in this area

Prefetching for HighPerformance Pro cessors IEEE Transac

Recent programming languages and environments for

tions on Caches vol pp May

explicitly parallel sharedmemory programs eg POSIX

K Gharachorloo A Gupta and J Hennessy Performance

Evaluation of Memory Consistency Mo dels for SharedMemory

threads Java and Op enMP supp ort relaxed consistency

Multipro cessors in Proc th Intl Conf on Architectural

mo dels For many programs some of these languages eg

Support for Programming Languages and Operating Systems

POSIX threads and Java require that for exp ected re

pp

K Gharachorloo A Gupta and J Hennessy Hiding Memory

sults systemprovided synchronization constructs b e used

Latency Using Dynamic Scheduling in SharedMemory Multi

to prevent data races While such an approach encour

pro cessors in Proc th Intl Symp on Computer Architec

ture pp

ages go o d programming practice the provided constructs

R N Zucker and JL Baer A Performance Study of Memory

may b e overkill or inappropriate for some cases leading to

Consistency Mo dels in th Intl Symp on Computer Archi

a p erformance loss compared to more exible approaches

tecture pp

V S Pai P Ranganathan S V Adve and T Harton An

Op enMP provides more exibility allowing synchroniza

Evaluation of Memory Consistency Mo dels for SharedMemory

tion through arbitrary reads and writes data consistency

Systems with ILP Pro cessors in Proc th Intl Conf on Ar

is achieved by using the FLUSH directive at the appropriate

chitectural Support for Programming Languages and Operating

Systems pp

synchronization p oints in the program

P Ranganathan V S Pai H Ab delSha and S V Adve

Highlevel languages with relaxed mo dels eliminate the

The Interaction of Software Prefetching with ILP Pro cessors in

programmability advantage of supp orting sequential con

SharedMemory Systems in Proc th Intl Symp on Com

puter Architecture pp

sistency in the compiler or hardware from the p ersp ec

P Ranganathan V S Pai and S V Adve Using Sp eculative

tive of programmers of such languages However hard

Retirement and Larger Instruction Windows to Narrow the Per

ware and compiler designers must consider the programma

formance Gap b etween Memory Consistency Mo dels in Proc

th Symp on Parallel Algorithms and Architectures pp

bilityperformance tradeo for programmers of other lan

guages either highlevel or assembly level b efore deciding

J P Singh WD Weber and A Gupta SPLASH Stanford

the consistency mo del This tradeo will dep end on results

Parallel Applications for SharedMemory Computer Architec

ture News vol pp March

of future studies on compiler optimizations and the imp or

S C Woo M Ohara E Torrie J P Singh and A Gupta

tance of the remaining hardware p erformance gap to sys

The SPLASH Programs Characterization and Metho dolog

ical Considerations in Proc nd Intl Symp on Computer

tem vendors Finally the memory consistency mo del must

Architecture pp June

b e chosen to last b eyond current implementations since it

V S Pai P Ranganathan and S V Adve RSIM A Simu

is an architectural sp ecication visible to the programmer

lator for SharedMemory Multipro cessor and Unipro cessor Sys

tems that Exploit ILP in Proc rd Workshop on Computer

Architecture Education

VI I Acknowledgments

A Gupta J Hennessy K Gharachorloo T Mowry and W

We are grateful to Bill Pugh for correcting the descrip

D Weber Comparative Evaluation of Latency Reducing and

Tolerating Techniques in Proc th Intl Symp on Computer

tion of the Java memory mo del in an earlier version of this

Architecture pp May

pap er We thank Vikram Adve and Kourosh Gharachorloo

A Krishnamurthy and K Yelick Optimizing Parallel SPMD

for valuable comments on earlier versions of this pap er We Programs in Proc th Workshop on Languages and Compilers

for Parallel Computing

also thank Vikram Adve and Rohit Chandra for clarica

A Krishnamurthy and K Yelick Optimizing Parallel Programs

tions on HPF and Op enMP resp ectively

with Explicit Synchronization in Proc SIGPLAN Conf on

Programming Language Design and Implementation July

Information Technology Portable Operating System Inter

References

face POSIX Part System Application Program Interface

L Lamp ort How to Make a Multipro cessor Computer That

API C Language IEEE July ISOIEC standard

Correctly Executes Multipro cess Programs IEEE Trans on

E ANSIIEEE Standard

Computers vol C no pp September

J Gosling B Joy and G Steele The Java Language Specica

S V Adve and K Gharachorloo Shared Memory Consistency

tion pp The Java Series AddisonWesley

Mo dels A Tutorial IEEE Computer pp Dec

OpenMP Fortran Application Program Interface Version

K Gharachorloo D Lenoski J Laudon P Gibb ons A Gupta

Octob er

and J Hennessy Memory Consistency and Event Ordering in

MPI Extensions to the MessagePassing Interface July

Scalable SharedMemory Multipro cessors in Proc th Intl

C Ko elb el et al The High Performance Fortran Handbook The

Symp on Computer Architecture pp May

MIT Press

M Dub ois C Scheurich and F A Briggs Memory Access

S V Adve Designing Memory Consistency Models for Shared

Buering in Multipro cessors in Proc th Intl Symp on Com

Memory Multiprocessors PhD thesis Computer Sciences De

puter Architecture pp June

partment University of WisconsinMadison December

S V Adve and M D Hill Weak Ordering A New Denition

K Gharachorloo Memory Consistency Models for Shared Mem

in Proc th Intl Symp on Computer Architecture pp

ory Multiprocessors PhD thesis Stanford University

May