Translating Between and Sparc Memory Consistency Models

Lisa Higham(1) LillAnne Jackson(1)(2) [email protected] [email protected]

( ) 1 Department of Computer Science The University of Calgary, Calgary, Canada ( ) 2 Department of Computer Science The University of Victoria, Victoria, Canada

Categories and Subject Descriptors: B.3.2 [HARDWARE]: formed program substantially more than the original program Memory Structures— Design styles: Shared memory C.1.2 in order to ensure that no erroneous computations can arise. [ ARCHITECTURES]: Multiple Data Stream Architectures (Multiprocessors)— Interconnection architec- tures 1. INTRODUCTION General Terms: Theory, Verification. Our general goal is to translate multiprocessor programs Keywords: Sparc, Itanium, Memory consistency models, mul- designed for one architecture to programs for another archi- tiprocessors, program transformations. tecture, while ensuring that each program’s semantics remains unchanged. This is challenging because the memory access and ordering instructions of various architectures, differ sig- ABSTRACT nificantly. Consequently, the set of possible outcomes of a Our general goal is to port programs from one multipro- “fixed” multiprogram can be very different when run on dif- cessor architecture to another, while ensuring that each pro- ferent multiprocessors. Each multiprocessor architecture has gram’s semantics remains unchanged. This paper addresses a set of rules, called its memory consistency model, that speci- a subset of the problem by determining the relationships be- fies the ordering constraints and the allowable returned values tween memory consistency models of three Sparc architec- of instructions that access the shared memory. These rules tures, (TSO, PSO and RMO) and that of the Itanium archi- and the vocabulary used in their specification varies consid- tecture. First we consider Itanium programs that are con- erably between architectures, making comparison difficult. strained to have only one load-type of instruction in { load, Additionally, each architecture defines a set of instructions load acquire}, and one store-type of instruction in { store, that further constrain the allowable orderings. For example store release}. We prove that in three out of four cases, the Sparc architectures use memory barriers which order certain set of computations of any such program is exactly the set instructions before and after the barrier, while Itanium archi- of computations of the “same” program (using only load and tectures provide instructions with acquire (respectively, re- store) on one Sparc architecture. In the remaining case the set lease) semantics which only ensure that instructions after (re- is nested between two natural sets of Sparc computations. spectively, before) them remain so ordered. Real Itanium programs, however, use a mixture of load, This paper compares the of the Sparc ar- load acquire, store, store release and memory fence instruc- chitecture by Sun [15] and the Itanium architecture by IN- tions, and real Sparc programs use a variety of barrier instruc- TEL[11]. First we restrict the memory access and ordering in- tion as well as load and store instructions. We next show that structions to a subset that is common to each class of machine any mixture of the load-types or the store-types (in the case and derive (and prove) the precise relationship between sets of Itanium) or any barrier instructions (in the case of Sparc) of computations that can arise from running the same mul- completely destroys the clean and simple similarities between tiprogram on each. Next, system specific memory ordering the sets of computations of these systems. Thus (even with- instructions are included and we derive the relationship be- out considering the additional complications due to register tween these more complicated sets of computations on each and control dependencies) transforming these more general machine. This two step approach has proven a useful tech- programs in either direction requires constraining the trans- nique for understanding how different architectures behave. Sections 3 through 5 restrict Sparc multiprograms to have only load and store memory access instructions that are con- strained by one of the RMO, PSO or TSO memory con- Permission to make digital or hard copies of all or part of this work for sistency models. Itanium multiprograms are restricted to personal or classroom use is granted without fee provided that copies are use only one type of memory access instruction in {load, not made or distributed for profit or commercial advantage and that copies load acquire} and one type in {store, store release}.We bear this notice and the full citation on the first . To copy otherwise, to prove that the sets of possible computations on each system republish, to post on servers or to redistribute to lists, requires prior specific for any multiprogram so restricted are closely related and in permission and/or a fee. SPAA’06, July 30–August 2, 2006, Cambridge, Massachusetts, USA. many cases identical. For example, if a multiprogram uses Copyright 2006 ACM 1-59593-262-3/06/0007 ...$5.00. just load and store memory access instructions then the set of computations that the multiprogram can produce on a Sparc 2.2 Multiprocesses, computations, mem- RMO system is equal to the set of computations it can pro- ory consistency duce on an Itanium system. The same a multiprogram, when As each in a multiprocess system executes, it issues run on a Sparc TSO system, can produce exactly the same set a sequence of instruction invocations on shared memory ob- of computations as it would produce on an Itanium system jects. We begin with a simplified setting in which we assume after replacing each load by load acquire and each store by the shared memory consists of only shared variables, and each store release. instruction invocation is either of the form storep(x,v) mean- These simple relationships hold because different strengths ing that process p writes a value v to the shared variable x, of load and/or store instructions are not combined in the same or of the form loadp(x) meaning that process p reads a value program. In section 6, we add the memory barrier instructions from shared variable x. For this paper it suffices to model each of Sparc and allow Itanium multiprograms to contain more individual process p as a sequence of these store and load in- combinations of load, load acquire, store and store release struction invocations and call such a sequence an individual instructions. Perhaps surprisingly, any mixture of the load- program.Amultiprogram is defined to be a finite set of these types or the store-types (in the case of Itanium) or any bar- individual programs. rier instructions (in the case of Sparc) completely destroys An instruction is an instruction invocation completed with the clean and simple similarities between the sets of compu- a response. Here the response of a store instruction invocation tations of these systems. In fact, we present impossibilities is an acknowledgment and is ignored. The response of a load when dealing with the Sparc MEMBAR #StoreLoad instruc- invocation is the value returned by the invocation. A (mul- tion and with any additional Itanium memory instructions. tiprocess) computation of a multiprogram P is created from The scope of this paper does not include atomic read- P by changing each load instruction invocation, loadp(x),to modify-write instructions (i.e., atomic memory transactions ν ←loadp(x) where ν is either the initial value of x or a value on Sparc and semaphores on Itanium). Furthermore, regis- stored to x by some store(x,·) in the multiprogram. A “·” ter and control dependencies such as branching are not con- in place of a variable or value is used to denote the set of sidered. Capturing these dependency orderings is a signfi- all the instruction invocations or instructions that match the cant task that we have addressed elsewhere [9] using different given pattern. For example, storep(x,·) represents the set of techniques. Since the techniques are orthogonal to those of all store instructions by program p to the shared variable x,or, this paper, they could be combined to give a complete mem- depending on context, a member of that set. ory model. Let I(C) be all the instructions of a computation, C. The se- Previous work [8, 10] has compared the Sparc models to quence of instruction invocations of each individual program standard memory consistency models defined in the literature prog induces a partial order (I(C),−→ ), called program order,on [13, 14, 1, 6, 2, 3]. Gopalakrishnan and colleagues [4, 16] ( ) −→prog have worked on a formal specification of the memory consis- I C , defined naturally by i j if both i and j are instruc- tency of Itanium for verification purposes. Previous work [12] tions of the same program, say p, and the invocation of i pre- has compared the memory consistency model on distributed cedes the invocation of j in p’s individual program. shared memory systems. Gharachorloo [5] (chapter 4) de- Notice that the definition of a computation permits the ( ) fines porting relationships to and from Sparc and other shared value returned by each load x instruction invocation to be memory architectures, but does not consider Itanium. arbitrarily chosen from the set of values stored to x by the multiprogram. In a real machine, the values that might be actually be returned are substantially further constrained by the machine architecture, which determines the way in which 2. NOTATION AND MODELS the processes communicate and that shared memory is imple- mented. A memory consistency model captures these con- 2.1 Sets, sequences, and orders straints by specifying a set of additional requirements that computations must satisfy. We use C(P, MC) to denote the set −→po < Let (B, ) be a partial order. The notation a po b de- of all computations of multiprogram P that satisfy the mem- po po notes that (a,b) ∈ (B,−→ ). The partial order (B,−→ ) ex- ory consistency model MC. In the next two subsections, we  define memory consistency models that capture the computa- −→po ( , ) ∈ −→po ( , ) ∈ −→po tends (B, )if a b (B, ) implies a b (B, ). tions that can arise from multiprograms run on systems based TO Given a total order (B,−→ ) on a finite set B, there is a unique on Sparc and Itanium architectures. sequence, denoted Sequence(TO), of the elements of B, that We simplify the description of a memory consistency model satisfies a precedes b in Sequence(TO) if and only if (a,b) ∈ by assuming that each store instruction invocation has a dis- TO (B,−→ ). Since the total orders we consider are on finite sets, tinct value. Although it is technically straightforward to re- we will represent total orders interchangeably as a total or- move this assumption, without it, the description of the mem- der relation TO ⊂ B × B and as the corresponding sequence ory model is messy and its properties are consequently ob- Sequence(TO). scured. The subset of a set B consisting of all the elements that sat- isfy predicate Q is denoted by B|Q. Similarly, for a sequence 2.3 Sparc-like consistency models S, S|Q denotes the subsequence of S consisting of those ele- Until section 6, P denotes a multiprogram consisting only ments that satisfy Q. When the meaning is clear, we abuse of load and store instruction invocations, C is a computation this notation slightly when Q is not a predicate, but is natu- of P and I(C) is the set of all the instructions in C. In keep- rally associated with one. IntS(a,b) denotes the subsequence ing with the subset notation introduced in Subsection 2.1, of S strictly between two of its elements a and b, where aINTEL Itanium Pro- For every p ∈ Pdo cessor Family Memory Ordering [11] when applied to in- Let l1, l2, ... ,lk be the forward load instructions by p in dividual programs that use one type of load instruction in New(M) listed in program order. {load,load acquire} and one type of store instruction in {store, Let s j be the store instruction causally related to lj for each j. store release}. Specifically, a computation of P is Itanium if For j from 1 to k do and only if it can arise from executing P on an Itanium system If there is a load instruction in IntNew(M)(l j,s j) that where P uses only load and store instructions; a computation precedes l j in program order, let i be the latest such of P is Acquire Itanium if and only if it can arise from execut- instruction. ing P on an Itanium system where P uses only load acquire If i exists, New(M) ←(New(M) with l j moved and store instructions; a computation of P is Release Itanium to immediately follow i); if and only if it can arise from executing P on an Itanium sys- T (M) ←New(M). tem where P uses only load and store release instructions, and Observe that T (M) is the same regardless of the order in a computation of P is Acquire Release Itanium if and only if which the individual programs are selected. it can arise from executing P on an Itanium system where P OBSERVATION 3.2. The order of all backward load and uses only load acquire and store release instructions. store instructions is the same in memory order M and in T (M). CLAIM 3.3. Each load instruction has the same type in 3. SPARC-LIKE COMPUTATIONS SAT- memory order T (M) as it has in M. ISFY ITANIUM-LIKECONSISTENCY Proof: By Observation 3.2, backward load instructions in M This section establishes that each of the Sparc-like memory are also backward load instructions in T (M). A forward load consistency models satisfies the memory consistency condi- might move forward from its position in M to a new position tions of some Itanium-like model. The proofs are all construc- in T (M), but this new position is still earlier in T (M) than tive. Given a computation C that satisfies a Sparc-like mem- its causally related store instruction. Thus, each forward load ory consistency model, the Sparc-verifying memory order for instruction in M remains a forward load in T (M). CLAIM 3.4. If a memory order M is Sparc-valid for a com- LEMMA 3.9. If a memory order M for a computation C p.l. p.l. putation, then all forward load instructions are in program extends (I(C),−→ ) then (M) extends (I(C),−→ ). | T order in T (M) x for each x. Proof: Consider any load, l, and any other instruction, i, Proof: For a pair of forward load instructions, l and l ,in prog p.l. 1 2 where l −→ i. Because M extends (I(C),−→ ) all load in- | −→prog T (M) x that satisfy l1 l2, let s1 be the causally related structions are in program order in M. Thus, no instructions store instruction to l1 and s2 the causally related store instruc- are moved in the creation of T (M). So M = T (M) implying tion to l2. p.l. < ∈ that T (M) extends (I(C),−→ ). Assume l2 T (M) l1. By construction of T (M), l1 < < < IntT (M)(l2,s2), so l2 T (M) s2 T (M) l1 T (M) s1. Thus, by LEMMA 3.10. If a memory order M is Sparc-valid for a Observation 3.2, s < s , and by Claim 3.3, l < s . This . . 2 M 1 2 M 2 ( ) −→f s contradicts the Sparc-validity of M since l should be causally computation C and extends (I C , ) then T (M) extends 2 f.s. related to s1 rather than to s2. (I(C),−→ ). Proof: Let i be either a store or a load and s be a store in CLAIM 3.5. If a memory order M is Sparc-valid for a com- prog f.s. M where i −→ s. Since M extends (I(C),−→ ), i < s.If putation, then no forward interval in (M)| | , can contain a M T p x i is a backward load or a store then, by Observation 3.2, i backward load. < s. Otherwise, let i be a forward load in M and s its | | T (M) i Proof: Assume a forward interval, IntT (M)(l,s) p x, contains −→prog −→prog < < causally related store. Since M is valid, si i so si s. a backward load lb. Thus, l T (M) lb T (M) s. By Observa- . . ( ) −→f s tion 3.2, lb

C(P,Preceding Load RMO) C(P,Following Store RMO) C(P,Release Itanium) C(P,Acquire Itanium)

C(P,RMO) C(P,Itanium) Figure 2: Comparison of sets of computations under Sparc-like and Itanium-like Memory Consistency Models of instructions load MEMBAR#LL, MEMBAR#LS (respec- As well, Itanium systems support a memory fence instruc- tively, MEMBAR#LS, MEMBAR#SS, store). It follows tions which ensure the partial order: {(o1,o2)|∃ i1,i2, i fen,p from the definitions in Subsection 2.3 that (P, Preced- c c C where o1∼i1 ,o2 ∼i2 and i fen = memory fence and i1i2i fen∈ ing Load RMO) = C(LLLS(P), RMO) = C(P, PSO) and sim- prog prog I(C)| and i −→ i −→ i }. It is clear that the Itanium ilarly, C(P, Following Store RMO) = C(LSSS(P), RMO) and p 1 fen 2 memory fence instruction and the Sparc MEMBAR#LLLSSSSL C(P, Preceding Load Following Store RMO) = C(LLLS(P), instruction are equivalent. Thus, systems that contain only Following Store RMO) = (LSSS(LLLS(P)), RMO) = C load, store and the appropriate one of these instructions are (LSSS(P), Preceding Load RMO) = (P, TSO). C C equivalent. In the remainder of this work we consider only These relations clarify how to port between the three Sparc the effect of the load acquire and store release Itanium in- architectures. Any program designed for a weak model will structions. Define the following sets of multiprograms that be correct on a stronger model without changes. Furthermore, use these additional instructions. any explicit barriers can be removed provided they are im- It plied by the model to which the program is being ported. For Pacq,rel is the set of all multiprograms consisting of sequences example, any RMO program that uses only load, store, MEM- of load, load acquire, store and store release instruc- BAR#LL and MEMBAR#LS will remain correct when these tions. It memory barriers are removed and it is run on a Sparc PSO ma- Pacq is the set of all multiprograms consisting of sequences chine. To port a program designed for a strong Sparc model to of load, store and load acquire instructions. a weaker one, the implicit barrier instructions of the stronger It Prel is the set of all multiprograms consisting of sequences of that are not implied by the weaker must be inserted. For ex- load, store and store release instructions. ample, a TSO program can be ported to PSO, by inserting MEMBAR#LS and MEMBAR#SS before every store. Let P be a multiprogram with only load and store instruc- ∈ Sp tions. Let ACQ(P) (respectively, REL(P)) denote the multi- Now consider a multiprogram P PLS,SS. Clearly, program that results from replacing each load (respectively, ⊆ C(LSSS(P), RMO) C(P, RMO) since in LSSS(P) each each store)instruction in P with a load acquire (respectively, store-type instructions is preceded by a MEMBAR#LS and store release) instruction. It follows directly from the def- a MEMBAR#SS while in P the effect of only some of these initions in Subsection 2.4 that C(P, Acquire Itanium) = MEMBAR instructions is present. From the results in Section (ACQ(P), Itanium) and similarly (P, Release Itanium) = ⊆ C C 5 (and the summary in Figure 2), C(P, Release Itanium) C(REL(P), Itanium) and C(P, Acquire Release Itanium) = C(LSSS(P), RMO). Thus, P can be transformed to the Itanium C(ACQ(P), Release Itanium) = C(REL(ACQ(P)), Itanium) = architecture. It requires simply removing all the MEMBAR C(REL(P), Acquire Itanium). instructions, and replacing each store to a store release. ∈ It Now consider a multiprogram P Pacq. Clearly, More generally, to transform a Sparc program to a program C(ACQ(P), Itanium) ⊆ C(P, Itanium) since in ACQ(P) all for the Itanium architecture, start at the node that represents load-type instructions are constrained to satisfy load-acquire the Sparc architecture for which the program was designed, semantics, while in P only some are so constrained. Also, travel up the Sparc side until each explicit MEMBAR used by according to the results in Section 5 (and the summary in Fig- the program is included in the implicit MEMBAR types of the ure 2), C(P, Preceding Load RMO) ⊆ C(ACQ(P), Itanium). node. Remove all barrier instruction. Then travel backward Thus, P can be transformed trivially to the Sparc PSO archi- along arrows to the Itanium side. If the resulting node has tecture; no changes need to be made to the code to accommo- Acquire (respectively, Release) semantics, replace each load date the different memory consistency model. As another ex- instruction with the corresponding load acquire (respectively, It ample, using similar reasoning, any multiprogram P ∈ P , store release) instruction. acq rel can be trivially transformed to the Sparc TSO architecture. As well as the weak load and store instructions, Itanium In general, Figure 2 summarizes simple ways to transform systems supports two strong instructions called load acquire any Itanium multiprogram to one of the 3 Sparc architectures. and store release instructions, defined as follows: • On the Itanium side, the node C(P, Acquire Release Itanium) load acquire is equivalent to a load instruction that also ∈ It c covers any P P , . C(P, Acquire Itanium) covers any ensures the partial order: {(o ,o )|∃ i ,i ,p where o ∼i acq rel 1 2 1 2 1 1 P ∈ PIt . (P, Release Itanium) covers any P ∈ PIt . and (P, ∼c ∈ ( )| −→prog } acq C rel C = load acquire, o2 i2 and i1i2 I C p and i1 i2 Itanium) covers P only if it uses only load and store instruc- • store release is equivalent to a store instruction that also tions. Start at the weakest (lowest in Figure 2) node that cov- { |∃ ∼c ensures the partial order: (o1,o2) i1,i2,p where o1 i1 ers the given multiprogram, replace each strong instruction c prog ,o2 ∼i2 = store release and i1i2∈ I(C)|p and i1 −→ i2 with the corresponding weak one, and travel backward along and (o1,o2)∈{(LV(i1),LV(i2)), (R(i1),LV(i2)), (RVq(i1), arrows to the Sparc side. The node arrived at has memory RVq(i2) for any q ∈ P) }}; plus all store release in- consistency constraints for which the program will be correct structions are contiguous. without any additional MEMBAR instructions. 6.2 Efficient Transformations? THEOREM 6.5. There is no individual program transfor- Sp It The transforming techniques of the previous subsection can mation from (Pall ,RMO) to (Pacq,rel, Itanium). impose many more additional ordering constraints on the com- putations of the transformed multiprogram than it had origi- THEOREM 6.6. There is an individual program transfor- nally. This in turn reduces the possible efficiencies of the sys- Sp It tem. We would like more efficient transformed that are cor- mation from (PLS,SS,RMO) to (Prel, Itanium). rect while curtailing these extra, and unnecessary constraints. Sp Even a single MEMBAR#SS in a Sparc multiprogram, for Theorem 6.5 implies that C(Pall , RMO) is incomparable example, resulted in strengthening every store instruction to It Sp to C(Pacq,rel, Itanium). Theorem 6.6 indicates that C(PLS,SS, a store release in order to transform for an Itanium system. RMO) can be implemented (but not exactly implemented) on So instead we seek program transformations that replace in- It It C(Pacq,rel, Itanium) and Theorem 6.1 indicates that C(Pacq,rel, dividual strong instructions with memory barriers and vice Itanium) can be implemented (but not exactly implemented) versa, instead of altering the entire code. We call such pro- on (PSp , RMO). gram transformations instruction transformations to empha- C LL,LS,SS size that they transform each strong instruction or MEMBAR instruction separately, and leave every weak instruction un- Acknowledgements changed. Surprisingly, such transformations do not typically We gratefully thank the anonymous referees for extensive com- exist. ments. Some of their suggestions have been incorporated in this extended abstract to improve its content and presentation; Transforming from Itanium to Sparc those that could not be entertained due to space constraints The notation (P , MC) denotes the system consisting of all will be addressed in the full version of the paper. programs in P executing under the memory constraint MC. 7. REFERENCES THEOREM 6.1. There is an instruction transformation from [1] S. V. Adve and M. D. Hill. Weak ordering - a new definition. In Proc. (PIt ,Itanium) to (PSp , RMO). 17th Int’l Symp. on , pages 2–14, May 1990. acq,rel LL,LS,SS [2] M. Ahamad, R. Bazzi, R. John, P. Kohli, and G. Neiger. The power of processor consistency. In Proc. 5th Int’l Symp. on Parallel Algorithms ∈ It and Architectures, pages 251–260, June 1993. Technical Report The proof of Theorem 6.1 is found in [7]. For any P Pacq,rel τ( ) GIT-CC-92/34, College of Computing, Georgia Institute of Technology. the transformation P of that proof is defined by defining the [3] M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, and P. W. Hutto. Causal transformation of each instruction as follows: memory: Definitions, implementations, and programming. , 9:37–49, 1995. define τ(loadp(x)) = define τ(storep(x,v)) = [4] P. Chatterjee and G. Gopalakrishnan. Towards a formal model of shared memory consistency for intel itaniumtm. In Proc. 2001 IEEE v ←loadp(x) storep(x,v) International Conference on Computer Design (ICCD), pages 515–518, return v Sept 2001. define τ(load acqp(x)) = define τ(store relp(x,v)) = [5] K. Gharachorloo. Memory Consistency Models for Shared-Memory v ←loadp(x) MEMBAR#LS Multiprocessors. PhD dissertation, Department of Electrical MEMBAR#LL MEMBAR#SS Engineering, Stanford University, 1995. ( , ) [6] J. Goodman. consistency and sequential consistency. Technical MEMBAR#LS storep x v Report 61, IEEE Scalable Coherent Interface Working Group, March return v 1989. ∈ It [7] L. Higham and L. Jackson. Translating between itanium and This transformation ensures all orderings in P Pacq,rel are memory consistency models. Technical report, Department of Computer retained in τ(P)∈PSp , but it causes ordering relations be- Science, The University of Calgary, May 2006. LL,LS,SS [8] L. Higham, L. Jackson, and J. Kawash. Specifying memory consistency tween more instructions in τ(P) than are in P. However, the of write buffer multiprocessors. ACM Trans. on Computer Systems.To removal of any one of the MEMBAR instructions can cause a appear. necessary partial order to be lost. [9] L. Higham, L. Jackson, and J. Kawash. Capturing register and control dependence in memory consistency models with applications to the Itanium architecture, May 2006. Submitted to: DISC 2006. Transforming from Sparc to Itanium [10] L. Higham, J. Kawash, and N. Verwaal. Defining and comparing Due to space constraints all proofs in this section have been memory consistency models. In Proc. 10th Int’l Conf. on Parallel and Distributed Computing Systems, pages 349–356, October 1997. omitted, but can be found in [7]. [11] Intel Corporation. A formal specification of the intel itanium processor family memory ordering. http://www.intel.com/, Oct 2002. LEMMA 6.2. There is no instruction transformation from [12] V. Iosevich and A. Schuster. A comparison of sequential consistency (PSp,RMO) to (PIt , Itanium). with home-based lazy release consistency for software distributed shared LL acq memory. In International Conference on Supercomputing, pages 306–315, September 2004. LEMMA 6.3. There is no instruction transformation from [13] L. Lamport. How to make a multiprocessor computer that correctly Sp It executes multiprocess programs. IEEE Trans. on Computers, (P ,RMO) to (P , , Itanium). SL acq rel C-28(9):690–691, September 1979. [14] R. J. Lipton and J. S. Sandberg. PRAM: A scalable shared memory. Can a program transformation that is not required to be an Technical Report 180-88, Department of Computer Science, Princeton instruction transformation solve the negative results of Lem- University, September 1988. mas 6.2 or 6.3? There are two negative and one positive re- [15] D. L. Weaver and T. Germond, editors. The SPARC Architecture Manual sult! version 9. Prentice-Hall, 1994. [16] Y. Yang, G. Gopalakrishnan, G. Lindstrom, and K. Slind. Analyzing the LEMMA 6.4. There is no individual program transforma- intel itanium memory ordering rules using logic programming and sat. Sp It Technical Report UUCS-03-010, University of Utah, 2003. tion from (PSL ,RMO) to (Pacq,rel, Itanium).