1316 IEEE TRANSACTIONSON PARALLEL AND DISTRIBUTEDSYSTEMS, VOL. 6, NO. 12, DECEMBER 1995

, Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring rchitectures

Xiaodong Zhang, Senior Member, IEEE, and Yong Yan

Abstract- performance on scalable share& large NUMA architectures, the memory system organization memory architectures is affected by the structure of the intercon- also affects communication latency. With respect to the kinds nection networks linking processors to memory modules and on of memory organizations utilized, NUMA memory systems the efficiency of the memory/ management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache can be classified into the following three types in terms of data Only Memory Access (COMA) are two effective memory systems, migration and coherence: and the hierarchical ring structure is an efficient interco~ection Non-CC-NUMA stands for. non-cache-coherentNUMA. network in hardware. This paper focuses on comparative per- formance modeling and evaluation of CC-NUMA and COMA on This type of architecture either supports no local caches a hierarchical ring shared-memory architecture. Analytical mod- at all (e.g., the BBN GPlOOO [I]), or provides a local. els for the two memory systems for comparative evaluation are cache that disallows caching of shared data in order to presented. Intensive performance measurements on data migra- avoid the cache coherence problem (e.g., the BBN tions have been conducted on the KSR-1, a COMA hlerarcbicaI TC2cKm[ 11). ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and CCWUMA stands for cache-coherent NUMA, where comparisons of the two cache coherence memory systems. Our each node consists of a processor with an as- analytical and experimental results show that a COMA system sociated cache, and a designated portion of the global balances the work load well. However the overhead of frequent . Cache coherence for a large scale data movement may match the gains obtained,from improving shared-memory system is usually maintained by a direc- load balance. We believe our performance results could be fur- ther generalized to the two memory systems on a hierarchical tory-based protocol. Examples of such a system are the network architecture. Although a CC-NUMA system may not Stanford DASH [8], and the University of Toronto’s automatically balance the load at the system level, it provides an Hector [13]. option for a user to explicitly handle data locality for a possible COMA stands for cache-only memory architecture. Like performance improvement. CC-NUIvlA, each processor node has a processor, a Index Terms-Cache coherence, CC-NUMA, COMA, per- cache, and a designated portion of the global shared formance modeling and measurements, slotted rings, shared- memory. The difference, however, is that the memory as- memory, the KSRl. sociated with each node is augmented to act as a large cache. Consistency among cache blocks in the system is I. INTRODUCTION maintained using a cache coherence protocol. A COMA system allows transparent migration and replication of ARGE scale shared-memory architectures provide shared data items to the nodes where they are referenced. Ex- L address space supported by physically distributed mem- ample systems are the Kendall Square Research’s KSR-1 ory. The strength of such systems comes from combining the [7], and the Swedish Institute of Computer Science’s of network-based architectures with the conven- Data Diffusion Machine (DDM) [5]. ience of the shared-memory programming model. Computing comparative performance evaluation between CC-NUMA performance on such systemsis affected by two important ar- and COMA models has been conducted by using simulations chitecture and system design factors: the interconnection net- in [12j, where dynamic network contention is not a major work and the memory system structure. The choice of inter- consideration. In addition, only 16 processors were simulated connection networks to link processor nodes to cache/memory on relatively small problem sizes. However, both CC-NUMA modules can make nonuniform memory access(NUMA) times and COMA systemsare targeted at large scale architectures on vary drastically, depending upon the particular accesspatterns large problem sizes. Another experimental measurement has involved. A hierarchical ring structure is an interesting base on been recently conducted to compare performance of the DASH which to build large scale shared-memory multiprocessors. In (CC-NUMA on cluster networks) and the KSR-1 (COMA on hierarchical rings) [ 111. We believe that further work needs to Manuscript received Nov. 16, 1993; acceptedMar. 22, 1995. The authors are with the High PerformanceComputing and Software Labo- be done in order to more precisely and completely provide ratory, University of Texas at San Antonio, San Antonio, TX 78249; e-mail: insight into the overhead effects inherent in the two memory [email protected],[email protected]. To order reprints of this article, e-mail: [email protected],and systems.First, a comparative evaluation needs to carefully take referenceIEEECS Log Number D95063. into consideration the network contention which varies be-

1045-9219/95$04.00 0 1995 IEEE ZHANG AND YAN: COMPARATIVE MODELING AND EVALUATION OF CC-NUMA AND COMA ON HIERARCHICAL RING ARCHITECTURES 1317 tween the two memory system designs and among different 1) The architecture consists of a global ring and M local interconnectionnetwork architectures.Second, a comparative rings. Each local ring is connected to the global ring evaluation between the two memory systems,should be done through an inter-ring port with a pair of buffers. The basedon a particular network structure, becausedifferent net- buffers are used to temporarily store and forward work structures can make the two memory systemsperform packets passing the port. The global ring has M equally and behavedifferently. COMA and CC-NUMA mainly differ sized slots connecting to the M local rings. Each local in the requirementsof the network and the djfference in data ring has N equally sized slots, each of which connects locality, migration, andreplication. Previousstudies combine the to a station module. two factorsin the evaluationwhich limit the possibility to isolate 2) Each station module consistsof one main memory mod- the effects of either of them. Our study using‘ a particular net- ule, one subcacheand one processor.In the CC-NUMA work is concernedwith the isolated effects of the memory sys- system, the main memory module is a home addressed tems. Thirdly, a comparativeevaluation should provide infor- memory which occupiesa unique contiguousportion of a mation to identify and distinguishthe effects oh computingper- flat, global (physical) addressspace. In the COMA sys- formancecaused by the memory systemsand ~ by the intercon- tem, the main memory module is a big-capacity cache nection network structure.Finally, besidessimtilation and mod- which results in dynamic mapping between the context els, a comparativeevaluation should also presentexperimental results on real CC-NUMA and COMA architecturesto provide address and the system logic addressthrough segment practical observationsof program executions.his paper differs translation tables. from the study cited in [ 111 and [ 121, with respectto the four 3) Both the global ring and local rings are rotated con- points mentionedabove. In addition, this modFling and evalua- stantly. The rotation period between each slot is defined tion work complementsour experimentaland, application pro- as tr. A processornode in a local ring ready to transmit a grammingwork on various network-basedshyed memorymul- messagewaits until an empty slot is available. The re- tiprocessorsin the High PerformanceComputing and Software sponseand a request,such as a read/write will be rotated Laboratory.(Seee.g., [14], [1.5], and [17]). back to the requestingprocessor. A hierarchical ring structure is the particular interconnec- 4) Each inter-ring port is able to determine the path of the tion network in this comparative performance evaluation messagepackets passed through it. In the COMA system, study of both the CC-NUMA and COMA systems.The or- each port keeps a directory to record the mapping rela- ganization of the rest of the paper is as folloyvs. In Section II, tionships amongthe cachememory modulesin the corre- we begin with detailed descriptions of the modeled ring net- sponding local ring. In CC-NUMA, each port can de- work and cache coherenceprotocols of the CC-NUMA and COMA. The performance parametersof the analysis are also termine the destination of the messagepackets according presented in Section II. The analytical models of the two to the home addresscarried by the messagepacket. memory systems on the hierarchical ring qetwork are pre- sented in Sections III and IV. Section V gives the compara- tive performance evaluation between the two memory sys- tems in terms of data migration, load distributions, network contention and communication bandwidth. Section VI re- ports our experiments of cache coherenceland data access patterns under the two memory systems on the KSR-1 to verify and support the analytical and simuliition results pre- sented in Section V. Finally, we give a suinmary and con- clusions in Section VII. ,

II. THE DEFINITIONSFOR THE HIERARCHICAL RING BASED CC-NUMA AND COM:A In order to clarify and simplify the prekentation without Fig. 1. The architectureof a two-levelring-based CC-NUMAKOMA system. losing the generality of a hierarchical ring, v&econsider a two level ring-basedshared-memory architecture ’for both the CC- B. Cache Coherence Protocols of CC-NUMA and COMA NUMA and COMA systems. The cache coherenceprotocols in our models are based on A. The Architecture for the CC-NUMA a$l the COMA the available ones which have been proposed/implemented on The CC-NUMA and the COMA systemsto be discussedin hierarchical ring architectures(see e.g., [2] and [4] and [7]). In the following sections share the same ring interconnection order to compare the differences between the two memory network architectureshown in Fig. 1. This a+hitecture has the systems,similar hierarchical data directories and cachecoher- following hardware and software structures,,functions and pa- ence protocols are designedin each system.In both systems, rameterswhich are similar to the onesin [7] &d [ 131: sequential consistency is preserved. 1318 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. 12, DECEMBER 1995

B.1. Cache coherence protocol of CC-NUMA changed into “Nonexclusive Copy” if its original ownership is Exclusive. 1) Hierarchical directory * Write shareddata-the processorwill first searchthe Each processormaintains a local directory in its local owner of the shareddata through the entire ring hier- cache, which records the data allocation information in archy. As soon as the owner is found in a cachemod- the local cache.Each local ring maintainsa global direc- ule, the processorloads the data back to its local cache tory built in the inter-ring port, which records the data and invalidates all the existing copies in the system. allocation information in the local ring. 2) Ownershipsof a shareddata After the invalidation, the processorperform the write in the local cache. The updated data copy becomes l Share: there is more than one copy of the shareddata “Exclusive.” existing in other memorymodules. o Exclusive:the currentcopy is the only onein the system. C. Performance Parameters and Assumptions 3) Read/write protocol In order to fairly comparethe performancedifferences be-, 0 Reading the shared-data-the processor will get the tween CC-NUMA and COMA, it is necessaryto define a com- data in its local memory if it is available there, other- mon evaluationbase. The modelspresented in the next sections wise it will get it from one of the memory modulesin are basedon the following commonperformance parameters: the local ring, or in a remote ring through searching. The newly loaded copy will have the “share” owner- 1) jl: requestmiss rate of eachlocal cache. ship. 2) 5: a &action of L directing to a hot-spotaddress segment. 3) &: a fraction of il(l - ;lh) directing to memory modules * Writing the shared-data-the processor will either on its Iocal ring. write the data locally if it is available or will do a 4) 4: readmiss fraction of ;1. remote-write in the destination memory module. The 5) &: write miss fraction of a. associated invalidation operations are defined as 6) f,: rotation period of eachring. follows. 7) A? tbe numberof stations,connected to eachlocal ring (or The invalidation of shared data is conducted during the number of slots in a local ring). the of a write request traveling to the home 8) M: the number of local rings connectedto the global ring. node and returning from the home node. Each time a 9) NC:the numberof cachesegments in eachlocal cachemem- ory (this parameteris only usedfor the COMA system). write request passesby a global directory which has copies of the requesteddata, it will produce a invali- Furthermore,we assume: dation packet to invalidate the copies in the local 1) The requestmiss rate of eachlocal cachefollows a Pois- ring. son process. 2) The local request rate and the nonlocal request rate are B.2. Cache coherence protocol of COMA uniformly distributed. 1) Hierarchical directory 3) In the requestsequence of eachstation, the read missesand It has the same structure as the one defined for CC- write missesto a datalocation are uniformly distributed. NUMA. 4) One messagepacket can be completely carried by one 2) Ownershipsof cachesegments slot which only conveysthis messagepacket. So the suc- cessiveslots behaveindependently. * Copy: there is more than one copy of a physical ad- 5) When a station receivesa messagepacket from a slot, it dresssegment in the system. will producea reply into the sameslot without any delay. l Nonexclusive Copy: it has the same features as “Copy” except the location is the owner of the physi- In this paper, major latency analysesfor both memory sys- cal addresssegment. tems on the ring architectureare basedon evaluating two im- 0 Exclusive: there is only one valid copy of the physical portant performancefactors. First, the ring network contention addresssegment in the system. is modeled by studying hot spot effects. Second, analytical 0 Invalidation: The copy in the cachesegment is not valid. models of read/write miss latencies are constructedusing the 3) Read/write protocol network contention models associatedwith the cache coher- 0 Read shared data-if the copy exists in the local enceprotocols. The M/G/l model is a major mathematicaltool cache, the processorperforms the read immediately. to derive the analytical latency formulas. In the following two Otherwise, it sendsa probe packet into the local ring sections, we present the objectives, assumptionsand major or remote rings to search for a copy of its required results of eachmodel to evaluateCC-NUMA and COMA sys- physical addresssegment. The processorwill receive tems on a hierarchicalring architecture.For detailed derivation a copy with the ownership of Copy. The ownership processof the mathematicalmodels, the interestedreader may of the cache copy in the destination module will be refer to AppendicesA and B. ZHANG AND YAN: COMPARATIVE MODELING AND FVAL1JATION OF CC-NUMA AND COMA ON HIERARCHICAL RING ARCHITECTURES 1319

III. AN ANALYTICAL MODEL FORTHE HIERARCHICAL Therefore,the averagelatency of a remote-writeto the hot spotis RING BASED CC-NUMA ~ Ln4l CM - UT,-n,,, =-+ T, - num (3.3) A. Network Contention M M In a hierarchical CC-NUMA system, network contention C. Latency of a Remote-Read to the Hot Spot can be well characterizedby a hot spot environmentwhere a A read-missprocess is more complex than that of a write- large number of processorstry to accessa globally shared miss becausemultiple copies of the data may exist in the sys- variable acrossthe network. In this case,a hierarchical ring is tem. In general,the processof a remote-readcan be described divided into three regions in terms of network activities: her by the state transition graph shown in Fig. 2. In Fig. 2, 0 rep- local ring which is the local ring where the hot spot is located, resents the initial state, L representsthe state where the re- cool local rings which are the rest of the local rings without questing processorreceives the data from the local ring, G the presenceof the hot spot, and the global~ring. A compre- representsthe state where the requesting processorreceives hensiveaccess delay model for the entire hierarchical ring in data from a remote ring, and r, and T, representthe latency in the presenceof the hot spot is presentedbased on contentionin statesL and G, respectively. eachof the three parts of the rings. In Appendix A, the follow- ing CC-NUMA latencyfactors are obtained: ~ & : the meanwaiting time for a messageto find an empty slot in a cool local ring. ~~coOl~lPOrt: the queuing time of a messagein the interface port from the global ring to a cool local king. &, : the meanwaiting time for a messageto find an empty slot in the hot local ring. Fig. 2. Transition graph of a remote-readto the hot spot. &,t-lpon : the mean queuing time of a messagein the in- terfaceport from the global ring to the hot local ring. Becausewe have assumedthat read missesand write misses qhot-gport:the meanqueuing time of a messagein the in- are uniformly distributed in the request sequences,whether a terfaceport from the hot local ring to the global ring. data item has multiply copies distributed in other memory ~cool~gport: the mean queuing time of :a messagein the modules is determinedby the relative ratio of the read miss interfaceport from a cool local ring to the global ring. rate to the write rate. The transition probability P can be de- , termined as follows: B. Latency of a Remote-Write to the Hot dpot 1) When ;3, 5 &,, each read miss must be precededby a A remote-write to the hot spot will be satisfied in the fol- write miss. So the probability for a hot read to get the lowing two possiblesituations: data item from a copy of the home data is zero. In this 1) The write request is from the hot local ring, with prob- case,the probability P of reading the hot memorymodule ability of * : in the local ring equals to the probability of the local ring’s becomingthe hot ring, which is $. This request only needsto travel the !hot local ring for one circle. The traveling time, denoted by T-,U,, con- 2) When & > &,,, each write miss follows + w read-misses sistsof the time for the sourceprocessor to find an empty where only the first read-missvisits the home data and slot on the hot local ring and the timesfor ,therequest to the other read-missesvisit a copy of the home data. So travel the hot local ring for one circle: the probability for a read miss to visit the home data is %. Moreover, the probability for the hot data item to be fLu#na =d,+Nt,. (3.1) locatedon a different ring as a requestis q. Therefore, 2) The write request is from a cool local ring, with prob- ability of 9: the probability P for a requestto get the data item from a copy of the home data or from the home data on the local This write will accessthe hot memory remotely. The re- ring is 1 -w.(M-1)2. mote-write time, denoted by z-,,,,! consists of four r parts: the time from the source cool wringto the global Concluding the aboveanalyses, the transition probability P can ring, the time from the global ring to the hot ring, the be representedas time for searchingthe destination processorin the hot 1 ring, and the time for the data packetsto go back to the A, 2 A,, p = T a.,(&1) otherwise. (3.4) sourceprocessor: i MA,

T,m42 = d7, + %xLgport + qhot-gpqn By (3.4), the latency of a read-missto the hot spot is (3.2) + ?hotJpon + L~~~~ort + &M + 2N). Tr-m4mu= PT+(l-P)T,, (3.5) 1320 IEEE TRANSACTXONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. 12, DECEMBER 1995 whereTl andTg are computed under the following two conditions: rings. Basedon the above analysis,the packet arrival rate, de- 1)&+,2&: noted as &, in each interface port from the global ring to a A read always visits the hot data item in the home node local ring can be expressedas

becauseno copies of the hot data item exist in this situa- a = 2N(M-l)& +2N(1-&)(l-it,)& (4.7) tion. Hence, by (3.1) and (3.2) we have T, = z - numaand fP M Tg= T,-num,* where the hot request rate to a local ring is N(M-I)&M , the 2) & < ar: nonhot requestrate to a local ring is N( 1 - &)(l - al)>% the In this case,the remote-readmiss latency T’ is FE-,,,. data packetrate to respondto the hot requestsof a local ring is The local-readmiss latency T, is N(M-l)M,, M ’ and the data packet rate to respond to the nonhot requestsof a local ring is N( 1 - &,)( 1 - &)L Then, using the samemethod describedin Appendix A, we where (2, + Nt,) is the searchingtime of a read-missin a can obtain the following three important performanceresults: nonhot local ring and (dh + Nt,) is the searchingtime of 1) %*ma9the utilization of a local ring in COMA is a read-misson the hot local ring. uILcom M+2 (M -) 1 ah +2M(l-&)(1-q). (4.8)

IV. ANANALYTICAL MODELFORTHE 2, Kait - coma 3 the waiting time for a messageto find an HIERARCHICAL RING BASED COMA empty slot in a local ring is Based on the cache coherenceprotocols defined in Sec- tion II, there are only two types of packetsrunning in the ring: probe packetswhich carry accessrequests to searchtheir des- the queuing time in the interface port for a mes- tinations and data packetswhich carry the data segmentsback 3) ;?Lcom7 to the sourceprocessors. In a steady state, each ring can be sageto enter a local ring from the global ring is consideredto have the samenumber of probe packetsand data t, (4.10) packets. Moreover, a COMA system may cause a physical e-comu= 1 _ N;lr,(M+3(M-l)a,+3M(l-a,)(l-a,)) . addresssegment to be moved in different local caches at a M different time. Therefore a physical addresssegment can be For the global ring, eachinterface port from a local ring to assumedto have the sameprobability to reside on every local the global ring can be modeled as a MIGII queuewith packet cacheat the sametime under the condition that eachprocessor arrival rate of 2NA.((M - 1)&/M + (1 - &)(I - AL)).Then, we requestsa physical addresssegment with the sameprobability can obtain the following two important performanceresults in during a unit period of time. Basedon this unique COMA data the sameway: migration feature, we can assumethat each local ring has the 2) UX-=-, the utilization of the global ring is samecontention pattern in a steadystate which is independent u g-com= 2N;y(M- l)&,+ M(l - Q(1 - $>). (4.11) of the contention differences among physical address seg- ments. This is the major difference between the home ad- 2) Fg~coma~the queuing time in the interface port for a mes- dressedCC-NUMA (a data item can be fixed in a memory sageto enter the global ring from a local ring is module, and accessto it is conductedby remote-readand se- mote-write), and the changeablyaddressed COMA. However, the data migration feature of a COMA system makes the read/write miss processmore complicated than that in a CC- NUMA system. In a COMA system, a read/write miss will B. Latency of Remote-Write to a Hot Memory dynamically chasethe data becausethe requestdata does not In a COMA system,the searchingprocess of a write-miss have a home addressand will be dynamicallymoved in a local can be expressedas the state transition graph shown in Fig. 3 cache by a write access.In the following, the latency of a basedon the dynamic migration feature of data. State 0 repre- read/write miss is derived mainly by modeling the dynamically sents the initial state of a write miss at its source processor. chasingprocess of a read/write miss. StatesLS and GS representtwo possibilities for a write-miss to find the owner of its required addresssegment, where LS is the A. Modeling the Network Contention in COMA processof searchingand getting the hot segmentin the local For all local rings, each interface port from the global ring ring with probability of l/M, and GS is the processof getting to a local ring contributes an equal amount of trafftc to the the hot segmentin a remote memory with probability of (M- local ring, which is not affected by the hot spot effects because 1)/M. States INV-1 and INV-2 represent the corresponding of the data dynamic migration feature. So the fraction of ac- invalidation statesof IS and GS respectively.We use tis, tin,,-1, cessesto the hot spot in each processorshould be considered tgs,and tin”-2to representthe time spent in each corresponding uniformly distributed among the memory modules in N local state. Based on Fig. 3, the latency of a remote-write to a hot ZHANGAND YAN: COMPARATIVEMODELIN~ANDE~VALUA~IONOFCC-N~MAANDCOMAONHIERARCHICALRINGARCHITECTURES 1321

spot, denoted as FWecorna,can be expressedas ~ the reduced global search latency by (1 - &.0,Jl2 data copies. h+ h ”-ICM - l)(lg.~ + 4w-2 1 +;wa it~~co~) ;i;,-,,a= + (413) Furthermore,the averagenumber of copies of a data item M M I in one of M local rings is (1 - &&,Jl(2M), which reduces where the detailed derivation processof tl,, t+,,i, t,,, and ti,v 2 the local searchlatency Tl to is listed in Appendix B.

4s + WI2 T1 = max W + iSwait_coma f (l+p-a,/a,)/(m)) ’ (4’18) i 1 whereNC + qwtlit-comais the least time for traveling a lo- t,,r+w2 cal ring for a circle, and (i+(i-~, ,nw j,(2M)j is the reduced searchtime by the multiple copies.

V. COMPARATIVEPERFORMANCE EVALUATION Fig. 3. Transition graph of a COMA write-miss. BETWEENCC-NUMA ANDCOMA BASEDON THE ANALYTICAL MODELS C. Latency of Remote-Read to a Hot Memory In this section, we provide analytical results dependent on In a COMA system, the remote-read process can be de- various architecture effects such as access-misslatency and scribed by the same state transition graph asishown in Fig. 2. bandwidth. The analysis of the bandwidth is based on the The transition probability P also has the following expression: analysisof the upper bound of the request rate per processor. aw1 a,; The architectural factors to be consideredare the size of a ring P= +i (4.14) 1- nw(M-l) otherwise. and the rotation speed of the ring. The system factors to be Ml, considered are the data locality, the hot spot effects and the The hot read missing latency is ratio of betweenread-miss and write-miss. T,-C”, = P1; +(l- P)T,~, (4.15) A. The Analysis on Access-MissLatency where the computation of local search latency ‘p and global search latency Tg is more complicated than that in a CC- A. I. The hot spot effects NUMA system becausea read miss in a COMA system in- Here we choose32 as the number of slots in each local ring volves in a process of dynamically chasing for data. In the and the global ring. The rotation period is one unit time. The following, we calculate TLand Tg under two cbnditions: effects of the hot spot to miss latency under this condition in l)&>$: CC-NUMA and COMA systemsare shownin Fig. 4. Fig. 4 Each read miss to data must be preceded by a write miss indicates that the hot spot has little effect on remote-read la- to the data, which means that no copies of a data item tency in both systemsand remote-write latency in CC-NUMA. exist when a read miss to the data ocdurs. In this situa- But it affectsremote-write latency in COMA to a certainde- tion, the data searchprocedure of a reaci miss is the same gree because of more frequent data migration. The results as that of a write miss except that a read miss does not showthat the structureof the hierarchicalring networkcan involve an invalidation process.Hence, iwe have well balancenetwork traffic when a hot spot occurs. T, = tls+ NtJ2, Tg = tgs+ (M + N&,/2. i (4.16) 2) &’ < ar: Each write miss follows * w read misses. So the aver- i . age number of copies of a data itemsm the system is (1 - i1,&),)/2 wh en a read miss to the data occurs,which reducesthe global searchlatency Tg to ~ Tg= max( (2N +M)4 + 9,,it_mma +?~-,,,d +qg-cottta I tg,+(M+2N)t,/2 (4.17) ’ i+(i-a,/a,)/2 I where(2N +MY, + qwctit-coma +F~-co~ + &mc l is the least time of remotely getting data, and ~+~~~~~~~~$/22is Fig. 4. Effects of changing the hot spot rate on access-misslatencies. I, I 1322 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. 12, DECEMBER 1995

A.2. The locality effects A.4. Theeffects of re&rite missdistributions The locality is defined as the ratio betweenthe numberof The effects of read/write miss distributions to the miss la- accessesto a local ring and the total numberof nonhot mem- tency are measuredby changingthe ratio betweenread-misses ory accessesin the system.The performanceparameters are and write-misses.Fig. 7 shows that, by increasingthe read- selectedas in Section V.A.1 except that total miss rate is miss rate, the read-misslatencies in both systemsreduce sig- 0.0005. Fig. 5 presentsthe effects of\the locality on the miss nificantly; but the write-misslatencies in both systemsincrease latenciesin both memorysystems. It sh”owsthat the increaseof becausethere are more cachecopies to be invalidated.On the’ the locality significantly reducesthe miss latencies in both otherhand, the COMA handlesread/write misses slightly more systems.In particular,the read-misslatencies in both systems effectively than that in the CC-NUMA. presentalmost the samecurves. But the write-misslatency in a COMA is lessthan that in a CC-NUMA system.

80 -iLl 0.2 0.3 0.4 0.s 0.6 0.7 0.8 0.9 rud adukrq *at8

1 I L I 1 I 1 L I I .l 0.2 0.3 0.4 0.1 0.6 0.7 0.8 0.9 1 Fig. 7. Effects of changingread miss distribution on miss latencies. locality r8t8

Fig. 5. Effects of changingthe program locality on miss latencies. A.>. Systemeffects by changingthe ring size The size of a ring is an important architecturefactor. As- * A.3. Theeffects caused by diferent missrates and request suming the miss rate in each processoris uniformly distrib- rates uted, and the rotation period of the ring is in a unit time, Fig. 8 The network contention is mainly determinedby the miss showsthat, by increasingthe size of a ring, read-misslatencies rate in each processor.Assuming a uniform distribution of in both systemshave the sameincreasing curves, but the write- miss rates in each processor,the effects of the miss rate on miss latency in COMA increasesslightly slower than that in access-misslatencies in both systemsare shown in Fig. 6. The the CC-NUMA. results show that the increase in miss rate causes higher read/write miss latency in both systems.But the write-miss latency in the COMA is slightly smaller than the write-miss laaen~yin the CC-NUMA becauseof a more balancedload in the COMA.

Fig. 8. Effkcts of changi~ the ring size to miss late&es.

, 1 . a J 0.00025 0.0003 o.ooo3s 0.0004 0.0004s A.6 Systemeffects by changingthe rotation period to&l rium r8bpn8tr8t8 The rotation period of the rings is anotherimportant archi- Fig. 6. Effects of changingthe total missrate on access-misslatencies. tecture factor affecting performance.Fig. 9 plots the latencies ZHANG AND YAN: COMPARATIVE MODELING AND EVALUATION OF CC-NUMA AND COMA ON HIERARCHICAL RING ARCHITECTURES 1323

of both systemsby slowing down the rotation speedstep by

step. The results indicate that the COMA performs slightly 0.00046 better than the CC-NUMA in termsof changingthe rotation period.

0.00032 1 1 I 1 L I 1 1 . I 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 thm r8ta of bougotaCOma

Fig. 11. Effects of changingthe hot spot rate on the requestbound.

B.2. Effectsof architecturalfactors 1 1.s 2 2.5 3 3.5 4 rotating tiam Fig. 12 presentsthe effectsby changingthe sizeof the rings Hg. 9. Effects of changingthe rotation period on missing latencies. with a certain hot spot rate to the upper bound of the access- missrate. The curvesshows that both systemshave nearlythe sameperformance. Fig. 13 showsthat by decreasingthe rota- B. kmhvidth Analysis tion speed,the upper bound of the miss ratesin both the CC- In CC-NUMA and COMA systems,the access-missrequest NUMA and COMA systemsis almostidentically affected. on eachprocessor is boundedby the network contention.It is importantto comparethe different effectson the requestbound by changing. the performanceparameters between the two memory systems. The detailed mathematical models for 3 bandwidthanalysis are given in AppendixC. :: 0.OOt B.I. E$ectsof locality Fig. 10 presentsthe effectsof localities to the upper bound of the missrate. It showsthat the upperbound of the missrate in the COMA increasesslightly faster than that in the CC- NUMA. Fig. 11 showsthat the effects of the hot spot on the upper bound of the missrate in both CC-NUMA and COMA m almostidentically significant.

0.08s 1 I 1 I 1 1 . . I 1 fig. 12. Effects of changing the ring sizt with a specific hot spot rate to the quest bound.

VI. EXPERIMENT-BASEDVALIDAT~ONONTHEKSR-1 rla 0.003% k 0.003 A. An Overview of the Experiment-Bali ValiMn 8 o.ooa To valklate analytical models, execution-drivensimulation o.ooa and real-machine-basedexperiments are two akmatives. Al-

o.oas thoughthe executiondrivensimulation can mavarietyof architecturefUures by flexibly changingvarious architectural 0.011 ~~WNWUS,it may be difficult or even impossibleto verify o.ooos whetherthe simulatorhas correctly modeled a real mukiproces- I I 01 I L a L 1 . 1 a I 0.1 0.2 0.3 -0.4r8t8d-L t&2i7 0.8 0.9 1 sor system.As pointedout in [IO], the validationto a simulation czt only shows whetherit can produceresults similar to another Fig. 10. Efkts of changing programlocality with a specific hot spot rate on simulator.Therefa, we decidedto conductexperiments on a the rt!queatbound. realhierarchical ring systemto validateour analyticalresults.

. . : . . .,. ‘_. ” ‘. . . ..‘, ...... IEEETRANSACTIONS ON PARALLELAND DISTRIBUTEDSYSTEMS, VOL. 6, NO. 12, DECEMBER 1995

on a shared variable, the similar can be simulated by making each processor produce its own operation sequence on two different variables in the same memory module. In the rest of this section, we report two sets of experiments on KSR-1 to validate our analytical results, and to give more comparative results between the CC-NUMA and the COMA which complement the results from analytical models. The first set of experiments designed for performance validation, called uniform experiments where the memory access patterns in the analytical models are simulated and the effects of cache miss rate, read/write rate, cache coherence protocol, hot spot, and

n- locality are measured for comparisons with the analytical re- sults. Based on the analytical models, this set of experiments were uniformly constructed such that all the 64 processors on Fig. 13.Effects of changingthe rotation period to therequest bound. two rings were employed. In execution, each processor gener-- ated read/write request misses to the memory modules on its KSR-1 [7], introduced by Kendall Square Research, is a hi- local ring and on the remote ring respectively. The changes of erarchical ring based COMA multiprocessor system which memory access patterns were adjusted by the four parameters: provides a direct testbed for validating the analytical results of request rate, read (or write) rate, hot spot rate, and locality the COMA system. To validate the analytical results on a hier- rate. In the second set of experiments, additional memory ac- archical ring based CC-NUMA system, we simulate its mem- cess patterns were generated to measure the effects of hot spot ory operations on KSR-1. A key part of simuIating a CC- and the two cache coherence protocols. NUMA system in a COMA system is to generate the CC- The objective of running these experiments is to validate NUMA memory access patterns on the KSR- 1. While in a CC- analytical results presented in early sections. Since the analyti- NUMA system data is home addressed, in a COMA system a cal models and experiments were performed on two different data item is dynamically duplicated and moved upon bases, the absolute latency measures are different. However, read/write requests. In order to fix a data item, we use an array we can still well present performance model validations and on the KSR-1 to simulate a home addressed variable in a CC- comparisons based on performance tendencies and implica- NUMA system. The array is called the extension vector of the tions from both analytical and experimental results. variable. The memory access pattern of a CC-NUMA read/write operation sequence is simuIated by directing a B. Cache Coherence Effects read/write operation to each independent element of the vari- KSR-1 maintains consistency of the data ins each cache Gs- able’s extension vector based on the following rule: ing a write-invalidate cache coherence strategy. Whenever a Let s be a variable, s[m] be the extension vector of s, where data item is requested by a processor for an update, every m is the length of the vector for multiple accesses to s. Let other copy of that data item, in every other cache where that a,(s), a,@>, ...I a,(s) be a read/write sequence on s in a CC- subpage is located, is marked “invalid” so that the proper NUMA system, where t is the length of the sequence. This value of the data item can be maintained. The distributed sequence is simulated in a COMA system by sequence cache directories maintain a list of each subpage in the local a;, ai, . . ., aI which is constructed as follows: cache, along with an indication of the “state” of that subpage. To validate the analytical results on cache coherence effects, we performed the following three experiments. 2) For any i > 1, if ui-, is a write operation on sb], then The first experiment was designed to validate the Assump- al = a,(s[j+l]); if aL!-, is a read operation on s[iJ, then tion 5 in Section II.C. In our experiment, one processor writes u; = U[(S[j]). an array of 500,000 elements into its local cache. All other processors then read that array into their local caches, leavigg The above rule guarantees that a series of consecutive read the exclusive ownership~of this array with the processor which operations access the same variable, and a write operation originally wrote it and leaving copies of the array located in does not move the location of the data item. Hence, a CC- NUMA access pattern can be rigorously simulated with the every other cache. The original processor updates the array support of the extension vector. The cache coherence protocol again, requiring that all other copies be invalidated. The tests, proposed for the CC-NUMA system in Section II indicates that were conducted two different ways: a remote write always has the same executing trace which is 1) the number of processors was scaled from 1 to 62 con- independent of the number of data copies in the system. This is tinuously, O-31 on one ring, then moving to the second because the system uses multiple parallel invalidation packets ring for 32-62; and to invalidate the copies in local rings. If two processors in a 2) the number of processors was scaled in pairs, one proces- CC-NUMA system produce two similar operation sequences sor on each of the two rings being a member of the pair. ZHANGANDYAN: COMPARATIVEMODELINGANDE~ALUATIONOFCC-NUMAANDCOMAONHIERARCHICALRINGARCHITECTURES 1325

With each increment of the number of processors,one proces- lowing conditions: the request rate was fixed at 0.0005, the sor was added to each ring. Consequently, the~numberof proc- locality rate was uniformly distributed, the hot spot rate was essorsscaled by 2; e.g., 2,4, 6, 8, .. .. 62. The results in Fig. 14 set to 0 and the read rate was changed from 0.1 to 0.8. Table II reflect these two different ways of scaling the problem and lists the measurementresults showing that the read-miss la- conclusions will be drawn from these two different configura- tency is very close to the write-miss latency when the read rate tions. Fig. 14 shows that the maintenancecost of cache coher- is less than 0.5, then decreasessignificantly with the increase ence bears no additional cost in KSR-1, over and above the of read rate beyond 0.5. This is due to the effect of multiple latency of the ring rotation itself, which does not reflect an copies of data items in both COMA and CC-NUMA systems. increase.because of the number of processors, but an increase These experimental results are identical to the analytical re- due to the distance the data invalidation must travel from one sults given in Fig. 7 in terms of the system effects. ring to another ring. This is consistent to the~Assumption5 in TABLE II Section I1.C where the ring rotation is clocked so that any and MEASUREMENTRESULTS:EFFECTSOFCHANGINGREAD/WRITEMISS all actions that could possibly take place during a stop at each DISTRIBUTIONONMISSLATENCIES(IN~.S) cell can be accomplished.

C. Locality Effects The effect of locality rate was measured through the uni- form experiments under the following condition: The request rate was fixed at 0.0005, the hot spot rate was set to 0, and the read rate was set to 0.7. The locality rate was changed from 0.1 (where 90% of requests sent by a processor directed to a remote processor in the remote ring) to 0.9 (where 90% of requests sent from a processor directed a remote processor on the local ring). The measurementresults reported in Table III show that the decreaserate of miss latencies is similar to the analytical result given in Fig. 5. For example, both analytical and experimental results show that the miss latencies reduce Fig. 14. Cache coherencetiming changes as the numbhr of processorsis in- creasedon the KSR-I i about 25% when the locality rate increasesfrom 0.1 to 0.5. TABLE III MEASUREMENTRESULTS:EFFECTSOFCHANGINGL.OCALITYRATE In the second and the third experiments1the uniform ex- ONMISSLATENCIES(INp.v) periments were conducted under different sets of parameters. To validate the effect of request miss rate on network conten- tion, in the second experiment, we used the same set of per- formance parameters used in the analytical! model where the hot spot rate was set to 0, the read rate was Cl.7and the locality rate was uniformly distributed. The write miss latencies and the read miss latencies were measured while the request rate D. Hot Spot Performance on the KSR-1 was changed from 0.00015 to 0.00045 through a delay func- In practice, a hot spot may occur under different memory tion. The measurementsare given in Table I, which shows the accesspatterns, resulting in different performance degradation. similar varying tendencies of miss latency to the analytical Hence our measurementswere conducted not only by the uni- results in Fig. 6. I form experiments for validating the analytical results presented TABLE I in Fig, 4, but also by three additional experiments for studying MEASUREMENTRESULTS:EFFECTSOFCHANFINGT~ETOTALMISSRA~the hot spot effects under practical memory accesspatterns. ONACCESS-MISSLATENCIES(IN#S) D. 1. Effects on memory access delay 3n r 36.2 The hot spot on the KSR-1 is allocated either in a fixed lo- 32.4 1 34.2 36.0 38.5 cation, called fixed hot spot for CC-NUMA or in movable lo- 1 25.0 26.2 cations, called movable hot spot for COMA. The fixed hot 5.6 2j.l , 26.9 spot remains physically on one processor as other processors try to read it with a single variable or a block of data. The To validate the effects of different read/write miss distribu- movable hot spot will be migrated around the ring on demand tions on miss latency, in the third experiment, we set the fol- of any processor which does a read with a single variable or a 1326 IEEETRANSACTIONSONPARALLELANDDISTRIBUTEDSYSTEMS,VOL. 6, NO. 12, DECEMBER 1995 block of data. This data migration is a feature of the KSR-1 These remote readings were also unidirectional among the two intended to enhance data locality. rings during any run of the experiment. Thus, the 16 proces-’ To validate our analytical results, we first evaluated the hot sors on the cool ring read the 16 processors on the hot ring spot effects through the uniform experiments under the follow- during the hot spot activity in one run of the experiment. Then ing conditions: each processor in the two rings generated re- we reversed the remote reading and had the 16 processors on quest misses at the fixed rate of 0.0005 where the locality rate the hot ring read remotely the 16 processors on the cool ring was uniformly distributed, the read rate was fixed at 0.7, and during the hot spot activity in another run. The hot spot was the hot spot rate was changed from 0.1 to 0.6. The read/w&e generated by either reading a single variable or by reading a miss latencies were averaged over 10000 test cases and are block of data. Thus, there were four different timings from the listed in Table IV. The measurement results show that a write- hot spot activity that were compared to the remote readings miss in the COMA system is more sensitive to a hot spot than when there was no hot spot. For 97% of processors on the hot in the CC-NUMA due to the overhead of more frequent data ring in usage there was one available processor on the hqt ring movement in the COMA system. This conclusion is consistent and 1 available processor on the cool ring. As shown on Ta- to the analytical results reported in Fig. 4 in terms of hot spot bles V and VI for the different variations, the hot spot hag very effects. little effect on all the remote reads in this experiment. This additional experiment on the KSR-1 further strengthens our TABLE IV analysis that a hierarchical ring based architecture, such as the MEASUREMENTRESULTS:EFFE~TSOFCHANG~NGHOTSPOTRAIE ONMISSLA?ENCIES(INJ@ KSR machine handles the hot spot activity efficiently, as pre- sented by the analytical models in Fig. 4.

TABLE V READINGMEASUREMENTS(IN~)WHENHOTSPOTISGENERATED BY~O%OEPROCEXORSONTHEHOTRINGONTHEKSR-1 r kml cool/qto COOli In practice, a hot spot is usually generated only by part of from hotRto coolR processors in the system. Experiments reported in [16] simu- hot spot (vx) late this type of memory access patterns, which used 57 out of from hotRto cook 64 processors in a KSR-1 system to generate the hot spot on hot spot (block) 32.74 137.68 1 67.59 1 99.60 another remote cache module, remaining 6 remote cool cache from ~001~to hotR I I I modules. The miss latencies of remote reads and remote writes hot spot (var) of one word, one block, two blocks, and three blocks were from cools to hotR respectively measured under an environment without any hot hot spot (block) spots, an environment with the hot spot generated by cache TABLE VI references in a word unit, and an environment with the hot spot READINGMEAUREMENTS (IN p) WHENHOT SPOTIs GENERATED generated by cache references in a block unit. In comparison BY 97% OFPROCESSORSONTHEHOTRINGONTHEKSR-1 between the fixed and movable hot spot experiments, the re- (1 var) (1 blk) (2 blks) (3 blks) sults in [ 161 indicate that a movable hot spot slightly increases 1from ~001~to coola 27.83 ) 32.14 1 61.63 1 93.84 the access delay to cool variables in the cool cache modules from hotx to COOIR due to heavier traffic caused by more data movement. i 27.80 32.50 62.70 93.18 Another experiment to verify our modeIing work is to see if a hot spot can affect remote readings among processors that 27.94 32.08 63.09 93.05 are not involved in the process of generating the hot spot. from cool8 to ho& Again, two rings were used for the experiment. The difference hot spot (var) 27.49 32.10 62.65 93.99 between this experiment and the previous one is that all of the from COOLto hotl( processors contributing towards generating the hot spot are on hotspot@lock) 27.92 32.22 62.28 92.73 the same ring (the hot ring). While the other processors on the other ring (the cool ring) are involved in generating the hot D.2. Effects on normal parallel computations in cool nodes spot. The hot spot is fixed to one processor within the hot ring. In this experiment we measured the effects that a hot spot We varied the experiment by increasing the number of proces- may have on a matrix multiplication application. Again, there sors to be involved in the hot ring to generate the hot spot. We were two forms of hot spots, fixed for CC-NUMA and move- chose to use 50% and 97% of the available processors on the able for COMA. hot ring to generate the hot spot for these variations. Thus, for We used 64 processors for our experiments. We increased 50% usage of the processors on the hot ring there were also 16 the numb&- of processors doing the matrix multiplication frqm processors that were not involved in generating the hot spot. 1, 2,4, 8, and 16 processors. The matrices to be multiplied are At the same time, there were also 16 processors to be used on A x B and the result is put in matrix C. The Size of these matri- the cool ring. These two sets of 16 processors were used to do ces are 224 x 224. The computation resided on the same ring. remote readings of their counter processors, respectively. The matrices A atid C are distributed so that each processor ZHANGANDYAN: COMPARATIVEMODELINGANDE~ALUATIONOFCC-NUMAANDCOMAONHIERARCHICALRINGARCHITECTURES 1327 has to access from its neighbor to do its share bf the computa- no difference in the timings in the presence of the hot spot. tion. Each of the contributing processors has ia local copy of This group of experiments further support our analytical per- the matrix B. When there is a hot spot presenf, the number of formance evaluation in Section V.A. 1. processors that contribute towards generating the hot spot is 48 of the 64 processors (75%). The ring where hhe hot spot re- VII. CONCLUSION sides is the hot ring (this is true only for the fixed hot spot im- plementation). The other ring is the cool ring.; As in our other In this paper, our analytical models provide performance experiments, the hot spot was generated by either reading a differences between the CC-NUMA and the COMA on a hier- single variable or reading a block of data. i archical ring architecture. The model considers the intercon- Table VII presents the timing results of fited hot spot ef- nection network and the memory systems, which are two im- fects on the matrix multiplication. The first row gives timings portant factors affecting the shared-memory performance. We of the matrix multiplication (MM) without a hot spot present. also conducted experiments on the KSR-I, a hierarchical ring The second and third rows show the timings during the follow- COMA system to verify some of the analytical results. We ing setup. The processors doing the matrix multiplication all summarize performance evaluation results as follows: reside on the same ring (cool ring) while the! hot spot resides 1) In a hierarchical ring based architecture, a slotted ring on the hot ring. The fourth and fifth rows show the opposite set orders and delays remote data access requests. This up of the previous two rows. That is, all procebsors involved in structure naturally reduces network contention for pro- the matrix multiplication reside on the hot ring (where the hot grams with hot spots. Analysis indicates that in the pre- spot is located). The sixth and seventh row$ show the same sense of hot spots overall ring traffic will be moderately setup as in the previous two rows with one d’fference. One of increased but it will be distributed evenly in the ring net- Ill the processors that contribute towards the ~atrix multiplica- work. Analytical results have been verified by the ex- tion will be the processor where the hot spa! resides. This is periments on the KSR-1. When the hot spot memory ac- shown in the table with a one in parenthesis (11)to signify this. cess rate is increased, the write-miss latency in a COMA As we predicted, there was virtually no difference in the tim- system will become slightly bigger than that in the CC- ings during the presence of a fixed hot spot, with any of the NUMA. implementations. I 2) In the presence of a hot spot, COMA generates higher TABLE VII write-miss latency due to more frequent data migrations READINGMEASUREMENTS(INSECS)OFMATRIX~ULTIPLICATION and a larger number of invalidations. OFSIZE~~~X~~~DURINGTHEPRESENCEOFAFOXED HOTSPOT 3) Our analysis indicates that COMA would have slightly lower write-access latency by changing the degree of lo- calities of programs, but it would have the same read- miss latency in most cases as that in CC-NUMA. 4) Our analyses and experiments show that for applications with dominant read-misses at either high or low rates, COMA and CC-NUMA have nearly identical perform- ance. In contrast, the simulation results in [12] indicate the two systems have the nearly identical performance only for the applications with low miss rates. A main rea- son for the different performance results is related to the different evaluation testbeds. The constant latency as- sumption on the flat network architecture simulator causes longer delay for COMA, and is likely to make the network contention independent of memory access pat- terns of applications. In addition, a flat network architec- TABLE VIII : READINGMEASUREMENTS(INSECS)OFMATRIXMULTIPLICATION ture is more hot spot sensitive than a hierarchical net- OFSIZE~~~X~~~DURINGTHEPRESENCEOFAM~EABLE HOTSPOT work. In comparison, the KSR ring architecture allows the system to exploit hierarchical by moving referenced data to a local cache and satisfying data references from nearby copies of a data item when- ever possible. 5) We show that CC-NUMA handles coherence misses only slightly more efficiently than COMA in the ringe archi- tecture, while the simulation results in [12] indicate the difference is significant. Again, this is related to the dif- In Table VIII, we show the timings during the presence of a ferent network architectures used for the evaluations. moveable hot spot. Row one shows the timings without the presence of the hot spot while rows two and three shows the We conclude that both CC-NUMA and COMA systems timings with the moveable hot spot. Again,; there was virtually behave similarly on a hierarchical ring architecture. Two 1328 IEEE TRANSACTlONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. 12, DECEMBER 1995 main reasons for this are that overhead of data migration in uses a queue to buffer the message packets. In order to calcu- COMA matches the saving from improving locality in CC- late the average waiting time for a request in the queue, we can NUMA; and that a slotted ring architecture balances the net- model the port as a M/G/l queue with packet arrive rate of work contention. Our study indicates that a hierarchical ring a coo~_IpOrtThe average queue length in the buffer may be calcu- network is a reasonable candidate for both COMA and CC- lated by Little’s law, / NUMA systems. We believe our performance results could &=a - (7.23) be further generalized to the two memory systems on a hier- cool_lport9mol_lport) archical network architecture. Although a CC-NUMA system Where ‘Icoo1Jpm-tf an average waiting time in the queue,may be may not automatically balance the load at the system level, it calculated by provides an option for a user to explicitly handle data local- ity for a possible performance improvement. Therefore, the %&1pon = (d,++e(il,+4). (7.24) decision of using or making CC-NUMA or COMA systems Combining (7.23) and (7.24), W becomes on a hierarchical ring should be determined by the pro- gramming and manufacturing cost of the systems. Finally, based on our study, we believe a fair and valuable compara- tive performance evaluation of COMA and CC-NUMA sys- tems should be conducted on each particular network archi- In the hot local ring, the interface port from the global ring tecture. to the hot local ring contributes traffic to the hot local ring at the rate of ;Ehor_lporrwhich is calculated as: APPENDIX A abtwbfl = (ii4 - i )Naah + 2~( I - ah)(l - alla. (7.26) MODELING NETWORK CONTENTION By (7.20), (7.22), (7.23), (7.24), and (7.26), &, the mean USING THE M/G/l QUEUE THEORY time to find an empty slot on the hot local ring, and qhotlport, There are N stations and one interface port in a local ring the mean queueing time in the interface port between the which are assumed to be independent of each other. Each in- global ring and the hot local ring before a message enters the dependent station contributes an equal amount of traffic to the hot ring, are derived as: ring at the same miss request rate of a which follows a Poisson process. The interface port also inputs traffic to the local ring - ~a,2(1+(~-1)a,+2jl-a,)(l-a,)) at the rate of &coo~~~port,which is expressed as the sum of the dh = l-~~,(1+(~-l)a,+2(l-a,)(l-al))' (7'27) probe packet arrival rate and the data packet arrival rate: and a cool~lport= Nash + 2~~1 - iz,xl - ada. (7.19) A general network utilization is defined by:

U = limC, (7.20) t--f- t In the global ring, there are N - 1 cool interface ports con- nected to N - 1 cool local rings, and one hot interface port where C is the total number of bytes transmitted to the network connected to the hot local ring. Each cool interface port can be from all the connected stations and the interface port during a period of time t. Based on (7.19) and (7.20), the utilization of modeled as an MYGIl queue with packet arrival rate of a nonhot local ring is formulated as a CWlLgpOrt: ul = tr(il,,,,~,,, + Na) = Naql + ah+ 2(1 - iz,)(l -al)). (7.21) acwl_gport= Nash + 2~(1 - ah)(l- alla. (7.29) Because the successive slots have been assumed to behave The hot interface port can be modeled as an M/G/l queue with independently, the probability of a slot on a cool local ring to packet arrival rate of &,_gpO,.t: be full is Ur. The time needed for a packet in each station to ahor_gpon= (M - l)N& + 2N(l - &)( 1 - A@. (7.30) find an empty slot can be approximated to have a geometric Then, using the same method as the above, we can get the distribution. The analyses in [3] and [9] show that the error mean queuing times qCOOl-gpOrtin the cool ports and qhot-sporl caused by the independence assumption is trivial. Hence, the average time, termed as d,, to find an empty slot on a cool in the hot port, respectively, as follows: local ring is:

& = j),(l-u,)u; = $ i=o 1 (7.22) Nat,2(l+a.h+2(1-ah)(l-a1) = l-Nat,(l+a,+2(1-a,)(l-a,))' (7.32) The interface port from the global ring to a cool local ring ZHANG AND YAN: COMPARATIVEMODELING AND EVALUATION OF CC-NUMA AND COMA ON HIERARCHICALRING ARCHITECTURES 1329

APPENDIX B B. Modeling the Invalidation Time tin”-1in State INV-1 in MODELING READ/WRITE MISS LATENCIES Fig. 3 IN COMA SYSTEM ~ When a hot write request changes from state LS to state ZNV-1, the invalidation process in ZNV-1 is determined by the A. Modeling the Search Time & in LS State iin Fig. 3 ownership changesof the owner of the hot data: Using a local ring, we know that the probability for a slot to CASE 1. The owner’s status of the data copy is changed from be non-empty in a local ring is VIF,,,,. Since there are only two types of packets running on the rings, the probe packets and “Exclusive” to “Invalidation” with probability &,: The in- validation block carries the hot data directly to the source the data packets, the probability for a slot to have a probe processor.Hence the invalidation time ttLu i is N&/2. packet is Vl/2. In addition, the probability for a probe packet to be a hot’ write miss is (a, +$$$)J., . Therefore, the prob- CASE2. The owner’s status of the data copy is changed from ability for a slot to have”ahot write probe packet is “Nonexclusive” to “Invalidation” with probability &: the invalidation packet must travel the global ring for one circle Pwh,=+(ah+~~w. (7.33) to invalidate the other copies and then goes back to the source processor. The invalidation time tt:“-i consists of the In state LS, let ii be the initial distance (number of slots) from following components: the hot segment to the write miss request, which is a random 1) the time traveling from a local ring to the global ring. value in [0, Nl. We assume that the probability for the hot 2) the time traveling one circle of the global ring. segmentto be located in local cachej 0 = 1,~2, .a., N) is l/N. 3) the time traveling from the global ring to the local ring. 4) the time traveling to the source processor in the local Then the averageof it is N/2. The probability for the hot write ring. request to catch up with the hot segment eat distance ii is (1 - Pwhl)i,,which is the probability for none of the other write so G-1 can be expressedas requeststo write the hot segment in advance.~If the write-miss tih-1 = (2N + Wr + Gwit-coma+ 2%-corn+ ?g-comu* has not found the hot segment at distance iti the hot segment must have been carried onto another local icache which has Combining the above two cases, the invalidation time in state ZNV-1 is: distance iz from the hot write probe block. Variable iz is a ran- dom value in [0, N/2] with the average of N/g. The probability tinv - 1 = 4dnvml + 4t;;v-l. for the write request to catch up with the hot segment at the C. Modeling the global search time tgsin state GS in Fig. 3 second local cache is (1 - (1 - P,,,*[$)(l - Pwhl)i,. The write request will repeat this process until it gets the hot-spot seg- Similar to (7.33), the probability for a slot on the global ring ment. Thus, the search process of a write request can be ex- to have a hot write probe block is pressedas the state transition graph shown inlFig. 15.

The time tgsconsists of the following three components: 1) ttra,the time traveling from a local ring to the global ring, which is t tra = W/2 + ~,-,,wu 7 2) t,,,, the time searchingthe directory along the global ring. Initially we can consider M/2 (the number of the lots) to be the average distance from the source processor to the hot global directory which connects to the hot local ring. When the request, denoted by rl, reaches the directory, Fig. 15. Searchingprocess of a hot write on a local ring. the directory may have become cool, which means that there was another write request entering the hot ring be- In Fig. 15, k = log(N), pi = (l- P,,,hl)N/2’,iqi= 1 -pi and the fore this one. The request rl must continue searching along the global ring, to wait for another global directory time Z’i can be approximately evaluated as iNr,/2’ because the to become hot, and then to repeat the above procedure new owner of the hot spot can be any local cache except the until rl catches up with the hot global directory. Each old ones and the sourcenode at each state i (i = 1, 2, *es,k + 1). time a write request runs into the hot global directory, the So the searchtime r~*is hot directory becomes cool. There will be no hot direc- k+l tory until the write request carries the hot segment into (7.34) the source local ring, which makes the global directory on the source local ring hot. The delay, denoted by d, in 1330 IEEE TRANSACTIONSON PARALLEL AND DISTRIBUTEDSYSTEMS, VOL. 6, NO. 12, DECEMBER 1995

the system between generating two hot directories is the sum of the time for the write request to enter the hot local a coma = min Nt,(M+I(M-1)L::3M(l-?l~~~l-a~~~~ ring, the time to searchthe hot segment, the time to carry (7.41) the hot segment to the global ring and the time to travek M to the interface port connected to the source local ring. 2Ntr(M+1)((M-1)A,+M(1-j1h)(1-jt~)) ’ Hence, this delay can be expressedas 1 d = Tgis_coma+ Tii- comu+ Mt, /2 + 4s. (7.35) Formulas (7.40) and (7.41) are the basic models for bandwidth The probability for the write request, rl, to success- analysis presentedin Section V.B. fully catch up with the hot global directory each time is (1 - P,hf”. So the averagesearching time tseais ACKNOWLEDGMENTS We are grateful to the anonymous referees for providing important suggestionsand comments to improve the technical, (7.37) quality, clarity and readability of the paper. We appreciate our colleague Neal Wagner for carefully reading the paper and = contributing many helpful comments and Robert Castafiedafor contributing some hot spot experiments on the KSR-1. Finally, 3) t,,, the time of the request’s entering the hot local ring and many thanks go to Fredrik Dahlgren at the Lund University, searchingthe hot segmentis Sweden, for the discussions and his many useful technical comments on this work. This work is supported in part by the National Science Combining the above three components,we have Foundation under research grants CCR-9102854 and CCR- 9400719, by the U.S. Air Force Office of Scientific Research d(l - (1 - pwhg)M/2) under grant AFOSR-95-l-0215, by a grant from Cray Re- (M+W, +tk + search, and by a fellowship from- the Southwestern Bell Foun- 5s = s,mna + ~1mncl+ 2 (1-?$hJM’2 * dation. Part of the experiments were conducted on the KSR-1 (7.38) machines at Cornell University and at the University of Washington. D. Modeling the invalidation time tiny-2in state NV-2 in Fig. 3 REFERENCES In the INV-2, the invalidation process has the same two al- ill BBN Advanced Computer Inc., Inside the GPlOOOand the TC2000, ternatives as those in ZNV-1. Using similar analysis tech- 1989. niques, the invalidation time in ZNV-2 can be expressedas PI L.A. Barroso and M. Dubois, “The performance of cache-coherentring- (7.39) basedmultiprocessors, ” Proc. 20th Int’l Symp. Computer Architectures, tinv-2 = Nt, + q2_comcl+ q,~Cm?uI + Mtr@r + 912. pp. 268-277, May 1993. r31 L.N. Bhuyan, D. Ghosal, and Q. Yang, “Approximate analysis of single and multiple ring networks,” IEEE Trans. Computers, vol. 38, no. 7, APPENDIX C pp. 1,027-1,040, 1989. MATHEMATICALMODELSFOR BANDWIDTH ANALYSIS [41 K. Farkas, Z. Vranesic, and M. Stumm, “Cache consistency in hierar- chical ring-based multiprocessors,” Proc. Supercomputing 92, pp. 34% Because the queuing time is larger than zero, by (7.25), 357,Nov.1992. (7.28), (7.31), and (7.32), we can derive the upper bound of I.51 E. Hagetsten, A. Landin, and S. Haridi, “DDM-A cache-only memory architecture,” Computer, vol. 25, no. 9, pp. 44-54, Sept.1992. the request rate in a CC-NUMA system, denoted as &,,-, as I61 Kendall SquareResearch, KSRI Technology Background, 1992. follows: [71 D. Lenoski, 3. Iandon, T. Joe, D. Nakahira, L. Stevens,A. Gupta, and J. Hennessy, “The DASH prototype: Logic overhead and performance,” IEEE Trans. Parallel and Distribufed Systems,vol. 4, no. 1, pp. 41-61, 1993. &I W.M. Lou&s, V.C. Hamacher, B. Preiss, and L. Wang, “Short-packet transfer performance in local area ring networks,” IEEE Truns. Comput- ers, vol. 34, no. 11, pp. 1,004-1,014, 1985. [91 S.K. Reinbardt, M.D. Hill, and J.R. Lams, “The Wisconsin wind tunnel: virtual prototyping of parallel computers,” Proc. 1993 ACM SIGMET- By (4.10) and (4.12), the upper bound of the request rate in RICS Con& pp. 48-60, May 1993. UOI J. P. Singh, T. Joe, A. Gupta, and J. Hennessy, “An empirical compari- a COMA system, denoted as A,-,,,,,is given as follows: son of the KSR and DASH multiprocessors,” Proc. Supercomputing 93, pp. 214-225, Nov. 1993. [Ill P. Stenstrom, T. Joe, and A. Gupta, “Comparative performance of cache-coherent NUMA and COMA architectures,” Proc. 19th Int’l Symp. Computer Architectures, pp. 80-91, 1992. ZHANG AND YAN: COMPARATIVEMODELING AND EVALUATION OF CC-NUMA AND COMA ON HIERARCHICALRING ARCHITECTURES 1331

[12] Z.G. Vranesic, M. Stumm, D.M. Lewis. and R. White. “Hector: A hier- Xiaodong Zhang received the BS degree in electri- archically structured shared-memory multiprocessor,” Computer, vol. cal engineering from Beijing Polytechnic University, 24, no. 1, pp. 72-79, 1991. China, in 1982, and the MS and Ph.D. degrees in X. Zhang, R. Castafieda,and W.E. Chan, “Spin-lock synchronization on computer sciencefrom the University of Colorado at 1131 Boulder in 1985 and 1989, respectively. the Butterfly and KSR-1,” IEEE Parullel and Disyn’buted Technology, He is an associateprofessor of computer science vol. 2, no. 1, pp. 51-63, spring 1994. and director of the High Performance and Comput- t141 X. Zhang, K. He, and G. Butchee, “Execution behavior analysis and ing and Software Laboratory at the University of performanceimprovement in shared-memoryarchitectures, ” Proc. Fifth Texas at San Antonio. He has held research and IEEE Symp. Parallel and Distributed Processing, pp. 23-26, Dec. 1993. visiting faculty positions at Rice University and WI X. Zhang, Y. Yan, and R. Castatieda, “Comparative performance Texas A&M University. His research interests are evaluation of hot spot contention between MIN-based and ring-based parallel and , parallel architecture and systemperform- shared-memoryarchitectures, ” IEEE Trans. Parallel and Distributed ance evaluation, and scientific computing. Systems, vol. 6, no. 8, pp. 872-886, Aug. 1995. ’ Dr. Zhang has served on the program committees of several conferences and is the program chair of the Fourth International Workshop on Modeling, X. Zhang, Y. Yan, and K. He, “Latency metric: An experimental [161 Analysis, and Simulation of Computer and Telecommunication Systems method for measuring and evaluating program and architecture scal- (MASCOTS’96). He currently serveson the editorial board of Parallel Com- ability,” J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 392- puting and is an ACM National Lecturer. 410,1994. Yong Yan received the BS and MS degreesin com- puter science from Huazhong University of Science and Technology, Wuhan, China, in 1984 and 1987, respectively.He is currently a PhD student of com- puter scienceat the University of Texas at San Anto- nio. He has been a faculty memberat Huazhong Uni- versity of Scienceand Technologysince 1987. He was a visiting scholar at the High PerformanceComputing and Software Laboatory at the University of Texas at San Antonio from 1993-1995. Since 1987, he has published extensively in the areas of parallel and distributed computing, performance evaluation, operating systems,and algo- rithm analysis.