Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures
Total Page:16
File Type:pdf, Size:1020Kb
1316 IEEE TRANSACTIONSON PARALLEL AND DISTRIBUTEDSYSTEMS, VOL. 6, NO. 12, DECEMBER 1995 , Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring rchitectures Xiaodong Zhang, Senior Member, IEEE, and Yong Yan Abstract-Parallel computing performance on scalable share& large NUMA architectures, the memory system organization memory architectures is affected by the structure of the intercon- also affects communication latency. With respect to the kinds nection networks linking processors to memory modules and on of memory organizations utilized, NUMA memory systems the efficiency of the memory/cache management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache can be classified into the following three types in terms of data Only Memory Access (COMA) are two effective memory systems, migration and coherence: and the hierarchical ring structure is an efficient interco~ection Non-CC-NUMA stands for. non-cache-coherentNUMA. network in hardware. This paper focuses on comparative per- formance modeling and evaluation of CC-NUMA and COMA on This type of architecture either supports no local caches a hierarchical ring shared-memory architecture. Analytical mod- at all (e.g., the BBN GPlOOO [I]), or provides a local. els for the two memory systems for comparative evaluation are cache that disallows caching of shared data in order to presented. Intensive performance measurements on data migra- avoid the cache coherence problem (e.g., the BBN tions have been conducted on the KSR-1, a COMA hlerarcbicaI TC2cKm[ 11). ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and CCWUMA stands for cache-coherent NUMA, where comparisons of the two cache coherence memory systems. Our each processor node consists of a processor with an as- analytical and experimental results show that a COMA system sociated cache, and a designated portion of the global balances the work load well. However the overhead of frequent shared memory. Cache coherence for a large scale data movement may match the gains obtained,from improving shared-memory system is usually maintained by a direc- load balance. We believe our performance results could be fur- ther generalized to the two memory systems on a hierarchical tory-based protocol. Examples of such a system are the network architecture. Although a CC-NUMA system may not Stanford DASH [8], and the University of Toronto’s automatically balance the load at the system level, it provides an Hector [13]. option for a user to explicitly handle data locality for a possible COMA stands for cache-only memory architecture. Like performance improvement. CC-NUIvlA, each processor node has a processor, a Index Terms-Cache coherence, CC-NUMA, COMA, per- cache, and a designated portion of the global shared formance modeling and measurements, slotted rings, shared- memory. The difference, however, is that the memory as- memory, the KSRl. sociated with each node is augmented to act as a large cache. Consistency among cache blocks in the system is I. INTRODUCTION maintained using a cache coherence protocol. A COMA system allows transparent migration and replication of ARGE scale shared-memory architectures provide shared data items to the nodes where they are referenced. Ex- L address space supported by physically distributed mem- ample systems are the Kendall Square Research’s KSR-1 ory. The strength of such systems comes from combining the [7], and the Swedish Institute of Computer Science’s scalability of network-based architectures with the conven- Data Diffusion Machine (DDM) [5]. ience of the shared-memory programming model. Computing comparative performance evaluation between CC-NUMA performance on such systemsis affected by two important ar- and COMA models has been conducted by using simulations chitecture and system design factors: the interconnection net- in [12j, where dynamic network contention is not a major work and the memory system structure. The choice of inter- consideration. In addition, only 16 processors were simulated connection networks to link processor nodes to cache/memory on relatively small problem sizes. However, both CC-NUMA modules can make nonuniform memory access(NUMA) times and COMA systemsare targeted at large scale architectures on vary drastically, depending upon the particular accesspatterns large problem sizes. Another experimental measurement has involved. A hierarchical ring structure is an interesting base on been recently conducted to compare performance of the DASH which to build large scale shared-memory multiprocessors. In (CC-NUMA on cluster networks) and the KSR-1 (COMA on hierarchical rings) [ 111. We believe that further work needs to Manuscript received Nov. 16, 1993; acceptedMar. 22, 1995. The authors are with the High PerformanceComputing and Software Labo- be done in order to more precisely and completely provide ratory, University of Texas at San Antonio, San Antonio, TX 78249; e-mail: insight into the overhead effects inherent in the two memory [email protected],[email protected]. To order reprints of this article, e-mail: [email protected],and systems.First, a comparative evaluation needs to carefully take referenceIEEECS Log Number D95063. into consideration the network contention which varies be- 1045-9219/95$04.00 0 1995 IEEE ZHANG AND YAN: COMPARATIVE MODELING AND EVALUATION OF CC-NUMA AND COMA ON HIERARCHICAL RING ARCHITECTURES 1317 tween the two memory system designs and among different 1) The architecture consists of a global ring and M local interconnectionnetwork architectures.Second, a comparative rings. Each local ring is connected to the global ring evaluation between the two memory systems,should be done through an inter-ring port with a pair of buffers. The basedon a particular network structure, becausedifferent net- buffers are used to temporarily store and forward work structures can make the two memory systemsperform packets passing the port. The global ring has M equally and behavedifferently. COMA and CC-NUMA mainly differ sized slots connecting to the M local rings. Each local in the requirementsof the network and the djfference in data ring has N equally sized slots, each of which connects locality, migration, andreplication. Previousstudies combine the to a station module. two factorsin the evaluationwhich limit the possibility to isolate 2) Each station module consistsof one main memory mod- the effects of either of them. Our study using‘ a particular net- ule, one subcacheand one processor.In the CC-NUMA work is concernedwith the isolated effects of the memory sys- system, the main memory module is a home addressed tems. Thirdly, a comparativeevaluation should provide infor- memory which occupiesa unique contiguousportion of a mation to identify and distinguishthe effects oh computingper- flat, global (physical) addressspace. In the COMA sys- formancecaused by the memory systemsand ~ by the intercon- tem, the main memory module is a big-capacity cache nection network structure.Finally, besidessimtilation and mod- which results in dynamic mapping between the context els, a comparativeevaluation should also presentexperimental results on real CC-NUMA and COMA architecturesto provide address and the system logic addressthrough segment practical observationsof program executions.his paper differs translation tables. from the study cited in [ 111 and [ 121, with respectto the four 3) Both the global ring and local rings are rotated con- points mentionedabove. In addition, this modFling and evalua- stantly. The rotation period between each slot is defined tion work complementsour experimentaland, application pro- as tr. A processornode in a local ring ready to transmit a grammingwork on various network-basedshyed memorymul- messagewaits until an empty slot is available. The re- tiprocessorsin the High PerformanceComputing and Software sponseand a request,such as a read/write will be rotated Laboratory.(Seee.g., [14], [1.5], and [17]). back to the requestingprocessor. A hierarchical ring structure is the particular interconnec- 4) Each inter-ring port is able to determine the path of the tion network in this comparative performance evaluation messagepackets passed through it. In the COMA system, study of both the CC-NUMA and COMA systems.The or- each port keeps a directory to record the mapping rela- ganization of the rest of the paper is as folloyvs. In Section II, tionships amongthe cachememory modulesin the corre- we begin with detailed descriptions of the modeled ring net- sponding local ring. In CC-NUMA, each port can de- work and cache coherenceprotocols of the CC-NUMA and COMA. The performance parametersof the analysis are also termine the destination of the messagepackets according presented in Section II. The analytical models of the two to the home addresscarried by the messagepacket. memory systems on the hierarchical ring qetwork are pre- sented in Sections III and IV. Section V gives the compara- tive performance evaluation between the two memory sys- tems in terms of data migration, load distributions, network contention and communication bandwidth. Section VI re- ports our experiments of cache coherenceland data access patterns under the two memory systems on the KSR-1 to verify and support the analytical and simuliition results pre- sented in Section V. Finally, we give a suinmary and con- clusions in Section VII. , II. THE DEFINITIONSFOR THE HIERARCHICAL RING BASED CC-NUMA AND COM:A In order to clarify and simplify