NEW DIRECTIONS I Scalable Shared-Memory Multiprocessor Architectures New coherence schemes scale beyond single-bus-based, shared-memory architectures. This report describes three research efforts: one multiple-bus-based and two directory-based schemes. Introduction Shreekant Thakkar, Sequent Computer Systems Michel Dubois, University of Southern California Anthony T. Laundrie and Gurindar S. Sohi, University of Wisconsin-Madison There are two forms of shared-memory the bus as a broadcast medium to maintain interconnection. The following “coher- multiprocessor architectures: bus-based coherency; all the processors “snoop” on ence properties” form the basis for most of systems such as Sequent’s Symmetry, the bus to maintain coherent information in these schemes: Encore’s Multimax, SGI’s Iris, and Star- the caches. The protocols require the data Sharing readers. Identical copies of a dent’s Titan; and switching network-based in other caches to be invalidated or updated block of data may be present in several systems such as BBN’s Butterfly, IBM’s on a write by a processor if multiple copies caches. These caches are called readers. RP3, and New York University’s Ultra. of the modified data exist. The bus pro- Exclusive owners. Only one cache at a Because of the efficiency and ease of the vides free broadcast capability, but this - time may have permission to write to a shared-memory programming model, feature also limits its bandwidth. block of data. This cache is called the these machines are more popular for paral- New coherence schemes that scale be- owner. lel programming than distributed multi- yond single-bus-based, shared-memory processors such as NCube or Intel’s iPSC. architectures are being proposed now that - Reader invalidates. Before a cache can They also excel in multiprogramming the cost of high-performance interconnec- gain permission to write to a block (that is, throughput-oriented environments. Al- tions is dropping. Current research efforts become the owner), all readers must be though the number of processors on a include directory-based schemes and mul- notified to invalidate their copies. single-bus-based shared-memory multi- tiple-bus-based schemes. Accounting. For each block address, processor is limited by the bus bandwidth, the identity of all readers is somehow large caches with efficient coherence and Directory-based schemes. Directory- stored in a memory-resident directory. bus protocols allow scaling to a moderate based schemes can be classified as central- number of processors (for example, 30 on ized or distributed. Both categories sup- Presence flags. One cache-coherence Sequent’s Symmetry). port local caches to improve processor scheme, proposed by Censier and Feau- Bus-based shared-memory systems use performance and reduce traffic in the trier’ in 1978, uses presence flags. In each June 1990 71 The performance of presence-flag schemes is limited by conflicts in access- ing the main memory directory. The main memory and the tags can be distributed to improve the main memory’s performance. However, the serialization of responses Bus2 through the main memory and the locking :a) of lines by the directory controller affect the performance of these cache-coherence schemes. B pointers. Another alternative, being pursued by Agarwal et al? and by Weber and Gupta,’ requires each block to have a smaller array of B pointers instead of the large array of presence bits. Some studies of application suggest that, because of the parallel programming P = Processor model used, a low value forB (perhaps 1 or 2) might be sufficient. Since each shared C = Cache data structure is protected by a synchroni- M = Memory zation variable (lock), only the lock -not I/O = Inputloutput the shared data structure - is heavily CII = Cluster interface contested in medium- to large-grain paral- lel applications. Thus, the shared data only N/I = Network interface moves from one cache to another during computation, and only synchronization Figure 1. Extension of single-bus architectures to multiple-bus architectures: can cause invalidation in multiple caches. (a) one dimension; (b) two dimensions; (c) hierarchy. If B is small and the data is heavily shared, processors can thrash on the heavily shared data blocks. If B is large, the memory requirements are worse than for the pres- ence-flag method. memory module, every data block is ap- presence flags, a cache miss is serviced by Linked lists. Note that the presence-flag pended with a single state bit followed by checking the directory to see if the block is solution uses a low-level data structure one presence flag per cache. Each presence dirty in another cache. When necessary, (Boolean flags) to store the readers of a flag is set whenever the corresponding consistency is maintained by copying the block, while the E-pointers method saves cache reads the block. As a result, invalida- dirty block back to the memory before them in a higher level structure (a fixed tion messages need only be sent to those supplying the data. The reply is thus serial- array). Perhaps more-flexible data StNC- caches whose presence bits are set. ized through the main memory. To ensure tures, such as linked lists, can be applied to In the presence-flag solution, unfortu- correct operation, the directory controller the problem of cache coherence. Distribut- nately, the size of each memory tag grows must lock the memory line until the write- ing the directory updates among multiple linearly with the number of caches, mak- back signal is received from the cache with processors, rather than a central directory, ing the scheme unscalable. The tag will be the dirty block. Write misses generate could reduce memory bottlenecks in large at least N bits, where N is the number of additional invalidate messages for all multiprocessor systems. caches in the system. There may also be caches that have clean copies of the data. A few groups have proposed cache- other bits stored in the directory to indicate Invalidate-acknowledgments must be re- coherence protocols based on a linked list line state. ceived before a reply can be sent to the of caches. Adding a cache to (or removing Variations of this scheme use a broad- requesting cache. Note that the relevant it from) the shared list proceeds in a man- cast mechanism to reduce the number of line is locked while this is being done. ner similar to software linked-list modifi- bits required in the directories. However, Requests that arrive while a line is locked cation. Groups using this approach include this introduces extra traffic in the intercon- must be either buffered at the directory or the IEEE Scalable Coherent Interface nection and may degrade system perform- bounced back to the source to be reissued (SCI) standard project, a group at the ance. at a later time. This may cause a loss in University of Wisconsin-Madison, and In the central-directory scheme with performance. Digital Equipment Corporation in its work 12 COMPUTER with Stanford University’s Knowledge ticube. However, it differs in the cache- Acknowledgment Systems Laboratory. The SCI work, which coherence mechanism and in the distribu- is the most defined of the three, is covered tion of memory modules. The system uses We would like to thank all the authors of the in the report beginning on p. 74. Some a combination of a snoopy cache-coher- special reports that follow for their assistance features of the Stanford Distributed Direc- ence protocol and a directory-based co- and for their review of this introduction. tory (SDD) Protocol are outlined on pp. herence protocol to provide coherency in 78-80. the system. The memory is distributed per node. unlike the Wisconsin Multicube. Bus-based schemes. Bus-based systems References provide uniform memory access to all Hierarchical systems. In hierarchical processors. This memory organization al- systems, clusters of processors are con- 1. L.M. Censier and P. Feautrier, “A New lows a simpler programming model, mak- nected by a bus or an interconnection net- Solution to Coherence Problems in Multi- ing it easier to develop new parallel appli- work. (See Figure IC.) In the three ex- cache Systems,’’ IEEE Trans. Compurers, Vol. C-27, No. 12, Dec. 1978, pp. 1,112- cations or to move existing applications amples below, the intercluster connection 1.118. from a uniprocessor to a parallel system. isabus,andtheprocessorswithinacluster Since the bus transfers all data within the are also connected via the bus. This is 2. A. Agarwal et al., “An Evaluation of Direc- Proc. system, it is the key to performance -and similar to a single-bus system. tory Schemes for Cache Coherence,” 15th Int’l Symp. Computer Architecture, a potential bottleneck - in all hus-based Wilson* uses a simulation and an ana- Computer Society Press, Los Alamitos, systems. lytical model to examine the performance Calif., Order No. 861 (microfiche only), Several architectural variations of bus- of a hierarchically connected multiple- 1988, pp. 280-289. based systems have beenproposed. Below, bus design. The design explores a uniform 3. W.D. Weber and A. Gupta, “Analysis of we describe two types -multiple-bus and memory architecture with global memory Cache Invalidation Patterns in hierarchical architectures. (SeeFigure 1.) at the highest level. It uses hierarchical Microprocessors,’’ Proc. ASPLOS Ill. One of these, the Aquarius multiple-bus caches to reduce bus use at various levels Computer Society Press, Los Alamitos, multiprocessor architecture, is described and to expand cache-coherency tech- Calif., Order No. 1936, 1989, pp. 243-256. in more detail in the report beginning niques beyond those of a single-bus sys- 4. S.J. EggersandR.H.Katz,“ACharacteriza- on p. 80. tem. The performance study showed that tion of Sharing in Parallel Programs and its degradation resulting from cache coher- Application to Coherence Protocol Multiple-bus systems.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-