Sparcle: an Evolutionary Processor Design for Large-Scale Multiprocessors

Sparcle: An Evolutionary Processor Design for Large-scale Multiprocessors Working jointly at MIT, IS1 Logic, and Sun Microsystems, designers created the Sparcle pro- cessing chip by evolving an existing RISC architecture toward a processor suited for large- scale multiprocessors. This chip supports three multiprocessor mechanisms: fast context switching, fast, user-level message handling, and fme-grain synchronization. The Sparcle effort demonstrates that RISC architectures coupled with a communications and memory management unit do not require major architectud changes to support multiprocessing efficiently. Anant Agarwal he Sparcle chip clocks at no more than faults and message arrivals. Current micro- 40 MHz, has no more than 200,000 processor designs do not support a clean com- John Kubiatowicz transistors,does not use the latest tech- munications interface between the processor nologies, and dissipates a paltry 2 and the communications network. Further- David Kranz watts. It has no on-chip cache, no Fancy pads, more, traps and other asynchronous event- and only 207 pins. It does not even support mul- handlers are inefficient on many current Beng-Hong lim tiple-instruction issue. Then why do we think this microprocessors,often requiring tens of cycles chip is interesting? to reach the appropriate trap service routine. Donald Yeung Sparcle is a processor chip designed to support large-scale multiprocessing. We designed its The impetus for the Sparcle chip project was Massachusetts institute of mechanisms and interfaces to provide fast mes- our belief that we could implement a processor Technology sage handling, latency tolerance, and fine-grain that provides interfaces for the above mechanisms synchronization. Specifically, Sparcle implements by making small modifications to an existing mi- Godfrey D'Souza croprocessor. Indeed, we derived Sparcle from Mechanism to tolerate memo y and commu- Sparc' (scalable programmable architecture from LSI Logic nication latencies, as well as spcbmniza- Sun Microsystems), and we integrated it into Ale- tion latencies. Long latencies are inevitable wife,2 a large-scale multiprocessor system being Mike Parkin in large-scale multiprocessors, but current developed at MIT. microprocessor designs are ill-suited to Sparcle tolerates long communication and syn- Sun Microsystems handle such latencies. chronization latencies by rapidly switching to Mechanisms to support fine-grain synchro- other threads of computation. The current imple- nization. Modem microprocessors pay scant mentation of Sparcle can switch to another thread attention to this aspect of multiprocessing, of computation in 14 cycles. Slightly more ag- usually providing just a test-and-set instnic- gressive modifications could reduce this number tion, and in some cases, not even that. to four cycles. Sparcle switches to another thread Mechanisms to initiate communication ac- when a cache miss that requires service over the tions to remoteprocesson- acmss the commu- communications network occurs, or when a syn- nications network, and to respond rapidly to chronization fault occurs. Such a processor re- asyncbmnous euentssuch assynchronization quires a pipelined memory and communications 4% /€€€Micro 0272-1732/93/0600-0048$03.000 1993 IEEE Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 17, 2009 at 09:28 from IEEE Xplore. Restrictions apply. system. In our system. a separate coiiimLinir.ationsancl nieiiioq management chip (CMML;) interfaces to Slxircle to provicle the desired pipelined system intertiice. Our system also provides a software prefetch instruction. For a description of the modifications to a modern RISC microprocessor needed to achieve fast context switching. see our cliscussion under x- chitecture and implementation of Sparcle later in the article. Sparcle supports fine-grain data-level synchronization through the use of fiill/empty bits, ;IS in the HEI-’ computer,’ With fiilVempty bits, a lock and access oftlie data word proteaed by the lock can tx prolxd in one operation. If the synchroniz- tion attempt fails. the synchoni7;ltion trap invokes a fault handler. In our system, the external communications chip detects synchronization faults and alerts Sparcle I>y raising a trap line. The system then handles the fault in software trap code. Finally, Sparcle supports a highly streamlined network in- Figure 1. An Alewife node. terface with the ability to launch and receive interconnection network messages. While this design implements the con- liance performance iq increasing parallelism and reduc- niunications interface with the interconnection network in a ing coininmication overhead. This enhancement relieves separate chip, the CMMU, future implementations can inte- the programmer of undue effort in partitioning data and grate this functionality into the processor chip. Sparcle sup- controlling flo\\, into coarser chunks to increase ports rapid response to asynchronous events by streamlining performance. Sparc’strap interface and by supporting rapid dispatch to the Memory latency tolerance. Context switching and data appropriate trap handler. To achieve this, Sparcle provides prefetching can reduce communication overhead intro- two special trap lines for the most common types of events- duced by nethvork delays. For shared-memory programs, cache misses to remote nodes and synchronization faults. the smirch must be very fast and occur automatically Sparcle uses a third trap line for all other types of events. when a remote cache miss occurs. Also, this chip has an increased number of instructions in Efficient message interface. The ability to send and each trap dispatch entry so that vital trap codes can be put in receive messages is needed to support message-passing line at the dispatch points. programs. Such interfacing can also improve the perfor- Sparcle’s design process was unusual in that it did not inance of shared-memory programs in some common involve developing a completely new architecture. Rather, situations. we implemented Sparcle with the help of LSI Logic and Sun Microsystems by slightly modifying the existing Sparc Before we can examine the implementation of these fea- architecture. At MIT, we received working Sparcle chips from tures in Sparcle, we need to consider each of these areas in LSI Logic on March 11, 1992. These chips have already un- turn, and discuss why they are useful for large-scale dergone complete functional testing. We are currently con- multiprocessing. tinuing to implement the Alewife multiprocessor so that we Fine-grain computation. As multiprocessors become can thoroughly evaluate our ideas and subject the Sparcle larger, the grain size of parallel computations decreases to chips to full-speed testing. Figure 1 shows an Alewife node satisfy higher parallelism requirements. Computational grain with the Sparcle chip. size refers to the amount of computation between synchronization operations. Given a fixed problem size, the overhead Mechanisms for multiprocessors of parallel and synchronization operations limits the ability By supporting the widely used shared-memory and mes- to use a larger number of processors to speed up a program. sage-passing programming models, Sparcle eases the Systems supporting fine-grain parallelism and synchroniza- programmer’s job and enhances parallel program perfor- tion attempt to minimize this overhead so that parallel pro- mance. We have implemented programming constructs in grams can achieve better performance. parallel versions of Lisp and C that use these features. Sparcle’s The challenge of supporting fine-grain computation is in features fall into three areas, the first two of which support implementing efficient parallelism and synchronization con- the shared-memory model: structs without incurring extensive hardware cost, and without reducing coarse-grain performance. By taking an Fine-grain computation. Efficient support of fine-grain evolutionary approach in designing Sparcle, we have at- expression of parallelism and synchronization can en- tempted to satisfy these requirements. June 1993 49 Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 17, 2009 at 09:28 from IEEE Xplore. Restrictions apply. Larue-scale multiprocessors State Value ment. A synchronizing write stores a value to an empty ele- ment and sets it to f~ill,releasing any waiters. An L-structure 11 I 234982 I thus allows mutually exclusive ;~ccessto each of its elements 11 234233 ]Read n :incl allows multiple nonlocking rcaders. 1 111164 Sparcle supports J- and L-structures, as well as other types 1 111134 Consumer of fine-grain data-level synchronization, with per-word, full/ 1 290267 empt) hits in memory.' Sparcle provides new load/store in- 290012 structions that interact with the full/empty bits. The design 77Read361992 @ also includes an extra synchronous trap line to deliver the fdl/empty trap. This extra line allows Sparcle to immediately Producer (Blocked) identify the trap. - Consumer Coiitrol-letel parallelism. Control-level parallelism niay be expressed by wrappingfuturearound an expression or state- - ment X. Thefuture keyword declares that Xand the continu- ation of the future expression niay be evaluated concurrently. Figure 2. J-structures. Fine-grain support allows the amount of computation needed for evaluating X to be small without severely affecting performance. We can express fine-grain parallelism and synchronization If the compiler or runtime system chooses to create a new

Load more