APRIL: a Processor Architecture for Multiprocessing

APRIL: A Pro cessor Architecture for Multipro cessing Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz Lab oratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Abstract design for multipro cessors provides a means for hiding latency. When sucient parallelism exists, a pro cessor Pro cessors in large-scale multipro cessors must b e able that rapidly switches to an alternate thread of computa- to tolerate large communication latencies and synchro- tion during a remote memory request can achieve high nization delays. This pap er describ es the architecture utilization. of a rapid-context-switching pro cessor called APRIL Pro cessor utilization also diminishes due to synchro- with supp ort for ne-grain threads and synchroniza- nization latency. Spin lo ck accesses havealowover- tion. APRIL achieves high single-thread p erformance head of memory requests, but busy-waiting on a syn- and supp orts virtual dynamic threads. A commercial chronization eventwastes pro cessor cycles. Synchro- RISC-based implementation of APRIL and a run-time nization mechanisms that avoid busy-waiting through software system that can switch contexts in ab out 10 pro cess blo cking incur a high overhead. cycles is describ ed. Measurements taken for several par- Ful l/empty bit synchronization [22] in a rapid context allel applications on an APRIL simulator show that the switching pro cessor allows ecient ne-grain synchro- overhead for supp orting parallel tasks based on futures nization. This scheme asso ciates synchronization infor- is reduced by a factor of twoover a corresp onding im- mation with ob jects at the granularity of a data word, plementation on the Encore Multimax. The scalability allowing a low-overhead expression of maximum con- of a multipro cessor based on APRIL is explored using currency. Because the pro cessor can rapidly switchto a p erformance mo del. We show that the SPARC-based other threads, wasteful iterations in spin-wait lo ops are implementation of APRIL can achieve close to 80% pro- interleaved with useful work from other threads. This cessor utilization with as few as three resident threads reduces the negative e ects of synchronization on pro- p er pro cessor in a large-scale cache-based machine with cessor utilization. an average base network latency of 55 cycles. This pap er describ es the architecture of APRIL, a pro cessor designed for large-scale multipro cessing. APRIL builds on previous research on pro cessors for 1 Intro duction parallel architectures such as HEP [22], MASA [8], P- RISC [19], [14], [15], and [18]. Most of these pro cessors The requirements placed on a pro cessor in a large-scale supp ort ne-grain interleaving of instruction streams multipro cessing environment are di erent from those in from multiple threads, but su er from p o or single- a unipro cessing setting. A pro cessor in a parallel ma- thread p erformance. In the HEP, for example, instruc- chine must b e able to tolerate high memory latencies tions from a single thread can only b e executed once and handle pro cess synchronization eciently [2]. This every 8 cycles. Single-thread p erformance is imp ortant need increases as more pro cessors are added to the sys- for eciently running sections of applications with low tem. parallelism. Parallel applications imp ose pro cessing and commu- APRIL do es not supp ort cycle-by-cycle interleaving nication bandwidth demands on the parallel machine. of threads. To optimize single-thread p erformance, An ecient and cost-e ective machine design achieves a APRIL executes instructions from a given thread until balance b etween the pro cessing p ower and the commu- it p erforms a remote memory request or fails in a syn- nication bandwidth provided. An imbalance is created chronization attempt. We show that such coarse-grain when an underutilized pro cessor cannot fully exploit the multithreading allows a simple pro cessor design with available network bandwidth. When the network has context switchoverheads of 4{10 cycles, without sig- bandwidth to spare, low pro cessor utilization can re- ni cantly hurting overall system p erformance (although sult from high network latency. An ecient pro cessor 1 the pip eline design is complicated by the need to handle pip eline dep endencies). In APRIL, thread scheduling is NETWORK done in software, and unlimited virtual dynamic threads ROUTER ALEWIFE MACHINE y bit synchro- are supp orted. APRIL supp orts full/empt PROCESSOR nization, and provides tag supp ort for futures [9]. In this text, and task are pap er the terms pro cess, thread, con CACHE CACHE CONTROLLER used equivalently. By taking a systems-level design approach that con- FPU siders not only the pro cessor, but also the compiler and MAIN MEMORY run-time system, wewere able to migrate several non- ALEWIFE NODE critical op erations into the software system, greatly sim- plifying pro cessor design. APRIL's simplicity allows an implementation based on minor mo di cations to an ex- isting RISC pro cessor design. We describ e such an im- Figure 1: ALEWIFE no de. plementation based on Sun Microsystem's SPARC processor [23]. A compiler for APRIL, a run-time system, and an APRIL simulator are op erational. We present memory, cache/directory controller and a network rout- simulation results for several parallel applications on ing switch. Multiple no des are connected via a direct, APRIL's eciency in handling ne-grain threads and packet-switched network. assess the scalabilityofmultipro cessors based on a The controller synthesizes a global shared memory coarse-grain multithreaded pro cessor using an analyt- space via messages to other no des, and satis es requests ical mo del. Our SPARC-based pro cessor supp orts four from other no des directed to its lo cal memory. It main- hardware contexts and can switch contexts in ab out 10 tains strong cache coherence [7] for memory accesses. cycles, which yields roughly 80% pro cessor utilization On exception conditions, suchascache misses and failed in a system with an average base network latency of 55 synchronization attempts, the controller can cho ose to cycles. trap the pro cessor or to make the pro cessor wait. A The rest of this pap er is organized as follows. Sec- multithreaded pro cessor reduces the ill e ects of the tion2isanoverview of our multipro cessor system archi- long-latency acknowledgment messages resulting from tecture and the programming mo del. The architecture a strong cache coherence proto col. To allow exp erimen- of APRIL is discussed in Section 3, and its instruction tation with other programming mo dels, the controller set is describ ed in Section 4. A SPARC-based imple- provides sp ecial mechanisms for bypassing the coher- mentation of APRIL is detailed in Section 5. Section 6 ence proto col and facilities for preemptiveinterpro ces- discusses the implementation and p erformance of the sor interrupts and blo ck transfers. APRIL run-time system. Performance measurements of The ALEWIFE system uses a low-dimension direct APRIL based on simulations are presented in Section 7. network. Such networks scale easily and maintain high Weevaluate the scalabilityofmultithreaded pro cessors nearest-neighb or bandwidth. However, the longer ex- in Section 8. p ected latencies of low-dimension direct networks com- pared to indirect multistage networks increase the need for pro cessors that can tolerate long latencies. Further- 2 The ALEWIFE System more, the lower bandwidth of direct networks over indirect networks with the same channel width intro duces APRIL is the pro cessing element of ALEWIFE, a large- interesting design tradeo s. scale multipro cessor b eing designed at MIT. ALEWIFE In the ALEWIFE system, a context switch o ccurs is a cache-coherent machine with distributed, globally- whenever the network must b e used to satisfy a re- shared memory. Cache coherence is maintained using quest, or on a failed synchronization attempt. Since a directory-based proto col [5]overalow-dimension di- caches reduce the network request rate, we can em- rect network [20]. The directory is distributed with the ploy coarse-grain multithreading (context switch ev- pro cessing no des. ery 50{100 cycles) instead of ne-grain multithreading (context switchevery cycle). This simpli es pro ces- 2.1 Hardware sor design considerably b ecause context switches can b e more exp ensive (4 to 10 cycles), and functionality such As shown in Figure 1, each ALEWIFE no de consists of as scheduling can b e migrated into run-time software. a pro cessing element, oating-p oint unit, cache, main Single-thread p erformance is optimized, and techniques 2 used in RISC pro cessors for enhancing pip eline p erfor- level parallelism and primitives for placement of data mance can b e applied [10]. Custom design of a pro cess- and tasks. As an example, the programmer can use ing element is not required in the ALEWIFE system; future-on whichworks just like a normal future but indeed, we are using a mo di ed version of a commercial allows the sp eci cation of the no de on whichtoschedule RISC pro cessor for our rst-round implementation. the future. Extending Mul-T in this way allows us to exp eriment with techniques for enhancing lo cality and to research language-level issues for programming par- 2.2 Programming Mo del allel machines. Our exp erimental programming language for ALEWIFE is Mul-T [16], an extended version of Scheme. Mul-T's 3 Pro cessor Architecture basic mechanism for generating concurrent tasks is the future construct. The expression (future X ), where APRIL is a pip elined RISC pro cessor extended with X is an arbitrary expression, creates a task to evaluate sp ecial mechanisms for multipro cessing.

Load more