The Flash Multiprocessor: Designing a Flexible and Scalable System

THE FLASH MULTIPROCESSOR: DESIGNING A FLEXIBLE AND SCALABLE SYSTEM Jeffrey Scott Kuskin Technical Report No. CSL-TR-97-744 November 1997 This research has been supported by DARPA contract DABT63-94-C-0054. The FLASH Multiprocessor: Designing a Flexible and Scalable System Jeffrey Scott Kuskin CSL-TR-97-744 November 1997 Computer Systems Laboratory Departments of Electrical Engineering and Computer Science Stanford University William Gates Computer Science Building, A-408 Stanford, CA 94305-9040 [email protected] Abstract The choice of a communication paradigm, or protocol, is central to the design of a large- scale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all protocol processing. The architecture of the MAGIC chip allows FLASH to support multiple communication paradigms — in particular, cache-coherent shared memory and high-performance message passing — while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the inter- connection network, an I/O interface, and MAGIC, the custom node controller. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The result is a system that is flexible and scalable, yet competitive in performance with a traditional multiprocessor that implements a single communication paradigm completely in hardware. The focus of this dissertation is the architecture, design, and performance of FLASH. Much of the motivation behind the FLASH system and the MAGIC node controller design stems from an examination of the characteristics of protocol code and the architecture of the DASH system, the predecessor to FLASH. This examination led to two major design goals: development of a node controller architecture that can attain high protocol processing performance while still maintaining flexibility and a need to reduce the logic and memory overheads associated with cache coherence. The MAGIC design achieves these goals by implementing on a single chip a programmable protocol engine with an instruction set optimized for the characteristics of protocol code, along with dedicated support i logic to alleviate the most serious protocol processing performance bottlenecks — data movement, message dispatch, and lack of close coupling to the node board components. The design of the FLASH node complements the MAGIC design, matching the close coupling and high bandwidth support in MAGIC to provide a balanced node architecture. Next, the dissertation investigates the performance of cache-coherence on FLASH. Perfor- mance results are presented from microbenchmarks run on the Verilog RTL of the MAGIC chip and from complete applications run on FlashLite, the FLASH system-level simulator. The microbenchmarks demonstrate that the architectural extensions added to the MAGIC design — particularly the instruction set optimizations to the programmable protocol processor — yield significantly lower latencies and protocol processor occupancies to service the most common types of memory operations. The application results are used to evaluate the performance costs of flexibility by compar- ing the performance of FLASH to that of a hardwired machine on representative parallel applications and multiprogramming workloads. These results show that poor application memory reference or load balancing characteristics cause the performance of the FLASH system to degrade more rapidly than the performance of the hardwired system; that is, FLASH’s performance is less robust. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. Overall, however, the performance of FLASH can be competitive with the performance of the hardwired machine. Specifically, for a range of optimized parallel applications, the performance differences between the hardwired machine and FLASH are small, typically less than 10% at 32 processors and less than 15% at 64 processors. For these programs, either the processor cache miss rates are small or the latency of the programmable protocol processing can be hidden behind the memory access time. Keywords and Phrases: distributed shared memory, cache coherence, message passing, scalable multiprocessors, communication protocols, parallel processing. ii Copyright 1997 by Jeffrey Scott Kuskin All Rights Reserved Acknowledgments The FLASH design was a huge effort, especially in a university setting. A number of people were instrumental in carrying the design from concept to actual hardware. First, I would like to thank John Hennessy, my primary advisor. John’s inspiration and confidence have been unceasing during my tenure at Stanford and his thorough review of this dissertation has greatly improved its quality. I also would like to thank Mark Horowitz, both for acting as my secondary advisor and for serving as the day-to-day manager of the FLASH hardware development effort. In addition, I thank John Cioffi for graciously agreeing to chair my Ph.D. orals committee and for serving on my dissertation reading committee. I owe a huge debt of gratitude to the members of the FLASH hardware team, the group responsible for the design and implementation of the MAGIC chip and the FLASH system. Their dedication, talent, and sheer perseverance were extraordinary: • Joel Baxter — Created much of the protocol code development environment and support tools. • Jules Bergmann — Designed and implemented the protocol processor, the most complex unit on the MAGIC chip. • John Heinlein — Developed the message passing protocol code. • Mark Heinrich — Key developer of FlashLite and several cache coherence protocols. Designed and implemented the inbox. • Hema Kapadia — Architect of the IO unit. • Dave Nakahira — Designed and implemented the memory controller, the JTAG unit, and the diagnostic port. Single-handedly designed the FLASH node board. • Dave Ofelt — Designed and implemented the PI and much of the HLL environment. Managed the diagnostic regression suite. Co-old-timer on the FLASH team. • Jeff Solomon — MAGIC place and route, hardmac development, and other physical design tasks. • Ron Ho and Evelina Yeung — Designed the data buffer memory array. Many other students aided in the design and verification as well, including Jisu Bhatta- charya, Luke Chang, Jeff Gibson, Ziyad Hakura, Richard Ho, and Rich Simoni. The MAGIC design also benefited from the contributions of the FLASH operating system group, under the leadership of Mendel Rosenblum. In addition, I would like to acknowl- edge the assistance of Mike Galles, Jim Laudon, and Dan Lenoski at Silicon Graphics. Mike Galles, in particular, supplied the LLP logic and cheerfully answered many ques- tions from me about its operation and use. v A special thanks to my fellow research group members and (former) trailerites — Steve Woo, Chris Holt, and Andrew Erlichson — and to the members of the FLASH soft- ball team. I would like to thank my parents for their love, support, and encouragement through- out my Ph.D. career. Most important, I would like to thank my wife, Diane. Being married to a seemingly perpetual student surely was not easy, and Diane’s unfailing love, encouragement, and sense of humor have been constant reassurances on the long road to a degree. vi Table of Contents Chapter 1 Introduction...................................................................................................1 1.1 The FLASH Project.........................................................................................................................2 1.2 Research Contributions ...................................................................................................................4 1.3 Organization of the Dissertation......................................................................................................4 Chapter 2 Background ...................................................................................................7 2.1 Distributed Multiprocessing ............................................................................................................7 2.1.1 Communication Paradigms..................................................................................................9 2.1.2 Message Passing versus Shared Memory ............................................................................9 2.1.3 Distributed Cache Coherence ............................................................................................13 2.2 The DASH System ........................................................................................................................15 Chapter 3 FLASH System Architecture .....................................................................17 3.1 Motivation and Goals ....................................................................................................................17 3.1.1 FLASH System Features....................................................................................................18 3.1.2 Programmable Node Controller.........................................................................................19 3.1.3 Logic Overhead Reduction ................................................................................................20

Load more