Shared Memory Communication
Total Page:16
File Type:pdf, Size:1020Kb
Design and Implementation of a Multi-purpose Cluster System Network Interface Unit by Boon Seong Ang Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 1999 © 1999 Massachusetts Institute of Technology. All rights reser MASSACHUS OF TEC Author............. ..... V.. .4....... Department of Electrical Engineerin Ind Computer c February 23, 1999 Certified by................ Arvind Johnson Professor of Computer Science /42 Thesis Supervisor Certified by.. ................. Larry Rudolph Principal Research Scientist Th sis Supervisor Accepted by............ A. C. Smith Chairman, Departmental Committee on Graduate Students Design and Implementation of a Multi-purpose Cluster System Network Interface Unit by Boon Seong Ang Submitted to the Department of Electrical Engineering and Computer Science on February 23, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Today, the interface between a high speed network and a high performance com- putation node is the least mature hardware technology in scalable general purpose cluster computing. Currently, the one-interface-fits-all philosophy prevails. This ap- proach performs poorly in some cases because of the complexity of modern memory hierarchy and the wide range of communication sizes and patterns. Today's mes- sage passing NIU's are also unable to utilize the best data transfer and coordination mechanisms due to poor integration into the computation node's memory hierarchy. These shortcomings unnecessarily constrain the performance of cluster systems. Our thesis is that a cluster system NIU should support multiple communica- tion interfaces layered on a virtual message queue substrate in order to streamline data movement both within each node as well as between nodes. The NIU should be tightly integrated into the computation node's memory hierarchy via the cache- coherent snoopy system bus so as to gain access to a rich set of data movement operations. We further propose to achieve the goal of a large set of high performance communication functions with a hybrid NIU micro-architecture that combines custom hardware building blocks with an off-the-shelf embedded processor. These ideas are tested through the design and implementation of the StarT- Voyager NES, an NIU used to connect a cluster of commercial PowerPC based SMP's. Our prototype demonstrates that it is feasible to implement a multi-interface NIU at reasonable hardware cost. This is achieved by reusing a set of basic hardware building blocks and adopting a layered architecture that separates protected network sharing from software visible communication interfaces. Through different mechanisms, our 35MHz NIU (140MHz processor core) can deliver very low latency for very short messages (under 2ps), very high bandwidth for multi-kilobyte block transfers (167 MBytes/s bi-directional bandwidth), and very low processor overhead for multi-cast communication (each additional destination after the first incurs 10 processor clocks). We introduce the novel idea of supporting a large number of virtual message queues through a combination of hardware Resident message queues and firmware emulated Non-resident message queues. By using the Resident queues as firmware controlled caches, our implementation delivers hardware speed on the average while providing graceful degradation in a low cost implementation. Finally, we also demonstrate that an off-the-shelf embedded processor comple- ments custom hardware in the NIU, with the former providing flexibility and the latter performance. We identify the interface between the embedded processor and custom hardware as a critical design component and propose a command and com- pletion queue interface to improve the performance and reduce the complexity of embedded firmware. Thesis Supervisor: Arvind Title: Johnson Professor of Computer Science Thesis Supervisor: Larry Rudolph Title: Principal Research Scientist Acknowledgments This dissertation would not have been possible without the encouragement, support, patience and cooperation of many people. Although no words can adequately express my gratitude, an acknowledgement is the least I can do. First and foremost, I want to thank my wife, Wee Lee, and our families for standing by me all these years. They gave me the latitude to seek my calling, were patient as the years passed but I was no closer to enlightenment, and provided me a sanctuary to retreat to whenever my marathon-like graduate school career wore me thin. To you all, my eternity gratitude. I am greatly indebted to my advisors, Arvind and Larry, for their faith in my abilities and for standing by me throughout my long graduate student career. They gave me the opportunity to co-lead a large systems project, an experience which greatly enriched my systems building skills. To Larry, I want to express my gratitude for all the fatherly/brotherly advice and the cheering sessions in the last leg of my graduate school apprentice-ship. I would also like to thank the other members of my thesis committee: Frans and Anant, for helping to refine this work. I want to thank Derek Chiou for our partnership through graduate school, working together on Monsoon, StarT, StarT-NG and StarT-Voyager. I greatly enjoy bringing vague ideas to you, and jointly developing them into well thought out solutions. This work on StarT-Voyager NES is as much yours as it is mine. Thank you too for the encouragement and counselling you gave me all these years. The graduate students and staff in Computation Structures Group gave me a home away from home. Derek Chiou, Alex Caro, Andy Boughton, James Hoe, RPaul Johnson, Andy Shaw, Shail Aditya Gupta, Xiaowei Shen, Mike Ehrlich, Dan Rosen- band and Jan Maessen, thank you for the company in this long pilgrimage through graduate school. It was a pleasure working with all of you bright, hardworking, de- voted and, at one time, idealistic people. I am also indebted to many of you, especially Derek and Alex, for coming to my rescue whenever I painted myself into a corner. In my final years at MIT, I had the pleasure of working with the StarT-Voyager 5 team: Derek, Mike, Dan, Andy Boughton, Jack Constanza, Brad Bartley, and Wing- chung Ho. It was a tremendous education experience laboring along-side the other talented team members, especially Derek, Mike and Dan. I also want to thank our friends at IBM: Marc, Pratap, Eknath, Beng Hong, and Alan, for assisting us in the StarT-Voyager project. To all of you I mentioned above, and many others that I missed, thank you for the lessons and memories of these years. 6 Contents 1 Introduction 13 1.1 M otivation ... ... .. ... ... ... .. .. ...... 14 1.2 Proposed Cluster Communication Architecture . ... .. ... 17 1.3 A Flexible Network Interface Unit (NIU) Design .. ... ... 19 1.4 Related Work .. .. ... ... .. ... .. .. .. ... ... 22 1.5 Contributions ... ... .. ... .... ... ... ..... 25 1.6 Road M ap .. ... ... ... ... ... ... ... ..... 27 2 Design Requirements 29 2.1 Message Passing Communication . ..... 30 2.1.1 Channel vs Queues-and-network Model . ......... ... 30 2.1.2 Message Content Specification . ... ... ... ... ... 3 2 2.1.3 Message Buffer Management . .... .. ... ... ... .. 3 2 2.1.4 Message Reception Method .... .. ...... ....... 3 5 2.1.5 Communication Service Guarantee .. ..... ........ 36 2.1.6 Synchronization Semantics . .... .. ..... ........ 3 7 2.1.7 Sum m ary .. .... .... ... ... ............. 38 2.2 Shared Memory Communication .. ..... ............. 3 8 2.2.1 Caching and Coherence Granularity .. ............. 39 2.2.2 Memory Model ..... .... .... .. .... .... ... 4 0 2.2.3 Invalidate vs Update .. ....... ... ... ... ... 4 1 2.2.4 CC-NUMA and S-COMA ... .... .. ... ... ... .. 4 2 2.2.5 Atomic Access and Operation-aware Protocol 44 7 2.2.6 Performance Enhancement Hints ... .... .... ... 4 5 2.2.7 Summary ..... .... ... ... .. ... ... .. ... 4 5 2.3 System Requirements . ...... ... ... ... ... ... ... 4 5 2.3.1 Multi-tasking Model ... ... ... .... ... ... ... 4 6 2.3.2 Network Sharing Model ... ... .... ... ... .... 4 7 2.3.3 Fault Isolation . ...... .. ... .. ... ... .. ... .. 5 0 2.4 SMP Host System Restrictions .. .. ... .... .... .... ... 5 0 2.5 NIU Functionalities ........ .... .... .... .... ... 5 2 2.5.1 Interface to Host .... .... .... .... .... ... 5 2 2.5.2 Interface to Network ... .. .... .... .... ... ... 5 4 2.5.3 Data Path and Buffering . .. .... ... ... .... ... 5 4 2.5.4 Data Transport Reliability and Ordering .... ..... ... 55 2.5.5 Unilateral Remote Action . .... ... ... ... ... ... 5 6 2.5.6 Cache Protocol Engine ... ... ... ... ... .. ... 5 6 2.5.7 Home Protocol Engine ... ... .... ... ... ... 5 8 3 A Layered Network Interface Macro-architecture 61 3.1 Physical Network Layer .. ............. .. ... 62 3.1.1 Two Independent Networks ......... .. .. .. 63 3.1.2 Reliable Delivery Option .. ........ .. .. 63 3.1.3 Ordered Delivery Option and Ordering-set Concept . .. .. 64 3.1.4 Bounded Outstanding Packet Count .. .. ... .. .. 65 3.1.5 Programmable Send-rate Limiter ...... .. ... .. 65 3.2 Virtual Queues Layer . ...... ...... .... .. ... .. 66 3.2.1 Virtual Queue Names and Translation . .. .. .. .. 68 3.2.2 Dynamic Destination Buffer Allocation . .. .. .. 69 3.2.3 Reactive Flow-control ............. .. .. ... 72 3.2.4 Decoupled Process Scheduling and Message Queue Activity .. 75 3.2.5 Transparent Process