Design and Implementation of a Multi-Purpose Cluster System
Total Page:16
File Type:pdf, Size:1020Kb
Design and Implementation of a Multipurp ose Cluster System Network Interface Unit by Bo on Seong Ang Submitted to the Department of Electrical Engineering and Computer Science t of the requirements for the degree of in partial fulllmen Do ctor of Philosoph y at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February c Massachusetts Institute of Technology All rights reserved Author : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Department of Electrical Engineering and Computer Science February Certied by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Arvind Johnson Professor of Computer Science Thesis Sup ervisor Certied by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Larry Rudolph Principal Research Scientist Thesis Sup ervisor Accepted by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A C Smith Chairman Departmental Committee on Graduate Students Design and Implementation of a Multipurp ose Cluster System Network Interface Unit by Bo on Seong Ang Submitted to the Department of Electrical Engineering and Computer Science on February in partial fulllment of the requirements for the degree of Do ctor of Philosophy Abstract To day the interface b etween a high sp eed network and a high p erformance com putation no de is the least mature hardware technology in scalable general purp ose Currently the oneinterfacetsall philosophy prevails This ap cluster computing proach p erforms p o orly in some cases b ecause of the complexity of mo dern memory hierarchy and the wide range of communication sizes and patterns To days mes are also unable to utilize the b est data transfer and co ordination sage passing NIUs mechanisms due to p o or integration into the computation no des memory hierarchy These shortcomings unnecessarily constrain the p erformance of cluster systems thesis is that a cluster system NIU should supp ort multiple communica Our tion interfaces layered on a virtual message queue substrate in order to streamline data movement b oth within each no de as well as b etween no des The NIU should b e tightly integrated into the computation no des memory hierarchy via the cache coherent sno opy system bus so as to gain access to a rich set of data movement op erations We further prop ose to achieve the goal of a large set of high p erformance communication functions with a hybrid NIU microarchitecture that combines custom hardware building blo cks with an otheshelf emb edded pro cessor These ideas are tested through the design and implementation of the StarT oyager NES an NIU used to connect a cluster of commercial PowerPC based SMPs V Our prototyp e demonstrates that it is feasible to implement a multiinterface NIU at reasonable hardware cost This is achieved by reusing a set of basic hardware building blo cks and adopting a layered architecture that separates protected network sharing from software visible communication interfaces Through dierent mechanisms our MHz NIU MHz pro cessor core can deliver very low latency for very short s very high bandwidth for multikilobyte blo ck transfers messages under MBytess bidirectional bandwidth and very low pro cessor overhead for multicast communication each additional destination after the rst incurs pro cessor clo cks We intro duce the novel idea of supp orting a large numb er of virtual message queues through a combination of hardware Residen t message queues and rmware emulated Nonresident message queues By using the Resident queues as rmware controlled caches our implementation delivers hardware sp eed on the average while providing graceful degradation in a low cost implementation Finally we also demonstrate that an otheshelf emb edded pro cessor comple y and the ments custom hardware in the NIU with the former providing exibilit latter p erformance We identify the interface b etween the emb edded pro cessor and custom hardware as a critical design comp onent and prop ose a command and com pletion queue interface to improve the p erformance and reduce the complexity of emb edded rmware Arvind Thesis Sup ervisor Title Johnson Professor of Computer Science Thesis Sup ervisor Larry Rudolph Title Principal Research Scientist Design and Implementation of a Multipurp ose Cluster System Network Interface Unit by Bo on Seong Ang Submitted to the Department of Electrical Engineering and Computer Science on February in partial fulllment of the requirements for the degree of Do ctor of Philosophy Abstract To day the interface b etween a high sp eed network and a high p erformance com putation no de is the least mature hardware technology in scalable general purp ose cluster computing Currently the oneinterfacetsall philosophy prevails This ap proach p erforms p o orly in some cases b ecause of the complexity of mo dern memory hierarchy and the wide range of communication sizes and patterns To days mes sage passing NIUs are also unable to utilize the b est data transfer and co ordination mechanisms due to p o or integration into the computation no des memory hierarchy These shortcomings unnecessarily constrain the p erformance of cluster systems Our thesis is that a cluster system NIU should supp ort multiple communica tion interfaces layered on a virtual message queue substrate in order to streamline data movement b oth within each no de as well as b etween no des The NIU should b e tightly integrated into the computation no des memory hierarchy via the cache coherent sno opy system bus so as to gain access to a rich set of data movement op erations We further prop ose to achieve the goal of a large set of high p erformance communication functions with a hybrid NIU microarchitecture that combines custom hardware building blo cks with an otheshelf emb edded pro cessor These ideas are tested through the design and implementation of the StarT Voyager NES an NIU used to connect a cluster of commercial PowerPC based SMPs Our prototyp e demonstrates that it is feasible to implement a multiinterface NIU at reasonable hardware cost This is achieved by reusing a set of basic hardware building blo cks and adopting a layered architecture that separates protected network sharing from software visible communication interfaces Through dierent mechanisms our MHz NIU MHz pro cessor core can deliver very low latency for very short messages under s very high bandwidth for multikilobyte blo ck transfers MBytess bidirectional bandwidth and very low pro cessor overhead for multicast communication each additional destination after the rst incurs pro cessor clo cks We intro duce the novel idea of supp orting a large numb er of virtual message queues through a combination of hardware Resident message queues and rmware emulated Nonresident message queues By using the Resident queues as rmware controlled caches our implementation delivers hardware sp eed on the average while providing graceful degradation in a low cost implementation Finally we also demonstrate that an otheshelf emb edded pro cessor comple ments custom hardware in the NIU with the former providing exibility and the latter p erformance We identify the interface b etween the emb edded pro cessor and custom hardware as a critical design comp onent and prop ose a command and com pletion queue interface to improve the p erformance and reduce the complexity of emb edded rmware Thesis Sup ervisor Arvind Title Johnson Professor of Computer Science Thesis Sup ervisor Larry Rudolph Title Principal Research Scientist Acknowledgments This dissertation would not have b een p ossible without the encouragement supp ort patience and co op eration of many p eople Although no words can adequately express my gratitude an acknowledgement is the least I can do First and foremost I want to thank my wife Wee Lee and our families for standing by me all these years They gave me the latitude to seek my calling were patient as the years passed but I was no closer to enlightenment and provided me a sanctuary to retreat to whenever my marathonlike graduate scho ol career wore me thin To you all my eternity gratitude I am greatly indebted to my advisors Arvind and Larry for their faith in my abilities and for standing by me throughout my long graduate student career They gave me the opp ortunity to colead a large systems pro ject an exp erience which greatly enriched my systems building skills To Larry I want to express my gratitude for all the fatherlybrotherly advice and the cheering sessions in the last leg of my graduate scho ol apprenticeship I would also like to thank the other memb ers of my thesis committee Frans and Anant for helping to rene this work I want to thank Derek Chiou for our partnership through graduate scho ol working together on Monso on StarT StarTNG and StarTVoyager I greatly enjoy bringing vague ideas to you and jointly developing them into well thought out solutions This work on StarTVoyager NES is as much yours as it is mine Thank you to o for the encouragement and counselling you gave me all these years The graduate students and sta in Computation Structures Group gave me a home away from home Derek Chiou Alex Caro Andy Boughton James Ho e RPaul Johnson Andy Shaw Shail Aditya Gupta Xiao wei Shen Mike Ehrlich Dan Rosen band and Jan Maessen thank you for the company in this long pilgrimage through graduate scho ol It was a pleasure working with all of you bright hardworking