Improving the Scalability of High Performance Computer Systems

IMPROVING THE SCALABILITY OF HIGH PERFORMANCE COMPUTER SYSTEMS Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften der Universität Mannheim vorgelegt von Heiner Hannes Litz aus Heidelberg (Diplom-Informatiker der Technischen Informatik) Mannheim, 2010 Dekan: Professor Dr. W. Effelsberg, Universität Mannheim Referent: Professor Dr. U. Brüning, Universität Heidelberg Korreferent: Professor Dr. R. Männer, Universität Heidelberg Tag der mündlichen Prüfung: 17. März 2011 Für Regine Abstract Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to- software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks- on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor- memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design. Zusammenfassung Die Verbesserung der Skalierbarkeit zukünftiger Computer Systeme ist eine wichtige Vorraussetzung, um die Geschwindigkeit dieser zu erhöhen, da bisher angewendete Prinzipien zur Geschwindigkeitssteigerung in Zukunft nicht mehr anwendbar sein werden. Der Grund hierfür ist hauptsächlich im zu hohen Stromverbrauch begründet welcher durch die Erhöhung der Taktfrequenz verursacht wird. Eine Lösung für dieses Problem besteht darin, die Anzahl von Komponenten innerhalb eines Mikrochips, eines Rechenknotens und innerhalb von Rechennetzwerken zu erhöhen um eine höhere Gesamtleistung zu erzielen. Um dieses Ziel zu erreichen schlagen wir einen neuen Ansatz vor, genannt “Tightly Coupled Cluster”. Unser Ansatz erlaubt es, Computer-Netzwerke zu realisieren, welche die Kommunikation mit extrem geringer Latenz ermöglichen. Dies ermöglicht das Ausnutzen fein granularer Parallelität welche in Applikationen vorhanden ist, was wiederum die Skalierbarkeit erhöht. Unser vorgeschlagener Mechanismus integriert die Netzwerkschnittstelle gewissermassen in den Prozessor, was ermöglicht zusätzliche Integrierte Schaltkreise zu umgehen, und eine Protokollübersetzung überflüssig macht. Wir präsentieren eine Implementierung unserer Technologie welche auf AMD Prozessoren aufsetzt. Tests belegen eine Software zu Software Kommunikationslatenz von 240 ns welche anderen Verfahren weit überlegen ist. Im zweiten Teil der Dissertation wird ein Rahmenwerk zur Realisierung von Mikrochip internen Netzwerken vorgestellt. Sogenannte “Networks-on-Chip” bieten eine gute Möglichkeit um viele Komponenten innerhalb eines Mikrochips miteinander zu verbinden. Da sich die Anzahl von Funktionseinheiten innerhalb eines Mikrochips ständig erhöht, stellt die Kommunikationsschnittstelle eine wichtige und geschwindigkeitsbestimmende Komponente dar. Das Rahmenwerk welche die Struktur vorgibt um Netzwerke unterschiedlicher Topologie und Architektur zu entwerfen, stellt eine Software zur Verfügung, welche es ermöglicht auf Knopfdruck verschieden Designs automatisch zu generieren. Auf Basis einer abstrakten Beschreibung wird synthetisierbarer Code generiert. Die Evaluierung des Rahmenwerks belegt die gute Skalierbarkeit und Performanz unseres Ansatzes. Im dritten Teil der Arbeit wird ein Mechanismus zur Verbesserung der Prozessor- Speicher Schnittstelle vorgeschlagen. Die Integrationsdichte zukünftiger Prozessoren stellt immer höhere Anforderungen an die Speicherkapazität und Speicherbandbreite. Um diese zu befriedigen, schlagen wir eine Speichererweiterung vor, welche auf Hochgeschwindigkeits-Transceivern basiert. Unsere Speichererweiterung lässt sich transparent in die Systemarchitektur integrieren, und ist Cache Kohärent. Die Möglichkeit Speicherhierarchien zu unterstützen sowie unserer vorgeschlagener “Asymmetric Probing” Mechanismus erhöht die Skalierbarkeit unserer Architektur. Zusätzlich stellen wir einen “DMA offloading” Mechanismus vor. Unser Ansatz wurde mittels eines “Field Programmable Gate Array” realisiert. Software Benchmarks belegen die Qualität unseres Ansatzes. Acknowledgements This dissertation is the result of many years of effort, training and support by other people. In particular, I want to thank the following people (in alphabetical order): Parag Beeraka, Ulrich Brüning, Holger Fröning, Alexander Giese, Benjamin Kalisch, Berenike Litz, Mondrian Nüssle and Maximilian Thürmer. Furthermore, I want to thank the companies AMD and SUN Microsystems for supporting my work. Table of Contents Introduction 1 1.1 Limitations Of Scalability .............................................................................3 1.2 Scalable Computer Systems .......................................................................... 8 1.2.1 Microarchitecture ..........................................................................................8 1.2.2 Node Architecture .......................................................................................10 1.2.3 Network Architecture ..................................................................................11 1.2.4 Programming Model ...................................................................................12 1.3 Dissertation Methodology ...........................................................................15 Scalable Multinode Clusters 17 2.1 Introduction .................................................................................................18 2.1.1 Motivation ...................................................................................................19 2.1.2 Related Work ..............................................................................................21 2.1.3 Background .................................................................................................22 2.1.4 Approach .....................................................................................................25 2.2 Design Space Exploration ........................................................................... 27 2.2.1 Programming Model ...................................................................................27 2.2.2 Routing ........................................................................................................30

Improving the Scalability of High Performance Computer Systems

Instability Phenomena in Underloaded Packet Networks with Qos Schedulers M.Ajmone Marsan, M.Franceschinis, E.Leonardi, F.Neri, A.Tarello

Scalable Multi-Module Packet Switches with Quality of Service

Fair Scheduling in Input-Queued Switches Under Inadmissible Traffic

Scheduling Algorithm and Evaluating Performance of a Novel 3D-VOQ Switch

On Scheduling Input Queued Cell Switches

Performance Analysis of Networks on Chips

Switch Kenji Yoshigoe University of South Florida

Quantifiable Service Differentiation for Packet Networks

Fabric-On-A-Chip: Toward Consolidating Packet Switching Functions on Silicon

Data Path Processing in Fast Programmable Routers

Msc THESIS Design of a High-Performance Buﬀered Crossbar Switch Fabric Using Network on Chip

High Performance Queueing and Scheduling in Support of Multicasting in Input-Queued Switches