Shared Memory Programming on NUMA–Based Clusters Using a General and Open Hybrid Hardware / Software Approach

Shared Memory Programming on NUMA–based Clusters using a General and Open Hybrid Hardware / Software Approach Martin Schulz Institut für Informatik Lehrstuhl für Rechnertechnik und Rechnerorganisation Shared Memory Programming on NUMA–based Clusters using a General and Open Hybrid Hardware / Software Approach Martin Schulz Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. R. Bayer, Ph.D. Prüfer der Dissertation: 1. Univ.-Prof. Dr. A. Bode 2. Univ.-Prof. Dr. H. Hellwagner, Universität Klagenfurt / Österreich Die Dissertation wurde am 24. April 2001 bei der Technischen Universität München ein- gereicht und durch die Fakultät für Informatik am 28. Juni 2001 angenommen. Abstract The widespread use of shared memory programming for High Performance Computing (HPC) is currently hindered by two main factors: the limited scalability of architectures with hardware support for shared memory and the abundance of existing programming models. In order to solve these issues, a comprehensive shared memory framework needs to be created which enables the use of shared memory on top of more scalable architectures and which provides a user–friendly solution to deal with the various different programming models. Driven by the first issue, a large number of so–called SoftWare Distributed Shared Memory (SW–DSM) systems have been developed. These systems rely solely on software components to create a transparent global virtual memory abstraction on highly scalable, loosely coupled architectures without any direct hardware support for shared memory. However, they are often affected by inherent performance problems and, in addition, do not solve the second issue of the existence of (too) many shared memory programming models. On the contrary, the large amount of work done in the DSM area has led to a significant number of independent systems, each with its own API, thereby further worsening the sit- uation. The work presented within this thesis therefore takes the idea of SW–DSM systems a step further by proposing a general and open shared memory framework called HAMSTER (Hybrid-dsm based Adaptive and Modular Shared memory archiTEctuRe). Instead of be- ing fixed to a single shared memory programming model or API, this framework provides a comprehensive set of shared memory services enabling the implementation of almost any shared memory programming model on top of a single core. These services are designed in a way that minimizes the complexity for target programming models making the implementation of a large number of different models feasible. This can include both existing and new application or application domain specific programming models easing both the porting of given and the parallelization of new applications. In addition, the HAMSTER framework avoids typical performance problems of SW– DSM systems by relying on so–called NUMA (Non–Uniform Memory Access) architectures which combine scalability and cost effectiveness with limited support for shared memory in the form of non–cache coherent hardware DSM. Their capabilities are directly exploited by a new type of hybrid hardware/software DSM system, the core of the HAM- STER framework. This Hybrid–DSM approach closes the semantic gap between the global physical memory provided by the underlying hardware and the global virtual memory re- i quired for shared memory programming enabling applications to directly benefit from the hardware support. On top of this Hybrid–DSM system, the HAMSTER framework defines and imple- ments several independent and orthogonal management modules. This includes separate modules for memory, consistency, synchronization, and task management as well as for the control of the cluster and the global process abstraction. Each of these modules of- fers typical services required by implementations of shared memory programming models. Combined they form the HAMSTER interface which can then be used to implement shared memory programming models without much effort. This capability is proven through the implementation of a number of selected shared memory programming models on top of the HAMSTER framework. These models range from transparently distributed thread models all the way to explicit put/get libraries and also include various APIs from existing SW–DSM systems with different relaxed consistency models. It therefore covers the whole spectrum of shared memory programming models and underlines the broad applicability of this approach. The presented concepts are evaluated using a large number of different benchmarks and kernels exhibiting the performance details of the individual components. In addition, HAMSTER is used as the basis for the implementation or port of two real–world applications from the area of nuclear medical imaging, more precisely the reconstruction of PET images and their spectral analysis. These experiments cover both the porting of an already existing shared memory application using a given DSM API and the parallelization of an application from scratch using a new, customized API. In both cases, the system provides an efficient platform resulting in a very scalable execution. These experiment, therefore, prove both the wide applicability and the efficiency of the overall HAMSTER framework. ii Acknowledgments I would like to take this opportunity to express my deepest gratitude to the following people, for I realize that this work would not have been possible without their help, guidance, and support. First, I am indebted to my two advisors, Prof. Dr. A. Bode, chair of LRR–TUM, and Prof. Dr. H. Hellwagner, now at the University of Klagenfurt. Prof. Hellwagner initially took me in as his student and after his move to Klagenfurt, Prof. Bode took over without hesitation. Together they provided me with an excellent research environment and gave me the greatest possible freedom to pursue my work and interests. In addition, despite their busy schedules, they both always had an open ear for me. I would also like to direct my special thanks to Dr. Wolfgang Karl who, in his func- tion as leader of the architecture group at LRR, significantly contributed to the excellent research environment. In addition, his support and the many long, informal discussions and helpful comments regarding my work were invaluable, especially in times when I seemed stuck. I would also like to thank my many colleagues at LRR–TUM; I enjoyed very much working with them and the many discussions we shared over coffee. Especially I would like to mention my office mates over the years: Phillip Drum, Michael Eberl, Detlef Fliegl (who also contributed significantly to the Windows NT driver for the SCI-VM and provided personal system administration support), Günther Rackl, and Christian Weiß. Additionally, I would like thank Jie Tao, whose pleasant disposition and enthusiasm made it a pleasure to work with her. Our secretaries, especially Mrs. Eberhardt, Mrs. Wöllgens, and Mrs. Brunnhuber, deserve my special thanks for their help in maneuvering through the many administrative obstacles ranging from contract issues to business trip applications. Without them, I would still be sitting here in a pile of paper work. I would also like to thank Klaus Tilk, our system administrator, for keeping our systems alive and healthy. I do not want to forget to mention Prof. Dr. A. Chien and the members of his Concurrent Systems Architecture Group (CSAG) from the University of Illinois at Urbana–Champaign (now at the University of California at San Diego). During my stay in this group and during my Master’s work there, they introduced me to a scientific style of working from which I still very much profit. The European Union/Commision deserves credit for the funding it provided for the majority of my work. This was done mainly within the ESPRIT projects SISCI (EP 23174) iii and NEPHEW (EP 29907). In addition, the many partners Europe–wide involved in these projects provided an additional source of inspiration and support. Special thanks go to Dolphin ICS, most specifically to Kåre Løchson, Hugo Kohmann, Torsten Amundsen, and Roy Nordstrøm, who were involved not only in these ESPRIT projects, but also technically supported our work within SMiLE. The work within NEPHEW also brought me in contact with the Clinic for Nuclear Medicine at the “Klinikum Rechts der Isar”. From this cooperation I received valuable input for the evaluation part of this work. This would not have been possible without help from Frank Munz, Sibylle Ziegler, and Martin Völk and I am very thankful for the time and energy they invested. On a more personal note, I am indebted to my friends and my whole family who were always at my side providing constant encouragement and support. I would especially like to thank my good friend, Dr. Johannes Zimmer, for all our long talks and for his prolific help with LATEX. I would also like to express my deepest gratitude to my parents. From early on, they sparked and encouraged my interest in learning and provided unconditional support in all of my endeavors, both financially and (most important) spiritually. Last, but certainly not least, I would like to thank the love of life, my wife Laura, who not only took the time to review my thesis and to give valuable comments regarding language and style, but who also supported me throughout the entire time and with her whole heart. Martin Schulz April 2001 iv Contents 1 Motivation

Shared Memory Programming on NUMA–Based Clusters Using a General and Open Hybrid Hardware / Software Approach

Emerging Technologies Multi/Parallel Processing

The Risks Digest Index to Volume 11

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

CSC 256/456: Operating Systems

Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)

Efficient Synchronization Mechanisms for Scalable GPU Architectures

Some Performance Comparisons for a Fluid Dynamics Code

CIS 501 Computer Architecture This Unit: Shared Memory