Extending the C++ Asynchronous Programming Model with the HPX Runtime System for Distributed Memory Computing

Total Page:16

File Type:pdf, Size:1020Kb

Extending the C++ Asynchronous Programming Model with the HPX Runtime System for Distributed Memory Computing Extending the C++ Asynchronous Programming Model with the HPX Runtime System for Distributed Memory Computing Erweiterung des asynchronen C++ Programmiermodels mithilfe des HPX Laufzeitsystems für verteiltes Rechnen Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades Doktor-Ingenieur vorgelegt von Thomas Heller aus Neuendettelsau Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg Tag der mündlichen Prüfung: 28.02.2019 Vorsitzender des Promotionsorgans: Prof. Dr.-Ing. Reinhard Lerch Gutachter/in: Prof. Dr.-Ing. Dietmar Fey Prof. Dr. Thomas Fahringer To Steffi, Felix and Hanna Acknowledgement This thesis was written at the Chair for Computer Science 3 (Computer Architecture) of the Friedrich-Alexander-University Erlangen-Nuremberg. I would like to thank all persons who were involved in creating this work in one way or another. Special thanks goes to the Prof.Dr.-Ing. Dietmar Fey who took the role of this doctoral thesis supervi- sor. I would like to thank him for his support, trust and all the helpful discussions that led to the success of this thesis. Additionally, I would like to thank Prof.Dr. Thomas Fahringer for his support and accepting the role as reviewer. In addition, I would like to thank all students that contributed to the project in the vari- ous forms of either bachelor, master theses as well as being part of the team otherwise. I would like to thank my colleagues for all the fruitful discussions that helped to further develop the ideas presented in this thesis. I would like to thank Dr. Alice Koeniges for providing me with access to the NERSC Supercomputers and the RRZE for providing access to the Meggie Cluster. Furthermore, I would like to thank the STE||AR-Group, especially Dr. Hartmut Kaiser. Without the help and support of the group, this thesis wouldn’t have been possible. The group helped to develop a stable product, which is the foundation of this very thesis. Hartmut is and was an excellent mentor who drove my research in various ways and helped develop my academic career. Last but not least, I would like to thank my family, especially my wife and children, for their all-embracing support, without which this thesis couldn’t have been accomplished in the first place. Abstract This thesis presents a fully Asynchronous Many Task (AMT) runtime system extend- ing the C++ programming language. Defining the distributed, asynchronous C++ Pro- gramming Model based on the C++ programming language is in focus. Besides, pre- senting performance portable Application Programming Interfaces (APIs) for shared and distributed memory computing as well as accelerators. With the rise of multi and many-core architectures, the C++ Language got amended with support for concurrency and parallelism. This work derives the methodology for massive parallelism from this industry standard and extending it with fine-grained user-level threads as well as distributed computing allowing large-scale Supercomput- ers to employ the same syntax and semantics for remote and local operations. By lever- aging the nature of asynchronous task-based message passing using a one-sided Remote Procedure Call (RPC) mechanism, the overarching principle of work follows data man- ifests. By leveraging the asynchronous, task-based nature of the future as a handle for asyn- chronously computed results, the term Futurization is coined, presenting a technique based on Continuation Passing Style (CPS) programming. This technique allows deal- ing with millions of concurrently running asynchronous tasks. By attaching continua- tions, dynamic dependency graphs are formed naturally from the regular control flow of the code. The effect is to parallelize through the runtime system by executing multiple continuations in parallel. In other words, the future based synchronization is express- ing fine-grained constraints. Furthermore, Futurization blends in naturally with other well-known Techniques, such as data parallelism. Those other paradigms can be built using Futurization. The technique as mentioned earlier provides the necessary foundation to address the needs for modern scientific applications targeting High Performance Computing (HPC) platforms. However, addressing the challenge of handling more and more complicated architectures, like different memory access latencies and accelerators is essential. This thesis attempts to solve this challenge by providing necessary means to define computa- tional and memory targets by reusing already defined, or upcoming, concepts for C++. Consequently, providing means to link them together to intensify the principle of work follows data. The feasibility of this approach will be demonstrated with a set of low-level micro- benchmarks to show that the provided abstractions come with minimal overhead. Pro- viding a 2D Stencil example that attests the programmability of Futurization, as well as the performance benefits, serves as the second benchmark. Lastly, showing there- sults of futurizing the astrophysics application OctoTiger, a 3D octree Adaptive Mesh Refinement (AMR) based binary star simulation, running at extreme scales concludes the experimental section. Kurzübersicht Diese Arbeit stellt ein vollständig “Asynchronous Many Task (AMT)“ Laufzeitsystem vor. Der Fokus liegt dabei auf der Definition der benötigten Konzepte auf der Basis der C++ Programmiersprache. Darüberhinaus werden portable APIs für das Rechnen auf verteilten System und Beschleuniger-Hardware eingeführt. Mit dem Aufkommen von Multi- und Many-Core-Architekturen wurde die C++ Pro- grammiersprache mit Unterstützung für Nebenläufigkeit und Parallelität erweitert. Die- se Arbeit leitet die Methodik für massive Parallelität von diesem Industriestandard ab und ergänzt es mit fein granularen User- Level-Threads sowie verteiltem Rechnen. Dies ermöglicht das Benutzen großer Supercomputer mit derselben Syntax und Semantik für entfernte und lokale Operationen. Durch Verwendung des asynchronen task-basierten Nachrichtenaustausches durch einseitige Remote Procedure Call (RPC), ergibt sich das all umfassende Prinzip des ”work follows data”, d.h. die Arbeit wird dort ausgeführt, wo die Daten liegen. Der Begriff Futurization wird als Basis der “Continuation Passing Style (CPS)“ Program- mierung geprägt. Dies erreicht man anhand der future basierten Handle zum Aus- druck von asynchronen, Task-basierten Ergebnissen. Diese Technik erlaubt es, Millio- nen von nebenläufigen, asynchronen Tasks handzuhaben. Durch Einhängen von Con- tinuations werden dynamische Abhängigkeitsgraphen erzeugt, die als Nebenprodukt des regulären Kontrollflusses, leicht zu bestimmen sind. Dies hat den Effekt, dass meh- rere dieser Continuations parallel durch das Laufzeitsystem abgearbeitet werden kön- nen. Durch diese future basierte Synchronisierung ist man in der Lage, fein granulare Bedingungen für die korrekte Ausführung zu bestimmen. Darüber hinaus, ermöglicht Futurization die Implementierung anderer Programmierparadigmen wie Data Paralle- lismus. Diese Technik bietet die notwendige Grundlage, um den Bedarf an modernen wissen- schaftlichen Anwendungen für High Performance Computing (HPC) Plattformen ge- recht zu werden. Allerdings werden die Herausforderungen, um immer kompliziertere Architekturen effizient zu programmieren, immer schwieriger. Darunter fallen unter- schiedliche Speicherzugriffslatenzen und Hardware-Beschleunigern. Diese Arbeit ver- folgt das Ziel diese Aufgabe zu lösen, indem sie die notwendigen Mittel durch die Wie- derverwendung bereits definierter oder zukünftiger Konzepte aus dem C++ Standard bereitstellt. Die Ergebnisse dieses Ansatzes werden anhand der Evaluation mehrerer Benchmarks dargestellt. Zuerst wird eine Messung mit diversen Micro-Benchmarks durchgeführt, um zu zeigen, dass der Overhead der bereitgestellten Abstraktionen minimal ist. Sowohl die Programmierbarkeit als auch die Leistungsfähigkeit wird anhand einer 2D Stencil Anwendung demonstriert. Die Arbeit wird abgeschlossen durch die Anwendung Octo- Tiger, eine 3D octree basierte “Adaptive Mesh Refinement (AMR)“ Astro Physik Simu- lation. Diese wird anhand von Futurization auf einen der größten aktuellen Supercom- puter portiert. Contents 1 Introduction 1 2 Related Work 5 3 Parallelism and Concurrency in the C++ Programming Language 9 3.1 Low-Level Abstractions ............................. 10 3.1.1 Memory Model ............................. 11 3.1.2 Concurrency Support ......................... 14 3.1.3 Task-Parallelism Support ....................... 19 3.2 Higher Level Parallelism ............................ 23 3.2.1 Concepts of Parallelism ........................ 23 3.2.2 Parallel Algorithms ........................... 26 3.2.3 Fork-Join Based Parallelism ...................... 30 3.3 Evolution ..................................... 31 3.3.1 Executors ................................ 31 3.3.2 Support for heterogeneous architectures and Distributed Com- puting .................................. 34 3.3.3 Futurization ............................... 35 3.3.4 Coroutines and Parallelism ...................... 40 4 The HPX Parallel Runtime System 43 4.1 Local Thread Management ........................... 44 4.2 Active Global Address Space ......................... 47 4.2.1 Processes in Active Global Address Space (AGAS) – Localities . 48 4.2.2 C++ Objects in AGAS – Components . 48 4.2.3 Global Reference Counting ...................... 52 4.2.4 Resolving Globally unique Identifier (GID)s to
Recommended publications
  • Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)
    1 Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX) Patrick Diehl, Madhavan Seshadri, Thomas Heller, Hartmut Kaiser Abstract—Experience shows that on today’s high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities.
    [Show full text]
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Adaptive Data Migration in Load-Imbalanced HPC Applications
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 10-16-2020 Adaptive Data Migration in Load-Imbalanced HPC Applications Parsa Amini Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Amini, Parsa, "Adaptive Data Migration in Load-Imbalanced HPC Applications" (2020). LSU Doctoral Dissertations. 5370. https://digitalcommons.lsu.edu/gradschool_dissertations/5370 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. ADAPTIVE DATA MIGRATION IN LOAD-IMBALANCED HPC APPLICATIONS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Parsa Amini B.S., Shahed University, 2013 M.S., New Mexico State University, 2015 December 2020 Acknowledgments This effort has been possible, thanks to the involvement and assistance of numerous people. First and foremost, I thank my advisor, Dr. Hartmut Kaiser, who made this journey possible with their invaluable support, precise guidance, and generous sharing of expertise. It has been a great privilege and opportunity for me be your student, a part of the STE||AR group, and the HPX development effort. I would also like to thank my mentor and former advisor at New Mexico State University, Dr.
    [Show full text]
  • MVAPICH2 2.2 User Guide
    MVAPICH2 2.2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: October 19, 2017 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • HPX – a Task Based Programming Model in a Global Address Space
    HPX – A Task Based Programming Model in a Global Address Space Hartmut Kaiser1 Thomas Heller2 Bryce Adelstein-Lelbach1 [email protected] [email protected] [email protected] Adrian Serio1 Dietmar Fey2 [email protected] [email protected] 1Center for Computation and 2Computer Science 3, Technology, Computer Architectures, Louisiana State University, Friedrich-Alexander-University, Louisiana, U.S.A. Erlangen, Germany ABSTRACT 1. INTRODUCTION The significant increase in complexity of Exascale platforms due to Todays programming models, languages, and related technologies energy-constrained, billion-way parallelism, with major changes to that have sustained High Performance Computing (HPC) appli- processor and memory architecture, requires new energy-efficient cation software development for the past decade are facing ma- and resilient programming techniques that are portable across mul- jor problems when it comes to programmability and performance tiple future generations of machines. We believe that guarantee- portability of future systems. The significant increase in complex- ing adequate scalability, programmability, performance portability, ity of new platforms due to energy constrains, increasing paral- resilience, and energy efficiency requires a fundamentally new ap- lelism and major changes to processor and memory architecture, proach, combined with a transition path for existing scientific ap- requires advanced programming techniques that are portable across plications, to fully explore the rewards of todays and tomorrows multiple future generations of machines [1]. systems. We present HPX – a parallel runtime system which ex- tends the C++11/14 standard to facilitate distributed operations, A fundamentally new approach is required to address these chal- enable fine-grained constraint based parallelism, and support run- lenges.
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • Beowulf Clusters — an Overview
    WinterSchool 2001 Å. Ødegård Beowulf clusters — an overview Åsmund Ødegård April 4, 2001 Beowulf Clusters 1 WinterSchool 2001 Å. Ødegård Contents Introduction 3 What is a Beowulf 5 The history of Beowulf 6 Who can build a Beowulf 10 How to design a Beowulf 11 Beowulfs in more detail 12 Rules of thumb 26 What Beowulfs are Good For 30 Experiments 31 3D nonlinear acoustic fields 35 Incompressible Navier–Stokes 42 3D nonlinear water wave 44 Beowulf Clusters 2 WinterSchool 2001 Å. Ødegård Introduction Why clusters ? ² “Work harder” – More CPU–power, more memory, more everything ² “Work smarter” – Better algorithms ² “Get help” – Let more boxes work together to solve the problem – Parallel processing ² by Greg Pfister Beowulf Clusters 3 WinterSchool 2001 Å. Ødegård ² Beowulfs in the Parallel Computing picture: Parallel Computing MetaComputing Clusters Tightly Coupled Vector WS farms Pile of PCs NOW NT/Win2k Clusters Beowulf CC-NUMA Beowulf Clusters 4 WinterSchool 2001 Å. Ødegård What is a Beowulf ² Mass–market commodity off the shelf (COTS) ² Low cost local area network (LAN) ² Open Source UNIX like operating system (OS) ² Execute parallel application programmed with a message passing model (MPI) ² Anything from small systems to large, fast systems. The fastest rank as no.84 on todays Top500. ² The best price/performance system available for many applications ² Philosophy: The cheapest system available which solve your problem in reasonable time Beowulf Clusters 5 WinterSchool 2001 Å. Ødegård The history of Beowulf ² 1993: Perfect conditions for the first Beowulf – Major CPU performance advance: 80286 ¡! 80386 – DRAM of reasonable costs and densities (8MB) – Disk drives of several 100MBs available for PC – Ethernet (10Mbps) controllers and hubs cheap enough – Linux improved rapidly, and was in a usable state – PVM widely accepted as a cross–platform message passing model ² Clustering was done with commercial UNIX, but the cost was high.
    [Show full text]
  • MVAPICH2 2.3 User Guide
    MVAPICH2 2.3 User Guide MVAPICH Team Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2018 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: February 19, 2018 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.3 Features2 4 Installation Instructions 14 4.1 Building from a tarball................................... 14 4.2 Obtaining and Building the Source from SVN repository................ 14 4.3 Selecting a Process Manager................................ 15 4.3.1 Customizing Commands Used by mpirun rsh.................. 16 4.3.2 Using SLURM................................... 16 4.3.3 Using SLURM with support for PMI Extensions................ 16 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3...... 17 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3............... 20 4.6 Configuring a build to support running jobs across multiple InfiniBand subnets... 21 4.7 Configuring a build for Shared-Memory-CH3...................... 21 4.8 Configuring a build for OFA-IB-Nemesis......................... 21 4.9 Configuring a build for Intel TrueScale (PSM-CH3)................... 22 4.10 Configuring a build for Intel Omni-Path (PSM2-CH3)................. 23 4.11 Configuring a build for TCP/IP-Nemesis......................... 24 4.12 Configuring a build for TCP/IP-CH3........................... 25 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)... 26 4.14 Configuring a build for Shared-Memory-Nemesis.................... 26 4.15 Configuration and Installation with Singularity....................
    [Show full text]
  • Composable Concurrency Models
    Composable Concurrency Models Dan Stelljes Division of Science and Mathematics University of Minnesota, Morris Morris, Minnesota, USA 56267 [email protected] ABSTRACT Given that an application is likely to make use of more The need to manage concurrent operations in applications than one concurrency model, programmers would prefer that has led to the development of a variety of concurrency mod- different types of models could safely interact. However, dif- els. Modern programming languages generally provide sev- ferent models do not necessarily work well together, nor are eral concurrency models to serve different requirements, and they designed to. Recent work has attempted to identify programmers benefit from being able to use them in tandem. common \building blocks" that could be used to compose We discuss challenges surrounding concurrent programming a variety of models, possibly eliminating subtle problems and examine situations in which conflicts between models when different models interact and allowing models to be can occur. Additionally, we describe attempts to identify represented at lower levels without resorting to rough ap- features of common concurrency models and develop lower- proximations [9, 11, 12]. level abstractions capable of supporting a variety of models. 2. BACKGROUND 1. INTRODUCTION In a concurrent program, the history of operations may Most interactive computer programs depend on concur- not be the same for every execution. An entirely sequen- rency, the ability to perform different tasks at the same tial program could be proved to be correct by showing that time. A web browser, for instance, might at any point be its history (that is, the sequence in which its operations are rendering documents in multiple tabs, transferring files, and performed) always yields a correct result.
    [Show full text]
  • Improving MPI Threading Support for Current Hardware Architectures
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2019 Improving MPI Threading Support for Current Hardware Architectures Thananon Patinyasakdikul University of Tennessee, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Patinyasakdikul, Thananon, "Improving MPI Threading Support for Current Hardware Architectures. " PhD diss., University of Tennessee, 2019. https://trace.tennessee.edu/utk_graddiss/5631 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Thananon Patinyasakdikul entitled "Improving MPI Threading Support for Current Hardware Architectures." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Jack Dongarra, Major Professor We have read this dissertation and recommend its acceptance: Michael Berry, Michela Taufer, Yingkui Li Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Improving MPI Threading Support for Current Hardware Architectures A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Thananon Patinyasakdikul December 2019 c by Thananon Patinyasakdikul, 2019 All Rights Reserved. ii To my parents Thanawij and Issaree Patinyasakdikul, my little brother Thanarat Patinyasakdikul for their love, trust and support.
    [Show full text]
  • Reactive Async: Expressive Deterministic Concurrency
    Reactive Async: Expressive Deterministic Concurrency Philipp Haller Simon Geries Michael Eichberg Guido Salvaneschi KTH Royal Institute of Technology, Sweden TU Darmstadt, Germany [email protected] feichberg, [email protected] Abstract tion can generate non-deterministic results because of their Concurrent programming is infamous for its difficulty. An unpredictable scheduling. Traditionally, developers address important source of difficulty is non-determinism, stemming these problems by protecting state from concurrent access from unpredictable interleavings of concurrent activities. via synchronisation. Yet, concurrent programming remains Futures and promises are widely-used abstractions that help an art: insufficient synchronisation leads to unsound pro- designing deterministic concurrent programs, although this grams but synchronising too much does not exploit hardware property cannot be guaranteed statically in mainstream pro- capabilities effectively for parallel execution. gramming languages. Deterministic-by-construction con- Over the years researchers have proposed concurrency current programming models avoid this issue, but they typi- models that attempt to overcome these issues. For exam- cally restrict expressiveness in important ways. ple the actor model [13] encapsulates state into actors This paper introduces a concurrent programming model, which communicate via asynchronous messages–a solution Reactive Async, which decouples concurrent computations that avoids shared state and hence (low-level) race condi- using
    [Show full text]
  • A Reduction Semantics for Direct-Style Asynchronous Observables
    Journal of Logical and Algebraic Methods in Programming 105 (2019) 75–111 Contents lists available at ScienceDirect Journal of Logical and Algebraic Methods in Programming www.elsevier.com/locate/jlamp A reduction semantics for direct-style asynchronous observables ∗ Philipp Haller a, , Heather Miller b a KTH Royal Institute of Technology, Sweden b Carnegie Mellon University, USA a r t i c l e i n f o a b s t r a c t Article history: Asynchronous programming has gained in importance, not only due to hardware develop- Received 31 January 2016 ments like multi-core processors, but also due to pervasive asynchronicity in client-side Received in revised form 6 March 2019 Web programming and large-scale Web applications. However, asynchronous program- Accepted 6 March 2019 ming is challenging. For example, control-flow management and error handling are much Available online 18 March 2019 more complex in an asynchronous than a synchronous context. Programming with asyn- chronous event streams is especially difficult: expressing asynchronous stream producers and consumers requires explicit state machines in continuation-passing style when using widely-used languages like Java. In order to address this challenge, recent language designs like Google’s Dart introduce asynchronous generators which allow expressing complex asynchronous programs in a familiar blocking style while using efficient non-blocking concurrency control under the hood. However, several issues remain unresolved, including the integration of analogous constructs into statically-typed languages, and the formalization and proof of important correctness properties. This paper presents a design for asynchronous stream generators for Scala, thereby ex- tending previous facilities for asynchronous programming in Scala from tasks/futures to asynchronous streams.
    [Show full text]