Resilience in High-Level Parallel Programming Languages

Total Page:16

File Type:pdf, Size:1020Kb

Resilience in High-Level Parallel Programming Languages Resilience in High-Level Parallel Programming Languages Sara S. Hamouda A thesis submitted for the degree of Doctor of Philosophy at The Australian National University June 2019 c Sara S. Hamouda 2019 Except where otherwise indicated, this thesis is my own original work. Sara S. Hamouda 13 October 2018 To my parents, Hoda and Salem. Acknowledgements I grant all my gratitude to Almighty Allah, for blessing me with this wonderful learning journey and giving me the strength to complete it. Crossing the finish line would not have been possible without the great support and assistance of a group of wonderful people. First of all, my supervisors, Josh Milthorpe, Peter Strazdins and Steve Blackburn, who offered their ultimate guidance and support to remove every obstacle in the way and enable me to achieve my goals. I wish to express my deepest appreciation to my primary supervisor, Josh Milthorpe, who dedicated many hours for discussing my work, refining my ideas, connecting me to collaborators, and challenging me with difficult questions sometimes to help me learn more. Thank you, Josh, for all the favors you showed me during these years. Without your encouragement and generous support, this thesis would not have been possible. I am very grateful to the X10 team; David Grove, Olivier Tardieu, Vijay Saraswat, Benjamin Herta, Louis Mandel, and Arun Iyengar, for giving me the opportunity to collaborate with them in the Resilient X10 project, and for the invaluable experience I gained from this collaboration. I am also very grateful to the computer systems group in the research school of computer science at the ANU for the valuable feedback and recommendations given during the PhD monitoring sessions. I would like to thank J. Eliot B. Moss for valuable discussions about transactional memory systems during his visit to the ANU, and Peter Strazdins for valuable discussions that led to the design of the resilience taxonomy described in this thesis. I am grateful to the Australian National University for providing me the HDR merit and stipend scholarship. This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. Finally, I would to thank my family and friends for the continuous encouragement and the happy moments they favored me with. My nephews, Nour Eldeen, Omar, and Eyad, are now too young to read these words, but I hope one day they will open my thesis and read this line to know how much I love them and how much chatting with them on a regular basis was my main source of happiness during my remote study years. To my dear family; Hoda, Salem, Mai, Ahmed and Esraa, thanks for supporting me every step of the way. vii Abstract The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models. Given the complexity of managing multi-threaded distributed execution in the presence of failures, there is a critical need for task-parallel ab- stractions that simplify writing efficient, modular, and understandable fault-tolerant applications. MPI User-Level Failure Mitigation (MPI-ULFM) is an emerging fault-tolerant specification of MPI. It supports failure detection by returning special error codes and provides new interfaces for failure mitigation. Unfortunately, the unstructured form of failure reporting provided by MPI-ULFM hinders the composability and the clarity of the fault-tolerant programs. The low-level programming model of MPI and the simplistic failure reporting mechanism adopted by MPI-ULFM make MPI-ULFM more suitable as a low-level communication layer for resilient high-level languages, rather than a direct programming model for application development. The asynchronous partitioned global address space model is a high-level pro- gramming model designed to improve the productivity of developing large-scale applications. It represents a computation as a global control flow of nested parallel tasks that use global data partitioned among processes. Recent advances in the AP- GAS model supported control flow recovery by adding failure awareness to the nested parallelism model — async-finish — and by providing structured failure reporting through exceptions. Unfortunately, the current implementation of the resilient async- finish model results in a high performance overhead that can restrict the scalability of applications. Moreover, the lack of data resilience support limits the productivity of the model as it shifts the challenges of handling data availability and atomicity under failure to the programmer. In this thesis, we demonstrate that resilient APGAS languages can achieve scalable performance under failure by exploiting fault tolerance features in emerging com- munication libraries such as MPI-ULFM. We propose multi-resolution resilience, in which high-level resilient constructs are composed from efficient lower-level resilient constructs, as an approach for bridging the gap between the efficiency of user-level fault tolerance and the productivity of system-level fault tolerance. To address the limited resilience efficiency of the async-finish model, we propose ‘optimistic finish’ — a message-optimal resilient termination detection protocol for the finish construct. To improve programmer productivity, we augment the APGAS model with resilient ix x data stores that can simplify preserving critical application data in the presence of failure. In addition, we propose the ‘transactional finish’ construct as a productive mechanism for handling atomic updates on resilient data. Finally, we demonstrate the multi-resolution resilience approach by designing high-level resilient application frameworks based on the async-finish model. We implemented the above enhancements in the X10 language, an embodiment of the APGAS model, and performed empirical evaluation for the performance of re- silient X10 using micro-benchmarks and a suite of transactional and non-transactional resilient applications. Concepts of the APGAS model are realized in multiple pro- gramming languages, which can benefit from the conceptual and technical contri- butions of this thesis. The presented empirical evaluation results will aid future comparisons with other resilient programming models. Contents Acknowledgements vii Abstract ix 1 Introduction1 1.1 Background and Motivation..........................1 1.2 Multi-Resolution Resilience..........................4 1.3 Problem Statement...............................5 1.4 Research Questions...............................6 1.5 Thesis Statement................................6 1.6 Contributions..................................6 1.7 Thesis Outline..................................7 2 Background and Related Work9 2.1 Resilience Definition..............................9 2.2 Taxonomy of Resilient Programming Models................ 10 2.2.1 Adaptability............................... 10 2.2.1.1 Resource Allocation..................... 10 2.2.1.2 Resource Mapping...................... 11 2.2.2 Fault Tolerance............................. 12 2.2.2.1 Fault Type.......................... 12 2.2.2.2 Fault Level.......................... 13 2.2.2.3 Recovery Level........................ 13 2.2.2.4 Fault Detection........................ 14 2.2.2.5 Fault Tolerance Technique................. 15 2.2.3 Performance............................... 17 2.2.3.1 Performance Recovery................... 17 2.3 Resilience Support in Distributed Programming Models......... 19 2.3.1 Message-Passing Model........................ 19 2.3.1.1 MPICH-V........................... 20 2.3.1.2 FMI.............................. 20 2.3.1.3 rMPI.............................. 21 2.3.1.4 RedMPI............................ 21 2.3.1.5 FT-MPI............................ 21 xi xii Contents 2.3.1.6 MPI-ULFM.......................... 22 2.3.1.7 FA-MPI............................ 22 2.3.2 The Partitioned Global Address Space Model........... 23 2.3.2.1 GASNet............................ 23 2.3.2.2 UPC.............................. 23 2.3.2.3 Fortran Coarrays....................... 24 2.3.2.4 GASPI............................. 24 2.3.2.5 OpenSHMEM........................ 25 2.3.2.6 Global Arrays........................ 25 2.3.2.7 Global View Resilience................... 25 2.3.3 The Asynchronous Partitioned Global Address Space Model.. 25 2.3.3.1 Chapel............................ 26 2.3.3.2 X10.............................. 27 2.3.4 The Actor Model............................ 28 2.3.4.1 Charm++........................... 28 2.3.4.2 Erlang............................. 29 2.3.4.3 Akka.............................. 29 2.3.4.4 Orleans............................ 29 2.3.5 The Dataflow Programming Model................. 29 2.3.5.1 OCR.............................. 30 2.3.5.2 Legion............................. 30 2.3.5.3 NABBIT and PaRSEC.................... 31 2.3.5.4 Spark............................. 31 2.3.6 Review Conclusions.......................... 31 2.4 The X10 Programming Model......................... 34 2.4.1 Task Parallelism............................. 34 2.4.2 The Happens-Before Constraint................... 34 2.4.3 Global Data............................... 35 2.4.4 Resilient X10.............................. 36 2.4.5 Elastic X10................................ 37 2.4.5.1 The PlaceManager API................... 37 2.4.5.2 Place Virtualization..................... 39 2.5 Summary..................................... 39 3 Improving Resilient
Recommended publications
  • Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)
    1 Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX) Patrick Diehl, Madhavan Seshadri, Thomas Heller, Hartmut Kaiser Abstract—Experience shows that on today’s high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities.
    [Show full text]
  • A Case for High Performance Computing with Virtual Machines
    A Case for High Performance Computing with Virtual Machines Wei Huangy Jiuxing Liuz Bulent Abaliz Dhabaleswar K. Panday y Computer Science and Engineering z IBM T. J. Watson Research Center The Ohio State University 19 Skyline Drive Columbus, OH 43210 Hawthorne, NY 10532 fhuanwei, [email protected] fjl, [email protected] ABSTRACT in the 1960s [9], but are experiencing a resurgence in both Virtual machine (VM) technologies are experiencing a resur- industry and research communities. A VM environment pro- gence in both industry and research communities. VMs of- vides virtualized hardware interfaces to VMs through a Vir- fer many desirable features such as security, ease of man- tual Machine Monitor (VMM) (also called hypervisor). VM agement, OS customization, performance isolation, check- technologies allow running different guest VMs in a phys- pointing, and migration, which can be very beneficial to ical box, with each guest VM possibly running a different the performance and the manageability of high performance guest operating system. They can also provide secure and computing (HPC) applications. However, very few HPC ap- portable environments to meet the demanding requirements plications are currently running in a virtualized environment of computing resources in modern computing systems. due to the performance overhead of virtualization. Further, Recently, network interconnects such as InfiniBand [16], using VMs for HPC also introduces additional challenges Myrinet [24] and Quadrics [31] are emerging, which provide such as management and distribution of OS images. very low latency (less than 5 µs) and very high bandwidth In this paper we present a case for HPC with virtual ma- (multiple Gbps).
    [Show full text]
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Adaptive Data Migration in Load-Imbalanced HPC Applications
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 10-16-2020 Adaptive Data Migration in Load-Imbalanced HPC Applications Parsa Amini Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Amini, Parsa, "Adaptive Data Migration in Load-Imbalanced HPC Applications" (2020). LSU Doctoral Dissertations. 5370. https://digitalcommons.lsu.edu/gradschool_dissertations/5370 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. ADAPTIVE DATA MIGRATION IN LOAD-IMBALANCED HPC APPLICATIONS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Parsa Amini B.S., Shahed University, 2013 M.S., New Mexico State University, 2015 December 2020 Acknowledgments This effort has been possible, thanks to the involvement and assistance of numerous people. First and foremost, I thank my advisor, Dr. Hartmut Kaiser, who made this journey possible with their invaluable support, precise guidance, and generous sharing of expertise. It has been a great privilege and opportunity for me be your student, a part of the STE||AR group, and the HPX development effort. I would also like to thank my mentor and former advisor at New Mexico State University, Dr.
    [Show full text]
  • HPX – a Task Based Programming Model in a Global Address Space
    HPX – A Task Based Programming Model in a Global Address Space Hartmut Kaiser1 Thomas Heller2 Bryce Adelstein-Lelbach1 [email protected] [email protected] [email protected] Adrian Serio1 Dietmar Fey2 [email protected] [email protected] 1Center for Computation and 2Computer Science 3, Technology, Computer Architectures, Louisiana State University, Friedrich-Alexander-University, Louisiana, U.S.A. Erlangen, Germany ABSTRACT 1. INTRODUCTION The significant increase in complexity of Exascale platforms due to Todays programming models, languages, and related technologies energy-constrained, billion-way parallelism, with major changes to that have sustained High Performance Computing (HPC) appli- processor and memory architecture, requires new energy-efficient cation software development for the past decade are facing ma- and resilient programming techniques that are portable across mul- jor problems when it comes to programmability and performance tiple future generations of machines. We believe that guarantee- portability of future systems. The significant increase in complex- ing adequate scalability, programmability, performance portability, ity of new platforms due to energy constrains, increasing paral- resilience, and energy efficiency requires a fundamentally new ap- lelism and major changes to processor and memory architecture, proach, combined with a transition path for existing scientific ap- requires advanced programming techniques that are portable across plications, to fully explore the rewards of todays and tomorrows multiple future generations of machines [1]. systems. We present HPX – a parallel runtime system which ex- tends the C++11/14 standard to facilitate distributed operations, A fundamentally new approach is required to address these chal- enable fine-grained constraint based parallelism, and support run- lenges.
    [Show full text]
  • Beowulf Clusters — an Overview
    WinterSchool 2001 Å. Ødegård Beowulf clusters — an overview Åsmund Ødegård April 4, 2001 Beowulf Clusters 1 WinterSchool 2001 Å. Ødegård Contents Introduction 3 What is a Beowulf 5 The history of Beowulf 6 Who can build a Beowulf 10 How to design a Beowulf 11 Beowulfs in more detail 12 Rules of thumb 26 What Beowulfs are Good For 30 Experiments 31 3D nonlinear acoustic fields 35 Incompressible Navier–Stokes 42 3D nonlinear water wave 44 Beowulf Clusters 2 WinterSchool 2001 Å. Ødegård Introduction Why clusters ? ² “Work harder” – More CPU–power, more memory, more everything ² “Work smarter” – Better algorithms ² “Get help” – Let more boxes work together to solve the problem – Parallel processing ² by Greg Pfister Beowulf Clusters 3 WinterSchool 2001 Å. Ødegård ² Beowulfs in the Parallel Computing picture: Parallel Computing MetaComputing Clusters Tightly Coupled Vector WS farms Pile of PCs NOW NT/Win2k Clusters Beowulf CC-NUMA Beowulf Clusters 4 WinterSchool 2001 Å. Ødegård What is a Beowulf ² Mass–market commodity off the shelf (COTS) ² Low cost local area network (LAN) ² Open Source UNIX like operating system (OS) ² Execute parallel application programmed with a message passing model (MPI) ² Anything from small systems to large, fast systems. The fastest rank as no.84 on todays Top500. ² The best price/performance system available for many applications ² Philosophy: The cheapest system available which solve your problem in reasonable time Beowulf Clusters 5 WinterSchool 2001 Å. Ødegård The history of Beowulf ² 1993: Perfect conditions for the first Beowulf – Major CPU performance advance: 80286 ¡! 80386 – DRAM of reasonable costs and densities (8MB) – Disk drives of several 100MBs available for PC – Ethernet (10Mbps) controllers and hubs cheap enough – Linux improved rapidly, and was in a usable state – PVM widely accepted as a cross–platform message passing model ² Clustering was done with commercial UNIX, but the cost was high.
    [Show full text]
  • Exploring Massive Parallel Computation with GPU
    Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Exploring Massive Parallel Computation with GPU Ian Bond Massey University, Auckland, New Zealand 2011 Sagan Exoplanet Workshop Pasadena, July 25-29 2011 Ian Bond | Microlensing parallelism 1/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Assumptions/Purpose You are all involved in microlensing modelling and you have (or are working on) your own code this lecture shows how to get started on getting code to run on a GPU then its over to you . Ian Bond | Microlensing parallelism 2/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Outline 1 Need for parallelism 2 Graphical Processor Units 3 Gravitational Microlensing Modelling Ian Bond | Microlensing parallelism 3/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Paralel Computing Parallel Computing is use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer (Wilkinson 2002). How does one achieve parallelism? Ian Bond | Microlensing parallelism 4/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with todays computers’ Examples: – Modelling large DNA structures – Global weather forecasting – N body problem (N very large) – brain simulation Has microlensing
    [Show full text]
  • Improving MPI Threading Support for Current Hardware Architectures
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2019 Improving MPI Threading Support for Current Hardware Architectures Thananon Patinyasakdikul University of Tennessee, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Patinyasakdikul, Thananon, "Improving MPI Threading Support for Current Hardware Architectures. " PhD diss., University of Tennessee, 2019. https://trace.tennessee.edu/utk_graddiss/5631 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Thananon Patinyasakdikul entitled "Improving MPI Threading Support for Current Hardware Architectures." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Jack Dongarra, Major Professor We have read this dissertation and recommend its acceptance: Michael Berry, Michela Taufer, Yingkui Li Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Improving MPI Threading Support for Current Hardware Architectures A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Thananon Patinyasakdikul December 2019 c by Thananon Patinyasakdikul, 2019 All Rights Reserved. ii To my parents Thanawij and Issaree Patinyasakdikul, my little brother Thanarat Patinyasakdikul for their love, trust and support.
    [Show full text]
  • Dcuda: Hardware Supported Overlap of Computation and Communication
    dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi Jeremia Bar¨ Torsten Hoefler Department of Computer Science Department of Computer Science Department of Computer Science ETH Zurich ETH Zurich ETH Zurich [email protected] [email protected] [email protected] Abstract—Over the last decade, CUDA and the underlying utilization of the costly compute and network hardware. To GPU hardware architecture have continuously gained popularity mitigate this problem, application developers can implement in various high-performance computing application domains manual overlap of computation and communication [23], [27]. such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent In particular, there exist various approaches [13], [22] to programming model for GPU clusters. We therefore introduce overlap the communication with the computation on an inner the dCUDA programming model, which implements device- domain that has no inter-node data dependencies. However, side remote memory access with target notification. To hide these code transformations significantly increase code com- instruction pipeline latencies, CUDA programs over-decompose plexity which results in reduced real-world applicability. the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a High-performance system design often involves trading thread stalls, the hardware scheduler immediately proceeds with off sequential performance against parallel throughput. The the execution of another thread ready for execution. This latency architectural difference between host and device processors hiding technique is key to make best use of the available hardware perfectly showcases the two extremes of this design space.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • HPXMP, an Openmp Runtime Implemented Using
    hpxMP, An Implementation of OpenMP Using HPX By Tianyi Zhang Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science in the School of Engineering at Louisiana State University, 2019 Baton Rouge, LA, USA Acknowledgments I would like to express my appreciation to my advisor Dr. Hartmut Kaiser, who has been a patient, inspirable and knowledgeable mentor for me. I joined STEjjAR group 2 years ago, with limited knowledge of C++ and HPC. I would like to thank Dr. Kaiser to bring me into the world of high-performance computing. During the past 2 years, he helped me improve my skills in C++, encouraged my research project in hpxMP, and allowed me to explore the world. When I have challenges in my research, he always points things to the right spot and I can always learn a lot by solving those issues. His advice on my research and career are always valuable. I would also like to convey my appreciation to committee members for serving as my committees. Thank you for letting my defense be to an enjoyable moment, and I appreciate your comments and suggestions. I would thank my colleagues working together, who are passionate about what they are doing and always willing to help other people. I can always be inspired by their intellectual conversations and get hands-on help face to face which is especially important to my study. A special thanks to my family, thanks for their support. They always encourage me and stand at my side when I need to make choices or facing difficulties.
    [Show full text]
  • On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework
    On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework Montella, R., Giunta, G., Laccetti, G., Lapegna, M., Palmieri, C., Ferraro, C., Pelliccia, V., Hong, C-H., Spence, I., & Nikolopoulos, D. (2017). On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework. International Journal of Parallel Programming, 45(5), 1142-1163. https://doi.org/10.1007/s10766-016-0462-1 Published in: International Journal of Parallel Programming Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2016 Springer Verlag. The final publication is available at Springer via http://dx.doi.org/ 10.1007/s10766-016-0462-1 General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:08. Oct. 2021 Noname manuscript No. (will be inserted by the editor) On the virtualization of CUDA based GPU remoting on ARM and X86 machines in the GVirtuS framework Raffaele Montella · Giulio Giunta · Giuliano Laccetti · Marco Lapegna · Carlo Palmieri · Carmine Ferraro · Valentina Pelliccia · Cheol-Ho Hong · Ivor Spence · Dimitrios S.
    [Show full text]