Hybrid MPI & Openmp Parallel Programming

Total Page:16

File Type:pdf, Size:1020Kb

Hybrid MPI & Openmp Parallel Programming Hybrid MPI & OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3) [email protected] [email protected] [email protected] 1) High Performance Computing Center (HLRS), University of Stuttgart, Germany 2) Regional Computing Center (RRZE), University of Erlangen, Germany 3) Texas Advanced Computing Center, The University of Texas at Austin, USA Tutorial M02 at SC10, November 15, 2010, New Orleans, LA, USA Hybrid Parallel Programming Slide 1 Höchstleistungsrechenzentrum Stuttgart Outline slide number • Introduction / Motivation 2 • Programming models on clusters of SMP nodes 6 8:30 – 10:00 • Case Studies / pure MPI vs hybrid MPI+OpenMP 13 • Practical “How-To” on hybrid programming 48 • Mismatch Problems 97 • Opportunities: Application categories that can 126 benefit from hybrid parallelization • Thread-safety quality of MPI libraries 136 10:30 – 12:00 • Tools for debugging and profiling MPI+OpenMP 143 • Other options on clusters of SMP nodes 150 • Summary 164 • Appendix 172 • Content (detailed) 188 Hybrid Parallel Programming Slide 2 / 169 Rabenseifner, Hager, Jost Motivation • Efficient programming of clusters of SMP nodes SMP nodes SMP nodes: cores shared • Dual/multi core CPUs memory • Multi CPU shared memory Node Interconnect • Multi CPU ccNUMA • Any mixture with shared memory programming model • Hardware range • mini-cluster with dual-core CPUs Core • … CPU(socket) • large constellations with large SMP nodes SMP board … with several sockets (CPUs) per SMP node … with several cores per socket ccNUMA node Hierarchical system layout Cluster of ccNUMA/SMP nodes • Hybrid MPI/OpenMP programming seems natural • MPI between the nodes • OpenMP inside of each SMP node Hybrid Parallel Programming Slide 3 / 169 Rabenseifner, Hager, Jost Motivation SMP node SMP node • Which programming model Socket 1 Socket 1 is fastest? MPMI PIM PI MPMI PIM PI Qpuroadc-ecosrse Qpuroadc-ecosrse • MPI everywhere? 4 xC mPUulti- 4 xC mPUulti- MPI MPI MPI MPI MtPhIr eparodceedss MtPhIr eparodceedss 8 x multi- 8 x multi- Socket 2 Socket 2 • Fully hybrid threaded threaded MPI & OpenMP? MPMI PIM PI MPMI PIM PI Qpuroadc-ecosrse Qpuroadc-ecosrse 4 xC mPUulti- 4 xC mPUulti- MPI MPI MPI MPI • Something between? threaded threaded (Mixed model) • Often hybrid programming Node Interconnect slower than pure MPI – Examples, Reasons, … ? Hybrid Parallel Programming Slide 4 / 169 Rabenseifner, Hager, Jost Goals of this tutorial • Sensitize to problems on clusters of SMP nodes see sections Case studies •Less Mismatch problems frustration • Technical aspects of hybrid programming & see sections Programming models on clusters •More Examples on hybrid programming success • Opportunities with hybrid programming with your parallel see section Opportunities: Application categories program on that can benefit from hybrid paralleliz. clusters of • Issues and their Solutions SMP nodes with sections Thread-safety quality of MPI libraries Tools for debugging and profiling for MPI+OpenMP Hybrid Parallel Programming Slide 5 / 169 Rabenseifner, Hager, Jost Outline • Introduction / Motivation • Programming models on clusters of SMP nodes • Case Studies / pure MPI vs hybrid MPI+OpenMP • Practical “How-To” on hybrid programming • Mismatch Problems • Opportunities: Application categories that can benefit from hybrid parallelization • Thread-safety quality of MPI libraries • Tools for debugging and profiling MPI+OpenMP • Other options on clusters of SMP nodes • Summary Hybrid Parallel Programming Slide 6 / 169 Rabenseifner, Hager, Jost Major Programming models on hybrid systems • Pure MPI (one MPI process on each core) • Hybrid MPI+OpenMP OpenMP inside of the – shared memory OpenMP SMP nodes – distributed memory MPI MPI between the nodes via node interconnect Node Interconnect • Other: Virtual shared memory systems, PGAS, HPF, … • Often hybrid programming (MPI+OpenMP) slower than pure MPI – why? MPI local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each core for (j=…;…; j++) block_to_be_parallelized Explicit Message Passing again_some_serial_code ••• sleeping ••• by calling MPI_Send & MPI_Recv Hybrid Parallel Programming Slide 7 / 169 Rabenseifner, Hager, Jost Parallel Programming Models on Hybrid Platforms pure MPI hybrid MPI+OpenMP OpenMP only one MPI process MPI: inter-node communication distributed virtual on each core OpenMP: inside of each SMP node shared memory No overlap of Comm. + Comp. Overlapping Comm. + Comp. MPI only outside of parallel regions MPI communication by one or a few threads of the numerical application code while other threads are computing Masteronly MPI only outside of parallel regions MPI local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each core for (j=…;…; j++) block_to_be_parallelized Explicit message transfers again_some_serial_code ••• sleeping ••• by calling MPI_Send & MPI_Recv Hybrid Parallel Programming Slide 8 / 169 Rabenseifner, Hager, Jost Pure MPI pure MPI one MPI process on each core Advantages – No modifications on existing MPI codes – MPI library need not to support multiple threads Major problems – Does MPI library uses internally different protocols? • Shared memory inside of the SMP nodes Discussed • Network communication between the nodes in detail later on – Does application topology fit on hardware topology? in the section Mismatch – Unnecessary MPI-communication inside of SMP nodes! Problems Hybrid Parallel Programming Slide 9 / 169 Rabenseifner, Hager, Jost Hybrid Masteronly Masteronly Advantages MPI only outside of parallel regions – No message passing inside of the SMP nodes – No topology problem for (iteration ….) Major Problems { – All other threads are sleeping #pragma omp parallel numerical code while master thread communicates! /*end omp parallel */ – Which inter-node bandwidth? /* on master thread only */ – MPI-lib must support at least MPI_Send (original data MPI_THREAD_FUNNELED to halo areas in other SMP nodes) Section MPI_Recv (halo data Thread-safety from the neighbors) } /*end for loop quality of MPI libraries Hybrid Parallel Programming Slide 10 / 169 Rabenseifner, Hager, Jost Overlapping Communication and Computation OMvPIe cormlampunpicaitniong b yc oonem or ma feuw nthriecadast wihoilen o tahenr tdhre cados amre pcoumptuatintgion if (my_thread_rank < …) { MPI_Send/Recv…. i.e., communicate all halo data } else { Execute those parts of the application that do not need halo data (on non-communicating threads) } Execute those parts of the application that need halo data (on all threads) Hybrid Parallel Programming Slide 11 / 169 Rabenseifner, Hager, Jost Pure OpenMP (on the cluster) OpenMP only distributed virtual shared memory • Distributed shared virtual memory system needed • Must support clusters of SMP nodes Experience: • e.g., Intel® Cluster OpenMP Mismatch – Shared memory parallel inside of SMP nodes section – Communication of modified parts of pages at OpenMP flush (part of each OpenMP barrier) i.e., the OpenMP memory and parallelization model is prepared for clusters! Hybrid Parallel Programming Slide 12 / 169 Rabenseifner, Hager, Jost Outline • Introduction / Motivation • Programming models on clusters of SMP nodes • Case Studies / pure MPI vs hybrid MPI+OpenMP – The Multi-Zone NAS Parallel Benchmarks – For each application we discuss: • Benchmark implementations based on different strategies and programming paradigms • Performance results and analysis on different hardware architectures – Compilation and Execution Summary Gabriele Jost (University of Texas,TACC/Naval Postgraduate School, Monterey CA) • Practical “How-To” on hybrid programming • Mismatch Problems • Opportunities: Application categories that can benefit from hybrid paralleli. • Thread-safety quality of MPI libraries • Tools for debugging and profiling MPI+OpenMP • Other options on clusters of SMP nodes • Summary Hybrid Parallel Programming Slide 13 / 169 Rabenseifner, Hager, Jost The Multi-Zone NAS Parallel Benchmarks Nested MPI/OpenMP MLP OpenMP Time step sequential sequential sequential MPI MLP inter-zones OpenMP Processes Processes exchange data copy+ Call MPI OpenMP boundaries sync. intra-zones OpenMP OpenMP OpenMP • Multi-zone versions of the NAS Parallel Benchmarks LU,SP, and BT • Two hybrid sample implementations • Load balance heuristics part of sample codes • www.nas.nasa.gov/Resources/Software/software.html Hybrid Parallel Programming Slide 14 / 169 Rabenseifner, Hager, Jost Using MPI/OpenMP call omp_set_numthreads (weight) subroutine zsolve(u, rsd,…) do step = 1, itmax ... call exch_qbc(u, qbc, nx,…) !$OMP PARALLEL DEFAUL(SHARED) !$OMP& PRIVATE(m,i,j,k...) do k = 2, nz-1 call mpi_send/recv !$OMP DO do j = 2, ny-1 do i = 2, nx-1 do zone = 1, num_zones do m = 1, 5 if (iam .eq. pzone_id(zone)) then u(m,i,j,k)= call zsolve(u,rsd,…) dt*rsd(m,i,j,k-1) end if end do end do end do end do end do !$OMP END DO nowait ... end do ... !$OMP END PARALLEL Hybrid Parallel Programming Slide 15 / 169 Rabenseifner, Hager, Jost Pipelined Thread Execution in SSOR subroutine ssor !$OMP PARALLEL DEFAULT(SHARED) !$OMP& PRIVATE(m,i,j,k...) subbroutine sync1 call sync1 () …neigh = iam -1 do k = 2, nz-1 do while (isync(neigh) .eq. 0) !$OMP DO !$OMP FLUSH(isync) do j = 2, ny-1 end do isync(neigh) = 0 do i = 2, nx-1 !$OMP FLUSH(isync) do m = 1, 5 rsd(m,i,j,k)= … dt*rsd(m,i,j,k-1)
Recommended publications
  • A Case for High Performance Computing with Virtual Machines
    A Case for High Performance Computing with Virtual Machines Wei Huangy Jiuxing Liuz Bulent Abaliz Dhabaleswar K. Panday y Computer Science and Engineering z IBM T. J. Watson Research Center The Ohio State University 19 Skyline Drive Columbus, OH 43210 Hawthorne, NY 10532 fhuanwei, [email protected] fjl, [email protected] ABSTRACT in the 1960s [9], but are experiencing a resurgence in both Virtual machine (VM) technologies are experiencing a resur- industry and research communities. A VM environment pro- gence in both industry and research communities. VMs of- vides virtualized hardware interfaces to VMs through a Vir- fer many desirable features such as security, ease of man- tual Machine Monitor (VMM) (also called hypervisor). VM agement, OS customization, performance isolation, check- technologies allow running different guest VMs in a phys- pointing, and migration, which can be very beneficial to ical box, with each guest VM possibly running a different the performance and the manageability of high performance guest operating system. They can also provide secure and computing (HPC) applications. However, very few HPC ap- portable environments to meet the demanding requirements plications are currently running in a virtualized environment of computing resources in modern computing systems. due to the performance overhead of virtualization. Further, Recently, network interconnects such as InfiniBand [16], using VMs for HPC also introduces additional challenges Myrinet [24] and Quadrics [31] are emerging, which provide such as management and distribution of OS images. very low latency (less than 5 µs) and very high bandwidth In this paper we present a case for HPC with virtual ma- (multiple Gbps).
    [Show full text]
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Adaptive Data Migration in Load-Imbalanced HPC Applications
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 10-16-2020 Adaptive Data Migration in Load-Imbalanced HPC Applications Parsa Amini Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Amini, Parsa, "Adaptive Data Migration in Load-Imbalanced HPC Applications" (2020). LSU Doctoral Dissertations. 5370. https://digitalcommons.lsu.edu/gradschool_dissertations/5370 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. ADAPTIVE DATA MIGRATION IN LOAD-IMBALANCED HPC APPLICATIONS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Parsa Amini B.S., Shahed University, 2013 M.S., New Mexico State University, 2015 December 2020 Acknowledgments This effort has been possible, thanks to the involvement and assistance of numerous people. First and foremost, I thank my advisor, Dr. Hartmut Kaiser, who made this journey possible with their invaluable support, precise guidance, and generous sharing of expertise. It has been a great privilege and opportunity for me be your student, a part of the STE||AR group, and the HPX development effort. I would also like to thank my mentor and former advisor at New Mexico State University, Dr.
    [Show full text]
  • Beowulf Clusters — an Overview
    WinterSchool 2001 Å. Ødegård Beowulf clusters — an overview Åsmund Ødegård April 4, 2001 Beowulf Clusters 1 WinterSchool 2001 Å. Ødegård Contents Introduction 3 What is a Beowulf 5 The history of Beowulf 6 Who can build a Beowulf 10 How to design a Beowulf 11 Beowulfs in more detail 12 Rules of thumb 26 What Beowulfs are Good For 30 Experiments 31 3D nonlinear acoustic fields 35 Incompressible Navier–Stokes 42 3D nonlinear water wave 44 Beowulf Clusters 2 WinterSchool 2001 Å. Ødegård Introduction Why clusters ? ² “Work harder” – More CPU–power, more memory, more everything ² “Work smarter” – Better algorithms ² “Get help” – Let more boxes work together to solve the problem – Parallel processing ² by Greg Pfister Beowulf Clusters 3 WinterSchool 2001 Å. Ødegård ² Beowulfs in the Parallel Computing picture: Parallel Computing MetaComputing Clusters Tightly Coupled Vector WS farms Pile of PCs NOW NT/Win2k Clusters Beowulf CC-NUMA Beowulf Clusters 4 WinterSchool 2001 Å. Ødegård What is a Beowulf ² Mass–market commodity off the shelf (COTS) ² Low cost local area network (LAN) ² Open Source UNIX like operating system (OS) ² Execute parallel application programmed with a message passing model (MPI) ² Anything from small systems to large, fast systems. The fastest rank as no.84 on todays Top500. ² The best price/performance system available for many applications ² Philosophy: The cheapest system available which solve your problem in reasonable time Beowulf Clusters 5 WinterSchool 2001 Å. Ødegård The history of Beowulf ² 1993: Perfect conditions for the first Beowulf – Major CPU performance advance: 80286 ¡! 80386 – DRAM of reasonable costs and densities (8MB) – Disk drives of several 100MBs available for PC – Ethernet (10Mbps) controllers and hubs cheap enough – Linux improved rapidly, and was in a usable state – PVM widely accepted as a cross–platform message passing model ² Clustering was done with commercial UNIX, but the cost was high.
    [Show full text]
  • The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗
    THE COMPONENT ARCHITECTURE OF OPEN MPI: ENABLING THIRD-PARTY COLLECTIVE ALGORITHMS∗ Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Laboratory, Indiana University Bloomington, Indiana, USA [email protected] [email protected] Abstract As large-scale clusters become more distributed and heterogeneous, significant research interest has emerged in optimizing MPI collective operations because of the performance gains that can be realized. However, researchers wishing to develop new algorithms for MPI collective operations are typically faced with sig- nificant design, implementation, and logistical challenges. To address a number of needs in the MPI research community, Open MPI has been developed, a new MPI-2 implementation centered around a lightweight component architecture that provides a set of component frameworks for realizing collective algorithms, point-to-point communication, and other aspects of MPI implementations. In this paper, we focus on the collective algorithm component framework. The “coll” framework provides tools for researchers to easily design, implement, and exper- iment with new collective algorithms in the context of a production-quality MPI. Performance results with basic collective operations demonstrate that the com- ponent architecture of Open MPI does not introduce any performance penalty. Keywords: MPI implementation, Parallel computing, Component architecture, Collective algorithms, High performance 1. Introduction Although the performance of the MPI collective operations [6, 17] can be a large factor in the overall run-time of a parallel application, their optimiza- tion has not necessarily been a focus in some MPI implementations until re- ∗This work was supported by a grant from the Lilly Endowment and by National Science Foundation grant 0116050.
    [Show full text]
  • Exploring Massive Parallel Computation with GPU
    Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Exploring Massive Parallel Computation with GPU Ian Bond Massey University, Auckland, New Zealand 2011 Sagan Exoplanet Workshop Pasadena, July 25-29 2011 Ian Bond | Microlensing parallelism 1/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Assumptions/Purpose You are all involved in microlensing modelling and you have (or are working on) your own code this lecture shows how to get started on getting code to run on a GPU then its over to you . Ian Bond | Microlensing parallelism 2/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Outline 1 Need for parallelism 2 Graphical Processor Units 3 Gravitational Microlensing Modelling Ian Bond | Microlensing parallelism 3/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Paralel Computing Parallel Computing is use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer (Wilkinson 2002). How does one achieve parallelism? Ian Bond | Microlensing parallelism 4/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with todays computers’ Examples: – Modelling large DNA structures – Global weather forecasting – N body problem (N very large) – brain simulation Has microlensing
    [Show full text]
  • Improving MPI Threading Support for Current Hardware Architectures
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2019 Improving MPI Threading Support for Current Hardware Architectures Thananon Patinyasakdikul University of Tennessee, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Patinyasakdikul, Thananon, "Improving MPI Threading Support for Current Hardware Architectures. " PhD diss., University of Tennessee, 2019. https://trace.tennessee.edu/utk_graddiss/5631 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Thananon Patinyasakdikul entitled "Improving MPI Threading Support for Current Hardware Architectures." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Jack Dongarra, Major Professor We have read this dissertation and recommend its acceptance: Michael Berry, Michela Taufer, Yingkui Li Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Improving MPI Threading Support for Current Hardware Architectures A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Thananon Patinyasakdikul December 2019 c by Thananon Patinyasakdikul, 2019 All Rights Reserved. ii To my parents Thanawij and Issaree Patinyasakdikul, my little brother Thanarat Patinyasakdikul for their love, trust and support.
    [Show full text]
  • Dcuda: Hardware Supported Overlap of Computation and Communication
    dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi Jeremia Bar¨ Torsten Hoefler Department of Computer Science Department of Computer Science Department of Computer Science ETH Zurich ETH Zurich ETH Zurich [email protected] [email protected] [email protected] Abstract—Over the last decade, CUDA and the underlying utilization of the costly compute and network hardware. To GPU hardware architecture have continuously gained popularity mitigate this problem, application developers can implement in various high-performance computing application domains manual overlap of computation and communication [23], [27]. such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent In particular, there exist various approaches [13], [22] to programming model for GPU clusters. We therefore introduce overlap the communication with the computation on an inner the dCUDA programming model, which implements device- domain that has no inter-node data dependencies. However, side remote memory access with target notification. To hide these code transformations significantly increase code com- instruction pipeline latencies, CUDA programs over-decompose plexity which results in reduced real-world applicability. the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a High-performance system design often involves trading thread stalls, the hardware scheduler immediately proceeds with off sequential performance against parallel throughput. The the execution of another thread ready for execution. This latency architectural difference between host and device processors hiding technique is key to make best use of the available hardware perfectly showcases the two extremes of this design space.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework
    On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework Montella, R., Giunta, G., Laccetti, G., Lapegna, M., Palmieri, C., Ferraro, C., Pelliccia, V., Hong, C-H., Spence, I., & Nikolopoulos, D. (2017). On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework. International Journal of Parallel Programming, 45(5), 1142-1163. https://doi.org/10.1007/s10766-016-0462-1 Published in: International Journal of Parallel Programming Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2016 Springer Verlag. The final publication is available at Springer via http://dx.doi.org/ 10.1007/s10766-016-0462-1 General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:08. Oct. 2021 Noname manuscript No. (will be inserted by the editor) On the virtualization of CUDA based GPU remoting on ARM and X86 machines in the GVirtuS framework Raffaele Montella · Giulio Giunta · Giuliano Laccetti · Marco Lapegna · Carlo Palmieri · Carmine Ferraro · Valentina Pelliccia · Cheol-Ho Hong · Ivor Spence · Dimitrios S.
    [Show full text]
  • Parallel Data Mining from Multicore to Cloudy Grids
    311 Parallel Data Mining from Multicore to Cloudy Grids Geoffrey FOXa,b,1 Seung-Hee BAEb, Jaliya EKANAYAKEb, Xiaohong QIUc, and Huapeng YUANb a Informatics Department, Indiana University 919 E. 10th Street Bloomington, IN 47408 USA b Computer Science Department and Community Grids Laboratory, Indiana University 501 N. Morton St., Suite 224, Bloomington IN 47404 USA c UITS Research Technologies, Indiana University, 501 N. Morton St., Suite 211, Bloomington, IN 47404 Abstract. We describe a suite of data mining tools that cover clustering, information retrieval and the mapping of high dimensional data to low dimensions for visualization. Preliminary applications are given to particle physics, bioinformatics and medical informatics. The data vary in dimension from low (2- 20), high (thousands) to undefined (sequences with dissimilarities but not vectors defined). We use deterministic annealing to provide more robust algorithms that are relatively insensitive to local minima. We discuss the algorithm structure and their mapping to parallel architectures of different types and look at the performance of the algorithms on three classes of system; multicore, cluster and Grid using a MapReduce style algorithm. Each approach is suitable in different application scenarios. We stress that data analysis/mining of large datasets can be a supercomputer application. Keywords. MPI, MapReduce, CCR, Performance, Clustering, Multidimensional Scaling Introduction Computation and data intensive scientific data analyses are increasingly prevalent. In the near future, data volumes processed by many applications will routinely cross the peta-scale threshold, which would in turn increase the computational requirements. Efficient parallel/concurrent algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in such scientific data analyses.
    [Show full text]
  • HPVM: Heterogeneous Parallel Virtual Machine
    HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou∗ Prakalp Srivastava∗ Matthew D. Sinclair Department of Computer Science Department of Computer Science Department of Computer Science University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign [email protected] [email protected] [email protected] Rakesh Komuravelli Vikram Adve Sarita Adve Qualcomm Technologies Inc. Department of Computer Science Department of Computer Science [email protected]. University of Illinois at University of Illinois at com Urbana-Champaign Urbana-Champaign [email protected] [email protected] Abstract hardware, and that runtime scheduling policies can make We propose a parallel program representation for heteroge- use of both program and runtime information to exploit the neous systems, designed to enable performance portability flexible compilation capabilities. Overall, we conclude that across a wide range of popular parallel hardware, including the HPVM representation is a promising basis for achieving GPUs, vector instruction sets, multicore CPUs and poten- performance portability and for implementing parallelizing tially FPGAs. Our representation, which we call HPVM, is a compilers for heterogeneous parallel systems. hierarchical dataflow graph with shared memory and vector CCS Concepts • Computer systems organization → instructions. HPVM supports three important capabilities for Heterogeneous (hybrid) systems; programming heterogeneous systems: a compiler interme- diate representation (IR), a virtual instruction set (ISA), and Keywords Virtual ISA, Compiler, Parallel IR, Heterogeneous a basis for runtime scheduling; previous systems focus on Systems, GPU, Vector SIMD only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for het- 1 Introduction erogeneous systems.
    [Show full text]