Warwick.Ac.Uk/Lib-Publications MENS T a T a G I MOLEM

Total Page:16

File Type:pdf, Size:1020Kb

Warwick.Ac.Uk/Lib-Publications MENS T a T a G I MOLEM A Thesis Submitted for the Degree of PhD at the University of Warwick Permanent WRAP URL: http://wrap.warwick.ac.uk/102343/ Copyright and reuse: This thesis is made available online and is protected by original copyright. Please scroll down to view the document itself. Please refer to the repository record for this item for information to help you to cite it. Our policy information is available from the repository home page. For more information, please contact the WRAP Team at: [email protected] warwick.ac.uk/lib-publications MENS T A T A G I MOLEM U N IS IV S E EN RS IC ITAS WARW The Readying of Applications for Heterogeneous Computing by Andy Herdman A thesis submitted to The University of Warwick in partial fulfilment of the requirements for admission to the degree of Doctor of Philosophy Department of Computer Science The University of Warwick September 2017 Abstract High performance computing is approaching a potentially significant change in architectural design. With pressures on the cost and sheer amount of power, additional architectural features are emerging which require a re-think to the programming models deployed over the last two decades. Today's emerging high performance computing (HPC) systems are maximis- ing performance per unit of power consumed resulting in the constituent parts of the system to be made up of a range of different specialised building blocks, each with their own purpose. This heterogeneity is not just limited to the hardware components but also in the mechanisms that exploit the hardware components. These multiple levels of parallelism, instruction sets and memory hierarchies, result in truly heterogeneous computing in all aspects of the global system. These emerging architectural solutions will require the software to exploit tremendous amounts of on-node parallelism and indeed programming models to address this are emerging. In theory, the application developer can design new software using these models to exploit emerging low power architectures. However, in practice, real industrial scale applications last the lifetimes of many architectural generations and therefore require a migration path to these next generation supercomputing platforms. Identifying that migration path is non-trivial: With applications spanning many decades, consisting of many millions of lines of code and multiple sci- entific algorithms, any changes to the programming model will be extensive and invasive and may turn out to be the incorrect model for the application in question. This makes exploration of these emerging architectures and programming models using the applications themselves problematic. Additionally, the source code of many industrial applications is not available either due to commercial ii or security sensitivity constraints. This thesis highlights this problem by assessing current and emerging hard- ware with an industrial strength code, and demonstrating those issues described. In turn it looks at the methodology of using proxy applications in place of real industry applications, to assess their suitability on the next generation of low power HPC offerings. It shows there are significant benefits to be realised in using proxy applications, in that fundamental issues inhibiting exploration of a particular architecture are easier to identify and hence address. Evaluations of the maturity and performance portability are explored for a number of alternative programming methodologies, on a number of architectures and highlighting the broader adoption of these proxy applications, both within the authors own organisation, and across the industry as a whole. iii Acknowledgements This thesis, and the supporting research, was made possible by the support of a number of people; their academic, professional and personal support throughout the years spent on this work have been invaluable. Firstly, I'd like to thank my employer AWE who has invested not just financially but also allowed dedicated time for me to carry out this research and subsequent write up. In particular thanks go to Paul Tomlinson, Paul Ellicott and Andrew Randewich. A special thank you also goes to Wayne Gaudin for his technical input and general willingness as a sounding board. Colleagues and peers across the HPC industry have also been of assistance: LANL, LLNL, SNL, PGI, Cray, Intel, IBM have either provided access to hardware or given technical input, without which this work would not have been possible. I would like to thank Sriram Swaminarayan, Todd Gamblin, Scott Futral, Si Hammond, Jim Ang, Mike Glass, Doug Miles, Michael Wolfe, Craig Toepfer, Dave Norton, Alistair Hart, John Levesque, Fiona Burgess, Andy Mason, Andy Mallinson, Victor Gamayunov, Stephen Blair-Chappell and Peter Mayes. Also I'd like to express my thanks to the PhD students in the High Perfor- mance and Scientific Systems Group throughout my time at Warwick: Dr Si Hammond, Dr Andy Mallinson, Dr Olly Perks, Dr David Beckingsale, Dr James Davis, Dr Steve Wright, Dr John Pennycook and Dr Bob Bird for their advice and technical steer. I'd also like to express a big thank you to Prof. Stephen Jarvis for acceptance onto the PhD programme and his supervision. Finally, I dedicate this PhD to my family, Sarah, Katie, Jack, Mam and sister; and to my Dad - hope you would have been proud - I love you and I miss you. Andy Herdman - September 2017 iv Declarations This thesis is submitted to the University of Warwick in support of the author's application for the degree of Doctor of Philosophy. It has been composed by the author and has not been submitted in any previous application for any degree. Parts of this thesis have been previously published by the author in peer re- viewed conferences and technical papers: • Herdman, J. A., et al. \Benchmarking and modelling of POWER7, West- mere, BG/P, and GPUs: an industry case study." In: 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), New Orleans, LA, USA. ACM SIGMETRICS Performance Evaluation Review 38.4 (2011): 16-22 [103]. • Herdman, J. A., et al. \Accelerating hydrocodes with OpenACC, OpenCL and CUDA." SC Companion IEEE, 2012 High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, 10-16 Nov 2012 [101]. • Herdman, J. A., et al. \Achieving portability and performance through OpenACC." Proceedings of the First Workshop on Accelerator Program- ming using Directives (WACCPD '14), held as part of SC14 The Inter- national Confernece for HPC, Networking, Storage and Analysis. New Orleans, LA, USA. IEEE Press, 2014 [102]. In addition developments as part of this thesis's contributions, were among the integrated collection of mini-applications included in the 2013 R&D 100 \Oscars of Innovation" [172] award winning Mantevo test suite [104]. Also, Chapter 5 has been selected for inclusion in the forthcoming book: \Parallel Programming with OpenACC" [88]. Other related publications by the author, although not reproduced in this thesis, include: • Davis, J. A., Mudalige, G. R., Hammond, S. D., Herdman, J. A., Miller, I., and Jarvis, S. A. \Predictive analysis of a hydrodynamics application on large-scale CMP clusters." Computer Science-Research and Development 26, no. 3-4 (2011): 175-185. [79] • Mallinson, A. C., Beckingsale, D. A., Gaudin, W. P., Herdman, J. A., Levesque, J. M., and Jarvis, S. A. \CloverLeaf: Preparing hydrodynamics v codes for Exascale." In: A New Vintage of Computing : CUG2013, Napa, CA, 6 - 9 May 2013. Published in: A New Vintage of Computing : Preliminary Proceedings (2013) [137] • Pennycook, S. J., Hammond, S. D., Wright, S. A., Herdman, J. A., Miller, I., and Jarvis, S. A. \An investigation of the performance portability of OpenCL." Journal of Parallel and Distributed Computing 73, no. 11 (2013): 1439-1450. [162] • Mallinson, A. C., Beckingsale, D. A., Gaudin, W. P., Herdman, J. A., and Jarvis, S. A. \Towards portable performance for explicit hydrodynamics codes." (2013). [138] • Gaudin, W. P, Mallinson, A. C., Perks, O., Herdman, J. A., Beckingsale, D. A., Levesque, J. M., and Jarvis, S. A. \Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite." The Cray User Group (2014): 4-8. [94] • Mudalige, G. R., Reguly, I. Z., Giles, M. B., Mallinson, A. C., Gaudin, W. P., and Herdman, J. A. \Performance Analysis of a High-Level Abstraction Based Hydrocode on Future Computing Systems." In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simula- tion, pp. 85-104. Springer International Publishing, 2014. [149] • Mallinson, A. C., Jarvis, S. A., Gaudin, W. P., and Herdman, J. A. \Ex- periences at scale with PGAS versions of a Hydrodynamics application." In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, p. 9. ACM, 2014. [139] In addition, research conducted during the period of registration has also led to the following publications, which are not directly linked to the contents of this thesis: • Hammond, S. D., Mudalige, G. R., Smith, J. A., Davis, J. A., Jarvis, S. A., Holt, J., Miller, I., Herdman, J. A., and Vadgama, A. \To upgrade or not to upgrade? Catamount vs. Cray Linux Environment." In Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp. 1-8. IEEE, 2010. [99] • Wright, S. A., Hammond, S. D., Pennycook, S. J., Miller, I., Herdman, J. A., and Jarvis, S. A. \LDPLFS: improving I/O performance without application modification.” In Parallel and Distributed Processing Sympo- sium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pp. 1352-1359. IEEE, 2012 [200] vi • Perks, O., Beckingsale, D. A., Hammond, S. D., Miller, I., Herdman, J. A., Vadgama, A., Bhalerao, A. H., He, L., and Jarvis, S. A. Towards Automated Memory Model Generation via Event Tracing, The Computer Journal 56(2), June 2012 [165] • Perks, O., Beckingsale, D. A., Dawes, A. S., Herdman, J. A., Mazau- ric, C., and Jarvis, S. A. \Analysing the influence of InfiniBand choice on OpenMPI memory consumption." In High Performance Computing and Simulation (HPCS), 2013 International Conference on, pp.
Recommended publications
  • Workload Management and Application Placement for the Cray Linux Environment™
    TMTM Workload Management and Application Placement for the Cray Linux Environment™ S–2496–4101 © 2010–2012 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray and Sonexion are federally registered trademarks and Active Manager, Cascade, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XK6, Cray XK6m, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, LibSci, NodeKARE, RapidArray, The Way to Better Science, Threadstorm, uRiKA, UNICOS/lc, and YarcData are trademarks of Cray Inc.
    [Show full text]
  • Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C
    PathScale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C. Bergström | May 14th, 2012 Brief Introduction to ENZO 2 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 ENZO Overview & Goals Speed transition to GPU & many-core systems • Simplify the task of migrating software written in C, C++ & Fortran • Uses OpenHMPP Standard (easy migration) • CAPS HMPP compatible Performance & HPC focused • Fully exploits NVIDIA GPU features • Generates native instructions optimized for NVIDIA GPU 3 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule & Status 4 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule . ENZO Production release June 2012 – OpenHMPP 2.5 C, C++ and Fortran . Next ENZO Production release October 2012 – More tools and better support for libraries – x8664 OpenHMPP task parallelism (similar to OMP3 tasks) – More optimizations (IPA / CG2 / textures) – OpenHMPP 3.0 – CUDA 4.x – Kepler 5 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Status . OpenHMPP 2.5 – Running CAPS C & Fortran Labs – PathScale written HMPP test suite – Customer code . New C++ compiler – Perennial C++VS and CVSA regression free – Corner case compile time issues – Corner case runtime issues . Ongoing effort – Performance tuning & benchmarking – Compiler robustness – Nightly compiler builds to address issues 6 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance 7 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance . NVIDIA Tesla 2050 - “Lab2” SGEMM – ENZO – /opt/enzo/bin/pathcc -hmpp
    [Show full text]
  • Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives
    Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. 100,000 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz 19,484 10,000 AMD Athlon 64, 2.8 GHz 14,387 AMD Athlon, 2.6 GHz 11,865 Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 22%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year Performance (vs.
    [Show full text]
  • Marrying Many-Core Accelerators and Infiniband for a New Commodity Processor
    Marrying Many-core Accelerators and InfiniBand for a New Commodity Processor Konstantin S. Solnushkin* <[email protected]> Yuichi Tsujita <[email protected]> *Your speaker today International Conference on Computational Science (ICCS 2013) Barcelona, Spain June 5 - June 7, 2013 About the speaker • PhD candidate, topic is “Automated Design of Computer Clusters” • Main findings: – Continuity is more important than sporadic improvements – Ease of use is more important than energy efficiency (or any other trendy topic) • Although “ease of use” is harder to measure than “energy efficiency” About the speaker • Main finding of my PhD research, in one sentence: Economics is important About the speaker • Main finding of my PhD research, in one sentence: Economics is important Why is economics important? • New devices bring: benefits: energy efficiency, etc. disadvantages: you need to learn how to use them • New device (GPGPUs, Intel Xeon Phi) need to rewrite code (e.g., learn NVIDIA’s CUDA) • Learning new devices means time, and therefore money • Vendor lock-in also translates into money (your knowledge of CUDA can become “sunk costs”) – And you will need to learn something new when NVIDIA retires CUDA Economics • Mass-produced devices rule the world • We use commodity, off-the-shelf components to build clusters because they are cheap, not because they are the best • Future exascale computers are going to contain O(105) or even O(106) individual CPU chips [1] • 1. J. Dongarra et al., “The international exascale software project roadmap” Economics • 106 CPUs is a game-changer • It becomes economically viable to build your own CPUs • CPU design project costs around $100M [1].
    [Show full text]
  • How to Write Code That Will Survive
    Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO Introduction • Programming many-core systems faces the following dilemma o Achieve "portable" performance • Multiple forms of parallelism cohabiting – Multiple devices (e.g. GPUs) with their own address space – Multiple threads inside a device – Vector/SIMD parallelism inside a thread • Massive parallelism – Tens of thousands of threads needed o The constraint of keeping a unique version of codes, preferably mono- language • Reduces maintenance cost • Preserves code assets • Less sensitive to fast moving hardware targets • Codes last several generations of hardware architecture • For legacy codes, directive-based approach may be an alternative o And may benefit from auto-tuning techniques CC 2012 www.caps-entreprise.com 2 Profile of a Legacy Application • Written in C/C++/Fortran • Mix of user code and while(many){ library calls ... mylib1(A,B); ... • Hotspots may or may not be myuserfunc1(B,A); parallel ... mylib2(A,B); ... • Lifetime in 10s of years myuserfunc2(B,A); ... • Cannot be fully re-written } • Migration can be risky and mandatory CC 2012 www.caps-entreprise.com 3 Overview of the Presentation • Many-core architectures o Definition and forecast o Why usual parallel programming techniques won't work per se • Directive-based programming o OpenACC sets of directives o HMPP directives o Library integration issue • Toward a portable infrastructure for auto-tuning o Current auto-tuning directives in HMPP 3.0 o CodeletFinder for offline auto-tuning o Toward a standard auto-tuning interface CC 2012 www.caps-entreprise.com 4 Many-Core Architectures Heterogeneous Many-Cores • Many general purposes cores coupled with a massively parallel accelerator (HWA) Data/stream/vector CPU and HWA linked with a parallelism to be PCIx bus exploited by HWA e.g.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Pubtex Output 2011.12.12:1229
    TM Using Cray Performance Analysis Tools S–2376–53 © 2006–2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XK6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, Sonexion, The Way to Better Science, Threadstorm, uRiKA, and UNICOS/lc are trademarks of Cray Inc.
    [Show full text]
  • CAPS Openacc Compiler
    CAPS OpenACC Compiler HMPP Workbench 3.2 IDDN.FR.001.490007.000.S.P.2008.000.10600 This information is the property of CAPS entreprise and cannot be used, reproduced or transmitted without authorization. Headquarters – France CAPS – USA CAPS – CHINA Immeuble CAP Nord 4701 Patrick Drive Bldg 12 Suite E2, 30/F 4A Allée Marie Berhaut Santa Clara JuneYao International Plaza 35000 Rennes CA 95054 789, Zhaojiabang Road, France Shanghai 200032 Tel.: +33 (0)2 22 51 16 00 Tel.: +1 408 550 2887 x70 Tel.: +86 21 3363 0057 Fax: +33 (0)2 23 20 16 43 Fax: +86 21 3363 0067 [email protected] [email protected] [email protected] N° d’agrément formation : 53 35 08397 35 Visit our website: http://www.caps-entreprise.com CAPS OpenACC Compiler SUMMARY 1. Introduction 5 1.1. Revisions history .................................................................................................................................... 5 1.2. Introduction ............................................................................................................................................ 6 1.3. What is HMPP Workbench? What is the CAPS OpenACC Compiler? ................................................. 6 1.4. Execution Model .................................................................................................................................... 8 1.5. Memory Model ....................................................................................................................................... 8 2. OpenACC Directives 9 2.1. kernels
    [Show full text]
  • Using CAPS Compiler on NVIDIA Kepler and CARMA Systems
    Using CAPS Compiler on NVIDIA Kepler and CARMA Systems F. Bodin CTO – CAPS entreprise Introduction • CAPS develops programming tools to help writing a unique source code that can be executed on existing accelerator technologies o C / C++ / Fortran • Fast moving hardware systems require two directives sets o OpenHMPP - easy to extend – integrate new HW features o OpenACC - standardized – longer term view but moving slowly • Generates CUDA or OpenCL codes o Portable on AMD GPU and APU, Intel MIC, Nvidia Kepler-Carma, … nvidia SC 2012 www.caps-entreprise.com 2 CAPS Technology • Provide OpenACC and OpenHMPP directives o OpenHMPP codelet based o OpenACC code region based #pragma hmpp myfunc codelet, … void saxpy(int n, float alpha, float x[n], float y[n]){ #pragma hmppcg gridify(i) for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } #pragma acc kernels … { for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } nvidia SC 2012 www.caps-entreprise.com 3 Compilation Process • Source-to-source technology C++ C Fortran Frontend Frontend Frontend Extraction module codelets Host Fun#1 Fun #2 code Fun #3 Instrumentation OpenCL/Cuda module Generation CPU compiler Native (gcc, ifort, …) compilers Executable CAPS HWA Code (mybin.exe) Runtime (Dyn. library) www.caps-entreprise.com nvidia SC 2012 4 A Few Typical Situations 1. Simple nested loops 6. Dealing with accelerated library 2. Data transfer optimization 7. Dealing with dynamic accelerated tasks scheduling 3. Complex loop nests 8. Using multiple accelerators 4. Code tuning 9. Nested parallelism using native 5. Integrating auto-tuning techniques nvidia SC 2012 www.caps-entreprise.com 5 Simple nested loops - 1 Host • The simple construct is to CPU declare a parallel loop to be code Accelerator compiled and executed on an accelerator send A,B,C o Iterations of the loop nests are converted into threads execute kern.
    [Show full text]
  • This Is the GPU Based ANUGA
    This is the GPU based ANUGA Documentation Documentation is under doc directory, and the html version generated by Sphinx under the directory doc/sphinx/build/html/ Install Guide 1. Original ANUGA is required 2. For CUDA version, PyCUDA is required 3. For OpenHMPP version, current implementation is based on the CAPS OpenHMPP Compiler When compiling the code with Makefile, the NVIDIA device architecture and compute capability need to be specified For example, GTX480 with 2.0 compute capability HMPP_FLAGS13 = -e --nvcc-options -Xptxas=-v,-arch=sm_20 -c --force For GTX680 with 3.0 compute capability HMPP_FLAGS13 = -e --nvcc-options -Xptxas=-v,-arch=sm_30 -c --force Also the path for python, numpy, and ANUGA/utilities packages need to be specified Defining the macro USING_MIRROR_DATA in hmpp_fun.h #define USING_MIRROR_DATA will enable the advanced version, which uses OpenHMPP Mirrored Data technology so that data transmission costs can be effectively cut down, otherwise basic version is enabled. Environment Vars Please add following vars to you .bashrc or .bash_profile file export ANUGA_CUDA=/where_the_anuga-cuda/src export $PYTHONPATH=$PYTHONPATH:$ANUGA_CUDA Basic Code Structure README Evolve workflow.pdf device_spe/ docs/ profiling/ CUDA/ source/ sphinx/ build/ html/ config.py anuga_cuda/ gpu_domain_advanced.py gpu_domain_basic.py Makefile anuga-cuda/ hmpp_domain.py hmpp_python_glue.c src/ anuga_HMPP/ sw_domain.h sw_domain_fun.h hmpp_fun.h evolve.c scripts/ utilities/ CUDA/ merimbula/ merimbula.py tests/ HMPP/ merimbula.py README (What you are
    [Show full text]
  • Downloads Parallel Code and GPU So That the Program Can Execute Output Data from the HWA to the Host
    IJCSI International Journal of Computer Science Issues, Volume 12, Issue 2, March 2015 ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784 www.IJCSI.org 9 OMP2HMPP: Compiler Framework for Energy-Performance Trade-off Analysis of Automatically Generated Codes Albert Saà-Garriga1, David Castells-Rufas2 and Jordi Carrabina3 1 Dept. Microelectronics and Electronic Systems, Universitat Autònoma de Barcelona Bellaterra, Barcelona 08193, Spain. general purpose CPUs and parallel programming models Abstract such as OpenMP [22]. In this paper we present automatic We present OMP2HMPP, a tool that, in a first step, transformation tool on the source code oriented to work automatically translates OpenMP code into various possible with General Propose Units for Graphic transformations of HMPP. In a second step OMP2HMPP Processing(GPGPUs). This new tool (OMP2HMPP) is in executes all variants to obtain the performance and power the category of source-to-source code transformations and consumption of each transformation. The resulting trade-off can be used to choose the more convenient version. seek to facilitate the generation of code for GPGPUs. After running the tool on a set of codes from the Polybench OMP2HMPP is able to run specific sections on benchmark we show that the best automatic transformation is accelerators, so that the program executes more efficiently equivalent to a manual one done by an expert. Compared on heterogeneous platform. OMP2HMPP was initially with original OpenMP code running in 2 quad-core processors developed for the systematic exploration of the large set of we obtain an average speed-up of 31× and 5.86× factor in possible transformations to determine the optimal one in operations per watt.
    [Show full text]
  • Program Sequentially, Carefully, and Benefit from Compiler Advances For
    Program Sequentially, Carefully, and Benefit from Compiler Advances for Parallel Heterogeneous Computing Mehdi Amini, Corinne Ancourt, Béatrice Creusillet, François Irigoin, Ronan Keryell To cite this version: Mehdi Amini, Corinne Ancourt, Béatrice Creusillet, François Irigoin, Ronan Keryell. Program Se- quentially, Carefully, and Benefit from Compiler Advances for Parallel Heterogeneous Computing. Magoulès Frédéric. Patterns for Parallel Programming on GPUs, Saxe-Coburg Publications, pp. 149- 169, 2012, 978-1-874672-57-9 10.4203/csets.34.6. hal-01526469 HAL Id: hal-01526469 https://hal-mines-paristech.archives-ouvertes.fr/hal-01526469 Submitted on 30 May 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Program Sequentially, Carefully, and Benefit from Compiler Advances for Parallel Heterogeneous Computing Mehdi Amini1;2 Corinne Ancourt1 B´eatriceCreusillet2 Fran¸coisIrigoin1 Ronan Keryell2 | 1MINES ParisTech/CRI, firstname.lastname @mines-paristech.fr 2SILKAN, firstname.lastname @silkan.com September 26th 2012 Abstract The current microarchitecture trend leads toward heterogeneity. This evolution is driven by the end of Moore's law and the frequency wall due to the power wall. Moreover, with the spreading of smartphone, some constraints from the mobile world drive the design of most new architectures.
    [Show full text]