Modern C++, Heterogeneous Computing & Opencl SYCL

Total Page:16

File Type:pdf, Size:1020Kb

Modern C++, Heterogeneous Computing & Opencl SYCL Modern C++, heterogeneous computing & OpenCL SYCL Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 — IWOCL 2015 SYCL Tutorial •C++14 I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 2 / 43 •C++14 I C++14 • 2 Open Source compilers available before ratification (GCC & Clang/LLVM) • Confirm new momentum & pace: 1 major (C++11) and 1 minor (C++14) version on a 6-year cycle • Next big version expected in 2017 (C++1z) I Already being implemented! • Monolithic committee replaced, by many smaller parallel task forces I Parallelism TS (Technical Specification) with Parallel STL I Concurrency TS (threads, mutex...) I Array TS (multidimensional arrays à la Fortran) I Transactional Memory TS... Race to parallelism! Definitely matters for HPC and heterogeneous computing! C++ is a complete new language • Forget about C++98, C++03... • Send your proposals and get involved in C++ committee (pushing heterogeneous computing)! Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 3 / 43 •C++14 I Modern C++ & HPC (I) • Huge library improvements I <thread> library and multithread memory model <atomic> HPC I Hash-map I Algorithms ; I Random numbers I ... • Uniform initialization and range-based for loop std ::vector<int> my_vector { 1, 2, 3, 4, 5 }; f o r(int &e : my_vector) e += 1; • Easy functional programming style with lambda (anonymous) functions std::transform(std::begin(v), std::end(v), [] (intv) { return2 ∗v ; } ) ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 4 / 43 •C++14 I Modern C++ & HPC (II) • Lot of meta-programming improvements to make meta-programming easy easier: variadic templates, type traits <type_traits>... • Make simple things simpler to be able to write generic numerical libraries, etc. • Automatic type inference for terse programming I Python 3.x (interpreted): def add(x, y): r e t u r nx+y p r i n t(add(2, 3))#5 p r i n t (add("2","3"))# 23 I Same in C++14 but compiled+ static compile-time type-checking: auto add= [] (auto x, autoy) { returnx+y; }; std::cout << add(2, 3) << std::endl;//5 std::cout << add("2"s,"3"s) << std::endl;// 23 ((( Without using templated code! template( <typename(( > (((( , Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 5 / 43 •C++14 I Modern C++ & HPC (III) • R-value references & std :: move semantics I matrix_A = matrix_B + matrix_C Avoid copying (TB, PB, EB... ) when assigning or function return • Avoid raw pointers, malloc()/free()/ /delete[]: use references and smart pointers instead // Allocatea double with new() and wrap it ina smart pointer auto gen() { return std ::make_shared<double> { 3.14 }; } [...] { auto p = gen(), q = p; ∗q = 2.718; // Out of scope, no longer use of the memory: deallocation happens here } • Lot of other amazing stuff... • Allow both low-level & high-level programming... Useful for heterogeneous computing Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 6 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 7 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (I) • Announced at GDC, March 2015 • Move from C99-based kernel language to C++14-based // Template classes to express OpenCL address spaces local_array<int, N> array; l o c a l<float> v; constant_ptr<double> p; // UseC++11 generalized attributes, to ignore vector dependencies [[safelen(8), ivdep]] f o r(int i =0; i <N; i++) // Can infer that offset >=8 array[i+offset] = array[i] + 42; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 8 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (II) • Kernel side enqueue I Replace OpenCL 2 infamous Apple GCD block syntax by C++11 lambda kernel void main_kernel(intN, int ∗ array ) { // Only work−item0 will launcha new kernel i f (get_global_id(0) == 0) // Wait for the end of this work−group before starting the new kernel get_default_queue(). enqueue_kernel (CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP, ndrange { N }, [ = ] kernel{ array[get_global_id(0)] = 7; }); } • C++14 memory model and atomic operations • Newer SPIR-V binary IR format Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 9 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (III) • Amazing progress but no single source solution à la CUDA yet I Still need to play with OpenCL host API to deal with buffers, etc. Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 10 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (I) • Parallel STL + map-reduce https://github.com/HSA-Libraries/Bolt • Developed by AMD on top of OpenCL, C++AMP or TBB #include <bolt/cl/sort.h> #include <vector> #include <algorithm> i n t main() { // generate random data(on host) std ::vector<int> a(8192); std::generate(a.begin(), a.end(), rand); // sort, run on best device in the platform bolt::cl::sort(a.begin(), a.end()); r e t u r n 0; } • Simple! Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 11 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (II) • But... I No direct interoperability with OpenCL world I No specific compiler required with OpenCL some special syntax to define operation on device OpenCL kernel source strings for complex operations; with macros BOLT_FUNCTOR(), BOLT_CREATE_TYPENAME(), BOLT_CREATE_CLCODE()... Work better with AMD Static C++ Kernel Language Extension (now in OpenCL 2.1) & best with C++AMP (but no OpenCL interoperability...) Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 12 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (I) • Boost library accepted in 2015 https://github.com/boostorg/compute • Provide 2 levels of abstraction I High-level parallel STL I Low-level C++ wrapping of OpenCL concepts Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 13 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (II) // Geta default command queue on the default accelerator auto queue = boost::compute::system:: default_queue (); // Allocatea vector ina buffer on the device boost ::compute:: vector<float> device_vector { N, queue.get_context() }; boost::compute:: iota(device_vector.begin() , device_vector.end() , 0); // Create an equivalent OpenCL kernel BOOST_COMPUTE_FUNCTION(float , add_four, (float x), { return x+4; }); boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , add_four , queue); boost::compute:: sort(device_vector.begin() , device_vector.end() , queue); // Lambda expression equivalent boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , boost ::compute::lambda::_1 ∗ 3 − 4 , queue ) ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 14 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (III) • Elegant implicit C++ conversions between OpenCL and Boost.Compute types for finer control and optimizations auto command_queue = boost ::compute::system:: default_queue (); auto context = command_queue.get_context (); auto program = boost ::compute::program:: create_with_source_file(kernel_file_name , context ) ; program. build (); boost ::compute:: kernel im2col_kernel { program,"im2col"}; boost::compute:: buffer im_buffer { context , image_size∗ s i z e o f(float), CL_MEM_READ_ONLY } ; command_queue. enqueue_write_buffer(im_buffer , 0/ ∗ O f f s e t ∗ /, im_data.size() ∗ s i z e o f(decltype(im_data ):: value_type), im_data.data ()); Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 15 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (IV) im2col_kernel.set_args(im_buffer , height , width, ksize_h , ksize_w, pad_h, pad_w, stride_h , stride_w , height_col , width_col , data_col ) ; command_queue. enqueue_nd_range_kernel (kernel, boost::compute::extents<1> { 0 }/ ∗ g l o b a l work offset ∗ /, boost::compute::extents<1> { workitems }/ ∗ g l o b a l work−item ∗ /, boost::compute:: extents<1> { workgroup_size };/ ∗ Work group size ∗ /); • Provide program caching • Direct OpenCL interoperability for extreme performance • No specific compiler required some special syntax to define operation on device • Probably the right tool to use to; translate CUDA & Thrust to OpenCL world Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 16 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (I) • Parallel STL similar to Boost.Compute + mathematical libraries https://github.com/ddemidov/vexcl I Random generators (Random123) I FFT I Tensor operations I Sparse matrix-vector products I Stencil convolutions I ... • OpenCL (CL.hpp or Boost.Compute) & CUDA back-end • Allow device vectors & operations to span different accelerators from different vendors in a same context vex::Context ctx { vex:: Filter ::Type { CL_DEVICE_TYPE_GPU } && vex:: Filter ::DoublePrecision }; vex:: vector<double>A { ctx, N }, B { ctx, N }, C{ ctx, N }; A = 2 ∗ B − s i n (C ) ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 17 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (II) I Allow easy interoperability with back-end //
Recommended publications
  • Analysis of GPU-Libraries for Rapid Prototyping Database Operations
    Analysis of GPU-Libraries for Rapid Prototyping Database Operations A look into library support for database operations Harish Kumar Harihara Subramanian Bala Gurumurthy Gabriel Campero Durand David Broneske Gunter Saake University of Magdeburg Magdeburg, Germany fi[email protected] Abstract—Using GPUs for query processing is still an ongoing Usually, library operators for GPUs are either written by research in the database community due to the increasing hardware experts [12] or are available out of the box by device heterogeneity of GPUs and their capabilities (e.g., their newest vendors [13]. Overall, we found more than 40 libraries for selling point: tensor cores). Hence, many researchers develop optimal operator implementations for a specific device generation GPUs each packing a set of operators commonly used in one involving tedious operator tuning by hand. On the other hand, or more domains. The benefits of those libraries is that they are there is a growing availability of GPU libraries that provide constantly updated and tested to support newer GPU versions optimized operators for manifold applications. However, the and their predefined interfaces offer high portability as well as question arises how mature these libraries are and whether they faster development time compared to handwritten operators. are fit to replace handwritten operator implementations not only w.r.t. implementation effort and portability, but also in terms of This makes them a perfect match for many commercial performance. database systems, which can rely on GPU libraries to imple- In this paper, we investigate various general-purpose libraries ment well performing database operators. Some example for that are both portable and easy to use for arbitrary GPUs such systems are: SQreamDB using Thrust [14], BlazingDB in order to test their production readiness on the example of using cuDF [15], Brytlyt using the Torch library [16].
    [Show full text]
  • Introduction to Openacc 2018 HPC Workshop: Parallel Programming
    Introduction to OpenACC 2018 HPC Workshop: Parallel Programming Alexander B. Pacheco Research Computing July 17 - 18, 2018 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 3 / 45 Accelerate Application for GPU 4 / 45 GPU Accelerated Libraries 5 / 45 GPU Programming Languages 6 / 45 What is OpenACC? I OpenACC Application Program Interface describes a collection of compiler directive to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator. I provides portability across operating systems, host CPUs and accelerators History I OpenACC was developed by The Portland Group (PGI), Cray, CAPS and NVIDIA. I PGI, Cray, and CAPs have spent over 2 years developing and shipping commercial compilers that use directives to enable GPU acceleration as core technology.
    [Show full text]
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • Introduction to GPU Computing
    Introduction to GPU Computing J. Austin Harris Scientific Computing Group Oak Ridge National Laboratory ORNL is managed by UT-Battelle for the US Department of Energy Performance Development in Top500 • Yardstick for measuring 1 Exaflop/s performance in HPC – Solve Ax = b – Measure floating-point operations per second (Flop/s) • U.S. targeting Exaflop system as early as 2022 – Building on recent trend of using GPUs https://www.top500.org/statistics/perfdevel 2 J. Austin Harris --- JETSCAPE --- 2020 Hardware Trends • Scaling number of cores/chip instead of clock speed • Power is the root cause – Power density limits clock speed • Goal has shifted to performance through parallelism • Performance is now a software concern Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.” Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 3 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • Excellent at graphics rendering – Fast computation (e.g., TV refresh rate) – High degree of parallelism (millions of independent pixels) – Needs high memory bandwidth • Often sacrifices latency, but this can be ameliorated • This computation pattern common in many scientific applications 4 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • CPU Strengths • GPU Strengths – Large memory – High mem. BW – Fast clock speeds – Latency tolerant via – Large cache for parallelism latency optimization – More compute – Small number of resources (cores) threads that can run – High perf./watt very quickly • GPU Weaknesses • CPU Weaknesses – Low mem. Capacity – Low mem. bandwidth – Low per-thread perf. – Costly cache misses – Low perf./watt Slide from Jeff Larkin, “Fundamentals of GPU Computing” 5 J.
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction oneAPI Programming Model Overview ..........................................................7 Data Parallel C++ (DPC++)................................................................8 oneAPI Toolkit Distribution..................................................................9 About This Guide.......................................................................................9 Related Documentation ..............................................................................9 Chapter 2: oneAPI Programming Model Sample Program ..................................................................................... 10 Platform Model........................................................................................ 14 Execution Model ...................................................................................... 15 Memory Model ........................................................................................ 17 Memory Objects.............................................................................. 19 Accessors....................................................................................... 19 Synchronization .............................................................................. 20 Unified Shared Memory.................................................................... 20 Kernel Programming
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • Opencl SYCL 2.2 Specification
    SYCLTM Provisional Specification SYCL integrates OpenCL devices with modern C++ using a single source design Version 2.2 Revision Date: – 2016/02/15 Khronos OpenCL Working Group — SYCL subgroup Editor: Maria Rovatsou Copyright 2011-2016 The Khronos Group Inc. All Rights Reserved Contents 1 Introduction 13 2 SYCL Architecture 15 2.1 Overview . 15 2.2 The SYCL Platform and Device Model . 15 2.2.1 Platform Mixed Version Support . 16 2.3 SYCL Execution Model . 17 2.3.1 Execution Model: Queues, Command Groups and Contexts . 18 2.3.2 Execution Model: Command Queues . 19 2.3.3 Execution Model: Mapping work-items onto an nd range . 21 2.3.4 Execution Model: Execution of kernel-instances . 22 2.3.5 Execution Model: Hierarchical Parallelism . 24 2.3.6 Execution Model: Device-side enqueue . 25 2.3.7 Execution Model: Synchronization . 25 2.4 Memory Model . 26 2.4.1 Access to memory . 27 2.4.2 Memory consistency . 28 2.4.3 Atomic operations . 28 2.5 The SYCL programming model . 28 2.5.1 Basic data parallel kernels . 30 2.5.2 Work-group data parallel kernels . 30 2.5.3 Hierarchical data parallel kernels . 30 2.5.4 Kernels that are not launched over parallel instances . 31 2.5.5 Synchronization . 31 2.5.6 Error handling . 32 2.5.7 Scheduling of kernels and data movement . 32 2.5.8 Managing object lifetimes . 34 2.5.9 Device discovery and selection . 35 2.5.10 Interfacing with OpenCL . 35 2.6 Anatomy of a SYCL application .
    [Show full text]
  • DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation
    ARTICLES https://doi.org/10.1038/s41565-020-00813-z DNA scaffolds enable efficient and tunable functionalization of biomaterials for immune cell modulation Xiao Huang 1,2, Jasper Z. Williams 2,3, Ryan Chang1, Zhongbo Li1, Cassandra E. Burnett4, Rogelio Hernandez-Lopez2,3, Initha Setiady1, Eric Gai1, David M. Patterson5, Wei Yu2,3, Kole T. Roybal2,4, Wendell A. Lim2,3 ✉ and Tejal A. Desai 1,2 ✉ Biomaterials can improve the safety and presentation of therapeutic agents for effective immunotherapy, and a high level of control over surface functionalization is essential for immune cell modulation. Here, we developed biocompatible immune cell-engaging particles (ICEp) that use synthetic short DNA as scaffolds for efficient and tunable protein loading. To improve the safety of chimeric antigen receptor (CAR) T cell therapies, micrometre-sized ICEp were injected intratumorally to present a priming signal for systemically administered AND-gate CAR-T cells. Locally retained ICEp presenting a high density of priming antigens activated CAR T cells, driving local tumour clearance while sparing uninjected tumours in immunodeficient mice. The ratiometric control of costimulatory ligands (anti-CD3 and anti-CD28 antibodies) and the surface presentation of a cytokine (IL-2) on ICEp were shown to substantially impact human primary T cell activation phenotypes. This modular and versatile bio- material functionalization platform can provide new opportunities for immunotherapies. mmune cell therapies have shown great potential for cancer the specific antibody27, and therefore a high and controllable den- treatment, but still face major challenges of efficacy and safety sity of surface biomolecules could allow for precise control of ligand for therapeutic applications1–3, for example, chimeric antigen binding and cell modulation.
    [Show full text]
  • GPU Computing with Openacc Directives
    Introduction to OpenACC Directives Duncan Poole, NVIDIA Thomas Bradley, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Universities Data Analytics Supercomputing Centers Life Sciences 100,000’s Oil & Gas Defense Weather Research Climate Early Adopters Plasma Physics 2004 Present Time 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages CUDA Libraries are interoperable with OpenACC “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages CUDA Languages are interoperable with OpenACC, “Drop-in” Easily Accelerate too! Maximum Acceleration Applications Flexibility NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP Vector Signal GPU Accelerated Matrix Algebra on Image Processing Linear Algebra GPU and Multicore NVIDIA cuFFT Building-block Sparse Linear C++ STL Features IMSL Library Algorithms for CUDA Algebra for CUDA GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications OpenACC Directives CPU GPU Simple Compiler hints Program myscience Compiler Parallelizes code ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 OpenACC ... parallel code ... Compiler Works on many-core GPUs & enddo enddo Hint !$acc end kernels multicore CPUs ... End Program myscience
    [Show full text]
  • Openacc Course October 2017. Lecture 1 Q&As
    OpenACC Course October 2017. Lecture 1 Q&As. Question Response I am currently working on accelerating The GCC compilers are lagging behind the PGI compilers in terms of code compiled in gcc, in your OpenACC feature implementations so I'd recommend that you use the PGI experience should I grab the gcc-7 or compilers. PGI compilers provide community version which is free and you try to compile the code with the pgi-c ? can use it to compile OpenACC codes. New to OpenACC. Other than the PGI You can find all the supported compilers here, compiler, what other compilers support https://www.openacc.org/tools OpenACC? Is it supporting Nvidia GPU on TK1? I Currently there isn't an OpenACC implementation that supports ARM think, it must be processors including TK1. PGI is considering adding this support sometime in the future, but for now, you'll need to use CUDA. OpenACC directives work with intel Xeon Phi is treated as a MultiCore x86 CPU. With PGI you can use the "- xeon phi platform? ta=multicore" flag to target multicore CPUs when using OpenACC. Does it have any application in field of In my opinion OpenACC enables good software engineering practices by Software Engineering? enabling you to write a single source code, which is more maintainable than having to maintain different code bases for CPUs, GPUs, whatever. Does the CPU comparisons include I think that the CPU comparisons include standard vectorisation that the simd vectorizations? PGI compiler applies, but no specific hand-coded vectorisation optimisations or intrinsic work. Does OpenMP also enable OpenMP does have support for GPUs, but the compilers are just now parallelization on GPU? becoming available.
    [Show full text]
  • Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications
    Episode 15: A Proving Ground for Open Standards Host: Nicole Huesman, Intel Guests: Mike Voss, Intel; Andrew Lumsdaine, Northwest Institute for Advanced Computing ___________________________________________________________________________ Nicole Huesman: Welcome to Code Together, an interview series exploring the possibilities of cross- architecture development with those who live it. I’m your host, Nicole Huesman. In earlier episodes, we’ve talked about the need to make it easier for developers to build code for heterogeneous environments in the face of increasingly diverse and data-intensive workloads. And the industry shift to modern C++ with the C++11 release. Today, we’ll continue that conversation, exploring parallelism and heterogeneous computing from a user’s perspective with: Andrew Lumsdaine. As Chief Scientist at the Northwest Institute for Advanced Computing, Andrew wears at least two hats: Laboratory Fellow at Pacific Northwest National Lab, and Affiliate Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. By spanning a university and a national lab, Andrew has the opportunity to work on research questions, and then translate those results into practice. His primary research interest is High Performance Computing, with a particular attention to scalable graph algorithms. I should also note that Andrew is a member of the oneAPI Technical Advisory Board. Andrew, so great to have you with us! Andrew Lumsdaine: Thank you. It’s great to be here. Nicole Huesman: And Mike Voss. Mike is a Principal Engineer at Intel, and was the original architect of the Threading Building Blocks flow graph API, a C++ API for expressing dependency, streaming and data flow applications.
    [Show full text]