PROGRAMMING Intel® Processor Graphics

Total Page:16

File Type:pdf, Size:1020Kb

PROGRAMMING Intel® Processor Graphics PROGRAMMING Intel® Processor Graphics Chi-Keung (CK) Luk - Intel Principal Engineer Intel Software & Services Group Agenda • Compute programming on Intel Graphics with: • OpenCL • CilkPlus • Tools • Workload performance 2 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Acknowledgments for Slide Sources • OpenCL: . Khronos . Uri Levy, Yuval Eshkol, Doron Singer . Robert Ioffe, Aaron Kunze, Ben Ashbaugh, Stephen Junkins, Michal Mrozek • CilkPlus: . Knud J Kirkegaard, Anoop Madhusoodhanan Prabha, Konstantin Bobrovsky, Sergey Dmitriev • VTune: . Alexandr Kurylev, Julia Fedorova • Workload Performance: . Sharad Tripathi, Chinang Ma, Akhila Vidiyala . Edward Ching, Norbert Egi, Masood Mortazavi, Vivent Cheng, Guangyu Shi 3 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL 1. Introduction 2. Optimizing OpenCL for Intel GPUs 3. Using Shared Virtual Memory (SVM) 4 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL* (Open Computing Language) • An open standard managed by the Khronos group • A set of C-based APIs on the host that defines a run-time environment • Programs written in a C-based language (C++ support since OpenCL 2.1) that run on the device(s) Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Based on C99 Mostly C-like Kernels (functions which execute a work-item) have a kernel prefix, and a void return type kernel void foo (global int* ptr) { … for (int i=0; i<10; i++) { if (ptr == NULL) continue; *ptr[i] = i; } … No support for library functions . No stdio.h / stdlib.h / math.h / etc.. But printf is supported Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Supported data types Scalar data types . char / uchar / short / ushort / int / uint / long / ulong . float / double . size_t . Pointers Derived data types . Arrays . Structures Vector data types Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Working with vectors Vectors exist for all scalar types Vector widths are 2, 3, 4, 8, 16 All arithmetic operations work on vector types uint3 vec0, vec1; … uint3 result = vec0 + vec1; Component access (XYZW) double res1 = dvec0.x + dvec1.z; double2 res2 = dvec0.wy + dvec1.xx; // Swizzle Vectors > 4 use numeric (hexadecimal) indices float res1 = vec16.s5 + vec16.sf; float2 res2 = vec8.s37 + vec16.sca; // Swizzle Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Additional details Functions used for querying the index . get_global_id(int dimension); // Index of work-item in entire execution . get_local_id(int dimension); // index of work-item in work-group . get_group_id(int dimension); // index of work-group . A few others.. Memory . No support for dynamic memory allocation . When passing buffers as arguments, specify global / local / constant kernel void foo (global const int* ptr, local float* scratch) { … Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Built-in functions Many functions supported Overloaded to all relevant types and vector widths A tiny bit for example: . Math: sin / cos / min / max / log / pow / sqrt / …. Geometric: dot / cross / distance / length / … . Relational: isEqual / isGreater / all / any / select / … kernel void dot_product foo(global const int4* a, global const int4* b, global int* out) { size_t tid = get_global_id(0); out[tid] = dot(a[tid], b[tid]); } Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Casts and type conversions • Scalar values • Use “normal” (C-style) casts • Vector values • Vector conversions must be done explicitly dstValue = convert_destType(srcValue) • Source and destination types must have the same vector width • Example: int8 intVec; double8 dVec; float8 fVec = convert_float8(intVec); float8 fVec2 = convert_float8(dVec); • Vector construction (from scalars) ushort3 vecUshort = (ushort3)(0, 12, 7); Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Language summary Based on C99 With some extra features: . Vector data types . Extensive Built-in functions library . Image handling, Work-group synchronization Minus others: . Recursion . Function pointers . Pointers to pointers . Dynamic memory allocation Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL 1. Introduction 2. Optimizing OpenCL for Intel GPUs . Use zero copying to transfer data between CPU and GPU . Maximize EU occupancy . Maximize compute performance . Avoid Divergent Control Flow . Take advantage of large register space . Optimize data accesses 3. Using Shared Virtual Memory (SVM) 18 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Optimizing Host to Device Transfers: Zero Copying • Host (CPU) and Device (GPU) share the same physical memory • For buffers allocated through the OpenCL™ runtime: - Let the OpenCL runtime allocate system memory . Create buffer with system memory pointer and CL_MEM_ALLOC_HOST_PTR - OR, Use pre-allocated system memory . Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR . Allocate system memory aligned to a page (4096 bytes) (e.g., use _aligned_malloc or memalign to allocate) . Allocate a multiple of cache line size (64 bytes) . No transfer needed (zero copy)! - Use clEnqueueMapBuffer() to access data . No transfer needed (zero copy)! 19 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Maximizing EU Occupancy • Occupancy is a measure of EU thread utilization • Two primary things to consider: - Launch enough work items to keep EU threads busy - In short kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost . For example, color conversion: __global uchar* src_ptr, dst_ptr; uchar16 src = vload16(0, src_ptr); __global uchar* src, dst; uchar4 c0 = src.s048c; p = src[src_idx] * B2Y + uchar4 c1 = src.s159d; src[src_idx + 1] * G2Y + uchar4 c2 = src.s26ae; src[src_idx + 2] * R2Y; uchar4 Y = c0 * B2Y + dst[dst_idx] = p; c1 * G2Y + c2 * R2Y; Before: vstore4(Y, 0, dst_ptr); One pixel per work item After: Four pixels per work item 20 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Maximize Compute Performance • Use floats instead of integer data types - Because an EU can issue two float operations per cycle • Floating-point throughput depends on the data width - float16 throughput = 2 x float32 throughput - float32 throughput = 4 x float64 throughput • Trade accuracy for speed, where appropriate - Use “native” built-ins (or use -cl-fast-relaxed-math) - Use mad() / fma()(or use -cl-mad-enable) x = cos(i); x = native_cos(i); 21 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Avoid Divergent Control Flow “SIMT” ISA with Predication and Branching “Divergent” code executes both branches . Reduced SIMD Efficiency, Increased Power and Exec Time this(); Example: “x” sometimes true Example: “x” never true SIMD lane SIMD lane time if ( x ) time that(); else another(); finish(); 22 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Optimizing Data Accesses 23 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Take Advantage of Large Register Space • Each work item in an OpenCL™ kernel has access to up to 256-512 bytes of register space • Bandwidth to registers faster than any memory • Loading and processing blocks of pixels in registers is very efficient! float sum[PX_PER_WI_X] = { 0.0f }; float k[KERNEL_SIZE_X]; allocated in registers float d[PX_PER_WI_X + KERNEL_SIZE_X]; // Load filter kernel in k, input data in d ... Use available registers (up // Compute convolution to 512 bytes) instead of for (px = 0; px < PX_PER_WI_X; ++px) for (sx = 0; sx < KERNEL_SIZE_X; ++sx) memory, where possible! sum[px]= mad(k[sx], d[px + sx], sum[px]); 24 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Global and Constant Memory Global Memory Accesses go through the L3 Cache L3 cache line is 64 bytes EU thread accesses to the same cache line are collapsed • Order of data within cache line does not matter • Bandwidth determined by number of cache lines accessed • Maximum Bandwidth (L3 EU): 64 bytes / clock / sub slice Good: Load at least 32-bits of data at a time, starting from
Recommended publications
  • Other Apis What’S Wrong with Openmp?
    Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor.
    [Show full text]
  • Opencl on Shared Memory Multicore Cpus
    OpenCL on shared memory multicore CPUs Akhtar Ali, Usman Dastgeer and Christoph Kessler Book Chapter N.B.: When citing this work, cite the original article. Part of: Proceedings of the 5th Workshop on MULTIPROG -2012. E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal (eds), 2012. Copyright: HiPEAC Network of Excellence Available at: Linköping University Electronic Press http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-93951 Published in: Proc. Fifth Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2012) at HiPEAC-2012 conference, Paris, France, Jan. 2012. OpenCL for programming shared memory multicore CPUs Akhtar Ali, Usman Dastgeer, and Christoph Kessler PELAB, Dept. of Computer and Information Science, Linköping University, Sweden [email protected] {usman.dastgeer,christoph.kessler}@liu.se Abstract. Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full compu- tational power of these machines. OpenMP and Intel Threading Build- ing Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heteroge- neous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposi- tion, 2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors’ plat- forms and the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU.
    [Show full text]
  • Mac Labs. More Creativity
    Mac labs. More creativity. More engagement. More learning. Every Mac comes ready for 21st-century learning. Mac OS X Server creates the ultimate lab environment. Creativity and innovation are critical skills in the modern Mac OS X Server includes simplified tools for creating wikis and workplace, which makes them critical skills to teach today’s blogs, so students and staff can set up and manage them with students. Apple offers comprehensive lab solutions for your little or no help from IT. The Spotlight Server feature makes it school or university that will creatively engage students and easy for students collaborating on projects to find and view motivate them to take an interest in their own learning. The content stored anywhere on the network. Of course, access iLife ’09 suite of applications comes standard on every Mac, controls are built in, so they only get the search results they’re so students can immediately begin creating podcasts, editing authorised to see. Mac OS X Server also allows for real-time videos, composing music, producing photo essays and more. streaming of audio and video content, which lets everyone get right to work instead of waiting for downloads – saving major Mac or PC? Have both. storage across your network. Now one lab really can solve all your computing needs. Every Mac is powered by an Intel processor and features Launch careers from your Mac lab. Mac OS X – the world’s most advanced operating system. As an Apple Authorised Training Centre for Education, your Mac OS X comes with an amazing dual-boot feature called school can build a bridge between students and real-world Boot Camp that lets students run Windows XP or Vista natively careers.
    [Show full text]
  • Heterogeneous Task Scheduling for Accelerated Openmp
    Heterogeneous Task Scheduling for Accelerated OpenMP ? ? Thomas R. W. Scogland Barry Rountree† Wu-chun Feng Bronis R. de Supinski† ? Department of Computer Science, Virginia Tech, Blacksburg, VA 24060 USA † Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94551 USA [email protected] [email protected] [email protected] [email protected] Abstract—Heterogeneous systems with CPUs and computa- currently requires a programmer either to program in at tional accelerators such as GPUs, FPGAs or the upcoming least two different parallel programming models, or to use Intel MIC are becoming mainstream. In these systems, peak one of the two that support both GPUs and CPUs. Multiple performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the models however require code replication, and maintaining majority of programming models for heterogeneous computing two completely distinct implementations of a computational focus on only one of these. With the development of Accelerated kernel is a difficult and error-prone proposition. That leaves OpenMP for GPUs, both from PGI and Cray, we have a clear us with using either OpenCL or accelerated OpenMP to path to extend traditional OpenMP applications incrementally complete the task. to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they OpenCL’s greatest strength lies in its broad hardware do not preserve the former while adding the latter. Thus support. In a way, though, that is also its greatest weak- computational potential is wasted since either the CPU cores ness. To enable one to program this disparate hardware or the GPU cores are left idle.
    [Show full text]
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • Integrated Step-Down Converter with SVID Interface for Intel® CPU
    Product Order Technical Tools & Support & Folder Now Documents Software Community TPS53820 SLUSE33 –FEBRUARY 2020 Integrated Step-Down Converter with SVID Interface for Intel® CPU Power Supply 1 Features 3 Description The TPS53820 device is D-CAP+ mode integrated 1• Single chip to support Intel VR13.HC SVID POL applications step-down converter for low current SVID rails of Intel CPU power supply. It provides up to two outputs to • Two outputs to support VCCANA (5.5 A) and power the low current SVID rails such as VCCANA P1V8 (4 A) (5.5 A) and P1V8 (4 A). The device employs D-CAP+ • D-CAP+™ Control for Fast Transient Response mode control to provide fast load transient • Wide Input Voltage (4.5 V to 15 V) performance. Internal compensation allows ease of use and reduces external components. • Differential Remote Sense • Programmable Internal Loop Compensation The device also provides telemetry, including input voltage, output voltage, output current and • Per-Phase Cycle-by-Cycle Current Limit temperature reporting. Over voltage, over current and • Programmable Frequency from 800 kHz to 2 MHz over temperature protections are provided as well. 2 • I C System Interface for Telemetry of Voltage, The TPS53820 device is packaged in a thermally Current, Output Power, Temperature, and Fault enhanced 35-pin QFN and operates between –40°C Conditions and 125°C. • Over-Current, Over-Voltage, Over-Temperature (1) protections Table 1. Device Information • Low Quiescent Current PART NUMBER PACKAGE BODY SIZE (NOM) • 5 mm × 5 mm, 35-Pin QFN, PowerPAD Package TPS53820 RWZ (35) 5 mm × 5 mm (1) For all available packages, see the orderable addendum at 2 Applications the end of the data sheet.
    [Show full text]
  • Introduction to Intel® FPGA IP Cores
    Introduction to Intel® FPGA IP Cores Updated for Intel® Quartus® Prime Design Suite: 20.3 Subscribe UG-01056 | 2020.11.09 Send Feedback Latest document on the web: PDF | HTML Contents Contents 1. Introduction to Intel® FPGA IP Cores..............................................................................3 1.1. IP Catalog and Parameter Editor.............................................................................. 4 1.1.1. The Parameter Editor................................................................................. 5 1.2. Installing and Licensing Intel FPGA IP Cores.............................................................. 5 1.2.1. Intel FPGA IP Evaluation Mode.....................................................................6 1.2.2. Checking the IP License Status.................................................................... 8 1.2.3. Intel FPGA IP Versioning............................................................................. 9 1.2.4. Adding IP to IP Catalog...............................................................................9 1.3. Best Practices for Intel FPGA IP..............................................................................10 1.4. IP General Settings.............................................................................................. 11 1.5. Generating IP Cores (Intel Quartus Prime Pro Edition)...............................................12 1.5.1. IP Core Generation Output (Intel Quartus Prime Pro Edition)..........................13 1.5.2. Scripting IP Core Generation....................................................................
    [Show full text]
  • Implementing FPGA Design with the Opencl Standard
    Implementing FPGA Design with the OpenCL Standard WP-01173-3.0 White Paper Utilizing the Khronos Group’s OpenCL™ standard on an FPGA may offer significantly higher performance and at much lower power than is available today from hardware architectures such as CPUs, graphics processing units (GPUs), and digital signal processing (DSP) units. In addition, an FPGA-based heterogeneous system (CPU + FPGA) using the OpenCL standard has a significant time-to-market advantage compared to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL. 1 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Introduction The initial era of programmable technologies contained two different extremes of programmability. As illustrated in Figure 1, one extreme was represented by single core CPU and digital signal processing (DSP) units. These devices were programmable using software consisting of a list of instructions to be executed. These instructions were created in a manner that was conceptually sequential to the programmer, although an advanced processor could reorder instructions to extract instruction-level parallelism from these sequential programs at run time. In contrast, the other extreme of programmable technology was represented by the FPGA. These devices are programmed by creating configurable hardware circuits, which execute completely in parallel. A designer using an FPGA is essentially creating a massively- fine-grained parallel application. For many years, these extremes coexisted with each type of programmability being applied to different application domains. However, recent trends in technology scaling have favored technologies that are both programmable and parallel. Figure 1.
    [Show full text]
  • Introducing the New 11Th Gen Intel® Core™ Desktop Processors
    Product Brief 11th Gen Intel® Core™ Desktop Processors Introducing the New 11th Gen Intel® Core™ Desktop Processors The 11th Gen Intel® Core™ desktop processor family puts you in control of your compute experience. It features an innovative new architecture for reimag- ined performance, immersive display and graphics for incredible visuals, and a range of options and technologies for enhanced tuning. When these advances come together, you have everything you need for fast-paced professional work, elite gaming, inspired creativity, and extreme tuning. The 11th Gen Intel® Core™ desktop processor family gives you the power to perform, compete, excel, and power your greatest contributions. Product Brief 11th Gen Intel® Core™ Desktop Processors PERFORMANCE Reimagined Performance 11th Gen Intel® Core™ desktop processors are intelligently engineered to push the boundaries of performance. The new processor core architecture transforms hardware and software efficiency and takes advantage of Intel® Deep Learning Boost to accelerate AI performance. Key platform improvements include memory support up to DDR4-3200, up to 20 CPU PCIe 4.0 lanes,1 integrated USB 3.2 Gen 2x2 (20G), and Intel® Optane™ memory H20 with SSD support.2 Together, these technologies bring the power and the intelligence you need to supercharge productivity, stay in the creative flow, and game at the highest level. Product Brief 11th Gen Intel® Core™ Desktop Processors Experience rich, stunning, seamless visuals with the high-performance graphics on 11th Gen Intel® Core™ desktop
    [Show full text]
  • Intel® Omni-Path Architecture (Intel® OPA) for Machine Learning
    Big Data ® The Intel Omni-Path Architecture (OPA) for Machine Learning Big Data Sponsored by Intel Srini Chari, Ph.D., MBA and M. R. Pamidi Ph.D. December 2017 mailto:[email protected] Executive Summary Machine Learning (ML), a major component of Artificial Intelligence (AI), is rapidly evolving and significantly improving growth, profits and operational efficiencies in virtually every industry. This is being driven – in large part – by continuing improvements in High Performance Computing (HPC) systems and related innovations in software and algorithms to harness these HPC systems. However, there are several barriers to implement Machine Learning (particularly Deep Learning – DL, a subset of ML) at scale: • It is hard for HPC systems to perform and scale to handle the massive growth of the volume, velocity and variety of data that must be processed. • Implementing DL requires deploying several technologies: applications, frameworks, libraries, development tools and reliable HPC processors, fabrics and storage. This is www.cabotpartners.com hard, laborious and very time-consuming. • Training followed by Inference are two separate ML steps. Training traditionally took days/weeks, whereas Inference was near real-time. Increasingly, to make more accurate inferences, faster re-Training on new data is required. So, Training must now be done in a few hours. This requires novel parallel computing methods and large-scale high- performance systems/fabrics. To help clients overcome these barriers and unleash AI/ML innovation, Intel provides a comprehensive ML solution stack with multiple technology options. Intel’s pioneering research in parallel ML algorithms and the Intel® Omni-Path Architecture (OPA) fabric minimize communications overhead and improve ML computational efficiency at scale.
    [Show full text]
  • Opencl SYCL 2.2 Specification
    SYCLTM Provisional Specification SYCL integrates OpenCL devices with modern C++ using a single source design Version 2.2 Revision Date: – 2016/02/15 Khronos OpenCL Working Group — SYCL subgroup Editor: Maria Rovatsou Copyright 2011-2016 The Khronos Group Inc. All Rights Reserved Contents 1 Introduction 13 2 SYCL Architecture 15 2.1 Overview . 15 2.2 The SYCL Platform and Device Model . 15 2.2.1 Platform Mixed Version Support . 16 2.3 SYCL Execution Model . 17 2.3.1 Execution Model: Queues, Command Groups and Contexts . 18 2.3.2 Execution Model: Command Queues . 19 2.3.3 Execution Model: Mapping work-items onto an nd range . 21 2.3.4 Execution Model: Execution of kernel-instances . 22 2.3.5 Execution Model: Hierarchical Parallelism . 24 2.3.6 Execution Model: Device-side enqueue . 25 2.3.7 Execution Model: Synchronization . 25 2.4 Memory Model . 26 2.4.1 Access to memory . 27 2.4.2 Memory consistency . 28 2.4.3 Atomic operations . 28 2.5 The SYCL programming model . 28 2.5.1 Basic data parallel kernels . 30 2.5.2 Work-group data parallel kernels . 30 2.5.3 Hierarchical data parallel kernels . 30 2.5.4 Kernels that are not launched over parallel instances . 31 2.5.5 Synchronization . 31 2.5.6 Error handling . 32 2.5.7 Scheduling of kernels and data movement . 32 2.5.8 Managing object lifetimes . 34 2.5.9 Device discovery and selection . 35 2.5.10 Interfacing with OpenCL . 35 2.6 Anatomy of a SYCL application .
    [Show full text]
  • An Introduction to Gpus, CUDA and Opencl
    An Introduction to GPUs, CUDA and OpenCL Bryan Catanzaro, NVIDIA Research Overview ¡ Heterogeneous parallel computing ¡ The CUDA and OpenCL programming models ¡ Writing efficient CUDA code ¡ Thrust: making CUDA C++ productive 2/54 Heterogeneous Parallel Computing Latency-Optimized Throughput- CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/54 Why do we need heterogeneity? ¡ Why not just use latency optimized processors? § Once you decide to go parallel, why not go all the way § And reap more benefits ¡ For many applications, throughput optimized processors are more efficient: faster and use less power § Advantages can be fairly significant 4/54 Why Heterogeneity? ¡ Different goals produce different designs § Throughput optimized: assume work load is highly parallel § Latency optimized: assume work load is mostly sequential ¡ To minimize latency eXperienced by 1 thread: § lots of big on-chip caches § sophisticated control ¡ To maXimize throughput of all threads: § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over ALUs via SIMD 5/54 Latency vs. Throughput Specificaons Westmere-EP Fermi (Tesla C2050) 6 cores, 2 issue, 14 SMs, 2 issue, 16 Processing Elements 4 way SIMD way SIMD @3.46 GHz @1.15 GHz 6 cores, 2 threads, 4 14 SMs, 48 SIMD Resident Strands/ way SIMD: vectors, 32 way Westmere-EP (32nm) Threads (max) SIMD: 48 strands 21504 threads SP GFLOP/s 166 1030 Memory Bandwidth 32 GB/s 144 GB/s Register File ~6 kB 1.75 MB Local Store/L1 Cache 192 kB 896 kB L2 Cache 1.5 MB 0.75 MB
    [Show full text]