Cross-Platform GPGPU With

Total Page:16

File Type:pdf, Size:1020Kb

Cross-Platform GPGPU With Cross-Platform GPGPU Organic Vectory BV with Project Services Consultancy Services Expertise Markets 3D Visualization GPU Computing Architecture/Design X Embedded Software X GIS X X Finance X George van Venrooij Organic Vectory B.V. [email protected] OpenCL Khronos Group ● As defined by the Khronos Group (www.khronos.org): ● OpenCL™ is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. ● Khronos Group controls many other standards like: ● OpenGL (ES) ● OpenVG ● COLLADA ● WebGL ● … and many more OpenCL vs. CUDA OpenCL & CUDA Device Memory Model OpenCL CUDA (source: nVidia OpenCL Tutorial Slides) Terminology Qualifiers OpenCL CUDA OpenCL CUDA Work-Item Thread __kernel __global__ Work-Group Thread-Block (no qualifier needed) __device__ (function) Global Memory Global Memory __constant __constant__ Constant Memory Constant Memory __global __device__ (variable) Local Memory Shared Memory Private Memory Local Memory __local __shared__ Indexing API Objects OpenCL CUDA OpenCL CUDA get_num_groups() gridDim cl_device_id CUdevice get_local_size() blockDim cl_context CUcontext get_group_id() blockIdx cl_program CUmodule get_local_id() threadIdx cl_kernel CUfunction get_global_id() (calculate manually) get_global_size() (calculate manually) cl_mem CUdeviceptr cl_command_queue CUstream Kernel Language Device Thread Synchronization OpenCL CUDA OpenCL CUDA Subset of C99 “C for CUDA”, subset of C barrier() __syncthreads() data-parallel extensions data parallel extensions (no equivalent) __threadfence() C++ features (function templates) enable higher productivity through mem_fence() __threadfence_block() meta-programming techniques Requires run-time compilation by Compilation through separate OpenCL driver compiler: NVCC read_mem_fence() (no equivalent) No function pointers or recursion No function pointers or recursion write_mem_fence() (no equivalent) Performance Comparison (1) GPU Profiler Tool accelereyes.com OpenCL vs CUDA tested on Tesla C2050 Performance Comparison (2) Performance Comparison (3) NVidia GeForce GTX 285 - OpenCL vs CUDA NVidia GeForce GTX 285 - OpenCL vs CUDA ATI vs NVidia ATI vs NVidia Particle simulation PI approximation Particle simulation PI approximation 0.014000 0.600000 0.020000 0.400000 0.012000 0.500000 0.350000 c 0.010000 c c c 0.015000 0.300000 e e 0.400000 e e Nvidia GeForce Nvidia GeForce s s s s 0.008000 n GTX 285 n GTX 285 0.250000 i i n OpenCL n CUDA i i 0.300000 e e e 0.010000 CUDA e 0.200000 OpenCL 0.006000 ATI Radeon HD ATI Radeon HD m m i i m m t t i i 5870 5870 t t 0.200000 l l 0.150000 l l 0.004000 a a t t a a t t 0.005000 0.100000 o o o o T 0.002000 T 0.100000 T T 0.050000 0.000000 OpenCL 0.000000 CUDA 0.000000 Nvidia GeForce GTX 285 0.000000 Nvidia GeForce GTX 285 29929 69696 199809 599076 1000000 20 40 60 80 100 300 500 700 900 40000 99856 698896 30 70 200 600 1000 10000 49729 90000 399424 799236 10 30 50 70 90 200 400 600 800 1000 10000 69696 399424 1000000 10 50 90 400 800 Count Iterations Count Iterations NVidia GeForce GTX 285 - OpenCL vs CUDA NVidia GeForce GTX 285 - OpenCL vs CUDA ATI vs NVidia ATI vs NVidia Random global memory reads Random global memory writes Random global memory reads Random global memory writes 0.450000 0.450000 0.400000 0.350000 0.400000 0.400000 0.350000 0.300000 0.350000 0.350000 c c c c e e e 0.300000 e 0.300000 Nvidia GeForce 0.300000 Nvidia GeForce 0.250000 s s s s n GTX 285 n GTX 285 i 0.250000 i n 0.250000 CUDA n CUDA 0.250000 i i 0.200000 e e e 0.200000 OpenCL e OpenCL 0.200000 ATI Radeon HD 0.200000 ATI Radeon HD m m i i m m 0.150000 t t i i 5870 5870 t t 0.150000 l l 0.150000 0.150000 l l a a t t a a t t 0.100000 0.100000 0.100000 0.100000 o o o o T T T 0.050000 OpenCL T 0.050000 0.050000 0.050000 0.000000 CUDA 0.000000 CUDA 0.000000 Nvidia GeForce GTX 285 0.000000 Nvidia GeForce GTX 285 20 40 60 80 100 300 500 700 900 20 40 60 80 100 300 500 700 900 30 70 200 600 1000 30 70 200 600 1000 10 30 50 70 90 200 400 600 800 1000 10 30 50 70 90 200 400 600 800 1000 10 50 90 400 800 10 50 90 400 800 Count Count Count Count Preliminary Conclusions Back to the Host ● There are cases where OpenCL performs better than CUDA ● There are cases where CUDA performs better than OpenCL ● OpenCL seems to have slightly higher overhead for kernel launches compared to CUDA on NVidia's platform ● For some cases the differences can be large, but... Measuring = knowing! Host Synchronization: Host Synchronization: CUDA Streams OpenCL Command Queues ● Streams are a sequence of commands that execute in-order ● Default behavior of command queue's is similar to CUDA Streams ● Streams can contain kernel launches and/or memory transfers ● One big difference: out-of-order execution mode ● Host code can wait for stream completion using the cudaStreamSynchronize() call ● clEnqueue...() commands can be given a set of events to wait for ● Events can be inserted into the stream ● Each command itself can generate an event ● Host code can query event completion or perform a blocking wait for an ● event Based on the dependencies between commands in the queue, OpenCL can determine which commands are allowed to execute simultaneously ● Useful for synchronization with host code and timing ● It is possible to create multiple queues for a device ● It is possible have commands in one queue wait for events from a different queue Task & Data Parallelism Intermediate Conclusions ● The commands and the events ● The programming methodology for data-parallel application is virtually they must wait for, create a “task identical, i.e. if you can program in one language/environment, you can graph” program in the other ● OpenCL will execute the ● CUDA currently offers certain productivity advantages at the kernel level commands in the queue as it sees ● fit, respecting the dependencies NVidia's hardware seems to be more capable on the GPGPU side when specified. compared to ATi's hardware ● ● The end-result is a task-parallel OpenCL has the platform advantage in that it presents a unified platform API framework supporting data-parallel for ALL computing hardware in your machine tasks ● OpenCL programs can be run on hardware from different vendors ● Your application could be written entirely in OpenCL kernels, requiring only a small framework that fills the command queue OpenCL Implementations Portability to other platforms AVAILABLE: ● Results of a kernel are guaranteed across platforms Vendor Type Hardware ● Optimal Performance is not Apple CPU x86_64 (Intel) nVidia GPU GeForce 8/9 series and higher ● All platforms are required to support data-parallelism, but are not required to ATi GPU R700/800 series support task-parallelism AMD CPU any x86/x86_64 with SSE3 extensions Samsung CPU ARM A9 IBM ACCELERATOR CELL BE ● ZiiLabs CPU ARM OpenCL can be considered a replacement for OpenMP (data-parallel) ANNOUNCED/UPCOMING: ● OpenCL can be considered a replacement for Threads (task-parallel) Imagination Technology GPU PowerVR SGX Series 5 VIA GPU VN 1000 Chipset S3 GPU Chrome 5400E Graphics Processor Apple CPU ARM A4 Libraries & Tools for OpenCL Libraries & Tools for CUDA ● ATi StreamProfiler (ATi hardware only) ● cuBLAS (closed-source) ● NVidia Visual Profiler (NVidia hardware only) ● cuFFT (closed-source) ● Stream KernalAnalyzer (ATi hardware only) ● CUDPP (data-parallel primitives) ● NVidia NSight (NVidia hardware only) ● Thrust (high-level CUDA & OpenMP-based algorithms) ● gDebugger CL (Windows, Mac, Linux, currently in beta) ● CULATools (LAPACK) ● libstdcl (wrapper around context/queue management functions) ● NSight debugger ● GATLAS (Matrix multiplication) ● NVidia Visual Profiler ● ViennaCL (BLAS level 1 and 2) ● Language bindings for Python, Java, .NET, MATLAB, Fortran, Perl, Ruby, Lua ● Language bindings for C++, Fortran, Java, Matlab, .NET, Python and Scala are available (Unofficial) Sneak Preview Things to consider ● Platforms ● OpenCL is currently the only choice if you do not want to tie your application to NVidia's hardware ● API stability/agility ● OpenCL changes more slowly, retains backward compatibility ● CUDA changes more rapidly, unlocks new hardware features quicker ● Third-party library availability ● OpenCL is about 2 years younger, so less numerous and less mature libraries are available ● CUDA has spawned a host of initiatives and various libraries are available, especially in the scientific computing domain ● Supporting tools ● OpenCL has a fairly young, but decent set of tools ● NVidia recently launched the NSight debugger which seems more mature Questions Further Reading ● GPGPU ● www.gpgpu.org ● OpenCL General ● www.khronos.org/opencl ● OpenCL Implementations ● http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx ● http://developer.nvidia.com/object/opencl.html ? ● http://www.alphaworks.ibm.com/tech/opencl ● http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html ● OpenCL/CUDA Comparisons ● http://www.gpucomputing.net/?q=node/128 ● Mobile/Embedded OpenCL announcements ● http://www.imgtec.com/News/Release/index.asp?NewsID=557 ● http://www.ziilabs.com/technology/opencl.aspx References ● http://blog.accelereyes.com/blog/2010/05/10/nvidia-fermi-cuda-and-opencl/ ● http://www.s3graphics.com/en/news/news_detail.aspx?id=44 ● http://www.gremedy.com/gDEBuggerCL.php ● http://browndeertechnology.com/stdcl.html ● http://golem5.org/gatlas/ ● http://www.mainconcept.com/products/sdks/hw-acceleration/opencl-h264avc.html ● http://awaregeek.com/news/some-pictures-of-old-computers/ .
Recommended publications
  • GLSL 4.50 Spec
    The OpenGL® Shading Language Language Version: 4.50 Document Revision: 7 09-May-2017 Editor: John Kessenich, Google Version 1.1 Authors: John Kessenich, Dave Baldwin, Randi Rost Copyright (c) 2008-2017 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast, or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be reformatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group website should be included whenever possible with specification distributions.
    [Show full text]
  • Technical Report Aaron Councilman
    Extensible Parallel Programming in ableC Aaron Councilman Department of Computer Science and Engineering University of Minnesota, Twin Cities May 23, 2019 1 Introduction There are many different manners of parallelizing code, and many different languages that provide such features. Different types of computations are best suited by different types of parallelism. Simply whether a computation is compute bound or I/O bound determines whether the computation will benefit from being run with more threads than the machine has cores, and other properties of a computation will similarly affect how it performs when run in parallel. Thus, to provide parallel programmers the ability to deliver the best performance for their programs, the ability to choose the parallel programming abstractions they use is important. The ability to combine these abstractions however they need is also important, since different parts of a program will have different performance properties, and therefore may perform best using different abstractions. Unfortunately, parallel programming languages are often designed monolithically, built as an entire language with a specific set of features. Because of this, programmer's choice of parallel programming abstractions is generally limited to the choice of the language to use. Beyond limiting the available abstracts, this also means, that the choice of abstractions must be made ahead of time, since any attempt to change the parallel programming language at a later time is likely to be be prohibitive as it may require rewriting large portions of the codebase, if not the entire codebase. Extensible programming languages can offer a solution to these problems. With an extensible compiler, the programmer chooses a base programming language and can then select the set of \extensions" for that language that best fit their needs.
    [Show full text]
  • Opengl ES / Openvg / Opencl / Webgl / Etc
    KNU-3DC : Education and Training Plan 24 July 2013 Hwanyong LEE, Ph.D. Principal Engineer & Industry Cooperation Prof., KNU 3DC [email protected] KNU-3DC Introduction • Kyungpook National Univ. 3D Convergence Technology Center . Korea Government Funded Org. 23 staffs (10 Ph.D) . Research / Supporting Industry . Training and Education • Training and Education . Dassault Training Center . KIKS(Korea Institute of Khronos Study) • Constructing New Center Building . HUGE ! 7 Floors ! (2014E) 2 KNU-3DC KIKS • KIKS(Korea Institute of Khronos Study) . Leading Role in Korea for Training and Education of Khronos Standard – Collaboration with Khronos Group . Open Lecture + Develop Coursework for • OpenGL / OpenGL ES / OpenVG / OpenCL / WebGL / etc. • Opening / Sponsoring Workshop and Forum . Participating Khronos Activities • Contributor Member (Plan, now Processing) • Active participation of Khronos WG . Other Standard Activities • Make Khronos Standard into Korea National Standard. (WebGL) • W3C, ISO/IEC JTC1, IEEE 3333.X . Research and Consulting for Industry and Academy 3 KIKS Course • KIKS Course will be categorized into . Basic / Advanced / Packaged . Special Course - For instance Overview / Optimization / Consulting • Developing Courseware (for Khronos API) . OpenGL ES Basic & Advanced . OpenGL Basic & Advanced . OpenVG . OpenCL Basic & Advanced . WebGL . Etc. – new standards (Red - Started / Orange – Start at 4Q2013 / Dark Blue – Start at 2014) 4 Different View of Courses University Computer Game Image Parallel View Graphics Develop Processing … Processing JavaScript Khronos Canvas View … iPhone Android Company Web-App Parallel App App Application Develop View Develop Develop … Develop Course Development – Packaging Example • Android Application with OpenGL ES . General Android API’s – JNI, Java etc. OpenGL ES • iPhone App. Development with OpenGL ES . General iPhone APP API – cocoa, Objective-C, etc.
    [Show full text]
  • Implementing FPGA Design with the Opencl Standard
    Implementing FPGA Design with the OpenCL Standard WP-01173-3.0 White Paper Utilizing the Khronos Group’s OpenCL™ standard on an FPGA may offer significantly higher performance and at much lower power than is available today from hardware architectures such as CPUs, graphics processing units (GPUs), and digital signal processing (DSP) units. In addition, an FPGA-based heterogeneous system (CPU + FPGA) using the OpenCL standard has a significant time-to-market advantage compared to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL. 1 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Introduction The initial era of programmable technologies contained two different extremes of programmability. As illustrated in Figure 1, one extreme was represented by single core CPU and digital signal processing (DSP) units. These devices were programmable using software consisting of a list of instructions to be executed. These instructions were created in a manner that was conceptually sequential to the programmer, although an advanced processor could reorder instructions to extract instruction-level parallelism from these sequential programs at run time. In contrast, the other extreme of programmable technology was represented by the FPGA. These devices are programmed by creating configurable hardware circuits, which execute completely in parallel. A designer using an FPGA is essentially creating a massively- fine-grained parallel application. For many years, these extremes coexisted with each type of programmability being applied to different application domains. However, recent trends in technology scaling have favored technologies that are both programmable and parallel. Figure 1.
    [Show full text]
  • Intel® Cilk™ Plus
    Overview: Programming Environment for Intel® Xeon Phi™ Coprocessor One Source Base, Tuned to many Targets Source Compilers, Libraries, Parallel Models Multicore Many-core Cluster Multicore Multicore CPU CPU Intel® MIC Multicore Multicore and Architecture Cluster Many-core Cluster Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Phase Product Feature Benefit Intel® Threading design assistant • Simplifies, demystifies, and speeds Advisor XE (Studio products only) parallel application design • C/C++ and Fortran compilers • Intel® Threading Building Blocks • Enabling solution to achieve the Intel® • Intel® Cilk™ Plus application performance and Composer XE • Intel® Integrated Performance scalability benefits of multicore and Build Primitives forward scale to many-core • Intel® Math Kernel Library • Enabling High Performance Scalability, Interconnect Intel® High Performance Message Independence, Runtime Fabric † MPI Library Passing (MPI) Library Selection, and Application Tuning Capability ® Intel Performance Profiler for • Remove guesswork, saves time, VTune™ optimizing application makes it easier to find performance Amplifier XE performance and scalability and scalability bottlenecks Memory & threading dynamic • Increased productivity, code quality, ® Intel analysis for code quality and lowers cost, finds memory, Verify Inspector XE threading , and security defects & Tune Static Analysis for code quality
    [Show full text]
  • Openvg 1.1 API Quick Reference Card - Page 1
    OpenVG 1.1 API Quick Reference Card - Page 1 OpenVG® is an API for hardware-accelerated two-dimensional Errors [4.1] vector and raster graphics. It provides a device-independent Error codes and their numerical values are defined by the VGErrorCode enumeration and can be and vendor-neutral interface for sophisticated 2D graphical obtained with the function: VGErrorCode vgGetError(void). The possible values are as follows: applications, while allowing device manufacturers to provide VG_NO_ERROR 0 VG_UNSUPPORTED_IMAGE_FORMAT_ERROR 0x1004 hardware acceleration where appropriate. VG_BAD_HANDLE_ERROR 0x1000 VG_UNSUPPORTED_PATH_FORMAT_ERROR 0x1005 • [n.n.n] refers to sections and tables in the OpenVG 1.1 API VG_ILLEGAL_ARGUMENT_ERROR 0x1001 VG_IMAGE_IN_USE_ERROR 0x1006 specification available at www.khronos.org/openvg/ VG_OUT_OF_MEMORY_ERROR 0x1002 VG_NO_CONTEXT_ERROR 0x1007 • Default values are shown in blue. VG_PATH_CAPABILITY_ERROR 0x1003 Data Types & Number Representations Colors [3.4] The sRGB color space defines values R’sRGB, G’sRGB, B’sRGB in terms of the linear lRGB primaries. Primitive Data Types [3.2] Colors in OpenVG other than those stored in image pixels are represented as non-premultiplied sRGBA color values. openvg.h khronos_type.h range Convert from lRGB to sRGB Convert from sRGB to lRGB Image pixel color and alpha values lie in the range [0,1] (gamma mapping) (1) (inverse gamma mapping) (2) VGbyte khronos_int8_t [-128, 127] unless otherwise noted. -1 VGubyte khronos_uint8_t [0, 255] R’sRGB = γ(R) R = γ (R’sRGB) Color Space Definitions -1 The linear lRGB color space is defined in terms of the G’sRGB = γ(G) G = γ (G’sRGB) VGshort khronos_int16_t [-32768, 32767] -1 standard CIE XYZ color space, following ITU Rec.
    [Show full text]
  • Hardware Implementation of a Tessellation Accelerator for the Openvg Standard
    IEICE Electronics Express, Vol.7, No.6, 440–446 Hardware implementation of a tessellation accelerator for the OpenVG standard Seung Hun Kim, Yunho Oh, Karam Park, and Won Woo Roa) School of Electrical and Electronic Engineering, Yonsei University, 134 Shinchon-Dong, Seodaemun-Gu, SEOUL, 120–749, KOREA a) [email protected] Abstract: The OpenVG standard has been introduced as an efficient vector graphics API for embedded systems. There have been several OpenVG implementations that are based on the software rendering of image. However, the software rendering needs more execution time and power consumption than hardware accelerated rendering. For the effi- cient hardware implementation, we merge eight pipeline stages in the original specification to four pipeline stages. The first hardware accel- eration stage is the tessellation part which is one of the pipeline stages that calculates the edge of vector graphics. In this paper, we provide an efficient hardware design for the tessellation stage and claim this would eventually reduce the execution time and hardware complexity. Keywords: OpenVG, vector graphics, tessellation, hardware acceler- ator Classification: Electron devices, circuits, and systems References [1] K. Pulli, “New APIs for mobile graphics,” Proc. SPIE - The International Society for Optical Engineering, vol. 6074, pp. 1–13, 2006. [2] Khronos Group Inc., “OpenVG specification Version 1.0.1” [Online] http://www.khronos.org/openvg/ [3] S.-Y. Lee, S. Kim, J. Chung, and B.-U. Choi, “Salable Vector Graphics (OpenVG) for Creating Animation Image in Embedded Systems,” Lecture Notes in Computer Science, vol. 4693, pp. 99–108, 2007. [4] G. He, B. Bai, Z. Pan, and X.
    [Show full text]
  • Khronos Template 2015
    Ecosystem Overview Neil Trevett | Khronos President NVIDIA Vice President Developer Ecosystem [email protected] | @neilt3d © Copyright Khronos Group 2016 - Page 1 Khronos Mission Software Silicon Khronos is an Industry Consortium of over 100 companies creating royalty-free, open standard APIs to enable software to access hardware acceleration for graphics, parallel compute and vision © Copyright Khronos Group 2016 - Page 2 http://accelerateyourworld.org/ © Copyright Khronos Group 2016 - Page 3 Vision Pipeline Challenges and Opportunities Growing Camera Diversity Diverse Vision Processors Sensor Proliferation 22 Flexible sensor and camera Use efficient acceleration to Combine vision output control to GENERATE PROCESS with other sensor data an image stream the image stream on device © Copyright Khronos Group 2016 - Page 4 OpenVX – Low Power Vision Acceleration • Higher level abstraction API - Targeted at real-time mobile and embedded platforms • Performance portability across diverse architectures - Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware… • Extends portable vision acceleration to very low power domains - Doesn’t require high-power CPU/GPU Complex - Lower precision requirements than OpenCL - Low-power host can setup and manage frame-rate graph Vision Engine Middleware Application X100 Dedicated Vision Processing Hardware Efficiency Vision DSPs X10 GPU Compute Accelerator Multi-core Accelerator Power Efficiency Power X1 CPU Accelerator Computation Flexibility © Copyright Khronos Group 2016 - Page 5 OpenVX Graphs
    [Show full text]
  • Part V Some Broad Topic
    Part V Some Broad Topic Winter 2021 Parallel Processing, Some Broad Topics Slide 1 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised First Spring 2005 Spring 2006 Fall 2008 Fall 2010 Winter 2013 Winter 2014 Winter 2016 Winter 2019* Winter 2021* *Chapters 17-18 only Winter 2021 Parallel Processing, Some Broad Topics Slide 2 V Some Broad Topics Study topics that cut across all architectural classes: • Mapping computations onto processors (scheduling) • Ensuring that I/O can keep up with other subsystems • Storage, system, software, and reliability issues Topics in This Part Chapter 17 Emulation and Scheduling Chapter 18 Data Storage, Input, and Output Chapter 19 Reliable Parallel Processing Chapter 20 System and Software Issues Winter 2021 Parallel Processing, Some Broad Topics Slide 3 17 Emulation and Scheduling Mapping an architecture or task system onto an architecture • Learn how to achieve algorithm portability via emulation • Become familiar with task scheduling in parallel systems Topics in This Chapter 17.1 Emulations Among Architectures 17.2 Distributed Shared Memory 17.3 The Task Scheduling Problem 17.4 A Class of Scheduling Algorithms 17.5 Some Useful Bounds for Scheduling 17.6 Load Balancing and Dataflow Systems Winter 2021 Parallel Processing, Some Broad Topics Slide 4 17.1 Emulations Among Architectures Need for scheduling: a.
    [Show full text]
  • Real-Time Computer Vision with Opencv Khanh Vo Duc, Mobile Vision Team, NVIDIA
    Real-time Computer Vision with OpenCV Khanh Vo Duc, Mobile Vision Team, NVIDIA Outline . What is OpenCV? . OpenCV Example – CPU vs. GPU with CUDA . OpenCV CUDA functions . Future of OpenCV . Summary OpenCV Introduction . Open source library for computer vision, image processing and machine learning . Permissible BSD license . Freely available (www.opencv.org) Portability . Real-time computer vision (x86 MMX/SSE, ARM NEON, CUDA) . C (11 years), now C++ (3 years since v2.0), Python and Java . Windows, OS X, Linux, Android and iOS 3 Functionality Desktop . x86 single-core (Intel started, now Itseez.com) - v2.4.5 >2500 functions (multiple algorithm options, data types) . CUDA GPU (Nvidia) - 250 functions (5x – 100x speed-up) http://docs.opencv.org/modules/gpu/doc/gpu.html . OpenCL GPU (3rd parties) - 100 functions (launch times ~7x slower than CUDA*) Mobile (Nvidia): . Android (not optimized) . Tegra – 50 functions NEON, GLSL, multi-core (1.6–32x speed-up) 4 Functionality Image/video I/O, processing, display (core, imgproc, highgui) Object/feature detection (objdetect, features2d, nonfree) Geometry-based monocular or stereo computer vision (calib3d, stitching, videostab) Computational photography (photo, video, superres) Machine learning & clustering (ml, flann) CUDA and OpenCL GPU acceleration (gpu, ocl) 5 Outline . What is OpenCV? . OpenCV Example – CPU vs. GPU with CUDA . OpenCV CUDA functions . Future of OpenCV . Summary OpenCV CPU example #include <opencv2/opencv.hpp> OpenCV header files using namespace cv; OpenCV C++ namespace int
    [Show full text]
  • Khronos Native Platform Graphics Interface (EGL Version 1.4 - April 6, 2011)
    Khronos Native Platform Graphics Interface (EGL Version 1.4 - April 6, 2011) Editor: Jon Leech 2 Copyright (c) 2002-2011 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, repub- lished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be re- formatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group web-site should be included whenever possible with specification distributions. Khronos Group makes no, and expressly disclaims any, representations or war- ranties, express or implied, regarding this specification, including, without limita- tion, any implied warranties of merchantability or fitness for a particular purpose or non-infringement of any intellectual property.
    [Show full text]
  • The Openvx™ Specification
    The OpenVX™ Specification Version 1.0.1 Document Revision: r31169 Generated on Wed May 13 2015 08:41:43 Khronos Vision Working Group Editor: Susheel Gautam Editor: Erik Rainey Copyright ©2014 The Khronos Group Inc. i Copyright ©2014 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specifica- tion for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be re-formatted AS LONG AS the contents of the specifi- cation are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group web-site should be included whenever possible with specification distributions.
    [Show full text]