Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on Gpus

Total Page:16

File Type:pdf, Size:1020Kb

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on Gpus Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs A Dissertation Presented by Yash Ukidave to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts December 2015 c Copyright 2015 by Yash Ukidave All Rights Reserved i Contents List of Figures vi List of Tables ix List of Acronyms x Acknowledgments xi Abstract xii 1 Introduction 1 1.1 TypesofConcurrencyonGPUs . ... 2 1.2 Benefits of Multi-level concurrency . ........ 3 1.3 Challenges in Implementing Multi-Level Concurrency . ............. 5 1.3.1 Runtime Level Challenges . 5 1.3.2 Architectural Challenges . .... 5 1.3.3 Scheduling Challenges for Multi-Tasked GPUs . ........ 6 1.4 Contributions of the Thesis . ...... 7 1.5 Organization of the Thesis . ..... 9 2 Background 10 2.1 GPUarchitecture ................................. 10 2.1.1 AMD Graphics Core Next (GCN) Architecture . ..... 11 2.1.2 NVIDIA Kepler Architecture . 12 2.1.3 Evolution of GPU Architecture . 14 2.2 TheOpenCLProgrammingModel . 16 2.2.1 PlatformModel................................ 16 2.2.2 MemoryModel................................ 17 2.2.3 RuntimeandDriverModel. 18 2.3 Multi2Sim: Cycle-level GPU Simulator . ........ 20 2.3.1 Disassembler ................................. 20 2.3.2 Functional Simulation / Emulation . ...... 21 2.3.3 Architectural Simulator . 22 2.3.4 OpenCL Runtime and Driver Stack for Multi2Sim . ...... 23 ii 2.4 Concurrent Kernel Execution on GPUs . ....... 24 2.4.1 Support for Concurrent Kernel Execution on GPUs . ........ 26 2.5 Concurrent Context Execution on GPUs . ....... 27 2.5.1 Context Concurrency for Cloud Engines and Data Centers ......... 27 2.5.2 Key Challenges for Multi-Context Execution . ........ 28 3 Related Work 30 3.1 GPUModelingandSimulation . 30 3.1.1 Functional simulation of GPUs for compute and graphics applications . 31 3.1.2 Architectural Simulation of GPUs . ..... 31 3.1.3 PowerandAreaEstimationonGPUs . 33 3.2 Design of Runtime Environments for GPUs . ....... 33 3.3 Resource Management for Concurrent Execution . .......... 34 3.3.1 Spatial Partitioning of Compute Resources . ........ 34 3.3.2 QoS-aware Shared Resource Management . ..... 35 3.3.3 Workload Scheduling on GPUs . 36 3.3.4 GPU virtualization for Data Center and Cloud Engines . .......... 37 3.3.5 Scheduling on Cloud Servers and Datacenter . ....... 38 3.3.6 Address Space and Memory Management on GPUs . 38 4 Adaptive Spatial Partitioning 40 4.1 Mapping Multiple Command Queues to Sub-Devices . ......... 41 4.2 Design of the Adaptive Spatial Partitioning Mechanism . .............. 43 4.2.1 Compute Unit Reassignment Procedure . ..... 45 4.2.2 Case Study for Adaptive Partitioning . ...... 46 4.3 Workgroup Scheduling Mechanisms and Partitioning Policies............ 47 4.4 Evaluation Methodology . 49 4.4.1 Platform for Evaluation . 49 4.4.2 EvaluatedBenchmarks . 49 4.5 EvaluationResults ............................... 51 4.5.1 Performance Enhancements provided by Multiple Command Queue Mapping 51 4.5.2 Effective Utilization of Cache Memory . ...... 53 4.5.3 Effects of Workgroup Scheduling Mechanisms and Partitioning Policies . 54 4.5.4 Timeline Describing Load-Balancing Mechanisms for the Adaptive Partitioning 55 4.6 Summary ....................................... 58 5 Multi-Context Execution 59 5.1 Need for Multi-Context Execution . ....... 60 5.2 Architectural and Runtime Enhancements for Multi-Context Execution . 61 5.2.1 Extensions in OpenCL Runtime . 61 5.2.2 Address Space Isolation . 63 5.3 Resource Management for Efficient Multi-Context Execution............ 64 5.3.1 Transparent Memory Management using CML . 64 5.3.2 Remapping Compute-Resources between Contexts . ........ 66 5.4 Firmware Extensions on Command-Processor (CP) for Multi-Context Execution . 68 iii 5.5 Evaluation Methodology . 69 5.5.1 Platform for Evaluation . 69 5.5.2 Applications for Evaluation . 69 5.6 EvaluationResults ............................... 70 5.6.1 Speedup and Utilization with Multiple Contexts . .......... 70 5.6.2 Evaluation of Global Memory Efficiency . ..... 74 5.6.3 TMMevaluation ............................... 75 5.6.4 Evaluation of Compute-Unit Remapping Mechanism . ........ 79 5.6.5 L2 cache Efficiency for Multi-Context Execution . ......... 79 5.6.6 Firmware characterization and CP selection . ......... 81 5.6.7 Overhead in Multi-Context Execution . ...... 82 5.7 Discussion...................................... 84 5.7.1 Static L2 Cache Partitioning for Multiple Contexts . ........... 84 5.8 Summary ....................................... 84 6 Application-Aware Resource Management for GPUs 86 6.1 Need and Opportunity for QoS . 87 6.2 Virtuoso DefinedQoS................................. 89 6.2.1 QoSpolicies ................................. 89 6.2.2 QoSMetrics ................................. 90 6.3 Virtuoso Architecture ................................. 90 6.3.1 QoS Interface: Settings Register (QSR) . ....... 90 6.3.2 QoS Assignment: Performance Analyzer . ..... 92 6.3.3 QoS Enforcement: Resource Manager . 93 6.4 Evaluation Methodology . 95 6.4.1 Platform for Evaluation . 95 6.4.2 Applications for Evaluation . 96 6.4.3 Choosing the QoS Reallocation Interval . ....... 97 6.5 VirtuosoEvaluation..... ...... ..... ...... ...... .. 98 6.5.1 Evaluation of QoS Policies . 98 6.5.2 Impact of Compute Resource Allocation . ...... 99 6.5.3 Effects of QoS-aware Memory Management . 100 6.5.4 Case Study: Overall Impact of Virtuoso ................... 102 6.5.5 Evaluation of Memory Oversubscription . ....... 103 6.5.6 Sensitivity Analysis of High-Priority Constraints . ............. 104 6.5.7 Measuring QoS Success rate . 105 6.6 Discussion...................................... 105 6.7 Summary ....................................... 106 7 Machine Learning Based Scheduling for GPU Cloud Servers 108 7.1 Scheduling Challenges for GPU Cloud Servers and Datacenters . 109 7.2 GPU Remoting for Cloud Servers and Datacenters . ......... 111 7.3 Collaborative Filtering for Application Performance Prediction . 112 7.4 MysticArchitecture..... ...... ..... ...... ...... .. 114 7.4.1 The Causes of Interference (CoI) . 114 iv 7.4.2 Stage-1: Initializer and Profile Generator . ......... 114 7.4.3 Stage-2: Collaborative Filtering (CF) based Prediction ........... 116 7.4.4 Stage-3: Interference Aware Scheduler . ....... 117 7.4.5 Validation................................... 118 7.5 Evaluation Methodology . 118 7.5.1 ServerSetup ................................. 118 7.5.2 Workloads for Evaluation . 119 7.5.3 Metrics .................................... 120 7.6 EvaluationResults ............................... 121 7.6.1 Scheduling Performance . 121 7.6.2 System Performance Using Mystic . 123 7.6.3 GPUUtilization ............................... 124 7.6.4 Quality of Launch Sequence Selection . 125 7.6.5 Scheduling Decision Quality . 126 7.6.6 Scheduling Overheads Using Mystic .................... 126 7.7 Summary ....................................... 127 8 Conclusion 128 8.1 FutureWork...................................... 129 Bibliography 131 v List of Figures 2.1 BasicarchitectureofaGPU. ..... 11 2.2 AMD GCN Compute Unit block diagram. 12 2.3 NVIDIA Advanced Advanced Stream Multiprocessor (SMX) . ........... 13 2.4 OpenCL Host Device Architecture . ...... 17 2.5 TheOpenCLmemorymodel.. 18 2.6 The runtime and driver model of OpenCL . ...... 19 2.7 Multi2Sim’s three stage architectural simulation . ............... 21 2.8 PipelinemodelofGCNSIMDunit . 22 2.9 Multi2Sim Runtime and Driver Stack for OpenCL . ......... 24 2.10 ComputeStagesofCLSurf . 25 2.11 Kernel scheduling for Fermi GPU . ...... 26 2.12 Multiple hardware queues in GPUs . ....... 27 2.13 Active time of data buffers in GPU applications . ............ 28 4.1 High-level model describing multiple command queue mapping .......... 41 4.2 Flowchart for adaptive partitioning handler . ............ 44 4.3 Procedure to initiate compute unit reallocation between kernels . 46 4.4 Case study to demonstrate our adaptive partitioning mechanism. 46 4.5 Workgroup scheduling mechanisms . ...... 47 4.6 Performance speedup using multiple command queue mapping........... 52 4.7 Memory hierarchy of the modeled AMD HD 7970 GPU device. ........ 52 4.8 L2 cache hit rate when using multiple command queue mapping .......... 53 4.9 Effect of different partitioning policy and workgroup scheduling mechanisms . 54 4.10 Timeline analysis of compute-unit reassignment . .............. 56 4.11 Overhead for CU transfer . ..... 57 5.1 Resource utilization and execution time of Rodinia kernels ............. 60 5.2 Position of CML and context dispatch in the runtime . ........... 62 5.3 Address space isolation using CID . ....... 63 5.4 Combined host and CP interaction for TMM. ....... 65 5.5 TMM operation, managing three concurrent contexts. ............. 66 5.6 Speedup and resource utilization improvement using multi-context execution . 72 5.7 Compute units utilization using multi-context execution............... 73 5.8 Increase in execution time for multi-context execution. ............... 74 vi 5.9 Global memory bandwidth utilization using multi-context execution. 74 5.10 Timeline of global memory usage for contexts from different mixes. 75 5.11 Evaluation of different global memory management mechanisms . 77 5.12 Performance of pinned memory updates over transfer stalling............ 77 5.13 Timeline analysis of the compute unit
Recommended publications
  • Small Form Factor 3D Graphics for Your Pc
    VisionTek Part# 900701 PRODUCTIVITY SERIES: SMALL FORM FACTOR 3D GRAPHICS FOR YOUR PC The VisionTek Radeon R7 240SFF graphics card offers a perfect balance of performance, features, and affordability for the gamer seeking a complete solution. It offers support for the DIRECTX® 11.2 graphics standard and 4K Ultra HD for stunning 3D visual effects, realistic lighting, and lifelike imagery. Its Short Form Factor design enables it to fit into the latest Low Profile desktops and workstations, yet the R7 240SFF can be converted to a standard ATX design with the included tall bracket. With 2GB of DDR3 memory and award-winning Graphics Core Next (GCN) architecture, and DVI-D/HDMI outputs, the VisionTek Radeon R7 240SFF is big on features and light on your wallet. RADEON R7 240 SPECS • Graphics Engine: RADEON R7 240 • Video Memory: 2GB DDR3 • Memory Interface: 128bit • DirectX® Support: 11.2 • Bus Standard: PCI Express 3.0 • Core Speed: 780MHz • Memory Speed: 800MHz x2 • VGA Output: VGA* • DVI Output: SL DVI-D • HDMI Output: HDMI (Video/Audio) • UEFI Ready: Support SYSTEM REQUIREMENTS • PCI Express® based PC is required with one X16 lane graphics slot available on the motherboard. • 400W (or greater) power supply GCN Architecture: A new design for AMD’s unified graphics processing and compute cores that allows recommended. 500 Watt for AMD them to achieve higher utilization for improved performance and efficiency. CrossFire™ technology in dual mode. • Minimum 1GB of system memory. 4K Ultra HD Support: Experience what you’ve been missing even at 1080P! With support for 3840 x • Installation software requires CD-ROM 2160 output via the HDMI port, textures and other detail normally compressed for lower resolutions drive.
    [Show full text]
  • Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster
    Exploring weak scalability for FEM calculations on a GPU-enhanced cluster Dominik G¨oddeke a,∗,1, Robert Strzodka b,2, Jamaludin Mohd-Yusof c, Patrick McCormick c,3, Sven H.M. Buijssen a, Matthias Grajewski a and Stefan Turek a aInstitute of Applied Mathematics, University of Dortmund bStanford University, Max Planck Center cComputer, Computational and Statistical Sciences Division, Los Alamos National Laboratory Abstract The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. Key words: graphics processors, heterogeneous computing, parallel multigrid solvers, commodity based clusters, Finite Elements PACS: 02.70.-c (Computational Techniques (Mathematics)), 02.70.Dc (Finite Element Analysis), 07.05.Bx (Computer Hardware and Languages), 89.20.Ff (Computer Science and Technology) ∗ Corresponding author. Address: Vogelpothsweg 87, 44227 Dortmund, Germany. Email: [email protected], phone: (+49) 231 755-7218, fax: -5933 1 Supported by the German Science Foundation (DFG), project TU102/22-1 2 Supported by a Max Planck Center for Visual Computing and Communication fellowship 3 Partially supported by the U.S.
    [Show full text]
  • Hardware Developments V E-CAM Deliverable 7.9 Deliverable Type: Report Delivered in July, 2020
    Hardware developments V E-CAM Deliverable 7.9 Deliverable Type: Report Delivered in July, 2020 E-CAM The European Centre of Excellence for Software, Training and Consultancy in Simulation and Modelling Funded by the European Union under grant agreement 676531 E-CAM Deliverable 7.9 Page ii Project and Deliverable Information Project Title E-CAM: An e-infrastructure for software, training and discussion in simulation and modelling Project Ref. Grant Agreement 676531 Project Website https://www.e-cam2020.eu EC Project Officer Juan Pelegrín Deliverable ID D7.9 Deliverable Nature Report Dissemination Level Public Contractual Date of Delivery Project Month 54(31st March, 2020) Actual Date of Delivery 6th July, 2020 Description of Deliverable Update on "Hardware Developments IV" (Deliverable 7.7) which covers: - Report on hardware developments that will affect the scientific areas of inter- est to E-CAM and detailed feedback to the project software developers (STFC); - discussion of project software needs with hardware and software vendors, completion of survey of what is already available for particular hardware plat- forms (FR-IDF); and, - detailed output from direct face-to-face session between the project end- users, developers and hardware vendors (ICHEC). Document Control Information Title: Hardware developments V ID: D7.9 Version: As of July, 2020 Document Status: Accepted by WP leader Available at: https://www.e-cam2020.eu/deliverables Document history: Internal Project Management Link Review Review Status: Reviewed Written by: Alan Ó Cais(JSC) Contributors: Christopher Werner (ICHEC), Simon Wong (ICHEC), Padraig Ó Conbhuí Authorship (ICHEC), Alan Ó Cais (JSC), Jony Castagna (STFC), Godehard Sutmann (JSC) Reviewed by: Luke Drury (NUID UCD) and Jony Castagna (STFC) Approved by: Godehard Sutmann (JSC) Document Keywords Keywords: E-CAM, HPC, Hardware, CECAM, Materials 6th July, 2020 Disclaimer:This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement.
    [Show full text]
  • Conga-TR4 User's Guide
    COM Express™ conga-TR4 COM Express Type 6 Basic module based on 4th Generation AMD Embedded V- and R-Series SoC User’s Guide Revision 1.10 Copyright © 2018 congatec GmbH TR44m110 1/66 Revision History Revision Date (yyyy.mm.dd) Author Changes 0.1 2018.01.15 BEU • Preliminary release 1.0 2018.10.15 BEU • Updated “Electrostatic Sensitive Device” information on page 3 • Corrected single/dual channel MT/s rates for two variants in table 2 • Updated section 2.2 “Supported Operating Systems” • Added values for four variants in section 2.5 "Power Consumption" • Added values in section 2.6 "Supply Voltage Battery Power" • Updated images in section 4 "Cooling Solutions" • Added note about requiring a re-driver on carrier for USB 3.1 Gen 2 in section 5.1.2 "USB" and 7.4 "USB Host Controller" • Added Intel® Ethernet Controller i211 as assembly option in table 4 "Feature Summary" and section 5.1.4 "Ethernet" • Corrected section 7.4 "USB Host Controller" • Added section 9 "System Resources" 1.1 2019.03.19 BEU • Corrected image in section 2.4 "Supply Voltage Standard Power" • Updated section 10.4 "Supported Flash Devices" 1.2 2019.04.02 BEU • Corrected supported memory in table 2, 3, and added information about supported memory in table 4 • Added information about the new industrial variant in table 3 and 7 1.3 2019.07.30 BEU • Updated note in section 4 "Cooling Solutions" • Changed number of supported USB 3.1 Gen 2 interfaces to two throughout the document • Added note regarding USB 3.1 Gen 2 in section 7.4 "USB Host Controller" 1.4 2020.01.07 BEU
    [Show full text]
  • Monte Carlo Evaluation of Financial Options Using a GPU a Thesis
    Monte Carlo Evaluation of Financial Options using a GPU Claus Jespersen 20093084 A thesis presented for the degree of Master of Science Computer Science Department Aarhus University Denmark 02-02-2015 Supervisor: Gerth Brodal Abstract The financial sector has in the last decades introduced several new fi- nancial instruments. Among these instruments, are the financial options, which for some cases can be difficult if not impossible to evaluate analyti- cally. In those cases the Monte Carlo method can be used for pricing these instruments. The Monte Carlo method is a computationally expensive al- gorithm for pricing options, but is at the same time an embarrassingly parallel algorithm. Modern Graphical Processing Units (GPU) can be used for general purpose parallel-computing, and the Monte Carlo method is an ideal candidate for GPU acceleration. In this thesis, we will evaluate the classical vanilla European option, an arithmetic Asian option, and an Up-and-out barrier option using the Monte Carlo method accelerated on a GPU. We consider two scenarios; a single option evaluation, and a se- quence of a varying amount of option evaluations. We report performance speedups of up to 290x versus a single threaded CPU implementation and up to 53x versus a multi threaded CPU implementation. 1 Contents I Theoretical aspects of Computational Finance 5 1 Computational Finance 5 1.1 Options . .7 1.1.1 Types of options . .7 1.1.2 Exotic options . .9 1.2 Pricing of options . 11 1.2.1 The Black-Scholes Partial Differential Equation . 11 1.2.2 Solving the PDE and pricing vanilla European options .
    [Show full text]
  • Deliverable 7.7 Deliverable Type: Report Delivered in June, 2019
    Hardware developments IV E-CAM Deliverable 7.7 Deliverable Type: Report Delivered in June, 2019 E-CAM The European Centre of Excellence for Software, Training and Consultancy in Simulation and Modelling Funded by the European Union under grant agreement 676531 E-CAM Deliverable 7.7 Page ii Project and Deliverable Information Project Title E-CAM: An e-infrastructure for software, training and discussion in simulation and modelling Project Ref. Grant Agreement 676531 Project Website https://www.e-cam2020.eu EC Project Officer Juan Pelegrín Deliverable ID D7.7 Deliverable Nature Report Dissemination Level Public Contractual Date of Delivery Project Month 44(31st May, 2019) Actual Date of Delivery 25th June, 2019 Description of Deliverable Update on "Hardware Developments III" (Deliverable 7.5) which covers: - Report on hardware developments that will affect the scientific areas of inter- est to E-CAM and detailed feedback to the project software developers (STFC); - discussion of project software needs with hardware and software vendors, completion of survey of what is already available for particular hardware plat- forms (CNRS); and, - detailed output from direct face-to-face session between the project end- users, developers and hardware vendors (ICHEC). Document Control Information Title: Hardware developments IV ID: D7.7 Version: As of June, 2019 Document Status: Accepted by WP leader Available at: https://www.e-cam2020.eu/deliverables Document history: Internal Project Management Link Review Review Status: Reviewed Written by: Alan Ó Cais(JSC) Contributors: Jony Castagna (STFC), Godehard Sutmann (JSC) Authorship Reviewed by: Jony Castagna (STFC) Approved by: Godehard Sutmann (JSC) Document Keywords Keywords: E-CAM, HPC, Hardware, CECAM, Materials 25th June, 2019 Disclaimer:This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement.
    [Show full text]
  • ATI Radeon™ HD 2000 Series Technology Overview
    C O N F I D E N T I A L ATI Radeon™ HD 2000 Series Technology Overview Richard Huddy Worldwide DevRel Manager, AMD Graphics Products Group Introducing the ATI Radeon™ HD 2000 Series ATI Radeon™ HD 2900 Series – Enthusiast ATI Radeon™ HD 2600 Series – Mainstream ATI Radeon™ HD 2400 Series – Value 2 ATI Radeon HD™ 2000 Series Highlights Technology leadership Cutting-edge image quality features • Highest clock speeds – up to 800 MHz • Advanced anti-aliasing and texture filtering capabilities • Highest transistor density – up to 700 million transistors • Fast High Dynamic Range rendering • Lowest power for mobile • Programmable Tessellation Unit 2nd generation unified architecture ATI Avivo™ HD technology • Superscalar design with up to 320 stream • Delivering the ultimate HD video processing units experience • Optimized for Dynamic Game Computing • HD display and audio connectivity and Accelerated Stream Processing DirectX® 10 Native CrossFire™ technology • Massive shader and geometry processing • Superior multi-GPU support performance • Enabling the next generation of visual effects 3 The March to Reality Radeon HD 2900 Radeon X1950 Radeon Radeon X1800 X800 Radeon Radeon 9700 9800 Radeon 8500 Radeon 4 2nd Generation Unified Shader Architecture y Development from proven and successful Command Processor Sha S “Xenos” design (XBOX 360 graphics) V h e ade der Programmable r t Settupup e x al Z Tessellator r I Scan Converter / I n C ic n s h • New dispatch processor handling thousands of Engine ons Rasterizer Engine d t c r e r u x ar e c t f
    [Show full text]
  • Deep Dive: Asynchronous Compute
    Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA Joint Session AMD NVIDIA ● Graphics Core Next (GCN) ● Maxwell, Pascal ● Compute Unit (CU) ● Streaming Multiprocessor (SM) ● Wavefronts ● Warps 2 Terminology Asynchronous: Not independent, async work shares HW Work Pairing: Items of GPU work that execute simultaneously Async. Tax: Overhead cost associated with asynchronous compute 3 Async Compute More Performance 4 Queue Fundamentals 3 Queue Types: 3D ● Copy/DMA Queue ● Compute Queue COMPUTE ● Graphics Queue COPY All run asynchronously! 5 General Advice ● Always profile! 3D ● Can make or break perf ● Maintain non-async paths COMPUTE ● Profile async on/off ● Some HW won’t support async ● ‘Member hyper-threading? COPY ● Similar rules apply ● Avoid throttling shared HW resources 6 Regime Pairing Good Pairing Poor Pairing Graphics Compute Graphics Compute Shadow Render Light culling G-Buffer SSAO (Geometry (ALU heavy) (Bandwidth (Bandwidth limited) limited) limited) (Technique pairing doesn’t have to be 1-to-1) 7 - Red Flags Problem/Solution Format Topics: ● Resource Contention - ● Descriptor heaps - ● Synchronization models ● Avoiding “async-compute tax” 8 Hardware Details - ● 4 SIMD per CU ● Up to 10 Wavefronts scheduled per SIMD ● Accomplish latency hiding ● Graphics and Compute can execute simultanesouly on same CU ● Graphics workloads usually have priority over Compute 9 Resource Contention – Problem: Per SIMD resources are shared between Wavefronts SIMD executes
    [Show full text]
  • Contributions of Hybrid Architectures to Depth Imaging: a CPU, APU and GPU Comparative Study
    Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU comparative study Issam Said To cite this version: Issam Said. Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU com- parative study. Hardware Architecture [cs.AR]. Université Pierre et Marie Curie - Paris VI, 2015. English. NNT : 2015PA066531. tel-01248522v2 HAL Id: tel-01248522 https://tel.archives-ouvertes.fr/tel-01248522v2 Submitted on 20 May 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT DE l’UNIVERSITE´ PIERRE ET MARIE CURIE sp´ecialit´e Informatique Ecole´ doctorale Informatique, T´el´ecommunications et Electronique´ (Paris) pr´esent´eeet soutenue publiquement par Issam SAID pour obtenir le grade de DOCTEUR en SCIENCES de l’UNIVERSITE´ PIERRE ET MARIE CURIE Apports des architectures hybrides `a l’imagerie profondeur : ´etude comparative entre CPU, APU et GPU Th`esedirig´eepar Jean-Luc Lamotte et Pierre Fortin soutenue le Lundi 21 D´ecembre 2015 apr`es avis des rapporteurs M. Fran¸cois Bodin Professeur, Universit´ede Rennes 1 M. Christophe Calvin Chef de projet, CEA devant le jury compos´ede M. Fran¸cois Bodin Professeur, Universit´ede Rennes 1 M.
    [Show full text]
  • AMD APP SDK V2.8.1 Developer Release Notes
    AMD APP SDK v2.8.1 Developer Release Notes 1 What’s New in AMD APP SDK v2.8.1 1.1 New features in AMD APP SDK v2.8.1 AMD APP SDK v2.8.1 includes the following new features: Bolt: With the launch of Bolt 1.0, several new samples have been added to demonstrate the use of the features of Bolt 1.0. These features showcase the use of valuable Bolt APIs such as scan, sort, reduce and transform. Other new samples highlight the ease of porting from STL and the performance benefits achieved over equivalent STL implementations. Other samples demonstrate the different fallback options in Bolt 1.0 when no GPU is available. These options include a fallback to multicore-CPU if TBB libraries are installed, or falling all the way back to serial-CPU if needed to ensure your code runs correctly on any platform. OpenCV: AMD has been working closely with the OpenCV open source community to add heterogeneous acceleration capability to the world’s most popular computer vision library. These changes are already integrated into OpenCV and are readily available for developers who want to improve the performance and efficiency of their computer vision applications. The new samples illustrate these improvements and highlight how simple it is to include them in your application. For information on the latest OpenCV enhancements, see Harris’ blog. GCN: AMD recently launched a new Graphics Core Next (GCN) Architecture on several AMD products. GCN is based on a scalar architecture versus the VLIW vector architecture of prior generations, so carefully hand-tuned vectorization to optimize hardware utilization is no longer needed.
    [Show full text]
  • Time Complexity Parallel Local Binary Pattern Feature Extractor on a Graphical Processing Unit
    ICIC Express Letters ICIC International ⃝c 2019 ISSN 1881-803X Volume 13, Number 9, September 2019 pp. 867{874 θ(1) TIME COMPLEXITY PARALLEL LOCAL BINARY PATTERN FEATURE EXTRACTOR ON A GRAPHICAL PROCESSING UNIT Ashwath Rao Badanidiyoor and Gopalakrishna Kini Naravi Department of Computer Science and Engineering Manipal Institute of Technology Manipal Academy of Higher Education Manipal, Karnataka 576104, India f ashwath.rao; ng.kini [email protected] Received February 2019; accepted May 2019 Abstract. Local Binary Pattern (LBP) feature is used widely as a texture feature in object recognition, face recognition, real-time recognitions, etc. in an image. LBP feature extraction time is crucial in many real-time applications. To accelerate feature extrac- tion time, in many previous works researchers have used CPU-GPU combination. In this work, we provide a θ(1) time complexity implementation for determining the Local Binary Pattern (LBP) features of an image. This is possible by employing the full capa- bility of a GPU. The implementation is tested on LISS medical images. Keywords: Local binary pattern, Medical image processing, Parallel algorithms, Graph- ical processing unit, CUDA 1. Introduction. Local binary pattern is a visual descriptor proposed in 1990 by Wang and He [1, 2]. Local binary pattern provides a distribution of intensity around a center pixel. It is a non-parametric visual descriptor helpful in describing a distribution around a center value. Since digital images are distributions of intensity, it is helpful in describing an image. The pattern is widely used in texture analysis, object recognition, and image description. The local binary pattern is used widely in real-time description, and analysis of objects in images and videos, owing to its computational efficiency and computational simplicity.
    [Show full text]
  • Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild
    Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild Parthasarathy Ranganathan Sarah J. Gwin Narayana Penukonda Daniel Stodolsky Yoshiaki Hase Eric Perkins-Argueta Jeff Calow Da-ke He Devin Persaud Jeremy Dorfman C. Richard Ho Alex Ramirez Marisabel Guevara Roy W. Huffman Jr. Ville-Mikko Rautio Clinton Wills Smullen IV Elisha Indupalli Yolanda Ripley Aki Kuusela Indira Jayaram Amir Salek Raghu Balasubramanian Poonacha Kongetira Sathish Sekar Sandeep Bhatia Cho Mon Kyaw Sergey N. Sokolov Prakash Chauhan Aaron Laursen Rob Springer Anna Cheung Yuan Li Don Stark In Suk Chong Fong Lou Mercedes Tan Niranjani Dasharathi Kyle A. Lucke Mark S. Wachsler Jia Feng JP Maaninen Andrew C. Walton Brian Fosco Ramon Macias David A. Wickeraad Samuel Foss Maire Mahony Alvin Wijaya Ben Gelb David Alexander Munday Hon Kwan Wu Google Inc. Srikanth Muroor Google Inc. USA [email protected] USA Google Inc. USA ABSTRACT management, and new workload capabilities not otherwise possible Video sharing (e.g., YouTube, Vimeo, Facebook, TikTok) accounts with prior systems. To the best of our knowledge, this is the first for the majority of internet traffic, and video processing is also foun- work to discuss video acceleration at scale in large warehouse-scale dational to several other key workloads (video conferencing, vir- environments. tual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger CCS CONCEPTS video processing infrastructures and ś with the slowing of Moore’s · Hardware → Hardware-software codesign; · Computer sys- law ś specialized hardware accelerators to deliver more computing tems organization → Special purpose systems.
    [Show full text]