Acceleration of CFD and Data Analysis Using Graphics Processors Ali Khajeh Saeed University of Massachusetts Amherst, [email protected]

University of Massachusetts Amherst ScholarWorks@UMass Amherst Open Access Dissertations 2-2012 Acceleration of CFD and Data Analysis Using Graphics Processors Ali Khajeh Saeed University of Massachusetts Amherst, [email protected] Follow this and additional works at: https://scholarworks.umass.edu/open_access_dissertations Part of the Industrial Engineering Commons, and the Mechanical Engineering Commons Recommended Citation Khajeh Saeed, Ali, "Acceleration of CFD and Data Analysis Using Graphics Processors" (2012). Open Access Dissertations. 524. https://doi.org/10.7275/hhk9-mx74 https://scholarworks.umass.edu/open_access_dissertations/524 This Open Access Dissertation is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Open Access Dissertations by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. ACCELERATION OF CFD AND DATA ANALYSIS USING GRAPHICS PROCESSORS A Dissertation Outline Presented by ALI KHAJEH-SAEED Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY February 2012 Mechanical and Industrial Engineering © Copyright by Ali Khajeh-Saeed 2012 All Rights Reserved ACCELERATION OF CFD AND DATA ANALYSIS USING GRAPHICS PROCESSORS A Dissertation Outline Presented by ALI KHAJEH-SAEED Approved as to style and content by: J. Blair Perot, Chair Stephen de Bruyn Kops, Member Rui Wang, Member Hans Johnston, Member Donald L. Fisher, Department Head Mechanical and Industrial Engineering ACKNOWLEDGMENTS This thesis would not have been possible without the aid and support of countless people over the past four years. I must first express my gratitude towards my advisor, Professor J. Blair Perot. I would also like to thank the members of my graduate committee, Professor Stephen de Bruyn Kops, Professor Rui Wang and Professor Hans Johnston for their guidance and suggestions. Special thanks to my friend Michael B Martell JR for his invaluable assistance with all things Linux, CFD and Windows. I would like to thank my friend Timothy McGuiness for his invaluable help with CUDA. It is a pleasure to thank Shivasubramanian Gopalakrishnan, Sandeep Menon, Kshitij Neroorkar, Dnyanesh Digraskar, Nat Trask, Michael Colarossi, Thomas Fur- long, Kyle Mooney, Chris Zusi, Brad Shields, Maija Benitz and Saba Almalkie for making the lab a fun place to work. The author would also like to thank the Department of Defense and Oak Ridge National Lab (ORNL) for their support. Some of the development work occurred on the NSF Teragrid/XSEDE supercomputer, Lincoln and Forge, located at National Center for Supercomputing Applications (NCSA) and Keeneland supercomputer, located at National Institute for Computational Science (NICS). The author would like thank the NVIDIA and AMD for generous donation of 2 Tesla C2070s and 2 ATI FirePro V7800s graphic cards. iv ABSTRACT ACCELERATION OF CFD AND DATA ANALYSIS USING GRAPHICS PROCESSORS FEBRUARY 2012 ALI KHAJEH-SAEED B.Sc., SHARIF UNIVERSITY OF TECHNOLOGY M.Sc., SHARIF UNIVERSITY OF TECHNOLOGY Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor J. Blair Perot Graphics processing units function well as high performance computing devices for scientific computing. The non-standard processor architecture and high memory bandwidth allow graphics processing units (GPUs) to provide some of the best performance in terms of FLOPS per dollar. Recently these capabilities became acces- sible for general purpose computations with the CUDA programming environment on NVIDIA GPUs and ATI Stream Computing environment on ATI GPUs. Many applications in computational science are constrained by memory access speeds and can be accelerated significantly by using GPUs as the compute engine. Using graphics processing units as a compute engine gives the personal desktop computer a processing capacity that competes with supercomputers. Graphics Processing Units represent an energy efficient architecture for high performance computing in flow simulations and many other fields. This document reviews the graphic processing unit and its features and limitations. v TABLE OF CONTENTS Page ACKNOWLEDGMENTS ............................................. iv ABSTRACT ................................................... ....... v LIST OF TABLES ................................................... xi LIST OF FIGURES..................................................xiii CHAPTER 1. INTRODUCTION ................................................. 1 1.1 Central Processing Unit (CPU) .....................................1 1.2 Graphics Processing Unit (GPU) ...................................1 1.3 Architecture of a Modern GPU .....................................6 1.4 Compute Unified Device Architecture (CUDA) .......................8 1.5 Computational Fluid Dynamics (CFDs) and Parallel Processing ........9 1.6 Supercomputers and Specifications .................................13 1.6.1 Orion In-House Supercomputer .............................13 1.6.2 Lincoln Supercomputer ....................................14 1.6.3 Forge Supercomputer ......................................15 1.6.4 Keeneland Supercomputer ..................................15 2. AN INTRODUCTION TO GPU PROGRAMMING .............. 18 2.1 Basic GPU Programming Model and Interface .......................18 2.2 Device Memories ................................................19 2.2.1 Global Memory (Device Memory) ...........................20 2.2.2 Local Memory ............................................21 2.2.3 Texture Memory ..........................................21 2.2.4 Shared Memory ...........................................22 2.2.5 Constant Memory .........................................23 2.2.6 Page-Locked Host Memory .................................24 vi 2.2.6.1 Pinned Memory ...................................24 2.2.6.2 Portable Memory .................................25 2.2.6.3 Write-Combining Memory ..........................25 2.2.6.4 Mapped Memory ..................................25 2.3 Performance Optimization Strategies ...............................26 2.4 Multi-GPU Optimization Strategies ................................27 3. DATA ANALYSIS ON THE GPUS ............................... 29 3.1 STREAM Benchmark ............................................29 3.1.1 Problem Statement ........................................29 3.1.2 Single GPU ..............................................30 3.1.3 Weak Scaling on Many GPUs ...............................31 3.1.4 Strong Scaling on Many GPUs ..............................33 3.1.5 Power Consumption .......................................34 3.1.6 GPU Temperature ........................................35 3.2 Unbalanced Tree Search ..........................................36 3.2.1 Problem Statement ........................................36 3.2.2 Results for Unbalanced Tree Search ..........................38 4. SEQUENCE MATCHING ........................................ 40 4.1 Introduction ................................................... .40 4.2 The Smith-Waterman Algorithm...................................42 4.3 The Optimized Smith-Waterman Algorithm .........................46 4.3.1 Reduced Dependency ......................................46 4.3.2 Anti-Diagonal Approach ...................................47 4.3.3 Parallel Scan Smith-Waterman Algorithm ....................48 4.3.4 Overlapping Search ........................................50 4.3.5 Data Packing .............................................52 4.3.6 Hash Table ...............................................52 4.4 Literature Review ...............................................53 4.5 Results ................................................... ......56 4.5.1 Single GPU ..............................................57 4.5.2 Strong Scaling on Many GPUs ..............................64 4.5.3 Weak Scaling on Many GPUs ...............................64 vii 5. HIGH PERFORMANCE COMPUTING ON THE 64-CORE TILERA PROCESSOR ........................................ 67 5.1 Introduction ................................................... .67 5.2 Giga Update Per Seconds (GUPS) .................................70 5.2.1 Strong Scaling ............................................70 5.2.2 Weak Scaling .............................................72 5.2.3 Performance Variation .....................................73 5.2.4 Power Consumption .......................................74 5.3 Sparse Vector-Matrix Multiplication ................................75 5.3.1 Performance Results .......................................76 5.3.2 Power Consumption .......................................78 5.4 Smith-Waterman Algorithm .......................................80 5.4.1 Anti-Diagonal Algorithm ...................................80 5.4.2 Row Approach ............................................81 5.4.2.1 Strong Scaling ....................................81 5.4.2.2 Weak Scaling .....................................84 5.4.3 Power Consumption .......................................86 5.5 Fast Fourier Transform (FFT) .....................................88 5.5.1 1D FFT .................................................88 5.5.2 3D FFT .................................................90 5.5.3 Power Consumption .......................................93 5.6 Conclusions ................................................... ..94 6. CFD AND GPUS ................................................

Acceleration of CFD and Data Analysis Using Graphics Processors Ali Khajeh Saeed University of Massachusetts Amherst, [email protected]

Lecture 7 CUDA

ATI Radeon™ HD 4870 Computation Highlights

AMD Accelerated Parallel Processing Opencl Programming Guide

AMD Opencl User Guide.)

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Tousimojarad, Ashkan (2016) GPRM: a High Performance Programming Framework for Manycore Processors. Phd Thesis

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

Near Data Processing: Are We There Yet?

IBM US Nuke-Lab Beast 'Sequoia' Is Top of the Flops (Petaflops, That Is) | Insidehpc.Com

June 2012 | TOP500 Supercomputing Sites

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers