Porting Scalable Parallel CFD Application Hifun on NVIDIA GPU

HiFUN on GPU Krishnababu et. al. Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU D. V. Krishnababu, N. Munikrishna, Nikhil Vijay Shende 1 N. Balakrishnan 2 Thejaswi Rao 3 1. S & I Engineering Solutions Pvt. Ltd., Bangalore, India 2. Aerospace Engineering, Indian Institute of Science, Banglore, India 3. NVIDIA Graphics Pvt. Ltd., Banglore, India GPU Technology Conference Silicon Valley March 26–29, 2018 1 / 18 Introduction http://www.sandi.co.in HiFUN on GPU Krishnababu The HiFUN Software et. al. High Resolution Flow Solver on Unstructured Meshes. A Computational Fluid Dynamics (CFD) Flow Solver. Primary product of the company SandI. Robust, fast, accurate and efficient tool. About SandI A technology company. Incubated from Indian Institute of Science, Bangalore. Promotes high end CFD technologies with uncompromising quality standards. 2 / 18 Introduction http://www.sandi.co.in HiFUN on GPU Krishnababu The HiFUN Software et. al. High Resolution Flow Solver on Unstructured Meshes. A Computational Fluid Dynamics (CFD) Flow Solver. Primary product of the company SandI. Robust, fast, accurate and efficient tool. About SandI A technology company. Incubated from Indian Institute of Science, Bangalore. Promotes high end CFD technologies with uncompromising quality standards. 2 / 18 Features of HiFUN http://www.sandi.co.in/home/products HiFUN on GPU General Krishnababu et. al. 3 / 18 Features of HiFUN http://www.sandi.co.in/home/products HiFUN on GPU Well Validated Krishnababu et. al. AIAA DPW SPICES AIAA HiLiftPW 4 / 18 Features of HiFUN http://www.sandi.co.in/home/products HiFUN on GPU Super Scalable Workload: 165 Million Volumes Krishnababu et. al. Simulation CPU Cores Time (Hours/Days) 256 30/1.25 RANS 10000 1 256 108/4.5 URANS 10000 3 256 525/22 DES 10000 15 5 / 18 SandI–NVIDIA Collaboration HiFUN on GPU HiFUN on NVIDIA Pascal, Volta GPU Way - Krishnababu et. al. Ahead NVLink With IBM Power CPU 2018 - GTC 2018 GTCx Mumbai 2016 - HiFUN in GPU Apps Catalogue GTC 2016: Poster Presentation 2015 - NVIDIA Innovation Award 2014 - Joint Development Initiative Kicks Off 6 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Hybrid Supercomputers Consist of CPU and NVIDIA GPU. Less power to achieve same FLOPS. Less cooling & space. GPU Thousands of computing cores sharing same RAM. Higher memory bandwidth. High data transfer overheads with CPU. 7 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Hybrid Supercomputers Consist of CPU and NVIDIA GPU. Less power to achieve same FLOPS. Less cooling & space. GPU Thousands of computing cores sharing same RAM. Higher memory bandwidth. High data transfer overheads with CPU. 7 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Parallelization Model on GPU Shared memory. Many FLOPS per byte of data from CPU to GPU. Re–look at parallelization of CFD algorithms. Parallelization Challenges General purpose algorithms. Implicit: Global data dependence. Complex multi–layered unstructured data structure. 8 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Parallelization Model on GPU Shared memory. Many FLOPS per byte of data from CPU to GPU. Re–look at parallelization of CFD algorithms. Parallelization Challenges General purpose algorithms. Implicit: Global data dependence. Complex multi–layered unstructured data structure. 8 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Constraints Krishnababu et. al. No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to GPU. Optimal data communication between CPU & GPU. 9 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Constraints Krishnababu et. al. No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to GPU. Optimal data communication between CPU & GPU. 9 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Constraints Krishnababu et. al. No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to GPU. Optimal data communication between CPU & GPU. 9 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Onera M6 NASA CRM Trap Wing Configurations & Workloads (Million) Onera M6 Wing: 1.1, 9.3, 12.12, 15.4 NASA CRM: 6.2, 26.5, 30 NASA Trap Wing: 20, 66 Simulation Type Steady RANS Simulations 10 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Onera M6 NASA CRM Trap Wing Configurations & Workloads (Million) Onera M6 Wing: 1.1, 9.3, 12.12, 15.4 NASA CRM: 6.2, 26.5, 30 NASA Trap Wing: 20, 66 Simulation Type Steady RANS Simulations 10 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Computing Platform: NVIDIA PSG Krishnababu et. al. Node configuration Two Hexa–deca core Intel(R) Xeon(R) Haswell processors. Eight NVIDIA Tesla K–80 GPUs. GPU Memory = 12 GB. Total CPU Memory per node = 256 GB. Infiniband interconnect Software PGI Compiler 16.7 OPENMPI 1.10.2 OpenACC 2.0 11 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Computing Platform: NVIDIA PSG Krishnababu et. al. Node configuration Two Hexa–deca core Intel(R) Xeon(R) Haswell processors. Eight NVIDIA Tesla K–80 GPUs. GPU Memory = 12 GB. Total CPU Memory per node = 256 GB. Infiniband interconnect Software PGI Compiler 16.7 OPENMPI 1.10.2 OpenACC 2.0 11 / 18 HiFUN on NVIDIA GPU Parallel Performance Parameters HiFUN on GPU Krishnababu Ideal Speed–up et. al. Ratio of number of nodes used for a given run to reference number of nodes. Actual Speed–up Ratio of time/iteration using reference number of nodes to time/iteration using number of nodes for given run. Accelerator Speed–up Ratio of time per iteration obtained using given no. of CPUs to time per iteration obtained using same no. of CPUs working in tandem with GPUs. 12 / 18 HiFUN on NVIDIA GPU Single Node Performance HiFUN on GPU Krishnababu et. al. Accelerator Speed–up on 2 GPU Observations Increase in grid size increases GPU utilization and accelerator speed–up. Important to load GPU completely. 13 / 18 HiFUN on NVIDIA GPU Single Node Performance HiFUN on GPU Krishnababu et. al. Varying GPUs % Increase Observations Increase in no. of GPUs increase accelerator speed–up. Use of 4 GPUs per node is optimal. 14 / 18 HiFUN on NVIDIA GPU Single Node Performance HiFUN on GPU Krishnababu et. al. Time to RANS Solution (Hours) Observations Time to solution on 1 million grid ∼ 15 minutes. Time to solution on 30 million grid ∼ half a day. Single node serves as a desktop supercomputer. 15 / 18 HiFUN on NVIDIA GPU Multi–node Performance HiFUN on GPU Krishnababu et. al. Parallel Speed–up: 66 Million Workload Observations Near linear speed–up using 2 GPUs per node. Drop in speed–up for larger no. nodes and/or higher GPUs due to lower GPU utilization. 16 / 18 HiFUN on NVIDIA GPU Multi–node Performance HiFUN on GPU Krishnababu et. al. Normalized Time Per Iteration: 66 Million Workload Observations Drop in time/iter with increase in no. of nodes and/or GPUs. Time to solution with 8 nodes ∼ 4 hours. 17 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU. 18 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU. 18 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU. 18 / 18 HiFUN on NVIDIA GPU HiFUN on GPU Krishnababu et. al. Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU. 18 / 18.

Porting Scalable Parallel CFD Application Hifun on NVIDIA GPU

Behavioral Non-Portability in Scientific Numeric Computing

Combating Obsolescence in High-Performance Multiprocessor Software by William Lundgren, Kerry Barnes, and James Steed

Developing Portable Software

Quality Attributes for Mission Flight Software: a Reference for Architects

A Study on the Portability of Iot Operating Systems

Portability of Process-Aware and Service-Oriented Software Evidence and Metrics

Enhancing Software Portability with a Testing and Evaluation Platform

Investigating Software Maintainability Development: a Case for ISO 9126

Techniques for Software Portability in Mobile Development

Software Portability Gains Realized with Metah and Ada95

Code Porting in Embedded Systems: a Case Study

Software Reuse in Safety-Critical Systems