Performance Optimizations for Scalable CFD Applications on Hybrid CPU+MIC Heterogeneous Computing System with Millions of Cores

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores WANG Yong-Xiana,∗, ZHANG Li-Luna, LIU Weia, CHENG Xing-Huaa, Zhuang Yub, Anthony T. Chronopoulosc,d aNational University of Defense Technology, Changsha, Hu'nan 410073, China bTexas Tech University, Lubbock, TX 79409, USA cUniversity of Texas at San Antonio, San Antonio, TX 78249, USA dVisiting Faculty, Dept Computer Engineering & Informatics, University of Patras 26500 Rio, Greece Abstract For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is still a great challenge. An in-house parallel CFD code capable of simulating three dimensional structured grid applications is developed and tested in this study. Several methods of parallelization, performance optimization and code tuning both in the CPU-only homogeneous system and in the heterogeneous system are proposed based on identifying potential parallelism of applications, balancing the work load among all kinds of computing devices, tuning the multi-thread code toward better performance in intra-machine node with hundreds of CPU/MIC cores, and optimizing the communication among inter-nodes, inter-cores, and between CPUs and MICs. Some benchmark cases from model and/or industrial CFD applications are tested on the Tianhe-1A and Tianhe-2 supercomputer to evaluate the performance. Among these CFD cases, the maximum number of grid cells reached 780 billion. The tuned solver successfully scales up to half of the entire Tianhe-2 supercomputer system with over 1.376 million of heterogeneous cores. The test results and performance analysis are discussed in detail. Keywords: computational fluid dynamics, parallel computing, Tianhe-2 supercomputer, CPU+MIC heterogeneous computing 1. Introduction cessors and co-processors/accelerators have become pop- ular in current supercomputer systems. The typical and In recent years, with the advent of powerful computers widely-used co-processors and/or accelerators include GPG- and advanced numerical algorithms, computational fluid PUs (general purpose graphics processing unit), the first dynamics (CFD) has been increasingly applied to the aero- generation Intel (Knights Corner) MICs (Many Integrated space and aeronautical industries and CFD is reducing the Core), FPGA (field-programmable gate array), and so on. dependencies on experimental testing and has emerged as The fact that there are more than sixty systems with one of the important design tools for these fields. Par- CPU+GPU and more than twenty systems with CPU+MIC allel computing on modern high performance computing in the latest Top500 supercomputer list release in Novem- (HPC) platform is also required to take full advantage of ber 2016 indicates such a trend in HPC platforms. capability of computations and huge memories of these Consequently, the CFD community faces an important arXiv:1710.09995v1 [cs.PF] 27 Oct 2017 platforms when an increasing number of large-scale CFD challenge of how to keep up with the pace of rapid changes applications are used nowadays to meet the needs of engi- in the HPC systems. It could be hard, if not impossible, to neering design and scientific research. It's well known that port the high performing original efficient algorithm on the HPC systems are facing a rapid evolution driven by power traditional homogeneous platform to the current new HPC consumption constraints, and as a result, multi-/many- systems with heterogeneous architectures seamlessly. The core CPUs have turned into energy efficient system-on- existing codes need to be re-designed and tuned to exploit chip architectures and HPC nodes integrating main pro- the different levels of parallelism and complex memory hi- erarchies of new heterogeneous systems. During the past years, researchers in the CFD fields ∗Corresponding author Email addresses: [email protected] (WANG Yong-Xian), have made a great efforts to implement efficiently CFD [email protected] (ZHANG Li-Lun), [email protected] (LIU codes in the heterogeneous systems. Among many recent Wei), [email protected] (CHENG Xing-Hua), studies, researchers have paid more attention to CPU+GPU [email protected] (Zhuang Yu), [email protected] (Anthony hybrid computing. Ref. [1] studied an explicit Runge- T. Chronopoulos) Preprint submitted to Computers & Fluids October 30, 2017 Kutta CFD solver for three-dimensional compressible Eu- + Xeon Phi co-processors hybrid architectures showed that ler equations using a single NVIDIA Tesla GPU and got a good performance and scalability can be achieved. roughly 9.5x performance over a quad-core CPU. In [2], the The rest of this paper is organized as follows: In Sect. 2 authors implemented and optimized a two-phase solver for the numerical methods, the CFD code and the heteroge- the Navier-Stokes equations using the Runge-Kutta time neous HPC system are briefly introduced. The paralleliza- integration on a multi-GPU platform and achieved an im- tion and performance optimization strategies of large-scale pressive speedup of 69.6x on eight GPUs/CPUs. Xu et al CFD applications running on heterogeneous system are proposed a MPI-OpenMP-CUDA parallelization scheme discussed in detail in Sect. 3. Numerical simulations to in [3] to utilize both CPUs and GPUs for a complex, real- evaluate the proposed methods and results and analysis world CFD application using explicit Runge-Kutta solver are also reported in Sect. 4, and conclusion remarks are on the Tianhe-1A supercomputer and achieved a speedup given in Sect. 5. factor of about 1.3x when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs and a parallel efficiency 2. Numerical Methods and High Performance Com- of above 60% on 1024 Tianhe-1A nodes. puting Platform On the other hand, many studies aim to exploit the capability of CPU + MIC heterogenous systems for mas- 2.1. Governing Equations sively parallel computing of CFD applications. Ref. [4] In this paper, the classical Navier-Stokes governing equa- gave a small-scale preliminary performance test for a bench- tion is used to model the three-dimensional viscous com- mark code and two CFD applications on a 128-node CPU pressible unsteady flow. The governing equations of the + MIC heterogeneous platform and evaluated the early differential form in the curvilinear coordinate system can performance results, where application-level testing was be written as: primarily limited to the case of using a single machine node. In the same year, authors in [5] made an effort @Q @(F − F ) @(G − G ) @(H − H ) + v + v + v = 0; (1) to port their CFD code on traditional HPC platform to @t @ξ @η @ζ the forthcoming new-setup Tianhe-2 supercomputer with CPU+MIC heterogeneous architecture, and the perfor- where Q = (ρ, ρu, ρv; ρw; ρE)T denotes the conservative mance evaluations of massive CFD cases used up-to 3072 state (vector) variable, F, G and H are the inviscid (con- machine nodes of the fastest supercomputer in that year. vective) flux variables, and Fv, Gv and Hv are the vis- Ref. [6] reported simulations of running large-scale sim- cid flux variables in the ξ; η and ζ coordinate directions, ulations of turbulent flows on massively-parallel acceler- respectively. Here, ρ is the density, u; v and w are the ators including GPUs and Intel Xeon Phi coprocessors cartesian velocity components, E is the total energy. All and found that the different GPUs considered substan- these physics variables are non-dimensional in the equa- tially outperform Intel Xeon Phi accelerator for some ba- tions, and for the three-dimensional flow field, they are sic OpenCL kernels of algorithm. Ref. [7] implemented an vector variables with five components. The details of def- unstructured meshes CFD benchmark code on Intel Xeon inition and expression of each flux variable can be found Phis by both explicit and implicit schemes and their re- in [10]. sults showed that a good scalability can be observed when using MPI programing technique. However, the openMP 2.2. Numerical Methods multi-threading and MPI hybrid case remains untested in For the numerical method, a high-order weighted com- their paper. As a case study, Ref. [8] compared the perfor- pact nonlinear finite difference method (FDM) is used for mance of high-order weighted essentially non-oscillatory the spatial discretization. Specifically, let us first consider scheme CFD application on both K20c GPU and Xeon the inviscid flux derivative along the ξ direction. By using Phi 31SP MIC, and the result showed that when vector the fifth-order explicit weighted compact nonlinear scheme processing units are fully utilized the MIC can achieve (WCNS-E-5) [11], its cell-centered FDM discretization can equivalent performance to that of GPUs. Ref. [9] reported be expressed as the performance and scalability of an unstructured mesh based CFD workflow on TACC Stampede supercomputer @Fi 75 25 = (F − F ) − (F − F ) and NERSC Babbage MIC based system using up to 3840 @ξ 64h i+1=2 i−1=2 384h i+3=2 i−3=2 cores for different configurations. 3 + (F − F ); (2) In this paper, we aim to study the porting and per- 640h i+5=2 i−5=2 formance tuning techniques of a parallel CFD code to heterogeneous HPC platform. A set of parallel optimiza- where h is the grid size along ξ direction, and flux (vec- tion methods considering the characteristics of both hard- tor) variable F is computed by some kind of flux-splitting ware architecture and the typical CFD applications are de- method by combining the responding left-hand and right- veloped.The portability and device-oriented optimization hand cell-edge flow variables.

Load more