Optimizations for Seismic Applications on the NEC SX-Aurora TSUBASA

Raghunandan Mathur, NEC Technologies India Reid Atcheson, Numerical Algorithms Group Yoshiyuki Kubo, NEC Corporation

13-Oct-2020 Poster Station 11 D, Exhibit Hall B George R. Brown Convention Center

1 © NEC Corporation 2020 In collaboration with

2 © NEC Corporation 2020 Stencil Code

▌ Stencil code refers to a procedure pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, etc.

▌ Stencil patterns require updates to each element in a multidimensional array by referring to its neighbor elements. 푚

퐵푖,푗,푘 = 푐퐵푖,푗,푘 + ෍ 퐹푙 퐴푖+푝(푙),푗+푞(푙),푘+푟(푙) 푙=1 ▌ These codes requires significant performance of both computation and memory access, since they load a value of each element several times while they store a new value once.

▌ Stencil shapes and sizes differ based on the problem they solve:

Directional 2D Planar 2D Axial 2D Diagonal 3D Volumetric 3D Axial

Load : 3 Load : 9 Load : 5 Load : 5 Load : 27 Load : 7 Store : 1 Store : 1 Store : 1 Store : 1 Store : 1 Store : 1

3 © NEC Corporation 2020 Stencil Code

▌Example of a Stencil Code Laplace Equation real :: a(0:ni+1,0:nj+1) 2 2 real :: b(1:ni,1:nj) 휕 휕 2 퐴 + 2 퐴 = 0 : 휕푥 휕푦 do itr=1,maxitr discretization : finite-difference 1 do j=1,nj 푏푖,푗 = 푎푖,푗−1 + 푎푖−1,푗 + 푎푖+1,푗 + 푎푖,푗+1 do i=1,ni 4 b(i,j)=0.25*( & j+2 a(i ,j-1) & +a(i-1,j ) & j+1 +a(i+1,j ) & +a(i ,j+1)) j end do end do j-1 : a(:,:)=b(:,:) j-2 end do i-2 i-1 i i+1 i+2

4 © NEC Corporation 2020 Reverse Time Migration (RTM)

▌ Reverse time migration (RTM) modeling is a critical component in the seismic processing workflow of oil and gas exploration. ▌ RTM imaging enables accurate imaging in areas of complex structures and velocities by gathering a two-way acoustic image of seismic data in place of a one-way image. ▌ RTM spends most of its computation time in wave propagation kernels that utilize stencil codes.

As observed on Xeon Gold 6148 x2 (Skylake 2.40GHz 40C), approx. 90% compute is spent on RTM kernels:

0 20 40 60 80 100 Elapsed Time Ratio [%]

stencil code other computation I/O ▌ Full simulations of a generalized kernel called Anisotropic Elastic Wave Equation propagator can provide significant seismic information under a wide variety of geological assumptions.

5 © NEC Corporation 2020 Fully anisotropic elastic wave equation propagator

▌ The wave propagation kernels numerically represent the type of physics the user needs to emphasize for the migration.

▌ Isotropic acoustics is a common and simple wave propagation kernel for driving RTM, but with fewer assumptions on subsurface geology we obtain more accurate and expensive kernels like Vertical Transverse Isotropy (VTI) or Tilted Transverse Isotropy (TTI).

▌ The elasto-dynamic wave equation for anisotropic media can be expressed as:

(δkl∂tt - Akl [∇])ul = 1/ρ Fk where

Akl [∇] = 1/ρ ∂i Ciklj ∂j

is the acoustic differential operator, and δkl is the Kronecker delta.

▌ This study covers both low frequency and high frequency fully anisotropic wave equation propagators.

▌ Both propagators are relevant when considering Reverse Time Migration (RTM) and Full Waveform Inversion (FWI).

6 © NEC Corporation 2020 Vector Processor on PCIe Card (High Memory Capacity & Bandwidth Processor)

 8 cores / processor  1.22TB/s memory bandwidth, 48GB memory (Very High Memory Bandwidth)  Standard programming with Fortran/C/C++(No Special Programming Model Needed)  2.15TF performance (double precision)  4.30TF performance (single precision)  Power consumption < 300 W

7 © NEC Corporation 2020 Vector Processor

▌ A SIMD architecture that processes a ‘Vector’ as a single data element. ▌ Arbitrarily large vector length can be processed in a single vector instruction. ▌ Large memory bandwidth enabling high sustained performance.

Scalar Processor Vector Processor GPGPU ▌ Large capacity of vector registers. Data Data Data  Every core has 64 vector registers.  Every vector register has 512 single-precision elements.

 Vector registers can be accessed Compute Compute Compute much faster than a cache.

▌ The memory access can be reduced by retaining intermediate results in vector registers.

8 © NEC Corporation 2020 STREAM Benchmark – A Reference

▌ STREAM benchmark evaluates memory bandwidth and is a good benchmark for preliminary comparison. 1400 1245 Comparison with Intel Xeon Gold 6148: 1200 984  On Intel machine STREAM Triad achieves 180 GB/s 1000  On NEC machine STREAM Triad achieves 984 GB/s 830 830 800  Expected maximum speedup on NEC

=> 984/180 = 5.5x 600

400 Comparison with Nvidia Tesla V100:

Memory Bandwidth (GB/s) Bandwidth Memory 180 200  On Nvidia machine STREAM Triad achieves 830 GB/s

 On NEC machine STREAM Triad achieves 984 GB/s 0  Expected maximum speedup on NEC Intel Xeon Gold 6148 Nvidia Tesla V100 A64FX NEC VE 10B NEC VE 20B => 984/830 = 1.2x (2CPU) (1GPU) (1CPU) (1 Card) (1 Card)

▌ The calculated speed-up builds an expected result for the experiment.

9 © NEC Corporation 2020 Experiment Setup

Evaluation Target System ▌ The evaluation target systems were chosen based on the below three popular HPC architectures:

2x Intel Xeon Gold 6148 1x NEC Vector Engine (Type 10B) 1x Nvidia Tesla V100 Software setup ▌ The implementation of the anisotropic wave equation kernel that computes results for three problem sizes and three stencil lengths: Stencil Lengths: 3D problem sizes: • (nx,ny,nz) = 64x64x64 • Length = 2 • (nx,ny,nz) = 128x128x128 • (nx,ny,nz) = 256x256x256 • Length = 4

Illustration of • Length = 8 Stencil Length=4

10 © NEC Corporation 2020 Experiment Setup

▌ The core computation is timed where the timing results are an average of 10 iterations of the wave equation solver.

▌ Code also computes min, max, standard deviation, of timings but these are largely to certify the timing is sensible, i.e. if standard deviation is large the result is discarded and run again.

▌ Each iteration is compared against an analytic solution to ensure correctness, but this comparison is not timed.

▌ On an ideal system with perfect number of registers and caching, the stencil length would not change performance at all. However, with varying sizes of cache and vector registers, source code tuning was attempted across all architectures.

11 © NEC Corporation 2020 Source Code modifications

This is an abridged version of actual loop in code, depicting the nature of modifications: #define id(i0,i1,i2) ((i0)+n0*((i1)+n1*(i2))) The indexing scheme is fastest in #pragma omp parallel for The “n0” dimension, but originally, we have the “n2” for(int i0=HALO;i0

12 © NEC Corporation 2020 Source Code modifications

This is an abridged version of actual loop in code, depicting the nature of modifications: #define id(i0,i1,i2) ((i0)+n0*((i1)+n1*(i2))) #pragma omp parallel for for(int i2=HALO;i2

13 © NEC Corporation 2020 Source Code modifications

▌ Some tuning modifications that brought out the best performance of each architecture:

On Intel Xeon On NEC VE On Nvidia V100

• Simple Loop Reorder: Fastest • Simple Loop reorder: Fastest • Threading: In general threading dimension to be kept in the dimension to be kept in the on GPU is more flexible than innermost loop for best effect of innermost loop for best effect of vectorization. vectorization through AVX512. vectorization • Avoiding branching • Loop Blocking: Outer loops to • Loop Collapse: Collapsing the operations: Specific cases be blocked for best cache inner two loops in order to where the ratio of taken utilization. ensure a single long vector for branches was high. best utilization of the long vector pipe (256-words).

• Loop Unroll: Compiler automatically unrolls the outer- loop

▌ While more complicated tuning approaches posed scope for better performance impacts, we didn’t lose our focus on the ease and simplicity of tuning for high performance.

14 © NEC Corporation 2020 Performance Results

▌ For small problem sizes, NEC Vector Engine outperforms both the CPU and GPU, although performing very similar to the Problem Size: 64x64x64

GPU. 0.006 Lower is betteris Lower

0.005 ▌ These small problem sizes also do not utilize the large data processing capability of the 0.004 vector engine due to smaller vector lengths. 0.003

▌ Smaller vector lengths provide better cache Execution Time (secs) 0.002 friendliness on CPUs and tend to perform

well. 0.001

▌ As expected, the speed-up is in the range of 0 Stencil Length = 2 Stencil Length = 4 Stencil Length = 8 3.5x ~ 4.5x for CPU, which is short of the Intel Skylake Nvidia V100 NEC VE expected theoretical speedup based on memory bandwidth considerations.

These performance recordings are as of October 2020

15 © NEC Corporation 2020 Performance Results

▌ With increase in dataset sizes, the speedup improves. Problem Size: 128x128x128 0.07 ▌ NEC Vector engine still outperforms betteris Lower 0.06 both the CPU and GPU based systems, with noticeable 0.05 performance benefit compared to GPU. 0.04

0.03 Execution Time (secs) ▌ The speedup is in 6.0x ~ 8.5x range 0.02 compared to CPU, which is a much better representative of the expected 0.01

performance speedup on theoretical 0 bases. Stencil Length = 2 Stencil Length = 4 Stencil Length = 8 Intel Skylake Nvidia V100 NEC VE ▌ NEC VE provides nearly 1.7x faster

performance compared to GPUs. These performance recordings are as of October 2020

▌ Larger problem sizes speedup brought up to 16.0x between Intel Xeon and NEC VE, much higher than Problem Size: 256x256x256 0.5

the theoretical best speedup of 6x betteris Lower based solely on memory bandwidth 0.45 considerations. 0.4 0.35

0.3

▌ Even for the Nvidia GPU, the 0.25 speedup is in the 1.5x ~ 2.1x range, 0.2

that is higher than the theoretical Execution Time (secs) 0.15 memory bandwidth consideration. 0.1

0.05

▌ The speedup suggests that the NEC 0 Vector Engine provides a good Stencil Length = 2 Stencil Length = 4 Stencil Length = 8 combination of boosts in memory Intel Skylake Nvidia V100 NEC VE performance, as well as computational performance. These performance recordings are as of October 2020

Gridpoints-per-second 800,000,000

Intel Skylake Nvidia V100 NEC VE betteris Higher 700,000,000

600,000,000

500,000,000

400,000,000

300,000,000

200,000,000

100,000,000

0 Stencil Length = 2 Stencil Length = 4 Stencil Length = 8 Stencil Length = 2 Stencil Length = 4 Stencil Length = 8 Stencil Length = 2 Stencil Length = 4 Stencil Length = 8 Problem Size = 64x64x64 Problem Size = 128x128x128 Problem Size = 256x256x256 ▌ This plot represents the number of grid-points being evaluated per-second for each architecture. ▌ NEC Vector Engine consistently outperforms the CPU and GPU architectures, up to 16x faster than Intel Skylake, and more than 2x faster than Nvidia V100 for large datasets. These performance recordings are as of October 2020 18 © NEC Corporation 2020 Performance Results

▌ The performance patterns also reveal the ideal grid size based on best performance per-core for each architecture:

Intel Skylake Gold Nvidia Tesla V100 NEC VE Type 10B 6148 Stencil Length = 2 64-grid 256-grid 256-grid

Stencil Length = 4 64-grid 256-grid 256-grid

Stencil Length = 8 64-grid 128-grid 256-grid

▌ This table can help a developer design the granularity of parallelism for their code based on what architecture they are working on.

▌ For each stencil length, VE is consistently the best in terms of choice of grid size for each architecture.

▌Stencil codes are performance intensive on the memory as well as compute for any given architecture.

▌Reverse Time Migration is a performance intensive code, especially memory bound and poses a genuine challenge relevant to the Oil and Gas Industry.

▌Vector architectures, particularly the NEC Vector Engine is capable of catering to the recurring challenges in seismic processing and providing better performance than the available leading architectures.

▌Power efficient solution with minimal software engineering effort.