Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]

GPU Progress Summary for GPU-Parallel CAE

Strong GPU investments by commercial CAE vendors (ISVs) GPU adoption led by implicit FEA and CEM, followed by CFD GPUs for 2nd-level parallelism, preserves costly MPI investment ISV focus on hybrid parallel CAE that utilizes all CPU cores + GPU

GPUs now production-HPC for leading CAE end-user sites Led by automotive, electronics, and aerospace industries

GPU acceleration contributing to growth in emerging CAE New ISV developments in particle-based CFD (LBM, SPH, etc.) Rapid growth for range of CEM applications and GPU adoption

2

NVIDIA Use of CAE in Product Engineering

ANSYS Icepak – active and passive cooling of IC packages

ANSYS Mechanical – large deflection bending of PCBs

ANSYS Mechanical – comfort and fit of 3D emitter glasses

ANSYS Mechanical – shock & vib of solder ball assemblies

3

NVIDIA GPUs Accelerate CAE at Any Scale

Same GPU Technology from TITAN at ORNL 20+ PetaFlops 18,688 NVIDIA Tesla K20x MAXIMUS Workstations to TITAN — #2 at Top500.org

MAXIMUS Workstation

Key Application S3D for Turbulent Combustion How to efficiently burn next gen diesel and bio fuels?

4 MAXIMUS for Workstations and CAE

NVIDIA® MAXIMUS Visual Computing Parallel Computing

Intelligent GPU job Allocation Unified Driver for Quadro + Tesla ISV Application Certifications CAD Operations FEA HP, Dell, Lenovo, others Now Kepler-based GPUs Pre-processing CFD

Available Since November 2011 Post-processing CEM 5 DELPHI and ANSYS Mechanical GPU Acceleration

ANSYS Mechanical 14: Based on SMP Direct Solver – Tesla C2075, CUDA 3.2 vs. Multi-core Intel Westmere

2

Higher is Better 1.73 1.5 1.46x av 1.41 1.24 System Configuration 1 - HP Z800 - 2 x Xeon 5650 CPUs - 12 Cores, 2.67 GHz 0.5 - 48 GB Memory 8 Cores+GPU 6 Cores+GPU 6 Cores+GPU - Windows 64-bit vs. 8 Cores vs. 6 Cores vs. 6 Cores - Quadro 6000 GPU - Tesla C2075 GPU Speedup of Tesla C2075 vs. Xeon 5600 Series 5600vs.XeonC2075 SpeedupTesla of 0 Battery Pack MCLC Connector - CUDA 3.2 3.2 MDOF 4.7 MDOF 1.9 MDOF BL, 100 Modes PCG, Static Sparse, Static 6 GPU Status of Select Automotive CAE Software

Select Automotive CAE Application ISV Select CAE Software GPU Status CSM: Durability (Stress) and Fatigue MSC Available Today Road Handling and VPG Adams (for MBD) Evaluation Powertrain Stress Analysis /Standard Available Today Body NVH MSC Nastran Available Today Crashworthiness and Safety LS-DYNA Implicit only, beta CFD: Aerodynamics / Thermal UH ANSYS Fluent Available Today, beta IC Engine Combustion STAR-CCM+ Evaluation Aerodynamics / HVAC OpenFOAM Available Today Mold Injection Moldflow Available Today

7 GPU Progress – Commercial CAE Software GPU Status Structural Mechanics Fluid Dynamics Electromagnetics Available ANSYS Mechanical ANSYS CFD (FLUENT) EMPro Today Abaqus/Standard Moldflow CST MWS MSC Nastran Culises (OpenFOAM) XFdtd Particleworks Marc SEMCAD X AFEA SpeedIT (OpenFOAM) FEKO AcuSolve Nexxim Product AMLS, FastFRS CFD-ACE+ JMAG in 2013 NX Nastran HFSS HyperWorks OptiStruct PAM-CRASH implicit Product LS -DYNA implicit Abaqus/CFD Xpatch Evaluation RecurDyn LS-DYNA CFD SCSK Adventure Cluster LS-DYNA CFD++ Research Abaqus/Explicit FloEFD Evaluation RADIOSS STAR-CCM+ PAM-CRASH XFlow 8

Status Summary of ISVs and GPU Acceleration

Every primary ISV has products available on GPUs or ongoing evaluation

The 4 largest ISVs all have products based on GPUs, some at 3rd generation ANSYS MSC Software Altair

The top 4 out of 5 ISV applications are available on GPUs today ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, . . . LS-DYNA implicit only

Several new ISVs were founded with GPUs as a primary competitive strategy Prometech, FluiDyna, Vratis, IMPETUS, Turbostream

Availability of commercial CEM software expanding with ECAE growth CST, Remcom, Agilent, EMSS on 3rd-gen; JSOL to release JMAG, ANSYS to release HFSS

9

CAE Priority of ISV Software Adoption for GPUs

LSTC / LS-DYNA SIMULIA / Abaqus/Explicit #4 Altair / RADIOSS ESI / PAM-CRASH

ANSYS / ANSYS Mechanical Altair / RADIOSS #2 Altair / AcuSolve (CFD) Autodesk / Moldflow

#1 ANSYS / ANSYS Mechanical #3 SIMULIA / Abaqus/Standard ANSYS / ANSYS Fluent MSC Software / MSC Nastran OpenFOAM (Various ISVs) MSC Software / Marc CD-adapco / STAR-CCM+ LSTC / LS-DYNA implicit Autodesk Simulation CFD Altair / RADIOSS Bulk ESI / CFD-ACE+ Siemens / NX Nastran SIMULIA / Abaqus/CFD Autodesk / Mechanical 10 Commercial CAE Focus on Sparse Solvers

CFD Application Software

Read input, matrix Set-up

Implicit Sparse 40% - 75% of Implicit Sparse Matrix Operations Profile time, Matrix Operations Small % LoC GPU CPU - Hand-CUDA Parallel - GPU Libraries, CUBLAS

- OpenACC Directives Global solution, write output (Investigating OpenACC for more tasks on GPU)

+ 11 Basics of GPU Computing for ISV Software

ISV software use of GPU acceleration is user-transparent

Jobs launch and complete without additional user steps User informs ISV application (GUI, command) that a GPU exists

Schematic of a CPU with an attached GPU accelerator CPU begins/ends job, GPU manages heavy computations

1

Cache DDR Schematic of an x86 CPU GDDR GDDR CPU with a GPU accelerator

DDR

1. ISV job launched on CPU 2 2. Solver operations sent to GPU I/O 3. GPU sends results back to CPU PCI-Express Hub GPU 4. ISV job completes on CPU 3 4 12 Motivation for CAE ISVs to Make Strong Investments in GPUs

GPUs are a Key HPC Technology Central to Current CAE Trends

13 CAE Trends and GPU Acceleration Benefits

Higher fidelity (better models) GPUs permit higher fidelity within existing CPU-only job times

Parameter sensitivities (more models) GPUs increase job throughput over existing CPU-only number of jobs, and at lower cost

Advanced methods GPUs make practical: high order methods, more transients vs. steady, use of solid finite elements vs. 2D shells, etc.

Growth in ISV software budgets GPUs provide more use of existing ISV software investment 14

Cost Trends of CAE During Recent 20 Years

Cost Trends : Hardware is Cheap, People and Software Continue Cost Increase

• Historically hardware very expensive vs. ISV software and people

• ISV software budgets are now 4x vs. hardware

• Increasingly important that hardware choices drive cost-performance efficiency in people and ISV software

15 ANSYS and NVIDIA Collaboration Roadmap

Release ANSYS Mechanical ANSYS Fluent ANSYS EM

13.0 SMP, Single GPU, Sparse ANSYS Nexxim Dec 2010 and PCG/JCG Solvers

14.0 + Distributed ANSYS; Radiation Heat Transfer ANSYS Nexxim Dec 2011 + Multi-node Support (beta)

14.5 + Multi-GPU Support; + Radiation HT; ANSYS Nexxim Nov 2012 + Hybrid PCG; + GPU AMG Solver (beta), + Kepler GPU Support Single GPU

15.0 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; ANSYS Nexxim Q4-2013 + CUDA 5 Kepler Tuning ANSYS HFSS (Transient)

16 ANSYS Mechanical 14.5 GPU Acceleration

Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs 500 V14sp-5 Model CPU + GPU 400 Higher CPU Only is Better

300

200 210 Turbine geometry 2,100,000 DOF 164 SOLID187 FEs 100 Static, nonlinear One iteration (final Westmere Sandy Bridge solution requires 25) Distributed ANSYS 14.5 ANSYS Mechanical Number of Jobs Per Day Per Jobs of Number Mechanical ANSYS 0 Xeon X5690 3.47 GHz Xeon E5-2687W 3.10 GHz Direct sparse solver 8 Cores + Tesla C2075 8 Cores + Tesla K20 Results from Supermicro X9DR3-F, 64GB memory 17 ANSYS Mechanical 14.5 GPU Acceleration

Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs 500 V14sp-5 Model CPU + GPU 400 Higher CPU Only is 395 Better 341 300 K20 = 1.9x Acceleration 200 C2075 = 2.1x Turbine geometry Acceleration 210 2,100,000 DOF 164 SOLID187 FEs 100 Static, nonlinear One iteration (final Westmere Sandy Bridge solution requires 25) Distributed ANSYS 14.5 ANSYS Mechanical Number of Jobs Per Day Per Jobs of Number Mechanical ANSYS 0 Xeon X5690 3.47 GHz Xeon E5-2687W 3.10 GHz Direct sparse solver 8 Cores + Tesla C2075 8 Cores + Tesla K20 Results from Supermicro X9DR3-F, 64GB memory 18 ANSYS Mechanical Automotive Benchmark

Source: ANSYS Automotive Simulation World Congress, 30 Oct 2012 – Detroit, MI “High-Performance Computing for Mechanical Simulations using ANSYS” By Jeff Beisheim, ANSYS, Inc.

GPU Performance 4.0 3.8x

3.5

3.0 2.6x 2.5 2.0 • 6.5 million DOF 1.5 • Linear static analysis 1.0 • Sparse solver (DMP) 0.5 • 2 Intel Xeon E5-2670 (2.6 GHz, 16 Speedup Relative 0.0 cores total), 128 GB RAM, SSD, 4 2 cores 8 cores 8 cores Tesla C2075, Win7 (no GPU) (no GPU) (1 GPU) 19 ANSYS Mechanical 14.5 Multi-GPU Performance

Source: ANSYS Automotive Simulation World Congress, 30 Oct 2012 – Detroit, MI “High-Performance Computing for Mechanical Simulations using ANSYS” By Jeff Beisheim, ANSYS, Inc.

GPU Performance 6.0 5.2x

5.0

4.0

3.0 2.7x

• 11.8 million DOF 2.0 Multi-GPU • Linear static analysis 1.0 • PCG solver (DMP) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 Speedup Relative 0.0 cores total), 128 GB RAM, SSD, 4 2 cores 8 cores 16 cores Tesla C2075, Win7 (no GPU) (1 GPU) (4 GPUs) 20 SIMUILA and Abaqus GPU Release Progression

Abaqus 6.11, June 2011 Direct sparse solver is accelerated on the GPU Single GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000) Abaqus 6.12, June 2012 Multi-GPU/node; multi-node DMP clusters Flexibility to run jobs on specific GPUs Fermi GPUs + Kepler Hotfix (since November 2012) Abaqus 6.13, June 2013 Un-symmetric sparse solver on GPU Official Kepler support (Tesla K20/K20X) 21

Rolls Royce: Abaqus 3.5x Speedup with 5M DOF • 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps)

Sandy Bridge + Tesla K20X Single Server • Direct Sparse solver, 100GB memory

20000 Elapsed Time in seconds Speed up relative to 8 core 3.5

3 15000 2.42x

2.5

10000 2.11x 2

5000 1.5

0 1 Speed up relative 8 to core (1x) 8c 8c + 1g 8c + 2g 16c 16c + 2g

Server with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2 22 Rolls Royce: Abaqus Speedups on an HPC Cluster • 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) Sandy Bridge + Tesla K20X for 4 x Servers • Direct Sparse solver, 100GB memory

9000

2.2x

6000 seconds 2.04X 1.9x 1.8X 1.8x

3000 Elapsed Time in

0 24c 24c+4g 36c 36c+6g 48c 48c8g

2 Servers 3 Servers 4 Servers

Servers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2 23 MSC Nastran Release 2013 for GPUs

MSC Nastran Direct Equation Solver is GPU accelerated

Sparse direct factorization with no limit on model size Real, Complex, Symmetric, Un-symmetric Impacts several solution sequences: High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)

Support of multi-GPU and for Linux and Windows

With DMP> 1, multiple fronts factorized concurrently on multiple GPUs NVIDIA GPUs include Tesla 20-series, Tesla K20/K20X, Quadro 6000 Functionality developed with CUDA 5

24 24 MSC Nastran 2013 and GPU Performance SMP + GPU acceleration of SOL101 and SOL103

6X 6 Higher serial 4c 4c+1g is Better 4.5

3 2.7X 2.8X

1.9X 1.5 Lanczos solver (SOL 103) 1X 1X Sparse matrix factorization

0 Iterate on a block of vectors (solve) SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front Orthogonalization of vectors

Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory 25 MSC Nastran 2013 and NVH Simulation on GPUs Coupled Structural-Acoustics simulation with SOL108

1X Europe Auto OEM 1000 Lower is 710K nodes, 3.83M elements

Better 100 frequency increments (FREQ1) ) 800

Direct Sparse solver mins 600

400 2.7X Elapsed Time ( Time Elapsed 4.8X 5.2X 200 5.5X 11.1X

0 serial 1c + 1g 4c (smp) 4c + 1g 8c (dmp=2) 8c + 2g (dmp=2) Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory 26 GPU Server Performance of RADIOSS PCG Solver

Problem: Hood of a car with pressure loads, displacements and stresses

Benchmark 2,2 Millions of Degrees of Freedom, 62 Millions of non zero 2 x GPU on 1 Node  7.5X 1200 380000 Shells + 13000 Solids + 1100 RBE3 1106 SMP 6-core 5300 iterations 1000 Hybrid 2 MPI x 6 SMP

Platform NVIDIA PSG Cluster – 2 nodes with: 800 SMP 6 + 1 GPU Dual NVIDIA M2090 GPUs, Cuda v3.2 Hybrid 2 MPI x 6 SMP + 2 GPUs 572

Intel Westmere 2x6 X5670@2,93Ghz 600 Elapsed(s) Linux RHEL 5.4 with Intel MPI 4.0 400 254 200 4.3X* 143 7.5X* 0 4 x GPU on 2 Nodes  13X! 350 306 Hybrid 4 MPI x 6 SMP 300 Hybrid 4 MPI x 6 SMP + 4 GPUs 250

200

Elapsed(s) 150

100 85

50 13X*

0 27 Summary of GPU Progress for CAE

GPUs provide significant speedups for solver intensive jobs

Improved product quality with higher fidelity modeling Shorten product engineering cycles with faster simulation turnaround

Simulations recently considered impractical now possible

FEA: Larger DOFs in model, more complex material behavior, FSI CFD: Unsteady RANS, LES simulations practical in cost and time Effective parameter optimization from large increase in number of jobs

28 Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]