Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]
GPU Progress Summary for GPU-Parallel CAE
Strong GPU investments by commercial CAE vendors (ISVs) GPU adoption led by implicit FEA and CEM, followed by CFD GPUs for 2nd-level parallelism, preserves costly MPI investment ISV focus on hybrid parallel CAE that utilizes all CPU cores + GPU
GPUs now production-HPC for leading CAE end-user sites Led by automotive, electronics, and aerospace industries
GPU acceleration contributing to growth in emerging CAE New ISV developments in particle-based CFD (LBM, SPH, etc.) Rapid growth for range of CEM applications and GPU adoption
2
NVIDIA Use of CAE in Product Engineering
ANSYS Icepak – active and passive cooling of IC packages
ANSYS Mechanical – large deflection bending of PCBs
ANSYS Mechanical – comfort and fit of 3D emitter glasses
ANSYS Mechanical – shock & vib of solder ball assemblies
3
NVIDIA GPUs Accelerate CAE at Any Scale
Same GPU Technology from TITAN at ORNL 20+ PetaFlops 18,688 NVIDIA Tesla K20x MAXIMUS Workstations to TITAN — #2 at Top500.org
MAXIMUS Workstation
Key Application S3D for Turbulent Combustion How to efficiently burn next gen diesel and bio fuels?
4 MAXIMUS for Workstations and CAE Software
NVIDIA® MAXIMUS Visual Computing Parallel Computing
Intelligent GPU job Allocation Unified Driver for Quadro + Tesla ISV Application Certifications CAD Operations FEA HP, Dell, Lenovo, others Now Kepler-based GPUs Pre-processing CFD
Available Since November 2011 Post-processing CEM 5 DELPHI and ANSYS Mechanical GPU Acceleration
ANSYS Mechanical 14: Based on SMP Direct Solver – Tesla C2075, CUDA 3.2 vs. Multi-core Intel Westmere
2
Higher is Better 1.73 1.5 1.46x av 1.41 1.24 System Configuration 1 - HP Z800 - 2 x Xeon 5650 CPUs - 12 Cores, 2.67 GHz 0.5 - 48 GB Memory 8 Cores+GPU 6 Cores+GPU 6 Cores+GPU - Windows 64-bit vs. 8 Cores vs. 6 Cores vs. 6 Cores - Quadro 6000 GPU - Tesla C2075 GPU Speedup of Tesla C2075 vs. Xeon 5600 Series 5600vs.XeonC2075 SpeedupTesla of 0 Battery Pack MCLC Connector - CUDA 3.2 3.2 MDOF 4.7 MDOF 1.9 MDOF BL, 100 Modes PCG, Static Sparse, Static 6 GPU Status of Select Automotive CAE Software
Select Automotive CAE Application ISV Select CAE Software GPU Status CSM: Durability (Stress) and Fatigue MSC Nastran Available Today Road Handling and VPG Adams (for MBD) Evaluation Powertrain Stress Analysis Abaqus/Standard Available Today Body NVH MSC Nastran Available Today Crashworthiness and Safety LS-DYNA Implicit only, beta CFD: Aerodynamics / Thermal UH ANSYS Fluent Available Today, beta IC Engine Combustion STAR-CCM+ Evaluation Aerodynamics / HVAC OpenFOAM Available Today Plastic Mold Injection Moldflow Available Today
7 GPU Progress – Commercial CAE Software GPU Status Structural Mechanics Fluid Dynamics Electromagnetics Available ANSYS Mechanical ANSYS CFD (FLUENT) EMPro Today Abaqus/Standard Moldflow CST MWS MSC Nastran Culises (OpenFOAM) XFdtd Particleworks Marc SEMCAD X AFEA SpeedIT (OpenFOAM) FEKO AcuSolve Nexxim Product AMLS, FastFRS CFD-ACE+ JMAG in 2013 NX Nastran HFSS HyperWorks OptiStruct PAM-CRASH implicit Product LS -DYNA implicit Abaqus/CFD Xpatch Evaluation RecurDyn LS-DYNA CFD SCSK Adventure Cluster LS-DYNA CFD++ Research Abaqus/Explicit FloEFD Evaluation RADIOSS STAR-CCM+ PAM-CRASH XFlow 8
Status Summary of ISVs and GPU Acceleration
Every primary ISV has products available on GPUs or ongoing evaluation
The 4 largest ISVs all have products based on GPUs, some at 3rd generation ANSYS SIMULIA MSC Software Altair
The top 4 out of 5 ISV applications are available on GPUs today ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, . . . LS-DYNA implicit only
Several new ISVs were founded with GPUs as a primary competitive strategy Prometech, FluiDyna, Vratis, IMPETUS, Turbostream
Availability of commercial CEM software expanding with ECAE growth CST, Remcom, Agilent, EMSS on 3rd-gen; JSOL to release JMAG, ANSYS to release HFSS
9
CAE Priority of ISV Software Adoption for GPUs
LSTC / LS-DYNA SIMULIA / Abaqus/Explicit #4 Altair / RADIOSS ESI / PAM-CRASH
ANSYS / ANSYS Mechanical Altair / RADIOSS #2 Altair / AcuSolve (CFD) Autodesk / Moldflow
#1 ANSYS / ANSYS Mechanical #3 SIMULIA / Abaqus/Standard ANSYS / ANSYS Fluent MSC Software / MSC Nastran OpenFOAM (Various ISVs) MSC Software / Marc CD-adapco / STAR-CCM+ LSTC / LS-DYNA implicit Autodesk Simulation CFD Altair / RADIOSS Bulk ESI / CFD-ACE+ Siemens / NX Nastran SIMULIA / Abaqus/CFD Autodesk / Mechanical 10 Commercial CAE Focus on Sparse Solvers
CFD Application Software
Read input, matrix Set-up
Implicit Sparse 40% - 75% of Implicit Sparse Matrix Operations Profile time, Matrix Operations Small % LoC GPU CPU - Hand-CUDA Parallel - GPU Libraries, CUBLAS
- OpenACC Directives Global solution, write output (Investigating OpenACC for more tasks on GPU)
+ 11 Basics of GPU Computing for ISV Software
ISV software use of GPU acceleration is user-transparent
Jobs launch and complete without additional user steps User informs ISV application (GUI, command) that a GPU exists
Schematic of a CPU with an attached GPU accelerator CPU begins/ends job, GPU manages heavy computations
1
Cache DDR Schematic of an x86 CPU GDDR GDDR CPU with a GPU accelerator
DDR
1. ISV job launched on CPU 2 2. Solver operations sent to GPU I/O 3. GPU sends results back to CPU PCI-Express Hub GPU 4. ISV job completes on CPU 3 4 12 Motivation for CAE ISVs to Make Strong Investments in GPUs
GPUs are a Key HPC Technology Central to Current CAE Trends
13 CAE Trends and GPU Acceleration Benefits
Higher fidelity (better models) GPUs permit higher fidelity within existing CPU-only job times
Parameter sensitivities (more models) GPUs increase job throughput over existing CPU-only number of jobs, and at lower cost
Advanced methods GPUs make practical: high order methods, more transients vs. steady, use of solid finite elements vs. 2D shells, etc.
Growth in ISV software budgets GPUs provide more use of existing ISV software investment 14
Cost Trends of CAE During Recent 20 Years
Cost Trends : Hardware is Cheap, People and Software Continue Cost Increase
• Historically hardware very expensive vs. ISV software and people
• ISV software budgets are now 4x vs. hardware
• Increasingly important that hardware choices drive cost-performance efficiency in people and ISV software
15 ANSYS and NVIDIA Collaboration Roadmap
Release ANSYS Mechanical ANSYS Fluent ANSYS EM
13.0 SMP, Single GPU, Sparse ANSYS Nexxim Dec 2010 and PCG/JCG Solvers
14.0 + Distributed ANSYS; Radiation Heat Transfer ANSYS Nexxim Dec 2011 + Multi-node Support (beta)
14.5 + Multi-GPU Support; + Radiation HT; ANSYS Nexxim Nov 2012 + Hybrid PCG; + GPU AMG Solver (beta), + Kepler GPU Support Single GPU
15.0 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; ANSYS Nexxim Q4-2013 + CUDA 5 Kepler Tuning ANSYS HFSS (Transient)
16 ANSYS Mechanical 14.5 GPU Acceleration
Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs 500 V14sp-5 Model CPU + GPU 400 Higher CPU Only is Better
300
200 210 Turbine geometry 2,100,000 DOF 164 SOLID187 FEs 100 Static, nonlinear One iteration (final Westmere Sandy Bridge solution requires 25) Distributed ANSYS 14.5 ANSYS Mechanical Number of Jobs Per Day Per Jobs of Number Mechanical ANSYS 0 Xeon X5690 3.47 GHz Xeon E5-2687W 3.10 GHz Direct sparse solver 8 Cores + Tesla C2075 8 Cores + Tesla K20 Results from Supermicro X9DR3-F, 64GB memory 17 ANSYS Mechanical 14.5 GPU Acceleration
Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs 500 V14sp-5 Model CPU + GPU 400 Higher CPU Only is 395 Better 341 300 K20 = 1.9x Acceleration 200 C2075 = 2.1x Turbine geometry Acceleration 210 2,100,000 DOF 164 SOLID187 FEs 100 Static, nonlinear One iteration (final Westmere Sandy Bridge solution requires 25) Distributed ANSYS 14.5 ANSYS Mechanical Number of Jobs Per Day Per Jobs of Number Mechanical ANSYS 0 Xeon X5690 3.47 GHz Xeon E5-2687W 3.10 GHz Direct sparse solver 8 Cores + Tesla C2075 8 Cores + Tesla K20 Results from Supermicro X9DR3-F, 64GB memory 18 ANSYS Mechanical Automotive Benchmark
Source: ANSYS Automotive Simulation World Congress, 30 Oct 2012 – Detroit, MI “High-Performance Computing for Mechanical Simulations using ANSYS” By Jeff Beisheim, ANSYS, Inc.
GPU Performance 4.0 3.8x
3.5
3.0 2.6x 2.5 2.0 • 6.5 million DOF 1.5 • Linear static analysis 1.0 • Sparse solver (DMP) 0.5 • 2 Intel Xeon E5-2670 (2.6 GHz, 16 Speedup Relative 0.0 cores total), 128 GB RAM, SSD, 4 2 cores 8 cores 8 cores Tesla C2075, Win7 (no GPU) (no GPU) (1 GPU) 19 ANSYS Mechanical 14.5 Multi-GPU Performance
Source: ANSYS Automotive Simulation World Congress, 30 Oct 2012 – Detroit, MI “High-Performance Computing for Mechanical Simulations using ANSYS” By Jeff Beisheim, ANSYS, Inc.
GPU Performance 6.0 5.2x
5.0
4.0
3.0 2.7x
• 11.8 million DOF 2.0 Multi-GPU • Linear static analysis 1.0 • PCG solver (DMP) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 Speedup Relative 0.0 cores total), 128 GB RAM, SSD, 4 2 cores 8 cores 16 cores Tesla C2075, Win7 (no GPU) (1 GPU) (4 GPUs) 20 SIMUILA and Abaqus GPU Release Progression
Abaqus 6.11, June 2011 Direct sparse solver is accelerated on the GPU Single GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000) Abaqus 6.12, June 2012 Multi-GPU/node; multi-node DMP clusters Flexibility to run jobs on specific GPUs Fermi GPUs + Kepler Hotfix (since November 2012) Abaqus 6.13, June 2013 Un-symmetric sparse solver on GPU Official Kepler support (Tesla K20/K20X) 21
Rolls Royce: Abaqus 3.5x Speedup with 5M DOF • 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps)
Sandy Bridge + Tesla K20X Single Server • Direct Sparse solver, 100GB memory
20000 Elapsed Time in seconds Speed up relative to 8 core 3.5
3 15000 2.42x
2.5
10000 2.11x 2
5000 1.5
0 1 Speed up relative 8 to core (1x) 8c 8c + 1g 8c + 2g 16c 16c + 2g
Server with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2 22 Rolls Royce: Abaqus Speedups on an HPC Cluster • 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) Sandy Bridge + Tesla K20X for 4 x Servers • Direct Sparse solver, 100GB memory
9000
2.2x
6000 seconds 2.04X 1.9x 1.8X 1.8x
3000 Elapsed Time in
0 24c 24c+4g 36c 36c+6g 48c 48c8g
2 Servers 3 Servers 4 Servers
Servers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2 23 MSC Nastran Release 2013 for GPUs
MSC Nastran Direct Equation Solver is GPU accelerated
Sparse direct factorization with no limit on model size Real, Complex, Symmetric, Un-symmetric Impacts several solution sequences: High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400)
Support of multi-GPU and for Linux and Windows
With DMP> 1, multiple fronts factorized concurrently on multiple GPUs NVIDIA GPUs include Tesla 20-series, Tesla K20/K20X, Quadro 6000 Functionality developed with CUDA 5
24 24 MSC Nastran 2013 and GPU Performance SMP + GPU acceleration of SOL101 and SOL103
6X 6 Higher serial 4c 4c+1g is Better 4.5
3 2.7X 2.8X
1.9X 1.5 Lanczos solver (SOL 103) 1X 1X Sparse matrix factorization
0 Iterate on a block of vectors (solve) SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front Orthogonalization of vectors
Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory 25 MSC Nastran 2013 and NVH Simulation on GPUs Coupled Structural-Acoustics simulation with SOL108
1X Europe Auto OEM 1000 Lower is 710K nodes, 3.83M elements
Better 100 frequency increments (FREQ1) ) 800
Direct Sparse solver mins 600
400 2.7X Elapsed Time ( Time Elapsed 4.8X 5.2X 200 5.5X 11.1X
0 serial 1c + 1g 4c (smp) 4c + 1g 8c (dmp=2) 8c + 2g (dmp=2) Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory 26 GPU Server Performance of RADIOSS PCG Solver
Problem: Hood of a car with pressure loads, displacements and stresses
Benchmark 2,2 Millions of Degrees of Freedom, 62 Millions of non zero 2 x GPU on 1 Node 7.5X 1200 380000 Shells + 13000 Solids + 1100 RBE3 1106 SMP 6-core 5300 iterations 1000 Hybrid 2 MPI x 6 SMP
Platform NVIDIA PSG Cluster – 2 nodes with: 800 SMP 6 + 1 GPU Dual NVIDIA M2090 GPUs, Cuda v3.2 Hybrid 2 MPI x 6 SMP + 2 GPUs 572
Intel Westmere 2x6 X5670@2,93Ghz 600 Elapsed(s) Linux RHEL 5.4 with Intel MPI 4.0 400 254 200 4.3X* 143 7.5X* 0 4 x GPU on 2 Nodes 13X! 350 306 Hybrid 4 MPI x 6 SMP 300 Hybrid 4 MPI x 6 SMP + 4 GPUs 250
200
Elapsed(s) 150
100 85
50 13X*
0 27 Summary of GPU Progress for CAE
GPUs provide significant speedups for solver intensive jobs
Improved product quality with higher fidelity modeling Shorten product engineering cycles with faster simulation turnaround
Simulations recently considered impractical now possible
FEA: Larger DOFs in model, more complex material behavior, FSI CFD: Unsteady RANS, LES simulations practical in cost and time Effective parameter optimization from large increase in number of jobs
28 Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]