PARTNER SHOWCASE

NVIDIA is a MSC Software Performance partner with ® and Professional Solution product lines that provide excellent performance for Patran and MSC Nastran on Windows® and Linux® systems.

MSC Software: Partner Showcase - NVIDIA GPU Accelerates Simulation Performance for MSC Nastran Users

Key Highlights: The power wall (resulting from increase in terms of hiding memory latency because of their power consumption and heat dissipation due specialization to inherently parallel problems. to increased processor speeds) has introduced With the ever-increasing demand for more Industry radical changes in computer architectures. computing performance, the HPC industry is High-Perfomance Increasing core counts and hence, increasing moving towards a hybrid computing model, Computing parallelism have replaced increasing clock where GPUs and CPUs work together to speeds as the primary way of delivering greater perform general purpose computing tasks. In hardware performance. A modern GPU (Graphics this hybrid-computing model, the GPU serves Challenge Processing Unit) consists of hundreds of simple as a co-processor to the CPU. Co-processing Increase computing performance by processing cores; this degree of parallelism refers to the use of an accelerator, a GPU, to developing a hybrid computing model on a single processor is typically referred to as offload the CPU and to increase computational ‘many-core’ relative to ‘multi-core’ that refers to efficiency. In order to exploit this hybrid MSC Software Solutions processors with at most a few dozen cores. computing model and the massively parallel GPU architecture, application software will need MSC Nastran 2012 to support GPU Many-core GPUs will often demand a high computing capability including multiple GPU to be redesigned. MSC Software and NVIDIA degree of fine-grained parallelism – the computing capability for DMP runs application program should create many threads engineers have been working together over so that while some threads are waiting for data the last year on the use of GPUs to accelerate to return from memory other threads can be the sparse direct solver in MSC Nastran. Benefits executing – offering a different approach in • Vastly reduce use of pinned host memory • Handle arbitrarily large fronts, for very large models

Partner Showcase: NVIDIA | 1 PARTNER SHOWCASE MSC Software: Partner Showcase - NVIDIA

Solver Acceleration in and floating-point performance that are several In addition, the MSC Nastran implementation MSC Nastran 2012: factors faster than the latest CPUs. In supports multiple GPU computing capability MSC Nastran, the most time consuming part is for DMP (Distributed Memory Parallel) runs. A sparse direct solver is possibly the most the BLAS level 3 operations in the multi-frontal In such cases of DMP>1, multiple fronts are important component in a finite element factorization process. To date, only the trailing factorized concurrently on multiple GPUs. The structural analysis program. Typically, a matrix updates of the front factorization are matrix is decomposed into two domains, and multi-frontal algorithm with out-of-core implemented as CUDA kernels and these each domain is computed by a MPI process. capability for solving extremely large update kernels are the subject of a collaborative problems and BLAS level 3 kernels for the A typical MSC Nastran job submission work between NVIDIA and MSC engineers. highest compute efficiency is implemented. command with multiple GPUs is shown below: Elimination tree and compute kernel level nastran2012 jid=myinput mem=48gb parallelism with dynamic scheduling is used GPU Computing Implementation and buffsize=65537 dmp=2 gpuid=0:1 to ensure the best scalability. The BLAS level Target Analysis (Solution Sequences): gputhresh=12000 sys205=192 3 compute kernels in a sparse direct solver NVIDIA’s CUDA parallel programming sys151=1 mode=i8 sdir=/local/ are the prime candidate for GPU computing architecture is used to implement the update skodiyal/tmp bat=no scr=yes due to their high floating point density and kernels. CUDA is the hardware and software gpuid is the ID of a licensed GPU device to favorable compute to communication ratio. architecture that enables NVIDIA GPUs be used in the analysis. Multiple IDs may The proprietary symmetric MSCLDL and to execute programs written with C, C++, be assigned to MSC Nastran DMP runs. asymmetric MSCLU sparse direct solvers FORTRAN, OpenCL, and other languages. gputhresh represents the minimum threshold in MSC Nastran employ a super-element Vastly reduced use of pinned host memory for GPU computing in the multi-frontal sparse analysis concept instead of dynamic tree level and the ability to handle arbitrarily large factorization. If the product of the rank size parallelism. In this super-element analysis, the fronts, for very large models (greater than and the front size of each front is smaller than structure/matrix is first decomposed into large 15M DOF) on a single Tesla C2050 GPU, are value, the rank update of the front is processed sub-structures/sub-domains according to user some strengths of the GPU implementation on the CPU. Otherwise, the GPU device would input and load balance heuristics. The out- in MSC Nastran 2012. ‘Staging’ is a term that be used for the rank update of the front. of-core multi-frontal algorithm is then used to is used to describe how very large fronts are The GPUs supported with this implementation compute the boundary stiffness, or the Schur handled. If the trailing submatrix is too large are the 20-series (shown in compliment, followed by the transformation of to fit on the GPU device memory, then it is Figure 1) and Quadro GPUs based on the Fermi the load vector, or the right hand side, to the broken up into approximately equal-sized architecture (compute capability 2.0). Linux boundary. The global solution is found after ‘stages’ and the stages are completed in and Windows 64-bit platforms are supported the boundary stiffness matrices are assembled order. Multiple streams are used within a into the residual structure and the residual stage. So, for an arbitrarily large submatrix, Any ‘fat’ BLAS3 code path would be structure is factorized and solved. The GPU is say 40GB, then it would be solved in, say, 10 potential candidate for GPU computing. a natural fit for each sub-structure boundary stages of 4GB each. The actual sizes of the Sparse direct solver intensive SOL101 (linear stiffness/Schur compliment calculation. stages can be varied for performance tuning. statics), SOL108 (direct frequency) and SOL400 (nonlinear) fall into this category. Today’s GPUs can provide memory bandwidth

Figure 1: NVIDIA Tesla 20-series GPUs (workstation & server form factors)

2 | MSC Software MSC Software: Partner Showcase - NVIDIA PARTNER SHOWCASE

Figure 2: Automotive crank shaft (945K DOF) and engine (15.2M DOF) models

Figure 3: Performance speed-ups with Single and Multiple GPUs using MSC Nastran 2012 models

SOL108 would need a complex sparse direct The hardware configurations used with enabled by GPU computing will facilitate solver that is not supported in MSC Nastran these benchmark runs consisted of: MSC Nastran users to add more realism 2012 implementation, however, this feature (1) AMAX server, Linux, 2x hex-core Westmere, to their models thus improving the quality is currently under development and testing 2.67GHz, 32GB memory, 2x Tesla C2050 of the simulations. A rapid CAE simulation for an upcoming point release. Likewise, GPU for the 945K and 1.3M DOF model capability from GPUs has the potential to conventional SOL111 (modal frequency) with transform current practices in engineering large MPYAD’s (multiply-add) also should (2) Super Micro server, Linux, 2x quad- analysis and design optimization procedures. benefit from GPU computing in a later release. core Nehalem 2.27GHz, 96GB memory, 2.2 TB SATA 5-way striped RAID and 2x This initial GPU computing implementation Tesla C2050 GPU for all other models. also identified certain issues – for one, the Performance analysis with larger the model, the higher the DMP overhead GPU Computing: Figure 3 shows the end-to-end (total) speed-up in MSC Nastran. This increased CPU side for single and multiple GPU runs. In general, Linear and nonlinear structural stress analysis overhead reduces the overall speed-up based on the benchmark models, we see resulting from GPU computing. Future are the target applications with this first speed-ups in the range of 4-6X with a single implementation of GPU computing in releases of MSC Nastran will address such GPU over a serial run and in the range of issues as well as expand the GPU computing MSC Nastran 2012. Structural finite element 1.4-2X with 2 GPUs over a 8 core DMP run. models dominated by solid elements provide capability to include complex solver kernels for more concentrated computational work for the NVH and dynamics markets. in the sparse matrix factorization, which Summary: is highly desirable for the GPU. A range of GPU computing is implemented in models with varying fidelity, from around MSC Nastran 2012 to significantly lower the 1M degrees of freedom (DOF) to 15M DOF simulation times for industry standard analysis is considered (Figure 2). Performance models. Vastly reduced use of pinned memory comparisons are relative to a serial Nastran and the ability to handle arbitrarily large front run, which is still widely adopted within sizes for very large models are some of the the customer community, as well as with strengths of this implementation. Further, multi-core (2x quad-core Nehalem) CPUs. multiple GPUs can be used with Nastran DMP analysis. The performance speed-ups

Partner Showcase: NVIDIA | 3 PARTNER SHOWCASE

About MSC Software About MSC Nastran MSC Software is one of the ten original software companies and MSC Nastran Structural & Multidiscipline FEA the worldwide leader in multidiscipline simulation. As a trusted MSC Nastran is the world’s most widely used Finite Element Analysis partner, MSC Software helps companies improve quality, save time (FEA) solver that helped MSC Software become recognized in 2011 and reduce costs associated with design and test of manufactured as one of the “10 Original Software Companies”. When it comes to products. Academic institutions, researchers, and students employ solving for stress/strain behavior, dynamic and vibration response MSC technology to expand individual knowledge as well as expand and thermal gradients in real-world systems, MSC Nastran is the horizon of simulation. MSC Software employs 1,000 professionals recognized as the most trusted multidiscipline solver in the world. in 20 countries. For additional information about MSC Software’s products and services, please visit www.mscsoftware.com. MSC Nastran is built on work done by NASA scientists and researchers, and is trusted for the design of mission critical systems in every industry. Nearly every spacecraft, aircraft, and vehicle designed in the last 40 years has been analyzed using MSC Nastran. In recent years, several extensions to its capabilities have resulted in a single multidisciplinary solver providing users with a trusted solution to simulate everything from a single component to complex assemblies under diverse conditions. MSC Nastran offers a complete set of linear static and dynamic Please visit analysis capabilities along with unparalleled support for superelements enabling users to solve large, complex assemblies www.mscsoftware.com more efficiently. MSC Nastran also offers a complete set of implicit and explicit nonlinear analysis capabilities, thermal and interior/ for more partner showcases exterior acoustics, and coupling between various disciplines such as thermal, structural, and fluid interaction. New modular packaging that enables you to get only what you need makes it more affordable to own MSC Nastran than ever before.

Corporate Europe, Middle East, Asia-Pacific Asia-Pacific MSC Software Corporation Africa MSC Software Japan LTD. MSC Software (S) Pte. Ltd. 2 MacArthur Place MSC Software GmbH Shinjuku First West 8F 100 Beach Road The MSC Software corporate logo, MSC, and the names of the Santa Ana, California 92707 Am Moosfeld 13 23-7 Nishi Shinjuku #16-05 Shaw Tower MSC ‌Software products and services referenced herein are trademarks Telephone 714.540.8900 81829 Munich, Germany 1-Chome, Shinjuku-Ku Singapore 189702 or registered trademarks of the MSC Software Corporation in the United www.mscsoftware.com Telephone 49.89.431.98.70 Tokyo, Japan 160-0023 Telephone 65.6272.0082 States and/or other countries. All other trademarks belong to their Telephone 81.3.6911.1200 respective owners. © 2012 MSC Software Corporation. All rights reserved.

NVIDIA*2012MAY*PS