<<

Recent Advances in Computing Infrastructure at Penn State and A Proposal for Integrating Computational Science in Undergraduate Education

a presentation for the HPC User Forum

2013 Fall Conference of Korean Society of Computational Science & Engineering Sponsored by National Institute of Supercomputing & Networking Seoul, South Korea October 1, 2013

Vijay K. Agarwala Senior Director, Research Computing and CI Information Technology Services The Pennsylvania State University University Park, PA 16802 USA What is missing in today's undergraduate education?

• Lack of comprehensive training in computational science principles in many majors except Computer Science – Object Oriented Programming and Discrete Mathematics – Data Structures and Algorithm Analysis – Mathematical Methods and Numerical Analysis – Parallel Programming and Numerical Algorithms

• A typical BS/BA graduate is not well-prepared for a computationally intensive graduate program

• Computational proficiency is being acquired on-the-job, either in graduate school or at place of employment What could universities add or change?

• An integrated 5-year program, either a BS/MS or dual BS/BA – Keep the general education requirements intact. The tradition of a liberal arts education is highly valued. – The discipline-specific part of undergraduate curricula cannot be shortened either – there is more to be taught than ever before in all majors – Add 8-10 classes to teach core principles of computational science and also extend the computational aspects of a discipline – Intent is not to produce merely computationalists, but graduates who are solidly grounded in science and engineering • e.g. a student earns a BS in Mechanical Engineering and concurrently an MS or a second BS in Computational Science – Incentivize 5th year via scholarships and/or tuition support A few computationally-intensive majors For a dual-degree program, a BS in any of the majors below can be combined with a concurrent MS or a second BS in Computational Science

• Aerospace Engineering • Astronomy and Astrophysics • Agricultural Engineering • Biochemistry and Molecular Biology • Bioengineering – • Biological Engineering – Neuroscience – Agricultural • – Natural Resources • Mathematics • Civil Engineering – Computational Mathematics • Chemical Engineering – Systems Analysis – Biomolecular and Bioprocessing • Physics – Energy and Fuels • Statistics • Computer Engineering – Computational • Electrical Engineering – Biostatistics • Engineering Science • Political Science • Mechanical Engineering • Economics – Quantitative Methods • Nuclear Engineering • Sociology and many more academic disciplines/majors A potential example of an Integrated Program: B.S. (Mechanical Engineering) and M.Engrg. (Computational Science) or B.S. (Mechanical Engineering) and B.S. (Computational Science)

TOTAL CREDIT = 161

Program Requirements

First Semester CR Second Semester CR ENGL 015 Rhetoric and Composition - or- ECON002 Microeconomic Analysis and Policy, - or -

ENGL 030 Honors Freshman Composition 3 ECON 004 Macroeconomic Analysis and Policy, - or -

FYS First Year Seminar 1 ECON014 Principles of Economics (GS - Social 3 Science) EDSGN 100 Introduction to Engineering Design 3 CHEM 112 Chemical Principles 3 GA, GH or GS course 3 GA, GH, or GS Course 3 CHEM 110 Gen Physics: Mechanics 3 PHYS 211 Gen Physics: Mechanics 4 MATH 140 Calculus with Analytical Geometry I - or - MATH 141 Calculus with Analytical Geometry II - or -

MATH 140E Calculus with Engineering Applications I 4 MATH 141E Calculus with Engineering Applications II 4 Total 17 Total 17 Third Semester CR Fourth Semester CR CAS 100 A/B Effective Speech 3 E MCH 212 Dynamics 3 E MCH 211 Statics 3 E MCH 213 Strength of Materials - or - 3 MATH 220 Matrices 2 E MCH 213D Strength of Materials with Design

MATH 231 Calculus of Several Variables 2 GHA Health and Physical Activity 1.5 PHYS 212 Gen Physics: Electricity and Magnetism 4 M E 300 Engineering Thermodynamics I 3 Introduction to Programming (/C++, Fortran) 3 MATH 251 Ordinary and Partial Differential Equations 4 Object Oriented Programming 3 Total 17 Total 17.5 Fifth Semester CR Sixth Semester CR E E 212 Introduction to Electronic Measuring Systems 3 ENGL 202C Technical Writing 3

E MCH 315 Mechanical Response of Engineering Materials 2 M E 340 Mechanical Engineering Design Methodology 3 GHA Health and Physical Activity 1.5 M E 360 Mechanical Design 3 M E 320 Fluid Flow 3 M E 370 Vibration of Mechanical Systems 3 MATSE 259 Properties and Processing of Engineering Materials 3 PHYS 214 Wave Motion and Quantum Physics 2 Data Structures and Algorithm Analysis I 3 Data Structures and Algorithm Analysis II 3 Total 15.5 Total 17

First Summer Internship in Industry

Seventh Semester CR Eighth Semester CR M E 450 Modeling of Dynamic Systems 3 Engineering Technical Elective (ETE) 3 M E 345 Instrumentation, Measurements, and Statistics 4 GA, GH, or GS Course 3 M E 410 Heat Transfer 3 GA, GH, or GS Course 3 M E Lab Course 1 I E 312 Design and Manufacturing Processes 3 GA, GH, or GS Course 3 ME Lab Course 1 Mathematical Methods and Numerical Analysis I 3 Mathematical Methods and Numerical Analysis II 3 Total 17 Total 16

Second Summer Internship in Industry

Ninth Semester CR Tenth Semester

Engineering Technical Elective (ETE) 3 General Technical Elective (GTE) 3

M E Technical Elective (METE) 3 M E 443W Design Project 3

M E 442W Design Project 3 Parallel Programming and Numerical Algorithms II 3

Parallel Programming and Numerical Algorithms I 3 Computational Analysis – Special Topics 3

Computational Analysis – Special Topics 3 Total 15 Total 12

Research Computing and Cyberinfrastructure (RCC) Services

At a high level, RCC services can be grouped into three categories: 1. Running state-of-the-art computational and data engines 2. Teaching workshops, seminars, and guest lectures in courses in the areas of large-scale computational techniques and systems 3. Providing one-on-one support and consulting for independent curiosity- driven research 4. Providing support for sponsored research Services provided by RCC • Provide systems services by researching current practices in , file system, data storage, job scheduling as well as computational support related to compilers, parallel computations, libraries, and other support. Also supports of GIS data and large computed datasets to gain better insight from the results of simulations. • Enable large-scale computations and data management by building and operating several state- of-the art computational engines. • Consolidate and thus significantly increase the research computing resources available to each faculty participant. Faculty members can frequently exceed their share of the machine to meet peak computing needs. • Provide support and expertise for using programming languages, libraries, and specialized data and software for several disciplines. • Investigate emerging visual computing technologies and implement leading-edge solutions in a cost-effective manner to help faculty better integrate data visualization tools and immersive facilities in their research and instruction. • Investigate emerging architectures for data and numerically intensive computations and work with early-stage companies in areas such as interconnects, networking, and graphics processors. • Help build inter- and intra-institutional research communities using cyberinfrastructure technologies. • Maintain close contacts with NSF, DoE, NIH, NASA and other federally funded centres, and help faculty members with porting and scaling of codes across different systems. Programs, Libraries, and Application Codes in Support of Computational Research

• Bioinformatics: BLAST/BLAST+, CUDASW++, RepeatMasker, Trinity • Code Development: Boost C++ Libraries, CMake, CUDA, Eclipse, FFTW, GCC, GNU Scientific Library (GSL), Haskell, HDF5, Intel, Java, MAGMA, Intel-MKL, OpenMPI, PETSc, PGI, Python, Subversion, TotalView, Valgrind • Computational Fluid Dynamics: ANSYS Fluent, GAMBIT, GridPro, OpenFOAM, Pointwise, STAR-CCM+ • Computer Vision: OpenCV • Connectivity Solutions: Exceed onDemand • Electromagnetics: Meep • Electron Microscopy: Auto3Dem, EMAN2 • Electronic Structure/DFT: ABINIT, BigDFT, Quantum ESPRESSO, SIESTA, VASP, WanT, WIEN2k • Evolutionary Biology: BEAGLE, BEAST, MrBayes, OpenBUGS • FEM: Abaqus, ANSYS, COMSOL Multiphysics, LS-DYNA, MSC Nastran, OpenSees, Patran • Material Science: , PARATEC • Mathematics: GAUSS, gridMathematica, Maple, Mathematica, MATLAB • Molecular Modeling: Amber12, , DL_POLY, GROMACS, LAMMPS, NAMD, nMOLDYN, OpenMM, Schrodinger • Optimization: AMPL, CPLEX, GAMS, MATLAB, SeDuMi • : CHARMM, , NWChem, Q-Chem, TeraChem/PetaChem • Scientific Collaboration: HUBzero • Statistics: GAUSS, MATLAB, R, SAS, Stata • Visualization: Avizo, DCV, formZ, GView, IDL, , Maya, ParaView, PyMOL, SCIRun, Tecplot, Visit, VMD, VTK

All software installations are driven by faculty. The software stack on every system is customized and entirely shaped by faculty needs. Visualization Services and Facilities

Promote effective use of visualization and VR techniques in research and teaching via strategic deployment of facilities and related support. Staff members provide consulting, teach seminars, assist faculty and support facilities for visualization and VR. Recent support areas include: • Visualization and VR system design and deployment • 3D Modeling applications: FormZ, Maya • Data Visualization applications: OpenDX, VTK, VisIt • 3D VR development and device libraries: VRPN, VMRL, JAVA3D, OpenGL, OpenSG, CaveLib • Domain specific visualization tools: Avizo (Earth, Fire, Green, Wind), VMD, SCIRun • Telecollaborative tools and facilities: Access Grid, inSORS, VNC • Parallel graphics for scalable visualization: Paraview, DCV, VisIt • Programming support for graphics (e.g. C/C++, JAVA3D, Tcl/Tk, Qt) Successful installations:

• Immersive Environments Lab: in partnership with School of Architecture and Landscape Architecture • Immersive Construction Lab: in partnership with Architecture Engineering • Visualization/VR Lab: in partnership with Materials Simulation Center • Visualization/VR Lab: in partnership with Computer Science and Engineering • Sports Medicine VR Lab: a partnership between Kinesiology, Athletics and Hershey Medical Center Immersive Construction Lab: Virtual Environments for Building Design, Construction Planning and Operations Management

Dr. John I. Messner Associate Professor of Architectural Engineering College of Engineering Immersive Environments Lab: Virtual Environments for Architectural Design Education and Telecollaborative Design Studio

Loukas Kalisperis Katsuhiko Muramoto Professor of Architecture Associate Professor of Architecture College of Arts and Architecture College of Arts and Architecture Virtual Reality Lab: VR for Assessment of Traumatic Brain Injury Dr. Semyon Slobounov Dr. Wayne Sebastianelli Professor of Kinesiology Director of Sports Medicine College of Health and Human Development College of Medicine RCC support of sponsored research programs

• Lion-X clusters are built with a combination of RCC/ITS funds and investments from research grants by participating faculty members. • RCC partners with faculty to provide a guaranteed level of service in exchange for keeping Lion-X clusters open to the broader community at Penn State. • Faculty partnerships enable researchers to access a larger shared system than they would be able to build on their own, an extensive and well-supported software stack, and access to computing expertise of RCC staff members. Penn State RCC/ITS Distributed Data Center

286 nodes with 5008 cores, 64-256 Lion-XG – The latest compute engine at Penn State RCC/ITS

• 144 nodes with two 8-core Intel Xeon “Sandy Bridge” E5-2665 processors, 64 GB of memory • 48 nodes with two 8-core Intel Xeon “Sandy Bridge” E5-2665 processors, 128 GB of memory • 96 nodes with two 10-core Intel Xeon “Ivy Bridge” E5-2670 v2 processors, 256 GB of memory • 5,008 total cores and 40 TB of total memory. Estimated maximum performance of 90 TFLOPS. • 10 Gb/s Ethernet and 56 Gb/s FDR InfiniBand to each node. The InfiniBand fabric has full bisection bandwidth within each 16-node chassis and half bisection bandwidth among the 12 chassis • Brocade VDX6730 10 Gb/s Ethernet with 16 ports of 8 Gb/s FibreChannel to enable FCoE • Transparent Interconnect of Lots of Links (TRILL) connection to RCC core switches using Brocade VCS technology Performance Impact of Recent Advanced Computing Infrastructure Developments 1. Storage upgrade to FibreChannel attached arrays 2. GPFS metadata upgrade with FibreChannel attached Flash memory arrays 3. Large memory servers accelerate Big Data science 4. GPU-enabled clusters accelerate research across disciplines Storage upgrade to FibreChannel attached arrays

• Previous GPFS Storage Architecture – Most storage SAS attached to pairs of servers. – If two servers in a pair became unavailable, data became unavailable. – Access to most storage for GPFS NSD servers was through 10 Gbps Ethernet. • Current GPFS Storage Architecture – All storage and GPFS NSD servers Fibre Channel attached to 16 Gbps Fibre Channel switch. – Any GPFS NSD server has direct block level access to all storage, increasing speed of system operations such as backup and GPFS operations such as migrating data to new disk. – Seven NSD servers in failover group for any filesystem - possible to lose up to six servers and still have access to data, assuming quorum isn't lost. – Allows creation of fast Data Mover hosts with direct Fibre Channel access to GPFS storage for users to upload or download data - no penalty of sending data over Ethernet to the NSD servers to get to storage. GPFS metadata upgrade with FibreChannel attached Flash memory arrays • GPFS Metadata issues faced – Metadata IO intensive system operations such as backup and GPFS policy scans bottlenecked by previous metadata disk performance. – GPFS file system responsiveness to user operations impacted during intensive system operations. • Previous Metadata State – 100 15K RPM SAS drives in multiple drive enclosures, drawing about 2.5 KW, using 20U of rack space, providing 19,000 IOPS to five GPFS filesystems with 0.5 PB capacity. • Current Metadata State – 2 Flash Memory Arrays, drawing about 500 watts, using 2U of rack space, providing 400,000 IOPS to nine GPFS filesystems with 2 PB capacity. • Effect of Moving to Flash Memory Arrays for GPFS Metadata – Nightly incremental backups: Time to run nightly incremental backup on 212 million files reduced from six hours to one hour. – Able to run GPFS policy scan on 400 million files in five minutes. – Responsiveness of file systems to users not impacted by metadata intensive system operations or general metadata load. Time to open/save/create files greatly reduced. Large memory servers accelerate Big Data science Thermal-Hydraulic Analysis of the New Penn State Breazeale Reactor Core using ANSYS Fluent ® Code

• Penn State Breazeale Reactor (PSBR) is a MARK-III type TRIGA design with 1 MWt nominal power • Existing PSBR core has some inherent design problems which limits the experimental capability of the facility • A thermal-hydraulics analysis of the new design was initiated to verify safe operation of the new core-moderator configuration • The purpose of the study is to generate ANSYS Fluent ® CFD models of existing and new PSBR designs to calculate the temperature and flow profile in core and around the core-moderator designs 3D CAD drawings of existing and new PSBR designs Large memory servers accelerate Big Data science Thermal-Hydraulic Analysis of the New Penn State Breazeale Reactor Core using ANSYS Fluent ® Code

• Two CFD models of the existing and new PSBR designs were prepared using ANSYS Gambit ®, each with a total of ~30 million structured and unstructured 3D cells • Simulation of either model requires ~300 GB of memory, something that can only be achieved on our large memory machine Lion-LSP • Simulation runtime on 24 processors is about a day

Velocity profile calculated by Fluent in the center of existing PSBR design (Side View & Top View) GPU-enabled clusters accelerate research across disciplines

Finite Differences - Small BLAS/LAPACK - Large BLAS/LAPACK - Monte Carlo Density Functional Theory - Sparse Linear Algebra - Quantum Chemistry Reverse Time Migration - Kirchhoff Depth Migration - Kirchhoff Time Migration GPU Performance CPU Performance • Several key entry points to GPU acceleration, PSU engages at several levels : – Existing GPU applications eg., PetaChem, ANSYS, ABAQUS – Linking against libraries eg., cuBLAS – Using OpenACC compiler directives – Porting code to GPU using CUDA • Depending on algorithm/application, realistically one experiences 2x-50x over optimized CPU code and GPU FLOPS/Watt >> CPU FLOPS/Watt • Following examples compare M2090 performance versus Intel Westmere; all applications compiled with -O2 optimization, SSE where applicable GPU-enabled clusters accelerate research across disciplines Accelerating Finite Element Analyses with GPUs ANSYS 14.0, Sparse Solver: 376,000 DOFs

3000 Without GPU Acceleration With GPU Acceleration 2457 2500

2000 seconds

1500

Timein 1169 1018 1000 945 807 837 815 725 692 631

500

0 1 2 4 8 12 No. of CPUs GPU-enabled clusters accelerate research across disciplines Accelerating Finite Element Analyses with GPUs ABAQUS/Standard 6.11-1: 2.2 million DOFs

3500 3306 Without GPU Acceleration With GPU Acceleration 3000

2500

seconds 2000 1812 1720

Timein 1500

1012 1047 1000 680 653 570 512 493 500

0 1 2 4 8 12 No. of CPUs GPU-enabled clusters accelerate research across disciplines Cluster configurations for benchmarking

Cluster Lion-GA Stampede CPU 12 X5675 @ 3.07 GHz 16 E5-2680 @ 2.70 GHz GPU 8 M2070 or 8 M2090 1 K20c Nodes equipped with 8 120 GPUs 40 Gb/s Mellanox 56 Gb/s Mellanox Interconnect QDR Infiniband FDR Infiniband GPU-enabled clusters accelerate research across disciplines Cluster configurations for benchmarking

• Lion-GA configuration • 3 GPUs per PCIe switch • 3 to 5 GPUs per IOH chip

• Bottleneck: GPU communication • Single IOH supports 36 PCIe lanes; NVIDIA recommends 16 per device • IOH does not currently support peer to peer transfers between GPU devices on different chipsets • It is therefore difficult to achieve peak transfer rates across GPU on different sockets

GPU-enabled clusters accelerate research across disciplines Fractional Quantum Hall Effect

 FQHE at filling fraction ν=5/2 is believed to support excitations that exhibit non-Abelian statistics, so far not observed in any other known systems

 Research involves numerical studies of competing variational wave functions Ψ that contain intricate correlations, requires N! / (N/2)! (N/2)! LU decompositions of (N/2)2 matrices

 Scaling is the same when increasing batch size (small matrices < cache size)

 For example, for N=11, computation time using 8 cores and 8 GPU devices (w/ MPI), 1024 Monte Carlo iterations is ~ 246 seconds from ~ 31488 single core

 Batch decompositions of small matrices greatly benefit from nodes with multiple GPUs

 Each GPU is assigned a smaller subset of the matrix set

 Not much communication across GPUs is required for the decomposition GPU-enabled clusters accelerate research across disciplines Fractional Quantum Hall Effect Batch LU Decomposition of 1M 16x16 Batch LU Decomposition of 1M 16x16 matrices matrices Number of matrices decomposed per second Total time to completion on Lion-GA on Lion-GA 600 6000000 500 5000000

400

4000000

300 3000000 Time (ms)

200 2000000 Matrices per second

1000000 100

0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of GPU (Tesla M2090) Number of GPU (Tesla M2090) GPU-enabled clusters accelerate research across disciplines with AMBER force fields • Molecular Dynamics is widely used for simulation of solvated proteins or molecules and make use of various force fields (AMBER, ReaxFF, etc.)

• AMBER force field is implemented in the eponym software suite

• The software PMEMD in AMBER is used for both explicit solvent Particle Mesh Ewald (PME) and implicit solvent General Borne (GB) simulations

• AMBER does not require extensive communication between GPUs or between CPU and GPU, and does not take advantage of the CPU if GPUs are used

• GPU acceleration allows for longer simulated times to the order of the nanosecond or more GPU-enabled clusters accelerate research across disciplines Molecular Dynamics with AMBER force fields

PME simulation of DHFR protein in water (NPT ensemble, 23,558 atoms) Achieved performance on Lion-GA 80

70

60

50

40 ns/day 30

20

10

0 12 1 2 3 4 5 6 7 8 X5675 M2090 M2090 M2090 M2090 M2090 M2090 M2090 M2090 GPU-enabled clusters accelerate research across disciplines Molecular Dynamics with AMBER force fields

PME simulation of FactorIX molecule in water (NPT ensemble, 90,906 atoms) Achieved performance on Lion-GA 18

16

14

12

10

ns/day 8

6

4

2

0 12 1 2 3 4 5 6 7 8 X5675 M2090 M2090 M2090 M2090 M2090 M2090 M2090 M2090 GPU-enabled clusters accelerate research across disciplines Molecular Dynamics with AMBER force fields

PME simulation of Cellulose molecule in water (NPT ensemble, 408,609 atoms) Achieved performance on Lion-GA 4.5

4

3.5

3

2.5

ns/day 2

1.5

1

0.5

0 12 1 2 3 4 5 6 7 8 X5675 M2090 M2090 M2090 M2090 M2090 M2090 M2090 M2090 GPU-enabled clusters accelerate research across disciplines Molecular Dynamics with AMBER force fields

Implicit solvent GB simulation of Myoglobin (2,492 atoms) Achieved performance on Lion-GA 200 180 160 140 120 100

ns/day 80 60 40 20 0 12 1 2 3 4 5 6 7 8 X5675 M2090 M2090 M2090 M2090 M2090 M2090 M2090 M2090 GPU-enabled clusters accelerate research across disciplines Molecular Dynamics with AMBER force fields

Implicit solvent GB simulation of Nucleosome (25,095 atoms) Achieved performance on Lion-GA 7

6

5

4

ns/day 3

2

1

0 12 1 M20902 M20903 M20904 M20905 M20906 M20907 M20908 M2090 X5675 GPU-enabled clusters accelerate research across disciplines DFT using Quantum Espresso

 Density Functional Theory (DFT) has enjoyed huge growth in popularity owing to computational and numerical advancements; used widely in material science

 Quantum Espresso (QE) is an open source DFT package which has recently added GPU acceleration, largely through BLAS and FFT routines

 When building QE with MAGMA (UT/ORNL), one introduces heterogeneous linear algebra routines which use both CPU & GPU

 Benchmark

 Self-consistent field calculation, using PBE pseudopotentials

 168 atoms

 Periodic boundary conditions

 Kinetic energy cutoff (Ry) for charge density of 80 Ry

 Davidson diagonalization

 Semi-empirical dispersion correction (DFT+D) applied

GPU-enabled clusters accelerate research across disciplines DFT using Quantum Espresso

SCF calculation of cellulose SCF calculation for cellulose Total walltime (in hrs) on Lion-GA Total walltime (in hrs) on Stampede@TACC 25 7

6 20

5

) 15 )

hrs hrs 4 ( (

3

Walltime 10 Walltime

2

5 1

0 0 1 X5675 1 M2070 2 M2070 4 M2070 8 M2070 1 K20 2 K20 4 K20 8 K20 16 K20 32 K20 GPU-enabled clusters accelerate research across disciplines Quantum chemistry using PetaChem

• Quantum Chemistry designed to run on NVIDIA series hardware

• Several key features • Restricted Hartree-Fock and grid-based Kohn-Sham single point energy and gradient calculations • Various functions supported • Geometry optimization • Ab-ignition molecular dynamics • Support for Multi-GPU

• Benchmark • Single point energy, using basis 6-31g for Olestra

GPU-enabled clusters accelerate research across disciplines Quantum chemistry using PetaChem

PetaChem Olestra SCF calculation Total walltime (in s) on Lion-GA 600

500

400

300 Walltime (s) 200

100

0 1 M2070 2 M2070 3 M2070 4 M2070 5 M2070 6 M2070 7 M2070 8 M2070