<<

Case Study: Beyond Homogeneous Decomposition with Scaling Long-Range Forces on Massively Parallel Systems

Donald Frederick, LLNL – Presented at Supercomputing ‘11 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551! LLNL-PRES-508651 Case Study: Outline

• Problem Descripon • Computaonal Approach • Changes for Scaling

LLNL-PRES-508651 Computer simulaons of materials Computer simulaons are widely used to predict the properes of new materials or understand the properes of exisng ones

LLNL-PRES-508651 Simulaon of Materials from First-

First-principles methods: Principles Calculate properes of a given material directly from fundamental physics equaons.

• No empirical parameters Can make predicons about complex or novel materials, under condions where experimental data is unavailable or inconclusive.

• Chemically dynamic As atoms move, chemical bonds can be formed or broken. density surrounding • Computaonally expensive water , calculated Solving quantum equaons is me-consuming, liming from first-principles systems sizes (e.g. hundreds of atoms) and simulaon mescales (e.g. picoseconds)

LLNL-PRES-508651 is Hard Properes of a many-body quantum system are given by Schrödinger's Equaon:

2 ⎡⎤−h 2 vv vv v ⎢⎥∇+V()V()V()ionrr + H + xc rrψεψiii () = () r ⎣⎦2m Exact numerical soluon has exponenal complexity in the number of quantum degrees of freedom, making a brute force approach impraccal (e.g. a system with 10,000 would take 2.69 x 1043 mes longer to solve than one with only 100 electrons!)

Approximate soluons are needed: stochascally sample high-dimensional phase space O(N3) Density Funconal Theory use approximate form for exchange and correlaon of electrons

Both methods are O(N3), which is expensive, but tractable

LLNL-PRES-508651 Density Funconal Theory Density funconal theory: total of system can be wrien as a funconal of the electron density

occ vv*2⎛⎞11 v vvion v ZZ ER{},{}ψψψρ= 2 drr ()−∇ () rdrrVr+ () () +IJ [ iI] ∑∑∫∫ i⎜⎟ i iJI⎝⎠22≠ RRIJ− v xc 1()vv vρ rʹ ++Edrrdr[]ρρ () ʹ vv 2 ∫∫rr− ʹ many-body effects are approximated by a density funconal exchange-correlaon

Solve Kohn-Sham equaons self-consistently to find the ground state electron density defined by the minimum of this energy funconal (for a given set of ion posions)

The total energy allows one to directly compare different structures and configuraons

LLNL-PRES-508651 Stac Calculaons: Total Energy Basic idea: For a given configuraon of atoms, we find the electronic density which gives the lowest total energy

Bonding is determined by , no bonding informaon is supplied a priori

LLNL-PRES-508651 First-Principles (FPMD) For dynamical properes, both ions and electrons have to move simultaneously and consistently: 1. Solve Schrodinger’s Equaon to find ground state electron density. 2. Compute inter-atomic forces. 3. Move atoms incrementally forward in me. 4. Repeat. Because step 1 represents the vast majority (>99%) of the computaonal effort, we want to choose me steps that are as large as possible, but small enough so that previous soluon is close to current one

me t move atoms me t + dt LLNL-PRES-508651 Predicve simulaons of melng

SOLID LIQUID

T = 800 K T = 900 K

t = 1.2 ps t = 0.6 ps

t = 2.4 ps t = 1.7 ps freezing melng

A two-phase simulaon approach combined with local order analysis allows one to calculate phase behavior such as melng lines with unprecedented accuracy.

LLNL-PRES-508651 High-Z from First-Principles • A detailed theorecal understanding of high-Z Effect of radioactive decay metals is important for stockpile stewardship. • Simulaons may require many atoms to capture anisotropy. Dynamical properes such as melng require enough atoms to minimize finite-size effects. Response to shocks • Computaonal cost depends on number of electrons, not atoms. A melng simulaon of a high-Z with 10-20 valence electrons/ atom will be 1000-8000 mes as expensive as Evolution of aging damage hydrogen! • The complex electronic structure of high-Z metals requires large basis sets and mulple k- points.

We need a really big computer!

LLNL-PRES-508651 The Plaorm: BlueGene/L

System (64 cabinets, 64x32x32) 65,536 nodes

Cabinet (131,072 CPUs) (32 Node boards, 8x8x16)

Node Board (32 chips, 4x4x2) 512 MB/node 16 Compute Cards

Compute Card (2 chips, 2x1x1) 180/360 TF/s 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 367 TFlop/s peak 8 GB DDR performance 5.6/11.2 GF/s 2.8/5.6 GF/s 0.5 GB DDR 4 MB

Can an FPMD code use 131,072 CPUs efficiently? The Code: Qbox Qbox, wrien by François Gygi, was designed from the ground up to be massively parallel – C++/MPI – Parallelized over plane waves and electronic states – Parallel linear algebra handled with ScaLAPACK/PBLAS libraries. – Communicaon handled by BLACS library and direct MPI calls. – Single-node dual-core dgemm, zgemm opmized by John Gunnels, IBM – Uses custom distributed 3D complex Fast Fourier Transforms

Implementaon details: – Wave funcon described with a plane wave basis – Periodic boundary condions – Pseudopotenals are used to represent electrons in atomic core, and thereby reduce the total number of electrons in the simulaon.

LLNL-PRES-508651 Qbox Code Structure

Qbox

ScaLAPACK/PBLAS XercesC (XML parser)

BLAS/MASSV BLACS FFTW lib

DGEMM lib MPI

LLNL-PRES-508651 Parallel Soluon of Kohn-Sham Equaons How to efficiently distribute problem over tens of thousands of processors?

⎧− Δϕi +V (ρ,r)ϕi = εiϕi i =1…Nel Kohn-Sham equaons ⎪ – soluons represent molecular ⎪ ρ(rʹ) V (ρ,r) = V (r) + drʹ +V (ρ(r),∇ρ(r)) orbitals (one per electron) ⎪ ion ∫ r − rʹ XC ⎪ – molecular orbitals are complex ⎨ N 3 el 2 scalar funcons in R ⎪ρ(r) = ϕ (r) ⎪ ∑ i – coupled, non-linear PDEs ⎪ i=1 – ⎪ ∗ periodic boundary condions ϕ (r)ϕ (r) dr = δ ⎩⎪∫ i j ij

We use a plane-wave basis for orbitals: ()r ceiqr⋅ ϕkkq,,nn= ∑ + 2 kq+

Careful distribuon of wave funcon is key to achieving good parallel efficiency Algorithms used in FPMD

• Solving the KS equaons: a constrained opmizaon problem in ⎧− Δϕi +V (ρ,r)ϕi = εiϕi i =1…Nel ⎪ the space of coefficients cqn ⎪ ρ(rʹ) V (ρ,r) = V (r) + drʹ +V (ρ(r),∇ρ(r)) • Poisson equaon: 3-D FFTs ⎪ ion ∫ r − rʹ XC ⎪ • Computaon of the electronic charge: ⎨ N el 2 ⎪ρ(r) = ϕ (r) 3-D FFTs ⎪ ∑ i ⎪ i=1 • Orthogonality constraints require ⎪ ∗ ϕ (r)ϕ (r) dr = δ ⎩⎪∫ i j ij dense, complex linear algebra (e.g. A = CHC)

Overall cost is O(N3) for N electrons

LLNL-PRES-508651 Qbox communicaon paerns

The matrix of coefficients cq,n is block ()r ceiqr⋅ distributed (ScaLAPACK data layout) ϕnn= ∑ q, 2 q

0 8 16 24 Ncolumns 1 9 17 25 0 8 16 24 2 10 18 26 1 9 17 25 q plane 3 11 19 27 N 2 10 18 26 waves (q rows 3 11 19 27 >> n) 4 12 20 28 4 12 20 28 5 13 21 29 5 13 21 29 6 14 22 30 6 14 22 30 7 15 23 31 7 15 23 31

wave funcon matrix c process grid LLNL-PRES-508651 Qbox communicaon paerns

• Computaon of 3-D Fourier transforms MPI_Alltoallv

LLNL-PRES-508651 Qbox communicaon paerns

• Accumulaon of electronic charge density MPI_Allreduce

LLNL-PRES-508651 Qbox communicaon paerns

• ScaLAPACK dense linear algebra operaons PDGEMM/PZGEMM

PDSYRK

PDTRSM

LLNL-PRES-508651 Controlling Numerical Errors

• For systems with complex electronic structure like high-Z metals, the desired high accuracy is achievable only through careful control of all numerical errors.

• Numerical errors which must be controlled: – Convergence of Fourier series ()r ceiqr⋅ ϕnn= ∑ q, 2 q

2 ρψ()rrk= () d 3 – Convergence of k-space integraon ∫ k,n BZ • We need to systemacally increase – Plane-wave energy cutoff Let’s start with these – Number of atoms – Number of k-points in the Brillouin zone integraon

BlueGene/L allows us to ensure convergence of all three approximaons High-Z test problem: Mo1000 • Electronic structure of a 1000-atom Molybdenum sample • 12,000 electrons • 32 non-local projectors for pseudopotenals

• 112 Ry plane-wave energy cutoff Mo1000 • High-accuracy parameters

We wanted a test problem which was representave of the next generaon of high-Z materials simulaons while at the same me straighorward for others to replicate and compare performance

LLNL-PRES-508651 QBox Performance

1000 Mo atoms: 8 k-points! 207.3 TFlop/s 112 Ry cutoff (56% of peak)! 4 k-points! 12 electrons/atom 1 k-point (2006 Gordon Bell Award) Communication Comm. Optimizations Complex Arithmetic related Optimal Node Mapping Dual Core MM

1 k-point!

LLNL-PRES-508651 Performance measurements

• We use the PPC440 HW performance counters • Access the HW counters using the APC library (J. Sexton, IBM) – Provides a summary file for each task • Not all double FPU operaons can be counted – DFPU fused mulply-add: ok – DFPU add/sub: not counted • FP operaons on the second core are not counted

LLNL-PRES-508651 Performance measurements

• Using a single-core DGEMM/ZGEMM library – Measurements are done in three steps: • 1) count FP ops without using the DFPU • 2) measure mings using the DFPU • 3) compute flop rate using 1) and 2) • Problem: some libraries use the DFPU and are not under our control • DGEMM/ZGEMM uses mostly fused mulply-add ops • In pracce, we use 2). Some DFPU add/sub are not counted

LLNL-PRES-508651 Performance measurements

• Using a dual-core DGEMM/ZGEMM library: – FP operaons on the second core are not counted – In this case, we must use a 3 step approach: • 1) count FP ops using the single-core library • 2) measure ming using the dual-core library •Our performance numbers represent a strict lower bound of 3) compute the flop rate using 1) and 2) the actual performance

LLNL-PRES-508651 Mo scaling results sustained1000 performance

1000 Mo atoms:

112 Ry cutoff 12 electrons/atom 1 k-point

LLNL-PRES-508651 Single-node kernels

Exploing the BG/L hardware – Use double FPU instrucons (“double hummer”) – Use both CPUs on the node • use virtual node mode, or • program for two cores (not L1 coherent) – We use BG/L in co-processor mode • 1 MPI task per node • Use second core using dual-core kernels DGEMM/ZGEMM kernel (John Gunnels, IBM) – Hand opmized, uses double FPU very efficiently – Algorithm tailored to make best use of L1/L2/L3 – Dual-core version available: uses all 4 FPUs on the node FFTW kernel (Technical University of Vienna) – Uses hand-coded intrinsics for DFPU instrucons

LLNL-PRES-508651 Mo scaling results sustained1000 performance

1000 Mo atoms:

112 Ry cutoff 12 electrons/atom 1 k-point

dual-core matrix mulplicaon library (dgemm)

LLNL-PRES-508651 Need for Scalable Tools

• Support complete development cycle – Debugging – Performance analysis – Opmizaon/transformaon • New challenges with scalability – Large volumes of data to store and analysis – Central processing/control infeasible – Light-weight kernel • New tool strategies – Scalable infrastructures – Applicaon specific tool support Ø Flexible and interoperable toolboxes

LLNL-PRES-508651 Assembling Applicaon Specific Tool Prototypes

§ Tools need to react to specific application requirements § Limited use for monolithic tools with their own stacks § New and unprecedented scenarios § Limit data collection and specialize data analysis § Example: Dynamic tool assembly with PnMPI LLNL-PRES-508651 Virtualizing MPI Tools • MPI tools based on the PMPI interface – Intercept arbitrary MPI calls using tool wrappers – Transparently track modify, or replace MPI calls Application Application PMPI Tool 1 Application PMPI Tool 1 MPI Library PMPI Tool 1 MPI MPI N PMPI Tool 2 PMPI Tool 2 Application P PMPI Tool 2 MPI Library MPI Library MPI Library

§ PNMPI: Dynamically stack multiple (binary PMPI) tools • Provide generic tool services for other tools • Transparently modify contextmodule of existing mpitrace tools ! module mpiprofile! LLNL-PRES-508651 PNMPI Switch Modules • Mulple tool stacks – Defined independently Application – Inial tool stack called PMPI Tool 1 by applicaon PMPI Tool 2 • Switch modules Switch – Dynamic stack choice – Based on arguments or dynamic steering PMPI Tool 3 PMPI Tool 4 • Duplicaon of tools PMPI Tool 5 PMPI Tool 5 – Mulple contexts – Separate global state MPI Library

LLNL-PRES-508651 Disnguishing Communicators • Example: dense matrix Rows – Row and column communicators – Global operaons • Need to profile separately – Different operaons – Separate opmizaon • Switch module to Columns split communicaon (111 lines of code) – Create three independent tool stacks – Apply unmodified profiler (mpiP) in each stack

LLNL-PRES-508651 Profiling Setup ! Switch Module • Configuraon file: ! module commsize-switch! Default argument sizes 8 4! Stack argument stacks column row! module mpiP! ! Arguments Target stack! row! controlling Stack 1 module mpiP1! switch module ! Target stack column! Stack 2 module mpiP2! ! Multiple profiling instances

LLNL-PRES-508651 Communicator Profiling Results for FPMD Code

Operation Sum Global Row Column Send 317245 31014 202972 83259 Allreduce 319028 269876 49152 0 Alltoallv 471488 471488 0 0 Recv 379265 93034 202972 83259 Bcast 401042 11168 331698 58176 AMD Opteron/Infiniband Cluster

• Informaon helpful for … 1000 Mo atoms:Comm. Optimizations

– 112Complex Ry cutoff Arithmetic Understanding behavior Optimal12 electrons/ Node Mapping atom – Locang opmizaon targets Dual1 k-point Core MM – Opmizing of collecves – Idenfying beer node mapping

LLNL-PRES-508651 Conclusions

• System size growing towards Petascale – Tools must scale with systems and codes – New concepts and infrastructures necessary • Scalable debugging with STAT – Lightweight tool to narrow search space – Tree-based stack trace aggregaon • Dynamic MPI tool creaon with PNMPI – Ability to quickly create applicaon specific tools – Transparently reuse and extend exisng tools • Tool interoperability increasingly important

LLNL-PRES-508651 Mapping tasks to physical nodes BG/L is a 3-dimensional torus of processors. The physical placement of MPI tasks can thus be expected to have a significant effect on communicaon mes

?

LLNL-PRES-508651 Mapping tasks to physical nodes BG/L is a 3-dimensional torus of processors. The physical placement of MPI tasks can thus be expected to have a significant effect on communicaon mes

LLNL-PRES-508651 Mapping tasks to physical nodes BG/L is a 3-dimensional torus of processors. The physical placement of MPI tasks can thus be expected to have a significant effect on communicaon mes

LLNL-PRES-508651 QBox Performance

1000 Mo atoms: 8 k-points! 207.3 TFlop/s 112 Ry cutoff (56% of peak)! 4 k-points! 12 electrons/atom 1 k-point (2006 Gordon Bell Award) Communication Comm. Optimizations Complex Arithmetic related Optimal Node Mapping Dual Core MM

1 k-point!

LLNL-PRES-508651 Marn Schulz: Developing New Tool Strategies for Scalable HPC Systems, May 21st 2007 BG/L node mapping 65536 nodes, in a 64x32x32 torus

z biparte “htbixy” x 64.0 TF 50.0 TF xyz (default) y 39.5 TF

64% speedup!

512 tasks per MPI subcommunicator quadparte 8x8x8 64.7 TF 38.2 TF

Physical task distribuon can significantly affect performance LLNL-PRES-508651 Mo scaling results sustained1000 performance

1000 Mo atoms:

112 Ry cutoff With over a factor of two speed-up from 12 electrons/atom 1 k-point inial runs, this lays the groundwork for more complex calculaons involving mulple simultaneous soluons of the Kohn- Sham equaons

opmal node mapping

LLNL-PRES-508651 High symmetry requires k-point Crystalline systems require explicit sampling of the sampling Brillouin zone to accurately represent electrons: 2π/L – Typically, a Monkhorst-Pack grid is used, with symmetry-equivalent points combined. – each wave vector (“k-point”) requires a separate soluon of Schrödinger’s Equaon, increasing momentum space the computaonal cost – electron density is symmetry-averaged over k- points every iteraon k-points

momentum space

Qbox was originally wrien for non-crystalline calculaons and assumed k=0. Rewring the code for k-points had several challenges: – more complicated parallel distribuon – complex linear algebra

LLNL-PRES-508651 Mo scaling results sustained1000 performance

1000 Mo atoms:

112 Ry cutoff 12 electrons/atom 1 k-point

general k-point (complex linear algebra) new dual-core zgemm

k = (0,0,0) special case

LLNL-PRES-508651 Node Mapping Opmizaon Working with Bronis de Supinski and Marn Schulz, we conducted a detailed analysis of communicaon paerns within the biparte node mapping scheme to further improve performance.

• Analysis of the MPI tree broadcast algorithm in sub-communicators uncovered communicaon bolenecks.

Reordering tasks within each “checkerboard” using a modified space-filling curve improves performance

• The BLACS library was using an overly general communicaon scheme for our applicaon, resulng in many unnecessary Type_commit operaons.

BLACS was rewrien to include a check for cases where these operaons are unnecessary

LLNL-PRES-508651 Mo scaling results sustained1000 performance

1000 Mo atoms:

112 Ry cutoff Single k-point performance has now been well- 12 electrons/atom opmized, but metals require mulple k- 1 k-point points!

LLNL-PRES-508651 Mo scaling results sustained1000 performance

1000 Mo atoms: 8 k-points 207.3 TFlop/s 112 Ry cutoff (2006 Gordon 4 k-points 12 electrons/atom 1 k-point Bell Award)

1 k-point

LLNL-PRES-508651 Conclusions • Use exisng numerical libraries (where possible) – ScaLAPACK/BLAS • System-specific pracces – Use opmized rounes where necessary – single-node dual-core dgemm, zgemm – custom 3D complex FFT rounes – mapping of tasks to physical nodes

• Important to have suitable tool set – May be necessary to modify or develop new tools to assist in scaling

LLNL-PRES-508651 Lessons for Petascale Tools

Tools are essenal in Petascale efforts – Need to debug at large scale – Performance opmizaon to exploit machines – Need performance measurement tools for large scale use • Centralized infrastructures will not work – Tree-based aggregaon schemes – Distributed storage and analysis – Node count independent infrastructures • Need for flexible and scalable toolboxes – Integraon and interoperability – Comprehensive infrastructures – Community effort necessary

LLNL-PRES-508651 Acknowledgments Qbox Team: Eric Draeger Center for Applied Scienfic Compung, Lawrence Livermore Naonal Laboratory

François Gygi University of California, Davis

Marn Schulz, Bronis R. de Supinski, Center for Applied Scienfic Compung, Lawrence Livermore Naonal Laboratory

Franz Franche Department of Electrical and Computer Engineering, Carnegie Mellon University

Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber Instute of Analysis and Scienfic Compung, Vienna University of Technology

John A. Gunnels, Vernon Austel, James C. Sexton IBM Thomas J. Watson Research Center, Yorktown Heights, NY

LLNL-PRES-508651