Case Study: Beyond Homogeneous Decomposition with Qbox Scaling Long-Range Forces on Massively Parallel Systems

Case Study: Beyond Homogeneous Decomposition with Qbox Scaling Long-Range Forces on Massively Parallel Systems Donald Frederick, LLNL – Presented at Supercomputing ‘11 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551! LLNL-PRES-508651 Case Study: Outline • Problem DescripBon • Computaonal Approach • Changes For Scaling LLNL-PRES-508651 Computer simulaons oF materials Computer simulaons are widely used to predict the properBes oF new materials or understand the properes of exisng ones LLNL-PRES-508651 Simulaon oF Materials From First- First-principles methods: Principles Calculate properBes oF a given material directly From Fundamental physics equaons. • No empirical parameters Can make predicons about complex or novel materials, under condi-ons where experimental data is unavailable or inconclusive. • Chemically dynamic As atoms move, chemical bonds can be formed or broken. Electron density surrounding • Computaonally expensive water molecules, calculated Solving quantum equa-ons is -me-consuming, limi-ng From first-principles systems sizes (e.g. hundreds of atoms) and simula-on -mescales (e.g. picoseconds) LLNL-PRES-508651 Quantum Mechanics is Hard ProperBes oF a many-body quantum system are given by Schrödinger's Equaon: 2 ⎡⎤−h 2 vv vv v ⎢⎥∇+V()V()V()ionrr + H + xc rrψεψiii () = () r ⎣⎦2m Exact numerical soluBon has exponenal complexity in the number oF quantum degrees oF Freedom, making a brute Force approach impracBcal (e.g. a system with 10,000 electrons would take 2.69 x 1043 Bmes longer to solve than one with only 100 electrons!) Approximate soluBons are needed: Quantum Monte Carlo stochasBcally sample high-dimensional phase space O(N3) Density Func9onal Theory use approximate Form For exchange and correlaon oF electrons Both methods are O(N3), which is expensive, but tractable LLNL-PRES-508651 Density FuncBonal Theory Density FuncBonal theory: total energy oF system can be wriZen as a FuncBonal oF the electron density occ vv*2⎛⎞11 v vvion v ZZ ER{},{}ψψψρ= 2 drr ()−∇ () rdrrVr+ () () +IJ [ iI] ∑∑∫∫ i⎜⎟ i iJI⎝⎠22≠ RRIJ− v xc 1()vv vρ rʹ ++Edrrdr[]ρρ () ʹ vv 2 ∫∫rr− ʹ many-body effects are approximated by a density FuncBonal exchange-correlaon Solve Kohn-Sham equaons selF-consistently to find the ground state electron density defined by the minimum oF this energy FuncBonal (For a given set oF ion posiBons) The total energy allows one to directly compare different structures and configuraons LLNL-PRES-508651 Stac Calculaons: Total Energy Basic idea: For a given configuraon oF atoms, we find the electronic density which gives the lowest total energy Bonding is determined by electronic structure, no bonding inFormaon is supplied a priori LLNL-PRES-508651 First-Principles Molecular Dynamics (FPMD) For dynamical properBes, both ions and electrons have to move simultaneously and consistently: 1. Solve Schrodinger’s Equaon to find ground state electron density. 2. Compute inter-atomic Forces. 3. Move atoms incrementally Forward in Bme. 4. Repeat. Because step 1 represents the vast majority (>99%) oF the computaonal effort, we want to choose Bme steps that are as large as possible, but small enough so that previous soluBon is close to current one Bme t move atoms Bme t + dt LLNL-PRES-508651 PredicBve simulaons oF melBng SOLID LIQUID T = 800 K T = 900 K t = 1.2 ps t = 0.6 ps t = 2.4 ps t = 1.7 ps freezing melng A two-phase simulaon approach combined with local order analysis allows one to calculate phase behavior such as melBng lines with unprecedented accuracy. LLNL-PRES-508651 High-Z Metals From First-Principles • A detailed theoreBcal understanding oF high-Z Effect of radioactive decay metals is important For stockpile stewardship. • Simulaons may require many atoms to capture anisotropy. Dynamical properBes such as melBng require enough atoms to minimize finite-size effects. Response to shocks • Computaonal cost depends on number oF electrons, not atoms. A melBng simulaon oF a high-Z metal with 10-20 valence electrons/ atom will be 1000-8000 mes as expensive as Evolution of aging damage hydrogen! • The complex electronic structure oF high-Z metals requires large basis sets and mulBple k- points. We need a really big computer! LLNL-PRES-508651 The Plaorm: BlueGene/L System (64 cabinets, 64x32x32) 65,536 nodes Cabinet (131,072 CPUs) (32 Node boards, 8x8x16) Node Board (32 chips, 4x4x2) 512 MB/node 16 Compute Cards Compute Card (2 chips, 2x1x1) 180/360 TF/s 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 367 TFlop/s peak 8 GB DDR perFormance 5.6/11.2 GF/s 2.8/5.6 GF/s 0.5 GB DDR 4 MB Can an FPMD code use 131,072 CPUs efficiently? The Code: Qbox Qbox, wriZen by François Gygi, was designed From the ground up to be massively parallel – C++/MPI – Parallelized over plane waves and electronic states – Parallel linear algebra handled with ScaLAPACK/PBLAS libraries. – Communicaon handled by BLACS library and direct MPI calls. – Single-node dual-core dgemm, zgemm opBmized by John Gunnels, IBM – Uses custom distributed 3D complex Fast Fourier TransForms Implementaon details: – Wave FuncBon described with a plane wave basis – Periodic boundary condiBons – PseudopotenBals are used to represent electrons in atomic core, and thereby reduce the total number oF electrons in the simulaon. LLNL-PRES-508651 Qbox Code Structure Qbox ScaLAPACK/PBLAS XercesC (XML parser) BLAS/MASSV BLACS FFTW lib DGEMM lib MPI LLNL-PRES-508651 Parallel SoluBon oF Kohn-Sham Equaons How to efficiently distribute problem over tens oF thousands oF processors? ⎧− Δϕi +V (ρ,r)ϕi = εiϕi i =1…Nel Kohn-Sham equaons ⎪ – soluBons represent molecular ⎪ ρ(rʹ) V (ρ,r) = V (r) + drʹ +V (ρ(r),∇ρ(r)) orbitals (one per electron) ⎪ ion ∫ r − rʹ XC ⎪ – molecular orbitals are complex ⎨ N 3 el 2 scalar FuncBons in R ⎪ρ(r) = ϕ (r) ⎪ ∑ i – coupled, non-linear PDEs ⎪ i=1 – ⎪ ∗ periodic boundary condiBons ϕ (r)ϕ (r) dr = δ ⎩⎪∫ i j ij We use a plane-wave basis For orbitals: ()r ceiqr⋅ ϕkkq,,nn= ∑ + 2 kq+<Ecut Represent func9on as a Fourier series CareFul distribuBon oF wave FuncBon is key to achieving good parallel efficiency Algorithms used in FPMD • Solving the KS equaons: a constrained opBmizaon problem in ⎧− Δϕi +V (ρ,r)ϕi = εiϕi i =1…Nel ⎪ the space oF coefficients cqn ⎪ ρ(rʹ) V (ρ,r) = V (r) + drʹ +V (ρ(r),∇ρ(r)) • Poisson equaon: 3-D FFTs ⎪ ion ∫ r − rʹ XC ⎪ • Computaon oF the electronic charge: ⎨ N el 2 ⎪ρ(r) = ϕ (r) 3-D FFTs ⎪ ∑ i ⎪ i=1 • Orthogonality constraints require ⎪ ∗ ϕ (r)ϕ (r) dr = δ ⎩⎪∫ i j ij dense, complex linear algebra (e.g. A = CHC) Overall cost is O(N3) for N electrons LLNL-PRES-508651 Qbox communica+on pa9erns The matrix oF coefficients cq,n is block ()r ceiqr⋅ distributed (ScaLAPACK data layout) ϕnn= ∑ q, 2 q <Ecut n electronic states 0 8 16 24 Ncolumns 1 9 17 25 0 8 16 24 2 10 18 26 1 9 17 25 q plane 3 11 19 27 N 2 10 18 26 waves (q rows 3 11 19 27 >> n) 4 12 20 28 4 12 20 28 5 13 21 29 5 13 21 29 6 14 22 30 6 14 22 30 7 15 23 31 7 15 23 31 wave FuncBon matrix c process grid LLNL-PRES-508651 Qbox communica+on pa9erns • Computaon oF 3-D Fourier transForms MPI_Alltoallv LLNL-PRES-508651 Qbox communica+on pa9erns • Accumulaon oF electronic charge density MPI_Allreduce LLNL-PRES-508651 Qbox communica+on pa9erns • ScaLAPACK dense linear algebra operaons PDGEMM/PZGEMM PDSYRK PDTRSM LLNL-PRES-508651 Controlling Numerical Errors • For systems with complex electronic structure like high-Z metals, the desired high accuracy is achievable only through careFul control oF all numerical errors. • Numerical errors which must be controlled: – Convergence of Fourier series ()r ceiqr⋅ ϕnn= ∑ q, 2 q <Ecut – Convergence oF system size (number oF atoms) 2 ρψ()rrk= () d 3 – Convergence oF k-space integraon ∫ k,n BZ • We need to systemacally increase – Plane-wave energy cutoff Let’s start with these – Number oF atoms – Number oF k-points in the Brillouin zone integraon BlueGene/L allows us to ensure convergence of all three approximaons High-Z test problem: Mo1000 • Electronic structure oF a 1000-atom Molybdenum sample • 12,000 electrons • 32 non-local projectors For pseudopotenBals • 112 Ry plane-wave energy cutoff Mo1000 • High-accuracy parameters We wanted a test problem which was representave oF the next generaon oF high-Z materials simulaons while at the same Bme straighlorward For others to replicate and compare perFormance LLNL-PRES-508651 QBox PerFormance 1000 Mo atoms: 8 k-points! 207.3 TFlop/s 112 Ry cutoff (56% of peak)! 4 k-points! 12 electrons/atom 1 k-point (2006 Gordon Bell Award)" Communication Comm. Optimizations Complex Arithmetic related Optimal Node Mapping Dual Core MM 1 k-point! LLNL-PRES-508651 PerFormance measurements •! We use the PPC440 HW perFormance counters •! Access the HW counters using the APC library (J. Sexton, IBM) –!Provides a summary file For each task •! Not all double FPU operaons can be counted –!DFPU Fused mulBply-add: ok –!DFPU add/sub: not counted •! FP operaons on the second core are not counted LLNL-PRES-508651 PerFormance measurements •! Using a single-core DGEMM/ZGEMM library –! Measurements are done in three steps: •! 1) count FP ops without using the DFPU •! 2) measure Bmings using the DFPU •! 3) compute flop rate using 1) and 2) •! Problem: some libraries use the DFPU and are not under our control •! DGEMM/ZGEMM uses mostly Fused mulBply-add ops •! In pracBce, we use 2). Some DFPU add/sub are not counted LLNL-PRES-508651 PerFormance measurements •! Using a dual-core DGEMM/ZGEMM library: –!FP operaons on the second core are not counted –!In this case, we must use a 3 step approach: •! 1) count FP ops using the single-core

Load more