Large-Scale Electronic Structure Calculations of High-Z Metals on the Bluegene/L Platform
Total Page:16
File Type:pdf, Size:1020Kb
Large-Scale Electronic Structure Calculations of High-Z Metals on the BlueGene/L Platform Francois Gygi Erik W. Draeger, Martin Schulz, Department of Applied Science Bronis R. de Supinski University of California, Davis Center for Applied Scientific Computing Davis, CA 95616 Lawrence Livermore National Laboratory 530-752-4042 Livermore, CA 94551 [email protected] {draeger1,schulz6,bronis}@llnl.gov John A.Gunnels, Vernon Austel, Franz Franchetti James C. Sexton Department of Electrical and Computer Engineering Carnegie Mellon University IBM Thomas J. Watson Research Center Pittsburgh, PA 15213 Yorktown Heights, NY 10598 [email protected] {gunnels,austel,sextonjc}@us.ibm.com Stefan Kral, Christoph W. Ueberhuber, Juergen Lorenz Institute of Analysis and Scientific Computing Vienna University of Technology, Vienna, Austria [email protected], [email protected] [email protected] ABSTRACT First-principles simulations of high-Z metallic systems using the Qbox code on the BlueGene/L supercomputer demonstrate General Terms unprecedented performance and scaling for a quantum simulation Algorithms, Measurement, Performance. code. Specifically designed to take advantage of massively- parallel systems like BlueGene/L, Qbox demonstrates excellent parallel efficiency and peak performance. A sustained peak Keywords performance of 207.3 TFlop/s was measured on 65,536 nodes, Electronic structure. First-principles Molecular Dynamics. Ab corresponding to 56.5% of the theoretical full machine peak using initio simulations. Parallel computing. BlueGene/L, Qbox. all 128k CPUs. 1. INTRODUCTION Categories and Subject Descriptors First-Principles Molecular Dynamics (FPMD) is an accurate J.2 [Physical Sciences and Engineering]:– Chemistry, Physics. atomistic simulation approach that is routinely applied to a variety of areas including solid-state physics, chemistry, biochemistry and nanotechnology [1]. It includes a quantum mechanical description of electrons, and a classical description of atomic Permission to make digital or hard copies of all or part of this work for nuclei. FPMD simulations integrate the Newton equations of personal or classroom use is granted without fee provided that copies are motion for all nuclei in order to simulate dynamical properties of not made or distributed for profit or commercial advantage and that physical systems at finite temperature. At each discrete time step copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, of the trajectory, the forces acting on the nuclei are derived from a requires prior specific permission and/or a fee. calculation of the electronic properties of the system. 0-7695-2700-0/06 $20.00 ©2006 IEEE In this paper, we consider the case of FPMD simulations of heavy We have used the Qbox code to perform electronic structure (or “high-Z”) metals such as molybdenum or tantalum. In calculations of molybdenum on the BlueGene/L (BG/L) computer particular, we focus on high-accuracy electronic structure installed at Lawrence Livermore National Laboratory. Qbox is a calculations needed to evaluate the energy of isolated defects. C++ implementation of the FPMD method [3]. It uses the MPI Such calculations are especially challenging since they require the message-passing paradigm and solves the KS equations [2] within inclusion of a large number of atoms in a periodic simulation cell. the pseudopotential, plane wave formalism. The solution of the This in turn implies that a large number of valence electrons must KS equations has been extensively discussed by various authors be included in the calculation. Furthermore, high-accuracy [1] and requires the capability to perform three-dimensional calculations often require the use of additional tightly bound Fourier transforms and dense linear algebra efficiently. The electrons (known as semi-core electrons) in the simulation. implementation of these two operations will be discussed in detail below. Qbox was designed specifically for large parallel The electronic structure calculation is the most time-consuming platforms, including BlueGene/L. The design of Qbox yields part of an FPMD simulation. It consists in solving the Kohn-Sham good load balance through an efficient data layout and a careful (KS) equations [2]. The KS equations are one-particle, non-linear management of the data flow during the most time consuming PDEs that approximate the many-particle Schroedinger equation. operations. The solutions of the KS equations describe electrons in the presence of an average effective potential that mimics the The sample chosen for the present performance study contains complex many-body interactions between electrons. In periodic 1000 molybdenum atoms, and includes a highly accurate solids the KS equations take the form treatment of electron-ion interactions. Norm-conserving semi- local pseudopotentials were used to represent the electron-ion interactions. A total of 64 projectors were used (8 radial Hrψ ()= εψ () r knk nknk quadrature points for p and d channels) on each atom to represent the semi-local potentials. A plane wave energy cutoff of 112 Ry where H is the Hamiltonian, solutions ψ ()r are Bloch was used to describe the electronic wave functions. Semi-core p k nk electrons were included in the valence shell. Calculations waves, n is a band index and k is a wave vector (also called k- including 1, 4 and 8 k-points were performed. point) confined to the first Brillouin zone of the crystal. The one- particle KS Hamiltonian Hk depends non-linearly on the Simulations were performed using up to 65,536 nodes. Performance measurements were carried out by counting floating- electronic density ()r which in turn depends on the solutions ρ point operations using hardware counters. Qbox realizes a 41% parallel efficiency between 2k and 64k CPUs for single k-point ψ nk ()r through the relation calculations. When using multiple k-points, we show that 56.5% of peak performance, or 207.3 TFlop/s, can be achieved on the 2 full machine (65,536 nodes, 131,072 CPUs). ρψ()rr= ∑ nk () nk This kind of simulation is considerably larger that any previously The KS equations can therefore be solved independently for each feasible FPMD simulation. Our demonstration that BG/L’s large value of k, which leads to a natural division of the computational computing power makes such large simulations feasible opens the work on a parallel computer. However, the dependence of the way to accurate simulations of the properties of metals, including the calculation of melting temperatures, defect energies and defect electronic charge density ρ()r on the solutions at all values of k migration processes, studying the effects of aging on the introduces a coupling between solutions at different k values. This structural and electronic properties of heavy metals and the coupling appears in the dependence of the effective one-particle properties of materials subjected to extreme conditions[4]. potential on the total electronic charge distribution ρ()r through the electrostatic and exchange-correlation potentials. In the systems considered here (metals in large supercells), the number 2. KEY ASPECTS OF THE BLUEGENE/L of k-points needed is of the order of four to eight. Solutions of the KS equations are real if k=0 and complex otherwise. For this ARCHITECTURE FOR FPMD reason electronic structure computations involving multiple k- BlueGene/L (BG/L) presents several opportunities for efficient points are more costly than computations performed with a single implementation of FPMD simulations. Details of the tightly- k-point (if k=0) since they involve complex arithmetic. integrated large-scale system architecture are covered elsewhere [5], including aspects of it that are particularly relevant to Qbox Thus the electronic structure problem can be solved efficiently on [6]. Overall, LLNL’s BG/L platform has 65,536 compute nodes a parallel computer if i) a one-particle (single k-point) KS and a total peak performance of 367TFlop/s.We briefly cover its problem can be solved efficiently on one fourth to one eighth of general architectural aspects here, focusing on those related to the machine, and ii) the solutions for all k-points can be combined recent or planned optimizations in Qbox. efficiently to compute the total charge density. We show in this Each compute node is built from a single compute node ASIC and paper that both these conditions can be met and lead to efficient a set of memory chips. The compute ASIC features two 32-bit electronic structure calculations on the BlueGene/L platform. superscalar 700 MHz PowerPC 440 cores, with two copies of the PPC floating point unit associated with each core that function as a SIMD-like double FPU [7]. Achieving high performance 50% of the theoretical peak rate of the processor. Fortunately, requires the use of an extensive set of parallel instructions for general matrix-multiplication (C op= A*B) is dominated by which the double precision operands can come from the register FMAs and BG/L’s relatively rich instruction set allows one to file of either unit and that include a variety of paired multiply-add utilize the SIMD FMA instructions for all of the computations operations, resulting in a peak of four floating point operations involved in zgemm. The only prerequisite to taking advantage of (Flop) per cycle per core. Later in this paper, we discuss the these instructions is to load the registers utilized for computations DGEMM, ZGEMM and FFT implementations that allow Qbox to with useful data (i.e. not pad them or throw away half of their use these instructions and, thus, to achieve a high percentage of result). Because the input data, complex double-precision values peak performance. in this case, is assumed to be aligned on 16-byte boundaries, the loads and stores of the C matrix are strictly SIMD. If this BG/L includes five networks; we focus on the 3-D torus, the assumption regarding alignment were not made, the load-primary broadcast/reduction tree and the global interrupt for Qbox and load-secondary instructions would allow one to load the optimizations.