DIIS and Hamiltonian diagonalisation for total-energy minimisation in the ONETEP program using the ScaLAPACK eigensolver.

Álvaro Ruiz Serrano

August 27, 2010

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2010 Abstract

Direct Inversion in the Iterative Subspace (DIIS) combined with Hamiltonian diagonal- isation has been implemented in the ONETEP code for minimising the total energy of large systems using Density Functional Theory. This novel approach mixes the quan- tum single-particle density matrix at each iteration to accelerate the convergence of the self-consistent method in the inner loop of ONETEP. At each iteration the Hamilto- nian matrix is diagonalised and a new density matrix is generated from its eigenvectors. Eigensolver routines working on dense matrices scale with the third power of the system size which makes difficult to simulate systems of thousands of atoms. The ScaLAPACK parallel eigensolver is assessed as a method for diagonalising the Hamiltonian matrix in this kind of calculations. The results show that the DIIS-Hamiltonian diagonalisa- tion method works to a very high accuracy for systems of up to near 600 atoms and becomes unstable for larger systems. The ScaLAPACK parallel eigensolver has proven to enhance this approach by distributing the data over the processors which cooper- ate to find the solution of the system. The DIIS method is intended to be combined with ensemble-DFT methods for simulating metallic systems, which are not achievable with linear-scaling DFT approaches. The performance results show that it is possible to simulate systems of a few thousands of atoms with a computational effort that is comparable to linear-scaling DFT programs. The code that has been developed for this work will be added to the main ONETEP code repository and will be available to other users for active research on condensed matter theory. Contents

1 Introduction 1

2 Background theory 4 2.1 First-principles methods in ...... 4 2.2 Density Functional Theory ...... 6 2.3 Linear-scaling DFT ...... 8 2.4 The ONETEP code ...... 10

3 The DIIS algorithm: theory and computational implementation in ONETEP 15 3.1 Description of the original DIIS ...... 16 3.2 Implementation of the DIIS algorithm in the ONETEP code ...... 17 3.3 Description of the code ...... 20 3.4 Eigensolvers ...... 23 3.4.1 LAPACK serial eigensolver, DSYGVX ...... 24 3.4.2 ScaLAPACK parallel eigensolver, PDSYGVX ...... 25

4 Target machines architecture, compilation details and benchmarks 26 4.1 Target machines architecture ...... 26 4.2 Compilation details ...... 27 4.3 Benchmarks ...... 28

5 Validation tests 31 5.1 Results ...... 31 5.2 Methods for solving the convergence problems of the DIIS algorithm for large systems ...... 35 5.2.1 DIIS initialisation ...... 35 5.2.2 Kerker preconditioning ...... 36 5.2.3 Level shifting ...... 36 5.2.4 Other approaches ...... 37

6 Performance results and analysis 38 6.1 Comparison of LAPACK and ScaLAPACK eigensolvers ...... 39 6.2 Comparison of LNV and DIIS algorithms for the inner loop of ONETEP 42 6.3 Future performance optimisations ...... 46

i 7 Conclusions 47

ii List of Tables

4.1 Set of small test cases ...... 28 4.2 Set of benchmarks corresponding to silicon nanorods of different size . 29 4.3 Set of benchmarks corresponding to a the amyloid protein of increasing size ...... 30

5.1 Energy obtained by LNV and DIIS methods for the set of small test cases. The asterisk * indicates that the calculation did not converge to the standard threshold...... 33 5.2 Energy obtained by LNV and DIIS methods for the benchmark set of silicon nanorods. The asterisk * indicates that the calculation did not converge to the standard threshold...... 34

6.1 Successful calculations using LAPACK and ScaLAPACK eigensolvers on HECToR and Iridis 3. In all cases, 4 CPUs per node have been used. The asterisk * indicates that a smaller number of nodes has not been tested and it is possible for the calculation to fit on less nodes than that. For saving computational time, the protein systems have not been simulated using LAPACK on Iridis 3...... 40

iii List of Figures

2.1 ONETEP energy minimisation procedure. The energy functional is minimised in two nested loops: the inner loop optimises the elements of the density kernel, while the outer loop optimises the NGWFs. . . . . 13

3.1 Implementation of the DIIS algorithm in ONETEP...... 20

4.1 Set of small benchmarks...... 29 4.2 Set of silicon nanorods benchmarks...... 29 4.3 Set of benchmarks of cuts to the amyloid protein...... 30

5.1 Convergence of the energy of the LNV and DIIS methods for the set of small systems...... 32 5.2 Potential energy well generated by the H-bonds between the monomers of the Tennis protein. The calculations with LNV and DIIS show very good agreement for this system...... 33 5.3 Convergence of the energy of the LNV and DIIS methods for the set of silicon nanorods...... 34

6.1 Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the eigensolvers for the Si766H402 silicon nanorod on Iridis 3. 41 6.2 Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the eigensolvers for the p16_20 protein benchmark on HEC- ToR...... 42 6.3 Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the the inner loop of ONETEP for the p64_20 protein bench- mark on HECToR...... 43 6.4 Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the inner loop of ONETEP for the p64_20 protein benchmark on HECToR...... 44 6.5 Comparison of the scaling of the LNV and DIIS algorithms with the system size for the set of silicon nanorods. The systems have been simulating usign 20 nodes and 4 CPUs per node on Iridis 3...... 44 6.6 Comparison of the scaling of the LNV and DIIS algorithms with the system size for the set of amyloid benchmarks. The systems have been simulating usign 20 nodes and 4 CPUs per node on HECToR and Iridis 3. 45

iv 6.7 Scaling of the DIIS-Hamiltonian diagonalisation routines with the sys- tem size of the amyloid protein. The systems have been simulating usign 20 nodes and 4 CPUs per node on Iridis 3...... 45

v Acknowledgements

The author would like to acknowledge the members of the ONETEP developers group for their valuable help and his MSc supervisor, Bartosz Dobrzelecki, for his advice during the completion of this work. This project was done as part of High-End Computing PhD studentship granted to the author, funded by EPSRC and supervised by Dr. Chris-Kriton Skylaris at the University of Southampton. The first two years of this studentship include the part-time partici- pation to the MSc in High Performance Computing at the University of Edinburgh, for which this dissertation is submitted. Chapter 1

Introduction

Solid State Physics is the branch of Science that studies the formation and properties of matter. As a result of research on this field, novel materials and compounds have been developed for engineering and medical purposes that expand the capabilities of technology. Condensed matter is formed by atoms that interact with each other accord- ing to the laws of Quantum Mechanics which define the observable properties of the material. Experimental work in the field has to be combined with theoretical models of atoms, molecules and solids in order to overcome the intrinsic difficulty of work- ing in the nanoscale. First-principles calculations provide an an accurate description of the processes that take place at atomic scale, which is derived directly from the basic equations of Quantum Mechanics and do not require parameterisations based on exper- imental results to work. With these calculations, often based on the Born-Oppenheimer approximation [1], it is possible to determine the structure of a molecule or group of molecules and its electronic properties, which, in turn, result in a detailed description of the final behaviour of the real compound. For these reasons, due to its ability for describing the nature of materials, the development of First-Principles techniques con- stitutes an important field of research for academy and industry. Molecules are formed by many atoms which may contain a very large number of elec- trons. Codes for solving such systems represent a major challenge for computational science as they are an example of a problem of N interacting bodies, which, when translated into equations, result in an intractable amount of degrees of freedom to be found. The application of parallel computing techniques into First-Principles Quan- tum Mechanical codes has made possible to simulate systems that would have never been achievable with serial computers. The efficiency and scaling of these algorithms is the key for scaling to new and larger systems like modern nanomaterials or drug compounds.

The ONETEP code [2] performs First-Principles Quantum Mechanics calculations based on Density Functional Theory (DFT) [3] with a computational effort that scales linearly with the number of atoms in the system. This program takes advantage of parallel computing to simulate systems of tens of thousands of atoms with a great quantum ac-

1 curacy. The code, which uses Message-Passing directives for massive parallelisation on distributed memory machines, has been designed for maximum parallel performance on modern state-of-the-art clusters. ONETEP uses a Self-Consistent Field (SCF) ap- proach [1] for finding the ground-state of the system by minimising the total energy with respect to the quantum single-particle density matrix in two nested loops. In this work, Direct Inversion in the Iterative Subspace (DIIS) method [4] combined with explicit diagonalisation of the Hamiltonian matrix has been introduced as a new method for minimising the total energy in the inner loop of ONETEP. This algorithm is known to scale with the third power of the number of atoms (O(N 3)-scaling), and there- fore it is very computationally demanding compared to standard ONETEP. However, it represents a necessary and critical step for simulating metallic systems in ONETEP us- ing the ensemble-DFT approach [5]. Hence, the code developed during this work is intended to be merged with the main branch of the ONETEP repository for future aca- demic research on solid-state physics. To achieve top performance, the routine that diagonalises the Hamiltonian has to show an optimum parallel efficiency. The ScaLAPACK library [6] parallel eigensolver is as- sessed for diagonalising the Hamiltonian in contrast to the equivalent serial eigensolver from the LAPACK library [7]. The advantage of a parallel library such as ScaLAPACK is that all the processors cooperate to solve the given mathematical problem, which should result in a faster execution time of the algorithm and a distributed memory usage for storing large datatypes. On the other hand, inter-node communications can represent a bottleneck for this kind of parallel dense algebra routines and reduce the efficiency of the implementation on large clusters with slow interconnects. For this reason, this work contains an analysis of the performance of the ScaLAPACK eigensolver routine and its implementation in the ONETEP code for large dense matrices on two modern parallel clusters: HECToR (UK National Supercomputer Facility) and Iridis 3 (University of Southampton’s cluster). The aims of this work are i) develop and integrate the code for DIIS and Hamiltonian diagonalisation as a method for energy minimisation in the inner loop of ONETEP, and ii) evaluate and analyse the performance of the ScaLAPACK parallel eigensolver in contrast to the LAPACK serial version. The next chapter describes the underlying theory of First-Principles methods, tradi- tional DFT, linear-scaling DFT and the ONETEP code from a mathematical point of view. Also, it contains a brief summary of the most important characteristics that make ONETEP a linear-scaling code, such as the employment of the Li-Nunes-Vanderbilt (LNV) [8] algorithm as the standard method for energy minimisation in the inner loop. Chapter three describes the theory behind the DIIS algorithm as a method for acceler- ating the convergence of SCF calculations and the addition into ONETEP with Hamilto- nian diagonalisation for minimising the energy in the inner loop. This chapter also de- scribes the details of the computational implementation of the algorithm and describes the most relevant aspects of the routines included in the code and its potential effect in the final performance. In particular, this chapter focuses on the description of the eigen-

2 solvers as they are used in the code and their usability for diagonalising the Hamiltonian. The fourth chapter gives a description of the target machines, HECToR and Iridis 3, and summarises the compilation options, the timing routines and the set of benchmarks that are used for the validation and performance tests of the DIIS code. The fifth chapter is dedicated to validation of the DIIS code for minimising the energy. At this point, results regarding the convergence, numerical stability and agreement with the standard LNV method are provided, as well as a discussion of the range of valid- ity of the implemented code and possible solutions to detected misbehaviours in large systems. At chapter number six, the results and analysis regarding performance of the DIIS code on the target machines are provided. The analysis compares the parallel performance of the ScaLAPACK eigensolver with the serial version of LAPACK on increasing number of processors and discusses the range of applicability of each library for diagonalising large matrices. Then, the performance of the DIIS-Hamiltonian diagonalisation method is compared to standard LNV, with particular interest of the scaling of each algorithm with the system size on HECToR and Iridis 3. The last section of this chapter is dedi- cated to summarise some performance optimisations that can be introduce in the code in future developments, Finally, chapter seven contains the conclusions regarding the most important points of the research carried out for the development of the DIIS and Hamiltonian diagonalisa- tion method for total energy minimisation within the ONETEP code. It also includes a discussion of the suitability of the ScaLAPACK parallel eigensolver routine as a method for achieving top performance of modern parallel clusters that will allow to scale to larger atomic systems.

3 Chapter 2

Background theory

2.1 First-principles methods in computational chemistry

Atomic interaction in molecular systems is a case of an N-body problem whose com- plexity increases with the number of interacting particles in the system. This kind of problem has been widely analysed over the years by researchers from many fields due to the inherent difficulty of dealing with a large number of degrees of freedom in the system, representing one of the major challenges in computational science [9][10]. In the field of First-Principles , the interactions between the atomic particles are described by the Schrödinger equation:

Hˆ Ψ = EΨ, (2.1)

where Hˆ is the Hamiltonian operator (or energy operator) which describes the kind of interaction between the particles, Ψ is the wave function of the system and E is the total energy. Equation (2.1) is an eigenvalue problem, where the wave function Ψ is the eigenvector and E is the eigenvalue. This characteristic of the Schrödinger equation is very important in order to understand the nature of the problem and the computational challenge that solving the N-body system represents for quantum mechanics. The Hamiltonian contains the interaction terms between all the particles in the system. Under an atomistic approximation, the particles involved are the protons in the core of each atom and the electrons moving in orbitals around the nuclei. These particles in- teract with each other via electromagnetic interactions, introducing a large number of degrees of freedom in the equations that, in most cases (more than two atoms), make the problem intractable. The Born-Oppenheimer approximation (also known as the adia- batic approximation) addresses this difficulty by assuming that the nuclei are stationary due to their great mass compared to the electrons. Therefore, it is assumed that the energy due to interactions between the protons within a nucleus is zero compared to electronic iterations. The energy terms due to the protons can be suppressed from the

4 Hamiltonian, which reduces the degrees of freedom of the Schrödinger equation exclu- sively to those associated with the electrons. Under this approximation, the electronic Hamiltonian can be written in atomic units as:

" # ˆ 1 X 2 X X Zα 1 X X 1 Hel = − ∇j − + . (2.2) 2 ~ 2 | ~rj − ~rk | j j α | ~rj − Rα | j k6=j

The energy terms that involve interactions between the nuclei can be reduced to a con- stant. Changes of the spatial coordinates of the nuclei will map an adiabatic energy potential that can be solved in order to find the geometry of the system in equilib- rium (geometry optimisation calculations)[11]. Usually, for a single energy calculation, the nuclear positions are initialised to experimental coordinates (from X-ray diffraction data) or to the coordinates given by a previously optimised geometry. In Quantum Mechanics observables are determined by their expectation value. For the electronic Hamiltonian in equation 2.2, the expectation value corresponds to the total energy of the system:

D ˆ E Ψ | Hel | Ψ hH i = = E [Ψ] . (2.3) el hΨ | Ψi

The energy of the system is a functional of the wave function Ψ, in the sense that variations in Ψ will cause a change in E. Moreover, it can be proven that the energy functional E [Ψ] has a minimum in the ground state of the system, when Ψ is equal to the wave function in the ground state, Ψ0. Therefore, variations in Ψ towards Ψ0 will minimise the energy functional, which, in turn, is equivalent to solving the Schrödinger equation for the ground-state of the system.

δE [Ψ] = 0 if Ψ = Ψ0. (2.4)

Variational methods can be used in order to minimise the energy with respect to the wave function Ψ. These algorithms use well-established numerical procedures for seek- ing for the minimum of the energy. To describe the electronic structure of a system of atoms, a number of different approaches have been developed in the last decades by scientists, such as Tight-binding methods [12], Hartree-Fock methods (HF) [13], Den- sity Functional Theory (DFT) [14], Coupled-Cluster [15], Quantum Monte Carlo [16] or Car-Parrinello methods [17]. Each of these approaches has a range of applications, with their usability being limited by the degree of physical accuracy, the properties of the systems that can be simulated, the size of the molecules and the computational cost.

5 2.2 Density Functional Theory

Amongst these methods, Density Functional Theory (DFT) has been largely investi- gated in the last years by the international community of condensed matter researchers. DFT provides a quantum description of the system by solving the Schrödinger equation (2.1) to find the ground-state wavefunctions and energy of the molecules under sim- ulation. DFT calculations are based on the Hohenberg-Kohn theorems [18], outlined below. Before describing the these theorems, it is worth noticing that the electronic ˆ ˆ ˆ Hamiltonian in (2.2) can be written as Hel = F + V , where:

ˆ 1 X 2 1 X X 1 F = − ∇j + , (2.5) 2 2 | ~rj − ~rk | j j k6=j

X X X Zα Vˆ = v (~r ) = − . (2.6) ext i ~ i j α | ~rj − Rα |

The operator Fˆ contains the interaction between the electrons and the kinetic energy, and is the same for all N-body systems. On the other hand, the operator Vˆ contains the interaction between the electrons and the nuclei, and defines the external potential, vext(~r), on each electron due to the static nuclei in the system. Hohenberg-Kohn Theorem I: There is a one-to-one correspondence between the ground-state charge density of the system of N interacting electrons and the external potential, vext, acting on each elec- tron. The external potential is uniquely determined by the electronic density in the ground state. Hence, by determining the ground state electronic density, the entire Hamiltonian is uniquely defined, and so is the wave function and thus all the properties of the system. At this point, the energy becomes a functional of the density that can be written as:

Z E[n] = F [n] + d~rn(~r)vext(~r). (2.7)

Hohenberg-Kohn Theorem II: For any electronic density n(~r) different from the ground-state electronic density n0(~r), the following relation is true

E[n(~r)] ≥ E[n0(~r)]. (2.8)

This theorem allows to perform a variational minimisation of the energy based on the electronic density instead of the wave function. The absolute minimum of the energy functional is uniquely determined by the electronic density at the ground state of the system.

6 The Hohenberg-Kohn theorems make DFT more attractive for solving the Schrödinger equation in condensed matter applications. The advantage of DFT is clear: instead of having 3N degrees of freedom in the system (the three spatial coordinates of the N particles), the problem has been reduced to finding a function of three spatial co- ordinates, the ground-state electronic density, n0(~r). The energy minimisation, as the second theorem remarks, can be done in a variational way that is equivalent to solving the Schrödinger equation (2.1) for the ground state. However, the N-body problem re- mains intractable: the difficulty is caused by the fact that exact form of the functional F [n], that can be rewritten as

F [n] = Ecoul [n] + Exc [n] , (2.9) is system-dependent and remains unknown due to the complexity of the exchange- correlation interactions, Exc, in the N-body system. Over the years, researchers have developed a range of functionals to overcome this difficulty. Some examples are the LDA, GGA, PBE or B3LYP functionals [19]. All of them are based on what is called the Kohn-Sham mapping [20]. The Kohn-Sham mapping assumes that there is a fictitious system of N non-interacting electrons that has the same exact ground-state electronic energy as the original interact- ing system of electrons. As a consequence, it is now possible to define a single-electron Schrödinger equation:

ˆ hiψi(~r) = iψi(~r), (2.10) where ψi(~r) are electronic molecular orbitals containing a number of electrons between 1 0 and 2, and  is the energy of the molecular orbital ψ . There are N states ψ (~r) i i 2 i which are solutions of the Schrödinger equation and obey the following orthonormality condition:

Z ∗ d~rψi(~r)ψj (~r) = δij. (2.11)

The total energy of the system can be decomposed as

E [n] = Ekin [n] + Ecoul [n] + Exc [n] + Eext [n] , (2.12)

where Ekin [n] is the kinetic energy of the system of non-interacting electrons, Ecoul [n] takes into account the Coulombic interaction of the electrons as if they were immersed in an effective potential[1], Exc [n] is the exchange-correlation energy of non-interacting electrons and Eext [n] is the nuclei-electrons interaction. At this point, the electronic density can be written in terms of the Kohn-Sham orbitals as

7 N/2 Z X ∗ n(~r) = 2 d~rψi(~r)ψi (~r). (2.13) i=1 The Kohn-Sham equations simplify the Hamiltonian by isolating the exchange-correlation term and introducing the well-known kinetic energy term for non-interacting electrons. In fact, if Exc were exact, the ground state of the fictitious non-interacting electrons would be the same than that of the interacting electrons. Furthermore, as the energy in equation (2.12) depends on the electronic density and, according to equation (2.13), n(~r) depends on the molecular orbitals ψi which, at the same time, are defined by the Hamiltonian via equation (2.10), it should be possible to solve the Kohn-Sham equa- tion self-consistently in a variational way. This procedure would lead to the ground- state electronic density n0(~r), hence solving the Schrödinger equation of the system of electrons.

2.3 Linear-scaling DFT

Traditional DFT codes scale with the cubic of the system size (O(N 3)-scaling), due to the nature of the self-consistent cycle that is used to solve the Kohn-Sham equations in a variational manner. Computationally speaking, the cycle involves dense matrix diagonalisations that cause the cubic-scaling. For this reason, traditional DFT has been limited by the rate of computational power and system size, being only possible to simulate systems of a few hundred of atoms in parallel machines. The CASTEP [21] and VASP [22] codes are good examples of popular cubic-scaling DFT approaches for condensed-matter applications. In the last decade the effort has been put towards the development of linear-scaling methods (O(N)-scaling) for DFT calculations. Many of such codes are based in the reformulation of DFT based on the quantum one-particle density matrix, ρ(~r, ~r0), which contains all the information about the state of the particles in the system. The density matrix can be written in terms of the Kohn-Sham orbitals as:

∞ 0 X ∗ 0 ρ(~r, ~r ) = ψi(~r)fiψi (~r ), (2.14) i=1 where the coefficients fi are called occupation numbers, and lie in the range between 0 an 1 (empty orbital to fully occupied orbital). The electronic density is related to the density matrix by the trace of ρ(~r, ~r0):

n(~r) = 2ρ(~r, ~r). (2.15)

According to the last equation, the ground-state density matrix should follow the same properties regarding the Hohenberg-Kohn theorems of DFT. Therefore, minimising the

8 energy with respect to the density matrix elements is equivalent to solving the Kohn- Sham equation (2.10). The minimisation procedure, however, has to take into account a number of constraints that apply to the density matrix:

1. The number of electrons must remain constant: Z Z N = d~rn(~r) = 2 d~rρ(~r, ~r). (2.16)

2. The density matrix in the ground state has to be idempotent, which means that the following condition has to be true:

ρ2(~r, ~r0) = ρ(~r, ~r0). (2.17)

This will have an important effect in the calculation. Expanding ρ in terms of the Kohn-Sham orbitals as in equation (2.14), bearing in mind that ψi are orthogonal functions, it can be shown that, for idempotency to be achieved, the occupation 2 numbers fi have to match the relation fi = fi, which, in turn, means that fi are restricted to be either 0 or 1 for unoccupied and occupied orbitals respectively, and no fractional occupancies are allowed. As a consequence of that, equation (2.14) can be rewritten as

Nocc 0 X ∗ 0 ρ(~r, ~r ) = fiψi(~r)ψi (~r ), (2.18) i=1

where Nocc refers to the number of occupied orbitals.

The second constraint represents a major challenge for linear-scaling codes as the idem- potency condition cannot be implemented explicitly with a cost that scales linearly with the system size. This condition is closely related to the orthonormality constraint of the molecular orbitals, which if it is enforced directly, will cause the code to scale as N 3. In recent years, different approaches have been tested in order to overcome this difficulty. Some of the most successfull approaches are those based on the McWeeny method [23], which are able to impose weak idempotency in a linear-scaling fashion for a range of systems. Many of the linear-scaling codes are based on the property of locality (or nearsighted- ness) of the density matrix, originally proposed by Kohn [24]. It can be shown that the elements of the density matrix decay exponentially in systems with a non-zero band-gap like insulators or semiconductors. It is an intuitive observation that the interaction of two particles will decay with the distance between them, so the effect can be neglected if the two particles are separated enough. This behaviour is predominant in non-metallic systems, where the density matrix decays as

0 ρ(~r, ~r0) ∼ e−γ|~r−~r |, (2.19)

9 where γ is a coefficient whose value becomes larger with the band-gap. The property of locality in non-metallic systems can be exploited in order to truncate the density matrix, so the interaction between particles separated by more than a radial cut-off, 0 | ~r − ~r |> rcutoff , is set to zero. With this approximation, the degrees of freedom are substantially reduced, simplifying the problem. From a computational point of view, truncation of the density matrix means that the matrices involved in the calculation acquire a sparse pattern, so dealing with the algebra will require less computational effort, closer to linear-scaling. Nevertheless, actual linear-scaling also depends on the numerical implementation of the algorithm, which should take advantage of the locality of the density matrix and exploit the sparsity of the matrices in the system after the truncation. The choice of the basis set and the efficiency of the algebra routines will play a major role in the final scaling of the code. In the field of linear-scaling codes, an important measure is the break-even point, defined as the size of the system for which a linear-scaling code becomes faster than a traditional cubic-scaling approach. The break-even point is system-dependent, but usually lies in the range of a few hundreds of atoms. This means that, provided similar computational resources, linear-scaling codes are able to simulate larger systems, often in the range of thousands or tens of thousands of atoms. Examples of linear-scaling codes used by the international community of researchers are SIESTA [25], CONQUEST [26] and the ONETEP program [2], which has been used for the development of this work.

2.4 The ONETEP code

ONETEP (Order-N Electronic Total Energy Package) [2] is a linear-scaling approach for total-energy calculations that takes advantage of parallel computers to simulate large systems up to tens of thousands of atoms. It is written in Fortran90 and uses Message- Passing calls to allow scaling to very large processor counts on modern distributed memory parallel clusters.

The ONETEP approach is based on the single-particle density matrix, ρ(~r, ~r0), which contains all the information of the quantum state of the system. At the same time, ONETEP takes exploits the exponential decay of the density matrix in non-metallic sys- tems to truncate the density matrix for those elements that are separated more than a 0 given distant cut-off, |~r − ~r | > ~rcutoff . In order to take advantage of the localisation of the density matrix, ONETEP expands the Kohn-Sham molecular orbitals as a linear combination of a set of Non-orthogonal Generalised Wannier Functions (NGWFs) [27], φα(~r):

X α ψi(~r) = M iφα(~r). (2.20) α

The NGWFs are chosen to be non-orthogonal in order to impose strict spatial localisa-

10 tion within a sphere centred on each atom. Also, the NGWFs decay exponentially with the distance from the origin until they vanish at the border of the localisation sphere. The overlap between two NGWFs is defined as

Z ∗ Sαβ = d~rφα(~r)φβ(~r). (2.21)

The quality of the physical description of the system will depend in on the NGWF spheres radii rα (the larger the radii, the better accuracy) and the number of NGWF per atom (the more NGWF, the more complete the NGWF basis is, therefore, the better the accuracy). However, these parameters have to be tuned carefully in order to keep the balance between accuracy and speed of the calculation. This method exploits locality in the sense that, if two NGWFs are separated by more than rα, the overlap between them will be zero, thus providing sparsity to the matrices involved in the calculation. Furthermore, ONETEP uses the pseudopotential technique for dealing with the core elec- trons in a computationally-efficient way [28], which, in turn, reduces the problem to the accurate description of the valence electrons. The density matrix can be written in the NGWF representation by introducing the rela- tion in (2.20) into equation (2.18):

0 X αβ 0 ρ(~r, ~r ) = φα(~r)K φβ(~r ), (2.22) αβ

where Kαβ is called the density kernel and is the representation of the density matrix in NGWF notation:

αβ X α β K = fiM iM i. (2.23) i

The combination of both properties, the truncation of the density kernel and the locali- sation of the NGWFs, are key factors to achieve linear scaling in ONETEP. In addition to this, the NGWFs are expanded into a basis set of psinc functions [29], Dk,α(~r), as

X φα(~r) = Dk,α(~r)qk, (2.24) k

which are spike-like, orthogonal functions centred on each computational grid point and are equivalent to a plane-wave basis set via a unitary transformation. This has a number of important advantages over other choices of basis sets:

1. Plane waves are a near-complete basis set, which provides a superior accuracy for the representation of the NGWFs and hence the molecular orbitals.

11 2. The quality of the calculation is controlled by a single parameter: the kinetic energy cut-off, Ecutoff . The higher the energy cut-off, the more plane waves are used in the representation of the NGWFs and the better the accuracy will be. 3. The psinc basis set allows the NGWFs to be optimised in-situ during the calcula- tion, which eliminates the Basis Set Superposition Error (BSSE) [30]. 4. Plane waves are the natural basis set for performing Fast Fourier Transforms (FFTs). FFTs are a central part of the ONETEP core, as they are necessary for calculating the components of the total energy in a computationally efficient way. 5. Pulay forces [31] are zero as the psinc functions do not move with the atomic positions.

The ONETEP code minimises the total energy self-consistently in two nested loops per iteration, as shown in figure 2.1. The inner loop keeps the NGWFs, φα(~r), fixed while minimising the energy with respect to the elements of the density kernel, Kαβ. In the outer loop, the density kernel is fixed and the total energy is minimised with respect to the NGWFs in the psinc representation, to find the set of optimum qk coefficients. The whole procedure is done in such a way that the density kernel K stays idempotent, the NGWFs keep localised within a sphere and the number of electrons in the system re- mains constant. From equation (2.22) it can be seen that the two-nested loops scheme is equivalent to minimising the energy self-consistently with respect to the density matrix, ρ(~r, ~r0).

All the procedures included in the self-consistent loop of ONETEP are intended to scale linearly with the system size. In order to do that, a number of advanced numerical and computational techniques have been introduced in ONETEP and published in literature. Amongst some of them, it is important to remark the Li-Nunes-Vanderbilt algorithm for optimising the density kernel [32], the conjugate-gradient method for optimising the NGWFs in the outer loop [33], the FFT boxes technique [34] for performing multiple Fast Fourier Transforms that are not depending on the system size and the SPAM3 algebra that takes advantage of the sparsity of the matrices to efficiently distribute and perform matrix operations on parallel computers [35]. Density kernel optimisation: the inner loop The inner loop of ONETEP is in charge of minimising the total energy with respect to the elements of the density kernel [32], Kαβ. During the execution of this loop, the NGWFs, φα, remain fixed, so the energy functional depends only on the density kernel elements, E = E [K]. ONETEP uses the Li-Nunes-Vanderbilt (LNV) algorithm [8], based on the McWeeny method [23], in order to minimise the energy while keeping the idempotency constraint over the density kernel. Provided a nearly-idempotent density kernel, L, a fully idempotent density kernel can be built iteratively by applying the relation

K = 3LSL − 2LSLSL, (2.25)

12 Figure 2.1: ONETEP energy minimisation procedure. The energy functional is minimised in two nested loops: the inner loop optimises the elements of the density kernel, while the outer loop optimises the NGWFs.

where S is the overlap matrix as defined in equation (2.21). Provided sparsity of the density kernel (via the density cut-off radius, rcutoff ) this method scales linearly with the system size. NGWFs optimisation: the outer loop The outer loop of ONETEP minimises the energy with respect to the NGWFs, φα. Dur- ing the outer loop, the density kernel, that had been already optimised, remains con- stant, so the energy is a functional of the NGWFs only, E = E [φα]. The NGWFs are expanded as a linear combination of psinc functions as indicated in equation (2.24). The aim of this loop is to find the optimum set of coefficients qk that minimises the energy. This task is carried out by conjugate gradients methods, which scale linearly with the system size due to the localisation constraint over the NGWFs, which provides sparsity to the matrices involved in the process. Fast Fourier Transforms (FFTs) Fast Fourier Transforms (FFTs) are necessary in order to calculate the energy contri- butions efficiently. The FFTs scale as NlogN, and the number of FFTs required for

13 the calculation scales with the number of NGWFs, i.e., with N. This would make the algorithm scale as N 2 with the system size, eliminating the linear-scaling behaviour of ONETEP. However, ONETEP incorporates a development that takes advantage of the localisation of the NGWFs to make the number of FFTs required system-independent. ONETEP defines a box around the NGWFs called FFT box [34] which is big enough as to contain any overlapping NGWFs but small enough when compared to the entire simulation box. The result is that the number of grid points that will be taken into ac- count for the FFTs is system independent, so the final algorithm scales linearly with the system size. Efficient parallel strategy for communications and algebra ONETEP incorporates its own parallel algebra routines which fully exploit the trunca- tion of the density kernel and the localisation of the NGWFs. These developments have been published in a series of papers [35][36] that analyse the performance of ONETEP on different number of cores. The matrix elements are distributed over the processors in such a way that the communications required for the algebra operations (such as matrix-matrix multiplication) are minimised. In order to do that, neighbour atoms are allocated to the same processor so the elements of the resulting matrices are clustered in a nearly block-diagonal scheme where each block belongs to a different node. There- fore, blocks outside the diagonal are sparse or zero, which optimises the memory usage and minimises the communications between nodes. The code includes LAPACK calls for certain operations on sparse matrices. A similar scheme is used for dense matrices which, for performance, incorporate calls to ScaLAPACK parallel matrix operations.

14 Chapter 3

The DIIS algorithm: theory and computational implementation in ONETEP

Direct Inversion in the Iterative Subspace (DIIS) is a method for speeding up the en- ergy minimisation with respect to a set of coordinates by constructing and improving the solution as a point in the Krylov space [37]. This algorithm was originally devel- oped by Pulay [4] and then extended by a number of authors [38][39]. In fact, DIIS was at first presented as a method for speeding up SCF calculations by improving the electronic density with the values of previous iterations after each self-consistent cycle, thus reducing the number of iterations needed to converge. There are many examples in literature for the implementation of DIIS within an SCF problem at a computational cost that scales with the square of the system size (O(N 2)- scaling). Most of these codes combine modern quasi-Newton-Raphson methods (that make use of an improving approximation of the inverse Hessian matrix and the gradient vector) and density mixing in order to minimise a system of N independent particles in a self-consistent, iterative way. However, the approach taken in this work is different and has a computational cost that scale with the cube of the system size (O(N 3)-scaling). This cubic scaling is caused by the diagonalisation of the Hamiltonian to enforce self- consistency during the energy minimisation. The ScaLAPACK parallel eigensolver is used to diagonalise the Hamiltonian and to improve the performance of the code. This technique has been included in other codes such as CASTEP and CONQUEST [40]. DIIS can be applied to different variables of the SCF cycle, such as the electronic den- sity or the Hamiltonian. This work attempts a novel implementation of DIIS in ONETEP that mixes the single-particle density matrix, ρ(~r, ~r0). The following sections develop the theory of the DIIS method for accelerating SCF energy minimisation and the equa- tions behind the adaptation to the ONETEP framework. Then the main routines and the computational details of the actual code implementation will be explained, with special interest in the library calls to eigensolver routines for diagonalising the Hamiltonian

15 matrix as this is the most expensive and computationally demanding part of the pro- gram.

3.1 Description of the original DIIS

In the original Pulay DIIS scheme, the total ground-state energy of the system is a functional of a set of parameters, contained in the vector p of coordinates:

p = {p1, p2, . . . , pn} , (3.1) so the total energy of the system is a functional of these coordinates:

E = E [p1, p2, . . . , pn] . (3.2)

DIIS is inserted within an SCF iterative algorithm that minimises the energy in a vari- ational fashion. Each iteration produces a new parameter vector p(i+1), that, ideally, should be closer to the ground-state coordinates. A residuum vector can be associated to each new parameter vector by calculating the distance to the previous one:

∆p(i+1) = p(i+1) − p(i). (3.3)

However, convergence of the SCF cycles can be slow or non-existent for certain sys- tems, so stable numerical algorithms are needed to solve the energy minimisation prob- lem. In this context, the DIIS algorithm improves the solution at each SCF iteration by constructing a new parameters vector from a linear combination of the solution of previous iterations

i (i+1) X (m) p = p dm. (3.4) m=1

The DIIS method is based on the assumption that the residuum vector

i X (m) ∆p = ∆p dm, (3.5) m=1 converges to zero in the mean-squares sense, provided the condition

m X dm = 1, (3.6) i=1

16 to preserve the normalisation constraint. This leads to the system of m + 1 linear equa- tions

      B11 B12 ...B1m −1 d1 0  B21 B22 ...B2m −1   d2   0         ......   ...  =  ...  , (3.7)        Bm1 Bm2 ...Bmm −1   dm   0  −1 −1 ... −1 0 λ −1

i j where Bij = h∆p | ∆p i with a suitable scalar product of the residuum vectors and λ is the Lagrange multiplier that imposes the condition (3.6).

The solution of this system of equations are the dm coefficients that will be used to generate the parameter vector for the next iteration according to equation (3.4).

3.2 Implementation of the DIIS algorithm in the ONETEP code

The DIIS algorithm has been implemented as a total energy minimisation method in the inner loop of ONETEP. It represents an alternative to the LNV algorithm [8] which, although suitable for most cases, it has proven to not be reliable for certain cases [41]. LNV is a linear scaling method that takes advantage of the locality of matter to provide sparsity to the density kernel while keeping the density matrix idempotent. Due to its favourable properties, it has been popular among researchers in theoretical condensed matter theory. However, it is limited to systems with a non-vanishing energy band-gap between occupied and non-occupied states. On the other hand, DIIS, as it has been im- plemented in ONETEP, scales with the third power of the system size (O(N 3)-scaling) since it requires one diagonalisation of the Hamiltonian matrix per SCF iteration. Al- though this implementation might limit the size of the potential systems to be simulated, it provides a more stable method for minimising the energy. Furthermore, DIIS is the initial step of a series of modifications in the ONETEP code in order to make the code able to perform ensemble DFT calculations [5], which will provide a method for simu- lating metals (zero-bandgap systems) after some further modifications .

The DIIS in ONETEP mixes the density matrices, ρ(~r, ~r0), at each iteration of the inner loop in the SCF cycle. During the execution of the inner loop of ONETEP, the NGWFs φα(~r) are kept fixed, which means that they are no longer degrees of freedom (param- eters) of the energy functional. The total ground-state energy is a functional only of the density kernel elements, Kαβ, which are the parameters to be optimised in order to minimise the energy:

 αβ Einner = E K . (3.8)

17 Therefore, within the inner loop, mixing the density kernels is equivalent to mixing the density matrices at each iteration, as the NGWFs do not change:

m m 0 X 0 X X αβ ∗ 0 ρm+1(~r, ~r ) = diρi(~r, ~r ) = di φα(~r)Ki φβ(~r ) i=1 i=1 α,β m ! (3.9) X X αβ ∗ 0 = φα(~r) diKi φβ(~r ). α,β i=1

Each SCF iteration with embedded density kernel mixing involves the following steps, summarised in figure 3.1: 1.- Build the Hamiltonian from the density kernel: At each iteration it is possible to define two density kernels: the input density kernel, Kin, that comes from the previous iteration as the result of mixing the preceding kernels, and is used to build the Hamiltonian; and the output density kernel, Kout, that is the truly idempotent kernel that is built from the eigenvectors of the Hamiltonian, which, after convergence has been reached, will become the final outcome of the DIIS algorithm. The input density kernel is used to build the Hamiltonian matrix at the beginning of each iteration. In order to do that, the density-dependent matrices have to be recalculated for the values of Kin. This is a time-consuming routine as it has to calculate many two-body integrals using Fast Fourier Transforms. However, the effort scales linearly with the system size due to the FFT boxes techniques used in ONETEP. The density- independent matrices are calculated once before entering the inner loop, as they remain unchanged during the execution of DIIS. All the components are added up to obtain the Hamiltonian matrix of the system. 2.- Diagonalise the Hamiltonian and obtain the eigenvectors: The eigenvalue problem to be solved with diagonalisation is the Schrödinger equation in the NGWF representation that is obtained by combining equations (2.10) and (2.20):

β β HαβM i = iSαβM i. (3.10)

The Hamiltonian diagonalisation is the most expensive step as the program has to find the eigenvectors of a large dense matrix. This operation scales with the third power of the system size, ultimately defined by the number of atoms in the system. In order to speed up the calculation, this project takes advantage of the parallel algebra routines of ScaLAPACK, so the diagonalisation is performed in parallel with all the cores cooperat- ing to find the solution. Detailed description of the parallel diagonalisation is provided in section (3.4). 3.- Construct the new density kernel from the eigenvectors: Once the eigenstates of the Hamiltonian have been calculated, the output density kernel is built as

18 αβ X α † β  Kout = fi(M i) M i , (3.11) i

where, in order to enforce idempotency of the density matrix, the occupation numbers fi are set to 1 for occupied states and 0 for the unoccupied. When convergence is achieved, Kin and Kout should differ less than a certain given threshold. The residuum vector is calculated as the difference between the input and output kernels corresponding to the same iteration:

αβ αβ αβ R = Kout − Kin . (3.12)

4.- Check for convergence: In practise, since the code deals with dense matrices, a calculation is said to have con- verged when the RMS gradient of the residuum matrix R, calculated as:

s P Rαβ· Rαβ RMS(R) = αβ , (3.13) number of non zero elements

is less than a specified threshold (set to 10−6 by default). 5.- Mix density kernels: If convergence has not yet been achieved, then the density matrices are mixed. Follow- ing the original Pulay DIIS, the system of m linear equations (3.7) can be built at each iteration from all the residuum matrices of the preceding ones:

      B11 B12 ··· B1m −1 d1 0  B21 B22 ··· B2m −1   d2   0         ···············   ···  =  ···  , (3.14)        Bm1 Bm2 ··· Bmm −1   dm   0  −1 −1 · · · −1 0 λ −1

where Bij are calculated using the Frobenius norm for non-orthogonal matrices and tensorially-correct matrices

h † i Bij = hRi | Rji = tr Ri · S · Rj · S . (3.15)

The solution of (3.14) are the coefficients dm that will be used to generate the next input density kernel in order to achieve faster convergence in the SCF cycle:

m αβ X αβ Km+1,in = Ki,outdi. (3.16) i=1

19 αβ After these steps the new Kin is used to build the next Hamiltonian, and the self- consistent cycle starts again.

Figure 3.1: Implementation of the DIIS algorithm in ONETEP.

3.3 Description of the code

The density matrix DIIS code has been developed from scratch by the author of this work and added to ONETEP as a new module called DM_mod.F90. The program is written in Fortran90 and uses Message-Passing routines to scale to large number of processors in parallel clusters. In order to avoid code replication, that could potentially lead to bug-spreading, the code reuses many subroutines written by other authors of the ONETEP code. Some of the remarkable routines that are reused are the Hamiltonian builder, the MPI communication wrappers and the sparse and dense algebra routines and wrappers. In total, more than 2000 lines of code have been written, where nearly 25% of the actual code is dedicated to 12 test routines. The driver routine dm_mix_density performs the inner loop energy minimisation using DIIS. It is called from the outer loop that performs the SCF energy minimisation procedure (module ngwf_cg_mod.F90), which means that the DIIS routines are higher in hierarchy than the algebra, communications and matrix-builders of the code. The code takes the initial guess of the density kernel and returns the last output density kernel (Kout), the last input density kernel (Kin) and the total energy of the system.

20 During the execution, the sequence of input and output density kernels, together with the sequence of Hamiltonian matrices and residues are stored in arrays of a fixed size. This is memory consuming, but it is nevertheless necessary in order to perform the kernel mixing according to the Pulay method. The Hamiltonians are stored in memory to allow further changes to perform Hamiltonian mixing instead of density kernel mixing. The number of previous output kernels that are taken into account for the mixing ac- cording to the linear combination (3.16) can be tuned by the user to get the maximum convergence rate. Generally speaking, this is system-dependent, but the last 15 kernels should be enough for most systems. This is numerically more stable, as the first set of Kout should be further away from the final solution that the latest kernels. Therefore, by neglecting the contribution of the first output kernels, the convergence of the calculation is enhanced. Moreover, using a small number of preceding output kernels compared to the system size is beneficial since it simplifies the solution of the linear equations in (3.14). A serial solver for systems of linear equations like the LAPACK DGETRS [42] should do a good job without a performance hit. The number of previous output kernels to be mixed is controlled by the parameter dm_history, whose default value is 15 and can be changed by the user in the ONETEP input file.

To be consistent with the ONETEP parallelisation strategy, all the arrays containing ma- trices are of type SPAM3. This datatype takes advantage of the ONETEP routines for calculations with sparse matrices, which has also proven to be very efficient for dense matrices [35]. Before starting the DIIS loop, the subroutine dm_sparse_create allocates memory space for the arrays that will contain the matrices during the calcu- lation, which are the Hamiltonian, the input and output density kernels and the residue at each iteration. After the calculation is finished, dm_sparse_destroy deallocates the memory at the end of the DIIS procedure. After the initialisations, the self-consistent energy minimisation procedure begins. The code will perform a maximum number of dm_maxit kernel mixing iterations and stop if convergence has been achieved. The threshold can be controlled by the input para- menter dm_threshold whose default value is 10−6. The default value of dm_maxit is set to 25. During the DIIS loop, these are the main steps and routines:

• The Hamiltonian is built at each iteration by dm_build_ham, which calculates all the density-dependent matrices and adds all the components to the final Hamil- tonian matrix. This routine requires FFTs in order to calculate the energy terms, which makes them one of the most expensive routines in the code, specially for small systems. • The subroutine dm_diag_and_build_kernel takes the Hamiltonian ma- trix and performs the diagonalisation using the LAPACK or ScaLAPACK eigen- solvers. Within this routine, the Hamiltonian and the kernel matrices, stored in SPAM3, are converted to DEM format, which is used by the eigensolver wrapper dense_eigensolve. After the diagonalisation has completed, the new ouput

21 density kernel is built from the eigenvectors. For large systems, this routine is expected to take most of the time as it scales with the third power of the number of atoms. • After the new output kernel has been calculated, the constants of the calculation are monitored with the subroutine dm_constants_summary. This subrou- tine calculates the commutator of H and K for checking self-consistency, the commutator of K and K2 for checking idempotency and the number of electrons in the system. • The subroutine dm_energy calculates the total energy of the system based on the new output kernel. This subroutine is expected to be expensive as it requires FFTs for completion. • In order to check for convergence, the residual of the calculation is calculated by the subroutine dm_residue_inout according to equation (3.12). The routine dm_check_convergence calculates the RMS gradient of the residuum ac- cording to equations (3.13) and stops the loop if convergence has been achieved. • If the energy minimisation has not yet converged, then memory is allocated for the calculation of the B matrix based on equation (3.15) and the set of coefficients, dm, that will be used for the kernel mixing. The system of equations of (3.14) is solved in dm_find_coeffs, which calls the LAPACK serial solver DGETRS. • Routine dm_mix_kernels builds the next input density kernel based on a lin- ear combination of the output density kernels and the dm coefficients, storing the result on the NextKernIn SPAM3 array. • Before moving to the next iteration, the code shifts the entries in the SPAM3 arrays if the number of iterations is higher than dm_history. This eliminates the first density kernels and saves memory space.

To activate the DIIS minimisation (instead of LNV), the user is entitled to activate the logical flag use_DIIS : T in the ONETEP input file. The flag DIIS_method selects between the different mixing procedures: P for Pulay and L for linear. As an addition, the code includes a basic method of performing DIIS known as linear mixing [38], which simply builds the next input kernel by:

(m+1) (m) (m) Kin = dinKin + doutKout . (3.17)

This method is slower and less reliable than the Pulay mixing, and it is out of the scope of this dissertation. However, it can be used in small systems for testing and debugging purposes.

22 3.4 Eigensolvers

The essential part of the energy minimisation using DIIS is the Hamiltonian diagonal- isation as a method of achieve self-consistency in the inner loop. As the code deals with large dense matrices, the diagonalisation process will scale with the third power of the system size (O(N 3)-scaling), being the most computationally expensive part of the calculation. Therefore, the performance of the eigensolver is crucial for the overall performance of the DIIS code, particularly when the systems to be simulated are very large. Mathematical libraries are of great importance in HPC codes as they are able to achieve great performance in algebra operations with large matrices. In this context, LAPACK [7] and ScaLAPACK [6] have been chosen for solving the eigenvalue problem that the Schrödinger equation represents. In fact, solving (2.10) is equivalent to diagonalising the Kohn-Sham Hamiltonian, which, for a dense matrix, scales with the cube of the ma- trix size. Since this is the most expensive part of the DIIS calculations, the performance of these libraries is crucial for the overall performance of the code.

In the NGWF notation that is used in ONETEP, the Schrödinger equation to be solved is given by

β β HαβM i = iSαβM i, (3.18) which can be treated as an general eigenvalue problem of the form Ax = λBx, where A = H, B = S, x = M and λ = . It is worth noticing that the Hamiltonian H and the overlap matrix S are symmetric, and H is an Hermitian, hence its eigenvalues are real. Therefore the required eigensolver must be intended to solve real generalised symmetric-definite eigenproblems. The eigensolvers DSYGVX [43] (LAPACK) and PDSYGVX [44] (ScaLAPACK) are the candidates to solve such eigenvalue problem. Both library routines are based on the same algorithm, which comprises the following steps:

1. Cholesky factorisation [45] of the matrix B. 2. Transformation to reduce the complex Hermitian-definite generalised eigenprob- lem to standard form (DSYNGST or PDSYNGST routines [46]). 3. Solve the standard eigenvalue problem (DSYEVX or PDSYEVX routines [47]). 4. Back transformation to get the eigenvectors of the original problem (PDRRSM routine [48]). 5. Scale eigenvalues and eigenvectors if necessary.

The efficiency of the eigensolver is seriously limited by the size of the matrices. For large systems, H and S can be too large to fit in memory, which will lead the program to crash or to produce non-accurate results. Whereas LAPACK is a serial routine intended

23 to run on a single core [49], ScaLAPACK is the equivalent parallel algebra library which takes advantage of parallel computers [50]. ScaLAPACK distributes the matrices over the processors so the required total memory per node is minimised, which diminishes the risk of the program of running out of memory. On the other hand, ScaLAPACK routines impose the cooperation between the CPUs to perform the diagonalisation of the Hamiltonian. The aim of ScaLAPACK routines is to improve the performance of the algebra routines and to scale to larger systems than the serial LAPACK. For these reasons, in the DIIS code, a parallel eigensolver like the one included in ScaLAPACK will, in theory, bring better performance and results to the DIIS calculations. These mathematical libraries have been chosen as they are widely tested and used by the com- putational science community, providing a range of publications about the performance of these routines [51][52][53].

The ONETEP code contains a wrapper routine for these LAPACK and ScaLAPACK eigensolvers. The choice between LAPACK and ScaLAPACK is given at compilation time, via conditional compilation flags (#ifdef). If no flags are invoked, LAPACK is the default solver, whereas if the flag -DSCALAPACK is invoked at compilation time, the ScaLAPACK solver will be enabled. The ONETEP routine dense_eigensolve con- verts the SPAM3 matrix parallel storage into LAPACK or ScaLAPACK formats, whose result is converted back into SPAM3 format after the calculation. The wrapper routine feeds the arguments to the LAPACK or ScaLAPACK calls automatically, with the pro- grammer only having to specify the A and B matrices, the vector with the eigenvalues λ, the matrix with the eigenvectors x and the type of problem that the routine has to solve.

3.4.1 LAPACK serial eigensolver, DSYGVX

As the routines included in LAPACK are intended to be run on a single node, the size of the problems that are tractable with LAPACK is seriously limited by the memory on one node. On the other hand, it is possible to link LAPACK routines from a par- allel program (such as ONETEP), although each node will compute a separate problem which is independent from the others. This is not true parallelism as the CPUs do not cooperate for the solution of a single problem. As a result, all-to-all communications and synchronisation are necessary to broadcast the solutions to the other nodes and to ensure that all the cores have the same values stored in memory. This represents an obvious bottleneck in the performance that gets worse for large atomic systems. The wrapper dense_eigensolve converts the block-distributed matrices in SPAM3 format to two-dimensional arrays in Fortran. The array is then fed into the LAPACK eigensolver that proceeds to the diagonalisation. The matrix with the eigenvectors, x, is converted back to SPAM3 format for efficiency.

24 3.4.2 ScaLAPACK parallel eigensolver, PDSYGVX

ScaLAPACK takes advantage of parallel computers by distributing the matrix elements over the nodes so each of them owns a fraction of the entire system. Communications are then necessary in order to perform operations involving matrix operations; as a result, the overall performance of the ScaLAPACK routines will be determined by the amount of computation per core and the efficiency of the communications within the nodes. The ScaLAPACK library interface is built on top of the hierarchy of libraries that are necessary for its correct installation. PBLAS [54] is a set of parallel algebra routines for distributed memory machines that is in the basis of the ScaLAPACK parallel routines. At the same time, ScaLAPACK requires BLACS [55] for setting up the parallel com- munications, and the serial LAPACK and BLAS interfaces for algebra operations on each node. ScaLAPACK uses a 2D block-cyclic distribution of the matrices within the processors that minimises the data movement between the memory levels and optimises the communications between the nodes.

The ONETEP wrapper routine dense_eigensolve initialises all the inter-node com- munications (BLACS) and transforms the SPAM3 matrices to distributed matrices ac- cording to the ScaLAPACK standard. After the calculation has completed, the data is transformed back to SPAM3 for efficiency of the ONETEP code, and the communica- tions and all the associated processes are terminated. One of the properties of the eigenvectors is that they are orthogonal. For a parallel al- gebra eigensolver like PDSYGVX, keeping the eigenvectors orthogonal will require a large amount of all-to-all communications between the nodes. This is expected to take a substantial amount of the total runtime, so the overall speed will be affected by the quality of the interconnect and the message passing implementation on the target ma- chine. Also, the precision to which the eigenvectors are orthogonal to each other will affect the performance of the solver. As expected, the higher the precision, the slower the convergence of the eigenvectors will be. Moreover, convergence is limited by the amount of memory given to the ScaLAPACK workspace. Thus, these two parameters have to be tuned in order to get the maximum performance without affecting the ac- curacy of the calculation. The workspace is given to the ScaLAPACK eigensolver by the variable WORK, whereas the precision is controlled by ABSTOL, which specifies the maximum tolerance for considering the eigenvectors orthonormal, and ORFAC, which tells the libray routine which vectors need to be reorthogonalised. The ONETEP wrap- per routine dense_eigensolve automatically allocates WORK so as to compute the eigenvalues and eigenvectors, and sets ABSTOL and ORFAC to their standard values.

25 Chapter 4

Target machines architecture, compilation details and benchmarks

The ONETEP code with DIIS has been compiled on HECToR and Iridis3 for comparing the performance on each parallel cluster. The hardware and the compiler options can play a decisive role on performance, hence the first sections offer a description of the machines architectures and compilation details on each cluster. After that, the last sec- tion presents the set of benchmarks that have been used for validation and performance tests of the DIIS-Hamiltonian diagonalisation method.

4.1 Target machines architecture

HECToR (Cray XT4, Phase 2A) [56] has a total of 3072 AMD 2.3 GHz quad-core Opteron processors and 8 GB of memory per node, having a theoretical peak perfor- mance of 113 Tflops. These nodes present two levels of cache, L1 and L2, designed to enhance the data transfer between the main memory and the CPU. The nodes are connected by a high-bandwidth network of a three-dimensional torus of SeaStar2 chips that offers a peak bi-directional bandwidth of 7.6 GB/s. Iridis 3 [57] is formed by a total of 8064 CPUs allocated in 1008 Intel Nehalem nodes with two 4-core processors each and 22 GB of memory per node that add up to a the- oretical performance of 72 TFlops. The nodes are connected via a high-bandwidth, low-latency Infiniband network for top performance. The Nehalem cores have three levels of cache that will allow to fit datastructures of different size, which is expected to bring peaks of performance when the number of processors increases for a given system.

26 4.2 Compilation details

The ONETEP code with DIIS has been compiled on HECToR using the Portland Group Compiler (PGI) for Fortran with the -fastsse optimisation flag. For the message- passing library, HECToR provides an optimised implementation of the MPICH-2 li- brary. The PGI compiler environment has to be loaded with the directive module load PrgEnv-pgi and the wrapper ftn takes care of the compilation of Fortran code with MPI calls. As ONETEP needs efficient Fast Fourier Transformations during the runtime, the code is linked against FFTW 3.3.1. by loading the appropriate module, module load fftw/3.1.1. Finally, the code links to LAPACK and, if the ob- ject flag -DSCALAPACK is enabled, to ScaLAPACK via the Cray LibSci mathematical library.

On Iridis 3 the ONETEP code has been compiled using the Intel Fortran compiler (ifort) and OpenMPI and IntelMPI message-passing interfaces. FFTW 3.2.2. is linked for per- forming Fast Fourier Transformations, while the mathematical libraries, LAPACK and ScaLAPACK, are linked from the Intel MKL mathematical libray interface. It has been found, however, that the OpenMPI implementation was causing ScaLAPACK to hang in all-to-all communications, which, eventually, crashed the application. The reason for this behaviour is complicated, but the gdb debugger reveals that the ScaLAPACK eigensolver hangs on all-to-all commmunications and the calculation gets killed before completion. Communications are required for the orthogonalisation of the eigenvectors of the Hamiltonian, and should take a large amount of the total communication cost of the eigensolver routine. It seems that the OpenMPI implementation on Iridis3 is not compatible with ScaLAPACK or the Infiniband interconnector. This hypothesis is rein- forced as the calculations with Intel MPI succedeed, which suggests that the problem is not with the ScaLAPACK installation but the OpenMPI interface. Therefore, IntelMPI was chosen for the performance tests during this work. The optimisation flags for ifort were -O2 to enable compiler optimisations that reduce the code size, -vec-report0 to control the information reported by the vectoriser, -m64 to specify 64-bits architec- ture, -msse4.1 to enable vectorisation and -heap-arrays to put automatic arrays in the heap (instead of the stack) which fixes runtime problems when dealing with arrays that can lead to a segmentation fault.

27 4.3 Benchmarks

The DIIS code has been tested on a range of systems of different characteristic. These benchmarks have been selected for two purposes. The first one is to validate the cor- rectness of the DIIS code to describe the physics of atomic systems by reproducing the results given by the standard LNV algorithm of ONETEP. The second is to analyse the performance of the code on HECToR and Iridis 3 for different number of processors. The important parameters to characterise the benchmarks are:

1. Number of atoms of the molecule. The more atoms, the more expensive the calculation will be as there are more bodies in the system which introduce extra degrees of freedom. 2. Number of NGWFs in the system. The number of elements of the Hamiltonian matrix (and, in fact, all the matrices in NGWF representation, such as the density kernel), is given by NNGW F s×NNGW F s. Therefore, the more NGWFs, the bigger the matrix to be diagonalised. The effort to diagonalise the matrix scales with the cube of NNGW F s. 3. The kinetic energy cut-off. This parameter defines the computational grid to rep- resent the NGWFs in the psinc basis set (equation (2.24)). The higher the cut-off, the finer the grid.

A set of small test-cases has been selected so the calculations should take a few of iterations to converge. These systems lay in the range of sizes that could be simulated with conventional cubic-scaling DFT codes like CASTEP, which are capable to use (and eventually reproduce the results with) their own DIIS implementation [58] [41]. Table 4.3 summarises these systems, represented in figure 4.1.

Test case Natoms NNGW F s Energy cut-off (eV) Water 3 6 700 Ethene 6 12 400 H_bond 7 19 600 Benzene 12 60 300 C69N 70 280 250 Tennis ball 172 496 800 Catechol 5Å 339 780 800 Catechol 7Å 662 1597 800

Table 4.1: Set of small test cases

The DIIS code has also been tested using two sets of larger systems. The first set is a group of five silicon nanorods of different sizes which should show some of the typical properties of the semiconductors, such as a medium value of the band gap. A smaller band gap means that the nearsightedness of the density matrix is not so obvious, and some linear approaches may fail during the simulation. However, recent work using this

28 Figure 4.1: Set of small benchmarks.

set of silicon benchmarks [59] demonstrated that it is possible to simulate these kind of semiconductors using ONETEP without loss of accuracy. A DIIS approach should be, in any case, more stable for these kind of materials as no truncation of the density kernel is needed. Table 4.3 summarises the set of silicon benchmarks, represented in figure 4.2.

Test case Natoms NNGW F s Energy cut-off (eV) Si29H36 65 297 650 Si242H140 382 2318 650 Si532H224 756 5012 600 Si766H402 1168 7296 600 Si1186H366 1552 11040 600

Table 4.2: Set of benchmarks corresponding to silicon nanorods of different size

Figure 4.2: Set of silicon nanorods benchmarks.

The last set of benchmarks is formed by a sequence of cuts of different size applied to a double-helix shaped Amyloid protein [60]. This biological system has a larger band gap

29 that should make a linear-scaling approach based on the locality of the density matrix more accurate. However the protein is not a proper insulator, so the band gap is not particularly big. A cubic-scaling approach that diagonalises the Hamiltonian should be very stable and accurate for this kind of system. Table 4.3 and figure fig:amyloidbench summarise the computational parameters of this set of calculations.

Test case Natoms NNGW F s Energy cut-off (eV) p16_20 1712 4544 600 p32_20 3424 9088 600 p48_20 5136 13632 600 p64_20 6848 18176 600 p80_20 8560 22720 600 p96_20 10272 27264 600 p112_20 11984 31808 600 p128_20 13696 36352 600

Table 4.3: Set of benchmarks corresponding to a the amyloid protein of increasing size

Figure 4.3: Set of benchmarks of cuts to the amyloid protein.

30 Chapter 5

Validation tests

In this chapter the results regarding the validation of the DIIS method for minimising the total energy are presented. The calculations perform several iterations until, eventually, convergence is achieved in the inner loop. The values of the energy obtained with DIIS and standard LNV methods are compared. The first section shows the convergence results obtained for the different sets of bench- marks and the ability of the DIIS method to simulate the physical properties of real molecules. The next section discuss some methods to solve the discrepancies between LNV and DIIS algorithms and to enhance the convergence of DIIS for large, metallic systems.

5.1 Results

The tests carried out with the set of small benchmarks show a good agreement between the LNV algorithm and the DIIS. However, the DIIS calculations seem to become more unstable with increasing system sizes, which leads to large iteration numbers in order to achieve convergence. In fact, the calculation for the largest of these systems (Cate- chol 7Å) diverged from the value given by LNV. This molecule is right in the limit of the range of systems that cubic-scaling codes can simulate, which makes it difficult to reproduce the results with other codes that incorporate DIIS, such as CASTEP. With the exception of Catechol 7Å, for which the energy diverges, for the rest of small test-cases the agreement of the total energy in the inner loop to the value given by stan- dard LNV is very good. Figure 5.1 compares the convergence rate the DIIS and LNV methods for the different systems while table 5.1 compares the values of the energy with both methods at the end of the calculation. A further test have been performed in order to validate the DIIS method. The “Tennis ball” protein [61] is formed by two symmetrical monomers that fold forming a tennis ball-like structure, with each part interacting with the other via eight hydrogen bonds.

31 Figure 5.1: Convergence of the energy of the LNV and DIIS methods for the set of small systems.

By varying the distance between the two monomers the energy of the Hydrogen bonds changes. This generates a potential curve that should show a minimum at the optimal value of the distance. This plot has been generated using both LNV and DIIS codes in the inner loop, and the results are shown in figure 5.2. As can be seen, the agreement is very good with a difference of 10−6 Ha in the depth of the well. This suggests that the DIIS algorithm, at least for systems of this size, produces correct results as it is capable of reproducing the physical properties of the molecule under simulation up to chemical accuracy (1 kcal/mol). The fluctuating behaviour for both the LNV and DIIS algorithm in the potential well of the Tennis Ball protein can be explained in terms of the Basis Set Superposition Error: as the plot refers only to the inner loop, the NGWFs have not been optimised. It can be shown that explicit optimisation of the NGWFs eliminates the BSSE error [62], so

32 Figure 5.2: Potential energy well generated by the H-bonds between the monomers of the Tennis ball protein. The calculations with LNV and DIIS show very good agreement for this system. no further enhancements are necessary. However, the DIIS code needs to be adjusted to the outer loop where the NGWFs optimisation takes place.

Test case Energy Iterations Energy Iterations LNV (Ha) LNV DIIS (Ha) DIIS Water -15.8068920 13 -15.8068920 10 Ethene -12.5701824 7 -12.5701824 8 H_bond -38.8553433 35 -38.8553433 66 Benzene -68.6356775 26 -68.6356775 27 C69N -368.5969759 50* -368.5969135 100* Tennis ball -796.4267427 25 -796.4267415 66 Catechol 5Å -1151.8477085 50* -1148.8190442 100* Catechol 7Å -2487.7889570 50* 10160.2862602 41*

Table 5.1: Energy obtained by LNV and DIIS methods for the set of small test cases. The asterisk * indicates that the calculation did not converge to the standard threshold.

The results for the silicon nanorods show that DIIS achieves convergence for the first two systems (Si29H36 and Si242H140) with a value of the energy close to LNV. However, the next system, Si532H224, with matrices of size 5012 × 5012, failed to converge. It shows a remarkable divergence that drives the energy to very high and positive values. This tendency can be observed in the rest of silicon test cases of bigger size. Table 5.1 compares the value of the energy obtained by LNV and DIIS, while figure 5.3 shows the convergence speed of the energy this behaviour. The smallest of the amyloid protein test-cases, p16_20, diverges for the DIIS energy minimisation, whereas the LNV code works correctly. This system is approximately of the same size than the biggest silicon nanorod, therefore a divergent behaviour of the energy of all the other amyloid calculations is expected. Due to the large size of these molecules and the O(N 3)-scaling of the DIIS algorithm, it has been decided to not to waste valuable computational time on the rest of the calculations of this set as they will

33 Figure 5.3: Convergence of the energy of the LNV and DIIS methods for the set of silicon nanorods.

Test case Energy Iterations Energy Iterations LNV (Ha) LNV DIIS (Ha) DIIS Si29H36 -131.7835469 11 -131.7835469 14 Si242H140 -1011.9586231 11 -1011.9586143 42 Si532H224 -2219.2305021 13 53245.0016217 31*

Table 5.2: Energy obtained by LNV and DIIS methods for the benchmark set of silicon nanorods. The asterisk * indicates that the calculation did not converge to the standard thresh- old. most likely diverge. The plots regarding the set of small systems and the silicon nanorods show the same convergence pattern. At the first iteration, the total energy of the calculation is very similar to the value given by LNV, but after the second iteration the energy explodes and reaches a very high value. Then, after a few more DIIS iterations, the energy drops approaching to the value given by LNV, and the calculation, eventually, converges. For larger systems such as Si532H224 or Catechol 7Å, the DIIS minimisation becomes unstable and diverges. In other cases, different energy peaks can be observed during the DIIS minimisation such as in the case of Catechol 5Å. In this calculation, the peaks in energy appear in a periodic basis in the space of iterations, which indicates that the DIIS algorithmm implemented here may suffer of some kind of ill-conditioning that introduces this pattern into the convergence of the energy. The next section discusses some of the methods to improve the convergence of the DIIS algorithm for simulating large systems that can be the cause of this misbehaviour.

34 5.2 Methods for solving the convergence problems of the DIIS algorithm for large systems

According to the results, for systems with a large number of atoms and NGWFs the DIIS algorithm diverges considerably causing the total energy to increase to very high values. Despite of DIIS being considered as a robust method for improving the convergence of SCF calculations, there are some technicalities that must be considered for its efficient implementation.

5.2.1 DIIS initialisation

First of all, it is worth noticing that at the first DIIS iteration there is only one output kernel in the sequence. This means that it is only possible to generate one residue vector, R(1), which, according to equation (3.14), will produce only one mixing coefficient, d1. As the sum of coefficients is constrained by equation (3.6), it is inferred that at the first iteration d1 = 1. As a result, the input kernel at iteration number 2 will be (2) (1) equal to the output kernel at iteration 1: Kin = Kout. Effectively, no kernel mixing is being produced to improve the second SCF iteration, which, in fact, means that the (1) Hamiltonian at the second iteration is being built with the only contribution of Kout. This kind of procedure is known for being very unstable and divergent for many systems [38]. As can be seen in the plots in 5.1, the first iteration is giving a reasonable value of the total energy, whereas for the second iteration the energy becomes unpredictable and divergent. The consequence of this behaviour is that, at the end of the second (2) iteration, the elements of Kout are far from those that will cause the energy to converge, and, moreover, this effect will propagate to the following DIIS iterations via the kernel mixing. This effect is more visible in large systems as the K matrices are bigger: the number of previous density kernels, which according to equation (3.16) act as a basis set for the next input kernel, do not form a complete or near-complete basis set for building the next density kernel in a way that all the elements are close to those that minimise the total energy. As a consequence, the energy never converges for these kind of systems. There are a number of techniques to avoid this behaviour. The simplest one is to allow a previous number of SCF iterations prior to the density mixing. This will provide a number of initial, well-behaved kernels that will be included together with the kernels produced by diagonalisation for the kernel mixing. The problem, in this case, is to find an elegant method to produce the initial kernels without DIIS. Within the ONETEP framework, a number of LNV iteration could be performed before DIIS with no kernel preconditioning [63]. By doing this, it would be possible to accumulate kernels while keeping the occupation numbers free of constraint, so they can be fractional. The idem- potency of the density matrix will be then imposed by hand during the execution of DIIS and the Hamiltonian diagonalisation.

35 5.2.2 Kerker preconditioning

Another problem that this implementation of DIIS can suffer from is charge sloshing [64]. The Kerker preconditioning method [65] is a reliable technique for avoiding this issue, which is more important in metallic systems or semiconductors. The charge sloshing effect adds low frequency oscillation to the electronic density that can cause slow convergence or even divergence of the DIIS method. Kerker preconditioning is an adaptation of the Pulay method in which the electronic density mixing is done in the following way:

m m (m+1) ~ X (i) ~ ~ X (i) ~ nin (G) = dinout(G) + P (G) diR (G), (5.1) i i where P (G~ ) is the Kerker preconditioning term that depends on the frequency in recip- rocal space, G~ , and can be calculated as

| G~ |2 P (G~ ) = a . (5.2) ~ 2 2 | G | +G0

The Kerker preconditioning scheme requires fine-tuning of the parameters a and G0, which are system dependent. This is normally done by trial-and-error method, although some automated techniques for finding the suitable coefficients can be found in litera- ture [66]. However, the major challenge is to adapt the Kerker preconditioning scheme, that re- quires the representation of the density matrix on a grid so it can be transform to recip- rocal space by an FFT, to the DIIS code presented here. The DIIS method as it has been implemented in ONETEP mixes the density kernels directly in real space, with no need of building the quantum single-particle density matrix at each step. Thus, for com- putational efficiency, our method will require a version of Kerker preconditioning in real-space. There are some examples in literature that can be used in order to introduce Kerker preconditioning in ONETEP without using reciprocal space representation [67].

5.2.3 Level shifting

Another method that can be used to improve the convergence is the level shifting method [68][69]. Metallic systems are characterised by a zero bandgap which often causes con- vergence problems in DIIS. In some cases, the highest occupied orbital and the low- est unoccupied orbitals can exchange their positions during the diagonalisation, which makes the DIIS method diverge. In this approach, a constant value is added to the highest-energy orbital (now pseudocanonical orbitals). By doing this, the bandgap be- tween the occupied and unoccupied levels increases artificially, which should improves the efficiency of the mixing method. The convergence and reliability, however, depend

36 on the parameter that sets the artificial bandgap, which is often system-dependent and must be tuned by the user.

5.2.4 Other approaches

The methods mentioned above have been implemented successfully in other codes such as CASTEP [21] or CONQUEST [26] and are the most likely to succeed in achieving convergence of large systems using DIIS on ONETEP. Nevertheless, other methods can also be tested, such as:

1. A different choice of the metric of the system to avoid low frequencies [38]. 2. A different way of defining the residual vector to increase the efficiency of the Pulay and Kerker DIIS [70]. 3. The introduction of Thomas-Fermi density mixing for allowing fractional occu- pancies of the Kohn-Sham orbitals [71][72]. This point is necessary in order to simulate metals with ONETEP within the ensemble-DFT approach [5].

37 Chapter 6

Performance results and analysis

In this chapter the performance results of the DIIS-Hamiltonian diagonalisation method and LNV are presented. In the first section LAPACK and ScaLAPACK eigensolvers have been tested as a method for diagonalising the Hamiltonian matrix. The calculations have been submitted to HECToR and Iridis 3 clusters for a different number of nodes and multiples of 4 cores. On HECToR, the whole 4-core processors are used, whereas on Iridis 3, the calculations can be divided in full node calculations (8 CPUS per node) and half-node calculations (4 CPUs per node and exclusive access to the node by the job). In the half-node calculations more memory and cache are dedicated to each CPU on Iridis 3 as the other 4 cores are not active during the execution.

The results are obtained for only one iteration of the inner loop of ONETEP. In the case of DIIS-Hamiltonian diagonalisation, this ensures that only one diagonalisation process is carried out during the calculation, so the ScaLAPACK or LAPACK eigen- solver is called only once. The following performance metrics have been measured and derived from experiments: the runtime of the program, T (p), defined as the time for p processors to complete one iteration, the parallel speed-up, S(p), defined as:

T (p ) S(p) = 0 , (6.1) T (p) where p0 is the reference number of CPUs, assigned to the minimum number of proces- sors on which a calculation has completed successfully. The parallel efficiency, E(p), defined as:

S(p) E(p) = , (6.2) p and the Karp-Flatt metric [73], κ(p), defined as:

38 1 1 − S(p) p κ(p) = , (6.3) 1 1 − p which gives a measure of the serial fraction of the program and can be associated with the potential parallelisation difficulties of the code. If κ(p) tends to a constant when p increases it will mean that the code does not present further oppportunities for paral- lelism, and κ can be associated to the serial fraction of the code. On the other hand, if κ grows with the number of processors it can be interpreted as overheads decreasing the performance of the code, such as communications or load-imbalance bottlenecks.

The ONETEP code incorporates its own timing routine timer_clock which is a wrap- per to the MPI library timer MPI_WTIME. As the code runs in parallel, each processor has its own timing value. For synchronisation purposes, the final runtime associated to a subroutine is the average time over all the processors involved in the calculation.

6.1 Comparison of LAPACK and ScaLAPACK eigen- solvers

The performance of the routine dm_diag_and_build_kernel has been measured for a different number of cores on HECToR and Iridis3. This routine is in charge of diagonalising the Hamiltonian matrix and building the output density kernel from the eigenvectors. The diagonalisation of the Hamiltonian is expected to be the most computationally-demanding step, so the performance of this routine will have the great- est impact on the overall code. For this reason, ScaLAPACK parallel eigensolver is tested against the equivalent LAPACK serial eigensolver. The set of silicon nanorods and the protein benchmarks described in section 4.3 have been run using DIIS and Hamiltonian diagonalisation with both the LAPACK and ScaLAPACK routines for dif- ferent number of processors on HECToR and Iridis 3. It has been found that the LAPACK eigensolver was unable to cope with the largest systems as they require big matrices to be stored in a single-node memory. These cal- culations crashed with an out of memory error as the amount of available memory was less than the required space by the matrices. Hence the calculations using LAPACK are limited to those systems that fit in one node memory. For distributed memory programs such as ONETEP that run on multicore nodes, the maximum availbale memory will be provided when only one CPU is used on each node, which limits the performance of the code as intra-node communications via the memory chip are normally much faster than the inter-node communications via interconnect. Table 6.1 summarises the successful calculations using LAPACK. The second observation that has to be made about LAPACK is that, as it is a serial library, the speed-up of calculations using the LAPACK eigensolver will always stay

39 Test case LAPACK LAPACK ScaLAPACK ScaLAPACK (HECToR) (Iridis 3) (HECToR) (Iridis 3) Si29H36 Yes Yes Yes (1 node) Yes (1 node) Si242H140 Yes Yes Yes (1 node) Yes (1 node) Si532H224 Yes Yes Yes (1 node) Yes (1 node) Si766H402 No Yes Yes (2 nodes) Yes (1 node) Si1186H366 No No Yes (3 nodes) Yes (1 node) p16_20 Yes - Yes (7 nodes) Yes (15 nodes)* p32_20 No - Yes (9 nodes) Yes (20 nodes)* p48_20 No - Yes (10 nodes) Yes (5 nodes) p64_20 No - Yes (10 nodes) Yes (5 nodes) p80_20 No - Yes (11 nodes) Yes (20 nodes)* p96_20 No - Yes (13 nodes) Yes (20 nodes)* p112_20 No - Yes (17 nodes) Yes (20 nodes)* p128_20 No - Yes (19 nodes) Yes (20 nodes)*

Table 6.1: Successful calculations using LAPACK and ScaLAPACK eigensolvers on HECToR and Iridis 3. In all cases, 4 CPUs per node have been used. The asterisk * indicates that a smaller number of nodes has not been tested and it is possible for the calculation to fit on less nodes than that. For saving computational time, the protein systems have not been simulated using LAPACK on Iridis 3. below 1, so, as expected, there is no advantage of using LAPACK on a parallel clus- ter. LAPACK routines called within parallel codes require collective communications to synchronise the results on all the nodes, which causes a loss of performance when the number of CPUs increases. The results show that LAPACK is very inneficient com- pared to ScaLAPACK as it doesn’t take advantage of parallel computers to accelerate the calculation, which makes the parallel efficiency drop very quickly with the num- ber of processors. As the performance of the diagonaliser drops, the overall parallel efficiency of the entire inner loop using DIIS-Hamiltonian diagonalisation suffers. The Karp-Flatt metric, which measures the fraction of serial computation on a parallel code, reaches values that are very high compared to those obtained with ScaLAPACK in the inner loop. As the diagonaliser scales with the third power of the system size, such an inneficient eigensolver is not sustainable for simulating large molecules of thousands of atoms. Figures 6.1 and 6.2 compare the parallel efficiency of both libraries eigensolvers for the Si766H402 and p_16 systems. The results show that there are two advantages of using ScaLAPACK instead of LA- PACK for simulating large systems. First of all, ScaLAPACK has proven to be able to simulate systems with very large matrix size. As ScaLAPACK distributes the matrix elements over the nodes, each of them contains a smaller portion than the total matrix. Furthermore, increasing the number of cores translates into more available memory for storing large matrices, which implies that, in theory, provided enough cores, ScaLA- PACK can deal with any kind of matrix size. However, as ScaLAPACK uses parallel

40 (a) (b)

(c) (d)

Figure 6.1: Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the eigensolvers for the Si766H402 silicon nanorod on Iridis 3.

computation, its performance is limited by the rate between inter-node communications and computation on each node. These two concepts compete during a parallel calcu- lation: the more matrix elements per node, the less communications are required and more computation takes place on each CPU, and viceversa. Hence, for top parallel ef- ficiency, the balance between the number of elements per node and communications required to complete the operation is critical. The ScaLAPACK eigensolver is particu- larly communication-demanding as it needs a global transfer of data to reorthogonalise the eigenvectors at each step. As a result, the speed-up scaling is not dramatically ef- ficient for small systems, whereas for medium-big systems that require a large number of processors, the parallel behaviour gets more efficient. The second advantage is that ScaLAPACK eigensolver has shown to be faster than the serial equivalent for medium-large systems (those in the range of LAPACK usability) for different number of processors. As can be seen in figures 6.1 and 6.2 the cooperative strategy of ScaLAPACK for solving the problem speeds up the calculations up to a certain number of processors. At that point, communications start to dominate and the parallel efficiency drops. The peak in the speed-up plot is system dependent as the number of matrix elements per core changes. As a general rule, the larger the system, the more CPUs are needed to get the top speed-up. For the Si766H402 system the diagonaliser is nearly 8 times faster for 27x4 than for 1x4 Iridis cores, whereas for the p16_20 system the code speeds up by a factor of 1.4 when the number of HECToR CPUs is roughly doubled. The Karp-Flatt metric tends to a constant for the execution of the ScaLAPACK eigen- solver on different number of CPUs in the case of the Si766H402 system (around 0.2). This means that the overhead introduced by communications or load imbalance is not

41 (a) (b)

(c) (d)

Figure 6.2: Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the eigensolvers for the p16_20 protein benchmark on HECToR.

heavily increasing with the processor count and the calculation does not offer further opportunities for parallelism. However this parameter increases with the number of processors in the p16_20 system, which can be due to the overhead introduced by com- munications and small data structures per node. This behaviour can be explained from the fact that the matrix size in the Si766H402 system (7296x7296 elements) is larger than the matrices involved in the p16_20 protein (4544x4544 elements). As a result the size of the matrix chunks allocated to each processor for the silicon nanorod are larger than for the protein, which makes a more efficient use of inter-node communications and provides better load balance.

6.2 Comparison of LNV and DIIS algorithms for the inner loop of ONETEP

As both the LNV and DIIS-Hamiltonian diagonalisation methods are intended to min- imise the energy in the inner loop of ONETEP, it is necessary to compare the computa- tional performance of the two algorithms for simulating the set of selected benchmarks. The parallel implementation of the LNV algorithm is more efficient than the DIIS method for large systems. The standard ONETEP code has been carefully optimised by the developers over the years for achieving great performance on modern clusters. As can be seen in figures 6.3 and 6.4, which compares the performance of LNV and DIIS for the largest protein benchmark (p128_20), the serial fraction of the LNV method tends to a constant, which means that the parallel strategy is efficient as it implicitly

42 (a) (b)

(c) (d)

Figure 6.3: Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the the inner loop of ONETEP for the p64_20 protein benchmark on HECToR. minimises the communications needed to perform the LNV calculation. The peaks of performance of both the LNV and the DIIS methods on HECToR and Iridis 3 can be due to noise in the network due to many programs running at the same time on the clus- ter. As the ScaLAPACK eigensolver requires a substantial amount of communications during the runtime it is more sensitive to fluctuations in the performance of the inter- connect due to shared use of the network with other users. This can explain why the peaks in the DIIS method are more sharp than in the LNV code. Another interesting effect regarding the ScaLAPACK and SPAM3 matrix distributions has to do with the different levels of cache in the target machines. The size of the matrix portion that is stored on memory on each node decreases with the number of CPUs in the calculation. As a result, for a given number of CPUs, the data structures fit into the L2 cache (HECToR) or L3 cache (Iridis 3), which enhances the memory access, and hence the overall performance of the code. This can be observed in figure 6.2, where speed-up is above the theoretical limit for the LNV method, which produces a parallel efficiency greater than one. Figures 6.5 and 6.6 show that the computational effort required by the DIIS method scales with the cube of the number of atoms in the system. This is expected as the diagonaliser takes most of the runtime in a percentage that increases with the system size. On the other hand, the LNV algorithm, provided a suitable kernel cut-off, scales linearly with the number of atoms. The break-even point for which the linear-scaling method becomes faster than the cubic scaling approach is system dependent and its exact character is not clear from these sets of benchmarks. Nevertheless, it seems that for the silicon nanorods, LNV is faster than DIIS for any size provided a sensible choice of the kernel cut-off. The case of Si1186H366 suffers of a small kernel cut-off, so the

43 (a) (b)

(c) (d)

Figure 6.4: Timings, speed-up, parallel efficiency and serial fraction (Karp-Flatt metric) of the inner loop of ONETEP for the p64_20 protein benchmark on HECToR. matrices are not sparse, and hence it is slower than the DIIS approach. This indicates that the break-even point is below 1500. On the other hand, extrapolating the scaling curve of the amyloid protein, it can be seen that the break-even point would be around 1000 atoms.

Figure 6.5: Comparison of the scaling of the LNV and DIIS algorithms with the system size for the set of silicon nanorods. The systems have been simulating usign 20 nodes and 4 CPUs per node on Iridis 3.

The scaling of each of the subroutines that are needed for the DIIS approach can be seen in figure 6.7. This plot shows that the routines that calculate the Hamiltonian and the energy (dm_build_ham and dm_energy) are more demanding for small systems, but as they scale linearly with the system size (they have been reused from

44 Figure 6.6: Comparison of the scaling of the LNV and DIIS algorithms with the system size for the set of amyloid benchmarks. The systems have been simulating usign 20 nodes and 4 CPUs per node on HECToR and Iridis 3.

linear-scaling ONETEP), the diagonaliser dm_diag_and_build_kernel rapidly takes most of the runtime for large systems. This enphasises the necessity of a high- performing diagonaliser in order to enhance the performance of the DIIS code and to make it suitable for simulating larger systems.

Figure 6.7: Scaling of the DIIS-Hamiltonian diagonalisation routines with the system size of the amyloid protein. The systems have been simulating usign 20 nodes and 4 CPUs per node on Iridis 3.

The DIIS algorithm, after the discrepancies reported in section 5 have been fix, should be a more reliable algorithm than LNV as it is not based on approximations to truncate and purify the density matrix. On the other hand, DIIS is a more computationaly- consuming method that, as shown in figure 6.6 doubles the runtime of the LNV al- gorithm for systems of near 8000 atoms on 20 × 4 Iridis 3 cores. This performance, however, brings the opportunity to simulate large systems of a few thousands of atoms using DIIS and Hamiltonian diagonalisation in a time-scale and with computational re- sources comparable to LNV. Metallic systems with a zero bandgap cannot be simulated

45 using LNV in a linear-scaling fashion as the matrices involved in the calculation are dense. Traditional DFT approaches, which are not based on the property of locality of matter, can simulate metals up to a few hundreds of atoms. The combination of DIIS and Hamiltonian diagonalisation presented in this work, together with the addition of ensemble-DFT to ONETEP to enable partial occupancy numbers, could bring the poten- tial opportunity to simulate large metallic compounds of thousands of atoms with a fast and accurate algorithm.

6.3 Future performance optimisations

The most computationally demanding part of the energy minimisation using DIIS is the diagonalisation of the Hamiltonian at each iteration of the inner loop. This step is well known for scaling with the cubic of the matrix size for dense systems in the best imple- mentation. The results presented in figure 6.7 show that this is the case for this code, and hence the only possible way to improve the speed of the eigensolver is by reduc- ing the slope of the cubic behaviour. ScaLAPACK takes advantage of parallelisation to reduce this factor and offer a better performance compared to LAPACK serial eigen- solver. As a future development, the ARPACK [74] and PARPACK (parallel ARPACK) [75] libraries offer an interesting alternative to LAPACK and ScaLAPACK for the diag- onalisation routines. These libraries are specialised on eigenvalue problems that involve dense matrices, as it is the case of the DIIS method studied here, particularly for metallic systems which are not affected by locality of the density matrix. Another point to take into account for performance is memory usage. The code at each iteration stores in memory the Hamiltonian, the input density kernel, the output density kernel and the residue matrix that accumulate up to a maximum predefined number of DM_history for each of them. As memory usage represents a common bottleneck in HPC applications, reducing the number of matrices to be stored at each iteration to the minimum can have the potential effect of enhancing the performance of the DIIS code. It is important to bear in mind that other types of DIIS can be introduced in the code (such as Hamiltonian mixing or a mix of input and output kernels or residues), so the code must be designed to cope with the minimum storage necessities regardless the chosen method (specified by the user). In addition to this, the enhancements to improved the convergence of the method (sec- tion 5.2) should also enhance the performance of the overall code as they will require less kernels to mix in order to achieve convergence. Computationally speaking, the code will require to store less matrices in memory, which will improve the memory access and data re-usage from cache.

46 Chapter 7

Conclusions

First-Principles methods for Quantum Mechanical calculations in Chemistry and Solid- State Physics represent a developing field of science that require a combined effort from researchers of multiple fields. Simulations of materials and molecular systems offer the opportunity to investigate the properties of new compounds that are of interest for academy and industry with the potential of bringing new horizons for technology. The intrinsic complexity of such calculations require parallel computing techniques to be able to obtain accurate, valuable results in a human scale of time. In this work the theory of Quantum Mechanical calculations from First-Principle meth- ods using Density Functional Theory has been presented. In this context, the Direct Inversion in Iterative Subspace (DIIS) method, in combination with Hamiltonian diag- onalisation for achieving self-consistency, has been implemented in the ONETEP code for DFT calculations. This new approach mixes the quantum single-particle density matrix in the inner loop of ONETEP for minimising the total energy with respect to the elements of the density kernel in the NGWF representation. Effectively, this approach, as it has been shown, requires a computational effort that scales with the third power of the number of atoms in the system due to the effort of diagonalising a dense matrix at each iteration. This algorithm, although it limits the range of applicability in compari- son with conventional linear-scaling approaches, is a key step towards the implementa- tion of ensemble-DFT in the ONETEP code for simulating metallic systems efficiently. In order to reduce the impact of the diagonalisation in the overall performance, the LA- PACK and ScaLAPACK eigensolvers DSYGVX and PDSYGVX have been assessed as a method for speeding-up the DIIS-Hamiltonian diagonalisation calculations. The results show that the density matrix mixing scheme works for systems of up to ~600 atoms and matrix sizes of ~2000x2000 elements. For these test-cases the DIIS method has proven to give very close values of the energy to those given by the standard LNV method of ONETEP. In fact, by using the DIIS method, it has been possible to reproduce the potential well generated by the two monomers of the “Tennis Ball“ protein to an precision of 10−6 Ha, which is considered to be in the range of chemical accuracy. Therefore, the DIIS implementation presented in this work is able to simulate

47 the physics of the range of molecular systems that are in the scope of traditional O(N 3)- scaling DFT programs. However, for systems of hundreds of atoms, convergence of the energy minimisation can be very slow and unstable, and diverges for larger systems. This problem has been widely documented in literature and can be overcome by applying different numerical stabilisation methods, described in this work, such as Kerker preconditioning or level- shifting approaches. The computational implementation of the DIIS combined with Hamiltonian diagonali- sation is severely limited by the diagonaliser routine. The serial library LAPACK does not take advantage of distributed memory machines for storing large matrices on multi- ple nodes. As a result, the size of systems that can be simulated with LAPACK routines is limited by the available memory per node in the target machine. Hence large systems cannot be simulated on HECToR or Iridis 3 using the LAPACK serial eigensolver due to excess of memory requirements. On the other hand, ScaLAPACK routines distribute the matrix elements over all the available nodes, which means that the accessible mem- ory to store matrices can be expanded simply by adding more nodes to the calculation. As a consequence very large atomic systems (tens of thousands of atoms) represented by very big matrices can be simulated using the DIIS and Hamiltonian diagonalisation method with the ScaLAPACK eigensolver. The performance of the code depends greatly on the performance of the diagonaliser. The ScaLAPACK eigensolver has been designed as to make all the CPUs cooperate to solve a particular problem in order to decrease the runtime of the program. The results show that performance of the parallel eigensolver is limited by the inter-node communications, the load balance and the size of the distributed matrix on each node. In particular, the eigensolver routine is very communications-demanding, and it is very much affected by the performance of the interconnect. However, both on the HEC- ToR and Iridis 3 clusters, which high-quality netwotk connections, the speed-up and efficiency of the ScaLAPACK diagonaliser is similar. On the other hand, results show that, as expected, the diagonaliser scales with the cube of the number of atoms in the system while the other procedures involved in the energy minimisation during the inner loop scale in a linear fashion. This, in turn, means that for large molecules of thousands of atoms, the diagonaliser takes most of the runtime of the DIIS method. The break-even point for which the linear-scaling approach of LNV becomes faster than DIIS-Hamiltonian method as presented in this work is around 1000 atoms, although this measure is system-dependent and can also vary with the number of processors for the calculation. Further optimisations of the code will necessarilly involve reducing the runtime of the diagonaliser, and will have the potential of moving the break-even point closer to thousands of atoms. Linear-scaling codes are unable to perform calculations on metals due to their small band-gap. Traditional DFT codes can simulate metallic systems of few hundred of atoms with a computational cost that scales with the cubic of the system size. The re- sults show that with the DIIS algorithm it is possible to simulate systems up to a few

48 thousands of atoms with a computational cost that doubles the LNV execution runtime. The addition of the ensemble-DFT approach for allowing fractional occupation num- bers, combined with the DIIS and Hamiltonian diagonalisation approach, will allow to simulate metallic systems in a computational-efficient way using ONETEP.

49 Bibliography

[1] Attila Szabo and Neil S. Ostlund. Modern Quantum Chemistry. Introduction to Advanced Electronic Structure Theory. Dover Publications Inc., 2nd edition, 1989. [2] C.-K. Skylaris, P.D. Haynes, A.A. Mostofi, and M.C. Payne. Introducing ONETEP: Linear-scaling density functional simulations on parallel computers. J. Chem. Phys., 122(8), 2005. [3] Klaus Capelle. A bird’s-eye view of density-functional theory. Brazilian Journal of Physics, 36(4A), 2006. [4] P Pulay. Convergence acceleration of iterative sequences - the case of scf iteration. Chem. Phys. Lett., 73(2), 1980. [5] N Marzari, D Vanderbilt, and MC Payne. Ensemble density-functional theory for ab initio of metals and finite-temperature insulators. Phys. Rev. Lett., 79(7), 1997. [6] Scalapack library web page, 2010. http://www.netlib.org/scalapack/. [7] Lapack library web page, 2010. http://www.netlib.org/lapack/. [8] XP Li, RW Nunes, and D Vanderbilt. Density-matrix electronic-structure method with linear system-size scaling. Phys. Rev. B, 47(16), 1993. [9] J. Barnes and P. Hut. A hierarchical o(n-log-n) force-calculation algorithm. Na- ture, 324(6096), 1986. [10] J.P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods - barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2), 1995. [11] Bernd G. Pfrommer, Michel Cote, Steven G. Louie, and Marvin L. Cohen. Relax- ation of crystals with the quasi-newton method. J. Comput. Phys., 131(1), 1997. [12] CM Goringe, DR Bowler, and E Hernandez. Tight-binding modelling of materials. Rep. Prog. Phys., 60(12), 1997. [13] P. Echenique and J. L. Alonso. A mathematical and computational review of Hartree-Fock SCF methods in quantum chemistry. Mol. Phys., 105(23-24), 2007.

50 [14] Nathan Argaman and Guy Makov. Density functional theory – an introduction. arXiv, 1998. [15] Ove Christiansen. Coupled cluster theory with emphasis on selected new devel- opments. Theor. Chem. Acc., 116(1-3), 2006. [16] WMC Foulkes, L Mitas, RJ Needs, and G Rajagopal. Quantum Monte Carlo simulations of solids. Rev. Mod. Phys., 73(1), 2001. [17] R Car and M Parrinello. Unified approach for molecular-dynamics and density- functional theory. Phys. Rev. Lett., 55(22), 1985. [18] P. Hohenberg and W. Kohn. Inhomogeneous electron gas. Phys. Rev., 136(3B), 1964. [19] Sergio Filipe Sousa, Pedro Alexandrino Fernandes, and Maria Joao Ramos. Gen- eral performance of density functionals. J. Phys. Chem. A, 111(42), 2007. [20] W. Kohn and L. J. Sham. Self-consistent equations including exchange and corre- lation effects. Phys. Rev., 140(4A), 1965. [21] SJ Clark, MD Segall, CJ Pickard, PJ Hasnip, MJ Probert, K Refson, and MC Payne. First principles methods using CASTEP. Z. Kristallogr. Kristallgeom. Kristallphys., 220(5-6), 2005. [22] Guangyu Sun, J. Kurti, P. Rajczy, M. Kertesz, J. Hafner, and G. Kresse. Perfor- mance of the Vienna ab initio simulation package (VASP) in chemical applica- tions. THEOCHEM, 624, 2003. [23] R. McWeeny. Some recent advances in density matrix theory. Rev. Mod. Phys., 32(2), 1960. [24] E. Prodan and W. Kohn. Nearsightedness of electronic matter. Proc. Natl. Acad. Sci. USA, 102(33), 2005. [25] JM Soler, E Artacho, JD Gale, A Garcia, J Junquera, P Ordejon, and D Sanchez- Portal. The SIESTA method for ab initio order-N materials simulation. J. Phys. Condens. Matter, 14(11), 2002. [26] T. Miyazaki, D.R. Bowler, M.J. Gillan, T. Otsuka, and T. Ohno. Large-scale DFT calculations with the CONQUEST code. In AIP Conference Proceedings, volume 1148, pages 685–8, USA, 2009 2009. American Institute of Physics. Computa- tional Methods in Science and Engineering. Advances in Computational Science, 25-30 September 2008, Hersonissos, Crete, Greece. [27] C.-K. Skylaris, A.A. Mostofi, P.D. Haynes, O. Dieguez, and M.C. Payne. Nonorthogonal generalized Wannier function pseudopotential plane-wave method. Phys. Rev. B., 66(3), 2002. [28] MC Payne, MP Teter, DC Allan, TA Arias, and JD Joannopoulos. Iterative min- imization techniques for abinitio total-energy calculations: molecular-dynamics and conjugate gradients. Rev. Mod. Phys., 64(4), 1992.

51 [29] AA Mostofi, CK Skylaris, PD Haynes, and MC Payne. Total-energy calculations on a real space grid with localized functions and a plane-wave basis. Comput. Phys. Commun., 147(3), 2002. [30] PD Haynes, CK Skylaris, AA Mostofi, and MC Payne. Elimination of basis set superposition error in linear-scaling density-functional calculations with local or- bitals optimised in situ. Chem. Phys. Lett., 422(4-6), 2006. [31] P. Pulay. Ab initio calculation of force constants and equilibrium geometries in polyatomic molecules. I. Theory. Mol. Phys., 17(2), 1969. [32] P.D. Haynes, C.-K. Skylaris, A.A. Mostofi, and M.C. Payne. Density kernel opti- mization in the ONETEP code. J. Phys.: Condens. Matter, 20(29), 2008. [33] CK Gan, PD Haynes, and MC Payne. Preconditioned conjugate gradient method for the sparse generalized eigenvalue problem in electronic structure calculations. Comput. Phys. Commun., 134(1), 2001. [34] CK Skylaris, AA Mostofi, PD Haynes, CJ Pickard, and MC Payne. Accurate kinetic energy evaluation in electronic structure calculations with localized func- tions on real space grids. Comput. Phys. Commun., 140(3), 2001. [35] N. D. M. Hine, P. D. Haynes, A. A. Mostofi, C.-K. Skylaris, and M. C. Payne. Linear-scaling density-functional theory with tens of thousands of atoms: Expand- ing the scope and scale of calculations with ONETEP. Comput. Phys. Commun., 180(7), 2009. [36] CK Skylaris, PD Haynes, AA Mostofi, and MC Payne. Implementation of linear- scaling plane wave density functional theory on parallel computers. Phys. Status Solidi B-Basic Solid State Physics, 243(5), 2006. [37] Owe Axelsson. Iterative Solution Methods. Cambridge University Press, 2nd edition, 1996. [38] G. Kresse and J. Furthmüller. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B, 54(16), 1996. [39] A. Famulari, G. Calderoni, F. Moroni, M. Raimondi, and P. B. Karadakov. Ap- plication of the diis technique to improve the convergence properties of the scf-mi algorithm. J. Mol. Struct. THEOCHEM, 549(1-2), 2001. [40] D.R. Bowler, R. Choudhury, M.J. Gillan, and T. Miyazaki. Recent progress with large-scale ab initio calculations: the CONQUEST code. Phys. Status Solidi B- Basic Solid State Physics, 243(5), 2006. [41] DR Bowler and MJ Gillan. Recent progress in first principles O(N) methods. Mol. Simul., 25(3-4), 2000. [42] Lapack dgetrs. http://www.netlib.org/lapack/double/dgetrs.f. [43] Lapack dsygvx eigensolver routine. http://netlib.org/lapack/double/dsygvx.f.

52 [44] Scalapack pdsygvx. http://www.netlib.org/scalapack/double/pdsygvx.f. [45] Jaeyoung Choi, J.J. Dongarra, L.S. Ostrouchov, A.P. Petitet, D.W. Walker, and R.C. Whaley. Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, 5(3), 1996. [46] Scalapack pdsyngst. http://www.hemisphere-education.com/codes/ scalapack_documentation/documentation/VisualSourceCodes/D0_pdsyngst.htm. [47] Scalapack pdsyevx. http://www.netlib.org/scalapack/double/pdsyevx.f. [48] Scalapack pdtrsm. http://publib.boulder.ibm.com/infocenter/clresctr /vxrx/index.jsp?topic=/com.ibm.cluster.pessl33.prog.doc/am601_ltrsm.html. [49] EC Anderson and J Dongarra. Performance of LAPACK - a portable library of numerical linear algebra routines. Proc. IEEE, 81(8), 1993. [50] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK: a portable linear algebra library for distributed memory computers. Design issues and performance. In J. Dongarra, K. Madsen, and J. Wasniewski, editors, Applied Parallel Computing. Computation in Physics, Chemistry and Engineering Science. Second International Workshop, PARA ‘95. Proceedings, pages 95–106, Berlin, Germany, 1996 1996. Springer- Verlag. [51] Elena Breitmoser and Andrew G. Sunderland. A performance study of the plapack and scalapack eigensolvers on hpcx for the stan- dard problem. Technical Report from the HPCx Consortium, 2003. http://www.hpcx.ac.uk/research/hpc/HPCxTR0406.pdf. [52] A.G. Sunderland and E.Y. Breitmoser. An overview of eigensolvers for hpcx. Technical Report from the HPCx Consortium, 2003. http://www.hpcx.ac.uk/research/publications/HPCxTR0312.pdf. [53] Joachim Hein. Improved parallel performance of siesta for the hpcx phase2 sys- tem. Technical Report from the HPCx Consortium, 2004. http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0410.pdf. [54] Pblas library web page. http://www.netlib.org/scalapack/pblas_qref.html. [55] Blacs library web page. http://www.netlib.org/blacs/. [56] Hector hardware specifications. http://www.hector.ac.uk/support/documentation/userguide/hardware.php. [57] Iridis 3 hardware specifications. http://www.southampton.ac.uk/isolutions/computing/hpc/iridis/index.html. [58] MD Segall, PJD Lindan, MJ Probert, CJ Pickard, PJ Hasnip, SJ Clark, and MC Payne. First-principles simulation: ideas, illustrations and the CASTEP code. J. Phys. Condens. Matter, 14(11), 2002.

53 [59] N. Zonias, P. Lagoudakis, and C.-K. Skylaris. Large-scale first principles and tight-binding density functional theory calculations on hydrogen-passivated sili- con nanorods. J. Phys.: Condens. Matter, 22(2), 2010. [60] Joshua T. Berryman, Sheena E. Radford, and Sarah A. Harris. Thermodynamic Description of Polymorphism in Q- and N-Rich Peptide Aggregates Revealed by Atomistic Simulation. Biophys. J., 97(1), 2009. [61] N. Branda, R. Wyler, and J. Rebek Jr. Encapsulation of methane and other small molecules in a self-assembling superstructure. Science, 263(5151), 1994. [62] J. Phillip Bowen, Jennifer B. Sorensen, and Karl N. Kirschner. Calculating inter- action energies using first principle theories: Consideration of basis set superposi- tion error and fragment relaxation. J. Chem. Edu., 84(7), 2007. [63] PD Haynes and MC Payne. Corrected penalty-functional method for linear-scaling calculations within density-functional theory. Phys. Rev. B, 59(19), 1999. [64] M Kohyama. Ab initio calculations for SiC-Al interfaces: Tests of electronic- minimization techniques. Modell. Simul. Mater. Sci. Eng., 4(4), 1996. [65] GP Kerker. Efficient iteration scheme for self-consistent pseudopotential calcula- tions. Phys. Rev. B, 23(6):3082–3084, 1981. [66] A. Sawamura. An adaptive preconditioner in first-principles electronic-structure calculations. Transactions of the Japan Society for Industrial and Applied Mathe- matics, 18(4), 2008. [67] Yoshinori Shiihara, Osamu Kuwazuru, and Nobuhiro Yoshikawa. Real-space Kerker method for self-consistent calculation using non-orthogonal basis func- tions. Modell. Simul. Mater. Sci. Eng., 16(3), 2008. [68] H Sellers. The C2-DIIS convergence acceleration algorithm. Int. J. Quantum Chem., 45(1), 1993. [69] VR Saunders and IH Hillier. Level-shifting method for converging closed shell hartree-fock wave-functions. Int. J. Quantum Chem., 7(4), 1973. [70] IV Ionova and EA Carter. Error vector choice in direct inversion in the iterative subspace method. J. Comput. Chem., 17(16), 1996. [71] AD Rabuck and GE Scuseria. Improving self-consistent field convergence by varying occupation numbers. J. Chem. Phys., 110(2), 1999. [72] D. Raczkowski, A. Canning, and L.W. Wang. Thomas-Fermi charge mixing for obtaining self-consistency in density functional calculations. Phys. Rev. B: Con- dens. Matter, 64(12), 2001. [73] AH Karp and HP Flatt. Measuring parallel processor performance. Commun. ACM, 33(5), 1990. [74] Arpack library webpage. http://www.caam.rice.edu/software/ARPACK/.

54 [75] Parpack library webpage. http://www.caam.rice.edu/ kristyn/parpack_home.html.

55