<<

Zurich, July 12th 2017

CP2K : GPW and GAPW

Marcella Iannuzzi

Department of Chemistry, University of Zurich

http://www.cp2k.org Basis Set Representation

KS matrix formulation when the wavefunction is expanded into a basis

System size {Nel, M}, P [MxM], C [MxN]

⇥i(r)= Ci(r) Variational n(r)= fiCiC⇥i(r)⇥(r)= P⇥(r)⇥(r) principle i ⇥ ⇥ Constrained minimisation P = PSP problem KS total energy

E[ ]=T [ ]+Eext[n]+EH[n]+EXC[n]+EII { i} { i} Matrix formulation of the KS equations

H xc K(C)C = T(C)+Vext(C)+E (C)+E (C)=SC

2 Self-consistency SCF Method Generate a starting density ⇒ ninit Input 3D Coordinates init Generate the KS potential ⇒ VKS of atomic nuclei

Solve the KS equations ⇒ , ψ � Initial Guess Molecular Orbitals Fock Matrix (1-electron vectors) Calculation 1 Calculate the new density ⇒ n

1 New KS potential ⇒ VKS Fock Matrix Diagonalization

New orbitals and energies ⇒ �1 , ψ Calculate 2 Properties Yes SCF No New density ⇒ n Converged? End …..

until self-consistency 3to required precision Classes of Basis Sets

Extended basis sets, PW : condensed matter

Localised basis sets centred at atomic positions, GTO

Idea of GPW: auxiliary basis set to represent the density

Mixed (GTO+PW) to take best of two worlds, GPW

Augmented basis set, GAPW: separated hard and soft density domains

4 GPW Ingredients linear scaling KS matrix computation for GTO

Gaussian basis sets (many terms analytic)

2 mx my mz ↵mr ⇥i(r)= Ci(r) ↵(r)= dm↵gm(r) gm(r)=x y z e m X

Pseudo potentials

Plane waves auxiliary basis for Coulomb integrals

Regular grids and FFT for the density

Sparse matrices (KS and P)

Efficient screening

G. Lippert et al, Molecular Physics, 92, 477, 1997 J. VandeVondele et al, Comp. Phys. Comm.,167 (2), 103, 2005 5 Basis Set

Localised, atom-position dependent GTO basis

µ(r)= dmµgm(r) m CP2K: Ab initio Simulations Towards Linear Scaling HF/Exact Exchange Summary Acknowledgment Expansion of the density using the density matrix

Sparse Matrices n(r)= Pµ µ(r) (r) µ Gaussian basis: • Operator matricesOperator are rather matrices sparse are sparse The sparsity of H and S

Sαβ=∫ϕα(r)ϕβ (r)dr Sµ⌫ = 'µ(r)'⌫ (r)dr H αβ=∫ ϕα (r)v(r)ϕβ (r)dr r ϕ (r) Z ϕα ( ) β The sparsity pattern of S and H depends on the basis and the spatial location of the atoms, but not on theH µchemical⌫ = properties'µ(r)V ( ofr) the'⌫ (r)dr system in GGAZ DFT. Sαβ

HIV-1 Protease-DMP323 complex in solution (3200 atoms) 6 • Orbital matrices are invariant under unitary transformation The overlap (integral of the product) rapidly decays with the spatial separation of the basis Chemical localization: Boys, Edminston-Rudenberg, etc. functions.

Mathematical localization Analytic Integrals Cartesian Gaussian

n n n (r R)2 g(r, n, , R)=(x R ) x (y R ) y (z R ) z e x y z

Differential relations @ @ @ n)=2⌘ n + 1i) ni n 1i) n)= n) @Ri | | | @Ri | @ri |

Obara-Saika recursion relations

(0 (r) 0 ) (a + 1i (r) b) a|O | b |O |

Obara and Saika JCP 84 (1986), 3963 7 Basis Set library

GTH_BASIS_SETS ; BASIS_MOLOPT ; EMSL_BASIS_SETS O 6-31Gx 6-31G* 4O SZV-MOLOPT-GTHO SZV-GTH SZV-MOLOPT-GTH-q6 1 0 10 6 1 2 0 1 7 1 1 5484.67170000 2 0 1 4 1 1 0.00183110 O 6-311++G3df3pd 6-311++G(3df,3pd) 12.015954705512 -0.060190841200 0.036543638800 825.23495000 8.3043855492 0.01395010 0.1510165999 -0.0995679273 9 5.108150287385 -0.129597923300 0.120927648700 188.04696000 2.4579484191 0.06844510 -0.0393195364 -0.3011422449 1 0 0 6 1 2.048398039874 0.118175889400 0.251093670300 52.96450000 0.23271430 8588.50000000 0.00189515 0.832381575582 0.7597373434 0.462964485000 -0.6971724029 0.352639910300 -0.4750857083 16.89757000 0.47019300 1297.23000000 0.01438590 0.352316246455 0.2136388632 0.450353782600 -0.3841133622 0.294708645200 -0.3798777957 5.79963530 0.35852090 299.29600000 0.07073200 #0.142977330880 0.092715833600 0.173039869300 1 0 1 3 1 1 87.37710000 0.24000100 O0.046760918300 DZVP-GTH -0.000255945800 0.009726110600 15.53961600 -0.11077750 0.07087430 25.67890000 0.59479700 # 23.59993360 -0.14802630 0.33975280 3.74004000 0.28080200 O DZVP-MOLOPT-GTH DZVP-MOLOPT-GTH-q6 21.01376180 0 1 4 2 2 1.13076700 0.72715860 1 0 1 3 1 1 1 1 0 1 1 8.3043855492 1 1 0.1510165999 0.0000000000 -0.0995679273 0.0000000000 42.11750000 0.11388900 0.03651140 2 0 2 7 2 2 1 0.27000580 2.4579484191 1.00000000 -0.0393195364 1.00000000 0.0000000000 -0.3011422449 0.0000000000 9.62837000 0.92081100 0.23715300 12.015954705512 -0.060190841200 0.065738617900 0.036543638800 -0.034210557400 0.014807054400 1 2 2 1 0.7597373434 1 -0.6971724029 0.0000000000 -0.4750857083 0.0000000000 2.85332000 -0.00327447 0.81970200 5.108150287385 -0.129597923300 0.110885902200 0.120927648700 -0.120619770900 0.068186159300 0.80000000 1.00000000 1 0 1 1 1 1 2.048398039874 0.2136388632 0.118175889400 -0.3841133622 -0.053732406400 1.0000000000 0.251093670300 -0.3798777957 -0.213719464600 1.0000000000 0.290576499200 # 0.90566100 1.00000000 1.00000000 0.832381575582 3 2 2 1 1 0.462964485000 -0.572670666200 0.352639910300 -0.473674858400 1.063344189500 O 6-31Gxx 6-31G** 1 0 1 1 1 1 0.352316246455 1.1850000000 0.450353782600 1.0000000000 0.186760006700 0.294708645200 0.484848376400 0.307656114200 4 0.25561100 1.00000000 1.00000000 #0.142977330880 0.092715833600 0.387201458600 0.173039869300 0.717465919700 0.318346834400 1 0 0 6 1 1 2 2 1 1 0.046760918300 -0.000255945800 0.003825849600 0.009726110600 0.032498979400 -0.005771736600 O5484.67170000 TZVP-GTH 0.00183110 5.16000000 1.00000000 # 825.23495000 2 0.01395010 1 2 2 1 1 O TZVP-MOLOPT-GTH TZVP-MOLOPT-GTH-q6 188.04696000 2 0 1 5 3 3 0.06844510 1.29200000 1.00000000 1 52.96450000 10.2674419938 0.23271430 0.0989598460 0.0000000000 0.0000000000 -0.0595856940 1 2 2 1 10.0000000000 0.0000000000 2 0 2 7 3 3 1 16.89757000 3.7480495696 0.47019300 0.1041178339 0.0000000000 0.0000000000 -0.1875649045 0.32250000 0.0000000000 1.00000000 0.0000000000 12.015954705512 -0.060190841200 0.065738617900 0.041006765400 0.036543638800 -0.034210557400 -0.000592640200 0.014807054400 5.79963530 0.35852090 1 3 3 1 1 5.108150287385 1.3308337704 -0.129597923300 -0.3808255700 0.110885902200 0.0000000000 0.080644802300 0.0000000000 0.120927648700 -0.3700707718 -0.120619770900 0.0000000000 0.009852349400 0.0000000000 0.068186159300 1 0 1 3 1 1 1.40000000 1.00000000 2.048398039874 0.4556802254 0.118175889400 -0.6232449802 -0.053732406400 1.0000000000 -0.067639801700 0.0000000000 0.251093670300 -0.4204922615 -0.213719464600 1.0000000000 0.001286509800 0.0000000000 0.290576499200 15.53961600 -0.11077750 0.07087430 1 0 1 1 1 1 0.832381575582 0.1462920596 0.462964485000 -0.1677863491 -0.572670666200 0.0000000000 -0.435078312800 1.0000000000 0.352639910300 -0.2313901687 -0.473674858400 0.0000000000 -0.021872639500 1.0000000000 1.063344189500 3.59993360 -0.14802630 0.33975280 0.08450000 1.00000000 1.00000000 0.352316246455 31.01376180 2 2 1 1 0.450353782600 1.13076700 0.186760006700 0.72715860 0.722792798300 0.294708645200 0.484848376400 0.530504764700 0.307656114200 1 0 0.142977330880 1 1 1.1850000000 1 1 0.092715833600 1.0000000000 0.387201458600 -0.521378340700 0.173039869300 0.717465919700 -0.436184043700 0.318346834400 0.046760918300 0.27000580 -0.000255945800 1.00000000 0.003825849600 1.00000000 0.175643142900 0.009726110600 0.032498979400 0.073329259500 -0.005771736600 1 2 2 1 1 0.80000000 1.00000000 8 GTO in CP2K

The repository contains several GTO libraries /data/ ALL_BASIS_SETS BASIS_RI_cc-TZ GTH_POTENTIALS dftd3.dat ALL_POTENTIALS BASIS_SET HFX_BASIS nm12_parameters.xml BASIS_ADMM BASIS_ZIJLSTRA HF_POTENTIALS rVV10_kernel_table.dat BASIS_ADMM_MOLOPT DFTB MM_POTENTIAL t_c_g.dat BASIS_LRIGPW_AUXMOLOPT ECP_POTENTIALS NLCC_POTENTIALS t_sh_p_s_c.dat BASIS_MOLOPT EMSL_BASIS_SETS POTENTIAL vdW_kernel_table.dat BASIS_MOLOPT_UCL GTH_BASIS_SETS README

Tools for the optimisation of GTO basis sets are available in cp2k, based on atomic and molecular electronic structure calculations

9 Pseudopotentials Outline Recap of Previous lecture All electrons vs pseudopotentials The Hartree-Fock-Kohn-Sham method Core electronsClasses of Basis-set are eliminated ZV=Z-Zcore The exchange and correlation hole Condensed phase: Bloch’s th and PBC Solving the electronic problem in practice Atomic 1s : exp{-Z r}

Smooth nodeless pseudo-wfn close to nuclei

Bare Coulomb replaced by screened Coulomb

Inclusion of relativistic effects

Transferable

Angular dependent potentials:

Pt p peaked at 3.9Å s peaked at 2.4Å

10 d peaked at 1.3Å

Marialore Sulpizi Density Functional Theory: from theory to Applications GTH Pseudopotentials

Norm-conserving, separable, dual-space

Local PP : short-range and long-range terms

4 (2i 2) 2 PP PP PP PPr Zion PP V (r)= C (2) r e( ) erf r loc i r i=1 ⇧ ⇤⌃analytically⌅ part of ES ⇥

Non-Local PP with Gaussian type projectors

Accurate and PP lm l lm V (r, r0)= r p h p r0 nl h | i i ijh j | i Transferable ij Xlm X Scalar 2 1 r lm l lm (l+2i 2) relativistic r p = N Y (ˆr) r e 2 rl | i i “ ” ⇥ Few parameters Goedeker, Teter, Hutter, PRB 54 (1996), 1703; Hartwigsen, Goedeker, Hutter, PRB 58 (1998) 3641

11 Electrostatic Energy

Periodic system PP n˜⇤(G)˜n(G) 1 ZAZB EES = Vloc (r)n(r)dr +2⇡⌦ 2 + G 2 RA RB G A=B Z X X6 | | total charge distribution ntot(r)=n(r)+ nA(r) including n(r) and Z A

r R A ZA rc A ZA r RA n (r)= 3/2 e A V (r)= erf | | A c 3 „ « core r R rc (rA) | A| A ⇥

c PP rA = 2 rlocA cancels the long range term of local PP

H E [ntot] long range SR 1 ntot(r)ntot(r) EES = Vloc (r)n(r)+ drdr 2 r r smooth ⌅ ⌅⌅ | | 1 Z Z R R 1 Z2 + A B erfc | A B A c c c 2 RA RB 2 2 ⇥ r A=B (rA) +(rB) ⇥ A 2 A ⇤⇥ | | ⇤ ⇧ Eov short range, pair Eself 12 Auxiliary Basis Set

Long range term : Non-local Hartree potential

H 1 ntot(r)ntot(r) E [n ]= drdr tot 2 r r | |

Orthogonal, unbiased, naturally periodic PW basis Efficient Mapping 1 iG r n˜(r)= n˜(G) e · FFT G Grid spacing [Å] 0.15 0.13 0.11 0.10 0.09 0.08

-1 10 H2O, GTH, TZV2P -2 10 -3 Electrostatic Linear scaling solution of the Poisson10 equation -4 10 Energy n˜ (G)˜n -5 (G) H tot Error [a.u.] 10tot E [ntot]=2 -6 G210 G -7 10

100 200 300 400 500 Plane wave cutoff [Ry] 13 Fig. 1. Shown is the rapid convergence of the absolute error in the electrostatic

energy Eq. 11 with respect to plane wave cuto at fixed density matrix. The system

is a single water molecule described with fairly hard GTH pseudo potentials and a ˚ 2 TZV2P basis in a 10A cubic cell. The relation Ecuto = 2h2 is used throughout this work to convert the grid spacing h to the corresponding plane wave cuto.

infinite. All terms of the electrostatic energy are therefore treated simultane-

ously

ES PP n˜(G) n˜(G) 1 ZI ZJ E = Vloc (r)n(r)dr + 2 2 + (7) ⇥ G G 2 I=J RI RJ ⇥ | |

using the Ewald sum method [42] as it is commonly implemented in plane

wave electronic structure codes [6]. The long range part of all electrostatic

interactions is treated in Fourier space, whereas the short range part is treated

in real space. This separation is conveniently achieved for the ionic cores if a

I Gaussian charge distribution (nc(r)) for each nucleus is introduced and defined

9 Real Space Integration

Finite cutoff and simulation box define a real space grid

Screening Density collocation Real Space Grid Truncation n(r)= Pµ µ(r) (r) Pµ ¯µ (R) = n(R) Finite cuto and computational box define a real space grid R µ µ { }

nˆ(G) nˆ(G) V (G)= V (R) H G2 H

13 Numerical approximation of the gradient n(R) → ∇n(R)

∂ϵxc ϵ v n r → V R R XC and derivatives evaluated on the grid XC[ ]( ) XC( )= ∂n ( )

µν r ′ Real space integration HHXC = ⟨µ|VHXC( )|ν⟩→! VHXC(R)ϕµν (R) R

G. Lippert et al, Molecular Physics, 92, 477, 1997 J. VandeVondele et al, Comp. Phys. Comm.,167 (2), 103, 2005 14 Multiple Grids Integration 1 i Ecut Ecut = (i 1) ,i=1..N Exponent = 1 For the integartion of the exponent of Gaussian product selects the grid a Gaussian function with exponent 1 an ac- number of grid points is exponent-independent 10 curacy of 10 re- Accuracy quires an integration 2 Multiple Grids ⇥p =1/2p => Relative Cutoff range of 10 bohr, a ~30 Ry cuto of 25 Rydberg, Multiple Grids resulting in 22 integra- tion points. Number of pairs 70000 5000 integration points/integral batch ⇥ 15 50000

30000

10000 f c nj = Ij(ni ) 0 2 4 6 8 Exponent 15 16

16 Analysis of Multigrid

Bulk Si, 8 atoms, a=5.43Å, Ecut =100 Ry, Erel =60 Ry

------MULTIGRID INFO ------count for grid 1: 2720 cutoff [a.u.] 50.00 count for grid 2: 5000 cutoff [a.u.] 16.67 count for grid 3: 2760 cutoff [a.u.] 5.56 count for grid 4: 16 cutoff [a.u.] 1.85 total gridlevel count : 10496

Changing Ecut from 50 to 500 Ry

# REL_CUTOFF = 60 # Cutoff (Ry) | Total Energy (Ha) | NG on grid 1 | NG on grid 2 | NG on grid 3 | NG on grid 4 50.00 -32.3795329864 5048 5432 16 0 100.00 -32.3804557631 2720 5000 2760 16 150.00 -32.3804554850 2032 3016 5432 16 200.00 -32.3804554982 1880 2472 3384 2760 250.00 -32.3804554859 264 4088 3384 2760 300.00 -32.3804554843 264 2456 5000 2776 350.00 -32.3804554846 56 1976 5688 2776 400.00 -32.3804554851 56 1976 3016 5448 450.00 -32.3804554851 0 2032 3016 5448 500.00 -32.3804554850 0 2032 3016 5448

16 GPW Functional

1 Eel[n]= P ⇥ 2 + V SR + V ⇥ µ µ 2⇥ loc nl µ ⇥ ⇤ ⌃ n˜ (G)˜n (G) +2 tot tot + n˜(R )V XC(R) G2 ⌃G ⌃R 1 2 ext HXC = P ⇥ + V ⇥ + V (R)⇥⇥ (R) µ µ 2⇥ µ µ µ ⌅⇥ ⇤ R ⇧ ⌃ ⌃

Linear scaling KS matrix construction

17 CP2K DFT input

&FORCE_EVAL METHOD Quickstep &XC_GRID XC_DERIV SPLINE2_smooth &DFT XC_SMOOTH_RHO NN10 BASIS_SET_FILE_NAME GTH_BASIS_SETS &END XC_GRID POTENTIAL_FILE_NAME GTH_POTENTIALS &END XC LSD F &END DFT MULTIPLICITY 1 CHARGE 0 &SUBSYS &MGRID &CELL CUTOFF 300 PERIODIC XYZ REL_CUTOFF 50 ABC 8. 8. 8. &END MGRID &END CELL &QS &COORD EPS_DEFAULT 1.0E-10 O 0.000000 0.000000 -0.065587 &END QS H 0.000000 -0.757136 0.520545 &SCF H 0.000000 0.757136 0.520545 MAX_SCF 50 &END COORD EPS_SCF 2.00E-06 &KIND H SCF_GUESS ATOMIC BASIS_SET DZVP-GTH-PBE &END SCF POTENTIAL GTH-PBE-q1 &XC &END KIND &XC_FUNCTIONAL &KIND O &PBE BASIS_SET DZVP-GTH-PBE &END PBE POTENTIAL GTH-PBE-q6 &END XC_FUNCTIONAL &END KIND &END SUBSYS &END FORCE_EVAL

18 Hard and Soft Densities

Formaldehyde

Pseudopotential ➯ frozen core Augmented PW ➯ separate regions (matching at edges) LAPW, LMTO (OK Andersen, PRB 12, 3060 (1975)

Dual representation ➯ localized orbitals and PW PAW (PE Bloechl, PRB, 50, 17953 (1994))

19 Partitioning of the Density

n = n˜ + ! nA − ! n˜A I A A A

n(r) − n˜(r) =0 ⎫ ⎬r ∈ I nA(r) − n˜A(r) =0 A ⎭ A A n(r) − nA(r) =0⎫ r ∈ A n r − n r ⎬ ˜( ) ˜A( ) =0⎭ r A A r → G iG·R nA( ) = ! Pµν χµ χν n˜( ) = ! Pµν ϕ˜µϕ˜ν ! nˆ( )e µν µν G Gaussian Augmented Plane Waves

20 Local Densities

r A A nA( ) = ! Pµν χµ χν μ µν A

Χµ projection of φµ in ΩA through atom-dependent d’ ν μ ν ′A r overlap in A χµ = ! dµα gα( ) α projector basis (same size) = α = ′A {pα} λα k λmin ⟨pα|ϕµ⟩ ! dµβ⟨pα|gβ⟩ β

r ′A ′A r r ′A r r nA( )= Pµν dµαdνβ gα( )gβ( )= Pαβ gα( )gβ( ) αβ " µν # αβ ! ! !

21 Density Dependent Terms: XC

Semi-local functionals like local density approximation, generalised gradient approximation or meta-functionals

Gradient: ∇n(r)=∇n˜(r) + ! ∇nA(r) − ! ∇n˜A(r) A A

A A E[n]= Vloc(r)n(r)= V˜loc(r) + Vloc(r) − V˜loc(r) " A A $ ! ! # #

× n˜(r) + nA(r) − n˜A(r) dr A " A A $ # #

A A = V˜loc(r)˜n(r) + Vloc(r)nA(r) − V˜loc(r)˜nA(r) " A A $ ! # #

22 Density Dependent Terms: ES

Non local Coulomb operator

0 0 L L Compensation n (r) = nA(r) = QA gA(r) A A " L # charge A ! ! !

Same multipole expansion as the local densities

L Z l 2 Q = nA(r) − n˜A(r)+n (r) r Ylm(θφ)r dr sin(θ)dθdφ A ! A " #

0 Z 0 V [n˜ + n ]+! V [nA + nA] − ! V [n˜A + nA] A A

Interstitial region Atomic region

23 GAPW Functionals

Exc[n]=Exc[n˜]+! Exc[nA] − ! Exc[n˜A] A A Z 0 EH [n + n ]=EH [n˜ + n ]+ Z − n0 ! EH [nA + nA] ! EH [n˜A + ] A A

on global grids Analytic integrals via collocation + FFT Local Spherical Grids

Lippert et al., Theor. Chem. Acc. 103, 124 (1999); Iannuzzi, Chassaing, Hutter, Chimia (2005); Krack et al, PCCP, 2, 2105 (2000) VandeVondele , Iannuzzi, Hutter, CSCM2005 proceedings

24 GAPW Input

&DFT &SUBSYS … …

&QS &KIND O EXTRAPOLATION ASPC BASIS_SET DZVP-MOLOPT-GTH-q6 EXTRAPOLATION_ORDER 4 POTENTIAL GTH-BLYP-q6 EPS_DEFAULT 1.0E-12 LEBEDEV_GRID 80 METHOD GAPW RADIAL_GRID 200 EPS_DEFAULT 1.0E-12 &END KIND QUADRATURE GC_LOG &KIND O1 EPSFIT 1.E-4 ELEMENT O EPSISO 1.0E-12 # BASIS_SET 6-311++G2d2p EPSRHO0 1.E-8 BASIS_SET 6-311G** LMAXN0 4 POTENTIAL ALL LMAXN1 6 LEBEDEV_GRID 80 ALPHA0_H 10 RADIAL_GRID 200 &END QS &END KIND

&END DFT &END SUBSYS

25 Energy Functional Minimisation

T C = arg min E(C):C SC =1 C Standard: Diagonalisation + mixing (DIIS, Pulay, J.⇥ Comput. Chem. 3, 556,(1982); iterative diag. Kresse G. et al, PRB, 54(16), 11169, (1996) )

Direct optimisation: Orbital rotations (maximally localised Wannier functions) Example: DNA Crystal

Linear scaling methods: Efficiency2388 atoms, depends 3960 on orbitals, sparsity 38688 of P ( BSFS. (TZV(2d,2p)) Goedecker, Rev. Mod. Phys. 71, 1085,(1999)) density matrix, overlap matrix

c⇥Egap r r P(r, r⇥) e | | P 1 1 P = S S (r)P(r, r ) (r⇥)drdr⇥ µ µp q p q pq ⇥⇥ S

26 28 Traditional Diagonalisation

Eigensolver from standard parallel program library: SCALAPACK KC = SC

Transformation into a standard eigenvalues problem

T Cholesky decomposition S = U U C0 = UC

T T 1 1 KC = U UC (U ) KU C⇥ = C⇥ Diagonalisation of K’ and back transformation⇥ of MO coefficients (occupied only (20%))

error matrix DIIS for SCF convergence acceleration: few iterations e = KPS SPK

scaling (O(M3)) and stability problems

27 Orbital Transform

• Seeks to find the minimum of the Orbitalenergy functional Transformation with respect to Method the MO coefficients, with the Direction of steepest decent constraint that MO are normalised. Auxiliary X, linearly constrained variables to parametrison E surfacee the occupiedcorrection subspace tangent to Optimisation problem on a M- manifold at n Orbital• Transform Cn+1 not linear orthonormalitydimensional spherical constraint surface. Linear constraintC geodesic T n • CPerformSC =a variableI transformation, XSC0 =0 • Seeks to find the minimum of the from MO coefficients C to a set of Constraint energy functional with respect to manifold auxiliary variables X such that the for CTSC = 1 the MO coefficients, with the Direction of steepest decent optimisation of E is now on a M-1 constraint that MO are normalised. Direction of on E surface dimensionalcorrection linear space w.r.t. X steepest decent tangent to on E surface Optimisation problem on a M- manifold at n tangent to Orbital Transform • Cn+1 1 C(X)=C0 cos(U)+XU sin(U) manifold at n dimensional spherical surface. Cn geodesic Xn+1 Xn T 1 Perform a variable transformation, U =(X SX) 2Seeks to find the minimum of the • • Constraint Constraint manifold from MO coefficients C to a set of energymanifold functional with respect to for XTSC =0 auxiliary variables X such that the • With constraintthe forMO (fixes CT coefficients,SC the= 1 direction with the 0 M dimensional of the plane): M-1 dimensional optimisation of E is now on a M-1 Direction of constraint that MO are normalised. dimensional linear space w.r.t. X steepest decent T X SC =0 C0 on E surface 0 • Optimisation problem on a M- 1 tangent to dimensional spherical surface. X, C XT SC =0 C(X)=C0 cos(U)+XU sin(U) manifold at n 1 C 0 0 C(X)=C0 cos(XUn+1)+XU sin(U) h i⌘ Xn Perform a variable transformation, T 1 cos(✓) ˆ U =(X SX) 2 • T 1/2 C = = C0 cos(✓)+X sin(✓) U = X fromSX ConstraintMO coefficients C to a set of ✓ sin(✓) manifold  T X X X auxiliaryfor X SCvariables=0 X such that the ✓ = k k = X Xˆ = With constraint (fixes the direction 0 C X • matrix functionals by Taylor k k k k k k optimisation of E is now on a M-1 1 1 of the plane): T 2 T X = X, X 2 = X SX expansionsdimensional in X SX linear space w.r.t. X k k h i T X SC0 =0 28 1 C(X)=C0 cos(U)+XU sin(U)

T 1 U =(X SX) 2 • With constraint (fixes the direction of the plane):

T X SC0 =0 Preconditioned OT

minimisation in the auxiliary tangent space, idempotency verified

E(C(X)) + Tr(X†SC0) E C HC and SX = + SC0 X C X dominated CG(LS) or DIIS O(MN) Preconditioned gradients P(H S)X X 0 X ⇥PX ⇥ X = X P E n+1 n nr n

1 T ideal preconditioner P =(H S" ) " = C HC n n n n n

Full All Full Single Inverse Full Kinetic Full S Inverse Full Single

VandeVondele et al, JCP, 118, 4365 (2003) 29 OT Performance

Use Inner and Outer loop

Guaranteed convergence with CG + line search

Various choices of preconditioners

Limited number of SCF iterations

KS diagonalisation avoided

Sparsity of S and H can be exploited

Based on matrix-matrix and matrix-vector products

Scaling O(N2M) in cpu and O(NM) in memory

Optimal for large system, high quality basis set

30 OTBut Performance OT is hard to beat !

Refined preconditioner,Refined preconditioner, most effective most effectiveduring MD during of largeMD of systems large systems with well with well conditioned basis sets conditioned basis sets

onImproved Daint (XC30) Single Inverse 3844Preconditioner nodes (8 Preconditionercores + 1 GPU) Solver based on an inverse update.

Schiffmann, VandeVondele, JCP 142 244117 (2015) Schiffmann, VandeVondele, JCP 142 244117 (2015) 31 OT input

&SCF EPS_SCF 1.01E-07 &OUTER_SCF MAX_SCF 20 EPS_SCF 1.01E-07 &END OUTER_SCF SCF_GUESS RESTART MAX_SCF 20 &OT MINIMIZER DIIS PRECONDITIONER FULL_ALL &END OT &END SCF

32 Linear Scaling SCF Linear Scaling SCF in CP2K

3 Traditional Based on approachessparse matrix to solve matrix the self- Largest O(N ) calculation with CP2K multiplicationsconsistent field (SCF) (iterative equations proc.) are O(N3) (~6000 atoms) limiting system size significantly.

1 m

1 1 n

A newlyP = implementedI sign algorithmS H isµI O(N),S 4 allowing for2 far larger systems to be studied. m

Self consistent solution by mixing n 2 2 Hn+1(Pn+1) New regime: small devices, heterostructures, interfaces,Hˆ =(1nano-particles,↵)Hˆ a↵ smallH virus. n+1 n n+1

Chemical potential by bisecting until nm 22nm 22

µ : trace(P S) N < 1/2 Largest O(N) calculation with CP2K n+1 | n+1 el| (~1'000'000 atoms)

VandeVondele J; Borstnik U; Hutter J; 2012, Linear scaling self-consistent field calculations for millions of atoms in the condensed phase. JCTC 10: 3566 (2012) VandeVondele, Borstnik, Hutter; JCTC 10, 3566 (2012) 33 DBCSR: a sparse matrix library Distributed Blocked Compressed Sparse Row Distributed Blocked CannonSparse Sparse Recursive Matrix Library Optimized for the science case: 10000s of non-zeros per row. The denseDBCSR: limit as importantDistributed as the Blocked sparse limit.Compressed Sparse Row

For massively parallel architectures Cannon style communication Optimised for 10000s of non-zeros peron row a homogenized (dense limit) matrix for strong scaling Stored in block form : atoms or molecules

Cannons algorithm: 2D layout (rows/columns) and 2D distribution of data

Homogenised for load balance

Borstnik et al. : submitted given processor communicates only with nearest neighbours transferred data decreases as number of processors increases

34 Millions of atoms Millions of atoms in the condensed phase

Accurate basis sets, DFT 46656 cores

The electronic structure O(106) atoms in < 2 hours

Minimal basis sets: DFT, NDDO, DFTB 9216 cores

Bulk liquid water. Dashed lines represent ideal linear scaling.

VandeVondele, Borstnik, Hutter, JCTC, DOI: 10.1021/ct200897x

35 Metallic Electronic Structure

1 3 3 E = ( E )d k wknk(nk Ef )d k band ⇥ nk nk f ⇥ n BZ BZ n k ⇥ Rh band structure CKS and �KS needed

Ef Ef

charge sloshing and exceedingly slow convergence Wavefunction must be orthogonal to unoccupied bands close in energy

Discontinuous occupancies generate instability (large variations in n(r))

Integration over k-points and iterative diagonalisation schemes

36 Smearing & Mixing in G-space

Mermin functional: minimise the free energy

F (T )=E kBTS(fn) S(f )= [f ln f + (1 f ) ln(1 f )] n n n n n n

Any smooth operator that allows accurate S(fn) to recover the T=0 result

n Ef 1 fn = Fermi- E kT exp n f +1 ⇤ ⌅ kBT ⇥

Trial density mixed with previous densities: damping oscillations

m 1 ninp = ninp + GI [ninp]+ ↵ n + GI m+1 m R m i i Ri i=1 X residual minimise the residual [ninp]=nout[ninp] ninp R G preconditioning matrix damping low G

37 Iterative Improvement of the the n(r)

Input density matrix

Pin nin(r) ↵ !

Update of KS Hamiltonian

diagonalization plus iterative refinement Cn "n CPU Time Calculation of Fermi energy and occupations Ef fn Time[s]/SCF cycle on 256 CPUs IBM Power 5 : 116.2 New density matrix Pout nnew(r) ↵ Pout nout(r) ↵ !

Check convergence max Pout Pin ↵ ↵ Density mixing nout nin nh ... nnew !

38 Rhodium: Bulk and Surface

Rh(111) d-projected Rhodium: Bulk and Surface LDOS Bulk: 4x4x4 Q17 DZVP Surface: 6x6 7 layers

2 Basis PP a0 [Å] B[GPa] Es[eV/Å ]Wf [eV]

3s2p2df 17e 3.80 258.3 0.186 5.11 2s2p2df 9e 3.83 242.6 0.172 5.14 Q9 DZVP 2sp2d 9e 3.85 230.2 0.167 5.20 SZVP spd 9e 3.87 224.4 0.164 5.15 SZV d-projected LDOS

Minimal model for Rh(111) surface: -8 -4 0 4 8 4 layer slab, 576 Rh atoms, 5184 electrons, 8640 basis function E-Ef [eV]

39 T. Auckenthaler et al. / Parallel Computing 37 (2011) 783–794 785 avoiding system-specific complications such as the exact form of the eigenspectrum, or the choice of an optimal precondi- tioning strategy [11,9]. Even for (i)–(iii), though, a conventional diagonalization of some kind may still be required or is a necessary fallback. In general, the solution of (1) proceeds in five steps: (A) Transformation to a dense standard eigenproblem (e.g., by Chole- sky decomposition of S), H c =  Sc [ Aq = kq , k  ; (B) Reduction to tridiagonal form, A [ T; (C) Solution of the tridi- KS l l l A A  l agonal problem for k eigenvalues and vectors, TqT = kqT; (D) Back transformation of k eigenvectors to dense orthonormal form, qT [ qA; (E) Back transformation to the original, non-orthonormal basis, qA [ cl. Fig. 1 shows the overall timings of these operations on a massivelyT. parallel Auckenthaler IBM et al. / BlueGene/P Parallel Computing system, 37 (2011) 783–794 for one specific example: the electronic785 structure of a 1003-atom polyalanine peptide (small protein) conformation in an artificially chosen, fixed a-helical geometry. The example isavoiding set up using system-specific the ‘‘Fritz complications Haber Institute such asab the initio exactmolecular form of the simulations’’ eigenspectrum, (FHI-aims) or the choice all-electron of an optimal electronic precondi- structure package [8,32]tioning, at strategy essentially[11,9]. converged Even for (i)–(iii), basis though, set accuracy a conventional for DFT diagonalization (tier 2 [8]). For of some(1), this kind means may stilln be= 27,069. required The or is number a of calculated necessary fallback. eigenpairsIn general, is thek = solution 3410, somewhat of (1) proceeds more in five than steps: the (A) theoretical Transformation minimum to a densekmin standard= 1905, eigenproblem one state per(e.g., two by Chole- electrons. Steps (A)–(E) weresky decomposition performed using of S), H onlyc = subroutine Sc [ Aq = callskq , k as in; (B) the Reduction ScaLAPACK to tridiagonal[33] library form, whereA [ T; available,(C) Solution as of theimplemented tridi- in IBM’s sys- KS l l l A A  l tem-specificagonal problem ESSL for library,k eigenvalues combined and vectors, as describedTqT = kqT; briefly (D) Back in transformation[8, Section 4.2] of k. Theeigenvectors reason isto that dense ScaLAPACK orthonormal or its interfaces are widelyform, q usedT [ qA for; (E) (massively) Back transformation parallel to linear the original, algebra non-orthonormal and readily basis,available;qA [ c nol. Fig. claim 1 shows as to the whether overall timings our use of is the best or only possiblethese operations alternative on a massively is implied. parallel ScaLAPACK IBM BlueGene/P provides system, the for driver one specific routine example:pdsyevd the electronic, whichstructure calls pdsytrd of a , pdstedc, and 1003-atom polyalanine peptide (small protein) conformation in an artificially chosen, fixed a-helical geometry. The example pdormtris set up usingfor the tridiagonalization, ‘‘Fritz Haber Institute solutionab initio ofmolecular the tridiagonal simulations’’ eigenproblem (FHI-aims) all-electron and back electronic transformation structure package respectively. pdstedc is[8,32] based, at on essentially the divide-and-conquer converged basis set accuracy (D&C) algorithm, for DFT (tier tridiagonalization2 [8]). For (1), this means andn back= 27,069. transformation The number of arecalculated done using Householder transformationseigenpairs is k = 3410, and somewhat blocked more versions than thethereof theoretical[34,35] minimum. The backkmin = transformation 1905, one state per was two done electrons. only Steps for (A)–(E) the needed eigenvectors. wereOur performed point here using are only some subroutine key conclusions, calls as in the in ScaLAPACK agreement[33] withlibrary reports where in available, the wider as implemented literature [12,6,36] in IBM’s sys-. What is most appar- enttem-specific from Fig. ESSL 1 is library, that even combined for this as described large electronic briefly in [8, structure Section 4.2] problem,. The reason the is calculation that ScaLAPACK does or not its interfaces scale beyond are 1024 cores, thus widely used for (massively) parallel linear algebra and readily available; no claim as to whether our use is the best or only limitingpossible the alternative performance is implied. of any ScaLAPACK full electronic provides the structure driver routine calculationpdsyevd with, which more calls processors.pdsytrd, Bypdstedc timing, and steps (A)–(E) individ- ually,pdormtr it isfor obvious tridiagonalization, that (B) the solution reduction of the to tridiagonal tridiagonal eigenproblem form, and and then back (C) transformation the solution respectively. of the tridiagonalpdstedc problem using the D&Cis based approach on the divide-and-conquer dominate the calculation, (D&C) algorithm, and tridiagonalizationprevent further and scaling. back transformation For (B), the main are done reason using Householder is that the underlying House- holdertransformations transformations and blocked involve versions matrix–vector thereof [34,35]. operationsThe back transformation (use of BLAS-2 was done subroutines only for the and needed unfavorable eigenvectors. communication pat- tern);Our the point magnitude here are some of key (C) conclusions, is more surprising in agreement (see with below). reports in By the contrast, wider literature the matrix[12,6,36] multiplication-based. What is most appar- transformations ent from Fig. 1 is that even for this large electronic structure problem, the calculation does not scale beyond 1024 cores, thus (A),limiting (D), the and performance (E) either of still any scale full electronic or take structure only a small calculation fraction with of more the processors. overall time. By timing steps (A)–(E) individ- Angewandte Informatik ually,In the it is present obvious that paper, (B) the we reduction assume tothat tridiagonal stepThe ELPA (A) form, project already and then hasThe (C) beenproject the completed, solution of the and tridiagonal step (E) problem will not using be considered, the either. We presentD&C approach a new dominate parallel the implementation calculation,Beyond and prevent based the basic further on ELPA-Lib the scaling. two-stepAlgorithmic For (B), band paths the reduction main for eigenproblems reason of is Bischof that the underlying et al. [37] House-concerning step (B), tri- Algorithmik diagonalization;holder transformations Section involve2.1, matrix–vector with improvements operations mainly (use of BLAS-2 forImprovements step subroutines (D), with back ELPA and transformation; unfavorable communication Section 2.2 pat-. We also extend the Efficient tridiagonalization D&Ctern); algorithm, the magnitude thus of speeding (C) is more up surprising step (C); (see Section below).3. By Some contrast, additional the matrix optimization multiplication-based steps in transformations the algorithmic parts not specif- (A), (D), and (E) either stillGeneralized scale or take only Eigenvalue a small fraction Problem of the overall time. ically discussed here (reduction to banded form, optimized one-step reduction to tridiagonal form, and corresponding back In the present paper, we assume that step (A)State already of has the been Art completed, and step (E) will not be considered, either. We transformations)presentAlgorithmic a new parallel will implementation be published paths based as part for on the ofELPA eigenproblems two-stepan overall project band implementation reduction of Bischof inIII et[38] al..[37] Theseconcerning routines step are (B), also tri- included in recent productiondiagonalization; versions Section of2.1 FHI-aims., with improvementsGeneralized For simplicity Eigenvalue mainlyELPA we for Problem in will step cp2k present (D), back only transformation; the real symmetric Section 2.2. case; We also the extend complex the Hermitian case is similar.D&C algorithm,Problems thus speeding up with step (C); this Section approach:3State. Some of additionalthe Art optimization steps in the algorithmic parts not specif- icallyIn addition discussed hereto synthetic (reduction testcases, to banded form, we show optimizedELPA benchmarks project one-step reduction for two to large, tridiagonal real-world form, and problems corresponding from back all-electron electronic transformations) will be published as part of an overallELPA implementation in cp2k in [38]. These routines are also included in recent ScaLAPACKstructure theory: first, the n = 27,069, k = 3410 polyalanine case of Fig. 1, which will be referred to as Poly27069 problem production versions of FHI-aims.Generalized For simplicity Eigenvalue we Problem will present only the real symmetric case; the complex Hermitian case is insimilar. the following,ScaLAPACK and second, an n performance= 67,990State of generalized the Art eigenproblem arising from a periodic Pt (100)-‘‘(5 40)’’, large-scale ELPA project  reconstructedIn addition to surface synthetic calculation testcases, we with show 1046ELPA benchmarks in heavy-element cp2k for two large, atoms, real-world as needed problems in [39] from. In all-electron the latter electronic calculation, the large frac- tionstructure of core theory: electrons first, the forn Pt= 27,069, (atomick = number 3410 polyalanineZ = 78) makes case of Fig. for 1a, much which willhigher be referred ratio of to needed as Poly27069 eigenstates problem to overall basis size, in the following, and second, anScaLAPACKn = 67,990 generalized eigenproblem for arising diagonlis from a periodic Pt (100)-‘‘(5ation40)’’, large-scale k = 43,409 ScaLAPACK64%, than in the performance polyalanine case, even though the basis set used is similarly well converged. This problem will reconstructed surface calculation with 1046 heavy-element atoms, as needed in [39]. In the latter calculation, the large frac- be referred to as Pt67990.All electron Benchmarks electronic are performed structure on two calculation distinct computer with FHI-aims:systems: The IBM BlueGene/P machine tion of core electronsAT for Pt (atomic number Z = 78) makes for a much higher ratio( of λ needed ,q ) eigenstates to overall basisq size, ‘‘genius’’k = 43,409 used64%, in thanFig. in 1 the, and polyalanine a Sun Microsystems-built, case, even thoughpolyalanine the basis Infiniband-connected set peptide used is similarlyT well Intel converged. Xeon (Nehalem) This problem cluster willA with individual  reduction to compute transform eight-corebe referred nodes. to as Pt67990. WeAll note electron Benchmarks that electronic for are all performed standard structure on ScaLAPACK two calculation distinct or computer PBLAS with calls, FHI-aims:systems: i.e., The those IBM parts BlueGene/P not implemented machine by ourselves, the‘‘genius’’ optimized used in ScaLAPACK-likeFig. 1, and atridiagonal Sun Microsystems-built, implementations formpolyalanine Infiniband-connected byeigenvalues IBM peptide (ESSL) and or Intel Xeon (MKL) (Nehalem) wereeigenvectors employed. cluster with individual eight-core nodes. We note that for all standard ScaLAPACK− orvectors PBLAS of calls, T i.e., those parts not implemented by ourselves, the optimized ScaLAPACK-like implementations by IBM (ESSL) or Intel (MKL) were employed. one half BLAS 2 QR too slow pdsyevd (ESSL) on IBM BGP 1003 atoms 10033410 atoms MOSscaling BisInvIt slow, not robust Tridiagonalization TransformationPolyalanine to tridiagonalpeptide form basedTridiagonalization on around 50% 3410Generalized27069 MOS BSf Eigenvalue Problem D & C scaling 27069 BSf State of the Art Solution ELPA project Solution BLAS-2 operations.ELPA in cp2k not partial Cho. 1 Cho. 1 Back trans. ScaLAPACKEigen-decomposition in cp2k ofMRRRT traditionallynot robust enoughBack done trans. with routines Cho. 2 Cho. 2 such as bisection and inverse iterations. Eigenvalue Solvers—The ELPA Project and Beyond, Bruno Lang 9/31

Fig. 1. Left: SegmentDivide-and-conquer-based of the a-helical polyalanine molecule Ala100 as described in method the text. Right: Timings (D&C) for the five steps (A): reduction to standard Fig.eigenproblem, 1. Left: Segment (B): tridiagonalization, of the a-helical (C): solution polyalanine of the tridiagonal molecule problem, Ala100 andas back described transformation in the of text. eigenvectors Right: toTimings the full standard for the pro fiveblem steps (D) and (A): reduction to standard eigenproblem,the generalized576 (B): problem Cu,on tridiagonalization, IBM nao=14400, (E), of a BGP complete with (C): eigenvalue/-vector Nelect.=6336, solution ESSL: ofpdsyevd the solution tridiagonalk forof this eigen-pairs=3768 problem, molecule, n and= 27,069, back transformationk = 3410, as a function of eigenvectors of the number to of the processor full standard problem (D) and thecores. generalized The calculation problem was performed (E), of a on complete an IBM BlueGene/P eigenvalue/-vector system, using solution a completely for ScaLAPACK-based this molecule, implementation.n = 27,069, k = Step 3410, (C) aswas a performed function using of the number of processor Multipleon IBM relatively BGP with ESSL: robustpdsyevd representationstime x method SCF,7 / 25 on CRAY (MRRR) XE6 cores.the divide-and-conquer The calculationnprocs method. was performed syevd on an IBM BlueGene/P syevr system, Cholesky using a completely ScaLAPACK-based implementation. Step (C) was performed using 7 / 25 the divide-and-conquer32 method. 106 (49%) 72 (40%) 38 (21%) Parallel64 performance 69 (46%) 48 depends (37%) 34 (26%) on data locality>70% in eigenvalueand scalability solver ScaLAPACK128 41 need (41%) improvements 29 (34%) 23 (28%) in numerical stability, parallel 256 35 (41%) 26 (34%) 24 (32%) poor scaling scalability,Syevd and:D&C memory bandwidth limitations Syevr: MRRR 40 6 / 25

9 / 25 Angewandte Informatik The ELPA project The project Beyond the basic ELPA-Lib Algorithmic paths for eigenproblems Improvements with ELPA Algorithmik Efficient tridiagonalization Generalized Eigenvalue Problem State of the Art ELPA project ImprovementsELPA with in cp2k ELPA V Two-step StrategyELPA (http://elpa.rzg.mpg.de) Two-step reduction II: banded tridiagonal: Improved efficiency by a two-step transformation and back! transformation

better scaling better scaling higher per−node perf complexB complex better scaling qB complex band form by mainly "cheap" complex partial blocked BLAS 3 partial orthogonal λ AT( ,q T ) qA transformations reduction to compute transform tridiagonal form eigenvalues and eigenvectors

−vectors of T Generalized Eigenvalue Problem two−step State of the Art ELPA project one half BLAS 2 QR too slow ELPA in cp2k Benchmark on CRAY-XE6 N atom=scaling 2116; Nel = 16928;BisInvIt slow,Benchmark not robust on BG-PN atom= 480; Nel = 6000; nmo = 10964;variant with nao = 31740D & C scaling better scalingnmo = 7400; nao = 14240 Reductionbetter to band scaling form by blocked orthogonal transformations not partial partial variant All - syevd CRAY XE6 All - syevd BG-P Tridiagonalization by nMRRR2 stagesnot robust of enough a bulge-chasingimproved robustness algorithm Diag - syevd Diag - syevd 1000 All - syevr 10000 Optimized kernel for non-blocked HouseholderAll - ELPA transformations D&CAll - ELPA for partial eigensystemDiag - syevr Perspective:+ Extended MRRR based to tridiagonal complex eigensolver; hybrid openMP/MPI version Total time for SCF 12 + Improved parallelization Diag - ELPA Diag - ELPA 14 / 25 1000

500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 3500 4000 4500 EigenvalueNumber Solvers—The of cores ELPA Project and Beyond, Bruno Lang 41 15/31 24 / 25 Large metallic systems

hBN/Rh(111) Nanomesh 13x13 hBN on 12x12 Rh slab graph./Ru(0001) Superstructure 25x25 g on 23x23 Ru

2116 Ru atoms (8 valence el.) + 1250 C atoms, Nel=21928, Nao=47990 ;

~ 25 days per structure optimisation, on 1024 cpus

Slab 12x12 Rh(111) slab, a0=3.801 Å, 1 layer hBN 13x13 4L: 576Rh + 169BN: Nao=19370 ; Nel=11144 7L: 1008Rh + 338BN: Nao=34996 ; Nel=19840

Structure opt. > 300 iterations => 1÷2 week on 512 cores

42 SCF for Metals

&SCF SCF_GUESS ATOMIC MAX_SCF 50 EPS_SCF 1.0e-7 &XC EPS_DIIS 1.0e-7 &XC_FUNCTIONAL PBE &SMEAR &END METHOD FERMI_DIRAC &vdW_POTENTIAL ELECTRONIC_TEMPERATURE 500. DISPERSION_FUNCTIONAL PAIR_POTENTIAL &END SMEAR &PAIR_POTENTIAL &MIXING TYPE DFTD3 METHOD BROYDEN_MIXING PARAMETER_FILE_NAME dftd3.dat ALPHA 0.6 REFERENCE_FUNCTIONAL PBE BETA 1.0 &END PAIR_POTENTIAL NBROYDEN 15 &END vdW_POTENTIAL &END MIXING &END XC ADDED_MOS 20 20 &END SCF

43