arXiv:1601.07728v1 [physics.comp-ph] 28 Jan 2016 oeEcttosfrom Excitations Core uha oe aeil,aopossses rcmlxnano-st complex or systems, amorphous advant materials, particularly doped is as calculations such spectral large-scale perform to h S aitna,advrosmmr eutos cln sdemon is Scaling o reductions. parallelization memory complete various more and calculations, imp Hamiltonian, DFT the BSE include initial the improvements the These of electrons. cost syste thousand the for few spectra a evaluate to to up us enable that efficiencies additional pcr o h rai ih mtigmlcl Tris-(8-hydroxyquin molecule emitting light organic the for spectra hog h nlso fa of inclusion the through iiy 84 rnbe rne e. 3 476881727. +33 Tel.: France. Grenoble, 38043 cility, oelvlBEsle NS) ehv rvosyrpre hsimp two-p this Final reported previously have states. We (NBSE). band solver and BSE core-level core-level between elements matrix e fpeitv opttoa -a pcrsoyhas spectroscopy x-ray prob- computational The the predictive system. of condensed of nature extended lem the localized and highly hole the core both account into take quantitatively spectra calculate to predictively. quali- preferable and the best is at with are it of comparison comparisons tative; by such interpretation done but is spectra, this this reliable reference Often extracting a spectra. However, measured reconstruct requires to shells. used information coordination be bonding, can local chemical region the and extended the field, whereas state, configuration, oxidation near-edge to and The spin (XAS) sensitive is unoccupied (XANES) respectively. absorption region the states, absorption X-ray probe of densities structure spectra example, occupied (XES) electronic For emission and and atomic chemical materials. local the of the and of probe environment orbital-specific and element- Introduction 1. t functional density on based r both are by and Calculations (XES), equat (N/RIXS). emission Bethe-Salpeter x-ray spectra the (XAS), absorption of x-ray implementation including efficient an present We Abstract rpitsbitdt optrPyisCommunications Physics Computer to submitted Preprint b ins e aoaoyfrCro-ae ucinlMateri Functional Carbon-Based for Laboratory Key Jiangsu .Gilmore K. nacrt ecito fcr-ee xiain must excitations core-level of description accurate An quantitative a provide spectroscopies Core-level ∗ ffiin mlmnaino oeectto Bethe-Salpete core-excitation of implementation Efficient mi address: Email orsodn uhra:Erpa ycrto aito F Radiation Synchrotron European at: author Corresponding rQuantum or a,b, ∗ onVinson John , [email protected] c ainlIsiueo tnad n ehooy(IT,Gai (NIST), Technology and Standards of Institute National e espresso binitio Ab eateto hsc,Uiest fWsigo,Seattle, Washington, of University Physics, of Department a d GW uoenSnhornRdainFclt ER) Grenoble (ESRF), Facility Radiation Synchrotron European arneBree ainlLb(B) ekly Californi Berkeley, (LBL), Lab National Berkeley Lawrence c ..Shirley E.L. , efeeg.Tepoetragetdwv ehiu sue oe to used is technique wave augmented projector The energy. self ohpaewv ai,pedptnilcds hseetoi s electronic This codes. pseudopotential basis, plane-wave both , K Gilmore) (K. lcrncsrcueadNS)[ NBSE) and structure electronic oco nvriy uhu ins 113 RChina PR 215123, Jiangsu Suzhou, University, Soochow c .Prendergast D. , l eie,Isiueo ucinlNn otMaterials Soft & Nano Functional of Institute Devices, & als a- in h xeddrgo fmtrasi eywl repro- well very code function is Green’s materials real-space the of by region duced extended the tion, absorp- x-ray For approximation. independent-particle the representative etc. be interfaces, to dopants, enough defects, large experiment: systems of abil- the address and system to given ity a model accurately between to compromises ability upon the the hinges approximating spectra util- calculated The by of ity interactions. atoms electron-hole of able and hundreds are electron-electron theories many single-particle treat various to end, system. other larger the the subsys- to At small parametrically a or treat- loosely considering coupled exact by and tem more effects Atomic a many-body include of categories. to ment sought major be have two models can into cluster most scale but by directions, divided many from approached been a ecntutdfor constructed XAS accu- deep-core be structure, for can models near-edge quasi-particle are the independent potential the rate reproduce How- the To at to accuracy corrections [1]. important. loses non-spherical ranges when implementation energy edge current extensive the over ever, used widely is etddniyo ttsi h rsneo cendcore- screened a of presence the in states of pro- density unoccupied the jected to proportional is intensity absorption d ..Pemmaraju C.D. , ihnafis-rnilsapoc,i sesett use to easiest is it approach, first-principles a Within stntmslre hnpeiul osbe containing possible; previously than larger times ten ms gosfrivsiaigdlt rnnproi systems non-periodic or dilute investigating for ageous ructures. ril cteigsae r bandwt h NIST the with obtained are states scattering article hs e.B Rev. Phys. er DT lcrncsrcue eeae either generated structures electronic (DFT) heory eetto,wihw ee oas to refer we which lementation, snn n o-eoatieatcxryscattering x-ray inelastic non-resonant and esonant o BE ehdfrotiigcr-ee spectra core-level obtaining for method (BSE) ion ln)lmnm(Alq oline)aluminum eetto fotmlbssfntosta reduce that functions basis optimal of lementation hrbr,Mrln 09,USA 20899, Maryland thersburg, h cenn aclto n fteato of action the of and calculation screening the f tae nsprel fSrTiO of supercells on strated ahntn915 USA 98195, Washington 42,USA 94720, a 80,France 38000, 83 d ..Kas J.J. , 116(01] ee epresent we Here, (2011)]. 115106 , qaincalculations equation r s lvl 2.I hscs,tex-ray the case, this In [2]. -levels 3 r rsne.Teability The presented. are ) e ..Vila F.D. , rcuei improved is tructure ocean aut transition valuate aur 9 2016 29, January 3 e ..Rehr J.J. , n example and feff (FUNSOM), (Obtaining which , e hole, weighted by final-state transition matrix elements. and Ni [23–29]. The BSE largely resolves this discrep- This approach has been used with different treatments of ancy, yielding branching ratios in reasonable agreement the core-hole including the “Slater transition state” model with experiment [30]. However, simpler approaches such of a half-occupied core state [3] or the final state picture as TDDFT can also account for these corrections [31–33]. of a full core hole with [4] or without [5] the correspond- BSE solvers have been implemented in a few core-level ing excited electron. Due to the simplicity of their imple- codes to date [34–36] – as well as some valence level codes mentation and modest computational cost, such core-hole [37–40] – but their utility has been limited to a specialist schemes have been implemented in several standard DFT community. In part, this is due to significantly increased distributions [6–12] for near-edge spectra. However, these computational cost. feff and DFT-based core-hole ap- approaches often fail when the hole has non-zero orbital proaches require little more effort than standard DFT cal- angular momentum. Extensions to DFT such as time- culations [41], and calculations on systems of hundreds of dependent DFT (TDDFT) have been problematic due to atoms have become routine. BSE calculations are consid- the lack of sufficiently accurate exchange-correlation func- erably more intensive and have heretofore typically been tionals for core-excited states. Although recent devel- limited to a few tens of atoms. This added cost is largely opment of more accurate short-range exchange-corrected associated with including GW self-energy corrections to functionals [13, 14] has improved the viability of TDDFT the electronic structure, obtaining the screening response for calculating core-level spectra more advances will be to the core hole, and acting with the Bethe-Salpeter Hamil- needed to make TDDFT a generally applicable approach. tonian on the electron-hole wavefunction to obtain the ex- The aforementioned approaches are all inherently single- citation spectrum. However, given the significantly im- particle descriptions. For calculating x-ray spectra this proved accuracy of the BSE method it is desirable to make can be problematic, not only when the core hole has non- this a more widely used technique. This necessitates im- zero angular momentum, but also when many-body or proving its ease of use and reducing the computational multi-electron excitations are important. Many-electron cost. Toward this second objective we report herein sev- wavefunction-based methods constitute one obvious ap- eral efficiency improvements to our existing BSE code [42] proach to this problem. Contrary to single-particle de- that now allow BSE calculations on systems of hundreds of scriptions that typically begin from the perspective of the atoms and significantly reduce the time required for pre- extended system, atomic and cluster models start with the viously viable smaller systems. goal of completely describing the local problem. In par- The most time-consuming steps of a BSE calculation ticular, atomic multiplet theory [15] or configuration in- are (1) obtaining the ground-state electronic structure, (2) teraction [16, 17] and exact diagonalization methods [18] correcting the quasiparticle energies by adding a (GW) can accurately reproduce complicated L edges of transition self-energy, (3) evaluating the screening response to the metal systems. However, these techniques that take a local core-hole, and (4) determining the excitation spectrum of description typically ignore solid-state effects and are lim- the BSE Hamiltonian. Our BSE calculations build on self- ited to small cluster models, although recent progress at consistent field (SCF) DFT calculations of the ground- incorporating band structure within the multiplet model state charge density and the accompanying Kohn-Sham should be noted [19–21]. potential. We then use non-self-consistent field (NSCF) Further improvements in DFT-based approaches to cal- calculations, i.e., direct calculations that solve the one- culating spectra require a two-particle picture including electron Schr¨odinger equation in the already computed particle-hole interactions, particle and hole self-energies, Kohn-Sham potential, to obtain all desired occupied and and full-potential electronic structure, within the context unoccupied Bloch states. To alleviate the burden of k- of many-body perturbation theory. Specifically, this in- space sampling during the NSCF calculation, and to re- volves solving the Bethe-Salpeter equation (BSE), i.e., duce the plane-wave basis, we implement a k-space inter- a particle-hole Green’s function. The Bethe-Salpeter polation scheme that solves a k-dependent Hamiltonian equation description of absorption includes single-particle over a reduced set of optimal basis functions [43, 44]. This terms that describe the quasi-particle energies of the core is described in section 3.1. In most cases, rather than eval- hole and the excited photoelectron, together with the in- uating the GW self-energy in the typical random-phase teraction between them. To leading order the interaction approximation (G0W RPA), we instead use a much more consists of two terms: the Coulomb interaction, which in- computationally efficient approximation based on a multi- cludes adiabatic screening of the core hole, and an un- pole model for the loss function. This has been previously screened exchange term. Even this two-particle descrip- described in detail [45], and we will not discuss it further tion of the many-body final state is already a signifi- here. To reduce the time required to calculate the screen- cant improvement for L edges [22]. For example, when ing response to the core-hole we take advantage of the fact considering the L2,3 edges of the transition metals, the that this screening is highly localized around the excited independent-particle approximation predicts a 2:1 branch- atom and partition space accordingly. The electronic re- ing ratio between the intensities of the L3 and L2 edges, sponse is evaluated locally and a model dielectric response which is in contrast to experiments which exhibit branch- proves adequate for the rest of space [46]. Here we reduce ing ratios ranging from 0.7:1 for Ti to beyond 2:1 for Co the time needed to evaluate the screening response and 2 action of the BSE Hamiltonian by parallelizing these por- Coulomb interaction within the mean-field of the remain- tions of the code. This is discussed in section 3.2 for the ing electrons. This is separated into the attractive direct BSE Hamiltonian and section 3.3 for the screening. In sec- term tion 4 we demonstrate the effectiveness of these improve- + ′ ′ ′ + ′ ′ ments through XAS calculations on a series of supercells VD =ˆav (r, σ)ˆac(r , σ )W (r, r ,ω)ˆav(r, σ)ˆac (r , σ ), (6) of SrTiO3. Section 4.1 characterizes the time-scaling of the code with respect to system size. Section 4.2 evalu- which is screened by the other electrons in the system, and ates the savings realized by employing the optimal basis the repulsive exchange term functions. The efficacy of parallelization is reported in section 4.3 for the action of the BSE Hamiltonian and in + 1 ′ ′ + ′ ′ section 4.4 for the screening response. Example XAS and VX =ˆa (r, σ)ˆac(r, σ) aˆv(r , σ )ˆa (r , σ ), (7) v |r − r′| c XES spectra of the commonly studied organic molecule Tris-(8-hydroxyquinoline)aluminum (Alq ) are presented 3 + in section 4.5. We end with a summary of the capabilities which is treated as a bare interaction. Thea ˆv (ˆav) oper- of ocean and some general comments on its applicability. ator creates (annihilates) an electron in the valence level + whilea ˆc (ˆac) creates (annihilates) an electron in a core level. 2. Formalism Our implementation of the GW-BSE method is referred ocean The theoretical description of the absorption of a pho- to as (Obtaining Core Excitations from Ab initio ton by a material may be expressed in terms of the loss electronic structure and NBSE) and we have previously function σ(q,ω)= −Im ǫ−1(q,ω). The dielectric response presented it in detail [42]; NBSE refers to the NIST BSE function ǫ depends on the photon energy ω and the mo- solver. Because solving the Bethe-Salpeter equation at mentum transfer q. A formal many-body expression for the level of approximation described above is computa- the loss function may be given as tionally intensive compared to other methods of calculat- ing x-ray spectra our approach makes several reasonable 1 σ(q,ω) ∝− Imh0|Oˆ+G(ω)Oˆ|0i, (1) approximations to improve the efficiency of the calcula- π tion. Specifically, within a plane-wave approach to solving where |0i is the many-body ground-state wavefunction, the the Kohn-Sham equations, we use pseudopotentials to re- operator Oˆ describes the interaction between the photon duce the number of electrons and size of the plane-wave field and the system, and G(ω) is the Green’s function basis. The GW self-energy is obtained through the highly for the many-body excited-state. The form used for the efficient many-pole self-energy approximation when appro- operator Oˆ depends on the physical process being stud- priate [45]. The effort required to obtain the screening for ied, e.g., eiq·r for non-resonant inelastic x-ray scattering the direct interaction is also reduced by utilizing a hy- (NRIXS) or the expansion (eˆ · r) + (i/2)(eˆ · r)(q · r)+ ... brid real-space approach in which the screening response for x-ray absorption (XAS); eˆ being the photon polariza- is evaluated at the RPA level locally about the absorbing tion vector. Using the Bethe-Salpeter Hamiltonian, the site, but the long range screening is approximated with a Green’s function for the excitation can be approximated model dielectric function [46]. in a two-particle form as Despite these efficacious strategies, the largest system treated with our previous implementation of ocean was −1 G(ω) = [ω − HBSE] (2) a water cell consisting of 17 molecules [47]. To extend the capabilities of ocean to treat larger systems we have made where the Bethe-Salpeter Hamiltonian is typically given several improvements. A calculation with ocean consists by of four stages HBSE = He − Hh − VD + VX. (3) The term for the core-hole 1. DFT 2. Translator 3. Screening 4. BSE

Hh = ǫc + χ − iΓ (4) where stage 2 is a translation layer that allows different DFT packages to be used as the foundation for ocean. contains the average core-level energy ǫ , the spin-orbit c The limiting points of the calculation previously were interaction χ, and the core-hole life-time Γ. In most prac- stages 1 and 3, solving the Kohn-Sham equations and eval- tical work, the excited electron Hamiltonian uating the screening response to the core-hole. Stage 4, the GW actual evaluation of the Bethe-Salpeter Hamiltonian, was He = HKS +Σ − Vxc (5) also a limiting point for systems that required sampling is approximated by the Kohn-Sham Hamiltonian HKS with numerous atomic sites. Therefore, our efforts focused on a GW self-energy correction ΣGW where the exchange- (i) improving the efficiency of the DFT calculation and (ii) correlation energy, Vxc, is subtracted off due to double- parallelizing the evaluation of the screening response and counting. The excited electron and hole interact via the the Bethe-Salpeter Hamiltonian. 3 3. Implementation constructed or stored. Rather, at each step in the Hay- dock scheme the BSE Hamiltonian acts on the vector de- 3.1. Optimal Basis Functions termined by the previous iteration,

Our previous version of ocean used abinit [6–9] as the i+1 i ψ = HBSE ψ , (9) DFT solver. ocean may now be alternatively based on α,n,k α,n,k wavefunctions obtained with Quantumespresso [10]. For where i gives the ith vector, and convergence of the spec- the purpose of this paper, we report results based on the trum can be achieved in only a few hundred iterations use of Quantumespresso rather than abinit, though ei- (true diagonalization requires a number of steps equal to ther DFT solver may be chosen depending on the prefer- the dimension of the matrix). A more explicit explanation ence of the user. is given by Benedict and Shirley in Ref. 49. We divide up The loss function in eqn. 1 is calculated by transform- equation 9 by separating the BSE Hamiltonian into pieces, ing the implied integral over all space into sums over each of which is evaluated in its ideal or most compact ba- reciprocal-space k-points within the Brillouin zone and sis. This is outlined in the following two sections. The real-space x-points within the unit cell, as is standard in full vector ψi+1 is then the sum of contributions from each calculations of periodic systems. The sum over k-points piece. requires a denser mesh for converging spectra than prop- erties such as the density. Additionally, the BSE approach 3.2.1. Long-range necessitates summing over a large number of unoccupied Only the direct interaction has a long-range component states which are also not needed when looking at ground- as the exchange term requires both r and r′ be at the state properties. Thus, a considerable number of Kohn- core-hole site. The screened-coulomb interaction can be Sham states must be constructed. expanded in spherical harmonics and separated by angular We reduce the computational cost of generating the momentum l, Bloch functions through the use of optimal basis functions ∞ (OBF) [43]. We have implemented the OBF routines of ǫ−1(r, r′′) r r′ r′′ r r′ Prendergast and Louie as a middle-layer in the ocean W ( , )= d ′′ ′ = Wl( , ). (10) Z |r − r | X code [44]. The OBFs are a method of k-space interpo- l=0 lation and basis reduction. A fully self-consistent DFT The long-range part includes only the l = 0 term, calculation is carried out to converge the density, using −1 ′′ only enough bands to cover the occupied states. With ′ ′′ ǫ (r, r ) W0(r, r )= dr , (11) this density a non-SCF calculation is performed, including Z [r>]r′′,r′ unoccupied bands for the screening and BSE calculations (the density is held constant and the Kohn-Sham eigen- that goes as 1/r away from the core hole. All higher Wl are system is solved for all the bands needed for the BSE). treated as short ranged. By integrating out the core-hole This second calculation is used as a basis to create the dependence via the core-hole density ρα OBFs. By using the OBFs we achieve a significant reduc- −1 ′′ tion in the time spent calculating the Bloch functions for a ′ ′ ′′ ǫ (r, r ) W0(r)= dr ρα(r ) dr , (12) given system. Further details and quantitative results are Z Z [r>]r′′,r′ presented in section 4.2. the long-range component of the BSE Hamiltonian is a function of the electron spatial coordinate only. 3.2. BSE Hamiltonian We evaluate the action of W0 on a vector by first trans- In ocean the BSE Hamiltonian acts on a space con- forming to a super-cell space taining a core-level hole with index α and a conduction ik·x band electron with indicies n, k. A vector in this space is φα(x, k)= ψα,n,k e |un,k(x)i (13) described by the coefficients ψα,n,k, and the photoelectron Xn wavefunction for a given core index α is easily expanded φα(x, R) ≡Fk→R[φα(x, k)], in real space from the conduction-band Bloch functions where R are the lattice vectors and F indicates a Fourier ik·(R+x) |Φα(x, R)i = ψα,n,k e |un,k(x)i. (8) transform. The core-hole index α can refer to different Xn,k angular momentum and spin lmσ states of the core hole, but the long-range component of the direct term does not Within OCEAN we are interested primarily in the re- mix spin or angular momentum, and so these are treated sulting x-ray spectrum, rather than obtaining the actual sequentially. The k-space grid used in an ocean calcu- eigenvalues and eigenvectors of the Bethe-Salpeter Hamil- lation defines a maximum range of the screened core-hole tonian, and so the iterative Haydock technique is used potential. i+1 i [48, 49]. The advantage of an iterative technique is that The operation ψ = W0ψ is laid out in algorithm 1. the complete Hamiltonian does not need to be explicitly Since the direct interaction is diagonal in real-space this 4 i+1 i Algorithm 1 Computing ψ = W0ψ 3.3. Screening 1: for each x and α do · In the direct interaction the core-hole potential is 2: φ(k)= ψi eik xu (x) n n,k,α n,k screened by the dielectric response of the system. We cal- 3: φ(R)=PF → [φ(k)] k R culate this response within the random phase approxima- 4: φ(R)= W (x, R) × φ(R) tion (RPA) 5: φ(k)= FR→k[φ(R)] i+1 −ik·x ∗ 6: ψn,k,α += φ(k) e un,k(x) ′ 0 ′ dω ′ ′ ′ ′ 7: end for χ (r, r ,ω)= G(r, r ,ω )G(r , r,ω − ω). (15) Z 2πi

We evaluate this expression in real-space around the core- procedure is easily parallelized by distributing the x-points hole, and, taking advantage of the localized nature of near- among all the processors. The parallel algorithm has the edge core excitations, we limit our full calculation to a i+1 additional final step of summing the vector ψ over all sphere with a radius of approximately r = 8 a.u., splicing the processors. This limits the scaling, but the vector itself on a model dielectric function for the long-range behavior is not large. Both the number of x-points and the number [46]. The cross-over radius from RPA to model is a conver- of empty bands required for a given energy range scale as gence parameter so that for each material one may ensure the volume of the system. that this approximation has no discernible effect on the calculated spectra. We use static screening ω = 0 which 3.2.2. Short-range assumes that the exciton binding energy is small compared to the energy scale for changes in the dielectric response, The short-range components of the Hamiltonian are cal- i.e., smaller than the band gap. culated by projecting the conduction-band states into a lo- cal basis around the core hole. These basis functions have Our real-space grid is 900 points determined from a 36 a well-defined angular momentum around the absorbing point angular grid and 25 uniformly spaced radial points. atom and are reasonably complete such that near the atom We carry out the integral over energy in eqn. 15 explicitly along the imaginary axis, and for large systems the bulk of ν,l,m time is spent projecting the wavefunctions onto this grid ϕ (r)| c = A R (r)Y (ˆr). (14) n,k r

(Relative Run-Time) 3 5 to a 3 grid). As NSCF DFT calculations scale linearly with the number of k-points this represents a potential 8x speedup (3.4x for the conventional cell). This gain, 0 however, is partially offset by the additional steps needed 1 8 18 27 36 64 to carry out the OBF interpolation. Relative Number of States

Figure 1: Run-Time Scaling. The relative run-times for the screening and BSE portions of the calculation are presented versus Table 2: The relative time required for each step in generating Bloch system size. The screening (left axis, red circles) scales with the sys- functions and the speedup with respect to the total run time achieved tem size to the second power and the BSE (right axis, blue squares) by using the OBFs (The total run-time includes steps not explicitly goes as the system size to the fourth power. listed in the table). Supercell(N) 1 2 3 4 Rel. System Size 1 8 27 64 comes from the increasing number of bands which grows #Processors 8 32 64 128 proportionally with the supercell size. In this example we SCF 0.48 2.20 6.45 42.5 have considered a single site in the super cell. The number NSCF 0.06 0.38 7.11 38.6 of sites also grow with volume, leading to an overall scaling OBF 0.18 1.55 12.1 49.7 of volume cubed for both the screening and BSE. Given Total 1.00 4.90 30.3 152 that as the system size increases the BSE stage comes to Speedup 1.27x 1.22x 2.24x 2.46x dominate the overall run-time, it is clearly advantageous to find an effective parallelization scheme for this stage. This is discussed below in section 4.3. In Table 2 we show the relative time for the SCF, NSCF, From eqn. 16 it appears that the screening calculation and OBF stages that are needed to generate Bloch func- should scale directly with the number of bands as the ra- tions for the BSE calculation. By assuming that the NSCF dial grid is independent of system size. However, for large would necessarily take 8x (3.4x) longer without the OBF systems much of the time is spent projecting the DFT interpolation we estimate the savings as a percentage of the total run-time. While the small cells show only a wave functions onto the radial grid from the planewave 3 3 basis. Both the number of planewaves and bands grow modest improvement, the 3 and 4 cells complete in less with system size leading to this second power behavior. than half the time when using the OBF scheme. Gener- This indicates that an improved approach to projecting ically, the expected savings will depend most strongly on k the Kohn-Sham states onto the local basis is an avenue to the needed -point sampling for the system under inves- k further reducing the time spent in the screening routine. tigation. Metallic systems require much denser -point grids for convergence and will yield correspondingly larger Per section 3.2, the BSE Hamiltonian is broken into savings. three sections: non-interacting, short-range, and long- range. The non-interacting term is diagonal, and, for a 4.3. BSE Scaling single site scales directly with system size. The short-range part, like the screening calculation, relies on a small, local- To investigate the scalability of our parallel BSE solver ized basis that does not change with system size. Also like with processor number we focus on the 4 × 4 × 4 super- the screening calculation, for large system the projection cell of SrTiO3. This system approaches the limits of sin- of the DFT orbitals into this localizated basis to deter- gle node execution due to memory considerations. A sig- mine the coefficients A (eqn. 14) will grow with both the nificant amount of time for each run is spent reading in number of planewaves and bands. While this projection the wavefunctions. For this example the wavefunctions re- happens only once at the beginning of the BSE stage, it quire 12 GB of space (3072 conduction bands, 8 k-points, does lead to scaling with the second power of the system and 32,768 x-points). The time needed to read these data size. The long-range part of the BSE is the most compu- typically ranged from 40-60 seconds [57]. Currently, the tationally expensive. As shown in algorithm 1, the action wavefunctions are read in by a single MPI task and then of W0 grows with both the number of x-points and the distributed so this time is relatively constant with pro- number of empty states, both of which scale linearly with cessor count. To give a better picture of the scaling, we 7 subtract out the time required for this step before com- tion, the BSE code has very little explicit serial execution. paring runs. If multiple runs are carried out on the same Further work is needed to investigate and alleviate the cell, varying atomic site, edge, or x-ray photon (polariza- bottlenecks preventing linear scaling. tion or momentum transfer), the wavefunctions are kept in memory, and therefore this upfront cost will be amortized 4.4. Core-hole Screening Scaling over all the calculations. In the present example we calcu- In this section we investigate the timing of calculat- late the spectra for three polarizations, using 200 Haydock ing the valence electron screening of the core-hole. As iterations. stated before, for core-level spectroscopy we are mainly To assess the scaling of the BSE section we use the met- interested in the local electronic response. Limiting our ric cost. The cost reflects what the user would be charged calculation of the RPA susceptibility to a region of space on a computing facility: the run-time multiplied by the around the absorbing atom makes the calculation much number of processors. A perfectly parallelized code would cheaper than traditional planewave approaches without maintain a cost of 1.0 when run across any number of sacrificing accuracy [46]. Within this approach, and for processors. Costs less than 1.0 would indicate some super- most other methods of calculating core-excitation spectra, scaling behavior, most likely from accidental cache reuse. a separate screening calculation is required for each atomic In figure 2 we show the relative cost of the BSE section as site. In systems with inequivalent sites due to differences a function of the number of processors. The testbed for in bonding, defects, or vibrational disorder contributions this section consists of identical, dual socket nodes with 6 from each atom must be summed to generate a complete processors (cores) per chip and connected via high-speed spectrum. interconnects. Each multi-node test was run five times and To test the behavior of the screening section we use the the best time for each was used. same test case as for the BSE, the 4 × 4 × 4 supercell The BSE section is a hybrid MPI/OpenMP code so we of SrTiO3. The RPA susceptibility is calculated using a compare execution with various levels of threading. As can single k-point and 4096 bands. We show results for a single be seen in figure 2, the use of threads improves the speed of site, 16 sites, and 32 sites (out of the 64 titanium atoms the calculation over pure MPI. This is true even when 12 in our supercell). The number of sites will vary based threads are used, requiring communication across the two on the system being investigated. Impurities may require sockets via OpenMP. We also see very acceptable scaling only a single site, but liquids or other disordered systems with processor count. The benchmark calculation takes a necessitate averaging over many sites [47, 58]. As is evident little over 8.5 hours on a single node and single thread, in figure 3, this section of the calculation does not scale while running on 192 cores can be done in approximately beyond a few dozen processors. This is especially true 3.5 minutes for a real-world speedup of 114x. when only investigating a single site. While better scaling is desirable, as observed in Table 1 the total time for this 2.2 1 section is quite small. In this particular test the screening 2 calculation took just under 8 min for a single site while the 3 2 6 initial DFT calculations required approximately 3 hours on 12 128 processors. 1.8 4.5. Example Spectra 1.6 Cost Figure 4 presents the Ti L edge of pure SrTiO3 calcu- lated for both the primitive cell (one formula unit) using 1.4 our previous version of ocean [42] and for the 5 × 5 × 4 supercell (100 formula units) obtained with the code im- 1.2 provements described herein. This comparison serves to verify that the fidelity of the computational scheme has 1 12 24 48 96 192 been preserved through the modifications we have made # of Processors and demonstrates the feasibility of calculating spectra of much larger systems than previously possible. Doping Figure 2: BSE Cost. The cost, ratio of actual run-time to the- SrTiO3 yields a wide range of interesting physical prop- oretical linear scaling, as a function of processor count. Each line erties and we imagine that in future investigations it will represents a different number of threads per node. One or two MPI ocean tasks per socket (6 or 3 threads) are seen to give the best perfor- be fruitful to apply to studies of such systems. mance. Compounds of SrTiO3 doped with transition metals are investigated for use in photocatalysis and as permeable Sublinear scaling, a cost in excess of 1.0, is, in general, membranes, among other uses. Further, there is some ev- the result of non-parallelized sections of code, synchroniza- idence that doping with Mn, Fe and Co may yield dilute tion costs, and memory bandwidth constraints. With the magnetic semiconductors [59–61]. The task of improving exception of the aforementioned wavefunction initializa- the performance of such materials is assisted by a better 8 32 Sites 2 16 Sites Primitive Cell 1 Site 5x5x4 Supercell

1.5

1 Relative Time Intensity (arb. units) 0.5

0 16 24 32 40 48 56 64 452 454 456 458 460 462 464 466 468 # of Processors Energy (eV)

Figure 3: Screening scaling. The relative time for the screening Figure 4: Ti L-edge XAS of SrTiO3. Comparison of the Ti L- section as a function of the number of processors. We show both the edge XAS of SrTiO3 calculated from the primitive cell using abinit total time required for a single site, 16 sites, and 32 sites relative to and the serial version of ocean and from the 5×5×4 supercell using the time needed to run the single site on 16 processors. The per site the OBFs and parallelized ocean code. time is lower with a larger number of sites due to amortization of i/o and other initialization costs. While the scaling with processor number is poor, we see only very limited detrimental effects from using more processors, and therefore the screening calculation is run using the same processor count as the other stages. simulation. The spectra presented in figure 5 show the av- erage produced from 10 configurations of a MD simulation understanding of how the dopant element interacts with performed at 300 K within Quantumespresso. In addi- the host system. It is now possible to model such spectra tion to averaging over the MD configurations, each spec- with a realistic, first-principles approach. trum is an average over each atomic site in the molecule As a second example system we consider the organic for the given element. Thus, these results represent a se- molecule Tris-(8-hydroxyquinoline)aluminum (Alq3). ries of 30 calculations for the N edge, 30 calculations for Alq3 has remained a leading molecule for the electron the O edge, and 270 calculations for the C edge (keeping in transport and emitting layer in organic light emitting mind that the same ground state DFT calculation can be devices since it was first proposed for this purpose [63]. used for all edges of a given MD configuration). Despite Despite numerous academic and industrial investigations the large sampling required to produce these spectra the of such systems, significant problems remain, particularly calculations are not particularly burdensome. After av- regarding device lifetime. The Alq3 molecule is susceptible eraging over all samples and sites, self-energy corrections to decomposition through reaction with water molecules to the electronic structure were incorporated within the [64] or metal atoms from the cathode layer [65, 66]. multi-pole self-energy scheme of Kas et al. [45]. Finally, X-ray spectroscopy is commonly used to further reveal an ad hoc rigid energy shift was applied to each spectrum the interaction of water or metal ions with the Alq3 to align the absolute energies with the experimental values. molecule and the pathway by which the complex decom- poses. However, results are difficult to properly interpret Our calculations are generally in good agreement with because such experimental work is rarely coupled with the measured spectra of Ref. [62]. The primary structure calculated spectra and because the molecule is sensitive of the XAS is reproduced for each element with only a few to decomposition due to x-ray beam exposure. In this minor differences. For carbon, the feature near 289 eV section we demonstrate the ability of ocean to produce in our calculation appears in the experiment around 288 reliable XAS and XES spectra for this system. eV while, for nitrogen, the second feature is about 0.5 eV Alq3 exhibits two isomers, commonly referred to as fa- too low in energy in the calculation. The three peaks of cial and meridional. The meridional isomer is favored en- the oxygen XAS match the experimental spectrum quite ergetically and we consider only this structure. In figure closely. The minor differences could originate in differ- 5 we present the C, N and O K-edge XAS and XES from ences in electronic structure between the gas-phase, as we the meridional Alq3 molecule, which consists of 52 atoms. consider, and the condensed-phase material probed in ex- We treat the molecule in the gas-phase, though devices periment. Additionally, we presently neglect vibronic cou- typically contain amorphous films of the molecule. Since pling in the excited-state [67–69] which can be particularly thermal atomic motion can noticeably impact spectral fea- important in molecules with light elements [70, 71]. Nev- tures for lighter elements we sum spectra from a series of ertheless, agreement with experiment is generally quite fa- configurations generated by a (MD) vorable. 9 Carbon Nitrogen Oxygen

XES XAS XES XAS XES XAS Intensity (arb. units) Intensity (arb. units) Intensity (arb. units)

265 270 275 280 285 290 295 300 375 380 385 390 395 400 405 410 415 515 520 525 530 535 540 545 Energy (eV) Energy (eV) Energy (eV)

Figure 5: XAS and XES of Alq3. K-edge XAS (color) and XES (black) of carbon (left, red), nitrogen (middle, green) and oxygen (right, blue) for gas-phase meridional Tris-(8-hydroxyquinoline)aluminum (Alq3). Grey curves show the experimental data reproduced from reference [62].

5. Conclusion One should keep in mind that the Bethe-Salpeter equa- tion is one of several approaches to calculating x-ray spec- We have implemented a series of improvements that al- tra. While the BSE method holds advantages in being low core-level spectral calculations based on solving the a predictive, first-principles approach, its two-particle for- Bethe-Salpeter equation to be performed for much larger mulation, in certain cases, is still a crude approximation to systems than previously possible. Whereas ocean was the actual many-body problem. Contrary to this, many- previously limited to systems of a few tens of atoms, here particle approaches such as multiplet calculations and clus- we have reported calculations on systems as large as a ter models capture many-body physics more completely, 5x5x4 supercell of SrTiO3, which consists of 500 atoms though typically at the cost of band-structure effects. The and 3200 valence electrons and would allow for the direct challenge for these local methods is to incorporate the simulation of doping at the 1 % level. This particular electronic structure of the extended system in a scalable calculation required only 12.5 hours on 128 cores. This fashion. It appears possible to make considerable progress enhanced capability makes spectral calculations on amor- toward this goal by working within a localized basis con- phous and dilute systems feasible. structed from an extended electronic structure [19]. For We reduced the cost of the non-self consistent field DFT the BSE technique, it will be necessary to incorporate ad- calculation through a k-space interpolation scheme based ditional many-particle effects. We expect that a better on a reduced set of optimal basis functions. This yielded description of self-energy effects being made accessible by a speed-up by a factor of 2-2.5 for large systems. Paral- recent cumulant expansion development will prove advan- lelization of the screening calculation and evaluation of the tageous in this respect [72, 73]. BSE Hamiltonian provided further savings. We find that The ocean source code is now available for general use; the parallelization of the screening response scales only to for details and documentation see [74]. a few dozen processors. However, since the evaluation of the screening at this level of parallelzation has a limited cost compared to the initial DFT calculation this is not 6. Acknowledgments a significant concern at present. The BSE section of the code scales well to a few hundred processors with only a This work is supported in part by DOE Grant DE-FG03- small growth in overhead that appears linear in proces- 97ER45623 (KG, JJK, FDV, JJR). KG was additionally sor count. When both MPI and OpenMP parallelization supported by the National Natural Science Foundation of is used we achieved a speedup of 114x on 192 processors. China (Grant 11375127), the Natural Science Foundation However, when only MPI is used there is evidence that the of Jiangsu Provence (Grant BK20130280) and the Chinese overhead is growing superlinearly. There remains room for 1000 Talents Plan. Calculations were conducted with com- future improvement to reduce the MPI overhead. puting resources at Lawrence Berkeley National Labora- In addition to x-ray absorption spectra and x-ray emis- tory and the National Energy Research Scientific Comput- sion spectra, inelastic x-ray scattering spectra may also be ing Center (NERSC), a DOE Office of Science User Facility calculated with ocean. We view ocean as uniquely ca- supported by the Office of Science of the U.S. Department pable of evaluating such spectra, particularly at L edges, of Energy under Contract No. DE-AC02-05CH11231. with ab initio accuracy and predictive ability for complex Certain software packages are identified in this paper to ocean and dynamic systems. We envision use of to inter- foster understanding. Such identification does not imply pret data collected on a wide range of systems including recommendation or endorsement by the National Institute in operando studies of fuel cell materials, photocatalysts, of Standards and Technology, nor does it imply that these gas sensor and energy storage materials, as well as from are necessarily the best available for the purpose. liquid environments. 10 References [39] J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, S. G. Louie, Computer Physics Communications 183 (6) [1] J. J. Rehr, J. J. Kas, M. P. Prange, A. P. Sorini, Y. Takimoto, (2012) 1269 – 1289. F. Vila, Comptes Rendus Physique 10 (2009) 548. [40] M. Gr¨uning, A. Marini, X. Gonze, Nano Letters 9 (8) (2009) [2] J. J. Rehr, A. J. Soininen, E. L. Shirley, Physica Scripta T115 2820–2824. (2005) 207. [41] When periodic boundary conditions are used spurious interac- [3] L. Triguero, L. G. M. Pettersson, H. Agren, Physical Review B tions between the exciton and its periodic images exist. In this 58 (1998) 8097. case, it is necessary to use supercells to minimize these interac- [4] D. Prendergast, G. Galli, Physical Review Letters 96 (2006) tions. 215502. [42] J. Vinson, J. J. Rehr, J. J. Kas, E. L. Shirley, Physical Review [5] P. Rez, J. R. Alvarez, C. Pickard, Ultramicroscopy 78 (1999) B 83 (2011) 115106. 175. [43] E. L. Shirley, Physical Review B 54 (1996) 16464. abinit [6] The code is a common project of the Universit´e [44] D. Prendergast, S. Louie, Physical Review B 80 (2009) 235126. Catholique de Louvain, Corning Incorporated, and other con- [45] J. J. Kas, A. P. Sorini, M. P. Prange, L. W. Cambell, J. A. tributors; www.abinit.org. Soininen, J. J. Rehr, Physical Review B 76 (2007) 195116. [7] X. Gonze, et al., Computer Physics Communications 180 (2009) [46] E. L. Shirley, Ultramicroscopy 106 (2006) 986. 2582. [47] J. Vinson, J. J. Kas, F. D. Vila, J. J. Rehr, E. L. Shirley, Phys- [8] X. Gonze, et al., Zeitschrift fur Kristallographie 220 (2005) 558. ical Review B 85 (2012) 045101. [9] X. Gonze, et al., Computational Materials Science 25 (2002) [48] R. Haydock, Comput. Phys. Commun. 20 (1980) 11. 478. [49] [10] P. Giannozzi, et al., Journal of Physics: Condensed Matter 21 [50] P. E. Bl¨ochl, Physical Review B 50 (1994) 17953–17979. (2009) 395502. [51] E. L. Shirley, Journal of Electron Spectroscopy and Related [11] C. Gougoussis, M. Calandra, A. P. Seitsonen, F. Mauri, Physical Phenomena 136 (1–2) (2004) 77 – 83. Review B 80 (2009) 075102. [52] A. Ohtomo, H. Y. Hwang, Nature 427 (2004) 423. [12] P. Blaha, K. Schwarz, G. K. H. Madsen, D. Kvasnicka, J. Luitz, [53] N. Reyren, et al., Science 317 (2007) 1196. WIEN2k, An Aug-mented Plane Wave + Local Orbitals Pro- [54] A. Brinkman, et al., Nature Materials 6 (2007) 493. gram for Calculating Crystal Properties, Techn. Universitat [55] L. Li, et al., Nature Physics 7 (2011) 762. Wien, Austria, 2001. [56] J. A. Bert, et al., Nature Physics 7 (2011) 767. [13] N. A. Besley, M. J. G. Peach, D. J. Tozer, Physical Chemistry [57] This is an effective bandwidth around 60 MB/s which com- Chemical Physics 11 (2009) 10350. pares poorly to commercial spinning disk contiguous read speeds [14] N. A. Besley, F. A. Asmuruf, Physical Chemistry Chemical which range 200-300 MB/s. Physics 12 (2010) 12024. [58] J. Vinson, T. Jach, W. T. Elam, J. D. Denlinger, Physical Re- [15] F. de Groot, A. Kotani, Core Level Spectroscopy of Solids (Ad- view B 90 (2014) 205207. vances in Condensed Matter Science), CRC Press, Boca Raton, [59] D. Norton, N. Theodoropoulou, A. Hebard, J. Budai, L. Boat- FL, 2008. ner, S. Pearton, R. Wilson, Electrochemical and Solid State [16] H. Ikeno, I. Tanaka, Physical Review B 77 (2008) 075127. Letters 6 (2003) G19. [17] P. S. Miedema, H. Ikeno, F. M. F. de Groot, Journal of Physics: [60] J. Lee, Z. Khim, Y. Park, D. Norton, J. Budai, L. Boatner, Condensed Matter 23 (14) (2011) 145501. S. Pearton, R. Wilson, Electrochemical and Solid State Letters [18] C. C. Chen, M. Sentef, Y. F. Kung, C. J. Jia, R. Thomale, 6 (2003) J1. B. Moritz, A. P. Kampf, T. P. Devereaux, Physical Review B [61] L. M. da Costa Pereira, Searching room temperature ferro- 87 (2013) 165144. magnetism in wide gap semiconductors, thesis, Departamento [19] M. W. Haverkort, M. Zwierzycki, O. K. Andersen, Physical Re- de F´ısica da Faculdade de Ciˆencias da Universidade do Porto view B 85 (2012) 165113. (2007). [20] H. Ikeno, T. Mizoguchi, I. Tanaka, Physical Review B 83 (2011) [62] A. DeMasi, L. F. J. Piper, Y. Zhang, I. Reid, S. Wang, K. E. 155107. Smith, J. E. Downes, N. Peltekis, C. McGuinness, A. Matsuura, [21] I. Josefsson, K. Kunnus, S. Schreck, A. F¨ohlisch, F. de Groot, Journal of Chemical Physics 129 (2008) 224705. P. Wernet, M. Odelius, The Journal of Physical Chemistry Let- [63] C. W. Tang, S. A. VanSlyke, Applied Physics Letters 51 (1987) ters 3 (23) (2012) 3565–3570. 913. [22] E. L. Shirley, Physical Review Letters 80 (1998) 794. [64] J. E. Knox, M. D. Halls, H. P. Hratchian, H. B. Schlegel, Phys- [23] R. D. Leapman, L. A. Grunes, Physical Review Letters 45 ical Chemistry Chemical Physics 8 (2006) 1371. (1980) 397. [65] C. Shen, I. G. Hill, A. Kahn, J. Schwartz, Journal of the Amer- [24] J. Fink, et al., Physical Review B 32 (1985) 4899. ican Chemical Society 122 (2000) 5391. [25] M. W. Haverkort, et al., Physical Review Letters 95 (2005) [66] S. Meloni, A. Palma, A. Kahn, J. Schwartz, R. Car, Journal of 196404. Applied Physics 98 (2005) 023707. [26] G. van der Laan, et al., Physical Review B 33 (1986) 4253. [67] F. K. Gel’Mukhhanov, L. N. Mazalov, A. V. Kondratenko, [27] K. C. Prince, et al., Physical Review B 71 (2005) 085102. Chemical Physics Letters 46 (1977) 133. [28] C. T. Chen, et al., Physical Review Letters 75 (1995) 152. [68] S. Tinte, E. L. Shirley, Journal of Physics: Condensed Matter [29] C. T. Chen, et al., Physical Review B 43 (1991) 6785. 20 (2008) 365221. [30] J. Vinson, J. J. Rehr, Physical Review B 86 (2012) 195135. [69] K. Gilmore, E. L. Shirley, Journal of Physics: Condensed Mat- [31] J. Schwitalla, H. Ebert, Physical Review Letters 80 (1998) 4586. ter 22 (2010) 315901. [32] A. L. Ankudinov, A. I. Nesvizhskii, J. J. Rehr, Physical Review [70] M. P. Ljungberg, L. G. M. Pettersson, A. Nilson, Journal of B 67 (2003) 115120. Chemical Physics 134 (2011) 044513. [33] A. L. Ankudinov, Y. Takimoto, J. J. Rehr, Physical Review B [71] J. Vinson, J. T. Terrence, W. T. Elam, J. D. Denlinger, Physical 71 (2005) 165110. Review B 90 (2014) 205207. [34] W. Olovsson, I. Tanaka, T. Mizoguchi, P. Puschnig, [72] J. J. Kas, J. J. Rehr, L. Reining, Physical Review B 90 (2014) C. Ambrosch-Draxl, Physical Review B 79 (2009) 041102(R). 085112. [35] P. Kruger, Physical Review B 81 (2010) 125121. [73] J. J. Kas, F. D. Vila, J. J. Rehr, A. S. Chambers, Physical [36] R. Laskowski, P. Blaha, Physical Review B 82 (2010) 205104. Review B 91 (2015) 121112(R). [37] H. M. Lawler, J. J. Rehr, F. Vila, S. D. Dalosto, E. L. Shirley, [74] For details of the ocean code please visit Z. H. Levine, Physical Review B 78 (2008) 205108. http://feffproject.org/ocean. [38] S. Albrecht, G. Onida, V. Olevano, L. Reining, F. Sottile, the exc code, www.bethe-salpeter.org. 11