<<

DATA-DRIVEN QUANTUM CHEMISTRY PREDICTIONS FOR UNIQUE

LEWIS ACID-BASE PAIRS AND SMALL ORGANIC MOLECULES

A Thesis

Submitted to the Graduate Faculty

in Partial Fulfilment of the Requirements

for the Degree of Doctor of Philosophy

Molecular and Macromolecular Sciences

Department of Chemistry

Faculty of Science

University of Prince Edward Island

QAMMAR ALMAS

Charlottetown, Prince Edward Island

December 2018

© 2018. Q. Almas

To all of my sincere teachers. "It matters that you don’t just give up." - Dr. Stephen Hawking iv Acknowledgements

First of all, I would like to acknowledge my supervisor, Dr. Jason K. Pearson, for his guidance in the field of computational chemistry and his expertise for directing in academic technicalities. I thank him for his teachings to train me for high standards of scientific research. I also express gratitude for his constant support during all phases throughout the course of my PhD studies. I am grateful to the members of the Supervisory Committee namely Dr. Rabin Bissessur, Dr. Brian Wagner and Dr. James Polson for their productive directions. I am particularly thankful to Dr. Bissessur and Dr. Wagner for their directions to improve myself in aspects other than computational chemistry and become an overall a better researcher. I am greatly thankful to all faculty members of Chemistry Department. They all have been my teachers and guides in one or the other way throughout the years. I would also like to express my gratitude for all the members of administration and management in Chemistry Department. I particularly thank Ms Janette Paquet, Ms. Jillian MacDonald and Mr. Stephen Scully who have been kind and supportive in academic matters as well as in general. I thank them for making my life easier. I am also thankful to all the members of the Pearson Lab throughout my time of studies. In particular, I would like to express my thanks to Dr. Adam Proud, Brenden Sheppard and Dalton K. Mackenzie who have been supportive colleagues and friends. I am also thankful to all the grad and senior students in the Chemistry Department. It was fun for the most part. I am very grateful to the University of Prince Edward Island for allowing me use all of their resources required to complete a number of tasks which would have hardly possible without the facilities provided at campus as well as abroad via internet. I am also very grateful to Compute Canada and other research supporting agencies that allowed me to use high performance computing systems (ACE-NET as well as Westgrid) with abundantly available resources and support. vi

I would like to express my thanks to the Graduate Science Committee for their consistent support in academic matters. I thank Dr. Pedro Quijon, Ms Colleen Gallant and Dr. Amy Hsiao who have helped in many official matters where the Chemistry Department was not enough. And last but not least I thank all of my friends, family members and relatives all over the world. I thank them for their love and support that have encouraged and supported me to complete my PhD studies. I am also thankful to my haters, for they have been a great source of motivation to improve myself. Abstract

Besides many other fields of science and technology, big data discoveries and inventions have also been emerging in the field of quantum chemistry. This thesis presents two data-driven projects and one investigational task which emerged from one of the data-driven projects. In chapter 3, a test case of 24 computational chemistry models has been assessed for the performance on the reproducibility of potential energy surfaces of 8 unique Lewis acid-base pairs compared to high accuracy reference calculations. The assessment of density functionals (computational chemistry methods based on density functional theory which states that all ground state information of a molecular system is contained in its electron density and energy of the system can be calculated as the the functional of its electron density) has been employed by means of an automated program written in Python to the data at a central repository followed by the applied queries and analytics. The results reveal that density functionals in general are inaccurate for the prediction of potential energy surfaces of the Lewis pairs. During the analysis of the potential energy surfaces of Lewis acid-base pairs, an inflection in the potential energy surface was observed. The fourth chapter of the thesis is attributed to the inves- tigation of this novel phenomenon. Several medium and high-level computational models have been employed to reproduce the potential energy surfaces which are then compared to standard reference calculations. It is shown that the inflection is the result of a competition between the energetic cost of the required pyramidalization of the Lewis acid and the stabilization from the electrostatic potential between the Lewis acid and base for the formation of the dative bond. Chapter 5 is focused on the power of big data, however, the methods employed are a combination of ab initio quantum mechanics models and machine learning (QM/ML) algorithms. The position intracules (electron pair distributions) and the electron correlation energies of 5660 small organic molecules composed of , carbon, , oxygen and sulphur are calculated at Hartree-Fock and G4 level of theory respectively. This dataset was distributed as different percentages of test fractions and training fractions for predictions and training of the QM/ML model used. A kernel ridge regression algorithm has been employed to develop a viii

QM/ML model for the predictions of the correlation energies of these molecules. The regression model contains only a single hyperparameter sigma which was found to produce optimum results at 0.00004 value. The results are then compared to G4 reference correlation energies. It is shown that predictions are approaching the so called "chemical accuracy" of 1 kcal//mol, and they show great potential for further improvement. The inaccurate performance of density functionals in reproducing potential energy surfaces of Lewis acid-base complexes led to the discovery of an anomalous behavior of potential energy surface of a phosphine complex. This revelation and its investigation is anticipated to be crucial to Lewis acid-base chemistry. The other constructive aspect of the inaccuracy of density functionals to reproduce potential energy surfaces in chapter 3 is the motivation to test an alternative quantum chemistry model like machine learning. The successful application of a kernel ridge regression based QM/ML model has shown promising future aspects of such a model for the prediction of electronic properties of more complex systems like frustrated Lewis pairs. Table of contents

List of figures xii

List of tables xiv

1 Introduction1 1.1 Frustrated Lewis Pairs ...... 2 1.2 Scope of Thesis ...... 4

2 Theory and Methods7 2.1 Schrödinger Wave Equation ...... 7 2.2 Born-Oppenheimer Approximation ...... 9 2.3 Variational Theorem ...... 10 2.4 Basis Set Approximation ...... 13 2.5 Hartree-Fock Method ...... 15 2.6 Post Hartree-Fock Methods ...... 17 2.6.1 Møller-Plesset Perturbation Theory ...... 18 2.6.2 Configuration Interaction ...... 18 2.6.3 Multi-Configurational Self-Consistent Field ...... 20 2.6.4 Coupled Cluster Methods ...... 21 2.7 Density Functional Theory ...... 24 2.7.1 Hohenberg-Kohn Theorems ...... 25 2.7.2 Kohn-Sham DFT Formulation ...... 25 2.7.3 Density Functional Theory Methods ...... 26 2.8 Composite methods ...... 31 2.9 Potential Energy Surfaces ...... 33 2.10 Machine Learning and Quantum Chemistry ...... 35 x Table of contents

3 Automated Benchmark of Density Functionals for Stretched Dative Bond Com- plexes 37 3.1 Introduction ...... 37 3.2 Methods ...... 40 3.2.1 Description of Workflow ...... 40 3.2.2 Computational Models ...... 43 3.2.3 Data Processing ...... 45 3.3 Results and Discussion ...... 46 3.3.1 Workflow Performance ...... 46 3.3.2 Model Chemistry Performance ...... 48 3.4 Conclusion ...... 53

4 A Novel Bonding Mode in Phosphine-Haloboranes 55 4.1 Introduction ...... 55 4.2 Computational Methods ...... 58 4.3 Results and Discussion ...... 60

4.3.1 PES of F3B-PH3 ...... 60 4.3.2 Energy Decomposition ...... 61 4.3.3 Molecular Orbital Analysis ...... 63 4.3.4 Comparison with Analogous Systems ...... 65 4.4 Conclusion ...... 68

5 Intracules as New Molecular Descriptor in QM/ML 70 5.1 Introduction ...... 70 5.2 Methods ...... 72 5.2.1 Model ...... 72 5.2.2 Descriptor ...... 74 5.2.3 Dataset ...... 75 5.2.4 Preparation ...... 76 5.3 Results and Discussion ...... 80 5.4 Conclusion ...... 84

6 Conclusion and Future Perspective 85

References 88 Table of contents xi

Appendix A Python program written for machine learning calculations 105

Appendix B Data of potential energy surface calculations of selected Lewis pairs 113 List of figures

1.1 (a) There was no reaction upon mixing trimethylboron and lutidine due to steric hindrance caused by bulky groups attached, and (b) No formation of classic Lewis acid-base adduct due to steric hindrance between bulky moieties attached to LA and LB. (c) The hindrance between LA and LB caused frustration which was overcome by

breaking a hydrogen molecule bond and reaction was also reversible at 150◦C. ...3 1.2 The scheme for the analysis of the reversible hydrogenation by Lewis acid-base pair with bulky groups attached...... 5

2.1 HF or SCF optimization algorithm...... 17 2.2 Electronic excitations from ground state to singly, doubly and triply excited states. Green represents electrons at ground state whereas red represents excited states. G. state in figure means ground state ...... 19 2.3 Schematic of CASSCF and RASSCF depicting permitted electronic excitations. .... 21 2.4 Kohn-Sham density functional theory algorithm for optimization energy...... 30

2.5 PES of an arbitrary diatomic molecule where rAB represents the distance between two

nuclei, E(r) represents the potential energy depending on AB bond distance and req is the distance between two atoms at equilibrium structure...... 34

3.1 Organizational chart describing the proposed workflow...... 42

3.2 Potential energy surfaces for H3X-YH3 (X = B, Al; Y = N, P), computed at the reference CCSD(T)/CBS(D,T) level of theory and various model chemistries corresponding to the top performing models as measured by MAAD (EDF2), AAD (BMK), MUE (EDF1), and MSE (EDF1). Also shown are models having the least (Rank 1) and most (Rank 20) average absolute deviations for the specific chemical system shown...... 51 List of figures xiii

4.1 Comparison of PES scans of F3B-PH3,F3B-NH3 and H3B-PH3 along the dative bond calculated at the density functional M06L/aug-cc-pVTZ level of theory. The blue dot

highlights the inflection point at r < req...... 57

4.2 PES scans of F3B-PH3 by several levels of theory are presented here. The blue dots on

each curve illustrate the position of inflection points at r < req. r represents the B-P bond length in Angstroms (Å) from the relaxed PES scan...... 60

4.3 PESs of the dative bond in F3B-PH3, BF3, PH3, and the interaction energy between the two components calculated at the M06L/aug-cc-pVQZ level of theory...... 62

4.4 Frontier MO energy level diagrams of the trifluoride fragment of F3B-PH3, calculated at the geometries of several bond lengths, r. Data is from a PBE1PBE/6- 311+G* calculation and images are plotted at an isovalue of 0.07 a.u...... 64

4.5 PES scans of the dative bonds in (A) F3B-NH3, (B) H3B-PH3, and (C) Cl3B-PH3 along with their decomposed components and the electrostatic interaction energy between the fragments at various geometries, calculated at the M06L/cc-pVQZ level of theory.

The inset plot of part C illustrates Etotal for BCl3PH3 magnified...... 66

4.6 Optimized geometry of the BCl3PH3 dative complex at the MP2/aug-cc-pVTZ level of theory...... 68

5.1 Intracule P(u) plots of four randomly selected molecules from the dataset are being used. QM7-1230 has the largest u-point value with a P(u) larger than 0.0001 a.u. All four molecules are covered within this range and they are all different from each other that shows the uniqueness of the descriptor...... 77 5.2 Chart of σ values in terms of the log of mean absolute error (MAE) for the KRR model. 9 1 Sigma values ranging from 3 x 10− to 3 x 10− are plotted as horizontal axis in the figures. Selected region has been zoomed in so that the minimum of the curvecanbe seen clearly. The best sigma value that predicted the smallest MAE was found equal to 5 4 x 10− or σ = 0.00004...... 78 5.3 Algorithm of Python Program employed is depicted as a flow chart. Int represents intracules, Ref. stands for reference, Diff. is the difference and KMat is the abbreviation for kernel matrix...... 79 5.4 MAE in kcal/mol obtained for test fractions ranging from 1 to 90%. Clear trend of decrease in MAEs with the increase in percentage of training fractions...... 82 5.5 The scatter plots of reference correlation energies versus the predicted correlation energies obtained from a 10, 40, 70 and 90% test fractions...... 83 List of tables

3.1 Timings of relevant processes in our workflow, relative to the Solr query re- sponse (absolute time = 0.004 s)...... 47 3.2 Various performance metrics (in a.u.) of the test set molecules for all models

employed, listed in order of ascending MAADM. Methods yielding the smallest errors in each metric are bolded...... 49

3.3 MAADM values (in a.u.) for all models employed when vetted from req to 4.0 Å. 52

5.1 The QM/ML results predicted at σ equal to 0.00004 are tabulated in terms of maximum

error (Max E), minimum error (Min E), mean error (ME), Standard Deviation (SD), mean absolute error (MAE), mean absolute percent error (MAE %). All errors are given in kcal/mol units...... 81 Chapter 1

Introduction

There are numerous computational chemistry models that produce results with varying levels of accuracy for different types of molecular systems and their properties of interest. Therefore, the study of any particular kind of a molecular system and determination of a property of interest requires a discovery or an invention of the "best" computational chemistry model that can offer high-accuracy performance at an affordable computational cost. The discovery of the "best" computational chemistry model for a particular purpose is achieved by assessing the performance of a range of existing computational models against experimental or a high-level theoretical method. The best performer is then employed to further studies of systems and properties of interest. In most cases, determining one property of a system requires that many other properties of a system are calculated by default[1, 2]. For example, the calculations performed by any composite computational chemistry method include single point energy, vibrational energies [3] and many other properties of a molecular system which may not be considered useful. The immense informative data calculated of such ‘undesired’ properties remain unused, but nevertheless, occupies storage space. The continuous increase in the system and property specific models, the associated benchmark studies[1, 4, 2, 5–7] and the consequential generation of ‘undesired’ data contributes to big data problem in computational chemistry. Despite the fact that computational chemistry data is all digital in nature, only a few developments have been made to provide openly accessible repositories[8–13]. There is a need to develop more fruitful management of computational chemistry data which can be used to make data-driven discoveries in the field of chemistry. This thesis is a modest effort towards data-driven discoveries that originated sequentially from an automated benchmark study. A demonstration of the performance of a new repository for benchmarking a computational 2 Introduction chemistry model for a unique set of stretched Lewis acid-base pairs was the initial research goal. The results obtained lead to other interesting aspects of Lewis acid-base chemistry. The performance of density functional theory methods used was found to be poor for Lewis pairs and an unusual behavior of a Lewis acid-base pair was also discovered from benchmark studies. This motivated a thorough investigation of the anomalous behavior and also a search for other computational chemistry models and techniques that could be employed to study Lewis acid-base pairs. The former query was solved by employing a range of high level quantum chemistry models whereas for the latter a relatively new technique was employed. This new technique involved a machine learning algorithm with a combination of quantum chemistry models for assessing the performance in predicting the electron correlation energies of small organic molecules.

1.1 Frustrated Lewis Pairs

The concept of Lewis acid-base reactivity was first proposed by Gilbert Lewis in 1923[14]. He discovered that an electron pair donor donates its electron pair to an electron acceptor resulting in the formation of a stable adduct. This bonding was classified as coordinate covalent bonding or dative bonding. The dative complexes formed were known to be non-reactive adducts. This has become the axiom of main group chemistry for several decades. However, there have been a few exceptions when the interaction between the highest occupied molecular orbital (HOMO) of a Lewis base (LB) and the lowest unoccupied molecular orbital (LUMO) of a Lewis acid (LA) failed to produce a classic Lewis acid-base adduct. Brown and co-workers[15, 16] studied a series of reactions between and pyridines and found that trimethylboron (B(CH3)3) on mixing with lutidine ((CH3)2C5H3N) did not result in the formation of any adduct as shown in Figure 1.1(a). The methyl moities attached to boron and in lutidine caused steric hindrance to the formation of interaction between the HOMO of nitrogen and the LUMO of boron. Wittig and Benz[17] also found that o-fluorobromobenzene

(F phBr), triphenylphosphine (Pph3) and triphenylborane (Bph3) upon mixing did not result in the formation of a Lewis adduct. They, instead, produced a Pph3 ph Bph3 complex shown − − in Figure 1.1(b). There are other examples of such exceptions in the literature[18]. However, a recent and the most relevant example is the observation of Stephan and his group in 2006[19, 20]. They observed no formation of a classic Lewis adduct while they were attempting to produce a dimer of phosphino- complex, (MesP(C6F4)B(C6F5)2). It was realized that phosphorus and 1.1 Frustrated Lewis Pairs 3

Te23

N B(CH ) (a) 3 3 No reaction

Te23

Br Pph3

(b) Pph3 , Bph3

F Bph3

F F F F H H2 , RT ⊕ (c) (Mes)2P B(C6F5)2 (Mes)2P B(C6F5)2 ⊖ -H2, 150℃ H F F F F

Fig. 1.1 (a) There was no reaction upon mixing trimethylboron and lutidine due to steric hindrance caused by bulky groups attached, and (b) No formation of classic Lewis acid-base adduct due to steric hindrance between bulky moieties attached to LA and LB. (c) The hindrance between LA and LB caused frustration which was overcome by breaking a hydrogen molecule bond and reaction was also reversible at 150◦C. 4 Introduction boron both have bulky groups attached to them which sterically hinder the formation of a Lewis acid-base bond. The simultaneous attraction and hindrance between the LA and LB caused a ‘frustrating situation’ for which relatively unstable Lewis acid-base complexes are formed. These complexes were termed as Frustrated Lewis Pairs (FLPs) [20–23].

In Figure 1.2, a scheme of the reversible reaction of breaking and building up of hydrogen molecule is shown. This mechanism was supported by experimental analyses including crys- tallography and nuclear magnetic resonance[20].The resonance in one of the three aromatic rings of B(C6F5)3 induces a partial negative charge (shown as encircled negative sign) on Lewis acid boron and a partial positive charge on the carbon at meta position to boron. The zwitter ion of the complex, B(C6F5)3, is allowed to react with Mes2PH which results in no formation of a typical adduct. Instead, it forms another zwitter ion involving an aromatic substitution. The complex Mes2PH was too large to coordinate with boron and the reaction resulted in the formation of 1. 1 is further treated with (CH3)2SiHCl that forms 2. 2 and 3 represent the reversible metal-free hydrogenation by a Lewis acid-base complex. 2 releases hydrogen molecule when heated up to 150◦C and 3 reverses back to 2 when provided hydrogen gas at room temperature (RT)[19, 23, 24]. This unique catalytic property can be exploited to perform chemical reactions at industrial scale without creating toxic waste involving transition metals. The catalytic chemistry of FLPs has been reported for several other molecules such as capturing of CO2, CO, NO2 , NO and SO2 [25–27]. These catalytic reactions are also of industrial importance and are crucial for a cleaner and greener environment. This new reactivity exhibited by Lewis acid-base complexes composed of bulky ligands has attracted many researchers from both experimental and theoretical fields of chemistry. However, for a theoretical investigation of such novel reactivity in rather simple molecular systems, it is vital to find the most promising theoretical approach from a pool of computational methods.

1.2 Scope of Thesis

The background and description of the computational theory have been summarized in Chapter 2. Chapter 2 also includes the description of assumptions made in the development of computa- tional methods. Furthermore, the electronic properties calculated with the methods used are also described with their importance in chemistry. Chapter 3 illustrates the benchmarking of density functional theory (DFT) methods for the study of stretched dative bond complexes (i.e. 1.2 Scope of Thesis 5 3 2 ) 5 F 6 1 (Me) 2 F 3 H B(C ⊖ 6 F F 2

)

Mes = C Me = CH

5 SiHCl

2 F

6 Me 2 H B(C ⊖ F F

PH 2 2 F F ) P ⊕ 5 2 F

6 (Mes) H (Mes) B(C ⊖ F F F F P ⊕ 2 H (Mes) F F ⊕ F ℃ 2 H 150 2 ) 5 F 2 6 ) 5 F 6 B(C B(C F F F F F F F F P 2 F 3 (Mes)

Fig. 1.2 The scheme for the analysis of the reversible hydrogenation by Lewis acid-base pair with bulky groups attached. 6 Introduction

FLPs). This chapter also summarizes the repository used and the use of computational tools employed for automated assessment of the DFT methods. DFT methods are purely quantum mechanics methods which have their own merits and demerits in terms of their computational cost and accuracy of results. After the assessment of DFT methods, there was a need to discover a different computational approach. In addition to the discovery of an unusual behavior in the potential energy surface of one of the Lewis acid-base complexes (namely F3B PH3), it aroused a curiosity to investigate the reason behind − the unusual behavior. Chapter 4 describes the investigational process and the results of unusual potential energy surface of F3B PH3. − Chapter 5 is an attempt to search for a new computational chemistry approach that could be employed to study FLPs. For this, the performance of a combination of quantum mechanics and machine learning on a set of small organic molecules has been tested which could lead to study chemistry of more complex molecules like FLPs. Chapter 5 is followed by the conclusion of the thesis, and discussion of future work. Chapter 2

Theory and Methods

2.1 Schrödinger Wave Equation

Early 1900s was the time when principles of classical mechanics were found to be invalid to explain the behavior of subatomic particles. The nature of light had been discovered to exhibit both particle and wave nature[28]. Matter, in general, had also been discovered to have wave-particle duality. However, large masses were considered not to exhibit wave-like properties. In 1924, Louis de Broglie discovered that the wave-nature of matter was inversely proportional to its mass (λ = h/mv, where λ represents the wavelength of a particle, h is the Plank’s constant, m and v are the mass and velocity of a particle respectively)[29]. Thus, it was revealed that the waves of subatomic particles were found to be much more profound than that of larger matter. In order to comprehend and explain the science of the atomic cosmos, there was a need to develop a concrete formulation to describe wave-nature of matter. Among many founding fathers of quantum mechanics, Erwin Schrödinger was the one who presented an equation to describe small particles. The equation is now known as the Schrödinger Wave Equation (SWE) given below[30]. SWE could explain the behavior of subatomic particles with an exactness but its solutions were limited only to hydrogen like systems.

Hˆ Ψ = EΨ (2.1) where Hˆ is a Hamiltonian operator applied to the function written at its right. Ψ, read as Psi, is an eigenfunction, referred to as a wavefunction. The wavefunction represents the constituents of molecular systems under consideration and E is the eigenvalue of the wavefunction or energy 8 Theory and Methods of the system. The Hamiltonian contains the descriptive components of the total energy of a system given as follows:

N M N M Z N N M M Z Z ˆ 1 2 1 2 A 1 A B H = ∑ ∇i + ∑ ∇A ∑ ∑ + ∑ ∑ + ∑ ∑ (2.2) i=1 2 A=1 2mA − i=1 A=1 riA i=1 j>i ri j A=1 B>A rAB where first two terms are kinetic energy operators of N electrons and M nucleifor ith and Ath particles respectively. Z represents nuclear charge whereas electronic charge is given as -1, reduced Plank’s constant is 1 and mass of electron is also 1. The inverted delta is the Laplacian symbol which represents the motion of particles in three dimensions and is defined as:

2 2 2 2 ∂ ∂ ∂ ∇i = 2 + 2 + 2 (2.3) ∂xi ∂yi ∂zi

In (2.2), third term represents attractive force between electrons and nuclei which lowers the total energy hence given as negative. The fourth and fifth terms are representing electron- electron and nucleus-nucleus repulsions. SWE does have the ability to produce exact energy of the system however, the description of SWE given above rightfully reflects the complexity involved in obtaining a solution to SWE even for a small molecular system. Paul Dirac’s[31] comment, given below, on SWE has become an integral part in the history and the present of quantum mechanics: "The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble. It therefore becomes desirable that approximate practical methods of applying quantum mechanics should be developed." — P. A. M. Dirac (1929)

The comment given above was completely true at that time because Dirac could not fore- see the invention and rapid advancements of computer hardware and software. This comment, for the most part, still holds true today because there has always been a need to develop ap- proximate methods in order to achieve more and more accurate solutions to SWE with less complicated computations. 2.2 Born-Oppenheimer Approximation 9

2.2 Born-Oppenheimer Approximation

One of the central approximations made to SWE is Born-Oppenheimer approximation (BOA) named after Max Born and J. Robert Oppenheimer[32]. This approximation states that the mass of a nucleus is much larger than the mass of an electron which makes nuclear motion irrelevant compared to electronic motion. Therefore, nuclei can be assumed to be at fixed positions while exerting constant potential on surrounding electrons of the system. This approximation reduces the Hamiltonian in (2.2). It takes away the second term which is the kinetic energy of nuclei and the last term, nucleus-nucleus potential (VMM), becomes a constant that can be incorporated later into the total electronic energy. The equation (2.2), therefore, reduces to (2.4) given as follows:

N N M Z N N ˆ 1 2 A 1 H = ∑ ∇i ∑ ∑ + ∑ ∑ +VMM (2.4) i=1 2 − i=1 A=1 riA i=1 j>i ri j

The last term can also be removed because it will contribute as a constant term for fixed nuclei treated as point charges. Therefore, the Hamiltonian can be expressed as follows:

N N M Z N N ˆ 1 2 A 1 H = ∑ ∇i ∑ ∑ + ∑ ∑ (2.5) i=1 2 − i=1 A=1 riA i=1 j>i ri j

The wavefunction is a function of the spin and positions of all the electrons (r) and contains all the information of a molecular system under consideration. The square of wavefunction Ψ2(r) obtained by its conjugate complex (Ψ (r)Ψ(r)) gives the probability (magnitude of the) | | ∗ density of the system from which other properties of the system can be derived. The probability of finding density over three dimensional space can be integrated giving 100% of thedensity which is taken as 1:

Z Ψ∗(r)Ψ(r)dτ = 1 (2.6) where dτ represents the integration taken over spatial change in all dimensions. The simpli- fication was not only made in order to involve less mathematical complications but simpler notations were also developed to work on complicated math. The integral of wavefunction with 10 Theory and Methods its conjugate complex can be written in Dirac’s notation as follows:

Z Ψ∗(r)Ψ(r)dτ = Ψ Ψ (2.7) ⟨ | ⟩ where smaller than and greater than symbols are known as ‘bra’ and ‘ket’ respectively and Ψ∗ is the conjugate complex of Ψ.

2.3 Variational Theorem

The SWE has the ability to produce exact energy (ground state energy here and else where in this thesis) of the electronic system but it is not decided how to construct a wavefunction that could represent a molecular system. An arbitrary wavefunction consisting of an appropriate electronic and nuclear coordinates is guessed and operated by the Hamiltonian. Traditionally a guessed wavefunction Φ is taken as a linear combination of the Ψ given below as:

Φ = ∑ciΨi (2.8) i

The square of the guessed wavefunction integrated for all dimensions would be given as:

Z Z 2 Φ dr = ∑ciΨi ∑c jΨ jdr i i Z = ∑cic j ΨiΨ jdr i j (2.9) = ∑cic jδ i j 2 = ∑ci i where δ is Kronecker delta (equal to one when i = j and equal to zero otherwise) which can R replace ΨiΨ jdr because it is also equal to one for i = j. This energy of the Φ can be evaluated in the following as: 2.3 Variational Theorem 11

Z Z ˆ ˆ ΦHΦdr = (∑ciΨi)H(∑c jΨ j)dr Z ˆ = ∑cic j ΨiHΨ jdr i j (2.10) = ∑cic jE jδi j i j 2 = ∑ci Ei i

By combining (2.9) and (2.10), the following is obtained:

Z Z 2 ΦHˆ Φdr E0 Φ dr 0 (2.11) − ≥

Rearranging the above, E0 is obtained as follows:

R ΦHˆ Φdr E0 (2.12) R Φ2dr ≥

The above expression shows the strength of the variational theorem which states that the energy obtained by the guessed wavefunction will be above or equal to the energy obtained by the exact wavefunction of the system. This helps to assess the quality of the wavefunction guessed in the initial state. Different trial wavefunctions can be constructed and assessed to obtain the most accurate results. SWE yields exact analytical solutions for single-electron systems but in reality most of the systems of interest are multi-electronic systems. Historically, wavefunction representing system larger than hydrogen was represented as a product of many one-electron wavefunctions and it was termed as the Hartree-Product (HP) given as:

HP Ψ (r1,r2,...,rN) = ψ1(r1)ψ2(r2)ψ3(r3)...ψN(rN) (2.13) { } where ψ(rn) represents single-electron wavefunctions that are multiplied with each other to represent a total wavefunction Ψ. (Initially, single-electron wavefunctions were based on the hydrogen atom, and thus were called hydrogenic spin-orbitals. Currently wavefunctions are constructed from basis sets which are described later). This product form of the wavefunction 12 Theory and Methods has a crucial flaw. In HP, electrons in the wavefunction can have the same quantum spinwhich goes against Pauli’s exclusion principle. Wolfgang Pauli discovered that two electrons with the same spin cannot be accommodated in one orbital; therefore, electrons must have opposite spin in order to reside the same orbital. In other words, two electrons in one molecular system must have a unique set of quantum numbers. They must be described by a different spin orbital. The electrons, however, in a wavefunction are indistinguishable, meaning that they cannot be distinguished by any experimental means. When a wavefunction is permuted and 2 electrons exchange their positions, the probability density should not change. Electronic systems must also obey the principle of anti-symmetric which states that when two electrons exchange their positions, the new wavefunction must be anti-symmetric to the previous one. Slater discovered that a wavefunction expressed as a determinant of a matrix naturally adheres to the anti-symmetry principle. Henceforth, the wavefunctions were expressed as determinants known as Slater determinants, sometimes denoted by SD in the superscript. The wavefunction could now be seen as a Slater determinant given below:

ψ1(r1) ψ2(r1) ..... ψN(r1)

ψ1(r2) ψ2(r2) ..... ψN(r2)

SD 1 ...... Ψ (r1,r2,...,rN) = (2.14) √N! ......

......

ψ1(rN) ψ2(rN) ..... ψN(rN) where 1 represents the normalization factor for a system of N electrons. Vladimir Fock √N! extended the Slater determinant representation of a wavefunction for the determination of molecular orbitals (MOs). MOs are constructed from atomic orbitals and atomic orbitals are expressed by means of mathematical functions for the sake of convenience. These functions are called basis functions and a set of basis functions is called a basis set. The basis functions are ultimately the constructive units of a wavefunction. The employment of basis functions also justifies the orthonormality of the wavefunction. The orthonormality refers to two properties of a wavefunction. The inner product of wavefunctions is equal to one when they are same normalized wavefunctions and inner product is equal to zero when they are different non-degenerate solutions to the same SWE. 2.4 Basis Set Approximation 13

2.4 Basis Set Approximation

The wavefunction is a multi-dimensional function that makes analytical solution to SWE even more complicated with a little increase in the system size. Another very useful approximation to reduce complexity involved in the process was employed by constructing a wavefunction as a linear combination of basis functions. For example, to represent all N electrons in the system, a basis set is used to construct an arbitrary wavefunction Φ of a system. The linear combination of basis functions represents each MO and all the molecular orbitals contributes to the construction of a total wavefunction that represents the electronic structure of a molecular system. MOs are sometimes referred as to wavefunctions and are denoted by small φ. A total wavefunction Φ is given in the following as a linear combination of φi which are constructed by a linear combination of atomic orbitals (LCAO):

N Φ = ∑ ciφi (2.15) i=1 where ci represents the coefficients of each corresponding basis function ineach φi. The selection of a basis function is based on the shape a function can represent. There are two types of functions that are commonly used; one is the Slater function[33] defined as:

STO a b c ζr φ = Nx y z e− (2.16)

The other one is a Gaussian function[34] which is defined as:

GTO a b c ζr2 φ = Nx y z e− (2.17) where N is the normalization constant; x, y and z are variables in a polynomial, a + b + c = l, angular momentum that tells shape of the orbital and r ranges from 0 to ∞. The orbitals are termed after the names of the basis functions chosen. From the above two types of functions one obtains a set of Slater type orbitals (STOs) or Gaussian type orbitals (GTOs)[34–36]. GTOs were found more useful because of their convenience in analytical solutions of a case involving more than one orbital. This is due to the Gaussian product theorem which states that a product of any two Gaussians is another Gaussian. This results in the simplification of calculations associated with the product of multiple Gaussians. Another aspect which can be observed that 14 Theory and Methods the squared r in the Gaussian functions makes them decay faster than Slater functions which facilitates the useful gradient measurement of Gaussian functions.

Currently, there are basis sets that are constructed by a combination of two types of functions. For example, in the STO-3G basis set, orbitals are represented by Slater functions wherein each Slater function consist of three primitive Gaussian functions[35, 36]. Sir John Pople developed some advanced basis sets that could offer more flexible shapes of the orbitals. 3-21G is a Pople kind of basis set where 3 represents the 3 core primitive gaussian functions (core functions here represents the non-valence electrons of the system). The 2 and 1 after the hyphen represents atomic orbitals for valence electrons described by contracted set of 2 primitive Gaussian functions and one single primitive Gaussian function. Pople proposed to keep core electrons as they are but to split the valence electrons into two or more sets which would help produce a better shape including the aspect of bonding between atoms of a molecular system. There are other functions included in basis sets which contribute to enhanced performance. The double star notation, as in 6-31G** basis set, represents polarization functions added to the basis set. The first * is the polarization function added to the heavy atoms (other than hydrogen atoms) as d polarization functions whereas the second * is added to hydrogen atoms as p polarization functions[37]. These polarization functions have been found to contribute to the better shapes of orbitals. Polarization functions are also added for higher functions such as 2df, 3p3g etc. Basis sets also include diffuse functions which are useful for loose electrons in species such as anions. In a basis set such as 6-31++G, the plus sign represents the presence of diffuse functions which is applicable on only valence electrons in heavy atoms. The second plus sign adds diffuse functions for hydrogen atoms[38]. Diffuse functions are simply Gaussians with small exponents that therefore extend far out into space. The description of basis sets given above is for Pople’s version basis sets however there are other versions of basis sets. Dunning basis sets, for example, are comprised of the same kind of functions with different notations[39–42]. For example, cc-pVNZ stands for correlation consistent-polarized functions wherein valence electrons are split into N zeta orbitals. N can be D(double), T(triple), Q(Quadruple), 5, 6, 7... . By applying the above mentioned approximations, a guess wavefunction is constructed by basis functions. The guess wavefunction is then evaluated by applying the variational theorem to the eigenvalue (i.e. the energy of the system) of the wavefunction. The variational theorem states that the energy Ei of wavefunction representing a molecular system can go as low as 2.5 Hartree-Fock Method 15

equal to ‘exact energy’ E0 of the system but it can never be lower than the exact energy of the system i.e. Ei > E0[43, 44]. In this way, a guess wavefunction can be improved that can produce lower and lower energy.

2.5 Hartree-Fock Method

There are different procedures employed to obtain more efficient and cost effective results from the approximate models. Hartree-Fock (HF) method is one of the earliest methods in which a guess wavefunction is constructed from basis functions. The guess wavefunction is represented as Slater determinants which are solved to obtain the lowest eigenvalue of the wavefunction until the consistency of produced energy repeats itself within some threshold. This approach is called the self-consistent field (SCF) method of approximately solved the SWE. The major contributor to the development of SCF method are Hartree and Fock for which this approach is also called Hartree-Fock method. The mathematics of HF methods for the approximation of a wavefunction was described by Roothaan-Hall[45]. The Hamiltonian operator in the HF method can be expressed as the Fock operator as:

Fˆ = hˆ +VˆHF (2.18) where hˆ represents the one-electron operator and VˆHF represents a many-electron operator for electron-electron repulsion.

M Z ˆ 1 2 A ˆ F = ∇i ∑ +VHF (2.19) −2 − A=1 riA

The inter-electronic repulsion of Fock operator is described by using two operators, known as Coulomb operator (Jˆi) and Exchange operator (Kˆi). The Fock operator can be expressed as shown below:

M Z N ˆ 1 2 A ˆ ˆ F = ∇i ∑ + ∑(Jiri Kiri)) (2.20) −2 − A=1 riA i=1 −

The SWE for HF method can be expressed by applying Fock operator to a wavefunction Ψ that would predict energy with the wavefunction as: 16 Theory and Methods

FˆΨ = EΨ (2.21)

The HF energy depends on the sum of one-electron integral hˆ and 2 two-electron integrals shown below:

N 1 N N EHF = ∑ hi + ∑ ∑ (Jiri Kiri) +VMM (2.22) i=1 2 i=1 j=1 − where two-electron integrals (summations) are divided by 2 because the potential is calculated twice by integrating both diagonal halves of the matrix and VMM is added at the end to count nuclear-nuclear potential in the total HF energy. The cost of computations for the HF method scales as O(N)4 where N represents the number of basis functions. The energy predicted by the HF method would not be the exact energy of the system because of the error in the Coulomb and Exchange integrals. This is due to the mean field electronic effect assumed on each electron in the system. This error in the energy produced by the HF method is defined as the difference between the exact energy of the system and the HF energy and is termed as the correlation energy Ec expressed below:

Ec = Eexact EHF (2.23) − A schematic algorithm of HF procedure has been presented in the following which starts from the choice of basis set(s) that includes the functions representing atomic orbitals. The next step is to select a structure for the molecular system to be studied. In this step HF involves the computations of one and two-electron integrals which are further used to construct a guessed density matrix. The matrix is expanded to solve the secular equation and then a new density matrix is constructed from occupied HF MOs. This may or may not produce a different density matrix than the previous one. In case, the matrix is similar, the process of constructing a new matrix algorithm is set to replace the matrix and a new secular equation is solved for a density matrix. If a new matrix is different, the next step is to optimize the structure. The optimized structure should match the geometry within a given threshold range. If it does, the output data produced is for the optimized geometry of a molecule. In case, the optimization fails, a 2.6 Post Hartree-Fock Methods 17

Choose a basis set

Choose a molecular geometry

Compute all 1 and 2-electron integrals Guess initial density matrix

Construct & solve secular equation matrix

Replace Matrix Construct density matrix from occupied MOs no

Choose new geometry by optimization algorithm Is new density matrix similar to previous one?

yes

no Optimize molecular geometry?

yes no

Does geometry match the given criteria? Output data for unoptimized geometry

yes

Output data for optimized geometry

Fig. 2.1 HF or SCF optimization algorithm. new iteration starts from computing integrals and constructing new guessed density matrix. The process is repeated until consistent results are achieved. This is why this process is also referred as to self consistent field method. This process is accurate for geometry optimization: however, it does not predict Ec. Ec is crucial when studying processes involving inter-molecular interactions.

2.6 Post Hartree-Fock Methods

In order to approximate the exact energy of a chemical system, several modifications have been made to the HF method by incorporating additional components to the HF energy equation. 18 Theory and Methods

Such ‘advanced’ HF methods are generally called Post-HF methods. Post-HF methods tend to calculate Ec for the electronic system where electron repulsive forces are not taken as an average of all electronic effects. Generally, the corrected correlation energies are added as small corrections to the ground state energy of the systems obtained by HF. Commonly employed post-HF methods are described in the following:

2.6.1 Møller-Plesset Perturbation Theory

Møller-Plesset perturbation theory is one of the early post-HF theories that adds Ec to HF energy. The perturbation theory is an approach for continuously improving a previously obtained solution to any problem. The many-electron problem in HF approximation for Coulomb and Exchange potential can be solved by perturbation theory. The perturbation corrects the previous Hamiltonian in a way so it can fit the new improved Hamiltonian. The total Hamiltonian is constructed as a sum of each perturbed Hamiltonian. This can be expressed mathematically as:

Hˆ = Hˆ 0 + Hˆ 1 + Hˆ 2 + ... (2.24)

Hˆ 0 is an unperturbed Hamiltonian obtained from HF method. HF Hamiltonian was termed as zeroth-order perturbation by Møller and Plesset. The same idea is followed by the corresponding wavefunctions Ψn and their eigenvalues an. MP2 is a second order perturbation theory and its computational cost scales as O(N)5 for N basis functions. MP3, MP4 and higher are not used that frequently because of the rapid increase in computational cost. MPn methods are generally expensive and do not obey the variational theorem. They sometimes overcorrect the correlation energy therefore when it comes to MPn methods the employment of higher perturbation method may not always be a good choice[46].

2.6.2 Configuration Interaction

Configuration interaction (CI) is the most accurate approach to correlation energy43 correction[ ]. CI takes account of the HF ground state energy and then configures the electrons in the virtual orbitals. The energy of the excited electrons contributes to a better correction for the correlation energy. These excitations are classified as single, double, triple and beyond. Singly, doubly, triply et cetra excited electrons are included as separate determinants and are then linearly combined with the HF determinant. as given below: 2.6 Post Hartree-Fock Methods 19

6

5

4

3 2 1

G. State Singly Excited Doubly Excited Triply Excited

Fig. 2.2 Electronic excitations from ground state to singly, doubly and triply excited states. Green represents electrons at ground state whereas red represents excited states. G. state in figure means ground state

ΨCI = a0ΨHF + ∑asΨs + ∑aDΨD + ∑aT ΨT + ... (2.25) S D T where S, D and T in subscripts represent single, double and triple excitations. This can lead to Full CI. This makes these calculations very costly, nevertheless, the accuracy is almost equal to experimental results. Figure 2.2 depicts the idea of a CI application and rise in the number of determinants to be solved. The CI methods obey the variational principle and allows one to include additional excited state determinants in order to achieve highly accurate results. According to Figure 2.2, there are different classes of determinants in addition to the ground state HF determinant. Thus, exact energies can be predicted by exciting all the electrons to all possible excited states (with complete basis set). This approach is called Full CI. This will include as many determinants as 20 Theory and Methods many different excited configurations are considered. CIS stands for configuration interaction up to single excitation, CID respectively means doubly excited state approach and so on.

2.6.3 Multi-Configurational Self-Consistent Field

Multi-configurational self-consistent field (MCSCF) is a high level electronic structure method which is used to generate more correct reference states that are not possible with HF reference ground states. The exact electronic wavefunctions of molecular systems are approximated by linearly combination of configuration determinants. The MCSCF calculations involves the adjustment of the coefficients of determinants and the basis functions in order to obtain thetotal electronic wavefunction with the lowest possible energy. Thus, MCSCF is a combination of CI and HF approaches that involve expansion of the wavefunction by additional determinants and varying MOs in each determinant. MCSCF wavefunctions are also employed in multi-reference CI (MRCI) and multi-reference perturbation theories (MRPT). Complete Active Space SCF (CASSCF) is a subclass of MSCSF in which configuration state function (CSF, a Slater determinant developed of same quantum numbers) are considered of all possible excitation of electrons in a given number of orbitals. A CASSCF is usually written as CASSCF(n,o) where n represents the number of electrons and o represents the total number of orbitals available that could be occupied by n electrons. Restricted Active Space SCD (RASSCF) is another approach under this category which is useful to lower the computational cost involved when all electron are considered active for excitations. In RASSCF, only few electrons can be considered for excitations. For example, one can consider the excitation of HOMOs to be active in a particular orbital space of a molecular system. This reduces the computational cost by several folds. Figure 2.3 shows a schematic of CASSCF and RASSCF procedures. At the left side in the figure, six orbitals are selected as complete active space where 3 are occupied and 3 are completely unoccupied. These six electrons are allowed to excite in the three empty orbitals. On the right side, three different spaces are selected which are named as RAS1, RAS2 and RAS3. Generally a following procedure is adapted. RAS2 with a red block has two occupied and two unoccupied orbitals. The electrons are allowed to excite to all available empty orbitals in the given restricted space. This can also be considered a Full CI for the selected space. In RAS1, only double excitations of electrons are permitted to the empty orbitals in RAS2 and RAS3. The maximum excitations 2.6 Post Hartree-Fock Methods 21

RAS3

CAS RAS2

RAS1

Fig. 2.3 Schematic of CASSCF and RASSCF depicting permitted electronic excitations. from RAS1 allowed to each of the spacial regions is 2 electrons. For this, the excitation mode of RAS1 and RAS3 can also be addressed as CISD.

2.6.4 Coupled Cluster Methods

Coupled Cluster (CC) methods involve the addition of all types of corrections to the reference wavefunction and each correction is expanded to an infinite order. CC methods are represented by acronyms CCS, CCSD, CCSDT etc. where S, D and T stands for single, double and triple excitations respectively. All these types can be expanded to an infinite order[47, 48]. The excitation operator T is defined as:

T = T1 + T2 + T3 + ... + TN (2.26)

In right hand side of above equation, each T generates all ith excited determinants in the reference wavefunction Ψ. For example, T1 and T2 are expressed as follows: 22 Theory and Methods

occ vir a a T1Ψ0 = ∑∑t1 Ψ1 (2.27) i a

occ vir ab ab T2Ψ0 = ∑ ∑ ti j Ψi j (2.28) i< j a

T ΦCC = e Ψ0 (2.29)

where 1 1 eT = T + T + T + ... (2.30) 1 2! 2 3! 3

The eT can be expanded as follows:

1 1 eT = 1 + T + (T + T 2) + (T + T T + T 3) + ... (2.31) 1 2 2! 1 3 2 1 3! 1 where first term represents a reference wavefunction, second represents all singly excited states. The third term which is also a first parenthesis represents doubly excited states and fourth term represents all triply excited states and so on. The SWE for CC can be given as:

T T He Ψ0 = ECCe Ψ0 (2.32)

The expectation value for energy of a CC wavefunction can be determined by employing variational principle as given in the following:

T T e Ψ0 H e Ψ0 ECC = ⟨ T | |T ⟩ (2.33) e Ψ0 e Ψ0 ⟨ | ⟩ 2.6 Post Hartree-Fock Methods 23

Multiplying both sides of (2.39) by conjugate of the wavefunction and applying integration give the following:

T T Ψ0 He Ψ0 = ECC Ψ0 e Ψ0 (2.34) ⟨ | | ⟩ ⟨ | ⟩

T ECC = Ψ0 He Ψ0 (2.35) ⟨ | | ⟩

The above equation, when solved for all excited state determinants for TN operators, would predict energy equivalent to the full CI energy which is also the exact energy. These calculations are prohibited for all but for the the smallest molecular systems. The truncation of the expansions involved in CC method becomes essential. The terms in the amplitudes after truncation are valued equal to zero. The results produced are not exact anymore yet they are highly accurate.

The coupled cluster doubles (CCD), the lowest CC method, involves the operator T = T2. For CCSD, operator is T = T1 +T2 and for CCSDT T = T1 +T2 +T3. CCSDT scales to the power 8 for N basis functions. A hybrid CC method CCSD(T) is also commonly employed in which the triples contributions (T) are predicted by an MP perturbation method, MP4. CCSD(T) scales to the 7th power of N basis functions. The order of the accuracy obtained by CC methods compared to HF method (which predicts zero correlation energy) can be generalized in increasing order is as follows:

HF << CCD < CCSD < CCSD(T) < CCSDT < CCSDTQ...

There are also some other post-HF methods such as Quadratic Configuration Interaction, Complete Basis Set which are not the scope of this thesis. However, a quantum chemistry composite methods, which are also post-HF method, are included in later sections. All post-HF methods predict correlation energies with a trade-off to the computational cost. In general, the higher the accuracy achieved, the greater the computational cost to be paid but post-HF methods are prohibitive by nature for all systems but for a very small systems. Besides being the availability of high-accuracy methods, it is considerable to have an alternative approach that can offer high accuracy without requiring complex calculations. The root of the mathematical complexity in HF and post-HF methods is due to the wavefunction representation of molecular systems. As wavefunction contains one spin and three spatial coordinates for each electron, the 24 Theory and Methods calculations with every increasing electron after one electron would increase complications. There are computational chemistry methods that do not involve the wavefunction representations of molecular systems.

2.7 Density Functional Theory

Density functional theory (DFT) is the most widely employed quantum mechanical theory used to predict the observables of a molecular system. DFT methods are computationally inexpensive and predict results with higher accuracy than HF. HF, the basic wavefunction method, scales as O(N)4 and does not include electron correlation energies. On the contrary, DFT scales as O(N)3 and its performance is comparable to many post-HF methods[43].

DFT is based on the idea that the electron density contains all the ground state information of a molecular system. The integration of all the electron density determines the number of electrons. The spatial regions with maximum electronic cloud indicates the presence of nuclei which can utilized to determine the positions and type of nuclei. The size of any nucleus can be determined from the gradient of the point (of maximum electron density) in all dimensions. Mathematical expressions are given as:

Z N = ρ(r)dr (2.36) where N represents the number of electrons and r represents the positions of electrons in general in three dimensional space. The relation to determine an atomic number of a nucleus M with nuclear charge ZM can be given with the gradient of spherically average density ∂ρ¯ as follows:

∂ρ¯(rM) rM=0 = 2ZMρ(rM) (2.37) ∂rM | −

The above is given the logical description of electron density being a complete source of information of an electronic system. The meticulous foundation was given by two Hohenberg- Kohn theorems[49, 50] as follows: 2.7 Density Functional Theory 25

2.7.1 Hohenberg-Kohn Theorems

Hohenberg-Kohn states that the ground state electron density can be utilized to calculate the corresponding ground state wavefunction. In other words, the wavefunction Ψ of a molecular system is a functional of ground state density ρ0. Therefore, all ground-state properties are functionals of ground state electron density:

E = E[ρ(r)] = E[Ψ(r)] (2.38) where E is the energy and [ ] represents a functional. The second theorem states that the electron density that minimizes the energy of the functional is the true ground state electron density and also that electron density energy obeys the variational theorem. Mathematically, this can be given as:

E[ρ(r)] E0[ρ0(r)] (2.39) ≥ These two theorems gave a conceptual explanation of why the electron density can be used to determine observables of a molecular system while reducing the expensive computations involved in the wavefunction.

2.7.2 Kohn-Sham DFT Formulation

Kohn and Sham considered a fictitious system consisting of a uniform density of non-interacting electrons and proposed a Kohn-Sham (KS) self consistent field method in which the total energy functional was distributed into solvable and non-solvable functionals.

E[ρ(r)] = Tni[ρ(r)] +Vne[ρ(r)] +Vee[ρ(r)] + ∆T[ρ(r)] + ∆Vee[ρ(r)] (2.40)

The first term in above equation is the kinetic energy of non-interacting electrons, thesecond term is electron-nuclear attraction, the third is the electron-electron potential, the fourth is the correction to the consideration of non-interacting electrons and the fifth is the correction to electron-electron interaction. The first term is calculated from KS orbitals, which are obtained from the basis set selected. The second and third terms are calculated classically by the Coulomb interaction (electron-nuclear interaction) formula. The last two terms are collectively known as the exchange-correlation functional, often written as Exc[ρ(r)]. 26 Theory and Methods

Kohn and Sham gave a rigorous formalism to calculate total energy as a function of electron density except the functional forms of exchange and correlation energy components. The correct form of Exc[ρ(r)] functional could have helped in obtaining the exact energy of the system. The exact form of Exc[ρ(r)] being unknown, leads to the development of a large number of approximated Exc[ρ(r)] for predicting the total energy of the system.

2.7.3 Density Functional Theory Methods

There are several Exc corrected density functionals[43] and they all differ in the way the exchange and correlation energy functional forms are incorporated. The exchange energy is defined as the sum of α and β densities and correlation energy is defined as the difference of the exact energy and the HF energy. Density functionals differ from each other because of the different functional forms of exchange and correlation energies. The classification of density functionals is as follows where examples of functionals given may or may not be employed in the research work given in this thesis:

• Local Density Approximation The first complete approximation developed is known as the Local Density Approximation (LDA). In LDA, it was assumed that the density can locally be considered as a uniform

electron gas. The Exc for a uniform electron gas was calculated by Dirac’s formula given as follows:

Z LDA 4/3 E [ρ] = Cx ρ (r)dr (2.41) x −

In the above functional form α and β densities are assumed equal. When α and β densities are not equal, it becomes a more general case and the functional is known as Local Spin Density Approximation (LSDA). In LSDA functional, the total density is calculated as a sum of separate α and β densities each with a power of 4/3 expressed as follows:

Z LSDA 1/3 4/3 4/3 E [ρ] = 2 Cx (ρ + ρ )dr (2.42) x − α β 2.7 Density Functional Theory 27

The correlation energy Ec was calculated by subtracting the Ex from the total Exc. Vosko, Wilk and Nusair (VWN) developed functionals which fit the energy obtained from the

difference of Exc and Ex.

• Generalized Gradient Approximation Generalized Gradient Approximation (GGA) are improvements to the LDA models. The intuition was to correct LDA model that is based on the uniform electron density to more natural non-uniform electron density. Naturally, the electron density in a system changes from one point to another. This change can be calculated by including a density LSDA gradient in Fxc . GGAs contain a correction term added to the LSDA which consists of a gradient of the density. A GGA functional can be expressed as:

  GGA LSDA ∇ρ E [ρ] = E [ρ] + ∆Exc | | (2.43) xc x ρ4/3

There are some other approaches in which gradient corrections are incorporated differently. For instance, in LYP (Lee, Yang and Parr) the correlation energy is calculated separately which was then incorporated in to a new functional and the new functional was named as BLYP. More examples of commonly known GGA functionals include B86, PBE and PW91. It is customary that the names of density functionals are given after the name of their developers. B86 represents Becke, the name of the developer, and 86 is the year 1986, PBE stands for Perdew, Burke and Ernzerhof, respectively, and P and W in PW91 are Perdew and Wang, respectively[51–53].

• Meta-Generalized Gradient Approximation Further improvements to GGAs are logically the extension of gradients to the higher order derivatives. The Laplacian, which is defined as a double derivative, of the density was added to the gradient functional. These methods were known as meta-Generalized Gradient Approximations (m-GGAs). This defines the energy density more correctly than GGA models do. Alternatively, m-GGA functionals can be described as directly related to the orbital kinetic energy density (τ) which was defined by von Wiezsacker kinetic energy as shown below:

∇ρ(r) 2 τ = | | (2.44) w(r) 8ρ(r) 28 Theory and Methods

This is similar to the kinetic energy density of a single orbital defined as:

occ 1 2 τ = ∑ ∇Φi(r) (2.45) 2 i | |

τ and τw are both related via orbitals and effective potential (νe f f ) for KS orbitals. The relation is expressed below as:

occ 1 2 1 2 τ(r) = ∑Ei Φi(r) νe f f (r)ρ(r) + ∇ ρ(r) (2.46) 2 i | | − 2

More detailed information on the specific functionals are not the scope of this thesis, thus are not included. Some examples m-GGAs are B95 (Becke’s functional developed in year 1995) and M06L(M stands for Minnesota, L stands for local and it was developed in year 2006)[54].

• Hybrid Density Functionals Hybrid functionals are a further improvement to density functionals that often out- perform m-GGAs by combining a wavefunction method with density functionals. Hybrid functionals generally include some percentage of HF exchange which in some cases perform better than LDA, GGAs and m-GGAs. There is no one good method for all kind of molecular systems and for all properties of interest. However, the incorporation of HF exchange into density functionals is an advancement in density functional models. Some examples of Hybrid functionals are BMK (Boese and Martin for Kinetics), B3LYP (Becke’s 3 parameters with Lee, Yang and Parr) and PBE0 (combines three-fourth of PBE exchange and one-fourth exchange from HF exchange and zero % correlation from HF)[55–57]. The mathematical equations of exchange functionals of PBE0 and B3LYP would be exemplary to express the inclusion of HF exchange terms in DFT methods.

1 3 EPBE0 = EHF + EPBE + EPBE (2.47) xc 4 x 4 x c

EB3LYP = (1 a)ELSDA + aEHF + b∆EB + (1 c)ELSDA + cELYP (2.48) xc − x x c − c c 2.7 Density Functional Theory 29

where a, b and c are 3 parameters and their difference from 1 means the percentage subtracted from 100.

Here a schematic of a DFT procedure has been presented in Figure 2.4 which is almost the same as for a HF procedure for an optimization of a molecular system. The only difference there is, is the type of orbitals involved. In DFT, KS orbitals are employed to build the secular equation, henceforth all calculations are taken in the same manner as were in HF. DFT applications have been increasing broadly in the field of chemical and materials sciences in order to explain the behavior of complex systems at molecular level without requiring a solution of a wavefunction. DFT methods are applied to electrochemical investigations for semiconductor materials and have also been employed to predict mechanical properties of nanostructure materials. DFT methods are versatile and, in general, produce results with higher accuracy than HF and comparable to results obtained by post-HF. The computational cost is much lower than with the HF based methods. However, DFT methods also have their limitations for the kind of molecular systems and properties of interests, thus cannot be considered ideal computational methods for all systems. There is a wide range of DFT methods of which a broad classification has been explained above. In the present thesis, several DFT methods fromeach of the above mentioned DFT class have been employed to study small Lewis acid-base pairs. 30 Theory and Methods

Choose a basis set

Choose a molecular geometry

Compute all 1 and 2-electron integrals Guess initial density matrix

Construct & solve Kohn-Sham secular equation matrix

Replace Matrix Construct density matrix from occupied K-S Orbitals no

Choose new geometry by optimization algorithm Is new density matrix similar to previous one?

yes

no Optimize molecular geometry?

yes no

Does geometry match the given criteria? Output data for unoptimized geometry

yes

Output data for optimized geometry

Fig. 2.4 Kohn-Sham density functional theory algorithm for optimization energy. 2.8 Composite methods 31

2.8 Composite methods

Composite methods are another class of high-accuracy methods. They are generally referred to as Gn methods where G stands for Gaussian and n represents the version of the Gaussian composite method. To date, there are 4 Gaussian methods starting from G1 to G4. After John Pople and his research group developed G1, every succeeding Gn method was an advancement to the previous one. G4 is the latest in this series and is the result of modifications in G2 and G3[58–60]. These methods involve several calculations of thermodynamic properties of chemical systems. The calculations are performed by employing a combination of a relatively low level of theory and a big basis set and vice versa. The results of all calculations are incorporated at the end to obtain an electronic energy for the system. The performance of this scheme has been tested on test sets consisting of hundreds of molecular properties and are verified by comparing to experimental results of such test sets. These methods havebeen found to obtain chemical accuracy within 1 kcal/mol, and are widely accepted and used as a standard[58–60]. Gn methods involve a number of calculations performed by wavefunction as well as DFT based methods with a range of basis sets. The steps taken in a G4 calculation are stated below:

1. B3LYP/6-31G(2df, p) to predict equilibrium geometry of the system.

2. Zero point energy (ZPE) from vibrational frequencies for geometry calculated above. Frequencies are scaled by 0.9854.

3. HF energy at as EHF/aug cc pV5Z which is a combination of a lower level method − − and a large basis set. It is a smart way of predicting high-accuracy properties at low computational cost.

4. This step involves a series of single point energy (SPE) calculations for correction energy including corrections for diffuse functions, higher polarization functions, correlation effects beyond MP4 and correction to larger basis set effects and further correction to above mentioned basis set effects. These corrections are stated respectively as a, b , c and d sub-steps as follows:

(a) ∆E = E[MP4/6 31 + G(d)] E[MP4/6 31G(d)] + − − − Here a small contribution to the energy by a diffuse function is being extracted by taking a difference of two energies obtained at same models except one with a diffuse function. 32 Theory and Methods

(b) ∆E(2 f d, p) = E[MP4/6 31G(2d f , p)] E[MP4/6 31G(d)] − − − The same difference technique is applied in this part for calculating small energy contri- bution by (2fd, p) polarised basis functions.

(c) ∆E(CC) = E[CCSD(T)]/6 31G(d)] E[MP4/6 31G(d)] − − − This part calculates a small contribution to correlation energy from the difference of correlation energies obtained by CCSD(T) and MP4 with same basis set. This will result as correlation energy contribution by CCSD(T).

(d) ∆E(G3LargeXP) = E[MP2(Full)/G3LargeXP] E[MP2/6 31G(2d f , p)] E[MP2/6 − − − − 31 + G(d)] + E[MP2/6 31G(d)] − This difference is a small energy contribution by G3 after excluding energy obtained by MP2(Full) with a diffuse and all polarized functions.

5. Ecombined = E[MP4/6 31+G(d)]+∆E +∆E +∆ECC +∆EG3LargeXP +E − + (2d f ,p) HF/limit − EHF/G3LargeXP + ∆E(SO) where SO stands for spin-orbital correction. Here all the above calculated small contribu- tions are combined.

6. Ee(G4) = E(combined) + EHLC where EHLC = 0.006386 (valence electron pairs) - 0.002977(unpaired valence elec- − trons). HLC stands for high level corrections which is added as a combination of parameters. These HLCs are added to the all small contributions combined above.

7. E0(G4) = Ee(G4) + E(ZPE) And finally G4 energy at 0 Kelvin is calculated by adding zero point energy tothesumof all combined energy contributions and HLC parameterised corrections.

The composite methods produce results with an accuracy comparable to experimental results. However, the cost of computations is prohibitive for any larger system. As described above, Gn methods includes several other calculations to attain the final results of a composite 2.9 Potential Energy Surfaces 33 method. Therefore, the composite methods are only preferred to be used as reference to compare the performance of lower level computational chemistry models. Such calculations usually involve predictions of thermodynamic quantities such as enthalpies of formation, atomization energies, ionization energies and electron affinities. This thesis includes the employment ofG4 calculations to obtain electronic energies of small organic molecules.

2.9 Potential Energy Surfaces

The potential energy surface (PES) is a central concept in computational chemistry calculations[43]. The PES can be described as a multi-dimensional construct of an energy profile of a system, which depends only on the internal (nuclear) coordinates of a system. However, one-dimensional construct is practised more commonly. The construction of a PES of any molecular system is constructed as a function of nuclear motion. The energy profile of a molecular system is independent of its translational and rotational movements. The PES only explains potential energy profile, it does not explain kinetic energy of a system. This is due to the implication of BOA while constructing a PES. The nuclei are held fixed in their positions and electronic motion under the electron-electron and electron-nuclear potential is observed. Therefore, the PES depends on 3N internal coordinates of any system, where N represents the number of nuclei which can move in 3 dimensions in chemical space. The nuclei are fixed, for example, at different bond distances and potential energyis calculated. This is repeated at varying bond distances for a bond under consideration and a surface consisting of potential energies at various structural positions is formed. The PES of systems allows one to comprehend the properties, equilibrium geometries, reactivity, spectra and other critical points. Bond distances, bond angles and torsions are common examples of internal coordinates on which the PES of a system can be constructed. There are some stationary points in the PES which provide crucial information of a system. The gradient of a stationary point with respect to an internal coordinate is zero. There can be a single minimum or there can be many minima depending on the system being studied. The lowest energy stationary point is called global minimum. The minima are important because they correspond to critical points in the energy profile. For example, the reactants, products and intermediate states of a system have different levels of energy relative to each other. The energy profile may also form a saddle point which is minimum along all dimensions but one, forwhich it is maximum. These are corresponding to transition states of a system. 34 Theory and Methods

Fig. 2.5 PES of an arbitrary diatomic molecule where rAB represents the distance between two nuclei, E(r) represents the potential energy depending on AB bond distance and req is the distance between two atoms at equilibrium structure. 2.10 Machine Learning and Quantum Chemistry 35

Figure 2.5 depicts the PES depending on bond distance of two supposed nuclei; A and B.

The geometry of the molecule is the most stable when the interatomic distance is req which is also called the global minimum of the PES. Further shortening of rAB causes a rapid increase in energy because both atoms are too close that they repel each other. The increase in rAB (i.e. moving towards right) from req raises the energy until the curve forms a plateau. The difference between the energy level of plateau and equilibrium energy is defined as the bond dissociation energy (BDE). There are several computational chemistry programs with which the PES can be constructed; however, there is also a challenge to find a optimal method for the chemical system of interest.

2.10 Machine Learning and Quantum Chemistry

Machine learning (ML) is a data analysis technique which generates analytical models by learning from the available data. It is a branch of artificial intelligence which recognizes the patterns in data and learns from those patterns to make decisions by itself. Its roots are embedded in a theory of computer science which states that the computers can learn by themselves just from the experience in order to perform particular tasks. This has been found to work and ML became an interesting field of study with a potential to do wonders. In current times, ML has been employed in a number of fields including personalized suggestions and recommendations on internet, stock market predictions, self-driving vehicles to mention a few. In this context, quantum mechanics is no different than others. ML has been combined with quantum mechanics in order to predict quantum mechanical properties at much lower computational cost than what can be achieved by quantum chemistry methods alone. ML is a data-driven technique. It requires big data sets to work. The strengths of ML are statistics and mathematics with the support of computational tools. The data is organized in a convenient manner that is easy to comprehend for which data can be expressed as vectors, lists, sets, matrices, etc. Data is analyzed by statistical measure which is followed by mathematical optimizations to develop a ML model. The performance of a ML model is measured by statistical measures of errors. Mean absolute errors, root mean square error, mean absolute percentage error and standard deviation are common examples of such errors[61–65]. Generally, the larger the dataset, the better the learning of patterns and the smaller the error values in ML predictions. Besides large datasets, ML requires an appropriate descriptor of the data that can represent the nature of a given input. In principle then, ML is able to learn from any given quantum chemistry data and predict quantum mechanical properties without the use 36 Theory and Methods of any wavefunction or a density functional or any other similar entity. The combination of quantum mechanics data and ML algorithms is collectively written as a QM/ML technique. QM/ML is supposed to be far less expensive than traditional quantum chemistry models. ML can be classified into three types which include Unsupervised ML, Supervised ML and Reinforcement ML[65]. The present work is only concerned with supervised learning in which the dataset consists of features and their corresponding labels. Features are ideally independent characteristics of the dataset, whereas labels are feature-dependent properties of the dataset. For instance, a feature could be a structure of a molecule and its total energy value could be its corresponding label. A ML algorithm would learn from a fraction of the dataset of molecules and would be able to predict the energy of molecules for the fraction it has not learned from. This is an example of a regression problem solved by supervised ML (Classification is another problem solved by supervised ML). Regression problems can be solved by employing a range of algorithms such as simple linear regression, kernel ridge regression, Gaussian process regression, artificial neural networks and decision trees[66–68]. A kernel ridge regression model has been chosen for the present work because it is a non-linear model that is easy to train and it fits the description required for the study included in this thesis. Furthermore, ithasbeen employed successfully to make predictions out of chemical data.

A general form of kernel ridge regression with a dataset consisting of features xi and labels yi can be given as:

N f (x) = ∑αik(xi,x′) (2.49) i where f (x) is an estimator function, N represents the number of data points, and i represents the ith term. k is the kernel function and α are the respective weights given to each independent feature. These weights are obtained by minimizing the loss function L(α) which is defined as:

N 2 2 T L(α) = ∑( f (xi) yi) λ α kα (2.50) i − − where λ is called ridge and is used as a regularization factor to control overfitting of the predic- tion values. T represents the transpose of the α vector. More detail is given later chapter 5. Chapter 3

Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes

3.1 Introduction

This chapter includes a benchmark of density functional theory methods for a unique set of small Lewis acid-base pairs. The benchmarking process is automated and involves a demonstration of a repository used. This work has also been published and is used here with the permission of the co-authors[69]. In a recent article, Clark etal. commented on the growing importance of data management in chemistry[70]. In particular, making research data immediately accessible and machine readable would be a major step forward in enabling algorithms to realize chemical insight on a scale not possible by traditional publication methods. It is imperative to develop best practices for data stewardship if one is to take advantage of the vast potential and promise of “Big Data” in the chemical sciences. Specifically, because computational chemistry is inherently digital in nature, it is somewhat surprising that relatively slow gains have been made in terms of the development of trusted open repositories and access to the myriad of data produced on an ongoing basis. While there have been important advances both in terms of digital infrastructure[8–13] and applications[71–84] for “big data” in computational quantum chemistry, more is needed to broaden access and increase the value of such data to those who can benefit most from it. 38 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes

Importantly, computational quantum chemistry can be used in principle to predict any ob- servable chemical property; the problem is of course that the solution of the Born-Oppenhiemer Schrödinger wave equation (SWE)[30] (a usual target in computational quantum chemistry) is too computationally demanding to be tractable for the vast majority of chemical space. Accurate computational prediction of chemical properties on larger and larger swaths of chemical space are accessible though by approximating the solution to the SWE, or by resorting to density functional theory (DFT),[85] which has become a standard technique with broad applicability. However, the number of approximate models and density functionals for this purpose is large and continues to grow[86]. This growth is driven both by continuous algorithmic developments and the desire to probe novel chemical problems on larger scales. Unfortunately, the quality of computed properties varies significantly with the chosen model and also depends onthe nature of the chemical system and the chemical property one is interested in. Consequently, the lack of a reasonably “universal” electronic structure model necessitates a validation step prior to any serious computational investigation into the behaviour of chemical species. This is achieved most commonly by assessing the performance of a variety of selected models in predicting properties of interest by comparing against a standard reference, which are usually experimentally derived data or data from high-level ab initio calculations when possible. Once a model is found that yields a favourable comparison between its computed results and known references, one can then proceed confidently in predicting similar properties for other similar chemical systems. As a result of the necessity of such a validation, there is a large body of literature highlighting specific use cases and applications of specific models to specific problems (for example see reference 87). Though it is often possible for one to make use of previous model validations more generally, there are two important problems with this paradigm. The first is that the literature is an inherently static record. New models are developed often and the scope of chemical space and property space of interest to the community is a moving target. As such, a static reference is not always appropriate for assessing the ever-changing landscape of model chemistries and applicable problems. Second, there is often a wealth of data “hidden” from the literature that could have otherwise been quite useful, the so-called “dark data”. For example, it is routine to predict harmonic vibrational frequencies to assess the nature of a stationary point along a potential energy hypersurface when optimizing molecular structures. However, it is far less common for the numerical values of these computed frequencies (and associated atomic displacement vectors and absorption intensities, etc) to ever see the light of day even within supporting information files. Consequently, the reader learns about the performance ofmodel 3.1 Introduction 39 chemistries on properties of interest to the author but nothing about their performance on other response properties, which may otherwise be readily available if one had access to the original data. A much more attractive solution to the problem of validating model chemistries would be to refer to a “living” central repository of benchmark data “on the fly”, where research results are continuously contributed and accessible (in their entirety) by the scientific community. This is not an entirely new idea,[8, 11, 13, 88] though much is left to be done to realize it. If one imagines a suitable repository housing the detailed computational results published over the past several decades of development, one immediate and obvious utility of such a resource would be to automate practices that are common, effectively standardized, and of significant importance within the computational chemistry community, like model validation. With large-scale datasets becoming an emerging currency in chemistry[13, 11, 84, 88, 89] (particularly in computational chemistry) and given the large and growing number of publications in computational quantum chemistry (for example, the interested reader is directed to the “Quantum Chemistry Literature Database”, http://qcldb2.ims.ac.jp/pub.html) one may reasonably expect that a database suffi- cient for this purpose is possible and probably even likely within the near future. Therefore, the focus is on the development of an efficient computational workflow for harnessing the power of such a repository for the purpose of validating model chemistries, while simultaneously presenting new data of interest to the broader chemical community. In this chapter, the performance of a test set of 24 computational electronic structure models has been investigated for reproducing the CCSD(T)/CBS(D,T) PES of dative bond stretching between 8 unique Lewis acid-base pairs. Such systems (at stretched bond lengths) can be useful approximations to modeling the electronic structure of FLPs,[19, 90] which are of emerging interest in many areas and are novel species. The electrostatic attraction between a Lewis pair, countered by the steric hindrance due to bulky ligands, elongates the equilibrium dative bond and creates a so-called “frustrated” environment. In order to overcome such steric frustration, these species give rise to novel reactivity and are the first to catalyze metal-free hydrogenation[19, 91–93, 23, 94–99] in addition to being efficient catalysts for a wide range of synthetic transformations[21, 100, 101]. FLP chemistry has been investigated intensively by experimental techniques in recent years but there have been relatively few theoretical investigations of the electronic structure and reactivity of FLPs[102] owing to the theoretical challenge in modeling them accurately[103]. The properties of FLP models in the test set, such as long-range dispersion forces, non-equilibrium structures, and dative bonding have all been shown to be particularly problematic for modern DFT[104] and so these species will afford a 40 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes particularly relevant and interesting test case where it is not otherwise clear how one should proceed with modeling[103]. Moreover, FLPs (by definition) consist of bulky ligands, which generally equates to them being intractable to model with highly accurate ab initio electronic structure techniques and it is rational to investigate the performance of a wide variety of DFT models in reproducing the potential energy surface of dative bond stretching in small molecule model systems. By examining the fidelity of each of the chosen approximate electronic structure mod- els against a coupled cluster complete basis set CCSD(T)/CBS(D,T) reference dataset, the present work offers the insight into appropriate models for such systems and also demonstrate a generalized and automated workflow of model validation. Such a generalized workflow requires open repositories that simultaneously i) fully index computational chemistry data, ii) offer customized, nested queries against that data, and iii) cover a suitable breadth of model chemistries, properties and chemical space for current purposes. Currently, there are not any suitable repositories that presently fit these criteria, though it is expected that this won’t always be the case. For this reason, new data have been generated followed by the choice of systems that are both relevant to subject of interest and could offer reasonable complexity for the purpose of demonstrating our general approach.

3.2 Methods

3.2.1 Description of Workflow

To facilitate the storage and retrieval of computational chemical information data is hosted within an Islandora instance[105]. Islandora is an open-source software framework designed to facilitate the management and discovery of digital assets, and was originally developed at the authors’ institution. It integrates Drupal,[106] Fedora,[107] and Solr[108] to provide a robust platform capable of facilitating a wide range of applications and queries. What is particularly attractive about such a platform are the advanced full-text searching capabilities as well as the built-in support for open standards such as extensible markup languages (vide infra) and JavaScript Object Notation (JSON) for data representation within the Solr interface. Additionally, the availability of a Respresentational State Transfer (REST) application program interface (API) readily facilitates routine and automated access to the repository and allows a seamless integration of post-query data processing. 3.2 Methods 41

Figure 3.1 illustrates the general workflow in schematic form. In step 1, a user contributes raw output from a standard computational chemistry calculation to the repository. The raw output is stored but is also parsed into several other datastreams in step 2, most notably the chemical markup language (CML),[109, 9, 10] which is the most robust and widely used machine-readable data representation system in chemistry. The translation from a native output format to CML is dependent on available file parsers and in our case it is achieved bythe open-source JUMBO converters code[110]. After conversion to CML, Solr is then able to index the repository (step 3) according to the population of available fields. The CML:CompChem convention is designed to capture the qualitative and quantitative features of quantum chemical calculations and their relationships in a well defined implicit semantic structure[10]. As such, extensive (in principle, lossless) information is captured including the molecular structure, computational model, calculation input parameters, numerical results, etc. In principle, any number of additional datastreams may be generated from the original output and it may be advantageous to do so in several contexts. For example, detailed structural information may be condensed in a lightweight machine-readable format to avoid searching via Cartesian coordinates, which could often be meaningless. Alternatively, a canonicalized representation of precise structural features would be far more efficient. In our somewhat limited context, this is not necessary but should be considered for future applications in larger, more general data mining exercises. The data may be accessed (step 4) via a REST API, facilitated by Solr, whereby users can query any of the indexed CML fields. This approach is highly amenable to an algorithmic or automated implementation. In our case, users can query all available data for a particular swath of chemical space, say dative bonds between Lewis acids and bases, and subsequently compute standard statistical metrics of the accuracy of any particular approximate prediction of a known property (step 5). This is conveniently accomplished by employing the JSON data representation, which can encode all of the dependent (property of interest) and independent (computational model) variables in a lightweight, easily parsed standard format. For the example presented herein, a query is based on the nested set of independent vari- ables that will yield comparable calculation results within the repository and subsequently returns a minimally sufficient subset of those variables (sufficient to differentiate all results from each other) along with the value of the dependent property of interest. Because the interest is in relaxed PES evaluations of particular molecules at particular points along the PES, these queries must include the computational method, basis set, optimization and con- straint keywords, molecular formula, and dative bond length within the initialization 42 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes

Repository

1 2 3 4

OUT

JUMBO REST API OUT CML JSON

url-based index OUT CML DAT 5

ANALYSIS

Fig. 3.1 Organizational chart describing the proposed workflow. 3.2 Methods 43 module of the CML:CompChem convention in addition to the cc:totalEnergy field within the finalization module. The chosen test case offers the luxury of a priori knowledge of the full scope of the repository contents and can be sure that the applied optimization constraint is the correct one for the purposes of interest and that the dative bond length is known from a spe- cific element of the distance matrix (captured inthe cc:distance field of CML:CompChem). For more general applications, one would need to further search within the initialization module to confirm that the constrained parameters were as expected. Because all results will necessarily be constrained optimizations, these fields are not reproduced in the Solr output and only the unique subset of fields sufficient for the identification of the computational model are included in the results (i.e. computational method, formula, dative bond length, energy). Users provide reference property values in JSON format by specifying the same unique independent variables as above (formula, dative bond length) and the reference dependent variable (energy).

3.2.2 Computational Models

In the context of computational chemistry a ‘model’ is defined as a mathematical procedure for approximating some well defined ‘property’ of a chemical system and is usually (though not necessarily) defined by the combination of a particular level of theory with a basissetof atomic orbitals. Common properties of interest include the electronic energy and response properties such as structure or spectroscopic characteristics of the system. The ‘property’ one seeks to predict does not in general have to be experimentally observable, but must obviously be readily defined in terms of available quantities within the scope of a chosen modeland be comparable to analogous reference quantities (or else the very concept of validation loses meaning). Additionally, many experimentally observable characteristics of a chemical system are indirectly predicted by the combination of two or more properties. For example, the heterolytic bond dissociation energy (BDE) of a covalent bond between fragments ‘A’ and ‘B’ in a particular molecular structure refers to the enthalpy (per mole) required for the process[111]

0 0 + 0 BDE(A B) = ∆ f H (A−) + ∆ f H (B ) ∆ f H (A B). (3.1) − 298 298 − 298 − It is an important thermodynamic quantity accessible by computational models if one can obtain + sufficiently accurate properties of each− ofA ,B , and A-B. As such, BDE(A-B) might be referred to as a ‘derived property’, meaning that it is not generally produced from a single application of a particular model but rather the combination of several independent applications of the model across two or more chemical systems. A general validation workflow must 44 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes therefore be capable of assessing the performance of models in reproducing derived properties, such as the shape of the potential energy surface (PES) which governs many other derived properties, including BDE. PESs are generated by employing relaxed scans of the dative bond in a series of 8 small molecules that consist of Lewis acid-base pairs with varying small ligands (X3Y ZH3, where − X = H, F; Y = B, Al; Z = N, P). In these optimizations, the length of the dative bonds were held fixed at a series of lengths from 1.0 Å to 4.0 in 0.1 Å increments, while theremaining atoms in the complex were allowed to geometrically relax. The quantum chemical methods applied in the current chapter span 19 DFTs including GGAs (B86, B88, PBE, PBE0, B3LYP, B3LYP5, B3PW91, HCTH, ωB97X, PW91),[112, 52, 53, 55, 113, 57, 51, 114, 115, 51] meta GGAs (BMK, M05, M052X, M06, M06L, M06HF, M062X),[56, 116–118, 54, 118] and explicitly empirical functionals (EDF1, EDF2),[119, 120] in addition to the ab initio MP2 model[46]. This, of course, allows us to traverse several rungs of the so-called Jacob’s ladder[121] in DFT and probe the extent to which one must climb to achieve the “heavens of chemical accuracy” for FLP chemistry. Grimme’s empirical dispersion correction[122, 123] is also employed to the optimized structures of the series of PBE0, M052X, M062X and B3LYP functionals (indicated herein as PBE0-D, M052X-D, M062X-D and B3LYP-D, respectively), yielding four additional models. All optimizations have been performed with the 6-31G* gaussian basis set. Although it is expected that such a double-ζ basis is well short of a converged complete basis set (CBS) limit, it is important to note that any future applications of a model chemistry validated herein to FLPs will almost certainly be applied to large enough systems where going much beyond a double-ζ won’t be feasible. Consequently, efforts are explicitly made to search for a cancellation of errors and thus potentially a “right answer for the wrong reason”. Furthermore, the intent is also to demonstrate an efficient workflow for the parsing, storage, indexing, and searching of computational datain chemistry, and to that end any basis will be more than sufficient. The benchmark energies are obtained at the CCSD(T)/CBS(D,T)//MP2/6-31G* level,[124– 128] (where the MP2 geometries are determined according to the procedure defined above) by extrapolating to a complete basis set from the cc-pVDZ and cc-pVTZ basis set results,[41, 42, 129] according to:[130]

(EcorrD3 EcorrT 3) E = D − T (3.2) DT D3 T 3 − 3.2 Methods 45

corr corr where ED and ET represent the correlation energies predicted at the CCSD(T)/cc-pVDZ and CCSD(T)/cc-pVTZ levels, respectively, while D = 2 and T = 3. EDT is added to the CCSD(T)/cc-pVTZ energy and thus the total benchmark energy of the system is given as:

ECCSD(T)/CBS(D,T) = ECCSD(T)/cc-pVTZ + EDT (3.3) All DFT and MP2 calculations have been performed with the QChem[131] software package, while the coupled cluster results were obtained with the Gaussian09[132] suite of programs. Atomic units are used throughout.

3.2.3 Data Processing

It would be interesting to assess the performance of a series of computational models in reproducing chemical properties according to a CCSD(T)/CBS(D,T) reference. The incremental series of data points along the PES for each of the 8 chemical systems under investigation combine for a total of N = 248 unique energy evaluations. These define 248 mutually exclusive search criteria, which are each complemented by any additional necessary control conditions. The reference energies are predicted under the controlled condition that the dative bond be held fixed at a specific value while all other atoms were allowed to relax. Data isreadily filtered for these (or any other) conditions by querying the initialization module within the CML:CompChem convention, where a parameterList is defined containing fields for each of the calculation keywords specifying the molecular species, whether or not a constrained optimization was performed, what the constraint(s) was(were), structural information, as well as the computational model employed. The finalization module within the CML:CompChem convention reports calculation results, where the totalEnergy field would be of interest in the current work, though any result may be returned. The result of an appropriate query of sufficient conditions made by the REST API will produce a series of JSON entries detailingall records in the repository where any one of the 248 unique properties were calculated, along with all necessary information regarding the computational model (i.e. cc:method and cc:basis CML fields) employed. To afford a robust and automated assessment of model chemistries for producing an arbitrary set of (derived) properties, python code (available in the Appendix A) is implemented to subsequently calculate several straightforward statistical measures of accuracy by ingesting the aforementioned JSON representation of available data and comparing it against the known benchmarks (supplied by the user). 46 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes

Specifically, the errorE ( M(i) = R(i) M(i)) for each available model (M) over all available − points on the PES (i 1,2,...,N ) is computed, where R(i) refers to the reference property ∈ { } value for point i; maximum absolute error (Emax M = max EM(i) ) over all available data , | | 1 N′ points; mean signed error (MSEM = N′− ∑i EM(i)) over all N′ available data points where 1 N′ N′ N depending on data availability; mean unsigned error (MUEM = N′− ∑i EM(i) ) over ≤ 1 | N | all available N data points; and the average absolute deviation (AADM = N ∑ ′ MSEM ′ ′− i | − EM(i)) ) over all N data points. | ′ Finally, in cases where the set of N target properties is distributed over more than 1 (but less than N) chemical species, it is useful to compute separate AADM values for each chemical species. This affords a metric of how consistent the performance of a particular model (M) is over a range of compounds and avoids bias introduced by calculating MSE’s over the entire set of N properties or inconsistent data availability. Subsequently, it is instructive to compute the average AADM over each available model, which is reported as the “mean” AADM or MAADM.

3.3 Results and Discussion

3.3.1 Workflow Performance

Workflow performance is primarily evaluated based on computation time for a representative example of each independent process, where possible. Within our workflow, such processes include data generation (which is, of course, dependent on the computational chemistry software one chooses to work with), data ingestion (i.e. conversion to CML), indexing, querying, and post-query processing (see Figure 3.1). Table 3.1 lists relative timing data for each process where possible. The computational cost of the generation of relevant quantum chemical data is obviously linked directly to the configuration of particular users and is not of concern here. It ishighly dependent on a wide range of factors including the property of interest to be computed, al- gorithm, hardware, etc. However, the relative timings for two representative calculations (performed on the same machine as the other processes) as a point of reference to provide some discipline-specific context for comparison is listed. The results also include a CCSD(T)/cc- pVDZ electronic energy calculation on the BH3NH3 molecule employing the frozen core approximation in addition to a much cheaper B3LYP/6-31G* electronic energy evaluation on the same molecule. 3.3 Results and Discussion 47

Table 3.1 Timings of relevant processes in our workflow, relative to the Solr query response (absolute time = 0.004 s).

Process Rel. Time CCSD(T)/cc-pVDZ Electronic Energy (BH3NH3) 5250.0 B3LYP/6-31G* SCF (BH3NH3) 1125.0 Data ingest N/A Conversion to CML (SCF) 568.3 Conversion to CML (Optimization) 729.3 Query (REST API) 376.0 Query response 237.3 Query response (Solr) 1.0 Post-query processing 576.5

Estimates of data ingest timing are more difficult to definitively determine due to the inherent complexity of the ingest process within the fedora module of the Islandora installation. “Ingest” is an umbrella term that collectively refers to over a dozen specific tasks including assigning persistent identifiers, creating triples, and indexing the content. However, benchmarking of the ingest process with similar parameters as our own has shown the scaling of data ingest to be roughly constant, with respect to the number of objects that one submits to the database, having an upper bound of roughly 10 objects per second per processor[133] though this does not include the conversion of the computational chemistry legacy format to CML in our case (discussed below). The ingestion step is neither a bottleneck, nor even a relevant step for a user who wishes only to access the data a posteriori however. Conversion to CML is accomplished via the JUMBO module and the timing is weakly dependent on the size of the file to be converted. That is, structural optimizations orother iterative processes where information is output repeatedly in the legacy format will generally require marginally more time to canonicalize to CML. Alternatively, simpler calculations such as a self-consistent-field procedure to determine electronic energy will be processed more quickly. In Table 3.1, the timings reported are given per file and the reported timing for optimization output parsing to CML is for calculations exhibiting typically 9-14 optimization cycles prior to convergence. This is the most costly step in our workflow, however conversion need only occur once for any computational chemistry data file before it is accessible in perpetuity at far greater speeds (see below). For our purposes, querying the database can be partitioned into three processes. One is the computational overhead associated with the REST API involving translation of a human- 48 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes readable set of query parameters into a Solr url and subsequently logging in to the remote Islandora database server over a network. A second “process” is the response to the query, which involves receipt of the query parameters over the network, subsequent data retrieval, and transmission of the result file (JSON) to the user. Though relatively fast processes in comparison to data ingestion, it must be noted that both are highly dependent on network speeds, which can obviously vary substantially on a case-by-case basis and are the dominant factor in our timings. Finally, it is reported that the relative time of the actual server-side data (or metadata) retrieval by Solr which, in our case takes about 4 ms absolute time. This query response time is a critical property of our workflow as it is the network-independent portion of the total query response time and is the only component of the workflow which will change as a function of the number of stored triples in the repository (though data transmission speeds will obviously be affected with larger volumes of data being transmitted). The almost negligible cost in our test case is very promising for scaling up such a workflow on large databases within the computational chemistry domain, a goal of several groups[13, 11, 84, 88, 89]. Finally, the processing of the JSON produced via a Solr query is accomplished with about the same relative cost as conversion of a single computational chemistry output format to CML. In total, for a user accessing an existing Islandora database for the purpose of validating model chemistries on a scale similar to that presented here, the process would take just under 5 seconds to produce robust metrics of model quality over the full range of all results indexed in the database. It is therefore a highly-efficient path to choosing optimal domain-specific models and provides a compelling argument for the adoption of machine-driven workflows, particularly for the myriad of “black box” practitioners of computational chemistry interested only in a path of least resistance to reasonable quality computations.

3.3.2 Model Chemistry Performance

A summary of the analytics for each model chemistry is listed in Table 3.2, sorted according to

MAADM (see above), though methods yielding the smallest errors in each other metric are also bolded. In general, the performance of any particular model varied significantly over the chemical space of interest in the current work. The most generally consistent model was EDF2 (as measured by lowest MAAD), however several models produced MAADs within 1 kcal/mol of EDF2 such as M06L, PBE0, B3PW91, M06, and EDF1. Although some models performed exceptionally well, none were universally accurate for the purposes of assessing bond dissocia- 3.3 Results and Discussion 49

Table 3.2 Various performance metrics (in a.u.) of the test set molecules for all models employed, listed in order of ascending MAADM. Methods yielding the smallest errors in each metric are bolded.

Model MAADM AADM MUEM MSEM Emax,M Class EDF2 0.0071 0.3645 0.3911 -0.2334 0.9730 Empirical M06L 0.0073 0.3330 0.3502 -0.1837 0.8679 meta-GGA PBE0 0.0074 0.3941 0.5388 -0.4822 1.2028 GGA (hybrid) B3PW91 0.0079 0.3441 0.3970 -0.2616 0.9660 GGA M06 0.0084 0.3480 0.3832 -0.2421 0.9550 meta-GGA (hybrid) EDF1 0.0086 0.3329 0.3329 -0.0945 0.7807 Empirical BMK 0.0092 0.3131 0.3991 -0.3224 0.9642 meta-GGA (hybrid) M05 0.0093 0.3512 0.3711 -0.2206 0.9468 meta-GGA (hybrid) ωB97X 0.0094 0.3399 0.3667 -0.2103 0.9115 GGA (hybrid) B3LYP 0.0098 0.3329 0.3339 -0.1365 0.8222 GGA (hybrid) HCTH 0.0098 0.3719 0.3827 -0.1719 0.9384 GGA (hybrid) B3LYP5 0.0099 0.3405 0.3947 -0.2794 0.9806 GGA (hybrid) M062X 0.0100 0.3377 0.3849 -0.2631 0.9591 meta-GGA (hybrid) M052X 0.0100 0.3205 0.3389 -0.1939 0.8547 meta-GGA (hybrid) M06HF 0.0111 0.3344 0.3831 -0.2622 0.9463 meta-GGA (hybrid) MP2 0.0111 0.5649 1.1620 -1.1620 1.8024 Wavefunction PW91 0.0184 0.7755 1.6787 -1.6787 2.5288 GGA PBE 0.0191 0.8336 1.8320 -1.8320 2.8012 GGA B88 0.0202 0.7543 1.5988 -1.5988 2.4401 GGA B86 0.0212 0.7189 1.5097 -1.5097 2.3330 GGA tion energies in our context, reflecting the known difficulty in modeling our chosen chemical systems. M06, for example, exhibited excellent agreement (AADM06 within 0.6 kcal/mol) with the CCSD(T)/CBS(D,T) reference over the entire PES curve of AlF3NH3, but showed considerably poorer performance with BH3PH3 (AADM06 of 8.5 kcal/mol). Several other models also performed reasonably well, including EDF1 and BMK. BMK exhibited the lowest overall AADM over the entire dataset (as opposed to averaging independent AADs for each chemical species), while EDF1 produced the lowest MSEM, MUEM, and lowest maximum error over the complete dataset. The relatively impressive performance of the empirical EDF1 functional is not surprising in our case as EDF1 is a combined exchange and correlation functional that is specifically adapted to yield good results with the relatively modest-sized 6-31+G* basis set, by direct fitting to thermochemical data. As such, one might not expect EDF1 to yield similar performance when larger basis sets are employed. Interestingly, 50 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes the EDF2 functional, which is specifically designed to model the potential energy curve well, improved significantly upon the EDF1 results, even though EDF2 is parametrized for the larger, correlation consistent cc-pVTZ basis. In general there does not appear to be any particular class of method (i.e. GGA, meta-GGA, hybrid, wavefunction) that consistently outperforms the others, though pure GGAs do not generally do well. This is not unexpected though, particularly for the unbalanced exchange-only functionals employed. It has been noted that density functional models of electronic structure often require dispersion corrections, which motivated us to also include empirical dispersion corrections for several popular functionals, namely PBE0-D, M052X-D, M062X-D and B3LYP- D. Such dispersion corrections did not make a significant improvement to the base functional performance. For example, an improvement of approximately 0.0004 atomic units between the errors of B3LYP and B3LYP-D has been observed. The difference is even smaller between PBE0 and PBE0-D and likewise, the improvements in the cases of M052X-D and M062X-D are negligible and not discussed further.

Figure 3.2 illustrates the potential energy curves for the H3X-YH3 (X = B, Al; Y = N, P) species in our test set computed at various levels of theory. In addition to the CCSD(T)/CBS(D,T) reference data, the data for EDF2, BMK, and EDF1 is shown, each being a model highlighted in Table 3.2. It has also been shown that data for the model chemistries exhibiting the smallest and largest AADM for each specific molecule (labelled “Rank 1” and “Rank 20”, respectively) to illustrate the full scope of model performance in each case.

Among other things, evident in Figure 3.2 is the result that MAADM is indeed a robust metric for model quality in predicting derived properties over a range of structures. For example, the M052X functional ranks first with respect toM AAD for the AlH3NH3 system, despite the fact that EDF2 seems to track closer to the reference curve in that case. This is due to their respective performance at very small interatomic separations (not shown). The test data begins at interatomic separations of 1.0 Å, which is essentially non-physical for systems with longer equilibrium bond lengths, such as AlH3NH3. All models perform relatively poorly at these small bond lengths and therefore inherit significant errors that contaminate the overall performance metrics. In several cases this can even lead to a change in the rank of some model chemistries for a specific chemical species, relative to the ranking that would be achieved on a dataset that included only the longer bond lengths in our set. To probe this effect further, the analysis is reran using reference data only for bond lengths along the PES from the CCSD(T)/CBS(D,T) equilibrium value (req) to 4.0 Å (as listed in Table 3.3). It was found that model performance 3.3 Results and Discussion 51

H3BNH3% H3BPH3% 0.09" 0.09"

0.08" 0.08"

0.07" 0.07"

0.06" 0.06" CCSD(T)/CBS" CCSD(T)/CBS" 0.05" 0.05" PBE0"(Rank"1)" PBE0"(Rank"1)" 0.04" B86"(Rank"20)" 0.04" B86"(Rank"20)" Rel.%E%(au)% Rel.%E%(au)% 0.03" EDF1" 0.03" EDF1"

0.02" BMK" 0.02" BMK" EDF2" EDF2" 0.01" 0.01"

0" 0" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" r%(B,N,%Å)" r"(B,P,%Å)""

H3AlNH3% H3AlPH3% 0.08" 0.06"

0.07" 0.05" 0.06" 0.04" 0.05" CCSD(T)/CBS" CCSD(T)/CBS"

0.04" M052X"(Rank"1)" 0.03" M06L"(Rank"1)" B86"(Rank"20)" MP2"(Rank"20)" Rel.%E%(au)% 0.03" Rel.%E%(au)% EDF1" 0.02" EDF1" 0.02" BMK" BMK" EDF2" 0.01" EDF2" 0.01"

0" 0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" r%(Al,N,%Å)"" r%(Al,P,%Å)""

Fig. 3.2 Potential energy surfaces for H3X-YH3 (X = B, Al; Y = N, P), computed at the reference CCSD(T)/CBS(D,T) level of theory and various model chemistries corresponding to the top performing models as measured by MAAD (EDF2), AAD (BMK), MUE (EDF1), and MSE (EDF1). Also shown are models having the least (Rank 1) and most (Rank 20) average absolute deviations for the specific chemical system shown. 52 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes is universally and significantly improved. Regardless of this fact however, it is noted that the leading model chemistries did not significantly change by assessing them at only the longer bond lengths. Again, it is found that EDF2, M06L, PBE0, B3PW91, and M06 are all within

1 kcal/mol (according to MAADM) from the top of the list. Therefore, the robustness of a validation is obviously improved with large data sampling over many data points and chemical species, as has been done here.

Table 3.3 MAADM values (in a.u.) for all models employed when vetted from req to 4.0 Å.

Model MAADM PBE0 0.0012 B3PW91 0.0016 BMK 0.0017 EDF2 0.0017 ωB97X 0.0019 M062X 0.0020 MP2 0.0020 M052X 0.0020 M06L 0.0020 M06 0.0023 B3LYP 0.0024 B3LYP5 0.0024 M06HF 0.0027 M05 0.0028 EDF1 0.0032 HCTH 0.0038 PW91 0.0052 PBE 0.0054 B88 0.0062 B86 0.0063

Generally speaking, the Lewis acids containing aluminum are modeled more accurately than their boron-containing analogues. Likewise, the Lewis bases containing nitrogen are generally modeled more accurately than their phosphorous-containing analogues though this is less consistent. Consequently, BH3PH3 is the most poorly modeled system in our test set (shown in Figure 3.2). Fluorinating the Lewis acids does not significantly change the bond length but does impart significant changes on the potential energy well of the dative bond (see following chapter). The well becomes more shallow in the case of fluorinated boron atoms, while the well becomes 3.4 Conclusion 53 steeper in the case of fluorinated aluminum atoms. This consequently provides a simple and convenient procedure for varying dative bond strength in our dataset. In general, the fluorinated analogues of selected dative species are modeled better than the hydrogen-containing counterparts in boron-containing compounds, while the opposite is true for the aluminum- containing compounds. As such, AlH3NH3 is the chemical species being modeled most accurately over the set of chosen methods, though the effect is small and AlF3NH3 is also modeled very well.

3.4 Conclusion

An automated workflow for the purpose of validating computational chemistry models has been developed and tested in the "big data" context. Such a context is becoming increasingly relevant, especially in the chemical sciences, and there is a need to adapt and develop strategies for harnessing the power of shared open repositories within discipline-specific domains. The workflow is part of an ongoing attempt to automate practices that are common, effectively standardized, and of significant importance within the computational chemistry community, like model validation. It combines a wide range of open-source functionalities including conversion of legacy formats in computational chemistry to the CML:CompChem standard via JUMBO; storage and indexing within a central networked repository via Drupal, Fedora and Apache Solr (all packaged within the Islandora suite); and finally data retrieval and analytics via the Islandora REST API and in-house code that computes various standard statistical measures of model performance. The performance of the workflow has been shown to be quite promising, and will bean increasingly attractive methodology when applied to larger datasets and repositories. In total, retrieval of 4960 data points representing 31 constrained optimizations along the PES of each of 8 different chemical species using 20 different model chemistries, coupled with the subsequent calculation of statistical measures of performance can easily be completed within about 5 seconds (depending on network speeds). In particular, the critical step of query response within the Solr module is completed in about 4 ms and will scale quite favourably with a growing repository. As a test case, the performance of 20 electronic structure theories in terms of their ability to reproduce the PES of 8 unique dative bonding structures has been assessed. These are properties of the chosen chemical systems known to be challenging for many of the models employed. The chosen models were benchmarked against a robust CCSD(T)/CBS(D,T) reference set of 54 Automated Benchmark of Density Functionals for Stretched Dative Bond Complexes calculations. It is found that several density functional models (namely EDF2, M06L, PBE0, B3PW91, and M06) coupled with the 6-31G* basis can be quite good for evaluating derived properties of these structures, though it is noted that some species are not well represented by any of the employed approximate computational models. In particular, dative structures containing aluminum are more reliably modeled than boron analogues, and those containing nitrogen are more reliably modeled than phosphorous analogues. Equilibrium structures however, are reasonably well predicted for all species. The conclusions of this chapter also lead to a direction to investigate the discoveries which have been revealed in this chapter. For instance, there is a need to properly investigate the unusual PES of F3B NH3 which has been consistently observed with various methods and − for only this complex. The second important thing concluded from this chapter is that DFT methods are found to be inadequate for the study of Lewis acid-base complexes. This also suggest to test other computational chemistry methods that can lead to more accurate PESs. Chapter 4

A Novel Bonding Mode in Phosphine-Haloboranes

4.1 Introduction

The results obtained in the previous chapter revealed that one of the complexes (namely

F3B NH3) of the PES exhibits an anomalous PES. This unusual behavior was noticed in several − DFT methods employed to predict the PES of F3B NH3. This was found to be a unique and − interesting case that required a whole investigation with high-accuracy computational methods. The current chapter includes the investigation performed to discover the chemistry behind this unusual behavior of F3B NH3 and this work has also been published[134]. The elucidation − of properties of boron and phosphorous-based adducts has been of interest to chemists for many decades[135–143]. These compounds play many important roles in synthetic chemistry and are high-potential candidate materials for [144–148]. Furthermore, B-P compounds are particularly known as efficient refractory semiconductor materials, thus are also of interest in solid state physics[147] and have shown potential as reducing agents in biological systems[145, 146]. The importance of B-P complexes in a wide range of applications makes it critical to fully understand their structure and stability, which has been attempted both theoretically and experimentally dating back as far as the 1940’s[135, 149–151, 136, 152, 137, 153, 138–141, 154–160, 142]. Understanding the bonding in these dative complexes has been the subject of much research and many differing explanations have been offered for the trends in the strengths of such Lewis acids and bases. From a theoretical perspective however, understanding dative complexes 56 A Novel Bonding Mode in Phosphine-Haloboranes can be a challenging task as it is well-known that many computational methods (density functional theory, or DFT, in particular) do not describe the properties of dative bonded complexes well[161–163]. Specifically with respect to B-P dative complexes there have been several theoretical studies employing a variety of techniques that have revealed problematic or contradictory results[149, 139, 164, 165, 138, 166]. Janesko, for example, studied substitutions on boron of B-P compounds and found that the popular B3LYP[57] functional cannot reproduce accurate results for bond dissociation energies (BDEs)[139]. It has also been reported that DFT methods produce C1 geometries of B-P-benzene and B-P-coronene molecular systems whereas ab initio MP2/TZVPP [46, 167, 168] and CC2/TZVPP[169, 167, 168] levels of theory produce the correct Dh3 geometries[164]. Nevertheless, there have still been several valid attempts at rationalizing B-P bonding from a theoretical perspective[160]. As mentioned previously, there are other seemingly spurious computational results in the test set of non-metallic Lewis acid-base pairs[69]. While surveying the potential energy surfaces associated with the dative bond stretching of a series of small Lewis acid-base pairs, it was found that fluorinated phospho-boranes were unique in that they exhibited an inflection point on both sides of the global minimum associated with the equilibrium bond length (see 4.1). The

PES of F3B-PH3 predicted at the reasonably reliable M06L/cc-pVQZ density functional level is shown in 4.1, where the inflection “shoulder” is observed at r 2.3 Å (illustrated with a blue ≈ dot) and the global minimum is at r 3.1 Å. This is to be contrasted with the PESs of H3B-PH3 ≈ and F3B-NH3 predicted at the same level of theory, whose curves exhibit the usual Morse-like PES behaviour for bond stretching with their minima at r 1.9 Å and r 1.7 Å, respectively. ≈ ≈ The inset plot shows the magnified PES of3 F B-PH3 and highlights the unusual behaviour of the PES before the global minimum. This anomaly is not observed for either H3B-PH3 or F3B-NH3, suggesting that this phenomenon cannot be attributed directly to the trifluoroborane nor the phosphine alone, but rather is the result of the combined effects of each, perhaps in combination with the inaccuracies of the theoretical model. Despite what is a rather long history of theoretical and experimental investigation of Lewis acidity and basicity in a wide range of contexts,[157] according to the literature survey, this unusual behaviour of the F3B-PH3 complex dative bond PES has not been previously reported in the literature, nor systematically investigated or explained. Is this behaviour an artifact of the calculations or a bona fide chemical phenomenon that has yet to be explored? Given the importance of these species there has been efforts made to further explore the nature of the bonding within these and other halogenated phospho-boranes by employing a wide range of computational techniques including several density functional and highly-accurate ab initio 4.1 Introduction 57

0.030 Ê Ê TextÊ Ê 0.025 Ï 0.020 Ê

F3B-PH3 0.10 0.015 Ê

Ê / (a.u.) / 0.010 Ê H3B-PH3

) ‡ Ê Ê

eq Ê 0.005 Ê F3B-NH3 r Ê

( Ï Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê E Ê 0.000 Ê Ê Ê 0.05 1.5 2.0 2.5 3.0 3.5 4.0

‡ ‡ ‡ ‡ ‡ ‡ ‡ ) ‡ ‡ ‡ Ê ‡ ‡ Ï Ï Ï Ï r ‡ Ï Ï Ï ‡ Ï Ï Ï ( ‡ ‡ Ï Ï Ï Ê ‡ Ï Ï ‡ Ï Ï E ‡ Ê Ï‡ Ï Ï Ê Ï‡ Ï Ê Ê Ê ‡ Ï Ï ‡ Ê Ê 0.00 Ï Ï Ï ‡ ‡ Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê Ê 1.5 2.0 2.5 3.0 3.5 4.0

r /Å

Fig. 4.1 Comparison of PES scans of F3B-PH3,F3B-NH3 and H3B-PH3 along the dative bond calculated at the density functional M06L/aug-cc-pVTZ level of theory. The blue dot highlights the inflection point at r < req. 58 A Novel Bonding Mode in Phosphine-Haloboranes wave function models, comparison with the PESs of analogous complexes, molecular orbital (MO) analysis, and energy decomposition by systematic fragmentation of the Lewis acid and base components. This project aims to reveal and characterize some rather remarkable chemical behaviour. Atomic units are used throughout unless otherwise indicated.

4.2 Computational Methods

To probe the nature of the PES of dative bond stretching of halogenated phospho-boranes, a relaxed scan (where a molecular systems is allowed to reach its equilibrium structure) of the boron-phosphorous bond was performed for several species (vide infra) with a wide variety of electronic structure techniques. The presence of an inflection in the PES at r < req was confirmed by the analysis of the second derivative of the PES curve, characterizing the inflection as a point where the second derivative vanishes. For all optimizations, the Lewis acid-base bond was constrained at one of 31 bond distances from 1.0 to 4.0 Å (in steps of 0.1 Å) while all other geometrical parameters were allowed to fully relax. For DFT calculations, shown herein are results from the M06L functional[118] with the aug-cc-VQZ basis (unless otherwise indicated), however this functional has been coupled to a much wider series of standard basis sets (M06L/6-31G*, M06L/6-31+G**, M06L/6- 311+G*, M06L/cc-pVTZ, M06L/aug-cc-pVTZ) to investigate the basis set dependence of the PES. Additionally, a much larger swath of functionals have been probed, all of which exhibit qualitatively identical results (see Appendix B). The PESs were also predicted by ab initio wave function methods including MP2/6-31G*, MP2/6-311G*, MP2(Full)/cc-pVTZ, CCSD(T)/cc- pVDZ//MP2/6-31G*, CCSD(T)/cc-pVTZ//MP2/6-31G*,[170, 171] CR-CCL/ACCD//MP2/6- 31G*, CR-CCL/ACCT//MP2/6-31G* and an extrapolated “complete basis set” (CBS) approach for each of the coupled cluster schemes[172–176] obtained from the following extrapolation formula:[177]

(EcorrD3 EcorrT 3) E = D − T (4.1) DT D3 T 3 corr corr − where ED and ET represent correlation energies predicted at the double-zeta and triple-zeta levels respectively, while D = 2 and T = 3. EDT is added to the triple-zeta energy and thus the total benchmark energy of the system (for the CR-CCL case, for instance) is given as:

ECR-CCL/CBS(D,T) = ECR-CCL/ACCT + EDT (4.2) 4.2 Computational Methods 59

Complete active space (CAS)[178] calculations were also performed to further probe the dependence of the PES on the effects of static correlation. CAS calculations correct for static correlation errors arising from an inadequate description of the possible configuration state functions accessible to a chemical system. In CAS, all possible Slater determinants of a selected set of active occupied and virtual orbitals are taken into account and full configuration interaction is calculated on that so-called “active space”. CASSCF(8,8)/cc-pVTZ calculations were performed after preliminary calculations on the density matrices of a variety of sets of active spaces, where the (8,8) designation signifies that 8 electrons were allowed to be permuted among 8 virtual molecular orbitals. The selection of the active orbital space was decided based on two observations from preliminary calculations; (i) occupancies of orbitals outside of the (8,8) active space did not change when larger spaces were tested, and (ii) the trend of the PESs obtained were qualitatively the same regardless of the selected active space and number of basis functions in the level of theory that was employed. Specifically, it is noticed that CASSCF(6,6) and CASSCF(8,8) produce qualitatively identical results (Appendix B). In cases when the curvature of the PES was especially flat, fine integration grids and stringent convergence criteria to insure the reliability of the results have been employed. Additionally, relevant minima were confirmed by performing the standard frequency analysis. Consequently, thorough explorations of the space of available high-accuracy computational models suitable for this application has been made. All DFT calculations were performed using the QChem[131] software package. All CR- CCL calculations were performed using the GAMESS[179, 176] suite of programs and all MP2, CCSD(T), and CAS calculations were performed with the Gaussian 09 software[132]. Furthermore, the PESs of the relevant compounds of interest have been decomposed into their Lewis acid and base components by a systematic fragmentation approach. In this approach, the contribution to the total PES due to the structural rearrangement of the Lewis acid and base from their isolated structures has been made. To accomplish this, at each point along the PES, the electronic energy of each of the Lewis acid and base are calculated separately by holding them frozen in the geometry they adopted in the full complex at that point. These are referred to as Eacid and Ebase, respectively, and they quantify the energetic cost of rearranging the geometries of the isolated Lewis acid or base from their individual minimum-energy structure.

The electrostatic interaction energy between each of the “frozen” fragments (Eint) may then also be calculated for each point as the difference between the sum of the component energies

(Eacid, Ebase) and that of the original point along the PES (Etotal). 60 A Novel Bonding Mode in Phosphine-Haloboranes

Text M06L/aug-cc-pVQZ 0.05 Ê ÊʇÊÊ Ê Ê Ú MP2/cc-pVTZ

0.04 MP2/cc-pVQZ Û Ì Á CCSD(T)/VTZ

Ê

/ (a.u.) / · CCSD(T)/CBS ) 0.03 Ù ‡ Á eq CR-CCL/ACCD r Ú

Û ( Á Ê E 0.02 CR-CCL/ACCT Ì · Û Á Ù ‡

) Á CR-CCL/CBS Ê

r Û Ú Á ( · Û Ù Ê ‡ Û Á CASSCF (8,8)/6-311+G* E 0.01Ï Ê · Û Á Ì Ù Ê Û Ú ‡ · Á Ì Ï Ù Ê Û Ì ‡ · Ê Ì Ï ‡Ù · Û Á Ì Ï Ú ‡Ù Ê Á Ì Ï Ï Ú Ú ·‡Ù Û Ì Ï Ì Ï Ú Ú ·‡Ù Ê Û ÁÌ Ï Ï Ú ·Ù‡ Ê Ì Á Ï Ï ‡ ·ÙÚʇ Ï Ï Ì Ì ÌÚ ·Ù‡Ì Ê ÏÛ Ï Á ·Ê‡ ·ÙÚÊ 0.00 Ì Ì Ì ÏÌ Ï Ï Ï ÏÚ Ï·ÚÙ‡ ·ÚÙ‡Ê ·ÛÚÙ‡Ê Û·ÚÙʇ ·ÁÚÊÙ‡ ÁÚÙ 1.5 2.0 2.5 3.0 3.5 r /Å

Fig. 4.2 PES scans of F3B-PH3 by several levels of theory are presented here. The blue dots on each curve illustrate the position of inflection points at r < req. r represents the B-P bond length in Angstroms (Å) from the relaxed PES scan.

Eint = Etotal EPH EF B (4.3) − 3 − 3 Decomposing the PES in this way provides considerable insight as to the nature of bonding as the bond is stretched and/or compressed and will be useful regarding the interesting aspects of the subject.

4.3 Results and Discussion

4.3.1 PES of F3B-PH3

The PES of F3B-PH3 has been calculated by employing several fundamentally different com- putational techniques of varying theoretical rigour. Some representative results are plotted in Figure 4.2 and with very few exceptions, the results are surprisingly consistent. In some cases, the inflection is more pronounced than in others, however it persists in all cases. The global minimum in each case is usually found at 3.1 Å, which is consistent with the relative ≈ 4.3 Results and Discussion 61 weakness of the dative bond (low dissociation limit relative to the minimum) in this case[151]. Though it is non-trivial to ascertain exactly which model should be considered the theoretical “reference” in such a case, it is interesting to note that the pre-equilibrium inflection point persists despite an extensive exploration of the space of configuration state functions and the space of basis functions. A range of basis set size has been employed to observe the behavior of PES. In combination with density functionals, basis set starts from Pople’s double zeta basis sets to rather large augmented quadruple-zeta basis sets of Dunning,[40, 39] aug-cc-pVQZ, whereas for highly correlated wavefunction based models triple-zeta basis sets have been employed. It is tempting to conjecture that the inflection is a basis set artifact due to the fact that it appears tobe diminishing with increasing basis set size (see, for example, CR-CCL/ACCD versus CR- CCL/ACCT). This cannot be confirmed, however, as the inflection persists for all models, even when a complete basis extrapolation is employed (though the CR-CCL/CBS curve appears to be an outlier). The only confirmation is that increasing the basis set generally destabilizes the complex at smaller bond lengths relative to longer ones and after about 3Å the curves seem to converge. It is noteworthy that this behaviour is consistently observed in various coupled cluster methodologies, including CR-CCL, which is meant to be superior to the “gold-standard” CCSD(T) approach for bond breaking situations due to the triples contribution to the energy being calibrated from the difference between the exact solution and the CCSD energy. Even with the exploration of configuration space via a CAS approach, a typical Morse-like potential shape of the PES is not observed.

4.3.2 Energy Decomposition

To further explore the theoretical rationale for such a result, the F3B-PH3 PES was decomposed into its Lewis acid and base constituents (Eacid, Ebase) as well as the electrostatic interaction component of their total energy (Eint), as described above. These are plotted in Figure 4.3, along with the overall PES, Etotal. Not surprisingly, the phosphine component (Ebase) does not experience a large fluctuation in energy at all along the entire curve, which is a reflection of the relatively small structural change of this trigonal pyramidal fragment as the dative bond is formed. The boron trifluoride fragment however (Eacid), exhibits a steep relaxation energy profile as it progresses from a highly “compressed” (and relatively unstable) trigonal pyramidal structure at small r (and small 62 A Novel Bonding Mode in Phosphine-Haloboranes

Text Ê Ê

FE3B-PH3 ‡ total 0.10 ‡ PHE3base Ê

‡ / (a.u.) / ) ‡ BFE3acid

‡ eq

r Ú ‡

( ‡ EEint Ê int ‡ E 0.05 ‡

‡ Ê

) ‡

r ‡ Ê ( ‡ Ú Ê ‡ Ê E Ê Ê ‡ Ï Ï Ï Ï Ï Ï Ê Ê ‡ Ï Ï Ï Ï Ï Ê Ê‡ ‡ Ï Ï Ï Ï Ï Ï Ï ÏÊ Ïʇ Ïʇ Ïʇ Ïʇ Ïʇ Ï‡Ê Ï‡Ê Ï‡Ê 0.00 Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú 1.5 2.0 2.5Ú Ú 3.0 3.5 Ú Ú Ú Ú Ú Ú Ú Ú r /Å

Fig. 4.3 PESs of the dative bond in F3B-PH3, BF3, PH3, and the interaction energy between the two components calculated at the M06L/aug-cc-pVQZ level of theory. 4.3 Results and Discussion 63

∠FBF) to a fully relaxed trigonal planar structure at large r (∠FBF= 120◦). The stabilizing component of the PES comes from the electrostatic interaction (Eint) between each of these fragments and this is plotted in green. Though Eint is generally negative for most of the range plotted, the rather deep well in the Eint curve demonstrates a strong stabilizing interaction between the F3B- and -PH3 fragments near 1.9 Å , coinciding with the inflection in the PES. ≈ At long range the electrostatic attraction of the fragments is small but significant enough to bind them, owing to the fact that Ebase and Eacid are negligible and the result is a global minimum predicted at r 3.2 Å. In other words, the Lewis acid and base experience an electrostatic ≈ attraction prior to any geometrical rearrangement.

4.3.3 Molecular Orbital Analysis

Analysis of the molecular orbitals (MOs) by visualizing and comparing them at key bond distances provides deeper insight into these components of the PES of the complex. Figure

4.4 illustrates the frontier MOs of the F3B- fragment of F3B-PH3 in geometries that match its structure at bond lengths of 1.3, 1.9, 2.5 and 3.1 Å from the full dative complex. Generally, the pyramidalization of the BF3 component induces a concomitant rise in the energy of almost all of the MOs, with the exception of the LUMO. This orbital is primarily comprised of side-on overlap of equivalent p orbitals on adjacent F atoms and opposite phase p orbitals on the central

B atom. This side-on overlap of F orbitals is enhanced as the BF3 becomes more pyramidal. In the range of 1.9 Å r 3.1 Å, this effect lowers the LUMO energy much more significantly ≤ ≤ than the increase in energy of the occupied orbitals. The HOMO-LUMO gap is therefore shrinking significantly with decreasing ∠FBF and this is the driving force behind the favourable electrostatic component of the PES. The occupied orbitals of BF3 are increasing in energy as ∠FBF decreases, which destabilizes the molecule (Eacid) but simultaneously allows the molecule to be more susceptible to dative bonding by an incoming lone pair. Dative bond formation can be thought of as the sp3σ pσ donation of electron density from the Lewis → base to the acid and thus the LUMO of the Lewis acid is a significant factor in determining the strength of the interaction[159]. With all other things being equal, the lower the LUMO level, the more favourable dative bond formation will be and the more stabilizing Eint will be. This stabilization of the BF3 LUMO increases indefinitely with the pyramidalization of theBF3 fragment but it is ultimately outweighed by the unfavourable change in each of the other MOs in the system as ∠FBF decreases. 64 A Novel Bonding Mode in Phosphine-Haloboranes 4 4 3.1 3 3 2.5 Å r/ 2 2 1.9 Text is not available i think yes i do 1 1 1.3

0.1 0.1 0.2 0.3 0.4 0.5 0.1 0.1 0.2 0.3 0.4 0.5 0.0

------M ol e c ular O r bital Ener gi e s /(a.u.) s e gi Ener bital r O ular c e ol M

Fig. 4.4 Frontier MO energy level diagrams of the boron trifluoride fragment of F3B-PH3, calculated at the geometries of several bond lengths, r. Data is from a PBE1PBE/6-311+G* calculation and images are plotted at an isovalue of 0.07 a.u. 4.3 Results and Discussion 65

It is therefore apparent that the unusual behaviour observed in the PES of dative bond stretching in F3B-PH3 is the result of a confluence of several factors. The Lewis acid and base experience a long-range electrostatic attraction beginning at about 3.5 Å. The magnitude of this attraction slowly increases until the components reach about 2.5 Å separation, at which point the attraction becomes much stronger and Eint ultimately reaches its maximum at 1.9 Å. | | 6 This is consistent with the expected Lennard-Jones-like r− behaviour of interacting molecular species at long range. The overall potential however, is further complicated by the necessary pyramidalization of the BF3 fragment to allow for dative bond formation. Such pyramidalization results in i) the lowering of the BF3 LUMO with the decrease in ∠FBF (subsequently stabilizing the complex) and ii) an associated increase in steric repulsion (destabilizing the complex). These factors are of similar magnitude but the destabilizing steric repulsion is sufficient to negate the electrostatic attraction of the fragments at small r. Since the electrostatic attraction is present even at large r, prior to any significant degree of pyramidalization of 3the BF fragment, the global minimum energy structure for this species occurs at r 3.2 Å and a feature resembling ≈ a shoulder is observed in the PES at smaller r, where the Eint is larger. This behaviour is in | | agreement with many earlier experimental investigations of F3B-PH3[156, 151, 157] which report a loosely formed adduct whose bond length is difficult to determine.

4.3.4 Comparison with Analogous Systems

We can use this result as the basis for a guided exploration of chemical space to further elucidate this phenomenon. For example, if the multiple inflections in the PES3 ofF B-PH3 are the result of the electrostatic attraction of the Lewis acid/base pair being similar in magnitude (albeit smaller) to the energetic cost of the required pyramidalization of the Lewis acid, then one should probe cases with modified electrostatic interactions and steric penalties associated with dative bond formation.

Several obvious candidates include the aforementioned F3B-NH3 and H3B-PH3 complexes presented earlier in Figure 4.1. A decomposition of their potential energy surfaces are presented in Figure 4.5 (parts A and B, respectively) in the same manner as shown in Figure 4.3. As in the case of Figure 4.3, it is noticed that Ebase does not significantly contribute to the features in any of the overall PESs.

F3B-NH3 system has exactly the same Eacid part as that of F3B-PH3 (see Figure 4.5A) whereas the NH3 Lewis base is different and must get closer to the BF3 fragment for the dative bond to form, the ∠FBN angle at req is about 104◦, which compares with the ∠FBP angle of 66 A Novel Bonding Mode in Phosphine-Haloboranes

0.1& Eacid

0.075& Ebase 0.05& Etotal 0.025& Eint 0& 1& 1.5& 2& 2.5& 3& 3.5& 4& !0.025&

!0.05& A BF3NH3

!0.075& / (a.u.) / 0.1&

) 0.075&

1

r 0.05&

(

E 0.025&

0&

1& 1.5& 2& 2.5& 3& 3.5& 4& ) !0.025&

r ( !0.05& BH3PH3

E B !0.075& 0.1& 0.01%

0.075& 0.005%

0.05& 0% 1.5% 2.5% 3.5% 0.025& !0.005%

0& 1& 1.5& 2& 2.5& 3& 3.5& 4& !0.025& !0.05& C BCl3PH3 !0.075& r /Å

Fig. 4.5 PES scans of the dative bonds in (A) F3B-NH3, (B) H3B-PH3, and (C) Cl3B-PH3 along with their decomposed components and the electrostatic interaction energy between the fragments at various geometries, calculated at the M06L/cc-pVQZ level of theory. The inset plot of part C illustrates Etotal for BCl3PH3 magnified. 4.3 Results and Discussion 67

F3B-PH3 at req of 105◦. The electrostatic interaction between the acid/base pair, however, is significantly stronger for3 NH than PH3, which is consistent with the earlier theoretical work of Bessac and Frenking[160]. It is well known that NH3 is a stronger Lewis base in this context and this effect outweighs the cost of geometric rearrangement of the BF3 fragment, which results in a far more “typical” PES. Alternatively, the arsenic analogue F3B-AsH3 should exhibit far weaker electrostatic interaction and, accordingly, has not been successfully synthesized[156].

The energetic cost of pyramidalization in the case of F3B-PH3 can be decreased by replacing F3B with H3B. Figure 4.5B illustrates the intuitive result that the BH3 fragment does not exhibit a severe pyramidalization penalty (Eacid) in the relevant distance range and as such, the overall PES resembles the Eint curve quite closely. Again, the result is that Eint is able to dominate Eacid and produce a typical PES. The search for simple substituents on boron that would induce a higher cost of pyramidal- ization led us to consider the chlorinated analogue BCl3, owing to the larger size of the chlorine atoms, relative to fluorine. In their thorough review, Staubitz et. al had reported that thecostof distorting BCl3 from trigonal planar to pyramidal was higher than that of BF3,[157] however this is in stark contrast to both Hirao et al.[158] and Frenking[159]. Nonetheless, the results are presented in Figure 4.4C for the Cl3B-PH3 complex and note a particularly remarkable result. The first point to be noticed is that the Eacid is nearly identical for each of BCl3 and BF3 paired with PH3, in perfect agreement with earlier reports[158, 159]. What is particularly noteworthy however, is that the overall PES of the Cl3B-PH3 complex actually exhibits two distinct minima. The first is observed at req 2.0 Å, in close agreement with experimental ≈ estimations of the dative bond length of this complex[150] and previous geometry optimizations performed at the density functional BP86/TZ2p level[160]. The second minimum is observed at r 3.5 Å and is confirmed by several independent, fully relaxed geometry optimizations at ≈ the density functional and ab initio levels of theory (see Appendix B). An educated thought is that this is the first example ever reported of a bimodal dissociation curve of a single small molecule. Consequently, it can be claimed this is the first example of a single bond that exhibits two distinct minima while stretching along its axis. This rather remarkable feature is the result of a stronger close-range interaction between

BCl3 and PH3, relative to BF3 due to the smaller HOMO-LUMO gap in the pyramidal chlo- rinated analogue (0.211 a.u. versus 0.336 a.u. for BF3). Though decreasing ∠ClBCl comes at a similar energetic cost to decreasing ∠FBF, the LUMO for pyramidal boron trichloride is significantly lower than that of pyramidal boron trifluoride. This stabilizes the dativebond and is in close agreement with earlier comparisons of these Lewis acids[159]. Because this 68 A Novel Bonding Mode in Phosphine-Haloboranes

1.76 A˚

3.47 A˚

1.43A˚

Fig. 4.6 Optimized geometry of the BCl3PH3 dative complex at the MP2/aug-cc-pVTZ level of theory. enhanced stabilization only occurs at close range (i.e. r 1.9 Å), the long range behaviour of ≈ the PES is relatively unaffected and the minimum at r 3.5 Å persists and is comparable to the ≈ minimum observed for the F3B-PH3 analogue. At this separation, the Lewis acid and base are electrostaticly bound without the acid having to distort its trigonal planar geometry. The fully optimized structure of Cl3B-PH3 at r 3.5 Å is shown in Figure 4.6. Between 2.5-3.0 Å, the ≈ energetic cost of pyramidalization for BCl3 increases faster than the electrostatic attraction of the fragments and a maxima is observed separating each distinct minimum.

4.4 Conclusion

It has been shown that the potential energy surface of dative bond stretching in a series of haloboranes demonstrates highly unusual behaviour and is not able to be fit to any traditional

Morse or Lennard-Jones type potential. Specifically, in the case ofF3B-PH3, the PES scans of dative bonds exhibit an inflection point before global minimum. Additionally, the chlorinated analogue (Cl3B-PH3) exhibits similar behaviour but the effects are amplified and the PES is observed to have two distinct minima. This is explained by the competition between the energetic cost of the required pyramidalization of the Lewis acid to form a dative bond, and the stabilization from the electrostatic attraction between the Lewis acid and base (while being held 4.4 Conclusion 69 frozen at their respective geometries from the complex). When the cost of pyramidalization of the Lewis acid is high relative to the electrostatic attraction between the acid and base, the potential well associated with dative bonding is significantly affected and the result isa relatively flat PES that is susceptible to the unusual characteristics described herein. The species studied in this work have been investigated by a wide variety of computational techniques including many DFT methods and many levels of wave function approximations with a wide range of basis sets. All calculations yielded qualitatively similar results. Though the PES behaviour described herein has persisted in all the calculations, it is important to note that atypical inflections do seem to diminish with increasing basis set size. In particular, increasing the size and flexibility of the basis set seems to destabilize the PES atsmall r, relative to the large r portion of the curve (e.g. see Figure 4.1). It is conceivable that the multiple inflections observed for the F3B-PH3 complex PES could be resolved once higher level of theories are applied, however the inflections are even more pronounced for theCl3BPH3 complex and the small r portion of the curve presented already agrees well with both experimental and additional theoretical benchmarks. In any case, it is expected that isolation of the weakly bound Cl3BPH3 complex at r 3.5 Å would be a significant experimental challenge. ≈ This chapter was a follow-up to the previous in which the anomalous behaviour of the PES of F3B-PH3 was observed. The conclusion of the detailed investigations has been presented above. The following chapter of the thesis is another follow-up which is based on the poor performance of the DFT methods for the study of Lewis acid-base pairs. Chapter 5

Intracules as New Molecular Descriptor in QM/ML

5.1 Introduction

The DFT methods employed in the chapter 3 failed to reproduce PES determination results comparable to the reference. There is a need for an alternative computational approach that is inexpensive but can predict PESs of the Lewis acid-base pairs within the error limit of 1 kcal/mol. This chapter involves the demonstration of some new quantum chemistry calculations on simpler molecular systems, which are small organic molecules, than Lewis acid-base complexes. The results of this chapter would suggest whether to apply some alternative computational techniques to Lewis acid-base complexes. With the rapid advancement in computer hardware and algorithms, the approximate mod- els of quantum chemistry can predict the properties of molecular systems to acceptable accuracy[180, 181]. However, mathematical complexity increases with the increase in the size of a system, which results in a greater compromise between performance and compu- tational cost. Several ongoing efforts have been devoted towards the development of faster and more accurate models[182, 183, 46, 184]. This has resulted in numerous approximation models over the period of the past few decades. These approximations originate from two exact quantum theories: Schrödinger wave equation (SWE)[185] and density functional the- ory (DFT)[49, 50]. These approximations tend to reach the exact solution of SWE for the prediction of molecular properties. SWE based methods are Hartree-Fock (HF) and post-HF methods[182, 183, 46, 184]. Post HF methods predict correlation energy which is equivalent 5.1 Introduction 71 to the error in HF calculations. The coupled cluster methods are considered the most accurate post-HF methods: however, they are not always the preferred choice due to their high computa- tional cost (e.g. for CCSD(T)[48, 47], the cost rises at a scale of O(N7) where N is the number of basis functions). DFT methods, with the computational cost scaling upto O(N3), are based on the Hohenberg- Kohn theorem[49] which states that the exact energy of the system is a functional of its electron density (ρ(r)). DFT methods have become the most popular methods because the wave function keeps track of all coordinates of N electrons whereas DFT deals with a single spatial coordinate. The accuracy of any DFT method depends on the molecular system and the property of interest. Its computational cost is usually lower than the HF method[186]. In the era of rapidly growing basic health and household needs worldwide, the design of novel molecules for pharmaceutical and materials industry also requires rapid techniques. High-throughput screening is useful but screening of hundreds of thousands of molecules is hardly possible with quantum chemistry methods only[187–189] Therefore, there is a need of computing approaches that can predict the electron correlation energies of a pool of molecules expeditiously with high accuracy. The marriage of quantum mechanics and machine learning (QM/ML) has been found to be promising for the prediction of a range of molecular properties with a acceptable accuracy[61– 64]. The properties include atomization energies, eigenvalues of molecular orbitals, electron correlation energies, ionization potential and electron affinity to name a few[67, 190, 191, 62] The basic idea of QM/ML involves the quantum chemistry calculations at a cheap compu- tational cost and then, fast and inexpensive prediction of molecular properties by machine learning algorithms[66–68]. This eliminates the employment of expensive quantum mechanics calculations. More accurate predictions can be obtained simply by increasing the data on which machine learning models are trained[65] The essence of ML is the ability to recognise similarities in the patterns of a given data and the molecules in the dataset taken are composed of about only a few kinds of atoms but their connectivity and number of each atom used turns them into thousands of different molecules. This is possible with minor structural and/or atomic changes. Therefore, this does not decrease the advantage of ML can take from molecular systems. The presence of structural and atomic similarities, which is inherent in molecular systems, is substantially useful for a ML model to work efficiently. However, the molecules must be represented in a way that a pattern recognition becomes as simple as possible. 72 Intracules as New Molecular Descriptor in QM/ML

Molecular representations must possess certain properties that improve the performance of machine learning algorithms. The most invariant representation of the molecular systems is probably the most crucial factor in QM/ML[66, 191, 67]. A number of molecular representations (Coulomb matrices, crystal structures, nuclear chemical shifts, Fourier transformation of atomic radial distributions, etc.) have been reported in the past decade in order to predict more accurate results[191, 67, 192–194]. In machine learning, these representations are sometimes called ‘descriptors’. Descriptors are developed based on ad hoc reasoning which can only be assessed by their performance. A ‘good’ descriptor in QM/ML is unique for each molecular system, it has the least possible dimensionality, it is independent of atomic index, and it is invariant to rotational and translational motions. In this chapter, molecular systems are represented as a descriptor called an intracule. The intracules are of different kinds. The position intracule is the important to be explained for this thesis. A position intracule is a probability function of an electron pair according to their relative positions in space[195]. These are invariant to translational and rotational motions, and are independent of nuclear indexing. Besides the descriptor, the choice of a machine learning algorithm is also crucial to machine learning predictions. An appropriate choice of an algorithm’s choice depends on the problem one is dealing with, the desired accuracy, scalability and how fast the model must work.

5.2 Methods

5.2.1 Model

Here the dataset under consideration is of a continuous nature and it has features with its corresponding labels; thus, the goal is to solve a regression problem. Kernel Ridge Regression (KRR) models are fast and have been found popular for QM/ML calculations[61, 196, 197]. Therefore, in this chapter, a KRR model has been chosen for machine learning predictions of correlation energies of the test set molecules. The test set is one of the two subsets of the whole dataset, the other subset is called the training set on which KRR is trained. In KRR, ridge regression is a regularization method of regression which controls the overfitting of the curve. This control is desirable in order to keep the authenticity of the properties of chemical systems. However this chapter does not require any regularization factor since QM data is ‘noiseless’, and it is therefore impossible to overfit in this context. The kernel function inthe model involves the linear transformation of the vector space from a lower dimension to a higher 5.2 Methods 73 dimensional space while keeping the vector implicit. The direct application of the regression model to the feature space would require a tremendous amount of computation but the use of a kernel function bypasses this step and expresses the feature space as a dot product which is then employed in the algorithm. Thus, a kernelized regression model has been employed to reduce the cost of computations. The Kernel function is also called a similarity function because it computes the similarity between features which can then be separated based on their similarities. The Gaussian kernel, defined below, is employed in our KRR model. The Gaussian kernel interprets similarity between two features and scores it from 0 to 1, 1 being the most similar and 0 being the least similar. The Gaussian kernel is given as:

[ (p p )2/2σ 2] k(p, p′) = e − − ′ (5.1)

where σ is the width of the function and p and p′ are two input features. The dataset consists of n pairs including position intracules ‘p’ representing molecules and their respective electron correlation energies ‘c’. Both can be represented in terms of column matrices as pi = [p1, p2, p3,...pn] and ci = [c1,c2,c3,...cn] respectively. The general parametric regression equation in terms of matrices can be written as:

C = αk(p, p′) + ε (5.2) where ε is a very small irreducible error independent of p (intracules) in the above equation. The estimator function of the predicted correlation energies Cˆ is defined as f (p) = αˆ P where αˆ represents coefficients. For this chapter, the kernelized regression equation is given as follows:

N f (p) = ∑ α jk(p, p j) (5.3) j=1 where k represents the Gaussian kernel employed and α j represents the weights or coefficients of the input features. Intuitively speaking, the coefficients depend on the difference between the actual correction energies and ones predicted by the estimator function i.e. L = C Cˆ. − Therefore, the coefficients can be optimized by the following loss function:

N 2 L(α) = ∑ (c j f (p j)) (5.4) j=1 − 74 Intracules as New Molecular Descriptor in QM/ML

where c j are the corresponding labels to the features p j. The above equation can also be written as:

N 2 L(α) = ∑ (c j αˆ K) (5.5) j=1 −

The above penalty function is missing the regularization penalty component since regularization has not been employed in the machine learning model used. The α can be solved as:

1 α = K− C (5.6) where K is the kernel matrix of training samples and C is the target list of labels from the test set.

5.2.2 Descriptor

A good molecular descriptor contains the information that corresponds to the property of interest. In this chapter, position intracules (P(u)) are introduced as a novel molecular descriptor that represent the inter-electronic distribution of two electrons according to their relative positions[198, 195]. The position intracules are only one dimensional functions which are defined based on electron density[198, 195]. Intracules have been explored and discussed by many theoreticians over the years[199–203]. They have been proposed as functional models for the prediction of electron correlation energy[204, 205, 198, 195]. Recently one dimensional intracule functions have even been employed to model the correlation intracule functional[206]. Here, only 1-dimensional position intracules P(u) are employed to represent each molecular system. P(u) can be defined as the probability function of the electron density of thetwo electrons at their relative positions[202]. The position intracules have been taken as a molecular descriptor because of the properties they posses. As mentioned above, position intracules are one-dimensional, therefore, they are simple to explain. They are least variant to rotational and translational motions as the distance between two electrons is going to be the same no matter how the whole molecular system is moved. These are obtained by a simple HF calculations which is also an important factor for it does not require to employ complex computational chemistry methods. The position intracules are unique, thus are suitable for representing similar molecular systems. These functions representing different molecular systems also show similarity with each other as it can seen from Figure 5.1. The similarity between descriptors 5.2 Methods 75 is vital since ML works on repetitive patterns in the features, being position intracules in the thesis.

5.2.3 Dataset

A dataset of 5660 molecular systems has been taken from the QM7 subset that belongs to GDB-13[207, 208]. QM7 consists of 7165 organic molecules that are composed of up to 23 atoms with different combinations of C, N, O and S including H atoms for saturation. The dataset contains a range of molecular structures involving single, double and triple bonds, cyclic systems, alcohols, cyanides, epoxies, and amides. The input of the dataset molecules were taken in the form of cartesian representation and their intracules are calculated with the QChem software package at HF/6-31G* level[131]. Thus, intracules are the features with their corresponding labels as electron correlation energies. Correlation energies are the target values calculated by a high accuracy quantum chemistry method, G4[58, 59]. G4 is a Gaussian-4 composite method which predicts thermodynamic properties within a sub- kcal/mol accuracy. The performance of the machine learning is measured by mean errors, ME, mean absolute error,

MAE, maximum error, Emax, minumum error, Emin, Standard deviation, SD and mean absolute percent error, MAE %. ME, MAE, MAE% and SD are defined as:

1 n ME = ∑ Ei (5.7) n i=1

1 n MAE = ∑ Ei (5.8) n i=1 | |

n 1 Ei MAE% = ∑( ) 100 (5.9) n i=1 y ∗

s n 2 ∑ (Ei ME) S = i=1 − (5.10) D n 1 − 76 Intracules as New Molecular Descriptor in QM/ML

where Emax and Emin are respectively the biggest and the smallest error values among all error values.

5.2.4 Preparation

The intracule data are organized on a new uniform grid of points with a grid spacing equal to 0.05 a.u. The grid vector ranges from 0 to uMax. uMax is the biggest u-point value among all the intracule data that has a P(u) value bigger than the threshold (= 0.0001). This range covers all other P(u) values and facilitates to compare and measure the similarity of all the P(u) values on the same grid as shown in Figure 1. The similarity is measured by the Gaussian kernel function that forms a correlation matrix. Four molecular systems (QM7-0234, QM7-0432, QM7-1230 and QM7-4500) are selected randomly from the actual database used and are represented with their respective structures and intracule plots in Figure 5.1. It depicts how the interpolated intracules are arranged on a uniform grid. QM7-1230 with the largest u-point value has defined the uMax covering the rest of the intracules within 0 to uMax. Sigma, σ, is the only hyperparameter employed and is crucial to affect the performance of the QM/ML model. In order to find a σ value that predicts the correlation energies with the minimum error, the model has been tested for a range of σ values and are plotted against the corresponding MAE values in Figure 5.2. The MAE values are plotted at logarithmic scale because the values were changing exponentially. The model was also tested on much larger values (1.0, 1.8, 2.8, 3) that are not displayed in the Figure. However, the MAE seemed to be constant after any increment up to 0.0012. σ = 0.00004 is the value that has shown the minimum MAE among all other tested values therefore σ was set to 0.00004 for the rest of the ML calculations. After obtaining calculations of intracules and single point energies of molecules by QChem and Gaussian software, all other calculations were performed by means of a program written in the Python language and Python modules including numpy, scipy.interpolate and tabulate. The algorithm is illustrated as flowchart in Figure 5.3. The very first step was to normalize the raw data consisting of intracules and references energies into one format. Therefore, raw intracules and reference energy directories were normalized into respective JSON (JavaScript Object Notation is a data-interchangeable format. It is useful because it is lightweight, easy to read and write for humans and easy to parse and generate for machines) version files. Int_JSON file consisted of information like molecular Ids 5.2 Methods 77 u(a.u.) P(u)

Fig. 5.1 Intracule P(u) plots of four randomly selected molecules from the dataset are being used. QM7-1230 has the largest u-point value with a P(u) larger than 0.0001 a.u. All four molecules are covered within this range and they are all different from each other that shows the uniqueness of the descriptor. 78 Intracules as New Molecular Descriptor in QM/ML

u(a.u.)

Fig. 5.2 Chart of σ values in terms of the log of mean absolute error (MAE) for the KRR model. Sigma 9 1 values ranging from 3 x 10− to 3 x 10− are plotted as horizontal axis in the figures. Selected region has been zoomed in so that the minimum of the curve can be seen clearly. The best sigma value that 5 predicted the smallest MAE was found equal to 4 x 10− or σ = 0.00004. 5.2 Methods 79

Distribution into Test and Training Fractions

Similarity Full KMat Based Random Or

Test Fraction Training Fraction Unified Int Grid

KMat for Target List Training Int_JSON Ref_JSON

Solving for Alphas Normalization

Applying KRR

QChem G4 Intracule Reference Summary of Errors Raw Data Program Starts Program Ends

Fig. 5.3 Algorithm of Python Program employed is depicted as a flow chart. Int represents intracules, Ref. stands for reference, Diff. is the difference and KMat is the abbreviation for kernel matrix. 80 Intracules as New Molecular Descriptor in QM/ML

(i.e QM7-XXXX), molecular formula, HF energies, intracule weights, etc whereas Re f _JSON consisted of molecular Ids, molecular formula, and G4 single point references energies. This information would provide convenience to match corresponding data in later steps of the program. Int_JSON file is then used to form a unified grid of intracules so that intracules ofall the molecules are plotted on one regular grid for the sake of comparison. In the next step, the intracules are transformed into a kernel matrix, namely Full KMat, that contained all new features, similar or otherwise. These features are then distributed into training and test sets based on the scores of their similarities (and options of test and training fractions given while commanding the program to run). A similarity based kernel matrix ( training set KMat) is used to represent the training fraction whereas the rest were taken as test fraction. The alphas or coefficients were calculated using training KMat and the target list. The target listis obtained by matching Ids from Re f _JSON files and test fraction. After obtaining alphas, KRR model is applied and a list of predicted correlation energies is obtained as results. The difference between predicted and reference correlation energies gives the errors in the prediction of the machine learning model applied. These error values are then employed to calculate the different statistical measures of performance defined above.

5.3 Results and Discussion

QM/ML results have been presented in terms of differences of correlation energies from the correlation energies calculated using G4 reference. All predictions were performed at σ equal to 0.00004. Table 5.1 summarizes the results of up to 50% of the test fractions in terms of different statistical errors in the units of kcal/mol. The ME values are well within 1 kacl/mol. Impressively, the ME values of this model are generally well within the so called chemical accuracy and are sometimes as low as 0.5 kcal/mol. In general, the decline in the errors with the increasing percentage of training fraction has also been observed indicating that the ML model has been learning successfully. MAEs are calculated from 1 to 90% of the test fractions which are shown graphically in Figure 5.4. The calculations have predicted that a10% test fraction out of 5660 dataset produces an MAE equal to 9.54 kcal/mol and for 1% test fraction the MAE is predicted to be equal to 5.4 kcal/mol. These results are fairly close to chemical accuracy; however, the predictions made by QM/ML in this chapter are compared to G4 reference as a standard which is a very high level method. These predictions might be more close to experimental answers compared to the other QM/ML predictions reported in literature because the current literature report DFT 5.3 Results and Discussion 81

Table 5.1 The QM/ML results predicted at σ equal to 0.00004 are tabulated in terms of maximum error (Max E), minimum error (Min E), mean error (ME), Standard Deviation (SD), mean absolute error (MAE), mean absolute percent error (MAE %). All errors are given in kcal/mol units.

Test Fraction(%) 10.00 20.00 30.00 40.00 50.00 Max E 102 94 98 111 178 Min E 0 0 0 0 0 ME -2.70 0.56 2.95 0.44 3.83 MAE 9.54 12.61 14.43 15.37 19.89 MAE (%) 0.19 0.25 0.31 0.31 0.44 SD 12.74 16.88 18.89 20.39 25.85 No. of UMol 565 1119 1694 2259 2824 a UMol = the number of molecules in test fractions calculations as reference to QM/ML models[67, 191–194, 209]. According to those standards, chemical accuracy has been achieved with the use of different descriptors, different references and different sizes and aspects of datasets, and sometimes for different properties of interest [67, 191–194, 209]. For example, Faber et al. [209] reported that chemical accuracy has been achieved for the energy calculations of a few thousand small organic molecules. Another work reported to predict the atomization energies to 7.2 kcal/mol error compared to B3LYP. For the "bag of bonds" (were used to model atomization energies) training samples accuracy reached to 2.0 kcal/mol on 30% molecules out of 134K dataset[61]. The atomization energies for a training set of 10K molecules, MAE has reached to 8.0 kcal/mol, whereas for Coulomb Matrix representation the error was found equal to 6.2 kcal/mol[193]. The statistical correlation between predicted correlation energies and reference correlation energies is represented as scatter charts in Figure 5.5. Test fractions of 10, 40, 70 and 90% are plotted which depicts that there are hardly any outliers, particularly for smaller test fractions. As shown above in Table 5.1 too, the maximum errors in each case are mostly about 100 kcal/mol. (The positive or negative does not matter here) This exhibits the strong correlation between reference and predicted energies. The scattering shrinks with the increasing size of training fractions. Therefore, the correlation can be made stronger and deviations of predicted energies can be decreased with the increased size of the dataset. Overall strong statistical correlation has been observed and high accuracy is obtained for most of the predicted correlation energies. Besides accuracy, the strength of the results produced also depends on the time taken to complete these calculations. The time taken to predict correlation energies of 50% of the dataset using the QM/ML program was about 52 minutes. On the other hand, when reference 82 Intracules as New Molecular Descriptor in QM/ML ◆ ◆ 80 80 ◆ ◆ 60 60 ◆ ◆ 40 40 Test Fractions (%) Fractions Test ◆ ◆ 20 20 ◆ ◆

80 60 40 20 80 60 40 20 MAE (kcal/mol) MAE

Fig. 5.4 MAE in kcal/mol obtained for test fractions ranging from 1 to 90%. Clear trend of decrease in MAEs with the increase in percentage of training fractions. 5.3 Results and Discussion 83 3500 - 4000 - 90% T.F. 90% 40% T.F. 40% 4200 4000 - - 4400 - ) 4500 - 4600 - 4800 - 5000 - 5000 - 3500 4000 4500 5000 5500 4000 4200 4400 4600 4800 5000 ------4000 - 4000 4200 Predicted Corr. (a.u. Corr. Predicted - - 10% T.F. 10% T.F. 70% 4400 - 4500 - 4600 - 4800 5000 - - 5000

-

4000 4200 4400 4600 4800 ) Reference Corr. (a.u. Corr. 4000 4500 Reference 5000 ------

Fig. 5.5 The scatter plots of reference correlation energies versus the predicted correlation energies obtained from a 10, 40, 70 and 90% test fractions. 84 Intracules as New Molecular Descriptor in QM/ML calculations were performed by employing G4 for the 100% of the dataset, the time consumed using the same computer systems were equal to 9166.7 hours. It can be deduced that the G4 reference would have taken 45833.3 hours to predict the same properties for 50% of the dataset. This shows a huge difference between the computational costs of time taken by G4 and ML model used. Provided one is not required to generate intracules and reference data but simply applies QM/ML to the already existing features, one can predict many useful quantum chemistry properties only in a fraction of time compared to the time required otherwise. Nonetheless, the difference in computational cost of high accuracy quantum chemical calculations and QM/ML calculations is so big that it compensates a relatively low accuracy.

5.4 Conclusion

The correlation energies of 5660 small organic molecules have been predicted by employing QM/ML. The introduction of intracules as a new descriptor to represent molecules has been found efficient because of its inherent properties that justifies it as a good descriptor forthe QM/ML model. The accuracy of the results have reached to 5.4, 9.54 and 12.61 kcal/mol for the 1, 10 and 20% of the test fractions respectively for a relatively small sized dataset in this context. For the model employed, the size of the dataset is very crucial to accuracy of the results as it has also been seen in the results above; therefore, higher accuracies can be reached by applying the same model on larger datasets. The calculations exhibit an dramatic difference in terms of computational cost between typically used quantum chemistry methods and QM/ML approach. It has also been observed that there is a strong statistical correlation between features and labels that authenticates the QM/ML model used. On the whole, the QM/ML approach with intracules as a molecular descriptor has been found to consume thousandfold lower computational cost for the prediction of correlation energies and has high potential to obtain results with higher accuracy by increasing the size of the dataset. All the above mentioned properties of intracules and the results produced makes them a promising molecular descriptor for further employment in the field of QM/ML. The success in the results obtained by the QM/ML model employed gives an encouragement to employ QM/ML to rather complex molecules like Lewis acid-base complex and predict electronic properties at a cost even cheaper than DFT methods. Chapter 6

Conclusion and Future Perspective

Every computational study of novel molecular systems requires the assessment of computational chemistry models against experimental results or results obtained by high level theoretical methods. This approach consumes huge amounts of computational resources and, besides properties of interest, many other findings are calculated but remain unused. In this thesis, an automated benchmarking of density functionals for unique Lewis acid-base pairs stored on a repository has been employed. It has been found that several density functional models, coupled with the 6-31G* basis, are efficient for evaluating derived properties of stretched dative bond complexes, though it is also noted that some species are not well represented by any of the employed approximate computational models. It has been concluded that automated calculation of quantum chemical properties of molecular systems is crucial to a continuously growing computational chemistry data. The automated approach applied to created repositories have reduced the computational cost by several folds. The automation applied to existing repositories can even produce results at exponentially lower computational cost. Therefore, an automated benchmark should be a more common practice for the determination of molecular properties by computational methods. The results of PESs produced revealed the unusual behaviour of a phosphine complex and the failure of DFT methods to reproduce the PESs comparable to the PESs produced by reference methods lead to search for an alternative computational chemistry approach.

The interesting case of novel PES behavior of F3B PH3 has been investigated. The − reproducibility of PES with a range of computational chemistry models and molecular orbital analysis revealed that the inflection in the PES depends on the pyramidalization energy costof

F3B and the electrostatic attraction between the F3B and PH3 in order to form a dative − − − bond. It was also found that the increase in the size of a halogen attached to LA resulted into 86 Conclusion and Future Perspective even more profound inflection in the PES. It can also be deduced that the electronegativity ofa LA attached also play an important role in producing inflected PESs. This unusual behavior has never been reported before. Therefore, it is anticipated that this revelation is a crucial enrichment to Lewis acid-base chemistry literature. The alternative computational chemistry approach to DFT methods for Lewis acid-base study employed was QM/ML. QM/ML, another data-driven computational chemistry approach, has been found to predict molecular properties at much lower computational cost. The prediction of electron correlation energies by QM/ML has been attempted by the use of a new molecular descriptor known as position intracule. Position intracules are one of the least variant (to translational and rotational motions) molecular descriptors and are independent of atomic indices. The dataset employed for the QM/ML model consisted of a variety of small organic molecules which are synthetically accessible. It has been concluded that the accuracy of results crucially depend on the size of the dataset and a choice of hyperparameters for a given model. The combination of ab initio calculations of position intracules and KKR machine learning algorithms produce fairly accurate results. The predictions are compared to G4 reference which is comparable to experimental standards. Therefore, the predictions made by our QM/ML model are fairly accurate compared to generally used references to assess the performance of QM/ML models in literature. This, however, does not make the employed QM/ML model any better but the reference data would be a major contribution to any quantum chemistry repository and can rapidly be employed as reference to other property predictions by computational methods. The benchmark study of the computational chemistry models for the PES calculations have revealed that the DFT methods are not useful for predicting the properties of Lewis acid-base pairs within desired accuracy range. Despite this failure, the automated analytic approach employed in Chapter 3 has shown a new aspect of predicting quantum chemical properties in a cheaper and a faster manner. The data available at the repository can be used as reference for further studies of such systems. The inclusion of the automated benchmark is crucial to make faster predictions from large datasets without investing too much of the computational resources. It is anticipated in future that the repository would be enriched with any relevant computational chemistry data and each entry could help obtain latest results each time. This will certainly diminish the ‘unnecessary’ production of data produced as byproduct of desired quantum chemical predictions.

The discovery of the anomaly in the PES of F3B PH3 is certainly an interesting case − which encourages to discover and investigate other possible cases in relevance. It is important to remember that the stretched dative bond complexes studied in Chapter 3 were only 8 in 87 number and there could have been more variety in the ligands attached to the Lewis acid-base atoms. This might have revealed more interesting cases of PESs. The thorough investigation of anomalous behavior of the PES of F3B PH3 by computational chemistry tools should entice − experimental chemists too to verify the case by experimental means. The Chapter 5 is probably the most impactful for quantum chemistry discoveries in future. There have been several other molecular descriptors used in QM/ML approach and the search for newer descriptor, that could represent molecular systems better, continues. There is a potential for intracules to be used in QM/ML models for the prediction of other molecular properties. It is important to remind that the intracules used were only position intracules, future work may include momentum intracules. The highly anticipated future work related to Chapter 5 is increasing the size of the dataset since larger the dataset used, better the ML prediction. In the present thesis, only a single ML algorithm has been employed, the inclusion of artificial neural networks is highly expected to be employed for the prediction of the electron correlation energies. The motivation of employing QM/ML model was developed after the failure of the DFT methods to perform within an acceptable error limit (1 kcal/mol ) for predicting the PESs of Lewis acid-base pairs. Another important point to be noticed is that QM/ML has only been employed on simple organic molecules but the results from Chapter 5 has encouraged to move a step further and study rather complex systems by QM/ML approach. This will help to finally attain the unachieved goal of Chapter 3, that is, prediction of PESs of Lewis acid-base pairs using cheap computational resources. Furthermore, the results produced can be compared to the reference used in automated benchmark. It is also possible that QM/ML model would be able to predict electronic properties of systems as complex as FLPs. References

[1] Iogann Tolbatov and Daniel M Chipman. Benchmarking density functionals and Gaussian basis sets for calculation of core-electron binding energies in amino acids. Theoretical Chemistry Accounts, 136(7):1–16, June 2017. [2] Gabriel Kocher and Nikolas Provatas. New Density Functional Approach for Solid- Liquid-Vapor Transitions in Pure Materials . Physical Review Letters, 114:1–6, April 2015. [3] Larry A Curtiss, Paul C Redfern, and Krishnan Raghavachari. Gaussian-4 theory. The Journal of Chemical Physics, 126:1–13, November 2007. [4] Brina Brauer, Manoj K Kesharwani, Sebastian Kozuch, and Jan M L Martin. The S66x8 benchmark for noncovalent interactions revisited: explicitly correlated ab initio methods and density functional theory. pages 1–22, July 2016. [5] Rebecca L M Gieseking, Mark A Ratner, and George C Schatz. Benchmarking Semiem- pirical Methods To Compute Electrochemical Formal Potentials. The Journal of Physical Chemistry A, 122(33):6809–6818, August 2018. [6] P Ganesh, Jeongnim Kim, Changwon Park, Mina Yoon, Fernando A Reboredo, and P R C Kent. Binding and Diffusion of Lithium in Graphite: Quantum Monte-Carlo benchmarks and validation of van der Waals density functional methods. Journal of Chemical Theory and Computation, pages 1–20, July 2016. [7] Annia Galano and Juan Raul Alvarez-Idaboy. Kinetics of Radical-Molecule Reactions in Aqueous Solution: A Benchmark Study of the Performance of Density Functional Methods. Journal of Computational Chemistry, 35:2019–2026, September 2014. [8] S Adams and P Murray-Rust. Chempound-a Web 2.0-inspired repository for physical science data. Journal of Digital Information, 13(1):D32–D37, 2012. [9] Peter Murray-Rust, Joe A Townsend, Sam E Adams, Weerapong Phadungsukanan, and Jens Thomas. The semantics of Chemical Markup Language (CML): dictionaries and conventions. Journal of Cheminformatics, 3(1):43, 2011. [10] Weerapong Phadungsukanan, Markus Kraft, Joe A Townsend, and Peter Murray-Rust. The semantics of Chemical Markup Language (CML) for computational chemistry : CompChem. Journal of Cheminformatics, 4(1):15, 2012. References 89

[11] Sam Adams, Pablo de Castro, Pablo Echenique, Jorge Estrada, Marcus D Hanwell, Peter Murray-Rust, Paul Sherwood, Jens Thomas, and Joe Townsend. The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age. Journal of Cheminformatics, 3(1):38, 2011. [12] Peter Murray-Rust. Chemistry for everyone. Nature, 451(7179):648–651, February 2008. [13] Peter J Linstrom and William G Mallard. The NIST Chemistry WebBook: A Chemical Data Resource on the Internet†. Journal of Chemical and Engineering Data, 46(5):1059– 1063, March 2001. [14] G. N. Lewis. Valence and the Structure of Atoms and Molecules. Chemical Catalogue Company, New York, 1923. [15] H C Brown and H I Schlesinger. Studies in Stereochemistry. I. Steric Strains as a Factor in the Relative Stability of Some Coordination Compounds of Boron. Journal of American Chemical Society, 64(2):325–329, 1942. [16] H C Brown and J Kannerp. . Journal of American Chemical Society, 88:986–992, 1966. [17] G Wittig and E Benz. . Chemische Berichte, 92:1999–2013, 1959. [18] Tochtermann W. . Angewante Chemie, 78:355–375, 1966. [19] Gregory C Welch and Douglas W Stephan. Facile heterolytic cleavage of dihydrogen by phosphines and boranes. Journal of American Chemical Society, 129(7):1880–1881, February 2007. [20] Gregory C Welch, Ronan R San Juan, Jason D Masuda, and Douglas W Stephan. Reversible, Metal-Free Hydrogen Activation. Science, 314(5802):1124–1126, November 2006. [21] Cornelia M Mömming, Edwin Otten, Gerald Kehr, Roland Fröhlich, Stefan Grimme, Douglas W Stephan, and Gerhard Erker. Reversible metal-free carbon dioxide binding by frustrated Lewis pairs. Angewandte Chemie International Edition, 48(36):6643–6646, 2009. [22] Jenny S J McCahill, Gregory C Welch, and Douglas W Stephan. Reactivity of "frus- trated Lewis pairs": three-component reactions of phosphines, a borane, and olefins. Angewandte Chemie International Edition, 46(26):4968–4971, 2007. [23] Preston A Chase, Titel Jurca, and Douglas W Stephan. Lewis acid-catalyzed hydrogena- tion: B(C6F5)3-mediated reduction of imines and nitriles with H2. Chemical Communi- cations, (14):1701–1703, April 2008. [24] Preston A Chase, Gregory C Welch, Titel Jurca, and Douglas W Stephan. Metal-free catalytic hydrogenation. Angewandte Chemie International Edition, 46(42):8050–8053, 2007. 90 References

[25] Douglas W Stephan and Gerhard Erker. Frustrated Lewis Pair Chemistry: Development and Perspectives. Angewandte Chemie International Edition, 54(22):6400–6441, May 2015. [26] M. Sajid, Kehr G., C. G. Daniliuc, and G Erker. . Angewandte Chemie International Edition, 53:1118–1121, 2014. [27] Andrew E Ashley, Amber L Thompson, and Dermot O’Hare. Non-metal-mediated homogeneous hydrogenation of CO2 to CH3OH. Angewandte Chemie International Edition, 48(52):9839–9843, 2009. [28] Albert Einstien and Leopold Infeld. Evolution of Physics. Cambridge University Press, UK, 1938. [29] Louis De Broglie. Recherches sur la théorie des Quanta. Annalen Der Physik., 10(3):22– 128, 1925. [30] E Schrödinger. Quantization as an eigenvalue problem. Ann. Phys., 44:455, 1926. [31] P. A. M. Dirac. Quantum mechanics of many-electron systems. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 123:714–733, 1929. [32] von Max Born and J Robert Oppenheimer. Zur Quantentheorie der Molekeln. Annalen der Physik, 389(20):1–28, October 2005. [33] J C Slater. . Physical Review, 36, 1930. [34] S F Boys. . Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, A200:542, 1950. [35] Hehre W J, Stewart R. F., and Pople J. A. . Journal of Chemical Physics, 51, 1969. [36] H J Monkhorst and F E Harris. . Chemical Physics Letters, 3:537, 1969. [37] Michelle M. Francl, William J. Pietro, Warren J. Hehre, J. Stephen Binkley, Mark S. Gordon, Douglas J. DeFrees, and John A. Pople. Self-consistent molecular orbital methods. xxiii. a polarization-type basis set for second-row elements. The Journal of Chemical Physics, 77(7):3654–3665, 1982. [38] Michael J. Frisch, John A. Pople, and J. Stephen Binkley. Self-consistent molecular orbital methods 25. supplementary functions for gaussian basis sets. The Journal of Chemical Physics, 80(7):3265–3269, 1984. [39] D. E. Woon and Thom H. Dunning Jr. Gaussian basis sets for use in correlated molecular calculations. 3. the atoms aluminum through argon. Journal of Chemical Physics, 98:1358–1371, 1993. References 91

[40] Thom H. Dunning Jr. Gaussian basis sets for use in correlated molecular calculations. i. the atoms boron through neon and hydrogen. Journal of Chemical Physics, 90:1007– 1023, 1989. [41] Thom H. Dunning Jr. Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen. Journal of Chemical Physics, 90:1007– 1023, 1989. [42] D E Woon and Thom H. Dunning Jr. Gaussian basis sets for use in correlated molecular calculations. 3. The atoms aluminum through argon. Journal of Chemical Physics, 98:1358–1371, 1993. [43] Christopher J Cramer. Essentials of Computational Chemistry. Theories and Models. John Wiley & Sons Ltd, 2005. [44] Michael J. Frisch, John A. Pople, and J. Stephen Binkley. Lorentz trial function for the hydrogen atom: A simple, elegant exercise. Journal of Chemical Education, 88(11):1521– 1524, 2011. [45] C C J Roothaan. Reviews Modern Physics, 23(69), 1951. [46] Martin Head-Gordon, J A Pople, and M J Frisch. MP2 energy evaluation by direct methods. Chemical Physics Letters, 153(6):503–506, 1988. [47] J A Pople. Quadratic configuration interaction - a general technique for determining electron correlation energies. Journal of Chemical Physics, 87:5968–5975, 1987. [48] G E et al. Scuseria and Henry F. Schaefer III. Is coupled cluster singles and dou- bles (CCSD) more computationally intensive than quadratic configuration-interaction (QCISD)? Journal of Chemical Physics, 90:3700–3703, 1989. [49] P Hohenberg and W Kohn. Inhomogeneous electron gas. Physical Review, 136:864–871, November 1964. [50] L J Sham and W Kohn. Self-consistent equations including exchange and correlation effects. Physical Review, 140, 1965. [51] John P. Perdew and Y. Wang. Accurate and simple analytic representation of the electron- gas correlation energy. Physical Review, B Condens. Matter, 45(23):13244–13249, June 1992. [52] Axel D. Becke. Density-functional exchange-energy approximation with correct asymp- totic behavior. Physical Review A, 38(6):3098–3100, September 1988. [53] John P. Perdew, Kieron Burke, and Matthias Ernzerhof. Generalized Gradient Approxi- mation Made Simple. Physics Review Letters, 77(18):3865–3868, October 1996. [54] Yan Zhao and Donald G. Truhlar. A new local density functional for main-group thermochemistry, transition metal bonding, thermochemical kinetics, and noncovalent interactions. Journal of Chemical Physics, 125(19):194101, November 2006. 92 References

[55] Carlo Adamo and Vincenzo Barone. Toward reliable density functional methods without adjustable parameters: The PBE0 model. Journal of Chemical Physics, 110(13):6158– 6170, April 1999. [56] A Daniel Boese and Jan M L Martin. Development of density functionals for thermo- chemical kinetics. Journal of Chemical Physics, 121(8):3405–3416, August 2004. [57] Axel D Becke. A new mixing of Hartree–Fock and local density-functional theories. Journal of Chemical Physics, 98(2):1372–1377, January 1993. [58] L. A. Curtiss, P. C. Redfern, and K. Raghavachari. Gaussian-4 Theory. Journal of Chemical Physics, 126:084108, 2007. [59] L. A. Curtiss, P. C. Redfern, and K. Raghavachari. Assessment of Gaussian-3 and Density Functional Theories on the G3/05 test set of experimental energies. Journal of Chemical Physics, 123(12):124107, 2005. [60] Ohlinger W S, Klunzinger P E, Deppmeler B J, and Hehre W J. Efficient Calculation of Heats of Formation. Journal of Physical Chemistry A, 113(10):2165–2175, 2009. [61] Katja Hansen, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis, O Ana- tole von Lilienfeld, Klaus-Robert Müller, and Alexandre Tkatchenko. Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space. The Journal of Physical Chemistry Letters, 6:2326–2331, June 2015. [62] O Anatole von Lilienfeld. First principles view on space: Gaining rigorous atomistic control of molecular properties. International Journal of Quantum Chemistry, 113(12):1676–1689, June 2013. [63] Grégoire Montavon, Matthias Rupp, Vivekanand Gobre, Alvaro Vazquez-Mayagoitia, Katja Hansen, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole von Lilien- feld. Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics, 15(9):095003, September 2013. [64] K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller, and E. K. U. Gross. How to represent crystal structures for machine learning: Towards fast prediction of electronic properties. Physical Review B, 89(5):205118, May 2014. [65] I. H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, volume 4. Morgan Kaufmann Publishers, 2 edition, 2005. [66] Matthias Rupp. Machine learning for quantum mechanics in a nutshell. International Journal of Quantum Chemistry, 115(16):1058–1073, July 2015. [67] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, An- dreas Ziehe, Alexandre Tkatchenko, O. Anatole von Lilienfeld, and Klaus-Robert Müller. Learning Invariant Representations of Molecules for Atomization Energy Prediction. NIPS’12. Curran Associates Inc., USA, 2012. References 93

[68] O Anatole von Lilienfeld. First principles view on chemical compound space: Gaining rigorous atomistic control of molecular properties. International Journal of Quantum Chemistry, 113(12):1676–1689, June 2013. [69] Qammar L Almas, Benjamin L Keefe, Trevor Profitt, and Jason K Pearson. Choosing an appropriate model chemistry in a big data context: Application to dative bonding. Computational and Theoretical Chemistry, 1085(C):46–55, June 2016. [70] Alex M Clark, Antony J Williams, and Sean Ekins. Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data. Journal of Cheminformatics, 7(1):1–20, March 2015. [71] Katja Hansen, Grégoire Montavon, Franziska Biegler, Siamac Fazli, Matthias Rupp, Matthias Scheffler, O Anatole von Lilienfeld, Alexandre Tkatchenko, and Klaus-Robert Müller. Assessment and Validation of Machine Learning Methods for Predicting Molec- ular Atomization Energies. Journal of Chemical Theory and Computation, 9(8):3404– 3419, July 2013. [72] Roman M Balabin and Ekaterina I Lomakina. Support vector machine regression (LS- SVM)–an alternative to artificial neural networks (ANNs) for the analysis of quantum chemistry data? Physical Chemistry Chemical Physics, 13(24):11710–11718, June 2011. [73] Roman M Balabin and Ekaterina I Lomakina. Neural network approach to quantum- chemistry data: Accurate prediction of density functional theory energies. Journal of Chemical Physics, 131(7):074104, August 2009. [74] Jörg Behler. Neural network potential-energy surfaces in chemistry: a tool for large-scale simulations. Physical Chemistry Chemical Physics, 13(40):17930–17955, October 2011. [75] Geoffroy Hautier, Christopher C Fischer, Anubhav Jain, Tim Mueller, and Gerbrand Ceder. Finding Nature’s Missing Ternary Oxide Compounds Using Machine Learning and Density Functional Theory. Chem. Mater., 22(12):3762–3767, May 2010. [76] K V Jovan Jose, Nongnuch Artrith, and Jörg Behler. Construction of high-dimensional neural network potentials using environment-dependent atom pairs. Journal of Chemical Physics, 136(19):194111, May 2012. [77] Eric Martin and Eddie Cao. Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens. Journal of Computer-Aided Molecular Design, pages 1–9, December 2014. [78] Tobias Morawietz, Vikas Sharma, and Jörg Behler. A neural network potential-energy surface for the water dimer based on environment-dependent atomic energies and charges. Journal of Chemical Physics, 136(6):064103, February 2012. [79] Xiaohui Qu, Diogo Ars Latino, and Joao Aires-de Sousa. A big data approach to the ultra-fast prediction of DFT-calculated bond energies. Journal of Cheminformatics, 5(1):34, 2013. 94 References

[80] Brajesh K Rai and Gregory A Bakken. Fast and accurate generation of ab initio quality atomic charges using nonparametric statistical regression. Journal of Computational Chemistry, 34(19):1661–1671, July 2013. [81] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole von Lilien- feld. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Physics Review Letters, 108(5):058301, January 2012. [82] John C Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert Müller, and Kieron Burke. Finding density functionals with machine learning. Physics Review Letters, 108(25):253002, June 2012. [83] Kevin Vu, John C Snyder, Li Li, Matthias Rupp, Brandon F Chen, Tarek Khelif, Klaus- Robert Müller, and Kieron Burke. Understanding kernel ridge regression: Common behaviors from simple functions to density functionals. International Journal of Quantum Chemistry, May 2015. [84] J Hachmann and R Olivares-Amaya. The Harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the world community grid. Journal of Physical Chemistry Lett., 2:2241–2251, 2011. [85] Robert G Parr and Yang Weitao. Density-Functional Theory of Atoms and Molecules. Oxford University Press, April 1989. [86] C. David Sherrill. Frontiers in electronic structure theory. Journal of Chemical Physics, 132(11):110902–7, March 2010. [87] Christopher J. Cramer and Donald G. Truhlar. Density functional theory for transi- tion metals and transition metal chemistry. Physical Chemistry Chemical Physics, 11(46):10757–10816, 2009. [88] David Feller. The role of databases in support of computational chemistry calculations. Journal of Computational Chemistry, 17:1571–1586, 1996. [89] Yan Zhao and Donald G. Truhlar. Benchmark Databases for Nonbonded Interactions and Their Use To Test Density Functional Theory. Journal of Chemical Theory and Computation, 1(3):415–432, April 2005. [90] Z Yuan, N J Taylor, T B Marder, and I D Williams. Three coordinate phosphorus and boron as π-donor and π-acceptor moieties respectively, in conjugated organic molecules for nonlinear optics: crystal and molecular structures of E-Ph-CH=CH-B(mes)2, E-4- MeO-C6H4-CH=CH-B(mes)2, and E-Ph2P-CH=CH-B(mes)2 [mes = 2,4,6-Me3C6H2]. Journal of the Chemical Society, Chemical Communications, pages 1489–1492, 1990. [91] Charles P Casey, Galina A Bikzhanova, and Ilia A Guzei. Stereochemistry of imine re- duction by a hydroxycyclopentadienyl ruthenium . Journal of American Chemical Society, 128(7):2286–2293, February 2006. References 95

[92] K V Axenov, G Kehr, R Frohlich, and G Erker. Functional group chemistry at the group 4 bent metallocene frameworks: formation and “metal-free” catalytic hydrogenation of bis (imino-Cp) zirconium complexes. Organometallics, 28:5148–5155, 2009. [93] Dianjun Chen and Jürgen Klankermayer. Metal-free catalytic hydrogenation of imines with tris(perfluorophenyl)borane. Chemical Communications, (18):2130–2131, May 2008. [94] Charles W Hamilton, R Tom Baker, Anne Staubitz, and Ian Manners. B-N compounds for chemical hydrogen storage. Chemical Society Reviews, 38(1):279–293, January 2009. [95] Bernd Wrackmeyer, Bernd Schwarze, D Michael Durran, and Graham A Webb. Mult- inuclear magnetic resonance study of N,N’,N”-tris(trimethylsilyl). Magnetic Resonance in Chemistry, 33(7):557–560, July 1995. [96] Takatomo Kimura, Tohru Takahashi, Mitsuko Nishiura, and Kimiaki Yamamura. Novel metal-free hydrogenation of the carbon-carbon double bond in azulenoid enones by use of cycloheptatriene and protic acid. Organic Letters, 8(14):3137–3139, July 2006. [97] P Liptau, L Tebben, G Kehr, Roland Fröhlich, Gerhard Erker, Frank Hollmann, and Bernhard Rieger. Preparation of Enantiomerically Pure [3] Ferrocenophane-Based Chelate Bis-Phosphane Ligands and Their Use in Asymmetric Alternating Carbon Monoxide/Propene Copolymerization. Europeon Journal of Organic Chemistry, pages 1909–1918, 2005. [98] Patrick Spies, Sina Schwendemann, Stefanie Lange, Gerald Kehr, Roland Fröhlich, and Gerhard Erker. Metal-free catalytic hydrogenation of enamines, imines, and conjugated phosphinoalkenylboranes. Angewandte Chemie International Edition, 47(39):7543–7546, 2008. [99] Kirill V Axenov, Gerald Kehr, Roland Fröhlich, and Gerhard Erker. Catalytic hydro- genation of sensitive organometallic compounds by antagonistic N/B Lewis pair catalyst systems. Journal of American Chemical Society, 131(10):3454–3455, March 2009. [100] Edwin Otten, Rebecca C Neu, and Douglas W Stephan. Complexation of nitrous oxide by frustrated Lewis pairs. Journal of American Chemical Society, 131(29):9918–9919, July 2009.

[101] Meghan A Dureen and Douglas W Stephan. Reactions of boron amidinates with CO2 and CO and other small molecules. Journal of American Chemical Society, 132(38):13559– 13568, September 2010. [102] Tibor András Rokob, Andrea Hamza, and Imre Pápai. Rationalizing the reactivity of frustrated Lewis pairs: thermodynamics of H2 activation and the role of acid-base properties. Journal of American Chemical Society, 131(30):10701–10710, August 2009. [103] F Huang, J Jiang, M Wen, and Z X Wang. Assessing the performance of commonly used DFT functionals in studying the chemistry of frustrated Lewis pairs. . . . and Computational Chemistry, 13(01):1350074, 2014. 96 References

[104] T M Gilbert. Tests of the MP2 Model and Various DFT Models in Predicting the Structures and B-N Bond Dissociation Energies of -Boranes (X3C)mH3-mB- N(CH3)nH3-n (X = H, F; m=0-3; n=0-3): Poor Performance of the B3LYP Approach for Datvie B-N Bonds. Journal of Physical Chemistry A, 108:2550–2554, 2004. [105] http://islandora.ca/about. [106] https://www.drupal.org/. [107] http://www.fedora-commons.org/. [108] http://lucene.apache.org/solr/. [109] Peter Murray-Rust and Henry S Rzepa. CML: Evolution and design. Journal of Cheminformatics, 3(1):44, 2011. [110] Peter Murray-Rust. JUMBO: an object-based XML browser. World Wide Web Journal, 2(4):197–206, November 1997. [111] Yu-Ran Luo. Handbook of Bond Dissociation Energies in Organic Compounds. CRC Press, New York, December 2002. [112] Axel D. Becke. Density functional calculations of molecular bond energies. Journal of Chemical Physics, 84(8):4524–4529, April 1986. [113] Axel D. Becke. Density-functional thermochemistry. III. The role of exact exchange. Journal of Chemical Physics, 98(7):5648, 1993. [114] Fred A Hamprecht, Aron J Cohen, David J Tozer, and Nicholas C Handy. Development and assessment of new exchange-correlation functionals. Journal of Chemical Physics, 109(15):6264–6271, October 1998. [115] Jeng-Da Chai and Martin Head-Gordon. Systematic optimization of long-range corrected hybrid density functionals. Journal of Chemical Physics, 128(8):084106–16, 2008. [116] Yan Zhao, Nathan E Schultz, and Donald G. Truhlar. Exchange-correlation functional with broad accuracy for metallic and nonmetallic compounds, kinetics, and noncovalent interactions. Journal of Chemical Physics, 123(16):161103, October 2005. [117] Y Zhao, N E Schultz, and Donald G. Truhlar. Design of density functionals by com- bining the method of constraint satisfaction with parametrization for thermochemistry, thermochemical kinetics, and noncovalent interactions. Journal of Chemical Theory and Computation, 2(2):364–382, 2006. [118] Yan Zhao and Donald G. Truhlar. The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06-class functionals and 12 other functionals. Theoretical Chemistry Accounts, 120(1-3):215–241, 2008. References 97

[119] Ross D. Adamson, Peter M. W. Gill, and John A. Pople. Empirical density functionals. Chemical Physics Letters, 284:6–11, February 1998. [120] Ching Yeh Lin, Michael W George, and Peter M W Gill. EDF2: A Density Functional for Predicting Molecular Vibrational Frequencies. Australian Journal of Chemistry, 57(4):365–366, 2004. [121] J P Perdew and K Schmidt. Jacob’s ladder of density functional approximations for the exchange-correlation energy. American Institute of Physics Conference Proceedings, 577(1):1–20, 2001. [122] Stefan Grimme. Density functional theory with London dispersion corrections. WIREs Computational Molecular Science, 1(2):211–228, March 2011. [123] Stefan Grimme. Accurate description of van der Waals complexes by density func- tional theory including empirical corrections. Journal of Computational Chemistry, 25(12):1463–1473, September 2004. [124] J Cizek. Advances in Chemical Physics. page 35. Wiley Interscience, New York, 1969. [125] G D Purvis III and R J Bartlett. A full coupled-cluster singles and doubles model - the inclusion of disconnected triples. Journal of Chemical Physics, 76:1910–1918, 1982. [126] G E et al. Scuseria, C L Janssen, and Henry F. Schaefer III. An efficient reformulation of the closed-shell coupled cluster single and double excitation (CCSD) equations. Journal of Chemical Physics, 89:7382–7387, 1988. [127] G E et al. Scuseria and Henry F. Schaefer III. Is coupled cluster singles and dou- bles (CCSD) more computationally intensive than quadratic configuration-interaction (QCISD)? Journal of Chemical Physics, 90:3700–3703, 1989. [128] J A Pople. Quadratic configuration interaction - a general technique for determining electron correlation energies. Journal of Chemical Physics, 87:5968–5975, 1987. [129] E R Davidson. Comment on ’Comment on Dunning’s correlation-consistent basis sets’. Chemical Physics Letters, 260:514–518, 1996. [130] A Halkier, T Helgaker, P Jørgensen, and W Klopper. Basis-set convergence in correlated calculations on Ne, N2, and H2O. Chemical Physics, 286(3-4):243–252, 1998. [131] Yihan Shao, Laszlo Fusti Molnar, Yousung Jung, Jörg Kussmann, Christian Ochsenfeld, Shawn T Brown, Andrew T B Gilbert, Lyudmila V Slipchenko, Sergey V Levchenko, Darragh P O’Neill, Robert A DiStasio, Rohini C Lochan, Tao Wang, Gregory J O Beran, Nicholas A Besley, John M Herbert, Ching Yeh Lin, Troy Van Voorhis, Siu- Hung Chien, Alex Sodt, Ryan P Steele, Vitaly A Rassolov, Paul E Maslen, Prakashan P Korambath, Ross D. Adamson, Brian Austin, Jon Baker, Edward F C Byrd, Holger Dachsel, Robert J Doerksen, Andreas Dreuw, Barry D Dunietz, Anthony D Dutoi, Thomas R Furlani, Steven R Gwaltney, Andreas Heyden, So Hirata, Chao-Ping Hsu, 98 References

Gary Kedziora, Rustam Z Khalliulin, Phil Klunzinger, Aaron M Lee, Michael S Lee, Wanzhen Liang, Itay Lotan, Nikhil Nair, Baron Peters, Emil I Proynov, Piotr A Pieniazek, Young Min Rhee, Jim Ritchie, Edina Rosta, C. David Sherrill, Andrew C Simmonett, Joseph E Subotnik, H Lee Woodcock, Weimin Zhang, Alexis T Bell, Arup K Chakraborty, Daniel M Chipman, Frerich J Keil, Arieh Warshel, Warren J Hehre, Henry F Schaefer, Jing Kong, Anna I Krylov, Peter M W Gill, and Martin Head-Gordon. Advances in methods and algorithms in a modern quantum chemistry program package. Physical Chemistry Chemical Physics, 8(27):3172–3191, July 2006. [132] M J Frisch, G W Trucks, H B Schlegel, G E et al. Scuseria, M A Robb, J R Cheeseman, G Scalmani, Vincenzo Barone, B Mennucci, G A Petersson, H Nakatsuji, M Caricato, X Li, H P Hratchian, A F Izmaylov, J Bloino, G Zheng, J L Sonnenberg, M Hada, M Ehara, K Toyota, R Fukuda, J Hasegawa, M Ishida, T Nakajima, Y Honda, O Kitao, H Nakai, T Vreven, J E Peralta J A Montgomery Jr and, F Ogliaro, M Bearpark, J J Heyd, E Brothers, K N Kudin, V N Staroverov, T Keith, R Kobayashi, J Normand, K Raghavachari, A Rendell, J C Burant, S S Iyengar, J Tomasi, M Cossi, N Rega, J M Millam, M Klene, J E Knox, J B Cross, V Bakken, Carlo Adamo, J Jaramillo, R Gomperts, R E Stratmann, O Yazyev, A J Austin, R Cammi, C Pomelli, J W Ochterski, R L Martin, K Morokuma, V G Zakrzewski, G A Voth, P Salvador, J J Dannenberg, S Dapprich, A D Daniels, O Farkas, J B Foresman, J V Ortiz, J Cioslowski, and D J Fox. Gaussian 09, revision D.01. Technical report. [133] http://fedora.fiz-karlsruhe.de/. [134] Qammar L Almas and Jason K Pearson. Novel Bonding Mode in Phosphine Haloboranes. ACS Omega, pages 608–614, January 2018. [135] Philip S Bryan and Robert L Kuczkowski. Microwave spectra, structures, and dipole mo- ments of trimethylphosphine-borane and methylphosphine-borane. Inorganic Chemistry, 11(3):553–559, March 1972. [136] J R Durig and Zhongnan Shen. Investigation of the structure, bonding, vibrational frequencies and stability by ab initio calculations of H3B-PH3,H3B-PHF2, and H3B-PF3. Journal of Molecular Structure: Theochem, 397(1-3):179–190, June 1997. [137] Thomas L Allen and William H Fink. Theoretical studies of boranamine and its conju- gate base. Comparison of the boron-nitrogen and boron-phosphorus .pi. bond energies. Inorganic Chemistry, 32(20):4230–4234, September 1993. [138] Hafid Anane, Abdellah Jarid, Abderrahim Boutalib, Ignacio Nebot-Gil, and Francisco Tomás. Ab initio molecular orbital study of the substituent effect on phosphine-borane complexes. Chemical Physics Letters, 296(3-4):277–282, November 1998. [139] Benjamin G Janesko. Using Nonempirical Semilocal Density Functionals and Empirical Dispersion Corrections to Model Dative Bonding in Substituted Boranes. Journal of Chemical Theory and Computation, 6(6):1825–1833, June 2010. References 99

[140] James O Jensen. Vibrational frequencies and structural determination of phosphirane. Journal of Molecular Structure: Theochem, 686(1-3):165–172, October 2004. [141] Joshua A Plumley and Jeffrey D Evanseck. Covalent and Ionic Nature of the Dative Bond and Account of Accurate Borane Binding Enthalpies. Journal of Physical Chemistry A, 111:13472–13483, December 2007. [142] Joshua A Plumley and Jeffrey D Evanseck. Periodic Trends and Index of Boron Lewis Acidity. The Journal of Physical Chemistry A, 4(8):1249–1253, April 2009. [143] Joshua A Plumley and Jeffrey D Evanseck. Hybrid Meta-Generalized Gradient Functional Modeling of Boron-Nitrogen Coordinate Covalent Bonds. Journal of Chemical Theory and Computation, 113:5985–5992, 2008. [144] Zhenguo Huang and Tom Autrey. Boron-nitrogen-hydrogen (BNH) compounds: recent developments in hydrogen storage, applications in hydrogenation and catalysis, and new syntheses. Energy & Environmental Science, 5(11):9257–9268, 2012. [145] Emily A Seidler, Christopher J Lieven, Alex F Thompson, and Leonard A Levin. Effec- tiveness of Novel Borane-Phosphine Complexes In Inhibiting Cell Death Depends on the Source of Superoxide Production Induced by Blockade of Mitochondrial Electron Transport. ACS Chemical Neuroscience, 1(2):95–103, February 2010. [146] Christopher R Schlieve, Annie Tam, Bradley L Nilsson, Christopher J Lieven, Ronald T Raines, and Leonard A Levin. Synthesis and characterization of a novel class of reducing agents that are highly neuroprotective for retinal ganglion cells. Experimental Eye Research, 83(5):1252–1259, November 2006. [147] Rongwei Shi, Jingling Shao, Cheng Wang, Xiaolei Zhu, and Xiaohua Lu. Search for structures, potential energy surfaces, and stabilities of planar BnP(n=1-7). Journal of Molecular Modeling, 17(5):1007–1016, July 2010. [148] Frances H Stephens, R Tom Baker, Myrna H Matus, Daniel J Grant, and David A Dixon. Acid initiation of ammonia-borane dehydrogenation for hydrogen storage. Angewandte Chemie (International ed. in English), 46(5):746–749, 2007. [149] H. Umeyama, T. Kudo, and S. Nakagawa. Molecular orbital study on the structure and barrier to internal rotation of phosphine borane. Chemical & Pharmaceutical Bulletin, 29:287–292, 1981. [150] J. D. Odom, S. Riethmiller, J. D. Witt, and J. R. Durig. Spectra and structure of phosphorus-boron compounds. ii. a vibrational and nuclear magnetic resonance spectral investigation of phosphine-trichloroborane. Inorg. Chem., 12(5):1123–1127, 1973. [151] J. R. Durig, S. Riethmiller, V. F. Kalasinsky, and J. D. Odom. Spectra and structure of phosphorus-boron compounds. vi. vibrational spectra ans structure of some phosphine- trihaloboranes. Inorganic Chemistry, 13(11):2729–2735, 1974. 100 References

[152] Paul A Sibbald. A theoretical analysis of substituent electronic effects on phosphine- borane bonds. Journal of Molecular Modeling, 22(11):261, November 2016. [153] Volker Jonas, Gernot Frenking, and Manfred T. Reetz. Comparative theoretical study of lewis acid-base complexes of bh3, bf3, bcl3, alcl3, and so2. Journal of American Chemical Society, 116:8741–8753, 1994. [154] Donald Ray Martin. Coordination compounds of boron trichloride. i. a review. Accounts of Chemical Research, 34(461-473), 1944. [155] H.S. Booth and D.R. Martin. Boron Trifluoride and its Derivatives, chapter 4. John Wiley and Sons, Inc., New York, N.Y., 1949. [156] D.R. Martin and Roy Everett Dial. Systems of boron trifluoride with phosphine, arsine, and hydrogen bromide. Journal of American Chemical Society, 72:852–856, 1950. [157] Anne Staubitz, Alasdair P. M. Robertson, Matthew E. Sloan, and Ian Manners. Amine- and phosphine-borane adducts: New interest in old molecules. Chemical Reviews, 110(7):4023–4078, 2010. [158] Hajime Hirao, Kiyoyuki Omoto, and Hioshi Fujimoto. Lewis acidity of boron trihalides. Journal of Physical Chemistry A, 103:5807–5811, 1999.

[159] Fabienne Bessac and Gernot Frenking. Why is bcl3 a stronger lewis acid with respect to strong bases than b f3? Inorganic Chemistry, 42:7990–7994, 2003. [160] Fabienne Bessac and Gernot Frenking. Chemical bonding in phosphane and amine complexes of main group elements and transition metals. Inorganic Chemistry, 45:6956– 6964, 2006. [161] Fang Huang, Jinliang Jiang, Mingwei Wen, and Zhi-Xiang Wang. Assessing the perfor- mance of commonly used dft functionals in studying the chemistry of frustrated lewis pairs. Journal of Theoretical and Computational Chemistry, 13(1350074), 2014. [162] Thomas A. Holme and Thanh N. Truong. A test of density functional theory for dative bonding systems. Chemical Physics Letters, 215(1):53–57, 1993. [163] T. M. Gilbert. Tests of the mp2 model and various dft models in predicting the structures and b–n bond dissociation energies of amine-boranes (x3c)mh3-mb-n(ch3)nh3-n (x = h, f; m = 0-3; n = 0-3): poor performance of the b3lyp approach for dative b-n bonds. Journal of Physical Chemistry A, 108:2550–2554, 2004. [164] Carina F Pupim, Nelson H Morgon, and Alejandro Lopez-Castillo. Spurious Phosphorus Pyramidalization Induced by Some DFT Functionals. Journal of the Brazilian Chemical Society, 26(8):1648–1655, 2015. [165] Tom Leyssens, Daniel Peeters, A Guy Orpen, and Jeremy N Harvey. Insight into metal-phosphorus bonding from analysis of the electronic structure of redox pairs of metal-phosphine complexes. New Journal of Chemistry, 29(11):1424–1430, October 2005. References 101

[166] H Anane, S El Houssame, A El Guerraze, A Jarid, A Boutalib, I Nebot-Gil, and F Tomás. Ab initio molecular orbital study of the substituent effect on ammonia and phosphine- borane complexes. Journal of Molecular Structure: Theochem, 709(1-3):103–107, November 2004. [167] Florian Weigend and Reinhart Ahlrichs. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Physical Chemistry Chemical Physics, 7(18):3297–3305, 2005. [168] Florian Weigend. Accurate coulomb-fitting basis sets for h to rn. Physical Chemistry Chemical Physics, 8:1057–1065, 2006. [169] Ove Christiansen, Henrik Koch, and Poul Jørgensen. The second-order approximate coupled cluster singles and doubles model CC2. Chemical Physics Letters, 243(5-6):409– 418, September 1995. [170] G E et al. Scuseria and Henry F. Schaefer III. Is coupled cluster singles and dou- bles (CCSD) more computationally intensive than quadratic configuration-interaction (QCISD)? Journal of Chemical Physics, 90:3700–3703, 1989. [171] J A Pople. Quadratic configuration interaction - a general technique for determining electron correlation energies. Journal of Chemical Physics, 87:5968–5975, 1987. [172] P Piecuch and M Włoch. Renormalized coupled-cluster methods exploiting left eigen- states of the similarity-transformed Hamiltonian. The Journal of Chemical Physics, 123:224105–224115, 2005. [173] Marta Włoch, Jeffrey R Gour, and Piotr Piecuch. Extension of the Renormalized Coupled-Cluster Methods Exploiting Left Eigenstates of the Similarity-Transformed Hamiltonian to Open-Shell Systems: A Benchmark Study. Journal of Physical Chemistry A, 111(44):11359–11382, 2007. [174] Walter J Stevens, Morris Krauss, Harold Basch, and Paul G Jasien. Relativistic compact effective potentials and efficient, shared-exponent basis sets for the third-, fourth-, and fifth-row atoms. Canadian Journal of Chemistry, 70:612–630, 2011. [175] Kirsten Aarset, Kolbjørn Hagen, Reidar Stølevik, and Sæbø Per Christian. Molecular structure and conformational composition of 1-Chlorobutane, 1-Bromobutane, and 1- Iodobutane as determined by gas-phase electron diffraction and ab initio calculations. Structural Chemistry, 6(3):197–205, 1995. [176] M W Schmidt, K K Baldridge, and J A Boatz. General atomic and molecular electronic structure system - Schmidt - 1993 - Journal of Computational Chemistry - Wiley Online Library. Journal of Computational Chemistry, 14:1347–1363, 1993.

[177] Asger Halkier and et al. Basis-set convergence in correlated calculations on ne, n2 and h2o. Chemical Physics Letters, 286:243–252, 1998. 102 References

[178] Per E M Siegbahn, Jan Almlöf, Anders Heiberg, and Björn O Roos. The complete active space SCF (CASSCF) method in a Newton-Raphson formulation with application to the HNO molecule. The Journal of Chemical Physics, 74(4):2384–2396, February 1981. [179] Mark S. Gordon, Michael, and W. Schmidt. Advances in electronic structure theory: Gamess a decade later. Theory and Applications of Computational Chemistry: The First Forty Years, (n.a.):1167–1189, 2005. [180] Ivan Duchemin and François Gygi. A scalable and accurate algorithm for the computation of Hartree–Fock exchange. Computer Physics Communications, 181(5):855–860, May 2010. [181] Anouar Benali, Luke Shulenburger, Nichols A Romero, Jeongnim Kim, and O Anatole von Lilienfeld. Application of Diffusion Monte Carlo to Materials Dominated by van der Waals Interactions. Journal of Chemical Theory and Computation, 10(8):3417–3422, August 2014. [182] David Sherrill, Per-Olov Lwdin, John R. Sabin, Michael C. Zerner, and Erkki Brändas. The Configuration Interaction Method: Advances in Highly Correlated Approaches, volume 34. Academic Press, 1999. [183] Chr. Møller and M. S. Plesset. Note on an Approximation Treatment for Many-Electron Systems. Physical Review, 46:618–622, 1934. [184] Jiri Cizek. On the Correlation Problem in Atomic and Molecular Systems. Calculation of Wavefunction Components in Ursell-Type Expansion Using Quantum-Field Theoretical Methods. Journal of Chemical Physics, 45:4256, 1966. [185] E Schrödinger. Quantisierung als Eigenwertproblem. Annalen der Physik, 79:361, December 1926. [186] J K Nørskov, T Bligaard, J Rossmeisl, and C H Christensen. Towards the computational design of solid catalysts. Nature Chemistry, 1(1):37–46, April 2009. [187] S. Curtarolo, Gus L. W. Hart, M. B. Nardelli, N. Mingo, S. Sanito, and O. Levy. The high- throughput highway to computational materials design. Nature Materials, 12:191–201, 2013. [188] Lusann Yang and Gerbrand Ceder. Data-mined similarity function between material compositions. Physical Review B, 88(22):224107, December 2013. [189] Gisbert Schneider. Virtual screening: an endless staircase? Nature Reviews Drug Discovery, 9:273–276, 2010. [190] Xinguo Ren, Patrick Rinke, Volker Blum, Jürgen Wieferink, Alexandre Tkatchenko, Andrea Sanfilippo, Karsten Reuter, and Matthias Scheffler. Resolution-of-identity ap- proach to Hartree–Fock, hybrid density functionals, RPA, MP2 and GWwith numeric atom-centered orbital basis functions. New Journal of Physics, 14(5):053020–17, May 2012. References 103

[191] Felix Faber, Alexander Lindmaa, O Anatole von Lilienfeld, and Rickard Armiento. Crystal structure representations for machine learning models of formation energies. International Journal of Quantum Chemistry, 115(16):1094–1101, April 2015. [192] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilien- feld. Big Data Meets Quantum Chemistry Approximations: The ∆-Machine Learning Approach. Journal of Chemical Theory and Computation, 11(5):2087–2096, May 2015. [193] O Anatole von Lilienfeld, Raghunathan Ramakrishnan, Matthias Rupp, and Aaron Knoll. Fourier series of atomic radial distribution functions: A molecular fingerprint for machine learning models of quantum chemical properties. International Journal of Quantum Chemistry, 115(16):1084–1093, April 2015. [194] Tristan Bereau, Denis Andrienko, and O Anatole von Lilienfeld. Transferable Atomic Multipole Machine Learning Models for Small Organic Molecules. Journal of Chemical Theory and Computation, 11:3225–3233, July 2015. [195] Peter M W Gill. Intracule functional models. Annual Reports Section “C" (Physical Chemistry), 107(0):229–241, 2011. [196] Matthias Rupp, Raghunathan Ramakrishnan, and O Anatole von Lilienfeld. Machine Learning for Quantum Mechanical Properties of Atoms in Molecules. The Journal of Physical Chemistry Letters, 6(16):3309–3313, August 2015. [197] Pavlo O Dral, O Anatole von Lilienfeld, and Walter Thiel. Machine Learning of Parame- ters for Accurate Semiempirical Quantum Chemical Calculations. Journal of Chemical Theory and Computation, 11(5):2120–2125, May 2015. [198] Peter M W Gill, Darragh P O’Neill, and Nicholas A Besley. Two-electron distribution functions and intracules. Theoretical Chemistry Accounts, 109(5):241–250, 2003. [199] Russell J Boyd and Main C Yee. Angular aspects of electron correlation and the Coulomb hole. The Journal of Chemical Physics, 77(7):3578–3582, August 1998. [200] A. J. Coleman. Density matrices in the quantum theory of matter: Energy, intracules and extracules. International Journal of Quantum Chemistry, 1:457–464, 1967. [201] Ajit J Thakkar, A N Tripathi, and Vedene H Smith. Anisotropic electronic intracule densities for diatomics. International Journal of Quantum Chemistry, 26(2):157–166, August 1984. [202] C A Coulson and A H Neilson. Electron Correlation in the Ground State of Helium. Proceedings of the Physical Society, 78(5):831, November 1961. [203] Vitaly A. Rassolov. An ab initio linear electron correlation functional. The Journal of Chemical Physics, 10:3672, 1999. [204] Jason K Pearson, Deborah L Crittenden, and Peter M W Gill. Intracule functional models. IV. Basis set effects. The Journal of Chemical Physics, 130(16):164110–8, April 2009. 104 References

[205] Adam J Proud and Jason K Pearson. A simultaneous probability density for the intracule and extracule coordinates. The Journal of Chemical Physics, 133(13):134113–6, October 2010. [206] Rutvij Vihang Bhavsar and Raghunathan Ramakrishnan. Machine Learning Modeling of Wigner Intracule Functionals for Two Electrons in one Dimension. arXiv.v physics.chem- ph, pages 1–9, February 2018. [207] L. C. Blum abd J. l. Reymond. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Journal of American Chemical Society, 131:3700–3703, 2009. [208] Grégoire Montavon, Matthias Rupp, Vivekanand Gobre, Alvaro Vazquez-Mayagoitia, Katja Hansen, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole von Lilien- feld. Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics, 15(9):095003, 2013. [209] Felix A Faber, Anders S Christensen, Bing Huang, and O Anatole von Lilienfeld. Al- chemical and structural distribution based representation for universal quantum machine learning. The Journal of Chemical Physics, 148(24):241717–13, June 2018. Appendix A

Python program written for machine learning calculations #!/usr/local/bin/python # Usual header declarations from __future__ import division import sys import re import os import json import numpy as np #import numpy.ma as ma from scipy import stats from tabulate import tabulate from pprint import pprint from optparse import OptionParser # we want to use options parser=OptionParser()

# Make everything print out all pretty to the output file def print_model_summary(): outfile.write("The user has provided NRef = " + str(len(property_list)) + ' unique properties and associated reference values\n') outfile.write("for " + str(len(molecule_list)) + " unique chemical species.\n\n") outfile.write("We found the following " + str(len(model_list)) + " model chemistries with at least N = 1 matching property available.\n\n") outfile.write(tabulate(zip(model_list,N_list), ["Model","N"], tablefmt="orgtbl")+'\n\n') def print_model_error_summary(): summary_headers=["Model","MAAD", "AAD", "MUE", "MSE", "MaxE", "N"] outfile.write("Below is a summary of error measures for each model, sorted by "+error_measure_dict.get(sort_by)+".\n") outfile.write("Models listed have N greater than or equal to " + str(summary_percent) + "% of NRef") outfile.write(", or " + str(summary_threshold) + " data points.\n\n\n") outfile.write(tabulate(summary_data, summary_headers, tablefmt="orgtbl",floatfmt=".6f") + '\n\n') print_summary_legend() def print_summary_legend(): outfile.write("Legend:") outfile.write(" \n\tMAAD = Mean Absolute Average Deviation in properties, per molecule") outfile.write(" \n\tAAD = Average Absolute Deviation\n\tMUE = Mean Unsigned Error") outfile.write("\n\tMSE = Mean Signed Error\n\tMaxE = Maximum Error") outfile.write("\n\tN = Number of data points found for the specific model chemistry\n\n") outfile.write("Error[i] = Reference[i] - Model[i]\n") outfile.write("MaxE[Model] = Max(Error[i])\n") outfile.write("MSE[Model] = (\sum_i^N[Error[i]])/N \n") outfile.write("MUE[Model] = (\sum_i^N[Abs(Error[i])])/N \n") outfile.write("AAD[Model] = (\sum_i^N[Abs(MSE[Model]- Error[i])])/N \n") outfile.write("MAAD[Model] = Mean of AAD, when AAD is calculated separately for each unique chemical species.\n\n\n") def print_detailed_output(NTotal_cols): NPrint_cols = 4 common_header = ["#","Property", "ref value"] outfile.write("Detailed listing of all properties, associated reference values and model errors.\n\n") for i in range(int(np.ceil(NTotal_cols/NPrint_cols))): j = i*NPrint_cols outfile.write(tabulate(zip(range(1,len(property_list) +1),property_list, (zip(*reference_list))[1],error_data[:,j:j +1],error_data[:,j+1:j+2],error_data[:,j+2:j+3],error_data[:,j+3:j +4]),(common_header + model_list[i*NPrint_cols:NPrint_cols*(i+1)]), tablefmt="orgtbl", floatfmt=".6f") + '\n\n') def print_detailed_AAD(NTotal_cols): NPrint_cols = 4 common_header = ["#","Molecule"] outfile.write("\nBecause you've provided multiple properties for multiple molecules,\n") outfile.write("we report the Average Absolute Deviation PER MOLECULE below.\n\n") outfile.write("Detailed listing of the Absolute Average Deviation (AAD) for each model averaged over all properties for each molecule.\n\n") for i in range(int(np.ceil(NTotal_cols/NPrint_cols))): j = i*NPrint_cols outfile.write(tabulate(zip(range(1,len(molecule_list) +1),molecule_list, AAD_by_molecule[:,j:j+1],AAD_by_molecule[:,j+1:j +2],AAD_by_molecule[:,j+2:j+3],AAD_by_molecule[:,j+3:j+4]), (common_header + model_list[i*NPrint_cols:NPrint_cols*(i+1)]), tablefmt="orgtbl", floatfmt=".6f") + '\n\n') def print_warning_duplicate_input(duplicate_property): outfile.write("Warning! User has provided duplicate reference for property: "+ str(duplicate_property)) outfile.write("\nAnalysis is performed using the first unique instance.\n\n") def error_nodatafile(): outfile.write("Error! No data file found!\nExiting...") sys.exit() def error_nodata(): outfile.write("No data found matching your reference properties!\nExiting...") sys.exit() def warning_low_data(): outfile.write("Warning: No models were found having N greater than " + str(summary_percent) + "% of NRef") outfile.write(", or " + str(summary_threshold) + " data points.") outfile.write("\nPrinting summary and detailed results for all available models...\n\n") def avg_abs_dev(A): # return the AAD of a 1D numpy array that may contain "nan" entries meanA = np.mean(np.ma.masked_array(A,np.isnan(A)),axis=0) lenA = np.count_nonzero(~np.isnan(A)) diff = 0 for i in range(len(A)): if not np.isnan(A[i]): diff += abs(meanA - A[i]) if lenA > 0: return diff/lenA else: return np.nan

########################################################### ''' Reads a JSON summary of data and populates a 2D array where the row number corresponds to a particular property and the column number corresponds to a particular models value for the property ''' ###########################################################

# Some variables reference_list = [] property_list = [] molecule_list= [] value_list = [] model_list = [] outfilename = 'Results.out' data_file = 'JSON_report.txt' reference_file = 'JSON_references.txt' multi_mol_stats = False

# Setup a few parameters based on user preference: #Only report summary for models with at least this % of matching data summary_percent = 80 #summary_percent = int(raw_input("use model if it has at least __% of data: ")) detail = "y" #while ((summary_percent < 0) or (summary_percent > 100)): summary_percent = int(raw_input("Please enter an integer between 0 and 100: ")) #detail = str(raw_input("Would you like a detailed output? (y or n): ")) #while ((detail != "y") and (detail != "n")): detail = str(raw_input("Please enter either y or n (lower case): ")) sort_by = 1 #sort_by = int(raw_input("Sort the summary table by:\nModel = 0\nMAAD = 1\nAAD = 2\nMUE = 3\nMSE = 4\nMaxE = 5\nN = 6\nEnter an integer between 0 and 5: ")) #while ((sort_by < 0) or (sort_by > 6)): sort_by = int(raw_input("Please enter an integer between 0 and 5: ")) ''' Users can choose to sort the summary table by setting sort_by to any of the following Model = 0 MAAD = 1 AAD = 2 MUE = 3 MSE = 4 MaxE = 5 N = 6 ''' error_measure_dict = {0:'Model',1: 'MAAD',2:'AAD',3:'MUE',4:'MSE', 5:'MaxE',6:'N'};

# Open the output file for writing outfile = open(outfilename, 'w') outfile.write("Welcome to Retrievium's automated model validator\n \n")

# Import the reference data from the user (JSON file in this case) with open(reference_file) as JSON_file: reference_data = json.load(JSON_file)

# From the reference data, collect all properties of interest and their reference values # We assume the data fields are completely populated and that the properties index form # a complete list for datum in range(len(reference_data["Data"])): current_property = [] current_formula = reference_data["Data"][datum]["formula"] current_property.append(current_formula) if current_formula not in molecule_list: molecule_list.append(current_formula) #Keep track of unique molecules current_property.append(reference_data["Data"][datum] ["bond_len"]) #Identify property of interest, if current_property not in property_list: property_list.append(current_property) #and add it to the list current_value = reference_data["Data"][datum] ["Energy"] #Get the reference value, value_list.append(current_value) #and add it to the list else: print_warning_duplicate_input(current_property) # Report when the user supplies duplicates, only use first instance #The full reference list is then assembled and sorted reference_list = sorted(zip(property_list, value_list), key=lambda tup: tup[0]) property_list.sort() # and the property list is sorted as well

#If multiple properties are supplied for multiple molecules, calculate some "per molecule" statistics if (len(molecule_list)>1 and len(reference_list)/ len(molecule_list)>1): multi_mol_stats = True

# Now read the JSON data file from Retrievium if not os.path.isfile(data_file): error_nodatafile() # Exit if there is no JSON result file # Import the retrieved data from the JSON file with open(data_file) as JSON_file: JSON_data = json.load(JSON_file)

# From the JSON file, count and index the available models for datum in range(len(JSON_data["Data"])): current_model = JSON_data["Data"][datum]["method"] #Find Model if current_model not in model_list: model_list.append(current_model) if len(model_list) == 0: error_nodata() # Exit if there is no data present in the JSON file

# Sort the resultant model list for later reporting and tabulation model_list.sort()

# Now initialize (to nan) error_data array; difference between reference value and model predicted value error_data = [[np.nan for j in range(len(model_list))] for i in range(len(property_list))]

# Compute the difference between the reference value and that produced by each model # to populate the error_data array for datum in range(len(JSON_data["Data"])): current_property = [] current_property.append(JSON_data["Data"][datum] ["formula"]) current_property.append(JSON_data["Data"][datum] ["bond_len"]) #Get current property current_model = JSON_data["Data"][datum]["method"] #Get current model # Find matching property in reference property_list and populate error_data array for current_model for i in range(len(reference_list)): if reference_list[i][0] == current_property: error_data[property_list.index(current_property)] [model_list.index(current_model)] = reference_list[i][1] - JSON_data["Data"][datum]["Energy"] # Translate list to numpy array error_data = np.array(error_data)

# Make a list of the number of data points found (N), for each model chemistry and print # a summary N_list = [] for i in range(len(model_list)): N_list.append(np.count_nonzero(~np.isnan(error_data[:,i]))) print_model_summary()

# Calculate error measures MSE = np.mean(np.ma.masked_array(error_data,np.isnan(error_data)),axis=0) MUE = np.mean(np.ma.masked_array(abs(np.array(error_data)),np.isnan(abs(np .array(error_data)))),axis=0) MaxE = np.nanmax(abs(np.array(error_data)), axis=0) AAD = [] for model in model_list:

AAD.append(str(avg_abs_dev(error_data[:,model_list.index(model)]))) print molecule_list print model_list #If multiple properties are supplied for multiple molecules, calculate some "per molecule" statistics if multi_mol_stats: AAD_by_molecule = [[np.nan for j in range(len(model_list))] for i in range(len(molecule_list))] mol_offset = 0 for molecule in molecule_list: mol_range = (zip(*property_list)) [0].count(molecule) for model in model_list:

AAD_by_molecule[molecule_list.index(molecule)] [model_list.index(model)] = avg_abs_dev(error_data[mol_offset: (mol_offset+mol_range),model_list.index(model)]) mol_offset += mol_range AAD_by_molecule = np.array(AAD_by_molecule) print AAD_by_molecule MAAD_per_mol = []

for i in range(len(model_list)):

MAAD_per_mol.append(np.mean(np.ma.masked_array(AAD_by_molecule[:,i], np.isnan(AAD_by_molecule[:,i])),axis=0))

# Merge all lists for reporting... [(Model, MAAD, AAD, MUE, MSE, MaxE, N),...] summary_data = zip(model_list, MAAD_per_mol, AAD, MUE, MSE, MaxE, N_list) summary_data = sorted(summary_data, key=lambda tup: tup[6], reverse = True) # Check to see if there are any models with greater than summary_percent of available data summary_threshold = int((summary_percent / 100) * len(reference_list)) if len([i for i in N_list if int(i) >= summary_threshold]) == 0: warning_low_data() #Warn if no models have enough data summary_threshold = 0 # then set appropriate parameters to print a detailed report summary_percent = 0 # whether the user wanted it or not detail = "y"

# Remove entries having N's that are less than user-specified % of total NRef (i.e. summary_percent) for i in range(len(N_list)-1,-1,-1): if summary_data[i][6] <= summary_threshold: del summary_data[i]

# Sort the data_list by a chosen tuple (sort_by); see the variable list above for details summary_data = sorted(summary_data, key=lambda tup: tup[sort_by])

# Print results to the output print_model_error_summary() if detail == "y": print_detailed_output(len(model_list)) if multi_mol_stats: print_detailed_AAD(len(model_list)) outfile.close() Appendix B

Data of potential energy surface calculations of selected Lewis pairs Potential Energy surface data for PH3BF3 at various dative bond lengths

Bond Length M06L 6-31G* M06L cc-pVTZ M06L aug-cc-pVQZ MP2 6-31G* 1 -666.2425567 -666.4099252 -666.4610608 -664.930161 1.1 -666.717198 -666.8786988 -666.9267158 -665.415757 1.2 -667.041032 -667.1986924 -667.2448748 -665.748291 1.3 -667.2568927 -667.4126231 -667.4577056 -665.970893 1.4 -667.3999473 -667.5533414 -667.5976652 -666.117538 1.5 -667.4918312 -667.6442175 -667.6879707 -666.212524 1.6 -667.5498739 -667.7016024 -667.744946 -666.272683 1.7 -667.5853458 -667.736713 -667.7797151 -666.309644 1.8 -667.606249 -667.7576005 -667.8003816 -666.331446 1.9 -667.6180042 -667.7694698 -667.8120874 -666.343626 2 -667.6244386 -667.7761146 -667.8187087 -666.349959 2.1 -667.6278281 -667.7797327 -667.8223731 -666.352992 2.2 -667.6296243 -667.7816963 -667.8244114 -666.354398 2.3 -667.6309368 -667.7830569 -667.8258483 -666.355214 2.4 -667.6321315 -667.7842055 -667.8270979 -666.35598 2.5 -667.6334255 -667.7853527 -667.8283708 -666.356872 2.6 -667.6346761 -667.7864475 -667.8295968 -666.357842 2.7 -667.635931 -667.7874555 -667.8307107 -666.358764 2.8 -667.6370038 -667.7882928 -667.8316239 -666.359525 2.9 -667.6377553 -667.7888182 -667.8322122 -666.360062 3 -667.6383121 -667.7891719 -667.8326141 -666.36036 3.1 -667.6386884 -667.7893876 -667.8328878 -666.360438 3.2 -667.6388918 -667.7893682 -667.8329413 -666.360332 3.3 -667.6388748 -667.789193 -667.832836 -666.360086 3.4 -667.6385586 -667.7887999 -667.8325151 -666.359741 3.5 -667.6380209 -667.7882139 -667.8320099 -666.359334 3.6 -667.6374557 -667.7876818 -667.8315024 -666.358894 3.7 -667.636974 -667.7873106 -667.8311407 -666.358444 3.8 -667.6366246 -667.7871285 -667.8309487 -666.358003 3.9 -667.6364079 -667.7870175 -667.8308705 -666.357582 4 -667.6362385 -667.7869342 -667.830819 -666.357189 4.1 -667.6362385 4.2 -667.6362385 4.3 -667.6362385 4.4 -667.6362385 4.5 -667.6362385 Bond Length MP2 6-311G* CCSD(T) CBS MP2(Full-OPT) cc-pVTZ 1 -665.3298532 -664.0122814 -665.5211634 1.1 -665.8110483 -664.485573 -665.9989864 1.2 -666.1413198 -664.8096745 -666.326629 1.3 -666.3628384 -665.0268438 -666.5457614 1.4 -666.5089709 -665.1701671 -666.6897151 1.5 -666.603778 -665.2636065 -666.7826697 1.6 -666.6640104 -665.3235084 -666.8414633 1.7 -666.7012075 -665.3612812 -666.8776372 1.8 -666.723304 -665.3846849 -666.8990752 1.9 -666.7357484 -665.3989968 -666.9111638 2 -666.7422629 -665.4077878 -666.9175687 2.1 -666.7453575 -665.4133734 -666.9207463 2.2 -666.7466807 -665.4171954 -666.9222864 2.3 -666.7472565 -665.4201355 -666.9231501 2.4 -666.7476532 -665.4226531 -666.9238434 2.5 -666.7481058 -665.4249267 -666.924555 2.6 -666.7486664 -665.4270143 -666.9252879 2.7 -666.7492545 -665.4288887 -666.9259674 2.8 -666.7497862 -665.4305182 -666.9265169 2.9 -666.7501998 -665.431836 -666.9268901 3 -666.7504657 -665.432921 -666.9270758 3.1 -666.750577 -665.4337383 -666.9270922 3.2 -666.75056 -665.4343729 -666.9269613 3.3 -666.750424 -665.4346255 -666.9267186 3.4 -666.7501964 -665.4349784 -666.9263956 3.5 -666.7499012 -665.4351215 -666.9260211 3.6 -666.7495586 -665.4351512 -666.9256188 3.7 -666.7491874 -665.4351676 -666.9252078 3.8 -666.7488035 -665.4347986 -666.9248024 3.9 -666.7484754 -665.4347597 -666.9244133 4 -666.7480462 -665.4346085 -666.9240477 4.1 4.2 4.3 4.4 4.5 Bond Length CCSD(T)/cc-VDZ CCSD(T) cc-VTZ-SPE//MP2/cc-pVTZ 1 -664.7607075 -665.5751202 1.1 -665.3049626 -666.0539835 1.2 -665.6803043 -666.3821461 1.3 -665.9365467 -666.6016415 1.4 -666.1106827 -666.7459035 1.5 -666.2292093 -666.8391411 1.6 -666.3105554 -666.8982045 1.7 -666.3671585 -666.9346422 1.8 -666.4072231 -666.9563405 1.9 -666.4360936 -666.968686 2 -666.457239 -666.9753442 2.1 -666.4729194 -666.9787737 2.2 -666.4846253 -666.9805675 2.3 -666.4933626 -666.9816884 2.4 -666.4998358 -666.982643 2.5 -666.5045614 -666.9835882 2.6 -666.5079358 -666.9845219 2.7 -666.5102726 -666.9853639 2.8 -666.5118223 -666.9860353 2.9 -666.5127836 -666.9864963 3 -666.5133116 -666.9867427 3.1 -666.5135253 -666.9867981 3.2 -666.5135148 -666.9866916 3.3 -666.5133485 -666.9864628 3.4 -666.5130781 -666.9861457 3.5 -666.5127429 -666.9857724 3.6 -666.5123724 -666.9853674 3.7 -666.5119889 -666.9849522 3.8 -666.5116083 -666.9845416 3.9 -666.5112415 -666.9841472 4 -666.5108955 -666.9837754 4.1 4.2 4.3 4.4 4.5 Relative E Bond Length CASSCF(6,6)/6-311+G* CASSCF(6,6)/6-311+G* CASSCF(8 8)/6-311+G* 1 -664.3275428 1.4731883 -664.3223388 1.1 -664.8236896 0.9770415 -664.8118442 1.2 -665.1503611 0.65037 -665.1553374 1.3 -665.3816905 0.4190406 -665.3861346 1.4 -665.5351063 0.2656248 -665.5400106 1.5 -665.6354115 0.1653196 -665.6403319 1.6 -665.6997969 0.1009342 -665.7047346 1.7 -665.7401044 0.0606267 -665.7450596 1.8 -665.7645053 0.0362258 -665.7694733 1.9 -665.7786523 0.0220788 -665.7836338 2 -665.7864448 0.0142863 -665.7900518 2.1 -665.7905549 0.0101762 -665.7941678 2.2 -665.7927625 0.0079686 -665.7963954 2.3 -665.794183 0.0065481 -665.797854 2.4 -665.7954269 0.0053042 -665.7991551 2.5 -665.7967364 0.0039947 -665.8005336 2.6 -665.7981216 0.0026095 -665.8019861 2.7 -665.799488 0.0012431 -665.8034058 2.8 -665.8007311 0 -665.8046813 2.9 -665.8017791 -665.8057403 3 -665.8026045 -665.8065609 3.1 -665.8032167 -665.8071485 3.2 -665.8036329 -665.8075326 3.3 -665.8038876 -665.8041217 3.4 -665.8040133 -665.8040283 3.5 -665.8040396 -665.8040521 3.6 -665.8039919 -665.8040034 3.7 -665.8038913 -665.8039033 3.8 -665.8037551 -665.8037666 3.9 -665.8035968 -665.8036098 4 -665.8034272 -665.8034402 4.1 4.2 4.3 4.4 4.5 Relative E Relative E Bond Length CASSCF(8 8)/6-311+G* CASCSF(8 8)/cc-pVTZ CASCSF(8 8)/cc-pVTZ 1 1.4823425 -664.4018222 1.4647538 1.1 0.9928371 -664.8732167 0.9933593 1.2 0.6493439 -665.2131458 0.6534302 1.3 0.4185467 -665.4411852 0.4253908 1.4 0.2646707 -665.5940523 0.2725237 1.5 0.1643494 -665.6955845 0.1709915 1.6 0.0999467 -665.759731 0.106845 1.7 0.0596217 -665.8000715 0.0665045 1.8 0.035208 -665.8247462 0.0418298 1.9 0.0210475 -665.8393637 0.0272123 2 0.0146295 -665.8477837 0.0187923 2.1 0.0105135 -665.8526274 0.0139486 2.2 0.0082859 -665.8556153 0.0109607 2.3 0.0068273 -665.8577985 0.0087775 2.4 0.0055262 -665.8597273 0.0068487 2.5 0.0041477 -665.8616941 0.0048819 2.6 0.0026952 -665.8635338 0.0030422 2.7 0.0012755 -665.8651065 0.0014695 2.8 0 -665.866576 0 2.9 -665.8703722 3 -665.8714518 3.1 -665.8694127 3.2 -665.8727785 3.3 -665.8731189 3.4 -665.8631706 3.5 -665.8631429 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 Bond Length MP2 cc-pVQZ(OPT) CR-CCL ACCD CR-CCL ACCT CR-CCL CBS 1 -666.9809439 1.1 -667.0390515 1.2 -667.0747253 1.3 -667.095819 1.4 -667.1077018 1.5 -667.1139933 -666.4382225 -666.7700545 -667.982032 1.6 -667.1170916 -666.5004815 -666.8295321 -668.0362801 1.7 -667.1185528 -666.5388169 -666.8663173 -668.069204 1.8 -667.119332 -666.5613781 -666.8882681 -668.0882545 1.9 -667.1199448 -666.5737575 -666.9007717 -668.0985703 2 -667.1205935 -666.5799802 -666.9074936 -668.1036196 2.1 -667.1212863 -666.5825493 -666.9108276 -668.1055049 2.2 -667.1219476 -666.5833051 -666.9124025 -668.105743 2.3 -667.1224969 -666.5835102 -666.9133132 -668.105487 2.4 -667.1228838 -666.5837208 -666.9140861 -668.1052331 2.5 -667.1230953 -666.5840891 -666.9148631 -668.1050131 2.6 -667.1231473 -666.5845952 -666.9156751 -668.1048618 2.7 -667.1230626 -666.5851275 -666.9165078 -668.1048838 2.8 -667.1228745 -666.5856087 -666.9171426 -668.1046443 2.9 -667.122616 -666.585944 -666.917558 -668.1042084 3 -667.1223131 -666.5861323 -666.9177607 -668.1035772 3.1 -667.1219876 -666.5861885 -666.917837 -668.1029942 3.2 -667.1216557 -666.5860988 -666.917725 -668.1022135 3.3 -667.121329 -666.5859151 -666.9175026 -668.1014647 3.4 -667.1210154 -666.5856058 -666.9172143 -668.1006517 3.5 -667.1207197 -666.5852156 -666.9169173 -668.0999684 3.6 -667.1204448 -666.5848251 -666.9165619 -668.0992499 3.7 -667.1201915 -666.5844228 -666.9161762 -668.0985114 3.8 -667.11996 -666.5839778 -666.9157749 -668.0977756 3.9 -667.1197497 -666.5836517 -666.9154215 -668.0971973 4 -667.1195597 -666.5833127 -666.9151882 -668.0969168 4.1 -666.582985 -666.9148644 -668.4670346 4.2 -666.5827216 -666.9145898 -668.46661 4.3 -666.5824831 -666.9143493 -668.4662567 4.4 -666.5822228 -666.914135 -668.4659529 4.5 -666.5819972 -666.9139785 -668.4657732 Bond Length CR-CCL/CBS CCSD(T)/CBS B3LYP/6-31G* B3LYP-D3(0)/6-31G* 1 1.1 1.2 1.3 -667.711291 -667.313309 -666.0664368 1.4 -667.848705 -667.455529 -666.5369148 1.5 0.12371095 -667.937117 -667.547541 -666.8587168 1.6 0.06946284 -667.992707 -667.605876 -667.0743636 1.7 0.03653899 -668.026647 -667.642023 -667.2166142 1.8 0.01748844 -668.046557 -667.663772 -667.3088347 1.9 0.00717268 -668.057629 -667.676436 -667.367446 2 0.00212339 -668.063356 -667.683602 -667.4036591 2.1 0.00023812 -668.066026 -667.687614 -667.4256446 2.2 0 -668.067134 -667.689959 -667.4385584 2.3 0.00025596 -668.06758 -667.691535 -667.4462927 2.4 0.00050985 -668.067818 -667.692805 -667.4509871 2.5 0.00072989 -668.068033 -667.693955 -667.4539524 2.6 0.0008812 -668.068222 -667.695009 -667.4562343 2.7 0.00085918 -668.068317 -667.695933 -667.4582719 2.8 0.00109868 -668.068298 -667.696685 -667.4602468 2.9 0.00153461 -668.06811 -667.697242 -667.4621228 3 0.00216575 -668.067764 -667.6976 -667.4638277 3.1 0.00274877 -668.067316 -667.697774 -667.4652469 3.2 0.00352946 -668.066765 -667.697794 -667.4662982 3.3 0.00427832 -668.066178 -667.697696 -667.4670901 3.4 0.00509128 -668.065565 -667.697503 -667.4676888 3.5 0.0057746 -668.064955 -667.697202 -667.4679503 3.6 0.00649304 -668.064342 -667.696969 -667.46811 3.7 0.0072316 -668.063801 -667.696704 -667.4679991 3.8 0.00796742 -668.063292 -667.696434 -667.4675859 3.9 0.00854573 -668.062826 -667.696172 -667.4671374 4 -668.062396 -667.695921 -667.4669737 4.1 4.2 4.3 4.4 4.5 Bond Length B3LYP5 /6-31G* B3PW91 /6-31G* B86 /6-31G* B88 /6-31G* 1 1.1 1.2 1.3 -667.123667 -667.141005 -665.461025 -665.341079 1.4 -667.265939 -667.283104 -665.604969 -665.484531 1.5 -667.358 -667.374927 -665.698931 -665.578151 1.6 -667.41638 -667.433165 -665.759267 -665.638181 1.7 -667.452567 -667.469181 -665.797394 -665.676123 1.8 -667.474353 -667.490759 -665.821094 -665.699703 1.9 -667.487051 -667.503182 -665.835656 -665.714183 2 -667.494248 -667.510034 -665.84467 -665.723141 2.1 -667.498289 -667.513652 -665.850455 -665.728881 2.2 -667.50066 -667.515538 -665.85447 -665.732846 2.3 -667.50226 -667.516606 -665.857605 -665.735916 2.4 -667.503552 -667.517365 -665.860319 -665.738552 2.5 -667.504727 -667.518016 -665.862806 -665.740937 2.6 -667.505796 -667.51862 -665.865109 -665.743138 2.7 -667.506733 -667.519161 -665.867203 -665.745135 2.8 -667.507498 -667.5196 -665.86905 -665.746897 2.9 -667.508068 -667.51993 -665.870579 -665.748407 3 -667.508436 -667.520144 -665.871865 -665.749607 3.1 -667.508619 -667.520247 -665.872868 -665.750609 3.2 -667.508648 -667.520235 -665.873619 -665.751421 3.3 -667.508557 -667.520144 -665.874186 -665.751948 3.4 -667.508369 -667.519998 -665.874556 -665.752416 3.5 -667.508076 -667.519816 -665.874821 -665.752755 3.6 -667.507848 -667.519532 -665.874949 -665.752982 3.7 -667.507587 -667.519332 -665.874979 -665.753007 3.8 -667.507319 -667.519135 -665.875052 -665.75321 3.9 -667.50706 -667.519005 -665.875079 -665.753251 4 -667.506812 -667.51883 -665.875005 -665.753255 4.1 4.2 4.3 4.4 4.5 Bond Length BMK /6-31G* EDF1 /6-31G* EDF2 /6-31G* HCTH /6-31G* 1 1.1 1.2 1.3 -667.07199 -667.389785 -667.195667 -667.25688 1.4 -667.216951 -667.529443 -667.336185 -667.396866 1.5 -667.310467 -667.619625 -667.426885 -667.487371 1.6 -667.370264 -667.676685 -667.484148 -667.544755 1.7 -667.407621 -667.711934 -667.519486 -667.580341 1.8 -667.430749 -667.733078 -667.540617 -667.601763 1.9 -667.444215 -667.745308 -667.552791 -667.614208 2 -667.451901 -667.752158 -667.559568 -667.621212 2.1 -667.455893 -667.755888 -667.563241 -667.625066 2.2 -667.457991 -667.757938 -667.565276 -667.627239 2.3 -667.459171 -667.759199 -667.56656 -667.628639 2.4 -667.459997 -667.760134 -667.567532 -667.629758 2.5 -667.460728 -667.76097 -667.568376 -667.630818 2.6 -667.461246 -667.761775 -667.569139 -667.631872 2.7 -667.462141 -667.762546 -667.569786 -667.632909 2.8 -667.462883 -667.763258 -667.570282 -667.633888 2.9 -667.463273 -667.76389 -667.570608 -667.634765 3 -667.463634 -667.764423 -667.570759 -667.635511 3.1 -667.46381 -667.764856 -667.570764 -667.636132 3.2 -667.463746 -667.765184 -667.570627 -667.636611 3.3 -667.463567 -667.765428 -667.570402 -667.636828 3.4 -667.46326 -667.765501 -667.570102 -667.637108 3.5 -667.462743 -667.765607 -667.569709 -667.637292 3.6 -667.462453 -667.765746 -667.569407 -667.637401 3.7 -667.461976 -667.765755 -667.569082 -667.637442 3.8 -667.461693 -667.765842 -667.568762 -667.637431 3.9 -667.461615 -667.765789 -667.568457 -667.637389 4 -667.4614 -667.765788 -667.568171 -667.637328 4.1 4.2 4.3 4.4 4.5 Bond Length MP2 /6-31G* M05 /6-31G* M052X /6-31G*M052X-D3(0) /6-31G* 1 1.1 1.2 1.3 -665.970893 -667.20852 -667.251261 -666.0012228 1.4 -666.117536 -667.35219 -667.395463 -666.4736778 1.5 -666.212522 -667.445878 -667.489609 -666.7975828 1.6 -666.272679 -667.505268 -667.549306 -667.0145356 1.7 -666.309638 -667.542152 -667.586288 -667.1575542 1.8 -666.33144 -667.564019 -667.608459 -667.2501357 1.9 -666.343617 -667.576228 -667.621317 -667.308913 2 -666.349952 -667.582606 -667.628385 -667.3450351 2.1 -666.352987 -667.585901 -667.632167 -667.3668206 2.2 -666.354392 -667.587718 -667.634321 -667.3795754 2.3 -666.355208 -667.589183 -667.635724 -667.3871547 2.4 -666.355975 -667.590458 -667.636813 -667.3916681 2.5 -666.356869 -667.591678 -667.637779 -667.3944534 2.6 -666.357841 -667.592824 -667.638634 -667.3965493 2.7 -666.35876 -667.593719 -667.639319 -667.3983829 2.8 -666.359522 -667.594585 -667.639798 -667.4001468 2.9 -666.36006 -667.595349 -667.640087 -667.4018418 3 -666.360358 -667.595531 -667.640076 -667.4032827 3.1 -666.360436 -667.595755 -667.639852 -667.4044119 3.2 -666.360331 -667.595786 -667.63941 -667.4051012 3.3 -666.360083 -667.595821 -667.638971 -667.4056441 3.4 -666.35974 -667.595587 -667.638525 -667.4060828 3.5 -666.359333 -667.595235 -667.638043 -667.4062653 3.6 -666.358868 -667.594821 -667.63756 -667.406292 3.7 -666.358429 -667.594407 -667.637121 -667.4060651 3.8 -666.357994 -667.59398 -667.636687 -667.4056319 3.9 -666.357576 -667.593524 -667.636269 -667.4051684 4 -666.357185 -667.593125 -667.635927 -667.4048927 4.1 4.2 4.3 4.4 4.5 Bond Length M06 /6-31G* M062X /6-31G* M062X-D3(0) /6-31G* M06HF /6-31G* 1 1.1 1.2 1.3 -667.177573 -667.148967 -665.8987898 -667.150277 1.4 -667.320459 -667.292939 -666.3710078 -667.292313 1.5 -667.412817 -667.386583 -666.6944038 -667.385798 1.6 -667.471319 -667.446328 -666.9113966 -667.445938 1.7 -667.507698 -667.483345 -667.0544392 -667.483274 1.8 -667.529443 -667.505662 -667.1471577 -667.505921 1.9 -667.542093 -667.518921 -667.206328 -667.520034 2 -667.549269 -667.526461 -667.2429171 -667.528186 2.1 -667.553434 -667.530674 -667.2651346 -667.532803 2.2 -667.555995 -667.533089 -667.2781544 -667.535433 2.3 -667.557977 -667.534723 -667.2859707 -667.536795 2.4 -667.55958 -667.535962 -667.2906431 -667.537456 2.5 -667.560956 -667.53702 -667.2935314 -667.537697 2.6 -667.562109 -667.538012 -667.2957763 -667.537803 2.7 -667.562959 -667.538936 -667.2978589 -667.537842 2.8 -667.56365 -667.539665 -667.2998818 -667.537784 2.9 -667.564134 -667.5401 -667.3017308 -667.537309 3 -667.564425 -667.54022 -667.3033087 -667.536847 3.1 -667.564504 -667.540064 -667.3045129 -667.536328 3.2 -667.564357 -667.539715 -667.3052992 -667.535374 3.3 -667.563978 -667.539238 -667.3058081 -667.534715 3.4 -667.563521 -667.538654 -667.3061078 -667.534044 3.5 -667.563081 -667.537981 -667.3060953 -667.533357 3.6 -667.56245 -667.53728 -667.305899 -667.532795 3.7 -667.561957 -667.536741 -667.3055751 -667.532174 3.8 -667.561419 -667.536282 -667.3051219 -667.531681 3.9 -667.561133 -667.535936 -667.3047344 -667.53126 4 -667.560762 -667.535637 -667.3045047 -667.530944 4.1 4.2 4.3 4.4 4.5 Bond Length M06L /6-31G* PBE /6-31G* PBE0 /6-31G* PBE0-D3(0) /6-31G* 1 1.1 1.2 1.3 -667.256893 -665.026844 -666.844761 -665.5965108 1.4 -667.399947 -665.170167 -666.987417 -666.0674258 1.5 -667.491832 -665.263607 -667.079639 -666.3894278 1.6 -667.549874 -665.323508 -667.138121 -666.6051986 1.7 -667.585346 -665.361281 -667.174223 -666.7473752 1.8 -667.606249 -665.384685 -667.19579 -666.8393857 1.9 -667.618004 -665.398997 -667.208115 -666.897641 2 -667.624439 -665.407788 -667.214815 -666.9333851 2.1 -667.627828 -665.413374 -667.218242 -666.9547916 2.2 -667.629624 -665.417195 -667.219923 -666.9670514 2.3 -667.630937 -665.420135 -667.220796 -666.9740847 2.4 -667.632131 -665.422653 -667.221362 -666.9780721 2.5 -667.633426 -665.424927 -667.221837 -666.9803624 2.6 -667.634676 -665.427014 -667.222283 -666.9820423 2.7 -667.635931 -665.428889 -667.222673 -666.9835619 2.8 -667.637004 -665.430518 -667.222967 -666.9851068 2.9 -667.637755 -665.431836 -667.223147 -666.9866448 3 -667.638312 -665.432921 -667.223209 -666.9881017 3.1 -667.638688 -665.433738 -667.223165 -666.9893519 3.2 -667.638892 -665.434373 -667.223009 -666.9902712 3.3 -667.638874 -665.434625 -667.222761 -666.9909561 3.4 -667.638558 -665.434978 -667.222462 -666.9914898 3.5 -667.638021 -665.435121 -667.22213 -666.9917913 3.6 -667.637455 -665.435151 -667.221706 -666.991817 3.7 -667.636974 -665.435168 -667.221355 -666.9916141 3.8 -667.636624 -665.434798 -667.221044 -666.9912289 3.9 -667.636408 -665.43476 -667.22073 -666.9907644 4 -667.636239 -665.434609 -667.220434 -666.9905617 4.1 4.2 4.3 4.4 4.5 Bond Length PW91 /6-31G* wB97X/6-31G* 1 1.1 1.2 1.3 -665.245422 -667.214416 1.4 -665.388304 -667.358295 1.5 -665.481343 -667.451552 1.6 -665.540905 -667.510612 1.7 -665.578413 -667.546999 1.8 -665.601612 -667.568609 1.9 -665.615766 -667.580845 2 -665.624443 -667.587432 2.1 -665.629933 -667.590725 2.2 -665.633685 -667.592367 2.3 -665.636562 -667.593269 2.4 -665.639012 -667.594023 2.5 -665.641224 -667.59481 2.6 -665.643249 -667.595666 2.7 -665.645064 -667.5965 2.8 -665.646638 -667.597197 2.9 -665.64795 -667.597624 3 -665.648997 -667.597785 3.1 -665.649769 -667.597781 3.2 -665.650366 -667.597594 3.3 -665.650747 -667.597282 3.4 -665.650963 -667.59687 3.5 -665.651051 -667.596385 3.6 -665.650846 -667.595886 3.7 -665.650771 -667.595402 3.8 -665.65073 -667.594949 3.9 -665.650568 -667.594542 4 -665.650402 -667.594161 4.1 4.2 4.3 4.4 4.5 Optimized Geometries of the BCl3PH3 molecule

PBE1PBE/aug-cc-pVTZ Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z ------1 5 0 0.000000 0.000000 -0.923432 2 15 0 0.000000 0.000000 2.807861 3 17 0 0.000000 1.740629 -0.945578 4 17 0 1.507429 -0.870315 -0.945578 5 17 0 -1.507429 -0.870315 -0.945578 6 1 0 -1.036056 0.598167 3.574571 7 1 0 1.036056 0.598167 3.574571 8 1 0 0.000000 -1.196334 3.574571 wB97XD/aug-cc-pVTZ Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z ------1 5 0 0.900259 0.000498 0.015926 2 15 0 -2.745645 -0.001606 -0.066107 3 17 0 0.979575 -1.383301 -1.045656 4 17 0 0.840043 -0.229891 1.746552 5 17 0 0.955653 1.614749 -0.647171 6 1 0 -3.601036 0.210907 -1.175025 7 1 0 -3.431657 0.913014 0.770102 8 1 0 -3.463525 -1.128809 0.403588 wB97XD/aug-cc-pVQZ Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z ------1 5 0 0.000000 0.000000 -0.902390 2 15 0 0.000000 0.000000 2.751644 3 17 0 0.000000 1.744362 -0.927073 4 17 0 1.510662 -0.872181 -0.927073 5 17 0 -1.510662 -0.872181 -0.927073 6 1 0 -1.035795 0.598016 3.506002 7 1 0 1.035795 0.598016 3.506002 8 1 0 0.000000 -1.196033 3.506002 M062X/aug-cc-pVTZ Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z ------1 5 0 0.843966 -0.000114 0.021099 2 15 0 -2.591417 0.001064 -0.086570 3 17 0 0.942623 -1.383990 -1.032169 4 17 0 0.765882 -0.225629 1.748211 5 17 0 0.919718 1.608844 -0.642698 6 1 0 -3.449346 0.237063 -1.184002 7 1 0 -3.273017 0.893831 0.770453 8 1 0 -3.305994 -1.133101 0.359759

MP2/aug-cc-pVDZ Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z ------1 5 0 0.000000 0.000000 -0.852433 2 15 0 0.000000 0.000000 2.619386 3 17 0 0.000000 1.756573 -0.885585 4 17 0 1.521237 -0.878286 -0.885585 5 17 0 -1.521237 -0.878286 -0.885585 6 1 0 -1.045041 0.603355 3.378746 7 1 0 1.045041 0.603355 3.378746 8 1 0 0.000000 -1.206709 3.378746

MP2/aug-cc-pVTZ Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z ------1 5 0 0.000000 0.000000 -0.836173 2 15 0 0.000000 0.000000 2.574829 3 17 0 0.000000 1.741883 -0.871086 4 17 0 1.508515 -0.870941 -0.871086 5 17 0 -1.508515 -0.870941 -0.871086 6 1 0 -1.034287 0.597146 3.327941 7 1 0 1.034287 0.597146 3.327941 8 1 0 0.000000 -1.194292 3.327941