Automated Generation of Input Data for Machine-Learning-Based Predictions of Ni(I) Dimer Formation

Seminararbeit im Studiengang Scientific Programming Modulcode 95200 Automated Generation of Input Data for Machine-Learning-Based Predictions of Ni(I) Dimer Formation Nina Löseke Matrikelnummer 3147366 17. Dezember 2019 betreut von Prof. Dr. rer. nat. Hans Joachim Pflug Jannis Klinkenberg, M.Sc. IT Center der RWTH Aachen Erklärung Hiermit versichere ich, dass ich die Seminararbeit mit dem Thema ‘Automated Generation of Input Data for Machine-Learning-Based Predictions of Ni(I) Dimer Formation’ selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel ben- utzt habe, alle Ausführungen, die anderen Schriften wörtlich oder sinngemäß entnommen wurden, kenntlich gemacht sind und die Arbeit in gleicher oder ähnlicher Fassung noch nicht Bestandteil einer Studien- oder Prüfungsleistung war. Ich verpflichte mich, ein Exemplar der Seminararbeit fünf Jahre aufzubewahren und auf Verlangen dem Prüfungsamt des Fachbereiches Medizintechnik und Technomathematik auszuhändigen. Ort, Datum Unterschrift ii Abstract Ni(I) dimers are useful for selective synthesis and catalyze complex reactions. Gener- ally, dimeric catalysts offer a wide range of applications in selective synthesis, which is why there is a high demand for finding similar catalysts. An experimental approach for testing a large number of potential candidates to form this specific type of catalyst is not feasible because it is time-consuming and difficult to study systematically. Therefore, the researchers of the Schoenebeck Group at the RWTH Institute of Organic Chemistry hope to explore suitable ligands with the aid of machine learning. Ligands are ions or molecules attached to a metal atom that, in this case, need to stabilize a dinuclear Ni(I) Ni(I) core. Previously, the workflow to create data to extract ML features from has only been partially implemented and required a lot of manual interaction, which made it more prone to errors. This seminar thesis focuses on developing a fully automated, Python-based framework for generating ML input data to identify ligands that form Ni(I) dimers. The framework adapts the previous workflow in a more efficient way that allows for automatic error detection and requires little to no user interference. Instead of experiments, this training data set is generated in silico by applying DFT calculations, such as the external program Gaussian, to a library of structures or so-called species the ligands and nickel can form. DFT stands for ‘density functional theory’, which is a method for quantum mechanical modeling. Apart from that, this approach results in an input data set large enough to provide innovative insights into identifying novel, reactive Ni(I) dimers through ML. This is more difficult to achieve with purely experimental data since it is often limited in size or incidentally biased, which are some of the main challenges machine learning still faces in the field of chemistry. For my bachelor’s thesis, I plan to investigate different methods and techniques of unsupervised machine learning and analyze their effectiveness for the data set generated by the constructed framework. iii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Focus and Structure of This Thesis . .2 2 Important Tools and Chemical Background 3 2.1 Open Babel . .3 2.2 xtb and CREST . .3 2.3 Gaussian . .4 3 Status Quo 5 4 Implementation of the Framework 10 4.1 Molecule Assembly . 13 4.2 xtb/CREST and Gaussian . 14 4.3 Error Checking for Gaussian Logs . 16 4.4 Extraction of Descriptors . 17 5 Production Run 19 6 Summary and Outlook 20 List of Abbreviations 21 List of Figures 22 Listings 23 List of Tables 24 References 25 iv 1 Introduction 1.1 Motivation The use of machine learning in the field of chemistry is increasing in popularity [1] since it can help predict reaction outcomes [13, 12, 3] or molecular characteristics [14]. This is helpful for exploring new, suitable structures, e.g. when searching for novel catalysts, without having to test each possible candidate through time-consuming and expens- ive experiments in the laboratory. Machine learning can also help discern certain patterns for finding new candidates, e.g. the most important molecular properties that characterize suitable structures for a catalyst formation. Current data sets for ML input, however, are largely based on experimental data, which is limited in size due to the required effort and might contain inaccuracies. Moreover, certain molecular descriptors (so-called ‘features’ for machine learning) are often hand- picked, so that bias is introduced. As an effect, the training and validation data sets are often similar. This reduces the amount of new knowledge to be gained from machine learning. Computational methods help combat this challenge. Especially DFT calculations allow for much larger data sets to extract features from to be generated in silico, e.g. via Gaus- sian1. This so-called ‘density functional theory’ is a method for quantum mechanical modeling which helps optimize and collect data on large sets of molecular structures purely through computation without experimental efforts. These molecular characteristics can in turn provide a training data set that is large enough to yield more innovative results. One of the use cases for this is the dinuclear Ni(I) catalyst. Ni(I) dimers are important for selective synthesis and support complex reactions, like the highly selective isomerization of terminal olefins [7]. There is a high demand to find similar dimeric catalysts [11, 8], especially a chiral version of the Ni(I) dimer which can be applied in many ways in enantioselective catalysis [2]. However, potential ‘species’ to form these types of dimers cannot be determined stra- tegically through experiments because it is time-consuming. Apart from that, the exact correlations of suitable species are unknown due to complex factors that influence the formation of a specific catalyst. Species are groups of nickel and ligands, which are ions or molecules attached to a metal atom and, in this case, stabilize a dinuclear Ni(I) Ni(I) core. Therefore, the long-term goal of the research conducted by the Schoenebeck Group at the Institute of Organic Chemistry at the RWTH Aachen is to analyze ML-based methods for determining ligands that favor the formation of a reactive Ni(I) dimer. 1https://gaussian.com/g16main/ 1 1 Introduction 1.2 Focus and Structure of This Thesis In this seminar thesis, I develop a fully-automated, Python-based framework that applies DFT calculations to a library of structures based on provided ligands and generates ML input data from the results. This framework replaces the previous, non-automated, partial implementation of the steps needed to acquire the ML features. In order to better un- derstand the following chapters, figure 1.1 depicts a top-level overview of the workflow consisting of five main steps. repeat if error can be solved extract assemble optional xtb / CREST Gaussian check output descriptors molecules optimization calculation for errors from output Figure 1.1: Overview of Workflow First, the molecules each ligand can form need to be assembled programmatically from the ligand library in order to obtain a structure library of molecular data (see figure 3.1 for details). As an optional step, the molecular structure can be pre-optimized using the external programs xtb2 and CREST3, which is helpful for some structures. The successive Gaussian run most importantly optimizes the total energy of the molecule and outputs an optimized arrangement of the atoms. Some errors arising from Gaussian can be resolved by repeating the calculation with a few adaptions (see chapter 4.3). Otherwise, a manual check and evaluation might be required in case of more complex errors. To avoid bias in the resulting ML data, a wide range of molecular properties and energies is either extracted directly from the Gaussian results or computed with additional external applications. These so-called descriptors constitute the features that can serve as input data for various supervised and unsupervised ML algorithms. In the first part of this thesis, I give an overview of the most important software and tools used for this project and the chemical background of the xtb/CREST optimization and DFT calculations. Then, I discuss how the previous partial implementation of the workflow can be improved. Based on this, I outline the requirements for the fully automated framework and describe concrete details of its implementation and how to use it. Finally, I present data on a ‘production run’ of this framework and summarize my work. 2https://github.com/grimme-lab/xtb/ 3https://xtb-docs.readthedocs.io/en/latest/crest.html 2 2 Important Tools and Chemical Background This chapter does not only look at the software and tools involved in this project, but also explains the most important chemical details for xtb/CREST and the DFT calculations in sections 2.2 and 2.3. The calculations performed by xtb/CREST and Gaussian are very compute-intensive and optimized to run on high-performance computer architectures, in this case, the RWTH Compute Cluster1. To utilize these compute resources, the programs can be run as batch jobs by submitting batch scripts via the cluster’s workload manager SLURM2. After first queuing all incoming jobs that users send, SLURM then assigns resources in terms of CPUs, memory and runtime to the jobs waiting to run on the cluster. 2.1 Open Babel Originally, Open Babel3 is a C++-based chemical software toolbox that provides function- ality for converting different file formats for molecular data. This library also includes a Python interface with the same API for the essential functions that I utilize for this seminar thesis. The automated framework requires at least version 2.3, since this version introduces new classes for substructure search. Through substructure search, Open Ba- bel can find the atom indices of certain bonds or substructures in a molecule by defining patterns, which is relevant to the ML feature extraction explained in chapter 4.4.

Load more