Seminararbeit im Studiengang Scientific Programming Modulcode 95200

Automated Generation of Input Data for Machine-Learning-Based Predictions of Ni(I) Dimer Formation

Nina Löseke Matrikelnummer 3147366

17. Dezember 2019

betreut von Prof. Dr. rer. nat. Hans Joachim Pflug Jannis Klinkenberg, M.Sc. IT Center der RWTH Aachen Erklärung

Hiermit versichere ich, dass ich die Seminararbeit mit dem Thema ‘Automated Generation of Input Data for Machine-Learning-Based Predictions of Ni(I) Dimer Formation’ selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel ben- utzt habe, alle Ausführungen, die anderen Schriften wörtlich oder sinngemäß entnommen wurden, kenntlich gemacht sind und die Arbeit in gleicher oder ähnlicher Fassung noch nicht Bestandteil einer Studien- oder Prüfungsleistung war. Ich verpflichte mich, ein Exemplar der Seminararbeit fünf Jahre aufzubewahren und auf Verlangen dem Prüfungsamt des Fachbereiches Medizintechnik und Technomathematik auszuhändigen.

Ort, Datum Unterschrift

ii Abstract

Ni(I) dimers are useful for selective synthesis and catalyze complex reactions. Gener- ally, dimeric catalysts offer a wide range of applications in selective synthesis, which is why there is a high demand for finding similar catalysts. An experimental approach for testing a large number of potential candidates to form this specific type of catalyst is not feasible because it is time-consuming and difficult to study systematically. Therefore, the researchers of the Schoenebeck Group at the RWTH Institute of Organic Chemistry hope to explore suitable ligands with the aid of machine learning. Ligands are ions or mo- lecules attached to a metal atom that, in this case, need to stabilize a dinuclear Ni(I) Ni(I) core. Previously, the workflow to create data to extract ML features from has only been partially implemented and required a lot of manual interaction, which made it more prone to errors. This seminar thesis focuses on developing a fully automated, Python-based framework for generating ML input data to identify ligands that form Ni(I) dimers. The framework adapts the previous workflow in a more efficient way that allows for automatic error detection and requires little to no user interference. Instead of experiments, this training data set is generated in silico by applying DFT calculations, such as the external program , to a library of structures or so-called species the ligands and nickel can form. DFT stands for ‘density functional theory’, which is a method for quantum mech- anical modeling. Apart from that, this approach results in an input data set large enough to provide innovative insights into identifying novel, reactive Ni(I) dimers through ML. This is more difficult to achieve with purely experimental data since it is often limited in size or incidentally biased, which are some of the main challenges machine learning still faces in the field of chemistry. For my bachelor’s thesis, I plan to investigate different methods and techniques of unsupervised machine learning and analyze their effectiveness for the data set generated by the constructed framework.

iii Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Focus and Structure of This Thesis ...... 2

2 Important Tools and Chemical Background 3 2.1 ...... 3 2.2 xtb and CREST ...... 3 2.3 Gaussian ...... 4

3 Status Quo 5

4 Implementation of the Framework 10 4.1 Molecule Assembly ...... 13 4.2 xtb/CREST and Gaussian ...... 14 4.3 Error Checking for Gaussian Logs ...... 16 4.4 Extraction of Descriptors ...... 17

5 Production Run 19

6 Summary and Outlook 20

List of Abbreviations 21

List of Figures 22

Listings 23

List of Tables 24

References 25

iv 1 Introduction

1.1 Motivation

The use of machine learning in the field of chemistry is increasing in popularity [1] since it can help predict reaction outcomes [13, 12, 3] or molecular characteristics [14]. This is helpful for exploring new, suitable structures, e.g. when searching for novel cata- lysts, without having to test each possible candidate through time-consuming and expens- ive experiments in the laboratory. Machine learning can also help discern certain patterns for finding new candidates, e.g. the most important molecular properties that characterize suitable structures for a catalyst formation. Current data sets for ML input, however, are largely based on experimental data, which is limited in size due to the required effort and might contain inaccuracies. Moreover, certain molecular descriptors (so-called ‘features’ for machine learning) are often hand- picked, so that bias is introduced. As an effect, the training and validation data sets are often similar. This reduces the amount of new knowledge to be gained from machine learning. Computational methods help combat this challenge. Especially DFT calculations allow for much larger data sets to extract features from to be generated in silico, e.g. via Gaus- sian1. This so-called ‘density functional theory’ is a method for quantum mechanical modeling which helps optimize and collect data on large sets of molecular structures purely through computation without experimental efforts. These molecular characterist- ics can in turn provide a training data set that is large enough to yield more innovative results. One of the use cases for this is the dinuclear Ni(I) catalyst. Ni(I) dimers are important for selective synthesis and support complex reactions, like the highly selective isomerization of terminal olefins [7]. There is a high demand to find similar dimeric catalysts [11, 8], especially a chiral version of the Ni(I) dimer which can be applied in many ways in enantioselective catalysis [2]. However, potential ‘species’ to form these types of dimers cannot be determined stra- tegically through experiments because it is time-consuming. Apart from that, the exact correlations of suitable species are unknown due to complex factors that influence the formation of a specific catalyst. Species are groups of nickel and ligands, which are ions or molecules attached to a metal atom and, in this case, stabilize a dinuclear Ni(I) Ni(I) core. Therefore, the long-term goal of the research conducted by the Schoenebeck Group at the Institute of Organic Chemistry at the RWTH Aachen is to analyze ML-based methods for determining ligands that favor the formation of a reactive Ni(I) dimer.

1https://gaussian.com/g16main/

1 1 Introduction

1.2 Focus and Structure of This Thesis

In this seminar thesis, I develop a fully-automated, Python-based framework that applies DFT calculations to a library of structures based on provided ligands and generates ML input data from the results. This framework replaces the previous, non-automated, partial implementation of the steps needed to acquire the ML features. In order to better un- derstand the following chapters, figure 1.1 depicts a top-level overview of the workflow consisting of five main steps.

repeat if error can be solved

extract assemble optional xtb / CREST Gaussian check output descriptors molecules optimization calculation for errors from output

Figure 1.1: Overview of Workflow

First, the molecules each ligand can form need to be assembled programmatically from the ligand library in order to obtain a structure library of molecular data (see figure 3.1 for details). As an optional step, the molecular structure can be pre-optimized using the external programs xtb2 and CREST3, which is helpful for some structures. The successive Gaussian run most importantly optimizes the total energy of the molecule and outputs an optimized arrangement of the atoms. Some errors arising from Gaussian can be resolved by repeating the calculation with a few adaptions (see chapter 4.3). Otherwise, a manual check and evaluation might be required in case of more complex errors. To avoid bias in the resulting ML data, a wide range of molecular properties and energies is either extracted directly from the Gaussian results or computed with additional external applications. These so-called descriptors constitute the features that can serve as input data for various supervised and unsupervised ML algorithms. In the first part of this thesis, I give an overview of the most important software and tools used for this project and the chemical background of the xtb/CREST optimization and DFT calculations. Then, I discuss how the previous partial implementation of the work- flow can be improved. Based on this, I outline the requirements for the fully automated framework and describe concrete details of its implementation and how to use it. Finally, I present data on a ‘production run’ of this framework and summarize my work.

2https://github.com/grimme-lab/xtb/ 3https://xtb-docs.readthedocs.io/en/latest/crest.html

2 2 Important Tools and Chemical Background

This chapter does not only look at the software and tools involved in this project, but also explains the most important chemical details for xtb/CREST and the DFT calculations in sections 2.2 and 2.3. The calculations performed by xtb/CREST and Gaussian are very compute-intensive and optimized to run on high-performance computer architectures, in this case, the RWTH Compute Cluster1. To utilize these compute resources, the programs can be run as batch jobs by submitting batch scripts via the cluster’s workload manager SLURM2. After first queuing all incoming jobs that users send, SLURM then assigns re- sources in terms of CPUs, memory and runtime to the jobs waiting to run on the cluster.

2.1 Open Babel

Originally, Open Babel3 is a C++-based chemical software toolbox that provides function- ality for converting different file formats for molecular data. This library also includes a Python interface with the same API for the essential functions that I utilize for this sem- inar thesis. The automated framework requires at least version 2.3, since this version introduces new classes for substructure search. Through substructure search, Open Ba- bel can find the atom indices of certain bonds or substructures in a molecule by defining patterns, which is relevant to the ML feature extraction explained in chapter 4.4.

2.2 xtb and CREST

CREST and xtb are OpenMP-parallelized programs used for simulating various molecu- lar conformations [6]. OpenMP, or ‘Open Multi-Processing’, is an API that provides directives for parallelizing C/C++ or Fortran programs on shared-memory architectures via multi-threading [10]. This increases efficiency and decreases the total runtime. The calculations performed by xtb and then CREST are molecule-specific. xtb minimizes the energy of the molecule locally and CREST consequently conducts a global geometry optimization based on that has the same purpose. This computation is particularly relevant for larger metal complexes, which are part of the structure library, since they own higher degrees of freedom. For some structures, a pre-optimization re- duces the computational effort of the successive Gaussian calculations. The xtb version used for this framework is 6.2.1.

1https://doc.itc.rwth-aachen.de/display/CC/Home 2https://doc.itc.rwth-aachen.de/display/CC/Using+the+SLURM+Batch+System 3http://openbabel.org/dev-api/changes23.shtml

3 2 Important Tools and Chemical Background xtb and Gaussian (see subchapter 2.3) work in very similar ways. As opposed to Gaussian, however, xtb only provides a single method for optimizing the molecular structure that is less precise and therefore less compute-intensive and time-consuming. Starting from the atomic arrangement generated during the molecule assembly (compare figure 1.1), xtb does a local geometry optimization resulting in an improved atomic arrangement. Using this new conformation, CREST further minimizes the total energy of the structure through a global geometry optimization.

2.3 Gaussian

Gaussian is an extensive program package used by many computational chemists. It helps predict molecular characteristics and its DFT calculations scale well. Similar to xtb and CREST, Gaussian employs OpenMP for parallelization, and is efficient in localizing sta- tionary points, which makes Gaussian suitable for this project. For each molecule, the Gaussian run is comprised of several calculations. This framework currently utilizes ver- sion 16 A.03. Gaussian offers different types of optimizations and calculations [4]. The main goal for this project is to find an optimal atomic arrangement that minimizes the total energy of the molecule. With this optimal molecular conformation, Gaussian then does additional calculations to determine certain molecular descriptors that are later used as ML features (see figure 1.1). The basic Gaussian optimization, that all others are derived from, is the ‘single-point en- ergy calculation’. This evaluates the wave function Ψ(r;R) and returns the total energy of the molecule, E(R), with R representing all atom cores of the molecule. This problem is solved via an Eigenvalue equation, the ‘time-independent Schrödinger equation’:

Hˆ · Ψ(r;R) = E(R) · Ψ(r;R) (2.1)

Hˆ is the ‘Hamiltonian’, which can be described as an energy-related operator in . In this project, Gaussian needs a starting point for R, which could be the output generated by the molecule assembly or the xtb/CREST output. This initial atomic arrangement is modified with a local geometry optimization. Again, the purpose of this is to minimize the energy of the molecule by computing the minimum of the ‘potential energy surface’ E(R). With the pre-optimization done by xtb/CREST for some structures, better atomic arrangements with a smaller total energy can be found. Next, Gaussian conducts a ‘vibrational analysis’ to ensure that the optimized R, R*, really is a minimum. With this roughly optimized data R*, Gaussian then executes an additional single-point energy calculation, which bears a much higher precision for E(R∗) than the previous steps and is therefore more costly in terms of runtime. The remaining calculations are done solely to determine a number of molecular descriptors that are significant for the ML input data. Among these descriptors are bond lengths or angles between two or three atoms within the molecule.

4 3 Status Quo

The status quo or previous partial implementation of the workflow (see figure 1.1) con- sists of a Python script to assemble the molecules and templates for the batch scripts for xtb, CREST and Gaussian and their respective input files. A small library of ligands for about 15 structures each is also provided. Ligands are ions or molecules attached to a metal atom. In this case, they need to stabilize the dinuclear Ni(I) Ni(I) core of the di- meric structures that might qualify as novel catalysts. Based on the previous set-up of the workflow, this chapter analyzes opportunities for improvement and the requirements for the fully automated framework that stem from them. Regarding the ligand library, all molecules a ligand can form are split into ‘cores’ and ‘substituents’. This is independent from their actual, chemical fragments, as shown in figure 3.1. The idea is that a single ligand can generate several structures, depending on the kinds of substituents to connect to the core.

Figure 3.1: Fragments of a Single Structure (Courtesy of Schoenebeck Group). Implementation-wise, the structure is split into a ‘core’, that contains the di- meric Ni(I) Ni(I) core, and one or several substituents, as pictured on the left-hand side. The actual, chemical components shown on the right are dif- ferent from this fragmentation, but form the same structure.

The cores are sorted into two core groups: unsaturated and saturated cores. The reason for this classification is that this project focuses solely on ligands that are N-heterocyclic carbenes (NHC), which are divided into a saturated and an unsaturated sub-class. Other

5 3 Status Quo classes of ligands can also form Ni(I) dimers, but are unable to catalyze reactions like the isomerization of terminal olefins [7]. The necessary molecular data needed to assemble the structures is provided within two directories, cores and substituents. For each core or substituent, there is a .mol file and a .yml file. The .mol data can be exported from graphical tools for editing molecules, such as XDrawChem1. The .yml files specify which labels to assign to which atom indices in the molecule (Ri and Xi in figure 3.1). For the cores, the labels are Ri, for the substituents, they are Xi. The resulting labeled molecules are saved as .mol2 files. To achieve this, the Python class MolAssembler has been designed to read a .mol file via Open Babel, set the atom labels according to the .yml specification and generate .mol2 output for the core or substituent. With the status quo of the workflow, this process is done for one core or substituent at a time, which is time-consuming and should be optimized in the automated framework. The labels are needed in order to assemble the actual structures that are going to be tested. The Python script created for this purpose only assembles one molecule at a time by loading the .mol2 file for one core via Open Babel and the respective substituents and connecting labeled atoms. Where to connect each core and its substituents for the different structures is hard-coded in the Python script of the previous workflow and not documented anywhere. However, the way the cores are connected to the substituents is the same for all cores of the same set, e.g. the unsaturated core group, so that this could be generalized and easily automated for the improved framework. Listing 3.1 demonstrates that the previous implementation requires the user to have prior knowledge on how to assemble the structures. This can be solved by finding a convenient format for defining a so-called ‘recipe’, which is then provided for all users. A recipe defines where to connect which substituents to each core of a single core group, so that an entire set of structures can be generated from one recipe. Moreover, the hard-coded paths for loading the input files need to be adapted for every single structure, which is inefficient and lacks flexibility. What is missing here is an automated way to create all desired structures at once without having to specify every single path for each structure. This could be realized by constructing a suitable directory structure. Then, the fully assembled molecule is written to a .mol2 and an .xyz output file. The xyz data is needed to create input for the Gaussian calculations and defines the coordinates of every atom in a molecule. The optional xtb/CREST optimization (see figure 1.1) requires two input files, .xoptcon- trol for xtb and .xcontrol for CREST. The pre-optimization done by xtb is con- sequently utilized by CREST, however, both input files can be created in advance. For these two input files, a template with placeholders for data that differs from structure to structure already exists (see listing 3.2 for the .xcontrol template). Regarding the status quo, these placeholders are replaced manually for every xtb/CREST run, which makes the previous workflow more susceptible to errors.

1http://www.woodsidelabs.com/chemistry/xdrawchem.php

6 3 Status Quo

1 import openbabel 2 import MolAssembler 3 4 obconv = openbabel.OBConversion() 5 obconv.SetInFormat(’mol2’) 6 core_A = openbabel.OBMol() 7 obconv.ReadFile(core_A,’../cores/unsat/A.mol2’) 8 XMes = openbabel.OBMol() 9 obconv.ReadFile(XMes,’../substituents/XMes.mol2’) 10 XMe = openbabel.OBMol() 11 obconv.ReadFile(XMe,’../substituents/XMe.mol2’) 12 output = MolAssembler.Join(core_A, XMes, [(’R1’,’X1’)]) 13 output = MolAssembler.Join(output, XMe, [(’R2’,’X1’)]) 14 output = MolAssembler.Join(output, XMe, [(’R3’,’X1’)]) 15 output = MolAssembler.Join(output, XMes, [(’R4’,’X1’)]) 16 output = MolAssembler.Preoptimize(output) Listing 3.1: Snippet of Python Script for Assembling Molecules

1 $chrg {CHARGE} 2 $spin {SPIN} 3 $fix 4 atoms: {FIXED_ATOMS} 5 $end 6 $constrain 7 atoms: {FIXED_ATOMS} 8 force constant=1.0 9 reference=xtbopt.xyz 10 $end 11 $metadyn 12 atoms: {RMSD_ATOMS} 13 $end Listing 3.2: Input Template for .xcontrol

Using a template is a crucial step towards more automation, since only a few lines of the input files have to be adapted for each structure. However, the placeholders have previously been set manually: the values ‘charge’ and ‘spin’ of the molecule are either known or determined with the help of tools like XDrawChem. The atom indices for {FIXED_ATOMS} can also be taken from graphical tools because it is easy to find atoms with fixed positions that way. Concerning the automation, the more flexible way that does not include graphical tools would be to use a function already implemented by MolAssembler that returns a list of indices for all fixed atoms. For the {RMSD_ATOMS}, all other, non-fixed atom indices are listed. Instead of finding these with the help of external tools as previously, their indices could be derived from the fixed atom indices if the total number of atoms is known. For the latter, the Open Babel library provides a function to get the total number of atoms in the molecule. Another aspect that needs to be automated regarding this template is that the lists of atom indices need to be shortened, so that all successive indices are not listed individually, but abbreviated with a hyphen (see listing 3.3). Otherwise, the program might not be able to parse the input file

7 3 Status Quo

properly.

1 # atom indices returned by MolAssembler 2 [1, 3, 4, 5, 6, 7, 10, 11, 12, 14] 3 # string forFIXED_ATOMS 4 1,3-7,10-12,14 Listing 3.3: List of Indices of Fixed Atoms

Per molecule, the user can start the xtb/CREST optimization by submitting a batch job, for which the status quo of the implementation also provides a template. The output of the xtb/CREST optimization are slightly altered .xyz coordinates. The only value that has to be adjusted for each batch script is the name of the batch job, the rest stays the same for each job, e.g. the requested machines, time, memory etc. This could be improved by setting batch job names that include information about which structure or recipe is being optimized. With the previous implementation, the user has to make sure that an xtb/CREST run has finished before the subsequent Gaussian calculation can be started, i.e. these two steps are dependent on each other. Another important feature of the enhanced framework would be to automatically track which xtb/CREST jobs are currently running and which ones have finished, so that the user does not have to track them manually. This could be facilitated by designing the automated framework in such a way that it integrates these dependencies when managing and running the batch jobs. Similar dependencies have been realized programmatically, i.e. as dependent tasks for OpenMP [5]. Regarding OpenMP tasks, a new task can be generated from within another task that is currently being executed, which is an implicit dependency, or separate tasks have certain dependencies that are managed by the task scheduler. In this case, however, this concept has to be transferred onto SLURM batch jobs. The xyz data, either generated from the labeled molecule or by optimizing the molecular structure with xtb/CREST, is then used for the Gaussian input file, which is a .com file. These files contain directions about the different calculations to perform, the number of processes to use, the name of the output file and charge, spin multiplicity and .xyz co- ordinates for the molecule. For the status quo, a template to write such a .com file has already been defined (see listing 3.4). However, it has to be ensured that all the values in the .com file that refer to the batch job (such as memory, number of cores) are set accord- ing to the ones used in the actual job script. This makes the template prone to errors since this has previously been done manually. Another aspect is that for the coordinates, the xyz data is copied and pasted from the .xyz file manually. It would make sense to automate this step in such a way that the xtb/CREST-optimized xyz data is used automatically, if existent. The charge and spin multiplicity are the same for each core in spite of the substituents con- nected to it. These values previously had to be known by the user, but they can also be set via Open Babel. However, the methods GetTotalSpinMultiplicity and GetTotal- Charge of the Open Babel class OBMol, which represents a molecular structure, are not guaranteed to work correctly and might just set these values randomly. For the automated framework, there should be an option to specify spin multiplicity and charge per core so that no errors are introduced to the calculations due to Open Babel.

8 3 Status Quo

1 %chk = 2 %NProc = 3 %Mem = 4 # B3LYP/genecp opt freq=noraman EmpiricalDispersion=GD3BJ 5 6 Title Card Required 7 8 {CHARGE}{MULTIPLICITY} 9 {XYZ_DATA} 10 ... Listing 3.4: Excerpt from Gaussian com template gauinp.com

The template for the Gaussian batch job (gbatch.sh) does not require much adjustment per molecule. Just the name of the job has to be changed, all other resource-related settings as well as the name of the input and output files are the same for each structure. After a Gaussian run has finished, the Gaussian output, a .log file, is checked for errors. These checks guarantee the ‘normal termination’ of each calculation and ensure that none of the ‘frequencies’ values are negative. This is tedious since frequencies appear between 50 and 110 times in the output files for this project. Again, searching for these values in a non-automated fashion is time-consuming and could be accelerated with the use of regular expressions. With the previous implementation, the finalization of Gaussian batch jobs also has to be tracked manually in order to check for errors afterwards. Another requirement for the automated framework is that the independent steps (compare figure 1.1) are executed in parallel for all structures at once to speed up the time it takes to run the DFT calculations for the entire structure library. What needs to be implemented from scratch is the extraction of the descriptors from the Gaussian logs. Furthermore, it would be helpful to flag those .log files that contain more complex errors for manual checking.

9 4 Implementation of the Framework

The fully-automated framework realizes the potential improvements discussed in chapter 3 and is comprised of several directories. This structure clearly separates input and output files, templates and Python scripts needed to run each step from each other. This modular design enables the user to run just a specific step independently or to run the entire work- flow in an automated fashion up to the feature extraction. A more detailed illustration of the complete workflow is depicted in figure 4.1.

labels *.yml pre-optimized, recipes: labeled optional xtb optimization substituents cores, molecules *.xyz core set substituents *.mol2, *.xyz *.mol use xtb output for CREST

Gaussian CREST create *.com file calculations optimization with *.xyz data *.log *.xyz

use new *.xyz data apply script to correct other error if solvable negative coordinates check Gaussian frequencies extract values for log for errors features from log mark for manual error checking

Figure 4.1: Scheme of Full Framework with File Types

The directory structure consists of these parts:

• cores: separated by core group • substituents • recipes • scripts: Python scripts and global configuration file • calculations: intermediate and final output files • input_templates: templates for batch scripts and input files • manual_check: created if necessary

The inputs that need to be defined before running the whole workflow are the cores, sub- stituents and recipes. What remains the same is the specification of the labels via .yml files and using .mol files to provide the initial molecular data. However, for setting the spin and charge of each structure later on, there now is a way to define those values per

10 4 Implementation of the Framework

core instead of relying on Open Babel to set them correctly. Spin and charge stay the same for all structures that have the same core, so that the values can be set optionally in the .yml file of the core (see listing 4.1). To indicate that it is helpful to run xtb/CREST optimizations before Gaussian for a struc- ture, the .yml core file may contain the optional flag use_crest. If it is not set, false is assumed as the default.

1 atom_labels: 2 R1: [16 ,17] 3 R2: [20 ,21] 4 R3: [19 ,22] 5 R4: [15 ,18] 6 use_crest: True 7 spin_multiplicity: 1 8 total_charge: 0 Listing 4.1: .yml File for Unsaturated Core A

The recipes are written in .yml format so that they can easily be converted into Python dictionaries for the molecule assembly. Listing 4.2 shows an example recipe. The first line specifies the core group (or subdirectory of cores) and the files to use for assembling the structures. This recipe is therefore valid for all unsaturated cores and all labeled .mol2 files from the directory cores/unsat will be loaded. For each substituent, the file name and the labels for connecting the substituent to the core are given. This means that two substituents of type 3 are connected to the core at position R1 and R4.

1 cores: unsat/*.mol2 2 substituents: 3 - file: 3.mol2 4 connections: 5 - [R1 , X1] 6 - file: 3.mol2 7 connections: 8 - [R4 , X1] Listing 4.2: Example Recipe

All intermediate files are kept in the calculations directory. These include the labeled .mol2 file for each structure, the input and output files for xtb/CREST (if needed), and the input and output files for Gaussian, including the job scripts. The subdirectories in calculations are sorted by recipe name and core. If a Gaussian calculation finishes successfully, all intermediate files will be deleted except for the most recent log, which can then be passed onto the ML feature extraction. There is also a manual_check directory that will be created if necessary. This is where subdirectories of calculations will be moved if Gaussian fails due to a more complex error or if errors still persist even after re-running the Gaussian calculations two times. Upon manual inspection and adaptions, the files can easily be re-introduced to the work- flow by copying them back to the corresponding subdirectory of calculations.

11 4 Implementation of the Framework

Furthermore, the directory input_templates contains the templates for the xtb/CREST and Gaussian jobs and for another external program called sambvca1 used for the feature extraction, which is the last step of the workflow (see figure 4.1). Lastly, scripts is where all Python scripts for executing each workflow step are located, as well as the configuration file for the correct set-up of all the paths, e.g. the path for the cores (see listing 4.3). The usage of xtb/CREST can be enabled or disabled globally by setting the use_crest flag in the configuration file to true or false.

1 # Insert the absolute path to the directory containing the scripts 2 working_dir: /home/gu620893/seminar_thesis_ioc/example/scripts/ 3 # Set these paths relative to working_dir! 4 cores: ../cores 5 substituents: ../substituents 6 recipes: ../recipes 7 calculations: ../calculations 8 input_templates: ../input_templates 9 # Enable CREST usage by setting flag to True 10 use_crest: True Listing 4.3: Example Configuration File config.yml

The progress of the entire workflow is tracked for each structure in a shared CSV file log.csv in the root of the directory structure (see listing 4.4). In case several processes try to simultaneously append an event to the log for different structures, Python’s file locking prevents data races. The CSV format was chosen in case the user wants to parse the data. For each structure, the current status in the workflow is logged along with the timestamp, whenever there is a change of status. The different stages a molecule can reach during the workflow are pictured in figure 4.1. Finished, error-free Gaussian jobs will be marked as _done in the calculations directory. The feature extraction is applied to all correct Gaussian logs at once.

1 timestamp,recipe,core,status 2 2019-10-24 13:45:02,19,H,assembled 3 2019-10-24 13:47:24,19,H,gaussian run 0 4 2019-10-25 06:38:57,19,H,gaussian check 0: rerun (FormBX error or # steps exceeded) 5 2019-10-25 06:38:57,19,H,gaussian run 1 6 2019-11-08 21:06:59,19,H,gaussian check 1: successful Listing 4.4: Excerpt from log.csv of All the Events Logged for Structure H, Recipe 19

The full workflow up to the finalization of the Gaussian calculations can be run with the script run.py. As an input argument, it requires the name of the configuration file, e.g. --config config.yml. This automatically iterates over all recipes, assembles the molecules and starts the first Gaussian or xtb/CREST job (if enabled) per molecule. All consecutive calculations are triggered automatically by commands added to the end of the current batch script, which is based on applying implicit dependencies for the different workflow steps as discussed in chapter 3 and illustrated in figure 4.3. All of these batch

1https://www.molnac.unisa.it/OMtools/sambvca2.1/help/help.html

12 4 Implementation of the Framework

jobs are started successively with the correct dependencies but run independently from each other, so that independent steps for different structures can be executed in parallel. The configuration file is useful for passing the correct paths onto the other scripts that run each step of the workflow, so that the correct calculations subdirectory etc. can be found. These scripts are responsible for the assembly of the structures, the input genera- tion for and submission of xtb/CREST and Gaussian jobs, the error checking for the .log files and the feature extraction (see figure 4.1). The function of each script is explained in more detail in the following subchapters.

4.1 Molecule Assembly

The script assembler.py, along with the class MolAssembler, is responsible for the assembly of all structures. The assembler can be run on its own via command-line ar- guments without starting any further calculations (see listing 4.5). The configuration file provides the paths for the recipes, cores and substituents as well as the calcula- tions directory to write the output to.

1 usage: assembler.py --config CONFIG 2 3 arguments : 4 --config CONFIG config file with relative paths etc. Listing 4.5: Input Arguments for Assembler

The molecule assembly is composed of three steps: the preparation of the cores, the preparation of the substituents and the assembly of the molecules, which is illustrated in figure 4.2.

Figure 4.2: Scheme of Molecule Assembly (Courtesy of Schoenebeck Group)

First, the assembler extracts all core groups from the cores directory and the .mol files for each group. If there is no matching .yml file for each .mol file, an exception occurs. With the help of Open Babel, the .mol file for each core is loaded, the atom labels are

13 4 Implementation of the Framework read from the matching .yml file and the atoms get labeled by index with the help of MolAssembler. In case the .yml file for the core additionally specifies spin and charge, these values are set as properties of the OBMol object that represents the molecular struc- ture. When serializing this OBMol object as a labeled .mol2 file, additional properties of the molecule are not serialized implicitly, i.e. the spin and charge attributes would be lost. Therefore, a so-called OBPairData structure is used to add a comment in the .mol2 out- put file. These values can be read from the .mol2 file as a Python dictionary via Open Babel later on when they are needed. The OBPairData also specifies whether xtb/CREST usage is enabled for that core. An xtb/CREST optimization is only done for cores where the CREST flag is set to true in the .yml file and if xtb/CREST usage is enabled in the global configuration file as well. The preparation of the substituents works the same way – all the .mol and .yml files are read from the substituents directory, labeled accordingly and saved as .mol2 files. For the last step, the assembler iterates over all recipes, parses them into Python diction- aries and loads the needed .mol2 files. In case several recipes use the same core group, the OBMol objects that are created from the .mol2 files are cached for later use, along with the substituents. For each recipe and each of its cores, a subdirectory is created in the calculations folder. So for example, for recipe 1 and core A, the output files for the assembled molecule would be written to path/to/calculations/1/A. MolAssembler already pre-optimizes the molecular structure and generates both .mol2 and .xyz output for the assembled molecule. All finished assemblies are appended to the log file log.csv in the root directory.

4.2 xtb/CREST and Gaussian

The scripts for generating the xtb/CREST input and the Gaussian input work in similar ways. Both scripts have an optional command-line argument called --run, which indic- ates whether or not the batch scripts for a particular structure should also be submitted or just generated. For xtb, the script generate_xtb_input.py requires several input arguments. The ar- guments root_dir, path_core_subdir and path_templates are set according to the paths specified in the configuration file. The script works on one structure at a time by creating the necessary input files in the corresponding subdirectory in calculations (set via the input arguments recipe and core). The two input files for the xtb/CREST batch job are .xcontrol for CREST and .xoptcontrol for xtb. For the automated frame- work, the placeholders in the already existing template files are filled in automatically. This means that the fixed atom indices are determined via MolAssembler and turned into a correctly formatted string. The xyz data is parsed from the .xyz output of the assembly and inserted into the templates. Charge and spin are set by converting the .mol2 files into OBMol objects and reading its OBPairData attribute, which specifies total charge and spin

14 4 Implementation of the Framework multiplicity. The spin is defined as

multiplicity − 1 spin = . (4.1) 2

Another file generated by generate_xtb_input.py is the batch script for the xtb/CREST run called xbatch.sh. The input template for the batch job is modified so that the name of the job is set to a combination of the recipe name and the name of the core. Other placeholders include the name of the SLURM log file and the working directory for xt- b/CREST, which is defined as the subdirectory in calculations where all of the output for that specific structure is written. At the end of each xbatch.sh script there is a new placeholder for Gaussian commands, which are executed after the xtb/CREST optimiza- tion has finished. These call the script generate_gaussian_input.py with the correct command-line arguments for that particular structure in order to generate the Gaussian input from the optimized xyz data. These additional commands take the dependencies among the batch jobs into account, so that a new batch job for the Gaussian calculation is automatically submitted from within the xtb/CREST job without any user interference (see figure 4.3).

#SBATCH … xtb assemble_molecules() crest move files for manual check for each mol start_gaussian_job() if xtb/CREST start_xtb_crest_job() else use optimized .xyz data start_gaussian_job()

run.py #SBATCH … g16 gmod if frequencies < 0 start_gaussian_job() check_log_file() and #runs < 3

reversible Gaussian error and #runs < 3

Figure 4.3: Scheme of xtb/CREST and Gaussian Batch Jobs and Their Dependencies

The script generate_gaussian_input.py works similarly to the xtb/CREST input gen- eration. An important additional input argument for this script is --run_count, which starts from 0. This argument specifies the ID of the Gaussian run, which is relevant for the subsequent error checking. For the Gaussian jobs, the template for the .com files is set up automatically first. This is done by parsing the values for the memory and number of processes from the batch script template and replacing them in the .com files. This guarantees that the .com file and the batch script are consistent. If the configuration of the batch job changes, only the template of the batch script will need to be modified. The next step, the generation of the input files, is specific to each structure. The remaining placeholders in the .com file are filled in with the name of core and recipe and the xyz data. If existent, the xyz coordinates are taken from the optimized xtb/CREST output called crest_best.xyz. Otherwise, the data is extracted from the .xyz output of the molecule assembly.

15 4 Implementation of the Framework

The batch script gbatch.sh for each structure is derived from the existing template as well. At the end of each batch script, the script for checking the .log file for errors is called with the corresponding input arguments. For this purpose, the new placeholder {ERROR_CHECK_COMMANDS} has been added to the gbatch.sh template.

4.3 Error Checking for Gaussian Logs

Following the finalization of each Gaussian job, the finished .log file is checked for errors. The implementation of this dependency, including the re-submission of erroneous Gaussian runs, is pictured in figure 4.3. The script check_gaussian_logs.py can be run on its own by specifying the command-line arguments for the paths as configured in config.yml, the name of the log to check and with the optional argument --clean. If the clean option is enabled, all files except for the most recent, correct Gaussian log will be deleted from the structure’s subdirectory in calculations. This is useful for saving memory resources when running the entire workflow automatically for a lot of structures at once (e.g. as described in chapter 5), since the output and intermediate files in case of a Gaussian failure can take up more than 200 MB of memory. Three parameters determine whether a Gaussian run was successful: the ‘normal termin- ation’ of each partial calculation, positive frequencies and no miscellaneous errors, e.g. exceeding the maximum number of steps. Possible outcomes are that the run was suc- cessful, that the error can be re-solved or that the log file requires manual checking. Regular expressions help find all the ‘frequencies’ values in the log, which are formatted as shown in listing 4.6. None of the frequencies must be negative, otherwise the molecu- lar conformation is not a minimum of the energy function (compare chapter 2.3). In order to fix this, the Schoenebeck Group has implemented a script for the new workflow, which is called gmod. This script reads the coordinates from the .log file and modifies them in such a way that this error should not re-occur when re-running the same Gaussian calcula- tions. gmod outputs corrected xyz data that can be used as new input for an updated .com file. The generate_gaussian_input.py script is called from the error checks with the updated input files and an increased --run_count argument and submitted to the batch system (compare figure 4.1). The type of error is also logged in log.csv.

1 AAA 2 Frequencies -- 23.7449 36.9418 48.4119 3 Red.masses-- 3.9325 4.0949 4.0771 4 [...] 5 AAA 6 Frequencies -- 385.9112 398.6226 403.3570 7 Red.masses-- 4.2792 4.4275 4.5503 Listing 4.6: Excerpt from Gaussian Log for Core L

An error that is rather complex to solve is the failure of Gaussian calculations. The current .com template contains five individual calculations. The directions for these are specified in so-called ‘route sections’ in the .com files (shown in listing 4.7). Each section begins

16 4 Implementation of the Framework

with a ‘#’ as the first non-blank character. Based on this, the Python script automatically counts the number of calculations and compares it to the occurrences of ‘normal termin- ation’ in the log file. If these do not match, one or more calculations have failed and the output files for the structure are put aside in manual_check for manual investigation.

1 # B3LYP/genecp opt=(MaxCycles=100) freq=noraman EmpiricalDispersion =GD3BJ scf=xqc 2 #p B3LYP/genecp guess=read geom=allcheck NMR=GIAO EmpiricalDispersion=GD3BJ scf=xqc 3 # B3LYP/genecp guess=read geom=allcheck TD=50-50 EmpiricalDispersion=GD3BJ scf=xqc 4 # B3LYP/genecp guess=read geom=allcheck pop=nboread EmpiricalDispersion=GD3BJ scf=xqc 5 # B3LYP/genecp guess=read geom=allcheck SCRF=(CPCM,solvent=toluene) EmpiricalDispersion=GD3BJ scf=xqc Listing 4.7: Definition of Route Sections in .com Template

Other errors that are detected and resolved automatically are an exceeded number of steps or ‘FormBX’ errors. The .com file limits the steps Gaussian is allowed to do to find a solution with a certain accuracy. Exceeding this limit can occur if the minimizer for the energy function does not converge to the minimum quickly enough. Most of the time, this error does not re-occur after adding the keywords tight or calcfc to the optimizer in the .com file and using the coordinates of the failed optimization to re-run the calculation, since it is closer to the minimum than the original starting point. The run_count for the structure affected by this error is increased by 1 with each re-run. ‘FormBX’ errors also arise from internal issues of the Gaussian implementation and can- not be reproduced deterministically. These errors bear no chemical significance. On an internal level, Gaussian transforms coordinates into an internal representation that in- cludes distances between two atomic cores, angles between sets of three atoms and torsion angles between sets of four atoms. For instance, if one of these angles ends up being 180° due to the geometry optimization, Gaussian crashes due to this error. If any of the easily solvable errors do persist throughout two repetitions of the Gaussian run, the structure will also be inspected manually. The subdirectory for the structure will be moved from calculations to the directory manual_check and the name of the subdirectory will be marked as _error. In case of a successfully finished Gaussian run, the subdirect- ory in calculations gets marked as _done.

4.4 Extraction of Descriptors

The script extract_features_from_logs.py can be called at any time and it will auto- matically find all finished .log files and parse the values from those. Calling this script automatically after every successful Gaussian finalization would also be possible, but it would require adapting the global data structure for all descriptors many times. With the current implementation, all molecular descriptors to extract can be defined in one place instead of defining the feature extraction for every single structure in separate scripts.

17 4 Implementation of the Framework

Regarding the parsing of the data, two types of descriptors can be distinguished: the ones that can be directly read from the logs via regular expression and geometry descriptors that can be determined for a certain substructure inside the molecule. The OBMol methods to get the latter values usually need atom indices as input. These atom indices depend on the initial core structure of the molecule and they can be determined via Open Babel’s OBSmartsPattern support. These patterns offer a syntax to define substructures within a molecule or certain bonds. Open Babel’s version 2.3 or higher can then search for this sequence of atoms within the molecule and returns a list of matches. The matches are tuples of indices which identify the atoms in the pattern. It is necessary to apply this pat- tern search to the OBMol object that is based on the labeled .mol2 file since these include additional information about bonds as opposed to the .log files. An example for such a pattern is [Ni]∼[Ni], which returns a list of index tuples for all types of Ni Ni bonds. The atom indices for all geometry descriptors are managed by the script get_pattern_- indices_from_cores.py. This script contains functions that each return the atom in- dices for a specific bond or substructure within the molecule needed for a single geometry descriptor. The indices are collected inside a dictionary and sorted by the name of the geo- metry descriptor, e.g. r(Ni-Ni), and the name of the core. This data is serialized and read from the output file when running the script for extracting all ML features, extract_- features_from_logs.py. Some features involve molecular descriptors for constants like ions or predefined mo- lecules, e.g. Cl–. These values can be added or modified in a CSV file that is converted into a Python dictionary for the calculation of some features. For extracting all features, certain filters can be defined, which are whitelists of cores that certain molecular descriptors should be extracted for, since it does not make sense to extract all descriptors for every structure, depending on the conformation of the molecule. The script iterates over all finished Gaussian logs in calculations recipe by recipe and extracts all molecular descriptors per structure. The results, i.e. the features for the ML input are first saved in a two-dimensional Python dictionary that is sorted by recipe and name of the feature. The name of every feature is appended with the name of the core of the structure, e.g. r(Ni-Ni) (A) for core A. After the extraction is done, this result dictionary is transformed into a pandas DataFrame object. The pandas2 library is a Python library that provides functionalities for data analysis. The DataFrame format is common for machine learning input and is compatible with many Python-based ML libraries [9]. An excerpt from the resulting DataFrame structure is given in listing 4.4.

Figure 4.4: Excerpt from pandas DataFrame Containing Features for ML Input

2https://pandas.pydata.org/index.html

18 5 Production Run

In order to generate an input data set varied enough to use machine learning effectively, the Schoeneback Group plans to evaluate about 500 sets of ligands with the help of the automated framework. Every set, defined by a recipe (see chapter 4), performs at least 12 Gaussian jobs, i.e. a minimum of one Gaussian job per structure if no re-runs are necessary. At least one structure per set is optimized with xtb/CREST first. For the DFT calculations, the example batch job runtimes for medium-sized structures for core A and B are shown in table 5.1. This data was collected by the Schoenebeck Group and can be used to estimate the required computational resources for testing the entire structure library. The steps listed as ‘sub-jobs’ are the different parts of the Gaussian run. The optimization refers to the geometry optimization, while calculations like ‘NMR shift’ calculate molecular descriptors. The NMR shift for a specific atom within a molecule is a distinctive identifier of the molecular structure.

DFT sub-job CPU time (h) for job A CPU time (h) for job B optimization 149.8 70.5 analytical frequency 31.4 7.4 NMR shift 101.3 13.1 TD-DFT 111.5 15.5 NBO population analysis 3.6 0.4 single-point energy 26.4 5.7 total 424.0 112.5

Table 5.1: Example of Used Resources for Gaussian Run of a Medium-Sized Ligand (Courtesy of Schoenebeck Group)

Apart from the geometry optimization, the time-dependent DFT (TD-DFT) and NMR shift calculations consume a lot of CPU time, which compute part of the molecular descriptors used as ML features. The DFT calculations for all structures A to L that can be formed from this medium-sized ligand take about 2.5K hours of CPU time, so-called ‘core hours’, in batch. For larger ligands, a sum of about 5K core hours has been estim- ated. The pre-optimization performed by xtb and CREST is less significant to the overall resource requirements, since it only takes about 10 core hours per ligand, no matter its size. Therefore, the entire production run for all 500 ligands, generating about 250 medium- sized and 250 larger structures, would require 2.0 M core hours in total. Managing a data collection of this size would not be feasible to do manually without any automation.

19 6 Summary and Outlook

In comparison to the previous implementation outlined in chapter 3, the automated frame- work is more efficient because independent steps of the workflow are run concurrently. It allows its users to collect data in the background, i.e. by starting the production run de- scribed in chapter 5, without regularly having to monitor its progress. The handling of the batch job dependencies is managed by the framework itself. The only thing that requires manual interaction is solving more complex errors that might occur in the Gaussian logs, but this can be done at any time and does not interfere with the rest of the calculations in progress. Apart from that, no intermediate input files have to be adapted manually, so that the framework is less susceptible to careless mistakes when copying data or replacing place- holders. The framework I developed in this seminar thesis can serve as the basis for generating ML-suitable data sets that are more varied than the ones previously created through ex- periments. Due to its modular design, it can be easily modified or extended to extract different sets of molecular descriptors or run different DFT calculations. The features extracted in this workflow can be utilized for both supervised and unsupervised ML al- gorithms. For a consecutive bachelor’s thesis, I plan to explore methods of unsupervised machine learning for the data set created during the production run of 500 ligands. The aim is to analyze how suitable and efficient different approaches are for discovering novel and promising Ni(I) dimer catalysts. With the help of my framework, the Schoenebeck Group is currently working on a project that focuses on testing supervised ML methods to find reactive Ni(I) dimers. Apart from that, the Schoenebeck Group intends to offer this framework as a tool to other Gaussian users in the field of chemistry that specialize in similar research topics. This could contribute to the further establishment of machine learning in chemistry as the framework provides a new way to obtain extensive data sets.

20 List of Abbreviations

IOC Institute of Organic Chemistry at the RWTH Aachen University DFT density functional theory ML machine learning OpenMP Open Multi-Processing NHC N-heterocyclic carbenes

21 List of Figures

1.1 Overview of Workflow ...... 2

3.1 Fragments of a Single Structure (Courtesy of Schoenebeck Group). Implementation-wise, the structure is split into a ‘core’, that contains the dimeric Ni(I) Ni(I) core, and one or several substituents, as pictured on the left-hand side. The actual, chemical components shown on the right are different from this fragmentation, but form the same structure. . . . .5

4.1 Scheme of Full Framework with File Types ...... 10 4.2 Scheme of Molecule Assembly (Courtesy of Schoenebeck Group) . . . . 13 4.3 Scheme of xtb/CREST and Gaussian Batch Jobs and Their Dependencies 15 4.4 Excerpt from pandas DataFrame Containing Features for ML Input . . . 18

22 Listings

3.1 Snippet of Python Script for Assembling Molecules ...... 7 3.2 Input Template for .xcontrol ...... 7 3.3 List of Indices of Fixed Atoms ...... 8 3.4 Excerpt from Gaussian com template gauinp.com ...... 9

4.1 .yml File for Unsaturated Core A ...... 11 4.2 Example Recipe ...... 11 4.3 Example Configuration File config.yml ...... 12 4.4 Excerpt from log.csv of All the Events Logged for Structure H, Recipe 19 12 4.5 Input Arguments for Assembler ...... 13 4.6 Excerpt from Gaussian Log for Core L ...... 16 4.7 Definition of Route Sections in .com Template ...... 17

23 List of Tables

5.1 Example of Used Resources for Gaussian Run of a Medium-Sized Ligand (Courtesy of Schoenebeck Group) ...... 19

24 References

[1] D. T. Ahneman et al. ‘Predicting reaction performance in C–N cross-coupling using machine learning’. In: Science 360.6385 (2018), pp. 186–190. ISSN: 0036-8075. DOI: 10.1126/science.aar5169. URL: https://science.sciencemag.org/ content/360/6385/186. [2] Z.-Y. Cao et al. ‘Recent advances in the use of chiral metal complexes with achiral ligands for application in asymmetric catalysis’. In: Catal. Sci. Technol. 5 (7 2015), pp. 3441–3451. DOI: 10.1039/C5CY00182J. URL: http://dx.doi.org/10. 1039/C5CY00182J. [3] C. W. Coley et al. ‘Prediction of Organic Reaction Outcomes Using Machine Learn- ing’. In: ACS Central Science 3.5 (2017). PMID: 28573205, pp. 434–443. DOI: 10. 1021/acscentsci.7b00064. URL: https://doi.org/10.1021/acscentsci. 7b00064. [4] M. J. Frisch et al. Gaussian~16 Revision C.01. Gaussian Inc. Wallingford CT. 2016. [5] P. Ghosh et al. ‘A Prototype Implementation of OpenMP Task Dependency Sup- port’. In: OpenMP in the Era of Low Power Devices and Accelerators. Ed. by A. P. Rendell, B. M. Chapman and M. S. Müller. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 128–140. ISBN: 978-3-642-40698-0. [6] S. Grimme. ‘Exploration of Chemical Compound, Conformer, and Reaction Space with Meta-Dynamics Simulations Based on Tight-Binding Quantum Chemical Cal- culations’. In: Journal of Chemical Theory and Computation 15.5 (2019), pp. 2847– 2862. [7] A. Kapat et al. ‘E-Olefins through intramolecular radical relocation’. In: Science 363.6425 (2019), pp. 391–396. ISSN: 0036-8075. DOI: 10.1126/science.aav1610. URL: https://science.sciencemag.org/content/363/6425/391. [8] S. T. Keaveney, G. Kundu and F. Schoenebeck. ‘Modular Functionalization of Arenes in a Triply Selective Sequence: Rapid C(sp2 ) and C(sp3 ) Coupling of C-Br, C-OTf, and C-Cl Bonds Enabled by a Single Palladium(I) Dimer’. In: An- gewandte Chemie (International ed. in English) 57.38 (Sept. 2018), pp. 12573– 12577. ISSN: 1433-7851. DOI: 10 . 1002 / anie . 201808386. URL: http : / / europepmc.org/articles/PMC6175235. [9] A. C. Müller, S. Guido et al. Introduction to machine learning with Python: a guide for data scientists. " O’Reilly Media, Inc.", 2016. [10] OpenMP Application Programming Interface. 8th Nov. 2018. URL: https://www. openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf (visited on 11/12/2019).

25 List of Tables

[11] T. Scattolin et al. ‘Site-Selective C-S Bond Formation at C-Br over C-OTf and C-Cl Enabled by an Air-Stable, Easily Recoverable, and Recyclable Palladium(I) Catalyst’. In: Angewandte Chemie International Edition 57.38 (2018), pp. 12425– 12429. DOI: 10 . 1002 / anie . 201806036. URL: https : / / onlinelibrary . wiley.com/doi/abs/10.1002/anie.201806036. [12] G. Skoraczynski´ et al. ‘Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient?’ In: Scientific Reports. 2017. [13] J. N. Wei, D. Duvenaud and A. Aspuru-Guzik. ‘Neural Networks for the Prediction of Organic Chemistry Reactions’. In: ACS Central Science 2.10 (2016). PMID: 27800555, pp. 725–732. DOI: 10.1021/acscentsci.6b00219. URL: https: //doi.org/10.1021/acscentsci.6b00219. [14] J. G. P. Wicker and R. I. Cooper. ‘Will it crystallise? Predicting crystallinity of molecular materials’. In: CrystEngComm 17 (9 2015), pp. 1927–1934. DOI: 10. 1039/C4CE01912A. URL: http://dx.doi.org/10.1039/C4CE01912A.

26