Machine Learning Models in Fullerene/Metallofullerene

Studies

Xiaoyang Liu

Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Master of Science In Computer Science and Application

Yang Cao, Advisor Harry C. Dorn Lenwood S. Heath

August 8, 2019 Blacksburg, VA 24060, U.S.

Keywords: Machine learning, Neural Network, Chromatography, Fullerene, Modeling, Random Forest, XGBoost, Linear Regression, SVM regression, Nearest Neighbor

Machine Learning Models in Fullerene/Metallofullerene Chromatography

Studies

Xiaoyang Liu

ABSTRACT

Machine learning methods are now extensively applied in various scientific research areas to make models. Unlike regular models, machine learning based models use a data-driven approach. Machine learning algorithms can learn knowledge that are hard to be recognized, from available data. The data-driven approaches enhance the role of algorithms and computers and then accelerate the computation using alternative views. In this thesis, we explore the possibility of applying machine learning models in the prediction of chromatographic retention behaviors. Chromatographic separation is a key technique for the discovery and analysis of fullerenes. In previous studies, differential equation models have achieved great success in predictions of chromatographic retentions. However, most of the differential equation models require experimental measurements or theoretical computations for many parameters, which are not easy to obtain. Fullerenes/metallofullerenes are rigid and spherical molecules with only carbon atoms, which makes the predictions of chromatographic retention behaviors as well as other properties much simpler than other flexible molecules that have more variations on conformations. In this thesis, I propose the polarizability of a fullerene molecule is able to be estimated directly from the structures. Structural motifs are used to simplify the model and the models with motifs provide satisfying predictions. The data set contains 31947 isomers and their polarizability data and is split into a training set with 90% data points and a complementary testing set. In addition, a second testing set of large fullerene isomers is also prepared and it is used to testing whether a model can be trained by small fullerenes and then gives ideal predictions on large fullerenes.

Machine Learning Models in Fullerene/Metallofullerene Chromatography

Studies

Xiaoyang Liu

GENERAL AUDIENCE ABSTRACT

Machine learning models are capable to be applied in a wide range of areas, such as scientific research. In this thesis, machine learning models are applied to predict chromatography behaviors of fullerenes based on the molecular structures. Chromatography is a common technique for mixture separations, and the separation is because of the difference of interactions between molecules and a stationary phase. In real experiments, a mixture usually contains a large family of different compounds and it requires lots of work and resources to figure out the target compound. Therefore, models are extremely import for studies of chromatography.

Traditional models are built based on physics rules, and involves several parameters. The physics parameters are measured by experiments or theoretically computed. However, both of them are time consuming and not easy to be conducted. For fullerenes, in my previous studies, it has been shown that the chromatography model can be simplified and only one parameter, polarizability, is required. A machine learning approach is introduced to enhance the model by predicting the molecular polarizabilities of fullerenes based on structures. The structure of a fullerene is represented by several local structures. Several types of machine learning models are built and tested on our data set and the result shows neural network gives the best predictions.

ACKNOWLEDGEMENT

I would like to express my appreciation to my advisor Dr. Young Cao for giving me the opportunity to pursue a master degree in computer science. Studying toward a computer science degree opens a new world for me. The project is in the intersection between computer science knowledge and my Ph.D. dissertation. I also thank my committee members, Dr. Harry

Dorn and Dr. Lenwood Heath. I thank Dr. Harry Dorn for his support and understanding for my work and study in computer science. I thank Dr. Lenwood Heath for his help in discussion and revision of my thesis. It is my best luck to have the helpful committee members.

iv Table of Content 1. Background ...... 1 1.1 Chromatography ...... 4 1.2 High Performance Liquid Chromatography ...... 8 1.3 HPLC in Fullerenes/Metallofullerenes Separations ...... 20 2. Machine Learning Models Based on Molecular Geometry ...... 32 2.1 Machine Learning Models and Applications in Chemistry ...... 32 2.2 Feature Selection ...... 33 2.3 Model Selection ...... 37 3. Conclusion ...... 45 References ...... 46

v

1. Background

Machine learning has been in its rapid development for the past few decades[1]. We may find machine learning in news feed, searching, and image recognition. Machine learning combines statistics and computer science and builds models to realize the tasks, which are hardly achieved by regular methods. Fundamentally, machine learning algorithms extract information or key features from available data and represent the data set using general models. There are numerous machine learning methods and it has been well established that different models are capable to be used for different tasks. Generally, there are two categories of tasks, regression and classification[2]. Most machine learning models can be modified for both tasks. Machine learning models are trained by raw data and then applied to infer things for unknown data. The training of machine learning models is a big challenge and nowadays, there are several methods to achieve sufficient training of models due to the development of algorithms and computing resources. Different machine learning methods have different structures to make them general for most tasks. The learning of machine learning models is a step to acquiring the hidden key features of data. The structures of knowledge from the raw data take different forms. For example, linear regression is a model to describe the linear relationship between features and the result data. Decision tree uses a different way of applying rules and applies a set of parameters to make decisions under different conditions. The application of a machine learning algorithm is also referred to as data mining, which is the process to obtain key information or features from a large amount of data.

Among machine learning algorithms, neural network, or named as deep learning, is the most commonly used technique for scientific research. In the past few decades, various neural networks have been developed. For example, the convolutional neural network has shown its

1 power in image processing. Also, the ability to process an image or recognizing patterns is suitable to be converted to solve chemistry problems that involve chemical structures. Another essential feature to note is that deep learning can extract features automatically. One of the big challenges of machine learning is designing features, and designing features requires no only machine learning knowledge but also domain knowledge. Therefor deep learning is then versatile for different studies. Nowadays, with enough amount of data, deep learning models can defeat professional people in their areas.

Behind the fancy results of machine learning, math plays an essential role. Machine learning algorithms, fundamentally, are built based on linear algebra, statistics, and programming. The first step of building machine models is to convert data into vectors, which is the language computer program can understand. Then training machine learning models is then to solve linear algebra systems following the given structures of machine learning algorithms. The starting point of many machine learning algorithms is the probability, and then the decision or result is estimated based on probability distributions or likelihood. A machine learning models have a certain structure, but contains several parameters, which are decided by the data. The way to learning information from the data set is to figure out the parameters suitable for the data set. Therefore, the training of a machine learning model is to optimize the parameter associated with the general machine learning model. The development of optimization algorithms has been a central topic in machine learning for decades. A typical method to handle a large amount of computation in machine learning and convex optimization is gradient descent.

Gradient descent uses current slope on the function surface and then decrease the error step by step. Besides, stochastic gradient descent and other techniques are developed to accelerate the optimizations and also eliminate problems, such as falling into local minima.

2 Currently, machine learning methods have been applied in scientific research, especially in chemistry and material science. In the past, quantum chemistry provides one of the most successful models for most chemistry problems. The quantum chemistry methods solve chemical property problems mainly based on electron structures. With the development of computer algorithms and supercomputers, some of the large scale problems can be solved within a reasonable time. However, the computing resources required for quantum computing is still a limit for its applications. In particular, density functional theory based methods are widely used due to its good balance between accuracy and efficiency. Interest in applications of machine learning models explodes, especially in chemistry research. In most cases, the chemistry system is complex and hard to model using regular methods. Therefore, machine learning methods is a good choice to learn information from data without much knowledge about the system.

Nowadays, data-driven models enhance the ability of modeling chemistry system, in which the physics theories are incomplete or unavailable. The rapid emergence of machine learning methods in chemistry is partially due to the explosion of data. In the past few decades, with the rapid evolution of instruments and computers, a large amount of data is accumulated and numerous databases are available. However, regular methods fail to extract information from the databases. Also, the development of machine learning algorithms, especially those in pattern recognition. Currently, several studies have been reported using the available data set and machine learning methods to learn hidden knowledge. For example, in material design research, failure data are not considered useful. However, Raccuglia et al. proposed a machine learning model to learn material design knowledge from these disposed of data and showed that the performance of this model is even comparable to experienced scientists[3].

3 Machine learning methods are also helpful to accelerate chemistry computing. As we all know, most regular computing methods of chemistry system require remarkable resource and a relatively long time. Machine learning methods then provide alternative methods to estimate chemical properties. In addition, several tasks are hard to be modeled, such as medicine design.

As noted previously, the convolutional neural network can recognize local structures and estimate their contributions on a certain property. Before that, the design of medicine is mainly done manually and requires a considerable amount of lab work. We also note that in the past, scientists make great efforts to summarize rules and theories from observations. Currently, machine learning can do even better than a scientist and extract rules from available data.

In this thesis, we show our results in building machine learning models for chromatography that is essential for chemistry separation. The input of the model is the molecular structure and then the output is the parameter to estimate chromatographic retention time. The machine learning model can simply the simulation of chromatography behaviors of fullerenes.

1.1 Chromatography

Chromatography is a technique for mixture separation and sample purification[4]. The chromatographic technique is extensively applied in science and industry, such as chemical engineering and pharmacy. The history of chromatography dates back to 1903, when Mikhail

Tsvet, a Russian scientist first applied chromatography technique to separate plant pigments, which show different colors ( is green, carotene is orange, and xanthophylls is yellow). Then this separation technique gains its name, chromatography, from the derivative word of chroma, meaning color in Greek[5]. The modern chromatography has its rapid development from the 1930s to 1950s and the 1952 Nobel Prize in Chemistry was awarded to

Archer John Porter Martin, Richard Laurence, and Millington Synge for their contributions to chromatographic techniques[6]. After that, the chromatographic technique is continually developed to obtain advanced resolution for various mixtures.

4 A chromatography contains two phases, a mobile phase, and a stationary phase[7]. The mobile phase dissolves mixture and carries the mixture through the stationary phase. For the same stationary, the affinities of various components in the mixture vary, which leads to different speeds for each component of the mixture going through the stationary phase. The different times for the mixture components going through from the starting point to the terminal point, named retention time, causes the separation. The main principle can be applied via various implementations, such as size-exclusion chromatography and ion-exchange chromatography.

Figure 1.1 Scheme of thin-layer chromatography.

The thin-layer chromatography (TLC) is one of the first used chromatography technique with a thin layer of silica gel or cellulose on a flat plate as the bed[8]. The first chromatography used by Mikhail Tsvet is classified into the TLC family. The sample or named analytes in chromatograpy, is applied on the surface of the thin layer and the plate is placed in a container with solvent. The solvent is driven by capillary action to move up along the plate. The solvent brings analytes up but different compounds have different rates due to various affinities with the stationary phase and then the separation is achieved. The scheme for TLC is shown in

Figure 1.1. The capacity of TLC is limited and the resolution is at the low-level comparing to other chromatography techniques. But TLC is still widely used, especially in organic synthesis to tract the reaction. In addition to separation, TLC is also applied for compound identification

5 based on the Rf value. The Rf value is the ratio between the travel distances of the analyte and the solvent and Rf value remains the same for the same compound.

Figure 1.2 Scheme of column chromatography.

Column chromatography (Figure 1.2) packs the stationary phase that is in the form of small particles and filled in a column[9]. In most cases, solid stationary particles with stationary materials are filled in the whole tube areas to maximize the contact area between the mobile phase and the stationary phase to achieve the maximum ability of separation. The stationary phase may also be filled on the inside wall of the tube and leave other space open, and this type is called a tubular column. Another advantage of the column chromatography is that the mobile phase can be driven by pressure, which decreases the amount of time for the chromatographic process. Since the diffusion is inevitable in the chromatography process, less time required leads to a higher resolution of separation. The first pressure-driven column chromatography is reported by W. Clark Still[10] and the novel technique achieves a big jump in chromatography technique. Modern column chromatography instruments are designed as pre-packed column and the mobile phase is pumped through the column using gradient pumps. Besides, several detectors, such as UV/vis, are available to install after the chromatography to analyze the chemical components in time.

The mobile phase of chromatography may also vary and each physical state of the mobile phase has its cons and pros. Gas and liquid are two frequently used and efficient mobile phases,

6 named as (GC) and liquid chromatography (LC), respectively[11]. Gas chromatography is conducted with a packed column stationary phase. The driven force of the separation is the partition equilibrium between the stationary phase and the mobile gas solution.

The partition equilibrium is that the distribution of the substance reaches an equilibrium and the distribution ratio keeps as a constant. The stationary phase of GC is either solid or liquid, and the mobile gas is often helium. The column of GC has a tiny diameter, usually around 0.5 mm, and even smaller as a capillary column. For most instruments, GC uses a high pressure to pump gas mobile phase through the stationary column. The high pressure makes it efficient and, on the other hand, prevents its application on high molecular weight molecules, such as polymers and biomolecules. Therefore, GC is generally applied for small molecule analysis, such as air quality monitoring. LC is versatile. LC can be conducted using column or paper or thin layer bed and separating most molecules. Nowadays, the liquid mobile phase is normally forced by high pressure through a column packed with small particles or a porous membrane, referred to as the high-performance liquid chromatography (HPLC). Generally, the mobile phase of HPLC, usually organic solvent, is considered less polar than the stationary phase, which is usually silica-based material.

Sharing the principle of regular chromatography, several special chromatographic techniques are introduced for chemical analysis and separation. For example, the ion exchange chromatography is invented based on the ion exchange mechanism to separate analytes with different charges. It is mostly employed in biology and biochemistry studies for the purification of charged compounds, such as proteins. Hydrophobic interaction chromatography is developed based on the polarity differences. Proteins’ sidechains, which are non-polar, can interact with hydrophobic groups on the stationary phase.

With the target of applications, any type of chromatography is designed for either preparation or analysis or both[7]. Preparative chromatography is normally designed for separating

7 relatively large amount of mixture and employs circulations to enhance the resolution. On the other hand, analytical chromatography is more sensitive and accurate and usually designed for a tiny amount. Nowadays, chromatography techniques develop rapidly. New stationary materials and mechanical designs boost both separation abilities and the efficiency of chromatography. Recently, a common HPLC can separate a mixture in a couple of minutes and analyze a sample in even a few seconds. In addition, the development of controlling software and simulation methods also contributes much to the evolution of chromatography techniques[12,13].

1.2 High Performance Liquid Chromatography

High performance liquid chromatography, or previously named as high pressure liquid chromatography, is developed from classical liquid chromatography. Classical liquid chromatography, using a glass tube packed with silica, is still extensively used in organic chemistry and medicine synthesis[8]. As illustrated in Figure 1.3 a, the mixture is loaded on the top of the column and the solvent flows through the column driven by gravity. Each component of the mixture has interactions in different strength with the silica. The different interactions cause different speeds for components traveling through the column. Since each component of the mixture leaves the column at different times, we can collect each purified component one by one. HPLC is developed on account of the same principle as the classical liquid chromatography but comes with high driven pressure for advanced efficiency and resolution. As illustrated in Figure 1.3 b and c, a typical modern HPLC system contains generally three parts, the pressure supplier, the column and the linked detector. Samples are injected into the column automatically and the solvent is forced by a gradient pump instead of the gravity. When the sample components go through the column and reach the detector, real- time signals are recorded. The data obtained from the HPLC process are summarized as a chromatogram (Figure 1.3 d), which is a plot of time versus sample signals. Since 1978, when

8 the prototype of modern HPLC, called flash liquid chromatography was first applied, HPLC has evolved to be one of the premier methods for separation and analysis. As shown in Figure

1.3 a, an HPLC can separate a mixture with 15 different molecules within one minute. In

Figure 1.3 b, the two isomers, Sc3N@C80-D5h, and Sc3N@C80-Ih that are extremely similar can also be successfully separated by HPLC using a 5-PYE column[14]. In most cases, the separation of a mixture by HPLC is conducted in multi-steps using several different columns to maximize the separation abilities. In the 1980s, the invention of the protein separation column starts the application of HPLC in biology and biochemistry areas[15,16].

9

Figure 1.3 Column chromatography. (a) Basic column chromatography driven by gravity. (b)

Modern high performance liquid chromatography. (c) Diagram of key components of HPLC.

(d) A sample HPLC chromatogram for fullerene derivative.

The development of modern HPLC dramatically enhances the ability of analysis and purification. Nowadays, great effort has been devoted to improving the HPLC system to achieve an even better resolution and capability. The theory of chromatography and HPLC plays an essential role in the development of HPLC systems. Before the invention of the first

10 HPLC system, it has been proposed based on the theories that high-pressure-driven chromatography is capable to achieve high efficiency in mixture separations. The design of the

HPLC system largely relies on theoretical calculations. For example, the radius of the particles is customized based on theoretical calculations to achieve a maximum separation capacity.

Nowadays, chromatographic models to date are developed mainly based on mass transportation and differential equations. Chromatography is affected by several variables that are the temperature, the flow rate of solvent, and the column material and the analytes. Under controlled conditions, the experiment results are only related to the analytes. The retention time tR, which is the amount of time for a certain component traveling from the injection to the appearance in the detector, is introduced to measure the level of retention strength. On a chromatogram, the retention is shown as an interval between the beginning time and the time indicated by the peak top of an analyte component. The solvent also has a retention time, which is called the dead time t0. In the chromatographic process, the analyte molecules are distributed in both stationary and mobile phases and remain an equilibrium at any time. To measure the traveling rate of a certain component a, the migration rate ua is calculated as: �� = ��. R is the faction of component a in the mobile phase and u is the velocity of the solvent. The capacity factor k = (1- R) / R is applied to measure the distribution of a solute in the stationary phase and the mobile phase. The capacity factor is also relevant to the dead time t0 and the retention time tR (k = (tR – t0) / t0)[7]. The capacity factor is often used to measure the retention behavior of a molecule in a given column with a given solvent rate. It has been previously reported that the capacity factor is related to certain molecules’ physical properties.

During the chromatographic process, there is always a partition equilibrium for each molecule distributed in the solvent and the stationary phase. The equilibrium is dominated as a result of the intermolecular interactions between the molecules in the analytes and the stationary phase materials. Theoretically, the retention behaviors are predictable based on the intermolecular

11 interactions and the real chromatogram may be calculated if all interactions are considered.

However, the interactions are too complicated to be modeled comprehensively. Therefore, finding primary interactions and modeling the retention behaviors based on these parameters is a feasible plan. The accuracy of prediction largely depends on the structures of the molecules and customized models for a certain family of molecules are helpful. For example, fullerenes/metallofullerenes are in rigid structures and the interactions between these molecules and the stationary phases are not difficult to track, therefore, the estimation of fullerenes/metallofullerenes retention behaviors produce desirable output using relatively simple models[17,18]. The intermolecular interactions belong to five different groups, and they are dispersion interactions, dipole-dipole interactions, hydrogen bonding, ionic interactions, and π-π interactions[19]. Dispersion interactions are universal between any pair of molecules and are significant for less polar molecules. Dispersion interactions exist not only in the mobile phases but also in the stationary phases, so they are minor factors affecting the partition equilibrium. Hydrogen bonding partially shares features of covalent bonding and it is much stronger than any other intermolecular interactions. The hydrogen bonding attractions are normally between a hydrogen atom and another electronegative atom, which may be nitrogen, oxygen or fluorine. The ionic interactions are between a charged molecule and other charged or polar molecules around it. The ionic interaction may reach a maximum when positive and negative molecules are aligned next to one another. The π-π interactions are between any two aromatic systems and are critical in biomolecules to hold the structures. Charge transfers often occur when two π systems complex with each other. For the separation of fullerenes/metallofullerenes using HPLC, the π-π interactions play a major role[20,21].

12

Figure 1.4 Details of HPLC column with solvent flows.

The design of the column structure is critical for effective separations. As illustrated in Figure

1.4, silica particles with stationary materials are encapsulated in a steel column. It has been long investigated on the design of the columns in HPLC to pursue high separation resolution and efficiency. In practice, the separation capacity of a column is affected by numbers of factors, such as the particle size and the column diameter. As a chromatogram is shown, each of the solutes is represented as a peak and each peak is not exactly a line but has a width. The peak width is due to injecting band and diffusion. The width of a peak is a critical character associated with the separation. Usually, the peak width is measured as the baseline width or the half-height width. Within a certain time, the separation ability of a column is then determined by the number of peaks that can be furnished on the chromatogram. The plate

number, � = 16 , is applied to estimate the separation capacity of a column and N is independent to the solute. Theoretically, a chromatographic peak follows a Gaussian curve and then the peak width � equals to four times of the standard deviation σ. Based on the previous

. equations, the peak width can be expressed as � = 4� �(1 + �)[22–24]. Usually, in a chromatogram, the peak width increase with the retention time. Currently, it is still a challenge

13 to isolate molecules that show strong interaction with the stationary phase and long retention time, which leads to extremely wide peak due to diffusion.

The plate number N is an essential parameter for estimating the retention behaviors within a chromatography column. A larger plate number leads to a higher separation resolution and capacity. Associated with the plate number N and the column length L, plate height is defined as H = L / N. The plate height H is a measurement of the separation capability of a single plate layer. According to the relationship, we may increase the total length L or improve the per plate separation capability H to enhance the separation ability of a chromatography column.

However, in reality, the length of a chromatography column is quite limited. The diffusion of the analytes during the retention is one of the key reasons limits the length of the column. A longer column will increase the retention time of an analyte and a longer retention time broaden the peak width due to the diffusion. A broad peak may be hard to be recognized, because the peak height decreases for a fixed amount of analyte and then the signal-noise ratio is reduced.

The plate number is affected by several factors, such as the particle diameter and the mobile phase flow rate. It has been well established that, with the increase of flow rate, the plate number first increases and then decreases. The plate number keep increasing with the decreasing of the particle diameter. The limitation for continually increasing the flow rate and decreasing the particle size is the pressure. The pressure required for a column is evaluated as:

� ≈ (L is the column length (mm), η is the viscosity of the mobile phase liquid (cP), F is the mobile phase flow rate (mL/min), dp is the particle diameter (µm) and dc is the internal diameter of the column (mm))[7]. The pressure is limited by the column and other parts of

HPLC and the pressure applied nowadays for general HPLC is between 10,000 and 15,000 psi.

The solute in the mobile phase forms a band in the column during its movement from the injection toward the end of the column. The band is broadened due to extra- and intra- column

14 contributions. At first, the injection of the solute always occupies a volume before it enters the column. In most cases, the total amount of the injection solution is not large and the extra- column effect is not significant for the peak broadening. In certain situations, for example in the scale-up of liquid chromatography, the extra column effect may be considerable and have to be modeled for a good estimation. The diffusions of the solute molecules along the tube longitude is a major reason for peak broadening. The diffusion of the solute molecules increases along with the retention time, so the peak of an analyte with a long retention time is relatively broad and even disappear on the chromatogram. On the other hand, accelerating the flow rate is useful to decrease the diffusion broadening effect. Besides the longitudinal diffusion, eddy diffusion also contributes to peak broadening. Eddy diffusion is a diffusion process due to the swirling of a fluid. The arrangement of particles inside the column inevitably forms uneven spaces between particles. The wide part leads to fast migration of the solute, and the narrow part leads to slow migration. Although reducing the particle size distribution range helps decrease the eddy diffusion, but, in reality, it is impossible to make all particles identical and then the eddy diffusion is ineluctable. In the mathematics model, it is hard to include the eddy diffusion, but fortunately, the eddy diffusion is not significant as suggested by experiments. Mass transfers in the mobile phase and the stationary phase also cause peak broadening. For the mobile phase, the center flow is always faster than the margin flow and increase the band of the molecules. For the stationary phase, molecules may penetrate different distance into the particles. Molecules moving deeper has to spend longer time going back into the mobile phase. The broadening effects are summarized in Figure 1.5.

15

Figure 1.5 Major diffusion effects. (a) The initial peak width from the injection (b) The mobile phase diffusion (c) The eddy diffusion. (d) The diffusion from the absorbance of the stationary phase

The final peak-broadening effects are expressed as

� = �

W is the total effect and Wi represents the effect of the i-th component. As we discussed, the

Wi is categorized as either extra or intra column effect. Then

� = � + �

16

in which W0 and WEC represent the intra and extra column effect, respectively. W0 is irrelevant to the retention time while WEC increases with time. So extra column broadening affect mostly on early peaks and intra column broadening is significant for late peaks. In most cases, the intra column effect is the major reason for the peak broadening. If we ignore the extra column effects and show the intra column effects in detail, the total broadening is

� = � + � + � + �,

� � where �� is the longitudinal diffusion, � is the eddy diffusion, ��� is the mobile-phase

mass transfer and � is the stationary phase mass transfer.

Considering the equation � = 4�.� and � = ( )�, the W2 can be estimated as,

� = � ∙ � ≈ �������� � for a given peak. Therefore, the relationships between each component of the peak widths are converted to the relationship between the Hs, which is,

� = � + � + � + �

This equation is similar to the peak width relationship equations, but it is not always correct, since these H values are not independent of each other. For example, the eddy diffusion and mobile phase mass transfer are interacting with each other and can be combined as one coefficient in the equation as suggested by experiments. Therefore, an equation estimating the band-broadening from the theories and the experimental results is:

17 � = + �� + ��,

where F is the solvent flow rate, B is the longitudinal diffusion coefficient, A is the eddy diffusion and phase mass transfer coefficient and C is the stationary phase mass transfer coefficient. This is the Knox equation[25,26]. Although these coefficients are close related to the column properties, an approximate set of coefficient values is A = 1, B = 2, and C = 0.05.

These parameters are from experience and without strict theoretical explanations, and together with the Knox equation, people first use the estimations to select the column for HPLC practical application. The experimental equations are a good start for us to build up the mathematical models for HPLC retention behaviors. It will be clear that most ideas of HPLC models are generated by extending the simple relationships from experiments.

Besides the retention time and peak width, the peak shape also carries essential information of the chromatography. In experiments, the solutes may be recognized based on the asymmetric peaks. Mathematic models are also designed to simulate the peak shape. Under ideal condition, the peak of a solute on the chromatogram follows the Gaussian equation,

� = (2�).� , where x is the retention time and y is the peak intensity.

The real separation always shifts from the ideal condition more or less due to several reasons, such as tailing effect. So the models also involve a factor to update the symmetric peaks to asymmetric shapes. In the equation, x is (� − �) �, y is the height of the peak and � = �/4.

The tailing is one of the key factors deviating the peaks from symmetric to asymmetric and is added as the asymmetric factor As or the tailing factor TF:

18 � ≈ 1 + 1.5 (�� − 1).

The estimation of the asymmetric factor and the peak tailing factor is shown in Figure 1.5. The shape of real HPLC peak (an asymmetric curve) similar to a Landau distribution curve.

Figure 1.6 Asymmetric Gaussian curve and factors of curve characterization.

Although the tailing of an HPLC peak is inevitable, it may decrease the resolution of the separation. Generally, a tailing effect requires corrections if the TF value is larger than 2. In most cases, the tailing does not show significant negative effects on the separation.

Detectors are also a key part of a modern HPLC system. Although the detectors showed no effect on the separation process, they have a significant impact on the HPLC system’s resolution and efficiency. There are several detectors available to be linked in the HPLC system.

The UV/vis detector is generally used due to its high sensitivity. Normally, the limitation of detection is the signal-noise ratio is S/N > 3. The UV/vis detector requires the samples have absorptions in the UV or visible region, which is 190 – 700 nm. The detection follows Beer’s law,

19 A = log = ���, where A is the absorbance, Io is the intensity of the incident light, I is the transmitted light intensity. � is the molar absorptivity coefficient, b is the light path length in cm, and c is the concentration in M. Generally, the UV/vis detector is quite robust and almost not affected by the environment, such as the temperature. The UV/vis detector typically use three different configurations, fixed wavelength, variable-wavelength, and diode-array detectors. The 254 nm light is the most commonly used light source and this light wavelength has a wide range of applications. Variable wavelength detector is now most popular for modern HPLC systems.

The ability to change wavelength increases the range of applications as well the accuracy. The variable wavelength is also able to do real-time analysis of samples. The diode-array detectors allow the spectrum spreading across an array of photodiodes and can detect a range of light wavelength at the same time.

Based on the properties of the solute molecules, other detectors are also useful. For example, electrochemical detectors can be applied for molecules that can be oxidized or reduced. This detector is quite selective and sensitive. However, the electrochemical detector cannot be widely applied because of many limitations, such as the conductivity of the solvent. In summary, most analytical techniques may be used to design the detector and have its advantages and disadvantages.

1.3 HPLC in Fullerenes/Metallofullerenes Separations

Since their discovery, fullerenes have drawn major interests from the science and engineer communities. In previous studies, it has been established that fullerenes and metallofullerenes are potential materials for various applications, such as healthcare and electronic devices[27–

29]. It has been shown that for a certain number of carbon atoms, there are thousands of isomers and each isomer may have its advantage due to its cage structure[30,31]. Metallofullerenes are

20 6+ those with metal clusters encapsulated inside the cage, such as (Sc3N) . Nowadays, fullerenes/metallofullerenes are synthesized via the arc charge method using a Huffman-

Krätschmer generator[32]. The modern synthesis method produces a mixture of fullerene/metallofullerene isomers. The mixture is first handled by chemical separations and then further purified by multi-step HPLC. The chemical separation and coarse HPLC step eliminate most graphite pieces and impurities, and HPLC is employed to separate each isomer.

The refined HPLC separation of fullerene isomers requires high resolutions, since two fullerene isomers may have only a subtle difference[20]. As illustrated in Figure 1.7, the two isomers of

C90 show tiny difference and the separation of these two isomers is a challenge. However, these pairs of similar fullerenes are good examples for fundamental studies, for example, the subtle change in the structure may lead to a large difference in physical properties. Currently, it has been shown that HPLC shows outstanding abilities in fullerenes/metallofullerenes separation.

Molecules, which are capable to be separated, are not be well understood theoretically.

Therefore, it is urgent to understand the mechanism of fullerene separations[14].

Figure 1.7 The subtle difference between C90-D5d and C90-D5h.

The prediction of small molecules for retention behaviors have been long investigated and several models may provide good results[33,34]. Nowadays, it is still a big challenge to predict the retention behaviors of large molecules, such as proteins[35–38].

21 Fullerenes/metallofullerenes, due to their rigid structure and spherical surface, are relatively easy to be predicted[39]. Also, the delocalized electron structure of fullerenes/metallofullerenes provides a readily available example to study the � − � interactions between the fullerene/metallofullerene molecules and the stationary phase. The

� − � interactions are demonstrated in previous studies as one of the main factors for the retentions of proteins in the chromatography columns. We believe the mechanism of fullerene/metallofullerene retentions provides essential information for the improvements of the design of chromatography methods for proteins as well as for other large biology molecules.

The structure of a fullerene/metallofullerene molecule contains a large delocalized π system[40]. The interactions, although details remain unknown, mostly depend on the large delocalized electron interaction. In addition, fullerene and metallofullerene molecules show electron-donating and electron-withdrawing abilities with other aromatic molecules. These structural features suggest that the design of fullerene/metallofullerene separation stationary phases should contain large π systems to obtain efficient retentions. Nowadays, there are several π-system stationary phases designed and available commercially. One of the commonly employed stationary phases is the 3-[(pentabromoben- zyl)oxy]propylsilyl (PBB) derivatized silica phase, which has a high-capacity feature for preparation separations. Other columns, such as 5PYE, may have a low capacity, but have stronger retention of fullerenes/metallofullerenes, which makes them useful for hyperfine separations.[41] Purifying a single fullerene/metallofullerene is always a challenge, and the isolation contains multi-steps and several different columns due to the limitation of a single step separation resolution. The design of a stationary phase requires cross-disciplinary knowledge, such as physics and organic synthesis. It is also a challenge to characterize and evaluate a novel stationary phase. For biology and biochemistry studies, most molecules contain more than one aromatic rings and these molecules are summarized as the polycyclic aromatic hydrocarbons (PAHs) family.

22 Fullerene/Metallofullerene molecule are ideal molecule probes to characterize stationary phases designed for PAHs due to their rigid structures[42]. Then the new stationary phases can be first designed based on retentions of fullerene/metallofullerene molecules and then applied on PAHs separations.

The study of separation mechanisms of fullerene/metallofullerene molecules is essential for stationary phase designs[20]. Although there are a significant number of stationary phases, the mechanisms are quite similar. Since there are plenty of experimental data readily available on

5PBB and 5PYE columns, it is a good choice to use these two columns as examples[43]. As reported previously[19], the interactions between a fullerene molecule and the stationary phase is mainly the van der Waals interaction and are modeled by dispersion forces but the modeling is nontrivial. However, we cannot model the retention behaviors based on the physical forces, but a linear relationship between the chromatographic retention parameters, the capacity factor

(K), and the empty-cage fullerene carbon cage number (N) has long been observed[44–46].

The relationship has been validated on different chromatographic stationary phases and by various fullerene/metallofullerene molecules. The empty-cage fullerene carbon number is the number of electrons on fullerene cages. This observation confirms the analysis that the retentions mainly depend on the π electrons’ attractions with the stationary phases. The empty- cage carbon number is easy to count on empty-cage fullerenes, but it is hard to estimate on metallofullerenes. Most metallofullerene cages accept electrons from the inner clusters and the total amount of electrons is uncertain. The number of electrons transferred from inner clusters to the outer cage is hardly measured. Therefore, the linear relationship helps to predict metallofullerene retention behaviors. On the other hand, if we know the corresponding empty- cage fullerene retention behaviors, we can estimate the charge transfers based on this linear relationship. It is significant to find out other relationships, especially those based on physical properties. As reported by Kappes and coworkers[46], certain low polarity chromatographic

23 stationary phases, like PBB and PYE, shows weak induced dipole-dipole interactions and

London dispersion forces with fullerenes/metallofullerene molecules. Further studies confirm the stationary phase-fullerene cage π-π interactions dominate the retentions and also verify the validation of the linear relationship between the fullerene empty-cage number and the retention capacity factor. An essential observation is that empty-cage fullerenes, although they may have very different shapes, have a tiny dipole moment. It means the dipole moment effect on the retention behaviors is neglected.

The first discovered fullerene is the empty-cage C60 molecule, which is also the most abundant fullerene/metallofullerene molecule[47]. Shortly after that, the second-most abundant molecule C70 is isolated[48]. Then other empty-cage fullerenes, such as C76, C78, and C84 are reported[49]. Along with the discovery of empty-cage fullerenes, metallofullerenes are also found and isolated[50]. The yield of metallofullerene at that time is extremely low and cannot be collected at a macro amount for further investigations. The study of metallofullerene has slow progress until 1999 when the first trimetallic template metallofullerene molecule (TNT-

EMFs) was synthesized and reported[51]. An interesting observation on all these known fullerene/metallofullerene isomers is that none of them has fused pentagonal faces[52,53]. It has been summarized as the isolated pentagon rule (IPR), which suggests any fused pair of pentagons will destabilize the fullerene/metallofullerene molecule. In 2000, two metallofullerenes violating the IPR rule were reported independently by Shinohara et al and

Dorn et al. Along with the essential discovery of non-IPR isomers, it has been rapidly recognized that these non-IPR isomers have significant long retention times on HPLC stationary phases[54,55]. After that, with more and more non-IPR isomers synthesized, it becomes clear that fused pentagons increase the retentions with stationary phases and therefore all of them have longer retention times than IPR isomers of the same size. In 2012, it has been reported that fused pentagons enhance the dipole moments of fullerene/metallofullerene

24 molecules[56]. Based on the retentions and the dipole moment results, it has been readily recognized that dipole moment can be a dominant factor for the retention behaviors of fullerene/metallofullerene molecules[19].

The large collection of chromatographic retention data of empty-cage fullerenes derives a linear line relationship between the empty-cage number and the chromatographic capacity factor. The retention times of metallofullerenes deviate from this linear line, and the deviation is first recognized as electron transfer effects. However, for some of the metallofullerenes, the deviation is too large to be allowed to the contribution of extra electrons from the inner clusters.

Considering the observations on non-IPR metallofullerenes, the dipole moment is considered a key factor. It is an excellent example, because the mono-metallofullerenes are relatively easy to study electron transfers, since the single metal cation has least ability to stabilize electron density and has a specific number of electrons transferred to the fullerene cage. Formally, we believe each of the rare-earth metal cations has three electrons donated to the outer cage.

Although the estimation has been further proved as not accurate, the error is not significant. As illustrated in Figure 1.8, the two metallofullerenes with enhanced dipole moments, La@C82 and La2@C80, deviate a lot from the linear regression line made by empty-cage number and the chromatographic retention factor. The deviation is suggested as the effect of dipole moment[19]. The dipole moments of fullerene/metallofullerene molecules are one of the dominant factors for the retention time and the separations.

25

Figure 1.8 The linear relationship between the fullerene cage number and the retention parameter lg K. The two metallofullerenes La@C82 and La2@C82 (green) deviates from the regression line due to significant dipole moments.

The separation mechanism based on empty-cage carbon number and the dipole moment works well for a variety of fullerenes and metallofullerenes. However, the separation of two TNT-

EMFs molecules, Sc3N@C80-Ih and Sc3@C80-D5h, is quite unexpected. Both of the isomers have an identical inner cluster and similar fullerene cages. The difference between a C80-Ih cage, and a C80-D5h cage is subtle and the Sc3@C80-D5h is a little bit ellipsoidal, whereas the C80-Ih has a perfectly spherical surface. Both of them have no significant dipole moments and should have the same number of electrons on fullerene cages. Theoretically, HPLC columns have no ability to separate these two isomers, however, they are separated in experiments. Several subtle factors are affecting the separation, including the polarizability, charge transfer, and dipole moments. As reported previously, the empty cage number has a linear relationship with the retention times. This observation disobeys the rule, and we find that the cage number of empty fullerenes has an approximately linear relationship with the polarizabilities. Therefore, we believe the polarizability of a fullerene dominates the retention behavior if the fullerene has

26 no significant dipole moment. Besides, based on the available data, we propose a linear regression model for further prediction. We acknowledge that the model is relatively coarse due to the limitation of the data amount.

Recently, a class of the spheroidal A3N@C80-Ih and ellipsoidal A3N@C80-D5h is readily available for several metals. We measure all the retention times on a PBB column and a PYE column. As mentioned previously, we employ the empty-cage number capacity factor linear line to estimate the cage numbers of each of the spheroidal A3N@C80-Ih and ellipsoidal

A3N@C80-D5h molecules. As shown in Figure 8, it has been predicted for the A3N@C80-Ih series having formally six electrons transferred from the inner cluster to stabilize the neutral

0 (C80-Ih) cage. On a PBB column, the predicted electron transfer is 4.6-6.5 and on a PYE column, the number is 5.4 – 6.2. The variety of predicted electron transfer clearly shows that the retention does not depend on electron transfers only but other factors which can be affected by the change of stationary phases. As expected, the Sc3N@C80-Ih exhibits reduced transfers

(3.3 electrons) and weaker retention. The reduced electron transfers and weaker chromatographic retentions are due to the p-d orbital back-donations, which is demonstrated in previous studies. However, a surprising feature of the data is the small but constant difference between the chromatographic retention parameter K of the A3N@C80-Ih and the

A3N@C80-D5h. The retention parameter difference, to a first-order approximation, between isomers is independent of differences in the metal atom of the internal clusters. We then inspected polarizabilities for a series of empty-cage fullerenes (C60, C70, C78, and C84) as well as the TNT-EMFs, Sc3N@C68-D3, Sc3N@C78-D3h, Sc3N@C80-Ih, and Sc3N@C80-D5h. We plot these retention time data versus experimentally measured chromatographic parameter, ln(tR/t0).

As illustrated in Figure 1.9, there is a linear dependence between the polarizabilities and the chromatographic retention behaviors on both PBB and PYE columns[14]. It is expected that the slopes on the same stationary phase are the same but the interactions are different and on

27 different phases, the slopes are distinct. The observed characteristic offset between the line for the empty-cage fullerenes and the EMFs is due to the charge transfers. This data set, although limited, shows an identical slope for both empty-cage fullerenes and EMFs on the same stationary phase. The polarizability difference between Sc3N@C80-Ih and Sc3N@C80-D5h is approximately 0.7 A. The subtle polarizability difference leads to a significant difference in the chromatographic retention times, which makes these two isomers separate ideally. We believe the polarizability is a driven force for fine separation of fullerene/metallofullerene isomers. In addition, we acknowledge the polarizability is not evenly distributed on fullerene/metallofullerene cages, and site-specific polarizabilities may also affect the separation. It requires further investigations to understand the site-specific polarizability effects.

Figure 1.9 The linear dependence between the average polarizability and the chromatography retention time.

We observe that the polarizabilities increase with the increasing of empty-cage carbon numbers.

Therefore, we believe the effect of the electron numbers on the cage is consistent with the polarizability effect on retentions. The chromatographic retention can be modeled based on dipole moments and polarizabilities. As mentioned in previous chapters[14], the retention

28 behavior is an equilibrium process and the retention time is dependent on the Gibbs free energy difference between fullerene in solution and that absorbed on the stationary phase,

∆ ln ∝ ,

where the tR is the retention time, t0 is the dead time, and R is the gas constant, and T is the temperature. We first model the retentions of empty-cage fullerenes and then extend the model to EMFs, which have significant dipole moments. The interactions between fullerene/metallofullerene molecules and the stationary phases are mainly van der Waals interactions. The London expression of the long-range attractive potential can be written as,

∆� = ,

where � is the polarizability of the stationary phase, r is the mean center-to-center van der

Waals distance between the fullerene/metallofullerene molecule and the stationary phase, and

� and � are the ionization potentials of the stationary phase and the analyte, respectively.

Combining the two equations, we find that the chromatographic retention factor is proportional to the polarizability of the analyte on the same stationary phase if ∆� = ∆�

ln = = �������� �.

This equation is true for empty-cage fullerenes that contain no dipole moment. For EMFs with significant dipole moments, we have to include the additional contribution to the formula. This extra chromatographic retention is due to the dipole moment part. Based on equilibrium theory, the dipole moment induced interactions between a fullerene/metallofullerene molecule and the stationary phase lead to an extra contribution to the Gibbs free energy, which is marked as

∆(∆�). The increase in the free energy between fullerene/metallofullerene molecules in the solvent and absorbed on the stationary phase enhance the retention time. This part is the

29 contribution from the dipole moments on the chromatographic retention, and can be modeled in terms of the gas-phase system,

∆(∆�) = ,

where � and � are the dipole moment of the stationary phase and the fullerene/metallofullerene molecule, respectively. Most stationary phases for fullerene/metallofullerene separations contains least dipole moment. Therefore, an approximation is made that in the formula, we assume the stationary phase is nonpolar. Then the formula can be simplified by neglecting the term regarding the stationary phase dipole moment:

∆(∆�) = .

Considering the linear relationship between the empty-cage carbon number and the chromatographic retention factor, the polarizability is approximately proportional to the empty- cage number or the total electron on the cage. This is true for both empty-cage fullerenes and

EMFs, and the enhance retentions of EMFs comparing to empty-cage fullerenes are all from the dipole moment contribution. Therefore, we can summarize the equations as one,

ln = � + � + �,

in which, � = , � = , � = �������� , and � = . If the ( ) fullerene/metallofullerene is nonpolar, then the third part can be neglected and the formula can be simplified for empty cage fullerenes and metallofullerene with no dipole moments:

ln = � + �.

30 However, the estimation of chromatographic retention behavior is dependent on experimental factors, which are not easy to obtain. It has been rapidly recognized that computational polarizability and dipole moment may be applied to evaluate the retention times. We test the model based on a series of Yttrium based and Terbium based metallofullerenes. As discussed previously, fullerene/metallofullerene molecules with fused pentagons always have significant dipole moments and then much longer retention times than those of the same size. As illustrated in Figure 1.9, the predicted results, although not perfect, are helpful for a rough estimation of the retention times. The simple mathematic model works well for predicting retention times based on polarizabilities and dipole moment. The prediction based on average polarizability works well for fullerenes/metallofullerenes is because most known fullerenes/metallofullerenes have a spherical shell and the effect of site polarizability is now significant. Recently, we discovered a series of large empty cage fullerenes with nanotube shapes. It has been realized that these fullerenes, although without significant dipole moments, disobey the polarizability-retention relationship. It is because the non-spherical structure causes big differences among polarizabilities of different sites and the average polarizability no longer makes senses for the retention behaviors. Besides, the information from the HPLC is much more than retention times. As mentioned in the previous chapter, other features, such as the peak shape, is also essential to characterize the separation. Therefore, we extend the current model to include the prediction of more information. As we know, there are several mathematic models available for general chromatography. Although these models are capable to estimate fullerenes/metallofullerenes separations, the performance is not satisfying. The separation mechanism of fullerene/metallofullerene molecules, however, are much simpler than flexible molecules. We propose it is possible to build a model specific for fullerenes/metallofullerenes by modifying the general mathematic models based on knowledge of fullerene/metallofullerene chromatography.

31 2. Machine Learning Models Based on Molecular Geometry

The prediction of chromatographic retention time is essential for locating the target molecule in the chromatogram from various molecules. Chromatographic models to date are mostly based on mass transportation and derived using differential equations[57]. As discussed in

Chapter 1, the rigid and spherical structures of fullerenes simplify the retention time prediction and therefore it is possible to estimate the retention times directly from the molecular geometry.

2.1 Machine Learning Models and Applications in Chemistry

In the past few decades, the combination of computer science and chemistry has achieved great success in simulation, prediction, and experiment assistance[57–59]. Currently, with the rise of machine learning algorithms and the enhancement in computing power, machine learning models show significant impact on chemistry research[60,61]. Machine learning algorithms, which have shown their success in multi-study areas, such as natural language processing[62,63] and computer vision[64], are suitable for handling the large scale of possible structures and molecular pattern recognition[65,66]. Linear regression is one of the simplest examples of machine learning models[67]. In linear regression, the dependence among features is linear, which bring the simplicity of the model and also limit a wide application. Although a linear regression model is simple, it possesses all features of a machine learning model. For example, the target of a linear regression model is to minimize the loss function. Additional non-linear features are possible to be added into a linear regression model, such as using squaring, but the model requires knowing much about the relationship between the input and output data. A deep learning algorithm[68], such as feedforward neural network, may solve these problems and shows more power than a linear regression model to map the input features into an output.

32 Machine learning models are used for specific tasks without explicit instructions[2]. A machine learning model is made of a general structure and then optimized by sample data, or known as

“training data”. Machine learning algorithms are generally classified into two categories, which are supervised learning and unsupervised learning, depending on whether a training step is required. In unsupervised learning, a machine learning model is built from a set of data containing only inputs and no desired output labels are given. Unsupervised learning algorithms, such as clustering, are used to find structures or patterns in data. In particular, dimensionality reduction belongs to unsupervised learning algorithms[69,70]. Principal component analysis (PCA) is one of the most commonly used dimensionality reduction algorithms. PCA utilizes an orthogonal transformation to find out principal components, which are a set of linearly uncorrelated values. Principal components are dimensions with the largest possible variances[71]. An implementation of PCA in Python is available in the Appendix.

In supervised learning, there are two types of algorithms, which are classification and regression[72,73]. A classification task has a discrete output, and the output is limited to several classes. For example, email filtering is a typical application of classification. A regression task, on the other hand, has continuous outputs, usually, the output values are in a given range.

Prediction of molecular properties is an example of a regression task. There are several available algorithms to realize a regression task.

2.2 Feature Selection

In previous studies, properties of fullerenes have been studied based on the local structural motifs and the overall properties are the combinations of local properties[74–76]. Shortly after the discovery of fullerene and the isolation of several fullerene examples, it has been quickly recognized that all stable fullerene cages have pentagons isolated from each other, and it was summarized as an isolated pentagon rule (IPR)[77]. The IPR is understood by the trend for local structural motifs to satisfy the Hückel rule and the electrostatic distributions. For

33 metallofullerenes, due to the effects of the inner clusters on the electrostatic distribution, may not obey the IPR and several examples of non-IPR isomers have been isolated, such as

Sc2@C66, Sc3N@C68, and La2@C72[78–81]. However, the discovery of non-IPR metallofullerene isomers does not break the IPR since the fundamental rule of the IPR is that the energy decreases if the structural motif arrangements can satisfy a stable electrostatic distribution[82]. In early studies, it has also been demonstrated that the local structural motifs have a direct relationship with properties, such as dipole moment. In 2013, Zhang et al. reported their studies from experimental observations of a family of Yttrium based metallofullerenes that pentalene motif, which contains fused pentagons, enhance the dipole moments of the fullerene molecules[82]. The enhanced dipole moment is confirmed by solubilities, HPLC retention behaviors, and theoretical computation[39,83,84]. Polarizability is a parameter describing the dynamical response of a molecule to external electric fields and shows a strong relation to molecular structures[85]. Recently, it has been well known that the polarizability is connected with various physical and chemical processes that are dominated by the electron density distributions[86]. For example, the interactions between fullerenes and the stationary phases are mainly determined by the � − � interactions, and an electrostatic model shows that the interaction is dominated by electrostatics[87]. The local structural motifs and the arrangements of these motifs on fullerene cages are strongly related to the electrostatics of fullerene molecules. Therefore, the structural motifs are linked with the polarizability and are potential to predict the overall polarizability using certain combinations.

34

Figure 2.1 Face-centered motifs. Hexagon-centered motifs on the left and pentagon-centered motifs on the right.

There are various methods to represent the topologies of molecules. For example, an adjacency matrix is a direct way to show the connections between atoms[88,89]. In previous studies, this method of topology representation has been applied and several indexes are derived from the adjacent matrix to estimate the stabilities of fullerenes. Motifs, on the other hand, consider only the local effect and then estimate the overall properties based on the combination of local motif shares[90]. We acknowledge for an overall topology of a fullerene molecule, the motif representation may lose information and may have duplicated count of local structures.

However, it has been well established that the motifs idea is successful for energy predictions and we may improve the accuracy of predictions by selecting representative motifs according to known knowledge. In previous studies, several motif selections have been examined to connect motifs and the properties of fullerenes. For example, Austin et al.[76] applied sixteen structural motifs to fit the energies of 1812 neutral C60 molecules, and Cioslowski et al.[74] used thirty distinct structural motifs for the estimation of standard formation enthalpies of 115 neutral IPR fullerenes. Although it is known that if we choose large motifs and there are more different motifs used in the predictions, it is possible to have better results. However, considering the limited amount of data, problems, such as overfitting, have to be taken into

35 considerations[91]. It is too complex to count the motifs manually, if the chosen motifs are too large. Therefore, it is essential to balance the accuracy and simplicity of the model for choosing moderate and representative motifs. Recently, Wang et al. reported their comprehensive studies on energy predictions of charged fullerene cages using eight hexagon-centered motifs[90].

They showed that the energies of fullerene with -6 charges were capable to be estimated by a linear combination of the eight key motifs. According to previous studies, a motif containing a center face and faces around it, is a good choice, which can provide relatively accurate predictions and not makes the counting of motifs too complex. In this study, we propose to examine each of the face and the faces around the central face and then determine the motif type. Overall, there are 13 different hexagon-centered motifs and 8 pentagon-centered motifs

(see Figure 2.1).

Although it is never burdensome to inspect each face on a fullerene cage and use a linear function to estimate the isotropic polarizability, we are still interested in studying the contribution of each type of motif in the model and reducing the number of different motifs used in the model. Therefore, we apply principal component analysis (PCA) on the data set and calculate the contribution of each motif. We use the error ratio to measure the performance and find if we use 13 hexagon-centered motifs as features in the model, the error ratio is 1% or in another word, it remains 99% percent useful structural information. If we use only the first

8 different motifs in Figure 2.1, the error ratio is 5%, and 95% of information has been shown by these 8 motifs. It is interesting that if pentagon-centered motifs are not considered, there is almost no decrease in accuracy. The eight most important motifs are consistent with previous studies, in which the eight motifs are applied to obtain ideal results for charged metallofullerene energies. As shown in our result, the physical property of fullerene, specifically isotropic polarizability, is dominated by the counts of hexagon-centered motifs, and has a minimum dependence on pentagon-centered motifs.

36

Figure 2.2 Principal component analysis plot showing the relationship between the number of features versus the percentage of information remaining.

2.3 Model Selection

To build a predictive model using fullerene structures, several options ranging from simple linear regression models to relatively complex deep learning models are available. In this thesis, six different models are trained and tested by the polarizability data of C60 to C80, and 90% data are used for training and 10% data are used for testing. The training set and the testing set are split randomly. In addition, a data set containing polarizability data of C82 to C86 is used as the second testing set for testing the generalization abilities of different models. It is interesting that, although the performances of different machine learning models are satisfying on the first testing set, those on the second testing set ranging from C82 to C86 vary a great deal. A coefficient of determination, donated R2 value, is applied to measure the performances of different machine learning models. In statistics, the R2 value is used to measure the performance of a model on prediction, based on the proportion of total variation of outcomes[92]. For example, a data set has n values y1, y2, ……, yn marked as yi, and for each

37 of the observed value, there is an associated predicted value, f1, f2, ……, fn. Then the residuals are defined as ei = yi - fi. The mean of the observed values is

� = � .

The total sum of squares or the variance is

�� = (� − �) .

The regression sum of squares is

�� = (� − �) .

The sum of squares of residuals is

�� = (� − �) = � .

The R2 value is defined as

� = 1 − .

The selection of hyperparameters is a key step for building machine learning models. In this study, hyperparameters of a model are optimized based on experience and also optimized by a cross-validation approach[93]. One round of cross-validation involves partitioning of a data set into a training set and a complementary validation set. The performance of a model with a set of hyperparameters is then estimated by several rounds of validations. In this way, a model is made to fit the training data as best as possible. Cross-validation is performed in different ways, such as leave-one-out cross-validation and k-fold cross-validation. In this thesis, 10-fold cross- validation is applied to find out the best hyperparameters for each model. I acknowledge that the models used in this study are potential to be further improved due to limited time.

38 To build a machine learning model, several models with different complexisities are available.

In this study, six different types of models are built to predict the polarizabilities of fullerene isomers based on fullerene structures. Models are implemented using Python with two machine learning packages, scikit-learn and Pytorch. A k-nearest neighbors (kNN) model is one of the simplest models and estimates the results using the values of k most similar molecules. An important thing to note is that the value k is a key factor for a kNN model. If the value k is too small, the model is easy to overfitting. On the other hand, if the value k is too large, underfitting will happen. In this project, several values of k are tried to obtain the best performance[94]. It has been shown, if k = 5, a kNN model shows the best predictions with an R2 = 0.93. A 5NN model estimates the polarizability value-based the 5 nearest data points in the training set. It means that a predicted value is the average of polarizabilities of 5 most similar fullerenes.

Random forests are an ensemble learning algorithm for classification and regression[95].

Random forests involve a multitude of decision trees for training, and the output is the mean prediction of each decision tree. Both of kNN and random forests are weighted neighborhoods schemes as pointed by Lin et al. in 2002[96]. Support vector machine (SVM) is a machine learning algorithm using a separating hyperplane mainly for classification[97]. The training step of SVM is to find out the optimal hyperplane to separate the data into two parts with a maximum margin:

max , �. �. , � �� + � ≥ 1, � = 1, 2, … , �.

SVM is also possible to find out nonlinear hyperplane using kernel. Support vector regression

(SVR) shares the same idea with SVM but is designed for continuous values instead of classification[98,99]. SVR includes a hyperplane line and two boundary lines, which are built with certain errors. The optimization of an SVR model is to find a hyperplane with the maximum number of points within the boundaries. Linear regression belongs to both statistics

39 and machine learning. In a linear regression model, the target value is estimated by a linear combination of features. XGBoost is an implement of gradient boosting algorithm for regression as well as for classification[100]. Like other boosting models, a gradient boosting model produces prediction with an ensemble of weak prediction models, such as decision trees.

The neural network is an algorithm designed inspired by a biological neural network[101].

Each biological neural is mimicked by an artificial neuron node, which is an activation function, and the connections of biological neurons are modeled as weights. A neural network is a collection of artificial neural and they are connected in a designed way. The input of each neuron is a real number and the nonlinear activation function will compute an output based on the sum of inputs. Neurons are connected by edges and each edge has a parameter as weight and the weight is optimized in training. Currently, the neural network, as a popular deep learning algorithm, has achieved great success in scientific research as prediction models.

Figure 2.3 Performances of different machine learning models on testing set. (a) 5-nearest neighbors model with R2 = 0.93 on the testing set. (b) Random forest model with R2 = 0.95 on the testing set. (c) Support vector regression model with R2 = 0.91 on the testing set. (d) Linear regression model with R2 = 0.96 on the testing set. (e) XGBoost model with R2 = 0.96 on the testing set. (f) Neural network model with R2 = 0.96 on the testing set.

40 The data set contains polarizability data of 31947 isomers ranging from C60 to C80. The data set is split randomly, and 90% are in the training set and 10% are in the testing set. As illustrated in Figure 2.3, the predicted results are measured by R2 value. It is shown that all the models produce results with good agreements with real values. Among all the machine learning models, neural network model produces the best results with an R2 value equal to 0.97. In previous studies[90], the R2 values of predicted energies based on statistical models are around 0.5 –

0.7. Although there is no report regarding the prediction of polarizability, it is also instructive by comparing the performance of polarizability prediction with that of energy prediction. It is well known that energies are much easier to be predicted based on structures than polarizabilities, which are affected by many different factors. As suggested in previous studies, there are millions of different isomers, it is hard to calculate polarizabilities of all isomers using traditional quantum chemistry methods. Machine learning models are ideal approaches to traverse all possible examples and then find essential patterns from a large amount of data.

Figure 2.4 Performances of different machine learning models on C82 testing set. (a) 5-nearest neighbors model with R2 = -1.44 on the testing set. (b) random forest model with R2 = -0.99 on the testing set. (c) support vector regression model with R2 = -6.80 on the testing set. (d)

41 linear regression model with R2 = 0.39 on the testing set. (e) XGBoost model with R2 = -0.37 on the testing set. (f) neural network model with R2 = 0.57 on the testing set.

In addition, we examine the performance of the six machine learning models on fullerenes of large size that are not included in the training set. As we have shown, the six machine learning models are training by polarizability data of fullerenes ranging from C60 to C80. We prepared a new testing set containing C82 polarizability data. The testing result is shown in Figure 2.4. All the performances of six different machine learning models decrease. Comparing the performance of each model, the results vary greatly, and the neural network model achieve the best accuracy on prediction for all the C82 to C86 testing set. Considering only the neural network model on the C82, C84, and C86 testing sets, the performance remains almost the same.

As we know, a neural network model can learn the structural patterns better for polarizability and, therefore, it learns from small fullerenes and then gives a good prediction on large fullerenes. Other models cannot be trained by small fullerenes and give an quality prediction on large fullerenes. The results on prediction of polarizability is also consistent with previous reports on prediction of energies.

42 Figure 2.5 Performances of different machine learning models on C84 testing set. (a) 5-nearest neighbors model with R2 = -3.41 on the testing set. (b) Random forest model with R2 = -2.80 on the testing set. (c) Support vector regression model with R2 = -13.50 on the testing set. (d)

Linear regression model with R2 = 0.36 on the testing set. (e) XGBoost model with R2 = -1.16 on the testing set. (f) Neural network model with R2 = 0.60 on the testing set.

It has been shown in this thesis that machine learning models are capable to be used to predict polarizabilities directly from structures. Structural motifs, as used in this thesis, can ideally represent the structural patterns. However, it is still an open question what kinds of structural patterns contribute on the polarizability as well as other properties. As we know, machine learning models are widely applied on pattern recognitions. In the future, we are trying to learn fullerene patterns and their impacts on polarizability as well as other factors.

Figure 2.6 Performances of different machine learning models on C86 testing set. (a) 5-nearest neighbors model with R2 = -5.82 on the testing set. (b) Random forest model with R2 = -4.80 on the testing set. (c) Support vector regression model with R2 = -20.70 on the testing set. (d)

Linear regression model with R2 = 0.26 on the testing set. (e) XGBoost model with R2 = -3.34 on the testing set. (f) Neural network model with R2 = 0.58 on the testing set.

43 It has been shown that the size of fullerene has a remarkable effect on the performance of machine learning models. The models trained by polarizability data of small fullerenes have a relative unsatisfactory performance on predicting polarizabilities of large fullerenes. It is of interest to study the size effect and know what is the key factor. Therefore, we do an extra test, and use models trained with polarizability data of C60 to C80 without C76, which is then used as the testing set. As shown in Figure 2.7, the accuracy increase comparing with previous tests but still not as good as the results shown in Figure 2.3. It is obvious that each fullerene size has its only specific feature and models can extract the size factor from the data and then provide good predictions. We acknowledge that it requires more work to understand the size effect in machine learning models and then we can improve the accuracy of these models.

Figure 2.7 Performances of different machine learning models on C86 testing set. (a) 5-nearest neighbors model with R2 = -0.99 on the testing set. (b) Random forest model with R2 = -3.70 on the testing set. (c) Support vector regression model with R2 = -0.23 on the testing set. (d)

Linear regression model with R2 = 0.50 on the testing set. (e) XGBoost model with R2 = 0.30 on the testing set. (f) Neural network model with R2 = 0.58 on the testing set.

44 3. Conclusion

In this thesis, traditional chromatographic models are reviewed. The simulation of chromatographic retention behaviors is essential in research as well as industry. In previous studies, it has been shown that differential equation models achieve great success in chromatographic retention predictions. However, these models, depending on physical processes, are not easy to set up. In our previous studies, we have shown that retention behaviors of fullerenes/metallofullerenes are much easier to be estimated due to their rigid structures. Therefore, the retention time is capable to be predicted by average molecular polarizabilities. In this thesis, machine learning models are proposed to estimate polarizabilities directly from structures by counting structural motifs. Combining our two models, the retention times can be estimated by checking fullerene structures.

45

References

[1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature. 521 (2015) 436. [2] C.M. Bishop, Pattern recognition and machine learning, springer, 2006. [3] P. Raccuglia, K.C. Elbert, P.D.F. Adler, C. Falk, M.B. Wenny, A. Mollo, M. Zeller, S.A. Friedler, J. Schrier, A.J. Norquist, Machine-learning-assisted materials discovery using failed experiments, Nature. 533 (2016) 73–76. doi:10.1038/nature17439. [4] I. Smith, Chromatography, Elsevier, 2013. [5] L.S. Ettre, A. Zlatkis, 75 years of chromatography: a historical dialogue, Elsevier, 2011. [6] L.S. Ettre, K.I. Sakodynskii, MS Tswett and the discovery of chromatography II: Completion of the development of chromatography (1903–1910), Chromatographia. 35 (1993) 329–338. [7] L.R. Snyder, J.J. Kirkland, J.W. Dolan, Introduction to modern liquid chromatography, John Wiley & Sons, 2011. [8] H.K. Mangold, E. Stahl, Thin-layer chromatography, (1969). [9] L.S. Ettre, Nomenclature for chromatography (IUPAC Recommendations 1993), Pure and Applied Chemistry. 65 (1993) 819–872. [10] W.C. Still, M. Kahn, A. Mitra, Rapid chromatographic technique for preparative separations with moderate resolution, J. Org. Chem. 43 (1978) 2923–2925. doi:10.1021/jo00408a041. [11] H.J. Cortes, B. Winniford, J. Luong, M. Pursch, Comprehensive two dimensional gas chromatography review, Journal of Separation Science. 32 (2009) 883–904. [12] O. Schulz-Trieglaff, N. Pfeifer, C. Gröpl, O. Kohlbacher, K. Reinert, LC-MSsim–a simulation software for liquid chromatography mass spectrometry data, BMC Bioinformatics. 9 (2008) 423. [13] T. Gu, Mathematical modeling and scale-up of liquid chromatography: With application examples, Springer, 2015. [14] X. Liu, T. Zuo, H.C. Dorn, Polarizability effects dominate the chromatographic retention behavior of spheroidal and elipsoidal metallofullerene nanospheres, The Journal of Physical Chemistry C. 121 (2017) 4045–4049. [15] I.E. Bush, The Chromatography of Steroids: International Series of Monographs on Pure and Applied Biology: Biochemistry, Elsevier, 2013. [16] M.J. Kamlet, M.H. Abraham, P.W. Carr, R.M. Doherty, R.W. Taft, Solute–solvent interactions in chemistry and biology. Part 7. An analysis of mobile phase effects on high pressure liquid chromatography capacity factors and relationships of the latter with octanol–water partition coefficients, Journal of the Chemical Society, Perkin Transactions 2. (1988) 2087–2092. [17] L.A. Kartsova, A.A. Makarov, New fullerene-based stationary phases for gas chromatography, Journal of Analytical Chemistry. 59 (2004) 724–729. [18] A. Shareef, G. Li, R.S. Kookana, Quantitative determination of fullerene (C60) in soils by high performance liquid chromatography and accelerated solvent extraction technique, Environmental Chemistry. 7 (2010) 292–297. [19] D. Fuchs, H. Rietschel, R.H. Michel, A. Fischer, P. Weis, M.M. Kappes, Extraction and chromatographic elution behavior of endohedral metallofullerenes: inferences regarding effective dipole moments, The Journal of Physical Chemistry. 100 (1996) 725–729. [20] Y. Saito, H. Ohta, K. Jinno, Peer Reviewed: Chromatographic Separation of Fullerenes, ACS Publications, 2004.

46 [21] J.R. Baena, M. Gallego, M. Valcarcel, Fullerenes in the analytical sciences, TrAC Trends in Analytical Chemistry. 21 (2002) 187–198. [22] E. Glueckauf, Theory of chromatography. Part 9. The “theoretical plate” concept in column separations, Transactions of the Faraday Society. 51 (1955) 34–44. [23] J.C. Giddings, S.L. Seager, L.R. Stucki, G.H. Stewart, Plate height in gas chromatography, Analytical Chemistry. 32 (1960) 867–870. [24] J. Cazes, R.P. Scott, Chromatography theory, CRC press, 2002. [25] A. Berthod, On the Use of the Knox Equation. I. the Fit Problem, Journal of Liquid Chromatography. 12 (1989) 1169–1185. [26] A. Berthod, On the use of the Knox equation. II. The efficiency measurement problem, Journal of Liquid Chromatography. 12 (1989) 1187–1201. [27] Y. Li, X. Liu, C. Chen, J. Duchamp, R. Huang, T.-F. Chung, M. Young, T. Chalal, Y.P. Chen, J.R. Heflin, Differences in self-assembly of spherical C60 and planar PTCDA on rippled graphene surfaces, Carbon. 145 (2019) 549–555. [28] T. Li, S. Murphy, B. Kiselev, K.S. Bakshi, J. Zhang, A. Eltahir, Y. Zhang, Y. Chen, J. Zhu, R.M. Davis, A new interleukin-13 amino-coated gadolinium metallofullerene nanoparticle for targeted MRI detection of glioblastoma tumor cells, Journal of the American Chemical Society. 137 (2015) 7881–7888. [29] J. Zhang, Y. Ye, Y. Chen, C. Pregot, T. Li, S. Balasubramaniam, D.B. Hobart, Y. Zhang, S. Wi, R.M. Davis, Gd3N@ C84 (OH) x: a new egg-shaped metallofullerene magnetic resonance imaging contrast agent, Journal of the American Chemical Society. 136 (2014) 2630–2636. [30] M. Salami, M.B. Ahmadi, A mathematical programming model for computing the fries number of a fullerene, Applied Mathematical Modelling. 39 (2015) 5473–5479. [31] P.W. Fowler, D.E. Manolopoulos, An atlas of fullerenes, Courier Corporation, 2007. [32] A.A. Popov, S. Yang, L. Dunsch, Endohedral fullerenes, Chemical Reviews. 113 (2013) 5989–6113. [33] F. Bedani, P.J. Schoenmakers, H.-G. Janssen, Theories to support method development in comprehensive two-dimensional liquid chromatography–A review, Journal of Separation Science. 35 (2012) 1697–1711. [34] G.W. Slater, C. Holm, M.V. Chubynsky, H.W. de Haan, A. Dubé, K. Grass, O.A. Hickey, C. Kingsburry, D. Sean, T.N. Shendruk, Modeling the separation of macromolecules: A review of current computer simulation methods, Electrophoresis. 30 (2009) 792–818. [35] L. Zhang, Y. Sun, Molecular simulation of adsorption and its implications to protein chromatography: A review, Biochemical Engineering Journal. 48 (2010) 408–415. [36] S. Dimartino, C. Boi, G.C. Sarti, A validated model for the simulation of protein purification through affinity membrane chromatography, Journal of Chromatography A. 1218 (2011) 1677–1690. [37] H. Kempe, A. Axelsson, B. Nilsson, G. Zacchi, Simulation of chromatographic processes applied to separation of proteins, Journal of Chromatography A. 846 (1999) 1–12. [38] F.J. Stevens, Analysis of protein-protein interaction by simulation of small-zone size- exclusion chromatography: application to an antibody-antigen association, Biochemistry. 25 (1986) 981–993. [39] X. Liu, H.C. Dorn, DFT prediction of chromatographic retention behavior for a trimetallic nitride metallofullerene series, Inorganica Chimica Acta. 468 (2017) 316–320. [40] J. Zhang, S. Stevenson, H.C. Dorn, Trimetallic nitride template endohedral metallofullerenes: discovery, structural characterization, reactivity, and applications, Accounts of Chemical Research. 46 (2013) 1548–1557.

47 [41] K. Kimata, T. Hirose, K. Moriuchi, K. Hosoya, T. Araki, N. Tanaka, High-capacity stationary phases containing heavy atoms for HPLC separation of fullerenes, Analytical Chemistry. 67 (1995) 2556–2561. [42] S.A. Wise, L.C. Sander, W.E. May, Determination of polycyclic aromatic hydrocarbons by liquid chromatography, Journal of Chromatography A. 642 (1993) 329–349. [43] T. Zuo, Synthesis, Isolation, and Characterization of Tb-Based Large Cage TNT-EMFs and Dimetallic Endohedral Metalloazafullerenes, PhD Thesis, Virginia Tech, 2008. [44] R.C. Klute, H.C. Dorn, H.M. McNair, HPLC separation of higher (C84+) fullerenes, Journal of Chromatographic Science. 30 (1992) 438–442. [45] S. Stevenson, P. Burbank, K. Harich, Z. Sun, H.C. Dorn, P.H.M. Van Loosdrecht, M.S. DeVries, J.R. Salem, C.-H. Kiang, R.D. Johnson, La2@ C72: Metal-mediated stabilization of a carbon cage, The Journal of Physical Chemistry A. 102 (1998) 2833– 2837. [46] T. Zuo, M.M. Olmstead, C.M. Beavers, A.L. Balch, G. Wang, G.T. Yee, C. Shu, L. Xu, B. Elliott, L. Echegoyen, Preparation and Structural Characterization of the I h and the D 5h Isomers of the Endohedral Fullerenes Tm3N@ C80: Icosahedral C80 Cage Encapsulation of a Trimetallic Nitride Magnetic Cluster with Three Uncoupled Tm3+ Ions, Inorganic Chemistry. 47 (2008) 5234–5244. [47] H.W. Kroto, Health, JR; O’Brien, SC; Curl, RF; Smalley, RE, Nature. 318 (1985) 162. [48] K. Ala’a, Isolation, separation and characterisation of the fullerenes C 60 and C 70: the third form of carbon, Journal of the Chemical Society, Chemical Communications. (1990) 1423–1425. [49] F. Diederich, R. Ettl, Y. Rubin, R.L. Whetten, R. Beck, M. Alvarez, S. Anz, D. Sensharma, F. Wudl, K.C. Khemani, The higher fullerenes: isolation and characterization of C76, C84, C90, C94, and C70O, an oxide of D5h-C70, Science. 252 (1991) 548–551. [50] Y. Chai, T. Guo, C. Jin, R.E. Haufler, L.P.F. Chibante, J. Fure, L. Wang, J.M. Alford, R.E. Smalley, Fullerenes with metals inside, J. Phys. Chem. 95 (1991) 7564–7568. doi:10.1021/j100173a002. [51] S. Stevenson, G. Rice, T. Glass, K. Harich, F. Cromer, M.R. Jordan, J. Craft, E. Hadju, R. Bible, M.M. Olmstead, K. Maitra, A.J. Fisher, A.L. Balch, H.C. Dorn, Small-bandgap endohedral metallofullerenes in high yield and purity, Nature. 401 (1999) 55. doi:10.1038/43415. [52] H.W. Kroto, K. McKay, The formation of quasi-icosahedral spiral shell carbon particles, Nature. 331 (1988) 328. doi:10.1038/331328a0. [53] K. Kobayashi, S. Nagase, M. Yoshida, E. Ōsawa, Endohedral Metallofullerenes. Are the Isolated Pentagon Rule and Fullerene Structures Always Satisfied?, J. Am. Chem. Soc. 119 (1997) 12693–12694. doi:10.1021/ja9733088. [54] S. Stevenson, P.W. Fowler, T. Heine, J.C. Duchamp, G. Rice, T. Glass, K. Harich, E. Hajdu, R. Bible, H.C. Dorn, A stable non-classical metallofullerene family, Nature. 408 (2000) 427. doi:10.1038/35044199. [55] C.-R. Wang, T. Kai, T. Tomiyama, T. Yoshida, Y. Kobayashi, E. Nishibori, M. Takata, M. Sakata, H. Shinohara, C 66 fullerene encaging a scandium dimer, Nature. 408 (2000) 426. doi:10.1038/35044195. [56] J. Zhang, D.W. Bearden, T. Fuhrer, L. Xu, W. Fu, T. Zuo, H.C. Dorn, Enhanced Dipole Moments in Trimetallic Nitride Template Endohedral Metallofullerenes with the Pentalene Motif, J. Am. Chem. Soc. 135 (2013) 3351–3354. doi:10.1021/ja312045t. [57] D.E. Bautz, J.W. Dolan, W.D. Raddatz, L.R. Snyder, Computer simulation (based on a linear-elution-strength approximation) as an aid for optimizing separations by programmed-temperature gas chromatography, Anal. Chem. 62 (1990) 1560–1567. doi:10.1021/ac00214a004.

48 [58] R. Leardi, Genetic algorithms in chemometrics and chemistry: a review, Journal of Chemometrics. 15 (2001) 559–569. doi:10.1002/cem.651. [59] L.F. Capitán-Vallvey, N. López-Ruiz, A. Martínez-Olmos, M.M. Erenas, A.J. Palma, Recent developments in computer vision-based analytical chemistry: A tutorial review, Analytica Chimica Acta. 899 (2015) 23–56. doi:10.1016/j.aca.2015.10.009. [60] G.B. Goh, N.O. Hodas, A. Vishnu, Deep learning for computational chemistry, Journal of Computational Chemistry. 38 (2017) 1291–1307. [61] V. Botu, R. Ramprasad, Adaptive machine learning framework to accelerate ab initio molecular dynamics, International Journal of Quantum Chemistry. 115 (2015) 1074–1083. [62] G.G. Chowdhury, Natural language processing, Annual Review of Information Science and Technology. 37 (2003) 51–89. [63] K.S. Jones, J.R. Galliers, Evaluating natural language processing systems: An analysis and review, Springer Science & Business Media, 1995. [64] A. Voulodimos, N. Doulamis, A. Doulamis, E. Protopapadakis, Deep learning for computer vision: A brief review, Computational Intelligence and Neuroscience. 2018 (2018). [65] S. Ekins, A. Clark, A. Perryman, J. Freundlich, A. Korotcov, V. Tkachenko, Accessible machine learning approaches for toxicology, Computational Toxicology: Risk Assessment for Chemicals. (2018) 1–29. [66] Y.-C. Lo, S.E. Rensi, W. Torng, R.B. Altman, Machine learning in chemoinformatics and drug discovery, Drug Discovery Today. 23 (2018) 1538–1546. [67] D.T. Ahneman, J.G. Estrada, S. Lin, S.D. Dreher, A.G. Doyle, Predicting reaction performance in C–N cross-coupling using machine learning, Science. 360 (2018) 186– 190. [68] K.T. Butler, D.W. Davies, H. Cartwright, O. Isayev, A. Walsh, Machine learning for molecular and materials science, Nature. 559 (2018) 547. [69] L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, Siam Review. 60 (2018) 223–311. [70] M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of machine learning, MIT press, 2018. [71] I. Jolliffe, Principal component analysis, Springer, 2011. [72] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: A review of classification techniques, Emerging Artificial Intelligence Applications in Computer Engineering. 160 (2007) 3–24. [73] A. Singh, N. Thakur, A. Sharma, A review of supervised machine learning algorithms, in: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE, 2016: pp. 1310–1315. [74] J. Cioslowski, N. Rao, D. Moncrieff, Standard enthalpies of formation of fullerenes and their dependence on structural motifs, Journal of the American Chemical Society. 122 (2000) 8265–8270. [75] M. Alcamí, G. Sánchez, S. Díaz-Tendero, Y. Wang, F. Martín, Structural patterns in fullerenes showing adjacent pentagons: C20 to C72, Journal of Nanoscience and Nanotechnology. 7 (2007) 1329–1338. [76] S.J. Austin, P.W. Fowler, D.E. Manolopoulos, G. Orlandi, F. Zerbetto, Structural motifs and the stability of fullerenes, The Journal of Physical Chemistry. 99 (1995) 8076–8081. [77] J. Aihara, Bond resonance energy and verification of the isolated pentagon rule, Journal of the American Chemical Society. 117 (1995) 4130–4136. [78] K. Kobayashi, S. Nagase, A stable unconventional structure of Sc2@ C66 found by density functional calculations, Chemical Physics Letters. 362 (2002) 373–379.

49 [79] M. Yamada, H. Kurihara, M. Suzuki, J.D. Guo, M. Waelchli, M.M. Olmstead, A.L. Balch, S. Nagase, Y. Maeda, T. Hasegawa, Sc2@ C66 revisited: an endohedral fullerene with scandium ions nestled within two unsaturated linear triquinanes, Journal of the American Chemical Society. 136 (2014) 7611–7614. [80] M.M. Olmstead, H.M. Lee, J.C. Duchamp, S. Stevenson, D. Marciu, H.C. Dorn, A.L. Balch, Sc3N@ C68: folded pentalene coordination in an endohedral fullerene that does not obey the isolated pentagon rule, Angewandte Chemie International Edition. 42 (2003) 900–903. [81] H. Kato, A. Taninaka, T. Sugai, H. Shinohara, Structure of a missing-caged metallofullerene: La2@ C72, Journal of the American Chemical Society. 125 (2003) 7782–7783. [82] X. Lu, H. Nikawa, T. Nakahodo, T. Tsuchiya, M.O. Ishitsuka, Y. Maeda, T. Akasaka, M. Toki, H. Sawa, Z. Slanina, Chemical understanding of a non-IPR metallofullerene: stabilization of encaged metals on fused-pentagon bonds in La2@ C72, Journal of the American Chemical Society. 130 (2008) 9129–9136. [83] Y. Marcus, Solubilities of buckminsterfullerene and sulfur hexafluoride in various solvents, The Journal of Physical Chemistry B. 101 (1997) 8617–8623. [84] D. Fuchs, H. Rietschel, R.H. Michel, A. Fischer, P. Weis, M.M. Kappes, Extraction and chromatographic elution behavior of endohedral metallofullerenes: inferences regarding effective dipole moments, The Journal of Physical Chemistry. 100 (1996) 725–729. [85] D.S. Sabirov, R.G. Bulgakov, Reactivity of fullerene derivatives C60O and C60F18 (C 3 v) in terms of local curvature and polarizability, Fullerenes, Nanotubes, and Carbon Nanostructures. 18 (2010) 455–457. [86] D.S. Sabirov, Polarizability as a landmark property for fullerene chemistry and materials science, RSC Advances. 4 (2014) 44996–45028. [87] C.A. Hunter, J.K. Sanders, The nature of. pi.-. pi. interactions, Journal of the American Chemical Society. 112 (1990) 5525–5534. [88] E. Estrada, Spectral moments of the edge adjacency matrix in molecular graphs. 3. Molecules containing cycles, Journal of Chemical Information and Computer Sciences. 38 (1998) 23–27. [89] F.R. Burden, A chemically intuitive molecular index based on the eigenvalues of a modified adjacency matrix, Quantitative Structure-Activity Relationships. 16 (1997) 309–314. [90] Y. Wang, S. Díaz-Tendero, F. Martín, M. Alcamí, Key structural motifs to predict the cage topology in endohedral metallofullerenes, Journal of the American Chemical Society. 138 (2016) 1551–1560. [91] T. Dietterich, Overfitting and undercomputing in machine learning, ACM Computing Surveys. 27 (1995) 326–327. [92] J.A. Colton, K.M. Bower, Some misconceptions about R2, International Society of Six Sigma Professionals, EXTRAOrdinary Sense. 3 (2002) 20–22. [93] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Ijcai, Montreal, Canada, 1995: pp. 1137–1145. [94] N.S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician. 46 (1992) 175–185. [95] T.K. Ho, Random decision forests, in: Proceedings of 3rd International Conference on Document Analysis and Recognition, IEEE, 1995: pp. 278–282. [96] Y. Lin, Y. Jeon, Random forests and adaptive nearest neighbors, Journal of the American Statistical Association. 101 (2006) 578–590. [97] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning. 20 (1995) 273–297.

50 [98] H. Drucker, C.J. Burges, L. Kaufman, A.J. Smola, V. Vapnik, Support vector regression machines, in: Advances in Neural Information Processing Systems, 1997: pp. 155–161. [99] S.R. Gunn, Support vector machines for classification and regression, ISIS Technical Report. 14 (1998) 5–16. [100] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, ACM, 2016: pp. 785–794. [101] J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences. 79 (1982) 2554–2558.

51