Machine Learning Models in Fullerene/Metallofullerene Chromatography
Studies
Xiaoyang Liu
Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of
Master of Science In Computer Science and Application
Yang Cao, Advisor Harry C. Dorn Lenwood S. Heath
August 8, 2019 Blacksburg, VA 24060, U.S.
Keywords: Machine learning, Neural Network, Chromatography, Fullerene, Modeling, Random Forest, XGBoost, Linear Regression, SVM regression, Nearest Neighbor
Machine Learning Models in Fullerene/Metallofullerene Chromatography
Studies
Xiaoyang Liu
ABSTRACT
Machine learning methods are now extensively applied in various scientific research areas to make models. Unlike regular models, machine learning based models use a data-driven approach. Machine learning algorithms can learn knowledge that are hard to be recognized, from available data. The data-driven approaches enhance the role of algorithms and computers and then accelerate the computation using alternative views. In this thesis, we explore the possibility of applying machine learning models in the prediction of chromatographic retention behaviors. Chromatographic separation is a key technique for the discovery and analysis of fullerenes. In previous studies, differential equation models have achieved great success in predictions of chromatographic retentions. However, most of the differential equation models require experimental measurements or theoretical computations for many parameters, which are not easy to obtain. Fullerenes/metallofullerenes are rigid and spherical molecules with only carbon atoms, which makes the predictions of chromatographic retention behaviors as well as other properties much simpler than other flexible molecules that have more variations on conformations. In this thesis, I propose the polarizability of a fullerene molecule is able to be estimated directly from the structures. Structural motifs are used to simplify the model and the models with motifs provide satisfying predictions. The data set contains 31947 isomers and their polarizability data and is split into a training set with 90% data points and a complementary testing set. In addition, a second testing set of large fullerene isomers is also prepared and it is used to testing whether a model can be trained by small fullerenes and then gives ideal predictions on large fullerenes.
Machine Learning Models in Fullerene/Metallofullerene Chromatography
Studies
Xiaoyang Liu
GENERAL AUDIENCE ABSTRACT
Machine learning models are capable to be applied in a wide range of areas, such as scientific research. In this thesis, machine learning models are applied to predict chromatography behaviors of fullerenes based on the molecular structures. Chromatography is a common technique for mixture separations, and the separation is because of the difference of interactions between molecules and a stationary phase. In real experiments, a mixture usually contains a large family of different compounds and it requires lots of work and resources to figure out the target compound. Therefore, models are extremely import for studies of chromatography.
Traditional models are built based on physics rules, and involves several parameters. The physics parameters are measured by experiments or theoretically computed. However, both of them are time consuming and not easy to be conducted. For fullerenes, in my previous studies, it has been shown that the chromatography model can be simplified and only one parameter, polarizability, is required. A machine learning approach is introduced to enhance the model by predicting the molecular polarizabilities of fullerenes based on structures. The structure of a fullerene is represented by several local structures. Several types of machine learning models are built and tested on our data set and the result shows neural network gives the best predictions.
ACKNOWLEDGEMENT
I would like to express my appreciation to my advisor Dr. Young Cao for giving me the opportunity to pursue a master degree in computer science. Studying toward a computer science degree opens a new world for me. The project is in the intersection between computer science knowledge and my Ph.D. dissertation. I also thank my committee members, Dr. Harry
Dorn and Dr. Lenwood Heath. I thank Dr. Harry Dorn for his support and understanding for my work and study in computer science. I thank Dr. Lenwood Heath for his help in discussion and revision of my thesis. It is my best luck to have the helpful committee members.
iv Table of Content 1. Background ...... 1 1.1 Chromatography ...... 4 1.2 High Performance Liquid Chromatography ...... 8 1.3 HPLC in Fullerenes/Metallofullerenes Separations ...... 20 2. Machine Learning Models Based on Molecular Geometry ...... 32 2.1 Machine Learning Models and Applications in Chemistry ...... 32 2.2 Feature Selection ...... 33 2.3 Model Selection ...... 37 3. Conclusion ...... 45 References ...... 46
v
1. Background
Machine learning has been in its rapid development for the past few decades[1]. We may find machine learning in news feed, searching, and image recognition. Machine learning combines statistics and computer science and builds models to realize the tasks, which are hardly achieved by regular methods. Fundamentally, machine learning algorithms extract information or key features from available data and represent the data set using general models. There are numerous machine learning methods and it has been well established that different models are capable to be used for different tasks. Generally, there are two categories of tasks, regression and classification[2]. Most machine learning models can be modified for both tasks. Machine learning models are trained by raw data and then applied to infer things for unknown data. The training of machine learning models is a big challenge and nowadays, there are several methods to achieve sufficient training of models due to the development of algorithms and computing resources. Different machine learning methods have different structures to make them general for most tasks. The learning of machine learning models is a step to acquiring the hidden key features of data. The structures of knowledge from the raw data take different forms. For example, linear regression is a model to describe the linear relationship between features and the result data. Decision tree uses a different way of applying rules and applies a set of parameters to make decisions under different conditions. The application of a machine learning algorithm is also referred to as data mining, which is the process to obtain key information or features from a large amount of data.
Among machine learning algorithms, neural network, or named as deep learning, is the most commonly used technique for scientific research. In the past few decades, various neural networks have been developed. For example, the convolutional neural network has shown its
1 power in image processing. Also, the ability to process an image or recognizing patterns is suitable to be converted to solve chemistry problems that involve chemical structures. Another essential feature to note is that deep learning can extract features automatically. One of the big challenges of machine learning is designing features, and designing features requires no only machine learning knowledge but also domain knowledge. Therefor deep learning is then versatile for different studies. Nowadays, with enough amount of data, deep learning models can defeat professional people in their areas.
Behind the fancy results of machine learning, math plays an essential role. Machine learning algorithms, fundamentally, are built based on linear algebra, statistics, and programming. The first step of building machine models is to convert data into vectors, which is the language computer program can understand. Then training machine learning models is then to solve linear algebra systems following the given structures of machine learning algorithms. The starting point of many machine learning algorithms is the probability, and then the decision or result is estimated based on probability distributions or likelihood. A machine learning models have a certain structure, but contains several parameters, which are decided by the data. The way to learning information from the data set is to figure out the parameters suitable for the data set. Therefore, the training of a machine learning model is to optimize the parameter associated with the general machine learning model. The development of optimization algorithms has been a central topic in machine learning for decades. A typical method to handle a large amount of computation in machine learning and convex optimization is gradient descent.
Gradient descent uses current slope on the function surface and then decrease the error step by step. Besides, stochastic gradient descent and other techniques are developed to accelerate the optimizations and also eliminate problems, such as falling into local minima.
2 Currently, machine learning methods have been applied in scientific research, especially in chemistry and material science. In the past, quantum chemistry provides one of the most successful models for most chemistry problems. The quantum chemistry methods solve chemical property problems mainly based on electron structures. With the development of computer algorithms and supercomputers, some of the large scale problems can be solved within a reasonable time. However, the computing resources required for quantum computing is still a limit for its applications. In particular, density functional theory based methods are widely used due to its good balance between accuracy and efficiency. Interest in applications of machine learning models explodes, especially in chemistry research. In most cases, the chemistry system is complex and hard to model using regular methods. Therefore, machine learning methods is a good choice to learn information from data without much knowledge about the system.
Nowadays, data-driven models enhance the ability of modeling chemistry system, in which the physics theories are incomplete or unavailable. The rapid emergence of machine learning methods in chemistry is partially due to the explosion of data. In the past few decades, with the rapid evolution of instruments and computers, a large amount of data is accumulated and numerous databases are available. However, regular methods fail to extract information from the databases. Also, the development of machine learning algorithms, especially those in pattern recognition. Currently, several studies have been reported using the available data set and machine learning methods to learn hidden knowledge. For example, in material design research, failure data are not considered useful. However, Raccuglia et al. proposed a machine learning model to learn material design knowledge from these disposed of data and showed that the performance of this model is even comparable to experienced scientists[3].
3 Machine learning methods are also helpful to accelerate chemistry computing. As we all know, most regular computing methods of chemistry system require remarkable resource and a relatively long time. Machine learning methods then provide alternative methods to estimate chemical properties. In addition, several tasks are hard to be modeled, such as medicine design.
As noted previously, the convolutional neural network can recognize local structures and estimate their contributions on a certain property. Before that, the design of medicine is mainly done manually and requires a considerable amount of lab work. We also note that in the past, scientists make great efforts to summarize rules and theories from observations. Currently, machine learning can do even better than a scientist and extract rules from available data.
In this thesis, we show our results in building machine learning models for chromatography that is essential for chemistry separation. The input of the model is the molecular structure and then the output is the parameter to estimate chromatographic retention time. The machine learning model can simply the simulation of chromatography behaviors of fullerenes.
1.1 Chromatography
Chromatography is a technique for mixture separation and sample purification[4]. The chromatographic technique is extensively applied in science and industry, such as chemical engineering and pharmacy. The history of chromatography dates back to 1903, when Mikhail
Tsvet, a Russian scientist first applied chromatography technique to separate plant pigments, which show different colors (chlorophyll is green, carotene is orange, and xanthophylls is yellow). Then this separation technique gains its name, chromatography, from the derivative word of chroma, meaning color in Greek[5]. The modern chromatography has its rapid development from the 1930s to 1950s and the 1952 Nobel Prize in Chemistry was awarded to
Archer John Porter Martin, Richard Laurence, and Millington Synge for their contributions to chromatographic techniques[6]. After that, the chromatographic technique is continually developed to obtain advanced resolution for various mixtures.
4 A chromatography contains two phases, a mobile phase, and a stationary phase[7]. The mobile phase dissolves mixture and carries the mixture through the stationary phase. For the same stationary, the affinities of various components in the mixture vary, which leads to different speeds for each component of the mixture going through the stationary phase. The different times for the mixture components going through from the starting point to the terminal point, named retention time, causes the separation. The main principle can be applied via various implementations, such as size-exclusion chromatography and ion-exchange chromatography.
Figure 1.1 Scheme of thin-layer chromatography.
The thin-layer chromatography (TLC) is one of the first used chromatography technique with a thin layer of silica gel or cellulose on a flat plate as the bed[8]. The first chromatography used by Mikhail Tsvet is classified into the TLC family. The sample or named analytes in chromatograpy, is applied on the surface of the thin layer and the plate is placed in a container with solvent. The solvent is driven by capillary action to move up along the plate. The solvent brings analytes up but different compounds have different rates due to various affinities with the stationary phase and then the separation is achieved. The scheme for TLC is shown in
Figure 1.1. The capacity of TLC is limited and the resolution is at the low-level comparing to other chromatography techniques. But TLC is still widely used, especially in organic synthesis to tract the reaction. In addition to separation, TLC is also applied for compound identification
5 based on the Rf value. The Rf value is the ratio between the travel distances of the analyte and the solvent and Rf value remains the same for the same compound.
Figure 1.2 Scheme of column chromatography.
Column chromatography (Figure 1.2) packs the stationary phase that is in the form of small particles and filled in a column[9]. In most cases, solid stationary particles with stationary materials are filled in the whole tube areas to maximize the contact area between the mobile phase and the stationary phase to achieve the maximum ability of separation. The stationary phase may also be filled on the inside wall of the tube and leave other space open, and this type is called a tubular column. Another advantage of the column chromatography is that the mobile phase can be driven by pressure, which decreases the amount of time for the chromatographic process. Since the diffusion is inevitable in the chromatography process, less time required leads to a higher resolution of separation. The first pressure-driven column chromatography is reported by W. Clark Still[10] and the novel technique achieves a big jump in chromatography technique. Modern column chromatography instruments are designed as pre-packed column and the mobile phase is pumped through the column using gradient pumps. Besides, several detectors, such as UV/vis, are available to install after the chromatography to analyze the chemical components in time.
The mobile phase of chromatography may also vary and each physical state of the mobile phase has its cons and pros. Gas and liquid are two frequently used and efficient mobile phases,
6 named as gas chromatography (GC) and liquid chromatography (LC), respectively[11]. Gas chromatography is conducted with a packed column stationary phase. The driven force of the separation is the partition equilibrium between the stationary phase and the mobile gas solution.
The partition equilibrium is that the distribution of the substance reaches an equilibrium and the distribution ratio keeps as a constant. The stationary phase of GC is either solid or liquid, and the mobile gas is often helium. The column of GC has a tiny diameter, usually around 0.5 mm, and even smaller as a capillary column. For most instruments, GC uses a high pressure to pump gas mobile phase through the stationary column. The high pressure makes it efficient and, on the other hand, prevents its application on high molecular weight molecules, such as polymers and biomolecules. Therefore, GC is generally applied for small molecule analysis, such as air quality monitoring. LC is versatile. LC can be conducted using column or paper or thin layer bed and separating most molecules. Nowadays, the liquid mobile phase is normally forced by high pressure through a column packed with small particles or a porous membrane, referred to as the high-performance liquid chromatography (HPLC). Generally, the mobile phase of HPLC, usually organic solvent, is considered less polar than the stationary phase, which is usually silica-based material.
Sharing the principle of regular chromatography, several special chromatographic techniques are introduced for chemical analysis and separation. For example, the ion exchange chromatography is invented based on the ion exchange mechanism to separate analytes with different charges. It is mostly employed in biology and biochemistry studies for the purification of charged compounds, such as proteins. Hydrophobic interaction chromatography is developed based on the polarity differences. Proteins’ sidechains, which are non-polar, can interact with hydrophobic groups on the stationary phase.
With the target of applications, any type of chromatography is designed for either preparation or analysis or both[7]. Preparative chromatography is normally designed for separating
7 relatively large amount of mixture and employs circulations to enhance the resolution. On the other hand, analytical chromatography is more sensitive and accurate and usually designed for a tiny amount. Nowadays, chromatography techniques develop rapidly. New stationary materials and mechanical designs boost both separation abilities and the efficiency of chromatography. Recently, a common HPLC can separate a mixture in a couple of minutes and analyze a sample in even a few seconds. In addition, the development of controlling software and simulation methods also contributes much to the evolution of chromatography techniques[12,13].
1.2 High Performance Liquid Chromatography
High performance liquid chromatography, or previously named as high pressure liquid chromatography, is developed from classical liquid chromatography. Classical liquid chromatography, using a glass tube packed with silica, is still extensively used in organic chemistry and medicine synthesis[8]. As illustrated in Figure 1.3 a, the mixture is loaded on the top of the column and the solvent flows through the column driven by gravity. Each component of the mixture has interactions in different strength with the silica. The different interactions cause different speeds for components traveling through the column. Since each component of the mixture leaves the column at different times, we can collect each purified component one by one. HPLC is developed on account of the same principle as the classical liquid chromatography but comes with high driven pressure for advanced efficiency and resolution. As illustrated in Figure 1.3 b and c, a typical modern HPLC system contains generally three parts, the pressure supplier, the column and the linked detector. Samples are injected into the column automatically and the solvent is forced by a gradient pump instead of the gravity. When the sample components go through the column and reach the detector, real- time signals are recorded. The data obtained from the HPLC process are summarized as a chromatogram (Figure 1.3 d), which is a plot of time versus sample signals. Since 1978, when
8 the prototype of modern HPLC, called flash liquid chromatography was first applied, HPLC has evolved to be one of the premier methods for separation and analysis. As shown in Figure
1.3 a, an HPLC can separate a mixture with 15 different molecules within one minute. In
Figure 1.3 b, the two isomers, Sc3N@C80-D5h, and Sc3N@C80-Ih that are extremely similar can also be successfully separated by HPLC using a 5-PYE column[14]. In most cases, the separation of a mixture by HPLC is conducted in multi-steps using several different columns to maximize the separation abilities. In the 1980s, the invention of the protein separation column starts the application of HPLC in biology and biochemistry areas[15,16].
9
Figure 1.3 Column chromatography. (a) Basic column chromatography driven by gravity. (b)
Modern high performance liquid chromatography. (c) Diagram of key components of HPLC.
(d) A sample HPLC chromatogram for fullerene derivative.
The development of modern HPLC dramatically enhances the ability of analysis and purification. Nowadays, great effort has been devoted to improving the HPLC system to achieve an even better resolution and capability. The theory of chromatography and HPLC plays an essential role in the development of HPLC systems. Before the invention of the first
10 HPLC system, it has been proposed based on the theories that high-pressure-driven chromatography is capable to achieve high efficiency in mixture separations. The design of the
HPLC system largely relies on theoretical calculations. For example, the radius of the particles is customized based on theoretical calculations to achieve a maximum separation capacity.
Nowadays, chromatographic models to date are developed mainly based on mass transportation and differential equations. Chromatography is affected by several variables that are the temperature, the flow rate of solvent, and the column material and the analytes. Under controlled conditions, the experiment results are only related to the analytes. The retention time tR, which is the amount of time for a certain component traveling from the injection to the appearance in the detector, is introduced to measure the level of retention strength. On a chromatogram, the retention is shown as an interval between the beginning time and the time indicated by the peak top of an analyte component. The solvent also has a retention time, which is called the dead time t0. In the chromatographic process, the analyte molecules are distributed in both stationary and mobile phases and remain an equilibrium at any time. To measure the traveling rate of a certain component a, the migration rate ua is calculated as: �� = ��. R is the faction of component a in the mobile phase and u is the velocity of the solvent. The capacity factor k = (1- R) / R is applied to measure the distribution of a solute in the stationary phase and the mobile phase. The capacity factor is also relevant to the dead time t0 and the retention time tR (k = (tR – t0) / t0)[7]. The capacity factor is often used to measure the retention behavior of a molecule in a given column with a given solvent rate. It has been previously reported that the capacity factor is related to certain molecules’ physical properties.
During the chromatographic process, there is always a partition equilibrium for each molecule distributed in the solvent and the stationary phase. The equilibrium is dominated as a result of the intermolecular interactions between the molecules in the analytes and the stationary phase materials. Theoretically, the retention behaviors are predictable based on the intermolecular
11 interactions and the real chromatogram may be calculated if all interactions are considered.
However, the interactions are too complicated to be modeled comprehensively. Therefore, finding primary interactions and modeling the retention behaviors based on these parameters is a feasible plan. The accuracy of prediction largely depends on the structures of the molecules and customized models for a certain family of molecules are helpful. For example, fullerenes/metallofullerenes are in rigid structures and the interactions between these molecules and the stationary phases are not difficult to track, therefore, the estimation of fullerenes/metallofullerenes retention behaviors produce desirable output using relatively simple models[17,18]. The intermolecular interactions belong to five different groups, and they are dispersion interactions, dipole-dipole interactions, hydrogen bonding, ionic interactions, and π-π interactions[19]. Dispersion interactions are universal between any pair of molecules and are significant for less polar molecules. Dispersion interactions exist not only in the mobile phases but also in the stationary phases, so they are minor factors affecting the partition equilibrium. Hydrogen bonding partially shares features of covalent bonding and it is much stronger than any other intermolecular interactions. The hydrogen bonding attractions are normally between a hydrogen atom and another electronegative atom, which may be nitrogen, oxygen or fluorine. The ionic interactions are between a charged molecule and other charged or polar molecules around it. The ionic interaction may reach a maximum when positive and negative molecules are aligned next to one another. The π-π interactions are between any two aromatic systems and are critical in biomolecules to hold the structures. Charge transfers often occur when two π systems complex with each other. For the separation of fullerenes/metallofullerenes using HPLC, the π-π interactions play a major role[20,21].
12
Figure 1.4 Details of HPLC column with solvent flows.
The design of the column structure is critical for effective separations. As illustrated in Figure
1.4, silica particles with stationary materials are encapsulated in a steel column. It has been long investigated on the design of the columns in HPLC to pursue high separation resolution and efficiency. In practice, the separation capacity of a column is affected by numbers of factors, such as the particle size and the column diameter. As a chromatogram is shown, each of the solutes is represented as a peak and each peak is not exactly a line but has a width. The peak width is due to injecting band and diffusion. The width of a peak is a critical character associated with the separation. Usually, the peak width is measured as the baseline width or the half-height width. Within a certain time, the separation ability of a column is then determined by the number of peaks that can be furnished on the chromatogram. The plate