<<

Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Exploring Simple, Interpretable, and Predictive QSPR Model of Fullerene

C60 Solubility in Organic Solvents Lyudvig S. Petrosyan, Department of Physics, Jackson State University, Jackson, MS, USA Supratik Kar, Interdisciplinary Center for Nanotoxicity, Department of Chemistry and Biochemistry, Jackson State University, Jackson, MS, USA Jerzy Leszczynski, Interdisciplinary Center for Nanotoxicity, Department of Chemistry and Biochemistry, Jackson State University, Jackson, MS, USA Bakhtiyor Rasulev, Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, ND, USA

ABSTRACT

Buckminsterfullerene (C60) and its derivatives have currently been used as promising nanomaterial for diagnostic and therapeutic agents. They are applied in pharmaceutical industry due to their nanostructure characteristics, stability and hydrophobic character. Due to its sparingly soluble nature,

the solubility of C60 has been of enormous attention among carbon nanostructure investigators owing to its fundamental importance and practical interest in nanotechnology and medical industry. In

order to study the diverse role of C60 and its derivatives the dependence of fullerene’s solubility on molecular structure of the solvent must be understood. Current study was dedicated to the exploration

of the solubility of fullerene C60 in 156 organic solvents using simple, interpretable and predictive 1D and 2D descriptors employing quantitative structure-property relationship (QSPR) technique. The authors employed genetic algorithm followed by multiple linear regression analysis (GA-MLR) to generate the correlation models. The best performance is accomplished by the four-variable MLR 2 2 model with internal and external prediction coefficient of Q = 0.86 and R pred = 0.89. The study identified vital properties and structural fragments, particularly valuable for guiding future synthetic as well as prediction efforts. The model generated with the highest number of organic solvents to date

with simple descriptors can be reproduced in no time to predict the solubility of C60 in any new or existing organic solvents. This approach can be used as an efficient predictor for fullerenes’ solubility in various organic solvents.

Keywords

C60, Chemometrics, Fullerene, GA, MLR, QSPR, Solubility

1. INTRODUCTION

Fullerene, a highly symmetrical cage-like molecule has specific interaction with organic solvents and its knowledge can provide significant information on the mechanisms of solute-solvent interactions. The fullerenes have defined rigid geometries in distinction to other solutes whose shapes undergo conformational changes. Not only that intramolecular vibrational partition functions may undergo

solvent-dependent changes (Prylutskyy et al., 2003). Due to sparingly soluble nature of C60 in major organic solvents, the production cost is still high for this nanomaterial (Shunaev et al., 2015). Therefore, understanding of fullerene’s solubility provides significant feature assisting in purification,

DOI: 10.4018/JNN.2017010103

 Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 

28 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017 extraction, bioavailability, reactivity, and risk assessment of fullerenes. This information is vital due to ample application of carbon nanostructures, such as C60 and its derivatives in diverse aspects of nanotechnology, pharmaceuticals, cosmetic, medicinal chemistry, environmental applications (Cook et al., 2010; Bogatu, & Leszczynska, 2016) and materials science (Gharagheizi & Alamdari, 2008; Sivaraman et al., 2001). Quantitative structure-property relationship (QSPR) represents a powerful tool for modeling and prediction of physiochemical properties. The QSPR method is defined on the foundation of a mathematical algorithm providing a rational basis for establishing a predictive correlation model. Apart from providing a mathematical correlation, it also enables the exploration of chemical features encoded within parameters (descriptors) (Roy, Kar, & Das, 2015a; Toropova, 2016). Hence, diverse set of descriptors plays a noteworthy role in the recognition as well as analysis of the chemical basis involved in a studied phenomenon. Therefore, reliable QSPR model can offer time and cost-effective measure of C60 solubility values in organic solvents in the absence of experimental data.

A series of investigation for predicting C60 solubility in organic solvents employing QSPR model has been reported in the last 12 years. Liu et al. (2005) generated a linear model as well as a least- squares support vector machine (LSSVM) model for predicting the solubility of C60 in 128 and 122 organic solvents, respectively. Toropov et al. (2007, 2009) demonstrated two kinds of descriptors methods for predicting solubility of C60 in different organic solvents. Same dataset was used to build one-variable model once with the optimal descriptors calculated with simplified molecular input line entry system(SMILES) (Toropov et al., 2007) and in another work with International Chemical Identifier (InChI) (Toropov et al., 2009) with high statistical results. Petrova et al. (2011) depicted successful application of the GA-MLR technique in combination with quantum-chemical and topological descriptors yields reliable four-variable models for 122 organic solvents. One GA-MLR model was developed to predict the fullerene solubility in 36 benzene derivatives by Pourbasheer et al. (2011). Ghasemi et al. (2013) proposed first 3D-QSAR model employing VolSurf based descriptors with SPA-SVM (successive projection algorithm-support vector machine) method to predict C60 solubility in 132 organic solvents with acceptable statistical results. In recent time, Xu et al. (2016) proposed a QSPR model for predicting the solubility of fullerene C60 in 156 diverse organic solvents with the norm indexes. In this regard, we aimed to find simple, predictive, computationally time-efficient and mechanistically interpretable model to predict the solubility of C60 in the same set of organic solvents considered by Xu et al. (2016). In addition, the study intends to estimate predictive potential of the simple 1D and 2D descriptors to model the solubility of the fullerene C60 in a large number of organic solvents.

2. MATERIALS AND METHODS

2.1. Data Set

The experimental solubility (S) data of C60 in 156 organic solvents (Table 1) were collected from two datasets: Beck and Mándi (1997) and Semenov et al. (2010). As the logarithmic values of molar fractions corresponded to the free energy changes in the solvation process, the unit of solubility was considered as logS, instead of weight units (e.g., mg/mL).

29 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

2.2. Descriptors Calculation Solvent structures were prepared in HyperChem 8 software package (HyperChem(TM)) and saved as .mol extension file. Thereafter, DRAGON 6 (DRAGON, TALETE srl, Italy) software employed to generate a pool of descriptors to correlate with logS followed by finding best features which are responsible for C60 solubility in organic solvents. A total of 319 descriptors generated from constitutional indices, topological indices, walk-path count, connectivity indices, functional group counts, ETA indices, Atom-centered fragments, Atom-type E-state indices & molecular properties have been considered. Details about the descriptors is discussed in the following link: http://www. talete.mi.it/products/dragon_molecular_descriptor_list.pdf

2.3. Model Development The initial dataset was divided into training and test set based on sorted experimental property logS response value from lower to higher. Then, every 2nd molecule from the four solvents considered in the test set from the sorted column and the process carried out for the whole dataset. Therefore, dataset divided into 3:1 ratio with 117 and 39 solvents in the training and test set, respectively. Then, genetic algorithm (GA) was employed as the descriptor selection statistical tool implemented in the Genetic Algorithm 1.4 software package (http://teqip.jdvu.ac.in/QSAR_Tools/). We applied GA to select only the best combinations of descriptors for building models with the highest predictive power of solubility. Then the multiple linear regression (MLR) analysis was performed by MLR Plus Validation GUI 1.2 software (http://teqip.jdvu.ac.in/QSAR_Tools/), followed by validation of the model using the test set compounds. Overall, the combined GA-MLR technique was utilized to select the appropriate descriptors and to generate different QSPR models selecting the best models with variables in the range from 1 to 6.

2.4. Model Validation A set of stringent statistical metrics were utilized to make sure the fitness of the in silico models through internal and external validation methodologies. The goodness-of-fit of the equation was checked by regression coefficient (R2), as well as by using the following internal validation metrics: 2 2 2 the leave-one-out cross validation (Q LOO), and external validation metric (R pred). The rm metrics (Roy et al., 2012; Roy & Kar, 2015a), Golbraikh and Tropsha’s criteria were also checked for each model (Roy & Kar, 2015a). We also checked the prediction quality of all the models in terms of the mean absolute error (MAE)-based criteria (Roy et al., 2016). The robustness of the models was checked based on the process of Y-randomization technique generating 100 random models via shuffling the dependent variable while maintaining the original c 2 independent variables. The Rp parameter (Roy & Kar, 2015a) was checked which must be more than 0.5 for acceptable QSPR model. As stated in OECD principles, reliable and transparent QSPR model should have a defined applicability domain (AD) (Roy & Kar, 2015b). Thus, the AD of the best QSPR model was checked employing two techniques: a) the standardization based technique (Roy, Kar, & Ambure, 2015b) and b) the Euclidean distance approach (http://teqip.jdvu.ac.in/QSAR_Tools/). The principle for standardization based technique is as follows: According to this distribution, 99.7% of the population will remain within the range mean ±3 standard deviation (SD). Thus, mean ±3SD represents the zone where most of the training set compounds belong to. Any compound outside this zone is significantly different from the rest and majority of the compounds. Any test compound outside of the limit is considered as an outlier. In case of Euclidean distance approach, it is based on distance scores calculated by the Euclidean distance norms. At first, normalized mean distance score for training set compounds are calculated and these values ranges from 0 to 1 (0=least diverse, 1=most diverse training set compound). Then normalized mean distance score for test set are calculated, and those test compounds with score outside 0 to 1 range are said to be outside the applicability domain. This can also be checked by plotting a ‘Scatter plot’ (normalized mean distance vs. respective activity/ property) including both training and test set.

30 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Table 1. Experimental and predicted (calculated according to equation (4)) fullerene C60 solubility for studied organic solvents

ID. CAS no. Solvent name logS (Exp) logS (Cal) Residual Training set 1 109-66-0 -6.10 -5.80 -0.30 2 110-54-3 -5.10 -5.41 0.31 4 26635-64-3 Isooctane -5.20 -4.71 -0.49 5 124-18-5 -4.70 -4.32 -0.38 6 112-40-3 -3.50 -3.92 0.42 7 493-01-6 cis-Decahydronaphthalene -3.30 -3.59 0.29 8 493-02-7 trans-Decahydronaphthalene -3.50 -3.59 0.09 10 542-18-7 Cyclohexylchloride -4.10 -3.99 -0.11 12 626-62-0 Cyclohexyliodide -2.80 -3.03 0.23 14 110-83-8 Cyclohexene -3.80 -4.31 0.51 15 108-87-2 Methylcyclohexane -4.50 -4.34 -0.16 17 75-09-2 Dichloromethane -4.60 -5.58 0.98 18 56-23-5 Carbontetrachloride -4.40 -4.24 -0.16 19 74-95-3 Dibromomethane -4.50 -4.34 -0.16 22 74-97-5 Bromochloromethane -4.20 -4.93 0.73 24 75-03-6 Iodoethane -4.50 -4.54 0.04 25 79-34-5 1,1,2,2-Tetrachloroethane -3.10 -3.39 0.29 27 71-55-6 1,1,1-Trichloroethane -4.70 -4.69 -0.01 28 540-54-5 1-Chloropropane -5.60 -5.65 0.05 30 75-29-6 2-Chloropropane -5.90 -5.58 -0.32 31 75-26-3 2-Bromopropane -5.40 -4.96 -0.44 32 75-30-9 2-Iodopropane -4.80 -4.16 -0.64 33 78-87-5 1,2-Dichloropropane -4.90 -4.67 -0.23 34 142-28-9 1,3-Dichloropropane -4.80 -4.98 0.18 35 78-75-1 1,2-Dibromopropane -4.30 -3.66 -0.64 36 627-31-6 1,3-Diiodopropane -3.40 -3.14 -0.26 40 513-38-2 1-Iodo-2-methylpropane -4.30 -3.87 -0.43 41 507-19-7 2-Bromo-2-methylpropane -5.00 -5.14 0.14 43 127-18-4 Tetrachloroethylene -3.80 -3.24 -0.56 44 513-37-1 1-Chloro-2-methylpropene -4.50 -4.95 0.45 45 71-43-2 Benzene -4.00 -3.82 -0.18 46 95-47-6 1,2-Dimethylbenzene -2.90 -3.53 0.63 47 108-38-3 1,3-Dimethylbenzene -3.30 -3.61 0.31 49 95-63-6 1,2,4-Trimethylbenzene -2.50 -3.38 0.88 50 108-67-8 1,3,5-Trimethylbenzene -3.50 -3.50 0.00 51 527-53-7 1,2,3,5-Tetramethylbenzene -2.40 -3.25 0.85 52 119-64-2 Tetralin -2.50 -2.58 0.08 53 103-65-1 N-propylbenzene -3.50 -3.67 0.17 55 104-51-8 n-Butylbenzene -3.40 -3.55 0.15 57 462-06-6 Fluorobenzene -4.10 -3.87 -0.23 58 108-90-7 Chlorobenzene -3.00 -3.41 0.41 59 108-86-1 Bromobenzene -3.30 -3.01 -0.29 60 95-50-1 1,2-Dichlorobenzene -2.40 -2.90 0.50 62 694-80-4 1-Bromo-2-chloro-benzene -2.40 -2.55 0.15 64 120-82-1 1,2,4-Trichlorobenzene -2.80 -2.57 -0.23 65 100-42-5 Styrene -3.20 -3.53 0.33 68 100-66-3 Anisole -3.10 -3.98 0.88 69 100-52-7 Benzaldehyde -4.20 -3.69 -0.51 70 103-71-9 Phenylisocyanate -3.40 -3.61 0.21 72 108-98-5 Thiophenol -3.00 -3.53 0.53 continued on following page

31 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Table 1. Continued

ID. CAS no. Solvent name logS (Exp) logS (Cal) Residual 73 100-39-0 Benzylbromide -3.10 -3.04 -0.06 74 30583-33-6 Trichlorotoluene -3.00 -2.62 -0.38 75 90-12-0 1-Methylnaphthalene -2.20 -2.24 0.04 76 28804-88-8 Dimethylnaphthalene -2.10 -2.19 0.09 77 605-02-7 1-Phenylnaphthalene -1.90 -1.34 -0.56 80 71-41-0 1-Pentanol -5.30 -5.64 0.34 81 67-64-1 Acetone -7.00 -6.26 -0.74 82 68-12-2 N,N-Dimethylformamide -5.30 -5.86 0.56 83 110-01-0 Tetrahydrothiophene -5.40 -4.51 -0.89 84 110-02-1 Thiophene -4.40 -3.99 -0.41 85 554-14-3 2-Methylthiophene -3.00 -3.92 0.92 86 872-50-4 N-methyl-2-pyrrolidone -3.90 -4.39 0.49 88 91-22-5 Quinoline -2.90 -2.31 -0.59 89 62-53-3 Aniline -3.90 -3.98 0.08 90 100-61-8 N-methylaniline -3.80 -3.86 0.06 94 110-82-7 Cyclohexane -5.30 -4.53 -0.77 95 591-49-1 1-Methyl-1-cyclohexene -3.80 -4.15 0.35 96 2207-01-4 cis-1,2-Dimethylcyclohexane -4.60 -4.05 -0.55 97 1678-91-7 Ethylcyclohexane -4.30 -4.24 -0.06 98 67-66-3 Chloroform -4.80 -4.57 -0.23 99 106-93-4 1,2-Dibromoethane -4.20 -4.02 -0.18 100 106-94-5 1-Bromopropane -5.20 -5.03 -0.17 101 109-64-8 1,3-Dibromopropane -4.20 -4.15 -0.05 102 78-77-3 1-Bromo-2-methylpropane -4.90 -4.59 -0.31 103 507-20-0 2-Chloro-2-methylpropane -5.70 -5.66 -0.04 104 558-17-8 2-Iodo-2-methylpropane -4.40 -4.49 0.09 105 79-01-6 Trichloroethylene -3.80 -4.06 0.26 106 108-88-3 Toluene -3.40 -3.74 0.34 107 106-42-3 1,4-Dimethylbenzene -3.30 -3.51 0.21 108 488-23-3 1,2,3,4-Tetramethylbenzene -2.90 -3.16 0.26 110 135-98-8 sec-Butylbenzene -3.60 -3.41 -0.19 111 591-50-4 Iodobenzene -3.50 -2.47 -1.03 112 541-73-1 1,3-Dichlorobenzene -3.40 -3.06 -0.34 113 583-53-9 1,2-Dibromobenzene -2.60 -2.20 -0.40 114 88-72-2 2-Nitrotoluene -3.40 -3.55 0.15 115 100-44-7 Benzylchloride -3.40 -3.40 0.00 117 2586-62-1 1-Bromo-2-methylnapthalene -2.10 -1.72 -0.38 118 71-23-8 1-Propanol -6.40 -6.53 0.13 119 111-27-3 1-Hexanol -5.10 -5.29 0.19 120 111-87-5 1-Octanol -5.00 -4.69 -0.31 121 107-13-1 Acrylonitrile -6.40 -5.98 -0.42 122 111-96-6 2-Methoxyethyl ether -5.20 -5.21 0.01 124 79-00-5 1,1,2-Trichloroethane -4.78 -4.22 -0.56 126 629-04-9 Bromoheptane -3.30 -4.87 1.57 127 111-83-1 Bromooctane -3.09 -3.84 0.75 128 112-71-0 1-Bromotetradecane -2.59 -2.89 0.30 129 112-89-0 1-Bromooctadecane -2.53 -2.51 -0.02 130 67-56-1 Methanol -8.87 -8.26 -0.61 132 112-30-1 Decan-1-ol -4.15 -4.24 0.09 133 112-42-5 Undecan-1-ol -3.99 -4.04 0.05

continued on following page

32 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Table 1. Continued

ID. CAS no. Solvent name logS (Exp) logS (Cal) Residual 134 67-63-0 Propan-2-ol -6.65 -6.46 -0.19 135 78-92-2 Butan-2-ol -6.34 -5.90 -0.44 136 6032-29-7 Pentan-2-ol -5.57 -5.59 0.02 138 504-63-2 1,3-Propanediol -7.05 -6.36 -0.69 141 102-04-5 1,3-Diphenylacetone -3.40 -1.99 -1.41 143 2398-37-0 m-Bromoanisole -2.55 -3.15 0.60 144 98-07-7 1,1,1-Trichloromethylbenzene -3.02 -2.82 -0.20 145 573-98-8 1,2-Dimethylnaphthalene -2.12 -2.19 0.07 146 75-05-8 Acetonitrile -7.54 -6.83 -0.71 147 109-99-9 Tetrahydrofuran -5.17 -5.20 0.03 148 108-75-8 2,4,6-Trimethylpyridine -2.80 -3.63 0.83 150 79-09-4 Propionic acid -5.79 -6.02 0.23 151 107-92-6 Butanoic acid -5.74 -5.67 -0.07 152 109-52-4 Pentanoic acid -5.05 -5.26 0.21 153 142-62-1 Hexanoic acid -4.50 -4.94 0.44 154 111-14-8 Heptanoic acid -4.26 -4.66 0.40 156 112-05-0 Nonanoic acid -4.41 -4.17 -0.24 Test set 3 111-65-9 -5.20 -4.80 -0.40 9 137-43-9 Cyclopentylbromide -4.20 -3.75 -0.45 11 108-85-0 Cyclohexylbromide -3.40 -3.58 0.18 13 5401-62-7 1,2-Dibromocyclohexane -2.60 -3.03 0.43 16 6876-23-9 trans-1,2-Dimethylcyclohexane -4.60 -4.05 -0.55 20 75-25-2 Bromoform -3.20 -3.03 -0.17 21 74-88-4 Iodomethane -4.20 -4.89 0.69 23 74-96-4 Bromoethane -5.20 -5.48 0.28 26 107-06-2 1,2-Dichloroethane -5.00 -5.13 0.13 29 107-08-4 1-Iodopropane -4.60 -4.23 -0.37 37 96-11-7 1,2,3-Tribromopropane -2.90 -2.92 0.02 38 96-18-4 1,2,3-Trichloropropane -4.00 -4.10 0.10 39 513-36-0 1-Chloro-2-methylpropane -5.40 -5.13 -0.27 42 540-49-8 1,2-Dibromoethylene -3.70 -3.83 0.13 48 526-73-8 1,2,3-Trimethylbenzene -3.10 -3.37 0.27 54 98-82-8 iso-Propylbenzene -3.60 -3.47 -0.13 56 98-06-6 tert-Butylbenzene -3.70 -3.77 0.07 61 108-36-1 1,3-Dibromobenzene -2.60 -2.49 -0.11 63 108-37-2 1-Bromo-3-chloro-benzene -3.00 -2.76 -0.24 66 98-95-3 Nitrobenzene -3.90 -3.68 -0.22 67 100-47-0 Benzonitrile -4.20 -3.64 -0.56 71 99-08-1 3-Nitrotoluene -3.40 -3.48 0.08 78 64-17-5 Ethanol -7.10 -6.82 -0.28 79 71-36-3 1-Butanol -5.90 -6.07 0.17 87 110-86-1 Pyridine -4.00 -4.01 0.01 91 121-69-7 N,N-Dimethylaniline -3.20 -3.54 0.34 92 4904-61-4 1,5,9-Cyclododecatriene -2.70 -2.71 0.01 93 629-59-4 Tetradecane -4.30 -3.59 -0.71 109 100-41-4 Ethylbenzene -3.40 -3.71 0.31 116 90-13-1 1-Chloronaphthalene -2.00 -2.03 0.03 123 111-84-2 -4.92 -4.55 -0.37 125 109-65-9 Bromobutane -3.74 -4.87 1.13 131 143-08-8 Nonan-1-ol -4.29 -4.45 0.16 continued on following page

33 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Table 1. Continued

ID. CAS no. Solvent name logS (Exp) logS (Cal) Residual 137 584-02-1 Pentan-3-ol -5.36 -5.50 0.14 139 110-63-4 1,4-Butanediol -6.57 -5.87 -0.70 140 111-29-5 1,5-Pentanediol -6.19 -5.49 -0.70 142 104-92-7 p-Bromoanisole -2.54 -3.03 0.49 149 64-19-7 Acetic acid -6.27 -6.63 0.36 155 124-07-2 Octanoic acid -4.98 -4.40 -0.58

3. RESULTS AND DISCUSSION

3.1. Statistical Performance and Validation Quality of QSAR Models To find statistically accepted as well as reliable models, we calculated different models by increasing number of descriptors until the model attain the peak of correlation coefficient for internal as well as external validation. The four-variable model (Equation (4)) yields the best predictive potential for the solubility of C60 in organic solvents, which is strongly supported by comparative plot of correlation coefficient for individual model (Figure 1). Figure 1 shows that external prediction quality dropped down for 5 and 6 descriptors models in comparing to 4-descriptor model. On the contrary, the internal validation statistics are almost comparable for all three models. Thus, similar internal prediction quality and better external prediction can be achieved with 4-descriptor model than 5 and 6 descriptors models. In addition, as the number of descriptors increases, the complexity of the model increases along with the interpretation capability. Therefore, we have chosen equation (4) as the best model over equations (5) and (6). Collinearity is checked among modeled descriptors through pearson correlation. Statistical outcomes of one to six-variable models are illustrated in Table 2. The developed model equations are as follow: One-variable model: log(S). =9 513(. ±0 952)- 1.(582 ±0.)110 ×Psi __e A (1)

Two-variable model: log(S)- =14.(639 ±0.)546 +11.(427 ±0.)692 ×PDI (2) +0.(444 ±0.)087 ×ON1

Three-variable model: log(S) = -13.637( ± 0.509)+0.318( ± 0.039)× X0v (3) +0.575( ±0.371) ×NssssN+9.556( ±0.692) ×PDI

Four-variable model: log(S)- =16.(199 ±0.)760 +0.(423 ±0.)056 ×X 0sol +11.(825 ±1.) 01 ×PDI -.0 021(.±0 009)×VAR (4) -.0 303(.±0 088) ×X4 sol

Five-variable model:

34 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017 log(S)- =13.(648 ±0.)646 -.0 011(.±0 025) ×BBI α +5.(539 ±0.)497 ×∑ -.0 016(.±0 003) ×SAacc (5) NV +4.(061 ±0.)876 ×' + 2.(205 ±0.)233 × piPC 01 ∑ βs

Six-variable model: log(S)- =8.(591 ±0.)235 +0.(063 ±0.)005 ×AMW +1.(001 ±0.)117 × piPPC 02 +0.(681 ±0.)188 ×Cl - 086 (6) +0.(329 ±0.)315 ×NsssN -.0 015(.±0 003)×SAacc +0.(653 ±0.)093 ×MPC 03

As the training set composition is fixed, there may be a bias in selection of the descriptors. Therefore, we have also performed double cross-validation strategy (Roy & Ambure, 2016) to make sure our 4 descriptors model is best or we can get any other better model. In case of double-cross validation, best model obtained by Consensus Model Predictions among three methods (method parameters and result are provided in the supplementary material) under double cross validation in software mentioned under http://teqip.jdvu.ac.in/QSAR_Tools/. Now, if we compare our previous best model and best model evolved from the double cross validation, our previous 4 descriptors GA-MLR

Figure 1. Comparative plot of correlation coefficient values for individual model with one to six variables

35 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Figure 2. Scatter plot for the best four-variable model

model evolved as the best model based on regression-based metrics as well as error based metrics. Followed by bias in prediction (Roy, Ambure & Aher, 2017) is also checked for the best model (Equation (4)) and accepted results are as followed: Variance: 0.0087, Bias2: 0.160, Variance+Bias2: 0.169, Mean square error (MSE): 0.159. Graphical illustration of the experimental and predicted solubility (See Table 1) according to the best model (Equation (4)) for the training and test sets is shown in Figure 2. All models were developed with well explainable and transparent descriptors. The lack of chance correlation in the c 2 QSPR models is also well reflected from the value of R p which is higher than the acceptable threshold value of 0.5 for all six models. Thereafter, we have performed the AD study with two different approaches to make sure the best model is reliable and robust. Employing the euclidean distance based AD determination technique, no test compounds are found outside the AD. In case of the standardization AD validation approach, one compound from the test set (ID: 93, Tetradecane) was found to lie outside the domain defined by the training set (Figure 3). Thus, the collective outcomes attained from the AD studies indicate that out of 39 solvents in the test set, only 1 solvent is outside the AD and its prediction is not reliable. The AD study concludes that, we can assertively predict the solubility of 97.4% of the test solvents.

36 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017 Pass Pass Pass Pass Pass Pass GTC 2 p R c 0.79 0.83 0.86 0.64 0.85 0.85 Randomization Good Good Good Good MAE Good based criteria Moderate d le ca st)S 2 0.26 0.18 0.10 0.10 0.01 0.00 m(te r ” d le ca st)S 0.51 0.66 0.81 0.82 0.84 0.82 2 m(te r 0.684 0.526 0.398 0.400 0.416 0.439 RMSEp External validation ed pr 2 R 0.68 0.81 0.89 0.89 0.86 0.87 Tropsha’s criteria Tropsha’s aled Sc 0.27 0.17 0.14 0.12 0.10 0.11 2 m(LOO) r ” aled Sc 0.49 0.69 0.76 0.79 0.82 0.80 2 m(LOO) r Internal validation 0.762 0.579 0.507 0.461 0.422 0.447 RMSEc LOO 2 0.63 0.78 0.83 0.86 0.87 0.86 Q 2 R 0.64 0.79 0.84 0.87 0.89 0.88 1 2 3 4 6 5 Good predictions: MAE ≤ 0.1 × training set range and + 3σ ≤0.2 range. Bad > 0.15 or 0.25 Here, is the average absolute error σ Model a Table 2. Statistical outcome for the selected models with one to six descriptors Table value denotes the standard deviation of absolute error values for test set data. GTC-Golbraikh and

37 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Figure 3. Euclidean-AD plot for the best four variable model

3.2. Mechanistic Interpretation of the Best Model The success and acceptability of any QSPR model relies on its mechanistic interpretation as suggested in OECD principle 5. The best four-variable model is considered for interpretation in the present section. The contribution of the descriptors (Figure 4) to the solubility of C60 according to the equation (4) is maintained the following order: PDI> X0sol> X4sol>VAR. The most important feature of the best model is PDI, a molecular property defined as packing density index (Todeschini, & Consonni, 2000). The PDI can be explained as the ratio between the McGowan volume (Vx) and the total surface area from P_VSA-like descriptors. The McGowan volume (Vx, mL/mol) is the actual volume of a mole when the molecules are not in motion (McGowan, 1978). It is proportional to the parachor (Vp) which is the molecular volume at temperatures when surface tensions are equal to the following equation:

1 Mg 4 Vp = (7) ()dl− d g

Where, M is the molecular weight of a liquid, g is its surface tension, dl is its density, and dg is the density of the vapor in equilibrium with the liquid. The P_VSA descriptors defined as the amount of van der Waals surface area (VSA) having the computed property in a certain range (Labute, 2000).

As suggested by equation (4) and Figure 4, it has highest positive contribution to solubility of C60 in diverse organic solvents. Second and third models also identified PDI as important property to correlate a solubility response. Higher packing density index value resulted in the better solubility of C60 in the studied solvents.

38 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Figure 4. Descriptors’ contribution plot for the four-variable model

X0sol and X4sol are the solvation connectivity index of order chi-0 and chi-4 that encodes the solvation property of the compound (Todeschini & Consonni, 2000). These descriptors are defined in order to model solvation entropy and dispersion interactions in solution and relate the characteristic dimension of the molecule to the atomic parameters. Both descriptors can be defined by the following equations (Antipin et al., 1991). If the characteristic dimensions of the molecules by atomic parameters are taken into account, they are defined as:

 k    m+1  ()L    k  ∏ a k  s 1   X sol =   ⋅  a=1  m q  ∑ n  (8) 2   k +1 ()δ 1/ 2   ∏ a k   a=1 

where, La is the principal quantum number (2 for C, N, O atoms, 3 for Si, S, Cl and etc.) of the a-th atom in the k-th subgraph; δa is the corresponding vertex degree; k is the total number of m-th order subgraphs and n is the number of vertices in the subgraph. Interestingly, X0sol has much higher contribution towards solubility property than X4sol. Again, X0sol and X4sol have positive and negative contributions, respectively. The positive coefficient of X0sol indicates that an increase in the descriptor value results in an increase in solubility and negative coefficient of X4sol indicates that an increase in the descriptor value results in decrease in solubility of C60 in the considered solvent. Therefore, solvation property with 0 order connectivity index is better for solubility of C60 in the studied organic solvents than solvation of 4 order connectivity index. It worth to note, that the descriptor of the same class (X1sol) was selected as the best in one of the previous studies, performed by Petrova et al. (2011). The least contributing descriptor for the model is VAR which is a topological index explaining the variation of solvents structure (Todeschini & Consonni, 2000), has a negative contribution towards solubility of C60 in organic solvents. As suggested by its negative coefficient, lower descriptor value will help to increase the solubility for C60. Overall, among all models generated the four-variable model is a superior candidate for further use for C60 solubility predictions. It has a good predictive power, transparent descriptors and includes only easy-to-calculate ones.

39 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

3.3. Comparison with Developed Model In order to evaluate our model’s predictive ability, a comparative analysis was performed with existing models for C60 solubility in organic solvents (Liu et al., 2005; Toropov et al., 2007; Toropov et al., 2009; Petrova et al., 2011; Pourbasheer et al., 2011; Ghasemi, Salahinejad, & Rofouei, 2013; Xu et al. 2016), and results were listed in Table 3. Comparing with all models, Xu et al. (2016) and our present study considered highest number of solvents as data points and linear method is used for model generation in both cases. Though almost similar statistical data obtained from both models, current study outperform the model of Xu et al. (2016) by using much less number of descriptors. The present model explains well the solubility correlation with only four interpretable descriptors for such a large and diverse dataset with high acceptable statistical result.

4. CONCLUSION

This study shows that an application of the GA-MLR technique employing simple and transparent descriptors yields reliable, predictive and interpretable models. Although we have compared the outcome of one to six variable models, the best performance is accomplished by the four-variable model. Interestingly, one to three-variable models can be explained by information provided by small number of variables in the model, and for five to six-variable models it is due to attain saturation of the correlation and prediction point already. Among all the developed models for solubility prediction of C60 to date, current model employed highest number of organic solvents with least number of descriptors providing satisfactory prediction results along with mechanistic interpretation. The developed models can be employed proficiently for future predictions of fullerene C60 solubility in various organic solvents along with deep understanding of this phenomenon.

CONFLICT OF INTEREST

The authors confirm that this article has no conflicts of interest.

ACKNOWLEDGMENT

L.S.P, S.K. and J.L. want to thank the National Science Foundation (NSF/CREST HRD-1547754, and NSF RISE HRD-1547836) for financial support. B.R wants to acknowledge the support from the National Science Foundation under ND EPSCoR Award #IIA-1355466 and by the State of North Dakota.

Table 3. A comparative table for current and existing logS prediction models based on statistical parameters

2 2 2 Work reference Model Data No. of R Q R pred points variables Liu et al (2005) Nonlinear model (LSSVM) 128 - 0.91 0.91 0.90 Toropov et al. (2007) One-variable model (SMILES 122 1 0.86 0.85 0.88 based descriptor) Toropov et al. (2009) One variable model (InChI-based 122 1 0.94 0.94 0.93 descriptor) Petrova et al. (2011) GA-MLR 122 4 0.86 0.84 0.90 Pourbasheer et al. (2011) GA-MLR 36 5 0.87 0.80 0.81 Ghasemi, Salahinejad, & SPA-SVM (VolSurf method) 132 5 0.88 0.72 0.87 Rofouei, (2013) Xu et al. (2016) MLR 156 10 0.88 0.89 0.93 Present study GA-MLR 156 4 0.87 0.86 0.89

40 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

REFERENCES

Antipin I.S., Arslanov N.A., Palyulin V.A., Konovalov A.I., & Zefirov N.S. (1991) Solvation topological index. Topological description of dispersion interaction (in Russian). Doklady Akademii Nauk SSSR, 316, 925–927. Beck, M. T., & Mándi, G. (1997). Solubility of C. Fullerenes, Nanotubes and Carbon Nanostructures, 5(2), 291–310. doi:10.1080/15363839708011993 Bogatu, C., & Leszczynska, D. (2016). Transformation of Nanomaterials in Environment: Surface Activation of SWCNTs during Disinfection of Water with Chlorine. Journal of Nanotoxicology and Nanomedicine, 1(1), 45–57. doi:10.4018/JNN.2016010104 Cook, S. M., Aker, W. G., Rasulev, B., Hwang, H. M., Leszczynski, J., Jenkins, J. J., & Shockley, V. (2010). Choosing safe dispersing media for C60 fullerenes by using cytotoxicity tests on the bacterium Escherichia coli. Journal of Hazardous Materials, 176(1-3), 367–373. doi:10.1016/j.jhazmat.2009.11.039 PMID:19962827 Gharagheizi, F. R., & Alamdari, F. (2008). A molecular-basedmodel for prediction of solubility of C60 fullerene in various solvents. Fullerenes, Nanotubes and Carbon Nanostructures, 16(1), 40–57. doi:10.1080/15363830701779315 Ghasemi, J. B., Salahinejad, M., & Rofouei, M. K. (2013). Alignment independent 3D-QSAR modeling of fullerene (C60) solubility in different organic solvents. Fullerenes, Nanotubes and Carbon Nanostructures, 21(5), 367–380. doi:10.1080/1536383X.2011.629751 HyperChem(TM); 1115 NW 4th Street, Gainesville, Florida 32601, USA. Labute, P. A. (2000). Widely applicable set of descriptors. Journal of Molecular Graphics & Modelling, 18(4-5), 464–477. doi:10.1016/S1093-3263(00)00068-1 PMID:11143563 Liu, H., Yao, X., Zhang, R., Liu, M., Hu, Z., & Fan, B. (2005). Accurate quantitative structure-property relationship model to predict the solubility of C60 in various solvents based on a novel approach using a least- squares support vector machine. The Journal of Physical Chemistry B, 109(43), 20565–20571. doi:10.1021/ jp052223n PMID:16853662 McGowan, J. C. (1978). Estimates of the Properties of liquids. Journal of Applied Chemistry and Biotechnology, 28, 599–607. doi:10.1002/jctb.5700280902 Petrova, T., Rasulev, B. F., Toropov, A. A., Leszczynska, D., & Leszczynski, J. (2011). Improved model for fullerene C60 solubility in organic solvents based on quantum-chemical and topological descriptors. Journal of Nanoparticle Research, 13(8), 3235–3247. doi:10.1007/s11051-011-0238-x Pourbasheer, E., Riahi, S., Ganjali, M., & Norouzi, R. P. (2011). Prediction of solubility of fullerene C60 in various organic solvents by genetic algorithm-multiple linear regression. Fullerenes, Nanotubes and Carbon Nanostructures, 19(7), 585–598. doi:10.1080/1536383X.2010.504952 Prylutskyy, Y. I., Yashchuk, V. M., Kushnir, K. M., Golub, A. A., Kudrenko, V. A., Prylutska, S. V., & Matyshevska, O. P. et al. (2003). Biophysical studies of fullerene-based composite for bionanotechnology. Materials Science and Engineering: C, 23(1-2), 109–111. doi:10.1016/S0928-4931(02)00244-8 Roy, K., & Ambure, P. (2016). The double cross-validation tool for MLR QSAR model development. Chemometrics and Intelligent Laboratory Systems, 159, 108–126. doi:10.1016/j.chemolab.2016.10.009 Roy, K., Ambure, P., & Aher, R. (2017). How important is to detect systematic error in predictions and understand statistical applicability domain of QSAR models? Chemometrics and Intelligent Laboratory Systems, 162, 44–54. doi:10.1016/j.chemolab.2017.01.010 Roy, K., Das, R. N., Ambure, P., & Aher, R. B. (2016). Be aware of error measures. Further studies on validation of predictive QSAR models. Chemometrics and Intelligent Laboratory Systems, 152, 18–33. doi:10.1016/j. chemolab.2016.01.008 Roy, K., & Kar, S. (2015a). How to Judge Predictive Quality of Classification and Regression Based QSAR Models. In Z. U. Haq & J. Madura (Eds.), Frontiers of Computational Chemistry (pp. 71–120). Bentham; doi: 10.2174/9781608059782115020005

41 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Roy, K., & Kar, S. (2015b). Importance of applicability domain of QSAR models. Quantitative Structure–Activity Relationships in Drug Design, Predictive Toxicology, and Risk Assessment (pp. 180–211). PA: IGI Global; doi:10.4018/978-1-4666-8136-1.ch005 Roy, K., Kar, S., & Ambure, P. (2015b). On a simple approach for determining applicability domain of QSAR models. Chemometrics and Intelligent Laboratory Systems, 145, 22–29. doi:10.1016/j.chemolab.2015.04.013 Roy, K., Kar, S., & Das, R. N. (2015a). Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment. Academic Press. Roy, K., Mitra, I., Kar, S., Ojha, P., Das, R. N., & Kabir, H. (2012). Comparative studies on some metrics for external validation of QSPR models. Journal of Chemical Information and Modeling, 52(2), 396–408. doi:10.1021/ci200520g PMID:22201416 Semenov, K. N., Charykov, N. A., Keskinov, V. A., Piartman, A. K., Blokhin, A. A., & Kopyrin, A. A. (2010). Solubility of light fullerenes in organic solvents. Journal of Chemical & Engineering Data, 55(1), 13–36. doi:10.1021/je900296s Shunaev, V. V., Savostyanov, G. V., Slepchenkov, M. M., & Glukhova, O. E. (2015). Phenomenon of current occurrence during the motion of a C60 fullerene on substrate-supported graphene. RSC Advances, 5(105), 86337–86346. doi:10.1039/C5RA12202C Sivaraman, N., Srinivasan, T., Vasudeva Rao, P., & Natarajan, R. (2001). QSPR modeling for solubility of fullerene (C60) in organic solvents. Journal of Chemical Information and Computer Sciences, 41(4), 1067–1074. doi:10.1021/ci010003a PMID:11500126 Talete. Dragon ver. 6. (n. d.). Retrieved from http://www.talete.mi.it/products/dragon_molecular_descriptors.htm Todeschini, R., & Consonni, V. (2000). Handbook of molecular descriptors. Weinheim: Wiley-VCH. doi:10.1002/9783527613106 Toropov, A. A., Rasulev, B.F., Leszczynska, D., & Leszczynski, J. (2007). Additive SMILES based optimal descriptors: QSPR modeling of fullerene C60 solubility in organic solvents. Chemical Physics Letters, 444, 209–214. doi:.07.02410.1016/j.cplett.2007 Toropov, A. A., Toropova, A. P., Benfenati, E., Leszczynska, D., & Leszczynski, J. (2009). Additive InChI-based optimal descriptors: QSPR modeling of fullerene C60 solubility in organic solvents. Journal of Mathematical Chemistry, 46(4), 1232–1251. doi:10.1007/s10910-008-9514-0 Toropova, A. P., Achary, P. G. R., & Toropov, A. A. (2016). Quasi-SMILES for nano-QSAR prediction of toxic effect of Al2O3 nanoparticles. Journal of Nanotoxicology and Nanomedicine, 1(1), 17–28. doi:10.4018/ JNN.2016010102 Xu, X., Yan, L., Jia, Q., Wang, Q., & Peisheng, M. (2016). Predicting solubility of fullerene C60 in diverse organic solvents using norm indexes. Journal of Molecular Liquids, 223, 603–610. doi:10.1016/j.molliq.2016.08.085

42 Journal of Nanotoxicology and Nanomedicine Volume 2 • Issue 1 • January-June 2017

Lyudvig Petrosyan is theoretical physicist in condensed matter and quantum physics. His current position is a post-doctoral research associate and adjunct professor at the Physics department at Jackson State University, Mississippi, USA. He received his PhD degree in Condensed matter physics in 2002 from Yerevan State University, Armenia. Dr. Petrosyan has a diverse range of research experience in condensed matter physics, however his main topics of interest are, from a broad point of view, electronic and optical properties of low dimensional nanostructures and the understanding of resonant tunneling effects in quantum nanostructures. He has over 15 years of educator’s experience in the US and Armenia. He has published more than 40 research and review articles, including 2 textbooks. During last 2 years one of Dr. Petrosyan’s scientific interests is computational statistics and machine learning methods, particularly materials informatics and cheminformatics, including structure-activity relationship studies, dealing with biological activity and physico-chemical properties prediction, including nanoparticles and polymers. In computational statistics Dr. Petrosyan has close collaboration with North Dakota State University (USA), Ernest Mario School of Pharmacy at Rutgers University (USA) and The Focus Foundation (USA).

Supratik Kar has been a post-doctoral research associate at the Interdisciplinary Center for Nanotoxicity at Jackson State University, Mississippi, USA in Prof. Jerzy Leszczynski research group since April 2015. He has completed his BPharm. (Gold Medallist) (2008) and MPharm. (Gold Medallist) (2010) degree from Jadavpur University securing first position in both degrees. He has earned his PhD (2015) from the Department of Pharmaceutical Technology, Jadavpur University (Kolkata, India) under the guidance of Prof. Kunal Roy. He is a former visiting researcher at the University of Gdańsk (Gdansk, Poland) under the Marie Curie International Research Staff Exchange Scheme in Prof. Tomsz Puzyn’s group (http://nanobridges.eu/supratik-kar-ju-in-university-of-gdansk-ug-in-gdansk/). He has eight years of experience in QSAR and chemometric modeling studies. He researches a range of topics in structure-activity relationship studies, dealing with biological activity prediction of natural compounds, organic compounds and toxicity prediction of various chemicals, including nanoparticles. He is actively associated with modeling of power conversion efficient solar cell design through molecular modeling and quantum studies. He has published 41 research and review articles, 5 book chapters till date (http://orcid.org/0000-0002-9411-2091). He has also coauthored QSAR related books entitled “Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment” (Elsevier, 2015) and “A Primer on QSAR/QSPR Modeling: Fundamental Concepts” (Springer, 2015). His current h-index is 16 and i-10 index is 18 according to Scopus. He serves as an associate editor of the International Journal of Quantitative Structure-Property Relationships (IJQSPR) [IGI-Global publishers]. He has acted as a reviewer for many reputed journals like Molecular diversity (Springer), Nanoscale (RSC), Science of the Total Environment (Elsevier), Structural Chemistry (Springer), Energies (MDPI), Journal of Nanotoxicology and Nanomedicine (IGI).

Bakhtiyor Rasulev is a professor at Department of Coatings and Polymeric Materials (North Dakota State University). He received his PhD degree in Chemistry in 2002 from Uzbek Academy of Sciences. Dr. Rasulev’s scientific interests are in cheminformatics and structure-activity relationship studies, dealing with biological activity prediction, physico-chemical and toxicity prediction of chemicals, including nanoparticles and polymers. He is an author of many contributions devoted to QSAR modeling and quantum-chemical applications. Dr. Rasulev has close collaboration with the Instituto di Ricerche Farmacologiche Mario Negri (Italy), Jackson State University (USA), University of Zagreb (Croatia), Johns Hopkins University (USA) and etc. His accomplishments have been widely recognized. He is a permanent reviewer of more than 20 peer-reviewed journals. Dr. Rasulev has received many scholarships and awards, including Scholarship of Drew University (Residential School of Medicinal Chemistry, Madison, NJ), Young Investigators Award from Toxicological Division of ACS, award from CRDF Foundation, UNESCO Scholarship to visit the Institute of Desert Study of Ben-Gurion University, Israel.

43