QSAR Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Best practices for developing predictive QSAR models Alexander Tropsha Laboratory for Molecular Modeling and Carolina Center for Exploratory Cheminformatics Research School of Pharmacy UNC-Chapel Hill OUTLINE • Introduction: Brief outline of the QSAR approach • Why models fail (bad practices) • Good practices. – Predictive QSAR Modeling Workflow – Examples of the Workflow applications – Emerging applications of QSAR: chemocentric informatics • Conclusions: QSAR modeling is a decision support The rumors of QSAR demise have been greatly exaggerated 100,000 4500 90,000 4000 number of QSAR 80,000 Number of compounds papers in PubMed 3500 70,000 in CAS (in 1000s) 3000 60,000 2500 50,000 2000 40,000 Number of protein 1500 Structures in PDB 30,000 1000 20,000 500 10,000 0 0 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Graphs are courtesy of Prof. A. Cherkasov Principles of QSAR modeling Introduction O N C 0.613 O O N 0.380 A -0.222 O D C M N O E 0.708 N S O Quantitative 1.146 T P N C O R 0.491 N Structure I O I 0.301 O P Activity N 0.141 V U O T Relationships N O 0.956 I N O R N 0.256 S D 0.799 T O 1.195 N Y S 1.005 Principles of QSAR/QSPR modeling Introduction O N C 0.613 O O N 0.380 P -0.222 O D R M N O E 0.708 N S O Quantitative 1.146 O P N C O R 0.491 N Structure P O I 0.301 O P Property N 0.141 E U O T Relationships N O 0.956 R N O R N 0.256 S D 0.799 T O 1.195 N Y S 1.005 The utility of QSAR models CHEMICAL CHEMICAL PREDICTIVE PROPERTY/ STRUCTURES DESCRIPTORS QSAR MODELS ACTIVITY CHEMICAL DATABASE VIRTUAL HITS SCREENING ~106 – 109 molecules INACTIVES QSAR Modeling appears easy… Goal: Establish correlations between descriptors and the target property capable of predicting activities of novel compounds Chemistry Biology Cheminformatics (IC50, Kd...) (Molecular Descriptors) Comp.1 Value1 D1 D2 D3 | Dn Comp.2 Value2 " " " | " Comp.3 Value3 " " " | " - - - - - - - - - - - - - - - - - - - - - - Comp.N ValueN " " " | " ∑(y − yˆ )2 2 = − i i q 1 2 ∑ (y − yi ) BA = F(D) {e.g., …} (e.g., -LogIC50 = k1D1+k2D2+…+knDn) But … the unbearable lightness of model building for training sets… 3 2.5 2 Training 1.5 Linear (Training) 1 Predicted LogED50 0.5 0 0 1 2 3 4 Actual LogED50 (ED50 = mM/kg) …leads to unacceptable prediction accuracy. EXTERNAL TEST SET PREDICTIONS 9 y = 0.5958x + 2.3074 9 y = 0.4694x + 2.9313 2 R = 0.2135 R2 = 0.1181 7 7 5 Observed 5 Observed 3 3 3 5 7 9 3 5 7 9 Predicted Predicted BEWARE OF q2 (Kubinyi paradox)!!! 1 0.8 0.6 0.4 0.2 R2 0 0.5 0.6 0.7 0.8 0.9 1 -0.2 -0.4 -0.6 Poor Models Good Models -0.8 q2 •Only a small fraction of “predictive” training set models with LOO q2 > 0.6 is capable of making accurate predictions (r2 > 0.6) for the test sets. Golbraikh & Tropsha, J. Mol. Graphics Mod. 2002, 20, 269-276. Major components of QSAR modeling QSAR • Target properties (dependent variable) Pill – Continuous (e.g., IC50) – Categorical unrelated (e.g., different pharmacological classes) – Categorical related (e.g., subranges described as classes) • Descriptors (or independent variables) – Continuous (allows distance based similarity) – Categorical related (allows distance based similarity) – Categorical unrelated (require special similarity metrics) • Correlation methods (with and w/o variable selection) – Linear (e.g., LR, MLR, PCR, PLS) – Non-linear (e.g., kNN, RP, ANN, SVM) • Validation and prediction – Internal (training set) vs. external (test set) vs. independent evaluation set • Examples of applications and pitfalls Complexity of QSAR modeling: Choices and Practices • Descriptors (thousands and counting) • Data-analytical methods (dozens and counting) • Validation approaches (unfortunately (!) only a handful but counting) • Experimental validation as part of model building (very rare) BUT • We typically use one (or at best very few) modeling techniques • Publish successes only • Compete but (mostly) indirectly Why models may fail • Incorrect data (structures and activities) in the dataset • Modeling set is too small • No external validation • Incorrect selection of an external test set • Incorrect division of a dataset into training and test sets • Incorrect measure of prediction accuracy • Insufficient statistical criteria to estimate predictive power of models • Lack or incorrect definition of applicability domain • No Y-randomization test (overfitness) • Presence of leverage (structure) and activity outliers Also, see Dearden JC, Cronin MT, Kaiser KL. How not to develop a quantitative structure-activity or structure-property relationship (QSAR/QSPR). SAR QSAR Environ Res. 2009;20(3-4):241-66 Some reasons why QSAR models may fail: using incorrect target function in classification QSAR for biased datasets: QSAR • A typical target function (Classification Rate): Pill CR=N(classified correctly)/N(total) A dataset: Class 1: 80 compounds Class 2: 20 compounds Model: assign all compounds to Class 1. Target function: CR=0.8 The model appears to have high classification accuracy • Better target function: K – the number of classes CCR=0.5(Sensitivity+Specificity) corr In the above example, CCR = 0.5 Nk – the number of compounds of class k assigned to class k • General formula: corr total 1 K N Nk – total number of compounds = k of class k CCR ∑ total K k =1 Nk • For categorical response variable, target functions can depend also on the absolute errors (differences between predicted and observed classes). HOW TO DEFINE A PREDICTIVE QSAR MODEL 9 10 10 y = 0.3154x + 3.4908 y = 3.1007x - 10.715 2 y = 1.2458x - 1.8812 R = 0.9778 R 2 = 0.9778 8 8 R 2 = 0.8604 7 6 6 4 4 Predicted Observed Observed 5 y = 1.0023x y = 0.9383x 2 2 y = 0.9796x 2 2 R 0 = 0.5238 2 R 0 = -3.3825 R 0 = 0.8209 0 0 3 0 2 4 6 8 10 0 2 4 6 8 10 3 5 7 9 Predicted Observed Predicted Regression through Regression Correlation coefficient Coefficients of determination the origin ~ r y = a' y + b' ( y − y)(~y − ~y) ~ r0 2 ∑ i i r0 − R = ~y = k' y 2 ∑ ( yi yi ) 2 ~ ~ 2 R = 1 − ~ ~ ∑ ( yi − y) ∑( yi − y) 0 2 ∑ (yi − y)(yi − y) ~ ~ a'= ∑ ( yi − y) 2 ~ (y−− y)~ − ~ ∑ yi yi ∑∑( yi i y)( yi y) = k' 2 r0 2 a = 2 ~ ~ − ~ yi ( yi − yi ) ~∑ ( yi y) ∑ 2 ∑ b'= y − a' y R'0 = 1− − 2 ~ ∑ ( yi y) b = y − ay CRITERIA y r0 = k~y 2 > 2 > y ~y q 0.5; R 0.6; k = ∑ i i ~ 2 ∑ yi ' 2 '2 2 k or k ≈ 1.0; R0 or R0 ≈ R Some reasons why QSAR models may fail: No Applicability Domain is defined for the Model • Compounds which are highly dissimilar from all compounds of the training set (according to the set of descriptors selected) cannot be predicted reliably Lack of the AD: QSAR Pill unjustified extrapolation wrong prediction Typical situation: a compound of the test set for which error of prediction is high is considered an outlier HOWEVER: a compound of the test set dissimilar from all compounds of the training set can be by chance predicted accurately Applicability domain of QSAR models For a given model, two parameters are calculated: Descriptor 2 - <Dk> : average Euclidian distance between each compound of the training set and its k nearest neighbors in the descriptors space; - sk : standard deviation of the distances between each compound of the training set and its k nearest neighbors in the descriptors space. TRAINING SET Descriptor 1 Atelier Descripteurs 200 Applicability domain of QSAR models For a given model, two parameters are calculated: Descriptor 2 - <Dk> : average euclidian distance between each compound of the training set and its k nearest neighbors in the descriptors space; - sk : standard deviation of the distances between each compound of the training set and its k nearest neighbors in the descriptors space. TRAINING SET Descriptor 1 = NEW COMPOUND For each test compound i, the distance Di is calculated as the average of the distances between i and its k nearest neighbors in the training set. INSIDE THE DOMAIN OUTSIDE THE DOMAIN The new compound will be predicted by the model, only if : Will be predicted Will not be predicted Di ≤ <Dk> + Z × sk by the model by the model with Z, an empirical parameter (0.5 by default) *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:… Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77. Atelier Descripteurs 200 Applicability domain vs. prediction accuracy (Ames Genotoxicity dataset) 92 90 88 86 84 Test Set %Accuracy Set Test 82 80 0.5 1 2 3 5 10 kNN Z-Score Used for Prediction Cutoff Some reasons why QSAR models may fail: Y-randomization test is not QSAR carried out Pill • Y-randomization test: – Scramble activities of the training set – Build models and get model statistics. – If statistics are comparable to those obtained for models built with real activities of the training set, the last are unreliable and should be discarded. Frequently, Y-randomization test is not carried out. Y-randomization test is of particular importance, if there is: - a small number of compounds in the training or test set - response variable is categorical Activity randomization: model robustness Struc.1 Pro.1 Struc.2 Pro.2 0.7 Struc.3 Pro.3 0.6 2 .