
Prediction of Rodent Carcinogenicity compounds. Numerous approaches have been taken to forming SARs for carcino- Bioassays from Molecular Structure Using genesis. Ashby and co-workers (2-4) developed a successful semiobjective Inductive Logic Programming method of predicting carcinogenesis based on the identification of chemical substruc- Ross D. King1 and Ashwin Srinivasan2 tures (alerts) that are associated with car- 1Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, cinogenesis. A similar but more objective approach was taken by Sanderson and London, United Kingdom; 2Computing Laboratory, University of Oxford, Earnshaw (5), who developed an expert Oxford, United Kingdom system based on rules obtained from expert chemists. An inductive approach, not The machine learning program Progol was applied to the problem of forming the directly based on expert chemical knowl- structure-activity relationship (SAR) for a set of compounds tested for carcinogenicity in rodent edge, is the computer-automated structure bioassays by the U.S. National Toxicology Program (NTP). Progol is the first inductive logic programming (ILP) algorithm to use a fully relational method for describing chemical structure in evaluation (CASE) system (6,7). This sys- SARs, based on using atoms and their bond connectivities. Progol is well suited to forming SARs tem empirically identifies structural alerts for carcinogenicity as it is designed to produce easily understandable rules (structural alerts) for that are statistically related to a particular sets of noncongeneric compounds. The Progol SAR method was tested by prediction of a set of activity. A number of other approaches compounds that have been widely predicted by other SAR methods (the compounds used in the have been applied based on a variety of NTP's first round of carcinogenesis predictions). For these compounds no method (human or sources of information and SAR learning machine) was significantly more accurate than Progol. Progol was the most accurate method that methods (8-13). The effectiveness of these did not use data from biological tests on rodents (however, the difference in accuracy is not different SAR methods was evaluated on a significant). The Progol predictions were based solely on chemical structure and the results of test set of compounds for which predic- tests for Salmonella mutagenicity. Using the full NTP database, the prediction accuracy of Progol tions were made before the trials were com- was estimated to be 63% (±3%) using 5-fold cross validation. A set of structural alerts for pleted (round 1 of the NTP's tests for carcinogenesis was automatically generated and the chemical rationale for them investigated- carcinogenesis prediction) (8,14,15) There these structural alerts are statistically independent of the Salmonella mutagenicity. Carcinogenicity is currently a second round of tests. is predicted for the compounds used in the NTP's second round of carcinogenesis predictions. The machine-learning methodology The results for prediction of carcinogenesis, taken together with the previous successful Inductive Logic Programming (ILP) has applications of predicting mutagenicity in nitroaromatic compounds, and inhibition of been applied to a number of SAR prob- angiogenesis by suramin analogues, show that Progol has a role to play in understanding the lems. Initial work was done using the pro- SARs of cancer-related compounds. Environ Health Perspect 104(Suppl 5):1031-1040 (1996) gram Golem to form SARs for the Key words: machine learning, artificial intelligence, SAR, National Toxicology Program inhibition of dihydrofolate reductase by pyrimidines (16-18). This work was extended by the development of the pro- gram Progol (19) and its adaptation for Introduction application to noncongeneric SAR prob- An understanding of the molecular structure with ability to cause cancer. This lems (20). Progol has been successfully mechanisms of chemical carcinogenesis is work has been greatly advanced by the long- applied to predicting the mutagenicity of a central to the prevention of many environ- term carcinogenicity tests of compounds in series of structurally diverse nitroaromatic mentally induced cancers. One approach is rodents by the National Toxicology compounds (21), and the inhibition of to form structure-activity relationships Program (NTP) of the National Institute angiogenesis by suramin analogues (20). (SARs) that empirically relate molecular of Environmental Health Sciences (1). The ProgolSAR method is designed to pro- These tests have resulted in a database of duce easily understandable rules (structural more than 300 compounds that have been alerts). For the nitroaromatic and suramin This paper is part of the NIEHS Predictive-Toxicology shown to be carcinogens or noncarcino- compounds the rules generated provided Evaluation Project. Manuscript received May 8, 1996; gens. The database of compounds can be insight into the chemical basis of action. manuscript accepted 9 August 1996. We thank M.J.E. Sternberg and S.H. Muggleton. used to form general SARs relating Most existing SAR methods describe This work was supported by ESPRIT (6020), the molecular structure to formation of cancer. chemical structure using attributes-gen- SERC project, experimental application and develop- The compounds in the NTP database eral properties of objects. Such descriptions ment of ILP, and the Imperial Cancer Research Fund. Address correspondence to R.D. King, Biomolecular present a problem for many conventional can be displayed in tabular form, with the Modelling Laboratory, Imperial Cancer Research Fund, SAR techniques because the compounds in compounds along one dimension and the Lincoln's Inn Fields, P.O. Box 123, London, WC2A 3PX, the NTP databases are structurally very attributes along the other dimension. This U.K. Telephone: +44 171 269 3565. Fax: +44 171 269 and different molecular of is at 3417. E-mail:[email protected] diverse, many type description very inefficient rep- Abbreviations used: CASE, computer-augmented mechanisms are involved. Most conven- resenting structural information. A more structure evaluation; DEREK, deductive estimation of tional SAR methods are designed to deal general method of describing chemical risk from existing knowledge; ILP, inductive logic pro- with a common molec- structure is to use or gramming; MTD, minimally toxic dose; NTP, National compounds having logical statements, rela- Toxicology Program; QSAR, quantitative structure- ular template and presumed similar molec- tions. This method is also clearer, as chem- activity relationship; SAR, structure-activity relationship. ular mechanisms of action-congeneric ists are used to relating chemical properties Environmental Health Perspectives - Vol 104, Supplement 5 * October 1996 1031 KING AND SRINIVASAN and functions for groups of atoms. The consisted of 291 compounds, 161 (55%) the examples. This guarantee of optimality Progol method is the first to use a general carcinogens and 130 noncarcinogens. In does not extend to sets of rules constructed relational method for describing chemical addition to this train/test split, a 5-fold by Progol, as it does not follow that a set of structure in SARs. The method is based on cross-validation split of the 330 compounds rules consisting of individually optimal using atoms and their bond connectivities was tested for a more accurate estimate of rules is itself optimal for information com- and is simple, powerful, and generally the efficacy of Progol. The compounds were pression. Information compression is applicable to any SAR. The method also randomly split into five sets, and Progolwas defined as the difference in the amount of appears robust and suited to SAR problems successively trained on four of the splits and information needed to explain the exam- difficult to model conventionally (21). tested on the remaining split. ples with and without using the rule. It is The most similar approaches to Progol statistically highly improbable that a rule are those of CASE (6), MULTICASE Progol with high compression does not represent a (7), and the symbolic machine learning In inductive logic programming (ILP) all real pattern in the data (24). The use of approaches of Bahler and Bristol (8) and the inputs and outputs are logical rules compression balances accuracy (number of Lee (22). However the Progol methodology (23) in the computer language PROLOG. correct predictions/number of total predic- is more general, as the other approaches are Such rules are easily understandable tions) and coverage (number of examples based on attributes and therefore have because they closely resemble natural lan- predicted by the rule/number of examples), built-in limitations in representing struc- guage. For any application the input to i.e., it is a compromise between sensitivity tural relationships. Progol consists of a set of positive examples and specificity. The validity of the com- This article describes application of the (i.e., for SAR, the active compounds), nega- pression measure was empirically shown by Progol SAR method to predicting chemical tive examples (i.e., nonactive compounds), the results of the 5-fold cross-validation carcinogenesis. Progol was first bench- and background knowledge about the trial. Progol generates rules in a stepwise marked on the test data of round 1 and problem (e.g., the atom/bond structure of manner until all the examples are covered then applied to produce predictions for the compounds) (Figure 1). Progol outputs or
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-