<<

Progress Report of the Air Force Project

Covering research period 7/1/94 to 6/30/95

Agency No: DOD/F49620-94-1-0401 U of M No: 0756-5140 (1613-189-6090) NRRI Technical Report Number: NRRI/TR-95/35 Center for Water and the Environment Contribution Number 161

Predicting T oxicity and D egradability of Q uadricyclane , Ethers and T heir Analogs

Submitted By:

Subhash C. Basak, Ph.D. Principal Investigator Natural Resources Research Institute and 5013 Miller Trunk Highway Duluth, MN 55811 Phone: (218)7200-4230 Fax: (218)720-9412 Email: [email protected]

Keith B. Lodge, Ph.D. Co-principal Investigator Department of Chemical Engineering University of Minnesota, Duluth Duluth, MN 55812

Joseph Schubauer-Berigan, Ph.D. Principal Investigator University of South Carolina Subcontract NIWB NERR Baruch Marine Laboratory P.O. Box 1630 Georgetown, SC 29442

The University of Minnesota is an equal opportunity educator and employer. T a b l e o f C o n t e n t s

Acknowledgments...... 2

Objectives ...... 3

Status of E ffo rt...... 4

Accomplishments/New Findings...... 4 A. Background ...... 4 B. Progress of Tasks of the Project ...... 6 TASK 1: Development of data bases ...... 6 TASK 2: Development of methods to quantify molecular similarity .... 6 TASK 3: Selection of analogs ...... 8 TASK 4 Estimation of properties of the target chemical from the probe- induced subset ...... 23 TASK 5: Measurement of hydrophobicity ...... 24 TASK 6: Microbial degradation studies ...... 24 TASK 7: Photochemical transformations ...... 27

Publications ...... 27

Interactions/Transitions ...... 28 A. Participation/Presentations at Meetings, Seminars, e t c ...... 28 I. Presentations at Scientific Conferences ...... 28 II. Invited seminars/lectures ...... 29 B. Consultative and Advisory Function...... 30 C. Transitions ...... 30

New Discoveries...... 30

Honors/Awards 31 A cknowledgments

During the first year of the Air Force grant I have benefited considerably through interaction with numerous colleagues. I would like to specially mention Drs. William Fisanick (Research Department, Chemical Abstracts Service, Columbus, OH), Mic Lajiness (Computer-Aided Drug Discovery, The Upjohn Company, Kalamazoo, Ml), Dr. Douglas Bristol (NIH/NIEHS, RTP, NC), Don Gieschen (PSI International, Towson, MD), Professor A. T. Balaban (Polytechnic Institute, Bucharest, Roumania), Professor K. Balasubramanian (Department of Chemistry and Biochemistry, Arizona State University Tempe, AZ), Professor Timothy Colburn (Department of Computer Science, University of Minnesota, Duluth), Dr. Axel Drefahl (Department of Civil Engineering, Stanford University, Palo Alto, CA), Dr. Mark Johnson (Computer-Aided Drug Discovery, The Upjohn Company, Kalamazoo, Ml), Professor Milan Randic (Department of Mathematics and Computer Science, Drake University, IA), Dr. Romualdo Benigni (Director, Unit of Structure-Activity Relationships, Institute Superiore di Sanita, Rome) and Professor Rainer Bruggemann (GSF-Forschungszentrum fur Umselt und Gesundheit PUC, Germany) for very many fruitful discussions on topics related to quantification of molecular structure, molecular similarity and structure- activity relationships (SAR) pertaining to environmental toxicology.

I would like to acknowledge the help of Dr. William Fisanick for generating for us the 75 analogs of Quadricyclane from a 120,000 Chemical Abstracts Service database and Dr. Mic Lajiness for generating for us a subset of 6,836 seven compounds from Available Chemicals Directory (ACD) and generating for us analogs of Quadricyclane from the ACD.

I am thankful to Mr. Greg Grunwald, Applications Programmer of NRRI, for his excellent and sustained efforts in developing databases and carrying out computations resulting in many publications supported by the Air Force. I would like to thank my graduate students, David Axtell (Computer Science, UMD), and Brian Gute (Toxicology Ph.D. program, U of Minnesota) for collaborating with me in developing neural net and similarity methods in estimating toxicological properties of chemicals.

Finally I would like to thank AFOSR for providing us financial support for carrying out the research reported here.

Subhash C. Basak, Ph.D. Principal Investigator

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 2 Principal Investigator: Subhash C. Basak OBJECTIVES ( The same as in the original proposal)

In a large number of cases, we have to assess the risk of chemicals and predict the toxic potential of molecules in the face of limited experimental data. Structural criteria and functional criteria (if available) are routinely used to estimate the possible hazard posed by a chemical to the environment and ecosystem. Frequently, no biological or relevant physicochemical properties of the chemical species of interest are available to the risk assessor.

In the proposed project, we will develop and implement a number of methods of quantifying molecular similarity of chemicals using techniques of computational and mathematical chemistry. Some of the methods are new and will be based on our own research on the theoretical development and implementation of molecular similarity methods. These techniques will be implemented in a user friendly computer environment of the Silicon Graphics workstation. The similarity methods will be used to select analogs of chemicals of interest to the Air Force, viz., QUADRICYCLANE, FLUOROCARBON ETHERS AND THEIR ANALOGS, from databases containing high quality physicochemical data and toxicity endpoints for large number of chemicals. The databases used in the project will come from three sources: a) public domain databases, b) our own in-house databases, and c) databases acquired from commercial vendors.

The set of selected analogs, called probe-induced subsets, will be used to: a) develop structure-activity relationships (SAR), and b) carry out ranking of chemicals. Both of these methods will be used to estimate the hazard of the chemicals of interest.

A set of chemicals (five to ten) will be chosen for experimental work with the purpose of evaluating and refining computer models. The set will include quadricyclane and fluorocarbon ethers of interest to the Air Force. It will also include a selection of analogs (probe-induced subset) that are readily available, suitable for experimentation, and for which data are lacking. Experiments will be performed to assess the biodegradability and photochemical degradability of the members of the set. Their toxicity will be tested by MicroTox and MutaTox. In cases where significant degradation is observed, the toxicity of the degradation products will also be tested. Direct measurement of the hydrophobicity (octanol-water partition coefficient) will be performed on the members of the set.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 3 Principal Investigator: Subhash C. Basak Status of Effort

A number of novel molecular similarity methods have been developed using nonempirical parameters which can be computed directly from molecular structure. These parameters include topological indices (TIs), atom pairs (APs) and semiempirical quantum chemical parameters. The relative effectiveness of these similarity techniques in selecting analogs and estimating properties of toxicological importance have been tested on a selected set of properties like lipophilicity (logP, the logarithm of its octanol water portion coefficient, octanol/water), normal boiling point, mutagenicity, etc. The K nearest neighbor (KNN) method, K=1, 2, ------, 25, has been used in generating probe-induced subsets from different databases for each probe. Results show that the KNN method gives the best estimate of properties at K = 5-10 for a diverse range of properties studied.

Seventy-five probe-induced subsets have been generated for Quadricyclane from three different databases: a) STARLIST logP database of Daylight, Inc., containing more than 4,000 high quality logP values, b) Available Chemicals Directory (ACD), containing over 180,000 chemicals which are currently available from suppliers worldwide, and c) a Chemical Abstracts Service database containing about 120,000 diverse chemicals.

Both water solubility and logP of Quadricyclane (Q) have been determined experimentally. Preliminary biodegradation studies on Q and norbornane (NB) have been carried out. Preliminary studies indicate that Q has high potential for aerobic biodegradation (BOD test).

Accomplishments/New Findings

A. Background

In many instances one has to assess the hazardous potential of environmental pollutants in the face of limited or no available test data. Such data-poor situations dictate the use of estimated properties, calculated directly from molecular structure, for hazard estimation of chemicals. In environmental toxicology, two distinct approaches are used for assessing the hazard of a chemical under such circumstances: a) estimation of properties using class-specific quantitative structure-activity relationships (QSARs), and b) prediction of toxic and ecotoxic effects of a pollutant from the toxicity data of its analogs.

Many chemicals of interest to the US Air Force, e.g., Quadricyclane and fluorocarbon ethers (FCEs), fall in this category. Very little empirical data necessary for risk

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 4 Principal Investigator: Subhash C. Basak assessment are available for these chemicals. Table 1 gives a partial list of physicochemical and toxicological data necessary for a reasonable hazard assessment of chemicals.

Table 1: List of properties necessary for risk assessment of chemicals

Physicochemical Biological

Molar Volume Receptor Binding (KD) Boiling Point Michaelis Constant (Km) Inhibitor Constant (Ki) Vapor Pressure Biodegradation Aqueous Solubility Bioconcentration Dissociation Constant Alkylation Profile (pKa) Partition Coefficient Metabolic Profile :Octanol-water (log P) Chronic Toxicity :Air-Water Carcinogenicity :Sediment-Water Mutagenicity Reactivity (Electrophile) Acute Toxicity ■LD50 :LC50

The majority of the properties mentioned in Table 1 are not known for Q and FCEs. Therefore, in the foreseeable future, hazard assessment of these chemicals have to be carried out using estimated properties. To this end, we proposed in our project to develop property estimation methods for Q and FCEs using structure activity relationships (SAR) methods being developed in our lab. Specifically we proposed the following:

a) Development of methods for computing intermolecular similarity using algorithmically derived parameters that can be calculated directly from molecular structure, b) Selection of analogs (called probe-induced subsets) of probe chemicals (Q and FCEs) from databases of chemicals with high quality endpoints relevant to hazard assessment, c) Estimation of physicochemical and properties of Q and FCEs from the properties of probe-induced subset

Predicting toxicity and degradability of quadricyciane, fluorocarbon ethers and their analogs 5 Principal Investigator: Subhash C. Basak d) Measurement of hydrophobicity for the validation of the estimation methods developed in items a) through c) above e) Experimental determination of microbial degradability of Q, FCEs and their analogs in an effort to validate SAR models developed in items a) to c) above, and f) Estimation of photochemical degradability of Q, FCEs and some limited number of probe-induced subsets to test the validity of similarity-based estimation methods.

B. Progress of Tasks of the Project

The project consisted of seven distinct tasks. Below we present the progress in the various task areas during the first year of the project.

TASK 1: Development of data bases The Syracuse Research Corporation database's BIODEG and CHEMFATE have been purchased and installed on an IBM-compatible personal computer. These databases will be used for future creation of probe-induced subsets and estimation of relevant properties.

Access to the STARLIST database of high quality experimental logP values has been provided through the courtesy of the USEPA. Initial searches of this database have already been carried out and are detailed under TASK 3.

Access to a database of 120, 150 chemical structures has been provided through a collaborative study involving Chemical Abstracts Service, the Upjohn Company and the NRRI QSAR group. Initial searches of this database have been carried out and are detailed under TASK 3.

TASK 2: Development of methods to quantify molecular similarity Molecular similarity is an emerging area of structure-activity relationships. We carried out research on the development of different techniques of quantifying molecular similarity using K nearest neighbor (KNN), discriminant function analysis (DFA) and neural net methods. The successful methods developed under this task were applied in estimating properties in Task 4 below. Specific accomplishments of this task were: a) application of KNN methods (K=1 -10, 15, 20, 25) in selecting analogs and estimating properties of probe chemicals using principal components (PCS) derived from topological indices (TIs) and atom pairs (APs) as axes in N-dimensional structure space, b) comparative study of TIs, atom pairs and quantum chemical parameters in quantifying intermolecular similarity, and c) analysis of implications of the notion that similarity space is a TOLERANCE SPACE. Below we provide brief summaries of results published in, or accepted for, peer-reviewed journals:

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 6 Principal Investigator: Subhash C. Basak I. Application of graph theoretical parameters in quantifying molecular similarity and structure-activity studies by S. C. Basak, S. Bertelsen and G. D. Grunwald, J. Chem. Inf. Comout. Sci.. 34, 270, 1994. This paper reported the development of molecular similarity methods based on Euclidean distance of calculated TIs and an associative measure using APs. These similarity measures have been used in selecting analogs from: a) a combined set of octane and nonane , b) a diverse set of 382 chemicals, c) a subset of 3,692 chemicals found in the Toxic Substances Control Act (TSCA) inventory, and d) the STARLIST data base consisting of 4,067 chemicals. Two probe chemicals were selected from each of the four data sets mentioned above. For all eight probes, the neighbors selected by the similarity methods show that these methods are good at analog selection and reflect an intuitive notion of chemical similarity.

II. Predicting mutagenicity of chemicals using topological and quantum chemical parameters: a similarity based study, by S. C. Basak and G. D. Grunwald, Chemosphere. in press, 1995. This paper dealt with the similarity based estimation of mutagenicity of 73 aromatic and heteroaromatic amines. Similarity spaces were developed using APs, TIs, a set of physicochemical and quantum chemical properties, viz., logP (octanol/water), energy of the highest occupied molecular orbital (HOMO) and energy of the lowest unoccupied molecular orbital (LUMO), and a combination of TIs, logP, HOMO energy and LUMO energy. The results showed that addition of logP, HOMO energy and LUMO energy did not make a very significant improvement in the quality of estimation.

III. Tolerance space and molecular similarity, by S. C. Basak and G. D. Grunwald, BAR and QSAR in Environmental Research, in press, 1995. This paper dealt with the mathematical concept of tolerance when quantifying molecular similarity. Specifically, if A is similar to B, and B is similar to C, it does not hold that A is similar to C. We argued that a SIMILARITY SPACE IS A TOLERANCE SPACE. We then carried out studies on similarity based KNN estimation of mutagenicity of a set of 95 aromatic amines and the boiling point of over 2900 chemicals (a subset of TSCA). We showed that the similarity based estimated values follow the pattern expected for a TOLERANCE SPACE, i.e. more distant neighbors are too dissimilar to use in KNN estimation relative to near neighbors.

Copies of the above three papers are attached ( Vide Infra, Publication Section)

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 7 Principal Investigator: Subhash C. Basak TASK 3: Selection of analogs Analogs or “probe-induced subsets" selected from databases with good quality experimental data can be useful in predicting properties of probe chemicals. Taking Quadricyclane as the probe, we selected 75 analogs using the following searches:

Search 1: The STARLIST database contains high quality experimental logP values for over 4,000 chemicals. LogP is an important determinant of toxicity and ecotoxicity of chemicals. So, we determined 75 analogs of Q from the STARLIST database using atom pair based similarity method. Table 2 gives similarity scores, ID#, and ranks of the 75 nearest neighbors of Q selected by this search. The structures of the first ten analogs are given in Figure 1.

Similarity scores range from zero when there are no common atom pairs between two chemicals and from one when they are identical. It is clear from the similarity scores of the nearest neighbors that no chemical in the STARLIST database strongly resembles Q structurally.

Table 2. Seventy-five analogs of Quadricyclane from the STARLIST database selected using the atom pair method.

No. STARLIST ID Score Rank

1 3720 0.303 1

2 4410 0.303 1

3 339 0.276 2

4 1851 0.276 2

5 2099 0.276 2

6 2263 0.276 2

7 3036 0.276 2

8 3209 0.276 2

9 4916 0.276 2

10 3197 0.273 3

11 921 0.218 4

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 8 Principal Investigator: Subhash C. Basak Table 2. Continued

No. STARLIST ID Score Rank

12 3403 0.214 5

13 4389 0.214 5

14 1558 0.204 6

15 709 0.198 7

16 702 0.190 8

17 1216 0.190 8

18 2154 0.190 8

19 3594 0.190 8

20 3692 0.190 8

21 1609 0.182 9

22 2059 0.182 9

23 934 0.175 10

24 1816 0.175 10

25 3060 0.175 10

26 1165 0.152 11

27 3223 0.152 11

28 548 0.140 12

29 2366 0.140 12

30 115 0.132 13

31 612 0.132 13

32 4267 0.128 14

33 2482 0.127 15

34 1515 0.122 16

35 1026 0.121 17

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 9 Principal Investigator: Subhash C. Basak Table 2. Continued

No. STARLIST ID Score Rank

36 1186 0.121 17

37 1541 0.115 18

38 3391 0.115 18

39 4785 0.115 18

40 242 0.115 18

41 1914 0.115 18

42 3028 0.115 18

43 1454 0.115 18

44 1208 0.111 19

45 2279 0.111 19

46 3519 0.111 19

47 4072 0.111 19

48 191 0.105 20

49 1001 0.105 20

50 1002 0.105 20

51 1122 0.105 20

52 1374 0.105 20

53 1778 0.105 20

54 2122 0.105 20

55 2258 0.105 20

56 3418 0.105 20

57 3575 0.105 20

58 4802 0.105 20

59 4805 0.105 20

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 10 Principal Investigator: Subhash C. Basak Table 2. Continued

No. STARLIST ID Score Rank

60 163 0.104 21

61 495 0.104 21

62 1867 0.104 21

63 2159 0.104 21

64 2941 0.104 21

65 1873 0.102 22

66 4433 0.102 22

67 4928 0.100 23

68 2660 0.099 24

69 3062 0.099 24

70 3313 0.095 25

71 646 0.095 25

72 963 0.095 25

73 1269 0.095 25

74 2729 0.095 25

75 3547 0.095 25

Predicting toxicity and degradability of quadricyciane, fluorocarbon ethers and their analogs 11 Principal Investigator: Subhash C. Basak HO. Cl

Cl

Cl

Cl 1. 3720 0.303 3. 339 0.276

‘O

Cl

'C l 'C l

5. 2099 0.276

8. 3209 0.276 7. 3036 0.276 9. 4916 0.276

HO

Figure 1. Ten nearest neighbors of quadricyclane (Q) from atom pair search in the STARLIST database.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 12 Principal Investigator: Subhash C. Basak Search 2: Ready availability of analogs of probe chemicals is important for testing the validity of similarity methods in predicting properties. So, we carried out a similarity search of a subset of the ACD consisting of compounds with seven carbon atoms to select analogs for Q. ACD is a commercially available database of about 180,000 chemicals which are currently available from suppliers worldwide. Table 3 below gives similarity scores, ranks and ACD ID# of the 75 analogs of Q selected from ACD by the atom pair based similarity method. The structures of the first ten ACD analogs are given in Figure 2. The first twenty of the ACD analogs are being considered as potential analogs for predicting properties of Q.

Table 3. Seventy-five analogs Quadricyclane from the Available Chemicals Directory (ACD) selected by the AP method.

No. ACD No. Score Rank 1 124155 0.702 1 2 60935 0.653 2 3 149458 0.606 3 4 65846 0.490 4 5 59125 0.490 4 6 81609 0.421 5 7 81630 0.421 5 8 81752 0.421 5 9 81753 0.421 5 10 81607 0.421 5 11 132284 0.421 5 12 156832 0.421 5 13 81601 0.386 6 14 132155 0.367 7 15 132161 0.367 7 16 19621 0.367 7

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 13 Principal Investigator: Subhash C. Basak Table 3. Continued. No. ACD No. Score Rank 17 54433 0.367 7 18 110621 0.367 7 19 132317 0.367 7 20 132318 0.367 7 21 132325 0.367 7 22 41003 0.351 8 23 59192 0.333 9 24 41028 0.333 9 25 58813 0.333 9 26 131525 0.316 10 27 81610 0.316 10 28 132151 0.316 10 29 132162 0.316 10 30 132144 0.286 11 31 22280 0.286 11 32 41850 0.286 11 33 53297 0.286 11 34 53535 0.286 11 35 3411 0.286 11 36 14933 0.286 11 37 44463 0.286 11 38 54556 0.286 11 39 54557 0.286 11 40 146920 0.286 11

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 14 Principal Investigator; Subhash C. Basak Table 3. Continued. No. ACD No. Score Rank 41 3459 0.286 11 42 81755 0.281 12 43 109377 0.263 13 44 71169 0.246 14 45 119337 0.246 14 46 42589 0.246 14 47 121621 0.246 14 48 16372 0.246 14 49 56229 0.246 14 50 100369 0.246 14 51 82094 0.245 15 52 59193 0.245 15 53 15397 0.245 15 54 44481 0.245 15 55 3417 0.245 15 56 54558 0.245 15 57 54559 0.245 15 58 3462 0.245 15 59 21451 0.242 16 60 63544 0.238 17 61 41030 0.238 17 62 53532 0.238 17 63 56444 0.238 17 64 58808 0.238 17

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 15 Principal Investigator: Subhash C. Basak Table 3. Continued. No. ACD No. Score Rank 65 58817 0.238 17 66 58858 0.238 17 67 132176 0.237 18 68 108810 0.230 19 69 5168 0.222 20 70 42944 0.214 21 71 135772 0.214 21 72 109313 0.212 22 73 63755 0.212 22 74 124492 0.212 22 75 14034 0.212 22

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 16 Principal Investigator: Subhash C. Basak Cl

1. 124155 0.702 2. 60935 0.653

4. 65846 0.490 5. 59125 0.490 6. 81609 0.421

7. 81630 0.421 8. 81752 0.421 9. 81753 0.421

10. 81607 0.421

Figure 2. Ten nearest neighbors of quadricyclane (Q) from atom pair search of ACD C? database.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 17 Principal Investigator: Subhash C. Basak Search 3: Dr. William Fisanick of Chemical Abstracts Service (CAS) carried out an analog selection for Q from a 120,155 chemical CAS database at the request of Dr. Subhash Basak. This similarity search is based on measures of intermolecular similarity developed by CAS using 2-D parameters. We give below the nearest 75 analogs of Q from the 120,155 chemicals CAS database (Table 4). Figure 3 presents the two dimensional structures of the first ten analogs derived from this search.

Table. 4. Seventy-five Nearest Neighbors of Quadricyclane (CAS 278-06-8) from CAS Similarity Search of 120,155 Chemicals No. CAS Reg. No. Score Rank 1 277-10-1 81.40 1 2 25107-14-6 77.55 2 3 285-43-8 68.09 3 4 7075-86-7 67.39 4 5 60606-97-5 67.31 5 6 3104-87-8 66.67 6 7 25079-41-8 66.67 6 8 16282-07-9 65.52 7 9 278-07-9 65.12 8 10 286-34-0 65.12 8 11 281-23-2 64.58 9 12 2292-79-7 64.29 10 13 286-08-8 63.64 11 14 279-23-2 62.79 12 15 285-58-5 62.79 12 16 253-14-5 62.00 13 17 695-02-3 61.82 14 18 278-30-8 60.87 15

Predicting toxicity and degradability of quadricyctane, fluorocarbon ethers and their analogs 18 Principal Investigator: Subhash C. Basak Table. 4 Continued.

No. CAS Reg. No. Score Rank 19 6004-38-2 60.38 16 20 280-33-1 60.00 17 21 694-72-4 60.00 17 22 25107-18-0 59.62 18 23 6053-75-4 59.26 19 24 280-65-9 58.70 20 25 496-10-6 58.70 20 26 2146-36-3 58.49 21 27 77523-29-6 57.63 22 28 284-10-6 57.45 23 29 3296-50-2 57.45 23 30 4551-51-3 57.45 23 31 6221-55-2 57.45 23 32 43000-53-9 57.41 24 33 2090-14-4 57.14 25 34 2435-85-0 56.67 26 35 5744-03-6 56.14 27 36 185-94-4 56.10 28 37 91-17-8 55.10 29 38 282-53-1 55.10 29 39 6004-36-0 54.41 30 40 80657-74-5 54.39 31 41 174-74-3 54.35 32 42 281-72-1 54.00 33

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 19 Principal Investigator: Subhash C. Basak Table. 4 Continued. No. CAS Reg. No. Score Rank 43 493-01-6 54.00 33 44 493-02-7 54.00 33 45 4443-69-0 54.00 33 46 5661-80-3 54.00 33 47 13380-94-4 53.97 34 48 5743-97-5 53.45 35 49 6596-35-6 53.45 35 50 700-58-3 53.33 36 51 1521-78-4 53.33 36 52 7314-85-4 53.33 36 53 7346-41-0 53.33 36 54 1521-92-2 53.23 37 55 13380-89-7 53.12 38 56 81-21-0 52.86 39 57 7158-25-0 52.86 39 58 700-57-2 52.46 40 59 4488-57-7 52.46 40 60 13074-39-0 52.46 40 61 19398-83-5 52.46 40 62 23695-65-0 52.46 40 63 24669-56-5 52.46 40 64 4292-90-4 52.17 41 65 21681-47-0 52.17 41 66 4160-49-0 51.85 42

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 20 Principal Investigator: Subhash C. Basak Table. 4 Continued.

No. CAS Reg. No. Score Rank 67 10218-02-7 50.91 43 68 13756-54-2 50.88 44 69 497-38-1 50.00 45 70 2566-48-5 50.00 45 71 2958-75-0 50.00 45 72 29342-53-8 50.00 45 73 30651-03-7 50.00 45 74 42577-33-3 50.00 45 75 52794-33-9 50.00 45

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 21 Principal Investigator: Subhash C. Basak 4. 7075-86-7 67.39 6. 3104-87-8 66.67

7. 25079-41-8 66.67 8. 16282-07-8 65.52 9. 278-07-9 65.12

o

10. 286-34-0 65.12

Figure 3. Ten nearest neighbors of quadricyclane (Q) from 2-D topological search of CAS 120K database.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 22 Principal Investigator: Subhash C. Basak TASK 4 Estimation of properties of the target chemical from the probe-induced subset We studied the effectiveness of similarity methods developed in Task 2 above by applying these methods in estimating various physicochemical and toxicological endpoints. To this end, we carried out similarity-based estimation of physicochemical and toxicological properties. Three papers were submitted out of the research carried out in this task. These results were also presented in numerous national and international symposia and invited presentations. We summarize below the main findings of three peer-reviewed publications:

I. Estimation of boiling point of haloalkanes using molecular similarity, by S. C. Basak, B. D, Gute and G. D. Grunwald, Croat. Chim. Acta, submitted, 1995 A molecular similarity measure has been used to estimate the normal boiling points of a set of 267 haloalkanes with 1-4 carbon atoms. Molecular similarity and dissimilarity were quantified in terms of Euclidean distances of molecules in the eight dimensional principal component space derived from fifty nine topological indices. The correlation coefficients between experimental and estimated boiling points ranged from 0.842 to 0.944 in the K nearest neighbor estimation of boiling points using different number of nearest neighbors (K = 1-25).

II. Predicting genotoxicity of chemicals using nonempirical parameters, by S.C. Basak and G. D. Grunwald, in: Proceeding of the XVI International Cancer Congress, pp. 413-416, 1995. This paper dealt with the classification of a set of 463 chemicals into mutagens and nonmutagens using graph theoretic and quantum chemical parameters. The results showed that inclusion of quantum chemical indices into the set of independent variables did not make any significant differences in the predictive power of the models developed.

III. Use of graph theoretic parameters in predicting inhibition of microsomal p-hydroxylation of aniline by alcohols: a similarity based study, by S. C. Basak and B. D. Gute, in: Proceedings of the International Congress on Hazardous Waste: Impact on Human and Ecological Health. Atlanta, GA, in press, 1995. This paper used two similarity methods, one based on Euclidean distance on PCS derived from TIs and the other using APs, in predicting inhibition of microsomal p-hydroxylation of aniline by alkanols. The results showed that these methods give reasonable estimates of the inhibitory potencies of the alcohols.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 23 Principal Investigator: Subhash C. Basak Copies of the above three papers are attached (Vide Infra, Publication Section)

TASK 5: Measurement of hydrophobicity The determination of the water solubility for quadricyclane at 20° C gave 0.7 mg/g. Samples of water were equilibrated with a slight excess of Q and the water concentrations were determined directly by HPLC (Hypersil ODS 5 pm 100 x 4.6 mm reverse phase column held at 40° C, isocratic, mobile phase 20%v/v in water, detection at 210 nm). Direct determinations of Q in octanol and water, equilibrated together at 20° C, by HPLC indicate that Log P > 2; P is the 1- octanol/water partition coefficient. We are now applying an HPLC-retention time method for measurement of P for analogs of Q; by this method, the value of Log P for Q is 3. The common empirical estimation method, CLOGP (MedChem Software), gives a value of 1.5.

TASK 6: Microbial degradation studies To support degradation work, extraction methods and analytical methods by GC/FID for Q and norbornane (NB) in water have been developed. Acceptably high recoveries (> 95%), using dichloromethane as the extraction solvent, were obtained with a "through-septum" extraction method. Extraction methods in which the head space is opened to the atmosphere lead to low recoveries. The retention times for Q and NB are 11.5 and 9.1 min on a 30 m SE-30 column (0.53 mm ID, 1.2 pm film, Alltech Cat. No. 19656) with the following parameters: injector temperature, 180° C; detector temperature ,250° C; the oven is held at 45° C for 12 min, and then ramped at 30° C/min to 100° C and held for 2 min. With a 5 pL injection, the detection limit is 1 pg/mL.

Similar methods have been developed for sediments. A 1:1 sediment/MQW slurry was spiked with Q and NB and extracted using the "through-septum" method. Recoveries were also acceptably high (>95%). These methods have been applied to samples of sediment collected from sites near the University of South Carolina's Baruch Marine Laboratory in Georgetown, SC. For these samples, 2 mL of sediment slurry was added to a 10 mL amber serum vial; the vial was then capped with a teflon-lined butyl rubber septum. The vials are spiked through the septum with Q and NB (in ethanol) to give a concentration of 0.1 mg/mL. The vials are shaken for 15 min and stored in the refrigerator at 4° C until analysis.

The contents of two vials are analyzed at various times. The vials are spiked with the recovery standard, toluene, and then 5 mL of dichloromethane (DCM) is added; these chemicals are added through the septum using syringes. The vials are shaken for 15 min and centrifuged to separate the aqueous and DCM phases, otherwise an

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 24 Principal Investigator: Subhash C. Basak emulsion is formed. The DCM layer is spiked with the internal standard, octane, just before analysis. A sample is removed and injected on the GC. The concentrations are corrected using the recovery of the toluene. Equivalent experiments were done using MilliQ water; one series was done with water taken directly from the purification unit ("as is", or untreated), and the other series with water whose pH was adjusted to nine using potassium hydroxide.

The results are summarized in Table 5. We have interpreted the data in terms of a first order rate law; the integrated rate law is In C = -k.t + a constant, where C is the concentration of Q or NB, k is the rate constant and t is the time. Values of k are determined by linear regression of In C upon t. For the sample from Winyah Bay, Creekside, the results are displayed in Figure 4 for Q and NB.

For the sediments, NB does not degrade significantly in comparison to Q. The rate for NB is essentially zero. However, Q does degrade in the sediments and there appears to be a significant difference in the degradation rate between the samples; the rate of degradation in decreasing order is: Winyah Bay, Creekside; Oyster Landing, Creekside; and Alderley Landing. The degradation rate of Q is smaller in water at pH = 9 than in the untreated water, which is lower than degradation rate in the sediment samples. The surprising observations are that NB degrades in water and it degrades faster than Q for both types of water. These are in contrast to sediments in which NB does not degrade significantly.

The MilliQ water will contain dissolved oxygen (DO), unless special steps are taken to remove it. We speculate that the NB is oxidized in the water, but in the sediments the competition for the DO from the sediment is so great (sediment oxygen demand) that no oxidation of NB can occur. Another possibility is that the NB becomes tightly bound to sediments and is thereby stabilized.

As described in Task 3 above, we have a list of 225 potential analogs of Q in addition to NB. We have obtained small samples of some analogs that are commercially available in the USA. For these, we are checking estimates of Log P using the HPLC- retention time method and also evaluating their efficacy for analysis. The compounds are:

exo-2,3-epoxynorbornane (±) endo-norborneol (±) exo-norborneol exo-2-bromonorbornane (±) exo-2-chloronorbornane cis-exo-2,3-norbornanediol

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 25 Principal Investigator: Subhash C. Basak Table 5. Summary of Degradation Experiments with Quadricyclane and Norbornane Quadricyclane (Q) Norbornane (NB) Rate Rate Constant s.e. in Constant s.e. in k, 1/hr k N R2 k, 1/hr k N R2 MilliQ Water 0.026 0.002 18 0.90 0.040 0.003 18 0.89 "as is" MilliQ Water at 0.008 0.002 12 0.63 0.014 0.002 12 0.58 pH=9 Alderly 0.040 0.03 14 0.14 0.000 0.001 18 0.002 Landing Sediments Oyster Landing 0.053 0.005 14 0.92 0.0015 0.0007 18 0.23 Sediments, Creekside Winyah Bay 0.068 0.007 14 0.88 -0.001 0.001 14 0.04 Sediments, Creekside

Figure 4

□ QDC-results ----- QDC-regression 3K NB-results ------NB-regression

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 26 Principal Investigator: Subhash C. Basak Results from the University of South Carolina subcontract: Dr. Schubaurer- Berigan and his team have collected sediments from different sites in South Carolina and have characterized them. Biodegradation studies on Q and NB have been carried with the collected sediments. Initial results show that Q has a low potential for aerobic biodegradation (BOD test), and it may be toxic at the concentrations tested. Preliminary studies show that both Q and NB degrade in formalin killed and frozen treatments. The report from Dr. Schubauer-Berigan, the principal investigator of the USC subcontract, is attached in Appendix 1.

TASK 7: Photochemical transformations This task was not scheduled in Year 1 of the project.

Personnel Supported

University of Minnesota, Duluth: Subhash Basak, Keith Lodge, Greg Grunwald, and Gloria Bly University of South Carolina: Joseph Schubauer-Berigan and Darcy Wood

Publications

The following peer-reviewed papers, which are currently either published, accepted, or in submission, report results of research carried out between July 1, 1994 and June 30, 1995.

1. Molecular similarity and estimation of molecular properties, S. C. Basak and G. D. Grunwald, J. Chem. Inf. Comput. Sci. 35, 366-372, 1995.

2. Predicting genotoxicity of chemicals using nonempirical parameters, S.C. Basak and G. D. Grunwald, in: Proceeding of the XVI International Cancer Congress, pp. 413-416, 1995.

3. Predicting mutagenicity of chemicals using topological and quantum chemical parameters: a similarity based study, S. C. Basak and G. D. Grunwald, Chemosphere. in press, 1995.

4. Estimation of normal boiling points of haloalkanes using molecular similarity, S. C. Basak, B. D. Gute and G. D. Grunwald, Croat. Chim. Acta (submitted), 1995

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 27 Principal Investigator: Subhash C. Basak 5. Use of nonempirical parameters in estimating mutagenicity of chemicals: a molecular similarity approach, S. C. Basak and B. D. Gute, in: Proceedings of the International Congress on Hazardous Waste: Impact on Human and Ecological Health. Atlanta, GA, in press, 1995.

6. Tolerance space and molecular similarity, S. C. Basak and G. D. Grunwald, SAR and QSAR in Environmental Research, in press, 1995.

Copies of the six articles mentioned above are attached in Appendix 2. iNTERACXIONS/TRANSmONS

A. Participation/Presentations at Meetings, Seminars, etc

L Presentations at Scientific Conferences I) Subhash C. Basak presented an invited paper "Molecular similarity and estimation of molecular properties" at the at the 208th National Meeting of the American Chemical Society, August 20-25, 1994.

ii) Subhash C. Basak presented a paper "Tolerance space and molecular similarity" at the Sixth International Workshop on Quantitative Structure- Activity Relationships (QSAR) in Environmental Sciences. Belgirate, Italy, September 10-18, 1994.

iii) Subhash C. Basak presented a paper "Predicting genotoxicity of chemicals using nonempirical parameters" at the Sixteenth International Cancer Congress, New Delhi, October 30-November 5, 1994.

iv) Subhash C. Basak presented a paper “Use of molecular similarity in the assessment of toxicity of chemicals” at the International Congress on Hazardous Waste: Impact on Human and Ecological Health, organized by the Agency for Toxic Substances and Disease Registry (ATSDR), in Atlanta, GA, June 508, 1995.

v) Subhash Basak, Greg Grunwald and Brian Gute presented a poster "Estimation of toxicological and environmental properties of chemicals using molecular similarity" at the Sigma Xi poster exhibition, February 13- 17, 1995.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 28 Principal Investigator: Subhash C. Basak vi) Subhash C. Basak, Timothy Colburn, and David Axtell presented a paper “Modelling mutagenicity with neural networks” at the Tenth International Conference on Mathematical and Computer Modelling, in Boston, MA, July 5-8, 1995.

Copies of abstracts/overheads of the above-mentioned six presentations are attached in Appendix 3.

II. Invited seminars/lectures I) Subhash C. Basak presented an invited seminar "Use of graph-theoretic methods in risk assessment of environmental chemicals" on July 29, 1994, at the National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina.

ii) Subhash C. Basak presented an invited lecture to the Pharmaceutical Discussion Group (PDG) on "Use of nonempirical parameters in QSAR and novel drug discovery" in November, 1994. PDG, located in New Delhi, is a discussion group consisting of drug companies, government regulators and academics working in the field of drug design and pharmacy.

iii) Subhash C. Basak presented an invited seminar "Use of graph theoretic parameters in drug design and environmental toxicology" at the Department of Biochemistry, University College of Science, University of Calcutta, India, November, 1994.

iv) Subhash C. Basak gave an invited lecture "A biomathematical approach to risk assessment and drug design" at the Bose Institute, Calcutta, India, December, 1994.

v) Subhash C. Basak presented a seminar on "Development of nonempirical molecular descriptors and their applications in drug design" at Alza Drug Company, Palo Alto, CA, in January, 1995.

vi) Subhash C. Basak presented an invited seminar at Stanford University on "Use of graph theoretic approaches in the quantification of chemical similarity and structure-activity studies" in February, 1995.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 29 Principal Investigator: Subhash C. Basak B. Consultative and Advisory Function

I) Subhash C. Basak, Keith Lodge, and Greg Grunwald met with WPAFB scientists to discuss the use of QSAR and nonempirical parameters in predicting gene-modulating, pharmacokinetic and pharmacodynamic properties of molecules of interest to the US Air Force.

ii) Subhash C. Basak was a member of the national panel of experts which reviewed the extramural research projects submitted to NIEHS in the area of predicting genotoxicity of chemicals.

C. Transitions

I) Subhash C. Basak consulted with Alza in the use of topological indices and molecular similarity methods in designing more effective pharmaceutical agents.

ii) Subhash C. Basak consulted with Ranbaxy Laboratories, Ltd, India in the use of similarity/dissimilarity methods based on nonempirical parameters in molecular design.

iii) Subhash C. Basak participated in a workgroup meeting with computational chemists from Chemical Abstracts Service and the Upjohn Company to study the relative effectiveness of about ten different molecular similarity methods in molecular design and hazard assessment of chemicals. The similarity methods studied in this project will include those developed by Basak's group, CAS, and Upjohn Company. The results of the large study will assist users of similarity method in making more rational selection of similarity techniques and software for computing intermolecular similarity.

New Discoveries

None.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 30 Principal Investigator: Subhash C. Basak Honors/Awards

I) Subhash C. Basak was invited to become a member of the organizing committee of the Tenth International Conference on Mathematical and Computer Modelling and Scientific Computing. This series of conferences is an interdisciplinary and integrative forum which brings together mathematicians, computer scientists, biomedical scientists and engineers together to explore development of new methods and applications of mathematical and computer modelling in science and engineering.

ii) Subhash C. Basak delivered a distinguished endowment lecture "Development of novel molecular similarity approaches and their applications in drug discovery" at the University Department of Chemical Technology (UDCT), University of Bombay, in November, 1994.

Predicting toxicity and degradability of quadricyclane, fluorocarbon ethers and their analogs 31 Principal Investigator: Subhash C. Basak A ppendix 1

Appendix 1-1 Annual Progress Report of the University of South Carolina subcontract for AFOSR/PKC Grant F49620-94-1401. Annual Progress of Report for AFOSR/PKC Grant F49620-94-1-0401 To the University of Minnesota Subcontract 1613-189-6090-7901 To The University of South Carolina Submited By Dr. Joseph P. Schubauer-Berigan, Principal Investigator August 19,1995

Grant Administration The grant was awarded to the University of Minnesota on July 1, 1995. Also on July 1, 1995, Dr. Schubauer-Berigan, one of the projects Principal Investigators, previously with the University of Minnesota, began a new position with the University of South Carolina. After consulting and negotiating with the other Principal Investigators and AFOSR and University officials, a subcontract to support the work proposed by Dr. Joseph P. Schubauer-Berigan was requested by the University of South Carolina from the University of Minnesota on 13 September 1994. On 26 October 1994, the University of Minnesota made a request to AFOSR for the subcontract. The subcontract was approved by AFOSR and awarded by the University of Minnesota to the University of South Carolina on 13 February 1995.

Grant Activities (Tulv 1 ,1995 - Tune 30,19951 Prior to receiving the subcontract at USC, I worked closely with Dr. Keith Lodge and project technician, Ms. Gloria Bly, to develop protocols and plan and coordinate future activities of the two labs. Since the award, we have stayed in contact almost daily by Email and telephone to coordinate work, discuss developments and collaborate on experiments. On two occasions, August 26,1994 and March 13, 1995, I met with, Drs. Basak and Lodge in Duluth, MN to discuss grant progress and strategy. At the request of Dr. Basak, I sent him two interim progress reports, by Email, one in February 1995 and the other in May 1995. In June 1995,1 attended a AFOSR PFs meeting in Boulder, CO. Attending this meeting was very useful for making contacts and discussing problems and experimental findings with other researchers working on Quadricyclane (QDC) and Fluorocabon ethers.

Upon receiving the subcontract on February 13, 1995,1 initially focused my attention on locating and hiring support personnel and ordering the necessary equipment (ie. Anaerobic Hood) and materials to get the experiments underway. On March 1, 1995, Ms. Darcy Wood was hired as the project technician.

In March, we made our first collection of sediments to be used in the degradation experiments. Surface sediments were collected using a ponar dredge from sites located within the boundaries of the North Inlet - Winyah Bay National Estuarine Research Reserve and returned to the University of South Carolina's Baruch Marine Laboratory, in Georgetown, South Carolina for analysis. A portion of these sediments were also shipped to Dr. Lodge in Minnesota. The sites, Oyster Landing

1 Creekside (OLC), Oyster Landing High Marsh (OLH), Winyah Bay Creekside (WBC) and Alderly Landing (AL), are all wetland sites and located within less than a mile of the BML. The sites were selected because the sediments are thought to be uncontaminated and provide a diverse array of abiotic and biotic characteristics. For the past few months we have spent considerable effort characterizing these sediments for ongoing and future experiments. The results of these analyses are presented in Table 1. The sediments vary dramatically in their salinity, nutrient and organic mater content and bulk densities, all of which can potentially influence the fate of the test chemicals. Bacterial numbers determined for the sediments were highest at the high organic matter site WBC, but all sites were within the range we have observed for other aquatic sediments and soils. In the coming months we will also screen these sediments for toxicity and mutagenicity and determine their particle size distributions.

In the past five months, we have also begun to make strides in examining the biodegradability of Quadricyclane and one of it's potential analog, Norbornane. We have used two approaches for this work: 1) a series of initial degradation experiments in sediment and water in collaboration with Dr. Lodge at UMN; and 2) determining the Biological Oxygen Demand of these compounds at a series of test concentrations ranging from 25 to 100 ppm final concentration. BOD is a common standard measurement used to examine the degradation of chemicals and which should allow us to compare our results to literature values for other compounds.

In the preliminary degradation experiments we found that both QDC and NOB degraded rapidly in sediment or water at neutral pH (k = 0.026 and 0.040 respectively) slower in water at high pH (pH 9: k = 0.008 and 0.014 respectively). However, whereas QDC also degrades rapidly in sediments (AL: k=0.04; OLC: k=0.053; WBC: k=0.068), NOB degradation is inhibited (AL: k=0.000; OLC: k=0.0015; WBC: k=-0.0001). It is not clear at the present time why NOB degradation is inhibited in sediments. Perhaps NOB degradation is prevented by high sediment respiration leading to reduced conditions or because NOB partitions differently in the sediments. Both possibilities will be explored in the next year of the grant. Preliminary results also indicate that QDC and NOB both degrade in formalin killed and frozen treatments, however these experiments need to be repeated and verified. We also observed a secondary degradation product for QDC which we suspect is the same tricyclyl alcohol reported by Dr. Pat Sulivan's group at the University of Wyoming. His group has synthesied this compound and he has offered to send us some of it so that we can verify this observation. This alcohol is potentially toxic and where as QDC is apparently degraded rapidly abiotically, we expect the daughter product to undergo biotic degradation. We plan to examine this in the comming grant year. Also, we hope to determine if all the analogs of QDC have similar daughter products. We hope to collaborate with Dr. Pat Sulivan's group on some of this work.

2 As mentioned previously, we determined 5 day BOD's at 20°C for QDC and NOB at a series of test concentrations ranging from 25 to 100 ppm final concentration, following methods outlined in Standard Methods For The Examination Of Water And Wastewater. A commercially available BOD seed was purchased and used for these determinations. BOD's for QDC ranged from 0.16 mg/L O 2 to 1.22 mg/L O2 for the 25 ppm and 50 ppm treatment respectively (Table 2). This indicates that QDC has a low potential for aerobic biological degradation or that it may be toxic at the test concentrations. In an attempt to determine the relative toxicity of QDC at the test concentrations, we measured the BOD of the positive control with added glucose/glutamic acid (G/GA) and the BOD's of treatments with G/GA + varying concentrations of QDC (Table 3). In all cases the BOD's were higher in the treatments with QDC and G/GA than the treatment with G/GA alone, indicating that the compound was not toxic to the bacterial seed at these concentrations. Interestingly however, the BOD's of the QDC + G/GA treatments decreased with increasing QDC concentration, perhaps indicating some inhibition of biotic QDC degradation at the higher concentrations.

Whereas QDC easily dissolved in water at all the test concentrations used, we had to dissolve NOB in a carrier of 2.5 % (fc) of 95% EtOH to keep NOB in solution. EtOH at this concentration appeared to be toxic to the bacterial seed in the positive control(Table 4), but not in the treatments with added NOB. We later verified that EtOH reduces the BOD of the bacterial seed at a concentration of 2.5% or higher by varying concentrations of EtOH (Table 5). Why the treatments containing NOB do not seem to be inhibited by 2.5 % EtOH is not clear. Perhaps NOB or one of it's degradation products is such a good substrate for the bacteria that the treatments with NOB over come EtOH inhibition. Showing a similar pattern to that of QDC, the BOD's of the NOB + G/GA treatments decreased slightly with increasing NOB concentration, perhaps indicating some inhibition of biotic NOB degradation at the higher concentrations.

3 Table 1. Physical and biological characteristics for sediments sampled at the North Inlet/Winyah Bay National Estuarine Reasearch Reserve(NERR). Sites include: Oyster Landing Creekside(OLC)/ Oyster Landing High Marsh(OLH), Winyah Bay Creekside(WBC)/ and Alderly Landing(AL). Means of three or more replicates with associated standard deviations in parenthases are presented. Nutrients were analyzed on the centrifuged porewater fraction using a Technicon Autoanalyzer.

PARAMETERS OLC OLH WBC AL pH 6.73(.14) 6.46(.18) 6.37(.7) 6.47(0)

Salinity (ppt) 31.8(.41) 41.5(3.8) 7(3.3) 3.3(.58)

Ammonia 2.05(.65) 1.6(16) 19.9(.79) 191(6) (ug x atm N/L)

Nitrate/Nitrite 0.25(.04) 0.38(.01) 0.28(.03) 1.39(.28) (ug x atm N/L)

Phosphate 3.02(.08) 0.77(0) 0.49(.01) 0.6(.14) (ug x atm P/L)

% Moisture 49(12.7) 43.6(.62) 79.6(.66) 54.4(.61)

% Loss On Ignition 6.88(3) 5.25(.46) 14.5(3.9) 10.9(.34)

Bacteria(cells/ml) 3.89xlOA8 1.51xlOA8 1.09x10 A9 6.10xl0A8 (4.8xlOA7) (2.3xlOA7) (3.5x10 A7) (1.2xlOA8)

4 Table 2. Biochemical Oxygen Demand(BOD) for Quadricyclane. Means of three replicates with associated standard deviations in parenthases are presented.

BOD Oxygen Demand Treatments (mg/L) (mg/L/day)

G/GA 7.41 (.071) 1.48 (.014)

25ppm QDC 0.16 (.097) 0.10 (.095)

50ppm QDC 1.22 (.189) 0.24 (.04)

lOOppm QDC 0.92 (.274) 0.18 (.059)

G/GA = Glucose/Glutamic Acid solution.

Table 3. Effects of Quadricyclane on Glucose/ Glutamic Acid consumption. Means of three replicates with associated standard deviations in parenthases are presented.

BOD Oxygen Demand Treatments (mg/L) (mg/L/day)

G/GA+No QDC 5.96 (.805) 1.19 (.159)

G/GA+25ppm QDC 7.17 (.155) 1.43 (.031)

G/GA+50ppm QDC 6.80 (.305) 1.36 (.061)

G/GA+lOOppm QDC 6.67 (.655) 1.33 (.133)

G/GA = Glucose/Glutamic Acid solution.

5 Table 4. Biochemical Oxygen Demand(BOD) for Norbomane. Means of three replicates with associated standard deviations in parenthases are presented.

BOD Oxygen Demand Treatment (mg/L) (mg/L/day)

G/GA 1.14 (.082) 0.23 (.015)

25ppm NOR 7.68 (.279) 1.54 (.057)

50ppm NOR 4.94 (.244) 0.99 (.486) lOOppm NOR 5.53 (.226) 1.11 (.448)

G/GA = Glucose/Glutamic Acid solution. All treatments contained 2.5% final concentration of 95% Ethanol.

Table 5. Effects of Ethanol on Glucose/ Glutamic Acid consumption. Means of three replicates with associated standard deviations in parenthases are presented.

BOD Oxygen Demand Treatments (mg/L) (mg/L/day)

G/GA (No EtOH) 7.41(.071) 1.48(.014)

G/GA + 1% (95%EtOH) 8.64 (.363) 1.73 (.074)

G/GA + 2.5% (95%EtOH) 1.14 (.082) 0.23 (.015)

G/GA + 5.0% (95%EtOH) 1.56 (.121) 0.31 (.025)

G / GA = Glucose / Glutamic Acid solution.

6 A ppendix 2

Publications

Appendix 2-1 Molecular similarity and estimation of molecular properties

Appendix 2-2 Predicting genotoxicity of chemicals using nonempirical parameters

Appendix 2-3 Predicting Mutagenicity of chemicals using topological and quantum chemical parameters: A similarity based study

Appendix 2-4 Estimation of normal boiling points of haloalkanes using molecular similarity

Appendix 2-5 Use of nonempirical parameters in estimating mutagenicity of chemicals: A molecular similarity approach

Appendix 2-6 Tolerance space and molecular similarity Reprinted from the Journal of Chemical Information and Computer Sciences, 1995, 3 5 . 3 66 Copyright © 1995 by the American Chemical Society and reprinted by permission of the copyright owner.

Molecular Similarity and Estimation of Molecular Properties

Subhash C. Basalt* and Gregory D. Grunwald Natural Resources Research Institute, The University of Minnesota, 5013 Miller Trunk Highway, Duluth, Minnesota 55811

Received September 8, 1994®

Five molecular similarity methods have been used to select K nearest neighbors of chemicals for K — 1 — 10, 15, 20, 25. The properties of the selected neighbors have been used to estimate properties of two sets of chemicals: (a) normal boiling point of a group of 139 hydrocarbons and (b) mutagenicity of a set of 95 aromatic and heteroaromatic amine compounds. The similarity methods are based on calculated topological indices and atom pairs. The results show that each of these methods give reasonable estimates of molecular properties investigated in this paper.

1. INTRODUCTION chemicals to estimate properties. The similarity methods used in the study were based on topological indices and atom In recent years, there has been an upsurge of interest in pairs. It was of interest to see whether a higher number of the use of molecular similarity methods in drug design, K neighbors (AT > 1) results in better property estimation. selection of analogs for chemicals, and estimation of Therefore, in this paper, we have carried out a comparative properties In drug design, similarity/dissimilarity based .1-10 study of five QMSA techniques in predicting (a) boiling methods have been very useful in the rational selection of points of a set of 139 hydrocarbons and (b) mutagenic candidate chemicals from large databases .9 In risk assess­ potency of 95 aromatic and heteroaromatic amines. Results ment, we often have to select analogs of chemicals of interest of these analyses are presented along with a critical evalu­ when adequate sets of test data are not available .11 The ation of the applicability of these QMSA techniques in properties of the selected “similar” neighbors can be used estimating molecular properties. to estimate the hazard posed by the chemical of interest. The use of molecular similarity methods is based on the 2. METHODS structure—property similarity principle. This notion states that similar structures usually have similar properties .1'3 The 2.1. Databases. The 139 hydrocarbons used in this study principle has been used in ordering a wide range of chemical were taken from literature and included 73 C 3—C9 alkanes ,15 phenomena ranging from atomic properties to macromo- 29 alkylbenzenes,16 and 37 polycyclic aromatic hydrocarbons lecular behavior .12 Intermolecular similarity can be defined (PAHs).17 A listing of the hydrocarbons is presented in Table in terms of the number of structural features and their mutual 1. arrangements which two chemical species have in common .13 The 95 aromatic and heteroaromatic amines used to study The structural feature(s) used to quantify similarity will vary mutagenic potency were taken from the literature .18 The with the level of organization to which the chemical species mutagenic activities of these compounds in S. typhimurium belong, viz., atomic, molecular, macromolecular, etc. It is TA98 + S9 microsomal preparation had been collected and also dependent on the mode of representation of the species, are expressed as the mutation rate, ln(A), in log(revertants/ choice of the set of structural descriptors, and the selection nmol). Table 2 lists the compounds used for this study. of the particular mathematical function used to quantify 2.2. Topological Indices and Atom Pairs. The calcula­ similarity from the chosen set of descriptors. tion of the topological indices (TIs) and atom pairs 2 have Recently, a number of different methods have been previously been described in detail.5 However, four ad­ developed for quantitative molecular similarity analysis ditional indices developed by A. T. Balaban 19-21 were (QMSA) of chemicals .1-8'10 Any similarity method gives calculated and used in these analyses. Balaban denoted these us an ordered set of molecules (analogs) structurally related as J indices and they are based upon the distance sums s,- of to the chemical of interest. Properties of the K selected a chemical graph. J is defined as neighbors (analogs) can then be used to estimate properties of the candidate chemical .14 Similarity of chemicals has been J=q(p + \y l £ ( ^ ) " ,/2 (1) quantified using empirical and nonempirical properties or y edges parameters .''4-8'10'12 In view of the fact that the majority of new and existing chemicals will have scanty experimental data, QMSA methods based on nonempirical parameters where the cyclomatic number p (or number of rings in graph) would be most useful for hazard assessment and drug design. is p = q — n + 1, with q adjacencies or edges and n vertices. Graph theoretical parameters like topological indices and In the original definition of / , the term s,- referred to either substructures have been used in the quantification of mo­ the row distance sum for vertex i in the distance matrix (D) lecular similarity .1-10’12 In a recent paper, Basak and or the multigraph distance matrix (M): Grunwald 14 used properties of the closest neighbor of (2) ® Abstract published in Advance ACS Abstracts, December 15, 1994. j

0095-2338/95/1635-0366S09.00/0 © 1995 American Chemical Society Estimation of M olecular Properties J. Chem. Inf. Comput. Sci., Vol. 35, No. 3, 1995 367

Table 1. Normal Boiling Points (°C) for 139 Hydrocarbons no. compound bp no. compound bp i tt- • -42.1 71 2,2,3,4-tetramethylpentane 133.0 2 n-butane -0 .5 72 2,2,4,4-tetramethylpentane 122.3 3 2-methylpropane -11.7 73 2,3,3,4-tetramethylpentane 141.6 4 n-pentane 36.1 74 benzene 80.1 5 2-methylbutane 27.8 75 toluene 110.6 6 2,2-dimethylpropane 9.5 76 ethylbenzene 136.2 7 n-hexane 68.7 77 o-xylene 144.4 8 2-methylpentane 60.3 78 m-xylene 139.1 9 3-methylpentane 63.3 79 p-xylene 138.4 10 2,2-dimethylbutane 0.7 80 n-propylbenzene 159.2 11 2,3-dimethylbutane 58.0 81 l-methyl-2-ethylbenzene 165.2 12 n-heptane 98.4 82 l-methyl-3-ethylbenzene 161.3 13 2-methylhexane 90.0 83 l-methyl-4-ethylbenzene 162.0 14 3-methylhexane 91.8 84 1,2,3-trimethylbenzene 176.1 15 3-ethylpentane 93.5 85 1,2,4-trimethylbenzene 169.4 16 2,2-dimethylpentane 79.2 86 1,3,5-trimethylbenzene 164.7 17 2,3-dimethylpentane 89.8 87 n-butylbenzene 183.3 18 2,4-dimethylpentane 80.5 88 1,2-diethylbenzene 183.4 19 3,3-dimethylpentane 86.1 89 1,3-diethylbenzene 181.1 20 2,2,3-trimethylbutane 80.9 90 1,4-diethylbenzene 183.8 21 n-octane 125.7 91 l-methyl-2-n-propylbenzene 184.8 22 2-methylheptane 117.6 92 l-methyl-3-n-propylbenzene 181.8 23 3-methylheptane 118.9 93 1 -methy 1-4-n-propylbenzene 183.8 24 4-methylheptane 117.7 94 1,2-dimethyl-3-ethylbenzene 193.9 25 3-ethylhexane 118.5 95 1,2-dimethy 1-4-ethylbenzene 189.8 26 2,2-dimethylhexane 106.8 96 1,3-dimethy 1-2-ethylbenzene 190.0 27 2,3-dimethylhexane 115.6 97 1,3-dimethyl-4-ethylbenzene 188.4 28 2,4-dimethylhexane 109.4 98 1,3-dimethyl-5-ethylbenzene 183.8 29 2,5-dimethylhexane 109.1 99 1,4-dimethyl-2-ethylbenzene 186.9 30 3,3-dimethylhexane 112.0 100 1,2,3,4-tetramethylbenzene 205.0 31 3,4-dimethylhexane 117.7 101 1,2,3,5-tetramethylbenzene 198.2 32 2-methyl-3-ethylpentane 115.6 102 1,2,4,5-tetramethylbenzene 196.8 33 3-methyl-3-ethylpentane 118.3 103 naphthalene 218.0 34 2,2,3-trimethylpentane 109.8 104 acenaphthalene 270.0 35 2,2,4-trimethylpentane 99.2 105 acenaphthene 279.0 36 2,3,3-trimethylpentane 114.8 106 fluorene 294.0 37 2,3,4-trimethylpentane 113.5 107 phenanthrene 338.0 38 2,2,3,3-tetramethylbutane 106.5 108 anthracene 340.0 39 ^-nonane 150.8 109 4/7-cyclopenta[de/j phenanthrene 359.0 40 2-methyloctane 143.3 110 fluoranthene 383.0 41 3-methyloctane 144.2 111 pyrene 393.0 42 4-methyloctane 142.5 112 benzo[a]fluorene 403.0 43 3-ethylheptane 143.0 113 benzo[fe]fluorene 398.0 44 4-ethylheptane 141.2 114 benzo[c]fluorene 406.0 45 2,2-dimethylheptane 132.7 115 benzo[g/h]fluoranthene 422.0 46 2,3-dimethylheptane 140.5 116 cyclopenta[cd]pyrene 439.0 47 2,4-dimethylheptane 133.5 117 chrysene 431.0 48 2,5-dimethylheptane 136.0 118 benz[a]anthracene 425.0 49 2,6-dimethylheptane 135.2 119 triphenylene 429.0 50 3,3-dimethylheptane 137.3 120 napthacene 440.0 51 3,4-dimethylheptane 140.6 121 benzo[f>]fluoranthene 481.0 52 3,5-dimethylheptane 136.0 122 benzo[/]fluoranthene 480.0 53 4,4-dimethylheptane 135.2 123 benzo[^]fluoranthene 481.0 54 2-methyl-3-ethylhexane 138.0 124 benzo[a]pyrene 496.0 55 2-methyl-4-ethylhexane 133.8 125 benzo[e]pyrene 493.0 56 3-methyl-3-ethylhexane 140.6 126 perylene 497.0 57 3-methyl-4-ethylhexane 140.4 127 anthanthrene 547.0 58 2,2,3-trimethylhexane 133.6 128 benzo[g/t/]perylene 542.0 59 2,2,4-trimethylhexane 126.5 129 indenof 1,2.3-cdJfluoranthene 531.0 60 2,2,5-trimethylhexane 124.1 130 indenof 1,2,3-cd]py rene 534.0 61 2,3,3-trimethylhexane 137.7 131 dibenz[a,c]anthracene 535.0 62 2,3,4-trimethylhexane 139.0 132 dibenz[a,/t]anthracene 535.0 63 2,3,5-trimethylhexane 131.3 133 dibenz[aj]anthracene 531.0 64 2,4,4-trimethylhexane 130.6 134 picene 519.0 65 3,3,4-trimethylhexane 140.5 135 coronene 590.0 66 3,3-diethylpentane 146.2 136 dibenzo[a,e]pyrene 592.0 67 2,2-dimethyl-3-ethylpentane 133.8 137 dibenzo[a,/i]pyrene 596.0 68 2,3-dimethyl-3-ethylpentane 142.0 138 dibenzo[a,/]pyrene 594.0 69 2,4-dimethyl-3-ethylpentane 136.7 139 dibenzo[a,/]pyrene 595.0 70 2,2,3,3-tetramethylpentane 140.3

For distance matrix D, each matrix element dy represents vertices would be one. All other entries in this matrix would the distance from vertex i to vertex j. The diagonal entries be the number of edges or bonds traversed in the shortest are all zero, and the distance between any two adjacent path from i to j. For the multigraph distance matrix M, for 368 J. Chem. Inf. Comput. Sci., Vol. 35, No. 3, 1995 Basak and G runwald

Table 2. Mutagenicity (5. typhimurium TA98 with Metabolic Activation) of 95 Aromatic and Heteroaromatic Amines no. compound log rev./nmol no. compound log rev./nmol i 2-bromo-7-aminofluorene 2.62 49 2,6-dichloro-1,4-phenylenediamine -0 .6 9 2 2-methoxy-5-methylaniline -2 .0 5 50 2-amino-7-acetamidofluorene 1.18 3 5-aminoquinoline -2 .0 0 51 2,8-diaminophenazine 1.12 4 4-ethoxyaniline -2 .3 0 52 6-aminoquinoline -2 .6 7 5 1 -aminonaphthalene -0 .6 0 53 4-methoxy-2-methylaniline -3 .0 0 6 4-aminofluorene 1.13 54 3-amino-2'-nitrobiphenyl -1 .3 0 7 2-aminoanthracene 2.62 55 2,4'-diaminobiphenyl -0 .9 2 8 7-aminofluoranthene 2.88 56 1,6-diaminophenazine 0.20 9 8-aminoquinoline -1.14 57 4-aminophenyldisulfide -1 .0 3 10 1,7-diaminophenazine 0.75 58 2-bromo-4,6-dinitroaniline -0 .5 4 11 2-aminonaphthalene -0.67 59 2,4-diamino-n-butylbenzene -2 .7 0 12 4-aminopyrene 3.16 60 4-aminophenylether -1 .1 4 13 3-amino-3'-nitrobiphenyl -0 .5 5 61 2-aminobiphenyl -1 .4 9 14 2,4,5-trimethylaniline -1 .3 2 62 1,9-diaminophenazine 0.04 15 3-aminofluorene 0.89 63 1-aminofluorene 0.43 16 3,3'-dichlorobenzidine 0.81 64 8-aminofluoranthene 3.80 17 2,4-dimethylaniline -2 .2 2 65 2-chloroaniline -3 .0 0 18 2,7-diaminofluorene 0.48 66 3-amino-a,a,a-trifluorotoluene -0 .8 0 19 3-aminofluoranthene 3.31 67 2-amino-1 -nitronaphthalene -1 .1 7 20 2-aminofluorene 1.93 68 3-amino-4'-nitrobiphenyl 0.69 21 2-amino-4'-nitrobiphenyl -0 .6 2 69 4-bromoaniline -2 .7 0 22 4-aminobiphenyl -0 .1 4 70 2-amino-4-chlorophenol -3 .0 0 23 3-methoxy-4-methy!aniline -1 .9 6 71 3,3'-dimethoxybenzidine 0.15 24 2-aminocarbazole 0.60 72 4-cyclohexylaniline -1 .2 4 25 2-amino-5-nitrophenol -2 .5 2 73 4-phenoxyaniline 0.38 26 2,2'-diaminobiphenyl -1 .5 2 74 4,4'-methylenebis(o-ethylaniline) -0 .9 9 27 2-hydroxy-7-aminofluorene 0.41 75 2-amino-7-nitrofluorene 3.00 28 1 -aminophenanthrene 2.38 76 benzidine -0 .3 9 29 2,5-dimethylaniline -2 .4 0 77 l-amino-4-nitronaphthalene -1 .7 7 30 4-amino-2'-nitrobiphenyl -0 .9 2 78 4-amino-3'-nitrobiphenyl 1.02 31 2-amino-4-methylphenol -2 .1 0 79 4-amino-4'-nitrobiphenyl 1.04 32 2-aminophenazine 0.55 80 1 -aminophenazine -0.01 33 4-aminophenylsulfide 0.31 81 4,4'-methylenebis(o-fluoroaniline) 0.23 34 2,4-dinitroaniline -2 .0 0 82 4-chloro-2-nitroaniline -2 .2 2 35 2,4-diaminoisopropylbenzene -3 .0 0 83 3-aminoquinoline -3 .1 4 36 2,4-difluoroaniline -2 .7 0 84 3-aminocarbazole -0.48 37 4,4'-methylenedianiline -1 .6 0 85 4-chloro-l ,2-phenylenediamine -0 .4 9 38 3,3'-dimethylbenzidine 0.01 86 3-aminophenanthrene 3.77 39 2-aminofluoranthene 3.23 87 3,4'-diaminobiphenyl 0.20 40 2-amino-3'-nitrobiphenyl -0 .8 9 88 1-aminoanthracene 1.18 41 1 -aminofluoranthene 3.35 89 1-aminocarbazole -1 .0 4 42 4,4'-ethylenebis(aniline) -2 .1 5 90 9-aminoanthracene 0.87 43 4-chloroaniline -2 .5 2 91 4-aminocarbazole -1 .4 2 44 2-aminophenanthrene 2.46 92 6-aminochrysene 1.83 45 4-fluoroaniline -3 .3 2 93 1-aminopyrene 1.43 46 9-aminophenanthrene 2.98 94 4,4'-methylenebis(o-isopropylaniline) -1 .7 7 47 3,3'-diaminobiphenyl -1 .3 0 95 2,7-diaminophenazine 3.97 48 2-aminopyrene 3.50

each entry d y , 1 l b is used, where b is the conventional bond represented by a subspace without significant loss of order ( b = 1, 2, 3, and 1.5 for single, double, triple, and information. One such method, available under SAS as the aromatic bonds, respectively). To account for periodicity VARCLUS24 procedure, is to divide the indices into disjoint of chemical properties for heteroatoms, Balaban proposed subsets of variables based on the correlation matrix. two J variants21: Jx which includes corrections for heteroa­ The VARCLUS procedure divides the set of TIs into tom electronegativities and JY which has corrections for heteroatom covalent radii. disjoint clusters so that each cluster is essentially unidimen­ sional. Initially, all TIs start in one cluster, and the first Table 3 lists the 102 TIs used in this paper. The set of TIs were generated for all compounds in each data set. and second principal components (PCs) are determined from Several of these indices were completely correlated (r = 1.0) the correlation matrix. Then the cluster is split by assigning and were dropped. Coincidentally, this left 90 TIs for use all TIs most correlated with the first PC to one cluster and in the analyses for both data sets. The indices were all TIs most correlated with the second PC to another cluster. calculated by POLLY.22 The set of atom pairs was generated The first and second PCs for each new cluster are determined by APPROBE.23 after the original cluster is split. Each new cluster is split 2.3. Data Reduction. Initially, all TIs were transformed into two new clusters until each cluster has only one by the natural log of the index plus one. This was done significant PC, i.e., the eigenvalue for the second PC of since the scale of some TIs may be several orders of subset of clustered TIs is less than one. The output of this magnitude greater than other TIs. procedure is the first principal component (PC) from each Since many of the TIs are highly intercorrelated, the cluster of TIs. These PCs were subsequently used in the original 90 dimensioned space of these indices can be similarity measures described below. Estimation of Molecular Properties J. Chem. Inf. Comput. Set, Vol. 35, No. 3, 1995 369

Table 3. Topological Index Symbols and Definitions Iwo information index for the magnitudes of distances between all possible pairs of vertices of a graph jw mean information index for the magnitude of distance W Wiener index = half-sum of the off-diagonal elements of the distance matrix of a graph ID degree complexity H v graph vertex complexity HD graph distance complexity IQ information content of the distance matrix partitioned by frequency of occurrences of distance 6 O order of neighborhood when ICr reaches its maximum value for the -filled graph /orb information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices O orb maximum order of neighborhood of vertices for I0rb within the hydrogen-suppressed graph M i a Zagreb group parameter = sum of square of degree over all vertices M i a Zagreb group parameter = sum of cross-product of degrees over all neighboring (connected) vertices ICr mean information content or complexity of a graph based on the rth ( r = 0—6) order neighborhood of vertices in a hydrogen-filled graph SIC, structural information content for rth (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph CIC, complementary information content for rth (r = 0 —6) order neighborhood of vertices in a hydrogen-filled graph h y path connectivity index of order 6 = 0—6 h y c cluster connectivity index of order h = 3—6 h / c h chain connectivity index of order h — 3—6 h x pc path-cluster connectivity index of order h = 4 —6 h y f bonding path connectivity index of order h = 0—6 h%Q ^bonding cluster connectivity index of order h = 3 —6 h y ch ^bonding chain connectivity index of order h = 3—6 h x pc %pCbonding path-cluster connectivity index of order h = 4 —6 h y y valence path connectivity index of order h = 0 —6 hy'c valence cluster connectivity index of order h = 3 — 6 h y Q Zchvnlence chain connectivity index of order 6 = 3—6 h y'?c %pC valence path-cluster connectivity index of order h = 4—6 Ph number of paths of length 6 = 0—10 J Balaban's J index based on distance 7s Balaban's J index based on multigraph bond orders J x Balaban's J index based on relative electronegativities JY Balaban's J index based on relative covalent radii

In a standard principal component analysis, each of the approaches included (a) TIS) scaled TIs, (b) TIU, unsealed orthogonal axes derived from the set of variables is composed TIs, (c) PCS, scaled cluster variables, and (d) PCU, unsealed of linear combinations of all the original variables. This can cluster variables. The two methods using scaled dimensions make interpretation of the new axes difficult. By using involved rescaling the variables to have mean zero and disjoint clusters of variables, axis interpretation can be variance one. enhanced, although orthogonality can no longer be guaran­ 2.5. K Neighbor Selection and Property Estimation. teed. Using the similarity methods described in section 2.4, In addition to using the cluster or PC variables derived intermolecular similarity of the chemicals within each data above, we selected from each cluster the TI which was most set was quantified. For each chemical, the K nearest correlated with the cluster to which it belonged. These TIs neighbors were determined using each of the similarity were then used in the similarity measures as well. techniques, where K was 1 — 10, 15, 20, and 25. 2.4. Similarity Measures. Five measures of intermo- For the hydrocarbons, the mean observed boiling point of lecular similarity were used in these studies. The first the K nearest neighbors for a compound was used as the similarity measure was an associative coefficient using the estimated boiling point for the compound. The correlation atom pairs. Similarity (5) between two molecules i and j is (r) of observed boiling point with estimated boiling point defined as19 and the standard error of the estimates were used to assess S,j = 2C/(T; + 7}) (3) the relative efficacy of the five similarity methods. Since the probe compound is essentially removed from where C is the number of atom pairs common to molecule the database when selecting analogs, i.e., the probe is not a i and j. T-, and 7} are the total number of atom pairs in i and candidate neighbor to itself, the estimation of the property j, respectively. The numerator is multiplied by two to is the same as if the probe were a complete unknown. account for the atom pairs belong to both molecules. The analog selection and property prediction method used The remaining four similarity measures are based on for the set of hydrocarbons was used to estimate the Euclidean distance (ED) within an n-dimensional space. ED mutagenic activity, In(2?), of the aromatic and heteroaromatic between molecules i and j is defined as amines.

n EDij^ [ J j(Dik- D jk)2]U2 (4) 3. RESULTS * = i 3.1. Variable Clustering and TI Selection. There were where n equals the number of dimensions and Djk equals 10 significant clusters derived from the variable clustering the datavalue of the kth dimension for compound i. of 90 TIs for the 139 hydrocarbons. These 10 clusters The dimensions are the clusters (PCs) or TIs as selected explained a total of 90.5% of the total variation within the by the procedure outlined in section 2.3. The four ED based original TIs. 370 J. Chem. Inf. Comput. Sci., Vol. 35, No. 3, 1995 Basak and G rdnwald

Table 4. Topological Indices Selected for Similarity Measures and BP C o rre la tio n the Correlation of the TI with the Cluster from which It Was Selected hydrocarbons aromatic amines

Cluster TI r TI r

1 P o 0.991 0.985 2 IC* 0.984 SIC4 0.984 3 0.922 0.926 5x1c 4Xpc 0.967 4 5X a 0.998 CIC, 5 0.986 P o 0.990

6 6X 0.987 SXCh 0.940 7 3x'c 0.986 IC 4 0.987 8 CIC, 0.956 7s 0.954 9 SIC3 0.951 SIC2 0.896 10 SIC, 0.916 O 0.944

The TIs selected for use in the TIs and TIu similarity methods are listed in Table 4. Shown in Table 4 are the TI selected and the correlation of the TI with the cluster from which it was selected. Each of these TIs were the variable most correlated with the cluster from which it was selected. For the 95 amine substituted aromatic and heteroaromatic BP s.e. chemicals, there were 10 significant clusters derived from the 90 TIs. These clusters explained 90.5% of the total variation within the original TIs. As for the hydrocarbons, Table 4 shows the TTs selected for use in the TIs and TIu similarity methods. 3.2. A-Neighbor Property Estimation. Figure 1 presents the correlation (r) and standard error of prediction (SE) for boiling points of the hydrocarbons for the K values examined (1 — 10, 15, 20,25). Each line in the figure represents one of the five similarity methods. Table 5 shows the best boiling point model obtained for each similarity method. Shown in the table are the K value used for the similarity method, the correlation of the mean boiling point of the K neighbors with the observed boiling points, and the residual standard deviation (SE) seen with the predictive model. As can be seen from Figure 1 and Table 5, best boiling point estimates for each method occur when K is equal to two or three. The range of K from two to approximately five appear to give the best boiling point estimates. Figure 2 presents the correlation and SE of prediction for mutagenic potency, ln(R), where R is the number of Similarity Method revertants per nanomole. Table 6 shows the best potency —AP -+-TIS -*TIU — CLS -*-C tn model obtained for each similarity method. The best mutagenic rate estimates were for K in the range of two to Figure 1. Pattern of correlation (r) and standard error of the five for the four ED based methods. For the AP method, estimates (SE) according to K nearest neighbor selection for 139 the best estimates were from K from five to eight with the hydrocarbon boiling points. best mutagenic rate estimates being at K equal to five. Table 5. Comparison of the Five Similarity Methods for Prediction For both data sets, using ED with scaled TIs gave superior of Boiling Point for 139 Hydrocarbons results and the AP method slightly inferior results. These similarity method K r SE differences, however, were only slight. AP 2 0.987 25.9 TIs 3 0.993 19.2 4. DISCUSSION TIu 2 0.988 24.2 PCs 2 0.987 25.8 The major goals of this study were to investigate the PCu 3 0.989 23.6 effectiveness of similarity measures based on graph theoretic parameters in estimating properties and to explore the effect of increasing the number of analogs used for property general applicability. Furthermore, each of these data sets estimation on the efficacies of the various methods. To this has published models upon which we could draw compari­ end, we estimated normal boiling points of a set of 139 sons. hydrocarbons and mutagenic potencies for a group of 95 The results in Tables 5 and 6 show that for both data sets, aromatic and heteroaromatic amines. Testing these methods all five similarity methods give reasonable estimates of the on diverse properties gives us a better assessment of their properties. In Basak et al.,25 a principal component regres- Estimation of Molecular Properties J. Chem. Inf. Comput. Sci., Vol. 35, No. 3, 1995 371

Ln(R) Correlation attributed to the structural diversity between alkanes, alkyl benzenes, and polycyclic aromatic hydrocarbons.23 In the original study of the aromatic amines by Debnath et a/.,18 four physicochemical parameters were used to model mutagenicity. The model had a correlation coefficient of 0.898 and a standard error of 0.86. In Table 6, the correlation ranges from 0.837 for the AP method to 0.850 for the TIs method. The standard errors for these methods were 1.06 down to 1.02, respectively. The most interesting result was the effect of changing the number of analogs used in the estimation of properties. In all cases, a relatively small number of analogs (2—5) give best estimates of normal boiling point and mutagenicity. In a recent paper, Basak and Grunwald14 used the nearest neighbor (K = 1) to estimate molecular properties. The results of this paper indicate higher numbers of neighbors represent the neighborhoods of property space better as compared to the nearest neighbor. The same methods of estimation using principles of similarity were used for two diverse properties, vis., boiling points of hydrocarbons and mutagenicity of aromatic amines. Ln(R) s.e. We were interested in seeing how each of the methods performs as a generalized method. Further studies with other properties and larger, diverse data sets are needed to assess the general applicability of the methods and to test the effect of increasing the neighborhood space on the accuracy of estimated properties.

ACKNOWLEDGMENT The authors are delighted to dedicate this paper to Professor Alexandra T. Balaban in appreciation of his pioneering work in chemical graph theory. In addition, the authors are grateful to Sharon Bertelsen for technical support. This is contribution number 136 from the Center for Water and the Environment of the Natural Resources Research Institute. Research reported in this paper was supported, in part, by cooperative agreement CR 819621 from the United Stated Environmental Protection Agency and by Grant F49620-94-1-0401 from the United States Air Force.

REFERENCES AND NOTES

Similarity Methods (1) Johnson, M. A.; Maggiora, G. M. Concepts and Applications of M olecular Similarity: Wiley: New York, 1990. — AP -+-TIS -*-TIU — CLS -x-CLU (2) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Figure 2. Pattern of correlation (r) and standard error of the Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64—73. estimates (SE) according to K nearest neighbor selection for 95 (3) Johnson, M. A.; Basak, S. C.: Maggiora. G. A Characterization of aromatic amine mutagenic rates [ln(R)]. Molecular Similarity Methods for Property Prediction. Mathl. Comput. M o d e llin g 1988, 11, 630-634. Table 6. Comparison of the Five Similarity Methods for Prediction (4) Basak, S. C.; Magnuson, V. R.; Niemi, G. J.: Regal, R. R. Determining of Mutagenic Potency, ln(R), for 95 Aromatic and Heteroaromatic Structural Similarity of Chemicals Using Graph-Theoretic Indices. Amines Discrete Appl. Math. 1988, 1 9 . 17—44. (5) Basak, S. C.; Bertelsen, S.; Grunwald, G. D. Application of Graph similarity method K r SE Theoretical Parameters in Quantifying Molecular Similarity and Structure—Activity Studies. J. Chem. Inf. Comput. Sci. 1994. 3 4 ,270— AP 5 0.837 1.06 276. TIs 4 0.850 1.02 (6) Basak, S. C.; Grunwald, G. D. Use of Topological Space and Property TIu 4 0.846 1.03 Space in Selecting Structural Analogs. Math. M odelling Sci. Comput., PCs 3 0.847 1.03 in press. PCu 4 0.848 1.02 (7) Willet, P.; Winterman, V. A Comparison of Some Measures for the Determination of Intermolecular Structural Similarity: Measures of sion (PCR) model derived from topological indices for this Intermolecular Structural Similarity, Quant. Struct-Act. Relat. 1986, 5, 18-25. set of hydrocarbons had a correlation coefficient of 0.989 (8) Fisanick, W.; Cross. K. P.: Rusinko. Ill, A. Similarity Searching on with a SE of 23.9 °C. The results seen here are comparable, CAS Registry Substances 1. Molecular Property and Generic Atom with the correlation ranging from 0.987 to 0.993 and SE Triangle Geometric Searching. J. Chem. Inf. Comput. Sci. 1992, 3 2 , 664-674. ranging from 19.2° to 25.9° (see Table 5). The somewhat (9) Lajiness. M. S. In Computational Chemical Graph Tlieorv: Rouvray, high standard errors seen with these compounds can be D. H., Ed.; Nova: New York. 1990, pp 299-316. 372 J. Chem. Inf. Comput. Sci., Vol. 35, No. 3, 1995 Basak and G runwald

(10) Wilkins, C. L.; Randid, M. A Graph Theoretic Approach to Structure- (18) Debnath, A. K.; Debnath, G.; Shusterman, A. J.; Hansch, C. A QSAR Property and Structure-Activity Correlations. Theor. Chim. Acta Investigation of the Role of Hydrophobicity in Regulating Mutagenicity (B e r l.) 1980, 55, 45-68. in the Ames Test: 1. Mutagenicity of Aromatic and Heteroaromatic (11) Auer, C. M.; Nabholz, J. V.; Baetcke, K. P. Mode of Action and the Amines in Salmonella typhimurium TA98 and TA100. Environ. Mol. Assessment of Chemical Hazards in the Presence of Limited Data: M u ta g e n. 1992, 1 9 , 37-52. Use of Structure-Activity Relationships (SAR) Under TSCA, Section (19) Balaban, A. T. Highly Discriminating Distance-Based Topological 5. Environ. Health Perspect. 1990, 87, 183—197. Index. Chem. Phys. Lett. 1982, 8 9 , 399-404. (12) Basak, S. C.; Grunwald, G. D. Estimation of Lipophilicity from (20) Balaban, A. T. Topological Indices Based on Topological Distances Structural Similarity. New J. Chem., in press. in Molecular Graphs. Pure Appl. Chem. 1983, 5 5 , 199—206. (13) Austel, V. In Topics in Current Chemistry; Charton, M., Motoc, I„ (21) Balaban, A. T. Chemical Graphs. Part 48. Topological Index J for Eds.; Springer-Verlag: Berlin, 1983; Vol. 114, pp 7-19. Heteroatom-Containing Molecules Taking into Account Periodicities (14) Basak, S. C.; Grunwald, G. D. Molecular Similarity and Risk of Element Properties. MATCH 1986, 2 1 , 115-122. Assessment: Analog Selection and Property Estimation Using Graph (22) Basak, S. C.; Harriss, D. K.; Magnuson, V. R. POLLY: Copyright of Invariants. SAR QSAR Environmental Res., in press. the University of Minnesota, 1988. (15) Needham, D. E.; Wei, I. C.; Seybold, P. G. Molecular Modelling of (23) Basak, S. C.; Grunwald, G. D. APPROBE: Copyright of the University the Physical Properties of Alkanes. J. Am. Chem. Soc. 1988, 1 1 0 , of Minnesota, 1994. 4186-4194. (24) SAS Institute Inc. In SAS/STAT User's Guide, Release 6.03 Edition. (16) Mekenyan, O.; Bonchev, D.; Trinajstic, N. Chemical Graph Theory: SAS Institute Inc.: Cary, NC, 1988; Chapter 34, pp 949-965. Modelling the Thermodynamic Properties of Molecules. Int. J. (25) Basak, S. C.; Niemi, G. J.; Veith, G. D. Predicting Properties of Quantum Chem. 1980, 1 8 , 369—380. Molecules Using Graph Invariants. J. Math. Chem. 1991, 7,243—272. (17) Karcher, W. Spectral Atlas o f Polycyclic Aromatic Hydrocarbons, Vol. 2 . Kluwer Academic: Dordrecht/ Boston/London, 1988; pp 16—19. CI940103Z XVI International Cancer Congress Predicting genotoxicity 1994

of chemicals using New Delhi, India 30 Oct. - 5 Nov. 1994 nonempirical parameters

S.C. BA SA K and G.D. GRUNW ALD Natural Resources Research Institute University of Minnesota, Duluth, MN (USA) .1

SUMMARY

We have attempted to predict m utagenicity of a diverse set of 463 chemicals using nonempirical and semiempirical molecular descriptors. The set of independent parameters used for model development consisted of: a) numerical graph invariants, and b) calculated sem iem pirical quantum chemical parameters. Graph invariants alone can correctly classify 75% of the mutagens. Inclusion of quantum chemical parameters, viz., dipole moment (\i) , HOMO e n e r g y (EK0K0) / LUMO energy , and heat of form ation (HF) in the set of independent variables did not improve the results over graph invariants alone.

INTRODUCTION

A current interest in predictive toxicology is the estim ation of genotoxic potential of chemicals from their calculated structural parameters [1-4]. The major reason underlying this trend is that for the m ajority of new and existing industrial chemicals, adequate em pirical genotoxicity data necessary for hazard assessment are not available [5] .

In this paper, we have used nonempirical and semiemperical structural parameters of molecules to predict their mutagenicity/ nonmutagenicity. Two classes of calculated parameters have been used for

Copyright 1994 by Monduzzi Editore S.p.A. - Bologna (Italy) 413

I XVI International Cancer Congress our study. The firs t group, viz., numerical graph 1994 invariants, encode inform ation about molecular size, shape, symmetry, and branching. Such parameters have been widely used in the quantitative structure- New Delhi, India a ctivity relationships (QSARs) of various congeneric 30 Oct. -5 Nov. 1994 sets. On the other hand, quantum chemical parameters, which quantify chemical reactivity, have also been useful in predicting genotoxicity [4 ].

In a recent paper, Basak et al [1] used graph invariants to discrim inate mutagens from nonmutagens with 75% correct classification in a set of 520 chemicals. It was of interest to see whether inclusion of quantum chemical parameters could be useful in developing better models as compared to those derived from graph invariants alone. Therefore, in this paper, we have carried out a discriminant function analysis of a set of 463 mutagens/nonmutagens using 108 graph invariants and four quantum chemical param eters, v i z . , EH0H0, E UJ!t0,. dipole moment (u) , and heat of form ation {HF) .

MATERIALS AND METHODS

A diverse set of 242 mutagens and 221 nonmutagens was used for model development. The data were taken from the CRC Handbook of Identified Carcinogens and Noncarcinogens [6] and consisted of those compounds that had a positive or negative response to .Arne's m utagenicity te s t.

For each of these compounds, 96 topological indices were generated using POLLY [7]. The set of parameters included connectivity and information theoretic invariants. Also, a set of indicator- variables showing the presence or absence of 12 substructures relevant to genotoxicity was determined. This set of indicator variables consisted of [8 ]: a) nitrosos, b) hydrazines, c) halogen substituted mustard, sulfur mustard or oxygen mustard, d) haloalkyl ethers, e) organic sulfates or sulfonates, f) (3- or y-lactones, g) azoxys, h) orthotoluidine derivatives, i) biphenyl amine, benzidine, or 4,4’-methylenedianiline derivatives, j) ethyl or acetylenic carbamates, k) epoxides and 1) strained 3 or 4 membered rings. F in a lly, four quantum chem ical param eters, ELUM0, EH0M0, p, heat of form ation (HF) were generated using the AIL procedure o f MOPAC [9 ].

RESULTS AND DISCUSSION

The purpose of this study was to investigate whether the addition of semiempirical quantum chemical structural parameters to the set of graph theoretic predictors could improve discrim ination between mutagens and nonmutagens. In an e a rlie r paper, Basak e t a l [1] found that a combined set of 6 topological indices and four substructural indicators of genotoxicity could correctly classify 74.6% of 414 mutagens and nonmutagens. XVI International jn the present study, four calculated quantum chemical Cancer Congress oarameters were added to the set of independent 1994 variables w ith the idea that these parameters, being indicators of molecular reactivity, might be useful in predicting genotoxicity. We generated quantum New Delhi, India chemical parameters for a group of 463 chemicals 30 Ocl. - 5 Nov. 1994 instead of the fu ll set of 520 because of lim itations in computational resources. The results of linear discrim inant analysis on this set using a combined group of topological indices, indicator variables and quantum chemical indices showed a m arginal improvement in the classification of the chemicals. The best model was able to correctly cla ssify 74.1% o f t h e 4 6 3 chemicals. The final model consisted five topological indices, four indicators of genotoxicity and one of the quantum chemical parameters. The topological indices included two information based indices: inform ation content of the graph o rb ita ls (I0RB, [10] ) and structural inform ation content at 0ch order (SIC0, [11])- Two connectivity indices [12] were included, 3rJ order bond-corrected cluster connectivity (3xbc) and 3r“ order valence-corrected chain or cycle connectivity {3'X ) - The final topological index was number of paths of length 10 (P10) - The four substructural indicators were: a) nitroso compounds, b) halogen substituted mustard, sulfur mustard or oxygen mustard, c) s u lfa te or sulfonate, and d) biphenyl amine, benzidine or 4,4‘-m ethylenedianiline derivatives. The quantum chem ical parameter used was HF.

The best model on the set of 463 chem icals using only the topological variables could correctly classify 73% of the chemicals. This 10-variable model consisted of the same four indicator variables as above, the same five topological indices as above, v i z . , I 0R3, S IC 0, ’y C, "X 'ch> and P10. The fin a l parameter was ICS [11], inform ation content of the graph at 6th order.

It is of interest to note that the set of ten topological variables used for the set of 463 chemicals was identical with the variable set used for the classification of 520 chemicals. The results indicate that topological variables encode su fficien t structural information to classify about 75% of mutagens and nonmutagens.

r e f e r e n c e s

1. B a s a k , S.C., Bertelsen, S., and Grunwald, G.D. (1994) U se of Graph Theoretic Parameters in Risk Assessment of Chemicals. Toxicology Letters, in p re s s .

2. Rosenkranz, H.S., Klopman, G. , Oshima, H. and Bartsch, H. (1990) Structural basis of genotoxicity of nitrophenols and, derivatives present in smoked products. Mut. Res. 23 0, 9-27.

2- D e b n a th , A.K., Debnath, G. , Shusterman, A.J. a n d Hansch, C. (1992) A QSAR investigation of the role of hydrophobic!ty in regulating m utagenicity 415 XVI International Cancer Congress in the Ames te s t: m utagenicity of aromatic and 1994 heterocyclic amines in Salmonella typhimurium TA98 and TA100. Environ. Mol. Mutagen. 19, 37-52.

New Delhi, India 4. Gombar, V. K. and Enslein, K. (1990) Quantitative 30 Oct. - 5 Nov. 1994 structure-activity relationship (QSAR) studies using electronic descriptors calculated from topological and m olecular o rb ita l (MO) methods Q u a n t. Struct.-Act. Relat. 9, 321-325.

5. A u e r , C.M., N a b h o l z , J.V. and Baetcke, K.p. (1990) Mode of action and the assessment of chemical hazards in the presence of lim ited data: use of structure-activity relationships (SAR) under TSCA, Section 5. Environ. Health Perspect 87, 183-197.

6. Soderman, J.V . (Ed.) (1982) CRC Handbook of Identified Carcinogens and Noncarcinogens: C arcinogenicity-M utagenicity Database, Volume I. CRC Press, Inc. Boca Raton, FL.

7 . B a s a k , S.C., H a r r i s s , D.K. and Magnuson, V.R. (1988) POLLY. Copyright of the University of ! Minnesota. |

8 . B a s a k , S.C. (1987) Genetox: A system for the prediction of genotoxicity of chemicals from structure. Copyright of the University of M in n e s o t a .

9 . _ MOPAC 5 .0 (QCPE #455), Quantum Chem istry Program Exchange, Creative Arts Building 181, Indiana U niversity, Bloomington, Indiana, 47405, USA.

10. Rashevsky, N. (1955) Life, inform ation theory, and topology. B ull. Math. Biophys. 17, 229-235.

1 1 . R o y , A.B., B a s a k , S.C., H a r r i s s , D.K., and M a g n u s o n , V.R. (1984) Neighborhood com plexities and symmetry of chemical graphs and their biological applications. In: X.J.R. A v u la , R.E. K a lm a n , A.I. Lipais and E.Y. R o d in (Eds.), Mathematical Modelling in Sciences and Technology. Pergamon Press, pp. 745-750.

12. K ie r, L.B. and Hall, L.H. (1986) M o le c u la r Connectivity in Structure-Activity Analysis. j R e s e a r c h Studies Press, Letchworth, ; H ertfordshire, U.K. i

ACKNOWLEDGEMENTS

This paper is contribution number 137 from the Center for Water and the Environment of the Natural Resource Research Institute. Research reported in this paper was supported, in part, by grant F49620-94-1-0401 from the United States Air Force, Exxon Biomedical Sciences, Inc. and the Structure-Activity Relationship Consortium (SARCON) of the Natural Resources Research Institute at the University of Minnesota. 416 CHEMOSPHERE, IN PRESS, 1995

PREDICTING MUTAGENICITY OF CHEMICALS USING TOPOLOGICAL AND QUANTUM CHEMICAL PARAMETERS: A SIMILARITY BASED STUDY

Subhash C. Basak* and Gregory D. Grunwald

Natural Resources Research Institute, University of Minnesota 5013 Miller Trunk Highway, Duluth, MN 55811

ABSTRACT

Five molecular similarity methods have been used to estimate mutagenicity of a set of 73 aromatic and heteroaromatic amines. Two of the similarity methods (AP, PCTI) are based on topological parameters. Two other methods (PCPROP, PROP) are derived from physicochemical and electronic parameters. The fifth method, PCAU , is based on a combination of both topological and physicochemical parameters. The effectiveness of the five similarity methods in the rapid evaluation of mutagenicity is discussed.

*To whom correspondence should be sent.

INTRODUCTION

An important area of research in contemporary toxicology is the prediction of genotoxicity of drugs, industrial chemicals, pesticides and natural products [1,2]. Short-term tests (STTs) have been developed to assess the potential carcinogenic hazard of chemicals to humans. These STTs include mutagenesis in Salmonella and mouse lymphoma cells, and chromosomal aberration and sister chromatid exchanges in Chinese hamster ovary cells. The role of these STTs for detecting DNA damage has increased in view of findings that many rodent carcinogens are mutagenic in STTs carried out in vitro [2,3].

Nearly 13 million distinct chemical entities have been registered to the Chemical Abstracts Service (CAS) Registry and the list is growing by more than 500,000 per year. Of these half million chemicals, about 1,000 end up in societal use [4]. In the USA, the Toxic Substances Control Act (TSCA) Inventory has nearly 70,000 entries and the list is growing at the rate of more than 2,000 per year from new submissions to the United States Environmental Protection Agency (USEPA) for the Premanufacture Notification (PMN) process [5]. At present, risk assessment of the PMN chemicals is carried out with veiy limited test data. For example, approximately 15% of PMN submissions have empirical mutagenicity data [5]. Therefore, assessing mutagenic potential of chemicals directly from structure is crucial for risk assessment of chemicals.

Under these circumstances, USEPA has used two approaches to assess environmental risk: a) quantitative structure activity relationships (QSARs) and b) reasoning based on structural and functional analogy of the PMN chemical with existing chemicals which have test data [5]. While the former class of methods is based on models which are capable of predicting through extrapolation, the latter method is a two stage approach, consisting of the selection of analog(s) and use of the test data on analogs to estimate the hazard posed by the chemical with no or very little experimental data. These methods are based on the perception of structural similarity among chemicals.

Recently, a number of methods have been developed for quantifying molecular similarity of chemicals [6-16]. Similarity methods provide ordered sets of molecules (analogs) structurally related to the probe chemical of our interest. Properties of the selected neighbors (analogs) of compounds can then be used to estimate properties of the probe. Similarity of chemicals has been quantified using experimental properties and nonempirical properties or parameters.

The scarcity of available experimental data on environmental chemicals dictates that quantitative molecular similarity analysis (QMSA) techniques which utilize nonempirical parameters would be most useful for hazard assessment of chemicals.

Graph theoretical parameters like topological indices and substructures have been used in the quantification of molecular similarity. In recent studies, Basak and Grunwald [8-12] used properties of the closest neighbor or K neighbors (K=l-25) of a chemical to estimate it's properties. The similarity methods used in these studies were based on numerical graph invariants and atom pairs. Such methods have been used in estimating hydrophobicity [10], mutagenicity [9,12,17] and boiling points [9,11,12] of different groups of chemicals. In this paper, we have carried out a comparative study of five QMSA techniques in predicting mutagenic potency of 73 aromatic and heteroaromatic amines.

MATERIALS AND METHODS

The compounds used in this study consisted of 73 aromatic and heteroaromatic amines for which mutagenicity, log P, and electronic descriptors eLUM0 and eHOMO were available from literature [1], A list of these chemicals and their physico-chemical properties is presented in Table 1. The mutagenic activity of the aromatic and heteroaromatic amino compounds in S. typhimurium TA100 + S9 microsomal preparation is expressed in log of revertant per nanomole. Log P values were experimental values when available, otherwise log P values were calculated using CLOGP, release 3.54 [18]. The electronic descriptors, eLUM0 and eH0M0, were calculated using Dewar’s [19] semi- empirical AMI method of MOP AC 4.10 (Quantum Chemistry Program Exchange No. 455).

Table 1. Mutagenicity ( S. typhimurium TA100 with metabolic activation) and physico-chemical parameters for 73 aromatic and heteroaromatic amines

log

No. Compound Rev/nmol logP ei.c\?o e HOMO 1 2,3-dimethylaniline -1.36 1.91 0.581 -8.403 2 2,5-dimethylaniline -1.43 1.83a 0.581 -8.404 3 4-chloro-1,2-phenylenediamine -1.44 1.28a 0.241 -8.325 log No. Compound Rev/nmol logP eT.UMO 6H0M0 4 4-aminophenylsulfide 0.48 2.18” 0.047 -7.528 5 4-aminopyrene 2.69 3.72 -0.850 -7.837 6 2-amino-4-methylphenol -1.68 1.16“ 0.478 -8.274 7 1 -aminofluoranthene 2.34 3.72 -0.816 -8.159 8 2-aminofluorine 0.78 3.14a -0.108 -8.106 9 Benzidine -0.66 1.34a 0.147 -7.931 10 4-methyl-2-bromoaniline -0.64 2.58 0.228 -8.498 11 8-aminoquinoline -0.34 1.79a -0.294 -8.134 12 3,4-dimethylaniline -1.08 1.91 0.602 -8.318 13 3-aminofluorene 0.10 2.70 -0.182 -8.333 14 4-methyl-2-chloroaniline -0.40 2.43 0.293 -8.477 15 4-aminofluoiine 0.64 2.70 -0.157 -8.256 16 4-chloroaniline -1.51 1.88a 0.284 -8.568 17 8-aminofluoranthene 1.98 3.72 -0.853 -8.166 18 2-ethyl-4-chloroaniline 0.08 2.96 0.305 -8.502 19 2-aminopyrene 2.58 3.72 -0.865 -8.144 20 2-aminonaphthalene 0.39 2.28a -0.198 -8.227 21 4-cyclohexylaniline -0.14 3.65a 0.620 -8.380 22 6-aminoquinoline -1.22 1.28a -0.383 -8.443 23 2-amino- 1-methylnaphthalene 0.84 2.59 -0.203 -8.142 24 4-amino-3-methylbiphenyl 1.12 3.30 0.069 -8.208 25 4,4'-ethylenebis (aniline) -1.51 2.13 0.589 -8.318 26 2-methoxy-5-methylaniline -1.85 1.74a 0.419 -8.449 27 2,4,5-trimethylaniline -0.26 2.41 0.581 -8.224 28 2,4-diamino-n-butylbenzene -0.84 1.77 0.702 -8.130 29 7-aminofluoranthene 2.76 3.72 -0.895 -8.186 30 1-aminocarbazole -0.25 2.30 -0.115 -8.035 31 1-aminophenanthrene 1.79 3.26 -0.357 -8.132 log

No. Compound Rev/nmol logP 6 t.u m o 6 h o m o 32 3-amino-4~methylbiphenyl 0.09 3.30 -0.007 -8.398 33 3-aminocarbazole -0.11 2.30 -0.107 -7.877 34 4-methoxy-2-methylaniline -2.10 1.23 0.474 -8.152 35 2-aminobiphenyl -0.51 2.84a 0.107 -8.407 36 3-aminofluoranthene 2.25 4.20a -0.818 -8.033 37 4-aminobiphenyl 0.85 2.86a 0.048 -8.263 38 3,3'-dichlorobenzidine 0.66 3.51a -0.142 -8.125 39 2,6-dichloro-1,4-phenylenediamine -1.12 1.79 -0.029 -8.180 40 3,3 '-dimethoxybenzidine -0.85 1.81a -0.005 -7.927 41 4-aminophenyldisulfide 0.54 1.99 -1.334 -8.209 42 2-aminocarbazole -0.56 2.30 0.025 -8.023 43 4-aminocarbazole -0.47 2.30 -0.029 -7.993 44 1-aminofluorene -0.04 3.18a -0.191 -8.467 45 2-aminoanthracene 2.76 3.26 -0.771 -7.869 46 2-amino-3-methylnaphthalene 1.09 2.59 -0.199 -8.214 47 2-aminofluoranthene 2.87 3.72 -0.887 -8.273 48 3-aminoquinoline 0.07 1.63a -0.376 -8.534 49 4,4'-methylenebis (o-fluoraniline) -1.16 2.50 0.186 -8.467 50 3-methoxy-4-methylaniline -0.81 1.52 0.583 -8.274 51 2-chloroaniline -2.05 1.90a 0.268 -8.632 52 4-phenoxyaniline 0.63 2.96a 0.255 -8.598 53 2-amino-4-chlorophenol -2.00 1.81 0.154 -8.508 54 l-amino-2-methylnaphthalene -0.37 2.59 -0.176 -8.034 55 6-aminochrysene 2.41 4.98 -0.592 -7.926 56 2-methyl-4-bromoaniline 0.46 2.58 0.226 -8.562 57 4-aminophenanthrene -0.11 3.26 -0.411 -8.524 58 4,4'-methylenebis (o-ethylaniline) -0.55 3.66 0.605 -8.181 59 4-aminophenylether -0.27 1.36a 0.317 -8.167 log No. Compound Rev/nmol logP e LUMO e HOMO 60 4-ethoxyaniline -0.61 1.24“ 0.513 -8.182 61 5-aminoquinoline 0.14 1.16a -0.395 -8.395 62 2-methyl-4-chloroaniline 0.38 2.43 0.289 -8.508 63 1 -aminonaphthalene -1.00 2.25a -0.195 -8.108 64 2,4-dimethylaniline -0.23 1.68a 0.605 -8.288

65 2,4-difluoroaniline -2.52 1.54 -0.085 -8.695 66 4,4'-methylenedianiline -0.15 1.59a 0.576 -8.274 67 9-aminophenanthrene 2.79 3.56a -0.364 -8.099 68 3,4'-diaminobiphenyl 0.65 1.58 0.103 -8.187 69 3-aminophenanthrene 2.66 3.26 -0.203 -8.122 70 2-aminophenanthrene 2.74 3.26 -0.365 -8.233 71 1 -aminoanthracene 0.36 3.69a -0.804 -7.820 72 1-aminopyrene 1.05 4.31a -0.700 -7.580 73 9-aminoanthracene -0.24 3.26 -0.712 -7.600 “Experimental log P.

Calculation of graph-theoretic parameters Two types of graph invariants have been used: a) numerical graph invariants or topological indices, and b) substructural invariants called atom pairs. These invariants are based on both simple and weighted molecular graphs. Table 2provides a list of symbols for the topological indices. Definitions and computations of these invariants are described below.

The Wiener index W [20], the first topological index reported in the chemical literature, may be calculated from the distance matrix D(G) of a hydrogen-suppressed chemical graph G as the sum of the entries in the upper triangular distance submatrix. The distance matrix D(G) of a nondirected graph G with n vertices is a symmetric nxn matrix (dy), where dy is equal to the distance between vertices Vj and Vj in G. Each diagonal element d „of D(G) is zero. We give below the distance matrix D(G,) of the labelled hydrogen-suppressed graph G( of isobutane (Fig. 1): ( 1) (2) (3) (4)

1 0 1 2 2 2 1 0 1 1 D(Gl) = 3 2 1 0 2 4 2 1 2 0

W is calculated as:

W = '/2S dH = Sh • gh ( 1 ) id h where gh is the number of unordered pairs of vertices whose distance is h.

Randic's connectivity index [21], and higher-order connectivity path, cluster, path-cluster and chain types of simple, bond and valence connectivity parameters were calculated using the method of Kier and Hall [22]. Ph parameters, number of paths of length h (h = 0,1,..., 10) in the hydrogen- suppressed graph, are calculated using standard algorithms.

Table 2, The 90 topological indices used for the analysis of amines.

Symbol Definition Tw XD Information index for the magnitudes of distances between all possible pairs of vertices of a graph

fW ad Mean information index for the magnitude of distance W Wiener number = half-sum of the off-diagonal elements of the distance matrix of a graph Iu Degree complexity Hv Graph vertex complexity HD Graph distance complexity IC Information content of the distance matrix partitioned by frequency of occurrences of distance 0 Order of neighborhood when ICr reaches its maximum value for the hydrogen-filled graph Symbol Definition

^ORB Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices

A Zagreb group parameter = sum of square of degree over all vertices

M, A Zagreb group parameter = sum of cross-product of degrees over all neighboring (connected) vertices

ICr Mean information content or complexity of a graph based on the rth (r = 0-5) order neighborhood of vertices in the hydrogen-filled graph

SICr Structural information content for rth (r = 0-5) order neighborhood of vertices in the hydrogen-filled graph

CICr Complementary information content for rth (r = 0-5) order neighborhood of vertices in the hydrogen-filled graph

hx Path connectivity index of order h = 0-6

"X c Cluster connectivity index of order h = 3 and 5

6Xch Chain connectivity index of order six

hX pc Path-cluster connectivity index of order h = 4-6

hxb Bonding path connectivity index of order h = 0-6

hX c Bonding cluster connectivity index of order h = 3 and 5 6v b A,Ch Bonding chain connectivity index of order six hv b X pc Bonding path-cluster connectivity index of order h = 4-6

hxv Valence path connectivity index of order h = 0-6

hX c Valence cluster connectivity index of order h = 3 and 5

6VV A*Ch Valence chain connectivity index of order six hv v A.PC Valence path-cluster connectivity index of order h = 4-6

Ph Number of paths of length h = 0-10 J Balaban's J index JB Balaban's bond-weighted J index JX Balaban's electronegativity-weighted J index JY Balaban's covalent radii-weighted J index 2 3

4

G

Figure 1. Graph Gt, the hydrogen-suppressed labelled molecular graph of isobutane.

Balaban [23,24,25] defined a series of indices based upon distance sums within the distance matrix for a chemical graph which he designated as J indices. These indices are highly discriminating with low degeneracy. Unlike W, the J indices range of values are independent of molecular size.

Information-theoretic topological indices are calculated by the application of information theory on chemical graphs. An appropriate set A of n elements is derived from a molecular graph G depending upon certain structural characteristics. On the basis of an equivalence relation defined on A, the set A is partitioned into disjoint subsets A-t of order n; (i=l,2, ....h; Enj = n). A probability i distribution is then assigned to the set of equivalence classes:

Pi. Pa>..... Ph where p; = n/n is the probability that a randomly selected element of A will occur in the iUl subset.

The mean information content of an element of A is defined by Shannon's relation [26]:

h IC = " £ Pi log2 Pi (2)

The logarithm is taken at base 2 for measuring the information content in bits. The total information content of the set A is then n times IC. It is to be noted that information content of a graph G is not uniquely defined. It depends on the set A derived from G as well as on the equivalence relation which partitions A into disjoint subsets A,. For example, when A constitutes the vertex set of a chemical graph G, two methods of partitioning have been widely used: a) chromatic-number coloring of G where two vertices of the same color are considered equivalent, and b) determination of the orbits of the automorphism group of G whereafter vertices belonging to the same orbit are considered equivalent.

Rashevsky [27] was the first to calculate information content of graphs where "topologically equivalent" vertices were placed in the same equivalence class. In Rashevsky's approach, two vertices u and v of a graph are said to be topologically equivalent if and only if for each neighboring vertex ut (i = 1,2,..., k) of the vertex u, there is a distinct neighboring vertex v, of the same degree for the vertex v. While Rashevsky used simple linear graphs with indistinguishable vertices to symbolize molecular structure, weighted linear graphs or multigraphs are better models for conjugated or aromatic molecules because they more properly reflect the actual bonding patterns, i.e., electron distribution.

To account for the chemical nature of vertices as well as their bonding pattern, Sarkar et al [28] calculated information content of chemical graphs on the basis of an equivalence relation where two atoms of the same element are considered equivalent if they possess an identical first-order topological neighborhood. Since properties of atoms or reaction centers are often modulated by stereo-electronic characteristics of distant neighbors, i.e., neighbors of neighbors, it was deemed essential to extend this approach to account for higher-order neighbors of vertices. This can be accomplished by defining open spheres for all vertices of a chemical graph. If r is any non-negative real number and v is a vertex of the graph G, then the open sphere S(v,r) is defined as the set consisting of all vertices Vj in G such that d(v,Vj) < r. Therefore, S(v,0) = (j), S(v,r) = v for 0 < r < 1, and S(v,r) is the set consisting of v and all vertices vs of G situated at unit distance from v, if l

One can construct such open spheres for higher integral values of r. For a particular value of r, the collection of all such open spheres S(v,r), where v runs over the whole vertex set Y, forms a neighborhood system of the vertices of G. A suitably defined equivalence relation can then partition V into disjoint subsets consisting of vertices which are topologically equivalent for r,h order neighborhood. Such an approach has been developed and the information-theoretic indices calculated are called indices of neighborhood symmetry [29],

In this method, chemicals are symbolized by weighted linear graphs. Two vertices u0 and v0 of a molecular graph are said to be equivalent with respect to i-th order neighborhood if and only if corresponding to each path u0, u ,,..., ur of length r, there is a distinct path v0, vl5..., vr of the same length such that the paths have similar edge weights, and both u0 and v0 are connected to the same number and type of atoms up to the r01 order bonded neighbors. The detailed equivalence relation is described in earlier studies [29,30].

Once partitioning of the vertex set for a particular order of neighborhood is completed, ICr is calculated by Eq. 2. Subsequently, Basak et al. [31] defined another information-theoretic measure, structural information content (SICr), which is calculated as:

SICr = IC/log2n (3) where ICr is calculated from Eq. 2 and n is the total number of vertices of the graph.

Another information-theoretic invariant, complementary information content (CICr), is defined as [32]:

CICr = log2n-ICr (4)

CICr represents the difference between maximum possible complexity of a graph (where each vertex belongs to a separate equivalence class) and the realized topological information of a chemical species as defined by ICr. In Fig. 2, the calculation of IQ , SIC, and CIC, is demonstrated for the hydrogen-filled graph (G2) of n-propanol. The information-theoretic index on graph distance, I d is calculated from the distance matrix D(G) of a chemical graph G as follows [33]:

I d =W log2 W - II gh • h log2 h (5) h

The mean information index. Id, is found by dividing the information index by W. The information theoretic parameters defined on the distance matrix, HD and Hv were calculated by the method of Raychaudhury etal. [34]. All of the topological indices were calculated by the program POLLY [35],

Atom pairs were calculated using the method of Carhart et al. [13]. An atom pair is defined as a substructure consisting of two non-hydrogen atoms i and j and their interatomic separation:

--

In Fig. 3, Xn (n = 1 or 3 in this example) represents the number of non-hydrogen n eig h b o rs,rep resen ts 1 7t-electron and the C and O are atomic symbols. These are the elements of the atom descriptors. The "-k-", k = 2 and 3, are the separation values. H 2 h 4 h 6

H i O i C i C 2 C 3 H 8

H 3 h 5 h 7 Labelled hydrogen-filled graph of n-Propanol

G 6

First order neighborhoods: X O CO H, H2 o , c , CO P' ■ ■ ■ A A \ \ / l \ \ / / w

0 , c C' j HC, H HOC, H H C C , H H H C

Subsets: I N IN IV V VI (HO (H2-H8) (Oi) (CO (C2) (C3)

Probability: 1/12 7/12 1/12 1/12. 1/12 1/12

1C, = 5/12*log212 + 7/12*log212/7 = 1.950 bits SIC, = IC,/log212 = 0.544 bits CIC, = log212 - IC, = 1.635 bits

Figure 2. Example first order partitioning of n-propanol into neighborhood partitions and the calculation of IQ , SIC,, and CIC,. Acetone (d) 0

CH3--C--CH 3 (a) (b) (c)

Figure 3. Determination of atom pairs and their frequency for acetone.

The first atom pair (CXi - 2 - C.X3), corresponds to path ab, a methyl carbon with 1 non-hydrogen neighbor bonded to a carbonyl carbon with 3 non-hydrogen neighbors. The path length of ab is 2. Path cb is represented by the first atom pair as well, hence this atom pair has a frequency of 2. Path bd, involving the carbon and oxygen of the carbonyl group, defines atom pair 2. Paths abd and cbddefine atom pair 3. Atom pair 4 represents path abc. The atom pairs were determined using APPROBE [36].

Data reduction Principal component analysis (PCA) was used to reduce the dimensionality of the set of 90 topological indices (TIs). With PCA, linear combinations of the Tis, called principal components (PCs) are derived from the correlation matrix. The first PC has the largest variance, or eigenvalue, of the linear combination of Tis. Each subsequent PC explains maximal index variance orthogonal to previous PCs. With 90 Tis available, 90 PCs can be generated. For this study, PCs with an eigenvalue greater than one were retained. For the PCA, all Tis were transformed by the natural log of the Tl plus one. Basak et at. [6] provide more detail on this approach.

PCA was repeated two more times, once with the three physico-chemical properties, viz., log P, eLUMO and eHOMO, and once with the combined set of Tis and physico-chemical properties. For all three PCA's, the PCs were standardized to have unit variance.

Measurement of Similarity Several methods to measure intermolecular similarity were used in this study. The atom pair method (AP) uses an associative measure described by Carhart et al. [13] and is based on atom pair descriptors. The measurement is the ratio of the number of shared atom pairs between two molecules over the total number of atom pairs present in the two molecules. Similarity (S) between molecules i and j is then defined as:

Si, = 2C/(Ti + T,) (6)

where C is the number of atom pairs common to chemical i and j. T, and T, are the total number of atom pairs in chemicals i and j, respectively. The numerator is multiplied by a factor of 2 to reflect the presence of shared atom pairs in both compounds.

The remaining methods used Euclidean distance within an n-dimensional space. These n-dimensional spaces consisted of orthogonal variables (PCs) derived from the TIs only, from physico-chemical properties only, and from the combined set of TIs and physico­ chemical properties. A fourth approach used the original physico-chemical properties as a property space in which to measure ED. ED between molecule's i and j is defined as:

ED, = [I P , - Djk)2]1/2 (7)

where n equals the number of dimensions retained from PCA or the number of physico­ chemical properties. Dik and D,kare the data values of the k,h dimension for chemicals i and j, respectively.

Estimation of Mutagenic Potency Using the similarity methods described above, the intermolecular similarity of the chemicals was quantified. For each chemical, the K nearest neighbors (K = 1-5) were then determined based on each of the similarity methods. The mean observed log rev/nmole of the K-nearest neighbors for a compound was used as the estimated log rev/nmol for the compound. The correlation (r) of observed log rev/nmol with estimated log rev/nmol and the standard error (s.e.) of the estimates were used to assess the relative efficacy of the similarity methods.

RESULTS AND DISCUSSION

From the PCA of 90 TIs for 73 compounds, six PCs were retained. These six PCs explained, cumulatively, 95.9% of the total variation within the Tl data. Table 3 lists the eigenvalues of the six PCs, the proportion of variance explained by each PC and the cumulative variance explained. In addition, Table 3 lists two TIs most correlated with each PC. The first PC is strongly correlated with parameters which characterize size of the molecular graph, viz., M2 and P2. The second PC is highly correlated with higher order complexity indices including SIC4 and SIC5. For the third PC, the highest correlations are with the cluster and path-cluster connectivity TIs such as 5Xc and 4XpC. The fourth PC was dominated by low order information theoretic indices such as IC4 and IC2, and the fifth PC by the fifth and sixth order chain connectivity indices. Only the zeroth order information theoretic indices showed good correlation with the sixth PC. These six PCs were used for the PCT| method of determining similarity.

For the PCA of the physico-chemical properties, the eigenvalues were 1.9, 0.7, and 0.4 with 61.8, 23.8 and 14.5 percent of the variance being explained, respectively. With only three physico-chemical properties, only three PCs are possible. Since the third PC was explaining nearly 15% of the total variation within the physico-chemical property data, all three PCs were retained. The first PC was negatively correlated with eLUMO (-0.85) and positively with log P (0.81). The first PC showed moderate correlation with eHOMO (0.69). The second PC was strongly correlated with eHOMO (0.71) and showed a slight negative correlation with log P (-0.41). The third PC was positively correlated with eLUMO and log P, with r = 0.49 and 0.42, respectively. These three PCs were used for the PCPROP method of determining similarity.

Table 3. Summary of principal components of 90 TIs for 73 compounds and the correlation coefficients of two TIs most correlated with each principal component.

Percent First Second of Cumulative correlated correlated PC Eigenvalue variance percent Tl Tl

1 58.3 64.8 64.8 m 2 0.99 P2 0.99 2 13.1 14.5 79.3 SIC5 -0.96 SIC4 -0.96

4yb 3 6.3 7.0 86.3 5Xbc 0.82 A PC 0.72 4 4.3 4.8 91.0 IC2 0.64 ICi 0.57 5yv 5 2.7 3.1 94.1 ACh -0.61 5Xch -0.60 6 1.6 1.8 95.9 sicn 0.46 0.46

The PCA of the combined set of TIs and physico-chemical properties resulted in seven PCs with eigenvalues greater than 1.0. Cumulatively, these PCs explained 95.9% of the total variation within the original parameters. The PCA results are summarized in Table 4. Table 4 lists the two most correlated parameters with each PC, as well. By comparing the results in Table 3 and Table 4, one can see that the results are very similar for the analysis of TIs only and the combined set of TIs and physico-chemical properties. A significant difference does not occur until the sixth PC when the most correlated parameter is the electronic descriptor, eLUM0, when looking at the combined set of parameters. The second significant difference is the addition of one more PC for the combined set of parameters. The two parameters, log P and SIC2, were the only parameters to show correlation (p < 0.05) with the seventh PC. These seven PCs were used for the PCALL method of determining similarity.

In Table 5, we present a summary of the mutagenicity estimation when using each of the similarity measurements, viz., atom pairs (AP), ED within 6-dimension PC space derived from TIs (PCT,), ED within 7-dimension space derived from TIs and physico­ chemical properties (PCALL), ED within 3-dimension space derived from physico-chemical properties (PCPROP) and ED within 3-dimensional space consisting of the original physico­ chemical properties (PROP). Each line of Table 5, presents the similarity measurement used, the K value (number of neighbors being used for estimation), the correlation of observed and estimated mutagenicity and the standard error of the estimates. For brevity, only the best K value for each similarity method is being reported. In Table 6 are the estimated log rev/nmol for each compound using the five similarity methods at the best K value for the method.

Table 4. Summary of principal components of 90 TIs and 3 physico-chemical properties for 73 compounds and the correlation coefficients of two TIs most correlated with each principal component.

First . Second Proportion Cumulative correlated correlated PC Eigenvalu of variance percent Tl Tl e

1 59.5 64.0 64.0 m 2 0.99 P2 0.99 2 13.2 14.2 78.2 s ic 5 0.96 SIC4 0.96 V 3 6.3 6.8 85.0 5X bc 0.82 A PC 0.71 4 4.4 4.8 89.8 IC2 0.64 ic 1 0.56

5 V 5 2.8 3.0 92.7 A C h -0.60 5Xch -0.59

6 1.9 2.1 94.8 e LUMO 0.60 IC0 -0.59 7 1.0 1.1 95.9 'ogp___ 0.28 IC9 0.25

In many practical situations, one has to carry out rapid estimation of toxicity endpoints like mutagenicity. Viable models based on nonempirical parameters which can be calculated fast are useful in such cases. A number of molecular similarity methods constructed from topological parameters have been used in the estimation of physicochemical and toxicological properties for different sets of chemicals [9-12]. Table 5. Comparison of the five similarity methods for estimation of mutagenic potency (log Rev./nmol in S. typhimurium TA100 with metabolic activation) for 73 aromatic and heteroaromatic amines

Similarity method K r s.e. AP 4 0.77 0.88 PCT, 5 0.72 0.96

p c all 4 0.79 0.85 PCPROp 5 0.84 0.75 PROP 5 0.83 0.77

Such similarity spaces are based on parameters which quantify size, shape, branching and bonding patterns in molecular architecture and do not encode geometric or stereoelectronic information explicitly. On the other hand, if one has an idea about the mechanism of action of a particular class of chemicals and can develop parameters which adequately reflect different aspects of action of those chemicals at the molecular level, one can construct a similarity space using such parameters. From a mechanistic point of view, Debnath et al. [1] used lipophilicity, eH0M0 and eLUMO to correlate mutagenicity of aromatic and heteroaromatic amines. Therefore, we created similarity spaces using: a) log P, eH0M0, eLUM0, b) topological parameters, and c) a combination of topological parameters, log P, eH0M0 and eLUM0. The results in Table 5 show that the AP based similarity method is almost as effective as the property based (PROP) method in predicting mutagenicity. In view of the fact that calculation of atom pairs is orders of magnitude faster as compared to properties like eHOMO and eLUM0, one might use the AP method for rapid estimation of mutagenicity of aromatic and heteroaromatic amines. Further work using other mutagenicity data might be needed to understand the effectiveness of physicochemically-based vis-a-vis topologically-based similarity methods in estimating toxicity of chemicals.

Table 6. Estimated mutagenicity ( S. typhimurium TA100 + S9) for 73 aromatic and heteroaromatic amines using five similarity methods Obs. log AP PCT| PCALL PCpROP PROP No. Rev/nmol (K=4) (K=5) (K=4) (K=5) (K=5) 1 -1.36 -0.75 -0.73 -0.75 -1.22 -1.22 2 -1.43 -0.81 -0.75 -0.73 -1.21 -1.21 3 -1.44 -0.88 -0.45 -1.84 -0.45 -0.75 4 0.48 0.17 0.27 0.18 -0.47 -0.53 5 2.69 2.25 1.30 1.65 1.54 1.54 6 -1.68 -1.29 -0.06 -1.09 -0.79 -1.02 7 2.34 2.64 1.88 2.46 2.49 2.49 8 0.78 0.03 0.00 0.15 1.80 1.41 9 -0.66 1.06 -0.11 -0.26 -0.54 -0.74 10 -0.64 -0.31 -0.23 0.13 -0.13 -0.13 11 -0.34 -0.50 -0.69 0.02 -0.27 -0.27 12 -1.08 -0.68 -0.82 -0.82 -0.94 -0.94 13 0.10 0.32 -0.03 0.32 0.49 0.49 14 -0.40 -0.39 -0.28 0.07 -0.29 -0.18 15 0.64 0.09 0.02 0.09 0.65 0.65 16 -1.51 -0.64 -0.95 -0.89 -1.18 -1.18 17 1.98 2.19 1.95 2.55 2.56 2.56 18 0.08 -0.42 -1.21 -0.35 0.06 -0.24 19 2.58 2.57 1.81 1.56 2.44 2.44 20 0.39 -0.14 0.55 0.11 0.09 0.33 21 -0.14 -0.60 0.04 0.14 0.27 0.27 22 -1.22 0.07 -0.51 -0.50 -0.93 -0.93 23 0.84 0.03 0.74 0.43 0.15 0.15 24 1.12 0.27 0.41 0.41 1.01 1.01 25 -1.51 1.61 0.08 0.07 -0.87 -0.87 26 -1.85 -1.21 -0.83 -1.13 -1.48 -1.48 27 -0.26 -1.02 -0.87 -1.03 -1.12 -1.00 28 -0.84 -0.53 -0.71 -0.69 -0.78 -0.58 Obs. log AP PCT, PCALL PCpROp PROP No. Rev/nmol (K=4) (K=5) (K=4) (K=5) (K=5) 29 2.76 2.54 1.79 2.36 2.40 2.40 30 -0.25 -0.30 -0.05 -0.13 -0.31 -0.31 31 1.79 2.02 1.69 1.41 1.93 1.93 32 0.09 0.93 0.37 0.19 0.30 0.30 33 -0.11 -0.30 -0.08 -0.29 -0.50 -0.50 34 -2.10 -1.08 -0.78 -1.06 -0.70 -0.70 35 -0.51 0.92 0.12 0.22 -0.41 -0.16 36 2.25 2.16 2.49 2.49 2.47 2.47 37 0.85 0.74 0.31 0.25 0.49 0.49 38 0.66 0.45 -0.04 0.01 1.83 1.83 39 -1.12 -0.86 -0.30 -0.88 -0.11 -0.11 40 -0.85 0.19 0.03 0.02 -0.41 -0.41 41 0.54 -1.05 -1.43 -0.30 0.28 1.97 42 -0.56 -0.01 0.01 -0.18 -0.44 -0.44 43 -0.47 -0.07 -0.01 -0.07 -0.46 -0.46 44 -0.04 0.32 0.16 0.26 0.46 0.46 45 2.76 1.89 1.69 2.14 1.57 1.55 46 1.09 0.50 0.50 0.54 0.19 0.19 47 2.87 2.42 1.77 2.33 2.38 2.38 48 0.07 -0.26 -0.98 -0.82 -1.42 -1.04 49 -1.16 -0.15 0.33 -0.02 -0.14 -0.14 50 -0.81 -1.31 -1.03 -1.39 -0.75 -0.75 51 -2.05 -1.34 -0.69 -0.93 -0.61 -0.90 52 0.63 0.14 0.24 0.18 -0.18 -0.18 53 -2.00 -0.79 -0.75 -1.10 -1.24 -1.09 54 -0.37 0.26 0.99 0.90 -0.29 -0.29 55 2.41 2.15 1.16 1.35 1.83 1.74 56 0.46 -0.48 -0.45 -0.15 -0.24 -0.24 Obs. log AP PCT, PCALL PCpROp PROP No. Rev/nmol (K=4) (K=5) (K=4) (K=5) (K=5) 57 -0.11 1.76 2.07 1.89 0.48 0.48 58 -0.55 -0.11 -0.03 -0.31 0.45 0.33 59 -0.27 0.93 0.42 0.41 -1.04 -1.04 60 -0.61 -0.75 -0.22 -0.30 -1.00 -1.00 61 0.14 -0.62 -0.47 -0.83 -0.81 -1.03 62 0.38 -0.33 -0.44 -0.13 -0.33 -0.33 63 -1.00 -0.05 1.30 0.94 0.05 0.34 64 -0.23 -0.89 -0.99 -1.03 -1.00 -0.97 65 -2.52 -0.51 -0.60 -1.79 -1.34 -1.34 66 -0.15 0.01 -0.24 -0.27 -0.88 -0.88 67 2.79 1.03 1.49 1.17 1.73 1.73 68 0.65 0.20 1.04 0.24 -1.05 -1.05 69 2.66 1.91 1.51 1.19 1.75 1.75 70 2.74 1.77 1.49 1.17 1.71 1.74 71 0.36 1.77 1.98 2.03 1.70 1.70 72 1.05 2.45 2.06 2.06 1.60 1.60 73 -0.24 1.21 1.60 2.12 1.35 1.82

ACKNOWLEDGEMENTS

This is contribution number 141 from the Center for Water and the Environment of the Natural Resources Research Institute. Research reported in this paper was supported, in part, by cooperative agreement CR 819621 from the United States Environmental Protection Agency, by grant F49620-94-1-0401 from the United States Air Force, Exxon Biomedical Sciences, Inc., and the Structure-Activity Relationship Consortium (SARCON) of the Natural Resources Research Institute at the University of Minnesota. REFERENCES

1. A. K. Debnath, G. Debnath, A. J. Shusterman, and C. Hansch. (1992) Environmental and Molecular Mutagenesis, 19, 37-52. 2. R. W. Tennant, B. H. Margolin, D. Shelby, E. Zeiger, J. K. Haseman, J. Spalding, W. Caspary, M. Resnick, S. Stasiewicz, B. Anderson, and R. Minor. (1987) Science, 236, 933-94. 3. J. McCann, E. Choi, E. Yamakaki, and B. N. Ames. (1975) Proc. Natl. Acad. Sci., 72, 5135-5139. 4. J. Arcos. (1987) Environ. Sci. Techno!., 21, 743-745. 5. C. M. Auer, M. Zeeman, J. V. Nabholz, and R. G. Clements. (1994) SARandQSAR Environ. Res., 2, 29-38. 6. S. C. Basak, V. R. Magnuson, G. J. Niemi, and R. R. Regal. (1988) Discrete Appl. Math., 19, 17-44. 7. S. C. Basak, S. Bertelsen, and G. D. Grunwald. (1994) J. Chem. Inf. Comput. Sci., 34, 270-276. 8. S. C. Basak and G. D. Grunwald, Mathematical Modelling and Scientific Computing, (In press). 9. S. C. Basak and G. D. Grunwald, SAR and QSAR in Environ. Res., 2, 289-307. 10. S. C. Basak and G. D. Grunwald, New Journal of Chemistry, (In press). 11. S. C. Basak and G. D. Grunwald, J. Chem. Inf. Comput. Sci., (In press). 12. S. C. Basak and G. D. Grunwald, SAR and QSAR in Environ. Res., (In press). 13. R. E. Carhart, D. H. Smith, and R. Venkataraghavan. (1985) J. Chem. Inf. Comput. Sci., 25, 64-73. 14. W. Fisanick, K. P. Cross, and A. Rusinko III, (1992). (1992) J. Chem. Inf. Comput. Sci., 32, 664-674. 15. C. L. Wilkins and M. Randic. (1980) Theoret. Chim. Acta (Berl), 58, 45-68. 16. P. Willet and V. Winterman. (1986) Quant. Struct-Act. Relat., 5, 18-25. 17. S. C. Basak, S. Bertelsen, and G. D. Grunwald, Toxicology Letters, (In press). 18. A. Leo. (1988) Medicinal Chemistry Project, Pomona College, Claremont, CA. 19. M. J. S. Dewar, E. G. Zoebisch, E. F. Healy, and J. J. P. Stewart. (1985) J. Am. Chem. Soc., 107, 3902-3909. 20. H. Wiener. (1947) J. Am. Chem. Soc., 69, 17-20. 21. M. Randic. (1975) J. Am. Chem. Soc., 97, 6609-6615. 22. L. B. Kier and L. H. Hall. (1986) Molecular Connectivity in Structure-Activity Analysis. Research Studies Press: Letchworth, Hertfordshire, UK. 23. A. T. Balaban. (1982) Chemical Physics Letters, 89, 399-404. 24. A. T. Balaban. (1982) Pure & Appl. Chem., 55, 199-206. 25. A. T. Balaban. (1986) MATCH, 21, 115-122. 26. C. E. Shannon. (1948) Bell Syst. Tech. J., 27, 379-423. 27. N. Rashevsky. (1955) Bull. Math. Biophys., 17, 229-235. 28. R. Sarkar, A. B. Roy, and P. K. Sarkar. (1978) Math. Biosci., 39, 299-312. 29. A. B. Roy, S. C. Basak, D. K. Harriss, and V. R. Magnuson. (1984) In Mathematical Modelling in Science and Technology. Pergamon Press. 30. V. R. Magnuson, D. K. Harris, and S. C. Basak. (1983) In Chemical Applications of Topology and Graph Theory. Elsevier. 31. S. C. Basak, A. B. Roy, and J. J. Ghosh. (1979) In Proceeding of the llnd International Conference on Mathematical Modelling; Rolla, Missouri. 32. S. C. Basak and V. R. Magnuson. (1983) Arzneimittei-Forschung/Drug Research, 33, 501-503. 33. D. Bonchev and N. Trinajstic. (1977) J. Chem. Phys., 67, 4517-4533. 34. C. Raychaudhury, S. K. Ray, J. J. Ghosh, A. B. Roy, and S. C. Basak. (1984) J. Comput. Chem., 5, 581-588. 35. S. C. Basak, D. K. Harriss, and V. R. Magnuson. (1988) POLLY: Copyright of the University of Minnesota. 36. S. C. Basak and G. D. Grunwald. (1994) APPROBE: Copyright of the University of Minnesota. Estimation of Normal Boiling Points of Haloalkanes Using

Molecular Similarity

Subhash C. Basak, Brian D. Gute and Gregory D. Grunwald

Natural Resources Research Institute, University of Minnesota,

5013 Miller Trunk Highway, Duluth, MN 55811, USA

Croatia Chimica Acta, Submitted, 1995 A molecular similarity measure has been used to estimate the normal boiling points o f a

set o f 276 haloalkanes with 1-4 carbon atoms. Molecular similarity/dissimilarity was quantified

in terms of Euclidean distances of molecules in the eight dimensional principal component space

derived from fifty nine topological indices. The correlation coefficients between experimental and

estimated boiling points ranged from 0.842 to 0.944 in the K nearest neighbor estimation of boiling

points using different number of nearest neighbors (K = 1-25).

INTRODUCTION

The use of structural analogy as a tool to classify chemicals as well as predict the

behavior of chemical species is as old as chemistry. In 1819, Mitscherlich1 described the

phenomenon of isomorphism, in which substitution of one atom by another leads to similar lattice

structures and at the turn of this century, Langmuir2 observed that chemical species containing

the same total number of atoms and electrons have very similar properties. He called such

chemicals isosteric. Members of isosteric pairs, like N2 - CO and N20 - C02, have many similar

physical constants.3 The structural similarity of the isosteric amino acids valine and threonine

pose some interesting problems in the protein synthesis mechanism of cells. Being sterically

similar, valine and threonine may be charged to the same t-RNA. The incorrectly formed

aminoacyl adenylate and aminoacyl t-RNA are discriminated and destroyed via a "double sieve"

involving steric exclusion and ineffective binding before they are used in protein synthesis.4

Similarity plays an important role in biological activity. The enzyme dihydrofolate reductase normally facilitates the reduction of dihydfofolate to tetrahydrofolate. Methotrexate,

a compound whose structure is substantially similar to dihydrofolate, inhibits the action of the enzyme.5 Competitive inhibition of enzymes can also result from interaction of the enzyme with transition state analogs of the substrate. For example, proline racemase from Clostridium sticklandii preferentially binds the transition state of proline. As a result, it is subject to

inhibition by compounds which are structural analogs of the transition state of proline, such as

pyrrole-2-carboxylate and pyrroline-2-carboxylate, which bind to the enzyme with a much greater

affinity than does proline.6 Furthermore, the structural similarity between a macromolecular

biotarget and its antiidiotypic antibody is believed to be the reason for the use of such antibodies

as model receptors for the screening of chemicals for drug discovery.7

The last decade has seen an upsurge of interest in the development of similarity measures

and their applications in chemical research, drug design, and toxicology.8'25 Such methods are

based on different representations of chemical species, viz., topological, geometric, quantum

chemical, etc. In drug design, similarity searching of databases is used to identify potential leads.

Also, dissimilarity based methods are used to select chemicals for screening in the drug discovery process.11 In toxicology, structural and functional analogy are used to assess the ecological and human health risk of new and existing chemicals.26-23

In the United States, the majority of chemicals submitted to the Environmental Protection

Agency (USEPA) for registration do not have any test data.27 One of the methods used by regulators for the hazard assessment of such chemicals is to select their analogs and, subsequently, estimate the hazard of the chemical of interest from the hazard of the analogs.

Such selection of analogs is often done subjectively by individual experts based on an intuitive notion of similarity.27 In USEPA’s approach to ecological risk assessment, class specific QSARs are preferred over the use of analogs, although in human health hazard assessment, analog-based estimation of toxic potential is still the most important factor.28

Rapid selection of analogs for drug design and hazard assessment requires automated methods which are computationally feasible. Similarity methods based on parameters which can be calculated directly from molecular structure fall into this category.8-25

In some of our recent studies, we have developed novel methods of quantifying molecular

similarity using topological indices and substructural features like atom pairs.13'22 We have

applied similarity techniques in the selection of analogs and in the estimation of molecular

properties such as boiling point, lipophilicity, and mutagenicity for different sets of chemicals.

In this paper we have carried out a similarity based estimation of normal boiling point for a set

of 276 chlorofluorocarbons (CFCs). Our results will then be compared with the results from a

neural network analysis29 of the same 276 compounds.

DATABASES

The data analyzed in this study consisted of normal boiling points for 276 CFCs with 1-4

carbon atoms. These compounds were previously analyzed by Balaban et a/.29,30 Table I is a

listing of the compounds used in this study and their normal boiling points.

METHODS

Calculation of Topological Indices

The Wiener index W,3t the first topological index reported in the chemical literature, may

be calculated from the distance matrix D(G) of a hydrogen-suppressed chemical graph G as the

sum of the entries in the upper triangular distance submatrix. The distance matrix D(G) of a

nondirected graph G with n vertices is a symmetric nxn matrix with elements d^ equal to the distance between vertices v, and Vj in G . Each diagonal element d^ of D(G) is zero. We give below the distance matrix D(G,) of the labeled hydrogen-suppressed graph G, of isobutane

(Figure I): r (1) (2) (3) (4) 1

1 0 1 2 2 2 1 0 1 1 D(Gt) = 3 2 1 0 2 4 2 1 2 0 L J

W is calculated as:

W = V4£dij = Eh-g, (1) ij h

where gh is the number of unordered pairs of vertices whose distance is h.

Randic’s32 connectivity index, as well as the higher-order path, cluster, and path-cluster

types of simple and valence connectivity indices developed by Kier and Hall,33 were calculated

by the computer program POLLY.34 Ph parameters, the number of paths of length h (h = 0 - 5)

in the hydrogen-suppressed graph, are calculated using standard algorithms.

Information-theoretic topological indices are calculated by the application of information

theory on chemical graphs. An appropriate set A of n elements is derived from a molecular

graph G depending upon certain structural characteristics. On the basis of an equivalence relation defined on A, the set A is partitioned into disjoint subsets Aj of order n; (1=1,2, ... h; Sin; = n).

A probability distribution is then assigned to the set of equivalence classes:

Aj, A2, ..... Ah

P i ’ p 2 ’ ...... • P h

where p; = n/n is the probability that a randomly selected element of A will occur in the r subset. The mean information content of an element of A is defined by Shannon’s35 relation:

h IC = - £ Pi log2 Pi (2) 1 = 1

The logarithm is taken at base 2 for measuring the information content in bits. The total

information content of the set A is then n times IC.

Rashevsky was the first to calculate information content of graphs where "topologically

equivalent" vertices are placed in the same equivalence class. In Rashevsky’s approach, two

vertices u and v of a graph are said to be topologically equivalent if and only if for each

neighboring vertex Uj (I = 1 , 2 , .... k) of the vertex u, there is a distinct neighboring vertex Vj of

the same degree for the vertex v. Subsequently, Trucco36 defined topological information of

graphs on the basis of graph orbits. In this method, vertices which belong to the same orbit of

the automorphism group are considered topologically equivalent. While Rashevsky37 used simple

linear graphs with indistinguishable vertices to symbolize molecular structure, weighted linear

graphs or multigraphs are better models for conjugated or aromatic molecules because they more properly reflect the actual bonding patterns, i.e. electron distribution.

To account for the chemical nature of vertices as well as their bonding pattern, Sarkar,

Roy and Sarkar38 calculated information content of chemical graphs on the basis of an equivalence relation where two atoms of the same element are considered equivalent if they possess an identical first-order topological neighborhood. Since properties of atoms or reaction centers are often modulated by physicochemical characteristics of distant neighbors, i.e., neighbors of neighbors, it was deemed essential to extend this approach to account for higher- order neighbors of vertices. This can be accomplished by defining open spheres for all vertices of a chemical graph. If r is any non-negative real number and v is a vertex of the graph G, then the open sphere S(v,r) is defined as the set consisting of all vertices v, in G such that d(v,v,)

Then, S(v,0)= 0 , S(v/)=v for 0

G situated at unit distance from v for l

One can construct such open spheres for higher integral values of r. For a particular value of r, the collection of all such open spheres S(v,r), where v runs over the whole vertex set V, forms a neighborhood system of the vertices of G. A suitably defined equivalence relation can then partition V into disjoint subsets consisting of topological neighborhoods of vertices of up to r01 order neighbors. Such an approach has already been initiated and the information-theoretic indices calculated are called indices of neighborhood symmetry.39

In this method, chemical species are symbolized by weighted linear graphs. Two vertices uD and vQ of a molecular graph are said to be equivalent with respect to the r* order neighborhood if, and only if, corresponding to each path u0, ut, ..., U, of length r, there is a distinct path vD, v^

..., vr of the same length, such that the paths have similar edge weights, and both u0 and v„ are connected to the same number and type of atoms up to the r01 order bonded neighbors. The detailed equivalence relation is described in our earlier studies.

Once partitioning of the vertex set for a particular order of neighborhood is completed,

ICr is calculated from eq. 2. Subsequently, Basak, Roy and Ghosh40 defined another information- theoretic measure, structural information content (SICr), which is calculated as:

SICr = Ic/log2n (3)

where ICr is calculated from eq. 2 and n is the total number of vertices of the graph. Another information-theoretic invariant, complementary information content (CIQ) 41 is

defined as:

CIQ = log2n-IQ (4)

CICr represents the difference between the maximum possible complexity of a graph (where each

vertex belongs to a separate equivalence class) and the realized topological information of a chemical species as defined by IQ.

The information-theoretic index on graph distance,42 is calculated from the distance matrix D(G) of a chemical graph G as follows:

Ig= W log2 W - £hgh • h log2 h (5)

The mean information index, /"*, is found by dividing the information index /Dw, by W. ICr,

SIQ, CICr, 1^, and were calculated by POLLY. The information theoretic parameters defined on the distance matrix, HD and Hv were calculated by the method of Raychaudhury et al,43

Four additional indices, developed by A.T. Balaban,44^6 were calculated and used in this analysis. Balaban denoted these as J indices which are based upon the distance sums S; of a chemical graph. J is defined as:

J = q(p + 1)'1 Z(siSj)-M (6)

i j edges

where the cyclomatic number p (or number of rings in the graph) is p = q - n + 1, with q adjacencies or edges and n vertices. In the original definition of J, the term st referred to either

the row distance sum for vertex I in the distance matrix (D) or the mulrigraph distance matrix

(M):

s, = Zd* (7) j

For distance matrix D, each matrix element djj represents the distance from vertex I to vertex j. The diagonal entries are all zero and the distance between any two adjacent vertices would be one. All other entries in this matrix would be the number of edges or bonds traversed in the shortest path from I to j. For the multigraph distance matrix M, for each entry d^, 1/b is used, where b is the conventional bond order (b = 1, 2, 3, and 1.5 for single, double, triple and aromatic bonds, respectively). To account for the periodicity of chemical properties for heteroatoms, Balaban proposed two J variants: Jx, which includes corrections for heteroatom electronegativities, and JY, which has corrections for heteroatom covalent radii46

Table II lists the 59 TIs used in this paper. This set of TTs was generated for all compounds in the data set

Data Reduction

Principal component analysis (PCA) was used to reduce the dimensionality of the set of

59 topological indices (TIs). With PCA, linear combinations of the TIs, called principal components (PCS) are derived from the correlation matrix. The first PC has the largest variance, or eigenvalue, of the linear combination of TIs. Each subsequent PC explains maximal index variance orthogonal to previous PCS. With 59 TIs available, 59 PCS can be generated. For this study, PCS with an eigenvalue greater than one were retained. For the PCA, all TIs were

transformed by the natural log of the TI plus one. The PCA analysis and selection of PCS was

accomplished using SAS procedure PRINCOMP.47 Basak et aP provide more detail on this

approach.

Computation of Similarity

Intermolecular similarity was measured by Euclidean distance (ED) within an n-

dimensional space. This n-dimensional space consisted of orthogonal variables (PCS) derived

from the TIs. ED between molecule’s I and j is defined as:

Ed, = [EnM

where n equals the number of dimensions retained from PCA. D* and Djk are the data values of the k^1 dimension for chemicals I and j, respectively.

K-nearest Neighbor Selection and Boiling Point Estimation

Following the quantification of the intermolecular similarity of the CFCs, the K-nearest neighbors (K = 1-10,15,20,25) were determined based on ED. The mean observed boiling point of the K-nearest neighbors for a compound was used as the estimated boiling point for the compound. The correlation (r) of observed boiling point with estimated boiling point and the standard error (s.e.) of the estimates were used to assess the relative efficacy of the similarity method with the back-propagation neural network model. RESULTS

From the PCA of 59 TIs for 276 CFCs, eight PCS were retained These eight PCS

explained, cumulatively, 95.4% of the total variation within the TI data and all eigenvalues were

greater than 1.0. Table m lists the eigenvalues of the eight PCS, the proportion of variance

explained by each PC, and the cumulative variance explained. In addition, Table III lists the two

TIs most correlated with each PC. The first PC is strongly correlated with parameters which

characterize size of the molecular graph and the increasing number of chlorofluoro substitutions,

viz., P0 and P,. The second PC is highly correlated with higher order complexity indices

including SIQ and SIC,. For the third PC, the highest correlations occur with valence cluster

connectivity TIs such as 4%vc and 3xvc- The fourth PC was characterized by low order information

theoretic complexity indices such as CIQ, and SIQ, and the fifth PC by the fifth order path- connectivity indices. Interpretation beyond the fifth level PC becomes more difficult, as can be seen in. Table III. These PC/TI correlations agree with our expectations based on previous research.16,17,19,20 Generally, PCs and TIs correlate as follows: PC! with the size of the molecular graph, PC* with higher order complexity indices, PC, with cluster and path-cluster connectivity, and PC4 with low order information theoretic indices.

Table IV reports the correlation and standard errors of boiling point estimates with observed values. Each line of the table represents a different K level. The standard error for estimation was at its minimum of 26.78° C for K=5. The correlation, however, continued an upward trend as K increased. The patterns of correlations and standard errors between experimental and estimated values are represented in Figures 2a and 2b, respectively. REFERENCES

1. E. Mitscherlich, Abhandl. Akad. Wiss. Berlin, (1819) 427.

2. I. Langmuir, J. Am. Chem. Soc. 41 (1919) 1543.

3. T. Moeller, Inorganic Chemistry, John Wiley & Sons, New York 1852.

4. A. Comish-Bowden and CW. Wharton, Enzyme Kinetics, Edited by D. Rickwood, IRL

Press, Oxford, UK. 1988.

5. P. Calabresi and B.A. Chabner, Antineoplastic Agents, in: The Pharmacological Basis of

Therapeutics, Edited by A.G. Gilman, T.W. Rail, A.S. Nies, and .P. Taylor, Eighth

Edition, Pergamon Press, New York 1990.

6. D. Voet and J.G. Voet, Biochemistry. John Wiley & Sons, New York 1990.

7. J. Couraud, E. Escher, D. Regoli, V. Imhoff, B. Rossignol, turd P. Pradelles Jou r. B io.

C h em ., 260 (1985) 9461.

8. R.E. Carhart, D.H. Smith,and R. Venkataraghavan, /. Chem. Inf. Comp. Sci. 25 (1985)

64. 9. C.L. Wilkins and M. Randi_, Theoret. Chim. Acta (Berl.) 58 (1980) 45.

10. D.H. Rouvray, The evolution of the concept of molecular similarity, in: Concepts and

Applications of Molecular Similarity, Edited by M.A. Johnson and G M. Maggiora, John

Wiley and Sons, New York, 15-42 (1990).

11. M.S. Lajiness, Molecular similarity-based methods for selecting compounds for screening,

in: Computational Chemical Graph Theory, Edited by D.H. Rouvray, Nova, New York,

299-316 (1990).

12. Concepts and Applications o f Molecular Similarity, Edited by M.A. Johnson and G.M.

Maggiora, John Wiley & Sons, New York 1990.

13. S.C. Basak, V.R. Magnuson, G.J. Niemi, and R.R. Regal, Discrete Appl. Math. 19 (1988)

17.

14. M. Johnson, S.C. Basak, and G. Maggiora, Math, and Comp. Modelling U (1988) 630.

15. S.C. Basak, S. Bertelsen, and G. Grunwald, /. Chem. Inf. Comp. Sci. 34 (1994) 270.

16. S.C. Basak and G.D. Grunwald, SAR and QSAR Environ. Res. 2 (1994) 289.

17. S.C. Basak and G.D. Grunwald, New J. Chem., in press. 29. A.T. Balaban, S.C. Basak, T. Colburn, and G.D. Grunwald, J. Chem. Inf. Comp. Sci. 34

(1994) 1118.

30. A.T. Balaban, N. Joshi, LIB. Kier, and L.H. Hall, J. Chem. Inf. Comp. Sci. 32 (1992) 233.

31. H.J. Wiener, J. Am. Chem. Soc. 69 (1947) 17.

32. M. Randic, J. Am. Chem. Soc. 97 (1975) 6609.

33. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis, Research

Studies Press: Letchworth, Hertfordshire, UK 1986.

34. S.C. Basak, D.K. Harriss, and V.R. Magnuson, POLLY: Copyright of the University of

Minnesota, 1988.

35. C.E. Shannon, Bell Sys. Tech. J. 27 (1948) 379.

36. E. Trucco, Bull. Math. Biophys. 18 (1956) 65.

37. N. Rashevsky, Bull. Math. Biophys. 17 (1955) 229.

38. ' R. Sarkar, A.B. Roy, and P.K. Sarkar, Math. Biosci. 39 (1978) 299. 39. A.B. Roy, S.C. Basak, D.K. Haniss, and V.R. Magnuson, Neighborhood Complexities and i symmetry of chemical graphs, and their biological applications, in: Math. Modelling in

Sci.and Tech., Edited by X.J.R. Avula, R.E. Kalman, A.I. Liapis, and E.Y. Rodin,

Pergamon Press, New York, 745-750 (1984).

40. S.C. Basak, A.B. Roy, and J.J. Ghosh, Proc. Ilnd. Int. Conf. on Math. Modelling II

(1980) 851.

41. S.C. Basak and V.R. Magnuson, Arzneimittel-Forschung/Drug 33 (1983) 501.

42. D. Bonchev and N. Trinajsdc, J. Chem. Phys. 67 (1977) 4517.

43. C. Raychaudhuiy, S.K. Ray, JJ. Ghosh, A.B. Roy, and S.C. Basak, J. Comp. Chem. 5

(1984) 581.

44. A.T. Balaban, Chem. Phys. Letters 89 (1982) 399.

45. A.T. Balaban, Pure & Appl. Chem. 55 (1983) 199.

46. A.T. Balaban, MATCH 21 (1986) 115.

47. SAS Institute Inc. (1988). In, SAS/STAT User's Guide, Release 6.03 Edition. SASInstitute

Inc.: Cary, NC, Chapter 34, pp. 949-965. Table I. Normal boiling points of 276 haloalkanes with 1-4 carbon atoms.

Normal Boiling No. Chemical Name Point

1 76.7

2 Trichloromethane 61.2

3 Dichloromethane 39.8

4 Trichlorofluoromethane 23.7

5 Dichlorofluoromethane 8.9

6 Chlorofluoromethane -9.1

7 Chloromethane -24.3

8 Dichlorodifluoromethane -29.8

9 Chlorodifluoromethane -40.8

10 Difluoromethane -51.7

11 Fluoromethane -78.3

12 Chlorotrifluoromethane -81.3

13 Trifluoro methane -82.2

14 -128.1

15 184.4

16 1,1,1,2,2-pentachloro-2-fluoroethane 137.9

17 1,1,1,2-tetrachloro-2-fluoroethane 117

18 1,1,2,2-tetrachloro-1 -fluoroethane 116.6

19 1,1,2-trichloroethane 113.7

20 1,1,2-trichloro-2-fluoroethane 102.4

21 1,1,2,2-tetrachloro- 1,2-difluoroethane 92.7

22 1,1,1-trichioroethane 74 Table I. Continued.

Normal Boiling No. Chemical Name Point

23 1,2-dichloro-l-fluoroethane . 73.8

24 1,1,2-trichloro-1,2-difluoroethane 72.5

25 1,2-dichIoro-1,2-difluoroethane 58.5

26 1,1-dichloroethane 57.2

27 1,1,1 -trichloro-2,2,2-trifluoroethane 45.8

28 1,2-dichloro-1,1 -difluoroethane 46.6

29 2-chloro-1,1 -difluoroethane 35.1

30 1,1 -dichloro-1 -fluoroethane 32

31 2,2-dichloro-1,1,1-trifluoroethane 28.7

32 1 -chloro-1 -fluoroethane 16.1

33 Chloroethane 12.3

34 1 -chloro-1,1,2-trifluoroethane 12

35 2-chloro-1,1,1 -trifluoroethane 6.9

36 1,1,2-trifluoroethane 5

37 1,2-dichloro-1,1,2,2-tetrafluoroethane 3.6

38 2,2-dichloro-1,1,1,2-tetrafluoroethane 3.6

39 1 -chloro-1,1,2,2-tetrafluoroethane -12

40 1,1,2,2-tetrafluoroethane -22.8

41 1,1-difluoroethane -25.8

42 1,1,1,2-tetrafluoroethane -26.1

43 Fluoroethane -37.8

44 1,1,1-trifluoroethane -47.3 Table I. Continued.

Normal Boiling No. Chemical Name Point

45 1,1,1,2,2-pentafluoroethane -48.3

46 259

47 1,1,1,2,2,3,3-heptachloropropane 247

48 1,1,1,2,2,3,3-heptachloro-3-fluoropropane 236.8

49 1,1,2,2,3,3-hexachloropropane 218.5

50 1,1,1,2,2,3-hexachloropropane 218

51 1,1,1,2,3,3-hexachloro-2,3-difluoropropane 196

52 1,1,1,2,2,3-hexachloro-3,3-difluoropropane 193.4

54 1,1,1,2,2-pentachloro-3,3-difluoropropane 175

55 1,1,1,3,3-pentachloro-2,2-difluoropropane 174

56 1,2,2,1-tetrachloropropane 165.5

57 1,1,3,3-tetrachloropropane 161.9

58 1,2,3-trichloropropane 156.8

59 1, l ,2,3,3-pentachloro-1,2,3-trifluoropropane 154.7

60 1,1,2,2-tetrachloropropane 153

61 1, l,2,2,3-pentachloro-l,3,3-trifluoropropane 152.3

62 1,1,1,3-tetrachloro-2,2-difluoropropane 151.2

63 1,1,1,2-tetrachloropropane 150.4

64 1, l,2,2-tetrachloro-3,3-difluoropropane 147.6

65 1,1,3-trichioropropane 145.5

66 1,1,2,2-tetrachloro-1,3,3-trifluoropropane 134.6

67 1,2,3-trichioro-2-fluoropropane 130.8 Table I. Continued.

Normal Boiling No. Chemical Name Point

. 68 1, l,2,3-tetrachloro-2,3,3-trifluoropropane 129.8

69 1,1,3-trichloro-2,2-difluoropropane 127.3

70 1,1,2,2-tetrachloro-3,3,3-trifluoropropane 126.2

71 1,1,2-trichloro-2-fluoropropane 116.7

72 1,1,3,3-tetrachIoro-1,2,2,3-tetrafluoropropane 114

73 1,1, 1,3-tetrachloro-2,2,3,3-tetrafluoropropane 113.9

74 1,1,2-trichloro-1 -fluoropropane 113.5

75 1,1,1,2-tetrachloro-2,3,3,3-tetrafluoropropane 112.5

76 1,1,2,2-tetrachloro-1,3,3,3-tetrafluoropropane 112.3

77 1,2,2,3-tetrachloro-1,1,3,3-tetrafluoropropane 112.2

78 1,1,3-trichloro-1,2,2-trifluoropropane 109.5

79 1,1,1 -trichloropropane 108

80 l,2,2-trichloro-3,3,3-trifluoropropane 104.5

81 1,3-dichloro-2,2-difluoropropane 96.7

82 1,3,3-trichloro-1,1,2,2-tetrafluoropropane 91.8

83 1,2,2-trichloro-1,1 -difluoropropane 90.2

84 1,2,3-trichIoro-1,1,2,3-tetrafluoropropane 90

85 2,3-dichloro-1,1,2,3-tetrafluoropropane 89.8

86 1,2,3-trichloro-1,1,3,3-tetrafluoropropane 88

87 I -chloro-3-fluoropropane 81

88 1,2,3-trichloro-1,1,2,3,3-pentafluoropropane 73.7

89 2,3,3-trichloro-1,1,1,2,3-pentafluoropropane 73.4 Table L Normal boiling points of 276 haloalkanes with 1-4 carbon atoms.

Normal Boiling No. Chemical Name Point

1 Carbon tetrachloride 76.7

2 Trichloromethane 61.2

3 Dichloromethane 39.8

4 Trichlorofluoromethane 23.7

5 Dichlorofluoromethane 8.9

6 Chlorofluoromethane -9.1

7 Chloromethane -24.3

8 Dichlorodifluoromethane -29.8

9 Chlorodifluoromethane -40.8

10 Difluoromethane -51.7

11 Fluoromethane -78.3

12 Chlorotrifluoromethane -81.3

13 Trifluoromethane -82.2

14 Carbon tetrafluoride -128.1

15 Hexachloroe thane 184.4

16 1,1,1,2,2-pentachloro-2-fluoroethane 137.9

17 1,1,1,2-tetrachloro-2-fluoroethane 117

18 1,1,2,2-tetrachloro- 1-fluoroe thane 116.6 Table L Continued.

Normal Boiling No. Chemical Name Point

19 1,1,2-trichloroethane 113.7

20 1,1,2-trichloro-2-fluoroethane 102.4

21 1,1,2,2-tetrachloro-1,2-difluoroethane 92.7

22 1,1,1 -trichloroethane 74

23 1,2-dichloro-1 -fluoroethane 73.8

24 1,1,2-trichloro-1,2-difluoroethane 72.5

25 1,2-dichloro-1,2-difluoroethane 58.5

26 1,1-dichloroe thane 57.2

27 1,1,1 -trichloro-2,2,2-trifluoroethane r. 45.8

28 1,2-dichloro-l ,1-difluoroethane 46.6

29 2-chloro-1,1 -difluoroethane 35.1

30 1,1 -dichloro-1 -fluoroethane 32

31 2,2-dichloro-1,1,1 -trifluoroethane 28.7

32 1 -chloro-1 -fluoroethane 16.1

33 Chloroe thane 12.3

34 1-chloro-1,1,2-trifluoroethane 12

35 2-chloro-1,1,1 - trifluoroethane 6.9

36 1,1,2-trifluoroethane 5 Table I. Continued.

Normal Boiling No. Chemical Name Point

37 1,2-dichloro-1,1,2,2-tetrafluoroethane 3.6

38 2,2-dichloro-1,1,1,2-tetrafluoroethane 3.6

39 1 -chloro-1,1,2,2-tetrafluoroethane -12

40 1,1,2,2-tetrafluoroethane -22.8

41 1,1 -difluoroe thane -25.8

42 1,1,1,2-tetrafluoroethane -26.1

43 Fluoroethane -37.8

44 1,1,1 -trifluoroethane -47.3

45 1,1,1,2,2-pentafluoroethane -48.3

46 Octachloropropane 259

47 1,1,1,2,2,3,3-heptachloropropane 247

48 1,1,1,2,2,3,3-heptachloro-3-fluoropropane 236.8

49 1,1,2,2,3,3-hexachloropropane 218.5

50 1,1,1,2,2,3-hexachloropropane 218

51 1,1,1,2,3,3-hexachloro-2,3-difluoropropane 196

52 1,1,1,2,2,3-hexachloro-3,3-difluoropropane 193.4

53 1,1,1,2,2-pentachloro-3,3-difluoropropane 175

54 1,1,1,3,3-pentachloro-2,2-difluoropropane 174 Table L Continued.

Nonnal Boiling No. Chemical Name Point

55 1,2,2,1 -tetrachloropropane 165.5

56 1,1,3,3-tetrachloropropane 161.9

57 1,2,3-trichloropropane 156.8

58 1,1,2,3,3-pentachloro- 1,2,3-trifluoropropane 154.7

59 1,1,2,2-tetrachloropropane 153

60 1,1,2,2,3-pentachloro-1,3,3-trifluoropropane 152.3

61 1,1,1,3-tetrachloro-2,2-difluoropropane 151.2

62 1,1,1,2-tetrachloropropane 150.4

63 1,1,2,2-tetrachloro-3,3-difluoropropane 147.6

64 1,1,3-trichloropropane 145.5

65 1,1,2,2-tetrachloro- 1,3,3-trifluoropropane 134.6

66 1,2,3-trichloro-2-fluoropropane 130.8

67 1,1,2,3-tetrachloro-2,3,3-trifluoropropane 129.8

68 1,1,3-trichloro-2,2-difluoropropane 127.3

69 1,1,2,2-tetrachloro-3,3,3-trifluoropropane 126.2

70 1,1,2-trichloro-2-fluoropropane 116.7

71 1,1,3,3-tetrachloro-1,2,2,3-tetrafluoropropane 114

72 1,1,1,3-tetrachloro-2,2,3,3-tetrafluoropropane 113.9 Table L Continued.

Normal Boiling No. Chemical Name Point

73 1,1,2-trichloro-1 -fluoropropane 113.5

74 1,1,1,2- tetrachloro-2,3,3,3-tetrafluoropropane 112.5

75 1,1,2,2-tetrachloro-1,3,3,3-tetrafluoropropane 112.3

76 1,2,2,3-tetrachloro-1,1,3,3-tetrafluoropropane 112.2

77 1,1,3-trichloro-l ,2,2-trifluoropropane 109.5

78 1,1,1 -trichloropropane 108

79 1,2,2-trichloro-3,3,3-trifluoropropane 104.5

80 1,3-dichloro-2,2-difluoropropane 96.7

81 1,3,3-trichloro-1,1,2,2-tetrafluoropropane 91.8

82 1,2,2-trichloro-1,1-difluoropropane 90.2

83 1,2,3-trichloro-1,1,2,3-tetrafluoropropane 90

84 2,3 -dichloro-1,1,2,3-tetrafluoropropane 89.8

85 1,2,3-trichloro-1,1,3,3-tetrafluoropropane 88

86 1 -chIoro-3-fluoropropane 81

87 1,2,3-trichloro-1,1,2,3,3-pentafluoropropane 73.7

88 2,3,3-trichloro-l, 1,1,2,3-pentafluoropropane 73.4

89 1,3,3-trichloro-1,1,2,2,3-pentafluoropropane 73

90 2,2,3-trichloro-1,1,1,3,3-pentafluoropropane 72 Table L Continued.

Normal Boiling No. Chemical Name Point

91 1,2-dichloro-1,1 -difluoropropane 70

92 2,2-dichloropropane 69.3

93 1 -chloro-2-fluoropropane 68.5

94 1,1 -dichloro-1 -fluoropropane 66.6

95 1 -chloro-2,2-difluoropropane 55.1

96 1-chloro-1,2-difluoropropane 52.9

97 2,2-dichloro-l, 1,1-trifluoropropane 48.8

98 1-chloropropane 46.6

99 3,3-dichloro-1,1,1,2,2-pentafluoropropane 45.5

100 1,3-difluoropropane 41.6

101 2-chloropropane 35.7

102 1,3-dichloro-1,1,2,2,3,3-hexafluoropropane 35.7

103 2-chloro-2-fluoropropane 35.2

104 3,3 -dichloro-1,1,1,2,2,3-hexafluoropropane 35

105 3-chloro-1,1,1,2,2-pentafluoropropane 27.6

106 1 -chloro-1,1 -difluoropropane 25.4

107 1 -chloro-1,1,2,2,3,3-hexafluoropropane 21

108 1,1,1,2,3-penatfluoropropane 20 Table L Continued.

Normal Boiling No. Chemical Name Point

109 1,1,2,2,3,3-hexafluoropropane 10.5

110 1,1 -difluoropropane 7.5

111 1,1,1,2,3,3-hexafluoropropane 5

112 1,1,1,2,2,3-hexafluoropropane 1.2

113 2-chloro-l, 1,1,2,3,3,3-heptafluoropropane -2

114 2-fluoropropane -9.7

115 1,1,1 -trifluoropropane -12.5

116 1,1,1,2,3,3,3-heptafluoropropane -19

117 Hexafluoroe thane -78.2

118 1 -chloro-2-fluoroethane 53

119 1 -chloro-1,1 -difluoroethane -9.8

120 1-chloro-1,1,2,2,2-pentafluoroethane -38

121 1,2-dichloroethane 83.5

122 1,1 -dichloro-2,2-difluoroethane 60

123 1,1 -dichloro-1,2,2-trifluoroethane 30.2

124 1,2-dichloro-1,1,2-trifluoroethane 28.2

125 1,1,2-trichloro- 1-fluoroethane 88.5

126 1,1, l-trichloro-2,2-difluoroethane 73 Table L Continued.

Normal Boiling No. Chemical Name Point

127 1,1,2-trichloro-2,2-difluoroethane 71.2

128 1,1,2-trichloro-1,2,2-trifluoroethane 47.6

129 1,1,1,2-tetrachIoroethane 130.5

130 1,1,2,2-tetrachloroethane 146.3

131 1,1,1,2-tetrachloro-2,2-difluoroethane 91.6

132 1,1,1,2,2-pentachloroethane 161.9

133 1,1,2,2,3-pentachloropropane 196

134 1,1,2,3,3-pentachloropropane 199

135 1,1,2,2,3-pentachloro-3,3-difluoropropane 168.4

136 1,1,2,3,3-pentachloro-l ,3-difluoropropane 167.4

137 1,1,1,2,2-pentachloro-3,3,3-trifluoropropane 153

138 1,1,1,2,3-pentachloro-2,3,3-trifluoropropane 153.3

139 1,1, l,3,3-pentachloro-2,2,3-trifluoropropane 153

140 1,1,1,2,3,3-hexachloropropane 217

141 1,1,1,3,3,3-hexachloropropane 206

142 1,1,1,2,2,3-hexachloro-3-fluoropropane 210

143 1,1,1,2,3,3-hexachloro-3-fluoropropane 207

144 1,1,2,2,3,3-hexachloro-l-fluoropropane 210 Table L Continued.

Normal Boiling No. Chemical Name Point

145 1,1,1,3,3,3-hexachloro-2,2-difluoropropane 194.2

146 1,1,2,2,33-hexachloro-1,3-difluoropropane 194.2

147 1,1,1,2,3,3,3-heptachloropropane 249

148 1,2-dichloro-1,1,2,3,3-pentafluoropropane 56.3

149 2,3-dichloro-1,1,1,2,3-pentafluoropropane 56

150 1,1,2-trichloropropane 133

151 1,2,2-trichloropropane 122

152 1,1,1 -txichloro-2,2-difluoropropane 102

153 1,2,2-txichloro-l, 1,3,3-tetrafluoropropane 92

154 3,3,3-trichloro-1,1,1,2,2-pentafluoropropane 70.5

155 1,1,2-trichloro-1,2-difluoropropane 97.7

156 1,1,3-trichloro-3,3-difluoropropane 107.8

157 3-chloro-1,1,1,3,3-pentafluoropropane 28.4

158 2-chloro-1,1,1,3,3,3-hexafluoropropane 15.5

159 3-chloro-1,1,1,2,2,3,3-heptafluoropropane -2.5

160 3-chloro-1,1,1,2,2,3 -hexafluoropropane 20

161 1,1 -dichloropropane 88.1

162 1,2-dichloropropane 96 Table L Continued.

Normal Boiling No. Chemical Name Point

163 1,3-dichloropropane 120.8

164 1,2-dichloro-2-fluoropropane 88.6

165 1,2-dichIoro-1 -fluoropropane 93

166 1,1 -dichloro~2,2-di fluoropropane 79

167 1,3 -dichloro-1,1 -difluoropropane 80.8

168 1,1 -dichloro-1,2,2-trifluOropropane 60.2

169 3,3-dichloro-1,1,1-trifluoropropane 72.4

170 1,2-dichloro-1,1,2-trifluoropropane 55.6

171 2,3-dichloro-1,1,1 -trifluoropropane 76.7

172 1,3-dichloro-1,1,2,2-tetrafluoropropane 68.2

173 2,3-dichloro-1,1,1,3,3-pentafluoropropane 50.4

174 2,3-dichloro-1,1,1,2,3,3-hexafluoropropane 34.7

175 1,2,3-trichloro-1,1 -difluoropropane 114.3

176 1,1,1 -trichloro-3,3,3-trifluoropropane 95.1

177 1,1,2-trichloro-3,3,3-trifluoropropane 106.8

178 2,3,3-trichloro-1,1,1,3-tetrafluoropropane 87.2

179 1,1,3-trichloro-1,2,2,3-tetrafluoropropane 90.5

180 1,1,1,3-tetrachloropropane 158 Table L Continued.

Normal Boiling No. Chemical Name Point

181 1,1,2,3-tetrachloropropane 180

182 1,1,1,2-tetrachloro-2-fluoropropane 139.6

183 1,1,2,2-tetrachloro- 1-fluoropropane 135

184 1,1,1,3-tetrachloro-3,3-difluoropropane 132

185 1,1,1,2-tetrachloro-3,3,3-trifluoropropane 125.1

186 1,1,2,3-tetrachloro-1,3,3-trifluoropropane 128.7

187 l,l,3,3-tetrachloro-2,2,3-trifluoropropane 127

188 1,1,2,3-tetrachloro-1,2,3,3-tetrafluoropropane 112.5

189 1-fluoropropane -2.3

190 -38

191 2,2-difluoropropane -0.5

192 1,1,1,3-tetrafluoropropane 29.4

193 1,1,1,3,3,3-hexafluoropropane 0.8

194 1,1,1,2,2,3,3-heptafluoropropane -17

195 1 -chloro-1 -fluoi opropane 48

196 3-chloro-1,1,1 -trifluoropropane 45.1

197 2-chloro-1,1 -difluoropropane 52

198 2-chloro-1,1,1 -trifluoropropane 30 Table L Continued.

Normal Boiling No. Chemical Name Point

199 1-fluorobutane 32.2

200 2-fluorobutane 24.7

201 1,1,1,2,2,4,4,4-octafluorobutane 18

202 1,1,2,2,3,3,4,4-octafluorobutane 43

203 1,1,1,2,2,3,3,4,4-nonafluorobutane 14

204 Decafluorobutane -2

205 1-chlorobutane 78.5

206 2-chlorobutane 68.5

207 1-chloro-4-fluorobutane 115

208 1 -chloro-1,1 -difluorobutane 55.5

209 3-chloro-1,1,1-trifluorobutane 66

210 1-chloro-1,1,3,3-tetrafluorobutane 70.5

211 2-chloro-1,1,1,3,3,3-hexafluorobutane 51

212 4-chloro-1,1,1,2,2,3,3-heptafluorobutane 54

213 4-chloro-l,l,l,2,2,3,3,4,4-nonafluorobutane 30

214 1,1-dichlorobutane 115

215 1,2-dichlorobutane 123.5

216 1,3-dichlorobutane 133 Table L Continued.

Normal Boiling No. Chemical Name Point

217 1,4-diehlorobutane 155

218 1,3-dichloro-1, l ,3-trifluorobutane 129

219 3,4-dichloro-1,1,1,2,2,3-hexafluorobutane 72

220 1,4-dichloro-1,1,3-trifluorobutane 118.5

221 2,3 -dichloro-1,1,1,4,4,4-hexafluorobutane 78

222 4,4-dichloro-1,1,1,2,2,3,3-hep tafluorobutane 76.5

223 4,4-dichloro-1,1,1,2,2,3,3,4-octafluorobutane 62.8

224 3,4-dichloro-1,1,1,2,2,3,4,4-octafluorobutane 66

225 1,4-dichloro-1,1,2,2,3,3,4,4-octafluorobutane 64

226 2,2-dichloro-1,1,1,3,3,4,4,4-octafluorobutane 64

227 2,3-dichloro-1,1,1,2,3,4,4,4-octafluorobutane 64

228 1,1,1 - trichlorobu tan e 133.5

229 1,1,2-trichlorobutane 156.8

230 1,1,3-trichlorobutane 153.8

231 1,1,4-trichlorobutane 183.8

232 2,2,3-trichloro-1,1,1,4,4,4-hexafluorobutane 104

233 4,4,4-trichloro-1,1,1,2,2,3,3-heptafluorobutane 96.5

234 1,3,4-trichloro-1,1,2,2,3,4,4-hep tafluorobutane 99 Table L Continued.

Normal Boiling No. Chemical Name Point

235 2,2,3-trichloro-1,1,1,3,4,4,4-heptafluorobutane 97.4

236 1,1,4,4-tetrachlorobutane 200

237 l,2,4,4-tetrachloro-l,l»2,3»3,4-hexafluorobutane 134

238 l,2,3,4-tetrachloro-l,l,2,3,4,4-hexafluorobutane 134

239 1,1,2,3,4,4-hexachloro-l ,2,3,4-tetrafluorobutane 208

240 1 -chloroisobutane 68.3

241 2-chloroisobutane 50.7

242 1-chloro-1-fluoroisobutan e 82.5

243 1,1-dichloroisobutane 105

244 1,2-dichloroisobutane 106.5

245 1,3-dichloroisobutane 136

246 1,1 -dichloro-1 -fluoroisobutane 107

247 1,2,3-trichloroisobutane 163

248 1,1,2,3-tetrachloroisobutane 191

249 l,2,3-trichloro-2-chloroniethylpropane 211

250 1,1,2,3-tetrachloro-2chloromethylpropane 227

251 1-fluoroisobutane 16

252 2-fluoroisobutane 12 Table L Continued

Normal Boiling No. Chemical Name Point

271 2,3-dichlorobutane 116

272 2,2,3-txichlorobutane 143

273 1,2,3-trichlorobutane 166

274 1,4-difluorobutane 77.8

275 2,2-difluorobutane 30.9

276 1,2-difluoroethane 26 Table II. Topological index symbols and definitions.

Id Information index for the magnitudes of distances between all possible pairs of vertices o f a graph

Mean information index for the magnitude of distance

W Wiener index = half-sum of the off-diagonal elements of the distance matrix of a graph

ID Degree complexity

Hv Graph vertex complexity

H° Graph distance complexity

IC Information content o f the distance matrix partitioned by frequency of occurrences of distance h

O Order of neighborhood when IC, reaches its maximum value for the hydrogen-filled graph

I0RB Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices

0 ORB Maximum order of neighborhood of vertices for IORQ within the hydrogen-suppressed graph

M, A Zagreb group parameter = sum of square of degree over all vertices

M2 A Zagreb group parameter = sum of cross-product of degrees over all neighboring (connected) vertices

IC, Mean information content or complexity of a graph based on the r* (r = 0-3) order neighborhood of vertices in a hydrogen-filled graph

SIC, Structural information content for r1* (r = 0-3) order neighborhood o f vertices in a hydrogen-filled graph

CIC, Complementary information content for r“h (r = 0-3) order neighborhood of vertices in a hydrogen-filled graph hy Path connectivity index of order h = 0-5 hXc Cluster connectivity index of order h = 3-6 hXpc Path-cluster connectivity index of order h = 4-6 hXv Valence path connectivity index of order h = 0-5 hX^' Valence cluster connectivity index of order h = 3-6 hXrc Valence path-cluster connectivity index of order h = 4-6

Ph Number o f paths of length h = 0-5

J Balaban's J index based on distance

Jx Balaban's J index based on relative electronegativities

JY , Balaban’s J index based on relative covalent radii Table IV. Summary of K nearest neighbor normal boiling point estimation for 276 chlorofluorohydrocarbons. K r s.e. (°C)

1 0.842 38.2

2 0.914 28.1

3 0.920 27.5

4 0.923 27.4

5 0.928 26.8

6 0.929 27.4

7 0.931 27.3

8 0.937 26.8

9 0.940 26.8

10 0.939 27.4

15 0.937 29.3

20 0.943 30.2

25 0.944 31.4 Figure Captions

Figure 1. Hydrogen-suppressed graph of isobutane.

Figure 2a Pattern of correlation (r) for different values of K (1-25) in the K nearest neighbor estimation of boiling point using PC based similarity method.

Figure 2b. Pattern of standard error of estimates (s.e.) for different values of K (1-25) in the

K nearest neighbor estimation of boiling point using PC based similarity method. 1 2 3

4

G 1 Correlation Standard Error Use of Nonempirical Parameters in Estimating Mutagenicity of Chemicals: A Molecular

Similarity Approach

Subhash C. Basak and Brian D. Gute

Natural Resources Research Institute

University of Minnesota, Duluth

5013 Miller Trunk Highway

Duluth, MN 55811, USA

Proceedings of the International Congress on Hazardous Waste: Impact on Human and Ecological Health, in press, 1995. r

Introduction

Environmental and human health risk assessment of chemicals is often carried out using

insufficient experimental data. This is true for the large number of industrial chemicals as well as

for substances identified in industrial effluent, hazardous waste sites and environmental monitoring

surveys (Auer et al. 1990). In 1984, the National Research Council studied the availability of

toxicity data on industrial chemicals and found that many of these chemicals have little or no test

data (NRC 1984). About 13 million distinct chemicals have been registered with Chemical Abstract

Service (CAS) and the list is growing by nearly 500,000 per year. Out of these chemicals, about

1,000 enter into societal use every year (Arcos 1987). Few of these chemicals- are submitted with

the empirical data necessary for risk assessment. In the United States, the Toxic Substances Control

Act (TSCA) Inventory has over 74,000 entries and the list is growing by nearly 3,000 per year (Auer

et al. 1990, TSCA 1976). Of the 3,000 some chemicals submitted yearly to the United States

Environmental Protection Agency (USEPA) for the premanufacture notification (PMN) process,

more than 50% have no experimental data at all, less than 15% have empirical mutagenicity data,

and only about 6% have experimental ecotoxicological and environmental fate data. This dearth of

empirical data is also true for many of the over 700 chemicals found on the Superfund list of

hazardous substances (Auer et al. 1990).

A large number of physicochemical and biological test data on chemicals are a prerequisite

to the proper estimation of the hazards posed by a chemical species. Table l gives a partial list of

such properties. As a result of this lack of relevant data, a variety of structural, physicochemical, and

biochemical properties are used in hazard estimation. For example, in assessing the carcinogenic

potential of chemicals, three classes of criteria have been used by experts: a) structural criteria; b)

2 functional criteria; and c) guilt by association criteria. Structural criteria consists of structural

analogy of a chemical species with known and well-established chemical carcinogens. Structural

factors considered could be molecular size, shape, branching pattern, symmetry, charge, etc., to list

just a few. Frequently, structural characteristics of molecules are not enough for the reliable

estimation of the carcinogenic risk posed them. Functional criteria, i.e., results of short-term tests,

are often used to supplement structural criteria to make the hazard estimation more reliable. Ames'

test, mammalian cell transformation, unscheduled DNA synthesis or pattern of regulation of gene

expression are examples of frequently used functional criteria relevant to carcinogenic risk

assessment. The guilt-by-association criteria considers that, even though a chemical may have been

found to be inactive through normal testing (bioassays), it may need to be retested more stringently

if it belongs to a class of compounds which contains potent carcinogens (Arcos 1987).

In the assessment of environmental hazards, often the physicochemical and biological test

data essential to hazard estimation are unavailable. In such cases, regulators use a two-tiered

approach for the prediction of hazard from chemical structure: a) class-specific quantitative structure-activity relationship (QSAR) models, and b) use of analogs of chemicals (Auer et al.

1990).

QSARs are mathematical models which use various quantifiers of chemical structure and empirical parameters (or properties) in predicting physicochemical and biological properties of molecules (Basak et al. 1990, Hansch 1976). In class-specific QSARs, a chemical is first assigned a specific structural class and the QSAR of that particular class of chemicals is used to predict the potential toxicity of the molecule of interest.

If a chemical is very complex, e.g., contains a large number of functional groups, a simplistic

3 attempt at classification is almost certain to fail. The use of class-specific QSARs in hazard

assessment of such chemicals will be limited. In such cases, one resorts to the approach of selecting

analogs of the chemical of interest and estimating the hazardous potential of the chemical from the

toxicity of its analogs. Analogs of new chemicals are routinely used by regulatory agencies like

USEPA in hazard assessment (Auer et al. 1990). Chemical X is considered to be an analog of (or

similar to) chemical Y if X and Y resemble each other in one or more critical aspects, e.g.,

structurally, stereo-electronically, physicochemically, etc. The use of analogs is based on the tacit

assumption that similar structures have similar properties (Johnson et al. 1988, Johnson and

Maggiora 1990).

A perusal of the approaches used in carcinogenic risk assessment, and in environmental and

ecotoxicological hazard estimation, indicates that the candidate chemical is compared with known

toxicants using structural and/or functional criteria. These analogs (structurally-related chemicals) are often selected by experts based on their individual judgements and a selected set of structural features.

Analogs of chemicals can be selected using empirical descriptors or theoretical molecular descriptors (Basak and Grunwald 1995). In view of the paucity of available experimental data for environmental pollutants, it is desirable to develop methQds for the selection of chemical analogs using nonempirical variables which can be computed directly from the molecular structure (Basak and Grunwald 1995a).

In recent years, we have developed a number of methods for quantifying intermolecuiar similarity. Such methods are based on topological indices (TIs) and substructural variables like atom pairs. TIs are numerical graph invariants which encode information like size, shape, branching

4 pattern, symmetry and certain aspects of stereo-electronic factors associated with molecules.

Topological parameters have been found to be useful in predicting physicochemical as well as

biological properties of many different congeneric sets of molecules (Basak 1988; Basak and

Grunwald 1993; Basak et al. 1982, 1983, 1984, 1986, 1987a, 1987b, 1990, 1991; Kierand

Hall 1986; Niemietal. 1992; Randic 1975). Molecular similarity methods based on substructures

and TIs have been used successfully in the selection of analogs and in the discovery of novel drugs active against human immunodeficiency virus (HIV), as well as in the estimation of different physicochemical and toxicological properties (Basak et al. 1988, 1994, 1995g; Basak and

Grunwald 1994, 1995a, 1995b, 1995c, 1995d, I995e, 1995f; Lajiness 1990; Wilkins and

Randic 1980).

In this paper, similarity methods based on topological indices and atom pairs have been used to estimate the inhibitory effects (pIC50) of a set of nineteen aliphatic alcohols on microsomal p- hydroxylation of anilines by cytochrome ^450-

D a t a b a s e

Experimental pIC50 values for the inhibition of microsomal cytochrome P450 p-hydroxylation of anilines by nineteen alkanols are given in Table 2. These data were taken from Cohen and

Mannering (1973). The original set contained twenty compounds; one of them, methanol, was deleted. As a result of its single unique atom pair, the similarity of methanol to other compounds cannot be computed using the atom pair methodology.

5 C o m p u t a t i o n o f p a r a m e t e r s

Topological Indices

The sixty-four topological indices used in this study were calculated using POLLY 2.3

which is capable of calculating a total of ninety-eight TIs from the SMILES line notation input of

chemical structures (Basak et al. 1988a). The TIs calculated are listed in Table 3 and include

Wiener index (Wiener 1947), connectivity indices (Kier and Hall 1986, Randic 1975), information

theoretic indices defined on distance matrices of graphs (Bonchev and Trinajstic 1977,

Raychaudhury et al. 1984), parameters derived on the neighborhood complexity of vertices in hydrogen-filled molecular graphs (Basak 1987, Basak and Magnuson 1983, Basak etal. 1980,

Roy et al 1984), path lengths, and Balaban’s J indices (1982, 1983, 1986). Methods for the calculation of a few of the TIs used in this paper are described below.

The Wiener index W, the first topological index reported in the chemical literature, may be calculated from the distance matrix D(G) of a hydrogen-suppressed chemical graph G as the sum of the entries in the upper triangular distance submatrix. The distance matrix D(G) of a nondirected graph G with n vertices is a symmetric nxn matrix with elements dy equal to the distance between vertices Vj and Vj in G. Each diagonal element d yof D(G) is zero. We give below the distance matrix

D(G|) of the labeled hydrogen-suppressed graph G, of 1-butanol (Figure 1):

«Insert Figure 1 here»

* ( 0 (2) (3) (4) (5)

1 0 1 2 3 4 2 l 0 1 2 3 3 2 1 0 l 2 4 3 2 1 0 I 5 4 3 2 1 0

6 W is calculated as:

W = '/2 E djj = Eh • gh ( 0 ij h

where gh is the number of unordered pairs of vertices whose distance is h.

Information-theoretic topological indices are calculated by the application of information

theory on chemical graphs. An appropriate set A of n elements is derived from a molecular graph

G depending upon certain structural characteristics. On the basis of an equivalence relation defined

on A, the set A is partitioned into disjoint subsets A, of order n( (i= l,2,... h; En; = n). A probability

distribution is then assigned to the set of equivalence classes:

A2...... Ah

Pi’ P2’ .... > Ph where ps = n/n is the probability that a randomly selected element of A will occur in the ith subset.

The mean information content of an element of A is defined by Shannon's (Shannon 1948) relation:

h IC = - E Pi log2 Pi (2) 1=1

The logarithm is taken at base 2 for measuring the information content in bits. The total information content of the set A is then n times IC.

To account for the chemical nature of vertices as well as their bonding pattern, Sarkar et al.

7 ( 1978) calculated information content (IC) of chemical graphs on the basis of an equivalence relation

where two atoms of the same element are considered equivalent if they possess an identical first-

order topological neighborhood. Since properties of atoms or reaction centers are often modulated

by physicochemical characteristics of distant neighbors, i.e., neighbors of neighbors, it was deemed

essential to extend this approach to account for higher-order neighbors of vertices. This can be

accomplished by defining open spheres for all vertices of a chemical graph. One can construct such

open spheres for higher integral values of r. For a particular value of r, the collection of all such

open spheres S(v,r), where v runs over the whole vertex set V, forms a neighborhood system of the

vertices of G. A suitably defined equivalence relation can then partition V into disjoint subsets

consisting of topological neighborhoods of vertices of up to r‘h order neighbors.

This approach has been used to generate the indices of neighborhood symmetry. In this

method, chemical species are symbolized by weighted linear graphs. Two vertices u0 and v0 of a

molecular graph are said to be equivalent with respect to the r(h order neighborhood if, and only if,

corresponding to each path u0, u ,,..., ur of length r, there is a distinct path v0, v(, ..., vr of the same

length, such that the paths have similar edge weights, and both uQ and vQ are connected to the same

number and type of atoms up to the r(h order bonded neighbors. The detailed equivalence relation

is described in our earlier studies (Roy et al. 1984).

Once partitioning of the vertex set for a particular order of neighborhood is completed, ICr

is calculated from eq. 2. Subsequent information theoretic invariants include,structural information

content (SICr) shown in eq.3 (Basak et al. 1980) and complementary information content (CI(^) shown in eq. 4 (Basak and Magnuson 1983). In both equations, n is the total number of vertices of the graph:

8 SICr = IC/Iog2n (3)

CICr = log2n - ICr (4)

The information-theoretic index on graph distance, / ^ is calculated from the distance matrix

D(G) of a chemical graph G by the method of Bonchev and Trinajstic (1977):

/*=Wlog2W-Shgh-hIog2h (5)

The mean information index, is found by dividing the information index /Dw, by W.

Indices developed by A.T. Balaban (1982, 1983, 1986) were calculated and used in this

analysis. Balaban denoted these as J indices, which are based upon the distance sums sf of a

chemical graph. J is defined as:

J = q (p + l)-' S(s.lSj)-* (6)

i j edges where the cyclomatic number p (or number of rings in the graph) is p = q - n + 1, with q adjacencies or edges and n vertices. In the original definition of J, the term s; referred to either the row distance sum for vertex i in the distance matrix (D) or the multigraph distance matrix (M):

s; = Ed, (7) j

For distance matrix D, each matrix element d^ represents the distance from vertex i to vertex

9 j. The diagonal entries are all zero and the distance between any two adjacent vertices would be one.

All other entries in this matrix would be the number of edges or bonds traversed in the shortest path

from i to j. To account for the periodicity of chemical properties for heteroatoms, Balaban proposed

two J variants: Jx, which includes corrections for heteroatom electronegativities, and JY, which has

corrections for heteroatom covalent radii (Balaban 1986).

Atom Pairs

Atom pairs were calculated using the method of Carhart et al (1985). An atom pair is defined as a substructure consisting of two non-hydrogen atoms i and j and their interatomic separation:

-- contains information about the element type, number of non-hydrogen neighbors and the number of tt electrons. Interatomic separation of two atoms is the number of atoms traversed in the shortest bond-by-bond path containing both atoms.

Figure 2 demonstrates the calculation of atom pairs for 1-butanol. 1-Butanol has 10 total atom pairs, 9 of which are unique.

cdnsert Figure 2 here »

In Figure 2, X„ (n = 1 or 3 in this example) represents the number of non-hydrogen neighbors and the C and O are atomic symbols. These are the elements of the atom descriptors. The "-k-", k = 2,

3, 4, and 5 are the separation values.

The first atom pair (CX, - 2 - CXj), corresponds to path ab, a methyl carbon with 1 non­ hydrogen neighbor bonded a methyl carbon with 2 non-hydrogen neighbors. The path length of ab

1 0 is 2. Path be and cd are identical, each consists of a path of length 2 which joins two methyl

each of which has two non-hydrogen neighbors, hence this atom pair has a frequency of 2. Path de,

involving a methyl carbon and the oxygen of the hydroxyl group, defines atom pair 3.

S t a t is t ic a l A n a l y s is a n d S im il a r it y M e a s u r e s

Data Reduction

The TIs used in this paper are shown in Table 3. Initially, all TIs were transformed by the

natural logarithm of the value of the index plus one. This was done because the scale of some TIs

may be several orders of magnitude greater than others.

Principal Components analysis (PCA)

The data analyzed in this paper may be viewed as n (number of chemicals) vectors in p

(number of TIs) dimensions. The data for each set can be represented by a matrix X which has n

rows and p columns. For each of the molecular structures, the number of calculated parameters was

64 (TIs of Table 3). Each chemical is therefore represented by a point in R64. If each molecule is

represented in R2 then one could plot and investigate the extent of relationship between individual

parameters. In R50 such a simple analysis is not possible. However, since many of the TIs are highly

intercorrelated, the points in R64 can likely be represented by a subspace of fewer dimensions. The

method of principal components analysis (PCA) or the Karhunen-Loeve transformation is a standard method for reduction of dimensionality (Gnanadesikan 1977). The first principal component (PC) is the line which comes closest to the points in the sense of minimizing the sum of the squared

Euclidean distances from the points to the line. The second PC is given by projections onto the basis vector orthogonal to the first PC. For points in Rp, the first r principal components give the subspace

1 1 which comes closest to approximating the n points. The first PC is the first axis of the points.

Successive axes are major directions orthogonal to previous axes. The PCs are the closest

approximating hyperplane and because they are calculated from eigenvectors of a pxp matrix, the

computations are relatively accessible. But there are important scaling choices, because PCs are

scale dependent. To control this dependence, the most commonly used convention is to rescale the

variables so that each variable has mean zero and standard deviation one. The covariance matrix for

these rescaled variables is the correlation matrix. The PCA on the TIs has been carried out using

S AS software (S AS 1989).

Similarity Measures

Two measures of intermolecular similarity were used in the study of the alkanols. The

methods have been described in detail previously (Basak and Grunwald 1995b) and include an associative measure using atom pairs (AP) and Euclidean distance (ED) within a five-dimensional

PC space.

K-Neighbor Selection

Using the topologically-based methods described above, the intermolecular similarity of the chemicals was quantified. For each chemical, the K nearest neighbors (K=l-5) were then determined using the two similarity techniques. The mean observed pIC50 of the K-nearest neighbors for a compound was used as the estimated pIC50 for the compound. The correlation (r) of observed pIC50 with estimated pIC50 and the standard error (s.e.) of the estimates were used to assess the relative efficacy of the two similarity methods.

1 2 R e s u l t s

From the PCA of 64 TIs for 19 alkanols, Five PCs were retained. These Five PCs explain,

cumulatively, 94.5% of the total variation within the TI data. Table 4 lists the eigenvalues of the five

PCs, the proportion of variance explained by each PC, the cumulative variance explained, and the

two TIs most correlated with each PC. The First PC is strongly correlated with parameters which

characterize the size of the molecular graph such as, P0 and The second PC is highly correlated

with the higher order complexity indices including CIC4 and SIC4. For the third PC, the highest

correlations occur with low order information theoretic complexity TIs such as IC2 and IC, The

fourth PC was characterized by sixth order path-cluster and valence path-cluster connectivity indices,

6%pc and 6%vpc. The Fifth PC by parameters which characterize larger linear graphs, P7 and 6%pc.

These PC/TI correlations agree with our general expectations based on previous research (Basak and

Grunwald 1994, 1995b, 1995d, 1995e). Generally, PCs and TIs correlate as follows: PC, with the

size of the molecular graph, PC2 with higher order complexity indices, PC, with cluster and path-

cluster connectivity, and PC4 with low order information theoretic indices. In this particular case,

we see PC, and PC4 reversed.

Table 2 presents estimates of pIC50 for each similarity method at each K-level (K=l-5). The

atom pair method gave the overall best results. The standard errors fell within the range of the PC standard errors, and the correlations were all ten to twenty-Five percent higher. The best correlation

for the atom pair method was 0.878 for K=l. It should be noted that the results for K of 1, 2 , and

3 were all very close, within 0.013 units for correlation and standard errors of within 0.01 -log units.

Table 5 reports the correlation and standard errors of pIC50 estimates with observed values for both the atom pair (AP) and the principal component (PC) methods. Each line of the table represents a different K level. The standard error for estimation was at its minimum of 0.17 -log

1 3 units for the PC method with K=5. The correlation, however, was at its maximum of 0.878 using

the AP method with K=l.

D is c u s s io n

The objective of this study was to investigate the utility of nonempirically based molecular

similarity methods in estimating the inhibitory potency (pIC50) of a group of aliphatic alcohols for

microsomal p-hydroxylation of aniline. The result shows that the atom pair method of quantifying

similarity gives a reasonable estimate of pIC50 values of alkanols (Table 2).

It is evident from an analysis of results in Table 5 that AP method is superior to ED method

in predicting pIC 50 values. This is true for K = 1-5. This indicates that atom pairs quantify structural aspects of alkanols, relevant to inhibition of aniline p-hydroxylation by microsomal Cytochrome

P450’ better than the Euclidean space derived from the calculated numerical graph invariants. Further work is in progress to determine the relative effectiveness of AP vis-a-vis ED methods in estimating physicochemical as well as toxicological properties of chemicals.

A cknowledgements

This paper is contribution number 154 from the Center for Water and the Environment of the

Natural Resources Research Institute. Research reported in this paper was supported, in part, by grant F49620-94-1-0401 from the United States Air Force, Cooperative Agreement CR-819621 from the United States Environmental Protection Agency, Exxon Biomedical Sciences, Inc. and the

Structure-Activity Relationship Consortium (SARCON) of the Natural Resources Research Institute at the University of Minnesota. The authors would also like to extend their thanks to Greg Grunwald helpful discussions.

14 R e f e r e n c e s

Arcos JC. (1987). Structure-Activity Relationships: criteria for predicting carcinogenic activity of

chemical compounds. Environ Sci Technol 21:743-745.

Auer CM, Nabholz JV, Baetcke KP. (1990). Mode of action and the assessment of chemical

hazards in the presence of limited data: use of structure-activity relationships (SAR) under TSCA,

Section 5. Environ Health Perspect 87: 183-197.

BalabanAT. (1982). Highly discriminating distance-based topological index. ChemPhysLett 89:

399-404.

BalabanAT. (1983). Topological indices based on topological distances in molecular graphs. Pure

Appl Chem 55: 199-206.

BalabanAT. (1986). Chemical graphs. Part 48. Topological index J for heteroatom-containing

molecules taking into account periodicities of element properties. Math Chem (MATCH) 21: 115-

122.

Basak SC. (1987). Use of molecular complexity indices in predictive pharmacology and toxicology:

a QSAR approach. Med Sci Res 15:605-609.

Basak SC, Bertel sen S, Grunwald GD. (1994). Application of graph theoretical parameters in quantifying molecular similarity and structure-activity studies. J Chem Inf Comput Sci 34:

15 2 7 0 -2 7 6 .

Basak SC, Rosen ME, Magnuson VR. (1986). Molecular topology and mutagenicity: A QSAR

study of nitrosamines. IRCS Med Sci 14: 848-849.

Basak SC, FraneCM, Rosen ME, Magnuson VR. (1987a). Molecular topology and acute toxicity:

A QSAR study of monoketones. Med Sci Res 15: 887-888.

Basak SC, GieschenDP, Harriss DK, Magnuson VR. (1983). Physicochemical and topological correlates of enzymatic acetyl transfer reaction. J Pharm Sci 72: 934-937.

Basak SC, GieschenDP, Magnuson VR. (1984). A quantitative correlation of the LC50 values of esters in Pimephales promelas using physicochemical and topological parameters. Environ

Toxicol Chem 3: 191-199.

Basak SC, GieschenDP, Magnuson VR, Harriss DK. (1982). Structure-activity relationships and pharmacokinetics: a comparative study of hydrophobicity,. van der Waals' volume and topological parameters. IRCS Med Sci 10:619-620.

Basak SC, Grunwald GD. (1993). Use of graph invariants, volume and total surface area in predicting boiling point of alkanes. Math Modelling Sci Computing 2:735-740.

Basak SC, Grunwald GD. (1994). Molecular similarity and risk assessment: analog selection and

1 6 - property estimation using graph invariants. SAR and QSAR in Environ Res 2: 289-307.

Basak SC, Grunwald GD. (1995a). Use of topological space and property space in selecting

structural analogs. Math Modelling Sci Computing, in press.

Basak SC, Grunwald GD. (1995b). Estimation of lipophilicity from molecular structural similarity.

New J Chem 19:231-237.

Basak SC, Grunwald GD. (1995c). Molecular similarity and estimation of molecular properties.

J Chem Inf Comput Sci in press.

Basak SC, Grunwald GD. (1995d). Tolerance space and molecular similarity. SAR and QSAR

in Environ Res, in press.

Basak SC, Grunwald GD. (1995e). Predicting mutagenicity of chemicals using topological and quantum chemical parameters: a similarity based study. Chemosphere, in press.

Basak SC, Grunwald GD. (1995f). Use of graph theoretic parameters in risk assessment of chemicals. Toxicol Lett, in press.

Basak SC, GuteBD, Grunwald GD. (1995g). Estimation of normal boiling points of haloalkanes using molecular similarity. Croat Chim Acta, in press.

17 Basak SC, Harriss DK, Magnuson VR. (1988a). POLLY 2.3 : Copyright of the University of

Minnesota.

Basak SC, Magnuson VR. (1983). Molecular topology and narcosis: A quantitative structure-

activity relationship (QSAR) study of alcohols using complementary information content (CIC).

Arzneim Forsch/ Drug Research 33: 501-503.

Basak SC, Magnuson VR, Niemi GJ, Regal RR. (1988b). Determining structural similarity of

chemicals using graph-theoretic indices. Discrete Appl Math 19: 17-44.

Basak SC, Magnuson VR, Niemi GJ, Regal RR, VeithGD. (1987b). Topological indices: their

nature, mutual relatedness, and applications. Mathematical Modelling 8: 300-305.

Basak SC, Niemi GJ, Veith GD. (1990). Optimal characterization of structure for prediction of

properties. JM athChem 4: 185-205.

Basak SC, Niemi GJ, VeithGD. (1991). Predicting properties of molecules using graph invariants.

J Math Chem 7: 243-272.

Basak SC, Roy AB, Ghosh JJ. (1980). Study of the structure-function relationship of pharmacological and toxicological agents using information theory. In: Proceedings of the Ilnd

International Conference on mathematical modelling, II. Rolla: University of Missouri-Rolla:

851-856.

18 Bonchev D, Trinajstic N. (1977). Information theory, distance matrix and molecular branching.

J Chem Phys 67: 4517-4533.

Carhart RE, Smith DH, Venkataraghavan R. (1985). Atom pairs as molecular features in

structure-activity studies: definition and applications. J Chem Inf Comput Sci 25: 64-73.

Cohen GM, Mannering GJ. (1973). Involvement of a hydrophobic site in the inhibition of the

microsomal p-hydroxylation of aniline by alcohols. Mol Pharmacol 9: 383-397.

Gnanadesikan R. (1977). Methods for Statistical Analysis of Multivariate Observations. New

York: Wiley.

HanschC. (1976). On the structure of medicinal chemistry. J Med Chem 19: 1-6.

Johnson MA, MaggioraGM. (1990). Concepts and Applications of Molecular Similarity. New

York: Wiley.

Johnson M, BasakSC, MaggioraG. (1988). A characterization of molecular similarity methods for property prediction. Mathematical and Computer Modelling II: 630-635.

Kier LB, Hall LH. (1986). Molecular Connectivity in Structure-Activity Analysis. Research

Studies Press: Letchworth, Hertfordshire, U.K.

19 Shannon CE. (1948). A Mathematical Theory of Communication. Bell Sys Tech J 27:379-423.

Toxic Substances Control Act (TSCA). Public Law 94-469, 90 Stat. 2003, October 11, 1976.

Wiener H. (1947). Structural determination of paraffin boiling points. JA m C hem Soc 69: 17-20.

Wilkins CL, Randic M. (1980). A graph theoretic approach to structure-property and structure-activity correlations. Theoret Chim Acta (Bed.) 58: 45-68.

21 TABLES AND FIGURES

Table 1. List of properties necessary for risk assessment of chemicals.

Physicochemical Biological

Molar Volume Receptor Binding (KD) Boiling Point Michaelis Constant (Km) Melting Point Inhibitor Constant (Ki) Vapor Pressure Biodegradation Aqueous Solubility Bioconcentration Dissociation Constant (pKa) Alkylation Profile Partition Coefficient Metabolic Profile :Octanol-water (log P) Chronic Toxicity : Air-Water Carcinogenicity :Sediment-Water Mutagenicity Reactivity (Electrophile) Acute Toxicity •LDjo :LC50 Table 2. 19 Alkanols and their observed and predicted inhibition of microsomal p-hydroxylation of anilines (pICs0).

obs. est. pIQ,, - AP method csl. pIC,,, - ED method

2 3 4 5 1 2 3 4 5 No. Alkanol 1

1 Ethanol -1.10 -0.48 -0.48 -0.34 -0.34 -0.15 -0.48 -0.48 -0.33 -0.34 -0.22 2 1 -Propanol -0.48 -0.05 -0.05 -0.15 -0.20 -0.28 -0.05 -0.20 -0.04 -0.13 -0.12 3 1-Butanol -0.05 0.27 0.27 0.02 -0.11 0.02 -0.48 -0.42 -0.30 -0.16 -0.16 4 l-Pentanol 0.27 0.54 0.54 0.34 0.25 0.31 -0.07 -0.06 -0.20 -0.24 -0.08 -0.01 5 I-Hexanol 0.54 0.68 0.68 0.54 0.48 0.26 0.25 0.26 0.02 -0.01 6 1-Heptanol 0.68 0.54 0.54 0.45 0.41 0.23 0.54 0.40 0.14 0.01 0.06 7 2-Methyl-1-propanol -0.39 -0.15 -0.17 -0.16 -0.17 -0.45 -0.37 -0.28 -0.30 -0.24 -0.29 8 2-Methyl-1-butanol -0.15 -0.05 -0.22 -0.16 -0.22 -0.27 -0.07 -0.13 -0.10 -0.17 -0.31 -0.24 -0.21 9 3-Methyl-1-butanol -0.19 -0.07 -0.07 -0.18 -0.23 -0.10 -0.37 -0.26 -0.30 10 2,2-Dimethyl-1 -propanol -0.67 -0.39 -0.39 -0.42 -0.35 -0.52 -0.47 -0.43 -0.40 -0.24 -0.29 11 2-Propanol -0.47 -0.39 -0.37 -0.54 -0.46 -0.38 -0.39 -0.75 -0.66 -0.58 -0.47 12 2-Butanol -0.35 -0.15 -0.15 -0.12 -0.11 -0.18 -0.07 -0.06 0.05 -0.08 -0.10 13 2-Pentanol -0.07 0.15 0.15 -0.06 -0.16 -0.18 -0.35 -0.04 -0.04 -0.07 -0.15 14 2-Hexano! 0.15 0.25 0.25 0.14 0.09 -0.01 -0.47 -0.10 -0.09 -0.01 0.10 15 2-Heptanol 0.25 0.15 0.15 0.28 0.35 0.31 0.54 0.04 0.11 0.12 0.08 16 3-Pentanol -0.37 -0.47 -0.68 -0.61 -0.68 -0.39 -0.19 -0.29 -0.24 -0.20 -0.23 17 3-Hexano! -0.47 -0.07 -0.22 -0.17 -0.22 -0.04 0.15 0.21 0.12 0.22 0.11 18 2-Methyl-3-pentanol -0.89 -1,38 -1.38 -1.04 -0.88 -0.51 -0.07 -0.11 -0.14 -0.22 -0.25 19 2,4-Dimethyl-3-pentanol -1.38 -0.89 -0.89 -0.72 -0.64 -0.56 -0.89 -0.89 -0.58 -0.45 -0.40 Table 3. Topological index symbols and definitions.

Information index for the magnitudes of distances between all possible pairs of vertices of a graph

Mean information index for (he magnitude of distance

W Wiener index = half-sum of (he off-diagonal elements of (he distance matrix o f a graph

ID Degree complexity

Hv Graph vertex complexity

Hu Graph distance complexity

I c Information content of the distance matrix partitioned by frequency of occurrences of distance h

O Order o f neighborhood when ICr reaches its maximum value for the hydrogen-filled graph

Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood ^ORB of vertices

Maximum order of neighborhood of vertices for IoRB within the hydrogen-suppressed graph ^ O R B M, A Zagreb group parameter = sum of square of degree over all vertices M2 A Zagreb group parameter = sum of cross-product of degrees over all neighboring (connected) vertices

ic , Mean information content or complexity of a graph based on the rlh (r = 0-4) order neighborhood of vertices in a hydrogen-filled graph

SICr Structural information content for rlh (r = 0-4) order neighborhood o f vertices in a hydrogen-filled graph

CICr Complementary information content for r,h (r = 0-4) order neighborhood of vertices in a hydrogen- filled graph

hx Path connectivity index of order h = 0-6

hXc Cluster connectivity index of order h = 3-5

Path-cluster connectivity index of order h = 4-6 "X pc hxv Valence path connectivity index o f order h = 0-6

hx^ Valence cluster connectivity index of order h = 3-5

hv v Ape Valence path-cluster connectivity index of order h = 4-6

Ph Number of paths of length h = 0-7 J Balaban's J index based on distance

Jx Balaban's J index based on relative electronegativities

JY Balaban's J index based on relative covalent radii Table 4. Summary of the principal component analysis (PCA) using 64 Tis for 19 alkanols.

First Second Percent of Cumulative correlated correlated PC Eigenvalue variance percent TI TI

1 34.7 54.2 54.2 P0 0.997 °% 0.997

2 16.3 25.5 79.7 CIC4 0.932 SIC4 -0.93

3 4.9 7.7 87.4 IC2 -0.603 IC, 0.595

4 3.5 5.5 92.9 Y rc -0-565 Y c -0.543

5 1.0 1.6 94.5 P7 -0.444 6X pc 0.265 Table 5. -log ICJ0 estimation for alkanols by K-nearest neighbors using AP and ED similarity approaches.

AP Method ED Method

K r se r se

1 0.878 0.26 0.661 0.36

2 0.865 0.27 0.707 0.30

3 0.87! 0.26 0.595 0.23

4 0.855 0.29 0.566 0.19

5 0.811 0.34 0.638 0.17 Figure Captions

Figure I. Hydrogen-suppressed graph of 1-butanol.

Figure 2. Determination of atom pairs for 1-butanol.

F igu re 2

1-Butanol CH3—CH2--CH2—CH2—OH abode

Atom Pair Freq. of Occurence Path

1. CX, - 2 - CX2 1 ab 2. CX2 - 2 - CX2 2 be, cd 3. CX2 - 2 - OX, 1 de 4. CX, - 3 - CX2 1 abc 5. CX2 - 3 - CX2 1 bed 6. CX2 - 3 - OX, 1 ede 7. CX, - 4 - CX2 1 abed 8. CX2 - 4 - OX, 1 bede 9. CX,-5-OX, 1 abede SAR and QSAR in Environmental Research, in press, 1995

TOLERANCE SPACE AND MOLECULAR SIMILARITY

Subhash C. Basak* and Gregory D. Grunwald

Natural Resources Research Institute, The University of Minnesota

5013 M iller Trunk Highway, Duluth, MN 55811, U.S.A.

Running T i t l e : TOLERANCE SPACE AND MOLECULAR SIMILARITY

* Author to whom all correspondence should be addressed.

1 ABSTRACT

Molecular similarity methods were used in selecting K nearest neighbors and in estim ating m utagenicity o f 95 aromatic amines and boiling points of a large set of over 2,900 compounds. Similarity is analyzed in terms of the concept of tolerance space.

Specifically, the role of non-transitivity of the tolerance relation in estimating properties using similarity methods is examined.

KEYWORDS

Molecular similarity; Tolerance space; K nearest neighbor estimation; Mutagenicity; Boiling point; Risk assessment

2 INTRODUCTION

Notions of molecular sim ilarity/dissim ilarity have been used

in selecting analogs of molecules for risk assessment, in choosing

subsets of chemicals from databases in rational drug discovery and

in estimating molecular properties.1"16 Such applications of molecular similarity methods are based on the structure-property similarity principle. This notion states that similar structures usually have similar properties.12,13

Practical quantification of molecular sim ilarity/dissim ilarity is realized in terms of molecular descriptors which are derived from empirical or nonempirical properties of molecules.

Intermolecular similarity can be defined in terms of the number of structural features and their mutual arrangements which two chemical species have in common.17 The set of descriptors used to quantify similarity may vary with the level of organization to which the chemical species belong, viz., atomic, molecular, macromolecular, etc. Similarity measures are also dependent on the mode of representation of the chemical species, choice of the set of structural descriptors, and the selection of the particular mathematical function used to quantify sim ilarity from the chosen set of descriptors.

A number of quantitative molecular sim ilarity analysis (QMSA) methods have used graph theoretical descriptors. Topological and substructural parameters derived from molecular graphs are related to the shape, size, symmetry, branching and bonding pattern

3 associated with molecular structure. Such parameters can be

calculated for any arbitrary chemical structure.2'5,8,10,11'14'16 Graph

theoretically based QMSA methods have been used for the selection

of analogs of chemicals, rational drug design as well as for the

estimation of physicochemical and toxicological properties.3,5,8,9,14-16

Similarity methods order or partially order a set C of

chemicals according to the degree of similarity with the selected

probe.2-10,15 Usually, the elements of C are represented by n-

dimensional vectors where each element of the vector is a

descriptor. A set of such vectors representing the chemicals in C

and a tolerance relation given in the set of vectors defines the

tolerance space. Mathematically, tolerance space is the basis for

quantifying similarity/dissimilarity within a set of molecules. A

tolerance relation is reflexive, symmetric, and non-transitive in nature.18 Recently, Basak et ai2,3,6-9 used QMSA methods based- on atom pairs and topological indices to estimate physicochemical properties and toxicity of different groups of molecules. Such methods were based on K-nearest neighbor estimation of properties for probe chemicals.

In this paper we have carried out a systematic study on the effect of changing K on the estimation of properties using several

QMSA techniques. Three similarity measures have been used to quantify sim ilarity between molecules and determine K neighbors in a database o f 2966 compounds fo r which the normal b o ilin g p oin t was known. For a database o f 95 aromatic and heteroarom atic amines, the concept of non-transitivity of the tolerance relation has been

4 used in rationalizing the results of mutagenicity prediction using

five similarity measures.

MATERIALS AND METHODS

Databases

Normal Boiling Point Database

The normal b o ilin g p oin t database c o n siste d o f 2966 compounds taken from the U.S. EPA ASTER19 system. Compounds selected were those that have measured boiling point values.

Aromatic and Heteroaromatic Amines Mutagen Database

The 95 aromatic and heteroaromatic amines used to study mutagenic potency were taken from the literature.20 The mutagenic a c t i v i t i e s o f these compounds in S. typhimurium TA98 + S9 microsomal preparation had been collected. Mutagenic activities are expressed as the mutation rate, Ln(R), in log(revertants/nmol).

Table I l i s t s the compounds used fo r th is study.

Graph Theoretical Parameters

The calculation of the topological indices (TIs) 7 and atom pairs10 has previously been described in detail. However, four additional indices developed by A.T. Balaban21"23 were calculated and used in these analyses. Balaban denoted these as J indices and they are based upon the distance sums s i of a chemical graph. J is defined as:

5 J = q(u + l)-1 S(siSjr1/2 (1) i,j edges where the cyclomatic number p (or number of rings in the graph) is p = q - n + 1, with q adjacencies or edges and n vertices. In the original definition of J, the term s* referred to either the row distance sum for vertex i in the distance matrix (D) or the multigraph distance matrix (M):

Si = ^ (2)

For distance matrix D, each matrix element di;j represents the distance from vertex i to vertex j . The diagonal entries are all zero and the distance between any two adjacent vertices would be one. All other entries in this matrix would be the number of edges or bonds traversed in the shortest path from i to j . For the multigraph distance matrix M, for each entry di:j, 1/b is used, where b is the conventional bond order (b = 1, 2, 3, and 1.5 for single, double, triple and aromatic bonds, respectively). To account for periodicity of chemical properties for heteroatoms, Balaban proposed two J variants23: J , which includes corrections for heteroatom electronegativities, and JY, which has corrections for heteroatom covalent radii.

Table II lists the 102 TIs used in this paper. The set of TIs was generated fo r a l l compounds in each data s e t . Several in d ic e s r e la te d to e le c t r o n e g a t iv it ie s were undefined fo r 40 compounds in the boiling point data set. In analyses which include TIs, these

6 compounds have been dropped. A lso , 14 compounds in the b o ilin g

point data set consisted of only two non-hydrogen atoms which, by

definition, would result in a completely unique atom pair.

Therefore, it is not possible to select analogs using atom pairs

fo r these compounds, and these 14 compounds were dropped from any

study that used atom pairs. For the mutagenicity data set, the

third and fourth order chain connectivity index values were zero

for all compounds, so these indices were dropped. Also, the fourth

and sixth order cluster connectivity indices were dropped since

th e ir valu es were non-zero fo r only one compound. This l e f t 90 TIs

for use in the analyses of the mutagenicity data set. The TIs were

calculated by POLLY24 and the atom pairs by APPROBE.25

Statistical Methods and Computation of Similarity

Data Reduction

Initially, all TIs were transformed by the natural log of the

index plus one. This was done since the scale of some TIs may be several orders of magnitude greater than other TIs.

The set of TIs was divided into disjoint subsets or clusters based on the correlation matrix by using the SAS procedure

VARCLUS.26 The VARCLUS procedure d iv id e s the s e t o f TIs in to disjoint clusters so that each cluster is essentially unidimensional. The output of this procedure is the first principal component (PC) from each cluster of TIs. These PC's were subsequently used in the sim ilarity measures described below.

Besides using the cluster or PC variables derived above, we

7 selected from each cluster the TI that was most correlated with the

cluster to which it belonged. These TIs were then used in the

sim ilarity measures as well.

Similarity Measures

Five measures of intermolecular similarity were used in the

study of the aromatic amines. The methods have been described in

detail previously7 and include one associative measure using atom

pairs (AP) and four Euclidean distance (ED) measures using: a)

scaled TIs (TIS), b) unsealed TIs (TIJ, c) scaled variable clusters

(PCS) and d) unsealed variable clusters (IgC ) . The two methods using scaled dimensions involved rescaling the variables to have mean zero and variance one prior to calculating ED.

For the normal boiling point database, only three measures of sim ilarity were used. These measures included AP, TIS and TI0.

K Neighbor Selection and Property Estimation

Using the previously described similarity methods, intermolecular similarity of the chemicals within each data set was quantified. For each chemical, the K nearest neighbors were determined using each of the similarity techniques. For the boiling point data set, K was set to values 1-10, 15, 20, and 25.

For the aromatic amines, K was set to values 1-7. However, as an initial exploratory analysis of the non-transitivity of the tolerance relation, we examined selecting K neighbors, excluding the nearest seven neighbors. In other words, neighbors 8-14 were used as analogs for property prediction.

The mean of the observed property of the K nearest neighbors

8 about a compound was used as the estim ated prop erty fo r the

compound. The c o rre la tio n (r) o f observed property with estim ated

property and the standard error of the estimates were used to

assess the relative efficacy of the similarity methods.

RESULTS

Variable Clustering and TI Selection

There were 10 significant clusters derived from the variable clustering of 90 TI' s for the 95 aromatic amines. These 10 clusters explained a total of 90.5% of the total variation within the original TI's .

The TI's selected for use in the TIS and TIW similarity methods are listed in Table III. Shown in Table III are the TI selected and the correlation of the TI with the cluster from which it was s e le c te d . Each o f these TI's were the variable most correlated with the cluster from which it was selected.

For the set of 2926 compounds with boiling point data for which all TIs were defined, there were 13 significant clusters derived from the 102 TI's. These clusters explained 85.4% of the total variation within the original TI's.

As fo r the amines, Table I I I shows the TI's selected for use in the

TIS and TIu sim ilarity methods.

K-Neighbor Property Estimation and the Tolerance Relation

Figure 1 presents the correlation (r) and standard error of

9 prediction (s.e.) for mutagenic potency of the 95 aromatic amines

for the K values examined (1-7). Each line in the figure

represents the results when selecting neighbors 1-7 and neighbors

8-14. Since the results for each of the five similarity methods

were similar, for brevity, only the TIS method is displayed. Table

IV lists the results for each of the five methods. The

correlations are considerably better and the errors smaller when

selecting from the seven nearest neighbors as opposed to neighbors

8-14. When selecting the seven nearest neighbors, the best results

were seen with K from 3 to 5 with the correlation -0.84 and s.e.

-1.03. When selecting from neighbors 8-14, results were highly

variable. The best K ranged from 3 to 5 with the correlation

anywhere from -0.57 to -0.73 and s.e. from 1.34 to 1.62.

[Insert figure 1]

Figure 2 p resen ts the c o r r e la tio n and s.e. of prediction for normal boiling points for three similarity methods. Table V shows

the best boiling point model obtained for each sim ilarity method.

The best boiling point estimates were for K in the range of 7 to 15

for the two ED based methods. For the AP method, the best estimates were for K from 6 to 10 with the best boiling point estimates being at K equal to 8. The AP method gave the best result with a correlation of 0.905 and s.e. of 34.6 °C.

[Insert figure 2]

10 DISCUSSION

The objective of this paper was twofold: 1) to explore the u tility of molecular similarity methods in estimating properties of a large and diverse database, and 2) to analyze the nature of similarity techniques in terms of the tolerance relation.

Three sim ilarity methods have been used in estimating normal boiling points of a diverse set of over 2,900 chemicals. These methods are based on graph theoretic parameters like atom pairs and

TIs. The results of K nearest neighbor estimates of boiling point using the three methods are shown in Table V and Fig 2. It is obvious from the results that the atom pair method gives slightly better estimates as compared to the TI-based ED methods.

Five similarity methods have been used in predicting mutagenicity of aromatic amines. For K = 1-7, the results of this analysis is given in Table IV. In this case, all four Tl-based ED methods gave very similar estimates and were only slightly superior than the atom pair method.

An interesting result came from the study of the effectiveness of using relatively close and distant K neighbors to estimate mutagenicity of the aromatic amines (Figure 1). The closest seven neighbors (K=l-7) gave a better estimation of the property than when more distant neighbors (K=8-14) were used. This is presumably because the similarity relation on a set of chemicals is a tolerance relation18 which is reflexive, symmetric, but

11 nontransitive. This means that accumulation of successive small

differences in a series can lead to objects which are quite

dissimilar to the initial object. Figure 3 provides such an

example using atom pairs as the molecular descriptors. Let us

define similarity such that an arbitrary pair (X,Y) of chem icals

will be called similar if they have at least one atom pair in

common. By this definition, of the three chemicals A, B, and C in

Fig 3, A~B, B~C, and yet the relation A~C does not hold, where ~

represents the similarity relation defined above. Of course, this

definition of similarity is too broad to be of use in practical

situations. To be useful for analog selection and, ultimately,

property estimation, a similarity measure needs to define the local

neighborhood of a probe chemical such that the probe and its

neighbor(s) have sufficient structural analogy relevant to the

property. In our studies presented in this paper, we have used

either distance or an associative coefficient to order chemicals

relative to a probe chemical. Subsequently, we used the K nearest neighbors to characterize the local space of the probe chemical.

[INSERT FIGURE 3]

12 ACKNOWLEDGMENT

The authors are grateful to Professor Milan Randic for his review of the research reported in this paper. This is contribution number 138 from the Center for Water and the

Environment of the Natural Resources Research Institute. Research reported in this paper was supported, in part, by cooperative agreement CR 819621 from the United States Environmental Protection

Agency, by grant F49620-94-1-0401 from the United States Air Force,

Exxon Biomedical Sciences, Inc., and the S tr u c tu r e -A c tiv ity

Relationship Consortium (SARCON) of the Natural Resources Research

Institute at the University of Minnesota.

13 REFERENCES

1. Auer, C.M., Nabholz, J.V. and Baetcke, K.P. (1990). Mode o f

action and the assessment of chemical hazards in the presence

of limited data: use of structure-activity relationships (SAR)

under TSCA, Section 5. Environ. Health Perspect. 87, 183-197.

2. Basak, S.C., Magnuson, V.R., Niemi, G.J. and Regal, R.R.

(1988). Determining structural similarity of chemicals using

graph-theoretic indices. Discrete Appl. Math. 19, 17-44,

3. Basak, S.C., B e rte lsen , S. and Grunwald, G.D. (1994).

Application of graph theoretical parameters in quantifying

molecular similarity and structure-activity studies. J. Chem.

I n f. Comput. S c i. 34, 270-276.

4. Basak, S.C., B ertelsen , S. and Grunwald, G.D. (1995). Use of

graph theoretic parameters in risk assessment of chemicals.

Toxicology Letters, in p r e ss.

5. Basak, S.C. and Grunwald, G.D. (1995). Use of topological

space and property space in selecting structural analogs.

Mathematical Modelling and Scientific Computing, in p r e s s .

6. Basak, S. C. and Grunwald, G. D. (1995). Estimation of

lipophilicity from structural similarity. New J. Chem. 19,

231-237.

7. Basak, S.C. and Grunwald, G.D. (1994). Molecular similarity

and risk assessment: analog selection and property estimation

using graph invariants. SAR and QSAR in Environmental Research

2, 289-307.

1 4 8. Basak, S.C. and Grunwald, G.D. (1995). Molecular similarity

and estimation of molecular properties. J. Chem. In f. Comput.

S c i. 35, 366-372.

9. Basak, S.C. and Grunwald, G.D. (1995). Predicting mutagenicity

of chemicals using topological and quantum chemical

parameters: a sim ilarity based study. Chemosphere, in p r e s s .

10. Carhart, R.E., Smith, D.H. and Venkataraghavan, R. (1985).

Atom pairs as molecular features in structure-activity

studies: definition and applications. J. Chem. Inf. Comput.

S c i. 25, 64-73.

11. Fisanick, W. , Cross, K.P. and Rusinko, III, A. (1992).

Similarity searching on CAS registry substances 1. molecular

property and generic atom triangle geometric searching, J.

Chem. Inf. Comput. Sci. 32, 664-674.

12. Johnson, M.A. and Maggiora, G.M., Eds. (1990). Concepts and

Applications of Molecular Similarity. Wiley, New York.

13. Johnson, M.A., Basak, S.C. and Maggiora, G. (1988). A

characterization of molecular similarity methods for property

prediction. Mathl Comput. M odellin g 11, 630-634.

14. Lajiness, M.S. (1990). Molecular similarity-based methods for

selecting compounds for screening. In, Computational Chemical

Graph Theory (D. H. Rouvray, Ed. ) . Nova, New York, pp. 299-

316.

15. W ilk in s, C. L. and Randic, M. (1980). A graph theoretic

approach to structure-property and structure-activity

correlations. T h eoret. Chim. Acta (Berl.) 58, 45-68.

1 5 16. Willet, P. and Winterman, V. (1986). A comparison of some

measures for the determination of inter-molecular structural

similarity: measures of intermolecular structural similarity.

Quant. Struct-Act. Relat. 5, 18-25.

17. Austel, V. (1983). Features and problems in practical drug

design: In, Topics in Current Chemistry, (M. Charton and I.

Motoc, Eds.). Springer-Verlag, Berlin, Vol. 114, pp. 7-19.

18. Schreider, J.A. (1975). Equality, Resemblance and Order. Mir

P u b lish ers, Moscow, p.81.

19. Russom, C.L. (1992). Assessment Tools for the Evaluation of

R isk (ASTER) v. 1.0. U.S. Environmental Protection Agency.

20. Debnath, A.K., Debnath, G., Shusterman, A.J. and Hansch, C.

(1992). A QSAR Investigation of the Role of Hydrophobicity in

Regulating Mutagenicity in the Ames Test: 1. Mutagenicity of

Aromatic and Heteroaromatic Amines in Salmonella typhimurium

TA98 and TA100. Environ. Mol. Mutagen. 19, 37-52.

21. Balaban, A.T. (1982). Highly discriminating distance-based

topological index. Chemical Physics Letters 89, 399-404.

22. Balaban, A.T. (1982). Topological indices based on topological

distances in molecular graphs. Pure & Appl. Chem. 55, 199-206.

23. Balaban, A.T. (1986). Chemical graphs. Part 48. Topological

index J for heteroatom-containing molecules taking into

account periodicities of element properties. MATCH 21, 115-

122 .

24. Basak, S.C., H a rriss, D.K. and Magnuson, V.R. (1988). POLLY:

Copyright of the University of Minnesota.

16 25. Basak, S.C. and Grunwald, G.D. (1994). APPROBE: Copyright o f

the University of Minnesota.

26. SAS Institute Inc. (1988). In, SAS/STAT User's Guide, R e le a s e

6.03 E d i t i o n . SAS Institute Inc.: Cary, NC, Chapter 34, pp.

949-965.

17 Table I . Mutagenicity ( S. typhimurium TA98 w ith metabolic activation) of 95 aromatic and heteroaromatic amines.

l o g . No. Compound R e v ./n m o l 1 2-bromo-7-aminofluorene 2.62 2 2-methoxy-5-m ethylaniline -2.05 3 5-am inoquinoline - 2.00 4 4-ethoxyaniline -2.30 5 1-aminonaphthalene -0.60 6 4-am inofluorene 1 .1 3 7 2-aminoanthracene 2.62 8 7-am inofluoranthene 2.88 9 8-am inoquinoline - 1 .1 4 10 1,7-diaminophenazine 0.75 11 2-aminonaphthalene -0.67 12 4-aminopyrene 3.16 13 3-amino-3'-nitrobiphenyl -0.55

14 2,4,5-trim ethylaniline -1.32

15 3-am inofluorene 0.89

16 3 ,3 '-dichlorobenzidine 0.81 17 2,4-dim ethylaniline - 2.22 18 2,7-diam inofluorene 0.48 19 3-am inofluoranthene 3.31 20 2-am inofluorene 1.93 21 2-amino-4'-nitrobiphenyl -0.62 1 22 4-aminobiphenyl o 23 3-m ethoxy-4-m ethylaniline -1.96 24 2-aminocarbazole 0.60 25 2-am ino-5-nitrophenol -2.52 26 2, 2'-diaminobiphenyl -1.52 27 2-hydroxy-7-aminofluorene 0.41 28 1-aminophenanthrene 2.38 29 2,5-dim ethylaniline -2.40 30 4-amino-2'-nitrobiphenyl -0.92 31 2-amino-4-methylphenol - 2.10 32 2-aminophenazine 0.55 33 4-aminophenylsulfide 0.31 34 2,4-dinitroaniline -2.00 35 2,4-diaminoisopropylbenzene -3.00 36 2,4-difluoroaniline -2.70 37 4,4 '-m ethylenedianiline -1.60 38 3,31-dimethylbenzidine 0.01 39 2-aminofluoranthene 3.23 40 2-amino-3'-nitrobiphenyl -0.89

41 1-aminofluoranthene 3.35 42 4,4 '-ethylenebis(aniline) -2.15 43 4-chloroaniline -2.52 44 2-aminophenanthrene 2.46 45 4-fluoroaniline -3.32 46 9-aminophenanthrene 2.98 47 3,3 '-diaminobiphenyl -1.30 48 2-aminopyrene 3.50 49 2 ,6-dichloro-l,4-phenylenediam ine -0.69 50 2-amino-7-acetamidofluorene 1.18 51 2, 8-diaminophenazine 1.12 52 6-am inoquinoline -2.67 53 4-methoxy-2-methylaniline -3.00

54 3-amino-21-nitrobiphenyl -1.30 55 2,4 '-diaminobiphenyl -0.92 56 1 , 6-diaminophenazine 0.20 57 4-aminophenyldisulfide -1.03 58 2-bromo-4,6-dinitroaniline -0.54 59 2,4-diamino-n-butylbenzene -2.70

60 4-aminophenylether -1.14

61 2-aminobiphenyl -1.49 62 1, 9-diaminophenazine 0.04 63 1-aminofluorene 0.43 64 8-aminofluoranthene 3.80 65 2-chloroaniline -3.00 66 3-amino-aaa-trifluorotoluene -0.80 67 2-amino-l-nitronaphthalene -1.17 68 3-amino-4'-nitrobiphenyl 0.69

19 69 4-bromoaniline -2.70 70 2-amino-4-chlorophenol -3.00 71 3,3'-dimethoxybenzidine 0.15

72 4-cyclohexylaniline -1.24 73 4-phenoxyaniline 0.38 74 4,4' -m ethylenebis(o-ethylaniline) -0.99 75 2-amino-7-nitrofluorene 3.00

76 Benzidine -0.39 77 l-amino-4-nitronaphthalene -1.77

78 4-amino-3'-nitrobiphenyl 1.02

79 4-amino-41-nitrobiphenyl 1.04 80 1-aminophenazine - 0.01 81 4,4'-methylenebis(o-fluoroaniline) 0.23 82 4- c h l o r o - 2-nitroaniline - 2.22 83 3-aminoquinoline -3.14

84 3-aminocarbazole -0.48 85 4-chloro-l,2-phenylenediamine -0.49 86 3-aminophenanthrene 3.77 87 3,4'-diaminobiphenyl 0.20 88 1-aminoanthracene 1.18 89 1-aminocarbazole -1.04

90 9-aminoanthracene 0.87 91 4-aminocarbazole -1.42 92 6-aminochrysene 1.83 93 1-aminopyrene 1.43 94 4 ,4 1-methylenebis(o-isopropylaniline) -1.77 95 _____ 2 , 7 -d ia m in o p h e n a z in e ______3.97 Table I I . Topological index symbols and definitions.

^ Information index for the magnitudes of distances between all possible pairs of vertices of a graph

^ Mean information index for the magnitude of distance

W Wiener index = half-sum of the off-diagonal elements of the distance m atrix of a graph

ID Degree complexity

Hv Graph vertex complexity

Hd Graph distance complexity

IC Information content of the distance matrix partitioned by frequency of occurrences of distance h

O Order of neighborhood when ICr reaches its maximum value for the hydrogen-filled graph

I0RB Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices

0ORB Maximum order of neighborhood of vertices fg^f I within the hydrogen-suppressed graph

Mi A Zagreb group parameter = sum of square of degree over a ll vertices M2 A Zagreb group parameter = sum of cross-product of degrees over a ll neighboring (connected) vertices

IC r Mean inform ation content or complexity of a graph based on the rth ( r = 0- 6) order neighborhood of vertices in a hydrogen-filled graph

S IC r Structural information content for rth (r = 0-6) order, neighborhood of vertices in a hydrogen-filled graph

C IC r Complementary information content for rth (r = 0-6) o r d e r neighborhood of vertices in a hydrogen-filled graph Path connectivity index of order h = 0-6

Xc Cluster connectivity index of order h = 3-6

Xch Chain connectivity index of order h = 3-6 h Xpc Path-cluster connectivity index of order h = 4-6 h b Bonding path connectivity index of order h = 0-6 hXcb Bonding cluster connectivity index of order h = 3-6 h Xch Bonding chain connectivity index of order h = 3-6 h Xpc Bonding path-cluster connectivity index of order h = 4-6 h X Valence path connectivity index of order h = 0-6 h V Xc Valence cluster connectivity index of order h = 3-6 h Xch Valence chain connectivity index of order h = 3-6 h v Xpc Valence path-cluster connectivity index of order h = 4-6

21 Ph Number of paths of length h = 0-10

J Balaban's J index based on distance

JB Balaban's J index based on multigraph bond orders

Jx Balaban's J index based on relative electronegativities

JY Balaban's J index based on relative covalent radii

22 Table III. Topological indices selected for similarity measures and the correlation of the TI with the cluster from which it was selected.

Normal Boiling Point Data Aromatic Amines Cluster TI r TI r

1 0 .977 4x 0 .985

2 VA p c 0,. 931 sic4 0.. 984 4 b 3 SIC3 0.. 946 X pC 0..926

4 4Xch 0,.881 CICi 0,. 967

5 5X c 0.. 945 Po 0.. 990

6 IC4 0..991 5Xch 0,. 940

7 6X ch 0., 874 H O 0..987 8 6x 0., 924 0.. 954 9 ClCi 0.. 933 SIC2 0. 896

10 JB 0.. 981 0 0. 944

3 b 11 A c 0. 908 12 w 0., 974

13 Pc, 0. 968

23 Table IV. Correlation (r) and standard error (s.e.) of estimating Ln(R) of 95 aromatic amines when selecting neighbors 1-7 and neighbors 8-14 using five different similarity methods.

Neighbors 1--7 Neighbors 8- 14

Method K r s .e . r s .e . AP 1 0.,795 1. 28 0 ., 617 1. 76 2 0 ., 814 1. 16 ■ 0 . 641 1. 57 3 0., 823 1. 11 0.. 649 1. 52 4 0..828 1. 10 0.. 670 1..46 5 0., 837 1. 06 0. 712 1. 37 6 0,.837 1. 06 0.,699 1.,39 7 0..830 1. 08 0., 687 1. 42

TIg 1 0., 832 1. 14 0., 629 1.,70 2 0..836 1. 09 0.,702 1.,42 3 0..840 1. 06 0.,728 1. 34 4 0,. 850 1. 02 0.. 696 1.,40 5 0.. 834 1. 06 0..705 1,,37 6 0,, 837 1. 05 0., 722 1.,34 7 0,,834 1. 06 0.,711 1.,36 1.,92 H 1 0.. 798 1. 23 0., 443 G 2 0.. 842 1. 08 0.. 495 1., 74 3 0.. 836 1. 07 0.. 464 1., 80 4 0., 846 1. 03 0,. 524 1.,68 5 0.. 845 1. 03 0 ., 524 1.,70 6 0,. 830 1.,07 0.. 538 1,, 67 7 0,. 811 1. 13 0,, 571 1., 62 PCS 1 0,. 789 1.,24 0,. 531 1,. 85 2 0.. 829 1., 10 0.. 597 1,.64 3 0,. 847 1..03 0 ,. 599 1,, 61 4 0,. 842 1.,04 0,. 616 1..57 5 0,. 833 1., 07 0,. 657 1,, 47 6 0 . 824 1.,09 0 . 675 1,.43

24 7 0,. 821 1,, 10 0., 665 1..45 1 0..793 1.. 24 0,, 524 1., 85 2 0 ., 825 1., 11 0.,588 1,, 61 3 0.. 843 1.,05 0., 576 1., 64 4 0.. 848 1,,02 0.. 601 1.,59 5 0.,831 1,, 07 0.. 608 1,. 57 6 0., 825 1., 09 0., 620 1., 54 7 0.,827 1.,08 0., 635 1.,51

25 Table V. Comparison of three similarity methods for prediction of normal boiling point (°C) using K nearest neighbors.

Similarity Method N K r s ..e . (°C) AP 2952 8a 0.905 34.6 TIS 2926 10 0.874 40.0 Tin 2926 7 0.872 40.3 a K = 3 for one compound, K = 5 for two compounds.

26 S. C. RASAK AND G. D. GRUNWALD

Correlation (r'

Similarity Method - AP -*-Tls -»-TIu

1 5 10 15-20 25 Number of Neighbors (K)

Similarity Method -“-AP * T ls -*-TIu

Figure 2 Pattern of: (a) correlation (r) and (bj standard error of the estimates (s.e.) according to K nearest neighbor selection for >2900 boiling points. S.C. 13ASAK AND G. D. GRUNWALD

CX-,(2)C1 /

^ C ~ on

A~B, B-C, A •/ C

Figure 3 An example or non-transitivity for three compounds for which A - B. B ~ C. but A ~ C when at least one common atom pair defines compounds as similar. Appendix 3

Abstracts/Presentations

Appendix 3-1 Molecular similarity and estimation of molecular properties (abstract)

Appendix 3-2 Quantitative Structure-Activity Relationships (QSAR) in environmental sciences — Tolerance space and molecular similarity (abstract)

Appendix 3-3 Proceedings of the International Cancer Congress — Predicting genotoxicity of chemicals using nonempirical parameters (abstract)

Appendix 3-4 Estimation of toxicological and environmental properties of chemical using molecular similarity (abstract)

Appendix 3-5 International Congress on Hazardous Waste: Impact on human and ecological health — Use of molecular similarity in the assessment of toxicity of chemicals (abstract)

Appendix 3-6 Tenth International Conference on Mathematical and Computer Modelling and Scientific Computing — Modeling mutagenicity with neural networks (presentation) , ABSTRACT. please be BRIEF— 150 words maximum if possible. Title of paper should be ALL CAPS; author(s) listed by first i lame, middle initial, last name; indicate address w/zip code. SINGLE SPACE, BLACK INK.

MOLECULAR SIMILARITY AND ESTIMATION OF MOLECULAR PROPERTIES. S. C. Basak and G. 0. Grunwald, Natural Resources Research Institute, Duluth, MN 55811

Recently there is an upsurge in the use of molecular sim ilarity methods in the selection of analogs and estimation of properties. We have developed three methods of quantifying molecular sim ilarity using numerical graph invariants (topological indices). These methods have been used in selecting analogs and estimating physicochemical and toxicological properties of different and diverse sets of chemicals. Results of these analyses w ill be presented along with a critical evaluation of relative effectiveness of the three sim ilarity techniques in property estimation.

Ia ] DIVISION OF Chemical Information (To be filled In by Division) Paper number as listed on program jl] MEMBER □ Yes □ No

fcT] TITLE OF PAPER Molecular S im ilarity and fE] Principal Author’s Business Mailing Estimation of Molecular Properties Address Including Zip Code

Natural Resources Research Institute 5013 M iller Trunk Highway Please indicate preference: X ______oral______poster Duluth, MN 55811

|d ] AUTHORS

Principal Author: Surname First Name Ml

|f ] Principal Author’s Telephone, Fax 3asak Subhash c. Number, and E-mail Address Presenting Author (if different) (218) 720-4230 (218) 720-4219 (Fax) Co-authors: [email protected]

G runw ald G re g o ry D. G For contributed papers, do authors meet criteria outlined in ACS Bylaw VI, Section 6(3)? See instructions.

0Y es DN o

NOTE: ALL PRESENTING AUTHORS MUST REGISTER FOR THE MEETING— EITHER FULL MEETING REGISTRATION OR ONE-DAY REGISTRATION FOR THE DAY OF PRESENTATION.

[h] Specify Equipment Required for Presentation Other than 2" x 2" (35MM)slide or overhead (transparency) projector cr:oNc,.-4<^;v>:-'Vv

6th International Workshop on SP *Ap>JU/ Quantitative Structure-Activity Relationships (QSAR) in Environmental Sciences

Hotel Villa Carlotta - Belgirate (No) - Italy 13-17 September 1994

ABSTRACTS

JOINT Sponsored by the RESEARCH World Health Organisation CENTRE and in collaboration with the EUROPEAN COMMISSION European Commission DG V, Luxembourg Environment Institute DG XI, Brussels European Chemicals Bureau DG XII, Brussels i I TOLERANCE SPACE AND MOLECULAR SIMILARITY

Subhash C. Basok and Gregory D. Gnunvald

Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Highway, Duluth, MN 55811, U.S.A. !

In recent years a number of molecular similarity/dissimilarity methods havej been developed for analog selection and estimation of molecular properties.; Such methods are based on topological, geometrical, physicochemical, and; quantum chemical parameters. We have used four similarity methods in estimating physicochemical and toxicological properties of chemicals. These methods are based on graph invariants, viz., topological indices and atom pairs, which can be computed for any arbitrary chemical structure. The property of the nearest neighbor of a probe chemical by a particular similarity method has been taken as the estimate of the property. Each of these four methods gave reasonable . results in estimating properties. The results were, however, dependent on the scaling of parameters. Similarity space defined on a set of chemicals C is a tolerance space. The relation R defined on such a space is reflexive, symmetric and non-transitive. The implication of nontransitivity of similarity space in property estimation will be discussed with specific examples.

Keywords : molecular similarity; graph invariant; topological indices; atom pair; tolerance space.

14

FIGHT' GANGER rauflxiis Abstract Submission Form

XVi INTERNATIONAL CANCER CONGRESS Oct. 30 • Nov. 5. 1994, New Delhi, India.

ABSTRACT FOR : For ollice use only

Abs'. ID. No...... Dale Reed ...... /...... /. 1. Poster/Oral 2. Video Reg. No ......

SUBJECT CODE : Major Discipline Topic Code Site Code

Option I 2 n Option II

Have you registered tor the Congress? 1 Yes No

Please fill in your address:

Name of Registrant Ba s a k ______Vnhh a c h______Surname First name

Mailing Address: N 3 - t y r a l Resources Research Institni-P

• ^ 5013 M ille r Trunk Highway______

' Duluth. MN______

Zip Code: 55811 Country: I K A

Telephone: (218) 720-42 30

Fax: (218) 780-4219 "" Abstract Submission Form

Predicting genotoxicity of chemicals using nonemplrical parameters

S, C, Basak and G, D. Grunwald Natural Resources Research Institute, University, 5013 Miller Trunk Highway, Duluth, MN 55811, USA.

Assessing genotoxic potential of chemicals is an important factor in the risk assessment of chemicals. At present more than 500,000 new chemicals are registered with Chemical Abstracts Service (CAS) each year. Very few of these chemicals have empirical data necessary for assessing genotoxic potential, Therefore, the logical alternative is to use nonempirical approaches in assessing genotoxic potential of chemicals.

We have employed nonempirical parameters to predict genotoxicity of a diverse database of chemicals. The descriptors used in this study include topological, substructural and quantum chemical parameters calculated directly from structure. Discriminant function analysis using the set of nonempirical parameters show that our approach can correctly classify the chemicals 75% of the time.

In view of the fact that less.than 15% of new chemicals submitted to USEPA have any genotoxicity data, our model can be used to estimate genotoxic potential of chemicals.

genotoxicity, nonempirical parameters, risk USING USING MOLECULAR SIMILARITY ESTIMATION OF TOXICOLOGICAL ESTIMATION AND Presented at the Sigma Xi conference, Duluth, MN, 1995. Environmental Environmental 2, Research, 29, 1994. Computing, Computing, in press, 1995. S. S. C. Basak and G. D. Grunwald, J. Chem. Inf. Comput. Sci., in press, 1995. 289, 289, 1994. S. C. Basak and G. D. Grunwald, New Journal of Chemistry, in press, 1995. 1994. S. S. C. and Basak G. Grunwald, D. SAR and QSAR in in Environmental Research, S.C. S.C. Basak and G. D. Grunwald, in: Proceeding of the XVI International Cancer Congress, vol.1, p. , 1994. S. S. C. Basak and G. D. Grunwald, Chemosphere, in press, 1995. press, in 1995. Toxicology Grunwald, Letters, G. D. and Bertelsen S. C. Basak, S. press, press, 1995. ENVIRONMENTAL OF PROPERTIES ENVIRONMENTAL CHEMICALS Natural Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Highway, Highway, Duluth, MN 55811 Subhash Subhash C. Basak, Gregory D. Grunwald and Brian Gute Risk Risk of assessment industrial chemicals and environmental pollutants are often carried regulators often use selected "analogs" candidate often selected of the to chemicals use "analogs" regulators assess posed risk 10). 10). can Such parameters calculated be for any chemical species, or real hypothetical. by such chemicals. the chemicals. of such the intuitive by notion on based chosen usually are analogs Such out out with limited or no test data necessary for hazard estimation. scientist individual about chemical similarity In (1). such cases, We have been involved in the development of novel computationally feasible methods of of feasible methods computationally of development novel the in involved been We have 1. 1. C. M. Auer, M. Zeeman, J. V. Nabholz and R. G. Clements, SAR and QSAR in In In this presentation we will present new results on the application of the similarity quantifying chemical structural similarity based on algorithmically-defined parameters (2- (2- parameters algorithmically-defined on based similarity structural chemical quantifying properties of different congeneric sets. methods methods in estimating: (a) mutagenicity of a set of 95 aromatic amines, and b) boiling We We have applied these methods in the selection of analogs, and in the estimation of References points points of a group of 276 chlorofluorocarbons(CFCs). 2. 2. S. C. Basak and G. D. Grunwald, Mathematical Modelling and Scientific 3. 3. C. S. G. Basak and Grunwald, D. SAR and in QSAR 2, Environmental Research, "sf LO CD N 00 O i 10. C. S. Basak, S. and Bertelsen G. Grunwald, J. Chem. Comput. 34, Sci., 270, Inf. INTERNATIONAL CONGRESS ON HAZARDOUS WASTE:

IMPACT ON HUMAN AND ECOLOGICAL HEALTH AGENDA

June 5 - 8, 1995 Marriott Marquis Hotel Atlanta, Georgia

U.S. Department of Health and Human Services Public Health Service Agency for Toxic Substances and Disease Registry Use of Molecular Similarity in the Assessment of Toxicity of Chemicals

Subhash C. Basak, Ph.D., Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Highway, Duluth, MN 55811, USA.

Risk assessment of environmental pollutants is often carried out with limited or no test data necessary for hazard estimation. In such cases, regulators often use selected "analogs" of candidate chemicals to assess the risk posed by these chemicals. Such analogs are chosen based on the intuitive notion of the individual scientist about chemical similarity. We have been involved in the development of novel computationally feasible methods of quantifying chemical structural similarity using calculated parameters. Such parameters can be derived for any chemical species, real or hypothetical. We have applied these methods in the selection of analogs, and in the estimation of properties of different congeneric sets. In this paper, we will present new results on the application of different molecular similarity methods in estimating toxicologically important properties of chemicals. Presentation Outline

1. Data and Parameter Description

2. Neural Network Architecture

3. Avoiding Overtraining

4. Training Procedure

5. Final Model Validation

6. Comparison with Statistical Methods

The Data

95 Aromatic and Heteroaromatic Amines taken from the literature.

Measure of Mutagenic Activity log(revertants/nmol) = ln(R) MODEL 2: Topological Indices + LFER Parameters

Same parameters as Model 1 plus:

IJ Indicator 1 if three or more fused rings are present, 0 otherwise

Logp Log (base 10) of the octanol/water partition coefficient

Homo Energy potential of the highest occupied molecular orbital

Lumo Energy potential of the lowest unoccupied molecular orbital

Neural Network Architecture

For both models:

* Three layers

* Fully connected

* All parameter values scaled to [0,1]

* Learning by backpropagation with maximum tolerated Mean Squared Error = .0003 Avoiding Overtraining

Overtraining occurs if the neural network "memorizes" input/output patterns causing inability to generalize and predict.

Overtraining is caused by:

* Too many weights in the network in relation to the amount of data used to train.

* Too many cycles (epochs) used during training.

Options which control these factors are:

* Number of hidden nodes

Training rate coefficient (ETA)

Momentum coefficient (ALPHA) 80/20 Methodology

To determine these options we used the following methodology in preliminary experiments:

Step 1: Train a network on 80% of the compounds using a combination of options.

Step 2: Use the trained network to predict (estimate) ln(R) for the 20% held out of training.

Step 3: Compare predicted with actual output for the 20% using a correlation coefficient.

Choice of the 20% to hold out was done by random selection of the data into five equal size "bins" B1, B2 .....B5.

Steps 1-3 were performed for each of B1, .... B5. Correlation of Predicted and Actual ln(R) for Model 1 by Number of Hidden Nodes

| Number of Hidden Nodes Bin | 10 6 4 2 1 __i__ B1 | .855 .835 .847 .858 .814 B2 j .663 .850 .891 .833 .786 B3 | .487 .813 .776 .840 .659 B4 | .549 .674 .751 .801 .602 B5 | .547 .484 .685 .712 .662 > <

CQ - CQ .630 .730 .750 .810 .700

Correlation of Predicted and Actual ln(R) for Model 2 by Number of Hidden Nodes

| Number of Hidden Nodes Bin | 10 6 4 2 1 __l__ B1 | .819 .794 .838 .817 .544 B2 | .704 .747 .679 .915 .793 B3 | .692 .657 .888 .897 .863 B4 | .784 .875 .874 .712 .600 B5 | .560 .702 .652 .782 .458 > <

CQ CQ - .712 .760 .790 .820 .650

Therefore, for both models, the number of hidden nodes used was 2. Choosing ETA and ALPHA

To achieve a MSE <= .0003 with a minimum of cycles we found that "shaking" the values of ETA and ALPHA helped to speed convergence.

That is, during training we changed ETA and ALPHA periodically. After much experimentation we settled on the following

Training Procedure

Set ETA = 0.7 and ALPHA = 0.9 Perform 5000 cycles Perform the following 10 times: Set ETA = 0.7 and ALPHA = 0.9 Perform 200 cycles Set ETA = 0.475 and ALPHA = 0.675 Perform 100 cycles Set ETA = 0.3 and ALPHA = 0.6 Perform 100 cycles Set ETA = 0.075 and ALPHA = 0.275 Perform 100 cycles

This procedure limited training to 10000 cycles and guaranteed a MSE <= .0003 in all 80/20 tests. Final Model Validation

With the neural net architecture and training procedure decided, we validated the models with a "jackknife" procedure.

This is a highly granular variation of the 80/20 methodology:

Step 1: Train a network on all compounds but one.

Step 2: Use the trained network to predict (estimate) ln(R) for the one compound held out of training.

Step 3: Repeat Steps 1 and 2 for all 95 compounds.

Step 4: Compare predicted with actual output for all 95 compounds using a correlation coefficient. Jackknife Results for Model 1 Correlation = .925 predicted actual predicted actual

1.95 2.62 -2.56 -3.00 -2.56 -2.05 -2.56 -2.70 -1.63 -0.67 -1.93 -1.60 2.66 3.16 -0.27 0.01 -0.27 -0.55 2.64 3.23 -0.94 -1.32 -0.27 -0.89 0.86 0.89 -0.54 -0.60 0.63 0.81 2.64 3.35 -2.56 -2.22 -1.54 -2.15 -0.24 0.48 -2.25 -2.52 2.66 3.31 2.54 2.46 1.18 1.93 -2.25 -3.32 -2.56 -2.00 2.64 2.98 -0.28 -0.62 -1.83 -1.30 -0.41 -0.14 2.66 3.50 -2.56 -1.96 -0.24 -0.69 -0.07 0.60 0.73 1.18 -2.56 -2.52 0.16 1.13 -0.24 -1.52 2.37 1.12 1.35 0.41 -2.56 -2.67 2.65 2.38 -2.56 -3.00 -2.56 -2.40 -0.35 -1.30 -0.87 -0.92 -0.91 -0.92 -2.56 -2.30 0.29 0.20 -2.56 -2.10 -2.56 -1.03 0.65 0.55 -0.42 -0.54 0.18 0.31 -2.53 -2.70 Jackknife Results for Model 2 Correlation = .952 predicted actual predicted actual

2.71 2.62 - 2.12 - 2.00 -2.44 -2.05 -2.44 -3.00 -1.12 -0.67 -2.39 -2.70 2.75 3.16 -2.34 -1.60 -0.48 -0.55 -0.46 0.01 -1.18 -1.32 2.75 3.23 0.95 0.89 -0.36 -0.89 0.00 0.81 -0.79 -0.60 -2.44 -2.22 2.75 3.35 0.15 0.48 -1.93 -2.15 2.75 3.31 -2.43 -2.52 2.23 1.93 2.71 2.46 -2.44 -2.00 -2.43 -3.32 -0.06 -0.62 2.72 2.98 -0.32 -0.14 -2.41 -1.30 -2.44 -1.96 2.75 3.50 0.06 0.60 -0.02 -0.69 -2.15 -2.52 1.12 1.18 -1.10 -1.52 0.34 1.13 -0.00 0.41 1.50 1.12 2.75 2.38 -2.44 -2.67 -2.43 -2.40 -2.44 -3.00 -1.28 -0.92 -1.76 -1.30 -2.44 -2.30 -0.69 -0.92 -2.44 -2.10 0.12 0.20 0.11 0.55 -0.78 -1.03 0.55 0.31 -0.53 -0.54 Jackknife Results for Model 2 (continued) predicted actual predicted actual

-2.44 -2.70 0.11 -0.48 -0.51 -1.14 -0.75 -0.49 2.73 2.62 2.70 3.77 -1.29 -1.49 -0.20 0.20 0.09 0.04 2.75 1.18 0.64 0.43 0.00 -1.04 2.75 3.80 2.75 0.87 -2.28 -3.00 0.74 0.75 -1.07 -0.80 0.00 -1.42 -1.17 -1.17 2.75 1.83 0.91 0.69 2.75 1.43 -2.43 -2.70 -1.43 -1.77 -2.42 -3.00 2.61 3.97 2.75 2.88 -0.29 0.15 -1.38 -1.24 0.33 0.38 -0.97 -0.99 2.75 3.00 -0.40 -0.39 -1.38 -1.77 -0.05 1.02 0.86 1.04 0.06 -0.01 -2.43 -1.14 -0.10 0.23 -2.40 -2.22 -2.44 -3.14