An Integrated Computational Approach for Prioritizing Potential Estrogenic Endocrine Disruptors

An Integrated "Four-Phase" Approach for Endocrine Disruptor Priority Setting

- Part 1: Phase I and II predictions of Potential Estrogenic Endocrine Disruptors

Leming Shi1+, Weida Tong1*, Hong Fang2, Roger Perkins1, Jie Wu1, Meihua Tu1++, Robert

M Blair2, William S Branham2, Chris Waller3, Daniel M Sheehan2

1 R.O.W. Sciences, Inc., Jefferson, Arkansas 72079

2 Division of Genetic and Reproductive Toxicology, National Center for Toxicological

Research (NCTR), Jefferson, Arkansas 72079

3 Sphinx Pharmaceuticals, A Division of Eli Lilly and Company, RTP, NC 27709

+ Current address: American Cyanamid Co., American Home Products, Princeton, NJ

08543

++ Current address: Pfizer Inc., Groton, CT 06340

* Address correspondence to Dr. Weida Tong, R.O.W. Sciences, Inc., 3900 NCTR Road,

MC 910, Jefferson, AR 72079. Telephone: (870) 543-7142. Fax: (870) 543-7382. E- mail: [email protected]

1 Running Title: Priority Setting of Potential Estrogenic Endocrine Disruptors

Key Words: Endocrine Disruptors; Estrogens; Androgens; EDSTAC; Priority Setting;

Estrogen Receptor Binding; Predictive Toxicity; QSAR; Classification; Clustering;

Pharmacophores; Structural Alerts; Filters; Expert System; Knowledge Base

Abbreviations: Endocrine Disruptor (ED); Estrogen Receptor (ER); Relative Binding

Affinity (RBA); Endocrine Disruptor Screening and Testing Advisory Committee

(EDSTAC); Environmental Protection Agency (EPA); Food and Drug Administration

(FDA); National Center for Toxicological Research (NCTR); Quantitative Structure-

Activity Relationship (QSAR); K-Nearest neighbors (KNN); Classification and

Regression Tree (CART); Diethylstilbestrol (DES); Comprehensive Descriptors for

Structural and Statistical Analysis (CODESSA); Comparative Molecular Field Analysis

(CoMFA); Tier 1 Screening (T1S); Tier 2 Testing (T2T); High Throughput Pre-

Screening (HTPS); Hologram QSAR (HQSAR); Hydrogen Bond Donor (H-Donor);

Hydrogen Bond Acceptor (H-Acceptor)

Acknowledgments

This work was partially supported by the FDA’s Office of Women’s Health and the

Chemical Manufacturers Association (CMA). We thank Dr. John Walker from the U.S.

Environmental Protection Agency (EPA) to provide us the data for analysis. Dr. Hong

Fang gratefully acknowledges the Oak Ridge Institute for Science and Education

Program, supported by the U.S. Department of Energy (DOE) and the U.S. Food and

Drug Administration (FDA), for postdoctoral support.

2 Abstract

There is increasing evidence that some natural and man-made chemicals, now termed endocrine disruptors (EDs), have the potential adverse effects on human beings and wildlife by mimicking endorgenous hormones. In the U.S., the EPA must meet a legislative mandate for assessing a large number of commercial chemicals, some produced in huge volumes, for ED activity. A committee of experts convened by EPA recommended a tiered approach of screening and testing for estrogen, androgen, and thyroid endpoints that incorporates in vitro and in vivo assays. Conduct of these screens and tests on even a fraction of the chemicals appears difficult in terms of cost, time, and test animal expenditures. There is a crucial need for priority setting before this large strategy can be effectively implemented. Here we report an integrated computational approach to priority setting using estrogen receptor (ER) binding as an example. This approach rationally integrates different predictive computational models into a "Four-

Phase" scheme so that it can effectively identify potential estrogenic EDs based on their predicted ER RBA. The system has been validated using an in-house ER binding assay dataset for 232 chemicals that was designed to have both broad structural diversity and a wide range of binding affinity. When applied to 58,000 chemicals identified by EPA as requiring assessment for endocrine disruption, some 9,100 chemicals were predicted to bind to ER. Of these, only 3,600 were expected to bind to ER at RBA values up to

100,000-fold less that that of 17-estradiol. The method ruled out 83% of the chemicals as non-binders with an anticipated very low rate of false negatives. We believe that the same integrated scheme will be equally applicable to endpoints of other endocrine disrupting mechanisms, e.g., androgen receptor binding.

3 Introduction

Reminiscent of the environmental movement spawned in the 1960’s regarding chemicals in the environment causing human cancers, the potential for endocrine disruptors (EDs) to cause a broad range of adverse effects has created concern, if not alarm, among the public and governments worldwide (1). Adverse effects such as compromised reproductive fitness, learning disabilities, and cancer and immune disorders have been reported widely in the popular press (2). The resulting public concern has led to government regulatory actions (3, 4) and expanded research across Europe, Japan and

North America. EDs are chemicals that may mimic endogenous hormones, alter their pharmacokinetics or mechanisms of action among other possibilities. The scientific debate has escalated, fueled in part by the fact that some suspected EDs are high-volume, economically important chemicals.

The U.S. Congress passed laws that resulted in the Environmental Protection

Agency (EPA) developing and implementing a strategy for screening and testing for estrogen, androgen and thyroid endpoints (5). A two-tiered, multiple-endpoint strategy, which incorporates more than 20 different in vitro and in vivo assays (6), was recommended by EPA's Endocrine Disruptor Screening and Testing Advisory Committee

(EDSTAC). More than 87,000 chemicals may need to be experimentally evaluated under this requirement (7). The large number of chemicals and assays makes it difficult for each chemical to be run through these assay batteries in a reasonable time. There is a crucial need for priority setting to identify the chemicals most likely to possess ED activity for early entry into screening.

4 Priority setting using computational approaches is widely applied in the process of drug discovery. The objective of priority setting in pharmaceutical industry is to increase the chance of finding active compounds or "hits" that are more likely to be developed into "leads". Hence false positives are of great concern. In contrast, minimizing false negatives is critical for regulatory purpose because chemicals labeled as inactive are dropped into a lower priority category. For this purpose, we developed an integrated computational system that rationally combines different computational models into a sequential "Four-Phase" scheme according to the strength of each type of model

(Figure 1). In Phase I, several simple rejection filters or rules are used to exclude those chemicals that are most unlikely to exhibit estrogenic activity. Phase II uses three different types of models (structural alerts, pharmacophores, and classification methods) to make a qualitative activity prediction. In Phase III, multiple QSAR models are used quantitatively to predict activity. In Phase IV, an expert system is recommended to combine Phase II and Phase III predictions with exposure, fate and other data to set priorities. In this scheme, each Phase is used as a screen to reduce the number of compounds to be considered in the subsequent Phase. Therefore, these four Phases work in a hierarchical way to incrementally reduce the size of a dataset with increasing precision of prediction. Within each Phase, different models have been selected to work complementarily in representing key activity-determining structure features to minimize the rate of false negatives.

Previously, we have evaluated the performance of a number of QSAR models in

Phase III (8). This paper reports results from Phase I and II of the integrated approach for

5 the NCTR dataset and a subset of ~58,000 chemicals identified by EPA from a total list of 87,000 chemicals.

6 Materials and Methods

Datasets

A computational model is generally trained and validated first based on a reasonable amount of reliable data; it then can be expected to provide reliable predictions on new chemicals. To build robust and predictive computational models, it is important to have a reliable training set of chemicals with known biological activity. Because literature data is not sufficient for the purpose, an in-house rat ER binding assay was established to provide data for the model development (9). Chemicals in the dataset were selected to reflect the structural diversity of the endocrine disruptors and the distribution of biological activity needed for building robust models. This selection process has been a highly interactive one, involving computational chemists and experimental toxicologists, and has resulted in the steady improvement in performance of our models

(10). Our current dataset consists of 129 active and 103 inactive compounds (9). This data set, called the NCTR dataset, has been extensively used to build and validate a series of computational models proposed for priority setting. The distribution of the binding activity (calculated as Relative Binding Affinity, or RBA) and chemical classes of the dataset are shown in Figures 2 and 3, respectively. A chemical with RBA value smaller than a million-fold below 17-estradiol is defined as inactive, where the RBA value for

17-estradiol is set to 100.

The EPA has originally identified about 87,000 compounds required for testing their potential endocrine disruption. This dataset (EPA dataset) mainly contains a large and diverse collection of known environmental chemicals as well as some food additives and drugs, of which some 8,000 chemicals are regulated by the U.S. Food and Drug

7 Administration (FDA). A final 57,810 compounds remained for our study after EPA exclusion of all polymers that are unlikely to directly cause endocrine disruption. The molecular structures of these 57,819 chemicals were preprocessed according to the following criteria (11):

1. The records are valid, i.e., they contain the connection table fields and there are no

obvious errors in the structure description.

2. Counterions and solvent molecules were removed in order to obtain single compound

records.

3. Charges at acidic and basic groups are neutralized by adding or removing protons.

This prevented structural differences caused by different protonation states, which

might lead to differences in the calculation of the molecular descriptors.

General Approaches and Methodological Considerations

The proposed integrated approach for priority setting is composed of four sequential phases (Figure 1). Each phase contains a number of rules and/or models to estimate a compound's binding affinity. Briefly,

 Phase I: Filtering - A set of rejection filters are designed to significantly and with

high confidence eliminate those chemicals extremely unlikely to bind ER. The key

criterion in selecting these filters is to rule out as many true negatives as possible,

while having an extremely low probability of passing a false negative.

 Phase II: Active/Inactive assignment - Three different methods, i.e., structural alert,

pharmacophore searching, and classification/clustering, are used in parallel in this

phase to distinguish active from inactive compounds. Each method incorporates and

8 weighs differently the various structural features that endow a chemical with the

ability to bind the ER. In order to further reduce false negatives, a chemical predicted

to be active by any of these methods is subsequently evaluated in the Phase III. At

Phase II, the combined output derived from the three approaches could be used for

initial categorical priority setting.

 Phase III: Quantitative predictions - A number of quantitative structure-activity

relationship (QSAR) methods are used for quantitative prediction of the binding

affinity of the compounds identified as active in Phase II (8). Compounds with higher

predicted binding affinity are given higher priority for further evaluation.

 Phase IV: Rule-based decision-making system - In this final stage of our integrated

priority setting approach, we propose to use a rule-based (or knowledge-based)

decision-making system to foster definitive decision making. This system will be

useful only after we incorporate our accumulated human knowledge and expertise,

i.e., rules, into the system.

The importance of the incorporation of multiple models is twofold. First, different techniques have different strengths and weaknesses in their ability to correlate and encode specific chemical structure features that endow a chemical with its activity. Multiple models enable the system to aggregately encode these features in a complementary manner. Second, some techniques are fast, but others are moderately to very time-consuming. Hence, hierarchical sequencing of the models allows faster models to be used to eliminate the majority of inactive chemicals with an extremely low rate of false negatives. Subsequently, the progressively more time- consuming but more precise models can be used to refine predictions for an

9 increasingly smaller number of remaining chemicals. The application of the more refined models further eliminates true negatives as well as false positives from earlier models.

Phase I: Rejection filters

Four simple rejection filters were used in Phase I. The first filter ruled out chemicals with molecular weight (MW) <94 or >1000. The MW of phenol (94) was considered the lowest limit for a compound to bind ER, whereas a MW of 1000 was considered the upper limit of ER ligands, as suggested by the EDSTAC. The second filter was "number of rings = 0", which implies that an active compound needs at least one ring. This is based on the lack of known estrogens without a ring. The third filter was

"number of carbons = 0". It is believed that a chemical containing no carbon would be too lipophilic to cross a membrane, even though it might exhibit binding activity. The fourth filter ruled out inorganic compounds because their probability of being estrogenic was quite low. To our best knowledge, there are no known inorganics that exhibit endocrine disrupting potential via the mechanism of ER-binding.

Phase II: Structural alerts, pharmacophore queries and classification models

Phase II comprises three types of models for qualitative activity prediction: structural alerts, pharmacophore searching, and classification. The structural alerts were three 2D substructures that were identified as key 2D structural features for most estrogens. The pharmacophore search used seven different 3D queries based on known

3D structural features for ER binding. Two classification models were developed based

10 on KNN and CART methods to qualitatively categorize compounds into active and inactive subsets on the basis of their similarity in physicochemical properties. These twelve models were designed to work complementarily to distinguish active compounds from inactives. A prediction of active or inactive was first made by applying each of the

12 models to each chemical. The results of the individual model predictions were then combined, and only chemicals identified as inactive by all twelve models were eliminated from further evaluation in Phase III.

Structural Alerts. Figure 4 depicts three structural alerts that were designed to identify as potentially active all chemicals with any of these substructural features. The steroid skeleton alert was designed to make sure that all steroidal compounds, which include most endogenous hormones, would be passed to more advanced computational prediction models. The DES skeleton, two phenyl rings separated by two carbons bound with any bond-type, was used to make sure that compounds similar to DES, one of the most active synthetic estrogens, would not be missed. The third structural alert was the phenolic ring.

The precise overlapping of the A-rings of crystal structures for ER-estradiol, ER- raloxifene, ER-DES, and ER-4-hydroxytamoxifen (12, 13), as shown in Figure 5, and our knowledge of the structural requirements for ER-ligand binding, all suggest the importance of the A-ring phenolic structural feature for ER-binding. Hence, compounds with steroid, DES, or phenolic skeletons will be classified as active and moved to Phase

III for quantitative activity prediction, regardless of the results of pharmacophore searching and classification models.

11 Pharmacophore Query Construction and Database Searching. A pharmacophore is a combination of a few molecular features (e.g., H-donor, H-acceptor, hydrophobic centers and associative geometry) needed for a molecule to exhibit a certain type of biological activity (14). A training set of molecules may be used in CATALYST (to automatically generate a number of pharmacophore queries that specify the minimum requirement for binding based on a predefined number of interactions between a receptor and its ligands.

Alternatively, pharmacophore queries may be constructed manually (15), guided by a ligand-receptor crystal structure, as was done in the present study. In the latter case, one usually starts with template molecules that either are derived from bound receptor-ligand crystal structure or from the 3D structure of highly active chemicals. Molecular features are delineated from those templates and combined to form pharmacophore queries, where the 3D distance tolerance among these features can be adjusted for optimal performance.

The detailed construction and validation of the pharmacophore queries used in the present study will be published elsewhere (16). Briefly, the crystal structures of 17-estradiol and

DES bound to ER were selected for the templates. The molecular features of 17- estradiol and DES selected for development of queries included H-bond donors and acceptors (such as phenolic hydroxyl groups) and hydrophobic centers (such as phenolic ring centers). Over 20 pharmacophore queries were originally developed by rational selection of the combination of 3 to 5 features. These queries were further validated by elimination of duplications and minimization of false negatives against the NCTR dataset, which resulted in seven queries for final application. The developed 3D queries were in turn used to search the EPA dataset of 3D-chemical structures for "hits" that contain these queries. The 3D structures of chemicals were prepared using catDB

12 command in CATALYST to generate up to 100 conformations per chemical. The hits were presumably active or estrogens.

K-Nearest Neighbors (KNN). KNN is a widely used pattern recognition technique that can categorize an unknown chemical based on its proximity to samples already placed in categories. Specifically, the predicted class, i.e. active or inactive in this study, of an unknown chemical depends on the distribution of class assignment of its K nearest neighbors in the training set, which accounts for the name of the technique. The nearness is generally measured by an Euclidean distance metric in an N-dimensional space of molecular descriptors, although other distance metrics can be applied. In a fashion analogous to polling, each of the K nearest training set samples votes once for its class; the unknown chemical is then assigned to the class with the most votes. With a chosen distance metric, the most important part of the KNN process is to determine an optimal K value for the final model development, which was selected by the leave-one-out cross validation in this study. The standard KNN process was implemented as follows: (1)

Remove a chemical from the dataset; (2) Calculate the distance metric between the omitted chemical and all remaining chemicals in the dataset; (3) Select K chemicals nearest (or similar) to the left-out chemical according to the calculated distances; (4)

Assign the left-out chemical the activity class to which a majority of the K chemicals belongs; (5) Count the misclassification by comparing the predefined class with the predicted class of the left-out chemical; and (6) Repeat steps 1-5 until each chemical in the dataset is left out once for prediction and the rate of misclassification is summarized for the predefined K value. Theoretically, the K value can vary from 1 to N-1, where N is

13 the size of the dataset. In our application, the steps 1-6 were repeated for each K value from 3 to 10, and the optimal K value, which was 3 for the final model, was determined according to the smallest rate of misclassification.

Classification and Clustering Tree (CART). CART uses a decision tree to determine how a chemical may be classified or predicted through a series of rules based on selection of variables (or descriptors). These rules are operated by using "if … then …" expressions. Since tree-construction methods are recursive in nature, CART is also called the recursive partitioning method for pattern recognition in drug discovery (17).

Depending on the nature of activity data (endpoint), the tree can be constructed for either regression or classification. Each terminal node of the regression tree gives a quantitative prediction, while the classification tree gives a categorical prediction. The classification tree is used most commonly in data analysis, where the endpoint is usually binomial. In the present application, the tree method used to classify chemicals into active and inactive categories is described by Clark & Pregibon (18) and implemented in the S-Plus software. The development of a tree model involves two processes: tree construction and pruning. In the tree construction process, a parent population is split into two children nodes that become parents for further splits. The split is selected to maximally distinguish the response variable in the left and the right branches. Splitting continues until nodes are pure or data are too sparse. To avoid over-fitting the training data, the tree needs to be cut down to the desired size using tree cost-complexity pruning. In this study, the number of the terminal nodes was set to 10, which corresponds to about 23 compounds per node.

14 General Computational Modeling

The Phase I filters and Phase II substructural search were done using ISIS Base

(MDL Information System, Inc., San Leandro, CA). LogP was calculated using the atom/fragment contribution method (19). The S-Plus software (MathSoft, Inc., Seattle,

Washington) was used to develop K-neearest Neighbors (KNN) and Classification and

Regression Tree (CART) models (20). The pharmacophore searching was performed with the CATALYST package (Molecular Simulations Inc., San Diego, CA).

15 Results

The feasibility of the integrated computational approach for priority setting was tested on the NCTR and EPA datasets. The NCTR dataset was used as the training set to develop all the models. These models were then integrated together to form the "Four-Phase" system for priority setting of the EPA dataset. The results of the first two phases are summarized in Table 1. In summary, there were no true positives (of the 129 chemicals in the NCTR dataset) that were predicted to be negatives. Thus although only 21% of the chemicals in the NCTR dataset were ruled out for further evaluation, no false negatives were introduced in either Phase I or II.

Consequently, about 80% of chemicals of the EPA dataset were non-ER binders, and an extremely low rate of false negatives would be expected.

Phase I: Filtering

Several simple rejection filters were applied to eliminate the chemicals with very high or very low molecular weight, or other characteristics that make a chemical unlikely to bind to the ER. Table 1 shows that Phase I filters correctly excluded 7 inactive compounds, or about 3% of the original NCTR dataset. This percentage is as would be expected for a dataset designed to cover active and inactive compounds in the chemical space around ER binders. In real situations, like the EPA database, a much higher percentage of true negatives would be excluded in this phase. As shown in Table 1, more than 30% of the EPA dataset is expected to be excluded at Phase I. The total number of the EPA dataset compounds that passed to Phase II was about 40K. Most importantly,

16 there were no false negatives introduced in this phase using these filters based on the

NCTR dataset.

Simple filters are commonly employed in drug discovery to eliminate compounds for further testing. Filters used vary according to application, but are generally derivatives of the well-known Lipinski's "rule of 5" (21). Typically, criteria are applied for molecular weight, hydrophobicity, and number of H-donors and acceptors that make a chemical “drug-like”. The rules are aimed at ruling-out candidate compounds lacking "drug-likeness" properties, since false positives waste time and resources. When we applied similar rules to the NCTR ER-binding dataset, many false negatives resulted (i.e., many estrogens lacked good drug-like properties).

In contrast to drug discovery, false negatives are an unacceptable outcome for regulatory purposes. Our approach to filter design embodied in Phase I is to minimize the number of false negatives while keeping a lower rate of false positives. Phase I rules reflect our aggregate knowledge of the structural requirements for estrogenic activity via the mechanism of ER-binding. These filters were designed to significantly and confidently reduce, without false negatives, the number of compounds to be analyzed further by other more advanced and time-intensive computational models in the remaining phases.

Phase II: Active/Inactive Assignment

In principle, the biological activity of a chemical is determined by its structure that can be encoded in three distinct, but also related structural representations: 2D substructures, 3D pharmacophores and physicochemical properties. A 2D substructure is

17 a structural fragment of a molecule, which can be often used as a strong indicator of a particular activity (a structural alert). A 3D pharmacophore is a portion of a chemical's

3D structure that is considered essential in eliciting the biological activity of interest. A physicochemical property of a molecule is a measure of one property of a whole molecule represented by a single value. For example, logP measures a chemical's hydrophobicity. The biological activity of a chemical is related to the aforementioned structural features, but for a particular mechanism, the functional dependence is better represented by some features than by others. That is, a feature important for one mechanism may be less relevant for a different mechanism. Similarly, for a single mechanism such as ER-binding modeled here, some features may well represent binding dependencies for one structural class, while other features will better represent binding dependencies for a different structural class. Phase II encompasses multiple representations of structural features among the different structural alerts, pharmacophore, and classification/clustering models. Consequently, when used in parallel, these differing approaches will in a complementary manner encompass the diversity of structural features relating with activity. This approach is particularly critical for regulation where minimizing false negatives is a paramount concern. To attain minimal false negatives, we developed the rule that only chemicals predicted to be inactive by all models were categorized as inactive and eliminated. Any chemical predicted to be an ER-binder by any one or more models is passed to Phase III for quantitative QSAR evaluation.

Three structural alerts were used in this phase. Each was designed to represent 2D substructural features highly associated with estrogens. Figure 6 shows the performances

18 of these alerts when applied to the NCTR dataset. Most chemicals matching the steroid and DES skeletons were strong estrogens. This is consistent with the observation that a common structural feature of many strong estrogens is the presence of two rings (one of them usually a phenolic ring) separated by two carbons (22). In contrast, the phenolic ring alert was a less specific structural indicator of activity. Chemicals matching this alert were structurally diverse, of which about 80% were active chemicals for the NCTR dataset.

In the drug design industry, 3D pharmacophores have proven valuable as queries for lead discovery, whether applied alone or in conjunction with 2D substructure queries. A pharmacophore query is defined by specifying distance and/or angular constraints to characterize activity. A query-matched chemical is considered positive and segregated for further evaluation. One of the advantages of pharmacophore searching is that it can identify chemicals similar to the template in a

3D sense that may not be discernable by chemists in a 2D sense. Seven pharmacophore queries were developed using 17-estradiol and DES as templates.

Chemicals matching any of these queries were labeled as active. A chemical could match none, a few, or many of the seven separate queries, and the number of matches should increase in direct proportion to probability of activity. This so-called

"pharmacophore hit frequency" could be used to rank order chemicals in accordance with potential activity. Figure 7 shows that the chemicals in the NCTR dataset with a hit frequency >2 were mostly strong estrogens, while those with a hit frequency <=2 were less active. The majority of the chemicals not matched in any query were inactive.

19 Classification models, which use pattern recognition methods, group compounds on the basis of their similarities in structural features or biological activity patterns. Two types of classification methods, i.e. non-supervised (untrained) and (2) supervised (trained) methods, are widely used in various areas of science and technology to find the regularities and patterns in data sets (23-25). Their applications are widespread in chemistry, biology, and drug discovery (26, 27). Non-supervised classification techniques group a set of chemicals into subsets on the basis of descriptors representing only structure so that the chemicals within a particular subset generally have similar structural patterns. The supervised classification methods identify the structural features that determine the biological activity patterns and refine the model to amplify the importance of these structural determinants. To construct a supervised classification model, a set of molecular representations

(descriptors) is first generated for chemicals in the dataset. Next, a chemometric method is applied to correlate these descriptors with their biological activity, usually represented on a categorical scale.

A number of supervised and non-supervised classification methods were evaluated to classify chemicals into two estrogenic categories, active and inactive

(28). While these approaches are different in a number of ways, they generally produced similar classification results. The nature of the descriptors used, and more particularly the effectiveness in which they encode the structural features of the molecules related to the estrogenic activity, was far more critical than the specific method employed. A number of computational codes can generate a large number of descriptors for a chemical, such as CODESSA (Semichem, Shawnee, KS) which can

20 generate over 400 descriptors. The selection of molecular descriptors is paramount for the model development. Since our models were intended for application on a very large number of chemicals (about 40,000 chemicals for the EPA dataset), a small number of descriptors are desirable, as long as the model predictivity is not compromised. A close examination of descriptor correlation with activity enabled identification of four simple descriptors, logP, MW, phenolic indicator and pharmacophore hit frequency, which yielded good supervised classification models.

Together, these four descriptors encode the key 2D- and 3D-structural features, as well as the global molecular properties, which determine ER binding. The KNN and

CART methodologies were chosen from among many possibilities because of easy implementation and automation of the system. Figure 8 shows that both models' prediction for the NCTR dataset were about 85-90% correct for chemicals in either active or inactive groups, corresponding to about 10~15% false positive and false negative rates.

Each of the 12 models provides some false negatives. But, importantly, when all models were combined, all active chemicals in the NCTR dataset were predicted to be binders by one or more of the models. Hence, the combined models result in no false negatives for the NCTR dataset (Table 1), confirming that the models are complementary in identifying structural features for ER binding.

Consensus Ranking for Priority Setting

There was a major overlap between the hit lists from three structural alerts, seven pharmacophore queries and two classification models. A chemical could be predicted to be active by none, a few, or many of these twelve models. Since each

21 model identifies specific structural attributes associated with activity, the number of models predicting a chemical to be active should tend to increase in direct proportion to its actual activity. In other words, chemicals can be ranked based on the number of models in consensus. This so-called "consensus ranking" thus provides a suitable index for priority setting. To test this approach, the NCTR dataset was divided into five groups in accordance with the consensus ranking: Groups 1-5 contained chemicals that had the consensus ranking >6, 5~6, 3~4, 1~2 and 0, respectively

(Table 2). Correspondingly, the NCTR dataset was grouped into five RBA categories:

Categories A through D contained chemicals with RBAs compared to 17-estradiol no less than 10-fold below, between 10 to 103-fold below, between 103 to 105-fold below and between 105 to 106-fold below, respectively. Category E was chemicals with RBA values more than 106-fold below 17-estradiol (Table 2). As expected, the mean RBA values of priority groups decreased as the consensus ranking decreased, and were consistent with the RBA range values assigned for the activity categories

(Table 2). The majority of active chemicals (93%) were within groups 1~3, and most of these had RBAs no less than 10-5-fold below 17-estradiol (Figure 9). Both Table 2 and Figure 9 show a clear trend of decreasing inactive chemicals with increasing consensus ranking. There were no hits for most inactive chemicals. Among the approximately 40,000 chemicals of the EPA dataset that were not ruled out in Phase I,

9,810 chemicals (thase in Groups A-D, Table 2) were identified as potential estrogens by at least one model. Of these, some 3,600 chemicals were in Groups 1~3. These groups would be expected to contain the majority of active chemicals with RBA>10-3

(>105-fold below 17-estradiol). These chemicals would also be expected to compete

22 for binding to the ER at less than 100 M in our assay because the IC50 of 17- estradiol is about 1 nM.

23 Discussion

The EDSTAC (Endocrine Disruptor Screening and Testing Advisory

Committee), organized by the EPA, worked for two years on a strategy for identification of potential endocrine disruptors. They recommended two phases: Tier

1 Screening (T1S) and Tier 2 Testing (T2T) phases (5, 6). T1S consists of several in vitro and short-term in vivo assays to identify hazards, while T2T consists of multi- generation, multi-endpoint animal tests for chemicals shown positive in T1S. For a very small portion of the 58,000 chemicals that would undergo some level of assessment, there is enough information to warrant screening and testing. But for most of the chemicals, there is little or no hormonal activity information available.

Conduct of these screens and tests on all of the appropriate chemicals will be slow due to either cost, time or test animal expenditures. Therefore, methods for priority setting are essential for identifying the most likely chemicals to possess hormonal activity and thus for efficiency of implementation of the EDSTAC strategy.

Transcription-based high throughput pre-screening (HTPS) was recommended by the

EDSTAC as the primary source of biological effect information for priority setting.

However, in a pilot study undertaken by EPA, the HTPS system did not perform well

(29).

Here we have demonstrated the utility of computer-based methods to develop a scheme to prioritize chemicals in rank order based on estimates of their RBA range.

This alternative approach to HTPS for priority setting uses a suite of integrated computational models that include rule-based, pharmacophoric, chemometric and

QSAR techniques. There are two critical aspects of the system. First, the suite of

24 models were developed using the NCTR dataset that was designed to calibrate and validate models. Second is the optimal selection and sequencing of appropriate predictive computational models that are used in a hierarchical and complementary manner resulting in both effective and efficient prioritizing of a very large database of chemicals for subsequent assessment.

The advantage of the computational approach to priority setting is the efficiency of scale when applied to a large number of chemicals. When several endpoints are analyzed simultaneously, the efficiency of scale of computation is even more pronounced. Importantly, results from Phases I and II alone, together with information on exposure level, production volume, and environmental persistence of chemicals, may be sufficient to enable initial priority setting for estrogens. Similar procedure appears appropriate for androgen and thyroid hormone mimics.

Phases I and II ruled out as potential estrogens all but 9,100 chemicals from the original 58,000 in the EPA dataset. Of these, about 3,600 were predicted to be in the highest priority groups that identify chemicals with ER binding affinities at an

IC50 of less than 100 M. How to make full usage of this predictive information is dependent on the application. The highest three priority groups from Phase II might be of equal concern in a regulatory context, where false negatives are the primary issue. The highest three groups would also be suitable for evaluating large combinatorial libraries for adverse activity early in the drug discovery process. In contrast, the chemicals in priority group 1 (Table 2) might be important for lead discovery in pharmaceutical industry, because of their requirement for a low incidence of false positives.

25 The RBAs of the chemicals that were predicted to be active in the first two phases will be quantitatively predicted using multiple QSAR models in Phase III. The performance of several QSAR methods, including CoMFA, CODESSA and HQSAR, developed from literature data were previously reported (8, 30-32). Among these

QSAR methods, CoMFA stands out as superior in precision. Currently, we have adopted for Phase III a high-performance CoMFA model developed from the NCTR

ER-binding dataset (33). The model has been rigorously validated, and provides accurate prediction for test chemicals. Furthermore, while time intensive to use, the

CoMFA model demonstrates the ability to eliminate a large number of false positives resulting from Phase II. Quantitative predictions of RBAs can be also used for a more refined priority setting. In general, compounds with higher predicted binding affinity should be given a higher priority for earlier assessment in the EDSTAC recommended screening and testing assays.

In Phase IV, a knowledge-based system, or expert system, is proposed to make final decisions on priority setting. We anticipate that the number of chemicals predicted to be active by the QSAR models in Phase III will be on the order of 4,000, or less than

10% of the original EPA dataset. Clearly, conducting the EDSTAC battery of assays on

4,000 chemicals concurrently is intractable, and there remains the need for prioritizing within this group. The expert system will be useful only after incorporating human knowledge and expertise, or predictive models, i.e., rules, into the system. Then, the expert system will be able to make decisions on individual chemicals based on the rules in its knowledge base.

26 The expert system’s performance will only be as good as the correctness and usefulness of its knowledge base. We suggest that rule determination is inherently a multidisciplinary undertaking, and offer the following general suggestions on the design of priority setting rules, which should be subject to routine changes to accommodate regulatory needs and public concerns.

First, information gained at each phase of the integrated computational approach should be used for setting priorities. For example, the RBAs predicted from Phase III could be converted into the categorical value to rank order chemicals in a manner similar to that used in Phase II. The initial priority setting results from both Phase II and III can then be combined to determine a chemical's priority. Second, structural novelty should be considered. For chemicals with similar predicted RBAs, those with novel structural features should be given higher priority, because chemicals with novel structural features are more likely to be missed, and therefore, they are more likely to cause regulatory problems.

The various models presented here as an integrated suite also have utility when applied individually, or when integrated in part in different ways. For example, the Phase II models together have particular utility for application to a large number of chemicals. Alternatively, for a small number of chemicals, the Phase III QSAR models may be preferable since they provide more refined quantitative predictions.

27 Concluding Remarks

Various mechanisms are involved for endocrine disruption. The integrated computational approach reported in the paper is based on prediction of chemical binding to the ER, which, in turn, is correlated with numerous estrogenic endpoints. We anticipate that the same scheme will be equivalently applicable to other mechanistic steps

(e.g., androgen receptor binding) involved in endocrine disruption, and the associated endpoints. The stringent requirement for developing models for additional mechanisms is appropriately designed training datasets similar to the one employed here for the ER- binding models. Properly validated data allows the structural rules that govern activity to be determined, and used to develop robust predictive models.

While the results presented here clearly show both the feasibility and utility of using a computational approach for priority setting, it is important to note that predictions from any model are intrinsically no better than the experimental data employed for calibration. Any limitations of the assay used to generate the calibration data apply equally to the model’s predictions. It is difficult to guarantee that no active chemicals are predicted to be inactive, other than by assaying all chemicals predicted to be inactive.

Moreover, false negatives and false positives depend on the defined cut-off value to distinguish active from inactive. As the cut off value is lowered it is likely that error will increase even for a well designed and executed assay, and false positives and false negatives will both increase. Similarly, more false prediction might be introduced for chemicals with activity close to the cut-off value. The issue for a large number of chemicals of the rate of false positives and false negatives in predicted RBA values must be dealt with experimentally by running assays on a sufficiently large number of

28 chemicals to characterize the rates. However, using the NCTR dataset there were no false negatives among 129 chemicals that assayed positive, even though there were numerous low affinity ligands among them.

29 References

1. Colburn T, Dumanoski D, Myers JP. Our Stolen Future. Plume: New York, 1996. 2. Kavlock RJ, Daston GP, DeRosa C, Fenner-Crisp P, Gray LE, Kaattari S, Lucier G, Luster M, Mac MJ, Maczka C, Miller R, Moore J, Rolland R, Scott G, Sheehan DM, Sinks T, Tilson HA. Research needs for the risk assessment of health and environmental effects of endocrine disruptors: a report of the U.S. EPA-sponsored workshop. Environmental Health Perspectives 104 Suppl 4:715- 40; 1996. 3. 104th. U.S. Congress, Food Quality Protection Act 21 U.S.C. 346a(p). The Safe Drinking Water Act (42 U.S.C. (300j-17). Public Law 104-182, 1996. 4. US-Congress. The Food Quality Protection Act (FQPA) and the Safe Drinking Water Act (SDWA), 1996. 5. EDSTAC. http://www.epa.gov/opptintr/opptendo/finalrpt.htm. In: Final Report. 6. Gray LE, Jr. Tiered screening and testing strategy for xenoestrogens and antiandrogens. Toxicol Lett 102-103:677-80; 1998. 7. Patlak M. A testing deadline for endocrine disrupters. Environmental Science & Technology 30:540A-544A; 1996. 8. Tong W, Lowis DR, Perkins R, Chen Y, Welsh WJ, Goddette DW, Heritage TW, Sheehan DM. Evaluation of quantitative structure-activity relationship methods for large-scale prediction of chemicals binding to the estrogen receptor. Journal of Chemical Information and Computer Sciences 38:669-77; 1998. 9. Blair R, Fang H, Branham WS, Hass B, Dial SL, Moland CL, Tong W, Shi L, Perkins R, Sheehan DM. Estrogen Receptor Relative Binding Affinities of 188 Natural and Xenochemicals: Structural Diversity of Ligands. Toxicological science; 1999 (in press). 10. Perkins R, Anson JF, Blair R, Branham WS, Chen Y, Dial SL, Fang H, Hass B, Jackson M, Lu M, Meehan J, Moland C, Nossaman R, Leming M, Shi L, Tong W, Tu M, Welsh WJ, Sheehan DM. The Endocrine Disruptor Knowledge Base: A Prototype Toxicological Knowledge Base and Predictive Models for Endocrine Disrupting Compounds. In: Society of Environmental Toxicology and Chemistry, 8th International Workdhop on Quantitative Structure Activity Relationships (QSARs) in the Environmental Sciences, Baltimore, Maryland, 1998. 11. Sadowski J, Kubinyi H. A Scoring Scheme for Discriminating between Drugs and Nondrugs. J Med Chem 41:3325-3329; 1998. 12. Brzozowski AM, Pike AC, Dauter Z, Hubbard RE, Bonn T, Engstrom O, Ohman L, Greene GL, Gustafsson JA, Carlquist M. Molecular basis of agonism and antagonism in the oestrogen receptor. Nature 389:753-8; 1997. 13. Shiau AK, Barstad D, Loria PM, Cheng L, Kushner PJ, Agard DA, Greene GL. The structural basis of estrogen receptor/coactivator recognition and the antagonism of this interaction by tamoxifen. Cell 95:927-37; 1998. 14. Hong H, Neamati N, Wang S, Nicklaus MC, Mazumder A, Zhao H, Burke TR, Pommier Y, Milne GWA. Discovery of HIV-1 integrase inhibitors by pharmacophore searching. Journal of Medicinal Chemistry 40:930-936; 1997. 15. Greenidge PA, Carlsson B, Bladh L-G, Gillner M. Pharmacophores Incorporating Numerous Excluded Volumes defined by X-ray Crystallographic Structure in

30 Three-Dimensional Database Searching: Application to the Thyroid Hormone Receptor. J. Med. Chem. 41:2503-2512; 1998. 16. Fang H, Tong W, Tu M, Perkins R, Sheehan D. Pharmacophore searching as a priority setting technique for screning and testing of endocrine disruptors. ; 2000 (in preparation). 17. Hawkins DM, Young SS, Rusinko III A. Analysis of Large Structure-Activity Data Set Using Recursive Partitioning. Quant. Struct.-Act. Relat. 16:296-302; 1997. 18. Clark LA, Pregibon D. Tree-based models, vol Chapter 9:Chambers & Hastie, 1992. 19. Meylan W, Howard P. Atom/Fragment Contribution Method for Estimating Octanol-Water Partition Coefficients. J. Pharma. Sci. 84:83-92; 1995. 20. Venables WN, Ripley BD. Modern Applied Statistics with S-PLUS:Springer, 1997. 21. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews 23:3-25; 1997. 22. Fang H, Tong W, Perkins R, Soto A, Prechtl N, Sheehan DM. Quantitative Comparison of in vitro assays for Estrogenic Activity. Environmental Health Perspectives; 1999 (submitted). 23. Arabie P, Hubert LJ, De Soete G, eds. Clustering and Classification. Singapore:World Scientific Pub Co, 1996. 24. Mirkin BG. Mathematical Classification and Clustering. Boston:Kluwer Academic Pubishers, 1996. 25. Gordon AD. Classification: Methods for the Exploratory Analysis of Multivariate Data. London:Chapman and Hall, 1981. 26. Willett P. Similarity and Clustering in Chemical Information Systems, vol 12. Letchworth:Research Studies Press, 1987. 27. Shi LM, Tong W, Perkins R, Chen Y, Fang H, Poirier L, Sheehan D. Classification methods for toxicity prediction and anticancer drug discovery. In: 1998 Joint Annual Meeting of the Classification Society of North America and the Psychometric Society, Urbana, Illinois, June 17-21 1998. 28. Shi LM, Tong W, Fang H, Perkins R, Sheehan D. Classification and clustering methods for the identification of estrogenic compounds. Journal of Chemical Information and Computer Sciences; 2000 (in preparation). 29. E/ELetter. EPA Scientific Advisors: HTPS Needs More Work; Mixtures Low Priority. In: Endocrine/Estrogen Letter, 1999. 30. Tong W, Perkins R, Sheehan DM. Perspectives on Three-Dimensional Quantitative Structure-Activity Relationship (3D-QSAR)/Comparative Molecular Field Analysis (CoMFA) in Determining Estrogenic Effects. Japan Chemistry Today 2:50-57; 1999. 31. Tong W, Perkins R, Strelitz R, Collantes ER, Keenan S, Welsh WJ, Branham WS, Sheehan DM. Quantitative structure-activity relationships (QSARs) for estrogen binding to the estrogen receptor: predictions across species. Environmental Health Perspectives 105:1116-24; 1997.

31 32. Tong W, Perkins R, Xing L, Welsh WJ, Sheehan DM. QSAR models for binding of estrogenic compounds to estrogen receptor alpha and beta subtypes. Endocrinology 138:4022-5; 1997. 33. Shi LM, Tong W, Fang H, Perkins R, Blair R, Branham W, Sheehan D. Comparative Molecular Field Analysis (CoMFA) of a large diverse set of estrogenic compounds. Journal Chemical Information and Computer Sciences; 2000 (in preparation).

32 Table 1. Summary of Phase I and II results for the NCTR and EPA datasets.

NCTR dataset EPA dataset Active Inactive Total Original Size 129 103 232 57,810 Phase I 129 96 225 (97%) 39,822 (69%) Reduced Size Phase II 129 54 183 (79%) 9,810 (17%)

33 Table 2. Priority setting of the NCTR and EPA datasets

Experiment Prediction Consensus NCTR dataset EPA dataset Assigned RBA Prioritized Mean RBA* Active/Inactive # of Compds Categories Range Groups Ranking A >10 1 >6 1.2 32 / 1 124 B 10-1~10 2 5~6 1.1 x 10-2 44 / 9 317 C 10-3~10-1 3 3~4 3 x 10-3 43 / 16 3,183 D 10-4~10-3 4 1~2 2.3 x 10-4 9 / 29 6,186 E <10-4 5 0 <10-4 0 / 49 30,012 * RBA value for inactives was set to 10-4 to calculate mean RBA.

34 List of Figures

Figure 1. Overview diagram of the integrated computational approach for priority setting.

Different Phases work in a hierarchical way; different methods within each Phase work complementarily.

Figure 2. Binding activity distribution of the NCTR dataset. The activity is represented as

RBA (Relative Binding Affinity). The RBA for the endogenous ligand, 17-estradiol, was set to 100.

Figure 3. Chemical class distribution of the NCTR dataset.

Figure 4. Structural alerts for priority setting. Compounds with any of these substructural features that are commonly seen in ligands to ER should be passed to Phase III (Multiple

QSAR predictions) of the integrated priority setting approach. It is to ensure that compounds with these features will receive proper attention in priority setting.

Figure 5. Superimposition of ligands bound to ER. The phenolic A-ring is very important for ER-binding.

Figure 6. Performance of three structural alerts, the steroid skeleton, DES skeletong and the phenolic ring, on the NCTR dataset. The activity categories were arbitrary defined as follows: the strong estrogens had activities no less than 103-fold below

17-estradiol, the weak estrogens between 103 to 106-fold range and inactive chemicals less than 106-fold.

Figure 7. Ranking order the NCTR dataset based on pharmacophore hit frequency.

The activity categories are defined in Figure 6.

Figure 8. Classification results for both active and inactive chemicals in the NCTR dataset using KNN and CART. False positive and false negative rates were the

35 number of misclassified chemicals divided by the total number of chemicals predicted for the inactive and active groups, respectively.

Figure 9. Ranking the chemicals of the NCTR dataset according to the consensus ranking derived from structural alerts, pharmacophore queries and classification models. RBA=100 for 17-estradiol.

36 Figure 1

Phase 1 Eliminate True Negatives 4 Simple Filters

Phase II Qualitative Activity Prediction 3 Structural Alerts (2D) Compound Priority 7 Pharmacophore Queries (3D) Computational Intensity 2 Classification Methods (Physicochemical Properties)

Phase III Quantitative Activity Prediction Classical QSAR (Physicochemical Properties) HQSAR (2D) CoMFA (3D) Phase IV Priority Setting Expert System

37 Figure 2

120 103 100 s l

a 80 c i m e

h 60 C

f o

. 40

o 29 28

N 23 19 20 11 13 5 0

e v 1 1 .1 1 0 0 0 ti 0 .0 0 ~ 1 0 0 c .0 0 ~ .1 ~ 1 1 a 0 ~ 1 0 1 ~ > n ~ 1 0 0 I 1 0 . 1 0 .0 0 0 0 .0 0 Relative Binding Affinity (RBA)

38 Figure 3

50 47 45 s l 40 a 33 c 32 32 i 35 31 m 30 e h 25 22 C f

o 20 15

. 12

o 15

N 8 10 5 0 s s s n s s ls s s rs id S e T B o te e e ro E g D C n la id h o D e a ic t te D tr P h h t O S s lp t s e y h e to lk P P y A h P Chemical Classes

39 Figure 4

12 17 11 13 16

1 9 10 15 2 14 8

3 7 5 HO 4 6 Steroid Skeleton DES Skeleton Phenolic ring C1-C17, any two phenyl rings linked by two bond-type carbons with any bond-type

40 Figure 5

OH

41 Figure 6

60 Steroid Skeleton 50 s

l DES Skeleton a c i 40 Phenolic Ring m e h C

f 30 o

r e

b 20 m u

N 10

0 Inactive Weak Strong Binding Activity Category

42 Figure 7

70

s Hit Freq = 0

l 60 a 1 ~ 2 c i 50

m > 2 e h 40 C

f o

r 30 e b

m 20 u

N 10

0 Inactive Weak Strong Binding Activity Category

43 Figure 8

120

) 100 Exp. Results % (

KNN Model o i

t 80 CART Model a R

n 60 o i t c i

d 40 e r P 20

0 Active Inactive False Positives False Negatives Categories

44 Figure 9

50 Category RBA 45 A >10 s 40 l B 10 - 0.1 a

c 35 i C 0.1 - 0.001 m 30 e D 0.001~0.0001 h 25

C E <0.0001 f 20 o

. 15 o

N 10 5 0 1 2 3 4 5 Prioritized Groups

45