1 Profiling the ToxCast Library With a Pluripotent Human (H9) Stem Cell Line-Based Biomarker Assay for Developmental Toxici- ty[AQ2][AQ3]

PROFILING THE TOXCAST LIBRARY[AQ1]

ZURLINDEN et al.

Todd J. Zurlinden * Katerine S. Saili * Nathaniel Rush * Parth Kothiya * Richard S. Judson * Keith A. Houck * E. Sidney Hunter † Nancy C. Baker ‡ Jessica A. Palmer § Russell S. Thomas * and Thomas B. Knudsen*,1[AQ5][AQ4]

*. National Center for Computational Toxicology (NCCT) and †. National Health and Environmental Effects Research Laboratory (NHEERL), Office of Research and Development (ORD), U.S. En- vironmental Protection Agency (USEPA), Research Triangle Park, North Carolina 27711; ‡. Leidos, Research Triangle Park, North Carolina 27711; and §. Stemina Biomarker Discovery, Inc, Madison, Wisconsin 53719[AQ6]

1. To whom correspondence should be addressed at National Center for Computational Toxicology (B205-01), U.S. Environmental Protection Agency, Research Triangle Park, NC 27711. E-mail: [email protected].[AQ7] Disclaimer: The views expressed in this article are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not con‐ stitute endorsement or recommendation for use.

ABSTRACT The Stemina devTOX quickPredict platform is a human pluripotent stem cell-based assay that predicts the developmental toxicity potential based on changes in cellular metabolism following chemical exposure [Palmer, J. A., Smith, A. M., Egnash, L. A., Conard, K. R., West, P. R., Burrier, R. E., Donley, E. L. R., and Kirchner, F. R. (2013). Establishment and assessment of a new human embry- onic stem cell-based biomarker assay for developmental toxicity screening. Birth Defects Res. B Dev. Reprod. Toxicol. 98, 343– 363]. Using this assay, we screened 1065 ToxCast phase I and II chemicals in single-concentration or concentration-response for the targeted biomarker (ratio of ornithine to cystine secreted or consumed from the media). The dataset from the Stemina (STM) assay is annotated in the ToxCast portfolio as STM. Major findings from the analysis of ToxCast_STM dataset include (1) 19% of 1065 chemicals yielded a prediction of developmental toxicity, (2) assay performance reached 79%–82% accuracy with high spe- cificity (> 84%) but modest sensitivity (< 67%) when compared with in vivo animal models of human prenatal developmental toxicity, (3) sensitivity improved as more stringent weights of evidence requirements were applied to the animal studies, and (4) statistical analysis of the most potent chemical hits on specific biochemical targets in ToxCast revealed positive and negative associations with the STM response, providing insights into the mechanistic underpinnings of the targeted endpoint and its bio- logical domain. The results of this study will be useful to improving our ability to predict in vivo developmental toxicants based on in vitro data and in silico models.

Keywords: predictive toxicology ; developmental toxicity ; embryonic stem cells

In 2007, the National Research Council published Toxicity Testing in the 21st Century: A Vision and a Strategy (Na‐ tional Research Council, 2007). This report addressed the potential for automated high-throughput screening (HTS) and high-content screening (HCS) assays and technologies to identify chemically induced biological activity in hu‐ man cells and to develop predictive models of in vivo biological response that would ignite a shift from traditional animal endpoint-based testing to human pathway-based risk assessment (Collins et al., 2008). Concurrent with the NRC 2007 report, the U.S. Environmental Protection Agency (USEPA) launched the ToxCast research program that utilized statistical methods and machine learning algorithms in combination with HTS/HCS data for profiling biologi‐

© Copyrights 2019 cal pathways and building bioactivity signatures predictive of toxicity (Judson et al., 2010, 2016; Kavlock et al., 2012 Kavlock et al., 2012[AQ8]; Richard et al., 2016). An abundance of HTS/HCS data has since fueled the building and testing of integrative models for “encoding the toxicological blueprint of active substances that interact with liv‐ ing systems” ( Juberg et al., 2017 Juberg et al., 2017; Sturla et al., 2014). Impetus for the research and application of HTS/HCS assays is bolstered by the regulatory need to fill information gaps on potential hazards that chemicals might pose to human health and the environment and to identify and imple‐ ment appropriate health-protective risk management measures under the Registration, Evaluation, and Authorization of Chemicals (REACh) (European Parliament, Council of the European Union, 2006) and The Frank R. Lautenberg Chemical Safety for the 21st Century Act (amended Toxic Substances Control Act) in the United States (US Public Law 114–182, 2016). Under amended Toxic Substances Control Act, for example, the USEPA must encourage and facilitate “… the use of scientifically valid test methods and strategies that reduce or replace the use of vertebrate animals while providing information of equivalent or better scientific quality and relevance that will support regulato‐ ry decisions …” and consider the impacts of chemicals and chemical mixtures to “… potentially exposed or suscepti‐ ble subpopulation … who, due to either greater susceptibility or greater exposure, may be at greater risk than the general population of adverse health effects from exposure to a chemical substance or mixture, such as infants, chil‐ dren, pregnant women, workers, or the elderly.” (US Public Law 114–182, 2016). REACh regulation cites identifica‐ tion of derived no effect levels “for each relevant human population (e.g. workers, consumers and humans liable to exposure indirectly via the environment) and possibly for certain vulnerable sub-populations (e.g. children, pregnant women) …” and the need to “… to replace, reduce or refine testing on vertebrate animals” (European Parliament, Council of the European Union, 2006). These regulations highlight the need for in vitro assays and in silico models that can be used to evaluate the developmental toxicity potential of chemicals in screening and prioritization contexts, with less reliance on animal testing. The in vivo protocol commonly used to test for prenatal developmental toxicity (ie, OECD TG 414) is designed for a health-protective effects assessment based on observation of fetal malformations and variations in a study designed to produce a dose-response. The in vivo developmental studies are costly, animal resource intensive, and potentially different in cross-species responses (Knudsen and Daston, 2018; Leist et al., 2014). As such, HTS/HCS-based meth‐ odologies should consider novel in vitro data and in silico models that can effectively and efficaciously profile chemi‐ cals for critical effects on human development and as well point to mechanistic pathways. Some of the most promis‐ ing nonanimal alternatives exploit the self-organizing potential of embryonic stem cells (ESCs) to recapitulate devel‐ opmental processes that may be sensitive to chemical exposure (Bremer and Hartung, 2004; Luz and Tokar, 2018). Endpoints that provide mechanistic support for tissue-specific developmental processes include cardiomyocyte differ‐ entiation (Chandler et al., 2011; Genschow et al., 2002; Seiler and Spielmann, 2011), expression (Panzica-Kelly et al., 2013; Pennings et al., 2011), metabolic profiling ( Kleinstreuer et al., 2011 Kleinstreuer et al., 2011; Palmer et al., 2013; West et al., 2010), regulatory, gene-specific biomarkers (Kameoka et al., 2014; Le Coz et al., 2015), stem cell migration (Xing et al., 2015), axial patterning (Warkus and Marikawa, 2017), and histodifferentiation in 3D orga‐ noids (Huch and Koo, 2015). For example, the validated mEST (Genschow et al., 2002) monitors emergence of beat‐ ing cardiomyocytes from pluripotent murine ESCs as the targeted read-out (in parallel with cytotoxicity) to discrimi‐ nate nonteratogens from weak teratogens and strong teratogens. Because the cardiopoietic lineage is ultimately de‐ pendent on heterogeneous interactions with other cell lineages in the culture, either via “embryoid bodies” (Seiler and Spielmann, 2011) or dense monolayers (Chandler et al., 2011), the cardiogenic read-out is an effective surrogate for complex pathways in teratogenicity. These examples show the diversity of alternative test modalities amenable to ESC-based methodologies for developmental hazard prediction in embryogeny. Assays currently represented in the ToxCast portfolio evaluate hundreds of biochemical targets, dozens of signal‐ ing pathways, and a broad range of cellular effects. To increase the diversity of HTS assays used to predict develop‐ mental toxicants, we describe the addition of a human stem cell-based platform to the ToxCast portfolio based on the devTOX quickPredict (devTOXqP) platform (Palmer et al., 2013). This assay, contracted from Stemina Biomarker Discovery, utilizes undifferentiated H9 human embryonic stem cells (hESCs) and measures relative changes in 2 me‐ tabolites, ornithine (ORN) and cystine (CYSS), targeting the ORN/CYSS ratio as a biomarker for developmental tox‐ icity (Palmer et al., 2013, 2017). Ornithine is a nonproteogenic amino acid that functions in several biochemical path‐ ways including ammonia detoxification in the urea cycle, pyrimidine synthesis via ornithine transcarbamylase, and polyamine synthesis via ornithine decarboxylase. Ornithine is initially absent from the medium but released from via‐

© Copyrights 2018 ble cells; as such, decreased cellular release reflects general metabolic states for these pathways. Cystine is initially present in the medium and used by cells in glutathione production; as such, the change connected to decreased CYSS uptake likely reflects a change in cellular glutathione synthesis and redox balance. Ultra-performance liquid chroma‐ tography–high-resolution mass spectrometry (UPLC-HRMS) measures these metabolites in the conditioned medium of H9 hESCs maintained in a pluripotent state during a 3-day chemical exposure. Although it is not known if bidirectional changes in ORN and CYSS are coupled directly to the same pathways or indirectly to cellular metabolic state, an imbalance dropping the ORN/CYSS ratio below a critical level (e.g., < 0.88) has positive predictive value (PPV) for a chemical’s potential to invoke teratogenicity (Palmer et al., 2013). The tera‐ togenicity index (TI)potential[AQ9], as defined by Palmer et al. (2013) for pluripotent H9 hESCs and more recently the potential for developmental toxicity in induced human pluripotent stem cells (iPSCs) (Palmer et al., 2017), is a concentration-based comparison between the ORN/CYSS ratio relative to cell viability established using a 23-phar‐ maceutical compound training set (Palmer et al., 2013). Predictive performance has been shown by the assay provid‐ er for 80 diverse chemical compounds at 85% accuracy (0.89 specificity and 0.82 sensitivity) based on observed de‐ velopmental toxicity in rodents or humans (Zhu et al., 2016). Here, we provide a comprehensive description and analysis of the ToxCast_STM platform. Results are shown for 1065 chemicals from the phase I and II ToxCast library (1065 unique structures) (Richard et al., 2016). We describe the ToxCast assay annotation, hereafter referred simply as “STM,” with regards to (1) data processing through Tox‐ Cast pipeline (tcpl) R package version 2.0.1 (https://cran.r-project.org/web/packages/tcpl/vignettes/Data_retriev‐ al.html)[AQ10] ( Filer et al., 2017 Filer et al., 2016), (2) quality control metrics and performance ratings in predictiv‐ ity across published benchmark compounds of developmental toxicity (Augustine-Rauch et al., 2016; Daston et al., 2014; Genschow et al., 2002; West et al., 2010; Wise, 2016 Wise et al., 2016), (3) a broader evaluation of STM assay performance when anchored to prenatal developmental toxicity endpoints collated in the ToxRefDB database for pregnant rat and rabbit studies (Knudsen et al., 2009; Martin et al., 2009; Watford et al., 2019), and (4) an initial analysis of sensitive and insensitive pathways in the assay relative to 440 biochemical features in the ToxCast No‐ vaScreen (NVS) dataset (Knudsen et al., 2011; Sipes et al., 2013). Concurrent with this publication, the ToxCast STM dataset https://doi.org/10.23645/epacomptox.11819265,ftp://newftp.epa.gov/COMPTOX/CCTE_Publica‐ tion_Data/BCTD_Publication_Data/Knudsen/ToxCast_STM_dataset)is being released on the ToxCast data download page (https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data) for public access. MATERIALS AND METHODS

ToxCast chemical library EPA’s ToxCast chemical library has been constructed iteratively using criteria including chemical nomination and procurement, dimethyl sulfoxide (DMSO) solubility, and suitability for testing in automated or semiautomated sys‐ tems. Drivers for procurement included availability of animal toxicity data, mechanistic knowledge to support model development predicting toxicity, and chemicals of heightened regulatory concern for which data are lacking. The ToxCast chemical inventory file is available at the following link: (ftp://newftp.epa.gov/comptox/Sustainable_Chem‐ istry_Data/Chemistry_Dashboard/2018/September/), last accessed February 6, 2020. For a detailed description of the library, see Richard et al. (2016). The present evaluation addresses the phase I and II ToxCast library that include pesticides accompanied by guide‐ line animal studies (OCSPP 870 series, and some NTP studies that are guidelinelike), data-poor industrial chemicals, and over a hundred pharmaceutical compounds. This list has 1065 unique structures and 13 duplicates for a total of 1078 samples tested here. Chemical compounds were commercially procured, diluted in DMSO to a stock concentra‐ tion of up to 100 mM (approximately 30% of the chemicals were provided at concentrations lower than 100 mM), and plated by Evotec (US), Inc (Watertown, Massachusetts). Aliquots from the stock plates were first diluted with 100% DMSO (Sigma-Aldrich, St Louis, Missouri) to a concentration 1000 times the highest test concentration (HTC) (if necessary) and then diluted 1:1000 in the cell culture media for testing. The final concentration of 0.1% DMSO was a major determinant of the HTC because DMSO itself decreases the ORN:CYSS ratio at concentrations in hESCs above 0.2% (Palmer, unpublished data) and adversely impacts mESCs at concentrations > 0.25% (Adler et al., 2006).

© Copyrights 2018 We coded stock plates to blind the assay provider to chemical identity. For 7 chemical samples, neat compound was procured from Evotec to enable retesting at concentrations above the range achievable by stock dilutions. Pluripotent H9 hESC culture H9 cells (NIH code WA09, WiCell Research Institute, Inc, Madison, Wisconsin) were used as approved for feder‐ ally funded research and selected because of their commercial availability, genetic stability (normal female karyo‐ type), and scientific legacy (hundreds of publications). Derivation and characterization of the H9 cell line was origi‐ nally reported by Thomson et al. (1998). Cells were handled as described (Palmer et al., 2013). Briefly, cells were maintained under feeder-free conditions with mTeSR1 media (StemCell Technologies, Vancouver, Canada) on Matri‐ gel hESC-Qualified Matrix (Corning, Bedford, Massachusetts) coated 6-well plates. Cultures were incubated at 37°C in a humidified atmosphere of 5% CO2. Differentiated colonies were removed daily through aspiration to maintain the undifferentiated stem cell population. Differentiation was based on visual inspection; there is typically < 5% dif‐ ferentiation in a culture, and only highly pure undifferentiated H9 cell populations were used for these experiments. Cultures were passaged using Versene (Life Technologies, Grand Island, New York) or ReLeSR (StemCell Technolo‐ gies) at 85%–90% confluency, karyotyped approximately every 10 passages, and the absence of mycoplasma was routinely confirmed with the MycoAlert Mycoplasma Detection Kit (Lonza, Rockland, Maine). All treatments were carried out in Matrigel-coated 96-well plates. H9 cells were plated with a seeding density of 100 000 cells per well in mTeSR1 medium containing 10 µM Y27632 Rho-associated inhibitor (ATCC, Mana‐ ssas, Virginia) to increase plating efficiency. Y27632 was removed prior to compound addition at 24 h after plating. The passage number of H9 cells used over the course of this study ranged from 31 to 48; anything above passage 40 was karyotyped within 10 passages prior to use in the assay. Chemical exposure H9 hESCs maintained in a pluripotent state were exposed to test compound for 72 h with chemical replenishment with media replacement every 24 h. Cell-conditioned media from the final 24-h treatment period was collected for analysis of the targeted biomarker, and cell viability of the corresponding cell layer was assessed as described below. The chemical library was tested in blinded fashion. Plate design and sample workflow is summarized in Figure 1 for the single-concentration screen (no cell viability measures, n = 3) and/or 8-point concentration-response series (with cell viability measure, n = 3). Each test plate included 1 µM Methotrexate (MTX; Selleck Chemicals, Houston, Texas) as a positive reference, 5 nM MTX as a negative reference, 0.1% DMSO as the neutral (vehicle) control, and sample- level media blanks. Figure 1. Workflow for the ToxCast STM dataset. “Samples” indicates chemicals tested in triplicate from stock plates; “Com‐ pounds” indicates chemicals entered into the dataset. The total number of samples (1373) reflects all duplicate measures and sin‐ gle-concentration screens that collapse to 1065 chemical records. Plate design is mapped for single-concentration (upper) and con‐ centration-response (lower) series; individual plate-level controls (included negative control [5 nM Methotrexate, green wells], positive control [1 µM Methotrexate, red wells], and neutral control [dimethyl sulfoxide; DMSO, gray wells]). “Records” refers to individual chemical entries into the ToxCast data pipeline (tcpl) at level 0 from which the virtual plate diagram is reconstructed for QA purposes. Records from chemicals tested in the concentration-response series were processed in tcpl to level 6 and entered into invitrodb (Filer et al., 2017) (Filer et al., 2017); records from chemicals tested at a single-concentration were entered directly to invitrodb from tcpl level 0 and pipelined to level 2. All subsequent data analysis was performed from the processed data at level 6 for concentration-response and level 2 for single-concentration-response.[AQ17]

© Copyrights 2018 Dosing strategy for the protocol initial screen was guided by the “cytotoxicity point” previously reported across 38 different ToxCast assays (Judson et al., 2016). We initially selected 141 chemicals for concentration-response testing; the remaining 924 chemical samples were tested in one concentration. Of those, 252 were retested in concentration- response to confirm a response or adjust the concentration range. The HTC for the initial screen was, for most sam‐ ples, set to 1, 10, or 100 µM so as not to exceed the chemical-specific median AC50 cytotoxicity point based on Z- score as defined by Judson et al. (2016) unless otherwise limited by compound availability. Other considerations for setting the HTC in concentration-response evaluation included outcome of the single-concentration screen, relevant data from ToxRefDB prenatal developmental toxicity animal studies, and reference compounds used by other studies on alternative test platforms for developmental toxicity (Augustine et al., 2016; Daston et al., 2014; Genschow et al., 2002; West et al., 2010). In all, 379 chemicals were tested in concentration-response and the remaining 686 were negatives at the initial screen. The latter has 14 chemicals remaining where the HTC was an order of magnitude be‐ low the ToxCast lower bounds of the median cytotoxicity burst (LBC)[AQ11] (see Supplementary Table S1 for de‐ tails). Biosample processing H9 cell-conditioned media from the final 24-h treatment period was collected for analysis of the targeted biomark‐ er and cell viability was measured from the corresponding cell layer. Cell viability was measured using the CellTiter- Fluor assay (Promega, Madison, Wisconsin) based on proteolytic cleavage of a substrate to fluorescent signal propor‐ tional to the number of living cells (Niles et al., 2007). The cell viability Relative Fluorescence Unit (RFU) was back‐ ground corrected and normalized to mean RFU of the neutral control (0.1% DMSO). The collected H9-cell-condi‐ tioned media samples were processed for targeted biomarker analysis as described (Palmer et al., 2013). Briefly, spent media samples were deproteinized (40% acetonitrile) and processed for UPLC-HRMS. Data acquisition was performed using 4 separate UPLC-HRMS systems, consisting of an Agilent 1290 Infinity LC system (Agilent Tech‐ nologies) interfaced with an Agilent high-resolution mass spectrometer (models G6520A, G6520B, G6530A, and G6224A). A Waters Acquity UPLC BEH Amide column (2.1 mm × 50 mm, 1.7 μm particle size; Waters, Milford, Massachusetts) maintained at 40°C was applied for separation of metabolites using a 6.5 min solvent gradient with 0.1% formic acid in water and 0.1% formic acid in acetonitrile (1.0 ml/min flow rate). Data were acquired using MassHunter Acquisition software (version B 04.00, Agilent Technologies). The extracted ion chromatogram (EIC) areas for ORN, CYSS, and their respective spike-in C13-standards were determined using the Agilent MassHunter Quantitative Analysis software, version B.05.00 or newer (Agilent Technologies), and data were normalized as de‐ scribed in Palmer et al. (2017). ToxCast annotations All raw data and metadata were loaded into a central database for ToxCast using standard nomenclature to pipeline the data as outlined by Filer et al. (20152017) into invitrodb (v3 pending release, March 2020) under for the Stemina DevToxqP assay source identifier (asid) for the ToxCast platform, designated STM_H9 (asid 14); specifically, the 2 assay identifiers (aid) STM_H9_secretome (aid 428) and STM_H9_viability (aid 437)representing data measures for the conditioned media and cell monolayer, respectively. This included identifiers for tracking each chemical sample (spid), plate identifier (apid), well position (row, column), micromolar concentration tested (dose), well quality (0 = fail, 1 = pass), and well type (wllt). The well type identifiers included media blank “b” lacking H9 cells, neutral con‐ trol “n” 0.1% DMSO, test compound “t,” negative control “m” 0.005 µM MTX, and positive control “p” 1.0 µM MTX. Invitrodb_v3 stores raw data values (rval) for the following assay components (acid): peak area for C13-cystine (acid 1023) and C13-ornithine (acid 1024) tracers, peak area for measured cystine (acid 1025) and cystine standar‐ dized to the C13 tracer (acid 1026), cystine normalized to the plate median value of the neutral controls (acid 1027), peak area for measured ornithine (acid 1028) and ornithine standardized to the C13 tracer (acid 1029), ornithine nor‐ malized to the plate median value of the neutral controls (acid 1030), the ORN:CYSS ratio calculated from DMSO- normalized values (acid 1031), the targeted biomarker prediction (acid 1032, which is an empty placeholder), back‐ ground-corrected RFU from the CellTiter-Fluor assay (acid 1113), and cell viability normalized to mean RFU of the DMSO control (acid 1114). Virtual plate maps reconstructed from the ORN/CYSS ratio (Figure 1) were visually in‐ spected to confirm consistency in each well before entering the data into invitrodb_v3. The corresponding assay endpoint identifiers (aeids) analyzed in the Results presented here address the 4 main assay component features for the predictive model: the decrease in media ornithine reflecting reduced cellular release

© Copyrights 2018 (STM_H9_ornithineISnorm_perc_dn, aeid 1689), the increase in media cystine reflecting less utilization (STM_H9_cystineISnorm_perc_up, aeid 1682), the ORN/CYSS ratio reflecting a decrease in the ORN:CYSS ratio as the primary biomarker (STM_H9_OrnCyssISnorm_ratio_dn, aeid 1693), and normalized cell viability (STM_H9_Vi‐ ability_norm, aeid 1858). The processed STM dataset described and analyzed here comprised 79 398 individual data points across 1065 unique chemical structures. ToxCast Data Pipeline (tcpl) Individual data points from the 379-chemical concentration-response series were processed through the ToxCast Data Pipeline (tcpl) (Filer et al., 20162017 ). Level 0 of tcpl is the entry point for rval from the 4 key assay compo‐ nent features. All individual data points for the same chemical were concatenated at this level and processed through 6 levels of data processing. Level 1 flagged the DMSO-normalized samples for plate position effects, any missing replicate(s), or samples with poor well quality. The latter was found for individual wells across 30 chemicals where the highest concentration caused significant loss of cell viability and no measurable ORN. Generally, the few invalid samples reflected extreme cytotoxicity at higher concentrations. All are included in the concentration-response as‐ sessment, and a teratogenic index could still be computed if lower concentrations remained in the noise belt. If not, then we retested the samples at a lower concentration range and all of the data would be included in the tcpl profile. For plotting purposes, we assigned a minimal recorded ORN value (0.001) that was well below the limits of detection of the metabolite (0.01). This was applied only to the 30 data points having poor well quality. Level 2 transformed the individual replicate data points into an appropriate “response unit” computed on a log 2 scale. Level 3, typically a normalization step in tcpl, applied no additional normalization but instead inverted the log 2 data to graph response profiles in a manner consistent with other ToxCast assay platforms, ie, responses increase from a baseline of zero. At this level, tcpl calculated a baseline median absolute deviation (bmad) of the response variable using only the lowest 2 test compound concentrations from all test wells, concentrations where we expect no activity for the vast majority of chemicals. Note that the 731 chemicals with STM data calls based solely on single-concentration screen data (eg, that had not been tested to date in a definitive concentration series) are pipelined to level 2, with single-concentration hit calls summarized based on a global threshold of 3*bmad across all tested compounds, but not plate-level controls. For the multiconcentration series, the same 3*bmad was implemented as a noise belt distribution in level 4 that calcu‐ lated the parameters for automated curve-fitting models. Three curve-fitting models were applied (Constant, Hill, and Gain-Loss) and the winning model, selected by Akaike Information Criterion (AIC), progressed to level 5 for graph‐ ing. Level 6 then applied warning flags for curve-fitting issues or data quality concerns such as local plate effects, single point hits, and noisy data. Chemicals with a measured response on the ORN/CYSS ratio greater than 3*bmad were classified as “active” at the concentrations tested here. Because “inactive” results (hitc = 0) are not stored in the database, these were replaced by an arbitrary value of 106 µM post-tcpl for computing purposes, consistent with other assays in the ToxCast database. Developmental toxicity (in vivo) correlation analysis To assess STM model performance against in vivo developmental toxicity, we identified an anchoring set of 42 benchmark compounds from the ToxCast phase I/II inventory. An initial list was compiled having overlap with previ‐ ous literature studies that aimed to evaluate alternative methods for developmental toxicity (Augustine-Rauch et al., 2016; Daston et al., 2014; Genschow et al., 2002; West et al., 2010; Wise et al., 2016 ). Note that 18 of the 42 com‐ pounds were in the training set and 2 of the compounds were in the test set from the original devTOX publication (Palmer et al., 2013). From this list, we selected those chemicals having traditional (pre-2015) US FDA (Food and Drug Administration) categories for potential risk to the developing fetus if pregnant women are exposed (or in a few cases, well-defined developmental toxicants from the National Toxicology Program). A final set of posi‐ tives (n = 26) and negatives (n = 16) was developed using evidence from teratology cohorts tested under EPA Health Effects Test Guidelines OPPTS 870.3700 (https://ntp.niehs.nih.gov/testing/types/devrepro/index.html, last ac‐ cessed February 6, 2020) (now OCSPP 870.3700 even though it is has not been corrected on regulations.gov). STM model performance was computed from 2 × 2 binary classification tables of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) conditions (Powers, 2011). Overall (regular) accuracy (Rand ACC) was computed from the TP rates: sensitivity = TP/(TP + FN), specificity = TN/(TN + FP), and Rand ACC = (TP + TN)/(TP + FP + FN + TN). Balanced accuracy (BAC) was computed as the average of PPV = TP/(TP + FP) and negative predictive value (NPV) = TN/(TN + FN); BAC = (PPV + NPV)/2. Model performance was assessed with

© Copyrights 2018 Matthews correlation coefficient (Matthews, 1975). Because the testing protocol was independent of the original training by Palmer et al. (2013), the subset of 20 compounds overlapping in the 42 benchmark set that were used to develop the model (Palmer et al., 2013) are not used here for training chemicals; rather, they confer confidence in the testing strategy because the contractor was blinded to their identity. To evaluate a broader correlation of STM models to in vivo animal studies, we used prenatal developmental toxici‐ ty study type (870.3700) downloaded from the ToxRefDB v1.0 data release (https://www.epa.gov/chemical-research/ toxcast-toxrefdb-data-release, October 16, 2015, accessed July 25, 2017). This included chemical-endpoint-specific dosage for the Lowest Effect Level (LEL) in mg/kg/day if the endpoint was tested (or assumed to have been tested) and a treatment-related effect was observed (Knudsen et al., 2009). Lowest Effect Level values were culled from the “Endpoint_summary” download file. The “study-endpoint-category” calls reflected developmental (dLEL) and ma‐ ternal (mLEL) endpoints for each available study, else “NULL” meaning no LEL was observed at the highest dosage tested (assigned an LEL of 106 mg/kg/day for computing purposes). For the 1049 rat and rabbit studies, 924 Tox‐ RefDB study records (approximately 400 chemicals) were oral route, 60 records (16 chemicals) inhalation, 48 re‐ cords (16 chemicals) direct, 42 records (16 chemicals) dermal, and 4 records (2 chemicals) “other.” “Direct adminis‐ tration” in ToxRefDB refers to routes of exposure other than oral, dermal, or inhalation (eg, intraperitoneal, intramus‐ cular, and intravenous). Bioavailability might be questionable for the dermal or inhalation study records with a nega‐ tive outcome for developmental toxicity in ToxRefDB. This situation applied to 35 of 1049 records, of which 30 re‐ cords correlated with a negative response (TN) in the STM response and only 5 records correlated with a positive response (FP). There may be overlap wherein some chemicals may have records on alternate routes of exposure. The list of chemicals is provided in Supplementary Table S2. For the few inhalation studies having ppm exposure unit, we used a 1:1 mass conversion rather than assuming a particular breathing rate to get at this volume. In all, 3496 records were obtained for dLEL, mLEL. or NULL entries. These records collapsed into 791 outcomes (424 rat, 331 rabbit, 33 mouse, 2 hamster, and 1 dog). Given the preponderance of rat and rabbit studies, we focused only outcomes from those 2 species to build the endpoint classifier model (400 rat OR rabbit, 389 rat, 323 rabbit, and 203 rat AND rab‐ bit). This removed outcomes reported in mouse/hamster/dog from consideration, even if they may have been reported in rat/rabbit as well. We first assigned levels of evidence for developmental toxicity for each individual ToxRefDB record as follows: “clear evidence” (dLEL ≤ 200 mg/kg/day; dLEL < mLEL), “some evidence” (dLEL ≤ 200 mg/kg/day; dLEL ≤ mLEL), “equivocal evidence” (dLEL < 103 mg/kg/day; dLEL > mLEL), and “no evidence” demonstrated by data from a study with appropriate experimental design but no developmental effects observed (eg, no dLEL or dLEL ≥ 1000 mg/kg/day) (see Supplementary Table S2). This cutoff was arbitrary, selected from convenience to match the arbitrary in vitro cutoff of 200 µM for the Teratogenicity Index (TI) reflecting the critical change in the ORN/ CYSS biomarker. Replicate studies conducted with the same chemical compound and species were collapsed into a single dLEL or mLEL value with the evidence hierarchy: clear > no > some > equivocal. Essentially, “equivocal evi‐ dence” indicates that a dLEL was determined for the compound, in a rat or rabbit study, but there is not enough evi‐ dence to attribute the observed fetal endpoint to a specific developmental effect as opposed to a maternal effect. It is important that “evidence” not be equated to “importance.” A maternally mediated adverse developmental outcome is important, but perhaps falls outside the biological domain of the assay. The final number of compounds for in vivo anchoring was 401 from ToxRefDB and completed to 432 by adding information from the 11-overlapping com‐ pounds in the 42 benchmark compounds identified in the literature (described above) but not compiled in ToxRefDB. To produce a call TP, FP, FN, and TN for each chemical tested, we looked for rat-rabbit concordance where availa‐ ble. This was done by discordance (OR = hit in either species constituted a positive) and concordance (AND = same calls from both species). Stringency models built onto the BM-42 calls as follows. Base model: defines any chemical with a dLEL as positive in either species, else negative, and the STM data accepts as positive any dTP < 1000; n = 432. Low stringency model: defines a positive as CLEAR or SOME evidence, else negative (EQUIVOCAL, NO); the STM-positive data are filtered at an arbitrary cutoff of TI <= 200 μM; n = 432. Medium stringency: defines a positive when either species (rat OR rabbit) shows CLEAR evidence, else negative; n = 285. High stringency: defines positive as a concordant response (rat AND rabbit) showing CLEAR or NO evidence in both cases; n = 127. Note that the BM-42 set was defined as CLEAR and NO; there are 11 ToxRefDB among the 42 tested in rat of which 4 were also tested in rabbit. Biochemical (in vitro) correlation analysis

© Copyrights 2018 To evaluate the biological perturbations underlying the targeted biomarker in the STM assay, we utilized the HTS results from the ToxCast NVS cell-free biochemical-assay platform (Knudsen et al., 2011; Sipes et al., 2013). The information for this analysis is provided in Supplementary Table S3. AC50 (half-maximal activity concentration) val‐ ues for NVS aeids were selected from invitrodb_v3 hit-call matrix (August 10, 2018 release, accessed September 18, 2018) (https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data). Note that the public download has AC50s for chemical-assay relationships that are flagged as inactive because they do not meet the as‐ say-specific efficacy threshold for bioactivity; the hit-call file converts them to inactive and inserts log 10 = 3 for all inactive AC50s. The NVS dataset had AC50 values for activation or inhibition across 337 biochemical assays for substrate-product enzymatic activity or ligand-receptor binding, corresponding to 420 total features, due to the possi‐ bility of enzymatic activation/inhibition or nuclear receptor (NR) agonist/antagonist binding activity. This produced 2918 hit calls in the 420-feature × 1065 chemical matrix expressed as log 10 micromolar AC50. The NVS AC50 matrix was further consolidated to a “gene-score” feature matrix using chemical-specific Z-scores from the global ToxCast cytotoxicity burst (Judson et al., 2016) and the official gene symbol for the human function that is annotated in each particular ToxCast NVS assay. This filtered-out any chemical-assay pair having an AC50 value within 3 standard deviations of the cytotoxicity burst (Leung et al., 20152016 ). The resulting gene-score feature matrix, G(chemical, gene_dir), comprised the mean log10(AC50) for each homologous gene symbol and was assigned an “up” or “down” extension to account for directionality (dir) in terms of enzymatic activator/inhibitor or receptor ligand binding activities. Finally, this matrix was transformed to micromolar concentration potency terms using the following equation:

−log10G GPSwhere = 10 G is the ,log AC50 gene-score matrix and GPS is the transformed gene-potency score matrix (1062 chemi‐ cals × 267 NVS gene features) in the analysis. Binary calls from the ToxCast STM dataset (1 = active STM response, 0 = inactive STM response) were then used as the categorical endpoint to fit a multiple logistic regression model (Sci‐ kit-learn, v. 0.18.1; Python, v.2.7.13):

Ngenes logit p = w0 + wiGPSi, i = 1 where wi represents∑ the log-odds coefficient term for each gene. These weights can be positive or negative to rep‐ resent how strongly the gene association correlated to an active or inactive STM response, respectively. To characterize the phenotypic importance for each gene in the dataset, we used the Human-Mouse: Disease Con‐ nection (HMDC) database (http://www.informatics.jax.org/humanDisease.shtml, accessed July 31, 2018). Although rat is the preferred species in tiered approach for developmental toxicity testing, performing evaluations in the rat and mouse or mouse and rabbit detected 80% of the 105 teratogens examined in the veterinary pharmaceutical products study by Hurtt et al. (2003). As such, the HMDC database provides a reasonable surrogate for modeling rat-human disease connections. The phenotypic weighting was conducted independently of the logistic regression described above and serves as a way to determine the biological relevance of each gene present in the dataset. HMDC integrates data on phenotype and disease model data from the MGI Mouse Phenotype Ontology (MP) with human gene-to-dis‐ ease relationships from the National Center for Biotechnology Information, Online Mendelian Inheritance in Man (OMIM), and the Human Phenotype Ontology. We mapped all NVS gene symbols to the 28 available HDMC pheno‐ type systems (extracted from the HDMC database on July 31, 2018) and assigned a bin score (b) based on evidence weighting levels from the number of HMDC entries (n) for any (n = 1, b = 1), some (n = 2–6, b = 2), or many (n > 6, b = 3) annotations, or no HMDC data (n = 0, b = 0). We then calculated a gene-specific MP weight by summing the bin score across all 28 MP categories for the 233 HMDC gene symbols that could be linked to a NVS feature. This list was standardized from 0 to the maximum value and weighted each NVS target gene by relevance to a human-mouse disease system. It should be noted that each phenotype system has subordinates and not all of which may directly apply to developmental toxicity (the deeper level of ontology was not considered here). Also note that a few NVS gene entries lacked an HMDC record: the normalized MP weighting schema does not filter out those but scales

© Copyrights 2018 them at the low end of the evidence bar for a phenotypic system. The top and bottom 40 from the NVS gene targets list linked to the STM-positive and STM-negative responses, respectively, were annotated separately using the Func‐ tional Annotation Tool from the DAVID Bioinformatics Resources 6.8, NIAID/NIH (https://david.ncifcrf.gov/). Cate‐ gories were selected for a minimum of 4 genes and maximum 30 genes from GO Direct (molecular function, cellular compartment, and biological process), OMIM, KEGG, BioCarta, Reactome, and INTERPRO at a Bonferroni-adjus‐ ted p-value ≤ .05 per category. Nothing from OMIM or BioCarta passed. All other redundancies were resolved man‐ ually by the lowest false discovery rate value (as these have lesser uncertainties across the specific annotation) or in some cases the most informative annotation record. The resulting annotation records were visualized in a Spearman correlation matrix based on summed weights of represented gene targets from the logistic regression model, colored by STM on one axis and developmental toxicity on the other. RESULTS

The STM dataset currently holds concentration-response profiles for 379 chemicals processed through the tcpl pipeline, and single-concentration screens for another 686 chemicals that were inactive in the initial screen (Figure 1). Concurrent with this publication, aAll data files are available forftp:// download at: https://doi.org/10.23645/ epacomptox.11819265 and the processed raw data files from tcpl level 2 and level 6 are being mahave been submit‐ ted tode available for download in Dryad (http://datadryad.org/, doi.org/10.5061/dryad.gqnk98shm). Results descri‐ bed here will focus on the 2 key output parameters, namely the targeted biomarker (ORN/CYSS ratio) and cell viabil‐ ity. Beyond the present study, some of the single-concentration samples will be retested in concentration-response and the information added to the ToxCast database andfor public dashboardaccess. Less than 10% of the chemicals were tested < 10 μM, and we might anticipate some activity in those tested higher. Determination of STM Positivity The bmad determined in tcpl is calculated as the median absolute deviation of the response variable from the low‐ est 2 test compound concentrations in the assay between tcpl levels 3 and 4 (Filer et al., 2017) (Filer et al., 2017). Because of the way tcpl computes the noise belt, a point of departure represents the state of the platform in ToxCast. This threshold is dynamic and could change as more and more compounds are tested in the future. As such, the ORN/CYSS ratio = 0.76 represents the biomarker’s point of departure threshold (e.g., teratogenicity Index, TI) in the current state of 1065 cases and is not a recommendation for general applicability. Figure 2 plots the distribution of tcpl data at level 3 for plate-level neutral controls (n = 1176) and Methotrexate (MTX) reference for positive (1 µM, n = 589) and negative (5 nM, n = 590) response; and at level 4 for the lowest 2 test concentrations of ToxCast samples (n = 2169). A threshold drawn at 3*bmad for each feature distinguished all MTX-positive and MTX-negative plate references. CYSS values showed wider variability than ORN values due to a minor change in the culture medium formulation that increased CYSS utilization over the course of the contracting period. Aside from a few outliers that were not removed when calculating bmad, the lowest 2 concentrations from all ToxCast test wells mimicked the neu‐ tral control and MTX-negative reference groups. Figure 2. Boxplots of log 2-fold induction for STM response data (75% box/95% whiskers/outlier points). Tcpl response data and Mann-Whitney rank sum test comparison of these distributions versus plate-level neutral (DMSO) controls (n = 1158), MTX-nega‐ tive (n = 581), and MTX-positive (n = 580, p < .05) references; and tcpl baseline median absolute deviation (bmad) (n = 2069). 3*bmad for each feature (dashed line) correctly classified all MTX-positive reference values (a few outliers elicited a response outside 3*bmad at 1 or both of the 2 lowest concentrations tested).

© Copyrights 2018 Feature plots for each chemical tested in concentration-response (n = 379) are given in Supplementary Figures S1a–S1f. As an example, Figure 3 displays results for Methotrexate as part of the ToxCast library. The gray hatched zone indicates the global noise belt computed as a 3*bmad for each main feature. Due to the way tcpl calculates bmad, the threshold ORN/CYSS ratio (0.88) originally described (Palmer et al., 2013) fell within the noise belt. Con‐ sistent with other ToxCast assays, we used a statistical threshold for positivity that reflects the test concentration at which the activity departs from the noise belt (acb, activity concentration at baseline). This is indicated by the vertical yellow line in Figure 3 and for the ORN/CYSS ratio gives a TI that reflects the concentration threshold predicted for human developmental toxicity. Because several outliers widened the global noise belt this, in turn, resulted in a right- shift in the TI relative to the default ORN/CYSS model from the original assay description (Palmer et al., 2013). For example, 3*bmad (log 2) was 0.403, rendering the acb (log 2−0.403) = 0.756 versus 0.88 in the default model. For normalized cell viability (percent control), the 3*bmad = 0.161 rendering the acb (log 2−0.161) = 0.894. This threshold corresponds to a global decrease of 11% cell viability, which is taken as a positive effect for this feature but may or may not reflect overt cytotoxicity. Although minor effects on cell viability could account for changes in cellular ORN release and/or CYSS uptake measured in the ORN/CYSS ratio, altered cell growth and/or survival are potential modes of action in developmental toxicity. As such, compounds that impact the ORN/CYSS biomarker due to mini‐ mal effects on cell viability should not be discounted because of it. The blinded ToxCast Methotrexate sample regis‐ tered a TI = 0.059 µM and 11% loss of cell viability at 0.062 µM, consistent with the MTX plate references. Figure 3. Sample level 6 tcpl outputs on each feature in the ToxCast_STM dataset, exemplified by methotrexate. Automated curve-fits from DMSO-normalized triplicate measures modeled by 3 objective functions: constant (CNST), Hill (HILL), and gain- loss (GNLS). HILL parameters are value and standard deviation for the top asymptote (tp), log10(µM AC50) in the gain direction (ga) and Hill coefficient (gw) in the gain direction. GNLS is the product of 2 Hill models with a shared tp, log10(µM) AC50s in gain (ga) and loss (la) directions, and Hill coefficients in the gain (gw) and loss (la) directions. The gray striped box denotes the feature noise belt computed from 0 ± 3 bmad for the lowest 2 concentrations across all samples in the feature set. Model summary values include AIC, probability (P), and root mean square error (RMSE). The lowest AIC is selected as the winning model (red font), and the vertical yellow line denotes point of departure of activity (activity concentration at baseline, acb) from the noise belt (horizontal gray line); vertical blue and red lines depict the AC50’s for the specific model plots. See Filer et al. (20162017 ) for details on HIT-CALL (0 = inactive, 1 = active), fit-category (FITC, 01 to 68+), activity probability (ACTP), and efficacy cutoff (COFF). The y-axis indicates the log 2-fold change in (A) ORN levels, (B) CYSS levels, (C) ORN/CYSS ratio, and (D) cell viabil‐ ity versus neutral (DMSO) control value.

© Copyrights 2018 To examine the impact of the right-shift in the tcpl-derived critical concentration for positivity, we next plotted the difference in hit calls between the original value (0.88) (Palmer et al., 2013). We binned each chemical by measured maximum response into a distance from bmad. For example, chemicals in the first bin had a maximal response within 1 bmad, chemicals in the second bin had a maximal response within 2 bmads, and so forth. Quantitatively, the tcpl model shifted active calls to a slightly higher concentration versus the published default that fell within the noise belt. This shift, however, had little impact resulting in the loss of only 2 chemicals from the active list (Maneb and 2- methoxy-5-nitroaniline) that displayed marginal efficacy. Profiling the ToxCast Phase I/II Inventory Level 2 and level 6 ToxCast STM data are contained in Supplementary Table S1. Because of the way ToxCast assays were contracted, multiwell stock plates (100 mM where possible) permitted testing most chemicals at HTC = 100 µM (due to potential interference with DMSO). Variations in the HTC was in some cases limited to < 100 µM or < 10 µM based on a low “median cytotoxicity burst” across dozen cytotoxicity and cell stress assays in the ToxCast portfolio (Judson et al., 2016). On the other hand, tcpl’s automated curve-fitting extrapolated TI above the HTC of 100 µM in several cases. Across the 1065 chemical samples tested, we observed a critical decrease in the ORN/CYSS ratio for 208; however, if we consider the response positive only if TI < 200 µM as reasonable cutoff, then this condi‐ tion was met by 205 (19%) of the ToxCast phase I/II library. Concentration-response profiles are shown for ORN, CYSS, the ORN/CYSS ratio, and cell viability plots in Supplementary Figures S1a–S1f. We retested all compounds with activity in the single-concentration screen in concentration-response aside from tetracycline and 3,3′-dimethyl‐ benzidine, which produced a weak effect on the ORN/CYSS ratio at 100 µM but were included among the positives. Concentration-dependent drops in the ORN/CYSS ratio were driven by ORN alone for 15 chemicals (7.3% of the positives, including thalidomide), by CYSS alone for 36 chemicals (17.6% of the positives), and both metabolites in 147 cases (71.7% of the positives). A few chemicals (n = 7) produced an overall effect on the ORN/CYSS ratio with‐ out scoring a hit on either metabolite alone due to noisy data flagged for individual metabolites. The reduction in CYSS utilization seemed to drive the ORN/CYSS response for most of the positive chemicals, minor variations were observed for a few compounds (see Supplementary Figures S1a and S1d). aAll-trans-Retinoic acid, for example, in‐ creased ORN production (an effect shown for retinoids as a class through inhibition of ornithine decarboxylase [Palmer et al., 2017]), but the active response was driven by a stronger reduction in CYSS utilization, indicating that different mechanisms may lead to changes in the ORN/CYSS ratio. For more complex concentration-response behav‐ ior, octyl gallate produced a transient spike-in CYSS utilization (acb = 0.39 µM) prior to concentrations that de‐ creased CYSS utilization (acb = 1.83 µM) (Supplementary Figs. S1a–f). The nonmonotonic changes in CYSS may foreshadow cytotoxicity but perhaps not as a general phenomenon. The general relationship between TI (Supplementary Figure S1e) and H9 cell viability (Supplementary Figure S1f) is shown graphically for the 205 positives confirmed in the concentration-response series (Figure 4). Although cell viability is measured, it is not included in the prediction by the assay, which is based solely on the ORN/CYSS ratio response. Cell viability is provided as an additional endpoint to aid in the interpretation of the data. The 11% decline in viable H9 cell number (point of departure from the noise belt) could be due to impaired growth, reduced viability, or both. For example, 2 well-recognized human teratogens, Thalidomide (Figure 4, No. 8) and Methotrexate (Fig‐ ure 4, No. 154), invoke a biomarker response at concentrations differently accompanied by changes in viable cell number. It should be noted that the highest concentration tested here approached within an order of magnitude or

© Copyrights 2018 exceeded the lower bounds of the ToxCast cytotoxicity point (LBC) for 204 of 205 STM-positive compounds (99.5%) and 843 of 883 STM-negative compounds (95.3%). Approximately one-third of the STM-positive com‐ pounds triggered a critical drop in the ORN/CYSS ratio without altering cell viability at the concentrations tested here (sector A in Figure 4 and Table 1). This is exemplified by all-trans-retinoic acid and thalidomide (TI = 0.003 and 1.27 µM, respectively). In the remaining two-thirds of the STM-positive compounds, an effect on the ORN/CYSS ratio cannot be generally dissociated from 11% drop in cell viability. Methotrexate and cytarabine, for example, drop‐ ped the ORN/CYSS ratio (TI = 0.059 and 0.054 µM, respectively) as well as cell viability (acb = 0.062 and 0.082 µM, respectively). Figure 4. Stratification of 205 STM-positive chemicals (TI < 200 µM) by teratogen index (TI), cell viability (CV), and lower bounds of the ToxCast cytotoxicity burst (LBC). Plots are −log10(acb) micromolar concentration for each chemical indicated by TI (blue line), 11% reduction in H9 cell viability (red line), and LBC (gray stippled line). (NOTE: color image available in the on‐ line version). Samples are ranked low (No. 1) to high (No. 205) (see Table 1 for chemical key), first by effect on the biomarker relative to no effect on H9 cell viability (sector A), then by biomarker potency (low to high) relative to an effect on H9 cell viabili‐ ty (sector B), and finally where the effect on H9 cell viability was more potent than the biomarker (sector C). The radial scale is −log10 μM from no effect (−log10 = −6, center) to potent effect (−log10 = +6 at the periphery). The stippled gray circle marks the 200 µM cutoff for activity (inward designated inactive) on a −log10 scale.

Table 1. Key for Positive Chemicals Plotted in Figure 4

Sector A Sector B Sector C 1* all-trans-Retinoic acid 73* Valproic acid 160 Phenanthrene 2 PharmaGSID_47333 74* Hydroxyurea 161 Endosulfan 3 Mirex 75 Fenarimol 162 Clodinafop-propargyl 4 Spiroxamine 76 Tetracycline 163* Indomethacin 5 SAR150640 77 3,3′-Dimethylbenzidine 164 Corticosterone

© Copyrights 2018 Sector A Sector B Sector C 6 Aplaviroc hydrochloride 78 Propetamphos 165 Fipronil 7 3′-Azido-3′-deoxythymidine 79 Linuron 166 Hexaconazole 8* Thalidomide 80 2,4-Dinitrotoluene 167 Myclobutanil 9 7,12-Dimethylbenz(a)anthracene 81 Fluthiacet-methyl 168 Flumioxazin 10* Carbamazepine 82 Tebuconazole 169 Malathion 11 Tridemorph 83 4-Nonylphenol, branched 170 Ethofumesate 12* Rifampicin 84 Diisobutyl phthalate 171 Norethindrone 13 Darbufelone mesylate 85 Cyhalofop-butyl 172 Diniconazole 14 Chlorpropham 86 Endrin 173 2,3-Diaminotoluene 15 Nitrofurazone 87 Thidiazuron 174 Pioglitazone hydrochloride 16 Carbaryl 88 Diazinon 175* Dexamethasone sodium phosphate 17 AVE8488 89 4-(2-Methylbutan-2-yl)phenol 176 Oxytetracycline hydro‐ chloride 18 GW473178E methyl benzene sul‐ 90 Nitrofen 177 Fenamidone fonic acid 19 Elzasonan 91 Flumetralin 178 Zoxamide 20 PharmaGSID_47259 92 Fluridone 179 Testosterone propionate 21* Amiodarone hydrochloride 93 o,p′-DDT 180 SSR240612 22 Volinanserin 94 N,N′-Methylenebisacrylamide 181 Bifenazate 23 Besonprodil 95 Methyl methanesulfonate 182 3,4-Diaminotoluene 24 Dihexyl phthalate 96 4-Chlorobenzotrichloride 183 Fluazinam 25 Carbendazim 97 Fluazifop-P-butyl 184 1,2-Phenylenediamine 26 Tri-allate 98 4,4′-Oxydianiline 185 PD 0343701 27* Lovastatin 99 Triticonazole 186 Thiram 28 SAR102608 100 Propiconazole 187 2,4-Diaminotoluene 29 Prometon 101 Disulfoton 188 Sodium dimethyldithiocar‐ bamate 30 Cycloate 102 Triclosan 189 Azathioprine 31 Dipentyl phthalate 103 Flutamide 190 SR125047 32 Ametryn 104 Genistein 191 Nitrofurantoin 33 N,N-Dimethyldecylamine oxide 105 Flusilazole 192 Milbemectin (mixture of 70% Milbemcin A4, 30% Milbemycin A3) 34 Pirinixic acid 106 Benzyl butyl phthalate 193* 5-Fluorouracil 35 PharmaGSID_48507 107 4-Vinyl-1-cyclohexene dioxide 194 Octyl gallate 36 Isazofos 108 Dibutyl phthalate 195 PharmaGSID_48505 37 Atrazine 109 Fludioxonil 196 Propargite 38 Tricresyl phosphate 110 1,3-Dinitrobenzene 197 N-Phenyl-1,4-benzenedia‐ mine 39 Diallyl phthalate 111 Pyrimethamine 198 Pyraclostrobin 40 Diethanolamine 112 5HPP-33 199 PharmaGSID_48519 41 Di(2-ethylhexyl) phthalate 113 Ketoconazole 200 Mercuric chloride 42 Triadimenol 114 4-Methylaniline 201 Gentian Violet 43 Nitrilotriacetic acid 115 Propazine 202 Disulfiram

© Copyrights 2018 Sector A Sector B Sector C 44 2,4,7,9-Tetramethyl-5-de‐ 116 Fenaminosulf 203 Tebufenpyrad cyne-4,7-diol 45 * Stavudine 117 N,N,N-Trimethyl(oxiran-2-yl)metha‐ 204 PharmaGSID_48511 naminium chloride 46 Isopropyl triethanolamine titanate 118 2-(Thiocyanomethylthio)benzothia‐ 205 PharmaGSID_48166 zole 47 Triadimefon 119 Methylene bis(thiocyanate) 48 Clomazone 120 Diquat dibromide monohydrate 49 Cyclanilide 121 Napropamide 50 Boscalid 122 SSR 241586 HCl 51 17α-Hydroxyprogesterone 123 Difenzoquat metilsulfate 52 Esfenvalerate 124 Octhilinone 53 Cymoxanil 125 CP-728663 54 Fluometuron 126* Busulfan 55 Flumiclorac-pentyl 127 UK-337312 56 2-tert-Butyl-5-methylphenol 128 SB236057A 57 Procymidone 129* Diphenhydramine hydrochloride 58 Coumaphos 130 Benomyl 59 Tributyl phosphate 131 Fluoxastrobin 60 2,4-Dinitrophenol 132 Clomiphene citrate (1:1) 61 Etridiazole 133 Sodium (2-pyridylthio)-N-oxide 62 Norflurazon 134 MK-274 63 Tralkoxydim 135 SR271425 64 Acibenzolar-S-methyl 136 Dodecyltrimethylammonium chloride 65 Diuron 137 SAR115740 66 Cyproconazole 138 Famoxadone 67 Dinoseb 139 Chlorpromazine hydrochloride 68 N-Nitrosodiphenylamine 140 Difenoconazole 69 Paclobutrazol 141 Captafol 70 1,3-Propane sultone 142 Didecyldimethylammonium chloride 71 Carminic acid 143 CI-959 72* MEHP 144 Cycloheximide 145 Cladribine 146 6-Thioguanine 147 Prometryn 148 Tributyltin chloride 149 Tributyltin methacrylate 150 Phenylmercuric acetate 151 Triphenyltin hydroxide 152 Triglycidyl isocyanurate 153* Cytarabine hydrochloride 154* Methotrexate 155 Rotenone

© Copyrights 2018 Sector A Sector B Sector C 156 Pyridaben 157 TNP-470 158 Colchicine 159 Fenpyroximate (Z, E)

Asterisk indicates the chemical is a positive in the BM-42 set. Benchmarking STM Assay Performance for Teratogenicity To evaluate ToxCast STM assay performance, we compared the in vitro classification with a set of 42 benchmark compounds that have been used by others to evaluate developmental toxicity alternatives (Augustine-Rauch et al., 2016; Daston et al., 2014; Genschow et al., 2002; West et al., 2010; Wise et al., 2016 ) (Table 2). Most of these compounds have information from FDA labels on potential risk to the developing fetus if pregnant women are ex‐ posed, ranging from safe for use during pregnancy (category A) to contraindicated during pregnancy (category X). Although the FDA labels are no longer used (since 2015), they have been used in evaluating alternatives to animal testing for developmental toxicity and, as such the category descriptors are given for convenience in parentheses as follows.[AQ12] (FDA category A where adequate and well-controlled studies have failed to demonstrate a risk to the fetus in the first trimester of pregnancy [and there is no evidence of risk in later trimesters]; category B, either ani‐ mal-reproduction studies have not demonstrated a fetal risk but there are no controlled studies in pregnant women, or animal-reproduction studies have shown an adverse effect [other than a decrease in fertility] that was not confirmed in controlled studies in women in the first trimester [and there is no evidence of a risk in later trimesters]; category C where animal-reproduction studies have shown an adverse effect on the fetus and there are no adequate and well- controlled studies in humans, but potential benefits may warrant use of the drug in pregnant women despite potential risks; category D where there is positive evidence of human fetal risk based on adverse reaction data from investiga‐ tional or marketing experience or studies in humans, but potential benefits may warrant use of the drug in pregnant women despite potential risks; and category X where studies in animals or human beings have demonstrated fetal abnormalities, or there is evidence of fetal risk based on human experience, or both, and the risk of the use of the drug in pregnant women clearly outweighs any possible benefit. The drug is contraindicated in women who are or may become pregnant.) The ToxCast STM assay based on defining a positive using the tcpl threshold 0.76 performed with an accuracy of Rand ACC = 78.6% (sensitivity 0.65, specificity 1.00, n = 42) and BAC = 82.0% (PPV 1.00, NPV 0.64, n = 42). This was consistent with the pharmaceutical-trained model from the assay provider (77% accura‐ cy, 0.57 sensitivity, 1.00 specificity, n = 23) (Palmer et al., 2013). The targeted biomarker outperformed 11% reduc‐ tion in H9 cell viability as a predictor of teratogenicity (Rand ACC = 55.054% accuracy, 0.28 sensitivity, 0.94 0.29, specificity 0.94, n = 42).

Table 2. ToxCast STM Performance Anchored to 42 Benchmark Compounds

CASRN Chemical HTC (µM) CVa (µM) TIb (µM) Preg. Classc STM Classd 302-79-4 all-trans-Retinoic acid 10 NA 0.003 X TP 69-74-9 Cytarabine hydrochloride 1 0.083 0.054 D TP 59-05-2 Methotrexate 1 0.062 0.059 X TP 147-24-0 Diphenhydramine hydrochloride 100 3.76 0.588 B TP 50-35-1 Thalidomide 100 NA 1.27 X TP 51-21-8 5-Fluorouracil 100 1.45 2.02 D TP 298-46-4 Carbamazepine 100 NA 2.29 C TP 55-98-1 Busulfan 100 4.91 2.31 D TP 13292-46-1 Rifampicin 10 NA 2.46 C TP 19774-82-4 Amiodarone hydrochloride 10 NA 5.10 D TP 75330-75-5 Lovastatin 20 NA 5.10 X TP

© Copyrights 2018 CASRN Chemical HTC (µM) CVa (µM) TIb (µM) Preg. Classc STM Classd 3056-17-5 Stavudine 100 NA 32.5 C TP 2392-39-4 Dexamethasone sodium phosphate 100 21.8 37.7 C TP 53-86-1 Indomethacin 100 44.1 72.7 D TP 127-07-1 Hydroxyurea 1000 237 74.9 D TP 127-01-1 Valproic acid 1000 271 155 D TP 4376-20-9 MEHP 500 NA 167 D TP 57-41-0 5,5-Diphenylhydantoin 100 NA NA D FN 51-52-5 6-Propyl-2-thiouracil 100 NA NA D FN 10043-35-3 Boric acid 40.7 NA NA NTP FN 4449-51-8 Cyclopamine 10 NA NA D FN 6055-19-2 Cyclophosphamide monohydrate 20 NA* NA D FN 56-53-1 Diethylstilbestrol 10 NA NA X FN 107-21-1 Ethylene glycol 100 000 NA NA NTP FN 57-30-7 Phenobarbital sodium 100 NA* NA D FN 81-81-2 Warfarin 100 NA NA X FN 69-72-7 Salicylic acid 10 000 1795 513 C TN 103-90-2 Acetaminophen 100 NA* NA B TN 79-06-1 Acrylamide 36 NA NA NTP TN 50-78-2 Aspirin 100 NA* NA C TN 80-05-7 Bisphenol A 100 39.4 NA NTP TN 94-26-8 Butylparaben 100 NA NA GRAS TN 58-08-2 Caffeine 500 NA NA B TN 464-49-3 d-Camphor 20 NA NA C TN 131-11-3 Dimethyl phthalate 100 NA NA NTP TN 59-30-3 Folic acid 100 NA NA A TN 54-85-3 Isoniazid 8.8 NA* NA C TN 57-55-6 1,2-Propylene glycol 1 000 000 246 664 327 552 NTP TN 68-26-8 Retinol 10 NA NA A TN 81-07-2 Saccharin 100 NA NA A TN 134-03-2 Sodium l-ascorbate 20 NA* NA A TN 599-79-1 Sulfasalazine 100 NA* NA B TN TP rate (sensitivity) 0.290.28 0.65 TN rate (specificity) 0.940.94 1.00 Overall Balanced aAccuracye 55.053.7% 78.682.0% (MCC = 0.647)

Abbreviations: CV, cell viability; FN, false negative; FP, false positive; HTC, highest tested concentration; TI, terato‐ genicty index; TN, true negative; TP, true positive. aPoint of departure (acb) at 11% reduced cell number; asterisk (*) inferred inactivity from single-concentration screen. bTI positivity set for a STM response; NA indicates no activity detected at the highest concentration tested. cAnchors labeled by FDA pregnancy risk (categories A, B, C, D, and X); generally regarded as safe (GRAS); NTP class based on evidence from teratology cohort study in a rat, rabbit, or mouse.

© Copyrights 2018 dConsensus across published studies (Augustine-Rauch et al., 2016; Daston et al., 2014; Genschow et al., 2002; West et al., 2010; Wise et al., 2016 ) for teratogens (actual) and nonteratogens. eContingency analysis accepts TI ≤ 200 µM as a STM-positive (predicted), else STM-negative.[AQ16] Ten compounds in ToxCast phase I/II are found among 28 compounds on the Daston List (DL) for exposure-based comparisons of developmental toxicity with regards to Cmax for maternal plasma toxicokinetic studies at a dosage producing, or not producing, teratogenicity/embryolethality (Daston et al., 2014). We evaluated concordance of the STM model for these 10 compounds across 7 negative and 7 positive exposure-based dosimetry calls from the Daston et al. (2014) study. In 7 cases, the highest concentration exceeded what could be achieved by diluting from 100 mM stock plates at the final DMSO concentration. Those were tested from neat compound in solid or solution form. Re‐ sults are shown in Table 3. Teratogenicity index correctly called 6 of 7 DL-negatives and 5 of 7 DL-positives, again yielding 78.6% concordance. The 3 misses were 1,2-propylene glycol (FP), caffeine (FN), and ethylene glycol (FN). Evaluation of 1,2-propylene glycol yielded a TI of 246 664 µM although the result is not considered reliable given the exceedingly high (1 M) test concentration required to achieve 850 000 µM for assessing DL-negativity (Daston et al., 2014). Daston List concordance improves to 84.6% if this FP call is discarded. Caffeine is negative in animal studies when consideration is given to appropriate control data (Wise, 2016 ) (Wise, 2016). Model performance further im‐ proves to 91.7% if this FN call is disregarded. No explanation can be offered for missing ethylene glycol aside from the lack of in vitro bioactivation; however, the ToxRefDB result for this chemical is equivocal with regards to prena‐ tal developmental toxicity in rats and rabbits. As such, the exposure-based DL model performed the same (78.6%) or better (> 84.6%) than the 42-benchmark compound model.

Table 3. ToxCast STM Performance Anchored to 10 DL Chemicals

CASRN Chemical HTC (µM) CV (µM) TI (µM) DL(−) (µM) DL(+) (µM) STM Responsea 302-79-4 all-trans-Retinoic acid 10 NA 0.0030 0.0017 0.03 DL(−), DL(+) 57-55-6 1,2-Propylene glycol 1 000 000 327 552 246 664 850 000 — DL(−)* 94-26-8 Butylparaben 100 NA NA 110 — DL(−) 58-08-2 Caffeine 500 NA NA 7.7 325 DL(−)* 107-21-1 Ethylene glycol 100 000 NA NA 1400 57 000 DL(−) 127-07-1 Hydroxyurea 1000 237.4 74.9 — 350 DL(+) 4376-20-9 MEHP 500 NA 166.6 1 146 DL(−), DL(+) 81-07-2 Saccharin 100 NA NA 24 — DL(−) 69-72-7 Salicylic acid 10 000 1795 513.4 — 3000 DL(+) 99-66-1 Valproic acid 1000 271.4 155.0 — 800 DL(+)

Abbreviations: CV, cell viability; DL(−) exposure-based negative from the Daston List (Daston et al., 2014); DL(+) exposure-based positive from the Daston List (Daston et al., 2014); HTC, highest tested concentration; TI, teratogen index. aConcordance: TI correctly called 6 of 7 DL-negatives and 5 of 7 DL-positives yielding 78.6% accuracy; accuracy improves to 84.6% if the hit* is discarded due to the exceptionally high concentration required for 1,2-propylene gly‐ col, and to 91.7% if the miss* is discarded due to ambiguities associated with Caffeine in animal studies. Evaluating STM Assay Performance Against ToxRefDB Prenatal Studies To evaluate a broader concordance of the ToxCast STM assay to in vivo animal studies, we culled the data from endpoint summary files in ToxRefDB prenatal studies that provide no effect level and LEL (Knudsen et al., 2009) and manually collapsed these data into evidence-based calls for developmental toxicity in rat and/or rabbit studies (see Materials and Methods for details). Each study attempts to achieve a maternal (mLEL) and fetal (dLEL) dosage for developmental parameters observed at term. Note here that some endpoint categories (eg, placenta, resorptions, and postimplantation loss) do not strictly map to dLEL outcomes and are thus not reflected in this analysis. Note that information from Table 2 was carried across the performance models, although only 11 of 42 benchmark compounds

© Copyrights 2018 had corresponding ToxRefDB data. Table 4 summarizes the results from 2 × 2 contingency models, and Supplemen‐ tary Table S2 provides the data and models used for this analysis. In all, we identified 432 chemicals with STM data and endpoint effects data.

Table 4. ToxCast STM Performance Anchored to ToxRefDB Prenatal Developmental Toxicity Studiesa

Conditionb Stringency Filter Applied to DevTox Anchor Basec,d Lowc,e Mediumc,f Highc,g BM-42c TP 98 68 44 20 17 FP 24 52 38 16 0 FN 204 119 42 10 9 TN 106 193 161 81 16 n 432 432 285 127 42 Sensitivity 0.325 0.364 0.512 0.667 0.654 Specificity 0.815 0.788 0.809 0.835 1.000 PPV 0.803 0.567 0.537 0.556 1.000 NPV 0.342 0.619 0.793 0.890 0.640 Rand ACC (%) 47.2 60.4 71.9 79.5 78.6 BAC (%) 57.3 60.4 66.5 72.3 82.0 MCC 0.143 0.173 0.325 0.473 0.647

Abbreviations: BAC, balanced accuracy; MCC, Matthews correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; Rand ACC, regular accuracy. aData from ToxRefDB (v1) endpoint summary file (Supplementary Table S2). bPredicted condition (in vitro) against true condition (in vivo). cBenchmark set of 42 reference compounds from Table 2, carried across the subsequent models (11 occur in Tox‐ RefDB). dBase model anchored STM positivity (TI < 1000 µM) to any dLEL whether recorded in a rat or rabbit study. eLow stringency model anchored STM positivity (≤ 200 µM) to dLEL ≤ mLEL; dLEL ≤ 200 mg/kg/day) in either species (rat or rabbit). fMedium stringency anchored STM positivity to clear evidence (dLEL < mLEL, dLEL ≤ 200 mg/kg/day) in one spe‐ cies (rat or rabbit). gHigh stringency model anchored STM positivity to a concordant response (dLEL < mLEL, dLEL ≤ 200 mg/kg/day) in both species (rat AND rabbit) on top of the 42 compound benchmark set. Fundamentally, ToxCast STM performed weakly against an unfiltered compilation of developmental toxicity stud‐ ies that included any dLEL call in a rat or rabbit (Rand ACC = 47.2%, 0.33 sensitivity, 0.82 specificity, n = 432). Concordance to the in vivo classification increased when the criteria for evidence of developmental toxicity became more stringent. Rand ACC increased to 60.4% (n = 432), 71.9% (n = 285), and 79.5% (n = 127) for low, medium, and high stringency anchors, respectively (Table 4). The latter sets a true condition as dLEL < mLEL and dLEL ≤ 200 mg/kg/day, and no developmental toxicity as ≥ 1000 mg/kg/day concordantly in rat and rabbit studies. Therefore, ToxCast STM accuracy reaches 78.6% (Rand ACC) to 82.0% (BAC) when there is high confidence in the call for developmental toxicity. Biochemical Stratification of the STM Response

© Copyrights 2018 We next sought to correlate the STM-positive and STM-negative responses to biochemical profiles from the Tox‐ Cast NVS dataset. This dataset was selected because it reflects potential molecular initiating events (MIEs) in chemi‐ cal-target interaction, and within the different STM-positive and STM-negative compounds there are compounds with known MIEs that could serve as indicators for how well the correlation is picking up MIEs. A binary classification outcome model of the ORN/CYSS ratio was built with a logistic regression strategy using assay-specific AC50s from 420 NVS features. Figure 5 shows the workflow for this operation and Supplementary Table S3 provides the corre‐ sponding data. Figure 5. Workflow used to identify biological pathways and processes demarcating sensitive and insensitive domains of the STM response (see Materials and Methods for details, Supplementary Figure S2 for the python code, and Supplementary Table S3 for corresponding data). Vertical arrows contain the number of chemical-NovaScreen (NVS) assay targets filtered through the work‐ flow, leaving 82 potential molecular initiating events (MIEs) for functional annotation (see Supplementary Table S3 for details) and 68 MIEs that finally map to 60 statistically significant functional annotations (see Figure 6 for details). (NOTE: color im‐ age available in the online version).

The NVS chemical-assay matrix started with 2918 chemical-biochemical features in the NVS dataset connected to a ToxCast gene score, whereby assays were aggregated for the gene symbol most closely linked to a biochemical target where the cell-free AC50s fell 3*mad below the ToxCast cytotoxicity burst (Judson et al., 2016; Leung et al., 2016). For assay selection, logistic regression identified 267 molecular targets having more statistical concordance over the different STM-positive or STM-negative compounds. The algorithm tries to maximize the distance between weak and strong potency, resulting in gene-potency scores (GPS) typically in the lower nanomolar range (eg, < 75 nM). Supplementary Table S3 lists them in rank order. Perturbed ligand binding to NR3C1 (glucocorticoid receptor) and ESR1 (estrogen receptor) represented the top GPS values for the STM-positive and STM-negative responses, re‐ spectively.

© Copyrights 2018 For further gene reduction, we next weighted the 267-gene list by evidence for gene-to-disease connections across 28 phenotypic systems in the HMDC database. This rearranged the gene list by evidence for gene-to-disease associa‐ tions from curated literature. Supplementary Table S3 gives the adjusted rank order, and Figure 5 shows the top 10 genes surfacing to the STM-positive and STM-negative responses, respectively, mapped to canonical pathways and biological processes. For example, receptor tyrosine (RTKs) and nonreceptor tyrosine phosphatases joined NR3C1 glucocorticoid receptor as top STM-positive contenders, whereas several NRs and G-protein coupled linked receptors joined ESR1 as top STM-negative contenders. Further demarcation of potential pathways and processes se‐ lected the top 40 phenotype-weighted NVS targets in each domain for functional annotation via NIAID/NIH DAVID Bioinformatics Resources 6.8 (see Materials and Methods). Note that 2 gene products (HDAC6, PTPN9) were present in both up/down sides resulting in 82 annotation records. Best results were obtained when STM-positive and STM-negative lists were annotated separately, utilizing a minimum of 4 genes and maximum 30 genes from GO Di‐ rect (molecular function, cellular compartment, and biological process), OMIM, KEGG, BioCarta, Reactome, and INTERPRO annotation systems. Annotation records were selected for Bonferroni-adjusted p-value ≤ .05 per catego‐ ry; redundancies were resolved manually by the lowest false discovery rate value (as these have lesser uncertainties across the specific annotation) or in some cases the most informative annotation record. This resulted in 60 overlap‐ ping annotation records (Supplementary Table S3) visualized by Spearman correlation (Figure 6). Figure 6. Spearman correlation matrix for 60 potential biological pathways and processes translated from biochemical targets in the ToxCast STM domain and annotated with the NIH DAVID bioinformatics resources. NovaScreen assay targets demarcated for the top 40 STM-positive and STM-negative response domains produced by the workflow operation shown in Figure 5. Each re‐ cord had 4–30 targets represented and passed at a Bonferroni-adjusted p-value ≤ .05. Vertical axis: annotation records colored by the NVS biochemical domain (STM-positive in red and STM-negative in blue). Horizontal axis: annotation records colored by the developmental toxicity domain from ToxRefDB medium stringency model (positive in red and negative in blue).

The 60 annotation records summed for 68 potential MIEs showed an inconsistent relationship to the medium strin‐ gency anchor when colored by STM response on one axis and developmental toxicity on the other (iee.g., rat or rab‐ bit outcomes, n = 285) (Figure 6). We further collapsed the 60 records into several “keystone” pathways that accoun‐ ted for 75% of the MIEs (51 of 68, Supplementary Table S3). Figure 7 maps 34 MIEs (NVS) that correlated with developmental toxicity and detected (TP) or not (FN) with the STM response. The flow of regulatory pathway infor‐

© Copyrights 2018 mation points to AKT/FoxO signaling and focal adhesion as determinants in the applicability domain (RTK signal‐ ing), and Ca++ second messenger generation (G-protein coupled receptor [GPCR] signaling) via G(q) pathways and as well NR-mediated as determinants of developmental toxicity outside the applicability domain. Aside from the glucocorticoid receptor (NR3C1), the pathway model provides the basis for a mechanistic speculation of where the STM response is lacking in sensitivity. Figure 7. Map of potential MIEs (NVS) to the applicability domain (STM) anchored to the medium stringency model of DevTox. Connections were largely made from information in GeneCards (https://www.genecards.org/). Molecular initiating events in dotted outline are implicit by pathway/group annotation from explicit MIEs (solid outline). Chemicals that hit MIEs in the red zone trip the STM biomarker and had some/clear evidence of DevTox in at least one species (rat or rabbit); those that hit MIEs in the green zone were positive for DevTox but missed by the STM assay under the conditions employed here (eg, false negatives [FNs]). The flow of regulatory information points to AKT/FoxO signaling in the TP domain, and G(q) signaling and nuclear receptor-mediated gene expression in the FN domain. (NOTE: color image available in the online version).

DISCUSSION Profiling hESCs for their secreted metabolites has been proposed as an alternative testing platform for identifying compounds with potential developmental toxicity (Kleinstreuer et al., 2011 ; Palmer et al., 2013, 2017; West et al., 2010). Dynamic variations in metabolite abundance with functional changes in biochemical pathways and cellular metabolic response may be direct or secondary consequences of chemical exposure (Allison et al., 2015). Taking this into consideration, the profile of intermediary metabolites and small molecules released by hESCs to their environ‐ ment (secretome) could lead to identification of the extent of adverse outcome pathways in the developing embryo. The ToxCast STM platform described here provides a potency read-out of a chemical compound’s exposure-based potential for developmental toxicity based on a critical imbalance in the targeted biomarker (decreased ORN/CYSS ratio detected in the H9 hESC conditioned medium) (Palmer et al., 2013). Testing the ToxCast phase I and II library of 1065 chemicals revealed a teratogenic index for 205 compounds consisting of mostly environmental and commer‐ cial chemicals and some pharmaceutical compounds. Despite the wide diversity of chemical structures (Richard et al., 2016), the resulting performance metrics in predicting in vivo developmental toxicity were consistent with those reported by the assay provider using pharmaceutical compounds (Palmer et al., 2013). Major findings from an initial analysis of the ToxCast STM dataset may be summarized as follows: (1) 19% of 1065 chemicals tested here showed a positive biomarker response, yielding a prediction of developmental toxicity, (2) biomarker performance in general reached accuracies of 79% (Rand ACC) to 82% (BAC) with excellent to out‐

© Copyrights 2018 standing specificity (> 84%) but modest sensitivity (< 67%) when compared with in vivo animal models of human prenatal developmental toxicity, (3) sensitivity improved as more stringent criteria were applied to the animal studies, and (4) statistical analysis of the most potent chemical hits on specific biochemical targets in ToxCast revealed posi‐ tive and negative associations with the STM response, providing insights into the mechanistic underpinnings of the targeted endpoint and its biological domain. The results must be interpreted with caution, insofar as the in vitro re‐ sponse is not a direct test of in vivo toxicity in absence of kinetics, metabolism, genetic diversity, and biological cov‐ erage (Crum p et al., 2010). (1) Targeted biomarker. Recognizing the potential for FP and FN calls, the STM dataset has been analyzed and interpreted under the assumption of direct chemical-biological interaction of developmental toxicants with target pro‐ teins in the H9 cell culminating in a toxicodynamic response altering the ORN/CYSS ratio. Ornithine is an amino acid produced in the urea cycle by splitting urea from l-arginine. When transported by SLC25A15 into the mitochon‐ drial matrix, ORN can be carbamylated to l-citrulline by ornithine transcarbamylase. Alternatively, ORN is metabo‐ lized by ornithine decarboxylase in the cytosol to putrescine, which is rate limiting in polyamine biosynthesis and thus important for the stabilization of newly synthesized DNA. Cystine taken by cells from the medium is used in glutathione production and so decreased CYSS uptake likely reflects a change in cellular glutathione synthesis and redox balance. Decreasing the ORN/CYSS ratio reflects, therefore, an imbalance in H9 cells that may predict a chem‐ icals’ human teratogenic potential (Palmer et al., 2013, 2017). Using a benchmark set of 42 compounds compiled from literature on nonanimal alternatives, ToxCast STM perfectly classified STM-positives (1.00 PPV) but misclassi‐ fied about a third as negatives (0.64 NPV). Absence of FPs bolsters confidence in the assay’s qualitative predictivity and adds quantitative value with an exposure-based prediction of teratogenicity; however, the relative insensitivity toward some chemicals showing evident developmental toxicity in guideline animal studies is an important factor in modeling assay performance that must be considered when using the data for health-protective hazard identification. For example, cyclopamine, cyclophosphamide, and diethylstilbestrol were classified as FNs. As noted in the Materi‐ als and Methods section, the HTC for each chemical was set based on information available for the cytotoxicity burst in ToxCast (Judson et al., 2016). The LBC (lower bounds of the median cytotoxicity burst) for cyclopamine, cyclo‐ phosphamide, and diethylstilbesterol was 15.7, > 1000, and 21.5 µM, respectively. We tested cyclopamine up to 20 µM and do not think its FN call was an underdosing issue. Although we only tested cyclophosphamide monohydrate to 20 µM, a response would not be expected because this agent needs to be metabolically activated to its proximate teratogen (our contractor finds no response in human iPSCs testing this compound up to 300 µM). For diethylstilbes‐ trol, the HTC (10 µM) was at the cytotoxicity burst; the contractor finds a response for this compound in iPSCs be‐ tween 30 and 100 µM, so it likely it would be TP if tested higher. (2) Assay performance. Overall accuracy of the ToxCast STM platform was assessed using in vivo outcomes in the ToxRefDB database’s prenatal developmental toxicity studies for pregnant rats and/or rabbits (Knudsen et al., 2009) and 42 benchmark compounds with confident calls from the literature, including 20 compounds from the original de‐ velopment of the assay (Palmer et al., 2013). Our data yielded a BAC of 82% that was consistent with the original description of the assay (Palmer et al., 2013). In characterizing the performance of the STM assay across in vitro and in vivo cutoff models, we found a convenient cutoff for positivity at ≤ 200 µM (in vitro), and ≤ 200 mg/kg/day for positivity and ≥ 1000 mg/kg/day for negativity in vivo. The median dLEL values for rat and rabbit developmental toxicity were 100 and 80 mg/kg/day, respectively. A study by van Ravenzwaay et al. (2017) to evaluate prenatal de‐ velopmental toxicity on 480 chemicals from the REACh and BASF databases reported median dLEL values of 320 and 65 mg/kg/day for rat and rabbit studies, respectively using ≤ 1000 mg/kg/day for taking developmental toxicity into account. As such, the 200 mg/kg/day cutoff for positivity is reasonable but could be more fully explored in future evaluation of compound-specific concentration (Cmax) dosimetry profiles predicted using high-throughput toxicoki‐ netic models. Expanding the 42-benchmark set with ToxCast chemicals having concordant ToxRefDB outcomes in rat and rabbit studies resulted in Rand ACC of 79.5% and BAC of 72.3% (n = 127). The 127 chemical set is highlighted for its value in assessing performance metrics that correspond to the higher stringency chemical set, where we have more confidence about whether a chemical produces developmental toxicity or not based on consistent results across 2 spe‐ cies of animal prenatal developmental toxicity studies. These metrics are short of the assay’s performance of 85% BAC for 80 compounds cited in a recent review (Zhu et al., 2016). An improvement was not evident when the analy‐ sis was expanded with calls from one species only, rat or rabbit (Rand ACC = 71.9%, BAC = 66.5%, n = 285). Relax‐

© Copyrights 2018 ing the stringency of a positive call in that case would lessen the weight of evidence from ToxRefDB calls to tests in one species or species discordance based on the cutoffs used in our analysis for positivity (≤ 200 mg/kg/day) and neg‐ ativity (≥ 1000 mg/kg/day). Given the caveat of species discordance in developmental toxicity findings for some chemicals (Carney et al., 2011; Hurtt et al., 2003; Janer et al., 2008; Knudsen et al., 2009; Rorije et al., 2012; Teixidó et al., 2018; Theunissen et al., 2016), and even more when using the base model, it may be that 82% BAC versus animal studies is quite good as a prediction of human developmental toxicity. Regulatory nonclinical safety testing of human pharmaceuticals typically requires embryo-fetal developmental toxicity testing in 2 species (1 rodent and 1 nonrodent). Discordance across species may be attributed to factors such as maternal toxicity, study design differen‐ ces, pharmacokinetic differences, and pharmacologic relevance of species. If rat and rabbit studies combined are more predictive of human developmental toxicity, then an important question is whether using rat or rabbit as the benchmark for assessing model performance with hESC lines is truly health protective. An analysis of 379 pharma‐ ceutical compounds having prenatal studies in both rat and rabbit animal models found both species to be equally sensitive by overall dLEL comparison, but selective developmental toxicity in one species was not uncommon, sug‐ gesting that the use of both species has a higher probability of detecting developmental toxicants than either one alone (Teixidó et al., 2018; Theunissen et al., 2016). Relaxing the stringency criteria for developmental toxicity would likely be used in a more conservative health-protective developmental hazard assessment for chemicals man‐ agement programs that do not require detailed animal protocols. Given the tradeoff between assay sensitivity-specif‐ icity, further investigation is needed to determine how information from ToxCast or read across methods might guide concentration selection to optimize the assay for sensitivity (eg, low FNs) or specificity (eg, low FPs). (3) Assay sensitivity. Potential reasons for FNs in STM predictivity include limited in vitro solubility of chemicals, chemical degradation, lack of xenobiotic metabolism, incomplete biological coverage, or simply not testing high enough concentrations. Solubility and stability properties of ToxCast chemicals have been reported (Richard et al., 2016). These, as well as the caveats associated with biotransformation, may be considered with regards to negative predictivity on a case-by-case basis and will not be discussed here. As part of the ToxCast portfolio, we designed the highest tested concentration for each chemical using its median cytotoxicity burst across 37 cytotoxicity and cell stress assays as an initial guide for setting the HTC (Judson et al., 2016). Based on distance between the highest con‐ centration tested here (ToxCast STM) and the lower bounds of the ToxCast cytotoxicity burst (Judson et al., 2016), we estimate 40–60 chemicals inactive may have been tested at too low of a concentration in this study. Applying the overall 19% hit rate from the assay profile obtained here suggests dose insufficiency might account for fewer than a dozen FNs in the tests conducted to date. Incomplete biological coverage will be discussed below as part of the quali‐ fication of this assay for the ToxCast dataset with regards to applicability for developmental hazard characterization. (4) Potential MIEs. Although performance indicators can be successfully anchored to in vivo developmental ef‐ fects data, statistical analysis of the most potent chemical hits on specific biochemical targets in ToxCast that demar‐ cated potential positivity or negativity of the STM response can provide insights into potential MIEs and mechanistic support for the assay biomarker (ie, the ORN/CYSS ratio). Focusing on the ToxCast NVS dataset as an initial proof of concept, functional annotation identified 60 “keystone” pathways and processes from the top correlations in the STM-positive and STM-negative space, respectively. Annotated functions most strongly correlated with the sensitive domain of the assay were RTKs and their associated downstream kinases and phosphatases that regulate cell growth, differentiation, and survival. At the top of the list were several “class-III RTKs” characterized by conserved Ig-like repeats (FLT1, FLT4, KDR, and CSF1R) activated by ligand-induced dimerization leading to autophosphorylation and downstream signaling through downstream phosphorylation cascades (SRC, RAS, RAP1, MAPK, PI3-AKT, and FOXO). Positivity is generally consistent with findings from a study in H9 cells using nuclear immunoreactivity for SOX17, a transcription factor of endoderm and hematopoietic differentiation, as a biomarker of positivity for terato‐ genicity (Kameoka et al., 2014). Those authors screened 302 kinase inhibitors that cover a majority of the human kinome. Their positives enriched for some of the top kinase targets identified in our ToxCast STM model (PI3K, AURKA, CSNK1D, KDR, FLT3, and FLT1) to further qualify relevance of the sensitive domain of the assay. Among the NR superfamily, only the glucocorticoid receptor (NR3C1) had a strong correlation with a STM-positive re‐ sponse. Annotated functions most strongly correlated with the insensitive domain of the assay included NRs and GPCRs. Estrogen receptors (ESR1) stood out among the NRs having a very strong correlation to STM-negative response, wherein a positive might have been expected. For example, 17-ethinylestradiol, 17-α-estradiol, 17-β-estradiol, and diethylstilbestrol all tested negative in the ToxCast STM assay despite showing sub-nanomolar AC50s in the ToxCast

© Copyrights 2018 NVS human ESR1 ligand binding assay. An obvious question is why the ORN/CYSS ratio was unaffected following exposure to test concentrations (1–10 µM) that were several orders of magnitude greater than biochemical AC50s. One obvious question is whether estrogen receptors are even expressed or functional in undifferentiated hESCs. In their quantitative polymerase chain reaction (qPCR)-based characterization of NR gene expression across 3 mouse and human ESC lines (including H9), Xie et al. (2009) did not detect ERα or ERβ expression in the undifferentiated cells. This absence is consistent with query of the NIH Stem Cell Data Management System Database (StemCellDB) (https://stemcelldb.nih.gov/public.do). On the other hand, both ERα and ERβ transcripts as well as nuclear immunor‐ eactivity are highly expressed in undifferentiated hESC from the Miz-hES1 cell line as well as embryoid bodies grown on murine feeder cells (Hong et al., 2004). For this reason, we cannot rule out the important concern that nega‐ tivity may reflect an absence of ESR1 signaling dynamics. G-protein coupled receptors at the top of the STM-negative list included muscarinic receptors (CHRM 1/3/5) and endothelin receptors (EDNRA and EDNRB) acting through G alpha (q) to stimulate calcium signaling through phosphatidylinositol generation and phospholipase C activity. Negativity in response to chemicals with strong antagonist effects on the CHRM system is consistent with negligible expression of these receptors in the NIH StemCellDB, as well as studies showing undifferentiated H9 cells to be unresponsive to various neurotransmitters, including acetylcholine and substance P, that robustly increase intracellular calcium in differentiated hESCs (Carpen‐ ter et al., 2004). Notably, antagonists of the neurokinin receptors TAC1 (substance P) and TACR2 (neurokinin A) correlated with a STM-negative response versus a STM-positive response for TACR3 (neurokinin B). Because all 3 neurokinin receptors couple to G(q) signaling events via phosphatidylinositol-calcium second messenger (Regoli et al., 1994), we may speculate that differential GPCR expression explains at least some of the relative insensitivity of the STM biomarker to some chemicals in the ToxCast library. Understanding the biochemical space not covered by ToxCast STM may lead to improved models for predictive developmental toxicity. Consider, for example, NVS assays for endothelin receptors EDNRA and EDNRB. Endothe‐ lin signaling through these GPCR systems is essential for normal neural crest development and disruption is terato‐ genic (Clouthier et al., 1998; Spence et al., 1999; Treinen et al., 1999). Two compounds with potent inhibitory effects on these NVS receptor assays (SB-217242 and SB-209670) were STM-negative. As such, statistical analysis of the STM response with regards to the NVS biochemical domain shows at least some teratogenic pathways will be missed by the ToxCast STM profile. A machine learning approach using the entire ToxCast assay portfolio will be needed as a follow-up study to more completely define the STM applicability domain and expand the predictive models with other in vitro data and in silico models relevant to prenatal developmental toxicity but otherwise missed by the Tox‐ Cast STM assay. Deconvoluting the flow of regulatory information in the STM-positive keystone annotations speculates a conver‐ gence of TP signals on FOXO signaling. The mammalian “Forkhead box O” family of transcription factors (FOXO1, FOXO3, and FOXO4) are well established downstream targets of AKT-phosphorylation that, in turn, lead to nuclear export via 14-3-3 binding (Ro et al., 2013) (Ro et al., 2013). FOXO is part of an energy-sensing system that controls stem cell self-renewal and differentiation relative to the metabolic state of the cell (Ochocki and Simon, 2013; Rafal‐ ski et al., 2012). FOXO1 is essential for the maintenance of hESC pluripotency (Zhang et al., 2011) and its inactiva‐ tion is linked with AKT, serum- and glucocorticoid kinase, IKB kinase, and ERK pathways (Hagenbuchner and Aus‐ serlechner, 2013). These pathways are all represented in the STM-positive domain. FOXO transcription factors act as metabolic sensors by virtue of redox modifications of their cysteine residues (Wang et al., 2013). ROS modulate FOXO activity at multiple levels, including posttranslational modifications (phosphorylation and acetylation), inter‐ action with coregulators, alterations in subcellular localization, protein synthesis, and stability (Klotz et al., 2015). Important roles are proposed for FOXO signaling in several embryonic processes (Hosaka et al., 2004; Yeo et al., 2013), leading to the hypothesis that the ORN/CYSS ratio reflects the H9 metabolic cell state in a manner linked to hypophosphorylation and nuclear retention (FOXO1) or mitochondrial ROS homeostasis (FOXO3) (Ochocki and Si‐ mon, 2013; Yeo et al., 2013) in redox balance. Further investigation is required to determine if the ORN/CYSS ratio somehow signifies the potential for a developmentally relevant alteration in FOXO switching. In conclusion, the data presented here support the application of the Stemina devTOXqP platform for predictive toxicology (Palmer et al., 2013, 2017) and further demonstrate its value in ToxCast as a novel resource that can gen‐ erate testable hypotheses aimed at characterizing potential pathways for teratogenicity and HTS prioritization of envi‐

© Copyrights 2018 ronmental chemicals for an exposure-based assessment of developmental hazard. These “hierarchical performance- based models” bring new and potentially valuable information on the ORN/CYSS biomarker for systematic and un‐ biased prediction of adverse fetal endpoints across a relatively large number of animal studies. This analysis, we be‐ lieve, points out strengths and weaknesses in the translatability of the performance-based models for both scientific and regulatory purposes. The present analysis demonstrated the positive predictive capability of this assay and that balanced accuracies increase in model performance increase as the stringency criteria tighten for in vivo developmen‐ tal toxicity (57%–82%). For an untested chemical, a positive test in this assay would indicate its likelihood of being a developmental hazard at a particular internal concentration, with the understanding that a negative response does not imply a nonhazard. Further analysis, both experimental and computational, will be necessary to determine whether the limitations in sensitivity reflect the robustness of in vivo endpoints used in the models or the biological domain of the hESC biomarker response, in generating alerts for further testing. DATA AVAILABILITY Supplementary data are available at https://doi.org/10.5061/dryad.gqnk98shm. The The ToxCast _STM dataset is available for ftp:// download at: https://doi.org/10.23645/epacomptox.11819265, last accessed February 67, 2020. DECLARATION OF CONFLICTING INTERESTS[AQ13] The authors declare that they have no conflicts of interest to disclose in connection with this study. J.A.P. is an em‐ ployee of Stemina Biomarker Discovery Inc. FUNDING[AQ14] US EPA under NCCT contract EP-D-13-055 with Stemina Biomarker Discovery (Madison, Wisconsin). ACKNOWLEDGMENTS We gratefully acknowledge technical support from Michael Colwell and Laura Egnash (Stemina) as well as Ann Ri‐ chard and David Murphy (NCCT). We thank our technical reviewers, Dr. Nicole Kleinstreuer (NTP/NIEHS/NIH) and Dr. Katie Paul-Friedman (NCCT/ORD/EPA), for their critical comments during the Agency’s clearance review of this manuscript. We also thank Madison Feshuk for compiling the assay annotations for release of this assay to EPA’s CompTox Chemicals Dashboard (anticipated March 2020). Author Contributions[AQ15] T.B.K. conceived the study and served as the contract’s Technical Contract Officer Representative (TCOR). K.A.H. served as EPA Project Lead and J.A.P. served as Stemina Project Lead. P.K. entered the raw data into invi‐ troDB and N.R. processed the data through tcpl. T.B.K. was responsible for final data curation in collaboration with J.A.P. T.B.K. developed the performance models and T.J.Z. developed the statistical models, in collaboration with R.S.J. All authors contributed to interpretation and preparation of the manuscript. REFERENCES Adler, S., Pellizzer, C., Paparella, M., Hartung, T., and Bremer, S. (2006). The effects of solvents on embryonic stem cell differentiation. Toxicol. In Vitro 20, 265–271. Allison, T. F., Powles-Glover, N. S., Biga, V., Andrews, P. W., and Barbaric, I. (2015). Human pluripotent stem cells as tools for high-throughput and high-content screening in drug discovery. Int. J. High Throughput Screen. 5, 1–13. Augustine-Rauch, K., Zhang, C. X., and Panzica-Kelly, J. M. (2016). A developmental toxicity assay platform for screening teratogenic liability of pharmaceutical compounds. Birth Defects Res. B Dev. Reprod. Toxicol. 107, 4–20. Bremer, S., and Hartung, T. (2004). The use of embryonic stem cells for regulatory developmental toxicity testing in vitro—The current status of test development. Curr. Pharm. Des. 10, 2733–2747. Carney, E. W., Ellis, A. L., Tyl, R. W., Foster, P. M. D., Scialli, A. R., Thompson, K., and Kim, J. (2011). Critical evaluation of current developmental toxicity testing strategies: A case of babies and their bathwater. Birth Defects Res. B 92, 395–403.

© Copyrights 2018 Carpenter, M. K., Rosler, E. S., Fisk, G. J., Brandenberger, R., Ares, X., Miura, T., Lucero, M., and Rao, M. S. (2004). Properties of four human embryonic stem cell lines maintained in a feeder-free culture system. Dev. Dynam. 229, 243–258. Chandler, K. J., Barrier, M., Jeffay, S., Nichols, H. P., Kleinstreuer, N. C., Singh, A. V., Reif, D. M., Sipes, N. S., Judson, R. S., Dix, D. J., et al. (2011). Evaluation of 309 environmental chemicals using a mouse embryonic stem cell adherent cell differentiation and cytotoxicity assay. PLoS One 6, e18540. Clouthier, D. S., Hosoda, K., Richardson, J. A., Williams, C. A., Yanagisawa, H., Kuwaki, T., Kumada, M., Hammer, R. E., and Yanagisawa, M. (1998). Cranial and cardiac neural crest defects in endothelin-A receptor-deficient mice. Development 125, 813–824. Collins, F. S., Gray, G. M., and Bucher, J. R. (2008). Transforming environmental health protection. Science 319, 906–907. Daston, G. P., Beyer, B. K., Carney, E. W., Chapin, R. E., Friedman, J. M., Piersma, A. H., Rogers, J. M., and Scialli, A. R. (2014). Exposure-based validation list for developmental toxicity screening assays. Birth Defects Res. B Dev. Reprod. Toxicol. 101, 423–428. European Parliament, Council of the European Union. (2006) Regulation (EC) No. 1907/2006 of the European Par‐ liament and of the Council of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restric‐ tion of Chemicals (REACH), establishing a European Chemicals Agency, amending Directive 1999/45/EC and re‐ pealing Council Regulation (EEC) No. 793/93 and Commission Regulation (EC) No. 1488/94 as well as Council Di‐ rective 76/769/EEC and Commission Directives 91/155/EEC, 93/67/EEC, 93/105/EC and 2000/21/EC. Available at: https://eur-lex.europa.eu/legal-content/EN/TXT/? uri=CELEX : 32006R1907. Last accessed July 1, 2019. Filer, D. L., Kothiya, P., Setzer, R. W., Judson, R. S., and Martin, M. T. (2017). tcpl: the ToxCast pipeline for high- throughput screening data. Bioinformatics (Oxford, England) 33, 618–620. 10.1093/bioinformatics/btw680 27797781 Filer, D. L., Kothiya, P., Setzer, R. W., Judson, R. S., and Martin, M. T. (2017). tcpl: the ToxCast pipeline for high- throughput screening data. Bioinformatics (Oxford, England) 33, 618–620. 10.1093/bioinformatics/btw680 27797781 Filer, D. L., Kothiya, P., Setzer, R. W., Judson, R. S., and Martin, M. T. (2017). tcpl: The ToxCast pipeline for high- throughput screening data. Bioinformatics 33, 618–620. Genschow, E., Spielmann, H., Scholz, G., Seiler, A., Brown, N., Piersma, A., Brady, M., Clemann, N., Huuskonen, H., Paillard, F., et al. (2002). The ECVAM international validation study on in vitro embryotoxicity tests: Results of the definitive phase and evaluation of prediction models. Altern. Lab. Anim. 30, 151–176. Hagenbuchner, J., and Ausserlechner, M. J. (2013). Mitochondria and FOXO3: Breath or die. Front. Physiol 4, 147. Hong, S. H., Nah, H. Y., Lee, Y. J., Lee, J. W., Park, J. H., Kim, S. J., Lee, J. B., Yoon, H. S., and Kim, C. H. (2004). Expression of estrogen receptor-alpha and -beta, glucocorticoid receptor, and receptor genes in human embryonic stem cells and embryoid bodies. Mol. Cells 31, 320–325. Hosaka, T., Biggs, W. H., 3rd, Tieu, D., Boyer, A. D., Varki, N. M., Cavenee, W. K., and Arden, K. C. (2004). Dis‐ ruption of forkhead transcription factor (FOXO) family members in mice reveals their functional diversification. Proc. Natl. Acad. Sci. U.S.A. 101, 2975–2980. Huch, M., and Koo, B.-K. (2015). Modeling mouse and human development using organoid cultures. Development 142, 3113–3125. Hurtt, M. E., Cappon, G. D., and Browning, A. (2003). Proposal for a tiered approach to developmental toxicity test‐ ing for veterinary pharmaceutical products for food-producing animals. Food Chem. Toxicol. 41, 611–619. Janer, G., Slob, W., Hakkert, B. C., Vermeire, T., and Piersma, A. H. (2008). A retrospective analysis of developmen‐ tal toxicity studies in rat and rabbit: What is the added value of the rabbit as an additional test species? Regul. Toxi‐ col. Pharmacol. 50, 206–217. Juberg, D. R., Knudsen, T. B., Sander, M., Beck, N. B., Faustman, E. M., Mendrick, D. L., Fowle, J. R., Hartung, T., Tice, R. R., Lemazurier, E., et al. (2017). FutureTox III: Bridges for Translation. Toxicological Sciences : An Official Journal of the Society of Toxicology 155, 22–31. 10.1093/toxsci/kfw194 27780885

© Copyrights 2018 Judson, R., Houck, K., Martin, M., Richard, A. M., Knudsen, T. B., Shah, I., Little, S., Wambaugh, J., Woodrow Set‐ zer, R., Kothya, P., et al. (2016). Analysis of the effects of cell stress and cytotoxicity on in vitro assay activity in the ToxCast dataset. Toxicol. Sci. 152, 323–329. Judson, R. S., Houck, K. A., Kavlock, R. J., Knudsen, T. B., Martin, M. T., Mortensen, H. M., Reif, D. M., Richard, A. M., Rotroff, D. M., Shah, I., et al. (2010). In vitro screening of environmental chemicals for targeted testing priori‐ tization: The ToxCast project. Environ. Health Perspect. 118, 485–492. Kameoka, S., Babiarz, J., Kolaja, K., and Chiao, E. (2014). A high-throughput screen for teratogens using human pluripotent stem cells. Toxicol. Sci. 137, 76–90. Kavlock, R., Chandler, K., Houck, K., Hunter, S., Judson, R., Kleinstreuer, N., Knudsen, T., Martin, M., Padilla, S., Reif, D., et al. (2012). Update on EPA’s ToxCast Program: Providing High Throughput Decision Support Tools for Chemical Risk Management. Chemical Research in Toxicology 25, 1287–1302. 10.1021/tx3000939 Kleinstreuer, N. C., Smith, A. M., West, P. R., Conard, K. R., Fontaine, B. R., Weir-Hauptman, A. M., Palmer, J. A., Knudsen, T. B., Dix, D. J., Donley, E., et al. (2011). Identifying developmental toxicity pathways for a subset of Tox‐ Cast chemicals using human embryonic stem cells and metabolomics. Toxicology and Applied Pharmacology 257, 111–121. 10.1016/j.taap.2011.08.025 Klotz, L.-O., Sanchez-Ramos, C., Prieto-Arroyo, I., Urbanek, P., Steinbrenner, H., and Monsalve, M. (2015). Redox regulation of FoxO transcription factors. Redox Biol. 6, 51–72. Knudsen, T. B., and Daston, G. P. (2018) Systems toxicology and virtual tissue models. In Comprehensive Toxicolo‐ gy, Third Edition (McQueen C. A., Ed.), Vol. 5, pp. 351–362. Elsevier Ltd, Oxford. Knudsen, T. B., Houck, K., Sipes, N. S., Judson, R. S., Singh, A. V., Weissman, A., Kleinstreuer, N. C., Mortensen, H., Reif, D., Setzer, R. W., et al. (2011). Activity profiles of 320 ToxCast™ chemicals evaluated Across 292 bio‐ chemical targets. Toxicology 282, 1–15. Knudsen, T. B., Martin, N. T., Kavlock, R. J., Judson, R. S., Dix, D. J., and Singh, A. V. (2009). Profiling the activity of environmental chemicals in prenatal developmental toxicity studies using the U.S. EPA’s ToxRefDB. Reprod. Tox‐ icol. 28, 209–219. Le Coz, F., Suzuki, N., Nagahori, H., Omori, T., and Saito, K. (2015). Hand1-Luc embryonic stem cell test (Hand1- Luc EST): A novel rapid and highly reproducible in vitro test for embryotoxicity by measuring cytotoxicity and dif‐ ferentiation toxicity using engineered mouse ES cells. J. Toxiol. Sci. 40, 251–261. Leist, M., Hasiwa, N., Rovida, C., Daneshian, M., Basketter, D., Kimber, I., Clewell, H., Gocht, T., Goldberg, A., Busquet, F., et al. (2014). Consensus report on the future of animal-free systemic toxicity testing. Altex 31, 341–356. Leung, M. C. K., Phuong, J., Baker, N. C., Sipes, N. S., Klinefelter, G. R., Martin, M. T., McLaurin, K. W., Setzer, R. W., Darney, S. P., Judson, R. S., et al. (2016). Systems toxicology of male reproductive development: Profiling 774 chemicals for molecular targets and adverse outcomes. Environ. Health Perspect. 124, 1050–1061. Luz, A. L., and Tokar, E. J. (2018). Pluripotent stem cells in developmental toxicity testing: A review of methodolog‐ ical advances. Toxicol. Sci. 165, 31–39. Martin, M. T., Mendez, E., Corum, D. G., Judson, R. S., Kavlock, R. J., Rotroff, D. M., and Dix, D. J. (2009). Profil‐ ing the reproductive toxicity of chemicals from multigenerational studies in the toxicity reference database. Toxicol. Sci. 110, 181–190. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Bio‐ chim. Biophys. Acta 405, 442–451. National Research Council. (2007). Toxicity Testing in the 21st Century: A Vision and a Strategy, 196 pp. The Na‐ tional Academies Press, Washington, DC. Niles, A., Moravec, R. A., Eric Hesselberth, P., Scurria, M. A., Daily, W. J., and Riss, T. L. (2007). A homogeneous assay to measure live and dead cells in the same sample by detecting different protease markers. Anal. Biochem. 366, 197–206.

© Copyrights 2018 Ochocki, J. D., and Simon, C. (2013). Nutrient-sensing pathways and metabolic regulation in stem cells. J. Cell Biol. 203, 23–33. Palmer, J. A., Smith, A. M., Egnash, L. A., Colwell, M. R., Donley, E. L. R., Kirchner, F. R., and Burrier, R. E. (2017). A human induced pluripotent stem cell-based in vitro assay predicts developmental toxicity through a retinoic acid receptor-mediated pathway for a series of related retinoid analogues. Reprod. Toxicol. 73, 350–361. Palmer, J. A., Smith, A. M., Egnash, L. A., Conard, K. R., West, P. R., Burrier, R. E., Donley, E. L. R., and Kirchner, F. R. (2013). Establishment and assessment of a new human embryonic stem cell-based biomarker assay for develop‐ mental toxicity screening. Birth Defects Res. B Dev. Reprod. Toxicol. 98, 343–363. Panzica-Kelly, J. M., Brannen, K. C., Ma, Y., Zhang, C. X., Flint, O. P., Lehman-McKeeman, L. D., and Augustine- Rauch, K. A. (2013). Establishment of a molecular embryonic stem cell developmental toxicity assay. Toxicol. Sci. 131, 447–457. Pennings, J. L. A., van Dartel, D. A. M., Robinson, J. F., Pronk, T. E., and Piersma, A. H. (2011). Gene set assembly for quantitative prediction of developmental toxicity in the embryonic stem cell test. Toxicology 284, 63–71. Powers, D. M. W. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63. Rafalski, V. A., Mancini, E., and Brunet, A. (2012). Energy metabolism and energy-sensing pathways in mammalian embryonic and adult stem cell fate. J. Cell Sci. 125, 5597–5608. Regoli, D., Boudon, A., and Fauchere, J. L. (1994). Receptors and antagonists for substance P and related peptides. Pharmacol. Rev. 46, 551–599. Richard, A. M., Judson, R. S., Houck, K. A., Grulke, C. M., Volarath, P., Thillainadarajah, I., Yang, C., Rathman, J., Martin, M. T., Wambaugh, J. F., et al. (2016). ToxCast chemical landscape: Paving the road to 21st century toxicolo‐ gy. Chem. Res. Toxicol. 29, 1225–1251. Ro, S.-H., Liu, D., Yeo, H., and Paik, J-h. (2013). FoxOs in neural stem cell fate decision. Archives of Biochemistry and Biophysics 534, 55–63. 10.1016/j.abb.2012.07.017 22902436 Rorije, E., van Hienen, F. J., Dang, Z. C., Hakkert, B. H., Vermeire, T., and Piersma, A. H. (2012). Relative parameter sensitivity in prenatal toxicity studies with substances classified as developmental toxicants. Reprod. Toxicol. 34, 284–290. Seiler, A. E. M., and Spielmann, H. (2011). The validated embryonic stem cell test to predict embryotoxicity in vitro. Nat. Protoc. 6, 961–978. Sipes, N. S., Martin, M. T., Kothiya, P., Reif, D. M., Judson, R., Richard, A., Houck, K. A., Dix, D. J., Kavlock, R. J., and Knudsen, T. B. (2013). Profiling 976 ToxCast chemicals across 331 enzymatic and receptor signaling assays. Chem. Res. Toxicol. 26, 878–895. Spence, S., Anderson, C., Cukierski, M., and Patrick, D. (1999). Teratogenic effects of the endothelin receptor antag‐ onist L-753,037 in the rat. Reprod. Toxicol. 13, 15–29. Sturla, S. J., Boobis, A. R., Fitzgerald, R. E., Hoeng, J., Kavlock, R. J., Schirmer, K., Whelan, M., Wilks, M. F., and Peitsch, M. C. (2014). Systems toxicology: From basic research to risk assessment. Chem. Res. Toxicol. 27, 314–329. Teixidó, E., Krupp, E., Amberg, A., Czich, A., and Scholz, S. (2018). Species-specific developmental toxicity in rats and rabbits: Generation of a reference compound list for development of alternative testing approaches. Reprod. Toxi‐ col. 76, 93–102. Theunissen, P. T., Beken, S., Beyer, B., Breslin, W. J., Cappon, G. D., Chen, C. L., Chmielewski, G., de Schaepdrijv‐ er, L., Enright, B., Foreman, J. E., et al. (2016). Comparing rat and rabbit embryo-fetal developmental toxicity data for 379 pharmaceuticals: On the nature and severity of developmental effects. Crit. Rev. Toxicol. 46, 900–910. Thomson, J. A., Itskovits-Eldor, J., Shapiro, S. S., Waknitz, M. A., Swiergiel, J. J., Marshall, V. S., and Jones, J. M. (1998). Embryonic stem cell lines derived from human blastocytes. Science 282, 1145–1147.

© Copyrights 2018 Treinen, K. A., Louden, C., Dennis, M. J., and Wier, P. J. (1999). Developmental toxicity and toxicokinetics of two endothelin receptor antagonists in rats and rabbits. Teratology 59, 51–59. US Public Law 114-182. (2016). Available at: https://www.congress.gov/114/plaws/publ182/PLAW-114publ182.pdf and https://www.epa.gov/assessing-and-managing-chemicals-under-tsca/alternative-test-methods-and-strategies-re‐ duce. van Ravenzwaay, B., Jiang, X., Luechtefeld, T., and Hartung, T. (2017). The threshold of toxicological concern for prenatal developmental toxicity in rats and rabbits. Regul. Toxicol. Pharmacol. 88, 157–172. Wang, K., Zhang, T., Dong, Q., Nice, E. C., Huang, C., and Wei, Y. (2013). Redox homeostasis: The linchpin in stem cell self-renewal and differentiation. Cell Death Dis. 4, e537. Warkus, E. L. L., and Marikawa, Y. (2017). Exposure-based validation of an in vitro gastrulation model for develop‐ mental toxicity assays. Toxicol. Sci. 157, 235–245. Watford, S., Pham, L. L., Wignall, J., Shin, R., Martin, M. T., and Paul Friedman, K. (2019). ToxRefDB version 2.0: Improved utility for predictive and retrospective toxicology analyses. Reprod. Toxicol. 89, 145–158. West, P. R., Weir, A. M., Smith, A. M., Donley, E. L. R., and Cezar, G. G. (2010). Predicting human developmental toxicity of pharmaceuticals using human embryonic stem cells and metabolomics. Toxicol. Appl. Pharmacol 247, 18– 27. Wise, L. D. (2016). Numeric Estimates of Teratogenic Severity from Embryo-Fetal Developmental Toxicity Studies. Birth Defects Research Part B: Developmental and Reproductive Toxicology 107, 60–70. 10.1002/bdrb.21171 Wise, L. D. (2016). Numeric estimates of teratogenic severity from embryo-fetal developmental toxicity studies. Birth Defects Res. B Dev. Reprod. Toxicol. 107, 60–70. Xie, C. Q., Jeong, Y., Fu, M., Bookout, A. L., Garcia-Barrio, M. T., Sun, T., Kim, B., Xie, Y., Root, S., Zhang, J., et al. (2009). Expression profiling of nuclear receptors in human and mouse embryonic stem cells. Mol. Endocrinol. 23, 724–733. Xing, J., Toh, Y.-C., Xu, S., and Yu, H. (2015). A method for human teratogen detection by geometrically confined cell differentiation and migration. Sci. Rep. 5, 10038. Yeo, H., Lyssiotis, C. A., Zhang, Y., Ying, H., Asara, J. M., Cantley, L. C., and Paik, J. H. (2013). FoxO3 coordinates metabolic pathways to maintain redox balance in neural stem cells. EMBO J. 32, 2589–2602. Zhang, X., Yalcin, S., Lee, D.-F., Yeh, T. Y. J., Lee, S.-M., Su, J., Mungamuri, S. K., Rimmelé, P., Kennedy, M., Sellers, R., et al. (2011). FOXO1 is an essential regulator of pluripotency in human embryonic stem cells. Nat. Cell Biol. 13, 1092–1099. Zhu, H., Bouhifd, M., Donley, E., Egnash, L., Kleinstreuer, N., Kroese, E. D., Liu, Z., Luechtefeld, T., Palmer, J., Pamies, D., et al. (2016). Supporting read-across using biological data. Altex 33, 167–182.

Author Queries

Query: AQ1: Please check whether the short title is OK as given. Response: Accept Query: AQ2: Please confirm the article title as there is a discrepancy of the same between the submitted manuscript and the submitted mail sheet. Response: Accept Query: AQ3: If your manuscript has figures or text from other sources, please ensure you have permission from the copyright holder. For any questions about permissions contact [email protected]. Response: Accept Query: AQ4: Please provide full spelling of the author’s first name for E. Sidney Hunter per style. Response: Sid does not go by his first name in the scientific literature. We prefer keeping it as written.

© Copyrights 2018 Query: AQ5: Please check all author names and affiliations. Please check that author surnames have been identified by a pink background in the PDF version, and by green text in the html proofing tool version (if applicable). This is to ensure that forenames and surnames have been correctly tagged for online indexing. Response: Accept Query: AQ6: Please provide Department/Division if any for the last two affiliations. Response: Accept Query: AQ7: Please provide the fax number for the corresponding author per style. Response: FAX 919-541-1194 Query: AQ8: Please provide complete details for Kavlock et al. (2012), Juberg et al. (2017), Kleinstreuer et al. (2011), Filer et al. (2015, 2016), Wise et al. (2016), Augustine et al. (2016), Leung et al. (2015), Crump et al. (2010), and Ro et al. (2013) in the reference list or delete the citation from the text. Response: Answered within text Query: AQ9: The abbreviation “TI” has been used for both “teratogenicity index” and “teratogen index.” Please check all occurrences of “TI” and modify so that the abbreviation indicates either of the expansion. Response: Answered within text Query: AQ10: Please check that all web addresses cited in the text, footnotes and reference list are up-to-date, and please provide a ‘last accessed’ date for each URL. Response: Answered within text Query: AQ11: According to journal style, abbreviations should be spelled out at first use. Would you please provide the spelled out form of LBC, FDA, and qPCR? Response: Answered within text Query: AQ12: Please note that the text footnote “@” has been moved to the text per style. Kindly confirm the edit. Response: Accept Query: AQ13: You may need to include a “conflict of interest” section. This would cover any situations that might raise any questions of bias in your work and in your article’s conclusions, implications, or opinions. Please see https://academic.oup.com/journals/pages/authors/authors_faqs/conflicts_of_interest. Response: Answered within text Query: AQ14: Please check that funding is recorded in a separate funding section if applicable. Use the full official names of any funding bodies, and include any grant numbers. Response: Accept Query: AQ15: Please check whether the ‘Author Contributions’ section is OK as typeset. Response: Accept Query: AQ16: Footnote [e] has not been indicated in the body of Table 2. Please insert the same. Response: Answered within text Query: AQ17: There is a charge of £350/f525/$600 per print colour figure. Please confirm if you are willing to pay the charge. There are no charges for publishing colour figures online only. Response: We are not opting for color printing.

Comments

C1 Author: AQ8 correction; C2 Author: AQ8 correction; C3 Author: AQ8 correction; C4 Author: AQ8 correction; C5 Author: AQ8 correction; C6 Author: AQ8 correction; C7 Author: AQ8 correction;

© Copyrights 2018 C8 Author: AQ8 correction; C9 Author: AQ8 correction; C10 Author: Reference provide as per AQ8; C11 Author: deleted as per AQ8;

© Copyrights 2018