Genes and Environment, Vol. 30, No. 4 pp. 139–149 (2008)

Mini-review Experimental Design and Statistical Analysis of Studies to Demonstrate a Threshold in Genetic Toxicology: AMini-review1

David P. Lovell2 Postgraduate Medical School, University of Surrey, Surrey, UK

(Received September 25, 2008; Revised October 25, 2008; Accepted October 26, 2008)

A mechanistic understanding of genotoxicity is im- genotoxic eŠects.'' This concept is the basis of risk as- portantfortheriskassessmentoftheexposureofhuman sessment strategies for chemicals with genotoxic poten- populations to chemicals. The nature of the dose response tial. These consider that genotoxic carcinogens do not relationship at low doses is valuable information in the have a threshold while those that act by non-genotoxic evaluation of the biological importance of such exposures. mechanisms may have a threshold (the default assump- A of mathematical and statistical approaches have tion in other areas of toxicology). Chemicals with been used to try to characterize responses at these low doses. Methods include mathematical models which do or thresholds are regulated via limit values such as ADI do not include thresholds and statistical methods which (Allowable Daily Intake) or TDIs (Tolerable Daily In- try to identify No-observable eŠect levels (NOELs). It is im- take) while non-thresholded chemicals are regulated us- portant to appreciate that determination of an NOEL is not ing concepts such as ALARP (As Low As Reasonably evidence for a threshold. There is an increasing apprecia- Practical) or ALARA (As Low As Reasonably Achieva- tion of the potential to identify `pragmatic' thresholds us- ble). (For recent reviews, see (2,3)). ing experimental systems with a range of biomarkers. The Genetic damage can be gene mutation, chromosome accurate characterization and estimation of these dose- damage (clastogenicity) and chromosome loss (aneuo- response relationships requires careful experimental de- genicity) and is detected by batteries of short-term sign which can improve the accuracy of the estimates of mutagenicity tests. Increasingly, some aneuogens, based the response while avoiding the introduction of artifactual upon their mechanism of action (MOA), are considered eŠects. Statistical approaches such as Design of Experi- ment (DoE) methodology, which builds on the traditional to have thresholds but that, in the absence of evidence factorial design, can provide e‹cient approaches for the to the contrary, gene mutagens do not. Chemicals, description and estimation of dose-response relationships however, may have a number of MOAs including both of both individual and combinations of agents. Estimation direct and indirect action on the DNA. approaches such as the benchmark dose methodology and Many of the ideas relating to low dose modelling such the concept of thresholds of toxicological concern provide as the Linear Non-Threshold (LNT) dose-response practical methods for addressing the threshold problem. model derive from radiation biology (4,5). The concepts remain contentious and there continues to be an active Key words: , experimental design, threshold, debate in the ˆeld (5,6). Gene mutations are assumed to genotoxicity have linear kinetics because they arise from single (one- hit) events with the dose-response relationship conse- quently being linear at low doses. Chromosomal Introduction damage may result from two or more hits (such as a The objective of this paper is to provide an overview chromosomal break followed by a rearrangement) with of the statistical and experimental design issues involved a linear-quadratic (non-linear or curved) dose-response in the design and interpretation of studies to identify relationship. It is assumed that there is a non-zero, thresholds associated with exposure to genotoxic agents. although small, chance that a single `hit' of radiation It has become an axiom that genotoxic chemicals in- duce DNA damage at any level of exposure and do not 1Presented at the International Symposium on Genotoxic and Car- have a threshold in their dose-response relationships. cinogenic Thresholds, Tokyo, Japan, July 22-23, 2008. 2Correspondence to: David P Lovell , Postgraduate Medical School, Madle et al. (1), for instance, stated that ``...it is gener- University of Surrey, Daphne Jackson Road, Manor Park, Guildford, ally agreed that there are no thresholds for genotoxic Surrey, GU2 7WG UK. Tel: +44-1483-688609, Fax: + eŠects of chemicals, i.e., that there are no doses without 44-1483-688501, E-mail: d.lovell@surrey.ac.uk

 The Japanese Environmental Mutagen Society 139 David P. Lovell will damage genetic material which then results in a one in which a range of sub-critical doses is incapable of mutagenic event leading ultimately to a cancer. This producing the speciˆed response; as dose increases, the translates into the concept of `no safe dose of radiation' minimal dose that can elicit the response is the threshold and subsequently into the `one molecule can cause can- dose'' (18). ECETOC (The European Centre for cer' concept applied to chemicals (7). Ecotoxicology and Toxicology of Chemicals) deˆned an In practice, descriptions of what occurs at very low `absolute' threshold as ``... a concentration below doses whether in terms of molecular or statistical which a cell would not `notice' the presence of the chem- models are theoretical and may only partially re‰ect ical. In other words, the chemical is present but does not what is actually happening. Even at very low doses mil- interact with the cellular target'' (15). lions of molecules of a compound may be involved. Thresholds are well known in pharmacology for Genomic DNA is exposed to continual `attack' by en- many receptor-activation-dependent processes and with dogenous mutagens (such as reactive oxygen species) non-carcinogenic endpoints because of detoxiˆcation which may result in some `spontaneous' or `back- and error repair mechanisms. There are also proponents ground' genetic damage. This can provide a reference of the concept of U-or J-shaped curves where high doses level for the assessment of the potential `added' risk that are toxic while low doses are protective. This concept of might arise from a low level of exposure leading, biphasic responses or hormesis has been reviewed by perhaps, to a `de minimis' dose based upon some Calabrese and co-workers (19,20). Examples include change compared with the spontaneous level of damage. adaptive responses where it is thought that a low prim- DiŠerent types of genetic damage may show diŠerent ing dose may reduce the eŠect of a second higher dose. types of dose-response relationships. Biomarkers of ex- EŠects such as `bystander eŠects', where a biological posure like DNA adducts appear to show linear eŠect occurs not in the cell that has been `hit' but in one responses in low-dose studies using accelerator mass in close proximity, and `genomic instability' have been spectrometry and other mass spectrometry methods suggested as explanations at the mechanistic/cellular while biomarkers of eŠect such as gene mutations may level of possible threshold eŠects (reviewed by Preston show non-linear responses because of defence mechan- (21)). isms such as DNA error repair and detoxiˆcation. The Pragmatic attempts to move forward from the no safe dose-response relationships for adducts and gene muta- dose/no threshold default assumption for genotoxic tions are not parallel and the relationship between the chemicals have included the use of the concepts such as two markers may be complex (8,9). de minimis (from the phrases de minimis non curat pra- Methods exist for comparing the induction of chro- etor or de minimis non curat lex takentomeanthatthe mosome loss and non-disjunction in combination stu- law is not interested in trivial matters) and the virtually dies which allow the exploration of eŠects at low doses safe dose (VSD) often associated with it and the concept (10). Experimental evidence of thresholds has been pro- of the threshold of toxicological concern (see later). vided for a number of chemicals including spindle poi- sons and toposiomerase II inhibitors (11–13). Thres- Linear and Non-linear holds may exist for aneuogens because they act through The terms linear and non-linear can cause some con- the disruption to the protein structure making up the fusion. In the context of graphical presentation of dose- spindle such as by binding to tubulin. Thresholds may response linear is equated with a `straight line' arise because a critical number of target sites must be relationship where a change is directly proportional to aŠected before the eŠect occurs and that there is some the exposure. The slope, in eŠect, is the regression redundancy in the target. The evidence that alkylating coe‹cient, which represents the change in the depend- genotoxins may have thresholds has recently been rev- ent variable for each unit change in the independent iewed (14). variable. Non-linear is often used to refer to a curved Various terms have been suggested to qualify the term relationship where the relationship between exposure threshold. Kirsch-Volders et al. (15) describes ``abso- and eŠect is more complex. A simple deˆnition of non- lute'', ``real or biological'', ``apparent'' and ``statisti- linear is where the eŠect is disproportionate to the cal'' thresholds. Lovell (16) has argued for `practical' or cause. Non-linear relationships can take many forms. It `pragmatic' thresholds while Hengstler et al.referto is important to appreciate that the term non-linear is not ``perfect'' and ``practical'' thresholds (17). Jenkins et synonymous with a threshold relationship. al. (14) noted that terms like absolute, biological, appar- Dose-response relationships can be referred to as ent, acceptable, statistical, NOEL, real, alleged, were linear, sub-linear (or convex as the relationship curves used to qualify the term. below the linear) or supra-linear (concave as it curves The ICPEMC (International Commission for Protec- above the linear) (Fig. 1). The sigmoid curve is a combi- tion against Environmental Mutagens and Carcinogens) nation of both. Supra-linear may indicate decreased stated that ``A threshold dose-response relationship is toxic eŠects or saturation at higher doses; sub-linear

140 Design and Statistical Analysis of Threshold

which extends the type of data, that can be analysed us- ing linear models by using link functions (forms of transformations) which are related to the underlying distribution of the data.

Transformations A transformation is a process for preparing data for analysis. Examples include the use of the log dose or converting a response to the change from control or baseline. Transformations aim to simplify the mathematics, ensure that the underlying assumptions are met, allow linear modelling, stabilize the and linearize the relationship for presentation of data as straight lines which is more convenient for interpreta- tion. Fig. 1. Illustration of linear, sub-linear and supra-linear curves. The apparent shape of the dose response relationship depends upon how the data are graphed. Visual inspec- tion (or `ocular regression' (24)) of the dose-response may indicate repair or deactivation at low doses. Lutz et relationship can be misleading and identifying whether a al. proposed statistical tests for sub-linearity at low threshold is present cannot be determined just by graph- doses (22). ing the data. It is, therefore, important to check how the Linear and some non-linear relationships are mono- data are actually presented. Beware of `optical illusions' tonic. A monotonic relationship is where the responses as the pattern will change using raw dose, log dose and at higher dose levels are always equal to or greater than extended log dose. Note particularly any discontinuities the responses at lower dose levels. A monotonic in the dose axis. Many graphs produced by Excel have relationship can be a prerequisite for some model ˆtting equal spaced points on the X axis independent of the ac- programs/software. tual dose. This can create a graph which is based on In the context of statistical models there is a distinc- neither the original nor the log transformed dose metric. tion between linear and non-linear regressions. Non- Transformation (by changing the scale) of either the linear refers to a where the X or Y axes can change a straight line into a curved line parameters values need to be ˆtted by iteration. Non- or vice versa. Lutz et al. (25) noted that ``Logarithmic is an interactive approach to ˆtting a representation of the dose axis transforms a straight line curve through data. It uses iterative (trial and error) into a sublinear (up-bent) curve, which can be misinter- methods which require an initial value, a statistical ap- preted to indicate a threshold.'' Fig. 2 shows examples proach (e.g. maximum likelihood) and some measure of of a dose-response relationship plotted when one or when the method (e.g. goodness of ˆt test) has con- both scales are converted to log scales: raw/raw plot, verged (or has failed to converge). Such methods are log/raw plot, raw/log plot and log/log plot. now tractable given the increase in computing power. Problems can arise over the presentation of zero on Linear regression is the statistical methodology for ˆt- the graph as the log of 0 cannot be calculated. Statistical ting the best ˆtting line through a set of points deˆned analysis of trend tests using log doses need a substitute by the intercept and slope and does not involve itera- value for the zero dose level to carry out an analysis. tion. The standard approach minimizes the sum of The software package Graph Pad, for instance, suggests squares of the distance between the points and the line. using a value about 2 log units below the lowest ``real'' The least squares approach generalizes to multiple X value on a log X scale. The presentation of data using regression where there are multiple independent varia- dose metrics and its eŠect on low dose extrapolation has bles ˆtted to the dependent variable. resulted in considerable debate between Waddell and A statistical linear model is one that can be expressed others (26–36). in a linear form. An example is the General Linear Slob discussed the concept of a practical threshold Model (GLM) which links a series of apparently uncon- and argued that an attempt to identify one is not possi- nected statistical methods. For example, the common ble (37). He stressed that trying to show experimental two- t-test is a special case of the analysis of vari- evidence of a threshold was impossible. He showed how ance methodology (which links to ancova and MANO- the GST-P positive foci data of Wanibuchi et al.(38)on VA) which, in turn, is related to linear and multiple MeIQx could be presented in 4 diŠerence ways. One regression methods through the GLM. The GLM is a relationship was thresholded, one linear, another special case of generalized linear models (GZM) (23) supralinear and the fourth sublinear. He noted that us-

141 David P. Lovell

Fig. 2. Illustration of data plotted on diŠerent dose metrics: a) untransformed responses v. untransformed doses b) untransformed responses v. log 10 transformed doses c) log 10 transformed responses v. untransformed doses d) log 10 transformed responses v. log 10 transformed doses. Hypothetical data to illustrate how transforming the axis of graphs can change the shape of the dose response relationship. ing the alternative presentations provided no evidence Much of modern statistical work is centred on model- of a threshold and that the interpretation of the data ling. This is a continuous process of building, reˆning, would be changed depending upon which was used. His testing and modifying the models. Box (43) reviewing preferred or `best way' of presentation of the data was the contributions of RA Fisher, the geneticist and on a double log scale plot arguing that the log scale for statistician, to modelling as part of a description of the the response may be more appropriate because it philosophy for statistical modelling used the phrase represented a relative change compared with an absolute ``All models are wrong'', which he subsequently modi- change. He stressed the beneˆt of the benchmark dose ˆed to ``All models are wrong (but) some are use- approach (see later) compared to the pair-wise statistical ful''(44). testing NOAEL approach. The objective of model building is to explain a relationship and then use this to make predictions. Ap- Curve Fitting and Modelling proaches such as multiple regression include Forward, A range of methods have been developed for ˆtting Reverse, Stepwise and Best-subset approaches (45). Us- curves to data. Complex curve ˆts can be achieved by ing the Forward approach parameters are added to the approaches including polynomials, splines and Lowess model until there is no improvement in the model ˆt; the regression curves (39,40,41). An important point about Reverse approach starts with a full model and terms are approaches such as the use of polynomials is that the dropped until the ˆt is reduced. The Stepwise is a combi- predicted response values can become negative outside nation of the two approaches. These diŠerent methods the experimental data range producing unrealistic ex- can result in diŠerent models. In general, the more trapolations. A number of statistical packages including parameters the better the ˆt, leading eventually (if GraphPad and Statistica have the capability to ˆt many enough parameter can be ˆtted) to a perfect ˆt. One of diŠerent curves to data. the potential problems of model ˆtting is `over ˆtting' The threshold or `Hockey stick' model is an example where the model provides a good ˆt to the observed data of a piece-wise linear regression (42). Speciˆc models to but provides poor predictivity (45). detect discontinuities, inhomogeneities, change points Many of the model ˆtting methods have a goodness or in‰ections in a relationship such as a time-series are of ˆt measure such as a P value which indicates the of interest in areas such as climate forecasting and ˆnan- degree by which the predicted values diŠer from the ob- cial markets. Identifying such discontinuities is not sim- served values. Small P values usually indicate a poor ˆt ple. to the data. P value of less than 0.10, rather than the

142 Design and Statistical Analysis of Threshold Experiments usual 0.05 or 0.01, have been suggested in the ben- thin the range of doses measured (between a pair of chmark dose modelling software as the critical value for known data points); extrapolation relates to estimation deˆning a poor ˆt. P values À0.10 do not imply a cor- outside the range. In general, it is usually assumed that rect model only that the results from the model are con- interpolation should be reasonably reliable but that ex- sistent with the observed data. Many models may give trapolation (which also has a wider qualitative interpre- adequate ˆts to the observed data. The goodness of ˆt tation) outside the observed range needs to be carried tests are not suitable for distinguishing between diŠerent out with care as there is a risk of serious error. An exam- models and the size of the P value provides no evidence ple is how the predictions from polynomials can become for distinguishing between models. negative outside the observed range and diŠer greatly Formal testing is only possible if the competing from the `target curve'. There is a debate on whether es- models form a family of models (nested or hierarchical timation between a low and the negative control or models). In the case of models where the same method spontaneous level is interpolation or extrapolation. of ˆtting is used, Likelihood ratio tests and Akaike's in- formation criterion (AIC) can be used to test whether Thresholds and the NOEL/NOAEL the addition or subtraction of a parameter improves the The most important point in this section is that iden- ˆt of the model. The tests compensate for the increased tifying a No Observable EŠect Level (NOEL) or No Ob- ˆt associated with increasing terms in the model. servable Adverse EŠect Level (NOAEL) using a statisti- Although useful for help in the methods cal approach does not that a threshold exists. should be used with care. If the models are not from the The NOEL (and NOAEL) is widely used in the assess- same family, such as, for instance, the probit and logis- ment of non-cancer endpoints. However, the NOEL tic models, the goodness of ˆt tests cannot be used. should not be equated with a threshold dose below A key aspect of model ˆtting is the need to check the which eŠects do not occur. The NOEL/NOAEL de- assumptions underlying the model such as independence pends upon speciˆc characteristics of the : of the data, homogeneity of variances and normality of sample size, statistical test, dose spacing. The limita- residuals are met using a set of `diagnostic' tools. tions of the NOEL are well known (46) and has, in part, Graphical methods should be used in conjunction with resulted in interest in the development of the benchmark the statistical tests. Judgement and experience is needed dose methodology (below). in the modelling process: for example, the choice of Determination of the NOEL is often based upon a whether to include or exclude data from the high doses hypothesis test. A statistical test based upon a test of a in the model or to drop results of a dose to achieve null hypothesis is able to show a positive eŠect (a monotonicity. The top dose may be in‰uential in deter- statistically signiˆcant increase) but not to show a nega- mining the best ˆtting model but if the ˆt to the data at tive eŠect i.e. that the null hypothesis is true. Failure to top dose is less good than at lower doses then the model ˆnd statistical signiˆcance shows only that there is not may be inappropriate. Ideally there should be results for su‹cient evidence to reject the null hypothesis. It is a doses near the likely benchmark dose to provide more serious error to equate a non-signiˆcant result with a accurate estimates. Model uncertainly should be ad- lack of eŠect. dressed. In practice a number of models with a similar The danger of equating statistical signiˆcance with a number of parameters can give satisfactory ˆts to the threshold is increased by the use of multiple comparison same data set. In the case of the benchmark dose ap- methods (such as the Bonferroni correction) in a dose- proach the range of estimates obtained from all the ac- response study to adjust the signiˆcance levels reported. ceptable models can be presented. These methods are aimed at avoiding type I errors (fal- Modellers often work to the principle of parsimony or sely declaring a result as signiˆcant when it is not) when Occam's razor: ``the simpler the explanation, the bet- making many statistical comparisons. In practice, the ter''. The overall objective should be to ˆnd a model approach `dampens' down the signiˆcance levels and which provides a satisfactory description of the dose- eŠectively increases the dose at which the NOEL is iden- response data using the minimum number of tiˆed. The choice of a critical value associated with a P parameters. So if the linear model is not signiˆcantly value of 0.05 is also arbitrary and probability levels of worse that the threshold or quadratic one then it could 0.01orloweraresometimesusedasthecut-oŠstodesig- be argued that the simpler more parsimonious linear nate eŠects as signiˆcant. Combining multiple compari- model should be used. However, the key message is that son levels with more stringent signiˆcance levels can ultimately any model must not just provide a good ˆt change what is considered the NOEL appreciably espe- but also good prediction when tested with new data. cially if there are many dose groups in a dose-response In the context of prediction a distinction is sometimes experiment as the Bonferroni correction will become made between the terms interpolation and extrapola- more severe as it is dependent upon the number of tion. Interpolation refers to the estimation of eŠects wi- groups in an experiment.

143 David P. Lovell

Background/Control Incidence statistical approach of estimating the size of eŠect can The size of the spontaneous or background level of a produce a bound on the size of eŠect that could be de- biomarker has implications for the statistical power of tected by a particular design. The size of eŠect could the study (especially for qualitative endpoints). EŠects then be put into perspective with both the background can be either absolute or relative changes. The absolute level and the degree of variability around this measure. size of a fold-change, obviously, depends upon the con- In this context some concept of eŠects which may be trol incidence: with control units of 2 or 20 units, the real but are small enough to treat as unimportant can fold change is 2 or 20 units respectively. lead to a de minimis approach. The benchmark dose ap- The argument has been made that low-dose linearity proach (discussed later) is one such approach. will result if just some of the cancers caused by a chemi- Identifying conˆdence intervals of a speciˆc width cal and those in the control group were indistinguishable around the dose-response relationship requires an ap- because they were a consequence of the chemical in- propriate and careful design. Biomarkers such as geno- duced damage being produced by the same mechanism toxic endpoints are potentially sensitive to producing ar- as that which produced the spontaneous or `back- tefacts such as can arise through non-. ground' cancers (47). The argument has been general- Therefore, if the objective is to investigate low level ized further by Crawford and Wilson (48) who suggest- eŠects then careful experimental design is critical. ed that low dose linearity is generalizable to a much The traditional experimental methods of randomiza- wider set of non-cancer outcomes. tion, , and local control (originally Another argument has been that while there may be a developed by Fisher) remain important in controlling threshold at the individual animal level there will be a variability. Fisher's concept of the factorial design distribution of tolerances so that some individuals may whichhasextendedintoDesignofExperiment(DOE) not have a threshold. Lutz (49), for instance, argued methodology and is a highly e‹cient and that genetic heterogeneity in a population that at cost-eŠective approach to the investigation of complex least some members of a population will not have a problems and is appreciably more e‹cient than the use threshold for a toxic agent. He argued that this breaks of the traditional One Factor at a Time (OFAT) ap- down the distinction between both carcinogenic and proach (50). non-carcinogenic agents and between genotoxic and The more experimental/dose groups the more precise non-genotoxic carcinogens. will be the description of the dose-response relationship with multiple dose levels in the area of interest (16). Experimental and Statistical Issues in Designs at Using 6 or more dose groups, with perhaps fewer than Low Doses and Curve Fitting Design Issues the typical 5–6 animals per group, will enhance the pre- The standard regulatory tests are designed for hazard cision with which the overall dose-response curve can be identiˆcation where the objective is to produce a estimated (51). qualitative–positive or negative–result rather than risk An assumption of most standard statistical tests: t- estimation. Standard statistical methods such as pair- test, anova, chi-square, Fisher's exact test is that the ex- wise comparisons or trend tests such as for linear regres- perimental units are independent. The concept of the ex- sion or the Cochran-Armitage trend test are often used. perimental unit is fundamental to the statistical analysis Tests of a linear trend in a dose-response in a design are of designed experiments. It is the unit in the experiment statistically more powerful than pair-wise comparisons that is randomly assigned to the treatment. Mis-speciˆ- because of the natural or inherent ordering imposed on cation of the experimental unit can lead to serious mis- it by the experimenter but need a more speciˆc null interpretation of the statistical analysis. hypothesis. It is important to appreciate that while the individual The results will depend upon the choice of the dose cell may be the smallest unit which can be measured, metric. This may be the applied (mg/kg), the log dose cells from the same animal or culture are not randomly (which needs a non-zero dose for the control group to assigned to the treatment but rather receive the same circumvent the problem of taking the logarithm of zero) treatment and are likely to show some degree of correla- or the target dose (i.e. the dose delivered to the target tions in their responses. A failure to appreciate this can tissue, related to the pharmacokinetics of the sub- lead to errors in analysis and interpretation. In particu- stance). Some care may be needed in the interpretation lar failure to take to take into account hidden level of if the linear and/or the quadratic components in the variation can lead to serious false positives. analysis are signiˆcant. The design of studies for characterising eŠects (esti- An alternative approach is to estimate the size of an mation) is diŠerent from that for hazard identiˆcation eŠect. As discussed above while experiments cannot be with more doses required and more resources in the area designed to show that a threshold exists (i.e. to prove of interest but there are implications in attempting to the null hypothesis of no diŠerence) the alternative power the study to show no or small genotoxic eŠects.

144 Design and Statistical Analysis of Threshold Experiments

Table 1. Sample size per group associated with 80z and 90z power for detecting an eŠect in (SD) units in a two sided test at P=0.05

Power SD units 80% 90%

0.0625 4020* 5381 0.1 1571 2103 0.125 1006 1346 0.2 394 527 0.25 253 338 0.3 176 235 0.4 100 133 0.5 64 86 0.6 45 60 0.8 26 34 Fig. 3. Plot of mean of groups from Table II of Asano et al. (54) 1 17 23 plotted against mg/kg dose () and against dose on log 10 dose scale 1.25 12 15 ($) (As doses were logarithmically spaced on the mg/kg dose scale the 1.5 9 11 zero dose was represented by a dose level an equivalent spacing below 2 6 7 the lowest dose on the log dose scale and the log dose scale aligned 2.5 4 5 with the mg/kg dose scale.) 344 433

*Bold numbers illustrate the approximate four-fold increase in sample Example of a Test System to Explore size with each halving of the eŠect size. Thresholds The development of ‰ow cytometry for the analysis of micronuclei (both in vitro and in vivo) is a potentially Table 1 shows the size of eŠect that can be detected for powerful tool for exploring thresholds. Automated a given power. A rule of thumb is that for 80z power scoring procedures allow previously unattainable num- fora2-sidedtestatP=0.05 then a 4-fold increase in bers of cells to be readily scored (53). It is now possible sizes is needed for each halving of eŠect size. For an to `score' over a million cells per animal. Such an ap- eŠect equal to one standard deviation (SD) the group proach can appear to be very powerful. Tourus et al. sizes are approximately 16 for 80z power, for an 0.5 (53) suggested that by scoring up to 3 million cells from SD unit diŠerence the group sizes need to be about 64. a single sample that it is possible to detect a diŠerence Historical data on the inter-experimental unit variability between 0.10z and 0.11z (based upon simulation ex- can be used to provide estimates of the size of a stan- periments using samples with `spiked malarial dard deviation unit. micronuclei') using Fisher's exact test. (It is important The level of eŠect detectable in an experiment is illus- to note, though, that Fisher's exact test assumes in- trated using GST-P positive foci data from Table 3 in dependence of the measures.) Such large potential sam- Murai et al. (52). The negative control mean is 22.9 foci ple sizes means that, while measuring more cells from an with approximate inter-individual within group stan- experiment makes the estimate for that speciˆc animal dard deviations of about 5 foci. Experiments with sam- or culture more precise, apparently signiˆcant diŠer- ple sizes of 50 would have 80z power to detect a diŠer- ences between dose levels could be a consequence of ar- ence from the control level of about 0.57 SD units or tifacts. about 2.8 foci (in a pair-wise comparison using a two- Asano et al. (54) in an in vivo experiment analyzed sidedtestatP=0.05). micronuclei for 1-b-D-arabinofuranosylcytosine using If the determination of an eŠect were based solely on ‰ow cytometry. Analysis was carried out on 1M, 200K, whether there is a signiˆcant diŠerence between two 20K and 2000 cells per animal using automatic and concentrations then there might be diŠerent NOAELs manual scoring. Scores for individual animals were for each biomarker because each biomarker would have reported by Asano et al. The results are presented as diŠerent sensitivities or `resolutions' based upon their parallel curves using log log scales and show apparently `intrinsic' variability. Studies of an equivalent size with increased variability between animals at 2K cells. a qualitative endpoint such as whether a tumour is Graphical presentation of the data shows the diŠerent present or absent will have lower power: a fact long ap- shaped dose-response curves (Fig. 3). Using the untran- preciated in the interpretation of cancer bioassay data. sformed dose the response is supra-linear while the curve is sub-linear against the logarithm of the dose il- lustrating that care is needed with visual inspection of

145 David P. Lovell

Table 2. Analysis of table of orthogonal breakdown of in- no evidence of a threshold. dividual animal scores from raw data provided in Table II of Asano et al.(54) Benchmark Dose Approach a) Using original dose metric (mg/kg) Benchmark dose methodology is an alternative ap- Source df MS F P proach to the NOEL/NOAEL (56). The Benchmark Dose (BMD) is the dose associated with a pre-deˆned Between groups 5 0.019 5.58 0.002 Linear 1 0.091 27.36 º0.001 biologically signiˆcant/important diŠerence, the Ben- Quadratic 1 0.000 0.01 0.97 chmark Response (BMR), in an endpoint of interest Cubic 1 0.001 0.42 0.91 compared with the negative control response. The BMR Deviations 2 0.000 0.06 0.95 could be a 10z change in average adult body weight or Within groups 24 0.003 a doubling of a liver enzyme level. The BMD is derived Total 29 0.173 by ˆtting a dose-response model to the experimental data. The BMD related to a change in response equal to b) Using log dose one standard deviation from the control mean is also es- Source df MS F P timatedtohelpwiththecomparisons. The BMDL is the lower one-sided 95z conˆdence Between groups 5 0.019 5.58 0.002 limit or bound on the BMD. Using the BMDL is a way Linear 1 0.068 20.37 º0.001 Quadratic 1 0.016 4.91 0.036 of expressing that there is 95z conˆdence that the true Cubic 1 0.007 2.12 0.16 eŠect at this dose would be less than the eŠect associated Deviations 2 0.002 0.24 0.79 with the BMDL. The BMDs and BMDLs from the Within groups 24 0.003 models are compared for similarity and consistency. Total 29 0.173 The chosen BMDL is used as the Reference Point (RP) from which low dose extrapolation begins in order to ˆnd a reference dose (RfD) such as an ADI. Table 3. Summary of comparisons between negative control and The NOEL/NOAEL is one example of an RP or dose groups from Asano et al. (54) Point of Departure (PoD) used to identify an ADI using safety or uncertainly factors (SF/UF).A number of or- Cells Mice Dose ganization such as EFSA and JECFA have proposed the (mg/kg BW) 2K 20K 200K 1M 1M use of BMD as the RP for the calculation of the Margin 0.06 of Exposure (MOE) of genotoxic carcinogens. 0.19 ** 0.60 ** ** Thresholds of Toxicological Concern ** ** ** 1.89 A pragmatic approach to the problem of no safe level 6.00 ** ** ** ** of a genotoxic agent is the concept of the threshold of *Pº0.05 toxicological concern (TTC). The TTC is an intake by **Pº0.01 humans that would be associated with a high probability Data from Table I of Asano et al. (54) of frequencies of micronucleat- of negligible risk. The TTC was originally developed as ed reticulocytes of various doses of 1-b-D-arabinofuranosylcytosine. Statistical comparisons using the cell as the carried out a method for regulating food contact materials as a con- using Fisher's exact test, comparison using the animal as the statistical centration level below which there would be no appreci- unit carried out using Student's t-test. able risk to human health irrespective of whether or not there are chemical-speciˆc toxicity data (reviewed by Kroes et al., (57)). Based upon studies on a database of graphs. carcinogenic chemicals a daily exposure of less that 1.5 In both cases in the analysis there is a signiˆcant mg/day was considered a virtually safe dose in that in a linear component to the dose-response with also a sig- worst case scenario this corresponded to a 10-6 lifetime niˆcant quadratic component for the log dose but this is risk. To accommodate possible higher risk chemicals an not signiˆcant for the mg/kg dose (Table 2). Statistical exposure level of 0.15 mg/day was proposed (58). analysis of the pair-wise comparisons using the cell as The underlying approach is based upon a statistical the unit showed signiˆcant eŠects at lower doses but analysis of a database of the carcinogenic potencies of a with the animals as the experimental units only at the large number of chemicals calculated on the basis of a top dose because of appreciable between animal linear extrapolation from the dose (TD50) estimated to variability (using hierarchical/nested analyses) (Table give a 50z tumour incidence. The probability distribu- 3). Abramsson-Zetterberg (55) also used ‰ow cytometry tion of the potencies was used to identify a threshold in two experiments with a large number of dose groups concentration which had a high probability of negligible of acrylamide which showed a linear dose response with risk (59, 60). The threshold is thus a pragmatic concen-

146 Design and Statistical Analysis of Threshold Experiments trationtohelpwithregulationratherthanindicatingab- 7 Efron E. The apocalyptics. New York: Simon and solute certainty of no risk below the concentration. Schuster; 1984. The European Medicines Agency (EMEA) (61) has 8 Perera EP. The signiˆcance of DNA and protein adducts proposed that concentrations of less than 1.5 mg/day in human biomonitoring studies. Mutat Res. 1988; 205: (corresponding to a 10-5 lifetime risk) would be accept- 255–69. able for genotoxic impurities and contaminants in phar- 9 Zito R. Low doses and thresholds in genotoxicity: from theories to experiments. J Exp Clin Cancer Res. 2001; 20: maceuticals where there is a risk:beneˆt consideration. 315–25. Humphrey has argued for a more ‰exible approach than 10 Parry JM, and E.M. Parry EM. The use of the in vitro a standard ˆgure (62). micronucleus assay to detect and assess the aneugenic ac- tivity of chemicals. Mutat Res. 2006; 607: 5–8. Conclusions 11 Elhajouji A, Van Hummelen P, Kirsch-Volders M. Indi- A wide range of curves and mathematical models can cations for a threshold of chemically-induced aneuploidy be ˆtted to experimental data but cannot prove the exis- in vitro in human lymphocytes. Environ Mol Mutagen. tence of an absolute threshold in a dose-response 1995; 26: 292–304. relationship. Equating a threshold with the identiˆca- 12 Elhajouji A, Tibaldi F, Kirsch-Volders M. Indication for tion of a NOEL/NOAEL based upon statistical sig- thresholds of chromosome non-disjunction versus chro- mosome lagging induced by spindle inhibitors in vitro in niˆcance using a hypothesis testing has serious limita- human lymphocytes. Mutagenesis. 1997; 12: 133–40. tions. Statistical methods based upon the estimation of 13 Bentley KS, Kirkland D, Murphy M, Marshall R. Evalua- the size of a response together with the associated conˆ- tion of thresholds for benomyl- and carbendazim-induced dence intervals can provide estimates of doses where aneuploidy in cultured human lymphocytes using ‰uores- there is high conˆdence of negligible risk. Approaches cence in situ hybridization. Mutat Res. 2000; 464: 41–51. based upon the benchmark dose methodology and 14 Jenkins GJS, Doak SH, Johnson GE, Quick E, Waters threshold of toxicological concern have the potential to EM, Parry JM. Do dose response thresholds exist for be used in this way. Future advances in the ˆeld may in- genotoxic alkylating agents? Mutagenesis 2005; 20: clude the development of more sophisticated mathemat- 389–98. ical models which include consideration of DNA repair 15 Kirsch-Volders M, Aardema M Elhajouji A. Concepts of and metabolic detoxiˆcation (63) and the use of mul- threshold in mutagenesis and carcinogenesis. Mutat Res. 2000; 464: 3–11. tivariate methods in association with -omics technol- 16 Lovell DP. Dose-response and threshold-mediated ogies to investigate the pattern of responses in low dose mechanisms in mutagenesis: statistical models and study studies. design. Mutat Res. 2000; 464: 87–95. 17 Hengstler JG, BogdanŠy MS, Bolt MH, Oesch F. References Challenging dogma: thresholds for genotoxic carcino- 1 Madle S, von der Hude W, Broschinski L, J äanig GR. gens? The case of vinyl acetate. Annu Rev Pharmacol Threshold eŠects in genetic toxicity: perspective of chemi- Toxicol. 2003; 43: 485–520. cals regulation in Germany, Mutat Res. 2000; 464: 18 Ehling DAUH, Cerutti PA, Friedman J, Greim H, Kol- 117–21. bye AC Jr, Mendelsohn ML. Review of the evidence for 2 Barlow S, Renwick AG, Kleiner J, Bridges JW, Busk L, the presence or absence of thresholds in the induction of Dybing E, Edler L, Eisenbrand G, Fink-Gremmels J, genetic eŠects by genotoxic chemicals. Mutat Res. 1983; Knaap A, Kroes R, Liem D, M äuller DJG, Page S, Rol- 123: 281–341. land V, Schlatter J, Tritscher A, Tueting W, W äurtzen G. 19 Calabrese EJ, Baldwin LA. Hormesis: U-shaped dose Risk assessment of substances that are both genotoxic responses and their centrality in toxicology. Trends Phar- and carcinogenic: Report of an International Conference macol Sci. 2001; 22: 285–91. organized by EFSA and WHO with support of ILSI Eu- 20 Calabrese EJ, Blain R. The occurrence of hormetic dose rope. Fd Chem Toxicol. 2006; 44: 1636–50. responses in the toxicological literature, the hormesis 3 O'Brien J, Renwick AG, Constable A, Dybing E, Muller database: an overview. Toxicol Appl Pharmacol. 2005; DJ,SchlatterJ,SlobW,TuetingW,vanBenthemJ,Wil- 202: 289–301. liams GM, Wolfreys A. Approaches to the risk assess- 21 Preston RJ. Bystander eŠects, genomic instability, adap- ment of genotoxic carcinogens in food: A critical ap- tive response, and cancer risk assessment for radiation praisal. Fd Chem Toxicol. 2006; 44: 1613–35. and chemical exposures. Toxicol Applied Pharmacol. 4 Coggle JE. Biological eŠects of radiation. London: Tay- 2005; 207: 550–6. lor and Francis; 1983. 22 Lutz RW, Stahel WA, Lutz WK. Statistical procedures to 5 ICRP ICRP Publication 60. 1990 Recommendations of test for linearity and estimate threshold doses for tumor the International Commission on Radiological Protec- induction with nonlinear dose-response relationships in tion, Annals of the ICRP 21. 1991; 201. bioassays for carcinogenicity. Regul Toxicol Pharmacol. 6 Wakeford R. The risk to health from exposure to low lev- 2002; 36: 331–7. els of ionising radiation, Annals of the ICRP 35. 2005; v- 23 Nelder J, McCullagh P. Generalized linear models, Lon- vii.

147 David P. Lovell

don: Chapman and Hall; 1989. 44 Box GEP, Draper NR. Empirical model-building and 24 Greenland S. Dose-response and trend analysis in response surfaces. New York: Wiley; 1987. . Epidemiology. 1995; 6: 356–65. 45 Draper NR, Smith H. Applied . 3rd 25 Lutz WK, Gaylor DW, Conolly RB, Lutz RW. Non- ed., New York: Wiley; 1998. linearity and thresholds in dose-response relationships for 46 Crump K. A new method for determining allowable daily carcinogenicity due to variation, logarithmic intakes. Fundam Appl Toxicol. 1984; 4: 854–71. dose scaling, or small diŠerences in individual susceptibil- 47 Crump K, Hoel D, Langley C, R. Peto R. Fundamental ity. Toxicol Appl Pharmacol. 2005; 207: 565–9. carcinogenic processes and their implications to low-dose 26 Waddell WJ. Thresholds of carcinogenicity of ‰avors. risk assessment. Cancer Res. 1976; 36: 2973–9. Toxicol Sci. 2002; 68: 275–9. 48 Crawford M, Wilson R. Low-dose linearity: the rule or 27 Waddel WJ. Comparison of human exposures to selected the exception. Hum Ecol Risk Assess. 1996; 2: 303–50. chemicals with thresholds from NTP carcinogenicity stu- 49 Lutz WK. Susceptibility diŠerences in chemical carcino- dies in rodents. Hum Exp Toxicol. 2003; 22: 501–6. genesis linearize the dose-response relationship: threshold 28 Waddell WJ. Thresholds in chemical carcinogenesis: doses can be deˆned only for individuals. Mutat Res. what are animal experiments telling us? Toxicol Pathol. 2001; 482: 71–6. 2003; 31: 260–2. 50 Montgomery DC. Design and analysis of experiments. 29 Waddell WJ. Threshold for carcinogenicity of N- 6th ed. New York: Wiley; 2004. nitrosodiethylamine for esophageal tumors in rats. Fd 51 Kavlock RJ, Schmid JE, Setzer RW Jr. A simulation Chem Toxicol. 2003; 41: 739–41. study of the in‰uence of study design on the estimation of 30 Waddell WJ. Thresholds of carcinogenicity in the ED01 benchmark doses for developmental toxicity. Risk Anal. study. Toxicol Sci. 2003; 72: 158–63. 1996; 16: 399–410. 31 Waddell WJ. Critique of dose response in carcinogenesis. 52 Murai T, Mori S, Kang JS, Morimura K, Wanibuchi H, Hum Exp Toxicol. 2006; 25: 413–36. Totsuka Y, Fukushima S. Evidence of a threshold- 32 Haseman JK. An alternative perspective: a critical evalu- eŠect for 2-Amino-3,8-dimethylimidazo-[4,5-f]quinoxa- ation of the Waddell threshold extrapolation model in line liver carcinogenicity in F344/DuCrj rats. Toxicol chemical carcinogenesis. Toxicol Pathol. 2003; 31: Pathol. 2008; 36: 472–7. 468–70. 53 Torous D, Asano N, Tometsko C, Sugunan S, Dertinger 33 Crump KS, Clewell HJ. Evidence of a ``clear and consis- S, Morita T, Hayashi M. Performance of ‰ow cytometric tent threshold'' for bladder and liver cancer in the large analysis for the micronucleus assay–a reconstruction ED01 carcinogenicity study. Toxicol Sci. 2003; 74: 485–6. model using serial dilutions of malaria-infected cells with 34 Haseman JK. Response to Waddell & Rozman. Toxicol normal mouse peripheral blood. Mutagenesis. 2006; 21: Pathol. 2003; 31: 715–6. 11–3. 35 Gaylor DW. Letter to the editor. Toxicol Pathol. 2003; 54 Asano N, Torous DK, Tometsko CR, Dertinger SD, 31: 572. Morita T, Hayashi H. Practical threshold for 36 Anderson ME, Conolly RB, Gaylor DW. Letter to the e- micronucleated reticulocyte induction observed for low ditor. Toxicol Sci. 2003; 74: 486. doses of mitomycin C, Ara-C and colchicine. Mutagene- 37 Slob W. What is a practical threshold? Toxicol Pathol. sis. 2006; 21 15–20. 2007; 35: 848–9. 55 Abramsson-Zetterberg L. The dose-response relationship 38 WanibuchiH,WeiHM,KarimMR,MorimuraK,DoiK, at very low doses of acrylamide is linear in the ‰ow Kinoshita A, Fukushima S. Existence of no hepatocar- cytometer-based mouse micronucleus assay. Mutat Res. cinogenic eŠect levels of 2-amino-3,8- 2003; 535: 215–22. dimethylimidazo[4,5-f]quinoxaline with or without coad- 56 Filipsson AF, Sand S, Nilsson J, Victorin K. The ben- ministration with ethanol. Toxicol Pathol. 2006; 34: chmark dose method–review of available models, and 232–6. recommendations for application in health risk assess- 39 Royston P, Altman DG. Approximating statistical func- ment. Crit Rev Toxico. 2003; 33 505–42. tion by using fractional polynomial. The Statistician. 57 Kroes R, Kleiner J, Renwick A. The threshold of toxico- 1997; 46: 411–22. logical concern concept in risk assessment. Toxicol Sci. 40 Goetghebeur EJT, Pocock SJ. Detection and estimation 2005; 86: 226–30. of J-shaped risk-response relationships. J Rad Stat Soc. 58 Kroes R, Renwick AG, Cheeseman M, Kleiner J, et al. 1995; 185A: 107–21. Structure-based thresholds of toxicological concern 41 Silverman BW. Some aspects of the spline smoothing ap- (TTC): Guidance for application to substances present at proach to non-parametric regression curve ˆtting. J Rad low levels in the diet. Fd Chem Toxicol. 2004; 42: 65–83. Stat Soc. 1985; B: 1–52. 59 CheesemanMA,MachugaEJBaileyAB.Atieredap- 42 Gennings C, CarterWH Jr, Carchman RA, Teuschler proach to threshold of regulation. Fd Chem Toxicol. LK,.Simmons JE, Carney EW. A unifying concept for 1999; 37: 387–412. assessing toxicological interactions: changes in slope. 60 Renwick AG. Toxicological databases and the concept of Toxicol Sci. 2005; 88: 287–97. thresholds of toxicological concern as used by the JECFA 43 Box GEP. Science and statistics. J Am Stat Assoc. 1976; for the safety evaluation of ‰avouring agents. Toxicol 71: 791–9. Lett. 2004; 149: 223–4.

148 Design and Statistical Analysis of Threshold Experiments

61 European Medicines Evaluation Agency, Committee for ment of potentially genotoxic impurities in pharmaceuti- Medicinal Products for Human Use (CHMP). Guideline cal drug substances. Toxicol Sci. 2007; 100: 24–8. on the limits of genotoxic impurities. CPMP/SWP/5199/ 63 Watanabe M. Threshold-like dose-response relationships 02 (June 28, 2006) London. (http://www.emea.europa. in a modiˆed linear-no-threshold model: application of eu/pdfs/human/swp/519902en.pdf). experimental data and risk evaluation. Genes Environ. 62 Humfrey CDN. Recent developments in the risk assess- 2008; 30: 17–24.

149