The Psychometric Function: I. Fitting, Sampling, and Goodness of Fit

Perception & Psychophysics 2001, 63 (8), 1293-1313

The psychometric function: I. Fitting, sampling, and goodness of fit

FELIX A. WICHMANN and N. JEREMY HILL University of Oxford, Oxford, England

The psychometricfunction relatesan observer’s performance to an independent variable,usually some physical quantity of a stimulus in a psychophysical task. This paper, together with its companion paper (Wichmann & Hill, 2001), describes an integrated approach to (1) fitting psychometric functions, (2) assessing the goodness of fit, and (3) providing confidence intervalsfor the function’s parametersand other estimatesderivedfrom them, for the purposes of hypothesis testing. The present paper deals with the first two topics, describing a constrained maximum-likelihood method of parameter estimation and develop- ing severalgoodness-of-fittests.Using Monte Carlo simulations, we deal with two specificdifficultiesthat arise when fitting functions to psychophysical data. First, we note that human observers are prone to stimulus-independent errors (or lapses). We show that failure to account for this can lead to serious biases in estimates of the psychometric function’s parametersand illustrate how the problem may be over- come. Second, we note that psychophysical data sets are usually rather small by the standards required by most of the commonly applied statisticaltests. We demonstrate the potential errors of applying traditional c 2 methods to psychophysical data and advocate use of Monte Carlo resampling techniques that do not rely on asymptotic theory. We have made available the software to implement our methods.

The performance of an observer on a psychophysical with the first and the third of these steps, parameter esti- task is typically summarized by reporting one or more mation and goodness-of-fit assessment. Our companion responsethresholds—stimulusintensitiesrequired to pro- paper (Wichmann & Hill, 2001)deals with the secondstep duce a given level of performance—and by characteriza- and illustrates how to derive reliable error estimates on tion of the rate at which performance improves with in- the fitted parameters. Together, the two papers provide creasing stimulus intensity. These measures are derived an integratedapproach to fitting psychometricfunctions, from a psychometric function, which describes the depen- evaluating goodness of fit, and obtaining confidence in- dence of an observer’sperformance on some physical as- tervals for parameters, thresholds, and slopes, avoiding pect of the stimulus: One example might be the relation be- the known sources of potential error. tween the contrast of a visual stimulus and the observer’s This paper is dividedinto two major subsections,fitting ability to detect it. psychometric functions and goodness of fit. Each subsec- Fitting psychometric functions is a variant of the more tion itself is again subdividedinto two main parts: first, an general problem of modelingdata. Modelingdata is a three- introduction to the issue, and second, a set of simulations step process. First, a model is chosen, and the parameters addressing the issue raised in the respective introduction. are adjusted to minimize the appropriate error metric or loss function. Second, error estimates of the parameters NOTATION are derived, and third, the goodness of fit between the model and the data is assessed. This paper is concerned We adhere mainly to the typographic conventions frequently encountered in statistical texts (Collett, 1991; Dobson, 1990; Efron & Tibshirani,1993). Variables are de- This research was supported by a Wellcome Trust Mathematical Bi- ology Studentship, a Jubilee Scholarship from St. Hugh’s College, Ox- noted by uppercase italic letters, and observed values are ford, and a Fellowship by Examination from Magdalen College, Oxford, denoted by the corresponding lowercase letters—for ex- to F.A.W. and by a grant from the Christopher Welch Trust Fund and a ample, y is a realization of the random variable Y. Greek Maplethorpe Scholarship from St. Hugh’sCollege, Oxford, to N.J.H. We letters are used for parameters, and a circumflex for esti- are indebted to Andrew Derrington, Karl Gegenfurtner, and Bruce Hen- mates; thus, parameter b is estimated by bˆ. Vectors are ning for helpful comments and suggestions. This paper benefited considerably from conscientious peer review, and we thank our reviewers denoted by boldface lowercase letters, and matrices by David Foster, Stan Klein, Marjorie Leek, and Bernhard Treutwein for boldface italicuppercase letters. The ith element of a vec- helping us improve our manuscript. Software implementing the methods tor x is denoted by xi. The probability density function of described in this paper is available written in MATLAB; contact F.A.W.at the random variable Y (or the probabilitydistribution if Y the address provided or see http://users.ox.ac.uk/~sruoxfor/psychofit/. q Correspondenceconcerning this article shouldbe addressed to F.A. Wich- is discrete) with as the vector of parameters of the dis- mann, Max-Planck-Institutfür Biologische Kybernetik, Spemannstraße tribution is written as p( y;q). Simulated data sets (repli- 38, D-72076 Tübingen, Germany (e-mail: felix@tuebingen. mpg.de). cations) are indicated by an asterisk—for example, aˆ i*is

the value of aˆ in the ith Monte Carlo simulation. The nth Bernoulliprocess with a probabilityof success pi. A model quantile of a distribution x is denoted by x(n). must then provide a psychometric function y(x), which specifies the relationship between the underlying proba- FITTING PSYCHOMETRIC FUNCTIONS bility of a correct (or positive) response p and the stimulus intensity x. A frequently used general form is Background yabglg(;,,,)xFx=+ (1 gl )(;,). ab (1) To determine a threshold, it is common practice to fit a two-parameter function to the data and to compute the in- The shape of the curve is determined by the parameters {a, verse of that function for the desired performance level. b, g, l}, to which we shall refer collectively by using the The slope of the fitted function at a given level of perfor- parameter vector q, and by the choice of a two-parameter mance serves as a measure of the change in performance function F, which is typically a sigmoid function, such as with changing stimulus intensity. Statistical estimation of the Weibull, logistic,cumulativeGaussian, or Gumbel dis- parameters is routine in data modeling (Dobson, 1990; tribution.1 McCullagh & Nelder, 1989): In the context of fitting psy- The function F is usually chosen to have range [0, 1], [0, chometric functions, probit analysis (Finney, 1952 , 1971) 1), or (0,1). Thus, the parameter g gives the lower bound of and a maximum-likelihood search method described by y(x;q), which can be interpreted as the base rate of per- Watson (1979) are most commonly employed. Recently, formance in the absence of a signal. In forced-choice Treutwein and Strasburger (1999) have described a con- paradigms (n-AFC), this will usually be fixed at the rec- strained generalized maximum-likelihood procedure that iprocal of the number of alternatives per trial. In yes/no is similar in some respects to the method we advocate in paradigms, it is often taken as correspondingto the guess this paper. rate, which will depend on the observer and experimental In the following,we review the applicationof maximum- conditions.In this paper, we will use examples from only likelihood estimators in fitting psychometric functions the 2AFC paradigm, and thus assume g to be fixed at .5. and the use of Bayesian priors in constraining the fit ac- The upper bound of the curve—that is, the performance cording to the assumptions of one’smodel. In particular, level for an arbitrarily large stimulus signal—is given by we illustrate how an often disregarded feature of psy- 1 l. For yes/no experiments, l corresponds to the miss chophysicaldata—namely, the fact that observers some- rate, and in n-AFC experiments, it is, similarly, a reflec- times make stimulus-independent lapses—can introduce tion of the rate at which observers lapse, responding insignificant biases into the parameter estimates. The ad- correctly regardless of stimulus intensity.2 Between the verse effect of nonstationaryobserver behavior(of which two bounds,the shape of the curve is determined by a and lapses are an example) on maximum-likelihoodparameter b. The exact meaning of a and b depends on the form of estimates has been noted previously (Harvey, 1986; Swan- the function chosen for F, but together they will determine son & Birch, 1992;Treutwein,1995; cf. Treutwein & Stras- two independent attributes of the psychometric function: burger, 1999). We show that the biases depend heavily on its displacement along the abscissa and its slope. the sampling scheme chosen (by which we mean the pat- We shall assume that F describes the performance of tern of stimulus valuesat which samples are taken) but that the underlying psychological mechanism of interest. Al- it can be corrected, at minimal cost in terms of parameter tough it is important to have correct values for g and l,the variability, by the introduction of an additional free but values themselves are of secondary scientific interest, highly constrained parameter determining, in effect, the since they arise from the stimulus-independent mecha- upper bound of the psychometric function. nisms of guessing and lapsing. Therefore, when we refer The psychometric function. Psychophysical data are to the thresholdand slope of a psychometricfunction,we taken by sampling an observer’sperformance on a psycho- mean the inverse of F at some particular performance physical task at a number of different stimulus levels. In level as a measure of displacement and the gradient of F the method of constantstimuli,each sample point is taken at that point as a measure of slope. Where we do not spec- in the form of a block of experimental trials at the same ify a performance level, the value .5 should be assumed: 1 stimulus level. In this paper, K denotesthe numberof such Thus threshold refers F 0.5 to and slope refers to F¢ eval- 1 blocks or datapoints. A data set can thus be described by uated at F 0.5. In our 2AFC examples, these values will three vectors, each of length K: x will denote the stimulus roughly correspond to the stimulus value and slope at the levels or intensities of the blocks, n the number of trials 75% correct point, although the exact predicted perfor- or observations in each block, and y the observer’s per- mance will be affected slightly by the (small) value of l. formance, expressed as a proportionof correct responses Maximum-likelihood estimation. Likelihood maxi- (in forced-choice paradigms) or positive responses (in mization is a frequently used techniquefor parameter es- single-interval or yes/no experiments). We will use N to timation (Collett, 1991; Dobson, 1990; McCullagh& Nel- 5 refer to the total number of experimental trials, N åni. der, 1989). For our problem, provided that the values of y To model the process underlying experimental data, it are assumed to have been generated by Bernoulli pro- is common to assume the number of correct responses yi ni cesses, it is straightforward to compute a likelihoodvalue in a given block i to be the sum of random samples from a for a particular set of parameters q, given the observed PSYCHOMETRIC FUNCTION I 1295 values y. The likelihoodfunction L(q;y) is the same as the distributionW(q), specified in advance,which weights the probabilityfunction p(y|q) (i.e., the probabilityof having likelihood calculation during fitting: The fitting process obtaineddata y given hypotheticalgeneratingparameters therefore maximizes W(q)L(q;y) or log W(q) 1 l(q;y), q)—note, however, the reversal of order in the notation, instead of the unweighted metrics. to stress that once the data have been gathered, y is fixed, The exact form of W(q) is to be chosen by the experi- and q is the variable. The maximum-likelihoodestimator menter, given the experimentalcontext.The ideal choice qˆ of q is simply that set of parameters for which the like- for W(l) would be the distribution of rates of stimulus- lihoodvalueis largest: L(qˆ;y) $ L(q;y) for all q. Since the independent error for the current observer on the current logarithm is a monotonic function, the log-likelihood task. Generally, however, one has not enough data to esti- functionl(q;y) 5 l(q;y) is also maximized by the same es- mate this distribution.For the simulationsreported in this timator qˆ, and this frequently proves to be easier to max- paper, we chose W(l) 5 1for0# l # .06; otherwise, imize numerically. For our situation, it is given by W(l) 5 04—that is, we set a limit of .06 on l, and weight smaller values equally with a flat prior.5,6 For data analy- K æ ni ö sis, we generally do not constrain the other parameters, ly(;qq )= å logç ÷ + ynii log(y x i ;) y q è ynø except to limit them to values for which (x; ) is real. i=1 ii Avoiding bias caused by observers’ lapses. In stud- + ()(11ynii log[]y () x i ;q . (2) ies in which sigmoid functions are fitted to psychophysical data, particularly where the data come from forced- In principle, qˆ can be found by solving for the points at choice paradigms, it is common for experimenters to fix which the derivative of l(q;y) with respect to all the pa- l 5 0, so that the upper bound of y(x;q) is always 1.0. rameters is zero: This gives a set of local minima and max- Thus, it is assumed that observers make no stimulus- ima, from which the global maximum of l(q;y) is selected. independenterrors. Unfortunately,maximum-likelihood For most practical applications,however, qˆ is determined parameter estimation as described above is extremely sen- by iterativemethods,maximizingthose terms of Equation2 sitive to such stimulus-independent errors, with a conse- that depend on q. Our implementation of log-likelihood quent bias in threshold and slope estimates (Harvey, 1986; maximization uses the multidimensional Nelder–Mead Swanson & Birch, 1992). simplex search algorithm,3 a descriptionof which can be Figure 1 illustrates the problem. The dark circles indi- found in chapter 10 of Press, Teukolsky, Vetterling, and cate the proportion of correct responses made by an ob- Flannery (1992). server in six blocks of trials in a 2AFC visual detection Bayesian priors. It is sometimes possible that the task. Each datapoint represents 50 trials, except for the last maximum-likelihood estimate qˆ contains parameter val- one, at stimulusvalue3.5, which represents 49 trials: The ues that are either nonsensical or inappropriate.For exam- observer still has one trial to perform to complete the ple, it can happen that the best fit to a particular data set block. If we were to stop here and fit a Weibull function has a negative value for l, which is uninterpretable as a to the data, we would obtain the curve plotted as a dark lapse rate and implies that an observer’sperformance can solid line. Whether or not l is fixed at 0 during the fit, the exceed 100% correct—clearly nonsensicalpsychologically, maximum-likelihood parameter estimates are the same: even though l(q; y) may be a real value for the particular {aˆ 5 1.573, bˆ 5 4.360,lˆ 5 0}. Now suppose that, on the stimulus values in the data set. 50th trial of the last block, the observer blinks and misses It may also happen that the data are best fit by a pa- the stimulus, is consequentlyforced to guess, and happens rameter set containinga large l (greater than .06, for ex- to guess wrongly. The new position of the datapoint at ample). A large l is interpreted to mean that the observer stimulus value 3.5 is shown by the light triangle: It has makes a large proportionof incorrect responses no matter dropped from 1.00 to .98 proportion correct. how great the stimulus intensity—in most normal psycho- The solid light curve shows the results of fitting a two- physical situations, this means that the experiment was parameter psychometric function (i.e., allowing a and b not performed properly and that the data are invalid. If the to vary, but keeping l fixed at 0). The new fitted param- observer genuinelyhas a lapse rate greater than .06, he or eters are {aˆ 5 2.604, bˆ 5 2.191}. Note that the slope of she requires extra encouragement or, possibly, replace- the fitted function has dropped dramatically in the space ment. However, misleadingly large l values may also be of one trial—in fact, from a value of 1.045 to 0.560. If we fitted when the observer performs well, but there are no allow l to vary in our new fit, however, the effect on pa- samples at high performance values. rameters is slight—{aˆ 5 1.543, bˆ 5 4.347, lˆ 5 .014}— In both cases, it would be better for the fitting algorithm and thus, there is little change in slope: dFˆ/dx evaluatedat 1 to return parameter vectors that may have a lower log- x 5 F 0.5 is 1.062. likelihoodthan the globalmaximum but that contain more The misestimation of parameters shown in Figure 1 is realistic values. Bayesian priors provide a mechanism for a direct consequenceof the binomial log-likelihooderror constrainingparameters within realistic ranges, based on metric because of its sensitivityto errors at high levels of the experimenter’s prior beliefs about the likelihood of predicted performance:7 since y(x;q) ® 1, so, in the third particular values. A prior is simply a relative probability term of Equation2, (1 yi)ni log[1 y(xi;q)] ® ¥ un- 1296 WICHMANN AND HILL

.7 data before lapse occurred fit to original data

.6 data point affected by lapse proportion correct revised fit: fixed .5 revised fit: free

.4 0 1 2 3 4

signal intensity

Figure 1. Dark circles show data from a hypotheticalobserver prior to lapsing.The solid dark line is a maximum-likelihood Weibull fit to the data set. The triangle shows a datapoint after the observer lapsed once during a 50-trial block. The solid light line shows the (poor) traditional two-parameter Weibull fit with l fixed; the broken light line shows the suggested three-parameter Weibull fit with l free to vary.

less the coefficient (1 yi)ni is 0. Since yi represents ob- ple point (at which the lapse occurred) at a comparatively served proportion correct, the coefficient is 0 as long as high stimulus value relative to the rest of the data set. The performance is perfect. However, as soon as the observer questionremains: How serious are the consequencesof as- lapses, the coefficient becomes nonzero and allows the suming a fixed l for sampling schemes one might readily large negativelog term to influence the log-likelihoodsum, employ in psychophysical research? reflecting the fact that observed proportions less than 1 are extremely unlikely to have been generated from an Simulations expected value that is very close to 1. Log-likelihood can To test this, we conductedMonte Carlo simulations;six- be raised by lowering the predicted value at the last stim- point data sets were generated binomially assuming a ulus value, y(3.5,q). Given that l is fixed at 0, the upper 2AFC design and using a standard underlyingperformance asymptote is fixed at 1.0; hence, the best the fitting algo- function F(x;{agen, bgen}), which was a Weibullfunction 5 5 rithm can do in our example to lower y(3.5,q) is to make with parameters agen 10 and bgen 3. Seven different the psychometric function shallower. sampling schemes were used, each dictating a different Judging the fit by eye, it does not appear to capture distribution of datapoints along the stimulus axis. They accurately the rate at which performance improves with are shown in Figure 2: Each horizontal chain of symbols stimulus intensity. (Proper Monte Carlo assessments of represents one of the schemes, marking the stimulus val- goodness of fit are described later in this paper.) ues at which the six sample points are placed. The differ- The problem can be cured by allowing l to take a non- ent symbol shapes will be used to identify the sampling zero value, which can be interpreted to reflect our belief schemes in our results plots. To provide a frame of ref- as experimenters that observers can lapse and that, there- erence, the solid curve shows 0.5 1 0.5(F(x;{agen, bgen}), fore, in some cases, their performance might fall below with the 55%, 75%, and 95% performance levels marked 100% despite arbitrarily large stimulus values. To obtain by dotted lines. the optimum value of l and, hence, the most accurate es- Our seven schemes were designed to represent a range timates for the other parameters, we allow l to vary in the of different sampling distributionsthat could arise during maximum-likelihood search. However, it is constrained “everyday”psychophysicallaboratorypractice,including within the narrow range [0,.06], reflecting our beliefs con- those skewed toward low performance values (s4) or high cerning its likely values8 (see the previoussection,on Baye- performance values (s3 and s7), those that are clustered sian priors). around threshold (s1), those that are spread out away from The exampleof Figure 1 might appear exaggerated;the the threshold (s5), and those that span the range from 55% distortion in slope was obtained by placing the last sam- to 95% correct (s2). As we shall see, even for a fixed num- PSYCHOMETRIC FUNCTION I 1297

.5 s7 .4 s6 s5

proportion correct .3 s4 .2 s3 s2 .1 s1 0 4 5 7 10 12 15 17 20 signal intensity

Figure 2. A two-alternative forced-choice Weibull psychometric function with parameter vector q 5 {10, 3, .5, 0} on semilogarithmic coordinates. The rows of symbols below the curve mark the x values of the seven different sampling schemes used throughout the remainder of the paper. ber of sample points and a fixed number of trials per tion 1—it is easy to show that variations in k or in the point,biases in parameter estimationand goodness-of-fit probabilityof a mishap are described by a change in l.We assessment (this paper) as well as the width of confidence shall use lgen to denote the generating function’s l param- intervals(companionpaper, Wichmann & Hill, 2001), all eter, possible values for which were 0, .01, .02, .03, .04, depend heavily on the distribution of stimulus values x. and .05. The number of datapoints was always 6, but the num- To generate each data set, then, we chose (1) a sampling ber of observationsper point was 20, 40, 80, or 160. This scheme, which gave us the vector of stimulus values x, meant that the total number of observations N could be (2) a value for N, which was dividedequallyamong the el- 120, 240, 480, or 960. ements of the vector n denoting the number of observa- We also varied the rate at which our simulated observer tions in each block, and (3) a value for lgen, which was as- lapsed. Our model for the processes involved in a single sumed to have the same value for all blocks of the data 9 trial was as follows: For every trial at stimulus value xi, the set. We then obtained the simulated performance vector observer’sprobabilityof correct response ygen(xi)isgiven y: For each block i, the proportion of correct responses yi 1 by 0.5 0.5F(xi;{agen, bgen}), except that there is a cer- was obtained by finding the proportion of a set of ni ran- 10 tain small probabilitythat somethinggoes wrong, in which dom numbers that were less than or equal to ygen(xi). A case ygen(xi) is instead set at a constant value k. The value maximum-likelihood fit was performed on the data set of k would depend on exactly what goes wrong. Perhaps described by x, y,andn, to obtain estimated parameters the observer suffers a loss of attention, misses the stimu- aˆ , bˆ, and lˆ. There were two fitting regimes: one in which lus, and is forced to guess; then, k would be .5, reflecting lˆ was fixed at 0, and one in which it was allowed to vary the probabilityof the guess’ being correct. Alternatively, but was constrainedwithin the range [0, .06]. Thus, there lapses might occur because the observer fails to respond were 336 conditions: 7 sampling schemes 3 6 values for within a specified response interval, which the experi- lgen 3 4 values for N 3 2 fitting regimes. Each condition menter interprets as an incorrect response, in which case was replicated 2,000 times, using a new randomly gen- k 5 0. Or perhaps k has an intermediate value that re- erated data set, for a total of 672,000 simulations overall, flects a probabilisticcombinationof these two events and/ requiring 3.024 3 108 simulated 2AFC trials. or other potential mishaps. In any case, provided we as- Simulation results: 1. Accuracy. We are interested in sume that such events are independent,that their probabil- measuring the accuracy of the two fitting regimes in es- ity of occurrence is constant throughout a single block of timating the threshold and slope, whose true values are trials, and that k is constant, the simulated observer’sover- given by the stimulus value and gradient of F(x;{agen, 5 all performance on a block is binomially distributed, with bgen}) at the point at which F(x;{agen, bgen}) 0.5. The an underlyingprobabilitythat can be expressed with Equa- true values are 8.85 and 0.118, respectively—that is, it is 1298 WICHMANN AND HILL

F(8.85;{10,3}) 5 0.5 and F¢(8.85;{10,3}) 5 0.118. From errors) or failing to find them when they do exist (Type II). each simulation, we use the estimated parameters to ob- The bias of an estimator must thus be evaluated relative tain the threshold and slope of F(x;{aˆ , bˆ}). The medians to its variability.A frequently applied rule of thumb is that of the distributions of 2,000 thresholds and slopes from a good estimator should be biased by less than 25% of its each conditionare plotted in Figure 3. The left-hand col- standard deviation (Efron & Tibshirani, 1993). umn plots median estimated threshold, and the right- The variability of estimates in the context of fitting psy- hand column plots median estimated slope, both as a chometric functions is the topic of our companion paper function of lgen. The four rows correspond to the four (Wichmann & Hill, 2001),in which we shall see that one’s values of N: The total number of observations increases chosen sampling scheme and the value of N both have a down the page. Symbol shape denotes sampling scheme profound effect on confidenceinterval width. For now,and as per Figure 2. Light symbols show the results of fixing without going into too much detail, we are merely inter- l at 0, and dark symbols show the results of allowing l ested in knowing how our decision to employ a fixed-l to vary during the fitting process. The true threshold and or a variable-l regime affects variability(precision) for the slope values(i.e., those obtainedfrom the generatingfunc- various sampling schemes and to use this information to tion) are shown by the solid horizontal lines. assess the severity of bias. The first thing to notice aboutFigure 3 is that, in all the Figure 4 shows two illustrative cases, plotting results plots, there is an increasing bias in some of the sampling for two contrasting schemes at N 5 480. The upper pair schemes’ median estimates as lgen increases. Some sam- of plots shows the s7 sampling scheme, which, as we have pling schemes are relatively unaffected by the bias. By seen, is highly susceptible to bias when l is fixed at 0 using the shape of the symbols to refer back to Figure 2, and observers lapse. The lower pair shows s1, which we it can be seen that the schemes that are affected to the have already found to be comparatively resistant to bias. greatest extent (s3, s5, s6, and s7) are those containing As before, thresholds are plotted on the left and slopes . sample points at which F(x,{agen, bgen}) 0.9, whereas on the right, light symbols represent the fixed-l fitting the others (s1, s2, and s4) contain no such points and are regime, dark symbols represent the variable-l fitting affected to a lesser degree. This is not surprising, bearing regime, and the true values are again shown as solid hor- in mind the foregoing discussion of Equation 1: Bias is izontal lines. Each symbol’sposition represents the me- most likely where high performance is expected. dian estimate from the 2,000 fits at that point, so they are Generally, then, the variable-l regime performs better exactly the same as the upward triangles and circles in than the fixed-l regime in terms of bias. The one excep- the N 5 480 plots of Figure 3. The vertical bars show the tion to this can be seen in the plot of median slope esti- interval between the 16th and the 84th percentiles of mates for N 5 120 (20 observations per point): Here, each distribution (these limits were chosen because they there is a slight upward bias in the variable-l estimates, give an interval with coverage of .68, which would be ap- an effect that varies according to sampling scheme but that proximately the same as the mean plus or minus one is relatively unaffected by the value of lgen. In fact, for standard deviation (SD) if the distributions were Gauss- # lgen .02, the downward bias from the fixed-l fits is ian). We shall use WCI68 as shorthand for width of the smaller, or at least no larger, than the upward bias from 68% confidence interval in the following. fits with variable l. Note, however, that an increase in N Applying the rule of thumb mentioned above to 240 or more improves the variable-l estimates, reduc- (bias # 0.25 SD), bias in the fixed-l condition in the . ing the bias and bringing the medians from the different threshold estimate is significant for lgen .02, for both sampling schemes together. The variable-l fitting sampling schemes. The slope estimate of the fixed-l fit- scheme is essentially unbiased for N $ 480, independent ting regime and the sampling scheme s7 is significantly $ of the sampling scheme chosen. By contrast, the value of biased once observers lapse—that is, for lgen .01. It is N appears to have little or no effect on the absolute size interesting to see that even the threshold estimates of s1, of the bias inherent in the susceptible fixed-l schemes. with all sample points at p , .8, are significantly biased Simulation results: 2. Precision. In Figure 3, the bias by lapses. The slight bias found with the variable-l fitting seems fairly small for thresholdmeasurements (maximally, regime, however, is not significant in any of the cases about 8% of our stimulus range when the fixed-l fitting studied. regime is used or about 4% when l is allowed to vary). We can expect the variance of the distribution of esti- For slopes, only the fixed-l regime is affected, but the ef- mates to be a function of N—the WCI68 gets smaller with fect, expressed as a percentage, is more pronounced (up increasing N. Given that the absolute magnitude of bias to 30% underestimation of gradient). stays the same for the fixed-l fitting regime, however, bias will However, note that however large or small the bias ap- become more problematicwith increasing N:ForN 5 960, pears when expressed in terms of stimulus units, knowl- the bias in threshold and slope estimates is significant edge aboutan estimator’s precision is required in order to for all sampling schemes and virtually all nonzero lapse assess the severity of the bias. Severity in this case means rates. Frequently, the true (generating) value is not even the extent to which our estimation procedure leads us to within the 95% confidence interval. Increasing the num- make errors in hypothesistesting: findingdifferences be- ber of observations by using the fixed-l fitting regime, tween experimental conditions where none exist (Type I contrary to what one might expect,increasesthe likelihood PSYCHOMETRIC FUNCTION I 1299

— = 0 (fixed) — true (generating) — free value

^ -1 ^ Threshold estimates: F0.5 Slope estimates: F’threshold

9.6 N = 120 0.15 N = 120

9.4 0.14 0.13 9.2 0.12

0.11 9 0.1 8.8 0.09

median estimated slope 0.08 8.6 median estimated threshold

0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

9.6 N = 240 0.15 N = 240

9.4 0.14 0.13 9.2 0.12

9 0.11 0.1 8.8 0.09

8.6 median estimated slope 0.08 median estimated threshold

0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

9.6 N = 480 0.15 N = 480

9.4 0.14 0.13 9.2 0.12

9 0.11 0.1 8.8 0.09

8.6 median estimated slope 0.08 median estimated threshold

0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

9.6 N = 960 0.15 N = 960 0.14 9.4 0.13 9.2 0.12

0.11 9 0.1 8.8 0.09

median estimated slope 0.08 8.6 median estimated threshold

0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

gen gen

Figure 3. Median estimated thresholds and median estimated slopes are plotted in the left-hand and right-hand columns, respectively; both are shown as a function of lgen. The four rows correspond to four values of N (120, 240, 480, and 960). Symbol shapes denote the different sampling schemes (see Figure 2). Light symbols show the results of fixing l at 0; dark symbols show the results of allowing l to vary during the fitting. The true threshold and slope values are shown by the solid horizontal lines. 1300 WICHMANN AND HILL

— free

— = 0 (fixed)

— true (generating) value

^ -1 ^ Threshold estimates: F0.5 Slope estimates: F’threshold

10 N =480 0.16 N = 480 9.5 0.14

0.12 9

0.1 8.5 0.08

median estimated slope 0.06

median estimated threshold 8 0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

10 N =480 0.16 N = 480 9.5 0.14

0.12 9

0.1 8.5 0.08 median estimated slope

median estimated threshold 8 0.06 0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

gen gen

Figure 4. Median estimated thresholds and median estimated slopes are plotted as a function of l gen on the left and right, re- 5 spectively. The vertical bars show WCI68 (see the text for details). Data for two sampling schemes, s1 and s7, are shown for N 480. of Type I or II errors. Again, the variable-l fitting regime of reduced precision. However, as was discussed above, performs well: The magnitude of median bias decreases the alternative is uninviting: With l fixed at 0, the true approximately in proportion to the decrease in WCI68. slope does not even lie within the 68% confidence inter- Estimates are essentially unbiased. val of the distribution of estimates for lgen .02. For s1, 5 For small N (i.e. , N 120), WCI68s are larger than those however, allowing l to vary brings neither a significant shown in Figure 4. Very approximately,11 they increase benefit in terms of accuracy nor a significant penalty in with 1/ÏN;evenforN 5 120, bias in slope and threshold terms of precision. We have found that these two contrast- estimates is significant for most of the sampling schemes ing cases are representative:Generally,there is nothingto in the fixed-l fitting regime once lgen .02 or .03. The lose by allowing l to vary in those cases where it is not variable-l fitting regime performs better again, although required in order to provide unbiased estimates. for some sampling schemes (s3, s5, and s7) bias is signif- Fixing lambda at nonzero values. In the previous icant at lgen $ .04, because bias increases dispropor- analysis, we contrasted a variable-l fitting regime with one tionally relative to the increase in WCI68. having l fixed at 0. Another possibilitymight be to fix l A second observation is that, in the case of s7, correct- at a small but nonzero value, such as .02 or .04. Here, we ing for bias by allowing l to vary carries with it the cost report Monte Carlo simulationsexploringwhether a fixed PSYCHOMETRIC FUNCTION I 1301 small value of l overcomes the problems of bias while both are simply shifted copies of each other—in fact, they retaining the desirable property of (slightly) increased are more or less merely shifted copies of the l fixed at zero precision relative to the variable-l fitting regime. data presented in Figure 3. Not surprisingly,minimal bias Simulationswere repeated as before, except that l was is obtained for lgen corresponding to the fixed-l value. either free to vary or fixed at .01, .02, .03, .04, and .05, The zone of insignificant bias around the fixed-l value covering the whole range of lgen. (A total of 2,016,000 is small, however, onlyextendingto, at most, l 6 .01. Thus, simulations,requiring 9.072 3 109 simulated 2AFC trials.) fixing l at, say, .01 provides unbiased and precise esti- Figure 5 shows the results of the simulationsin the same mates of threshold and slope, providedthe observer’slapse format as that of Figure 3. For clarity, only the variable-l rate is within 0 # lgen # .02. In our experience,this zone fitting regime and l fixed at .02 and .04 are plotted, or range of good estimation is too narrow: One of us using dark, intermediate,and light symbols, respectively. (F.A.W.)regularly fits psychometricfunctionsto data from Since bias in the fixed-l regimes is again largely inde- discriminationand detection experiments, and even for a pendent of the number of trials N, only data correspond- single observer, l in the variable-l fitting regime takes on ing to the intermediatenumber of trials is shown (N 5 240 values from 0 to .05—no single fixed l is able to provide and 480). The data for the fixed-l regimes indicate that unbiased estimates under these conditions.

— free — = 0.02 (fixed) — true (generating) value — = 0.04 (fixed)

^ -1 ^ Threshold estimates: F0.5 Slope estimates: F’threshold

9.2 0.15

0.14 9 0.13

0.12 8.8 0.11

0.1 8.6 0.09

8.4 N = 240 median estimated slope 0.08 N = 240 median estimated threshold 0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

9.2 0.15

0.14 9 0.13

0.12 8.8 0.11

0.1 8.6 0.09 N N 8.4 = 480 median estimated slope 0.08 = 480 median estimated threshold

0 .01 .02 .03 .04 .05 0 .01 .02 .03 .04 .05

gen gen

Figure 5. Data shown in the format of Figure 3; median estimated thresholds and median estimated slopes are plotted as 5 a function of l gen in the left-hand and right-handcolumns,respectively. The two rows correspond to N 240 and 480. Sym- bol shapes denote the different sampling schemes (see Figure 2). Light symbols show the results of fixing l at .04; medium gray symbols those for l fixed at .02; dark symbols show the results of allowing l to vary during the fitting. True threshold and slope values are shown by the solid horizontal lines. 1302 WICHMANN AND HILL

Discussion and Summary & Marshall, 1992; McKee, Klein, & Teller, 1985).This is Could bias be avoided simply by choosing a sampling further supported by the analysis of bootstrap sensitivity scheme in which performance close to 100% is not ex- in our companion paper (Wichmann & Hill, 2001). pected?Since one never knows the psychometricfunction exactly in advance before choosing where to sample per- GOODNESS OF FIT formance, it would be difficult to avoid high performance, even if one were to want to do so. Also, there is good rea- Background son to choose to sample at high performance values: Pre- Assessing goodness of fit is a necessary component of cisely because data at these levels have a greater influence any sound procedure for modeling data, and the impor- on the maximum-likelihood fit, they carry more informa- tance of such tests cannot be stressed enough, given that tion aboutthe underlyingfunction and thus allowmore ef- fitted thresholds and slopes, as well as estimates of vari- ficient estimation. Accordingly, Figure 4 shows that the ability(Wichmann & Hill, 2001), are usually of very lim- precision of slope estimates is better for s7 than for s1 (cf. ited use if the data do not appear to have come from the Lam, Mills, & Dubno, 1996). This issue is explored more hypothesizedmodel. A common method of goodness-of- fully in our companionpaper (Wichmann & Hill, 2001).Fi- fit assessment is to calculate an error term or summary nally,even for those samplingschemes that contain no sam- statistic, which can be shown to be asymptotically dis- ple points at performance levels above 80%, bias in thresh- tributed according to c 2—for example,Pearson X 2—and old estimates was significant, particularly for large N. to compare the error term against the appropriate c 2 dis- Whether sampling deliberately or accidentallyat high tribution.A problem arises, however, since psychophysical performance levels, one must allow for the possibilitythat data tend to consist of small numbers of points and it is, observers will perform at high rates and yet occasionally hence, by no means certain that such tests are accurate. A lapse: Otherwise, parameter estimates may become biased promising technique that offers a possible solution is when lapses occur. Thus, we recommend that varying l Monte Carlo simulation,which being computationallyin- as a third parameter be the method of choice for fitting tensive, has become practicableonly in recent years with psychometric functions. the dramatic increase in desktop computing speeds. It is Fitting a tightly constrained l is intended as a heuristic potentially well suited to the analysis of psychophysical to avoid bias in cases of nonstationaryobserver behavior. data, because its accuracy does not rely on large numbers It is, as well, to note that the estimated parameter lˆ is, in of trials, as do methods derived from asymptotic theory general, not a very good estimator of a subject’s true (Hinkley, 1988). We show that for the typicallysmall K and lapse rate (this was also found by Treutwein & Stras- N used in psychophysical experiments, assessing good- burger, 1999, and can be seen clearly in their Figures 7 ness of fit by comparing an empirically obtained statistic and 10). Lapses are rare events, so there will only be a against its asymptotic distribution is not always reliable: very small number of lapsed trials per data set. Further- The true small-sample distributionof the statisticis often more, their directly measurable effect is small, so that insufficiently well approximated by its asymptotic distri- only a small subset of the lapses that occur (those at high bution. Thus, we advocate generationof the necessary dis- x values where performance is close to 100%) will affect tributions by Monte Carlo simulation. the maximum-likelihood estimation procedure; the rest Lack of fit—that is, the failure of goodnessof fit—may will be lost in binomial noise. With such minute effective result from failure of one or more of the assumptions of sample sizes, it is hardly surprising that our estimates of one’s model. First and foremost, lack of fit between the l per se are poor. However, we do not need to worry, be- model and the data could result from an inappropriate cause as psychophysicists we are not interested in lapses: functionalform for the model—in our case of fitting a psy- We are interested in thresholds and slopes, which are de- chometric function to a single data set, the chosen under- termined by the function F that reflects the underlying lying function F is significantlydifferent from the true one. mechanism. Therefore, we vary l not for its own sake, but Second, our assumptionthat observer responses are bino- purely in order to free our threshold and slope estimates mial may be false: For example, there might be serial de- from bias. This it accomplisheswell, despite numerically pendenciesbetween trials within a single block. Third, the inaccurate l estimates. In our simulations, it works well observer’s psychometric function may be nonstationary both for sampling schemes with a fixed nonzero lgen and during the course of the experiment, be it due to learn- for those with more random lapsing schemes (see note 9 ing or fatigue. or our example shown in Figure 1). Usually, inappropriate models and violations of inde- In addition, our simulations have shown that N 5 120 pendence result in overdispersion or extra-binomial vari- appears too small a number of trials to be able to obtain re- ation: “bad fits” in which datapoints are significantly fur- liable estimates of thresholds and slopes for some sam- ther from the fitted curve than was expected.Experimenter pling schemes, even if the variable-l fitting regime is em- bias in data selection (e.g. , informal removal of outliers), ployed.Similar conclusionswere reached by O’Regan and on the other hand, could result in underdispersion: fits that Humbert (1989) for N 5 100 (K 5 10; cf. Leek, Hanna, are “too good to be true,” in which datapoints are signifi- PSYCHOMETRIC FUNCTION I 1303

ˆ cantlycloser to the fitted curve than one might expect(such l(qmax;y) is independent of q (being purely a function of data sets are reported more frequently in the psychophysi- the data, y), deviance fulfills the criterion set out in cal literature than one would hope12). Equation 3. From Equation 4 we see that deviance takes Typically,if goodness of fit of fitted psychometricfunc- values from 0 (no residual error) to infinity (observed tions is assessed at all, however, only overdispersion is data are impossible given model predictions). considered. Of course, this method does not allow us to For goodness-of-fit assessment of psychometricfunc- distinguish between different sources of overdispersion tions (binomial data), Equation 4 reduces to (wrong underlying functionor violationof independence) K ì æ ö æ öü and/or effects of learning. Furthermore, as we will show, yi 1 yi Dny= 21å í iilogç ÷ + nyii()logç ÷ý (5) models can be shown to be in error even if the summary i=1î è pi ø è1 pi øþ statistic indicates an acceptable fit. In the following, we describe a set of goodness-of-fit ( pi refers to the proportion correct predicted by the fitted tests for psychometric functions (and parametric models model). in general). Most of them rely on different analyses of the Deviance is used to assess goodness of fit, rather than residual differences between data and fit (the sum of likelihoodor log-likelihood directly,because, for correct squares of which constitutesthe popular summary statis- models, deviance for binomial data is asymptoticallydis- c 2 tics) and on Monte Carlo generation of the statistic’sdis- tributed as K, where K denotes the number of data- 13 tribution, against which to assess lack of fit. Finally, we points (blocks of trials). For derivation, see, for exam- show how a simple applicationof the jackkniferesampling ple, Dobson (1990,p. 57), McCullaghand Nelder (1989), technique can be used to identify so-called influential and in particular, Collett (1991, sects. 3.8.1 and 3.8.2). observations—that is, individualpoints in a data set that Calculating D from one’s fit and comparing it with the 2 exert undue influence on the final parameter set. Jackknife appropriate c distribution, hence, allows simple good- techniques can also provide an objective means of iden- ness-of-fit assessment, provided that the asymptotic ap- tifying outliers. proximationto the (unknown)distribution of deviance is accurate for one’sdata set. The specific values of L(qˆ;y) Assessing overdispersion.Summary statistics measure ˆ closeness of the data set as a whole to the fitted function. or l(q;y) or by themselves, on the other hand, are less Assessing closeness is intimatelylinked to the fitting pro- generally interpretable. 2 2 cedure itself: Selecting the appropriate error metric for Pearson X . The Pearson X test is widely used in fitting implies that the relevant currency within which to goodness-of-fit assessment of multinomialdata; applied measure closeness has been identified. How good or bad to K blocks of binomial data, the statistic has the form the fit is should thus be assessed in the same currency. 2 K nyii() p i In maximum-likelihood parameter estimation, the pa- X 2 = å , (6) qˆ rameter vector returned by the fitting routine is such i=1 ppii()1 that L(qˆ; y) $ L(q; y) for all q. Thus, whatever error metric, Z, is used to assess goodness of fit, with ni, yi, and pi as in Equation 5. Equation 6 can be interpreted as the sum of squared residuals (each residual ˆ ZZ(qqq;)yy (;)q (3) being given by yi pi), standardized by their variance, p (1 p )n 1. Pearson X 2 is asymptoticallydistributedac- should hold for all q. i i i cording to c 2 with K degrees of freedom, because the Deviance. The log-likelihood ratio, or deviance,isa binomial distribution is asymptotically normal and c 2 is monotonictransformation of likelihoodand therefore ful- K defined as the distribution of the sum of K squared unit- fills the criterion set out in Equation 3. Hence, it is com- variance normal deviates. Indeed, deviance D and Pear- monly used in the context of generalized linear models son X 2 have the same asymptotic c2 distribution (but see (Collett,1991; Dobson,1990; McCullagh& Nelder, 1989). note 13). Deviance, D, is defined as There are two reasons why deviance is preferable to é Pearson X 2 for assessing goodness of fit after maximum- L(;)qqmax y ˆ D = 22logê = []ll(;)(qqqmax yyq;), (4) likelihood parameter estimation. First and foremost, L(qqˆ;)y ë for Pearson X 2, Equation 3 does not hold—that is, the ˆ where L(qmax;y) denotes the likelihood of the saturated maximum-likelihood parameter estimate q will not gen- model—that is, a model with no residual error between erally correspond to the set of parameters with the small- 2 2 empirical data and model predictions. (qmax denotes the est error in the Pearson X sense—that is, Pearson X er- parameter vector such that this holds and the number of rors are the wrong currency (see the previous section, free parameters in the saturated model is equal to the total Assessing Overdispersion).Second,differences in deviance number of blocks of observations, K.) L(qˆ;y) is the likeli- between two models of the same family—that is, between hood of the best-fitting model; l(qmax;y) and l(qˆ ;y) de- models where one model includes terms in addition to note the logarithms of these quantities, respectively. Be- those in the other—can be used to assess the significance ˆ 2 cause, by definition, l(qmax;y) $ l(q;y) for all q,and of the additionalfree parameters. Pearson X , on the other 1304 WICHMANN AND HILL hand, cannot be used for such model comparisons(Collett, sponses are binomially distributed with success proba- 1991). This important issue will be expanded on when we bilityy(x,qˆ). A confidenceintervalfor deviance can then introduce an objective test of outlier identification. be obtained by using the standard percentile method: D*(n) denotes the 100 nth percentile of the distribution D*so Simulations that, for example, the two-sided 95% confidence interval As we have mentioned, both deviance for binomial data is written as [D*(.025), D*(.975)]. 2 and Pearson X are only asymptotically distributed ac- Let Demp denote the deviance of our empirically ob- 2 2 . (.975) cording to c . In the case of Pearson X , the approximation tained data set. If Demp D* , the agreement between to the c 2 will be reasonably good once the K individualbi- data and fit is poor (overdispersion),and it is unlikelythat nomial contributionsto Pearson X 2 are well approximated the empirical data set was generated by the best-fitting $ ˆ ˆ by a normal—that is, as long as both ni pi 5 and ni(1 psychometric function, y(x,q). y(x,q) is, hence, not an $ pi) 5 (Hoel, 1984). Even for only moderately high p val- adequate summary of the empirical data or the observer’s ues like .9, this already requires ni values of 50 or more, behavior. 5 and p .98 requires an ni of 250. When using Monte Carlo methods to approximate the No such simple criterion exists for deviance,however. true deviance distribution D by D*, one requires a large The approximation depends not only on K and N, but im- value of B so that the approximation is good enough to be portantly,on the size of the individualni and pi—it is dif- taken as the true or reference distribution—otherwise, we ficult to predict whether or not the approximation is suf- would simply substitute errors arising from the inappro- ficientlyclose for a particulardata set. (see Collett,1991, priate use of an asymptotic distribution for numerical er- 5 sect. 3.8.2). For binary data (i.e., ni 1), deviance is not rors incurred by our simulations(Hämmerlin & Hoffmann, even asymptotically distributed according to c 2, and for 1991). small ni, the approximation can thus be very poor even if One way to see whether D* has indeed approachedD is K is large. to look at the convergence of several of the quantiles of Monte-Carlo-based techniques are well suited to an- D* with increasing B. For a large range of different val- $ swering any question of the kind “what distribution of ues of N, K,andni, we found that for B 10,000,D*has values would we expect if . . .?” and, hence, offers a po- stabilized.14 tential alternative to relying on the large-sample c 2 approx- Assessing errors in the asymptotic approximation imation for assessing goodness of fit. The distributionof to the deviance distribution. We have found, by a large deviances is obtainedin the following way. First, we gen- amount of trial-and-error exploration, that errors in the erate B data sets yi*, using the best-fitting psychometric large-sample approximation to the deviance distribution function,y(x,qˆ), as the generatingfunction.Then, for each are not predictable in a straightforward manner from 5 of the i {1,..., B} generated data sets yi*, we calcu- one’s chosen values of N, K, and x. late deviance Di*, usingEquation5, yieldingthe deviance To illustrate this point, in this section we present six ex- distributionD*. The distributionD* reflects the deviances amplesin which the c2 approximationto the deviance dis- we should expect from an observer whose correct re- tribution fails in different ways. For each of the six ex-

N = 300, K =6,ni =50 N = 300, K =6,ni =50 1,200 1,500 P P RMS = 4.63 RMS = 14.18 P P max =6.95 max = 18.42 P P F = .011 F =0 800 P 1,000 P M =0 M =.596

400 500 number per bin number per bin

0 0 0 5 10 15 20 25 0 5 10 15 20 25 30 deviance (D) deviance (D)

Figure 6. Histograms of Monte-Carlo-generated deviance distributions D* (B 5 10,000). Both panels 5 5 5 5 show distributions for N 300, K 6, and ni 50. The left-hand panel was generated from pgen {.52, 5 .56, .74, .94, .96, .98}; the right-hand panel was generated from pgen {.63, .82, .89, .97, .99, .9999}. The 2 solid dark line drawn with the histograms shows the c 6 distribution (appropriately scaled). PSYCHOMETRIC FUNCTION I 1305

N = 120, K = 60, ni =2 N = 480, K = 240, ni =2 1,000 1,200

PRMS = 39.51 PRMS = 55.16 P 1,000 P 800 max = 56.32 max = 86.87 PF = .310 PF = .899 800 PM =0 PM =0 600

600 400 400 number per bin number per bin

200 200

0 0 50 60 70 80 90 100 110 240 260 280 300 320 340 360 380 deviance (D) deviance (D)

Figure 7. Histograms of Monte-Carlo-generated deviance distributions D* (B 5 10,000). For both distributions, pgen was uniformly distributed on the interval [.52, .85], and ni was equal to 2. The left-hand 5 5 2 panel was generated using K 60(N 120), and the solid dark line shows c 60 (appropriatelyscaled). The right-hand panel’s distribution was generated using K 5 240 (N 5 480), and the solid dark line shows the 2 appropriately scaled c 240 distribution. amples, we conducted Monte Carlo simulations, each (Schmidt, 1996), where models are assessed across sev- using B 5 10,000. Critically,for each set of simulations, eral data sets. (In such analyses, we are interested in CPE a set of generatingprobabilitiespgen was chosen: In a real errors even if the deviance value of one particular data experiment, these values would be determined by the po- set is not close to the tails of D: A systematic error in sitioningof one’ssample points x and the observer’strue CPE in individualdata sets might still cause errors of re- psychometric function ygen. The specific values of pgen jection of the model as a whole, when all data sets are in our simulationswere chosen by us to demonstrate typ- considered.) 2 ical ways in which the c approximation to the deviance In order to define DPRMS and DPmax, it is useful to in- distributionfails: For the two examples shown in Figure 6, troduce two additional terms, the Monte Carlo cumula- 2 we change only pgen, keeping K and ni constant; for the tive probability estimate CPEMC and the c cumulative two shown in Figure 7, ni was constant, and pgen covered probability estimate CPEc2.ByCPEMC, we refer to the same range of values; Figure 8, finally, illustrates the effect of changing ni while keeping pgen and K constant. #{}DDi £ 2 In order to assess the accuracy of the c approximation CPEMC () D = , (7) in all examples, we calculated the following four error B +1 terms: (1) the rate at which using the c 2 approximation that is, the proportion of deviance values in D* smaller would have caused us to make Type I errors of rejection, than some reference value D of interest. Similarly, rejecting simulated data sets that should not have been D rejected at the 5% level (we call this the false alarm rate, 2 2 CPEc 2 (, D K )==£ò cc () x dx P D (8) 2 K ()K or PF), (2) the rate at which using the c approximation 0 would have caused us to make TypeII errors of rejection, failing to reject a data set that the Monte Carlo distribu- provides the same information for the c 2 distribution tion D* indicatedas rejectable at the 5% level (we call this with K degrees of freedom. The root-mean squared CPE error DP (average difference or error) is defined as the miss rate, PM), (3) the root-mean squared error DPRMS RMS in cumulativeprobabilityestimate (CPE), given by Equa- æ B ö tion 9 (this is a measure of how different the Monte Carlo 1 2 DP = 100 BCPEDCPEDK()2 (,) , distribution D* and the appropriate c2 distribution are, RMS çå[]MC iic ÷ è i=1 ø on average), and (4) the maximal CPE error DPmax, given by Equation10, an indicationof the maximal error in per- (9) centile assignment that could result from using the c2 ap- and the maximal CPE error DPmax (maximal difference or proximation instead of the true D*. error) is given by The first two measures, PF and PM, are primarily useful for individual data sets. The latter two measures, DP RMS D = 2 Pmax 100max{}|CPEMC () Dii CPEc (,) D K | . (10) and DPmax, provide useful information in meta-analyses 1306 WICHMANN AND HILL

N =120,K = 60, ni =2 N =240,K = 60, ni =4 1,000 1,200 P P RMS = 25.34 RMS = 5.20 P 1,000 P 800 max = 34.92 max =8.50 P P F =0 F =0 P = .950 800 P =.702 600 M M 600

400 400 number per bin number per bin 200 200

0 0 30 40 50 60 70 80 90 40 50 60 70 80 90 100 deviance (D) deviance (D)

Figure 8. Histograms of Monte-Carlo-generated deviance distributions D* (B 5 10,000).For both distributions, pgen was uniformly distributed on the interval [.72, .99], and K was equal to 60. The left-hand panel’s 5 5 distribution was generated using ni 2(N 120); the right-hand panel’s distribution was generated using 5 5 2 ni 4(N 240). The solid dark line drawn with the histograms shows the c 60 distribution (appropriately scaled).

Figure 6 illustrates two contrasting ways in which the in ni (Collett, 1991). In the above examples, ni was as large approximation can fail.15 The left-hand panel shows re- as it is likely to get in most psychophysical experiments 5 sults from the test in which the data sets were generated (ni 50), but substantial differences between the true 5 5 2 from pgen {.52, .56, .74, .94, .96, .98} with ni 50 ob- deviance distribution and its large-sample c approxima- servations per sample point (K 5 6, N 5 300). Note that tion were nonetheless observed. Increasingly frequently, the c 2 approximation to the distribution is (slightly) psychometric functions are fitted to the raw data ob- 5 5 shifted to the left. This results in DPRMS 4.63,DPmax tained from adaptive procedures (e.g., Treutwein & 6.95, and a false alarm rate PF of 1.1%. The right-hand Strasburger, 1999), with ni being considerably smaller. panel illustrates the results from very similar input condi- Figure 7 illustratesthe profound discrepancy between the 5 2 tions: As before, ni 50 observationsper sample point (K true deviance distributionand the c approximationunder 5 5 5 6, N 300), but now pgen {.63, .82, .89, .97, .99, these circumstances. For this set of simulations, ni was .9999}. This time, the c 2 approximation is shifted to the equal to 2 for all i. The left-hand panel shows results 5 5 right, resulting in DPRMS 14.18, DPmax 18.42, and a from the test in which the data sets were generated from 5 5 large miss rate: PM 59.6%. K 60 sample points uniformly distributed over [.52, 5 5 Note that the reversal is the result of a comparatively .85] (pgen {.52, . . . , .85}), for a total of N 120 obser- 5 5 subtle change in the distribution of generating probabil- vations. This results in DPRMS 39.51, DPmax 56.32, ities. These two cases illustrate the way in which asymp- and a false alarm rate PF of 31.0 %. The right-hand panel totic theory may result in errors for sampling schemes that illustrates the results from similar input conditions, ex- may occur in ordinary experimental settings, using the cept that K equaled 240 and N was thus 480. The c 2 ap- 5 method of constant stimuli. In our examples, the Type I er- proximation is even worse, with DPRMS 55.16, DPmax 5 rors (i.e., erroneously rejecting a valid data set) of the left- 86.87, and a false alarm rate PF of 89.9%. hand panel may occur at a low rate, but they do occur. The The data shown in Figure 7 clearly demonstrate that a substantial Type II error rate (i.e. , accepting a data set large number of observationsN, by itself, is not a validin- whose deviance is really too high and should thus be re- dicator of whether the c 2 approximation is sufficiently jected) shown on the right-hand panel, however, should be good to be useful for goodness-of-fit assessment. 2 cause for some concern. In any case, the reversal of error After showing the effect of a change in pgen on the c type, for the same valuesof K and ni, indicatesthat the type approximation in Figure 6, and of K in Figure 7, Figure 8 of error is not predictablein any readily apparent way from illustratesthe effect of changingni while keepingpgen and the distribution of generating probabilities and the error K constant: K 5 60 and pgen were uniformly distributed cannotbe compensatedfor by a straightforward correction, on the interval[.72, .99]. The left-hand panelshows results such as a manipulation of the number of degrees of free- from the test in which the data sets were generated with 2 5 5 5 dom of the c approximation. ni 2 observationsper sample point (K 60, N 120). It is known that the large-sample approximationof the The c 2 approximation to the distribution is shifted to the 5 5 binomial deviance distributionimproves with an increase right, resulting in DPRMS 25.34, DPmax 34.92, and a PSYCHOMETRIC FUNCTION I 1307

miss rate PM of 95%. The right-hand panel shows results Needless to say, a correlation coefficient of zero implies 5 from very similar generation conditions,except that ni 4 neither that there is no systematic relationship between (K 5 60, N 5 240). Note that,unlike in the other examples residuals and the model prediction nor that the model cho- introduced so far, the mode of the distribution D* is not sen is correct; it simply means that whatever relation might 2 shifted relative to that of the c 60 distribution,but that the exist between residuals and model predictions, it is not a distribution is more leptokurtic (larger kurtosis or 4th- linear one. moment). DPRMS equals 5.20, DPmax equals 34.92, and Figure 9A shows data from a visual masking experi- 5 5 the miss rate PM is still a substantial 70.2%. ment with K 10 and ni 50, together with the best- Comparing the left-hand panels of Figures 7 and 8 fur- fitting Weibull psychometricfunction(Wichmann,1999). ther points to the impact of p on the degree of the c 2 ap- Figure 9B shows a histogram of D* for B 5 10,000 with 2 proximationto the deviance distribution: For constant N, the scaled c 10-PDF. The two arrows below the deviance (.025) K, and ni, we obtain either a substantial false alarm rate axis mark the two-sided 95% confidence interval [D* , 5 5 (.975) (PF .31; Figure 7) or a substantialmiss rate (PM .95; D* ]. The deviance of the data set Demp is 8.34, and Figure 8). the Monte Carlo cumulative probability estimate is 5 In general, we have found that very large errors in the CPEMC .479. The summary statistic deviance, hence, 2 . c approximation are relatively rare for ni 40, but they does not indicate a lack of fit. Figure 9C shows the de- still remain unpredictable(see Figure 6). For data sets with viance residuals d as a function of the model prediction , 5 ˆ ni 20, on the other hand, substantialdifferences between p [p y(x; q) in this case, because we are using a fitted the true deviance distributionand its large-sample c 2 ap- psychometric function]. The correlation coefficient be- proximation are the norm, rather than the exception. We tween d and p is r 5 .610. However, in order to deter- thus feel that Monte-Carlo-based goodness-of-fit assess- mine whether this correlation coefficient r is significant ments should be preferred over c 2-based methods for bi- (of greater magnitudethan expected by chance alone if our nomial deviance. chosen model was correct), we need to know the expected Deviance residuals. Examination of residuals—the distributionr. For correct models, large samples, and con- agreement between individual datapoints and the corre- tinuousdata—that is, very large ni—one should expect the sponding model prediction—is frequently suggested as distribution of the correlation coefficients to be a zero- being one of the most effective ways of identifyingan in- mean Gaussian, but with a variance that itself is a function correct model in linear and nonlinearregression (Collett, of p and, hence, ultimately of one’s sampling scheme x. 1991; Draper & Smith, 1981). Asymptotic methods are, hence, of very limited applica- Given that deviance is the appropriate summary sta- bility for this goodness-of-fit assessment. tistic, it is sensible to base one’s further analyses on the Figure 9D shows a histogram of r*obtained by Monte 5 deviance residuals, d. Each deviance residual di is de- Carlo simulation with B 10,000, again with arrows fined as the square root of the deviance value calculated marking the two-sided 95% confidence interval [r*(.025), for datapoint i in isolation, signed according to the direc- r*(.975)]. Confidence intervals for the correlation coeffi- tion of the arithmetic residual yi pi. For binomial data, cient are obtainedin a manner analogousto the those ob- this is tained for deviance. First, we generate B simulated data é æ ö æ ö sets yi*, using the best-fitting psychometric function as = yi 1 yi dypiiisgn()21ênyiilogç ÷ + nyii()logç ÷ . generating function. Then, for each synthetic data set è p ø è 1 p ø ë i i yi*, we calculate the correlation coefficient ri* between the deviance residuals d * calculated using Equation 11 (11) i and the model predictions p 5 y(x;qˆ). From r*, one then ob- Note that tains 95% confidence limits, using the appropriate quan- K 2 tile of the distribution. Dd= å i . (12) In our example, a correlation of .610 is significant, i=1 the Monte Carlo cumulative probability estimate being 5 Viewed this way, the summary statistic deviance is the CPEMC( .610) .0015. (Note that the distribution is sum of the squared deviationsbetween model and data; the skewed and not centred on zero; a positive correlation of dis are thus analogous to the normally distributed unit- the same magnitude would still be within the 95% con- variance deviations that constitute the c 2 statistic. fidence interval.) Analyzing the correlation between de- Model checking. Over and above inspectingthe resid- viance residuals and model predictions thus allows us to uals visually, one simple way of looking at the residuals reject the Weibull function as underlying function F for is to calculate the correlation coefficient between the the data shown in Figure 9A, even though the overall de- residuals and the p values predicted by one’smodel. This viance does not indicate a lack of fit. allows the identification of a systematic (linear) relation Learning. Analysis of the deviance residuals d as a between deviance residuals d and model predictions p, function of temporal order can be used to show perceptual which would suggest that the chosen functional form of learning, one type of nonstationaryobserver performance. the model is inappropriate—for psychometric functions, The approach is equivalent to that described for model that presumably means that F is inappropriate. checking,except that the correlation coefficientof deviance 1308 WICHMANN AND HILL

1 750 (A) (B) P RMS =5.90 P .9 max =9.30 P F =0 P M =.340 Weibull psychometric 500 .8 function. = {4.7; 1.98; 0.5; 0}

250 number per bin .6 proportion correct responses .5 K =10,N = 500

0 0.1 1 10 0 5 10 15 20 25 signal contrast [%] deviance

750

1.5 (D) r = -.610

500 0.5

-0.5 250 number per bin deviance residuals -1

-1.5 (C)

0 0.5 0.6 0.7 0.8 0.9 1 -0.8 -0.4 0 0.4 0.8 model prediction model prediction r

5 5 5 Figure 9. (A) Raw data with N 500, K 10, and ni 50, together with the best-fitting Weibull psychometric function 5 yfit with parameter vector q {4.7, 1.98,.5,0} on semilogarithmic coordinates. (B) Histogram of Monte-Carlo-generated 5 deviancedistribution D* (B 10,000) from yfit. The solid vertical line marks the devianceof the empirical data set shown 5 (.025) (.975) in panel A, Demp 8.34; the two arrows below the x-axis mark the two-sided 95% confidence interval [D* , D* ]. (C) Deviance residuals d plotted as a function of model predictions p on linear coordinates. (D) Histogram of Monte-Carlo- generated correlation coefficients between d and p, r* (B 5 10,000). The solid vertical line marks the correlation coef- 5 5 ficient between d and p yfit of the empirical data set shown in panel A, remp .610; the two arrows below the x-axis mark the two-sided 95% confidence interval [r*(.025),r*(.975)]. residuals is assessed as a function of the order in which als d against their indices (which we will denote by k)is the data were collected (often referred to as their index; expected to be positive if the subject’sperformance im- Collett, 1991). Assuming that perceptual learning im- proved over time. proves performance over time, one would expect the fit- Figure 10A shows anotherdata set from one of F.A.W.’s 5 5 ted psychometric function to be an average of the poor discrimination experiments; again, K 10 and ni 50, earlier performance and the better later performance.16 and the best-fitting Weibull psychometric function is Deviance residuals should thus be negative for the first shown with the raw data. Figure 10B shows a histogram 5 2 few datapoints and positive for the last ones. As a conse- of D*forB 10,000 with the scaled c 10-PDF. The two quence, the correlation coefficient r of deviance residu- arrows below the deviance axis mark the two-sided 95% PSYCHOMETRIC FUNCTION I 1309

1 750 6 5 (A) (b)(B) PRMS =3.82 7 1 P .9 max = 5.77 P = 0.010 K = 10, N = 500 F P = 0.000 8 2 500 M .8 9

.7 10 Weibull psychometric function. 250 4 number per bin .6 = {11.7; 0.89; 0.5; 0} 3 proportion correct responses .5

0 1 10 100 0 5 10 15 20 25 30 signal contrast [%] deviance

3 (C) 800 (D) 2

1 600

0 400

-1 number per bin deviance residuals 200 -2 r =.752

-3 0 1 2 3 4 5 6 7 8 9 10 -1 -.8 -.6 -.4 -.2 0 .2 .4 .6 0.8 1 index index r

5 5 5 Figure 10. (A) Raw data with N 500, K 10, and ni 50, together with the best-fitting Weibull psychometric function yfit with parameter vector q 5 {11.7, 0.89, .5, 0} on semilogarithmic coordinates. The number next to each individualdata point shows the 5 index ki of that data point. (B) Histogram of Monte-Carlo-generated deviance distribution D* (B 10,000) from yfit. The solid ver- 5 tical line marks the deviance of the empirical data set shown in panel A, Demp 16.97; the two arrows below the x-axis mark the two- sided 95% confidence interval [D*(.025) , D*(.975) ]. (C) Deviance residuals d plotted as a function of their index k. (D) Histogram of Monte-Carlo-generated correlation coefficients between d and index k, r* (B 5 10,000). The solid vertical line marks the empiri- 5 cal value of r for the data set shown in panel A, remp .752; the two arrows below the x-axis mark the two-sided 95% confidence interval [r*(.025), r*(.975)]. confidenceinterval [D*(.025), D*(.975)]. The deviance of the index is, hence, an objective means to identify percep- data set Demp is 16.97, the Monte Carlo cumulativeprob- tual learning and, thus, reject the fit, even if the summary abilityestimate CPEMC(16.97) being .914.The summary statistic does not indicate a lack of fit. statistic D does not indicate a lack of fit. Figure 10C Influential observations and outliers. Identification shows an index plot of the deviance residuals d. The cor- of influentialobservations and outliers are additionalre- relation coefficient between d and k is r 5 .752, and the quirements for comprehensive goodness-of-fit assess- histogram r* of shown in Figure 10D indicates that such ment. a high positive correlation is not expected by chance The jackknife resampling technique. The jackknife is alone. Analysis of the deviance residuals against their a resampling technique where K data sets, each of size 1310 WICHMANN AND HILL

ˆ K 1, are created from the original data set y by suc- knife parameter vectors q( j), an alternativemodel M2 for cessively omitting one datapoint at a time. The jth jack- the (complete) original data set y is constructed as knife y( j) data set is thus the same as y, but with the jth 17 ì p = y (,x qqˆ ) if x ¹ x datapoint of y omitted. ï i ()j i j Influential observations. To identify influentialobser- M :í . (13) 2 p = yz(,x qqˆ ) +=if x x vations, we apply the jackknife to the original data set îï i ()j i j and refit each jackknife data set y( j); this yields K param- ˆ ˆ Setting z equalto y y(x;qˆ ( j)), the devianceof M equals eter vectors q( 1),...,q( K). Influential observations are j 2 those that exert an undue influence on one’sinferences— D( j), because the jth datapoint, dropped during the jack- that is, on the estimated parameter vector qˆ—and to this knife, is perfectly fit by M2 owing to the addition of a ded- ˆ ˆ ˆ icated free parameter, z. end, we compare q( 1),…,q( K) with q. If a jackknife pa- ˆ ˆ To decide whether the reduction in deviance, D D( j), rameter set q( j) is significantly different from q, the jth datapointis deemed an influentialobservationbecause its is significant,we compare it against the c2 distributionwith 5 inclusion in the data set alters the parameter vector sig- one degree of freedom, because v1 v2 1. Choosing a nificantly (from qˆ to qˆ). one-sided 99% confidence interval, M2 is a better model ( j) 5 Again, the question arises: What constitutes a signifi- than M1 if D D( j) > 6.63, because CPEc2(6.63) .99. ˆ ˆ Obtaining a significant reductionin deviance for data set cant difference between q and q( j)? No general rules exist to decide at which point qˆ and qˆ are significantly dif- y( j) implies that the jth datapoint is so far away from the ( j) ˆ ferent, but we suggest that one should be wary of one’s original fit y(x;q) that the addition of a dedicated pa- sampling scheme x if any one or several of the parameter rameter z, whose sole function is to fit the jth datapoint, ˆ ˆ reduces overall deviance significantly.Datapoint j is thus sets q( j) are outside the 95% confidence interval of q. Our companionpaper describes a parametric bootstrap method very likely an outlier, and as in the case of influential ob- to obtain such confidence intervals for the parameters qˆ. servations, the best strategy generally is to gather addi- Usually, identifying influential observations implies that tional data at stimulus intensity xj, before more radical more data need to be collected, at or around the influential steps, such as removal of yj from one’sdata set, are con- datapoint(s). sidered. Outliers. Like the test for influentialobservations,this objective procedure to detect outliers employs the jack- Discussion knife resampling technique.(Testing for outliersis some- In the precedingsections,we introducedstatisticaltests times referred to as test of discordancy; Collett,1991.) The to identify the following: first, inappropriate choice of F; test is based on a desirable property of deviance—namely, second, perceptual learning; third, an objective test to its nestedness: Deviance can be used to compare differ- identify influential observations; and finally, an objective ent models for binomial data as long as they are members test to identify outliers. The histograms shown in Figures of the same family. Suppose a model M1 is a special case 9D and 10D show the respective distributions r*tobe of model M2 (M1 is “nested within” M2), so M1 has fewer skewed and not centered on zero. Unlike our summary sta- free parameters than M2. We denote the degrees of free- tistic D, where a large-sample approximationfor binomial dom of the models by v1 and v2, respectively. Let the de- data with ni > 1 exists even if its applicabilityis sometimes viance of model M1 be D1 and of M2 be D2. Then, the dif- limited,neitherof the correlationcoefficient statisticshas 2 ference in deviance, D1 D2, has an approximate c a distributionfor which even a roughly correct asymptotic distribution with v1 v2 degrees of freedom. This ap- approximationcan easily be found for the K, N,andx typ- proximation to the c 2 distribution is usually very good ically used in psychophysical experiments. Monte Carlo even if each individual distribution, D1 or D2, is not reli- methods are thus without substitute for these statistics. ably approximated by a c 2 distribution(Collett,1991); in- Figures 9 and 10 also provide anothergood demonstra- 2 2 deed, D1 D2, has an approximate c distribution with tion of our warnings concerning the c approximation of v1 v2 degrees of freedom, even for binary data, despite deviance. It is interesting to note that for both data sets, the fact that, for binary data, deviance is not even asymp- the MCS deviance histograms shown in Figures 9B and toticallydistributedaccording to c 2 (Collett, 1991). This 10B, when compared against the asymptotic c 2 distribu- property makes this particular test of discordancyapplic- tions, have considerableDPRMS values of 3.8 and 5.9, with 5 able to (small-sample) psychophysical data sets. DPmax 5.8 and 9.3, respectively.Furthermore, the miss 5 Totest for outliers,we again denotethe originaldata set rate in Figure 9B is very high (PM .34). This is despite by y and its deviance by D. In the terminology of the pre- a comparatively large number of trials in total and per ˆ 5 5 ceding paragraph, the fitted psychometricfunction y(x;q) block (N 500, ni 50), for both data sets. Further- 2 corresponds to model M1. Then, the jackknife is applied more, whereas the c is shifted toward higher deviances to y, and each jackknife data set y( j) is refit to give K pa- in Figure 9B, it is shifted toward lower deviance values ˆ ˆ rameter vectors q( 1),...,q( K), from which to calculate de- in Figure 10B. This again illustrates the complex inter- viance, yielding D( 1),...,D( K ). For each of the K jack- action between deviance and p. PSYCHOMETRIC FUNCTION I 1311

SUMMARY AND CONCLUSIONS Hämmer l in, G., & Hoffmann, K.-H. (1991). Numerical mathematics (L. T. Schumacher, Trans.). New York: Springer-Verlag. In this paper, we have given an account of the proce- Ha r vey, L. O., Jr . (1986). Efficient estimation of sensory thresholds. Behavior Research Methods,Instruments, & Computers, 18, 623-632. dures we use to estimate the parameters of psychometric Hinkl ey, D. V. (1988). Bootstrap methods. Journal of the Royal Sta- functions and derive estimates of thresholds and slopes. tistical Society B, 50, 321-337. An essential part of the fitting procedureis an assessment Hoel , P.G. (1984). Introduction to mathematical statistics. New York: of goodness of fit, in order to validate our estimates. Wiley. Lam, C. F., Mil l s, J. H., & Dubno, J. R. (1996). Placement of obser- We have described a constrainedmaximum-likelihood vations for the efficient estimation of a psychometric function. Jour- algorithm for fitting three-parameter psychometricfunc- nal of the Acoustical Society of America, 99, 3689-3693. tions to psychophysical data. The third parameter, which Leek, M. R., Hanna, T. E., & Mar sha l l , L. (1992). Estimation of psy- specifies the upper asymptote of the curve, is highlycon- chometric functionsfrom adaptive tracking procedures. Perception & strained, but it can be shown to be essential for avoiding Psychophysics, 51, 247-256. McCul l agh, P., & Nel der , J. A. (1989). Generalized linear models. bias in cases where observers make stimulus-independent London: Chapman & Hall. errors, or lapses. In our laboratory,we have found that the McKee, S. P., Kl ein, S. A., & Tel l er , D. Y.(1985). Statistical properties lapse rate for trained observers is typically between 0% of forced-choice psychometric functions:Implications of probitanaly- and 5%, which is enough to bias parameter estimates sig- sis. Perception & Psychophysics, 37, 286-298. Nach mias, J. (1981).On the psychometric function for contrast detection. nificantly. Vision Research, 21, 215-223. We have also described several goodness-of-fit statis- O’Regan, J. K., & Humber t , R. (1989). Estimating psychometric func- tics, all of which rely on resampling techniques to gen- tions in forced-choice situations: Significant biases found in thresh- erate accurate approximations to their respective distri- old and slope estimation when small samples are used. Perception & butionfunctionsor to test for influentialobservationsand Psychophysics, 45, 434-442. Pr ess, W. H., Teukol sky, S. A., Vet t er l ing, W. T., & Fl a nner y, B. P. outliers. Fortunately, the recent sharp increase in com- (1992). Numerical recipes in C: The art of scientific computing (2nd puter processing speeds has made it possible to fulfill ed.). New York: Cambridge University Press. this computationallyexpensivedemand. Assessing good- Quick, R. F. (1974). A vector magnitude model of contrast detection. ness of fit is necessary in order to ensure that our esti- Kybernetik, 16, 65-67. Schmidt , F. L. (1996). Statistical significance testing and cumulative mates of thresholds and slopes, and their variability, are knowledge in psychology: Implications for training of researchers. generated from a plausible model for the data and to Psychological Methods, 1, 115-129. identify problems with the data themselves, be they due Swanson, W.H., & Bir ch, E. E. (1992).Extracting thresholdsfrom noisy to learning, to uneven sampling (resulting in influential psychophysical data. Perception & Psychophysics, 51, 409-422. observations), or to outliers. Tr eut wein, B. (1995). Adaptive psychophysical procedures. Vision Research, 35, 2503-2522. Together with our companion paper (Wichmann & Tr eut wein, B., & St r asbur ger , H. (1999). Fitting the psychometric Hill, 2001), we cover the three central aspects of model- function. Perception & Psychophysics, 61, 87-106. ing experimental data: parameter estimation, obtaining Wat son, A. B. (1979).Probability summation over time. Vision Research, error estimates on these parameters, and assessing good- 19, 515-522. Weibul l , W. (1951). Statistical distribution function of wide applicabil- ness of fit between model and data. ity. Journal of Applied Mechanics, 18, 292-297. Wichmann, F. A. (1999). Some aspects of modelling human spatial vi- REFERENCES sion: Contrast discrimination. Unpublished doctoral dissertation, Ox- ford University. Col l et t , D. (1991). Modeling binary data. New York: Chapman & Wichman n, F. A., & Hil l , N. J. (2001). The psychometric function:II. Hall/CRC. Bootstrap-based confidence intervals and sampling. Perception & Dobson, A. J. (1990). Introduction to generalized linear models. Lon- Psychophysics, 63, 1314-1329. don: Chapman & Hall. Dr aper , N. R., & Smit h, H. (1981). Applied regression analysis.New NOTES York: Wiley. 1. For illustrative purposes, we shall use the Weibull function for F Efr on, B. (1979). Bootstrap methods: Another look at the jackknife. (Quick, 1974; Weibull, 1951). This choice was based on the fact that the Annals of Statistics, 7, 1-26. Weibull function generally provides a good model for contrast discrim- Efron, B. (1982). The jackknife, the bootstrap and other resampling ination and detection data (Nachmias, 1981) of the type collected by plans (CBMS-NSF Regional Conference Series in Applied Mathe- one of us (F.A.W.) over the past few years. It is described by matics). Philadelphia:Society for Industrial and Applied Mathematics. é b Efr on, B., & Gong, G. (1983).A leisurely look at the bootstrap,the jack- æ x ö knife, and cross-validation. American Statistician, 37, 36-48. Fx();ab ;= 10 expê , £<¥x . ê è a ø Efr on, B., & Tibshir an i, R. J. (1993). An introduction to the bootstrap. ë New York: Chapman & Hall. 2. Often, particularly in studies using forced-choice paradigms, l does Finney, D. J. (1952).Probit analysis (2nd ed.). Cambridge: Cambridge not appear in the equation, because it is fixed at zero. We shall illustrate University Press. and investigate the potential dangers of doing this. Finney, D. J. (1971).Probitanalysis (3rd ed.). Cambridge: Cambridge 3. The simplex search method is reliable but converges somewhat University Press. slowly. We choose to use it for ease of implementation: first, because of For st er , M. R. (1999). Model selection in science: The problem of lan- its reliability in approximating the global minimum of an error surface, guage variance. British Journal for the Philosophy of Science, 50, 83- given a good initial guess, and second, because it does not rely on gra- 102. dient descent and is therefore not catastrophically affected by the sharp Gel man, A. B., Carl in, J. S., St er n, H. S., & Rubin, D. B. (1995). increases in the error surface introduced by our Bayesian priors (see the Bayesian data analysis. New York: Chapman & Hall/CRC. next section). We have found that the limitations on its precision (given 1312 WICHMANN AND HILL

error tolerances that allow the algorithm to complete in a reasonable vidual binary responses and the order in which they occur (Colett, 1991). amount of time on modern computers) are many orders of magnitude Another hypothetical case occurs when observers use different cues to smaller than the confidence intervals estimated by the bootstrap proce- solve a task and switch between them on a nonrandom basis during a dure, given psychophysical data and are therefore immaterial for the block of trials (see the Appendix for proof). purposes of fitting psychometric functions. 13. Our convention is to compare deviance, which reflects the prob- 4. In terms of Bayesian terminology, our prior W(l) is not a proper ability of obtaining y given qˆ, against a distribution of probability mea- ˆ prior density, because it does not integrate to 1 (Gelman, Carlin, Stern, & sures of y*1 ...y*B, each of which is also calculated assuming q. Thus, Rubin,1995). However, it integrates to a positive finite value that is re- the test assesses whether the data are consistent with having been gen- flected, in the log-likelihood surface, as a constant offset that does not erated by our fitted psychometric function; it does not take into account affect the estimation process. Such prior densities are generally referred the number of free parameters in the psychometric function used to ob- to as unnormalized densities, distinct from the sometimes problematic tain qˆ. In these circumstances, we can expect, for suitably large data sets, improper priors that do not integrate to a finite value. D to be distributed as c 2 with K degrees of freedom. An alternative 5. See Treutwein and Strasburger (1999) for a discussion of the use would be to use the maximum-likelihood parameter estimate for each of beta functionsas Bayesian priors in psychometric function fitting. Flat simulated data set, so that our simulated deviance values reflect the ˆ ˆ priors are frequently referred to as (maximally) noninformative priors in probabilities of obtaining y*1 ...y*B given q*1 ...q*B . Under the latter the context of Bayesian data analysis, to stress the fact that they ensure circumstances, the expected distribution has K P degrees of freedom, that inferences are unaffected by information external to the current data where P is the number of parameters of the discrepancy function (which set (Gelman et al. , 1995). is often, but not always, well approximated by the number of free 6. One’s choice of prior should respect the implementation of the parameters in one’s model—see Forster, 1999). This procedure is ap- search algorithm used in fitting. Using the flat prior in the above exam- propriate if we are interested not merely in fitting the data (summariz- ple, an increase in l from .06 to .0600001 causes the maximization term ing, or replacing, data by a fitted function), but in modeling data,or to jump from zero to negativeinfinity.This would be catastrophic for some model comparison, where the particulars of the model(s) itself are of in- gradient-descent search algorithms. The simplex algorithm, on the other terest. hand, simply withdraws the step that took it into the “infinitely unlikely” 14. One of several ways we assessed convergence was to look at the region of parameter space and continues in another direction. quantiles .01, .05, .1, .16, .25, .5, .75, .84, .9, .95, and .99 of the simulated 7. The log-likelihooderror metric is also extremely sensitive to very distributions and to calculate the root mean square (RMS) percentage low predicted performance values (close to 0). This means that, in yes/no change in these deviance values as B increased. An increase from B 5 paradigms, the same arguments will apply to assumptions about the 500 to B = 500,000, for example, resulted in an RMS change of approx- lower bound as those we discuss here in the context of l. In our 2AFC imately 2.8%, whereas an increase from B 5 10,000 to B 5 500,000 examples, however, the problem never arises, because g is fixed at .5. gave only 0.25%, indicating that for B 5 10,000, the distribution has al- 8. In fact, there is another reason why l needs to be tightly con- ready stabilized. Very similar results were obtained for all sampling strained: It covaries with a and b, and we need to minimize its negative schemes. impact on the estimation precision of a and b. This issue is taken up in 15. The differences are even larger if one does not exclude datapoints our Discussion and Summary section. for which model predictions are p 5 0orp 5 1.0, because such points 9. In this case, the noise scheme corresponds to what Swanson and have zero deviance (zero variance). Without exclusion of such points, c2- Birch (1992)call “extraneous noise.” They showed that extraneous noise based assessment systematically overestimates goodnessof fit. Our Monte can bias threshold estimates, in both the method of constant stimuli and Carlo goodness-of-fitmethod, on the otherhand, is accurate whether such adaptive procedures with the small numbers of trails commonly used points are removed or not. within clinical settings or when testing infants. We have also run simu- 16. For this statistic, it is important to remove points with y 5 1.0 or 5 lations to investigate an alternative noise scheme, in which lgen varies y 0.0 to avoid errors in one’s analysis. between blocks in the same dataset: A new lgen was chosen for each 17. Jackknife data sets have negative indices inside the brackets as a block from a uniform random distribution on the interval [0, .05]. The reminder that the jth datapoint has been removed from the original data results (not shown) were not noticeably different, when plotted in the set in order to create the jth jackknife data set. Note the important dis- format of Figure 3, from the results for a fixed lgen of .2 or .3. tinction between the more usual connotation of “jackknife,” in which 10. Uniform random variates were generated on the interval (0,1), single observations are removed sequentially, and our coarser method, using the procedure ran2 () from Press et al. (1992). which involves removal of whole blocks at a time. The fact that obser- 11. This is a crude approximation only; the actual value depends vations in different blocks are not identically distributed and that their heavily on the sampling scheme. See our companion paper (Wichmann generating probabilities are parametrically related by y(x,qˆ ) may make & Hill, 2001) for a detailed analysis of these dependencies. our version of the jackknife unsuitable for many of the purposes (such 12. In rare cases, underdispersionmay be a direct result of observers’ as variance estimation) to which the conventional jackknife is applied behavior. This can occur if there is a negative correlation between indi- (Efron, 1979, 1982; Efron & Gong, 1983; Efron & Tibshirani, 1993). PSYCHOMETRIC FUNCTION I 1313

APPENDIX Variance of Switching Observer 5 Assume an observer with two cues, c1 and c2, at his or her dis- Binomial variance,on the other hand, with Eb Es and, hence, 5 1 posal,with associatedsuccessprobabilitiesp1 and p2, respectively. pb qp1 (1 q)p2, equals Given N trials,the observerchooses to use cue c on Nq of the tri- 1 s 2 = Np1111 p=+ N qp q p qp+ q p . als and c2 on N(1 q) of the trials. Note that q is not a probabil- bbb( ) ()1212( ) []()( ) c ity, but a fixed fraction:The observeruses 1 always and on ex- (A3) actly Nq of the trials.The expectednumber of correctresponses 5 5 s 2 5 of such an observer is For q 0orq 1 Equations A2 and A3 reduce to s 2 5 2 5 2 5 s b Np2(1 p2)ands s s b Np1(1 p1), respectively. , ENqpqp=+1. (A1) However, simple algebraic manipulation shows that for 0 s []12( ) , 2 2 [ q 1, s s < s b for all p1, p2 [0,1] if p1 p2. The variance of the responses around E is given by Thus, the varianceof such a “fixed-proportion-switching” ob- s server is smaller than that of a binomial distribution with the s 2 = Nqp111 p+ qp p . (A2) same expectednumber of correctresponses.This is an example s []11( ) ( ) 2( 2) of underdispersion that is inherent in the observer’sbehavior.

(Manuscript received June 10, 1998; revision accepted for publication February 27, 2001.)