A method to integrate and classify normal distributions

Abhranil Das ∗1,3,4 and Wilson S Geisler2,3,4

1Department of Physics, The University of Texas at Austin 2Department of Psychology, The University of Texas at Austin 3Center for Perceptual Systems, The University of Texas at Austin 4Center for Theoretical and Computational Neuroscience, The University of Texas at Austin

July 29, 2021

Abstract performance (behavior) of humans, other animals, neural circuits or algorithms. They also serve as a starting point in developing Univariate and multivariate normal probability distribu- other models/theories that describe sub-optimal performance of tions are widely used when modeling decisions under uncer- agents. tainty. Computing the performance of such models requires integrating these distributions over specic domains, which To compute the performance predicted by such theories it is can vary widely across models. Besides some special cases necessary to integrate the normal distributions over specic do- where these integrals are easy to calculate, there exist no gen- mains. For example, a particularly common task in vision science eral analytical expressions, standard numerical methods or is classication into two categories (e.g. detection and discrim- software for these integrals. Here we present mathematical ination tasks). The predicted maximum accuracy in such tasks results and open-source software that provide (i) the proba- is determined by integrating normals over domains dened by bility in any domain of a normal in any dimensions with any a quadratic decision boundary.2,3 Predicted accuracy of some of parameters, (ii) the probability density, cumulative distribu- the possible sub-optimal models is determined by integrating over tion, and inverse cumulative distribution of any function of a other domains. normal vector, (iii) the classication errors among any num- Except for some special cases4,5 (e.g. multinormal probabilities ber of normal distributions, the Bayes-optimal discriminabil- ity index and relation to the operating characteristic, (iv) di- in rectangular domains, such as when two normals have equal mension reduction and visualizations for such problems, and covariance and the optimal classication boundary is at), there (v) tests for how reliably these methods may be used on given exists no general analytical expression for these integrals, and we data. We demonstrate these tools with vision research appli- must use numerical methods, such as integrating over a Cartesian cations of detecting occluding objects in natural scenes, and grid, or Monte-Carlo integration. detecting camouage. Since the tails o innitely outwards, it is Keywords: multivariate normal, integration, classication, inecient to numerically integrate it over a nite uniform Carte- signal detection theory, Bayesian ideal observer, vision sian grid, which would be large and collect ever-reducing masses outwards, yet omit some mass wherever the grid ends. Also, if the normal is elongated by unequal and strong covariances, 1 Introduction or the integration domain is complex and non-contiguous, naïve integration grids will waste resources in regions or directions that The univariate or multivariate normal (henceforth called sim- have low density or are outside the domain. One then needs to arXiv:2012.14331v7 [stat.ML] 27 Jul 2021 ply ‘normal’) is arguably the most important and widely-used visually inspect and arduously hand-tailor the integration grid to . It is frequently used because various t the shape of each separate problem. central-limit theorems guarantee that normal distributions will Monte-Carlo integration involves sampling from the multinor- occur commonly in nature, and because it is the simplest and most mal, then counting the fraction of samples in the integration do- tractable distribution that allows arbitrary correlations between main. This does not have the above ineciencies or problems, but the variables. has other issues. Unlike grid integration, the desired precision Normal distributions form the basis of many theories and mod- cannot be specied, but must be determined by measuring the els in the natural and social sciences. For example, they are the spread across multiple runs. Also, when the probability in the in- foundation of Bayesian statistical decision/classication theories 1 tegration domain is very small (e.g. to compute the classication using Gaussian discriminant analysis, and are widely applied in error rate or discriminability d0 for highly-separated normal dis- diverse elds such as vision science, neuroscience, probabilistic tributions), it cannot be reliably sampled without a large number planning in robotics, psychology, and economics. These theo- of samples, which costs resource and time (see the performance ries specify optimal performance under uncertainty, and are of- benchmark section for a comparison). ten used to provide a benchmark against which to evaluate the Thus, there is no single analytical expression, numerical ∗[email protected] method or standard software tool to quickly and accurately in-

1 tegrate arbitrary normals over arbitrary domains, or to compute dimensional sd, since it linearly scales the normal (like σ in 1d), classication errors and the discriminability index d0. Evaluat- and its eigenvectors and values are the axes of the 1 sd error el- ing these quantities is often simplied by making the limiting lipsoid. We rst invert this transform to standardize the normal: assumption of equal . This impedes the quick testing, z = S−1(x − µ). This decorrelates or ‘whitens’ the variables, comparison, and optimization of models. Here we describe a and transforms the integration domain to a dierent quadratic: mathematical method and accompanying software implementa- 0 0 tion which provides functions to (i) integrate normals with arbi- q˜(z) = z Qe 2z + q˜1z +q ˜0 > 0, with trary means and covariances in any number of dimensions over Qe 2 = SQ2S, arbitrary domains, (ii) compute the pdf, cdf and inverse cdf of (2) q˜ = 2SQ µ + Sq , any function of a multinormal variable (normal vector), and (iii) 1 2 1 compute the performance of classifying amongst any number of q˜0 = q(µ). normals. This software is available as a Matlab toolbox ‘Inte- Now the problem is to nd the probability of the standard normal grate and classify normal distributions’, and the source code is z in this domain. If there is no quadratic term Qe 2, q˜(z) is nor- at github.com/abhranildas/IntClassNorm. mally distributed, the integration domain boundary is a at, and We rst review and assimilate previous mathematical results the probability is Φ( q˜0 ), where Φ is the standard normal cdf.4 into a generalized chi-squared method that can integrate arbi- kq˜1k 0 trary normals over quadratic domains. Then we present a novel Otherwise, say Qe 2 = RDR is its eigen-decomposition, where 0 ray-trace method to integrate arbitrary normals over any domain, R is orthogonal, i.e. a rotoreection. So y = R z is also standard and consequently to compute the distribution of any real-valued normal, and in this space the quadratic is: function of a normal vector. We describe how these results can qˆ(y) = y0Dy + b0y +q ˜ (b = R0q˜ ) be used to compute error rates (and other relevant quantities) 0 1 X 2  X for Bayes-optimal and custom classiers, given arbitrary priors = Diyi + biyi + bi0 yi0 +q ˜0 and outcome cost matrix. We then present some methods to re- i i0 duce problems to fewer dimensions, for analysis or visualization. (i and i0 index the nonzero and zero eigenvalues) Next, we provide a way to test whether directly measured sam-  2  2 X bi X X bi ples from the actual distributions in a classication problem are = Di yi + + bi0 yi0 +q ˜0 − 2Di 2Di close enough to normal to trust the computations from the tool- i i0 i box. After describing the methods and software toolbox with ex- X 02 = Di χ 2 + x, amples, we demonstrate their accuracy and speed across a variety 1,(bi/2Di) i of problems. We show that for quadratic-domain problems both the generalized chi-squared method and the ray-trace method are a weighted sum of non-central chi-square variables χ02, each with accurate, but vary in relative speed depending on the particular 1 degree of freedom, and a normal variable x ∼ N(m, s). So this 2 problem. Of course, for domains that are not quadratic, only the is a generalized chi-square variable χ˜w,k,λ,m,s, where we merge ray-trace method applies. Finally, we illustrate the methods with the non-central chi-squares with the same weights, so that the two applications from our laboratory: modeling detection of oc- vector of their weights w are the unique nonzero eigenvalues cluding targets in natural scenes, and detecting camouage. among Di, their degrees of freedom k are the numbers of times the eigenvalues occur, and their non-centralities, and normal pa- rameters are: 2 Integrating the normal s 1 X 2 X 2 λj = 2 bi , m = q(µ) − w.λ, s = bi0 . 2.1 In quadratic domains: the generalized chi- 4wj 0 i:Di=wj i square method The required probability, p χ˜2 > 0, is now a 1d integral, com- Integrating the normal in quadratic domains is important for putable using, say, Ruben’s6 or Davies’7 methods. We use the computing the maximum possible classication accuracy. The Matlab toolbox ‘Generalized chi-square distribution’ that we de- problem is the following: given a column vector x ∼ N(µ, Σ), veloped (source code is at github.com/abhranildas/gx2), which nd the probability that can compute the generalized chi-square parameters correspond- 0 0 ing to a quadratic form of a normal vector, its , cdf (using q(x) = x Q2x + q1x + q0 > 0. (1) three dierent methods), pdf, inverse cdf and random numbers. (Here and henceforth, bold upper-case symbols represent ma- Previous software implements specic forms of this theory for trices, bold lower-case symbols represent vectors, and regular particular quadratics such as ellipsoids.5 The method described lower-case symbols represent scalars.) here correctly handles all quadratics (ellipsoids, hyperboloids, This can be viewed as the multi-dimensional integral of the paraboloids and degenerate conics) in all dimensions. normal probability over the domain q(x) > 0 (that we call the ‘normal probability view’), or the single-dimensional integral of 2.2 In any domain: the ray-trace method the probability of the scalar quadratic function q(x) of a normal vector, above 0 (the ‘function probability view’). We present below our method to integrate the normal distribu- Note that x = Sz + µ, where z is standard multinormal, and tion in an arbitrary domain, which takes an entirely dierent ap- 1 the symmetric square root S = Σ 2 may be regarded as the multi- proach than the generalized chi-square method. The overview of

2 the method is as follows. We rst standardize the normal to make a it spherically symmetric, then we integrate it in spherical polar coordinates, outwards from the center. We rst calculate the ra- n dial integral by sending ‘rays’ from the center to trace out the inte- gration domain in every direction, i.e. determine the points where each ray crosses into and out of the domain (akin to the computer dn graphics method of ray-tracing, that traces light rays outward from the projection center to compute where it hits the dierent z2 objects it has to render). By knowing these crossing points, we then calculate the probability amount on each ray. Then we add up these probabilities over all the angles. This method of break- ing up the problem produces fast and accurate results to arbitrary ~ tolerance for all problem shapes, without needing any manual ad- f(z)>0 justment.

z1 2.2.1 Standard polar form The problem is to nd the probability that f(x) > 0, where f(x) is a suciently general function (with a nite number of zeros in any direction within the integration span around the normal mean, i.e. without rare pathologies such as the Dirichlet function with innite zeros in any interval). As before, we rst standardize the space to obtain f˜(z) = f(Sz + µ). Then we switch to polar axis-angle coordinates z and n: any point z = zn, where the unit ~ b f (z ) vector n denotes the angle of that point, and z is its coordinate ray n φk along the axis in this direction. Then the integral can be written as:

Z 2 Z Z 2 z z − k − z − k − z k−1 1 2 (2π) 2 e 2 dz = dn (2π) 2 e 2 z dz . ˜ ˜ Ω n Ωn Figure 1: Method schematic. a. Standard normal error ellipse is blue. | {z } axial integral Arrow indicates a ray from it at angle n in an angular slice dn, crossing the gray integration domain f˜(z) > 0 at z1 and z2. b. 1d slice of this ˜ where Ω˜ is the domain where f(z) > 0, and Ω˜ n is its slice along picture along the ray. The standard normal density along a ray is blue. ˜ ˜ the axis n, i.e. the intervals along the axis where the axial domain fn(z) is the slice of the domain function f(z) along the ray, crossing 0 ˜ ˜ at z1 and z2. function fn(z) = f(zn) > 0. This may be called the ‘standard polar form’ of the integral. dn is the dierential angle element (dθ in 2d, sin θ dθ dφ in 3d etc). domain function does not matter). That is, we specify whether or not the beginning of the ray (at −∞) is inside the domain, and 2.2.2 Integration domain on a ray all the points at which the ray crosses the domain. We denote the ˜ First let us consider the axial integration along direction n. Imag- rst by the initial sign ψ(n) = sign(fn(−∞)) = 1/−1/0 if the ine that we ‘trace’ the integration domain with an axis through ray begins inside/outside/grazing the integration domain. For a the origin in this direction (a bi-directional ‘ray’ marked by the quadratic domain, for example: arrow in g.1a), i.e. determine the part of this ray axis that is in  ˜ sign (˜q2 (n)) , if q˜2(n) 6= 0, the integration domain, dened by fn(z) > 0. For example, if the  integration domain is a quadratic such as eq.2, its 1d trace by the ψ(n) = sign (˜qn (−∞)) = −sign (˜q1 (n)) , if q˜2(n) = 0,  ray is given by: sign (˜q0) , if q˜2(n) =q ˜1(n) = 0.

0 2 0 ˜ q˜n(z) =q ˜(zn) = n Qe 2nz + q˜1nz +q ˜0 The crossing points are the zeros zi(n) of fn(z) = f(zSn+µ) 2 =q ˜2(n) z +q ˜1(n) z +q ˜0 > 0. (zin are then the boundary points in the full space). For a quadratic domain q˜n(z), these are simply its roots. For a general This is a scalar quadratic domain in z that varies with the direc- domain, the zeros are harder to compute. Chebyshev polynomial tion. Fig.1b is an example of such a domain. The ray domain approximations8 aim to nd all zeros of a general function, but ˜ function fn crosses 0 at z1 and z2, and the integration domain is can be slow. Other numerical algorithms can nd all function below z1 (which is negative), and above z2. zeros in an interval to arbitrary accuracy. We use such an algo- ˜ Note that a sucient description of such domains on an axis is rithm to nd the zeros of fn(z) within (−m, m). This amounts to specify all the points at which the domain function crosses zero, to ray-tracing f(x) within a Mahalanobis distance m of the nor- and its overall sign, which determines which regions are within mal. The error in the integral due to this approximation is there- and which are outside the domain (so any overall scaling of the fore < 2Φ(¯ m), where Φ¯ is the complementary cdf of the standard

3 normal. inside the domain (ψ = 1), it stays inside, and the probability con- In g.1, the initial sign along the ray is 1, and z and z are the tent is 2dn . If it begins and stays outside (ψ = −1), it is 0. And if 1 2 Ωk crossing points. it grazes the domain throughout (ψ = 0), half of the angular vol- Most generally, this method can integrate in any domain for ume is inside the domain and half is outside, so the probability is which we can return its ‘trace’ (i.e. the initial sign and crossing dn . So without accounting for roots, the probability in general is Ωk points) along any ray n through any origin o. So if a domain is al- ψ(n)+1 . To this we add, sequentially for each root, the probability Ωk ready supplied in the form of these ‘ray-trace’ functions ψ(o, n) from the root to ∞, signed according to whether we are entering and zi(o, n), our method can readily integrate over it. For ex- or exiting the domain at that root. So we have, for g.1, ample, the ray-trace function of the line y = k in 2d returns k−oy  ray ray  ψ = −sign(ny) and z = . When supplied with quadratic 2 2Φ¯ (z1) 2Φ¯ (z2) ny dp (n) = − k + k dn. domain coecients, or a general implicit domain f(x) > 0, the Ωk Ωk Ωk toolbox ray-traces it automatically under the hood. For an im- plicit domain, the numerical root-nding works only in a nite The sign of the rst root term is always opposite to ψ, and subse- interval, and is slower and may introduce small errors. So, if pos- quent signs alternate as we enter and leave the domain. In general sible, a slightly faster and more accurate alternative to the implicit then, we can write: domain format is to directly construct its ray-trace function by " # X dn hand. dp (n) = ψ(n) + 1 + 2ψ(n) (−1)i Φ¯ ray (z (n)) k i Ω i k | {z } 2.2.3 Standard normal distribution on a ray α(n) In order to integrate over piecewise intervals of z such as g.1b, Thus, the axial integral is α(n) . The total probability we shall rst calculate the semi-denite integral up to some z, Ωk 1 R α(n) dn can be computed, for up to 3d, by numerically then stitch them together over the intervals with the right signs. Ωk Consider the probability in the angular slice dn below some integrating α(n) over a grid of angles spanning half the angu- negative z such as z1 in g.1a. Note that the probability of a lar space (since we account for both directions of a ray), using standard normal beyond some radius is given by the chi distribu- any standard scheme. An adaptive grid can match the shape of tion. If Ωk is the total angle in k dimensions (2 in 1d, 2π in 2d, 4π the integration boundary (ner grid at angles where the bound- in 3d), and Fχk (x) is the cdf of the chi distribution with k degrees ary is sharply changing), and also set its neness to evaluate of freedom, we have: the integral to a desired absolute or relative precision. Fig.2a, top, illustrates integrating a trivariate normal with arbitrary co- z<0 Z k z2 − 2 − 2 k−1 variance in an implictly-dened toroidal domain ft(x) = a − Ωk (2π) e z dz = 1 − Fχk (|z|). 2 −∞  p 2 2 2 b − x1 + x2 − x3 > 0. So the probability in the angular slice dn below a negative z is Beyond 3d, we can use Monte Carlo integration over the angles. dn [1 − Fχ (|z|)] . Now, for the probability in the angular slice We draw a sample of random numbers from the standard multi- k Ωk below a positive z (such as z2), we need to add two probabili- normal in those dimensions, then normalize their magnitudes, to ties: that in the nite cone from the origin to the point, which is get a uniform random sample of rays n, over which the expec- dn Fχ (z) , and that in the entire semi-innite cone on the nega- tation hα(n)i/2 is the probability estimate. Fig.2b shows the k Ωk dn dn tive side, which is , to obtain [1 + Fχ (z)] . Thus, the prob- computation of the 4d standard normal probability in the domain Ωk k Ωk P4 ability in an angular slice dn below a positive or negative z is fp(x) = i=1|xi| < 1, a 4d extension of a regular octahedron dn with plane faces meeting at sharp edges. [1 + sign(z)Fχk (|z|)] Ω . We normalize this by the total proba- k Since the algorithm already computes the boundary points over bility in the angular slice, 2 dn , to dene the distribution of the Ωk its angular integration grid, they may be stored for plotting and standard normal along a ray: Φray(z) = [1 + sign(z)F (|z|)] /2. k χk inspecting the boundary. Rather than an adaptive integration grid Its density is found by dierentiating: φray(z) = f (|z|)/2, so it k χk though, boundaries are often best visualized over a uniform grid is simply the chi distribution symmetrically extended to negative (uniform array of angles in 2D, or a Fibonacci sphere in 3D9), φray(z) = φ(z) numbers. Notice that 1 , but in higher dimensions which we can explicitly supply for this purpose. it rises, then falls outward (g.1b), due to the opposing eects of the density falling but the volume of the angular slice growing outward. Since Matlab does not yet incorporate the chi distribu- 2.2.5 Set operations on domains tion, we instead dene, in terms of the chi-square distribution, Some applications require more complex integration or ray h 2 i ray 2 Φ (z) = 1 + sign(z)F 2 (z ) /2 and φ (z) = |z|f 2 (z ). classication domains built using set operations (inver- k χk k χk sion/union/intersection) on simpler domains. With implicit domain formats this is easy. For example, if fA(x) > 0 and 2.2.4 Probability in an angular slice c fB(x) > 0 dene two domains A and B, then A , A ∩ B, and c We can now write the total probability in the angular slice of g.1 A ∪ B are described by −fA(x) > 0, min(fA(x), fB(x)) > 0 as the sum of terms accounting for the initial sign and each root. and max(fA(x), −fB(x)) > 0 respectively. Fig.2a, bottom, The total volume fraction of the double cone is 2dn . Now rst illustrates integrating a 2d normal in a domain built by the union Ωk consider only the initial sign and no roots. Then if the ray starts of two circles.

4 a p = 0.03 b p ≈ 0.0148 c p = 0.41 d 10 2 0 15 25 p(e) = 0.008 .017 -2 = 4.83 1 7 p 1 1 = 5.11 7 x = 4.45

cos x 2 .015 2 -7 = x

2 l = 0 3 p = 0.97 f .013 ’ 101 102 103 104 -15 p (e) = 0.013 -7 7 -15 10 -3 f = x sin x x -10 10 -6 6 sample size of rays 1 1 2 1 e p(e) = 0.13/0.26/0.10 f p(e) = 0.23 g p = 0.13 h p(e) = 0.04/0.06 15 10

-30 0 40 q(x) l = 0 p(e) = 0.14/0.22/0.17 γ = 0

30 -20 0 -10 15 -20 15 -10 10 β(x) (1,1,1,1) axis

Figure 2: Toolbox outputs for some integration and classication problems. a. Top: the probability of a 3d normal (blue shows 1 sd error ellipsoid) in an implicit toroidal domain ft(x) > 0. Black dots are boundary points within 3 sd traced by the ray method, across Matlab’s adaptive integration grid over angles. Inset: pdf of ft(x) and its integrated part (blue overlay). Bottom: integrating a 2d normal (blue error ellipse) in a domain built by P4 the union of two circles. b. Estimates of the 4d standard normal probability in the 4d polyhedral domain fp(x) = i=1|xi| < 1 using the ray-trace method with Monte Carlo ray-sampling, across 5 runs, converging with growing sample size of rays. Inset: pdf of fp(x) and its integrated part. c. Left: heat map of joint pdf of two functions of a 2d normal, to be integrated over the implicit domain f1 − f2 > 1 (overlay). Right: corresponding integral of the normal over the domain h(x) = x1 sin x2 − x2 cos x1 > 1 (blue regions), ‘traced’ up to 3 sd (black dots). Inset: pdf of h(x) and 0 its integrated part. d. Classifying two 2d normals using the optimal boundary l (which yields the Bayes-optimal discriminability db), and a custom 0 0 linear boundary. de and da are approximate discriminability indices. e. Classication based on samples (dots) from non-normal distributions. Filled ellipses are error ellipses of tted normals. γ is an optimized boundary between the samples. The three error rates are: of the normals with l, of the samples with l, and of the samples with γ. f. Classifying several 2d normals with arbitrary means and covariances. g. Top: 1d projection of a 4d normal integral over a quadratic domain q(x) > 0. Bottom: Projection of the classication of two 4d normals based on samples, with unequal priors, and unequal outcome values (correctly classifying the blue class is valued 4 times the red, hence the optimal criterion is shifted), onto the axis of the Bayes decision variable β. Histograms and smooth curves are the projections of the samples and the tted normals. The sample-optimized boundary γ = 0 cannot be uniquely projected to this β axis. h. Classication based on four 4d non-normal samples, with dierent priors and outcome values, projected on the axis along (1,1,1,1). The boundaries cannot be projected to this axis.

As we noted before, computations are faster and more accu- x. But in the function probability view, f(x) is instead seen as a rate when domains are supplied in explicit ray-trace form, than mapping from the multi-dimensional variable x to a scalar, which as implicit functions. The toolbox provides functions to convert can be considered a decision variable. Hence, integrating the nor- quadratic and general implicit domains to ray-trace format, and mal in the multi-dimensional domain f(x) > 0 corresponds to functions to use set operations on these to build complex ray-trace integrating the 1d pdf of the decision variable f(x) beyond 0. It domains. For example, when a domain is inverted, only the initial is helpful to plot this 1d pdf, especially when there are too many sign of a ray through it ips, and for the intersection of several dimensions of x to visualize the normal probability view. domains, the initial sign of a ray is the minimum of its individual Conversely, given any scalar function f(x) of a normal, its cdf, initial signs, and the roots are found by collecting those roots of Ff (c) = p(f(x) < c), is computed as the normal probability in each domain where every other domain is positive. the domain c − f(x) > 0. Dierentiating this gives us the pdf. (If it is a quadratic function, its generalized chi-square pdf can also 2.2.6 Probabilities of functions of a normal vector be computed by convolving the constituent noncentral chi-square pdf’s.) Figs.2a-c and g show 1d pdf’s of functions computed in We previously mentioned the equivalent ‘normal probability’ and this way. Also, inverting the function cdf using a numerical root- ‘function probability’ views of conceptualizing a normal integral. nding method gives us its inverse cdf. So far we have mostly used the normal probability view, seeing scalar functions f(x) as dening integral domains of the normal With these methods to obtain the pdf, cdf and inverse cdf of

5 functions of a normal vector, we can conveniently compute cer- mated classes): tain quantities. For example, if x and y are jointly normal with µx = 1, µy = 2, σx = .1, σy = .2, and ρxy = .8, we can compute hv(A|x)i 0 0 ln = β(x) = x Q2x + q x + q0 > 0, where: the pdf, cdf and inverse cdf of the function xy, and determine, say, hv(B|x)i 1 that its mean, and sd are respectively 1.03, 1 and 0.21.

The probability of a vector (multi-valued) function of the nor- 1 −1 −1   Q2 = Σ − Σa , (3) mal, e.g. f(x) = f1(x) f2(x) , in some f-domain (which 2 b may also be seen as the joint probability of two scalar func- −1 −1 q1 = Σa µa − Σb µb, tions), is again the normal probability in a corresponding x-   1 0 −1 0 −1 |Σb| pava domain. For example, the joint cdf Ff (c1, c2) is the function q0 = µbΣb µb − µaΣa µa + ln + ln . 2 |Σa| pbvb probability in an explicit domain: p (f1 < c1, f2 < c2), and can be computed as the normal probability in the intersection of This quadratic β(x) is the Bayes classier, or the Bayes deci- the x-domains f1(x) < c1 and f2(x) < c2, i.e. the domain sion variable that, when compared to 0, maximizes expected gain. min (c1 − f1 (x) , c2 − f2 (x)) > 0. Numerically computing ∂ ∂ When V = 1, the Bayes decision variable is the log posterior Ff (c1, c2) then gives the joint pdf of the vector function. ∂c1 ∂c2 Fig.2c, left, is an example of a joint pdf of two functions of a ratio, and this decision rule minimizes overall error. The error  10 −7 rates of dierent types, i.e. true and false positives and nega- bivariate normal with µ = −2 5 and Σ = , com- −7 10 tives, are then the probabilities of the normals on either side of puted in this way. the quadratic boundary β(x) = 0. These probabilities can be computed entirely numerically using the ray-trace method, or we The probability of such a vector function in an implicit domain, can rst arrive at mathematical expressions using the generalized i.e. p (g (f) > 0), is computed as the normal probability in the chi-square method (as follows), which are then numerically eval- implicit domain: p (h (x) > 0) where h = g ◦ f. Fig.2c illus- uated. The overall error p(e) is the prior-weighted sum of the trates the function probability and normal probability views of error rates of each normal. the implicit integral p(h = x sin x − x cos x > 1). The 83rd 1 2 2 1 Further, when priors are equal, the Bayes decision variable is percentile of this function h (using the inverse cdf) is 4.87. the log likelihood ratio (of a vs b), which can be called l(x).

3.1.1 Single-interval (yes/no) task 3 Classifying normal samples Consider a yes/no task where the stimulus x comes from one of two equally likely 1d normals a and b with means µa, µb and sd’s σa > σb (g.3a). The optimal decision (eq.3) is to pick a if the Suppose observations come from several normal distributions a b l a (x) > 0 Bayes decision variable (log likelihood ratio of vs ) b , with parameters µi, Σi and priors pi, and the outcome values i.e. if (rewards and penalties) of classifying them are represented in a  2  2 x − µb x − µa σb matrix V: vij is the value of classifying a sample from i as j. − + 2 ln > 0. σb σa σa If the true class is i, selecting i over others provides a relative P l a (x) is a scaled and shifted 1 dof noncentral chi-square for each value gain of vi := vii − j6=i vij. Given a sample x, the ex- b pected value gain of deciding i is therefore hv(i|x)i = p(i|x)vi = class (g.3b), and the Bayes error rates are: p(x|i) pivi. The Bayes-optimal decision is to assign each sample  02 2   02 2  to the class that maximizes this gain, or its log: p (B|a) = p χ 2 < σb c , p (A|b) = p χ 2 > σac , 1,σaλ 1,σb λ  2 2 ln σa µa − µb σb 1 0 −1 pivi where λ = 2 2 , c = λ + 2 2 , lnhv(i|x)i = − (x − µi) Σ (x − µi) + ln . σa − σb σa − σb i p k 2 |Σi|(2π) (4) and p(e) is their average.

When the outcome value is simply the correctness of classi- cation, V = 1 (so each vi = 1), then this quantity is the log 3.1.2 Two-interval task posterior, ln p(i|x), and when priors are also equal, it is the log Now consider an equal-priors two-interval task, where two stim- likelihood. uli x1 and x2 come from each of the (general) distributions a and b. A decision rule commonly employed here is to check which stimulus is larger.10, 11 But note that the optimal strategy is to determine whether the tuple x = (x1, x2) came from the joint 3.1 Two normals distribution ab of independent a and b (in that order), or from ba (opposite order). To do this we compute, given x, the log likeli- a b hood ratio l ab of ab vs ba, which turns out to be simply related to Suppose there are only two normal classes and . The Bayes- ba A l a optimal decision rule is to pick if (upper-case denotes the esti- the log likelihood ratios b for the individual stimuli in the single-

6 interval task: so has generalized chi-square distributions for each category (g.  2  p(ab|x) p(x|ab) p(x |a).p(x |b) p(a|x ).p(b|x ) 3d), and we can calculate that p(e) = p χ˜w,k,λ,0,0 < 0, where = = 1 2 = 1 2 p(ba|x) p(x|ba) p(x1|b).p(x2|a) p(b|x1).p(a|x2) µ − µ w = σ2 −σ2 , k = 1 1 , λ = a b σ2 σ2 . =⇒ l ab (x1, x2) = l a (x1) − l a (x2). a b 2 2 a b ba b b σa − σb

The optimal rule is to pick ab if l ab (x1, x2) > 0, i.e. if l a (x1) > If the two stimuli themselves arise from k-dimensional nor- ba b 2 l a (x ) mals N(µa, Σa) and N(µb, Σb), then the optimal discrimination b 2 . This is the familiar decision rule: the observer gets a log likelihood ratio from each distribution for the single-interval task is between 2k-dimensional normals ab and ba, whose means are (e.g. g.3b), and picks the larger likelihood ratio (not the larger the concatenations of µa and µb, and covariances are the block- stimulus). diagonal concatenations of Σa and Σb, in opposite order to each When a and b are normals, ab is the 2d normal with mean other. µab = (µa, µb) and sd Sab = diag(σa, σb), and ba is its its ipped version (g.3c). The optimal decision rule (eq.3) boils down to 3.1.3 m-interval task selecting ab when: Consider the m-interval (m-alternative forced choice) task with 2 2 2 2 2 2 m stimuli, one from the signal distribution N(µa, σa), and the σa − σb x1 − x2 + 2 µaσb − µbσa (x1 − x2) > 0. rest from N(µb, σb). Following previous reasoning, the probabil- When σa = σb, this is the usual condition of whether x1 > x2. ity of the ith stimulus being the signal is an m-d normal, with But when σa 6= σb, this optimal decision boundary comprises mean vector whose ith entry is µa and the rest are µb, and diag- two perpendicular lines, solid and dashed (g.3c). The x1 > x2 onal sd matrix whose ith entry is σa and the rest are σb. The part criterion is to use only the solid boundary, which is sub-optimal. of this log likelihood that varies across i is: The minimum error rate p(e) is the probability that the dier-  2  2 X xj − µb xi − µa ence distribution of the two categories of g.3b exceeds 0. l ab ba − − l a σb σa is the dierence of scaled and shifted noncentral chi-squares b , j6=i  2  2  2 X xj − µb xi − µb xi − µa = − + − . σ σ σ a p(c) = 0.70 j b b a b | {z } 1 | {z } varies with i b 10 b constant l (x)=0 The optimal response is to pick the m-d normal with highest like- lihood, i.e. pick the xi with the largest value of the second term above, i.e. with the largest log likelihood ratio l of a vs b, which a a is the familiar rule.2 Analogous to Wickens12 eq. 6.19, the maximum accuracy is then given by: 10-2 Z ∞ 0 m−1 x log likelihood ratio l (x) p(c) = Fb (l) fa(l) dl −∞ c p(c) = 0.74 d where fa and Fb are the pdf and cdf of l under a and b, which ba are known, so this can be evaluated numerically. For example, for 2, 3 and 4 intervals with the parameters of g.3a, the accuracy is 0.74 (g.3c), 0.64 and 0.58 (see example in the getting started x ab guide for the toolbox). When the variances are equal, these com- 2 12 (μ , μ ) puted accuracies match table 6.1 of Wickens. (μs , μa n) b σ a ba ab σb 3.1.4 Discriminability index

l (x1,x2)=0 Bayesian classiers are often used to model behavioral (or neural) 0 0 performance in binary classication. Within the Bayesian mod- x eling framework, it is possible to estimate, from the pattern of 1 log likelihood ratio l (x1,x2) errors, the separation (or overlap) of the decision variable distri- Figure 3: Binary yes/no and two-interval classication tasks. a. Optimal butions for the two categories, independent of the decision cri- yes/no decision between two unequal-variance 1d normal distributions. terion (which may dier from the optimal value of zero). The b. The same task transformed to the log likelihood ratio axis (log verti- discriminability index d0 measures this separation. If the two cal axis for clarity). c. Optimal two-interval discrimination between the underlying distributions are equal-variance univariate normals 0 same 1d normal distributions a and b is actually a discrimination between a and b, then d = |µa − µb|/σ, and if they are multivariate 2d normals ab and ba. d. The task transformed to the log likelihood ratio with equal covariance matrices, then it is their Mahalanobis dis- axis. 0 p 0 −1 −1 tance: d = (µa − µb) Σ (µa − µb) = kS (µa − µb)k =

7 −1 kµa − µbk/σµ, where σµ = 1/kS µk is the 1d slice of the ratios, these are: sd along the unit vector µ through the means, i.e. the multi- Z 0 dimensional d0 equals the d0 along the 1d slice through the means. p(B|a) = fa(l) dl = Fa(0), For unequal variances, there exist several contending discrim- −∞ Z ∞ inability indices.10, 12, 13 A common one is Simpson and Fitter’s p(A|b) = fb(l) dl = 1 − Fb(0). 0 10 da = |µa −µb|/σrms, extended to general dimensions as the Ma- 0 halanobis distance using the pooled covariance, i.e. with Srms = 1 (For 1d normals, these are given by eq.4.) The overall Bayes error [(Σ + Σ ) /2] 2 as the common sd.14 Another index is Egan and a b p(e) is the average of these two, and is the amount of overlap of Clarke’s d0 = |µ − µ |/σ ,15 which we here extend to general e a b avg the two distributions (e.g. the overlap area in g.3a). We now dimensions using S = (S + S ) /2. avg a b dene the Bayes discriminability index as the equal-variance in- dex that corresponds to this same Bayes error, i.e. the separation between two unit variance normals that have the same overlap as a 1d 2d s the two distributions, which comes out to be twice the z-score of 1000 1 the maximum accuracy:

0.8 0 100 db = −2Z (Bayes error / overlap fraction p (e)) 0.6 = 2Z (best accuracy / non-overlapping fraction p (c)) 0.4 10 0.2 This index is the best possible discriminability, i.e. by an ideal observer. It extends to all cases as a smooth function of the lay- 0 1 0 60 0 60 out and shapes of the distributions, and reduces to d0 for equal variance/covariance normals. d0 b b is a positive-denite statistical distance measure that is free of assumptions about the distributions, like the Kullback-Leibler divergence DKL(a, b), which is the expected log likelihood ra- Savg tio l under the a distribution (mean of the blue distributions in Sa 0 gs.3b and d). DKL(a, b) is asymmetric, whereas d (a, b) is Srms b Sb 0 symmetric for the two distributions. However, db does not sat- isfy the triangle inequality. For example, consider three equal- width, consecutively overlapping uniform distributions: a over [0, 3], b over [2, 5], and c over [4, 7]. b overlaps with a and c: 0 0 db(a, b) = db(b, c) = 2Z(2/3), but a and c do not overlap: 0 0 0 db(a, c) = ∞ ≮ db(a, b) + db(b, c). 0 0 0 In g.4a, we compare db with da and de for dierent mean- c separations and sd ratios of two normals, in 1d and 2d. We rst take two 1d normals and increase their discriminability by equally shrinking their sd’s while maintaining their ratio σa/σb = s, i.e. s eectively separating the means. We repeat this by starting with two 2d normals with dierent sd matrices, one of them scaled by 0 1 dierent values s each time, then shrink them equally. Extending previous ndings,10 we see that in 1d (g.4a left), Figure 4: Comparing discriminability indices. a. Plots of existing indices d0 ≤ d0 ≤ d0 . Thus, d0 and d0 underestimate the optimal dis- 0 0 0 a e b a e da and de as fractions of the Bayes index db, with increasing separation criminability of normal distributions. The worst case is when the between two 1d and two 2d normals, for dierent ratios s of their sd’s. 0 0 0 means are equal, so da = de = 0, but db is positive, since unequal b. Left : two normals with 1 sd error ellipses corresponding to their sd variances still provide discriminability. matrices Sa and Sb, and their average and rms sd matrices. Right: the space has been linearly transformed, so that a is now standard normal, Now consider the opposite end, where large mean-separation and b is aligned with the coordinate axes. c. Discriminating two highly- has a much greater eect on discriminability than sd ratios. Even 0 separated 1d normals. here, the underestimate by da persists, and worsens as the sd’s 0 become more unequal, reaching nearly 30% in the worst case. de 0 is a better estimate throughout, and equals db at large separation. 0 0 These unequal-covariance measures are simple approxima- In higher dimensions, da ≤ de still, and they still usually un- 0 tions that do not describe the exact separation between the dis- derestimate db (especially when means are close), but there are tributions. However, our methods can be used to dene a dis- exceptions (g.4a right, and g.2d). 0 0 criminability index that exactly describes the separation between We can theoretically show that da ≤ de in all dimensions and two arbitrary distributions (even non-normal). First, we deter- cases. In 1d, this is simply because σavg ≤√σrms, and at the limit√ 0 0 mine the minimum possible (Bayes) errors when V = 1 and pri- of highly unequal sd’s, σavg/σrms → 1/ 2, so da → de/ 2, ors are equal. In terms of the distributions of the log likelihood which is the 30% underestimate. In higher dimensions, we can

8 show analogous results using g.4b as an example. The left g- mals are N(0, 1) and N(µ, σ). From the single-criterion ROC ure shows two normals with error ellipses corresponding to their curve, we rst estimate µ and σ, then we use our method to com- 0 sd’s, and their average and rms sd’s. Now we make two linear pute db of normals with these parameters. 0 transformations of the space: rst we standardize normal a, then db can also be estimated from a likelihood-ratio ROC curve. 0 we diagonalize normal b (i.e. a rotation that aligns the axes of For any two distributions in any dimensions, db corresponds to error ellipse b with the coordinate axes). In this space (right g- the accuracy at the point along their likelihood-ratio ROC curve ure), Sa = 1, Sb is diagonal, and the axes of Savg and Srms are the that maximizes p(hit)−p(false alarm), which is the farthest point average and rms of the corresponding axes of Sa and Sb. Srms is from the diagonal, where the curve tangent is parallel to the di- hence bigger than Savg, so has larger overlap at the same separa- agonal (g.5a). 0 0 0 0 −1 −1 tion, so da ≤ de. The ratio of da and de is kSrmsµk/kSavgµk, the ratio of the 1d slices of the average and rms sd’s along the axis 3.1.6 Custom classiers through the means.√ When these are highly unequal, we again 0 0 Sometimes, instead of the optimal classier, we need to test and have da → de/ 2 in general dimensions. 0 compare sub-optimal classiers, e.g. one that ignores a cue, or We can also show that at large separation in 1d, de converges 0 some cue covariances, or a simple linear classier. So the toolbox to db. Consider normals at 0 and 1 with sd’s sσ and σ (g.4c). At large separation (σ → 0), the boundary points, where the dis- allows the user to extract the optimal boundary and change it, and s 1 explicitly supply some custom sub-optimal classication bound- tributions cross, are s±1 . The right boundary is σ(s−1) sd’s from ary. Fig.2d compares the classication of two bivariate normals each normal, so it adds as much accuracy for the left normal as 0 it subtracts for the right. So only the inner boundary is useful, using the optimal boundary (which corresponds to db), vs. using a which is 1 sd’s from each normal. The overlap here thus hand-supplied linear boundary. Just as with integration, one can σ(s+1) supply these custom classication domains in quadratic, ray-trace corresponds to d0 = 2 = d0 . So, when two 1d normals are b σ(s+1) e or implicit form, and use set operations on them. too far apart to compute their overlap (see performance section) 0 0 and hence db, the toolbox returns de instead. 0 3.2 Classifying using data Given that de is often the better approximation to the best dis- 0 0 10 criminability db, why is da used so often? Simpson and Fitter If instead of normal parameters, we have labelled data as input, 0 argued that da is the best index, because it is the accuracy in we can estimate the parameters. The maximum-likelihood esti- a two-interval task with stimuli x1 and x2, using the criterion mates of means, covariances and priors of normals are simply the x1 > x2. But as we saw, this is not the optimal way to do this sample means, covariances and relative frequencies. With these task. The optimal error p(e) is instead as calculated previously, parameters we can compute the optimal classier β(x) and the 0 and db(ab, ba) = 2Z (1 − p (e)) is the best discriminability. Un- error matrix. We can further calculate another quadratic bound- 0 fortunately, this does not have a simple relationship with db(a, b) ary γ(x) to better separate the given samples: starting with β(x), for the yes/no√ task. But we can calculate mathematically here that we optimize its (k+1)(k+2)/2 independent parameters to maxi- 0 0 de(ab, ba) = 2de(a, b), which may still√ better approximate the mize the classication outcome value of the given samples. This is 0 0 best discriminability than da(ab, ba) = 2da(a, b). A brief note about Grey and Morgan’s approximate index, which uses the geometric mean of the sd’s: this behaves incon- a b p(c) = 0.97 sistently; it underestimates d0 at small discriminability, but over- 1 b N estimates it at large discriminability. 0 In sum, db is the maximum discriminability between normals in all cases, including two-interval tasks, especially when means 0 are closer and variances are unequal. de often approximates it

0 p (hit) better than da, e.g. when the decision variable in a classication t task is modeled as two unequal-variance 1d normals. likelihood ratio (4d) likelihood ratio (1d) single criterion (1d) 3.1.5 ROC curves 0 1 -7 0 20 p(false alarm) ROC curves track the outcome rates from a single criterion swept log likelihood ratio l (x) across two 1d distributions (e.g. black curve of g.5a), or varying the likelihood-ratio between any two distributions in any dimen- Figure 5: ROC curves. a. Yes/no ROC curves for a single shifting cri- terion (black), vs. a shifting likelihood-ratio (green), between the two sions (green and purple curves), which corresponds to sweeping a 1d normals of g.3a (adapted from Wickens, 12 g. 9.3), and a shift- single criterion across the 1d distributions of the likelihood ratio ing likelihood-ratio between a normal and a t distribution in 4d (purple). l (gs.3b and5b for the green and purple curves). The optimal two-interval accuracies of the 1d normals (g.3c) and the Discriminability√ indices are frequently estimated from ROC 4d distributions (g.5b) are 0.74 and 0.97, equal to the areas under their 0 curves. da is 2 times the z-score of the single-criterion ROC likelihood-ratio curves here. The points marked on these curves are the 0 curve area. db has no such simple relationship with curve area, farthest from the diagonal, and correspond to the Bayes discriminability. 0 t but can be estimated in dierent ways. Even though db uses both b. Distributions of the log likelihood ratio of the 4d vs normal distri- criteria for unequal-variance 1d normals, it can still be estimated bution. Sweeping the criterion corresponds to moving along the purple from the usual single-criterion ROC curve. Assume that the nor- ROC curve of a.

9 important for non-normal samples, where the optimal boundary In the sections below, we shall see examples of such applica- between estimated normals may not be a good classier. This op- tions, where we map to fewer dimensions to see how well a mul- timization then improves classication while still staying within tivariate normal model works for a problem, and also combine the smooth quadratic family and preventing overtting. Fig.2e groups of cues to organize a problem and get visual insight. shows classication based on labelled non-normal samples. If, along with labelled samples, we supply a custom quadratic 3.5 Testing a normal model for classication classier, the toolbox instead optimizes this for the sample. This is useful, say, in the following case: suppose we have already com- The results developed here are for normal distributions. But even puted the optimal classier for samples in some feature space. when the variables in a classication problem are not exactly nor- Now if we augment the data with additional features, we may mal (e.g. either they are an empirical sample, or they are from start from the existing classier (with its coecients augmented some known, but non-normal distribution), we can still use the with zeros in the new dimensions) to nd the optimal classier in current methods if we check whether normals are an adequate the larger feature space. model for them. One test, as described before, is to project the distributions to one dimension, either by mapping to a quadratic 3.3 Multiple normals form (g.2g), or to an axis (g.2h), where we can visually com- pare the projections of the observed distributions and those of The optimal classier between two normals is a quadratic, so their tted normals. error rates can be computed using the generalized chi-square We could further explicitly test the normality of the variables method or the ray-trace method. When classifying amongst more with measures like negentropy, but this is stricter than neces- than two normals, the decision region for each normal is the in- i sary. If the nal test of the normal model is against outcomes of a tersection of its quadratic decision regions qn(x) > 0 with all the limited-trials classication experiment, then it is enough to check other normals i, and may be written as: for agreement between outcome counts predicted by the true dis- i tributions and their normal approximations, given the number of f(x) = min qn(x) > 0. i trials. For any classication boundary, we can calculate outcome rates, e.g. p(A|a) for a hit, determined from the true distributions, This is not a quadratic, so only the ray-trace method can compute vs. from the normal approximations. The count of hits in a task the error rates here, by using the intersection operation on the domains as described before. Fig.2f shows the classication of several normals with arbitrary means and covariances. a b 500 1 3.4 Combining and reducing dimensions a It is often useful to combine the multiple dimensions in a problem to fewer, or one dimension.16 Mapping many-dimensional inte- gration and classication problems to fewer dimensions allows b p(A|a) visualization, which can help us understand multivariate normal l = 0 p(A|b) models and their predictions, and to check how adequately they 0 represent the empirical or other theoretical probability distribu- -500 500 -35 0 100 tions for a problem. log likelihood ratio l As we have described, the multi-dimensional problem of in- c 1 d 1 tegrating a normal probability in the domain f(x) > 0 can be p(B|b) viewed as the 1d integral of the pdf of f(x) above 0. Similarly, multi-dimensional binary classication problems with a classier pe f(x) can be mapped to a 1d classication between two distribu- tions of the scalar decision variable f(x), with the criterion at 0, while preserving all classication errors. For optimally classify- ing between two normals, mapping to the Bayes decision variable 0 0 β(x) is the optimal quadratic combination of the dimensions. For 10-5 100 105 1010 10-1 101 103 integration and binary classication problems in any dimensions, assumed prior ratio p /(p +p +p ) assumed covariance scale the toolbox can plot these 1d ’function probability’ views (g.2g). b a c d With multiple classes, there is no single decision variable to map Figure 6: Testing normal approximations for classication. a. Classify- the space to, but the toolbox can plot the projection along any ing two empirical distributions (a is not normal). Gray curves are con- chosen vector direction. Fig.2h shows the classication of sam- tours of l, i.e. the family of boundaries corresponding to varying like- ples from four 4d t distributions using normal ts, projected onto lihood ratios of the two tted normals. b. Mean ± sd of hit and false the axis along (1,1,1,1). alarm fractions observed (color lls) vs. predicted by the normal model For a many-dimensional classication problem, we can also de- (outlines), along this family of boundaries. Vertical line is the optimal ne a decision variable on a subset of dimensions to combine boundary. c. Similar bands for class b hits and overall error, for the 4d them into one, then combine those combinations further etc, ac- 4-class problem of g.2h, across boundaries assuming dierent priors 0 cording to the logic of the problem. pb, and d. across boundaries assuming dierent covariance scales (d s).

10 is binomial with parameters equal to the number of a trials and cues (which changes only their variances but not their covari- p(A|a), so we can compare its count distribution between the true ances), ones that ignore certain cues or cue covariances, or simple and the normal model. at boundaries. As seen here, even for many-dimensional distri- If the classes are well-separated (e.g. for ideal observers), the butions that cannot be visualized, these tests can be performed optimal boundary provides near-perfect accuracy on both the to reveal some of their structure, and to show which specic out- true and the normal distributions, so comparing yields no insight. comes deviate from normal prediction for which boundaries. To make the test more informative, we repeat it as we sweep the When the problem variables have a known non-normal theo- boundary across the space into regions of error, to show if the nor- retical distribution, the maximum-likelihood normal model is the mal model still stands. This is similar to how the decision criterion one that matches its mean and covariance, and these tests can be between two 1d distributions is swept to create an ROC curve that performed by theoretically calculating or bootstrap sampling the characterizes the classication more richly than a single bound- error rate distributions induced by the known true distributions. ary. In multiple dimensions, there is more than one unique way to sweep the boundary. We pick two common suboptimal boundary families. The rst corresponds to an observer being biased to- 4 Matlab toolbox: functions and exam- wards one type of error or another, i.e. a change in the assumed ples ratio of priors or outcome values. The second is an observer hav- ing an internal discriminability dierent from the true (external) For an integration problem, the toolbox provides a function that one (e.g. due to blurring the distributions by internal noise), so inputs the normal parameters and the integration domain (as adopting a boundary corresponding to covariance matrices that quadratic coecients or a ray-trace or implicit function), and out- are scaled by a factor. When there are two classes, the boundaries puts the integral and its complement, the boundary points com- for both of these sub-optimal observers correspond to a shift in puted, and a plot of the normal probability or function probabil- q the constant oset 0 (eq.3), i.e. a shift in the likelihood ratio ity view. The function for a classication problem inputs normal of the two normals. So we are simply moving along the normal parameters, priors, outcome values and an optional classication likelihood-ratio ROC curves, as we compare the outcome rates of boundary, and outputs the coecients of the quadratic bound- the true and the normal distributions. ary β and points on it, the error matrix and discriminability in- Fig.6a shows the classication of two empirical distributions, 0 0 0 dices db, da and de, and produces a normal probability or func- where a is not normal, and gray curves show this family of bound- tion probability plot. With sample input, it additionally returns aries, which are simply contours of the log likelihood ratio l. 0 the coecients of γ and points on it, error matrices and db values Since a and b are well-separated, the ROC curves for both true corresponding to classication accuracies of the samples using β and normal distributions would hug the top and left margins, so and γ, and the mapped scalar decision variables β(x) and γ(x) they cannot be compared. Instead, we detach the hits and false from the samples. The toolbox also provides functions to compute alarms from each other and plot them individually against the pdf’s, cdf’s and inverse cdf’s of functions of normals. changing likelihood ratio criterion, which gives us more insight. Many dierent example problems, including every problem Fig.6b shows the mean ± sd bands of hits and false alarms from discussed in this paper (examples in gs.2,3,4 and5, tests in applying these boundaries on samples of 100 trials (typical of a gs.6 and7, and research applications in g.8) are available as psychophysics experiment) from each true distribution, vs. the interactive demos in the ‘getting started’ live script of the toolbox, normal approximations. They exactly coincide for false alarms / and can be easily adapted to other problems. correct rejections, but deviate for hits / misses, correctly reect- ing that b is normal but a is not. The investigator can judge if this deviation is small enough to be ignored for their problem. 5 Performance benchmarks Now consider the case of applying these tests to multi-class problems. The two kinds of sub-optimal boundaries we picked In this section we test the performance (accuracy and speed) of are no longer the same family here. Recall that the classication our Matlab implementations of the generalized chi-square and problem of2h had four 4d t distributions. Fig.6c shows simi- ray-trace algorithms, against a standard Monte Carlo integration lar tests to see if this problem (with priors now equal) are well- algorithm. We rst set up a case where we know the ground truth, modeled by normals. The family of boundaries corresponds to and compare the estimates of all three methods at the limit of varying the assumed prior pb. We may compare any of the 16 out- high discriminability (low error rates) where it is most challeng- come rates here, e.g. p(B|b), and also the overall error p(e). When ing (which occurs for computational models and ideal observers). there are multiple classes, for any given true class, the numbers We take two 3d normals with the same covariance matrix, so that of responses in the dierent classes are multinomially distributed, the true discriminability d0 is exactly calculated as their Maha- so that the total number of wrong responses is again binomially lanobis distance. Now we increase their separation, while com- distributed. p(e) is the prior-weighted sum of these binomially puting the optimal error with each of our methods at maximum distributed individual errors, so we can calculate its mean and sd precision, and a discriminability from it. The generalized chi- predicted by the observed vs. the normal distributions. Fig.6d square method is very fast (due to the trivial planar boundary shows the test across boundaries corresponding to all covariance here), and the ray-trace method takes an average of 40s. For a 0 matrices scaled by a factor, changing the d between the classes. fair comparison, we use 108 samples for the Monte Carlo, which Some other notable suboptimal boundaries to consider for this also takes ∼40s. Each method returns an estimate dˆ0. Fig.7a test are ones that correspond to adding independent noise to the shows the relative inaccuracies |dˆ0 − d0|/d0 as true d0 increases.

11 With increasing separation, the Monte Carlo method quickly be- a -1 comes inaccurate, since the error rate, i.e. the probability content 10 in the integration domain, becomes too small to be sampled. The 0 8 Monte Carlo method stops working beyond d ≈ 10, where none of the 10 realmin samples fall in the error domain. In contrast, inaccuracies in our 10-6 methods are extremely small, of the order of the double-precision machine epsilon , demonstrating that the algorithms contain no systematic error besides machine imprecision (however, Matlab’s -11 native integration methods may not always reach the desired pre- 10 cision for a problem). This is possible because a variety of tech- ray niques are built into our algorithms to preserve accuracy, such -16 as holding tiny and large summands separate to prevent round- 10 ing, using symbolic instead of numerical calculations, and using relative inaccuracy accurate tail probabilities. The inaccuracies do not grow increas- 0 1 80 ing separation, until d0 ≈ 75, which corresponds to the smallest true d' error rate p(e) representable in double-precision (‘realmin’ = 2e- 0 308), beyond which both methods return p(e) = 0 and db = ∞. b 0 For 1d problems beyond this, the toolbox returns de instead. Rel. di. from MC Time (s) Problem Next, we compare the three methods across several problems χ˜2 ray χ˜2 ray MC of g.2. The values here are large enough that Monte Carlo esti- 2a top, p n/a 0 n/a 0.9 0.02 mates are reliable and quick, so we use it as a provisional ground 2a bottom, p n/a 0 n/a 0.3 0.003 truth. We compute the values with all three methods up to max- 2d, p(e) 0 0 0.7 0.1 0.05 imum practicable precisions, then we calculate the relative (frac- 2d, p0(e) 0 0 0.006 0.08 0.04 tional) dierences of our methods from the Monte Carlo. If a value 2e, 1st p(e) 0 0 1.7 0.12 0.01 is within the spread of the Monte Carlo estimates, we call the rel- 2g, p 0 8e-4 0.4 1.7 0.02 ative dierence 0. Table7b lists these, along with the times to 2g, 1st p(e) 0 9e-4 0.7 1.4 0.01 compute the values to 1% precision on an AMD Ryzen Thread- ripper 2950X 16-core processor. We see that both of our methods produce accurate values at comparable speeds. Figure 7: Performance benchmarks of the generalized chi-square (de- noted by χ˜2) and ray-trace methods, against a standard Monte-Carlo method. a. Relative inaccuracies in d0 estimates by the Monte-Carlo 6 Applications in visual detection method (across multiple runs), and our two methods, as the true d0 in- creases. Monte-Carlo estimates that take similar time rapidly become 0 We demonstrate the use of these methods in visual detection tasks erroneous, failing beyond d ≈ 10. Our methods stay extremely accu- 0 that have multiple cues with dierent variances and correlations. rate (around machine epsilon ) up to d ≈ 75, which corresponds to the smallest error rate representable in double precision (‘realmin’). b. For several problems of g.2, relative dierences in the outputs of the two 6.1 Detecting targets in natural scenes methods from the Monte Carlo estimate, and computation times for 1% precision. We have applied this method in a study to measure how humans compare against a nearly ideal observer in detecting occluding targets against natural scene backgrounds in a variety of condi- tions.17 We placed a target on a random subset of natural images, 4px and 8px. We extract two scalar features from the edge at each then blurred and downsampled them to mimic the eect of the scale: the edge power captures its overall prominence, and the early visual system (g.8a). We sought to measure how well edge spectrum characterizes how this prominence is distributed the targets on these degraded images can be detected using three along the boundary. We thus have 6 total features. Fig.8d shows cues: related to the luminance in the target region, the target pat- the classication of these images using these 6 features, projected tern, and the target boundary. We computed these cues on the set onto the Bayes decision variable (log likelihood ratio) l. In this re- of images. They form two approximately trivariate normal distri- duced dimension, we can see that the absent distribution is quite butions for the target present and target absent categories. We normal, and present is nearly so. Consistently, in a normality test 0 then computed the decision boundary, error rate and db against for classication with 100 trials, g.8e, the hit fraction deviates varying conditions. Fig.8b shows the result for one condition, only marginally from its normal prediction, so we accept the nor- 0 with a hyperboloidal boundary. These error rates and db’s can mal model here. Fig.8f shows classication using only the 2px then be compared across conditions. features. We use our dimension reduction technique to combine these two cues into the Bayes decision variable l of this space, 6.2 Detecting camouage which we call simply the 2px edge cue. Classifying using this sin- gle variable, g.8g, is the same as the 2d classication of g.8f, We also applied this method in a study measuring performance and preserves the errors. We do the same merging at 4px and 8px, in detecting camouaged objects.18 The major cue for detecting thus mapping 6 features to 3. Fig.8h shows the classication us- the object (g.8c) is its edge, which we compute at scales of 2px, ing these 3 merged cues. Due to the information in the two added

12 scales, the classication has improved. The total number of clas- sier parameters used in this sequential classication is 28 (6 for p = 0.003, = 5.59 a b e each of the three 2d classiers, then 10 when combining them in l = 0 3d). The classier in full 6d has 28 parameters as well, yet it per- γ = 0 forms better since it can simultaneously optimize them all. Even so, merging features allows one to combine them in groups and edge present sequences according to the problem structure, and visualize them.

silhouette absent 7 Conclusions

In this paper, we presented our methods and open-source soft- template ware for computing integrals and classication performance for c d pe = 8.3e-4, = 6.29 normal distributions over a wide range of situations. present We began by describing how to integrate any multinormal dis- tribution in a quadratic domain using the generalized chi-square method, then presented our ray-trace method to integrate in any absent domain, using examples from our software. We explained how this is synonymous with computing cdf’s of quadratic and ar- bitrary functions of normal vectors, which can then be used to -100 0 50 compute their pdf’s and inverse cdf’s as well. log likelihood ratio l (6d) We then described how to compute, given the parameters of multiple multinormals or labelled data from them, the classica- p = 10.7e-4, = 6.14 e f e tion error matrix, with optimal or sub-optimal classiers, and the 1 400 0 p(hits) maximum (Bayes-optimal) discriminability index db between two 0 0 normals. We showed that the common indices da and de underes- l = 0 0 timate this, and that contrary to common use, de is often a better 0 γ = 0 approximation than da, even for two-interval tasks. spectrum 2px edge We next described methods to merge and reduce dimensions p(false for normal integration and classication problems without losing alarm) information. We presented tests for how reliably all the above 0 -800 -100 0 50 0.2 1 methods, which assume normal distributions, can be used for log likelihood ratio l 2px edge power other distributions. We followed this by demonstrating the speed and accuracy of the methods and software on dierent problems. p = 9.4e-4, = 6.22 g pe = 10.7e-4, = 6.14 h e Finally, we illustrated all of the above methods on two visual detection research projects from our laboratory. Although not developed here, the approach of our ray-trace 300 integration method may carry over to other univariate and mul- -45 tivariate distributions. In the method, we spherically symmetrize

8px l = 0 edge the normal, nd its distribution along any ray from the center, 4px -500 γ = 0 then add it over a grid of angles. This transforms all problem -100 0 50 edge 50 20 shapes to the canonical spherical form, then eciently integrates 2px edge: l 2px edge -100 outward from the center of the distribution. Some distributions, Figure 8: Applying the method and toolbox to visual target detection e.g. log-normal, can simply be transformed to a normal and then studies. a. Example image of a target on a natural background. b. Classi- integrated with this method. For example, if y ∼ lognormal(µ = x cation of images with the target present or absent, in the space of three 1, σ = 0.5), then we can compute, say, p(sin y > 0) = p(sin e > 0 cues. p(e) and db correspond to the classication error of the sample 0) = 0.65 (where x is normal), and all other quantities such as points using γ. c. Example image for camouage detection. d. Clas- pdf’s, cdf’s, and inverse cdf’s of its arbitrary functions (see tool- sifying these using 6 cues, viewed in terms of the log likelihood ratio box example guide). l. e. Bootstrap mean ± sd of hit and false alarm fractions from apply- For other distributions, our general method is still useful if they ing a family of boundaries (corresponding to varying the criterion like- are already spherically symmetric (i.e. spherical distributions), or lihood ratio) on 100 samples of the true 6d cue distributions (color lls), can be made so (e.g. elliptical distributions), and the ray distri- f. vs. their normal approximations (outlines). Classifying with only two bution through the sphere can be found. When they cannot be cues computed at 2px. Grey curves are contours of the log likelihood ra- spherized, the ray distribution (if calculable) will depend on the tio l. g. Combining the two cues of plot f into one using l (i.e. the space of plot f has been projected along the gray contours). h. Classifying with orientation, just as the integration domain does. But once this such combined cues at 3 scales. additional dependency has been taken into account, integrating along rays from the center should still be the ecient method for distributions that fall o away from their center.

13 8 Acknowledgements 16 Ipek Oruç, Laurence T Maloney, and Michael S Landy. Weighted linear cue combination with possibly correlated er- We thank Dr. Johannes Burge (University of Pennsylvania), Dr. ror. Vision research, 43(23):2451–2468, 2003. R Calen Walshe (University of Texas at Austin), Kristoer Frey (MIT) and Dr. David Green for discussions and improvements 17 R Calen Walshe and Wilson S Geisler. Detection of occluding in the method, code and text. This work was supported by NIH targets in natural backgrounds. Journal of Vision, 20(13):14, grants EY11747 and EY024662. 2020. 18 Abhranil Das and Wilson Geisler. Understanding camouage References detection. Journal of Vision, 18(10):549, 2018.

1 Andrew Ng. Generative learning algorithms. CS229 Lecture notes, IV, 2019. 2 David Marvin Green and John A. Swets. Signal detection the- ory and psychophysics, volume 1. Wiley New York, 1966. 3 Richard O Duda, Peter E Hart, and David G Stork. Pattern classication. John Wiley & Sons, 2012. 4 Harold Ruben. Probability content of regions under spherical normal distributions, i. The Annals of Mathematical Statistics, 31(3):598–618, 1960. 5 Alan Genz and Frank Bretz. Computation of multivariate nor- mal and t probabilities, volume 195. Springer Science & Busi- ness Media, 2009. 6 Harold Ruben. Probability content of regions under spheri- cal normal distributions, iv: The distribution of homogeneous and non-homogeneous quadratic functions of normal vari- ables. The Annals of Mathematical Statistics, 33(2):542–570, 1962. 7 Robert B Davies. Numerical inversion of a characteristic func- tion. Biometrika, 60(2):415–417, 1973. 8 Lloyd N Trefethen. Approximation theory and approximation practice, volume 164. Siam, 2019. 9 Edward B Sa and A BJ Kuijlaars. Distributing many points on a sphere. The mathematical intelligencer, 19(1):5–11, 1997. 10 Adrian J Simpson and Mike J Fitter. What is the best index of detectability? Psychological Bulletin, 80(6):481, 1973. 11 David M Green. A homily on signal detection theory. The Journal of the Acoustical Society of America, 148(1):222–225, 2020. 12 Thomas D Wickens. Elementary signal detection theory. Ox- ford University Press, USA, 2002. 13 RL Chaddha and LF Marcus. An empirical comparison of dis- tance statistics for populations with unequal covariance ma- trices. Biometrics, pages 683–694, 1968. 14 SA Paranjpe and AP Gore. Selecting variables for discrim- ination when covariance matrices are unequal. Statistics & probability letters, 21(5):417–419, 1994. 15 James P Egan and Frank R Clarke. Psychophysics and signal detection. Technical report, Indiana University Hearing and Communication Laboratory, 1962.

14