arXiv:1302.4881v1 [stat.ME] 20 Feb 2013 oot,Otro 3 P,Cnd e-mail: Canada 1P3, St, M3J Keele Ontario, 4700 Toronto, University, York and Department, Professor, Associate is Monette Fox John and Monette Georges Friendly, Michael Elliptical through Geometry Methods Understanding Statistical Insights: Elliptical 03 o.2,N.1 1–39 1, DOI: No. 28, Vol. 2013, Science Statistical [email protected] [email protected] okUiest,40 el t oot,Otro M3J e-mail: Ontario, Toronto, Canada St, 1P3, Keele 4700 University, York et aitn nai,LS44 aaae-mail: Canada 4M4, L8S Ontario, Street Hamilton, Main West, 1280 of University, Department McMaster Statistics, Sociology, Social of Professor McMaster

ihe redyi rfso,Pyhlg Department, Psychology Professor, is Friendly Michael c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint nttt fMteaia Statistics Mathematical of Institute 10.1214/12-STS402 ra ubro motn facts. important a of on number attention great the fixing of advantage possess the mind, the the fatiguing without to senses speak which projections Statistical figures. quantity geometrical by represented and be extent may to relates Whatever lxne o ublt[( Humboldt von Alexander .INTRODUCTION 1. Abstract. l,mliait eaaayi,rgeso aaoe,r geometry. paradoxes, tistical regression meta-analysis, multivariate er els, analysis, measurement , kissing plots, , pothesis-error data ellips ellipse, of phrases: centration terms and in words understood easily Key be often can emphasiz that - We ometry methods, models. statistical mixed-effect among relationships and l including models contexts, higher-di of linear her variety a and tivariate in ellipse purposes the these T both of graphs. for virtues statistical the based t illustrates achieved geometrically and be and often diagrams can purposes, ric analytic data and didactic onFxi eao William Senator is Fox John . . nttt fMteaia Statistics Mathematical of Institute , 03 o.2,N.1 1–39 1, No. 28, Vol. 2013, [email protected] 2013 , iulisgt noawd ait fsaitclmethods, statistical of variety wide a into insights Visual Georges . aeciii] page 1811 de-aibepos aeinetmto,con- estimation, Bayesian plots, Added-variable This . ), in 1 ao n xso h aiyo liss with , Figure of in family shown the result of the axes the showing minor lines and ellipses. added major concentric he picture, the the to complete To vertical horizon- and of tal the to approximately corresponded ohddci n aaaayi purposes. analytic data for and cousins virtues didactic higher-dimensional the both her and illustrates dimen- ellipse and higher the extols of to paper 3D onward This a grew sions. to then and the , escaped ellipse to bivariate the from multivariate, progressed methods statistical ern ae ota fterparents, their of that to lated characteristic, a scale. in concen- only to differing close ellipses tolerably tric Galton, to were, joint whose outlines the concentric in form to frequencies seemed distribution bivariate equal insight— visual and of remarkable contours a parents had of he offspring, traits their heritable between lationship r,Ave( try, Ave( summaries, calculated nhsdarm oadbhl,h a eodvi- ( second of means a of had lines he the behold, insight: and sual Lo diagram. his on hnFacsGlo ( Galton Francis When mod- As ellipse. an was there beginning, the In atnsga a ot rdc o xli)how explain) (or predict to to was goal Galton’s X | Y ,adpotdteea ie fmeans of lines as these plotted and ), o,mxdeetmod- mixed-effect ror, dergeso,sta- regression, idge rni atn hy- Galton, Francis Y csltosadge- and solutions ic na oes mul- models, inear eg,hih)o hlrnwsre- was children of height) (e.g., , esoa cousins mensional i ae extols paper his ruhgeomet- hrough es. h strong the e 1886 1 Y . o both for | rtsuidtere- the studied first ) X X ,ad o symme- for and, ), oti n,he end, this To . Y | X n ( and ) X | Y ) 2 M. FRIENDLY, G. MONETTE AND J. FOX

lines to the ) with remarkable clarity nearly 2000 years before the development of analytic ge- ometry by Descartes. Over time, the ellipse would be called to duty to provide simple explanations of phenomena once thought complex. Most notable is Kepler’s insight that the Copernican theory of the of plan- ets as concentric (which required notions of epicycles to account for observations) could be brought into alignment with the detailed observational data from Tycho Brahe and others by an exquisitely sim- ple law: “The of every is an ellipse with the sun at a .” One century later, Isaac New- ton was able to connect this elliptical geometry with astrophysics by deriving all three of Kepler’s laws as simpler consequences of general laws of motion and Fig. 1. Galton’s 1886 diagram, showing the relationship of universal gravitation. height of children to the average of their parents’ height. The This paper takes up the cause of the ellipse as a diagram is essentially an overlay of a geometrical interpreta- tion on a bivariate grouped frequency distribution, shown as geometric form that can provide similar service to numbers. statistical understanding and data analysis. Indeed, it has been doing that since the time of Galton, but It is not stretching the point too far to say that these graphic and geometric contributions have of- a large part of modern statistical methods descends ten been incidental and scattered in the literature from these visual insights:1 correlation and regres- [e.g., Bryant (1984), Campbell and Atchley (1981), sion [Pearson (1896)], the bivariate distri- Saville and Wood (1991), Wickens (1995)]. We fo- bution, and principal components [Pearson (1901), cus here on visual insights through ellipses in the Hotelling (1933)] all trace their ancestry to Galton’s of linear models, multivariate linear models geometrical diagram.2 and mixed-effect models. Our goal is to provide as Basic geometry goes back at least to Euclid, but comprehensive a treatment of this topic as possible the properties of the ellipse and other conic sections in a single article together with online supplements. may be traced to Apollonius of (ca. 262 BC– The plan of this paper is as follows: Section 2 ca. 190 BC), a Greek geometer and astronomer who provides the minimal notation and properties of el- gave the ellipse, and their mod- lipsoids3 necessary for the remainder of the paper. ern names. In a work popularly called the Conics Due to length restrictions, other useful and impor- [Boyer (1991)], he described the fundamental prop- tant properties of geometric and statistical ellipsoids erties of ellipses (eccentricity, axes, principles of tan- have been relegated to the Appendix. Section 3 de- gency, normals as minimum and maximum straight scribes the use of the data ellipsoid as a visual sum- mary for multivariate data. In Section 4 we apply 1Pearson [(1920), page 37] later stated, “that Galton should data ellipsoids and confidence ellipsoids for parame- have evolved all this from his observations is to my mind one ters in linear models to explain a wide range of phe- of the most noteworthy scientific discoveries arising from pure nomena, paradoxes and fallacies that are clarified analysis of observations.” by this geometric approach. This view is extended 2Well, not entirely. Auguste Bravais [1811–1863] (1846), an astronomer and physicist first introduced the mathematical to multivariate linear models in Section 5, primar- theory of the bivariate as a model for the ily through the use of ellipsoids to portray hypoth- joint frequency of errors in the geometric position of a point. esis (H) and error (E) covariation in what we call Bravais derived the formula for level slices as concentric el- HE plots. Finally, in Section 6 we discuss a diverse lipses and had a rudimentary notion of correlation but did not appreciate this as a representation of data. Nonetheless, Pearson (1920) acknowledged Bravais’s contribution, and the 3As in this paragraph, we generally use the term “ellipsoid” correlation coefficient is often called the Bravais-Pearson co- as to refer to “ellipse or ellipsoid” where dimensionality does efficient in France [Denis (2001)]. not matter or context is clear. ELLIPTICAL INSIGHTS 3

Table 1 Statistical and geometrical measures of “size” of an ellipsoid

Size Conceptual formula Geometry Function

(a) Generalized : det(Σ)= Qi λi , (hyper) (b) Average variance: tr(Σ)= λi linear sum −1 (c) Average precision: 1/ tr(Σ ) = 1/ Pi(1/λi) (d) Maximal variance: λ1 maximum supremum

collection of current statistical problems whose so- 2.2 Statistical Ellipsoids lutions can all be described and visualized in terms In statistical applications, C will often be the in- of “kissing ellipsoids.” verse of a matrix (or a sum of squares 2. NOTATION AND BASIC RESULTS and cross-products matrix) and the ellipsoid will be centered at the means of variables or at estimates of There are various representations of an ellipse (or parameters under some model. Hence, we will also ellipsoid in three or more ), both geomet- use the following notation: ric and statistical. Some basic notation and proper- For a positive definite matrix Σ we use (µ, Σ) ties are described below. to denote the ellipsoid E 2.1 Geometrical Ellipsoids T 1 (4) := x : (x µ) Σ− (x µ) = 1 . We refer to the common notion of a bounded ellip- E { − − } When Σ is the of a multivariate soid (with nonempty interior) in the p-dimensional vector x with eigenvalues λ λ , the follow- space Rp as a proper ellipsoid. An origin-centered 1 2 ing properties represent the “size”≥ of≥··· the ellipsoid in proper ellipsoid may be defined by the quadratic Rp form (see Table 1). T For testing hypotheses for parameters of multi- (1) := x : x Cx 1 , E { ≤ } variate linear models, these different senses of “size” where equality in (1) gives the boundary, correspond (with suitable transformations) to (a) T x = (x1,x2,...,xp) is a vector referring to the co- Wilks’s Λ, (b) the Hotelling–Lawley trace, (c) the ordinate axes and C is a symmetric positive defi- Pillai trace, and (d) Roy’s maximum root tests, as nite p p matrix. If C is only positive semi-definite, we describe below in Section 5. then the× ellipsoid will be improper, having the Note that every nonnegative definite matrix W of a with elliptical cross-sections and un- can be factored as W = AAT, and the matrix A bounded in the direction of the null space of C. To can always be selected so that it is square. A will be extend the definition to singular (sometimes known nonsingular if and only if W is nonsingular. A com- as “degenerate”) ellipsoids, we turn to a definition putational definition of an ellipsoid that can be used that is equivalent to equation (1) for proper ellip- for all nonnegative definite matrices and that cor- p soids. Let denote the unit in R , responds to the previous definition in the case of S T (2) := x : x x = 1 , positive-definite matrices is S { } and let (5) (µ, W)= µ + A , E S (3) := A , where is a unit sphere of conformable dimension E S S where A is a nonsingular p p matrix. Then is and µ is the of the ellipsoid. One convenient a proper ellipsoid that could× be defined using equa-E choice of A is the Choleski square root, W1/2, as we T 1 tion (1) with C = (AA )− . We obtain singular el- describe in Appendix A.3. Thus, for some results lipsoids by allowing A to be any matrix, not nec- below, a convenient notation in terms of W is essarily nonsingular or even square. A more gen- (6) (µ, W)= µ √W = µ W1/2, eral representation of ellipsoids based on the sin- E ⊕ ⊕ gular value decomposition (SVD) of C is given in where emphasizes that the ellipsoid is a ⊕ Appendix A.1. Some useful properties of geometric and of the unit sphere followed by transla- ellipsoids are described in Appendix A.2. tion to a center at µ and √W = W1/2 = A. This 4 M. FRIENDLY, G. MONETTE AND J. FOX

Fig. 2. Sunflower plot of Galton’s data on heights of parents and their children (in.), with 40%, 68% and 95% data ellipses and the regression lines of y on x (black) and x on y (grey). The ratio of the vertical to the regression (labeled “r”) to the vertical to the top of the ellipse gives a visual estimate of the correlation (r = 0.46, here). Shadows (projections) on the coordinate axes give standard intervals, x¯ ksx and y¯ ksy, with k = 1, 1.5, 2.45, having bivariate coverage 40%, 68% and ± ± 95% and univariate coverage 68%, 87% and 98.6%, respectively. Plotting children’s height on the abscissa follows Galton. representation is not unique, however: µ B = ν the frequency at each point is represented by a sun- C (i.e., they generate the same ellipsoid)⊕iff µ =⊕ν flower symbol. We also overlay the 40%, 68% and and BBT = CCT. From this result, it is readily seen 95% data ellipses, as described below. that under a linear transformation given by a matrix In Figure 2, the ellipses have the mean vector L the image of the ellipse is (¯x, y¯) as their center; the lengths of arms of the cen- T tral cross show the standard deviations of the vari- L[( (µ, W))] = (Lµ, LWL ) E E ables, which correspond to the shadows of the 40% (7) = Lµ √LWLT ellipse. In addition, the correlation coefficient can ⊕ be visually represented as the fraction of a vertical = Lµ L√W. line fromy ¯ to the top of the ellipse that is ⊕ below the regression line y x, shown by the arrow 3. THE DATA ELLIPSE AND ELLIPSOID labeled “r.” Finally, as Galton| noted, the regres- sion line for y x (or x y) canb be visually estimated The data ellipse [Monette (1990)] [or concentra- as the locus of| the points| of vertical (or horizon- tion ellipse , Dempster (1969), Chapter 7] provides tal) tangentsb with theb family of concentric ellipses. a remarkably simple and effective display for view- See Monette [(1990), Figures 5.1–5.2] and Friendly ing and understanding bivariate marginal relation- [(1991), page 183] for illustrations and further dis- ships in multivariate data. The data ellipse is typi- cussion of the properties of the data ellipse. cally used to add a visual summary to a scatterplot, More formally [Dempster (1969), Monette (1990)], indicating the means, standard deviations, correla- for a p-dimensional sample, Yn p, we recognize the tion and slope of the regression line for two vari- in equation (4)× as corresponding to ables. Under classical (Gaussian) assumptions, the 2 the squared Mahalanobis distance, DM (y) = (y T 1 T − data ellipse provides a statistically sufficient visual y¯) S− (y y¯), of the point y = (y1,y2,...,yp) from − T summary, as we describe below. the centroid of the sample, y¯ = (¯y1, y¯2,..., y¯p) . Thus, It is historically appropriate to illustrate the data we use a more explicit notation to define the data el- ellipse and describe its properties using Galton’s lipsoid c of size (“radius”) c as the set of all points E 2 2 [(1886), Table I] data, from which he drew Figure 1 y with DM (y) less than or equal to c , 4 as a conceptual diagram, shown in Figure 2, where T 1 2 (8) (y¯, S) := y : (y y¯) S− (y y¯) c , Ec { − − ≤ } 1 n T 4These data are reproduced in Stigler [(1986), Table 8.2, where S = (n 1)− (yi y¯)(yi y¯ ) is the − i=1 − − page 286]. sample covariance matrix.P In the computational no- ELLIPTICAL INSIGHTS 5

(a) (b)

Fig. 3. Scatterplot matrices of Anderson’s iris data: (a) showing data, separate 68% data ellipses, and regression lines for each species; (b) showing only ellipses and regression lines. Key—Iris setosa: blue, ; Iris versicolor: red, +; Iris virginca: △ green, . tation of equation (6), the boundary of the data el- any linear combination, then correspond to stan- lipsoid of radius c is thus dard Scheff´eintervals taking “fishing” (simultane- ous interfence) in a p = 2-dimensional space into (9) (y¯, S)= y¯ cS1/2. Ec ⊕ account. Many properties of the data ellipsoid hold regard- As useful as the data ellipse might be for a single, less of the joint distribution of the variables; but if unstructured sample, its value as a visual summary the variables are multivariate normal, then the data increases with the complexity of the data. For exam- ellipsoid approximates a contour of constant density ple, Figure 3 shows scatterplot matrices of all pair- 2 in their joint distribution. In this case DM (x,y) has wise plots of the variables from Edgar Anderson’s 2 a large-sample χp distribution or, in finite samples, (1935) classic data on three species of iris flowers approximately [p(n 1)/(n p)]Fp,n p). found in the Gasp´ePeninsula, later used by Fisher − − −2 2 Hence, in the bivariate case, taking c = χ2(0.95) = (1936) in his development of discriminant analysis. 5.99 6 encloses approximately 95% of the data ≈ The data ellipses show clearly that the means, vari- points under normal theory. Other radii also have ances, correlations and regression slopes differ sys- useful interpretations: tematically across the three iris species in all pair- In Figure 2 we demonstrate that c2 = χ2(0.40) 1 wise plots. We emphasize that the ellipses serve as • 2 gives a data ellipse of 40% coverage with the prop-≈ sufficient visual summaries of the important statis- 5 erty that its projection on either axis corresponds tical properties (first and second moments) by re- to a standard interval,x ¯ 1sx andy ¯ 1sy. The ± ± 5 same property of univariate coverage pertains to We recognize that a normal-theory summary (first and any linear combination of x and y. second moments), shown visually or numerically, can be dis- By analogy with a univariate sample, a 68% cov- torted by multivariate outliers, particularly in smaller sam- • 2 2 ples. In what follows, robust covariance estimates can, in prin- erage data ellipse with c = χ2(0.68) = 2.28 gives ciple, be substituted for the classical, normal-theory estimates a bivariate analog of the standardx ¯ 1s andy ¯ ± x ± in all cases. To save space, we do not explore these possibilities 1sy intervals. The univariate shadows, or those of further here. 6 M. FRIENDLY, G. MONETTE AND J. FOX moving the data points from the plots in the version the mean and standard deviation of the condi- at the right. tional distribution on that line. The standard deviation of the residuals, s , can be • e 4. LINEAR MODELS: DATA ELLIPSES AND visualized as the half-width of the vertical (red) CONFIDENCE ELLIPSES line at x =x ¯. The vertical distance between the mean of y and Here we consider how ellipses help to visualize re- • the points where the ellipse has vertical tangents lationships among variables in connection with lin- is rsy. (As a fraction of sy, this distance is r = 0.75 ear models (regression, ANOVA). We begin with in the figure.) views in the space of the variables (data space) and The (blue) regression line of y on x passes through progress to related views in the space of model pa- • the points of vertical tangency. Similarly, the re- rameters (β space). gression of x on y (not shown) passes through the 4.1 Simple points of horizontal tangency. 4.2 Visualizing a Confidence Interval for the Various aspects of the standard data ellipse of ra- Slope dius 1 illuminate many properties of simple linear regression, as shown in Figure 4. These properties A visual approximation to a 95% confidence inter- are also useful in more complex contexts: val for the slope, and thus a visual test of H0 : β = 0, can be seen in Figure 5. From the formula for a 95% One-half of the widths of the vertical and hor- 0.975 • confidence interval, CI0.95(β)= b tn 2 SE(b), we izontal projections (dotted black lines) give the ± − × can take t0.975 2 and SE(b) 1 ( se ), leading to n 2 √n sx standard deviations sx and sy, respectively. − ≈ ≈ Because the projection onto any 2 se • (10) CI0.95(β) b . line through the center of the ellipse, (¯x, y¯), corre- ≈ ± √n × sx  sponds to some linear combination, mx + ny, the To show this visually, the left panel of Figure 5 half-width of the corresponding projection of the displays the standard data ellipse surrounded by the ellipse gives the standard deviation of this linear “regression ,” formed with the vertical combination. tangent lines and the tangent lines parallel to the With a multivariate normal distribution the line • regression line. This corresponds to the conjugate segment through the center of the ellipse shows axes of the ellipse induced by the Choleski factor of Syx as shown in Figure A.3 in Appendix A.3. Simple algebra demonstrates that the diagonal lines through this parallelogram have slopes of s b e . ± sx So, to obtain a visual estimate of the 95% confi- dence interval for β (not, we note, the 95% CI for the regression line), we need only shrink the diago- nal lines of the regression parallelogram toward the regression line by a factor of 2/√n, giving the red lines in the right panel of Figure 5. In the data used for this example, n = 102, so the factor is approxi- mately 0.2 here.6 Now consider the horizontal line through the center of the data ellipse. If this line is outside the envelope of the confidence lines, as it is in Figure 5, we can reject H0 : β = 0 via this simple visual approximation.

Fig. 4. Annotated standard data ellipse showing standard 6The data are for the rated prestige and average years of deviations of x and y, residual standard deviation (se), slope education of 102 Canadian occupations circa 1970; see [Fox (b) and correlation (r). and Suschnigg (1989)]. ELLIPTICAL INSIGHTS 7

Fig. 5. Visual 95% confidence interval for the slope in linear regression. Left: Standard data ellipse surrounded by the regression parallelogram. Right: Shrinking the diagonal lines by a factor of 2/√n, gives the approximate 95% confidence interval for β.

4.3 Simpson’s Paradox, Marginal and ing some important factor or covariate, and the con- Conditional Relationships ditional relationship, adjusting (controlling) for that Because it provides a visual representation of factor or covariate. means, and correlations, the data ellipse Simpson’s paradox [Simpson (1951)] occurs when is ideally suited as a tool for illustrating and expli- the marginal and conditional relationships differ in cating various phenomena that occur in the anal- direction. This may be seen in the plots of Sepal ysis of linear models. One class of simple, but im- length against Sepal width for the iris data shown portant, examples concerns the difference between in Figure 6. Ignoring iris species, the marginal, total- the marginal relationship between variables, ignor- sample correlation is slightly negative as seen in

(a) Total sample, marginal ellipse, (b) Individual sample, (c) Pooled, within-sample ellipse ignoring species conditional ellipses — species

Fig. 6. Marginal (a), conditional (b) and pooled within-sample (c) relationships of Sepal length and Sepal width in the iris data. Total-sample data ellipses are shown as black, solid ; individual-group data and ellipses are shown with colors and dashed lines. 8 M. FRIENDLY, G. MONETTE AND J. FOX panel (a). The individual-sample ellipses in panel giving a slope βmarginal and correlation rmarginal. In (b) show that the conditional, within-species cor- other cases, we may not have individual data, but relations are all positive, with approximately equal only aggregate group data, (¯yi, x¯i), i = 1,...,g, from regression slopes. The group means have a negative which we can estimate the between-groups (“eco- relationship, accounting for the negative marginal logical”) association, with slope βbetween and cor- correlation. relation rbetween. When all data are available and A correct analysis of the (conditional) relation- the model is an ANCOVA model of the form y ship between these variables, controlling or adjust- x + group, we can estimate a common conditional,∼ ing for mean differences among species, is based on within-group slope, βwithin, or, with the model y the pooled within-sample covariance matrix, x + group + x group, the separate within-group∼ × g ni slopes, βi. 1 T Swithin = (N g)− (yij y¯i )(yij y¯i ) Figure 7 illustrates these estimates in a simula- − − · − · Xi=1 Xj=1 tion of five groups, with ni = 10, meansx ¯i = 2i + (11) ( 0.4, 0.4) andy ¯ =x ¯ + (0, 0.52), so that g U − i i N 1 rbetween 0.95. Here (a, b) represents the uniform = (N g)− (ni 1)Si, − − distribution≈ between Ua and b, and (µ,σ2) repre- Xi=1 sents the normal distribution with meanN µ and vari- where N = ni, and the result is shown in panel (c) ance σ2. For simplicity, we have set the within-group of Figure 6.P In this graph, the data for each species covariance matrices to be identical in all groups, were first transformed to deviations from the species with Var(x) = 6, Var(y)=2 and Cov(x,y)= 3 in means on both variables and then translated back ± the left and right panels, respectively, giving rwithin = to the grand means. 0.87. S In a more general context, within appears as the ± In the left panel, the conditional, within-group E matrix in a multivariate linear model, adjusting or slope is smaller than the ecological, between-group controlling for all fitted effects (factors and covari- slope, reflecting the smaller within-group than between- ates). For essentially correlational analyses (princi- group correlation. In general, however, it can be pal components, factor analysis, etc.), similar dis- shown that plays can be used to show how multi-sample analy- ses can be compromised by substantial group mean βmarginal [βwithin, βbetween], differences and corrected by analysis of the pooled ∈ within-sample covariance matrix, or by including which is also evident in the right panel, where the important group variables in the model. Moreover, within-group slope is negative. This result follows display of the individual within-group data ellipses from the fact that the marginal data ellipse for the can show visually how well the assumption of equal total sample has a shape that is a convex combina- covariance matrices, Σ1 =Σ2 = =Σg, is satisfied tion (weighted average) of the average within-group in the data, for the two variables· · ·displayed. covariance of (x,y), shown by the green ellipse in Figure 7, and the covariance of the means (¯xi, y¯i), 4.4 Other Paradoxes and Fallacies shown by the red between-group ellipse. In fact, Data ellipses can also be used to visualize and un- the between and within data ellipses in Figure 7 derstand other paradoxes and fallacies that occur are just (a scaling of) the H and E ellipses in an with linear models. We consider situations in which hypothesis-error (HE) plot for the MANOVA model, there is a principal relationship between variables y (x,y) group, as will be developed in Section 5. See ∼ and x of interest, but (as in the preceding subsec- Figure 8 for a visual demonstration, using the same tion) the data are stratified in g samples by a factor data as in Figure 7. (“group”) that might correspond to different sub- The right panels of Figures 7 and 8 provide a pro- populations (e.g., men and women, age groups), dif- totypical illustration of Simpson’s paradox, where ferent spatial regions (e.g., states), different points βwithin and βmarginal can have opposite signs. Un- in time or some combination of the above. derlying this is a more general marginal fallacy (re- In some cases, group may be unknown or may not quiring only substantively different estimates, but have been included in the model, so we can only es- not necessarily different signs) that can occur when timate the marginal association between y and x, some important factor or covariate is unmeasured ELLIPTICAL INSIGHTS 9

Fig. 7. Paradoxes and fallacies: between (ecological), within (conditional) and whole-sample (marginal) associations. In both panels, the five groups have the same group means, and Var(x) = 6 and Var(y) = 2 within each group. The within-group correlation is r = +0.87 in all groups in the left panel and is r = 0.87 in the right panel. The green ellipse shows the average − within-group data ellipse.

or has been ignored. The fallacy consists of esti- one views a scatterplot matrix of (y,x1,x2,...) and mating the unconditional or marginal relationship believes that the slopes of relationships in the sepa- (βmarginal) and believing that it reflects the con- rate panels reflect the pairwise conditional relation- ditional relationship, or that those pesky “other” ships with other variables controlled. In a regres- variables will somehow average out. In practice, the sion context, the antidote to the marginal fallacy is marginal fallacy probably occurs most often when the added-variable plot (described in Section 4.8),

Fig. 8. Visual demonstration that βmarginal lies between βwithin and βbetween. Each panel shows an HE plot for the MANOVA model (x,y) group, in which the within and between ellipses are identical to those in Figure 7, except for scale. ∼ 10 M. FRIENDLY, G. MONETTE AND J. FOX which displays the conditional relationship between countries. It would be fallacious to infer that the the response and a predictor directly, controlling for same slope (or even its sign) applies to a between- all other predictors. country analysis of heart disease mortality vs. GNP The right panels of Figures 7 and 8 also illustrate per capita. A positive value of βbetween in this con- Robinson’s paradox [Robinson (1950)], where βwithin text might result from the fact that, across coun- 7 and βbetween can have opposite signs. The more gen- tries, higher GNP per capita is associated with less eral ecological fallacy [e.g., Lichtman (1974), Kramer healthy diet (more fast food, red meat, larger por- (1983)] is to draw conclusions from aggregated data, tions), leading to increased heart disease. estimating βbetween or rbetween, believing that they 4.5 Leverage, Influence and Precision reflect relationships at the individual level, estimat- The topic of leverage and influence in regression ing βwithin or rwithin. Perhaps the earliest instance of this was Andr´e-Michel Guerry’s (1833) use of the- is often introduced with graphs similar to Figure 9, matic maps of France depicting rates of literacy, what we call the “leverage-influence quartet.” In these graphs, a bivariate sample of n = 20 points crime, suicide and other “moral statistics” by de- 2 partment to argue about the relationships of these was first generated with x (40, 10 ) and y 10 + 0.75x + (0, 2.52). Then,∼N in each of panels (b)–∼ moral variables as if they reflected individual be- (d) a single pointN was added at the locations shown, havior.8 As can be seen in Figure 7, the ecological to represent, respectively, a low-leverage point with fallacy can often be resolved by accounting for some a large residual,9 a high-leverage point with small confounding variable(s) that vary between groups. residual (a “good” leverage point) and a high-leverage Finally, there are situations where only a subset point with large residual (a “bad” leverage point). of the relevant data are available (e.g., one group The goal is to visualize how leverage [ (x x¯)2] in Figure 7) or when the relevant data are available ⋆ ⋆ ∝ − and residual (y yˆi ) (wherey ˆi is the fitted value only at the individual level, so that only the con- for observation i−, computed on the basis of an aux- ditional relationship, βwithin, can be estimated. The iliary regression in which observation i is deleted) atomistic fallacy (also called the fallacy of compo- combine to produce influential points—those that sition or the individualistic fallacy), for example, T affect the estimates of β = (β0, β1) . Alker (1969), Riley (1963), is the inverse to the eco- The “standard” version of this graph shows only logical fallacy and consists of believing that one can the fitted regression lines for each panel. So, for draw conclusions about the ecological relationship, the moment, ignore the data ellipses in the plots. βbetween, from the conditional one. The canonical, first-moment-only, story behind the The atomistic fallacy occurs most often in the con- standard version is that the points added in pan- text of multilevel models [Diez-Roux (1998)] where els (b) and (c) are not harmful—the fitted line does it is desired to draw inferences regarding variabil- not change very much when these additional points ity of higher-level units (states, countries) from data are included. Only the bad leverage point, “OL,” in collected from lower-level units. For example, imag- panel (d) is harmful. ine that the right panel of Figure 7 depicts the nega- Adding the data ellipses to each panel immedi- tive relationship of mortality from heart disease (y) ately makes it clear that there is a second-moment with individual income (x) for individuals within part to the story—the effect of unusual points on the precision of our estimates of β. Now, we see

7 directly William Robinson (1950) examined the relationship be- that there is a big difference in impact be- tween literacy rate and percentage of foreign-born immigrants tween the low-leverage outlier [panel (b)] and the in the U.S. states from the 1930 Census. He showed that there high-leverage, small-residual case [panel (c)], even was a surprising positive correlation, rbetween = 0.526 at the though their effect on coefficient estimates is neg- state level, suggesting that foreign birth was associated with ligible. In panel (b), the single outlier inflates the greater literacy; at the individual level, the correlation rwithin estimate of residual variance (the size of the vertical was 0.118, suggesting the opposite. An explanation for the − slice of the data ellipse atx ¯). paradox was that immigrants tended to settle in regions of greater than average literacy. 8Guerry was certainly aware of the logical problem of eco- 9In this context, a residual is “large” when the point in logical inference, at least in general terms [Friendly (2007a)], question deviates substantially from the regression line for and carried out several side analyses to examine potential the rest of the data—what is sometimes termed a “deleted confounding variables. residual;” see below. ELLIPTICAL INSIGHTS 11

(a) Original data (b) Low leverage, Outlier

(c) High leverage, good fit (d) High leverage, Outlier

Fig. 9. Leverage-Influence quartet with data ellipses. (a) Original data; (b) adding one low-leverage outlier (O); (c) adding one “good” leverage point (L); (d) adding one “bad” leverage point (OL). In panels (b)–(d) the dashed black line is the fitted line for the original data, while the thick solid blue line reflects the regression including the additional point. The data ellipses show the effect of the additional point on precision.

ellipse somewhat more elongated, giving increased precision of our estimates of β. Whether a “good” leverage point is really good de- pends upon our faith in the regression model (and in the point), and may be regarded either as increasing the precision of βˆ or providing an illusion of preci- sion. In either case, the data ellipse for the modified data shows the effect on precision directly. 4.6 Ellipsoids in Data Space and β Space Fig. 10. Data ellipses in the Leverage-Influence quartet. This graph overlays the data ellipses and additional points It is most common to look at data and fitted from the four panels of Figure 9. It can be seen that only the models in “data space,” where axes correspond to OL point affects the slope, while the O and L points affect variables, points represent observations, and fitted precision of the estimates in opposite directions. models are plotted as lines (or planes) in this space. As we’ve suggested, data ellipsoids provide informa- To make the added value of the data ellipse more tive summaries of relationships in data space. For apparent, we overlay the data ellipses from Figure 9 linear models, particularly regression models with in a single graph, shown in Figure 10, to allow direct quantitative predictors, there is another space—“β comparison. Because you now know that regression space”—that provides deeper views of models and lines can be visually estimated as the locus of verti- the relationships among them. In β space, the axes cal tangents, we suppress these lines in the plot to pertain to coefficients and points are models (true, focus on precision. Here, we can also see why the hypothesized, fitted) whose coordinates represent val- high-leverage point “L” [added in panel (c) of Fig- ues of parameters. ure 9] is called a “good leverage point.” By increas- In the sense described below, data space and β ing the standard deviation of x, it makes the data space are dual to each other. In simple linear re- 12 M. FRIENDLY, G. MONETTE AND J. FOX

Fig. 11. Scatterplot matrix, showing the pairwise relationships among Heart (y), Coffee (x1) and (x2), with linear regression lines and 68% data ellipses for the marginal bivariate relationships. gression, for example, each line in data space corre- example we assume that the main goal is to deter- sponds to a point in β space, the set of points on mine whether or not coffee is good or bad for your any line in β space corresponds to a of lines heart, and stress represents one potential confound- through a given point in data space, and the propo- ing variable among others (age, smoking, etc.) that sition that every pair of points defines a line in one might be useful to control statistically. space corresponds to the proposition that every two The plot in Figure 11 shows only the marginal re- lines intersect in a point in the other space. lationship between each pair of variables. The mar- Moreover, ellipsoids in these spaces are dual and ginal message seems to be that coffee is bad for your inversely related to each other. In data space, joint heart, stress is bad for your heart and coffee con- confidence intervals for the mean vector or joint pre- sumption is also related to occupational stress. Yet, diction regions for the data are given by the ellip- when we fit both variables together, we obtain the T soids (¯x1, x¯2) c√S. In the dual β space, joint con- following results, suggesting that coffee is good for ⊕ fidence regions for the coefficients of a response vari- you (the coefficient for coffee is now negative, though able y on (x1,x2) are given by ellipsoids of the form nonsignificant). How can this be? (See Table 2). 1 β c√S− . We illustrate these relationships in the Figure 12 shows the relationship between the pre- example⊕ below. dictors in data space and how this translates into b Figure 11 shows a scatterplot matrix among the joint and individual confidence intervals for the coef- variables Heart (y), an index of cardiac damage, ficients in β space. The left panel is the same as the Coffee (x1), a measure of daily coffee consumption, corresponding (Coffee, Stress) panel in Figure 11, and Stress (x2), a measure of occupational stress, but with a standard (40%) data ellipse. The right in a contrived sample of n = 20. For the sake of the panel shows the joint 95% confidence region and the ELLIPTICAL INSIGHTS 13

Table 2 simultaneous inference (“fishing”) in a 2-dimen- Coefficients and tests for the joint model predicting heart sional space. disease from coffee and stress 1 α/d 1 α/2d Similarly, using the factor F1,ν− = tν− Estimate β Std. error t value > t • q ( b) Pr( | |) would give an ellipse whose 1D shadows are 1 α − Intercept 7.7943 5.7927 1.35 0.1961 Bonferroni confidence intervals for d posterior hy- − − Coffee 0.4091 0.2918 1.40 0.1789 potheses. − − Stress 1.1993 0.2244 5.34 0.0001 Visual hypothesis tests and d = 1 confidence in- tervals for the parameters separately are obtained individual 95% confidence intervals in β space, de- from the red ellipse in Figure 12, which is scaled 0.95 0.975 termined as by F1,ν = tν . We call this the “confidence- q 0.95 1/2 interval generating ellipse” (or, more compactly, the β dFd,ν se SX− , ⊕ q × × “confidence-interval ellipse”). The shadows of the where d is theb number of dimensions for which we confidence-interval ellipse on the axes (thick red lines) want coverage, ν is the residual degrees of freedom give the corresponding individual 95% confidence for se, and SX is the covariance matrix of the pre- intervals, which are equivalent to the (partial, dictors. Type III) t-tests for each coefficient given in the Thus, the green ellipse in Figure 12 is the ellipse standard multiple regression output shown above. 0.95 Thus, controlling for Stress, the confidence interval of joint 95% coverage, using the factor 2F2,ν and q for the slope for Coffee includes 0, so we cannot re- covering the true values of (βStress, βCoffee) in 95% of ject the hypothesis that βCoffee = 0 in the multiple samples. Moreover: regression model, as we saw above in the numeri- Any joint hypothesis (e.g., H0 : βStress = 1, βCoffee = cal output. On the other hand, the interval for the • 1) can be tested visually, simply by observing slope for Stress excludes the origin, so we reject the whether the hypothesized point, (1, 1) here, lies null hypothesis that βStress = 0, controlling for Cof- inside or outside the joint confidence ellipse. fee consumption. The shadows of this ellipse on the horizontal and Finally, consider the relationship between the data • vertical axes give the Scheff´ejoint 95% confidence ellipse and the confidence ellipse. These have exactly intervals for the parameters, with protection for the same shape, but the confidence ellipse is exactly

Fig. 12. Data space and β space representations of Coffee and Stress. Left: Standard (40%) data ellipse. Right: Joint 95% confidence ellipse (green) for (βCoffee, βStress), CI ellipse (red) with 95% univariate shadows. 14 M. FRIENDLY, G. MONETTE AND J. FOX

H0 : βCoffee = 0 in the simple regression model. The oblique shadow of the red 95% confidence-interval ellipse onto the horizontal axis is slightly smaller. How much smaller is a function of the t-value of the coefficient for Stress? We can go further. As we noted earlier, all linear combinations of variables or parameters in data or models correspond graphically to projections (shad- ows) onto certain subspaces. Let’s assume that Cof- fee and Stress were measured on the same scales so it makes sense to ask if they have equal impacts on Heart disease in the joint model that includes them both. Figure 13 also shows an auxiliary axis through the origin with slope = 1 corresponding to values − of βStress βCoffee. The orthogonal projection of the coefficient− vector on this axis is the point estimate of βStress βCoffee and the shadow of the red ellipse along this− axis is the 95% confidence interval for the b b difference in slopes. This interval excludes 0, so we Fig. 13. Joint 95% confidence ellipse for (βCoffee, βStress), would reject the hypothesis that Coffee and Stress together with the 1D marginal confidence interval for βCoffee ignoring Stress (thick blue line), and a visual confidence in- have equal coefficients. terval for βStress βCoffee = 0 (dark cyan). − 4.7 Measurement Error a 90o rotation and rescaling of the data ellipse. In In classical linear models, the predictors are often directions in data space where the slice of the data considered to be fixed variables or, if random, to ellipse is wide—where we have more information be measured without error and independent of the about the relationship between Coffee and Stress— regression errors; either condition, along with the as- the projection of the confidence ellipse is narrow, sumption of linearity, guarantees unbiasedness of the reflecting greater precision of the estimates of coef- standard OLS estimators. In practice, of course, pre- ficients. Conversely, where slice of the the data el- dictor variables are often also observed indicators, lipse is narrow (less information), the projection of subject to error, a fact that is recognized in errors- the confidence ellipse is wide (less precision). See in-variables regression models and in more general Figure A.2 for the underlying geometry. structural equation models but often ignored other- The virtues of the confidence ellipse for visualiz- wise. Ellipsoids in data space and β space are well ing hypothesis tests and interval estimates do not suited to showing the effect of measurement error in end here. Say we wanted to test the hypothesis that predictors on OLS estimates. Coffee was unrelated to Heart damage in the sim- The statistical facts are well known, though per- ple regression ignoring Stress. The (Heart, Coffee) haps counter-intuitive in certain details: measure- panel in Figure 11 showed the strong marginal rela- ment error in a predictor biases regression coeffi- tionship between the variables. This can be seen in cients, while error in the measurement in y increases Figure 13 as the oblique projection of the confidence the standard errors of the regression coefficients but ellipse to the horizontal axis where βStress =0. The does not introduce bias. estimated slope for Coffee in the simple regression In the top row of Figure 11, adding measurement is exactly the oblique shadow of the center of the error to the Heart disease variable would expand ellipse (βCoffee, βStress) through the point where the the data ellipses vertically, but (apart from random ellipse has a horizontal tangent onto the horizontal variation) leaves the slopes of the regression lines b b axis at βStress = 0. The thick blue line in this fig- unchanged. Measurement error in a predictor vari- ure shows the confidence interval for the slope for able, however, biases the corresponding estimated Coffee in the simple regression model. The confi- coefficient toward zero (sometimes called regression dence interval does not cover the origin, so we reject attenuation) as well as increasing standard errors. ELLIPTICAL INSIGHTS 15

Fig. 14. Effects of measurement error in Stress on the marginal relationship between Heart disease and Stress. Each panel starts with the observed data (δ = 0), then adds random normal error, (0, δ SDStress), with δ = 0.75, 1.0, 1.5 , to the value N × { } of Stress. Increasing measurement error biases the slope for Stress toward 0. Left: 50% data ellipses; right: 50% confidence ellipses for (β0, βStress).

Figure 14 demonstrates this effect for the marginal relation between Heart disease and Stress, with data ellipses in data space and the corresponding confi- dence ellipses in β space. Each panel starts with the observed data (the darkest ellipse, marked 0), then adds random normal error, (0, δ SDStress), with δ = 0.75, 1.0, 1.5 , to the valueN of Stress,× while keeping the{ mean of Stress} the same. All of the data ellipses have the same vertical shadows (SDHeart), while the horizontal shadows increase with δ, driv- ing the slope for Stress toward 0. In β space, it can be seen that the estimated coefficients, (β0, βStress), vary along a line and approach βStress =0 for δ suf- ficiently large. The vertical shadows of ellipses for (β0, βStress) along the βStress axis also demonstrate the effects of measurement error on the standard error of βStress. Perhaps less well-known, but both more surpris- ing and interesting, is the effect that measurement error in one variable, x1, has on the estimate of the Fig. 15. Biasing effect of measurement error in one vari- coefficient for an other variable, x2, in a multiple able (Stress) on the coefficient of another variable (Coffee) in regression model. Figure 15 shows the confidence el- a multiple regression. The coefficient for Coffee is driven to- ward its value in the marginal model using Coffee alone, as lipses for (βCoffee, βStress) in the multiple regression predicting Heart disease, adding random normal er- measurement error in Stress makes it less informative in the joint model. ror (0, δ SDStress), with δ = 0, 0.2, 0.4, 0.8 , to theN value of× Stress alone. As can{ be plainly} seen, while this measurement error in Stress attenuates coefficient for Coffee toward that in the marginal its coefficient, it also has the effect of biasing the regression of Heart disease on Coffee alone. 16 M. FRIENDLY, G. MONETTE AND J. FOX

Fig. 16. Added variable plots for Stress and Coffee in the multiple regression predicting Heart disease. Each panel also shows ⋆ ⋆ the 50% conditional data ellipse for residuals (xk, y ), shaded red.

4.8 Ellipsoids in Added-Variable Plots Geometrically, in the space of the observations,11 the fitted vector y is the orthogonal projection of y In contrast to the marginal, bivariate views of the ⋆ ⋆ onto the subspace spanned by X. Then y and xk relationships of a response to several predictors (e.g., are the projectionsb onto the orthogonal complement such as shown in the top row of the scatterplot ma- of the subspace spanned by X[ k], so the simple re- trix in Figure 11), added-variable plots (aka partial ⋆ ⋆ −ˆ gression of y on xk has slope βk in the full model, regression plots) show the partial relationship be- ⋆ ˆ ⋆ and the residuals from the line y = βkxk in this tween the response and each predictor, where the plot are identically the residuals from the overall re- effects of all other predictors have been controlled gression of y on X. b or adjusted for. Again we find that such plots have Another way to describe the added-variable plot remarkable geometric properties, particularly when (AVP) for xk is as a 2D projection of the space of supplemented by ellipsoids. (y, X), viewed in a plane projecting the data along Formally, we express the fitted standard linear the intersection of two hyperplanes: the plane of the model in vector form as y y X = β 1 + β x + regression of y on all of X, and the plane of regres- ≡ | 0 1 1 β2x2 + + βpxp, with model matrix X = [1, x1,..., sion of y on X[ k]. A third plane, that of the re- · · · b b b b − xp]. Let X be the model matrix omitting the col- gression of xk on X[ k], also intersects in this space b [ kb] − umn for variable− k. Then, algebraically, the added and defines the horizontal axis in the AVP. This is variable plot for variable k is the scatterplot of the illustrated in Figure 17, showing one view defined residuals (x⋆, y⋆) from two auxillary regressions,10 by the intersection of the three planes in the right k panel.12 fitting y and xk from X[ k], − Figure 16 shows added-variable plots for Stress ⋆ and Coffee in the multiple regression predicting Heart y y others = y y X[ k], ≡ | − | − ⋆ xk xk others = xk b xk X[ k]. 11 ≡ | − | − The “space of the observations” is yet a third, n- dimensional, space, in which the observations are the axes b and each variable is represented as a point (or vector). See, 10These quantities can all be computed [Velleman and for example, Fox [(2008), Chapter 10]. Welsh (1981)] from the results of a single regression for the 12Animated 3D movies of this plot are included among the full model. supplementary materials for this paper. ELLIPTICAL INSIGHTS 17

Fig. 17. 3D views of the relationship between Heart, Coffee and Stress, showing the three regression planes for the marginal models, Heart Coffee (green), Heart Stress (pink), and the joint model, Heart Coffee + Stress (light blue). Left: a ∼ ∼ ∼ standard view; right: a view showing all three regression planes on edge. The ellipses in the side panels are 2D projections of the standard conditional (red) and marginal (blue) ellipsoids, as shown in Figure 18. disease, supplemented by data ellipses for the residu- pretation of variance inflation due to collinear pre- ⋆ ⋆ als (xk, y ). With reference to the properties of data dictors, as we describe below. ellipses in marginal scatterplots (see Figure 4), the (5) The vertical half-width of the data ellipse is following visual properties (among others) are useful proportional to the residual standard deviation se in this discussion. These results follow simply from in the multiple regression. ⋆ 2 translating “marginal” into “conditional” (or “par- (6) The squared horizontal positions, (xk) , in the tial”) in the present context. The essential idea is plot give the partial contributions to leverage on the ⋆ ⋆ coefficient βˆk of xk. that the data ellipse of the AVP for (xk,y ) is to the estimate of a coefficient in a multiple regression (7) Items (3) and (7) imply that the AVP for xk as the data ellipse of (x,y) is to simple regression. shows the partial influence of individual observa- ˆ Thus: tions on the coefficient βk, in the same way as in Fig- ure 9 for marginal models. These influence statistics (1) The simple regression least squares fit of y⋆ are often shown numerically as DFBETA statistics ⋆ ˆ on xk has slope βk, the partial slope for xk in the [Belsley, Kuh and Welsch (1980)]. full model (and intercept = 0). (8) The last three items imply that the collection (2) The residuals, (y⋆ y⋆), shown in this plot of added-variable plots for y and X provide an easy − are the residuals for y in the full model. way to visualize the leverage and influence that indi- b ⋆ ⋆ vidual observations—and indeed the joint influence (3 The correlation between xk and y , seen in the shape of the data ellipse for these variables, is of subsets of observations—have on the estimation of each coefficient in a given model. the partial correlation between y and xk with the other predictors in X[ k] partialled out. Elliptical insight also permits us to go further, − (4) The horizontal half-width of the AVP data to depict the relationship between conditional and ellipse is proportional to the conditional standard marginal views directly. Figure 18 shows the same deviation of xk remaining after all other predictors added-variable plots for Heart disease on Stress and have been accounted for, providing a visual inter- Coffee as in Figure 16 (with a zoomed-out scaling), 18 M. FRIENDLY, G. MONETTE AND J. FOX

Fig. 18. Added-variable + marginal plots for Stress and Coffee in the multiple regression predicting Heart disease. Each ⋆ ⋆ panel shows the 50% conditional data ellipse for xk,y residuals (shaded, red) as well as the marginal 50% data ellipse for the (xk,y) variables, shifted to the origin. Arrows connect the mean-centered marginal points (open circles) to the residual points (filled circles).

but here we also overlay the marginal data ellipses the marginal ellipse of (xk,y), with standard devia- for (xk,y), and marginal regression lines for Stress tion s(xk) to the width of the conditional ellipse of and Coffee separately. In 3D data space, these are (x⋆,y⋆), with standard deviation s(x others). This k k| the shadows (projections) of the data ellipsoid onto geometry implies interesting constraints among the the planes defined by the partial variables. In 2D three quantities: improvement in fit, VIF, and change AVP space, they are just the marginal data ellipses from the marginal to conditional slope. translated to the origin. Finally, Figure 18 also shows how conditioning on The most obvious feature of Figure 18 is that the other predictors works for individual observations, ⋆ ⋆ AVP for Coffee has a negative slope in the condi- where each point of (xk, y ) is the image of (xk, y) tional plot (suggesting that controlling for Stress, along the path of the marginal regression. This re- coffee consumption is good for your heart), while in minds us that the AVP is a 2D projection of the the marginal plot increasing coffee seems to be bad full space, where the regression plane of y on X[ k] − for your heart. This serves as a regression example becomes the vertical axis and the regression plane of Simpson’s paradox, which we considered earlier. of xk on X[ k] becomes the horizontal axis. − Less obvious is the fact that the marginal and AVP ellipses are easily visualized as a shadow versus 5. MULTIVARIATE LINEAR MODELS: a slice of the full data ellipsoid. Thus, the AVP el- HE PLOTS lipse must be contained in the marginal ellipse, as we Multivariate linear models (MvLMs) have a spe- can see in Figure 18. If there are only two x’s, then cial affinity with ellipsoids and elliptical geometry, the AVP ellipse must touch the marginal ellipse at as described in this section. To set the stage and es- two points. The shrinkage of the intersection of the tablish notation, we consider the MvLM [e.g., Timm AVP ellipse with the y axis represents improvement (1975)] given by the equation Y = XB + U, where in fit due to other x’s. Y is an n p matrix of responses in which each More importantly, the shrinkage of the width (pro- column represents× a distinct response variable; X is jected onto a horizontal axis) represents the square the n q model matrix of full column rank for the root of the variance inflation factor (VIF), which can regressors;× B is the q p matrix of regression co- be shown to be the ratio of the horizontal width of efficients or model parameters;× and U is the n p × ELLIPTICAL INSIGHTS 19

Table 3 Multivariate test statistics as functions of the eigenvalues λi solving det(H λE) = 0 or − eigenvalues ρi solving det[H ρ(H + E)]=0 − Criterion Formula “mean” of ρ Partial η2

s 1 s 2 1/s Wilks’s Λ Λ = Qi 1+λi = Qi (1 ρi) geometric η = 1 Λ s λi s− 2 −V Pillai trace V = Pi 1+λi = Pi ρi arithmetic η = s s s ρi 2 H Hotelling–Lawley trace H = Pi λi = Pi 1−ρi harmonic η = H+s ρ1 2 λ1 Roy maximum root R = λ1 = 1−ρ1 supremum η = 1+λ1 = ρ1

matrix of errors, with vec(U) p(0, In Σ), where is the Kronecker product. ∼N ⊗ ⊗ A convenient feature of the MvLM for general multivariate responses is that all tests of linear hy- potheses (for null effects) can be represented in the form of a general linear test,

(12) H0 : L B = 0 , (h q)(q p) (h p) × × × where L is a rank h q matrix of constants whose rows specify h linear≤ combinations or contrasts of the parameters to be tested simultaneously by a multivariate test. For any such hypothesis of the form given in equa- tion (12), the analogs of the univariate sums of squares for hypothesis (SSH ) and error (SSE) are the p p sum of squares and cross-products (SSP) matrices× given by T T T 1 (13) H SSPH = (LB) [L(X X)−L ]− (LB) ≡ Fig. 19. Geometry of the classical test statistics used in tests and b b of hypotheses in multivariate linear models. The figure shows the representation of the ellipsoid generated by (H + E) rel- T T T T ⋆ ⋆ (14) E SSPE = Y Y B (X X)B = U U, ative to E in canonical space where E = I and (H + E) is ≡ − the corresponding transformation of (H + E). where U = Y XB is the matrixb of residuals.b b Multi-b variate test statistics− (Wilks’s Λ, Pillai trace, Hotel- b b produce maximal univariate F statistics for the hy- ling–Lawley trace, Roy’s maximum root) for test- pothesis in equation (12); we refer to these as the ing equation (12) are based on the s = min(p, h) canonical discriminant dimensions. nonzero latent roots λ > λ > > λ of the matrix 1 2 · · · s Beyond the informal characterization of the four H relative to the matrix E, that is, the values of λ classical tests of hypotheses for multivariate linear for which det(H λE) = 0 or, equivalently, the la- − models given in Table 3, there is an interesting ge- tent roots ρi for which det[H ρ(H + E)]=0. The ometrical representation that helps one to appre- − details are shown in Table 3. These measures at- ciate their relative power for various alternatives. tempt to capture how “large” H is, relative to E This can be illustrated most simply in terms of the in s dimensions, and correspond to various “means” canonical representation, (H + E)⋆, of the ellipsoid as we described earlier. All of these statistics have generated by (H + E) relative to E, as shown in transformations to F statistics giving either exact Figure 19 for p = 2. or approximate null-hypothesis F distributions. The With λi as described above, the eigenvalues and ⋆ corresponding latent vectors provide a set of s or- squared radii of (H+E) are λi +1, so the lengths of thogonal linear combinations of the responses that the major and minor axes are a = √λ1 +1 and b = 20 M. FRIENDLY, G. MONETTE AND J. FOX

(a) Data ellipses (b) H and E matrices

Fig. 20. (a) Data ellipses and (b) corresponding HE plot for sepal length and petal length in the iris data set. The H ellipse is the data ellipse of the fitted values defined by the group means, y¯i· The E ellipse is the data ellipse of the residuals, (yij y¯i·). Using evidence (“significance”) scaling of the H ellipse, the plot has the property that the multivariate test for a − given hypothesis is significant by Roy’s largest root test iff the H ellipse protrudes anywhere outside the E ellipse.

√λ2 + 1, respectively. The diagonal of the represented visually by ellipses (or ellipsoids beyond comprising the segments a, b (labeled c) has length 2D) that express the size of covariation against a c = √a2 + b2. Finally, a from the origin multivariate null hypothesis (H) relative to error co- dropped perpendicularly to the diagonal joining the variation (E). The multivariate tests, based on the 1 two ellipsoid axes is labeled d. latent roots of HE− , are thus translated directly 1 In these terms, Wilks’s test, based on (1 + λi)− , to the sizes of the H ellipses for various hypotheses, is equivalent to a test based on a b whichQ is propor- E × relative to the size of the ellipse. Moreover, the tional to the area of the framing rectangle, shown shape and orientation of these ellipses show some- shaded in Figure 19. The Hotelling–Lawley trace thing more—the directions (linear combinations of test, based on λi, is equivalent to a test based the responses) that lead to various effect sizes and on c = λi +Pp. Finally, the Pillai Trace test, 1 significance. based onpPλi(1 + λi)− , can be shown to be equal Figure 20 illustrates this idea for two variables to 2 d 2 for p = 2. Thus, it is strictly monotone in − P from the iris data set. Panel (a) shows the data el- d and− equivalent to a test based directly on d. lipses for sepal length and petal length, equivalent to The geometry makes it easy to see that if there is a the corresponding plot in Figure 3. Panel (b) shows large discrepancy between λ and λ , Roy’s test de- 1 2 the HE plot for these variables from the one-way pends only on λ1 while the Pillai test depends more MANOVA model yij = µi + uij testing equal mean on λ2. Wilks’s Λ and the Hotelling–Lawley trace cri- vectors across species, H : µ = µ = µ . Let Y be terion are also functional averages of λ1 and λ2, with 0 1 2 3 the former being penalized when λ is small. In prac- the n p matrix of fitted values for this model, that 2 × T T b tice, when s 2, all four test criteria are equivalent, is, Y = y¯i . Then H = Y Y ny¯y¯ (where y¯ is { ·} − ≤ the grand-mean vector), and the H ellipse in the in that their standard transformations to F statis- b b b tics are exact and give rise to identical p-values. figure is then just the 2D projection of the data el- lipsoid of the fitted values, scaled as described be- 5.1 Hypothesis-Error (HE) Plots T low. Similarly, U = Y Y, and E = U U = (N − − The essential idea behind HE plots is that any g)S , so the E ellipse is the 2D projection of pooled b b b b multivariate hypothesis test, equation (12), can be the data ellipsoid of the residuals. Visually, the E ELLIPTICAL INSIGHTS 21 ellipsoid corresponds to shifting the separate within- length and sepal length, perhaps a measure of over- group data ellipsoids to the centroid, as illustrated all size. above in Figure 6(c). 5.2 Linear Hypotheses: Geometries of Contrasts In HE plots, the E matrix is first scaled to a co- and Sums of Effects variance matrix E/dfe, dividing by the error degrees of freedom, dfe. The ellipsoid drawn is translated to Just as in univariate ANOVA designs, important the centroid y of the variables, giving y cE1/2/df . overall effects (df > 1) in MANOVA may be use- ⊕ e h This scaling and translation also allows the means fully explored and interpreted by the use of con- for levels of the factors to be displayed in the same trasts among the levels of the factors involved. In space, facilitating interpretation. In what follows, we the general linear hypothesis test of equation (12), show these as “standard” bivariate ellipses of 68% contrasts are easily specified as one or more (h q) i × coverage, using c = 2F 0.68 , except where noted L matrices, L1, L2,..., each of whose rows sums to 2,dfe otherwise. q zero. The ellipse for H reflects the size and orientation As an important special case, for an overall effect of covariation against the null hypothesis. In relation with dfh degrees of freedom (and balanced sample to the E ellipse, the H ellipse can be scaled to show sizes), a set of dfh pairwise orthogonal (1 q) L matrices (LTL =0for i = j) gives rise to a set× of df either the effect size or strength of evidence against i j 6 h rank-one Hi matrices that additively decompose the H0 (significance). overall hypothesis SSCP matrix (by a multivariate For effect-size scaling, each H is divided by dfe to conform to E. The resulting ellipse is then ex- analog of Pythagoras’ Theorem), actly the data ellipse of the fitted values, and corre- H = H1 + H2 + + Hdfh , sponds visually to a multivariate analog of univari- · · · ate effect-size measures [e.g., (¯y y¯ )/s where s exactly as the univariate SSH may be decomposed 1 − 2 e e is the within-group standard deviation]. in an ANOVA. Each of these rank-one Hi matrices For significance scaling, it turns out to be most will plot as a vector in an HE plot, and their collec- visually convenient to use Roy’s largest root statis- tion provides a visual summary of the overall test, tic as the test criterion. In this case, the H ellipse is as partitioned by these orthogonal contrasts. Even scaled to H/(λαdfe), where λα is the critical value more generally, where the subhypothesis matrices of Roy’s statistic.13 Using this scaling gives a simple may be of rank > 1, the subhypotheses will have visual test of H0: Roy’s test rejects H0 at a given hypothesis ellipses of dimension rank(Hi) that are α level iff the corresponding α-level H ellipse pro- conjugate with respect to the hypothesis ellipse for trudes anywhere outside the E ellipse.14 Moreover, the joint hypothesis, provided that the estimators the directions in which the hypothesis ellipse exceed for the subhypotheses are statistically independent. the error ellipse are informative about the responses To illustrate, we show in Figure 21 an HE plot and their linear combinations that depart signifi- for the sepal width and sepal length variables in the cantly from H0. Thus, in Figure 20(b), the variation iris data, corresponding to panel (1:2) in Figure 3. of the means of the iris species shown for these two Overlayed on this plot are the one-df H matrices ob- variables appears to be largely one-dimensional, cor- tained from testing two orthogonal contrasts among responding to a weighted sum (or average) of petal the iris species: setosa vs. the average of versicolor and virginica (labeled “S:VV”), and versicolor vs. 13The F test based on Roy’s largest root uses the ap- virginica (“V:V”), for which the contrast matrices proximation F =(df2/df1)λ1 with degrees of freedom df1, df2, are where df1 = max(dfh, dfe) and df2 = dfe df1 + dfh. Invert- − ing the F statistic gives the critical value for an α-level test: L1 = ( 2 1 1) , 1−α − λα =(df1/df2)F . df1,df2 L2 =(0 1 1 ) , 14Other multivariate tests (Wilks’s Λ, Hotelling–Lawley − trace, Pillai trace) also have geometric interpretations in HE where the species (columns) are taken in alphabet- plots [e.g., Wilks’s Λ is the ratio of areas () of the H ical order. In this view, the joint hypothesis testing and E ellipses (ellipsoids); Hotelling–Lawley trace is based on equality of the species means has its major axis in the sum of the λi], but these statistics do not provide such simple visual comparisons. All HE plots shown in this paper data space largely in the direction of sepal length. use significance scaling, based on Roy’s test. The 1D degenerate “ellipse” for H1, representing 22 M. FRIENDLY, G. MONETTE AND J. FOX

F -statistic for zk, subject to being uncorrelated with all other linear combinations. The canonical projection of Y to canonical scores Z is given by

1 (15) Yn p Zn s = YE− V/dfe, × 7→ × where V is the matrix whose columns are the eigen- 1 vectors of HE− associated with the ordered nonzero eigenvalues, λi, i = 1,...,s. A MANOVA of all s lin- ear combinations is statistically equivalent to that of the raw data. The λi are proportional to the fractions of between-group variation expressed by these linear combinations. Hence, to the extent that the first one or two eigenvalues are relatively large, a two-dimensional display will capture the bulk of between-group differences. The 2D canonical dis- criminant HE plot is then simply an HE plot of the scores z and z on the first two canonical dimen- Fig. 21. H and E matrices for sepal width and sepal length 1 2 sions. (If s 3, an analogous 3D version may also in the iris data, together with H matrices for testing two or- ≥ thogonal contrasts in the species effect. be obtained.) Because the z scores are all mutually uncorre- lated, the H and E matrices will always have their the contrast of setosa with the average of the other axes aligned with the canonical dimensions. When, two species, is closely aligned with this axis. The as here, the z scores are standardized, the E ellipse “ellipse” for H2 has a relatively larger component will be circular, assuming that the axes in the plot aligned with sepal width. are equated so that a unit data length has the same 5.3 Canonical Projections: Ellipses in Data physical length on both axes. Space and Canonical Space Moreover, we can show the contributions of the original variables to discrimination as follows: Let HE plots show the covariation leading toward re- P be the p s matrix of the correlations of each jection of a hypothesis relative to error covariation column of Y×with each column of Z, often called for two variables in data space. To visualize these re- canonical structure coefficients. Then, for variable lationships for more than two response variables, we j, a vector from the origin to the point whose coor- can use the obvious generalization of a scatterplot dinates p j are given in row j of P has projections matrix showing the 2D projections of the H and E on the canonical· axes equal to these structure coef- ellipsoids for all pairs of variables. Alternatively, a ficients and squared length equal to the sum squares transformation to canonical space permits visualiza- of these correlations. tion of all response variables in the reduced-rank 2D Figure 22 shows the canonical HE plot for the (or 3D) space in which H covariation is maximal. iris data, the view in canonical space corresponding In the MANOVA context, the analysis is called to Figure 21 in data space for two of the variables canonical discriminant analysis (CDA), where the (omitting the contrast vectors). Note that for g = 3 emphasis is on dimension reduction rather than hy- groups, dfh =2, so s = 2 and the representation in pothesis testing. For a one-way design with g groups 2D is exact. This provides a very simple interpreta- and p-variate observations i in group j, yij, CDA tion: Nearly all (99.1%) of the variation in species finds a set of s = min(p, g 1) linear combinations, means can be accounted for by the first canonical T T − T z1 = c1 y, z2 = c2 y,...,zs = cs y, so that: (a) all zk dimension, which is seen to be aligned with three of are mutually uncorrelated; (b) the vector of weights the four variables, most strongly with petal length. c1 maximizes the univariate F statistic for the lin- The second canonical dimension is mostly related to ear combination z1; (c) each successive vector of variation in the means on sepal width, and this vari- weights, ck, k = 2,...,s, maximizes the univariate able is negatively correlated with the other three. ELLIPTICAL INSIGHTS 23

(or osculate) along a path between the two cen- troids. We call this path the locus of osculation, whose properties are described in Section 6.1. Perhaps more interesting and more productive is that the same geometric ideas apply equally well in parameter (β) space. Consider, for example, method A to be OLS estimation and several alternatives for method B, such as ridge regression (Section 6.3) or Bayesian estimation (Section 6.4). The remarkable fact is that the geometry of such kissing ellipsoids provides a clear visual interpretation of these cases and others, whenever we consider a convex combi- nation of information from two sources. In all cases, the locus of osculation is interpretable in terms of the statistical goal to be achieved, taking precision into account. 6.1 Locus of Osculation Fig. 22. Canonical HE plot for the Iris data. In this plot, the H ellipse is shown using effect-size scaling to preserve The problems mentioned above all have a simi- resolution, and the variable vectors have been multiplied by a lar and simple physical interpretation: Imagine two constant to approximately fill the plot space. The projections stones dropped into a pond at locations with co- of the variable vectors on the coordinate axes show the corre- lations of the variables with the canonical dimensions. ordinates m1 and m2. The waves emanating from the centers form concentric circles which osculate along the line from m1 to m2. Now imagine a world Finally, imagine a 4D version of the HE plot of with ellipse-generating stones, where instead of cir- Figure 21 in data space, showing the four-dimension- cles, the waves form concentric ellipses determined al ellipsoids for H and E. Add to this plot unit by the shape matrices A1 and A2. The locus of os- vectors corresponding to the coordinate axes, scaled culation of these ellipses will be the set of points to some convenient constant length. Some rotation where the tangents to the two ellipses are parallel would show that the H ellipsoid is really only two- (or, equivalently, that their normals are parallel). An dimensional, while E is 4D. Applying the transfor- example is shown in Figure 23, using m1 = ( 2, 2), 1 − mation given by E− as in Figure A.4 and projecting m2 = (2, 6), and into the 2D subspace of the nonzero dimensions of 1.0 0.5 1.5 0.3 H would give a view equivalent to the canonical HE (16) A = , A = − , 1  0.5 1.5  2  0.3 1.0  plot in Figure 22. The variable vectors in this plot − are just the shadows of the original coordinate axes. where we have found points of osculation by trial and error. 6. KISSING ELLIPSOIDS An exact general solution can be described as fol- lows: Let the ellipses for i = 1, 2 be given by In this section we consider some circumstances in T which there is a data stratification factor or there fi(x) = (x mi) Ai(x mi), are two (or more) principles or procedures for deriv- − − x x 2 m A ing estimates of a parameter vector β of a linear : fi( )= c = i i { } ⊕ p model, each with its associated estimated covari- and denote their gradient-vector functions as ance matrix, for example, βA with covariance matrix ∂f ∂f Var(βˆ A) and βB with covariance matrix Var(βˆ B). b (17) f(x1,x2)= , The simplest motivating example is two-group dis- ∇ ∂x1 ∂x2  d b d criminant analysis (Section 6.2). In data space, so- so that lutions to this statistical problem can be described f (x) = 2A (x m ), geometrically in terms of the property that the data ∇ 1 1 − 1 ellipsoids around the group will just “kiss” f (x) = 2A (x m ). ∇ 2 2 − 2 24 M. FRIENDLY, G. MONETTE AND J. FOX

Then, the points where f1 and f2 are parallel can be expressed in terms of∇ the condition∇ that their T vector cross product u ⊛ v = u1v2 u2v1 = v Cu = 0, where C is the skew-symmetric− matrix 0 1 C =  1 0  − satisfying C = CT. Thus, the locus of osculation is − 2 the set , given by = x R f1(x) ⊛ f2(x)= 0 , whichO implies O { ∈ |∇ ∇ } T T (18) (x m ) A CA (x m ) = 0. − 2 2 1 − 1 Equation (18) is a bilinear form in x, with central matrix ATCA , implying that is a 2 1 O in the general case. Note that when x = m1 or x = m2, equation (18) is necessarily zero, so the locus of osculation always passes through m1 and m2. A visual demonstration of the theory above is shown in Figure 24 (left), which overlays the ellipses in Figure 23 with contour lines (hyperbolae, here) of Fig. 23. Locus of osculation for two families of ellipsoidal the vector cross-product function contained in equa- level curves, with centers at m1 =( 2, 2) and m2 = (2, 6), and tion (18). When the contours of f1 and f2 have the − shape matrices A1 and A2 given in equation (16). The left same shape (A1 = cA2), as in the right panel of ellipsoids (red) have radii = 1, 2, 3. The right ellipsoids have Figure 24, equation (18) reduces to a line, in accord radii = 1, 1.74, 3.1, where the last two values were chosen to with the stones-in-pond interpretation. The above make them kiss at the points marked with squares. The black can be readily extended to ellipsoids in a higher di- curve is an approximation to the path of osculation, using a mension, where the development is more easily un- spline function connecting m1 to m2 via the marked points of osculation. derstood in terms of normals to the surfaces.

Fig. 24. Locus of osculation for two families of ellipsoidal level curves, showing contour lines of the vector cross-product function equation (18). The thick black curve shows the complete locus of osculation for these two families of ellipses, where the cross-product function equals 0. Left: with parameters as in Figure 23 and equation (16). Right: with the same shape matrix A1 for both ellipsoids. ELLIPTICAL INSIGHTS 25

6.2 Discriminant Analysis Ridge regression replaces the standard residual sum of squares criterion with a penalized form, The right panel of Figure 24, considered in data T T space, provides a visual interpretation of the clas- RSS(k) = (y Xβ) (y Xβ)+ kβ β, (20) − − sical, normal theory two-group discriminant analy- (k 0), sis problem under the assumption of equal popula- ≥ whose solution is easily seen to be tion covariance matrices, Σ1 = Σ2. Here, we imag- RR T 1 T ine that the plot shows the contours of data ellip- βk = (X X + kI)− X y soids for two groups, with mean vectors m and (21) 1 b = GβOLS, m2, and common covariance matrix A = Spooled = T 1 1 [(n1 1)S1 + (n2 1)S2]/(n1 + n2 2). where G = [I + k(X Xb )− ]− . Thus, as the “ridge − − − RR The discriminant axis is the locus of osculation be- constant” k increases, G decreases, driving βk to- tween the two families of ellipsoids. The goal in dis- ward 0 [Hoerl and Kennard (1970a, 1970b)]. The addition of a positive constant k to the diagonalb of criminant analysis, however, is to determine a classi- T T fication rule based on a linear function, (x)= bTx, X X drives det(X X + kI) away from zero even if D det(XTX) 0. such that an observation x will be classified as be- ≈ longing to Group 1 if (x) d, and to Group 2 The penalized Lagrangian formulation in equa- D ≤ tion (20) has an equivalent form as a constrained otherwise. In linear discriminant analysis, the dis- minimization problem, criminant function coefficients are given by T βRR = argmin(y Xβ) (y Xβ) 1 β − − b = Spooled− (m1 m2). − (22) b T subject to β β t(k), All boundaries of the classification regions deter- ≤ mined by d will then be the tangent lines (planes) to which makes the size constraint on the parameters the ellipsoids at points of osculation. The location explicit, with t(k) an of k. This form of the classification region along the line from m1 to provides a visual interpretation of ridge regression, m2 typically takes into account both the prior prob- as shown in Figure 25. Depicted in the figure are abilities of membership in Groups 1 and 2, and the costs of misclassification. Similarly, the left panel of Figure 24 is a visual representation of the same prob- lem when Σ1 = Σ2, giving rise to quadratic classifi- cation boundaries.6 6.3 Ridge Regression In the univariate linear model, y = Xβ + ε, high multiple correlations among the predictors in X lead to problems of —unstable OLS estimates of the parameters in β with inflated standard errors and coefficients that tend to be too large in abso- lute value. Although collinearity is essentially a data problem [Fox (2008)], one popular (if questionable) approach is ridge regression, which shrinks the es- timates toward 0 (introducing bias) in an effort to reduce sampling variance. Suppose the predictors and response have been centered at their means and the unit vector is omit- ted from X. Further, rescale the columns of X to Fig. 25. Elliptical contours of the OLS residual sum of T unit length, so that X X is a correlation matrix. squares for two parameters in a regression, together with cir- cular contours for the constraint function, β2 + β2 t. Ridge Then, the OLS estimates are given by 1 2 ≤ regression finds the point βRR where the OLS contours just OLS T 1 T (19) β = (X X)− X y. kiss the constraint region. b 26 M. FRIENDLY, G. MONETTE AND J. FOX the elliptical contours of the OLS regression sum of tion of k. A bivariate version of this plot, with confi- squares, RSS(0) around βOLS. Each ellipsoid marks dence ellipses for the parameters, is introduced here. T the point closest to the origin, that is, with min β β. This plot provides greater insight into the effects of b It is easily seen that the ridge regression solution is k on coefficient variance.15 the point where the elliptical contours just kiss the Confidence ellipsoids for the OLS estimator are constraint contour. generated from the estimated covariance matrix of Another insightful interpretation of ridge regres- the coefficients, sion [Marquardt (1970)] sees the ridge estimator as OLS 2 T 1 equivalent to an OLS estimator, when the actual Var(β ) =σ ˆe (X X)− . data in X are supplemented by some number of fic- For the ridged estimator, this becomes [Marquardt titious observations, n(k), with uncorrelated predic- (1970)] tors, giving rise to an orthogonal X0 matrix, and k RR 2 T 1 T where y = 0 for all supplementary observations. The Var(β ) =σ ˆe [X X + kI]− (X X) (25) linear model then becomes T 1 d [X X + kI]− , y X e · (23) = βRR + ,  0   X0   e0  which coincides with the OLS result when k = 0. k k Figure 26 uses the classic Longley (1967) data to which gives rise to the solution illustrate bivariate ridge trace plots. The data con- RR T 0 T 0 1 T sist of an economic time (n = 16) observed (24) β = [X X + (Xk) Xk]− X y. yearly from 1947 to 1962, with the number of people 0 0 T 0 But becauseb Xk is orthogonal, (Xk) Xk is a scalar Employed as the response and the following predic- multiple of I, so there exists some value of k mak- tors: GNP, Unemployed, Armed.Forces, Population, ing equation (24) equivalent to equation (21). As Year, and GNP.deflator (using 1954 as 100).16 For promised, the ridge regression estimator then re- each value of k, the plot shows the estimate β, to- flects a weighted average of the data [X, y] with gether with the variance ellipse. For the sake of this n(k) observations [X0, 0] biased toward β = 0. In b k example, we assume that GNP is a primary predic- Figure 25, it is easy to imagine that there is a direct tor of Employment, and we wish to know how other translation between the size of the constraint region, predictors modify the regression estimates and their t(k), and an equivalent supplementary sample size, variance when ridge regression is used. n(k), in this interpretation. For these data, it can be seen that even small val- This classic version of the ridge regression problem ues of k have substantial impact on the estimates β. can be generalized in a variety of ways, giving other geometric insights. Rather than a constant multi- b T 15 plier k of β β as the penalty term in equation (20), Bias and mean-squared error are a different matter: Al- T though Hoerl and Kennard (1970a) demonstrate that there is consider a penalty of the form β Kβ with a positive a range of values for the ridge constant k for which the MSE definite matrix K. The choice K = diag(k1, k2,...) of the ridge estimator is smaller than that of the OLS estima- gives rise to a version of Figure 25 in which the con- tor, to know where this range is located requires knowledge of straint contours are ellipses aligned with the coordi- β. As we explain in the following subsection, the constraint ˆ nate axes, with axis lengths inversely proportional on β incorporated in the ridge estimator can be construed to k . These constants allow for differential shrinkage as a Bayesian prior; the fly in the ointment of ridge regres- i sion, however, is that there is no reason to suppose that the of the OLS coefficients. The visual solution to the ridge-regression prior is in general reasonable. obvious modification of equation (22) is again the 16Longley (1967) used these data to demonstrate the ef- point where the elliptical contours of RSS(0) kiss the fects of numerical instability and round-off error in least squares computations based on direct computation of the contours of the (now elliptical) constraint region. T cross-products matrix, X X. Longley’s paper sparked the de- 6.3.1 Bivariate ridge trace plots Ridge regression velopment of a wide variety of numerically stable least squares is touted (optimistically we think) as a method to algorithms (QR, modified Gram-Schmidt, etc.) now used is counter the effects of collinearity by trading off a almost all statistical software. Even ignoring numerical prob- small amount of bias for an advantageous decrease in lems (not to mention problems due to lack of independence), these data would be anticipated to exhibit high collinearity variance. The results are often visualized in a ridge because a number of the predictors would be expected to have trace plot [Hoerl and Kennard (1970b)], showing the strong associations with year and/or population, yet both of changes in individual coefficient estimates as a func- these are also included among the predictors. ELLIPTICAL INSIGHTS 27

Fig. 26. Bivariate ridge trace plots for the coefficients of Unemployed and Population against the coefficient for GNP in Longley’s data, with k = 0, 0.005, 0.01, 0.02, 0.04, 0.08. In both cases the coefficients are driven on average toward zero, but the bivariate plot also makes clear the reduction in variance. To reduce overlap, all variance ellipses are shown with 1/2 the standard radius.

What is perhaps more dramatic (and unseen in uni- To focus on alternative estimators, we can complete variate trace plots) is the impact on the size of the the square around β = βOLS to give variance ellipse. Moreover, shrinkage in variance is T T (y Xβ) (y Xbβ) =b (y Xβ) (y Xβ) generally in a similar direction to the shrinkage in − − − − (26) T T the coefficients. This new graphical method is de- + (β bβ) (X X)(b β β). veloped more fully in Friendly (2013), including 2D − − and 3D plots, as well as more informative represen- With a little manipulation, a conjugateb prior,b of β 2 β 2 2 tations of shrinkage by ellipsoids in the transformed the form Pr( ,σ ) = Pr( σ ) Pr(σ ), can be ex- pressed with Pr(σ2) an inverse| × gamma distribution space of the SVD of the predictors. depending on the first term on the right-hand side of 6.4 Bayesian Linear Models equation (26) and Pr(β σ2) a normal distribution, | In a Bayesian alternative to standard least squares Pr(β σ2) | estimation, consider the case where our prior infor- 2 p (27) (σ )− mation about β can be encapsulated in a distribu- ∝ prior 1 T tion with a prior mean β and covariance matrix exp (β βprior) A(β βprior) . A. We show that under reasonable conditions the · −2σ2 − −  posterior Bayesian posterior estimate, β , turns out to 2 y X prior The posterior distribution is then Pr(β,σ , ) be a weighted average of the prior coefficients β Pr(y X, β,σ2) Pr(β σ2) Pr(σ2), whence,| after∝ OLS b and the OLS solution β , with weights propor- some| simplification,× the| posterior× mean can be ex- 1 tional to the conditional prior precision, A− , and pressed as b XTX the data precision given by . Once again, this posterior T 1 T OLS prior can be understood geometrically as the locus of os- (28) β = (X X + A)− (X Xβ + Aβ ) T 1 culation of ellipsoids that characterize the prior and withb covariance matrix (X X + A)−b . The poste- the data. rior coefficients are therefore a weighted average of Under Gaussian assumptions, the conditional like- the prior coefficients and the OLS estimates, with lihood can be written as weights given by the conditional prior precision, A, and the data precision, XTX. Thus, as we increase (y X, β,σ2) L | the strength of our prior precision (decreasing prior 2 n/2 1 T variance), we place greater weight on our prior be- (σ )− exp (y Xβ) (y Xβ) . ∝ −2σ2 − −  liefs relative to the data. 28 M. FRIENDLY, G. MONETTE AND J. FOX

In this context, ridge regression can be seen as where u and ε are assumed to have normal distri- the special case where βprior = 0 and A = kI, and butions with mean 0 and where Figure 25 provides an elliptical visualization. b u G 0 In equation (24), the number of observations, n(k) (31) Var = , 0  ε   0 R  corresponding to Xk, can be seen as another way of expressing the weight of the prior in relation to the where G = diag(G1,..., Gm), R = diag(R1,..., Rm) data. and m is the number of clusters. The variance of y is therefore V = ZGZT + R, and when Z = 0 and 6.5 Mixed Models: BLUEs and BLUPs R = σ2I, the mixed model in equation (30) reduces In this section we make implicit use of the du- to the standard linear model. ality between data space and β space, where lines We now consider the case in which Zi = Xi and we in one map into points in the other and ellipsoids wish to predict βi = β +ui, the vector of parameters help to visualize the precision of estimates in the for the ith cluster. At one extreme, we could simply context of the linear mixed model for hierarchical ignore clusters and use the common mixed-model data. We also show visually how the best linear un- generalized-least-square estimate, biased predictors (BLUPs) from the mixed model GLS T 1 1 T 1 can be seen as a weighted average of the best linear (32) β = (X V− X)− X V− y, unbiased estimates (BLUEs) derived from OLS re- T 1 1 whose samplingb variance is Var(βGLS) = (X V X)− . gressions performed within clusters of related data − It is an unbiased predictor of β since E(βGLS and the overall mixed model GLS estimate. b i β ) = 0. With moderately large m, the sampling− The mixed model for hierarchical data provides i b variance may be small relative to G and Var(βGLS a general framework for dealing with dependence i − among observations in linear models, such as occurs βi) Gi. ≈ b when students are sampled within schools, schools At the other extreme, we ignore the fact that clus- within counties and so forth [e.g., Raudenbush and ters come from a common population and we calcu- Bryk (2002)]. In these situations, the assumption late the separate BLUE estimate within each clus- of OLS that the errors are conditionally indepen- ter, dent is probably violated, because, for example, stu- blue T 1 T dents nested within the same school are likely to βi = (Xi Xi)− Xi yi (33) have more similar outcomes than those from differ- b blue 2 T 1 with Var(β β ) S = σ (Xi Xi)− . ent schools. Essentially the same model, with pro- i | i ≡ i vision for serially correlated errors, can be applied Both extremes have drawbacks:b whereas the pooled to longitudinal data [e.g., Laird and Ware (1982)], overall GLS estimate ignores variation between clus- although we will not pursue this application here. ters, the unpooled within-cluster BLUE ignores the The mixed model for the n 1 response vector i × common population and makes clusters appear to yi in cluster i can be given as differ more than they actually do. This dilemma led to the development of BLUPs yi = Xiβ + Ziui + εi, (best linear unbiased predictor) in models with ran- (29) u (0, G ), i ∼Nq i dom effects [Henderson (1975), Robinson (1991), Speed (1991)]. In the case considered here, the BLUPs εi ni (0, Ri), ∼N are an inverse-variance weighted average of the mixed- β where is a p 1 vector of parameters correspond- model GLS estimates and of the BLUEs. The BLUP ing to the fixed× effects in the n p model matrix i × is then Xi; ui is a q 1 vector of coefficients corresponding × blup 1 1 1 to the random effects in the n q model matrix Z ; β = (S − + G− )− i × i i i i Gi is the q q covariance matrix of the random ef- (34) × e 1 blue 1 GLS fects in u ; and R is the n n covariance matrix (S− βi + G− βi ). i i i i · i i of the errors in ε . × i This “partial pooling” optimallyb combinesb the infor- Stacking the y , X , Z and so forth in the obvious i i i mation from cluster i with the information from all way then gives blue GLS clusters, shrinking βi toward β . Shrinkage for (30) y = Xβ + Zu + ε, a given parameter β is greater when the sample b ij b ELLIPTICAL INSIGHTS 29

Fig. 27. Comparing BLUEs and BLUPs. Each panel plots the OLS estimates from separate regressions for each school (BLUEs) versus the mixed model estimates from the random intercepts and slopes model (BLUPs). Left: intercepts; Right: slopes for CSES. The shrinkage of the BLUPs toward the GLS estimate is much greater for slopes than intercepts. size n is small or when the variance of the corre- SES (mean SES) , for ease of interpretation (mak- i ij − i sponding random effect, gijj, is small. ing the within-school intercept for school i equal to Equation (34) is of the same form as equation (28) the mean SES in that school). and other convex combinations of estimates consid- For simplicity, we consider the case of CSES as ered earlier in this section. So once again, we can a single quantitative predictor in X in the example understand these results geometrically as the locus below. We fit and compare the following models: of osculation of ellipsoids. Ellipsoids kiss for a rea- (35) y (β + x β ,σ2) pooled OLS, son: to provide an optimal convex combination of i ∼N 0 i 1 2 information from two sources, taking precision into (36) yi (β0i + xiβ1i,σ ) unpooled BLUEs, ∼N i account. y (β + x β + u + x u ,σ2) i ∼N 0 i 1 0i i 1i i 6.5.1 Example: Math achievement and SES To il- (37) lustrate, we use a classic data set from Bryk and BLUPs: random intercepts and slopes, Raudenbush (1992) and Raudenbush and Bryk (2002) and also include a fixed effect of sector, common dealing with math achievement scores for a sub- to all models; for compactness, the sector effect is sample of 7185 students from 160 schools in the elided in the notation above. 1982 High School & Beyond survey of U.S. pub- In expositions of mixed-effects models, such mod- lic and Catholic high schools conducted by the Na- els are often compared visually by plotting predicted tional Center for Education Statistics (NCES). The values in data space, where each school appears as a data set contains 90 public schools and 70 Catholic fitted line under one of the models above (sometimes schools, with sample sizes ranging from 14 to 67. called “spaghetti plots”). Our geometric approach The response is a standardized measure of math leads us to consider the equivalent but simpler plots achievement, while student-level predictor variables in the dual β space, where each school appears as a include sex and student socioeconomic status (SES), point. and school-level predictors include sector (public or Figure 27 plots the unpooled BLUE estimates Catholic) and mean SES for the school (among other against the BLUPs from the random effects model, variables). Following Raudenbush and Bryk (2002), with separate panels for intercepts and slopes to student SES is considered the main predictor and is illustrate the shrinkage of different parameters. In typically analyzed centered within schools, CSESij = these data, the variance in intercepts (average math 30 M. FRIENDLY, G. MONETTE AND J. FOX

inferences and measures of heterogeneity across stud- ies. The application of mixed model ideas in this con- text differs from the standard situation in that indi- vidual data are usually unavailable and use is made instead of summary data (estimated treatment ef- fects and their ) from the published lit- erature. The multivariate extension of standard uni- variate methods of meta-analysis allows the correla- tions among outcome effects to be taken into ac- count and estimated, and regression versions can incorporate study-specific covariates to account for some inter-study heterogeneity. More importantly, we illustrate a graphical method based (of course) on ellipsoids that serves to illustrate bias, heterogeneity and shrinkage in BLUPs, and the optimism of fixed- effect estimates when study heterogeneity is ignored. Fig. 28. Comparing BLUEs and BLUPs. The plot shows The general mixed-effects multivariate meta-anal- ellipses of 50% coverage for the estimates of intercepts and ysis model can be written as slopes from OLS regressions (BLUEs) and the mixed model (BLUPs), separately for each sector. The centers of the el- (38) yi = Xiβ + δi + ei, lipses illustrate how the BLUPS can be considered a weighted average of the BLUEs and the mixed-model GLS estimate, where yi is a vector of p outcomes (means or treat- ignoring sector. The relative sizes of the ellipses reflect the ment effects) for study i; Xi is the matrix of study- smaller variance for the BLUPs compared to the BLUEs, par- level predictors for study i or a unit vector when no ticularly for slope estimates. covariates are available; β is the population-averaged vector of regression parameters or effects (intercepts, achievement for students at CSES = 0), g00, among means) when there are no covariates; δi is the p- schools in each sector is large, so the mixed-effects vector of random effects associated with study i, estimates have small weight and there is little shrink- whose p p covariance matrix ∆ represents the be- × age. On the other hand, the variance component for tween-study heterogeneity unaccounted for by Xiβ; slopes, g11, is relatively small, so there is greater and, finally, ei is the p-vector of random sampling shrinkage toward the GLS estimate. errors (independent of δi) within study i, having the For the present purposes, a more useful visual p p covariance matrix Si. representation of these model comparisons can be ×With suitable distributional assumptions, the shown together in the space of (β0, β1), as in Fig- mixed-effects model in equation (38) implies that ure 28. Estimates for individual schools are not shown, (39) y (X β, ∆ + S ) but rather these are summarized by the ellipses of i ∼Np i i 50% coverage for the BLUEs and BLUPs within with Var(yi)= ∆ + Si. When all the δi = 0, and each sector. The centers of the ellipsoids indicate thus ∆ = 0, equation (38) reduces to a fixed-effects the relatively greater shrinkage of slopes compared model, yi = Xiβ + ei, which can be estimated by to intercepts. The sizes of ellipsoids show directly GLS to give the greater precision of the BLUPs, particularly for GLS T 1 T 1 slopes. (40) β = (X SX)− X S− y, GLS T 1 6.6 Multivariate Meta-Analysis (41) Var(βb ) = (X SX)− ,

A related situation arises in random effects mul- where y dand bX are the stacked yi and Xi, and S tivariate meta-analysis [Berkey et al. (1998), Nam, is the block-diagonal matrix containing the Si. The Mengersen and Garthwaite (2003)], where several fixed-effects model ignores unmodeled heterogene- outcome measures are observed in a series of sim- ity among the studies, however, and consequently ilar research studies and it is desired to synthesize the estimated effects in equation (40) may be bi- those studies to provide an overall (pooled) sum- ased and the estimated uncertainty of these effects mary of the outcomes, together with meta-analytic in equation (41) may be too small. ELLIPTICAL INSIGHTS 31

Fig. 29. Multivariate meta-analysis visualizations for five periodontal treatment studies with outcome measures PD and AL. Left: Individual study estimates yi and 40% standard ellipses for Si (dashed, red) together with the pooled, fixed effects estimate and its associated covariance ellipse (blue). Right: BLUPs from the random-effects multivariate meta-analysis model and their associated covariance ellipses (red, solid), together with the pooled, population-averaged estimate and its covariance ellipse (blue), and the estimate of the between-study covariance matrix, ∆ (green). Arrows show the differences between the FE and the RE models.

The example we use here concerns the compari- are positive), while nonsurgical treatment results son of surgical (S) and nonsurgical (NS) procedures in better attachment level (all estimates are neg- for the treatment of moderate periodontal disease ative). As well, within each study, there is a con- in five randomized split-mouth design clinical tri- sistently positive correlation between the two out- als [Antczak-Bouckoms et al. (1993), Berkey et al. come effects: patients with a greater surgical vs. (1998)]. The two outcome measures for each patient nonsurgical difference on one measure tend to have were pre- to post-treatment changes after one year a greater such difference on the other, and greater in probing depth (PD) and attachment level (AL), within-study variation on PD than on AL.18 The in mm, where successful treatment should decrease overall sizes of the ellipses largely reflect (inversely) probing depth and increase attachment level. Each the sample sizes in the various studies. As far as study was summarized by the mean difference, yi = we know, these results were not noted in previous S NS (yi yi ), between S and NS treated teeth, to- analyses of these data. Finally, the fixed-effect es- − GLS gether with the covariance matrix Si for each study. timate, β = (0.307, 0.394), and its covariance Sample sizes ranged from 14 to 89 across studies. ellipse suggest that these− effects are precisely esti- b The left panel of Figure 29 shows the individ- mated. ual study estimates of PD and AL together with The random-effects model is more complex be- their covariances ellipses in a generic form that we cause β and ∆ must be estimated jointly. A vari- propose as a more useful visualization of multivari- ety of methods have been proposed (full maximum ate meta-analysis results than standard tabular dis- likelihood, restricted maximum likelihood, method plays: individual estimates plus model-based sum- of moments, Bayesian methods, etc.), whose details mary, all with associated covariance ellipsoids.17 [for which see Jackson, Riley and White (2011)] are It can be seen that all studies show that surgi- not relevant to the present discussion. Given an esti- cal treatment yields better probing depth (estimates 18Both PD and AL are measured on the same scale (mm), 17The analyses described here were carried out using the and the plots have been scaled to have unit aspect ratio, jus- mvmeta package for R [Gasparrini (2012)]. tifying this comparison. 32 M. FRIENDLY, G. MONETTE AND J. FOX mate ∆, however, the pooled, population-averaged error in the average point estimates V and the error point estimate of effects under the random-effects in the random deviation predicted for each study. modelb can be expressed as 1 7. DISCUSSION AND CONCLUSIONS RE T 1 − T 1 (42) β = X Σ− X X Σ− y ,  i i i  i i i I know of scarcely anything so apt to im- Xi Xi b press the imagination as the wonderful form where Σi = Si + ∆. The first term in equation (42) of cosmic order expressed by the “[Ellipti- gives the estimated covariance matrix V Var(βRE) cal] Law of Frequency of Error.” The law b of the random-effect pooled estimates. For≡ the present would have been personified by the Greeks example, this is shown in the right paneld ofb Fig- and deified, if they had known of it.... It ure 29 as the blue ellipse. The green ellipse shows is the supreme law of Unreason. When- the estimate of the between-study covariance, ∆, ever a large sample of chaotic elements are whose shape indicates that studies with a larger es- taken in hand and marshaled in the order timate of PD also tend to have a larger estimateb of their magnitude, an unsuspected and of AL (with correlation = 0.61). It is readily seen most beautiful form of regularity proves that, relative to the fixed-effects estimate βGLS, the to have been latent all along. RE unbiased estimate β = (0.353, 0.339) under the Sir Francis Galton, Natural Inheritance, random-effects model has been shifted− towardb the b London: Macmillan, 1889 (“[Elliptical]” centroid of the individual study estimates and that added). its covariance ellipse is now considerably larger, re- flecting between-study heterogeneity. In contrast to We have taken the liberty to add the word “Ellip- RE tical” to this famous quotation from Galton (1889). the fixed-effect estimates, inferences on H0 : β = 0 pertain to the entire population of potential studies His “supreme law of Unreason” referred to univari- of these effects. b ate distributions of observations tending to the Nor- Figure 29 (right) also shows the best linear un- mal distribution in large samples. We believe he biased predictions of individual study estimates and would not take us remiss, and might perhaps wel- their associated covariance ellipses, superposed (pure- come us for extending this view to two and more ly for didactic purposes) on the fixed-effects esti- dimensions, where ellipsoids often provide an “un- mates to allow direct comparison. For random-effects suspected and most beautiful form of regularity.” models, the BLUPs have the form In statistical data, theory and graphical meth- ods, one fundamental organizing distinction can be BLUP RE 1 RE (43) β = β + ∆Σ− (y β ), i i i − made depending on the dimensionality of the prob- with covarianceb matricesb b b lem. A coarse but useful scale considers the essential BLUP 1 defining distinctions to be among: (44) Var(βi )= V + (∆ ∆Σi− ∆). − ONE (univariate), Algebraically,d b the BLUPb outcomeb estimatesb in • equation (43) are thus a weighted average of the po- TWO (bivariate), • MANY (multivariate). pulation-averaged estimates and the study-specific • estimates, with weights depending on the relative This scale19 at least implicitly organizes much of sizes of the within- and between-study covariance ma- current statistical teaching, practice and software. trices Si and ∆. The point BLUPs borrow strength But within this classification, the data, theory and from the assumption of an underlying multivariate graphical methods are often treated separately (1D, distribution of study parameters with covariance ma- 2D, nD), without regard to geometric ideas and vi- trix ∆, shrinking toward the mean inversely pro- sualizations that help to unify them. portional to the within-study covariance. Geometri- This paper starts from the premise that one geo- cally, these estimates may be described as occurring metric form—the ellipsoid—provides a unifying along the locus of osculation of the ellipses (y , S ) E i i framework for many statistical phenomena, with sim- and (β, ∆). ple representations in 1D (a line) and 2D (an ellipse) Finally,E the right panel of Figure 29 also shows the b b covariance ellipses of the BLUPs from equation (43). 19This idea, as a unifying classification principle for data It is clear that their orientation is a blending of the analysis and graphics, was first suggested to the first author correlations in V and Si, and their size reflects the in seminars by John Hartigan at Princeton, c. 1968. ELLIPTICAL INSIGHTS 33 that extend naturally to higher dimensions (an el- Monette (1990) took these ideas several steps fur- lipsoid). The intellectual leap in statistical think- ther, developing interactive 3D graphics focused on ing from ONE to TWO in Galton (1886) was enor- linear models, geometry and ellipsoids, and demon- mous. Galton’s visual insights derived from the el- strating how many statistical properties and results lipse quickly led to an understanding of the ellipse as could be understood through the geometry of ellip- a contour of a bivariate normal . From here, soids. Yet, even at this later date, the graphical fa- the step from TWO to MANY would take another cilities of readily available statistical software were 20–30 years, but it is hard to escape the conclusion still rather primitive, and 3D graphics was available that geometric insight from the ellipse to the gen- only on high-end workstations. eral ellipsoid in nD played an important role in the Several features of the current discussion may help development of multivariate statistical methods. to present these ideas in a new light. First, the exam- In this paper, we have tried to show how ellip- ples we have presented rely heavily on software for soids can be useful tools for visual statistical think- statistical graphics developed separately and jointly ing, data analysis and pedagogy in a variety of con- by all three authors. These have allowed us to create texts often treated separately and from a univari- what we hope are compelling illustrations, all sta- ate perspective. Even in bivariate and multivariate tistically and geometrically exact, of the principles problems, first-moment summaries (a 1D regression and ideas that form the body of the paper. More- line or 2 + D regression surface) show only part of over, these are now general methods, implemented in the story—that of the expectation of a response y a variety of R packages, for example, Fox and Weis- given predictors X. In many cases, the more inter- berg (2011), Friendly (2007b), and a large collec- esting part of the story concerns the precision of tion of SAS macros (http://datavis.ca/sasmac), various methods of estimation, which we’ve shown so we hope this paper will contribute to turning the to be easily revealed through data ellipsoids and el- theory we describe into practice. liptical confidence regions for parameters. Second, we have illustrated, in a wide variety of The general relationships among statistical meth- contexts, comprising all classical (Gaussian) linear ods, matrix algebra and geometry are not new here. models, multivariate linear models and several ex- To our knowledge, Dempster (1969) was the first to tensions, how ellipsoids can contribute substantially exploit these relationships in a systematic fashion, to the understanding of statistical relationships, both establishing the connections among abstract vector in data analysis and in pedagogy. One graphical spaces, algebraic coordinate systems, matrix oper- theme underlying a number of our examples is how ations and properties, the dualities between obser- the simple addition of ellipses to standard 2D graph- vation space and variable space, and the geometry ical displays provides an efficient visual summary of ellipses and projections. The roots of these con- of important bivariate statistical quantities (means, nections go back much further—to Cram´er (1946) variances, correlation, regression slopes, etc.). While (the idea of the concentration ellipsoid), to Pearson first-moment visual summaries are now common ad- (1901) and Hotelling (1933) (principal components), juncts to graphical displays in standard software, of- and, we maintain, ultimately to Galton (1886). ten by default, we believe that the second-moment Throughout this development, elliptical geometry visual summaries of ellipses (and ellipsoids in 3D) has played a fundamental role, leading to important now deserve a similar place in statistical practice. visual insights. Finally, we have illustrated several recent or en- The separate and joint roles of statistical compu- tirely new visualizations of statistical methods and tation and computational graphics should not be un- results, all based on elliptical geometry. HE plots derestimated in appreciation of these developments. for MANOVA designs [Friendly (2007b)] and their Dempster’s analysis of the connections among geom- projections into canonical space (Section 5) provide etry, algebra and statistical methods was fueled by one class of examples where ellipsoids provide sim- the development and software implementation of al- ple visual summaries of otherwise complex statisti- gorithms [Gram-Schmidt orthogonalization, Choles- cal results. Our analysis of the geometry of added ky decomposition, sweep and multistandardize op- variable-plots suggested the idea of superposing mar- erators from Beaton (1964)] that allowed him to ginal and conditional plots, as in Figure 18, lead- show precisely the translation of theoretical rela- ing to direct visualization of the difference between tions from abstract algebra to numbers and from marginal and conditional relationships in linear mod- there to graphs and diagrams. els. The bivariate ridge trace plots described in Sec- 34 M. FRIENDLY, G. MONETTE AND J. FOX tion 6.3.1 are a direct outgrowth of the geometric ap- (some linear combinations will have infinite confi- proach taken here, emphasizing the duality between dence intervals) iff the corresponding data ellip- views in data space and in parameter (β) space. We soid is flat, as when p>n or some predictors are believe these all embody von Humboldt’s (1811) dic- collinear. T tum, quoted in the Introduction. Defining ellipsoids with x : x Cx 1 produces proper ellipsoids for C positive{ definite≤ and} unbound- APPENDIX: GEOMETRICAL AND ed, fat ellipsoids for C positive semi-definite. But it STATISTICAL ELLIPSOIDS does not produce degenerate (i.e., flat) ellipsoids. On the other hand, the representation in equation (3), This appendix outlines useful results and proper- := A , with the unit sphere, produces proper E S S T 1 ties concerning the representation of geometric and ellipsoids when C = (A A)− where A is a nonsin- statistical ellipsoids. A number of these can be traced gular p p matrix and degenerate ellipsoids when to or have more general descriptions within the ab- A is a singular,× but does not produce unbounded stract formulation of Dempster (1969), but casting ellipsoids. them in terms of ellipsoids provides a simpler and One representation that works for all ellipsoids— more easily visualized framework. fat or flat and bounded or unbounded—can be based on a decomposition (SVD) represen- A.1 Taxonomy and Representation of tation A = U∆VT, with Generalized Ellipsoids (A.1) := U(∆ ), Section 2.1 defined a proper (origin-centered) el- E S where U is orthogonal and ∆ is diagonal with non- lipsoid in Rp by := x : xTCx 1 that is bounded E { ≤ } negative reals or infinity.20 The “inverse” of an el- with nonempty interior (call these “fat” ellipsoids). 1 lipsoid is then simply U(∆− ). The connection For more general purposes, particularly for statis- with traditionalE representationsS is that, if ∆ is fi- tical applications, it is useful to give ellipsoids a nite, A = U∆VT, where V can be any orthogonal wider definition. To provide a complete taxonomy, 1 2 T matrix, and if ∆− is finite, C = U∆− U . this wider definition should also include ellipsoids The U(∆ ) representation also allows us to char- Rp that may be unbounded in some directions in acterize anyS such generalized ellipsoid in Rp by its (an infinite cylinder of ellipsoidal cross-section) and signature, degenerate (singular) ellipsoids that are “flat” in Rp (C)=#[δi > 0, δi = 0, δi = ] with empty interior, such as when a 3D ellipsoid (A.2) G ∞ has no extent in one dimension (collapsing to an with (C)= p, ellipse), or in two dimensions (collapsing to a line). G a 3-vector containing the number (#)P of positive, These ideas are made precise below with a definition zero and infinite singular values. For example, in of the signature, (C), of a generalized ellipsoid. 3 G R , any proper ellipsoid has the signature (C)= The motivation for this more general representa- (3, 0, 0); a flat, 2D ellipsoid has (C) = (2,G1, 0); a tion is to allow a notation for a class of generalized flat, 1D ellipsoid (a line) has (CG ) = (1, 2, 0). Un- ellipsoids to be algebraically closed under operations bounded examples include an infiniteG flat plane, with (a) image and preimage under a linear transforma- (C) = (0, 1, 2), and an infinite cylinder of elliptical tion, and (b) inversion. The goal is to be able to Gcross-section, with (C) = (2, 0, 1). think about, visualize and compute a linear trans- Figure A.1 illustratesG these ideas, using two gen- formation of an ellipsoid with central matrix C or its erating matrices, C1 and C2, in this more general 1 inverse transformation via an analog of C− , which representation, applies equally to unbounded and/or degenerate el- 6 2 1 6 2 0 lipsoids. Algebraically, the vector space of C is the 1 C1 =  2 3 2  , C2 =  2 3 0  , dual of that of C− [Dempster (1969), Chapter 6] 1 2 2 0 0 0 and vice-versa. Geometrical applications can show     how points, lines and hyperplanes in Rp are all spe- 20 cial cases of ellipsoids. Statistical applications con- Note that the parentheses in this notation are obliga- tory: ∆ as defined transforms the unit sphere, which is then T cern the relationship between a predictor data ellip- transformed by U. V , also orthogonal, plays no role in this soid and the corresponding β confidence ellipsoid representation, because an orthogonal transformation of is (Section 4.6): The β ellipsoid will be unbounded still a unit sphere. S ELLIPTICAL INSIGHTS 35

−1 Fig. A.1. Two views of an example of generalized ellipsoids. C1 (blue) determines a proper, fat ellipsoid; its inverse C1 −1 also generates a proper ellipsoid. C2 (red) determines an improper, flat ellipsoid, whose inverse C2 is an unbounded cylinder of elliptical . The scale of these images is defined by a unit sphere (gray). The left panel shows that C2 is a projection of C1 onto the plane where z = 0. The right panel shows a view illustrating the orthogonality of each C and its dual, C−1.

where C1 generates a proper ellipsoid and C2 gen- eral case, the hypervolume of the ellipsoid is pro- 1/2 erates an improper, flat ellipsoid. C1 and its dual, portional to C − = A and is given by 1 | | p k k C− , both have signatures (3, 0, 0). C2 has the sig- p/2 1/2 1 π det(C)− /[Γ( 2 +1)] [Dempster (1969), Sec- nature (2, 1, 0), while its inverse (dual) has the sig- tion 3.5], where the first two factors are familiar nature (2, 0, 1). These varieties of ellipsoids are more as the normalizing constant of the multivariate easily understood in the 3D movies included in the normal density function. online supplements. Principal axes: In general, the eigenvectors, v , i = • i A.2 Properties of Geometric Ellipsoids 1,...,p, of C define the principal axes of the el- lipsoid and the inverse of the square roots of the Translation: An ellipsoid centered at x0 has the • T ordered eigenvalues, λ1 > λ2, . . ., λp, are the prin- definition := x : (x x0) C(x x0) = 1 or E { − − } cipal radii. Eigenvectors belonging to eigenvalues := x0 A in the notation of Section 2.2. EOrthogonality:⊕ S If C is diagonal, then the origin- that are 0 are directions in which the ellipsoid is • centered ellipsoid has its axes aligned with the unbounded. With = A , we consider the singu- E S T coordinate axes and has the equation lar-value decomposition A = UDV , with U and T V orthogonal matrices and D a diagonal nonneg- (A.3) x Cx = c x2 + c x2 + + c x2 = 1, 11 1 22 2 · · · pp p ative matrix with the same dimension as A. The 1/2 column vectors of U, called the left singular vec- where 1/√cii = cii− are the radii (semi- lengths) along the coordinate axes. tors, correspond to the eigenvectors of C in the Area and volume: In two dimensions, the area of case of a proper ellipsoid. The positive diagonal • 1/2 the axis-aligned ellipse is π(c11c22)− . For p = elements of D, d1 > d2 > > dp > 0, are the prin- 4 1/2 · · · 3, the volume is 3 π(c11c22c33)− . In the gen- cipal radii of the ellipsoid with di = 1/√λi. In the 36 M. FRIENDLY, G. MONETTE AND J. FOX

p = 3, λ3 = 0 gives a cylinder with an elliptical cross-section in 3-space, and λ2 = λ3 = 0 gives an infinite slab with thickness 2√λ . With = A , 1 E S the dimension of the ellipsoid is equal to the num- ber of positive singular values of A. Projections: The projection of a p-dimensional el- • lipsoid into any subspace is P , where P is an E idempotent p p (projection) matrix, that is, PP = × P2 = P. For example, in R2 and R3, the matrices 1 0 0 1 1 P2 = , P3 =  0 1 0   0 0  0 0 0  

project, respectively, an ellipse onto the line x1 = x2 and an ellipsoid into the (x1,x2) plane. If P is symmetric, then P is the matrix of an orthogonal projection, and it is easy to visualize P as the E shadow of cast perpendicularly onto span(P). E Fig. A.2. Some properties of geometric ellipsoids, shown in Generally, P is the shadow of onto span(P) E E 2D. Principal axes of an ellipsoid are given by the eigenvec- along the null space of P. √ tors of C, with radii 1/ λi. For an ellipsoid defined by equa- Linear transformations: A linear transformation tion (1), the comparable ellipsoid for 2C has radii multiplied • by 1/√2. The ellipsoid for C−1 has the same principal axes, of an ellipsoid is an ellipsoid, and the pre-image but with radii √λi, making it small in the directions where C of an ellipsoid under a linear transformation is is large and vice-versa. an ellipsoid. A nonsingular linear transformation maps a proper ellipsoid into a proper ellipsoid in singular case, the left singular vectors form a set the form shown in Section 2.2, equation (7). of principal axes for the flattened ellipsoid.21 Slopes and tangents: The slopes of the ellipsoidal • Inverse: When C is positive definite, the eigenvec- surface in the directions of the coordinate axes • 1 T tors of C and C− are identical, while the eigen- are given by ∂/∂x(x Cx) = 2Cx. From this, it 1 values of C− are 1/λi. It follows that the ellipsoid follows that the tangent hyperplane to the unit 1 for C− has the same axes as that of C, but with T ellipsoidal surface at the point xα, where xα∂/ inversely proportional radii. In R2, the ellipsoid T T 1 ∂x(x Cx) = 0, has the equation xαCx = 1. for C− is, with rescaling, a 90◦ rotation of the ellipsoid for C, as illustrated in Figure A.2. A.3 Conjugate Axes and Inner-Product Spaces Generalized inverse: A definition for an inverse • ellipsoid that is equivalent in the case of proper For any nonsingular A in equation (5) that gener- ellipsoids, ates an ellipsoid, the columns of A = [a1, a2,..., ap] 1 T form a set of “conjugate axes” of the ellipsoid. (Two (A.4) − := y : x y 1, x , E { | |≤ ∀ ∈ E} are conjugate iff the tangent line at the generalizes to all ellipsoids. The inverse of a sin- endpoint of one diameter is parallel to the other di- gular ellipsoid is an improper ellipsoid and vice ameter.) Each vector ai lies on the ellipsoid, and the versa. tangent hyperplane at that point is parallel to the Dimensionality: The ellipsoid is bounded if C is • span of all the other column vectors of A. For p = 2 positive definite (all λi > 0). Each λi = 0 increases this result is illustrated in Figure A.3 (left) in which the dimension of the space along which the el- lipsoid is unbounded by one. For example, with 1 1.5 A = [ a a ]= 1 2  2 1  ⇒ 21Corresponding left singular vectors and eigenvectors are (A.5) T 3.25 3.5 not necessarily equal, but sets that belong to the same eigen- W = AA = . value/singular value span the same space.  3.5 5  ELLIPTICAL INSIGHTS 37

Fig. A.3. Conjugate axes of an ellipse with various factorizations of W and corresponding basis vectors. The conjugate vectors lie on the ellipse, and their tangents can be extended to form a parallelogram framing it. Left: for an arbitrary fac- torization, given in equation (A.5). Right: for the Choleski factorization (solid, green, b1,b2) and the principal component factorization (dashed, brown, c1, c2).

Consider the inner-product space with inner prod- W. The subspace c1b1 + + cp 1bp 1, ci R is uct matrix the plane of the regression{ of· · ·the last− variable− ∈ on} the others, a fact that generalizes naturally to ellipsoids 1 1.25 0.875 W− = − that are not necessarily centered at the origin.  0.875 0.8125  − In the principal-component (PC) factorization 1 and inner product x, y = x′W− y. (shown dashed in brown in Figure A.3, right) W = h i T 1/2 ′ T T T T T CC , where C = ΓΛ and, hence, W = ΓΛΓ is Because A W 1A = A (AA ) 1A = A (A ) 1 − − − the spectral decomposition of W. Here, the ellipse A 1A = I, we see that a and a are orthogonal· − 1 2 axes are orthogonal in the space of the ellipse (so unit vectors (in fact, an orthonormal basis) in this the bounding tangent parallelogram is a rectangle) inner product: 1 as well as in the inner-product space of W− . The PC factorization is unique in this respect (up to re- T 1 a , a = a W− a = 1, flections of the axis vectors). h i ii i i 1 As illustrated in Figure A.3, each pair of conjugate a , a = a′ W− a = 0. h 1 2i 1 2 axes has a corresponding bounding tangent parallel- Now, if W = BBT is any other factorization of ogram. It can be shown that all such W, then the columns of B have the same proper- have the same area and equal sums of squares of the ties as the columns of A. Particular factorizations lengths of their diameters. yield interesting and statistically useful sets of con- A.4 Ellipsoids in a Generalized Metric Space jugate axes. The illustration in Figure A.3 (right) shows two such cases with special properties: In the In Appendix A.3, we considered the positive semi- Choleski factorization (shown solid in green), where definite matrix W and corresponding ellipsoid to be B is lower triangular, the last conjugate axis, b2, is referred to a , perhaps with different aligned with the coordinate axis x2. Each previous basis vectors. We showed that various measures of axis (b1, here) is the orthogonal complement to all the “size” of the ellipsoid could be defined in terms 1 later axes in the inner-product space of W− . The of functions of the eigenvalues λi of W. Choleski factorization is unique in this respect, sub- We now consider the generalized case of an anal- ject to a permutation of the rows and columns of ogous p p positive semi-definite × 38 M. FRIENDLY, G. MONETTE AND J. FOX

Fig. A.4. Left: Ellipses for H and E in Euclidean “data space.” Right: Ellipses for H⋆ and E⋆ in the transformed “canonical space,” with the eigenvectors of H relative to E shown as blue arrows, whose radii are the corresponding eigenvalues, λ1,λ2.

H, but where measures of length, distance and an- the corresponding eigenvalues λi. This can be seen gles are referred to a metric defined by a positive- geometrically as a rotation of “data space” to an ori- definite symmetric matrix E. As is well known, the entation defined by the principal axes of E, followed generalized eigenvalue problem is to find the scalars by a re-scaling, so that the E ellipsoid becomes the λi and vectors vi, i = 1, 2,...,p, such that Hv = unit . In this transformed space (“canonical λEv, that is, the roots of det(H λE)=0. space”), functions of the squared radii λi of the axes For such H and E, we can always− find a factor of H⋆ give direct measures of the “size” of H rela- T A of E, so that E = AA , whose columns will be tive to E. The orientation of the eigenvectors vi can conjugate directions for E and whose rows will also be related to the (orthogonal) linear combinations be conjugate directions for H, in that H = ATDA, of the data variables that are successively largest in where D is diagonal. Geometrically, this means that the metric of E. there exists a unique pair of bounding parallelo- To illustrate, Figure A.4 (left) shows the ellipses grams for the H and E ellipsoids whose correspond- generated by ing sides are parallel. A linear transformation of E 9 3 1 0.5 H E H = and E = and that transforms the parallelogram for to  3 4   0.5 2  a square (or cuboid), and hence E to a (or spheroid), generates an equivalent view in what we together with their conjugate axes. For E, the con- jugate axes are defined by the columns of the right describe below as canonical space. T T In statistical applications (e.g., MANOVA, canon- factor, A , in E = AA ; for H, the conjugate axes ical correlation), the generalized eigenvalue problem are defined by the columns of A. The transforma- ⋆ 1/2 1/2 is transformed to an ordinary eigenvalue problem by tion to H = E− HE− is shown in the right considering the following equivalent forms with the panel of Figure A.4. In this “canonical space,” an- gles and lengths have the ordinary interpretation of same λi, vi: Euclidean space, so the size of H⋆ can be interpreted (H λE)v = 0, √ − directly in terms of functions of the radii λ1 and 1 √ (HE− λI)v = 0, λ2. ⇒ − 1/2 1/2 (E− HE− λI)v = 0, ACKNOWLEDGMENTS ⇒ − where the last form gives a symmetric matrix, H⋆ = We acknowledge first the inspiration we found in 1/2 1/2 E− HE− . Using the square root of E defined by Dempster (1969) for this geometric approach to sta- the principal-component factorization E1/2 = ΓΛ1/2 tistical thinking. This work was supported by Grant produces the ellipsoid H⋆, the orthogonal axes of OGP0138748 from the National Sciences and En- which correspond to the vi, whose squared radii are gineering Research Council of Canada to Michael ELLIPTICAL INSIGHTS 39

Friendly, and by grants from the Social Sciences and Bryk, A. S. and Raudenbush, S. W. (1992). Hierarchical Humanities Research Council of Canada to John Linear Models: Applications and Data Analysis Methods. Fox. We are grateful to Antonio Gasparini for help- Sage, Thousand Oaks, CA. Campbell, N. A. and Atchley, W. R. (1981). The geom- ful discussion regarding multivariate meta-analysis etry of canonical variate analysis. Systematic Zoology 30 models, to Duncan Murdoch for advice on 3D graph- 268–280. ics and comments on an earlier draft, and to the re- Cramer,´ H. (1946). Mathematical Methods of Statistics. viewers and Associate Editor for many helpful sug- Princeton Mathematical Series 9. Princeton Univ. Press, gestions. Princeton, NJ. MR0016588 Dempster, A. P. (1969). Elements of Continuous Multivari- ate Analysis. Addison-Wesley, Reading, MA. SUPPLEMENTARY MATERIAL Denis, D. (2001). The origins of correlation and regression: Francis Galton or Auguste Bravais and the error theorists. Supplementary materials for Elliptical insights: History and Philosophy of Psychology Bulletin 13 36–44. Understanding statistical methods through ellipti- Diez-Roux, A. V. (1998). Bringing context back into epi- demiology: Variables and fallacies in multilevel analysis. cal geometry (DOI: 10.1214/12-STS402SUPP; .pdf). Am. J. Public Health 88 216–222. The supplementary materials include SAS and R Fisher, R. A. (1936). The use of multiple measurements in scripts to generate all of the figures for this arti- taxonomic problems. Annals of Eugenics 8 379–388. cle. Several 3D movies are also included to show Fox, J. (2008). Applied Regression Analysis and Generalized phenomena better than can be rendered in static Linear Models, 2nd ed. Sage, Thousand Oaks, CA. Fox, J. Suschnigg, C. print images. A new R package, gellipsoid, provides and (1989). A note on gender and the prestige of occupations. Canadian Journal of Sociology computational support for the theory described in 14 353–360. Appendix A.1. These are also available at http:// Fox, J. and Weisberg, S. (2011). An R Companion to Ap- datavis.ca/papers/ellipses and described in plied Regression, 2nd ed. Sage, Thousand Oaks, CA. http://datavis.ca/papers/ellipses/supp.pdf Friendly, M. (1991). SAS System for Statistical Graphics, [Friendly, Monette and Fox (2012)]. 1st ed. SAS Institute, Cary, NC. Friendly, M. (2007a). A.-M. Guerry’s Moral statistics of France: Challenges for multivariable spatial analysis. 22 REFERENCES Statist. Sci. 368–399. MR2399897 Friendly, M. (2007b). HE plots for multivariate linear mod- Alker, H. R. (1969). A typology of ecological fallacies. In els. J. Comput. Graph. Statist. 16 421–444. MR2370948 Social Ecology (M. Dogan and S. Rokkam, eds.) 69–86. Friendly, M. (2013). The generalized ridge trace plot: Visu- MIT Press, Cambridge, MA. alizing bias and precision. J. Comput. Graph. Statist. 22. Anderson, E. (1935). The irises of the Gasp´epeninsula. Bul- To appear. letin of the American Iris Society 35 2–5. Friendly, M., Monette, G. and Fox, J. (2012). Antczak-Bouckoms, A., Joshipura, K., Burdick, E. and Supplement to “Elliptical Insights: Understanding Tulloch, J. F. (1993). Meta-analysis of surgical versus Statistical Methods Through Elliptical Geometry.” non-surgical methods of treatment for periodontal disease. DOI:10.1214/12-STS402SUPP. J. Clin. Periodontol. 20 259–268. Galton, F. (1886). Regression towards mediocrity in hered- Beaton, A. E. Jr (1964). The use of special matrix oper- itary stature. Journal of the Anthropological Institute 15 ators in statistical calculus. Ph.D. thesis, Harvard Univ., 246–263. ProQuest LLC, Ann Arbor, MI. Galton, F. (1889). Natural Inheritance. Macmillan, London. Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regres- Gasparrini, A. (2012). MVMETA: multivariate meta- sion Diagnostics: Identifying Influential Data and Sources analysis and meta-regression. R package version 0.2.4. of Collinearity. Wiley, New York. MR0576408 Guerry, A. M. (1833). Essai sur la statistique morale de Berkey, C. S., Hoaglin, D. C., Antczak-Bouckoms, A., la France. Crochard, Paris. [English translation: Hugh P. Mosteller, F. and Colditz, G. A. (1998). Meta-analysis Whitt and Victor W. Reinking, Edwin Mellen Press, Lewis- of multiple outcomes by regression with random effects. ton, NY (2002).] Stat. Med. 17 2537–2550. Henderson, C. R. (1975). Best linear unbiased estimation Boyer, C. B. (1991). . In A History of and prediction under a selection model. Biometrics 31 423– Mathematics, 2nd ed. 156–157. Wiley, New York. 448. Bravais, A. (1846). Analyse math´ematique sur les prob- Hoerl, A. E. and Kennard, R. W. (1970a). Ridge regres- abilit´es des erreurs de situation d’un point. M´emoires sion: Biased estimation for nonorthogonal problems. Tech- Pr´esent´es Par Divers Savants `aL’Acad´emie Royale des nometrics 12 55–67. Sciences de l’Institut de France 9 255–332. Hoerl, A. E. and Kennard, R. W. (1970b). Ridge regres- Bryant, P. (1984). Geometry, statistics, probability: Varia- sion: Applications to nonorthogonal problems. Technomet- tions on a common theme. Amer. Statist. 38 38–48. rics 12 69–82. [Correction: 12 723.] 40 M. FRIENDLY, G. MONETTE AND J. FOX

Hotelling, H. (1933). Analysis of a complex of statisti- Raudenbush, S. W. and Bryk, A. S. (2002). Hierarchical cal variables into principal components. Journal of Edu- Linear Models: Applications and Data Analysis Methods, cational Psychology 24 417–441. 2nd ed. Sage, Newbury Park, CA. Jackson, D., Riley, R. and White, I. R. (2011). Multi- Riley, M. W. (1963). Special problems of sociological variate meta-analysis: Potential and promise. Stat. Med. analysis. In Sociological Research I: A Case Approach 30 2481–2498. MR2843472 (M. W. Riley, ed.) 700–725. Harcourt, Brace, and World, Kramer, G. H. (1983). The ecological fallacy revisited: New York. Aggregate-versus individual-level findings on economics Robinson, W. S. (1950). Ecological correlations and the be- and elections, and sociotropic voting. The American Po- havior of individuals. American Sociological Review 15 351– litical Science Review 77 92–111. 357. Laird, N. M. and Ware, J. H. (1982). Random-effects mod- Robinson, G. K. (1991). That BLUP is a good thing: els for longitudinal data. Biometrics 38 963–974. The estimation of random effects. Statist. Sci. 6 15–32. Lichtman, A. J. (1974). Correlation, regression, and the eco- MR1108815 logical fallacy: A critique. The Journal of Interdisciplinary Saville, D. and Wood, G. (1991). Statistical Methods: History 4 417–433. The Geometric Approach. Springer Texts in Statistics. Longley, J. W. (1967). An appraisal of least squares pro- Springer, New York. grams for the electronic computer from the point of view of Simpson, E. H. (1951). The interpretation of interaction in the user. J. Amer. Statist. Assoc. 62 819–841. MR0215440 contingency tables. J. Roy. Statist. Soc. Ser. B. 13 238– Marquardt, D. W. (1970). Generalized inverses, ridge re- 241. MR0051472 gression, biased linear estimation, and nonlinear estima- Speed, T. (1991). That BLUP is a good thing: The estimation tion. Technometrics 12 591–612. of random effects: Comment. Statist. Sci. 6 42–44. Monette, G. (1990). Geometry of multiple regression and Stigler, S. M. (1986). The History of Statistics: The Mea- interactive 3-D graphics. In Modern Methods of Data Anal- surement of Uncertainty Before 1900. The Belknap Press ysis, Chapter 5 (J. Fox and S. Long, eds.) 209–256. Sage, of Harvard Univ. Press, Cambridge, MA. MR0852410 Beverly Hills, CA. Timm, N. H. (1975). Multivariate Analysis with Applications Nam, I.-S., Mengersen, K. and Garthwaite, P. (2003). in Education and Psychology. Brooks/Cole Publishing Co., Multivariate meta-analysis. Stat. Med. 22 2309–2333. Monterey, CA. MR0443216 Pearson, K. (1896). Contributions to the mathematical the- Velleman, P. F. and Welsh, R. E. (1981). Efficient com- ory of evolution—III, regression, heredity and panmixia. puting of regression diagnostics. Amer. Statist. 35 234–242. Philosophical Transactions of the Royal Society of London von Humboldt, A. (1811). Essai Politique sur le Royaume 187 253–318. de la Nouvelle-Espagne. (Political Essay on the Kingdom Pearson, K. (1901). On lines and planes of closest fit to of New I: Founded on Astronomical Observations, and systems of points in space. Philosophical Magazine 6 559– Trigonometrical and Barometrical Measurements), Vol. 1. 572. Riley, New York. Pearson, K. (1920). Notes on the history of correlation. Wickens, T. D. (1995). The Geometry of Multivariate Statis- Biometrika 13 25–45. tics. Lawrence Erlbaum Associates, Hillsdale, NJ.