The Geometric Interpretation of Correspondence Analysis Author(S): Michael Greenacre and Trevor Hastie Source: Journal of the American Statistical Association, Vol
Total Page:16
File Type:pdf, Size:1020Kb
The Geometric Interpretation of Correspondence Analysis Author(s): Michael Greenacre and Trevor Hastie Source: Journal of the American Statistical Association, Vol. 82, No. 398 (Jun., 1987), pp. 437-447 Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association Stable URL: https://www.jstor.org/stable/2289445 Accessed: 26-01-2020 22:24 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms Taylor & Francis, Ltd., American Statistical Association are collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association This content downloaded from 171.66.12.245 on Sun, 26 Jan 2020 22:24:02 UTC All use subject to https://about.jstor.org/terms The Geometric Interpretation of Correspondence Analysis MICHAEL GREENACRE and TREVOR HASTIE* Correspondence analysis is an exploratory multivariate technique that indicator (or "dummy") variables. The correspondence converts a data matrix into a particular type of graphical display in which analysis of this indicator matrix is known as multiple cor- the rows and columns are depicted as points. The method has a long respondence analysis or homogeneity analysis (Gifi 1981). and varied history and has appeared in different forms in the psycho- metric and ecological literature, among others. In this article we review In Section 4 we show how other data types may be reex- the geometry of correspondence analysis and its geometric interpretation. pressed in order to be geometrically interpretable using We also discuss various extensions of correspondence analysis to mul- correspondence analysis. tivariate categorical data (multiple correspondence analysis) and a variety of other data types. KEY WORDS: Graphical display; Singular value decomposition; Prin- 2. GEOMETRY OF SIMPLE cipal components; Contingency tables. CORRESPONDENCE ANALYSIS 2.1 Basic Geometry 1. INTRODUCTION The correspondence analysis of a two-way contingency Correspondence analysis is an exploratory multivariate table N(I x J) provides a leading case of the method's technique that converts a matrix of nonnegative data into geometry and interpretation. We shall denote the row and a particular type of graphical display in which the rows column totals of N by ni,+ (i = 1 ... I) and n+j (j = 1 ... and columns of the matrix are depicted as points. It is a J), respectively, and the overall total simply by n. We shall method that, algebraically at least, has been known for use the 5 x 3 contingency table of Table 1 throughout more than 50 years, the first mathematical account being this section as an illustration. It is a cross-tabulation of 312 by Hirschfeld (1935). Since then the same algebraic and people, all identified as readers of a particular newspaper numerical procedure has been rediscovered in different in a readership survey, according to five educational groups contexts, notably in ecology (reciprocal averaging) and and three categories of readership of the newspaper. This psychology (dual scaling). The method was rediscovered example has the advantage that its geometry is precisely in France in the early 1960s and has been used extensively three-dimensional so that we can visually depict the tech- in that country as a method of graphical data analysis nique's concepts and mechanism without abstraction. (Benzecri 1973; Lebart, Morineau, and Tabard 1977). De- We suppose initially that we are interested in comparing tailed descriptions of the evolution of the various forms the rows of the table. The proportions of readership types of correspondence analysis can be found in Benzecri (1977), in each education group are given in parentheses in Table Nash (1978), Nishisato (1980, sec. 1.2), Greenacre (1984, 1. Each of these vectors is known as a row profile, denoted sec. 1.3), and Tenenhaus and Young (1985). A bibliog- by ai = [ni1lni+ - nIni+]T, for example, a2 = [.214 .548 raphy by Nishisato (1986) lists more than 1,000 references .238]T. Each row in this example can be represented by in the period 1975-1986 that are directly or indirectly rel- its profile as a point vector in three-dimensional Euclidean evant to the topic of correspondence analysis. space. The fact that there is a linear dependency among In this article we focus specifically on the geometry of the coordinates of the profile vectors (they each sum to correspondence analysis and its geometric interpretation. 1) means geometrically that the five points are contained This will be seen to be a variant of principal components exactly in a two-dimensional regular simplex, namely, the analysis, but tailored to categorical rather than continuous triangle with vertices at the unit points along each of the data. The most basic form of correspondence analysis is three coordinate axes. The points may be plotted directly its application to a two-way contingency table, which we in this triangle in what is generally known as triangular shall call simple correspondence analysis. The geometry of (or barycentric) coordinates. this leading case, discussed in Section 2, provides the basic Apart from serving as the vertices of the triangle bound- rules of interpretation. All other forms of correspondence ing the row points, the unit points may be considered as analysis are the application of the same algorithm to other the most extreme, or most polarized, types of row profile types of data matrices, with a consequent adaptation of observable. For example, the unit point e1 = [1 0 0]T the interpretation. In Section 3 we treat the case of a represents a row profile in which the total frequency is multiway contingency table that is coded as a matrix of concentrated into the "casual" reading category. In this sense the jth column of the data matrix is represented by * Michael Greenacre is Professor, Department of Statistics, University the vertex e1 of the triangle. of South Africa, Pretoria 0001, South Africa. Trevor Hastie is a member of the Technical Staff, Statistics and Data Analysis, Research Depart- For reasons that will soon become apparent, we wish ment, AT&T Bell Laboratories, Murray Hill, NJ 07974. This article was to define distances between profiles not by the usual Eu- inspired in part by a series of seminars on correspondence analysis, organized by Perci Diaconis, at Stanford University in 1984. The authors thank the participants of the seminars for their contributions and ac- ? 1987 American Statistical Association knowledge the constructive suggestions of the referees and the editor of Journal of the American Statistical Association the Statistical Graphics special section. June 1987, Vol. 82, No. 398, Statistical Graphics 437 This content downloaded from 171.66.12.245 on Sun, 26 Jan 2020 22:24:02 UTC All use subject to https://about.jstor.org/terms 438 Journal of the American Statistical Association, June 1987 Table 1. Contingency Table From Readership Survey In other words, 2Iln can be defined geometrically as the weighted average of the squared (chi-squared) distances Category of readership Education of the row profiles to their centroid. The quantity x/ln group C1 C2 C3 Totals crops up frequently in correspondence analysis and is called the total inertia of the data matrix. El 5 7 2 14 (.357) (.500) (.143) The null hypothesis of row-column independence, nil E2 18 46 20 84 = ni+n+jln (i = 1 ..I, j = 1 ... J), is equivalent to the (.214) (.548) (.238) hypothesis of homogeneity of the rows: n11lnl+ = n2jln2+ E3 19 29 39 87 = ... = njln+ (j = 1 ... J). Each row of N may be viewed (.218) (.333) (.448) as the realization of a multinomial distribution, conditional E4 12 40 49 101 on the respective row total. Under the homogeneity as- (.119) (.396) (.485) sumption, this distribution is common to all of the rows E5 3 7 16 26 up to a rescaling by the row totals. This scaled multinomial (.115) (.269) (.615) distribution is completely described by a set of probabil- Totals 57 129 126 312 ities whose maximum likelihood estimates are the ele- (.183) (.413) (.404) ments cj of the row centroid. Thus a significant x2 can be interpreted geometrically as a significant deviation of the NOTE: Education groups-El, some primary; E2, primary completed; E3, some secondary; E4, secondary completed; E5, some tertiary. Readership categories-Cl, glance; C2, fairly row profiles from their centroid, that is, from the homo- thorough; C3, very thorough. Row profiles are given in parentheses (i.e., row frequencies as a geneity hypothesis. Figures 1 and 2 show the deviations proportion of row totals). clidean metric but rather by the following weighted Eu- clidean metric, called the chi-squared metric: dc(ai, ajt) (ai - aj;)TD Dl(ai - ajt) (nijlni+ - niljlni +) 2 =, (n+1/n) S (2.1) where D, is the diagonal matrix of elements cj = n+jln (j = 1 ... J). The vector c = [cl ... cJ]T, in this case the proportions of all respondents in the readership categories, . .....a..... .... is called the average row profile. To observe chi-squared distances between points, we can rescale the row profiles as follows: Ai = D- 1/2ai, so chi-squared distances are trans- 4s.-........ formed to ordinary Euclidean distances, since (ai - ail)TD l(ai - ai) = (ai - ii)T(Ai - Ai). Equivalently, 5a~~~ we can stretch the coordinate axes in proportion to the values Cl 1/2 So that each axis has a different scale. In our .3.... example this results in a stretched triangle, no longer equi- lateral, which bounds the cloud of points (Figs.