Correspondence Analysis

Stat 315c: Transposable Data Correspondence Analysis Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Correspondence Analysis 1 / 17 It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benzécri in the 1960s. The treatment by Greenacre is particularly clear. Correspondence Analysis Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benzécri in the 1960s. The treatment by Greenacre is particularly clear. Correspondence Analysis It plots both variables and cases in the same plane. Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benzécri in the 1960s. The treatment by Greenacre is particularly clear. Correspondence Analysis It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 This is an old and classical statistical technique pioneered by Jean-Paul Benzécri in the 1960s. The treatment by Greenacre is particularly clear. Correspondence Analysis It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 The treatment by Greenacre is particularly clear. Correspondence Analysis It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benzécri in the 1960s. Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Correspondence Analysis It plots both variables and cases in the same plane. Clearest motivation is for contingency table data. It gets used elsewhere too. Emphasis is on presenting the data themselves as opposed to illuminating an underlying model. This is an old and classical statistical technique pioneered by Jean-Paul Benzécri in the 1960s. The treatment by Greenacre is particularly clear. Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17 Nomenclature Correspondence matrix P pij = nij/n•• Row masses ri = pi• = ni•/n•• Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . , pIj/cj) ∈ R These are conditional and marginal distributions Contingency tables I × J table of counts n11 n12 ··· n1J n21 n22 ··· n2J . .. nI1 nI2 ··· nIJ Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 These are conditional and marginal distributions Contingency tables I × J table of counts n11 n12 ··· n1J n21 n22 ··· n2J . .. nI1 nI2 ··· nIJ Nomenclature Correspondence matrix P pij = nij/n•• Row masses ri = pi• = ni•/n•• Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . , pIj/cj) ∈ R Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 Contingency tables I × J table of counts n11 n12 ··· n1J n21 n22 ··· n2J . .. nI1 nI2 ··· nIJ Nomenclature Correspondence matrix P pij = nij/n•• Row masses ri = pi• = ni•/n•• Column masses cj = pi• = ni•/n•• 0 J Row profiles ri = (pi1/ri, . , piJ /ri) ∈ R 0 I Column profiles cj = (p1j/cj, . , pIj/cj) ∈ R These are conditional and marginal distributions Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17 Column centroid J X 0 cjcj = (r1, . , rI ) ≡ r j=1 Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns First moments: centroids Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns First moments: centroids Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i Column centroid J X 0 cjcj = (r1, . , rI ) ≡ r j=1 Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 First moments: centroids Row centroid I I X X pi1 piJ 0 r r = r ,..., = (c , . , c )0 ≡ c i i i r r 1 J i=1 i=1 i i Column centroid J X 0 cjcj = (r1, . , rI ) ≡ r j=1 Upshot ‘Mass’ weighted average of row profiles is marginal distribution over columns Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17 This is the total inertia of the row profiles. It equals total inertia of column profiles. Second moments: inertias Chisquare for independence as weighted Euclidean distance 2 X X (nij − ni•n•j/n••) X2 = n n /n i j i• •j •• 2 X X (nij/ni• − n•j/n••) = n i• n /n i j •j •• 2 X ni• X (nij/ni• − n•j/n••) = n •• n n /n i •• j •j •• X 0 −1 = n•• ri(ri − c) diag(c) (ri − c) i = n•• × Inertia Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17 Second moments: inertias Chisquare for independence as weighted Euclidean distance 2 X X (nij − ni•n•j/n••) X2 = n n /n i j i• •j •• 2 X X (nij/ni• − n•j/n••) = n i• n /n i j •j •• 2 X ni• X (nij/ni• − n•j/n••) = n •• n n /n i •• j •j •• X 0 −1 = n•• ri(ri − c) diag(c) (ri − c) i = n•• × Inertia This is the total inertia of the row profiles. It equals total inertia of column profiles. Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17 Geometry For J = 3 12 8 8 7 7 6 8 8 10 6 9 8 9 8 7 3 We can plot profiles in R Low inertia Art B. Owen (Stanford Statistics) Correspondence Analysis 6 / 17 Geometry This example has higher inertia Art B. Owen (Stanford Statistics) Correspondence Analysis 7 / 17 Geometry Still higher inertia. χ2 statistics describes variation of row profiles Similarly for col profiles Art B. Owen (Stanford Statistics) Correspondence Analysis 8 / 17 Rescale Distances Euclidean distance in plot ignores column values rij Replace ri by ri with rij = √ e e cj 2 Euclidean dist between rei and rei0 is “χ dist” between ri and ri0 . Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17 Rescale Distances Euclidean distance in plot ignores column values rij Replace ri by ri with rij = √ e e cj 2 Euclidean dist between rei and rei0 is “χ dist” between ri and ri0 . Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17 Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass Role of χ2 in statistical significance is not considered important in this literature Reason for χ2 Invariance Suppose rows i and i0 are proportional nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows New ni∗j = nij + ni0j and delete originals Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 Role of χ2 in statistical significance is not considered important in this literature Reason for χ2 Invariance Suppose rows i and i0 are proportional nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows New ni∗j = nij + ni0j and delete originals Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 Reason for χ2 Invariance Suppose rows i and i0 are proportional nij/ni0j = α all j = 1,...,J Suppose also that we pool these rows New ni∗j = nij + ni0j and delete originals Then New χ2 distance between cols j and j0 equals old dist Principle of distributional equivalence Common profile, summed mass Role of χ2 in statistical significance is not considered important in this literature Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17 If J is too big reduce dimension by principal components of rij − cj √ cj plot in reduced dimension along with images of corners Dimension reduction Now we have a plot J−1 With rows and cols both in R Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17 Dimension reduction Now we have a plot J−1 With rows and cols both in R If J is too big reduce dimension by principal components of rij − cj √ cj plot in reduced dimension along with images of corners Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17 More notation Dr = diag(r) = diag(r1, . , rI ) Dc = diag(c) = diag(c1, . , cJ ) Duality Rows lie in min(I − 1,J − 1) dimensional space So do columns In PC of row profiles ... columns are outside In PC of column profiles ... rows are outside Symmetric correspondence analysis overlap the points after rescaling Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17 Duality Rows lie in min(I − 1,J − 1) dimensional space So do columns In PC of row profiles ..

Load more