
<p>Measuring association among ordinal categorical variables</p><p>-</p><p>Goodman and Kruskal’s Gamma</p><p>1 An example</p><p>Effect of smoking at 45 years of age on self reported health five years later</p><p>Variable Categories Smoking 1 Never smoked 2 Stopped smoking 3 1-14 cigarettes/day 4 15-24 cigarettes/day 5 25+ cigarettes/day</p><p>SRH 1 Very good 2 Good 3 Fair 4 Bad Ordinal categories</p><p>Expected monotonous association: Increasing codes on Smoking Increasing code on SRH</p><p>2 Data on males from the Glostrup surveys</p><p>+ SMOKE45 | | B:HEALTH51 | J | Vgood Good Fair Bad | TOTAL | ------+------+------+ Never | 16 73 6 1 | 96 | row%| 16.7 76.0 6.3 1.0 | 100.0 | No mo | 15 75 6 0 | 96 | row%| 15.6 78.1 6.3 0.0 | 100.0 | 1-14 | 13 59 7 1 | 80 | row%| 16.3 73.8 8.8 1.3 | 100.0 | 15-24 | 10 81 17 3 | 111 | row%| 9.0 73.0 15.3 2.7 | 100.0 | 25+ | 1 29 3 1 | 34 | row%| 2.9 85.3 8.8 2.9 | 100.0 | ------+ TOTAL | 55 317 39 6 | 417 | row%| 13.2 76.0 9.4 1.4 | 100.0 | ------+ 2 = 16.2 df = 12 p = 0.182</p><p>No evidence of association even though the expected monotonous relationship is plain as the nose on your face</p><p>3 Correlation coefficients for ordinal categorical variables</p><p>Pearson’s correlation coefficient </p><p>1) Measures linear association which is not meaningful for ordinal variables</p><p>2) Evaluation of significance requires normal distributions</p><p>Rank correlations (Kendall’s and Spearman’s ) are more appropriate but require continuous data with very little risk of ties.</p><p>4 Goodman and Kruskal’s for ordinal categorical data</p><p>1) Similar to Kendall’s .</p><p>2) Related to the odds ratio for 22 tables.</p><p>3) Well-known asymptotic properties.</p><p>4) A partial coefficient measuring conditional monotonous relationship among ordinal variables is available.</p><p>5 Monotonous relationships</p><p></p><p>Y increases when X increases/decreases</p><p>Two variables: X,Y</p><p>Probabilities: pxy = Prob(X=x,Y=y)</p><p>X and Y are independent </p><p></p><p> pxy = Prob(X=x)P(Y=y)</p><p>What exactly do we mean when we say that there is a monotonous relationship between X and Y?</p><p>6 Concordance and discordance</p><p>Compare outcomes on (X,Y) for two stochastically independent cases</p><p>(X1,Y1) and (X2,Y2)</p><p>Concordance (C) if X1<X2 and Y1<Y2 or X1>X2 and Y1>Y2</p><p>Discordance (D) if X1<X2 and Y1>Y2 or X1>X2 and Y1<Y2</p><p>Tie (T) if X1=X2 or Y1=Y2</p><p>7 Concordance = same trend in X and Y Discordance = different trends in X and Y</p><p>Probability of concordance</p><p> p= P P C x1 y 1 x 2 y 2 (x1 ,y 1 ,x 2 ,y 2 ) C p= P P D x1 y 1 x 2 y 2 (x1 ,y 1 ,x 2 ,y 2 ) D</p><p>Positive relationship: PC > PD</p><p>Negative relationship: PC < PD</p><p>8 The gamma coefficient</p><p>A measure of the strength of the monotonous relationship</p><p>P- P g = D C PD+ P C</p><p>Satisfies all conventional requirements of correlationcoefficients:</p><p>-1 +1 = 0 if X and Y are independent Positive association if > 0 Negative association if < 0 Change the order of Y categories: after recoding = - before recoding</p><p>9 Interpretation of </p><p>P(C) P(C | C� D) P(C)+ P(D)</p><p>P(D) P(D | C� D) P(C)+ P(D)</p><p> such that</p><p> = P(C|CD) – P(D|CD)</p><p> is the difference between two conditional probabilities</p><p>10 Estimation of </p><p>Pairwise comparison of all persons in the data set</p><p> nC = number of concordances nD = number of discordances nT = number of ties</p><p>Relative frequences</p><p> nC nD hC = hD = nC+ n D + n T nC+ n D + n T</p><p> nS hT = nC+ n D + n T</p><p>The estimate of h- h n - n G =C D = C D hC+ h D n C + n D</p><p>11 A little bit of notation</p><p> nxy = number of persons with X=x and Y=y</p><p>Aij=邋 n xy + n xy x> i,y > j x < i,y < j</p><p>Dij=邋 n xy + n xy x> i,y < j x < i,y > j</p><p>X 1 . x . c</p><p>1 Axy Dxy .</p><p>Y y nxy . . Dxy Axy R</p><p>Number of concordances and discordances n= n A n= n D C xy xy D xy xy x,y x,y</p><p>12 The coefficient for 22 tables</p><p> a b c d</p><p> nC = ad nD = bc</p><p> g =</p><p> nC- n D</p><p> nC+ n D = ad 1 ad- bc- OR - 1 =bc = ad bcad OR 1 ++ 1 + bc</p><p>13 OR- 1 1 + g g = OR = OR+ 1 1 - g</p><p>Gamma, odds-ratio og logit values gamma oddsratio logit LN(odds-ratio -1,00 ,00 - -,90 ,05 -2,94 -,80 ,11 -2,20 -,70 ,18 -1,73 -,60 ,25 -1,39 -,50 ,33 -1,10 -,40 ,43 -,85 -,30 ,54 -,62 -,20 ,67 -,41 -,10 ,82 -,20 ,00 1,00 ,00 ,10 1,22 ,20 ,20 1,50 ,41 ,30 1,86 ,62 ,40 2,33 ,85 ,50 3,00 1,10 ,60 4,00 1,39 ,70 5,67 1,73 ,80 9,00 2,20 ,90 19,00 2,94 1,00 + +</p><p>Note: logit 2gamma in the interval [-0,30 til 0,30] </p><p>14 Properties of the estimate of </p><p>The estimate is unbiased E(G) = </p><p> and asymptotically normally distributed</p><p> with standard error, s1, given by</p><p>2 2 16 轾 s1=4 犏 n xy( n D A xy - n C D xy ) (nC+ n D ) 臌x,y</p><p>If X and Y are independent, then the standard error, s0, of G is given by</p><p>2 4 轾 2 (n- n ) s2 =犏 n A - D - C D 02 xy( xy xy ) n (nC+ n D ) 臌犏x,y</p><p>15 Statistical inference</p><p>95 % confidence intervals</p><p>G 1.96se1</p><p>Test of significance</p><p>If X and Y are independent then </p><p>G z = Norm(0,1) se0</p><p>Notice that confidence intervals and assessment of significance uses different estimates of the standard errors</p><p>16 The example</p><p>+ SMOKE45 | | B:HEALTH51 | J | Vgood Good Fair Bad | TOTAL | ------+------+------+ Never | 16 73 6 1 | 96 | row%| 16.7 76.0 6.3 1.0 | 100.0 | No mo | 15 75 6 0 | 96 | row%| 15.6 78.1 6.3 0.0 | 100.0 | 1-14 | 13 59 7 1 | 80 | row%| 16.3 73.8 8.8 1.3 | 100.0 | 15-24 | 10 81 17 3 | 111 | row%| 9.0 73.0 15.3 2.7 | 100.0 | 25+ | 1 29 3 1 | 34 | row%| 2.9 85.3 8.8 2.9 | 100.0 | ------+ TOTAL | 55 317 39 6 | 417 | row%| 13.2 76.0 9.4 1.4 | 100.0 | ------+ 2 = 16.2 df = 12 p = 0.182 = 0.24 p < 0.0005 Very strong evidence of an effect of smoking on health</p><p>For ordinal variables, is much more powerful than 2 distributed test statistics</p><p>17 Exact conditional inference</p><p>The problem:</p><p>Can asymptotic distributions of estimates and test statistics be approximated by asymptotic distributions in small and moderate samples.</p><p>The small number of persons with bad health would result in warnings from most statistical programs that asymptotics probably do not work.</p><p>If in doubt use exact conditional tests instead of asymptotic tests.</p><p>18 The hypergeometric distribution</p><p>The contingency table: nxy, x = 1,..c y =1,,.r</p><p>The margins of the table:</p><p> n= n n= n x+ xy n+ y= n xy xy y x x,y</p><p>The probability of the table</p><p> n 骣 nxy P(n11,…,ncr) = 琪 pxy 桫n11 ...n rc x,y</p><p>19 H0: pxy = px+p+y</p><p>骣 n n nx+ + y P(n11,…,ncr) = 琪 照px+ p + y 桫n11 ...n rc x y</p><p>The marginal tables, nx+ and n+y, are</p><p> sufficient under H0</p><p>P(n11,…,ncr | n1+,..,nc+, n+1,..,n+r) = 骣 骣 琪照nx+ !琪 n + y ! 桫x桫 y</p><p> n! nxy ! x,y does not depend on unknown parameters</p><p>20 The exact conditional test procedure 1</p><p>Find all tables with the same marginal tables as the observed table.</p><p>For each of these tables calculate:</p><p>The conditional probability of the table The tests statistics of interest</p><p>The exact p-value = the sum of probabilities for tables with test statistics that are more extreme than the test statistic of the observed table</p><p>21 The exact conditional test procedure</p><p>Test statistic T(M) Where M is a rc table</p><p>Observed teststatistic = tobs</p><p>The exact p-value</p><p> pexact= P(M | n 1+ ,..,n c + ,n + 1 ,..,n + r ) M: mx+= n x + , m+y= n + y , T(M) tobs</p><p>Fisher’s exact test for 22 tables Also appropriate for rc tables, but may be very time consuming due to the number of tables fitting the margins The Monte Carlo test</p><p>22 Since the conditional probabilities are known exactly we may ask the computer to generate a random sample consisting of a large number of independent tables from this distribution:</p><p>The MC test procedure:</p><p>Generate tables: </p><p>M1,…,MNsim Calculate test statistics for each table: </p><p>Ti = T(Mi), i = 1,..Nsim</p><p>Count the number of random test statistics which are as extreme as the observed statistic</p><p>Nsim S= 1 {Ti t obs} i= 1 S p = is an unbiased estimate of pexact MC Nsim</p><p>The standard error of pMC depends on Nsim Sequential Monte Carlo tests</p><p>23 Interrupts the Monte Carlo procedure when it becomes obvious that the test statistic will not be significant</p><p>Nsim = 10,000 Critical level of the test = 5 %, </p><p>The sequential Monte Carlo test interrupts the Monte Carlo procedure when the number of tables with T(Mi) is equal to 501</p><p>S 501</p><p> pMC 501/10000 > 0.05</p><p>24 Repeated Monte Carlo tests</p><p>The repeated Monte Carlo test interrupts the Monte Carlo procedure when the “risk” of a</p><p> significant pMC-value has become very small</p><p>Parameters of the Repeated Monte Carlo test:</p><p>Nsim = the total number of tables to be generated</p><p>Nstart = the minimum number of tables to be Generated Critical value Max risk of stopping too soon (default = 0.1 %)</p><p>25 Smoking and Self reported health</p><p>+ SMOKE45 | | B:HEALTH51 | J | V.goo Good Fair Bad | TOTAL | ------+------+------+ Never | 16 73 6 1 | 96 | row%| 16.7 76.0 6.3 1.0 | 100.0 | No mo | 15 75 6 0 | 96 | row%| 15.6 78.1 6.3 0.0 | 100.0 | 1-14 | 13 59 7 1 | 80 | row%| 16.3 73.8 8.8 1.3 | 100.0 | 15-24 | 10 81 17 3 | 111 | row%| 9.0 73.0 15.3 2.7 | 100.0 | 25+ | 1 29 3 1 | 34 | row%| 2.9 85.3 8.8 2.9 | 100.0 | X² = 16.2 ------+ df = 12 TOTAL | 55 317 39 6 | 417 | p = 0.182 row%| 13.2 76.0 9.4 1.4 | 100.0 | Gam = 0.24 ------+ p = 0.000 Confounding?</p><p>26 Analysis of the conditional association given self reported health at 45 years HEALTH45 = Very good</p><p>+HEALTH45 | + SMOKE45 | | | B:HEALTH51 | G J | V.goo Good Fair Bad | TOTAL | ------+------+------+ 1 Never | 4 12 0 0 | 16 | row%| 25.0 75.0 0.0 0.0 | 100.0 | No mo | 5 7 0 0 | 12 | row%| 41.7 58.3 0.0 0.0 | 100.0 | 1-14 | 9 6 0 0 | 15 | row%| 60.0 40.0 0.0 0.0 | 100.0 | 15-24 | 2 3 0 0 | 5 | row%| 40.0 60.0 0.0 0.0 | 100.0 | 25+ | 0 3 0 0 | 3 | row%| 0.0 100.0 0.0 0.0 | 100.0 | X² = 6.0 ------+ df = 4 TOTAL | 20 31 0 0 | 51 | p = 0.196 row%| 39.2 60.8 0.0 0.0 | 100.0 | Gam = -0.18 ------+ p = 0.188</p><p>27 HEALTH45 = Good +HEALTH45 | + SMOKE45 | | | B:HEALTH51 | G J | V.goo Good Fair Bad | TOTAL | ------+------+------+ 2 Never | 11 55 5 1 | 72 | row%| 15.3 76.4 6.9 1.4 | 100.0 | No mo | 10 59 5 0 | 74 | row%| 13.5 79.7 6.8 0.0 | 100.0 | 1-14 | 3 50 4 1 | 58 | row%| 5.2 86.2 6.9 1.7 | 100.0 | 15-24 | 6 76 8 1 | 91 | row%| 6.6 83.5 8.8 1.1 | 100.0 | 25+ | 1 25 1 0 | 27 | row%| 3.7 92.6 3.7 0.0 | 100.0 | X² = 9.8 ------+ df = 12 TOTAL | 31 265 23 3 | 322 | p = 0.636 row%| 9.6 82.3 7.1 0.9 | 100.0 | Gam = 0.17 ------+ p = 0.041</p><p>HEALTH45 = Fair +HEALTH45 | + SMOKE45 | | | B:HEALTH51 | G J | V.goo Good Fair Bad | TOTAL | ------+------+------+</p><p>3 Never | 1 6 1 0 | 8 | row%| 12.5 75.0 12.5 0.0 | 100.0 | No mo | 0 6 1 0 | 7 | row%| 0.0 85.7 14.3 0.0 | 100.0 | 1-14 | 1 3 3 0 | 7 | row%| 14.3 42.9 42.9 0.0 | 100.0 | 15-24 | 2 1 6 2 | 11 | row%| 18.2 9.1 54.5 18.2 | 100.0 | 25+ | 0 1 2 1 | 4 | row%| 0.0 25.0 50.0 25.0 | 100.0 | X² = 17.5 ------+ df = 12 TOTAL | 4 17 13 3 | 37 | p = 0.131 row%| 10.8 45.9 35.1 8.1 | 100.0 | Gam = 0.52 ------+ p = 0.001</p><p>28 HEALTH45 = Bad</p><p>+HEALTH45 | + SMOKE45 | | | B:HEALTH51 | G J | V.goo Good Fair Bad | TOTAL | ------+------+------+</p><p>4 Never | 0 0 0 0 | 0 | row%| 0.0 0.0 0.0 0.0 | 0.0 | No mo | 0 3 0 0 | 3 | row%| 0.0 100.0 0.0 0.0 | 100.0 | 1-14 | 0 0 0 0 | 0 | row%| 0.0 0.0 0.0 0.0 | 0.0 | 15-24 | 0 1 3 0 | 4 | row%| 0.0 25.0 75.0 0.0 | 100.0 | 25+ | 0 0 0 0 | 0 | row%| 0.0 0.0 0.0 0.0 | 0.0 | X² = 3.9 ------+ df = 1 TOTAL | 0 4 3 0 | 7 | p = 0.047 row%| 0.0 57.1 42.9 0.0 | 100.0 | Gam = 1.00 ------+ p = 0.001</p><p>------** Local testresults for strata defined by HEALTH45 (G) ** p-values p-values (1-sided) G: HEALTH45 X² df asympt exact Gamma asympt exact ------1: V.good 6.04 4 0.1960 0.1880 -0.18 0.1884 0.2050 2: Good 9.77 12 0.6358 0.6310 0.17 0.0410 0.0290 3: Fair 17.52 12 0.1311 0.1430 0.52 0.0010 0.0030 4: Bad 3.94 1 0.0472 0.1540 1.00 0.0006 0.1220 ------</p><p>29 Tests of conditional independence H0: P(X,Y|Z=z) = P(X|Z=z)P(Y|Z=z) for all z</p><p>Z X Y Concordance Test statistics and discordance 1 2 3 1 1 2 1</p><p>2 N1C and N1D N1C N1D N N 1= 1C 1D</p><p>2 1 2 2</p><p>2 N2C and N2D N2C N2D N N 2= 2C 2D ...... 2 kZ 1 k</p><p>2 N2C and N2D NkC NkD N N k= kC kD All test statistics must be insignificant Global tests of conditional independence</p><p>The global 2</p><p>2 2 z z</p><p>30 df dfz z</p><p>The partial coefficient N N N N C zC D zD z z</p><p>NzC NzD NC ND z partial NC ND NzC NzD z =</p><p> NzC NzD z z w z z NzC NzD z z</p><p>NzC NzD where wz = N N i iC iD </p><p> partial w z z z Weighted mean Asymptotic normal distribution</p><p>31 Monte Carlo approximation of exact conditional p-values as for 2-way tables</p><p>------** Local testresults for strata defined by HEALTH45 (G) ** p-values p-values (1-sided) G: HEALTH45 X² df asympt exact Gamma asympt exact ------1: V.good 6.04 4 0.1960 0.1880 -0.18 0.1884 0.2050 2: Good 9.77 12 0.6358 0.6310 0.17 0.0410 0.0290 3: Fair 17.52 12 0.1311 0.1430 0.52 0.0010 0.0030 4: Bad 3.94 1 0.0472 0.1540 1.00 0.0006 0.1220 ------</p><p>2 Global = 37.3 df = 29 p = 0.139 pexact = 0.148</p><p>partial = 0.17 p = 0.034 pexact = 0.027 </p><p>32 Are the local coefficients homogenous?</p><p>Least square estimate: Gamma = 0.1998 s.e. = 0.0777</p><p>G: HEALTH45 Gamma variance s.e. weight residual ------1: V.good -0.18 0.0397 0.1993 0.152 -2.060 2: Good 0.17 0.0097 0.0987 0.620 -0.411 3: Fair 0.52 0.0264 0.1625 0.228 2.237 4: Bad 1.00 standard error is not available ------</p><p>Incomplete set of Gammas</p><p>Test for partial association: X² = 7.5 df = 2 p = 0.023 Pairwise comparisons of strata:</p><p>Comparison of strata 1+2 - p = 0.11</p><p>Significant difference between 1+2 and 3 P = 0.025</p><p>Notice the similarity of the analysis of coefficients and Mantel-Haenszel analysis of odds-ratios</p><p>33</p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages33 Page
-
File Size-