A Test of Independence in Two-Way Contingency Tables Based on Maximal Correlation

A TEST OF INDEPENDENCE IN TWO-WAY CONTINGENCY TABLES BASED ON MAXIMAL CORRELATION Deniz C. YenigÄun A Dissertation Submitted to the Graduate College of Bowling Green State University in partial ful¯llment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2007 Committee: G¶abor Sz¶ekely, Advisor Maria L. Rizzo, Co-Advisor Louisa Ha, Graduate Faculty Representative James Albert Craig L. Zirbel ii ABSTRACT G¶abor Sz¶ekely, Advisor Maximal correlation has several desirable properties as a measure of dependence, including the fact that it vanishes if and only if the variables are independent. Except for a few special cases, it is hard to evaluate maximal correlation explicitly. In this dissertation, we focus on two-dimensional contingency tables and discuss a procedure for estimating maximal correlation, which we use for constructing a test of independence. For large samples, we present the asymptotic null distribution of the test statistic. For small samples or tables with sparseness, we use exact inferential methods, where we employ maximal correlation as the ordering criterion. We compare the maximal correlation test with other tests of independence by Monte Carlo simulations. When the underlying continuous variables are dependent but uncorre- lated, we point out some cases for which the new test is more powerful. iii ACKNOWLEDGEMENTS I would like to express my sincere appreciation to my advisor, G¶abor Sz¶ekely, and my co-advisor, Maria Rizzo, for their advice and help throughout this research. I thank to all the members of my committee, Craig Zirbel, Jim Albert, and Louisa Ha, for their time and advice. I also want to thank to the faculty members of the Department of Mathematics and Statistics, and the Department of Applied Statistics and Operations Research, for excellent instruction and help throughout my graduate studies at Bowling Green State University. Finally, I would like to thank to my ¯anc¶ee,GÄune»sErtan, and my family, for all their support and encouragement. iv Table of Contents CHAPTER 1: Introduction 1 1.1 Statement of the Problem ............................ 1 1.2 Summary and Objectives ............................. 2 CHAPTER 2: Contingency Table Analysis and Dependence Measures 4 2.1 Analysis of Contingency Tables ......................... 4 2.1.1 Preliminaries ............................... 5 2.1.2 Pearson and Likelihood Ratio Chi-Squared Tests of Independence . 7 2.1.3 Loglinear Models ............................. 8 2.1.4 Correspondence Analysis ......................... 10 2.1.5 Exact Tests ................................ 11 2.2 Measures of Dependence ............................. 12 2.2.1 Correlation Based Measures ....................... 13 2.2.2 Measures Based on Distribution and Density Functions . 17 2.2.3 Measures of Dependence for Cross Classi¯cations . 19 CHAPTER 3: Maximal Correlation Test of Independence 24 3.1 Maximal Correlation ............................... 25 3.1.1 De¯nition ................................. 25 3.1.2 Literature Review ............................. 26 3.1.3 Attainment ................................ 28 v 3.2 Maximal Correlation in Case of Contingency Tables . 32 3.2.1 Sample Maximal Correlation ....................... 36 3.3 Algebraic Form of Sample Maximal Correlation . 38 3.4 Maximal Correlation Test for Contingency Tables . 39 3.4.1 Large Sample Case ............................ 40 3.4.2 Small Sample Case ............................ 41 3.5 A Numerical Illustration ............................. 41 3.6 A Related Test of Independence: Correlation Ratio Test . 43 3.7 An Example: Lissajous Curve Case ....................... 45 CHAPTER 4: Empirical Results 49 2 4.1 The Null Distribution of nSn ........................... 49 4.2 Large Sample Behavior of Sn ........................... 51 4.3 Power Comparisons ................................ 52 4.3.1 Simulation Design ............................. 53 4.3.2 Empirical Signi¯cance .......................... 55 4.3.3 Results of Power Comparisons ...................... 55 4.4 Empirical Powers of Correlation Ratio Test and I-Test . 62 4.5 Summary of Power Comparisons ......................... 64 4.6 An Exploratory Study for the Lissajous Curve Case . 65 CHAPTER 5: Conclusions 69 BIBLIOGRAPHY 71 Appendices 75 CHAPTER A: Results of the Empirical Study 76 vi CHAPTER B: Algebraic form of Maximal Correlation 83 B.1 2 £ 2 Contingency Tables ............................. 83 B.2 3 £ 3 Contingency Tables ............................. 84 CHAPTER C: Selected R Code 87 C.1 R Function for Maximal Correlation ....................... 87 C.2 R Function for Correlation Ratio ........................ 88 C.3 R Code for Table A.10 .............................. 88 vii List of Figures 3.1 Plots of Lissajous Curves ............................. 46 2 4.1 Empirical Distribution of nSn .......................... 51 4.2 Empirical MSE of Sn ............................... 52 4.3 Empirical Power for Example 1 ......................... 56 4.4 Empirical Power for Example 2 ......................... 57 4.5 Empirical Power for Example 3, Case 1 ..................... 59 4.6 Empirical Power for Example 3, Case 2 ..................... 60 4.7 Empirical Power for Example 4 ......................... 61 4.8 Empirical Power for Example 5 ......................... 62 A.1 Several Lissajous Curves ............................. 82 viii List of Tables 2.1 The General Form of I £ J Contingency Tables ................ 5 2.2 The Joint Distribution of X and Y ....................... 7 3.1 Postulates vs. Dependence Measures ...................... 26 3.2 Hair and Eye Color of 264 Males ........................ 42 2 A.1 Critical values of nSn ............................... 76 A.2 Empirical Mean Squared Error of Sn ...................... 76 A.3 Empirical Signi¯cance .............................. 77 A.4 Empirical Power for Example 1 ......................... 77 A.5 Empirical Power for Example 2 ......................... 78 A.6 Empirical Power for Example 3, Case 1 ..................... 78 A.7 Empirical Power for Example 3, Case 2 ..................... 79 A.8 Empirical Power for Example 4 ......................... 79 A.9 Empirical Power for Example 5 ......................... 80 ¤ A.10 Empirical Power for In and Sn, Loglinear Case . 80 ¤ A.11 Empirical Power for In and Sn, Lissajuos Curve Case . 81 A.12 I, KX (Y ), and KY (X) for Lissajous Curve Case . 81 1 CHAPTER 1 Introduction A variety of data come in the form of cross-classi¯ed tables of frequency counts, referred to as contingency tables. After a century of great progress, the analysis of contingency tables is still an active ¯eld in statistics, mainly due to its important applications in biological and social sciences. For a comprehensive survey of the most important methods in the analysis of contingency tables, see Agresti (2002). A principal interest in many studies regarding contingency tables is to test if the variables are independent. Although many good tests are available, no single test is known to be optimal for all independence problems. An overview of several existing tests for independence in contingency tables follows in Chapter 2. In this dissertation, we construct a test of independence for two-dimensional contingency tables, based on maximal correlation. The details of this test are presented and the power performance is compared with other tests of independence. 1.1 Statement of the Problem Consider two categorical response variables X and Y having I and J levels respectively. Given a contingency table, we consider the problem of testing if X and Y are independent. The null hypothesis of statistical independence of X and Y is given by H0 : ¼ij = ¼i¢¼¢j for 2 i = 1; :::I, j = 1; :::J, where ¼ij denotes the probability that a randomly selected individual falls into category i of variable X and category j of variable Y , and the subscript \¢" denotes the sum over the index it replaces. Our tool for approaching this problem is maximal correlation, which is a convenient measure of dependence. Maximal correlation has several desirable properties, including the fact that it vanishes if and only if the variables are independent. However, except for a few special cases, it is hard to evaluate the maximal correlation explicitly. We discuss a procedure for estimating the maximal correlation for contingency tables, which we use for constructing a test of independence. 1.2 Summary and Objectives The main objective of this research is to introduce the maximal correlation test of independence for contingency tables, and evaluate the performance of this test. We describe the computation of maximal correlation for an observed contingency table, and give a detailed procedure on how to carry out the test of independence. Exact inferential methods are used for contingency tables with small sample size or sparseness. For large samples, the asymptotic null distribution of maximal correlation is used to carry out the test. We also evaluate this test of independence against a wide range of alternatives, and compare its power with two well-known tests of independence by an extensive empirical study. This dissertation has ¯ve chapters. We introduce the problem of interest in Chapter 1. Chapter 2 introduces the notation used in the analysis of contingency tables, and summarizes the well-known methods used in this ¯eld. This chapter also includes a review of several widely used measures of dependence, including the maximal correlation. In Chapter 3 we give further insight to the concept of maximal correlation,

A Test of Independence in Two-Way Contingency Tables Based on Maximal Correlation

Contingency Tables Are Eaten by Large Birds of Prey

Use of Chi-Square Statistics

Knowledge and Human Capital As Sustainable Competitive Advantage in Human Resource Management

Krishnan Correlation CUSB.Pdf

Pearson-Fisher Chi-Square Statistic Revisited

ED441031.Pdf

Detecting Trends Using Spearman's Rank Correlation Coefficient

05 36534Nys130620 31

The Effects of Simplifying Assumptions in Power Analysis

Measures of Association for Contingency Tables

Introduction to Hypothesis Testing

Power of a Statistical Test