A Network Algorithm for Performing Fisher's Exact Test in R × C Contingency Tables

A Network Algorithm for Performing Fisher's Exact Test in r × c Contingency Tables Cyrus R. Mehta; Nitin R. Patel Journal of the American Statistical Association, Vol. 78, No. 382. (Jun., 1983), pp. 427-434. Stable URL: http://links.jstor.org/sici?sici=0162-1459%28198306%2978%3A382%3C427%3AANAFPF%3E2.0.CO%3B2-B Journal of the American Statistical Association is currently published by American Statistical Association. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/astata.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Mon Apr 16 12:59:18 2007 A Network Algorithm for Performing Fisher's Exact Test in r x c Contingency Tables CYRUS R. MEHTA and NlTlN R. PATEL* An exact test of significance of the hypothesis that the perform the exact test of significance. One may get some row and column effects are independent in an r x c con- idea of the complexity of this numerical problem from the tingency table can be executed in principle by general- work of Gail and Mantel (1977). izing Fisher's exact treatment of the 2 x 2 contingency In a recent paper, Mehta and Pate1 (1980) developed table. Each table in a conditional reference set of r x c an algorithm for computing the exact significance level tables with fixed marginal sums is assigned a generalized in 2 x c contingency tables. Subsequently, Pagano and hypergeometric probability. The significance level is then Halvorsen (1981) provided an algorithm for the more gen- computed by summing the probabilities of all tables that eral r x c problem. The present article provides an al- are no larger (on the probability scale) than the observed ternative algorithm for the exact treatment of r x c con- table. However, the computational effort required to gen- tingency tables which considerably extends the bounds erate all r x c contingency tables with fixed marginal of computational feasibility relative to the algorithm pro- sums severely limits the use of Fisher's exact test. A vided by Pagano and Halvorsen. Many r x c tables that novel technique that considerably extends the bounds of are computationally feasible with our algorithm, but in- computational feasibility of the exact test is proposed feasible without it, are still small enough to justify an here. The problem is transformed into one of identifying exact computation rather than a X Z approximation. all paths through a directed acyclic network that equal or The network approach that forms the basis of our al- exceed a fixed length. Some interesting new optimization gorithm can be adapted to many other exact permutation theorems are developed in the process. The numerical tests. results reveal that for sparse contingency tables Fisher's exact test and Pearson's X 2 test frequently lead to con- 2. FORMULATION AS A NETWORK PROBLEM tradictory inferences concerning row and column independence. Given a r x c contingency table, X, let xij denote the KEY WORDS: Fisher's exact test; r x c contingency entry in row i and column j. Let Ri = x;= xij be the tables; Conditional reference set; Network algorithm; sum of all entries in row i and let Cj = x;=I xij be the Contingency tables; Implicit enumeration; Permutation sum of all entries in column j. Throughout we will assume distribution; Exact tests. that the xij's and their various partial sums are nonne- gative integers. Denote by 9the reference set of all pos- 1. INTRODUCTION sible r x c contingency tables with the same marginal totals as X. Thus Fisher's exact treatment of the 2 x 2 contingency table readily generalizes to an exact test of row and column independence in r x c contingency tables. One would in general prefer to report the exact p value associated with an observed r x c table especially when the entries in Under the null hypothesis of row and column independ- each cell are small. (See, for example, Cochran 1954). It ence the probability of observing any Y E 9can be ex- is more common, however, to report the tail area of the pressed as a product of multinomial coefficients x 2 distribution based on Pearson's X2 statistic, because of the formidable computational effort that is needed to where T = Ri. For later use we define D = * Cyrus R. Mehta is Assistant Professor of Biostatistics, Harvard x;=I School of Public Health, and Dana Farber Cancer Institute, 44 Binney T!IRI! Rz! ... R,!. Street, Boston, MA 02115. Nitin R. Patel is Professor, Indian Institute The exact significance level or p value associated with of Management, Vastrapur, Ahmedabad 380-015, India. The authors thank Professor Marvin Zelen for initially stimulating their interest in the observed table X is defined as the sum of the prob- these problems. The availability of the DASH library, developed by the abilities of all the tables in 3 that are no more likely than DASH Software Development Group, Division of Biostatistics and Ep- idemiology, Dana Farber Cancer Institute, Boston, made it possible for them to access the algorithm of Pagano and Halvorsen. The authors -- also thank Mrs. Rita Rama Rao for progamming assistance. This in- O Journal of the American Statistical Association vestigation was supported in part by Grants CA-33019 and CA-23415 June 1983, Volume 78, Number 382 of the National Cancer Institute. Theory and Methods Section Journal of the American Statistical Association, June 1983 X. Specifically, where Y = { Y : Y E 3 and P ( Y ) 5 P(X)). The problem is to evaluate p. Our approach is to construct a directed acycle network of nodes and arcs such that the set of all distinct paths from the initial node to the terminal node is isomorphic with the set 3.Moreover, by construction, the length of the path that corresponds to the table Y E 5equals D . P ( Y ) .The original problem is now equivalent to identifying and summing the lengths of all paths which are no longer than D . P ( X ) . 2.1 Construction of the Network We construct the network in c + 1 stages labeled successively c , c - 1 , . , 0 . At any stage k there exist a set of nodes each labeled by a unique vector ( k ,R I ~. ., . ,R A ) .Arcs emanate from each node at stage k and every arc is directed to exactly one node at stage k - 1 . The d o t t e d p a t h c o r r e s p o n d s t o t h e t a b l e 1 0 1 and t h e l e n g t h of t h i s The network is defined recursively by specifying all the ; nodes of the form ( k - 1 , R I, k - I , . , R,.,+ ,) that suc- [; ;] ceed the node ( k ,R l k ,. , Rrk)and are connected to it by arcs. The range of Ri,k- i = 1 , 2, . , r for these Figure 1. Network Representation for All the 3 x 3 Contingency successor nodes is given by Tables with R, = 2, R2 = 1, R3 = 6 and C1 = 3, C2 = 3, C3 = 3. 2.2 The Network Algorithm Suppose that X is the observed r x c table. In network terms our goal is to identify and sum all paths whose lengths do not exceed D . P ( X ) .If we systematically enu- where Sj = x./I C/l and= we follow the convention that a summation is null if its lower limit exceeds its upper merate each path through the network, compute its length limit. There is only one node at stage c , the initial node, and sum all those path lengths that do not exceed which is labeled ( c , R I , . ,. , R,,.) where Ri(.= R i , i = D . P ( X ) ,we are in effect examining each r x c table in 1 , 2, . ,r. By applying the recursion (2.1)successively 3 individually. This is usually computationally infeasible. The advantage of the network representation is that it at stages c , c - 1 , . , 1 , we obtain exactly one node at stage 0 , labeled (0, 0 , . , 0 ) . This is the terminal circumvents the need to explicitly enumerate each path. node. To see this, we define the following two items: The length of an arc from node ( k ,R I ~. ,. , Rrh)to SP(k, R l k , . , Rrk) =the length of the shortest sub- node ( k - 1 , R I X xI -, . , R,..x-I)is equal to path from node ( k ,R l k , . ., R,k) to node (0, 0 , . , O ) , A complete path through the network is defined as a LP(k, R I x ,. , R r k )= the length of the longest sub- succession of connected arcs directed from the initial path from node ( k ,R I x ,.

A Network Algorithm for Performing Fisher's Exact Test in R × C Contingency Tables

Research Brief March 2017 Publication #2017-16

Percent R, X and Z Based on Transformer KVA

The Constraint on Public Debt When R<G but G<M

Reading Source Data in R

R Markdown Cheat Sheet I

R for Beginners

On the Phonetic Nature of the Latin R

Proposal for Generation Panel for Latin Script Label Generation Ruleset for the Root Zone

Recommendation Itu-R S.1003-2*

Reference: Humulin R U-500 Information

MUFI Character Recommendation V. 3.0: Code Chart Order

R Color Cheatsheet