INFERENCE OF ATTRIBUTABLE RISK FOR MULTIPLE EXPOSURE LEVELS UNDER CROSS-SECTIONAL SAMPLING DESIGN

Tanweer Shapla

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

August 2006

Committee:

Truc T. Nguyen, Co-Advisor

John T. Chen, Co-Advisor

Louisa Ha Graduate Faculty Representative

Arjun K. Gupta

ii

ABSTRACT

Truc T. Nguyen, Co-Advisor

John T. Chen, Co-Advisor

Attributable risk ( AR ) plays an important role in assessing the relationship between the risk

factor and the disease in public health and biomedical sciences. This research is intended to develop point and interval estimation procedures for the inference of the attributable risk when the data set is obtained by means of a cross-sectional sampling design.

In this thesis, we develop a novel approach for estimating the variance of the Maximum

Likelihood Estimate of AR for a dichotomous risk factor by using the Delta method. The new

method is computationally much easier than the existing method using the Fisher Information

Matrix. This method has also been extended for a risk factor with multiple exposure levels without and with confounders. The performance of the new method has been justified with real

life examples and by the Monte Carlo simulation. The simulation shows that the confidence

interval estimator performs very well in terms of the coverage probability and the average length

of the interval estimated.

For small sample case where large sample approximation theory can not be applied, we

develop inference procedure for a dichotomous risk factor using exact test regarding positive

association between the risk factor and disease outcome which has never been considered before

for attributable risk. This procedure has been extended for a risk factor with multiple exposure

levels. The attributable risk has also been studied for intermediate base-level which is useful for

detecting the significance of a particular level of risk factor with multiple exposure levels. This

technique can be used to amalgamate some of the insignificant exposure levels and hence reduce iii the exposure levels of the undertaken risk factor. Statistical properties of attributable risk have been explored under certain conditions on the cell probabilities.

The behavior of the test of positive dependence using the test statistics based on the estimate of AR and logarithm of the , logOR has been studied. It has been shown that in some subsets of the alternative, the test using the test statistic based on the estimate of AR is better than the test using the test statistic based on the estimate of logOR , and in some other subsets, the conclusion is in converse direction. In an exact test for small sample, it has been shown that the two statistics based on the estimates of AR and logOR are equivalent.

iv

ACKNOWLEDGEMENTS

I would like to thank my advisors, Professor Truc T. Nguyen and Professor John T. Chen for

their thoughtful guidance, encouragement and inspiration throughout this research. I also would like to express my thanks to the members of my committee, Professor Arjun K. Gupta and

Professor Louisa Ha for their advice and time.

I am grateful to the Department of Mathematics and Statistics for providing me teaching assistantship and a wonderful research environment. My special thanks go to Mary Busdeker,

Marcia Seubert and Cyndi Patterson for their assistance and cooperation during my stay at

Bowling Green State University.

I express my sincere gratitude to my parents and family for their love and sacrifice.

Finally, my deepest gratitude goes to my husband M. Khairul Islam for his continuous support, understanding and encouragement throughout my graduate study.

Bowling Green, OH Tanweer J. Shapla

August, 2006

v

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION...... 1

1.1 Introduction...... 1

1.2 Range of AR ...... 3

1.3 Some terminologies related to AR ...... 3

1.3.1 AR in exposed...... 3

1.3.2 Excess ...... 4

1.3.3 Prevented fraction ...... 5

1.4 Estimation of AR ...... 5

1.5 Application of the attributable risk in real life...... 7

1.6 Thesis objectives...... 9

CHAPTER 2: ESTIMATION OF ARk WITHOUT CONFOUNDERS...... 11

2. 1 Introduction...... 11

2.2 Estimation of the AR for a dichotomous risk factor...... 12

2.2.1 The Fisher’s Information Matrix ...... 12

2.2.2 The Delta method...... 15

2.3 The ARk for a risk factor with multiple exposure levels ...... 19

2.3.1 A model setup for ARk ...... 19

2.3.2 Motivation for using the Multivariate Delta method ...... 22

∧ 2.3.3 The derivation of the asymptotic variance of ARk ...... 25

2.4 Numerical example and simulations...... 31

2.5 Some special cases for ARk ...... 36 vi

2.5.1 Monotonicity of ARk ...... 36

2.5.2 The AR with respect to the intermediate level ...... 45

CHAPTER 3: ESTIMATION OF ARk WITH CONFOUNDERS...... 49

3.1 Introduction...... 49

3.2 Model development and point estimation of ARk ...... 50

∧ 3.3 Derivation of the asymptotic variance of ARk ...... 53

3.4 Example and simulation...... 58

∧ CHAPTER 4: TESTING POSITIVE ASSOCIATION USING AR ...... 64

4.1 Introduction...... 64

∧ 4.2 Test of Hypothesis using AR for large sample...... 66

4.3 The variation of AR and OR in the set of 2× 2 tables...... 69

^ 4.4 Estimation of power of the test using AR ...... 73

∧ ∧ 4.5 Comparing nominal size and power of the test statistics using AR and logOR ...... 78

∧ ∧ 4.5.1 Comparing nominal size of the test statistics using AR and logOR ...... 78

∧ ∧ 4.5.2 Comparing power of the test statistics using AR and logOR ...... 80

∧ CHAPTER 5: EXACT TEST FOR POSTIVE DEPENDENCE USING AR ...... 84

5.1 Introduction...... 84

5.2 Small-sample test for a 2× 2 table...... 84

5.2.1 Fisher’s exact test for a 2× 2 table...... 84

∧ 5.2.2 Testing procedure for independence for a small-sample using AR ...... 86 vii

5.3 Exact test for a (K +1) × 2 contingency table ...... 88

5.3.1 Useful results regarding independence ...... 88

5.3.2 Extension of Fisher’s exact test for an I × J table ...... 89

∧ 5.3.3 Testing procedure for independence for a small-sample using overall AR . 90

∧ ∧ 5.4 Comparing power of exact tests using AR and log OR ...... 94

REFERENCES ...... 98

viii

LIST OF TABLES

Tables Page

1.1 Interpreting BMI scores……………………………………………………………………....8

2.1 Classifying n individuals by exposure levels and status of the disease………………...…..12

2.2 Classifying subjects according to their exposure levels and the status of the disease…...….20

2.3 Distribution of 4270 subjects into four exposure levels with respective disease status...…..31

2.4 Probability distribution of subjects with respect to exposure levels and the disease status

in the population to be considered for simulations…………………………………………33

2.5 Simulated coverage probability and average length of the confidence interval together

with Monte Carlo sample mean and the standard error of the estimates of ARk ,

k = 1, 2, 3 for population described in Table 2.4…………………………………………….35

2.6 Probability distribution of subjects with respect to exposure levels and the disease status

in the population satisfying above conditions (i) and (ii)…………………………………...37

2.7 Simulated coverage probability and average length of the confidence interval together

with Monte Carlo sample mean and the standard error of the estimates of ARk ,

k = 1, 2, 3 for population described in Table 2.6……………………………………...……..38

2.8 Probability distribution of subjects with respect to exposure levels and the disease status

in the population satisfying above conditions (i) and (ii)………………………………...... 40

2.9 Simulated coverage probability and average length of the confidence interval together

with Monte Carlo sample mean and the standard error of the estimates of ARk ,

k = 1, 2, 3 for population described in Table 2.8…………...………………………………..41

2.10 Probability distribution of subjects with respect to exposure levels and the disease status

in the population satisfying above conditions (i) and (ii)………………...………………...43 ix

2.11 Simulated coverage probability and average length of the confidence interval together

with Monte Carlo sample mean and the standard error of the estimates of ARk ,

k = 1, 2, 3 for population described in Table 2.10………………...…………………………44

2.12 Distribution of 966 subjects into four exposure levels with the respective disease

status…………………………………………………………………………...…………..47

3.1 Classifying subjects by the exposure and confounding levels with respect to the status of

the disease…………………………………………………………………………………...51

3.2 Classifying 966 subjects by exposure levels, race and disease status………………………59

3.3 Estimated values of ARk and the confidence intervals……...……………………………...60

3.4 Probability distribution of subjects with respect to exposure levels, confounding levels

and the disease status in the population to be considered for simulation………………...…61

3.5 Simulated coverage probability and average length together with Monte Carlo sample

mean and the standard error of the estimates of ARk , k = 1, 2, 3 for population described

in Table 3.4…………………………………………………………………………………..63

4.1 Probability distribution of subjects with respect to exposure levels and the disease status...67

4.2 Cross-classification of 2784 subjects by the status of the respiratory disease and

locomotor disease……………………………………………………………………………68

∧ 4.3 Estimated power using AR for the case p1.= 0.1, p0.= 0.9………………………………...75

∧ 4.4 Estimated power using AR for the case p1.= 0.9, p0.= 0.1………………………………...76

∧ 4.5 Estimated power using AR for the case p1.= 0.5, p0.= 0.5…………………………………77

4.6 Estimated level (α = 0.05 ) for tests Z and Z * ……………...…………………………….79 x

∧ ∧ AR logOR 4.7 Estimated power for test statistics using AR and logOR where ≥ ……...……82 A B

∧ ∧ AR logOR 4.8 Estimated power for test statistics using AR and logOR where < …...………83 A B

5.1 Distribution of 72730 subjects according to the birth weight and life status………….……86

5.2 Distribution of 28 infants according to the status of life and the birth weight…...…………87

5.3 Probability distribution of subjects with respect to exposure levels and the disease status...88

5.4 Distribution of 17 subjects according to the maternal age and birth weight of offspring…..93

∧ 5.5 Estimated power of exact test with the test statistic AR ……………………………………97

xi

LIST OF FIGURES

Figure Page

4.1 Geometric representation for S………………...…………………………………………..72

4.2 The variation of AR on S …………...……………………………………………………72 p1.

+ 4.3 Power curve for AR and log OR for the subset P1 ………………...……………………..83

+ 4.4 Power curve for AR and log OR for the subset P2 ………...……………………………...83

1

CHAPTER 1

INTRODUCTION

1.1 Introduction

Attributable risk ( AR ) is one of the most commonly used epidemiological indices to quantify the impact of a risk factor on the development of a disease. It measures the fraction of the risk of the disease that might be avoided by eliminating the risk factor from the population. Currently it has widely been used by the epidemiologists and the public health practitioners in disease prevention programs. Although the and the odds ratio are also measures of association between the disease and the risk factor, they are not able to measure the fraction of the risk. For example, a risk factor of a disease with a high relative risk may have a low rate, whereas another risk factor with a relatively low relative risk may have a very high prevalence rate in the population, and hence is responsible for a sizeable fraction of cases. Therefore, while studying a disease with several risk factors varying both in their relative risks and prevalence, concentration only on the measures of relative risk for comparing several risk factors associated with a disease may be inadequate. Because attributable risk takes into account both the association (between exposure and disease) and the prevalence of exposure, it plays an important role in locating the major risk factors.

While several definitions for AR are available in the literature, the thesis concentrates on the definition given by Levin (1953). It is defined as the proportion of the total disease risk in the population that could be avoided if the effect associated with the risk factor of interest were totally eliminated. It has also been termed as etiologic fraction and fraction of etiology

(Miettinen, 1974), attributable fraction (Ouellet et al., 1979; Greenland and Robins, 1988; Last, 2

1983), and population attributable risk per cent (Cole and MacMahon, 1971). Up to 16 different names have been used to denote the attributable risk in the literature (Gefeller, 1990).

Let E1 and E0 respectively denote the presence and absence of the risk factor under study, and D , D the presence and the absence of the disease outcome, respectively. Exposure to

the risk factor is not necessary for the occurrence of the disease. Then P(D | E0 ) is the chance of getting disease in the unexposed group. If the risk factor were of no effect, one could expect the same rate in the exposed group. Therefore, if the risk factor were of no effect, then the

probability of getting disease in the exposed group would be P(D | E0 )P(E1 ) . If the risk factor increases the chance of getting the disease, then the probability of getting disease in the exposed

group will be P(D | E1 )P(E1 ) . Then Levin’s attributable risk AR is defined as

P(D | E )P(E ) − P(D | E )P(E ) AR = 1 1 0 1 . (1.1.1) P(D)

By the law of total probability,P(D) = P(D | E1 )P(E1 ) + P(D | E0 )P(E0 ) , and the fact that

P(E1 ) + P(E0 ) = 1, the above expression turns out to be

P(D) − P(D | E ) AR = 0 . (1.1.2) P(D)

The definition in (1.1.2) appears in the literature while using AR for a dichotomous risk factor with dichotomous disease outcome. It is interpreted as the proportion of reduction in P(D) , the rate of occurrence of the disease, given that the risk factor were totally eliminated from the population of concern.

From equation (1.1.1), it also follows that

P(E )(RR −1) AR = 1 , (1.1.3) 1+ P(E1 )(RR −1) 3

P(D | E ) where RR is the relative risk, RR = 1 . The definition in (1.1.3) has also been used for P(D | E0 ) studying AR when the use of RR is sensible by the undertaken sampling design.

1.2 Range of AR

It follows from the definition of AR in (1.1.3), AR lies between 0 and 1 if RR > 1, that is, the exposure factor under study is a risk factor. AR increases both with the RR and with the

(RR −1) prevalence P(E ) . AR is equal to when P(E ) = 1 and tends to 1 for an infinitely high 1 RR 1 relative risk provided the prevalence is greater than 0. AR is equal to 0 when either there is no association between exposure and disease ( RR = 1), or no subject is exposed in the population

( P(E1 ) = 0 ). For a protective factor ( RR < 1), AR takes negative values and varies from 0 to

− ∞ . In this case, use of prevented fraction (Last, 1983) makes more sense.

1.3 Some terminologies related to AR

1.3.1 AR in exposed

The population attributable risk among the exposed ( ARE ) is defined as the proportion of disease cases that can be attributed to an exposure among the exposed group only (Cole et al.,

1971; Levin, 1953; MacMahon and Pugh, 1970; Miettinen, 1974). It is mathematically expressed as

P(D | E1 ) − P(D | E0 ) ⎛ RR −1⎞ ARE = and can easily be shown to be ARE = ⎜ ⎟ , a one-to-one P(D | E1 ) ⎝ RR ⎠ increasing function of the relative risk. When the exposure factor under a study is a risk factor

⎛ RR −1⎞ ( RR > 1), it follows from ARE = ⎜ ⎟ that ARE lies between 0 and 1. The value of ARE ⎝ RR ⎠ increases with the strength of the association between exposure factor and disease measured by 4

the relative risk and tends to 1 for an infinitely high relative risk. It is clear that ARE equals 0 when there is no association between exposure and disease ( RR = 1).

The measure of attributable risk among exposed has its wide applications in the lawsuit argument where it is important to determine whether an action in the lawsuit is associated with the disease and the exposure to a risk factor (Finkelstein and Levin, 2001). But this measure is less useful than the attributable risk, because it is only a one-to-one transformation of RR and

does not take into account the prevalence of the exposure E1 . Therefore, it does not indicate the impact of the risk factor among all who have the disease in the population. Thus, AR has different applications in public health research.

P(D | E )P(E ) Because 1 1 = P(E | D) , it easily follows from equation (1.1.1) that P(D) 1

AR = ARE P(E1 | D) ,

which establishes a simple relationship between AR and ARE and shows that the population attributable risk is equal to the attributable risk among the exposed, reduced by the prevalence of the risk factor among the diseased.

1.3.2 Excess incidence

The excess incidence Δ (Berkson, 1958; MacMahon and Pugh, 1970; Mausner and Bahn, 1974) is defined as the difference between the incidence rate in the exposed and the incidence rate in the unexposed and is given by

Δ = P(D | E1 ) − P(D | E0 ) and can be re-expressed as

Δ = P(D | E0 )(RR −1) . 5

Thus, it takes into account the incidence of the disease in the unexposed and the strength of the association between exposure and disease. It quantifies the difference in incidence that can be attributed to exposure for an individual. It has also been termed as excess risk (Schlesselman,

1982), Berkson's simple difference (Walter, 1976), incidence density difference (Miettinen,

1976), excess prevalence (Walter, 1976) or even attributable risk (Markush, 1977; Schlesselman,

1982).

1.3.3 Prevented fraction

Miettinen (1974) has defined attributable risk or prevented fraction for beneficial factors. The prevented fraction PF measures the impact of an association between a protective factor or intervention and disease. It is defined as

PF = (P(D | E0 ) − P(D)) / P(D | E0 ) ,

where the denominator is the hypothetical probability of disease in the population in the absence of the protective factor.

1.4 Estimation of AR

Lilienfeld (1973) drew attention to the general importance of attributable risk in health research.

Miettinen (1974) provided a systematic discussion on various fundamental aspects in the estimation of the AR . Walter (1976) discussed various measures of the AR together with a rationale for their use as an alternative to relative risk in health research. Leung and Kupper

(1981) proposed the logit transformation while estimating the AR under various sampling schemes. Whittemore (1982) extended Levin’s measure of attributable risk to account for confounding by other aetiologic factors. She considered point estimates and confidence intervals for the extended measure based on matched or randomly sampled case-control data. The small sample properties of these estimates and confidence intervals were also investigated by using 6 simulated data. A straightforward and unified approach was presented by Bruzzi et al. (1985) for calculating population attributable risk in the general multivariate setting for a case-control study. The authors emphasized the benefits to be obtained from logistic regression models, so that risks need not be estimated separately in a large number of strata, some of which may contain inadequate numbers of individuals. The methods of adjustment for confounding factors for the estimation of the attributable risk in case-control studies were reviewed by Benichou

(1991) and Chen (2001). However, the other intensive contribution on the estimation of attributable risks in case-control studies are made by Whittemore (1983), Coughlin et al. (1994),

Benichou (1993), Breslow and Day (1980), Cole and MacMahon (1971), Coughlin et al. (1991),

Drescher and Becher (1997), Kooperberg and Petitti (1991), Kuritz and Landis (1987, 1988a,

1988b), Mantel and Haenszel (1959), Mezzetti et al. (1996), Schlesselman (1982), Denman and

Schlesselmann (1983), Drescher and Schill (1991), Gefeller and Windeler (1991), Taylor (1977),

Lubin (1981), Gefeller and Eide (1993), Eide and Gefeller (1995). Most recently, Lui (2003) developed asymptotic interval estimators of the AR for multiple exposure levels in case-control studies in the presence of confounders.

A list of publications has also dealt with methodologies for the estimation of the AR in cross-sectional studies (Walter, 1976; Gefeller, 1990; 1992a). When there are confounders,

Gefeller (1992b) compared several methods for adjusting confounders mostly relevant to the point estimation of AR . Eide and Gefeller (1995) proposed stepwise estimation of attributable fractions for a set of exposure variables with confounders. Basu and Landis (1995) estimated

AR for cross-sectional studies based on a logistic regression model without including any interaction term between exposure and the stratum effect. For confidence interval constructions,

∧ Fleiss (1979) provided the standard error of log (1− AR), and then constructed the confidence 7 interval of AR for cross-sectional studies. Lui (2001a) constructed asymptotic confidence intervals for AR and studied the finite-sample performance under cross-sectional sampling scheme with no confounders. Lui (2001b) further discussed the interval estimators of AR for a dichotomous risk factor for cross-sectional studies in the presence of confounders and studied the finite-sample performance of asymptotic interval estimators of AR . Shapla et al. (2005) developed point and interval estimation procedure for the attributable risk under cross-sectional sampling scheme in presence of confounders for a risk factor with multiple exposure levels.

1.5 Application of the attributable risk in real life

The AR is one of the most important and commonly-used epidemiological indices to assess the potential impact of a risk factor and compare various prevention strategies. It has been used by the epidemiologists and public health administrators to locate the factors that may increase the chance of developing a particular disease and take initiatives to prevent those factors. Since it provides the proportion of the disease risk that could be reduced if the risk factor were totally eliminated from the population of interest, it plays an important role in the disease prevention programs. This section concentrates on the application of the attributable risk in public health where it is necessary to know the effect of a risk factor for the occurrence of a disease. The following example explaining the effect of body mass index (BMI) in developing high blood pressure would be helpful to discuss the application of the AR .

Body mass index (BMI) is a reliable indicator of total body fat. It is defined as the ratio of the person’s weight to the square of that person’s height. Table 1.1 provides different BMI scores and their respective meanings.

8

Table 1.1: Interpreting BMI scores

BMI Interpretation

Below 18.5 Underweight

18.5-24.9 Normal

25.0-29.9 Overweight

30.0 and above Obesity

Source: National Health, Lung, and Blood Institute (NHLBI): Obesity Education Initiative, http://nhlbisupport.com/bmi/.

People who are overweight or obese have a greater chance of developing high blood pressure (hypertension), high blood cholesterol or other lipid disorders, type 2 diabetes, heart disease, stroke, and certain cancers. The body fat in the overweight or obese people form clots into their blood vessels which reduce the size of the vessels and make it narrower. Therefore, the amount of pressure exerted on the arteries goes high. High blood pressure, according to the

American Heart Association, is a systolic pressure of 140 mm Hg or higher and/or a diastolic pressure of 90 mm Hg or higher. Unfortunately in recent years, hypertension has been on the rise. Some studies show an increase of 30% in the United States without any signs of these numbers decreasing. High blood pressure, or hypertension, directly increases the risk of coronary heart disease and stroke. Therefore, it is important to know the impact of the BMI in the development of hypertension. Suppose the BMI factor has been categorized in the following four levels:

BMI < 23, 23≤ BMI < 25, 25≤ BMI < 27, 27≤ BMI, and the diastolic pressure greater than or equal to 90 mm Hg is considered as disease. In these four levels, BMI < 23 is referred to the baseline level. In order to know the association between the BMI and hypertension, one might be 9 interested to know the proportion of the risk of developing hypertension that could be eliminated if the BMI greater than or equal to 27 were reduced to the baseline level (< 23). This answer can be provided by measuring the attributable risk. The procedure of the point and interval estimation of the AR based on a sample data has been described in the chapter 2 of this thesis.

While studying the effect of a risk factor in developing a disease, there might be a factor (or factors) which is associated both with the risk factor and the disease outcome. Such factor is called the confounding factor. For example, studies show that African-Americans are more likely than U.S. to be overweight, obese, and physically inactive. Also, hypertension has higher prevalence in Blacks than in Whites. Thus, the factor race (black and white) is associated both with the BMI and hypertension, and therefore is a confounding factor. So, while measuring the

AR for the BMI, one needs to control for race. Otherwise, results will have limited usefulness.

The method of adjustment for a confounding factor in the estimation of AR for a risk factor with multiple levels under the cross-sectional study design has been carried out in the chapter 3 of this thesis.

1.6 Thesis objectives

The purpose of the thesis is to develop estimation procedures for the attributable risk under cross-sectional sampling scheme for dichotomous and multiple exposure levels, respectively.

The specific topics of the investigation of this thesis are listed below.

i. Investigating existing method of estimation of AR for dichotomous exposure levels

and proposing an alternative estimation for the variance of the estimator.

ii. Generalizing the existing method for point and interval estimation of the AR to

multiple exposure levels. 10 iii. Developing point and interval estimation procedures for AR for a risk factor with

multiple exposure levels with confounding factors iv. Exploring mathematical/statistical properties of AR v. Expressing AR with respect to intermediate exposure levels vi. Studying exact test regarding positive association with the use of AR

11

CHAPTER 2

ESTIMATION OF ARk WITHOUT CONFOUNDERS

2. 1 Introduction

The attributable risk ( AR ) plays an important role in assessing the relationship between the risk factor and the disease in public health and biomedical sciences. This chapter deals with the point and interval estimation of the AR for a risk factor. Section 2.2.1 reviews the estimation of the

AR and of the variance by using Fisher’s Information matrix for a dichotomous risk factor under cross-sectional design scheme. Section 2.2.2 derives an asymptotic variance of the estimate of the AR from the definition of the attributable risk by a direct application of the delta method for comparison. It follows that the two methods yield identical results. Section 2.3 extends the point and interval estimation procedure of the attributable risk for a risk factor with multiple exposure levels. In section 2.3.1, a model has been developed in order to estimate the attributable risk. In section 2.3.2, a motivation for using the multivariate delta method is given. Formula for an

asymptotic variance of ARk has been developed in section 2.3.3. A real life example on cigarette smoking and lung disease has been used to explain the estimation procedure in section 2.4.

Monte Carlo simulations has also been applied to generate 10000 random samples from a specified multinomial distribution to assess the finite sample performance of the confidence intervals constructed by using Wald’s test statistic (Walter, 1976). It follows from the simulations that the confidence interval estimators perform well with respect to coverage probabilities and average lengths. The Monte Carlo sample means of the estimates obtained from the repeated samples approach to the corresponding true parameters of the distribution. The

Monte Carlo standard error also decreases as sample size increases. In section 2.5.1, we study the

behavior of ARk under certain conditions on the cell probabilities. Finally in section 2.5.2, we 12

derive an expression for AR jk which is the attributable risk for reducing the exposure level from level k to j .

2.2 Estimation of the AR for a dichotomous risk factor

2.2.1 The Fisher’s Information Matrix

Let us consider the estimation of AR for a dichotomous risk factor and a dichotomous outcome

variable under a cross-sectional sampling scheme. LetEk , k = 0, 1 be the levels of the risk factor

under investigation with E0 referred to the baseline or the reference level. Suppose that a random sample of n individuals is considered and each individual is then dichotomized on the

basis of presence and absence of both disease and the risk factor. Let nkd be the random frequency of n individuals falling into the cell at exposure level k with the disease status

d ,2d = 1, , where 1(2) means the presence (absence) of the disease.

The table below summarizes the overall data structure given by the aforesaid design.

Table 2.1: Classifying n individuals by exposure levels and status of the disease

Disease Status Total Exposure Status D D

E1 (present) n11 n12 n1.

E0 (absent) n01 n02 n0.

Total n.1 n.2 n

Let pkd be the probability of a subject falling into a cell having observed frequency nkd .

Note that ∑∑nkd = ∑ n.d = n , nk. = nk1 + nk 2 , 0 < pkd < 1, and pk. = pk1 + pk 2 . Let ϕ1 and kd d 13

ϕ2 be the conditional probabilities of developing the disease among exposed and unexposed

group, that is, ϕ1 = P (D | E1 ) and ϕ2 = P (D | E0 ) . Let θ be the overall prevalence of the risk

p11 factor in the population, that is θ = P (E1 ) . Then, of course, ϕ1 = , ( p11 + p12 )

p01 ϕ2 = and θ = p11 + p12 . The relative risk of exposed compared to non-exposed ( p01 + p02 )

ϕ individual is ψ = 1 . The attributable risk for a dichotomous risk factor can be written ϕ2

P(D) − P(D | E ) as AR = 0 . Walter (1976) has expressed AR to be P(D)

−1 AR = 1−[( p11 + p12 ) (ψ −1) +1] . (2.2.1)

′ Note that the random vector N given by N = (n11 , n12 ,n01 , n02 ) follows the multinomial

′ distribution with parameters n and P given by P = ( p11 , p12 , p01 , p02 ) , for which the log- likelihood is given by

L = ln K + n11 ln p11 + n12 ln p12 + n01 ln p01 + n02 ln p02 .

∧ n The maximum likelihood estimates of the p s are p = k d , k = 0, 1; d = 1, 2 . k d k d n

Then by the invariance property (Casella and Berger, 2002) of the MLE, the maximum

∧ likelihood estimator of AR , AR , is given by

∧ ∧ ∧ ∧ −1 AR = 1−[( p11 + p12 ) (ψ −1) +1] .

One may write (Walter (1976)), 14

p12 = 1− p11 − p01 / f and p02 = − p01 + p01 / f , where f = ( p11 + p01 )(1− AR) , in order to

express the likelihood in terms of p11 , p01 and AR .The algebraically distinct elements of the information matrix, according to Walter (1976) are given by

⎛ ∂ 2 L ⎞ p 2 ( p + p ) n −1 E⎜ − ⎟ = 01 01 02 ⎜ 2 ⎟ 2 2 ⎝ ∂AR ⎠ f (1− AR) p12 p02

⎛ ∂ 2 L ⎞ 1 1 2 p p 2 ( p + p ) n −1 E⎜ − ⎟ = + − 01 + 01 12 02 ⎜ 2 ⎟ p p fARp 2 2 ⎝ ∂p11 ⎠ 11 12 12 f AR p12 p02

⎛ ∂ 2 L ⎞ 1 1 2 p p 2 ( p + p ) n −1 E⎜ − ⎟ = + − 11 + 11 12 02 ⎜ 2 ⎟ p p fARp 2 2 ⎝ ∂p01 ⎠ 01 02 02 f AR p12 p02

⎛ ∂ 2 L ⎞ p p 2 ( p + p ) n −1 E⎜ − ⎟ = 21 − 01 12 02 ⎜ ⎟ 2 ⎝ ∂AR∂p11 ⎠ f (1 − AR) p12 f AR(1− AR) p12 p02

⎛ ∂ 2 L ⎞ − p p p ( p + p ) n −1 E⎜ − ⎟ = 01 + 11 02 12 02 ⎜ ⎟ 2 ⎝ ∂AR∂p01 ⎠ f (1− AR) p02 f AR(1 − AR) p12 p02

⎛ ∂ 2 L ⎞ p p + p p p p ( p + p ) n −1 E⎜ − ⎟ = 11 02 12 01 − 11 12 12 02 ⎜ ⎟ 2 2 ⎝ ∂p11∂p01 ⎠ fARp12 p02 f AR p12 p02

The determinant of the information matrix, after simplification, is

( p + p ) 4 n 3 01 02 (2.2.2) 2 3 ( p11 + p01 ) p11 p12 p01 p02

Then,

∧ 4 (1− AR) ( p11 + p01 )( p01 + p02 ){ p01 ( p12 p01 − p11 p02 ) + p11 p02 } vI (AR) = 3 . (2.2.3) n p01

15

2.2.2 The Delta method

In this section, we consider the delta method (Agresti, 2002) for the derivation of the asymptotic

∧ variance of AR , which could be shown identical to that of (2.2.3).

Let (n, P) = (n, p11 , p12 , p01 , p02 ) be the parameters of a multinomial distribution, where

1 2 pij > 0 , ∑∑ pij = 1. ij==0 1

The covariance matrix of the estimate of P is given by

⎡ p11 (1− p11 ) - p11 p12 - p11 p01 - p11 p02 ⎤ ⎢ ⎥ 1 - p p p (1− p ) - p p - p p Σ = ⎢ 12 11 12 12 12 01 12 02 ⎥ n ⎢- p p - p p p (1 - p ) - p p ⎥ ⎢ 01 11 01 12 01 01 01 02 ⎥ ⎣- p02 p11 - p02 p12 - p02 p01 p02 (1− p02 )⎦ which is singular. We can partition Σ in the following way:

1 ⎡ Σ1 − u ⋅ p02 ⎤ Σ = ⎢ ⎥ n ⎣− u′⋅ p02 p02 ⋅ (1− p02 )⎦ where,

⎡ p11 (1− p11 ) - p11 p12 - p11 p01 ⎤ Σ = ⎢- p p p (1− p ) - p p ⎥ , a non-singular matrix, 1 ⎢ 12 11 12 12 12 01 ⎥ ⎢ ⎥ ⎣- p01 p11 - p01 p12 p01 (1 - p01 )⎦

and u′ = ( p11 , p12 , p01 ) .

Let us consider a function

l( p , p , p ) = h( p , p , p , 1− p − p − p ) 11 12 01 11 12 01 11 12 01

Then,

∂l ∂h ∂h = − = h11 − h02 ∂p11 ∂p11 ∂p02 16

∂l ∂h ∂h = − = h12 − h02 ∂p12 ∂p12 ∂p02

∂l ∂h ∂h = − = h01 − h02 ∂p01 ∂p01 ∂p02

∂h where, hij = , i = 0, 1; j = 1, 2. ∂pij

′ Define ∇l00 = (h11 ,h12 ,h01,h02 ) = (v′,h02 ) and

′ ′ ∇l02 = (h11 − h02 ,h12 − h02 ,h01 − h02 ) = (v′ − h02 ⋅1 )

where, v′ = (h11 ,h12 ,h01 ) , 1′ = (1,1,1) .

′ ′ Then it can be proved that n∇l Σ∇l = ∇l Σ ∇l , which enables using the delta method for 00 00 02 1 02

∧ the derivation of the asymptotic variance for AR . The proof is given in section 2.3.2 for a more general setting.

Note that (2.2.1) can be rewritten as,

−1 ⎧ ⎛ p ( p + p ) ⎞ ⎫ ⎜ 11 01 02 ⎟ AR = 1 − ⎨( p11 + p12 )⎜ −1⎟ +1⎬ ⎩ ⎝ p01 ( p11 + p12 ) ⎠ ⎭

p01 = 1− p11 p02 − p12 p01 + p01 p p − p p = 11 02 12 01 . p11 p02 − p12 p01 + p01

∧ By the invariance property of the MLE, the estimator of AR , AR is given by

∧ ∧ ∧ ∧ ∧ p p − p p ∧ n AR = 11 02 12 01 , where p = k d , k = 0, 1; d = 1, 2 . ∧ ∧ ∧ ∧ ∧ k d n p11 p02 − p12 p01 + p01 17

∧ ∧ Let ∂ be vector of the partial derivatives of AR with respect to the components of P evaluated

⎛ ∧ ∧ ∧ ∧ ⎞ ∧ ⎜ ∂ AR ∂ AR ∂ AR ∂ AR ⎟ at P = P . Then ∂′ = , , , . ⎜ ∧ ∧ ∧ ∧ ⎟ ⎜ ⎟ ∂ p ∂ p ∂ p ∂ p ∧ ⎝ 11 12 01 02 ⎠ P=P

∧ By using the delta method, the asymptotic variance of the estimate of AR , AR is given by

∧ v D(AR) = ∂′Σ∂ .

It can be easily shown that,

∂AR p01 p02 = 2 ∂p11 {} p01 + p11 p02 - p12 p01

2 ∂AR − p01 = 2 ∂p12 {} p01 + p11 p02 - p12 p01

∂AR − p11 p02 = 2 ∂p01 {} p01 + p11 p02 - p12 p01

∂AR p11 p01 = 2 ∂p02 {} p01 + p11 p02 - p12 p01

Also,

2 2 1 2 ⎛ ∂AR ⎞ ⎛ 1 2 ∂AR ⎞ ′ ⎜ ⎟ ⎜ ⎟ n∂ Σ∂ = ∑∑ pk d ⎜ ⎟ − ⎜∑∑ pk d ⎟ . kd==0 1 ⎝ pk d ⎠ ⎝ kd==0 1 pk d ⎠

Note that,

2 1 2 ⎛ ∂AR ⎞ p p 2 p 2 + p p 4 + p p 2 p 2 + p p 2 p 2 p ⎜ ⎟ = 11 01 02 12 01 01 11 02 02 11 01 ∑∑ k d ⎜ ⎟ 4 kd==0 1 ⎝ pk d ⎠ ( p01 + p11 p02 - p12 p01 )

And,

2 ⎛ 1 2 ∂AR ⎞ ( p p p − p p 2 − p p p + p p p ) 2 ⎜ p ⎟ = 11 01 02 12 01 01 11 02 02 11 01 ⎜∑∑ k d ⎟ 4 ⎝ kd==0 1 pk d ⎠ ( p01 + p11 p02 - p12 p01 ) 18

2 2 p01 ( p11 p02 − p12 p01 ) = 4 ( p01 + p11 p02 - p12 p01 )

Then,

p ( p p p 2 + p p 3 + p 2 p 2 + p p 2 p ) − p ( p p − p p )2 ′ 01 { 11 01 02 12 01 11 02 02 11 01 01 11 02 12 01 } n∂ Σ∂ = 4 . ( p01 + p11 p02 - p12 p01 )

Therefore, after some algebraic manipulations, the asymptotic variance of the estimate of the AR is given by

∧ 4 2 3 (1− AR) {p11 p02 ( p11 p02 + p11 p01 + p01 p02 ) − p01 ( p11 p02 − p12 p01 ) + p12 p01 } vD (AR) = 3 (2.2.4) n p01 ∧ Corollary 2.2.1 The expression for the variance obtained by Walter, vI (AR) , using Fisher information matrix and the one obtained by the proposed delta method are same, that

∧ ∧ is, vI (AR) = vD (AR) .

Proof: Note that

( p11 + p01 )( p01 + p02 ){} p01 ( p12 p01 − p11 p02 ) + p11 p02

2 2 = ( p11 p01 + p11 p02 + p01 + p01 p02 ) ( p12 p01 − p11 p01 p02 + p11 p02 )

2 3 2 = p11 p02 ( p11 p01 + p11 p02 + p01 p02 ) + p11 p02 p01 + p11 p12 p01 + p11 p12 p01 p02

4 3 2 2 3 2 2 2 2 + p12 p01 + p12 p01 p02 − p11 p01 p02 − p11 p01 p02 − p11 p01 p02 − p11 p01 p02

2 2 3 = p11 p02 ( p11 p01 + p11 p02 + p01 p02 ) + p11 p02 p01 − p01 ( p11 p02 − p12 p01 ) + p12 p01 ( p11 + p12

2 + p01 + p02 ) − p11 p02 p01 ( p11 + p12 + p01 + p02 )

2 2 3 2 = p11 p02 ( p11 p01 + p11 p02 + p01 p02 ) + p11 p02 p01 − p01 ( p11 p02 − p12 p01 ) + p12 p01 − p11 p02 p01

2 3 = p11 p02 ( p11 p01 + p11 p02 + p01 p02 ) − p01 ( p11 p02 − p12 p01 ) + p12 p01 (2.2.5) 19

∧ ∧ By (2.2.5), (2.2.3) and (2.2.4), it follows that vI (AR) = vD (AR) .

∧ Note that vˆD (AR) can be obtained by substituting the parameters by the corresponding MLE s.

∧ Then an asymptotic 100 (1−α) per cent confidence interval for AR by using Wald’s statistic is given by

⎡ ∧ ∧ ⎛ ∧ ∧ ⎞ ⎤ AR− Z vˆ (AR), min⎜ AR+ Z vˆ (AR), 1⎟ , where Z is the upper 100 (α) th percentile ⎢ α / 2 ⎜ α / 2 ⎟ ⎥ α ⎣⎢ ⎝ ⎠ ⎦⎥ of the standard normal distribution.

2.3 The ARk for a risk factor with multiple exposure levels

2.3.1 A model setup for ARk

Consider the estimation of the ARk due to a risk factor with K +1 exposure levels in a cross-

sectional study. Let Ek , k = 0, 1,..., K be the levels of the risk factor under investigation with

E0 referred to the baseline or the reference level. Suppose we take a random sample of n subjects and simultaneously classify each subject by the presence and the absence of a disease,

and a suspected risk factor with K +1 exposure levels. Let nkd be the random frequency of n individuals falling into the cell at exposure level k with the disease status d , d = 1, 2 where 1(2) means the presence (absence) of the disease.

20

The table below summarizes the overall data structure given by the aforesaid design.

Table 2.2: Classifying subjects according to their exposure levels and the status of the disease

Disease Status Total Exposure Levels D D 0 n n n 01 02 0. 1 n11 n12 n . 1...... K . nK1 nK 2 n K. Total n n.1 n.2

Let pkd , be the probability of a subject falling into a cell having observed frequency nkd .

Note that ∑∑nkd = ∑ n.d = n , nk. = nk1 + nk 2 , 0 < pkd < 1, and pk. = pk1 + pk 2 . Then the kd d random vector N given by

′ N = ()n01 ,n02 ,n11 ,n12 ,..., nK1 ,nK 2 follows the multinomial distribution with parameters n and

′ P given by P = ()p01 , p02 , p11 , p12 ,..., pK1 , pK 2 . The maximum likelihood estimator, MLE , of

∧ n p is given by p = kd . When the number of subjects, n , is large, by the Multivariate Central kd kd n

∧ Limit Theorem (Rao, 1973), the random vector ( P− P ) is asymptotically distributed as normal

N(0, Σ ) , where 0′ = (0, 0, ...., 0) and Σ is 2(K +1) × 2(K +1) covariance matrix of the estimate of

∧ P , P given by

⎡ p01 (1− p01 ) - p01 p02 - p01 p11 - p01 p12 ...- p01 pK1 - p01 pK 2 ⎤ ⎢ ⎥ 1 - p p p (1− p ) - p p - p p ...- p p - p p Σ = ⎢ 02 01 02 02 02 11 02 12 02 K1 02 K 2 ⎥ n ⎢ ...... ⎥ ⎢ ⎥ ⎣- pK 2 p01 - pK 2 p02 - pK 2 p11 - pK 2 p12 . .. - pK 2 pK1 pK 2 (1− pK 2 )⎦ 21

Let D and D be the events of a subject being diseased and non-diseased respectively.

Then the AR of a disease for reducing the exposure level from Ek to E0 is, by definition, equal to

[]P(D | E ) − P(D | E ) P(E ) AR = k 0 k (2.3.1) k P(D)

Equation (2.3.1) can be written as

P(Ek | D) ARk = P()Ek | D − (2.3.2) RRk

where RRk = P(D | Ek ) / P(D | E0 ) is the relative risk between exposure level Ek and E0 .

pk1 Note that, P(Ek | D) = p.1

P(D | Ek ) pk1 / pk. pk1 p0. RRk = = = P(D | E0 ) p01 / p0. p01 pk.

Then it follows from equation (2.3.2)

pk1 pk1 p01 pk. 1 ⎧ p01 pk. ⎫ ARk = − = ⎨pk1 − ⎬ (2.3.3) p.1 p.1 pk1 p0. p.1 ⎩ p0. ⎭

When K +1 = 2 , that is, K = k = 1, from equation (2.3.3)

1 ⎧ p01 p1. ⎫ AR = ⎨p11 − ⎬ p.1 ⎩ p0. ⎭

p p − p p = 11 0. 01 1. . p.1 p0.

Note that,

p11 p0. − p01 p1. = p11 p02 − p01 p12 , and

p.1 p0. = ( p01 + p11 ) ( p01 + p02 ) 22

2 = p01 + p01 p02 + p01 p11 + p02 p11

= p01 ( p01 + p02 + p11 ) + p02 p11

= p02 p11 − p01 p12 + p01 , since p01 + p02 + p11 = 1− p12

Therefore we have,

p p − p p AR = 02 11 01 12 , p02 p11 − p01 p12 + p01 which is the same as the one obtained by Walter (1976).

∧ By the invariance property of the MLE , the MLE of the ARk , ARk , is given by

∧ ∧ ∧ ⎧ ∧ ⎫ 1 ⎪ p01 pk. ⎪ ARk = ∧ ⎨pk1 − ∧ ⎬ . ⎪ ⎪ p.1 ⎩ p0. ⎭

2.3.2 Motivation for using the Multivariate Delta method

k Let (n, p) = (n, p1 , p2 ,..., pk ) be the parameters of a multinomial distribution with ∑ pi = 1 and i=1 the covariance matrix of the estimates of p is given by

⎡ p1 (1− p1 ) - p1 p2 . .. p1 pk−1 - p1 pk ⎤ ⎢ ⎥ - p1 p2 p2 (1− p2 ) ... - p2 pk−1 - p2 pk 1 ⎢ ⎥ Σ = ⎢ ...... ⎥ n ⎢ ⎥ , which is singular. ⎢- p1 pk−1 - p2 pk −1 ... pk−1 (1 - pk−1 ) - pk−1 pk ⎥ ⎢ ⎥ ⎣ - p1 pk - p2 pk ... - pk −1 pk pk (1− pk )⎦

We can partition Σ in the following way:

1 ⎡ Σ1 − u ⋅ pk ⎤ Σ = ⎢ ⎥ n ⎣− u′⋅ pk pk ⋅ (1− pk )⎦

23 where,

⎡ p1 (1− p1 ) - p1 p2 ... - p1 pk−1 ⎤ ⎢- p p p (1− p ) ... - p p ⎥ Σ = ⎢ 1 2 2 2 2 k −1 ⎥ , a non-singular matrix. 1 ⎢ ...... ⎥ ⎢ ⎥ ⎣- p1 pk −1 - p2 pk −1 ... pk−1 (1 - pk−1 ) ⎦

andu′ = ( p1 , p2 ,..., pk−1 ) .

Let us consider a function

l( p1 , p2 ,..., pk−1 ) = h( p1 , p2 ,..., pk−1 , 1− p1 − p2 .... − pk−1 ).

Then,

∂l ∂h ∂h = − = h1 − hk ∂p1 ∂p1 ∂pk

∂l ∂h ∂h = − = h2 − hk ∂p2 ∂p2 ∂pk

.

.

.

∂l ∂h ∂h = − = hk −1 − hk ∂pk −1 ∂pk −1 ∂pk

∂h where, hi = , i = 1, 2, ..., k. ∂pi

′ Define ∇l0 = (h1 ,h2 ,...,hk ) = (v′,hk ) and

′ ′ ∇lk = (h1 − hk ,h2 − hk ,...,hk−1 − hk ) = (v′ − hk ⋅1 )

where, v′ = (h1 ,h2 ,...,hk−1 ) , 1′ = (1,1,...,1)1×(k −1) .

24

′ ′ Lemma 2.3.1 With above notations, n∇l0 Σ∇l0 = ∇lk Σ1∇lk Proof: We have,

⎡ Σ − u ⋅ p ⎤ ⎛v ⎞ ′ ′ 1 k ⎜ ⎟ n∇l0 Σ∇l0 = (v ,hk ) ⎢ ⎥ ⎜ ⎟ ⎣− u′⋅ pk pk ⋅ (1− pk )⎦ ⎝hk ⎠

2 = v′Σ1v − u′hk pk v − v′upk hk + hk pk (1− pk )

2 = v′Σ1v − (v′pk hk u)′ − v′pk hk u + hk pk (1− pk )

2 = v′Σ1v − 2 pk hk v′u + hk pk (1− pk )

k −1 ′ 2 = v Σ1v − 2 pk hk ∑ hi pi + hk pk (1− pk ) i=1

Again,

′ ′ ∇lk Σ1∇lk = (v′ − hk 1 )Σ1 (v − hk 1)

′ = (v′Σ1 − hk 1 Σ1 )(v − hk 1)

′ 2 ′ = v′Σ1v − v′Σ1hk 1− hk 1 Σ1v + hk 1 Σ11

2 ′ = v′Σ1v − hk v′Σ11− (hk v′Σ11)′ + hk 1 Σ11

2 ′ = v′Σ1v − 2hk v′Σ11+ hk 1 Σ11

By the fact that p1 (1− p1 ) − p1 p2 − ... − p1 pk = 0 , we have

p1 (1− p1 ) − p1 p2 − ... − p1 pk−1 = p1 pk .

Then, it follows that Σ11 = upk .

Therefore,

hk v′Σ11 = hk v′upk 25

k −1 = hk pk ∑ hi pi i=1

And

2 ′ 2 ′ hk 1 Σ11 = hk 1 upk

k −1 2 = hk pk ∑ pi i=1

2 = hk pk (1− pk )

k −1 ′ ′ 2 Therefore, ∇lk Σ1∇lk = v Σ1v − 2 pk hk ∑ hi pi + hk pk (1− pk ) and the lemma 2.3.1 follows. i=1

∧ 2.3.3 The derivation of the asymptotic variance of ARk

∧ Let φ be the vector of partial derivatives of ARk with respect to the components of the vector

∧ ∧ P evaluated at P = P . Then we have,

∂AR ∂AR ∂AR ∂AR ∂AR ∂AR φ′ = ( k , k , k , k , ..., k , k ) ∂p01 ∂p02 ∂p11 ∂p12 ∂pK1 ∂pK 2 where,

∂AR 1 ⎧ p p ⎫ 1 ⎧ ⎛ p p − p p ⎞⎫ k 01 k. ⎪ ⎜ 0. k. 01 k. ⎟⎪ = − ⎨ pk1 − ⎬ + ⎨0 − ⎬ ∂p 2 p p ⎜ 2 ⎟ 01 p.1 ⎩ 0. ⎭ .1. ⎩⎪ ⎝ p0. ⎠⎭⎪

1 ⎛ p p ⎞ = − ⎜ AR + 02 k. ⎟ p ⎜ k 2 ⎟ .1 ⎝ p0. ⎠

∂ARk 1 ⎧ p01 pk. ⎫ 1 = − 2 ⎨pk1 − ⎬ = − ARk , for m ≠ k, m ≥ 1 ∂pm1 p.1 ⎩ p0. ⎭ p.1 26

∂ARk 1 ⎧ p01 pk. ⎫ 1 ⎧ p01 ⎫ = − 2 ⎨ pk1 − ⎬ + ⎨1- ⎬ ∂pk1 p.1 ⎩ p0. ⎭ p.1 ⎩ p0. ⎭

1 p02 = − (ARk − ), since p0. = p01 + p02 p.1 p0.

∂AR 1 ⎪⎧ p p ⎪⎫ p p k = ⎨ 01 k. ⎬ = 01 k. ∂p p 2 2 02 .1 ⎩⎪ p0. ⎭⎪ p.1 p0.

∂AR k = 0, for m ≠ k, m ≥ 1 ∂pm2

∂AR 1 ⎧ p ⎫ p k = ⎨0 − 01 ⎬ = − 01 ∂pk 2 p.1 ⎩ p0. ⎭ p.1 p0.

∧ ∧ By using the delta method, the asymptotic variance of ARk ,v (ARk ) is φ′ Σ φ .

Lemma 2.3.2 Under above notations ⎧ 2 ⎫ 1 ⎪ ⎛ ∂ARk ⎞ 2 ⎪ ⎛ ∂ARk ⎞ φ′ Σ φ = ⎨ pid ⎜ ⎟ − Aid ⎬ where Aid = pid ⎜ ⎟ . n ∑∑ ⎜ ∂p ⎟ ∑∑ ⎜ ∂p ⎟ ⎩⎪ id ⎝ id ⎠ ⎭⎪ id ⎝ id ⎠

∧ Proof: The variance covariance matrix Σ of the vector P can be expressed as

1 Σ = [diag( p , p , p , p ,..., p , p ) − PP′] n 01 02 11 12 K1 K 2 where,

⎡ p01 ⎤ ⎢ p ⎥ ⎢ 02 ⎥ ⎢ p ⎥ ⎢ 11 ⎥ 0 p 0 diag( p , p , p , p ,..., p , p ) = ⎢ 12 ⎥ 01 02 11 12 K1 K 2 ⎢ . ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ p ⎥ ⎢ K1 ⎥ ⎣⎢ pK 2 ⎦⎥ 27 and

P′ = p , p , p , p ,..., p , p ()01 02 11 12 K1 K 2

Then,

nφ′Σφ = φ′diag( p01 , p02 , p11 , p12 ,..., pK1 , pK 2 )φ −φ′PP′φ

And, after simplifying it can be shown that,

2 ⎛ ∂AR ⎞ ′ ⎜ k ⎟ φ diag( p01 , p02 , p11 , p12 ,..., pK1 , pK 2 )φ = ∑∑ pid ⎜ ⎟ id ⎝ ∂pid ⎠

and

2 ⎡ ⎛ ∂AR ⎞⎤ ′ ′ ⎜ k ⎟ 2 φ PP φ = ⎢∑∑ pid ⎜ ⎟⎥ = Aid ⎣ id ⎝ ∂pid ⎠⎦

Hence the lemma follows.

∧ ∧ Theorem 2.3.1 The asymptotic variance of ARk ,v(ARk ) is given by

∧ 2 2 2 1 ⎪⎧ 2 p02 ( p01 pk. − pk1 p0. ) pk1 p02 + pk 2 p01 p01 pk. p02 ⎪⎫ v (ARk ) = ⎨ARk p.1 + 2 ARk + + ⎬ n p 2 ⎪ p 2 p 2 p 3 ⎪ .1 ⎩ 0. 0. 0. ⎭

Proof: Note that,

2 ⎛ ∂AR ⎞ ⎜ k ⎟ ∑∑ pid ⎜ ⎟ id ⎝ ∂pid ⎠

2 2 2 2 2 2 ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ = p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ 01 ⎜ ∂p ⎟ k1 ⎜ ∂p ⎟ ∑ i1 ⎜ ∂p ⎟ 02 ⎜ ∂p ⎟ k 2 ⎜ ∂p ⎟ ∑ i2 ⎜ ∂p ⎟ ⎝ 01 ⎠ ⎝ k1 ⎠ i=1, i≠k ⎝ i1 ⎠ ⎝ 02 ⎠ ⎝ k 2 ⎠ i=1, i≠k ⎝ i2 ⎠

2 2 2 ⎧ ⎫ ⎧ ⎫ K ⎪ 1 ⎛ p p ⎞ ⎪ ⎪ 1 ⎛ p ⎞ ⎪ ⎪⎧ 1 2 ⎪⎫ ⎛ p p ⎞ = p ⎜ AR + 02 k. ⎟ + p ⎜ AR − 02 ⎟ + p AR + p ⎜ 01 k. ⎟ 01 ⎨ 2 ⎜ k 2 ⎟ ⎬ k1 ⎨ 2 ⎜ k ⎟ ⎬ ∑ i1 ⎨ 2 k ⎬ 02 ⎜ 2 ⎟ p p p p i=1,i≠k ⎪ p ⎪ p p ⎩⎪ .1 ⎝ 0. ⎠ ⎭⎪ ⎩⎪ .1. ⎝ 0. ⎠ ⎭⎪ ⎩ .1 ⎭ ⎝ .1 0. ⎠

2 ⎛ p ⎞ ⎜ 01 ⎟ + pk 2 ⎜− ⎟ ⎝ p.1 p0. ⎠ 28

K 2 2 2 ARk ( p01 + pk1 + pi1) 2 AR p p p 1 p p p 2 AR p p ∑i=1, i≠k k 01 k. 02 01 k. 02 k k1 02 = 2 + 2 2 + 2 4 − 2 p.1 p.1 p0. p.1 p0. p.1 p0.

2 2 2 2 pk1 p02 p01 pk. p02 1 pk 2 p01 + 2 2 + 2 4 + 2 2 p.1 p0. p.1 p0. p.1 p0.

1 ⎧ p ( p p − p p ) p p 2 + p p 2 p p 2 p ⎫ ⎪ 2 02 01 k. k1 0. k1 02 k 2 01 01 k. 02 ⎪ = 2 ⎨ARk p.1 + 2 ARk 2 + 2 + 3 ⎬ p.1 ⎩⎪ p0. p0. p0. ⎭⎪

Also,

⎛ ∂AR ⎞ ⎜ k ⎟ Aid = ∑∑ pid ⎜ ⎟ id ⎝ ∂pid ⎠

⎛ ∂AR ⎞ ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ = ⎜ k ⎟ + ⎜ k ⎟ ⎜ k ⎟ ⎜ k ⎟ ⎜ k ⎟ ⎜ k ⎟ p01 ⎜ ⎟ pk1 ⎜ ⎟ + ∑ pi1 ⎜ ⎟ + p02 ⎜ ⎟ + pk 2 ⎜ ⎟ + ∑ pi2 ⎜ ⎟ ⎝ ∂p01 ⎠ ⎝ ∂pk1 ⎠ i=1, i≠k ⎝ ∂pi1 ⎠ ⎝ ∂p02 ⎠ ⎝ ∂pk 2 ⎠ i=1, i≠k ⎝ ∂pi2 ⎠

⎪⎧ 1 ⎛ p p ⎞⎪⎫ ⎪⎧ 1 ⎛ p ⎞⎪⎫ K ⎧ 1 ⎫ ⎛ p p ⎞ = p − ⎜ AR + 02 k. ⎟ + p − ⎜ AR − 02 ⎟ + p − AR + p ⎜ 01 k. ⎟ 01 ⎨ k 2 ⎬ k1 ⎨ k ⎬ ∑i=1, i≠k i1 ⎨ k ⎬ 02 2 p ⎜ ⎟ ⎪ p ⎜ p ⎟⎪ p ⎜ ⎟ ⎩⎪ .1 ⎝ p0. ⎠⎭⎪ ⎩ .1. ⎝ 0. ⎠⎭ ⎩ .1 ⎭ ⎝ p.1 p0. ⎠

⎛ p ⎞ ⎜ 01 ⎟ + pk 2 ⎜− ⎟ ⎝ p.1 p0. ⎠

AR K p ( p − p ) p p p p p p p p = − k ( p + p + p ) − 01 0. 01 k. + k1 02 + 01 02 k. − k 2 01 01 k1 ∑ i1 2 2 p.1. i=1, i≠k p.1 p0. p.1 p0. p.1 p0. p.1 p0.

1 ⎛ p p p p p p p p p p ⎞ = − ⎜ AR p + 01 02 k. − k1 02 − 01 02 k. + k 2 01 ⎟ p ⎜ k .1 2 p 2 p ⎟ .1 ⎝ p0. 0. p0. 0. ⎠

1 ⎛ p p − p p ⎞ ⎜ k 2 01 k1 02 ⎟ = − ⎜ ARk p.1 + ⎟ p.1 ⎝ p0. ⎠

1 = − ()ARk p.1 − ARk p.1 p.1

= 0

∧ ∧ Hence, the asymptotic variance of ARk , v (ARk ) follows immediately. 29

Corollary 2.3.2 When K +1 = 2 , that is, K = k = 1,

∧ 2 2 2 1 ⎪⎧ 2 p02 ( p01 p12 − p11 p02 ) p11 p02 + p12 p01 p1. p01 p02 ⎪⎫ v (AR) = 2 ⎨AR p.1 + 2 AR 2 + 2 + 3 ⎬. n p.1 ⎩⎪ p0. p0. p0. ⎭⎪

Proof: When K +1 = 2 , that is, K = k = 1, and from

p01 p1. − p11 p0. = p01 p11 + p01 p12 − p11 p01 − p11 p02 = p01 p12 − p11 p02 , it follows from the theorem above that

∧ ⎧ p ( p p − p p ) p p 2 + p p 2 p 2 p p ⎫ 1 ⎪ 2 02 01 12 11 02 11 02 12 01 1. 01 02 ⎪ v (AR) = 2 ⎨AR p.1 + 2 AR 2 + 2 + 3 ⎬ n p.1 ⎩⎪ p0. p0. p0. ⎭⎪

Also, note that

p p − p p p p − p p AR = 11 0. 01 1. = 11 02 01 12 p.1 p0. p.1 p0.

∧ 2 2 ( p11 p02 − p01 p12 ) ( p11 p02 − p01 p12 ) p02 ( p01 p12 − p11 p02 ) Then n p.1 v (AR) = 2 2 p.1 + 2 2 p.1 p0. p.1 p0. p0.

2 2 p p + p p ( p + p ) 2 p p + 11 02 12 01 + 11 12 01 02 p 2 p 3 Then, 0. 0.

∧ 3 3 2 2 n p.1 p0. v (AR) = ( p11 p02 − p01 p12 ) p0. − 2 p02 ( p01 p12 − p11 p02 )

2 2 2 + ( p11 p02 + p12 p01 ) p0. p.1 + ( p11 + p12 ) p01 p02 p.1

2 2 2 = ( p11 p02 − p01 p12 ) ( p01 + p11 ) − 2 p02 ( p01 p12 − p11 p02 ) + ( p11 p02

2 2 + p12 p01 ) ( p11 p02 − p01 p12 + p01 ) + ( p11 + p12 ) p01 p02 ( p01 + p11 )

The right hand quantity can further be rewritten as

2 2 2 2 2 2 3 p01 ( p02 p11 + p02 p11 p12 + p02 p11 + p01 p02 p11 p12 + p01 p12 + p01 p02 p11 + p02 p11

+ 2 p p 2 p + p p p 2 ) 02 11 12 02 11 12 30

2 2 = p01{( p01 p12 + p02 p11 ( p01 + p02 + p11 + p12 ) + p02 p11 p12 ( p01 + p02 + p11 + p12 )

2 + p0 p11}

2 2 2 = p01 ( p01 p12 + p02 p11 + p02 p11 p12 + p02 p11 )

2 = p01{( p01 p12 + p02 p11 ( p11 + p12 + p02 )}

2 = p01{( p01 p12 + p02 p11 (1− p01 )}

2 = p01 ( p01 p12 − p01 p02 p11 + p02 p11 )

= p01{p01 ( p01 p12 − p02 p11 ) + p02 p11}

Thus,

4 ∧ ⎛ p ⎞ p p {p ( p p − p p ) + p p } v (AR) = ⎜ 01 ⎟ 0. .1 01 01 12 02 11 02 11 ⎜ ⎟ 3 ⎝ p0. p.1 ⎠ np01

4 (1− AR) ( p01 + p11 )( p01 + p02 ){p01 ( p01 p12 − p02 p11 ) + p02 p11} = 3 . np01

The above expression is the same as the one obtained by Walter (1976).

∧ ∧ ∧

Note that vˆ(ARk ) can be obtained by substituting the MLE s p kd for pkd and AR k for ARk .

Then an asymptotic 100 (1−α) per cent confidence interval for ARk using Wald’s statistic is

⎡ ∧ ∧ ⎛ ∧ ∧ ⎞⎤ AR − Z vˆ (AR ), min ⎜ AR + Z vˆ (AR ), 1⎟ , given by ⎢ k α / 2 k ⎜ k α / 2 k ⎟⎥ ⎣⎢ ⎝ ⎠⎦⎥

where Zα is the upper 100 (α) th percentile of the standard normal distribution.

31

2.4 Numerical example and simulations

Example 2.1 Bakke et al. (1991) studied the occurrence of symptoms of lung disease

(chronic cough and breathlessness during exercise) related to several risk factors (smoking, occupationally exposure to dust or gas, and residency) for a sample of 4270 subjects under a cross-sectional study. For simplicity, we have considered smoking as a factor to study the association between smoking and the chronic cough in the following example. The table below cross-classifies 4270 subjects according to their smoking behavior (nonsmoker, 1-9 cigarettes/day, 10-19 cigarettes/day, ≥ 20 cigarettes/day) and the status of the chronic cough ( D - yes, D - no).

Table 2.3: Distribution of 4270 subjects into four exposure levels with respective disease status

Disease Status Total Exposure Levels D D

0 (Nonsmoker) 123 2578 2701

1 (1-9 cig/day) 23 298 321

2 (10-19 cig/day) 137 743 880

3 (20+ cig/day) 103 265 368

Total 386 3884 4270

For data set in Table 2.3, we calculate the MLE s p kd of pkd and based on these estimates we

∧ obtain the MLE s AR k of ARk ’s and the corresponding confidence interval by using Wald’s test statistic.

The estimates of the unknown parameters AR1 , AR2 and AR3 are obtained to be 0.0219,

0.2513 and 0.2232, respectively. The interpretation for the estimate of AR1 is that 2 per cent of 32 all the risk of chronic cough caused by smoking could be avoided by reducing the number of

cigarettes from 1-9 to none. The estimated value of 0.2513 for AR2 means that 25 per cent of all the risk of chronic cough caused by smoking could be prevented by reducing cigarette smoking

from 10-19 to none. The similar interpretation goes for AR3 .

The corresponding 95% confidence intervals are found to be (-0.0026, 0.0463), (0.1964,

0.3061) and (0.1780, 0.2683). Since the lower limit of the confidence interval for AR1 is less than zero, this result suggest that there is no significant evidence at 5% to support that the proportional reduction of the risk of chronic cough would be > 0 if the number of cigarettes smoked reduced from 1-9 to none per day. The interpretation for the second confidence interval is: with 95% confidence level, we can assert that between 19 per cent and 30 per cent of all the risk of chronic cough could be eliminated by reducing the number of cigarettes consumption

from 10-19 to none per day. We can interpret the confidence intervals for AR3 in the same way.

We employ Monte Carlo simulations to evaluate the finite-sample performance of the interval estimators in terms of coverage probability and average length of the resulting confidence intervals together with the Monte Carlo sample mean and the standard error of the

estimates of ARk s from 10000 samples. For the purpose of the simulations we consider a multinomial distribution with parameters n and P given by

′ P = ( p01 , p02 , p11 , p12 , p21 , p22 , p31 , p32 ), where pkd , k = 0, 1, 2, 3, denotes the probability of an individual falling into the cell at exposure level k with the disease status d , d = 1, 2 where 1(2) means the presence (absence) of the disease in the population. Keeping pace with the data structure of the design, the components of P can be summarized in table below. 33

Table 2.4: Probability distribution of subjects with respect to exposure levels and the disease status in the population to be considered for simulations.

Disease status Total Exposure Levels

D D

0 0.03 0.12 0.15

1 0.05 0.14 0.19

2 0.08 0.20 0.28

3 0.11 0.27 0.38

Total 0.27 0.73 1.00

For Monte Carlo simulations, we generate 10000 random samples from the multinomial

distribution having parameters n , n = 50 , 100 , 500 , 1000 and pkd presented in Table 2.4. For

each simulated sample, we add 0.5 to all cell frequencies for which nkd is equal to 0 in order to improve the performance of the interval estimators (Whittemore, 1982; Lui and Kelly, 1999,

2000; Lui, 1998). It is to be noted here that the information we need to estimate ARk or the

asymptotic variance of the estimate of ARk are the cell frequencies or the estimated cell probabilities. So, we just consider generating multinomial random samples from the aforesaid distributions to estimate the cell probabilities to be used in the construction of the confidence intervals. We construct 95% confidence intervals using Wald’s statistic and obtain the estimated coverage probabilities and the average lengths for each interval estimator. The three parameters

AR1 , AR2 and AR3 for the population given by Table 2.4 are obtained to be 0.0444, 0.0889 and

0.1259, respectively. 34

The simulated results in Table 2.5 correspond to the population described by Table 2.4. It follows that the coverage probabilities for the estimated confidence intervals consistently perform well at 95% confidence level for both types of interval estimators. The length of confidence intervals gets smaller and smaller as sample size n gets larger. The Monte Carlo sample means of the estimates obtained from the repeated samples approach to the true values of

ARk , as sample size n gets larger. Note that while the values of AR1 , AR2 and AR3 for data set in Table 2.4 are found to be 0.0444, 0.0889 and 0.1259 respectively, the corresponding Monte

Carlo simulated mean estimates are found to be 0.0443, 0.0889, and 0.1264, respectively when

n =1000, which are very encouraging for using them in the real life problems. For n =500, the simulated means are also very close to the corresponding true parameters.

Thus given an adequate sample size, these estimators together with their interval estimators are

appropriate for use in biostatistical and epidemiological research while estimating ARk due to various levels of exposures in cross-sectional studies.

35

Table 2.5: Simulated coverage probability and average length of the confidence interval together

with Monte Carlo sample mean and the standard error of the estimates of ARk , k = 1, 2, 3 for population described in Table 2.4.

Coverage Average Standard

n ARk Probability Length Mean Deviation

AR1 0.9828 0.5912 0.0325 0.1413

50 0.9679 0.7977 0.0655 0.1930 AR2 0.9636 1.0267 0.0939 0.2459 AR3

AR1 0.9509 0.4035 0.0436 0.1051

100 0.9409 0.5516 0.0880 0.1432 AR2 0.9357 0.7056 0.1242 0.1818 AR3

AR1 0.9505 0.1791 0.0445 0.0456

500 0.9494 0.2447 0.0884 0.0625 AR2 0.9447 0.3138 0.1262 0.0800 AR3

AR1 0.9485 0.1264 0.0443 0.0326

1000 0.9468 0.1729 0.0889 0.0445 AR2 0.9477 0.2215 0.1264 0.0570 AR3

36

2.5 Some special cases for ARk

2.5.1 Monotonicity of ARk

In this section, we study the monotonicity of ARk when certain conditions have been imposed on the cell probabilities. We consider the following three cases.

Case 1: Let us assume that pk. is decreasing in k ∈{0,1, 2,..., K}, that is, the probability of being exposed to the higher level of the risk factor is less than that of the lower level.

1 ⎡ p01 ⎤ Then ARk − ARk−1 = ⎢ pk1 − p(k−1)1 − ( pk. − p(k −1). )⎥ p.1 ⎣ p0. ⎦

⎡ ⎤ 1 pk1 p(k −1)1 p01 = ⎢ pk. − p(k −1). − ( pk. − p(k −1). )⎥ p.1 ⎣⎢ pk. p(k −1). p0. ⎦⎥

1 ⎡⎛ p p ⎞ ⎛ p p ⎞ ⎤ = ⎢⎜ k1 − 01 ⎟ p − ⎜ (k −1)1 − 01 ⎟ p ⎥ p ⎜ p p ⎟ k. ⎜ p p ⎟ (k −1). .1 ⎣⎢⎝ k. 0. ⎠ ⎝ (k −1). 0. ⎠ ⎦⎥

1 ⎡⎛ p p ⎞ ⎛ p p ⎞ ⎤ ≤ ⎢⎜ k1 − 01 ⎟ p − ⎜ (k −1)1 − 01 ⎟ p ⎥ p ⎜ p p ⎟ k. ⎜ p p ⎟ k. .1 ⎣⎢⎝ k. 0. ⎠ ⎝ (k −1). 0. ⎠ ⎦⎥

1 ⎡⎛ p p ⎞ ⎤ = ⎢⎜ k1 − (k−1)1 ⎟ p ⎥ p ⎜ p p ⎟ k. .1 ⎣⎢⎝ k. (k −1). ⎠ ⎦⎥

1 = ( pk1 p(k−1)2 − p(k−1)1 pk 2 ) p.1 p(k−1).

p p ⎛ p p ⎞ = (k −1)1 k 2 ⎜ k1 (k−1)2 −1⎟ ⎜ ⎟ p.1 p(k −1). ⎝ p(k−1)1 pk 2 ⎠

p p < 0 if k1 (k −1)2 < 1 p(k −1)1 pk 2 37

pk1 p(k −1)2 Hence with pk. decreasing in k , ARk is decreasing in k if the odds ratio < 1. The p(k −1)1 pk 2

p p odds ratio k1 (k −1)2 < 1 implies that it is less likely to get disease in the k th exposure level than p(k −1)1 pk 2 in the (k −1) th level.

Example 2.2 Let us consider a multinomial distribution given in the following Table 2.6 that satisfy the above conditions, that is,

(i) pk. is decreasing in k

p p (ii) odds ratio k1 (k −1)2 < 1. p(k −1)1 pk 2

Table 2.6: Probability distribution of subjects with respect to exposure levels and the disease status in the population satisfying above conditions (i) and (ii)

Disease status Total Exposure Levels

D D pk.

0 0.10 0.28 0.38

1 0.07 0.24 0.31

2 0.02 0.14 0.16

3 0.01 0.14 0.15

Total 0.20 0.80 1.00

The true values of ARk s are found to be -0.0579, -0.1105, -0.1474, which are decreasing. Then for the purpose of the simulations, we generate 10000 random samples from the aforesaid

distribution and for each sample, we find estimates of ARk s, the 95% confidence interval using 38

Wald’s statistics. We also estimate the Monte Carlo sample mean and standard deviation of the

estimates of ARk s from 10000 samples. These results are summarized in Table 2.7 below.

Table 2.7: Simulated coverage probability and average length of the confidence interval together

with Monte Carlo sample mean and the standard error of the estimates of ARk , k = 1, 2, 3 for population described in Table 2.6.

n ARk Coverage Average Mean Standard

Probability Length Deviation

AR1 0.9470 0.8756 -0.0554 0.2316

50 0.9789 0.5243 -0.0929 0.1230 AR2 0.9651 0.4924 -0.1181 0.1078 AR3

AR1 0.9481 0.6358 -0.0567 0.1653

100 0.9707 0.3598 -0.1073 0.0919 AR2 0.9663 0.3245 -0.1396 0.0774 AR3

AR1 0.9513 0.2828 -0.0576 0.0720

500 0.9547 0.1582 -0.1103 0.0403 AR2 0.9562 0.1377 -0.1475 0.0346 AR3

AR1 0.9491 0.1995 -0.0583 0.0514

1000 0.9558 0.1118 -0.1109 0.0280 AR2 0.9519 0.0972 -0.1475 0.0248 AR3

Note that for each n = 50 , 100 , 500 , 1000 , the mean estimates of ARk are decreasing. 39

Therefore, the variations in the prevalence rate and odds ratio provide some information about the attributable risk in each level of the risk factor.

Case 2: Let us assume that pk. is increasing in k ∈{0,1, 2,..., K}.

1 ⎡ p01 ⎤ Then ARk − ARk−1 = ⎢ pk1 − p(k−1)1 − ( pk. − p(k −1). )⎥ p.1 ⎣ p0. ⎦

⎡ ⎤ 1 pk1 p(k −1)1 p01 = ⎢ pk. − p(k −1). − ( pk. − p(k −1). )⎥ p.1 ⎣⎢ pk. p(k −1). p0. ⎦⎥

1 ⎡⎛ p p ⎞ ⎛ p p ⎞ ⎤ = ⎢⎜ k1 − 01 ⎟ p − ⎜ (k −1)1 − 01 ⎟ p ⎥ p ⎜ p p ⎟ k. ⎜ p p ⎟ (k −1). .1 ⎣⎢⎝ k. 0. ⎠ ⎝ (k −1). 0. ⎠ ⎦⎥

1 ⎡⎛ p p ⎞ ⎛ p p ⎞ ⎤ ≥ ⎢⎜ k1 − 01 ⎟ p − ⎜ (k −1)1 − 01 ⎟ p ⎥ p ⎜ p p ⎟ (k −1). ⎜ p p ⎟ (k −1). .1 ⎣⎢⎝ k. 0. ⎠ ⎝ (k −1). 0. ⎠ ⎦⎥

1 ⎡⎛ p p ⎞ ⎤ = ⎢⎜ k1 − (k −1)1 ⎟ p ⎥ p ⎜ p p ⎟ (k −1). .1 ⎣⎢⎝ k. (k −1). ⎠ ⎦⎥

1 = ( pk1 p(k −1)2 − p(k −1)1 pk 2 ) p.1 pk.

p(k−1)1 pk 2 ⎡ pk1 p(k−1)2 ⎤ = ⎢ −1⎥ p.1 pk. ⎣⎢ p(k −1)1 pk 2 ⎦⎥

p p > 0 if k1 (k −1)2 > 1 p(k −1)1 pk 2

pk1 p(k −1)2 Hence with pk. increasing in k , ARk is also increasing in k if the odds ratio > 1 for p(k −1)1 pk 2

k ∈{1, 2,..., K}.

40

Example 2.3 Let us consider a multinomial distribution given in the following Table 2.8 that satisfy the above conditions, that is,

(i) pk. is increasing in k

p p (ii) odds ratio k1 (k −1)2 > 1. p(k −1)1 pk 2

Table 2.8: Probability distribution of subjects with respect to exposure levels and the disease status in the population satisfying above conditions (i) and (ii)

Disease status Total Exposure Levels

D D pk.

0 0.04 0.14 0.18

1 0.08 0.15 0.23

2 0.11 0.13 0.24

3 0.17 0.18 0.35

Total 0.40 0.60 1.00

The true values of ARk s are found to be 0.0722, 0.1417 and 0.2306, which are increasing. For simulations purpose, we generate 10000 random samples from the aforesaid distribution and for

each sample, we find estimates of ARk s, the 95% confidence interval using Wald’s statistics.

The Monte Carlo sample mean and standard deviation of the estimates of ARk s from 10000 samples are summarized in the table below.

41

Table 2.9: Simulated coverage probability and average length for the confidence interval

together with Monte Carlo sample mean and the standard error of the estimates of ARk ,

k = 1, 2, 3 for population described in Table 2.8

n ARk Coverage Average Mean Standard

Probability Length Deviation

AR1 0.9618 0.4594 0.0683 0.1158

50 0.9458 0.4969 0.1361 0.1246 AR2 0.9480 0.6531 0.2222 0.1621 AR3

AR1 0.9446 0.3203 0.0714 0.0841

100 0.9412 0.3466 0.1432 0.0908 AR2 0.9308 0.4574 0.2307 0.1191 AR3

AR1 0.9498 0.1426 0.0722 0.0365

500 0.9462 0.1544 0.1416 0.0395 AR2 0.9457 0.2046 0.2311 0.0522 AR3

AR1 0.9486 0.1007 0.0721 0.0258

1000 0.9502 0.1092 0.1418 0.0278 AR2 0.9485 0.1445 0.2307 0.0367 AR3

It is evident from the simulation results that the mean estimates of ARk are increasing for each

n = 50 , 100 , 500 , 1000 , as is expected.

42

p Case 3: Let us assume that k1 is increasing in k ∈{0,1, 2,..., K}. pk.

1 ⎡ p01 ⎤ Then ARk − ARk−1 = ⎢ pk1 − p(k−1)1 − ( pk. − p(k −1). )⎥ p.1 ⎣ p0. ⎦

⎡ ⎤ 1 pk1 p(k −1)1 p01 = ⎢ pk. − p(k −1). − ( pk. − p(k −1). )⎥ p.1 ⎣⎢ pk. p(k −1). p0. ⎦⎥

1 ⎡⎛ p p ⎞ ⎛ p p ⎞ ⎤ = ⎢⎜ k1 − 01 ⎟ p − ⎜ (k −1)1 − 01 ⎟ p ⎥ p ⎜ p p ⎟ k. ⎜ p p ⎟ (k −1). .1 ⎣⎢⎝ k. 0. ⎠ ⎝ (k −1). 0. ⎠ ⎦⎥

1 ⎡⎛ p p ⎞ ⎤ ⎜ k1 01 ⎟ ≥ ⎢⎜ − ⎟ ( pk. − p(k −1). )⎥ p.1 ⎣⎝ pk. p0. ⎠ ⎦

1 = []( pk1 p02 − pk 2 p01 ) ( pk. − p(k −1). ) p.1 pk. p0.

p p ⎡⎛ p p ⎞ ⎤ k 2 01 ⎜ k1 02 ⎟ = ⎢⎜ −1⎟ ( pk. − p(k −1). )⎥ p.1 pk. p0. ⎣⎝ pk 2 p01 ⎠ ⎦

pk1 p02 pk1 > 0 if pk. is increasing in k and > 1. Hence with increasing in pk 2 p01 pk.

pk1 p02 k , ARk is also increasing in k if pk. is increasing in k and > 1 for k ∈{1, 2,..., K}. pk 2 p01

Example 2.4 Let us consider a multinomial distribution given in the following Table 2.10 that satisfy the above conditions, that is,

(i) pk. is increasing in k

p p (ii) odds ratio k1 02 > 1 pk 2 p01

p (iii) k1 increasing in k pk. 43

Table 2.10: Probability distribution of subjects with respect to exposure levels and the disease status in the population satisfying above conditions (i), (ii) and (ii)

Disease status Total pk1 Exposure Levels pk. D D pk.

0 0.03 0.15 0.18 0.17

1 0.09 0.14 0.23 0.39

2 0.12 0.12 0.24 0.50

3 0.18 0.17 0.35 0.51

Total 0.42 0.58 1.00

The true values of ARk s are found to be 0.1230, 0.1905 and 0.2897, which are increasing. Then we generate 10000 random samples from the aforesaid distribution and for each sample, we find

estimates of ARk s, the 95% confidence interval using Wald’s statistics. We also estimate the

Monte Carlo sample mean and standard deviation of the estimates of ARk s from 10000 samples.

Table 2.11 provides the simulation results.

44

Table 2.11: Simulated coverage probability and average length for the confidence interval

together with Monte Carlo sample mean and the standard error of the estimates of ARk ,

k = 1, 2, 3 for population described in Table 2.10

n ARk Coverage Average Mean Standard

Probability Length Deviation

AR1 0.9463 0.4342 0.1150 0.1057

50 0.9413 0.4685 0.1814 0.1147 AR2 0.9541 0.5982 0.2755 0.1434 AR3

AR1 0.9477 0.3000 0.1227 0.0758

100 0.9417 0.3253 0.1894 0.0836 AR2 0.9401 0.4174 0.2896 0.1072 AR3

AR1 0.9473 0.1339 0.1223 0.0345

500 0.9505 0.1453 0.1902 0.0369 AR2 0.9446 0.1871 0.2901 0.0479 AR3

AR1 0.9485 0.0947 0.1231 0.0241

1000 0.9528 0.1028 0.1905 0.0259 AR2 0.9466 0.1322 0.2900 0.0340 AR3

45

2.5.2 The AR with respect to the intermediate level

Following the notation, ARk represents the percentage of the risk of the disease that could be

reduced if the level of the risk factor had been reduced from level Ek to E0 , where E0 is the baseline or reference level. In this section we would like to express attributable risk at exposure

level Ek with intermediate base level E j , denoted by AR j k , in terms of AR j and ARk ,

0 < j < k .

We have,

1 ⎧ p01 pk. ⎫ ARk = ⎨pk1 − ⎬ (2.5.1) p.1 ⎩ p0. ⎭

1 ⎧ p01 p j. ⎫ AR j = ⎨p j1 − ⎬ (2.5.2) p.1 ⎩ p0. ⎭

1 ⎪⎧ p j1 pk. ⎪⎫ AR j k = ⎨ pk1 − ⎬ (2.5.3) p.1 ⎩⎪ p j. ⎭⎪

From (2.5.1) we have,

ARk p.1 p0. = pk1 p0. − p01 pk.

p0. ( pk1 − ARk p.1 ) = p01 pk.

p p − AR p 01 = k1 k .1 (2.5.4) p0. pk.

Similarly we have,

p p − AR p 01 = j1 j .1 (2.5.5) p0. p j.

From (2.5.4) and (2.5.5) we have, 46

p − AR p p − AR p k1 k .1 = j1 j .1 pk. p j.

p p − AR p AR p j1 = k1 k .1 + j .1 (2.5.6) p j. pk. p j.

By (2.5.6) it follows from (2.5.3) that,

⎧ ⎛ p − AR p AR p ⎞ ⎫ 1 ⎪ ⎜ k1 k .1 j .1 ⎟ ⎪ AR j k = ⎨ pk1 − + pk. ⎬ p ⎜ p p ⎟ .1 ⎩⎪ ⎝ k. j. ⎠ ⎭⎪

pk. After simplifying we have, AR j k = ARk − AR j . p j.

By the invariance property of the MLE , the MLE of AR j k , AR jk , is given by

∧ ∧ ∧ ∧ pk. j k AR = ARk − AR j ∧ . p j.

∧ ∧ The variance of the estimator AR jk can be obtained by replacing 0 by j in v(ARk ) . Therefore,

2 2 2 ∧ ⎧ ⎫ 1 ⎪ 2 p j2 ( p j1 pk. − pk1 p j. ) pk1 p j2 + pk 2 p j1 p j1 pk. p j2 ⎪ jk v (AR ) = 2 ⎨ARk p.1 + 2 ARk + + ⎬ n p ⎪ p 2 p 2 p 3 ⎪ .1 ⎩ j. j. j. ⎭

Example 2.5 Following example explains the method relating to a sample data of 966 subjects obtained from the Second National Health and Nutrition Examination Survey

(NHANES II) conducted from 1976 to 1980 (McDowell et al. 1981). These data were selected from a larger research project to investigate secular trends in cardiovascular disease risk factors over the twenty-year period 1960-1980 in the United States among women aged 18-24 years. In the study, the body mass index (BMI), expressed as weight (kg)/height (m)2 has been considered as risk factors of diastolic blood pressure (DBP) while considering race (black and white) as a confounding factor. A DBP value exceeding 82.6 mmHg (determined from the 90th percentile of 47 the distribution) is considered as hypertension. In order to fit the data in our study of interest, we have cross-classified a total of 966 individuals with respect to the body mass index (BMI<23,

23 ≤ BMI< 25, 25 ≤ BMI< 27, BMI ≥ 27) and the status of the DBP (yes-D , no- D ).

Table 2.12: Distribution of 966 subjects into four exposure levels with the respective disease status

DBP Total BMI levels D D

0 (BMI <23) 50 590 640

1 (23 ≤ BMI< 25) 11 119 130

2 (25 ≤ BMI< 27) 8 69 77

3 (BMI ≥ 27) 39 80 119

Total 108 858 966

For data set in Table 2.12, we calculate the MLE s p kd of pkd and based on these estimates we

∧ obtain the MLE s AR jk of AR jk ’s and the corresponding confidence interval by using Wald’s test statistic.

The estimates of the unknown parameters AR12 , AR13 , and AR23 are obtained to be 0.0130,

0.2680, and 0.2477, respectively. The interpretation for the estimate of AR12 is that 1 per cent of the risk of hypertension could be avoided by reducing the BMI from 25 ≤ BMI< 27 to 23 ≤ BMI<

25. Likewise the interpretation for the estimate of AR13 is that 26 per cent of the risk of hypertension could be avoided by reducing the BMI from BMI ≥ 27 to 23≤ BMI<25. The corresponding 95% confidence intervals are found to be (- 0.0462, 0.0723), (0.1601, 0.3758), and (0.1270, 0.3685), respectively. 48

The interpretation for the second confidence interval for AR13 is: with 95% confidence level, we can assert that between 16 per cent and 37 per cent of the risk of developing hypertension could be eliminated by reducing the BMI from BMI≥ 27 to 23≤ BMI<25. We can interpret the

confidence interval for AR23 the same way. Note that the lower limit for the confidence interval

for AR12 is less than 0, this result suggests that there is no significant evidence at 5% to support that the proportional reduction of the risk of hypertension would be positive if the BMI level reduced from 25 ≤ BMI<27 to 23 ≤ BMI<25 based on this particular data. Therefore, we can combine levels 1 and 2 together to make a level 23 ≤ BMI<27. This prevents making unnecessary category of the risk factor. Thus, studying the attributable risk with respect to the intermediate level may play an important role in detecting the significance of a particular level while dealing with a risk factor with multiple exposure levels.

49

CHAPTER 3

ESTIMATION OF ARk WITH CONFOUNDERS

3.1 Introduction

In chapter 2, we considered estimating the attributable risk for a risk factor with dichotomous and multiple exposure levels. But in practice, there are factors that are related to both the risk factor and the disease outcome, usually called the confounding factors. For example, while smoking is a risk factor for heart disease, smokers may have developed sedentary lifestyles that contribute to the higher rates of heart disease. Here the sedentary lifestyle is considered as a confounding factor. Other habitual factors might also be considered as confounding factors.

These situations in real life often produce much smaller reduction in the risk of disease than the attributable risk would predict. Therefore, adjustment has to be made for these confounding factors in order to correctly estimate the AR . One form of this adjustment is to replace the rate of disease given no exposure (only) in the denominator of the relative risk RR by the rate of disease among those unexposed if the exposure of interest alone were removed, but assuming all other characteristics of the exposed persons remain the same. Such adjustments can be made by stratification in the cross-sectional sampling scheme. Lui (2003) derived the covariate-adjusted

AR by stratification for the confounding factors in case-control studies for a risk factor with multiple levels. Basu and Landis (1995) developed a model-based estimation procedure for the asymptotic variance of the maximum likelihood estimator of AR based on a logit linear model.

Their expressions were restricted to the overall attributable risk adjusted for the presence of confounding factors. However, when the risk factor is significant, it is also important that we know what levels of risk factor are contributing most. This requires the investigation at local levels. In this chapter we emphasize the estimation of the overall AR as well as the estimation at 50 local levels by using stratification of the confounding factors for a risk factor having multiple exposure levels.

Section 3.2 develops the model in order to derive the formula for estimating the covariate-adjusted attributable risk for each level of a polytomous risk factor. Section 3.3 provides the explicit form of the asymptotic variance of the estimate of covariate-adjusted AR .

In section 3.4, an example of a cross-sectional study on cardiovascular disease has been considered to illustrate the method. We also use a Monte Carlo simulation from a specified multinomial distribution to assess the finite sample performance of confidence intervals constructed by using Wald’s test statistic. It follows from the simulations that the confidence interval constructed on the basis of the derived standard error performs well at 95% confidence level in terms of coverage probability and average length together with Monte Carlo sample mean and standard error of the estimates of AR from a Monte Carlo sample of size 10000. The simulation study shows that the Monte Carlo sample means of the estimates obtained from the repeated samples approach to the corresponding true parameters of the distribution. The Monte

Carlo standard error also decreases as sample size increases.

3.2 Model development and point estimation of ARk

Consider the estimation of AR due to a risk factor with K + 1 exposure levels in a cross-

sectional study. Let Ek , k = 0, 1,..., K be the levels of the risk factor under investigation with

E0 referred to the baseline or the reference level. Suppose we take a random sample of n subjects and simultaneously classify each subject by the status of a disease, and a suspected risk factor with K + 1exposure levels. Suppose further that the combination of all confounders forms

S levels, denoted by Cs , s = 1, 2, ... , S . To control the effect of these confounders, we post-

stratify the n sampled subjects according to the level of Cs. Let nkds be the random frequency of 51

n individuals falling into the cell at exposure level k in stratum s with the disease status

d ,2d = 1, where 1(2) means the presence (absence) of the disease. We use Table 3.1 to summarize the overall data structure given by the aforesaid design.

Let pkds , k = 0, 1,..., K; s = 1, 2,..., S; d = 1, 2 be the probability of a subject falling into a cell having observed frequency nkds . Note that ∑∑∑ nkds = ∑∑ n.ds = n , nk.s = nk1s + nk 2s , kds ds

0 < pkds < 1, and pk.s = pk1s + pk 2s . Then the random vector N given by

′ N = ()n011 ,n021 ,...,n01S ,n02S ,n111 ,n121 ,...,n11S ,n12S ,...,nK11 ,nK 21 ,...,nK1S ,nK 2S follows the multinomial distribution with parameters n and P given by

′ P = ()p011 , p021 ,...., p01S , p02S , p111 , p121 ,..., p11S , p12S ,..., pK11 , pK 21 ,..., pK1S , pK 2S .

Table 3.1: Classifying subjects by the exposure and confounding levels with respect to the status of the disease

Levels of confounders,Cs , and status of disease under Cs

C1 C2 … … CS

D D D D … … D D

0 n011 n021 n012 n022 ... … n01S n02S Exposure n111 n121 n112 n122 ... … n11S n12S 1

Levels … … … … … … … … .

n n n n n n K K11 K 21 K12 K 22 ... … K1S K 2S

Total n.11 n.21 n.12 n.22 … …. n.1S n.2S

It can be easily verified that the maximum likelihood estimator, MLE , of

∧ n p is given by p = kds . When the number of subjects, n , is large, by the Central Limit kds kds n 52

∧ Theorem, the random vector ( P− P ) is asymptotically distributed as normal N(0, Σ ) , where

0′ = (0, 0, ...., 0) and Σ is a 2(K +1)S × 2(K +1)S covariance matrix given by

⎡ p011 (1− p011 ) - p011 p021 ... - p011 p01S - p011 p02S ...- p011 pK1S - p011 pK 2S ⎤ ⎢ ⎥ 1 - p p p (1− p ) ... - p p - p p ... - p p - p p Σ = ⎢ 021 011 021 021 021 01S 021 02S 021 K1S 021 K 2S ⎥ n ⎢ ...... ⎥ ⎢ ⎥ ⎣ - pK 2S p011 - pK 2S p021 ...... pK 2S (1− pK 2S )⎦

Let D and D be the events of a subject being diseased and non-diseased respectively. The

attributable risk of a disease for reducing the exposure level from Ek to E0 , denoted by ARk , is defined as

[]P(D | Ek ,Cs ) − P(D | E0 ,Cs ) P(Ek | Cs ) P(Cs ) ARk = ∑ . (3.2.1) s P(D)

Lui (2003) showed that equation (3.2.1) can be written as

P(Ek ,Cs | D) ARk = P()Ek | D − ∑ (3.2.2) s RRk|s

where RRk|s = P(D | Ek ,Cs ) / P(D | E0 ,Cs ) is the relative risk between exposure levels

Ek and E0 in stratum s , s = 1, 2,..., S .

∑ pk1s s pk1. Note that P(Ek | D) = ∑ P(Ek ,Cs | D) = = s p.1. p.1.

P(D | Ek ,Cs ) pk1s / pk.s pk1s p0.s RRk|s = = = P(D | E0 ,Cs ) p01s / p0.s p01s pk.s

Then it follows from equation (3.2.2)

pk1. pk1s p01s pk.s 1 ⎧ p01s pk.s ⎫ ARk = − ∑ = ⎨ pk1. − ∑ ⎬ . (3.2.3) p.1. s p.1. pk1s p0.s p.1. ⎩ s p0.s ⎭

When K +1 = 2, that is, K = k = 1, from equation (3.2.3) 53

1 ⎧ p01s p1.s ⎫ AR = ⎨ p11. − ∑ ⎬ p.1. ⎩ s p0.s ⎭

1 ⎧ p01s p1.s ⎫ = ⎨ p.1. − ∑ p01s − ∑ ⎬ p.1. ⎩ s s p0.s ⎭

1 ⎧ ⎛ p p + p p ⎞⎫ ⎜ 01s 0.s 01s 1.s ⎟ = ⎨p.1. − ∑⎜ ⎟⎬ p.1. ⎩ s ⎝ p0.s ⎠⎭

p ( p + p ) = 1− ∑ 01s 0.s 1.s s p.1. p0.s

p p = 1− ∑ 01s ..s (3.2.4) s p0.s p.1. which coincides with the expression obtained by Lui (2001b).

∧ By the invariance property of the MLE , the MLE of ARk , ARk , is given by

⎧ ∧ ∧ ⎫ ∧ 1 ⎪ ∧ p p ⎪ AR = p − 01s k.s . k ∧ ⎨ k1. ∑ ∧ ⎬ ⎪ s ⎪ p.1. ⎩ p0.s ⎭

∧ 3.3 Derivation of the asymptotic variance of ARk

To construct the confidence interval for ARk , it is necessary to find its asymptotic variance. We

use the delta method to find the asymptotic variance of ARk .

∧ Let φ be the vector of partial derivatives of ARk with respect to the components of the

∧ ∧ vector P evaluated at P = P . Then we have,

∂AR ∂AR ∂AR ∂AR ∂AR ∂AR ∂AR ∂AR ∂AR ∂AR φ′ = ( k , k , ..., k , k ,..., k , k , ..., k , k , ..., k , k ) where ∂p011 ∂p021 ∂p01S ∂p02S ∂p111 ∂p121 ∂p11S ∂p12S ∂pK1S ∂pK 2S

∂AR 1 ⎧ p p ⎫ 1 ⎧ ⎛ p p − p p ⎞⎫ k 01s k.s ⎪ ⎜ 0.l k.l 01l k.l ⎟⎪ = − ⎨pk1. − ⎬ + ⎨0 − ⎬ ∂p 2 ∑ p p ⎜ 2 ⎟ 01l p.1. ⎩ s 0.s ⎭ .1. ⎩⎪ ⎝ p0.l ⎠⎭⎪ 54

1 ⎛ p p p ⎞ = − ⎜ AR + k.l − 01l k.l ⎟, l = 1, 2, ..., S p ⎜ k p 2 ⎟ .1. ⎝ 0.l p0.l ⎠

∂AR 1 ⎧ p p ⎫ 1 k = − p − 01s k.s = − AR , for m ≠ k, m ≥ 1 2 ⎨ k1. ∑ ⎬ k ∂pm1l p.1. ⎩ s p0.s ⎭ p.1.

∂AR 1 ⎧ p p ⎫ 1 ⎧ p ⎫ k = − p − 01s k.s + 1- 01l 2 ⎨ k1. ∑ ⎬ ⎨ ⎬ ∂pk1l p.1. ⎩ s p0.s ⎭ p.1. ⎩ p0.l ⎭

1 p02l = − (ARk − ), since p0.l = p01l + p02l p.1. p0.l

∂ARk 1 ⎪⎧ 1 ⎪⎫ p01l pk.l = ⎨0 − (− ) p01l pk.l ⎬ = , l = 1, 2, ..., S ∂p p 2 2 02l .1. ⎩⎪ p0.l ⎭⎪ p.1. p0.l

∂AR k = 0, for m ≠ k, m ≥ 1 ∂pm2l

∂AR 1 ⎧ p ⎫ p k = ⎨0 − 01l ⎬ = − 01l . ∂pk 2l p.1. ⎩ p0.l ⎭ p.1. p0.l

∧ ∧ By using the delta method, the asymptotic variance of ARk ,v (ARk ) is φ′ Σ φ .

Lemma 3.3.1 Under above notations, 2 ⎧ K 2 S ⎫ K 2 S 1 ⎪ ⎛ ∂ARk ⎞ 2 ⎪ ⎛ ∂ARk ⎞ φ′ Σ φ = ⎨ pids ⎜ ⎟ − Aids ⎬ , where Aids = pids ⎜ ⎟ . n ∑∑∑ ⎜ ∂p ⎟ ∑∑∑ ⎜ ∂p ⎟ ⎩⎪ id===0 11s ⎝ ids ⎠ ⎭⎪ id===0 11s ⎝ ids ⎠

∧ Proof: The variance covariance matrix Σ of the vector P can be expressed as

1 Σ = [diag( p , p , p , p ,..., p , p ) − PP′] n 011 021 012 022 K1S K 2S 55

⎡ p011 ⎤ ⎢ p ⎥ ⎢ 021 ⎥ ⎢ p ⎥ ⎢ 012 ⎥ where diag( p011 , p021 , p012 , p022 ,..., pK1S , pK 2S ) = ⎢ 0 p022 0 ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ pK1S ⎥ ⎢ ⎥ ⎣ pK 2S ⎦

′ and P = ( p011 , p021 , p012 , p022 ,..., pK1S , pK 2S ) .

Then,

nφ′Σφ = φ′diag( p , p , p , p ,..., p , p )φ −φ′PP′φ 011 021 012 022 K1S K 2S

After simplifying it can be shown that,

2 ⎛ ∂AR ⎞ ′ ⎜ k ⎟ φ diag( p011 , p021 , p012 , p022 ,..., pK1S , pK 2S )φ = ∑∑∑ pids ⎜ ⎟ ids ⎝ ∂pids ⎠

and

2 ⎡ ⎛ ∂ARk ⎞⎤ 2 φ′PP′φ = ⎢ pids ⎜ ⎟⎥ = Aids ∑∑∑ ⎜ ∂p ⎟ ⎣ ids ⎝ ids ⎠⎦

Hence the lemma 3.3.1 follows.

∧ ∧ Theorem 3.3.1 The asymptotic variance of AR k ,v(AR k ) is given by

∧ 2 2 1 2 p02s ( p01s pk.s − pk1s p0.s ) pk1s p02s + pk 2s p01s v (AR k ) = [AR p + 2 AR + 2 k .1. k ∑ 2 ∑ 2 np.1. s p0.s s p0.s

p p 2 p p p − p p + 01s k.s 02s − (AR p + k 2s 01s k1s 02s ) 2 ] ∑∑3 k .1. ssp0.s p0.s

2 K 2 S ⎛ ∂AR ⎞ ⎜ k ⎟ Proof: Note that ∑∑∑ pids ⎜ ⎟ id===0 11s ⎝ ∂pids ⎠ 56

2 2 2 2 ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ = ⎜ k ⎟ + ⎜ k ⎟ ⎜ k ⎟ ⎜ k ⎟ ∑ p01s ⎜ ⎟ ∑ pk1s ⎜ ⎟ + ∑∑pi1s ⎜ ⎟ + ∑ p02s ⎜ ⎟ s ⎝ ∂p01s ⎠ s ⎝ ∂pk1s ⎠ is=1, i≠k ⎝ ∂pi1s ⎠ s ⎝ ∂p02s ⎠

2 2 ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ ⎜ k ⎟ ⎜ k ⎟ + ∑ pk 2s ⎜ ⎟ + ∑∑pi2s ⎜ ⎟ s ⎝ ∂pk 2s ⎠ is=1, i≠k ⎝ ∂pi2s ⎠

⎧ 2 ⎫ 2 ⎪ 1 ⎛ p p p ⎞ ⎪ ⎪⎧ 1 ⎛ p ⎞ ⎪⎫ = p ⎜ AR + k.s − 01s k.s ⎟ + p ⎜ AR − 02s ⎟ ∑ 01s ⎨ 2 ⎜ k 2 ⎟ ⎬ ∑ k1s ⎨ 2 ⎜ k ⎟ ⎬ s p p p s p p ⎩⎪ .1. ⎝ 0.s 0.s ⎠ ⎭⎪ ⎩⎪ .1. ⎝ 0.s ⎠ ⎭⎪

2 2 K ⎧ ⎫ ⎛ p p ⎞ ⎛ p ⎞ ⎪ 1 2 ⎪ ⎜ 01s k.s ⎟ 01s + pi1s ⎨ ARk ⎬ + p02s + pk 2s ⎜− ⎟ ∑∑ 2 ∑ ⎜ 2 ⎟ ∑ ⎜ p p ⎟ is=1,i≠k ⎩⎪ p.1. ⎭⎪ s ⎝ p.1. p0.s ⎠ s ⎝ .1. 0.s ⎠

⎧ ⎛ p p p p p p ⎞⎫ ⎪ 1 ⎜ 2 k.s 01s k.s 2 k.s 01s k.s ⎟⎪ = p01s ⎨ ARk + ( − ) + 2ARk ( − ) ⎬ ∑ 2 ⎜ p 2 p 2 ⎟ s ⎩⎪ p.1. ⎝ 0.s p0.s 0.s p0.s ⎠⎭⎪

K ⎪⎧ 1 ⎛ 2 p02s 2 p02s ⎞⎪⎫ ⎪⎧ 1 2 ⎪⎫ + pk1s ⎨ ⎜ ARk + ( ) − 2ARk ⎟⎬ + pi1s ⎨ ARk ⎬ ∑ 2 ⎜ p p ⎟ ∑∑ 2 s ⎩⎪ p.1. ⎝ 0.s 0.s ⎠⎭⎪ is=1,i≠k ⎩⎪ p.1. ⎭⎪

2 2 ⎛ p p ⎞ ⎛ p ⎞ + p ⎜ 01s k.s ⎟ + p ⎜− 01s ⎟ ∑ 02s ⎜ 2 ⎟ ∑ k 2s ⎜ p p ⎟ s ⎝ p.1. p0.s ⎠ s ⎝ .1. 0.s ⎠

AR 2 1 ( p p − p p ) 2 2 AR p p − p p = k p + p 0.s k.s 01s k.s + k p 0.s k.s 01s k.s 2 01. 2 ∑ 01s 4 2 ∑ 01s 2 p.1. p.1. s p0.s p.1. s p0.s

AR 2 1 p p 2 2AR p p 1 K + k p + k1s 02s − k k1s 02s + AR 2 p 2 k1. 2 ∑∑2 2 ∑ 2 k i1. p.1. p.1. s p0.s p.1. s p0.s p.1. i=1,i≠k

1 p p 2 p 2 1 p p 2 + 02s 01s k.s + k 2s 01s 2 ∑ 4 2 ∑ 2 p.1. s p0.s p.1. s p0.s

K 2 2 2 2 ARk ( p01. + pk1. + pi1.) 1 p p p 2 AR p p p 1 p p ∑i=1, i≠k 01s k.s 02s k 01s k.s 02s k1s 02s = 2 + 2 ∑ 4 + 2 ∑ 2 + 2 ∑ 2 p.1. p.1. s p0.s p.1. s p0.s p.1. s p0.s

2AR p p 1 p p 2 p 2 1 p p 2 − k k1s 02s + 02s 01s k.s + k 2s 01s 2 ∑∑2 4 2 ∑2 p.1. ssp0.s p.1. p0.s p.1. sp0.s 57

1 p ( p p − p p ) p p 2 + p p 2 = (AR 2 p + 2 AR 02s 01s k.s k1s 0.s + k1s 02s k 2s 01s 2 k .1. k ∑ 2 ∑ 2 p s p s p .1. 0.s 0.s p p 2 p + 01s k.s 02s ) ∑ 3 s p0.s

Also,

⎛ ∂AR ⎞ ⎜ k ⎟ Aids = ∑∑∑ pids ⎜ ⎟ ids ⎝ ∂pids ⎠

⎛ ∂AR ⎞ ⎛ ∂AR ⎞ K ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ ⎛ ∂AR ⎞ = p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ + p ⎜ k ⎟ ∑ 01s ⎜ ∂p ⎟ ∑ k1s ⎜ ∂p ⎟ ∑∑i=1, i≠k i1s ⎜ ∂p ⎟ ∑ 02s ⎜ ∂p ⎟ ∑ k 2s ⎜ ∂p ⎟ s ⎝ 01s ⎠ s ⎝ k1s ⎠ s ⎝ i1s ⎠ s ⎝ 02s ⎠ s ⎝ k 2s ⎠

K ⎛ ∂AR ⎞ + p ⎜ k ⎟ ∑∑i=1, i≠k i2s ⎜ ⎟ s ⎝ ∂pi2s ⎠

⎧ 1 ⎛ p p p ⎞⎫ ⎧ 1 ⎛ p ⎞⎫ ⎪ ⎜ k.s 01s k.s ⎟⎪ 02s = p01s ⎨− ARk + − ⎬ + pk1s ⎨− ⎜ ARk − ⎟⎬ ∑ p ⎜ p 2 ⎟ ∑ p ⎜ p ⎟ s ⎩⎪ .1. ⎝ 0.s p0.s ⎠⎭⎪ s ⎩ .1. ⎝ 0.s ⎠⎭

K ⎧ 1 ⎫ ⎛ p p ⎞ ⎛ p ⎞ ⎜ 01s k.s ⎟ 01s + pi1s ⎨− ARk ⎬ + p02s + pk 2s ⎜− ⎟ ∑∑i=1, i≠k p ∑ ⎜ 2 ⎟ ∑ ⎜ p p ⎟ s ⎩ .1. ⎭ s ⎝ p.1. p0.s ⎠ s ⎝ .1. 0.s ⎠

AR K ( p − p ) p p = − k ( p + p + p ) − p 0.s 01s k.s + p 02s 01. k1. ∑ i1. ∑ 01s 2 ∑ k1s p.1. i=1, i≠k s p.1. p0.s s p.1. p0.s

p p p + p 01s k.s − p 01s ∑ 02s 2 ∑ k 2s s p.1. p0.s s p.1. p0.s

1 ⎛ p p p p p p p p p p ⎞ = − ⎜ AR p + 01s 02s k.s − k1s 02s − 02s 01s k.s + k 2s 01s ⎟ p ⎜ k .1. ∑ 2 ∑ p ∑ 2 ∑ p ⎟ .1. ⎝ s p0.s s 0.s s p0.s s 0.s ⎠

1 ⎛ p p − p p ⎞ ⎜ k 2s 01s k1s 02s ⎟ = − ⎜ ARk p.1. + ∑ ⎟ p.1. ⎝ s p0.s ⎠

∧ ∧ Hence, the asymptotic variance of ARk , v (ARk ) follows immediately.

58

Corollary 3.3.1 When K +1 = 2, ∧ 1 p ( p p − p p ) p p 2 + p p 2 v (AR) = [AR 2 p + 2 AR 02s 01s 12s 11s 02s + 11s 02s 12s 01s 2 .1. ∑ 2 ∑ 2 n p.1. s p0.s s p0.s

p p 2 p p p − p p + 01s 1.s 02s − (AR p + 12s 01s 11s 02s )2 ] ∑∑3 .1. ssp0.s p0.s

Proof: WhenK +1 = 2, that is, K = k = 1, from

p01s p1.s − p11s p0.s = p01s p11s + p01s p12s − p11s p01s − p11s p02s = p01s p12s − p11s p02s it follows from the theorem that

∧ 1 p ( p p − p p ) p p 2 + p p 2 v (AR) = [AR 2 p + 2 AR 02s 01s 12s 11s 02s + 11s 02s 12s 01s 2 .1. ∑ 2 ∑ 2 n p.1. s p0.s s p0.s

p p 2 p p p − p p 2 + 01s 1.s 02s − (AR p + 12s 01s 11s 02s ) ] ∑∑3 .1. ssp0.s p0.s

This expression is exactly the same as the one obtained by Lui (2001b) for K +1 = 2.

∧ ∧ ∧

Note that vˆ(AR k ) can be obtained by substituting the MLE s p kds for pkds and AR k for ARk .

Then an asymptotic 100 (1−α) per cent confidence interval for ARk using Wald’s statistic is given by

⎡ ∧ ∧ ⎛ ∧ ∧ ⎞ ⎤ AR k − Z vˆ (AR k ), min⎜ AR k + Z vˆ (AR k ), 1⎟ ⎢ α / 2 ⎜ α / 2 ⎟ ⎥ , ⎣⎢ ⎝ ⎠ ⎦⎥

where Zα is the upper100 (α) th percentile of the standard normal distribution.

3.4 Example and simulation

The following example illustrates the application of the method in estimating the attributable risk and of the confidence intervals. In this example, the risk factor (body mass index) has four exposure levels and the confounding factor (race) has two levels. The example appeared in a 59 paper by Basu and Landis (1995), originally obtained from NHANES II conducted from 1976 to

1980 (McDowell et al., 1981). This is a cross-sectional study on cardiovascular disease among young adult women aged between 18 and 24 years, where diastolic blood pressure exceeding

82.6 mmHg (determined from the 90th percentile of the distribution) is considered as a disease.

The risk factor, body mass index, BMI, is defined as weight (kg)/height (m)2 and the confounding factor has two levels, white and black. The data is given in Table 3.2.

Table 3.2: Classifying 966 subjects by exposure levels, race and disease status

Exposure levels C1 = White C2 = Black

D D D D

0 (BMI <23) 40 527 10 63

1 (23 ≤BMI< 25) 8 101 3 18

2 (25 ≤ BMI< 27) 7 57 1 12

3 (BMI ≥27) 29 64 10 16

Total 84 749 24 109

For the data set in Table 3.2, Basu and Landis (1995) used the logit linear model to estimate the covariate adjusted attributable risk and the overall attributable risk for the risk factor. We

∧ ∧ calculate the MLE p kd of pkd and based on these estimates we obtain the MLE AR k of ARk and the corresponding confidence intervals by using Wald’s test statistic.

The estimates of the local attributable risks AR1 , AR2 and AR3 for BMI, after controlling for race are obtained to be 0.0040, 0.0151, and 0.2674, respectively. The interpretation for the estimate of

AR1 is that about 0.4 per cent of cardiovascular disease risk could be avoided by reducing body 60

mass index from level 1 to level 0 after adjusting for race. The estimated value of 0.2674 for AR3 means that about 27 per cent of cardiovascular disease risk could be prevented by reducing BMI

from level 3 to level 0. The similar interpretation goes for AR2 .

The corresponding race-adjusted 95% confidence intervals are found to be (-0.0591,

0.0671), (-0.0361, 0.0663) and (0.1699, 0.3648). Note that the confidence intervals for

AR1 and AR2 include 0, and hence we may conclude there is no significant evidence at 5% to support that the cardiovascular disease risk would be eliminated by reducing body mass index from level 1 to level 0, and level 2 to 0, after adjusting for race. However, with 95% confidence level, we can assert that between 17 per cent and 36 per cent of cardiovascular disease risk would be avoided by reducing body mass index from level 3 to level 0. The estimate of the unadjusted

overall attributable risk, ARo is found to be 0.3006 by collapsing over all the exposure levels and the confounding levels and the corresponding 95% confidence interval are obtained to be

(0.1665,0.4347). The results are summarized in the following table.

Table 3.3: Estimated values of ARk and the confidence intervals

Parameter Estimated value Confidence interval

ARo 0.3006 (0.1665, 0.4347)

AR1 0.0040 (-0.0591, 0.0671)

AR2 0.0151 (-0.0361, 0.0663)

AR3 0.2674 (0.1699, 0.3648)

61

To evaluate the finite-sample performance of the interval estimators, we compute coverage probability and the average length of the resulting confidence intervals by using Monte

Carlo simulation together with the Monte Carlo sample mean and the standard error of the

estimates of ARk from 10000 samples. For the purpose of the simulation we consider a multinomial distribution with parameters n and P given by

′ P = ()p011 , p021 ,..., p014 , p024 , p111 , p121 ,..., p114 , p124 ,..., p311 , p321 ,..., p314 , p324 ,

where pkds , k = 0, 1, 2, 3; s = 1, 2, 3, 4 denotes the probability of an individual falling into the cell at exposure level k in stratum s with the disease status d , d = 1, 2 where 1(2) means the presence (absence) of the disease in the population. The components of P can be summarized in table below.

Table 3.4: Probability distribution of subjects with respect to exposure levels, confounding levels and the disease status in the population to be considered for simulation

C1 C2 C3 C4 ------D D D D D D D D

0 0.002 0.023 0.006 0.029 0.005 0.035 0.012 0.028 Exposure 1 0.005 0.035 0.016 0.030 0.012 0.047 0.019 0.058

Levels 2 0.016 0.058 0.021 0.023 0.021 0.058 0.023 0.067

3 0.023 0.070 0.023 0.035 0.027 0.058 0.035 0.080

For Monte Carlo simulation we generate 10000 random samples from the multinomial

distribution with parameters n , n = 50 , 100 , 500 , 1000 and pkds . For each simulated sample, we

add 0.5 to all cell frequencies for which nkds is equal to 0 in order to improve the performance of 62 the interval estimators. We construct 95% confidence intervals and obtain the estimated coverage probabilities and the average lengths for each interval estimator. We also estimate the Monte

Carlo sample mean and the standard error of the estimates of ARk s from 10000 samples.

The value of the three parameters AR1 , AR2 and AR3 for the population given by Table 3.4 are

obtained to be 0.0392, 0.1153 and 0.1710, respectively. Also the overall attributable risk, ARo is found to be 0.3287. The simulated results of the study are given in Table 3.5 corresponding to the distribution presented in Table 3.4.

From the simulation results, it follows that the coverage probabilities for the estimated confidence intervals consistently perform well at 95% confidence level. The length of confidence intervals gets smaller and smaller as the sample size n gets larger. The Monte Carlo sample

means of the estimates obtained from the repeated samples approach to the true values of ARk

for the distribution studied, as sample size n gets larger. Note that while the true values of AR1 ,

AR2 and AR3 for distribution in Table 3.4 are found to be 0.0392, 0.1153 and 0.1710, respectively, the corresponding Monte Carlo simulated mean estimates are found to be 0.0389,

0.1153 and 0.1714 when n =1000. For n =500, the simulated means are also very close to the corresponding true parameter values. The Monte Carlo standard error also decreases as sample size increases. Thus given an adequate sample size, these estimators together with their interval estimators are appropriate. Thus given an adequate sample size, these estimators together with their interval estimators are appropriate for use in biostatistical and epidemiological research

while estimating ARk due to various levels of exposures in cross-sectional studies in presence of confounders. 63

Table 3.5: Simulated coverage probability and average length together with Monte Carlo sample

mean and the standard error of the estimates of ARk , k = 1, 2, 3 for distribution in Table 3.4

n ARk Coverage Average Mean Standard

Probability Length Deviation

AR0 0.9661 1.7762 0.2382 0.4777

AR1 0.9993 0.6776 -0.0450 0.1043 50

AR2 0.9915 0.8103 -0.0434 0.1262

AR3 0.9825 0.9412 -0.0413 0.1460

AR0 0.9156 1.3392 0.3216 0.3580

AR1 0.9903 0.5199 -0.0198 0.0975

400 AR2 0.9787 0.6465 0.0112 0.1197

AR3 0.9686 0.7633 0.0371 0.1409

AR0 0.9393 0.6393 0.3293 0.1635

AR1 0.9535 0.2066 0.0356 0.0521

500 AR2 0.9581 0.2561 0.1090 0.0632

AR3 0.9536 0.3051 0.1636 0.0757

AR0 0.9435 0.4517 0.3286 0.1164

1000 AR1 0.9516 0.1441 0.0389 0.0365

AR2 0.9510 0.1781 0.1153 0.0456

AR3 0.9494 0.2123 0.1714 0.0540

64

CHAPTER 4

∧ TESTING POSITIVE ASSOCIATION USING AR

4.1 Introduction

The relative risk ( RR ) and odds ratio (OR ) have been used widely to measure the association between the risk factor and the disease outcome. However, the RR can not be estimated for case-control studies. Cornfield (1951) proposed OR as an approximation to RR in order to measure the degree of association between risk factor and the disease outcome. The odds ratio of

1 corresponds to the independence of the risk factor and the disease outcome. Because

∧ ∧ (ln (OR) − ln (OR)) (OR− OR) ∧ converges more rapidly than to normality, where se(ln (OR)) ∧ ∧ se(ln (OR)) se(OR)

∧ ∧ ∧ and se(OR) are the standard errors for ln (OR) andOR , respectively, ln (OR) can also be used for the inference purposes. The hypothesis testing procedure to test for any hypothesized value of

ln (OR) has been discussed in the literature (Fleiss et al., 2003). The test statistic based on

∧ ∧ (ln (OR) − ln (OR)) ln (OR) given by Z ∗ = converges to a standard normal distribution. ∧ se(ln (OR))

∧ An estimated standard error for ln(OR ) is

1 ∧ ∧ ⎛ 1 1 1 1 ⎞ 2 se(ln(OR)) = ⎜ + + + ⎟ , (Woolf, 1955; Haldane, 1956; Gart, ⎝ n11 + 0.5 n12 + 0.5 n01 + 0.5 n02 + 0.5 ⎠

1966). Here nij s are the observed cell frequencies,i = 0, 1; j = 1, 2 , and the addition of 1/2 to each frequency is a bias reduction device (Haldane, 1956).

Although the relative risk and the odds ratio have enjoyed widespread use, neither takes into account the actual number of diseased cases which might may play an important role when 65 studying a disease with several risk factors varying both in their relative risk and .

Since AR takes into account both the prevalence of the risk factor and the relative risk, is estimable for all three sampling designs, namely, case-control, cross-sectional and cohort, and provides the proportion of the risk of disease that could be avoided if the whole population were hypothesized to attain the same risk of disease as the individuals in the unexposed or the lowest exposure category, it has become popular to epidemiologists and public health practitioners to measure the impact of a risk factor in developing a disease. In this chapter, we will carry out a test for positive association between the risk factor and the disease outcome using the test

∧ ∧ statistic based on AR . Section 4.2 presents the approximate distribution of AR and a useful result regarding independence between risk factor and the disease outcome. Then an example on locomotor disease and respiratory disease has been provided to illustrate the hypothesis testing procedure to test for independence. In section 4.3, we study the variation of AR and logOR in the set of 2× 2 tables for fixed row sums. Section 4.4 concentrates on the estimation of the

∧ power of the test statistic using AR to test H 0 : independence versus H a : positive association at different alternative points for three different combinations of the row sum marginals. In section

4.5, a comparative study of AR and logOR has been carried out to assess their testing size and power for testing positive association.

66

∧ 4.2 Test of Hypothesis using AR for large sample

∧ In this section, the approximate distribution of AR will be studied and considered for hypothesis testing procedure.

The following theorems will be useful.

∧ Theorem 4.2.1 The random vector P given by

ˆ′ P = ()pˆ11 , pˆ12 , pˆ 01 , pˆ 02 has an asymptotic singular 4 − variate normal distribution with

′ asymptotic mean vector P = ()p11 , p12 , p01 , p02 and asymptotic covariance matrix Σ defined before.

The proof of the theorem follows immediately from the direct application of the multivariate

Lindeberg-Lévy CLT (Serfling, 2002).

∧ ∧ Theorem 4.2.2 AR is asymptotically normally distributed with mean AR and variance v. (AR)

The proof follows from the direct application of the delta method.

∧ (AR− AR) Therefore, Z = can be approximated by a standard normal distribution and hence we ∧ v(AR) can employ z-test for testing for specified value of AR or for independence between the risk factor and the disease.

The following result will be useful for testing independence between risk factor and disease outcome. Let us consider the 2× 2 contingency table under the usual setup already discussed.

67

Table 4.1: Probability distribution of subjects with respect to exposure levels and the disease status

Disease Status Total Exposure Status D D

E1 (present) p 11 p12 p1.

E0 (absent) p01 p02 p0.

Total p.1 p.2 1

Lemma 4.2.1 Following the notations in table above,

p11 p01 ≥ if and only if pij ≥ pi. p. j ,i = 0,1; j = 1,2. p1. p0.

Proof: Let pij ≥ pi. p. j , i = 0, 1; j = 1, 2 .

p11 Then p11 ≥ p1. p.1 which implies ≥ p.1 . p1.

p01 Similarly, we get ≥ p.1. p0.

Now, p01 = p.1 − p11 ≤ p.1 − p1. p.1 = p.1 p0.

p p Therefore, 11 ≥ 01 . p1. p0.

p11 p01 p11 p01 p11 + p01 Again, let ≥ . Then ≥ ≥ , p11 ≥ p1. p.1 and p01 ≥ p0. p.1 . Hence, p1. p0. p1. p0. p..

pij ≥ pi. p. j ,i = 0,1; j = 1,2. 68

p p − p p p p Because AR = 11 0. 01 1. , 11 = 01 implies AR = 0 . Thus, independence between p.1 p0. p1. p0. the risk factor and the disease outcome implies that the underlying AR is equal to zero. The

p p positive association between the risk factor and the disease outcome, that is, 11 > 01 implies p1. p0.

AR > 0 .

The testing procedure of independence has been explained by using the following real life example.

Example 4.1 This example appears in Fleiss (2003). A total of 2784 subjects in a community have been cross-classified by the presence or absence of the respiratory disease and the locomotor disease. Locomotor disease is the disease of bones and organs of movement. The following table summarizes the distribution of the subjects with respect to respiratory disease and locomotor disease status.

Table 4.2: Cross-classification of 2784 subjects by the status of the respiratory disease and locomotor disease

Respiratory Locomotor Disease Proportion with

Disease D D Total Locomotor disease

E1 ( present ) 17 207 224 0.08

184 2376 2560 0.07 E0 (absent)

Total 201 2583 2784 0.07

Since the rates of locomotor disease in people with and without respiratory disease (0.08 and

0.07, respectively) are virtually the same, we would like to test whether there is an association 69

between respiratory disease and locomotor disease. Therefore we are interested to test H 0 : independence between the risk factor and the disease outcome, which is equivalent to test the

hypothesis H 0 : AR = 0 versusH a : AR ≠ 0 . Under null hypothesis, the observed value of the test

∧ (AR− AR) statistic, Z = is found to be 0.2182. Therefore, at 5% significance level the data does ∧ ∧ v(AR) not provide sufficient evidence to conclude that there is an effect of respiratory disease in developing locomotor disease. Thus, the two characteristics, respiratory disease and the locomotor disease are independent of each other.

In a similar way, the above test can be used to test for any hypothesized value of AR .

4.3 The variation of AR and OR in the set of 2× 2 tables

The set of all 2× 2 tables given byS = {( p11 , p12 , p01 ) : p11 , p12 , p01 > 0, p11 + p12 + p01 < 1 } can be written as the union of two subsets S+ and S − , where

+ p11 p01 S = {( p11 , p12 , p01 ) : p11 , p12 , p01 > 0, p11 + p12 + p01 < 1, ≥ } is the set of all 2x2 tables p1. p0.

− p11 p01 of positive association and S = {( p11 , p12 , p01 ) : p11 , p12 , p01 > 0, p11 + p12 + p01 < 1, ≤ } p1. p0. is the set of all 2× 2 tables of negative association. On S+, AR ≥ 0 and on S − , AR ≤ 0 .

Fixing the row sums p1. (then so is p0. ), let

S = {( p , p , p ) : p , p , p > 0, p + p + p < 1, p is fixed}. A 2× 2 table in S is p1. 11 12 01 11 12 01 11 12 01 1. p1. completely determined when p and p are determined. The set S can be written as the union 11 01 p1.

+ − of two sets S p1. and S p1. where

+ p11 p01 S p1. = {( p11 , p12 , p01 ) : p11 , p12 , p01 > 0, p11 + p12 + p01 < 1, p1.is fixed, ≥ } and p1. p0. 70

− p11 p01 S p = {( p , p , p ) : p , p , p > 0, p + p + p < 1, p is fixed, ≤ }. The set, S 1. 11 12 01 11 12 01 11 12 01 1. p1. p1. p0. of all 2× 2 tables with fixed row sums is represented by the shaded area of the Figure 4.1 below.

p p − p p p p − p p In the set S , if we fix p , then AR = 11 0. 01 1. = 11 0. 01 1. is only a function of p1. 01 p.1 p0. ( p01 + p11 ) p0.

p 11 .

∂AR ( p + p ) p 2 − p ( p p − p p ) = 01 11 0. 0. 11 0. 01 1. We have, 2 2 ∂p11 ( p01 + p11 ) p0.

p p ( p + p ) = 0. 01 0. 1. 2 2 ( p01 + p11 ) p0.

p01 = 2 > 0. ( p01 + p11 ) p0.

Thus, AR is increasing in p 11 .

If we fix p11 , AR is a function of p01 .

∂AR − ( p + p ) p p − p ( p p − p p ) = 01 11 0. 1. 0. 11 0. 01 1. Then, 2 2 ∂p01 ( p01 + p11 ) p0.

p11 = − 2 < 0. ( p01 + p11 ) p0.

Thus, AR is decreasing in p 01 .

( p , p ) S , we change p p < p < 1− p For a given point 11 01 of p1. 1 . in the range 11 1. 01 . Thus the

p p − p p p p − p p variation of AR = 11 0. 01 1. and the variation of 11 0. 01 1. are in the same direction. p.1 p0. p0.

p11 p0. − p01 p1. p01 p1. Since = p11 − is decreasing in p1. , then AR with a fixed point ( p11 , p01 ) p0. p0.

is decreasing for p11 < p1. < 1− p01 . 71

p p Now we study the variation of the odds ratio OR = 11 02 for the situations described above. p01 p12

p p p ( p − p ) In the set S , if we fix p , then OR = 11 02 = 11 0. 01 p p1. 01 is a function of 11 alone. p01 p12 p01 ( p1. − p11 )

∂OR p0. − p01 p11 Then, = . 2 > 0. ∂p11 p01 ( p1. − p11 )

Also for fixed p11 , OR is a function of p01 .

∂OR − p11 p0. Then, = 2 < 0 ∂p01 p1. − p11 p 01

Therefore, the direction of variation of OR is exactly the same as of AR.

( p , p ) S p p Similar as for the study of AR, for a fixed point 11 01 of p1. , we change 1. from 11 to

1− p01 .

p p p ( p − p ) p Then, fromOR = 11 02 = 11 0. 01 , it is to be noted that 11 is fixed and p02 p12 p01 ( p1. − p11 ) p01

( p0. − p01 ) (1− p1. − p01 ) = is decreasing in p1. for p11 < p1. < 1− p01 . ( p1. − p11 ) p1. − p11

Hence for a fixed point ( p11 , p01 ) , OR also decreases in p1. for p11 < p1. < 1− p01 .

S The results of variations of AR and OR on set p1. are represented by the arrows in Figures 4.1

+ and 4.2 below. The shaded area in Figure 4.2 represents the subset S p1. , and the remaining area

− represents the subset S p1. . 72

p01

A(0,0,1)

O

B C(0,1,0) (1,0,0) p12 p11

Figure 4.1: Geometric representation for S

p11

G F

p01 D E Figure 4.2: The variation of AR on S p1.

73

^ 4.4 Estimation of power of the test using AR

From the theoretical result developed above, we expect that the test of positive dependence has larger power at a point with larger value of AR than at a point with a smaller value of AR. In this section, a Monte Carlo study has been carried out to assess the power of the

∧ test statistic based on AR by fixing the row sum marginals and the entry p01 , and gradually

increasing the value of p11 for large sample case.

We consider three different combinations for the row sum marginals described below.

Case 1: The proportion of exposed group isp1. = 10 % , unexposed group isp0. = 90 % .

This kind of situation may arise in real life when a small proportion of people are exposed to a particular risk factor; for example, exposed to a chemical, coal dust, etc. in the community.

Case 2: The proportion of exposed group isp1. = 90 % , unexposed group isp0. = 10 % .

For example, in most of the third world countries, a big portion of the population is usually exposed to contaminated water, polluted air, unhealthy environment, etc.

Case 3: The proportion of exposed and unexposed groups are equal, that

is,p0. = p1. = 50 %. This situation may arise in real life when there is a possibility of the outbreak of a particular disease, for example, flu, and the community has been urged to take the preventive medication.

The Monte Carlo simulation has been performed following the scheme given below.

1. Fix significance levelα , and the Monte Carlo sample size M.

p p 2. Given a case described above, form a 2× 2 table satisfying 11 > 01 . p1. p0.

3. Generate a random sample from the given multinomial distribution for different values of n 74

∧ (AR− AR) and find the value of the test statistics Z = under the null hypothesis. ∧ ∧ v(AR)

4. Compare the observed value of the test statistics with the critical value zα and reject the null

hypothesis if the observed value is greater than or equal to the critical value.

5. Repeat steps 3-4 M times and count the number of rejections. The proportion of rejection over

all the simulations gives the estimated power.

6. Keeping the same row sum marginals and fixing the entry p01 , we consider a new 2× 2 table

p11′ p01 such that p11′ > p11 and > . We repeat steps 3-5 M times to find the power with the p1. p0.

new configuration given by ( p11′ , p12′ , p01 , p02 ) .

The simulation results for M=10000 have been summarized in Tables 4.3, 4.4, and 4.5 for cases 1, 2 and 3, respectively for different values of n. From the simulation results it is evident that the power of the test statistic is increasing with the increase in the value of AR for all three cases considered in the study, which is consistent with the theoretical development above.

75

∧ Table 4.3: Estimated power using AR for the case p1.= 0.1, p0.= 0.9

Matrix of Estimated power

probabilities AR n = 30 50 70 90 120 150 200

0.045 0.055 0.0123 0.0960 0.0999 0.1086 0.1164 0.1171 0.1208 0.1338

0.360 0.540

0.050 0.050 0.0244 0.1170 0.1349 0.1429 0.1525 0.1642 0.1857 0.2096

0.360 0.540

0.055 0.045 0.0361 0.1479 0.1538 0.1872 0.2177 0.2369 0.2682 0.3289

0.360 0.540

0.060 0.040 0.0476 0.1554 0.1937 0.2433 0.2815 0.3404 0.3996 0.4761

0.360 0.540

0.065 0.035 0.0588 0.1970 0.2537 0.3078 0.3656 0.4463 0.5223 0.6276

0.360 0.540

0.070 0.030 0.0698 0.2323 0.3042 0.3800 0.4582 0.5585 0.6404 0.7648

0.360 0.540

0.075 0.025 0.0805 0.2670 0.3614 0.4648 0.5570 0.6648 0.7612 0.8655

0.360 0.540

0.080 0.020 0.0909 0.2964 0.4207 0.5425 0.6420 0.7652 0.8512 0.9254

0.360 0.540

0.085 0.015 0.1011 0.3511 0.4917 0.6335 0.7363 0.8500 0.9147 0.9718

0.360 0.540

0.090 0.010 0.1111 0.3931 0.5539 0.7049 0.8122 0.9051 0.9572 0.9901

0.360 0.540

76

∧ Table 4.4: Estimated power using AR for the case p1.= 0.9, p0.= 0.1

Matrix of Estimated power

probabilities AR n = 30 50 70 90 120 150 200

0.405 0.495 0.1011 0.0257 0.0639 0.1286 0.1373 0.1377 0.1394 0.1386

0.040 0.060

0.450 0.450 0.1837 0.0127 0.0993 0.1637 0.1849 0.1917 0.2044 0.2423

0.040 0.060

0.495 0.405 0.2523 0.0115 0.1596 0.2151 0.2394 0.2766 0.3041 0.3705

0.040 0.060

0.540 0.360 0.3103 0.0496 0.2395 0.2800 0.3413 0.3936 0.4573 0.5449

0.040 0.060

0.585 0.315 0.3600 0.0990 0.2981 0.3721 0.4288 0.5218 0.5933 0.7006

0.040 0.060

0.630 0.270 0.4030 0.1828 0.3712 0.4621 0.5295 0.6441 0.7153 0.8299

0.040 0.060

0.675 0.225 0.4406 0.2570 0.4511 0.5469 0.6446 0.7460 0.8251 0.9108

0.040 0.060

0.720 0.180 0.4737 0.3232 0.5317 0.6499 0.7396 0.8383 0.9011 0.9577

0.040 0.060

0.765 0.135 0.5031 0.4385 0.6033 0.7425 0.8166 0.9053 0.9513 0.9842

0.040 0.060

0.810 0.090 0.5294 0.4826 0.6929 0.8031 0.8820 0.9487 0.9763 0.9932

0.040 0.060

77

∧ Table 4.5: Estimated power using AR for the case p1.= 0.5, p0.= 0.5

Matrix of Estimated power n = 30 50 70 90 120 150 200 probabilities AR

0.225 0.275 0.0588 0.1407 0.1392 0.1389 0.1439 0.1583 0.1622 0.1887

0.200 0.300

0.250 0.250 0.1111 0.1793 0.1935 0.2256 0.2514 0.2989 0.3331 0.4075

0.200 0.300

0.275 0.225 0.1579 0.2350 0.2805 0.3469 0.4065 0.4969 0.5718 0.6776

0.200 0.300

0.300 0.200 0.2000 0.2957 0.4030 0.5008 0.5925 0.6986 0.7853 0.8791

0.200 0.300

0.325 0.175 0.2381 0.3738 0.5293 0.6554 0.7602 0.8597 0.9145 0.9699

0.200 0.300

0.350 0.150 0.2727 0.4786 0.6704 0.7854 0.8809 0.9449 0.9779 0.9945

0.200 0.300

0.375 0.125 0.3043 0.5842 0.7694 0.8914 0.9453 0.9839 0.9953 0.9994

0.200 0.300

0.400 0.100 0.3333 0.6711 0.8633 0.9527 0.9837 0.9970 0.9998 1.0000

0.200 0.300

0.425 0.075 0.3600 0.7855 0.9336 0.9857 0.9967 0.9996 1.0000 1.0000

0.200 0.300

0.450 0.050 0.3846 0.8458 0.9727 0.9968 0.9995 1.0000 1.0000 1.0000

0.200 0.300

78

∧ ∧ 4.5 Comparing nominal size and power of the test statistics using AR and logOR

∧ ∧ 4.5.1 Comparing nominal size of the test statistics using AR and logOR

In this section, we will study the size of the test statistics to test H 0 : independence versus H a :

∧ ∧ positive association based on AR and logOR . In order to do so, we consider two matrices formed by the probabilities of two given multinomial distributions for which both the values of

AR and logOR are equal to zero. For each of the matrices, the simulation study has been performed in the following way.

1. Generate a random sample from the given multinomial distribution and find the value of the

∧ ∧ AR logOR test statistics given by Z = and Z * = . ∧ ∧ ∧ ∧ v(AR) v(logOR)

2. Compare the values of the both statistics with the critical value zα and reject the null

* hypothesis if Z ≥ zα , Z ≥ zα .

3. Repeat steps 1-2 for M times. The proportion of rejection over all the simulations

gives the estimated size for each test statistic.

From the simulated results it seems that under H 0 : independence, the statistic

Z converges to a standard normal distribution faster than the test statistic Z * does.

79

Table 4.6: Estimated level (α = 0.05 ) for tests Z and Z *

Matrix of ∧ ∧ AR logOR Z = Z * = ∧ ∧ ∧ ∧ probabilities n v(AR) v(logOR)

30 0.0483 0.0338

50 0.0495 0.0672

0.040 0.060 70 0.0546 0.0706

0.360 0.540 90 0.0534 0.0611

120 0.0530 0.0587

150 0.0526 0.0559

200 0.0553 0.0558

30 0.0312 0.0115

50 0.0429 0.0566

0.360 0.540 70 0.0418 0.0868

0.040 0.060 90 0.0560 0.0882

120 0.0656 0.0836

150 0.0595 0.0756

200 0.0554 0.0751

80

∧ ∧ 4.5.2 Comparing power of the test statistics using AR and logOR

∧ ∧ In this section, we study the testing power of the test statistics based on AR and logOR for

testing positive association between the risk factor and the disease outcome, that is, H 0 :

∧ AR independence versus H a : positive association. Under null hypothesis, both and ∧ ∧ v(AR)

∧ logOR follow asymptotically a standard normal distribution. We reject the null hypothesis ∧ ∧ v(logOR)

∧ ∧ AR logOR when > zα and > zα , where zα is the upper 100 (α ) th percentile of a ∧ ∧ ∧ ∧ v(AR) v(logOR) standard normal distribution.

∧ AR For large n and given a distribution P, the value of the statistic can be ∧ ∧ v(AR)

∧ AR logOR approximated by , and the value of the statistic can be approximated ∧ ∧ ∧ v(AR) v(logOR)

logOR by . ∧ v(logOR)

AR logOR Then ≥ (<) is equivalent to ∧ ∧ v(AR) v(logOR)

AR logOR ≥ (<) , where A B 81

(1− AR) 4{p p ( p p + p p + p p ) − p ( p p − p p ) 2 + p p 3} A = 11 02 11 02 11 01 01 02 01 11 02 12 01 12 01 p 3 01

⎛ 1 1 1 1 ⎞ B = ⎜ + + + ⎟ . ⎝ p11 p12 p01 p02 ⎠

We partition the set P + of 2× 2 positive tables into two subsets, given by

+ AR logOR + AR logOR P1 = {P : ≥ } and P2 = {P : < }. A B A B

∧ ∧ Since both AR and logOR are used as test statistics, it is expected that the power of the test

∧ ∧ statistic based on AR will be at least as powerful as of the test based on logOR for testing

+ positive dependence for a 2× 2 table when P is in the subset P1 , the reverse is expected to

+ happen for any P in the subset P2 .

In order to verify this, a Monte Carlo simulation has been performed for different values of the

+ sample size n . From the simulation result it is found that for a 2× 2 table in the subset P1 , the

∧ ∧ power of the test statistic based on AR is better than power based on logOR for each large value

+ of n. The reverse situation is evident in the subset P2 . The performance of the power for two different situations described above can easily be compared from Figures 4.3 and 4.4.

82

∧ ∧ AR logOR Table 4.7: Estimated power for test statistics using AR and logOR where ≥ A B

Matrix of ∧ ∧ AR logOR Z = Z * = ∧ ∧ ∧ ∧ probabilities n v(AR) v(logOR)

+ P1

30 0.4044 0.0995

50 0.7439 0.4957

0.090 0.010 70 0.8722 0.7056

90 0.9000 0.7666

0.360 0.540 120 0.9579 0.8857

150 0.9803 0.9420

200 0.9963 0.9863

83

∧ ∧ AR logOR Table 4.8: Estimated power for test statistics using AR and logOR where < A B

Matrix of ∧ ∧ AR logOR Z = Z * = ∧ ∧ ∧ ∧ probabilities n v(AR) v(logOR)

+ P2

30 0.4544 0.5590

50 0.5636 0.7265

0.765 0.135 70 0.7255 0.8107

90 0.8243 0.8980

0.040 0.060 120 0.8902 0.9446

150 0.9588 0.9837

200 0.9799 0.9936

1 1

0.8 0.8

0.6 0.6 power power

0.4 0.4

0.2 0.2

AR AR Log OR Log OR 0 0 0 50 100 150 200 250 0 50 100 150 200 250 n n

Figure 4.3: Power curve for AR and log OR Figure 4.4: Power curve for AR and log OR

+ + for the subset P1 for the subset P2

84

CHAPTER 5

∧ EXACT TEST FOR POSTIVE DEPENDENCE USING AR

5.1 Introduction

In this chapter, we concentrate on the hypothesis testing procedure to test independence between the risk factor and the disease outcome using AR in case of a small sample, which has always been neglected in the area of attributable risk. When the sample size is small, for example, in the case of a rare disease situation, we do not have sufficient information for each cell of the observed matrix. Therefore, the large sample approximation theory cannot be applied. In that case we use exact small-sample distributions rather than large-sample approximations. Fisher’s exact test (Fisher, 1934, 1935a,b; Irwin, 1935; Yates, 1934) is designed to test the independence for a 2× 2 contingency table. Section 5.2 discusses the testing procedure to test for independence for a dichotomous risk factor and a dichotomous outcome variable using AR . In section 5.3, the Fisher’s exact test has been extended to multiple exposure levels of the risk factor to test for independence for a risk factor and the disease outcome using an overall AR .

Section 5.4 concentrates on the estimation of testing power of tests using AR and logOR by means of exact test.

5.2 Small-sample test for a 2× 2 table

5.2.1 Fisher’s exact test for a 2× 2 table

Fisher’s exact test is based on the conditioning on the marginal totals. The conditioning on the observed margins of the fourfold table yields a probability distribution which is free of the unknown parameters. Therefore, the formula obtained enables one to find the exact probability calculations. 85

1 2 tij n!∏∏ pij ij==0 1 In a 2× 2 table, P(n11 = t11 ,n12 = t12 ,n01 = t01 ,n02 = t02 ) = 1 2 . ∏∏tij ! ij==0 1

Under H 0 : independence,

t11 t12 t01 t02 n!( p1. p.1 ) ( p1. p.2 ) ( p.1 p0. ) ( p0. p.2 ) P(n11 = t11 ,n12 = t12 ,n01 = t01 ,n02 = t02 ) = 1 2 . Fixing both ∏∏tij ! ij==0 1

sets of marginal totals (ni. ) and (n. j ), i = 0,1; j = 1,2, the conditional density of (nij ) is given by

t11 t12 t01 t02 n!( p1. p.1 ) ( p1. p.2 ) ( p.1 p0. ) ( p0. p.2 ) P((nij ) = tij (ni. ) = (ti. ),(n. j ) = (t. j )) = 1 2 t1. t0. t.1 t.2. n! p1. p0. n! p.1 p.2 ∏∏tij ! ij==0 1 t1.!t0.! t.1!t.2!

t !t !t !t ! = 1. .1 0. .2 . (5.2.1) n!t11!t12!t01!t02!

It is the hypergeomtric distribution and does not depend on unknown parameters.

For any bivariate cdf F(x, y) , min (FX (x), FY (y)) and max (0, FX (x) + FY (y) −1) are cdf’s and

are called the Fréchet bounds of the family F(x, y) with FX (x) and FY (y) fixed. For any

F(x, y) of the family, max (0, FX (x) + FY (y) −1) ≤ F(x, y) ≤ min(FX (x), FY (y)) (Hoeffding,

1940; Fréchet, 1951). Given the marginal totals, the entry n11 completely determines the other

three cell frequencies. The range of possible values of n11 given by

max (0, n1. + n.1 − n) ≤ n11 ≤ min (n1. ,n.1 ) where max (0, n1. + n.1 − n) and min (n1. ,n.1 ) are the

Fréchet bounds defined above.

86

∧ 5.2.2 Testing procedure for independence for a small-sample using AR

Suppose we are interested to test whether there is a positive association between the risk factor

and the disease, or, equivalently, we want to test H 0 : no effect versus H a : positive effect.The testing procedure consists of the following steps:

1. Calculate the observed value of AR,ARob based on the given table.

2. Fixing the marginal totals, generate all 2× 2 tables and calculate AR for each table.

3. Find the conditional probabilities given by (5.2.1) for all 2× 2 tables for which the calculated

AR is greater than ARob .

4. The sum of all those probabilities is called the p − value of the test. The decision regarding the

acceptance and rejection of the hypothesis is made based on the p − value .

Example 5.1 Let us consider an example that appeared in Fleiss (1979). This example is on live births and infant deaths among whites in 1974 for New York City. A total of 72730 infants were involved in the study and had been cross-classified according to their birth weight status and survival and death status. The cross-classification of the subjects according to the birth weight, and survival and death status is summarized in the following table.

Table 5.1: Distribution of 72730 subjects according to the birth weight and life status

Birth weight Died within Survived Total

first year first year

≤ 2500g 616 4594 5210

> 2500g 425 67095 67520

Total 1041 71689 72730

87

For the above data set, Fleiss (1979) found the estimated value of AR to be 0.56 , and a 95% confidence interval was obtained to be (0.527, 0.591). In order to apply an exact test for small sample, the proportions as in the original data have been kept in a sample of 25 subjects. The cell frequencies have been rounded up to the nearest integer making the sample size eventually 28.

We are interested to test whether there is an effect of birth weight on infants’ death. Therefore

we want to test H 0 : no effect versus H a : positive effect.

Table 5.2: Distribution of 28 infants according to the status of life and the birth weight

Died within Survived Total Birth weight first year first year

E1 ( ≤ 2500g ) 1 2 3

1 24 25 E0 ( > 2500g )

Total 2 26 28

Following the steps of exact test described above, the p − value is obtained to be 0.0079. Thus, at 5% significance level, the result is statistically significant. Therefore, based on the particular data we can conclude that prematurity (birth weight ≤ 2500g ) causes infant death, which is consistent with the large sample situation described by Fleiss (1979).

88

5.3 Exact test for a (K +1) × 2 contingency table

5.3.1 Useful results regarding independence

Let the parameters for a (K +1) × 2 table be presented as follows.

Table 5.3: Probability distribution of subjects with respect to exposure levels and the disease status

Disease Status Total Exposure Levels D D 0 p01 p02 p0. 1 p11 p12 p1...... pK1 pK 2 p K K.

Total p.1 p.2 1

Lemma 5.3.1 The conditional probability of the disease outcome in each level of the risk factor are equal if and only if the risk factor and disease outcome are independent, that is, under the notations introduced in above table,

p01 p11 pK1 = = ... = if and only if pij = pi. p. j ,i = 0,1,..., K, j = 1,2. p0. p1. pK.

Proof: Let pij = pi. p. j ,i = 0,1,..., K, j = 1,2.

p01 Then p01 = p0. p.1 which implies = p.1 . By similar argument, we have p0.

pk1 p01 p11 pK1 = p.1 , k = 1,2,..., K . Therefore, = = ... = . pk. p0. p1. pK.

p p p p p p + p + ...p Again, let 01 = 11 = ... = K1 . Then 01 = 11 = ... = 01 11 K1 . p0. p1. pK. p0. p1. p..

Therefore, p01 = p0. p.1 , p11 = p1. p.1 and so on. Hence, pij = pi. p. j ,i = 0,1,..., K, j = 1,2. 89

p p p Lemma 5.3.2 For a (K+1) × 2 contingency table, 01 = 11 = ... = K1 implies the underlying p0. p1. pK.

overall AR , ARovall is equal to zero.

K ∑ pk1 p01 p11 pK1 p01 p11 + p21 + ... + pK1 k=1 Proof: = = ... = implies = = K . p0. p1. pK. p0. p1. + p2. + ... + pK. ∑ pk. k=1

Also, the overall AR , ARovall can be found by collapsing over all the exposed categories

k,k = 1,2,..., K and is given by the equation

1 ⎛ K p K ⎞ ⎜ 01 ⎟ ARovall = ⎜∑ pk1 − ∑ pk. ⎟ . p.1 ⎝ k =1 p0. k=1 ⎠

1 ⎛ K p K ⎞ ⎜ 01 ⎟ Then, ARovall = ⎜∑ pk1 − ∑ pk. ⎟ p.1 ⎝ k =1 p0. k=1 ⎠

1 ⎛ K K ⎞ = ⎜ p0. ∑ pk1 − p01 ∑ pk. ⎟ p.1 p0. ⎝ k =1 k =1 ⎠

= 0 .

Thus the independence between the risk factor and the disease outcome implies ARovall = 0 .

5.3.2 Extension of Fisher’s exact test for an I × J table

In an I × J table,

I J tij n!∏∏ pij i==11j P((nij ) = (tij )) = I J . ∏∏tij ! i==11j

90

Under H 0 :independence,

I J tij n!∏∏( pi. p. j ) i==11j P((nij ) = (tij )) = I J . Fixing both sets of marginal totals (ni. ) and ∏∏tij ! i==11j

(n. j ), i =,1,..., I, j = 1,2,..., J , the conditional density of (nij ) is given by

I J tij n!∏∏( pi. p. j ) P((n ) = t (n ) = (t ),(n ) = (t )) = i==11j . ij ij i. i. . j . j I J t n! ( p )ti. n! ( p ) . j I J ∏ i. ∏ . j t ! i=1 j=1 ∏∏ ij I J i==11j ∏ti.! ∏t. j ! i=1 j=1

I J ∏ti.!∏t. j ! i=1 j=1 = I J . (5.3.1) n!∏∏tij ! i==11j

It is the multiple hypergeomtric distribution. Since the above conditional probability does not depend on unknown parameters, it permits exact probability calculations.

In order to perform an exact test, we need to generate all I × J matrices more concordant (Tchen,

1980; Ahmed et al., 1979) than the given I × J matrix. An algorithm for a 3×3matrix appears in the literature (Nguyen and Sampson, 1985).

∧ 5.3.3 Testing procedure for independence for a small-sample using overall AR

Suppose we want to test H 0 : independence versus H a : positive association between the risk factor and the disease outcome for a small sample in a I × J contingency table. The algorithm for the testing procedure is given below. 91

1. Find the observed value of the overall attributable risk, ARovall by collapsing over all the

exposure categories based on the observed table.

2. Fixing the marginal totals, generate all I × J matrices more concordant than the given

I × J matrix.

3. Find the overall AR for each matrix generated in step 2 by collapsing over the exposure

categories.

4. Calculate the above conditional probabilities given by (5.3.1) for all I × J matrices for

which the calculated AR is greater than or equal to ARovall . The sum of all those

probabilities is called the p -value of the test.

5. Based on the p -value, make the decision.

Below is an algorithm for generating all 3× 2 matrices more concordant than the given matrix.

Let us consider a matrix given by

⎡m11 m12 ⎤ M = ⎢m m ⎥ . ⎢ 21 22 ⎥ ⎢ ⎥ ⎣m31 m32 ⎦

The row sum vector and the column sum vector are respectively given by r = (m1. ,m2. ,m3. ) ,

c = (m.1 ,m.2 ) .

(0) Step 0: M 0 = (mij ) , i = 1,2,3, j = 1,2 .

(0) Set m11 = m 11 .

(0) (0) Step 1: Set m21 = max(0,m 11 + m 21 − m11 )

Step 2: m12 = m1. − m11 ,

m = m − m , 22 2. 21 m31 = m.1 − m11 − m21 , 92

m32 = m.2 − m12 − m22 .

Display M.

Step 3: Let a = m21 +1.

(a) If a ≤ min (m2. ,m.1 ) ; set m21 = a and go to step 2.

(b) If a > min (m2. ,m.1 ) ; go to step 4.

Step 4: Let b = m11 +1.

(a) If b ≤ min (m1. ,m.1 ) ; set m11 = b and go to step 1.

(b) If b > min (m1. ,m.1 ) ; the algorithm ends.

The procedure of small sample exact test for a 3× 2 can be explained by the following example.

Example 5.2 This kind of example appears in Fleiss (2003) for a 2× 2 contingency table under cross-sectional study design. Suppose we are interested to study the association between the age of the mother (0 represents a maternal age less than or equal to 20 years; 1, a maternal age between 21 and 35 years; and 2, a maternal age over 35 years) and the birth weight of her offspring ( D represents a birth weight less than or equal to 2500 grams, D , a birth weight over

2500 grams). In order to remove the effect of social and demographic factors, suppose that we study only women of a given level of education, race, demographic location and type of insurance. We take a sample of 17 subjects and cross-classify them according to maternal age and the birth weight of her offspring as shown in the Table 5.4. We want to test

H 0 :independence between maternal age and the birth weight versus H a :positive association,

which is equivalent to testH 0 : ARovall = 0, vs H a : ARovall > 0 .

93

Table 5.4: Distribution of 17 subjects according to the maternal age and birth weight of offspring

Birth weight Total Maternal age D D

0 (age≤ 20 years) 1 3 4

1 (age: 21-35 years) 2 5 7

2 (age > 35 years) 2 4 6

Total 5 12 17

Based on the data, the observed value of the overall AR , ARovall is found to be 0.15. Then we generate all 3× 2 matrices more concordant than the given 3× 2 matrix. The generated tables are as follows.

⎧ ⎡1 3 ⎤ ⎡1 3⎤ ⎡1 3⎤ ⎡2 2⎤ ⎡2 2⎤ ⎡2 2⎤ ⎡3 1 ⎤ ⎡3 1 ⎤ ⎡3 1 ⎤ ⎡4 0⎤ ⎡4 0⎤⎫ ⎪ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎪ ⎨ ⎢2 5⎥, ⎢3 4⎥,⎢4 3⎥,⎢1 6 ⎥,⎢2 5⎥,⎢3 4⎥, ⎢0 7⎥, ⎢1 6⎥, ⎢2 5⎥, ⎢0 7⎥, ⎢1 6⎥⎬. ⎪ ⎪ ⎩ ⎣⎢2 4⎦⎥ ⎢⎣1 5⎦⎥ ⎣⎢0 6⎦⎥ ⎣⎢2 4 ⎦⎥ ⎣⎢1 5⎦⎥ ⎣⎢0 6⎦⎥ ⎣⎢2 4⎦⎥ ⎣⎢1 5⎦⎥ ⎣⎢0 6⎦⎥ ⎣⎢1 5⎦⎥ ⎣⎢0 6⎦⎥⎭

Applying the above procedure of an exact test, we find the p -value as 0.3620.

So we can conclude that, at 5% significance level, the result indicates an association that is not statistically significant.

94

∧ ∧ 5.4 Comparing power of exact tests using AR and log OR

For fixed row sum and column sum marginals, a 2× 2 table is completely determined by p11 .

Thus, for fixed p1. and p.1 , we can write

p p − p p AR = 11 o. 01 1. p.1 p01

p p − ( p − p ) p = 11 o. .1 11 1. p.1 ( p.1 − p11 )

p ( p + p ) − p p = 11 o. 1. .1 1. p.1 ( p.1 − p11 )

p − p p = 11 .1 1. p.1 ( p.1 − p11 )

Then,

dAR 1 ( p.1 − p11 ) + p11 − p1. p.1 = 2 dp11 p.1 ( p.1 − p11 )

1 p.1 (1 − p1. ) = 2 p.1 ( p.1 − p11 )

p0. = 2 > 0. ( p.1 − p11 )

Thus AR is an increasing function in p11 .

Similarly, for fixed p1. and p.1 , the odds ratio can be written as

p p OR = 11 02 p01 p12

p (1 − p − p + p ) = 11 1. .1 11 . ( p1. − p11 ) ( p.1 − p11 ) 95

p11 (1− p1. − p.1 + p11) Since both and are increasing in p11 , and both are positive, OR is ( p1. − p11 ) ( p.1 − p11)

an increasing function of p11 . Therefore, log OR is also an increasing function in p11 .

In the exact test, we fix the row sum and column sum marginals, and pij ,i = 0,1; j = 1,2 are

∧ ∧ ∧ presented by p ij in AR and log OR .

∧ ∧ To study the power of the exact test based on AR and log OR for testing positive dependence for a 2× 2 table, the simulation study has been conducted in the following way.

1. Fix significance levelα , and the Monte Carlo sample size M.

p p 2. Form a 2× 2 table satisfying 11 > 01 . p1. p0.

3. Generate a random sample from the aforesaid multinomial distribution for different values of

n, (n=15, 20, 25, 30).

4. For each sample generated, fix the marginal totals and generate all 2× 2 tables satisfying the

Fréchet bounds.

5. Find the observed value of AR, ARob for each table generated in step 4.

6. Find the conditional probabilities given by (5.2.1) for all those tables for which ARob is

greater than the value of AR found in step 5.

7. The sum of all those probabilities is called the p -value. If the p -value is less than 0.05, reject

the null hypothesis.

8. Repeat the steps 3-7 M times.

9. The proportion of rejection over all the simulations gives the estimated power for the test

∧ based on AR . 96

∧ At each step through 5-8, do the same calculations based on log OR . Then the proportion of

∧ rejection over all the simulations gives the estimated power for the test based on log OR .

∧ ∧ The estimated power for the exact test based on AR and log OR for a Monte Carlo sample of size M=10000 are summarized in the Table 5.5. From the simulation results, it is to be

∧ ∧ noted that the estimated powers of the test based on AR and log OR are the same. Therefore, in

∧ ∧ an exact test of a 2× 2 table, the tests based on AR and log OR are equivalent.

97

∧ Table 5.5: Estimated power of exact test with the test statistic AR

Matrix of Estimated power

probabilities n = 15 20 25 30

0.25 0.25 0.0480 0.0470 0.0660 0.0680 0.20 0.30

0.30 0.20 0.0880 0.1210 0.1370 0.1750

0.20 0.30

0.35 0.15 0.1540 0.2330 0.2900 0.3650

0.20 0.30

0.40 0.10 0.2860 0.3950 0.4980 0.6120

0.20 0.30

0.45 0.05 0.4400 0.6000 0.7580 0.8260

0.20 0.30

98

REFERENCES

1. Agresti A. (2002). Categorical Data Analysis. 2nd ed., Wiley, New York.

2. Ahmed, A. H. N., Langberg, N. A., Leon, R. V. and Proschan, F. (1979). Partial ordering of

positive quadrant dependence, with applications. Florida State University, Technical Report.

3. Basu, S. and Landis, J. R. (1995). Model-based estimation of population attributable risk

under cross-sectional sampling. Aamerican Journal of , 142, 1338-1343.

4. Benichou, J. (1991). Methods of adjustment for estimating the attributable risk in case-control

studies: a review. Statistics in Medicine, 10, 1753-1773.

5. Benichou, J. (1993). Re: "Methods of adjustment for estimating the attributable risk in case-

control studies: a review"(letter). Statistics in Medicine, 12, 94-96.

6. Berkson, J. (1958). Smoking and lung cancer. Some observations on two recent reports.

Journal of American Statistical Association, 53, 28-38.

7. Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research. Vol. 1: The

Analysis of Case-Control Studies. Scientific Publications No. 32, International Agency for

Research on Cancer, Lyon.

8. Bruzzi, P., Green, S. B., Byar, D. P., Briton, L. A. and Schairer, C. (1985). Estimating the

population attributable risk for multiple risk factors using case-control data. American Journal

of Epidemiology, 122, 904-914.

9. Casella G., Berger R. L. (2002). Statistical Inference. 2nd ed., Duxbury Press, Belmont,

California.

10. Chen, J. T. (2001). Re: "Methods of adjustment for estimating the attributable risk in case-

control studies: a review"(letter to the editor). Statistics in Medicine, 20, 979-982.

11. Cole, P. and MacMahon, B. (1971). Attributable risk percent in case-control studies. British 99

Journal of Preventive and Social Medicine, 25, 242-244.

12. Cornfield, J. (1951). A method of estimating comparative rates from clinical data.

Application to cancer of the lung, breast, and cervix. J. Natl. Cancer Inst., 11, 1269-1275.

13. Coughlin, S. S., Benichou, J. and Weed, D. L. (1994). Attributable risk estimation in case-

control studies. Epidemiologic Reviews, 16, 51-64.

14. Coughlin, S. S., Nass, C. C., Pickle, L. W., Trock, B. and Bunin, G. (1991). Regression

methods for estimating attributable risk in population-based case-control studies: a

comparison of additive and multiplicative models. American Journal of Epidemiology, 133,

1289-1294.

15. Denman, D. W. and Schlesselman. J. J. (1983). Interval estimation of the attributable risk for

multiple exposure levels in case-control studies. Biometrics, 39, 185-192.

16. Drescher, K. and Becher, H. (1997). Estimating the generalized attributable fraction from

case-control data. Biometrics, 53, 1170-1176.

17. Drescher, K. and Schill, W. (1991). Attributable risk estimation from case-control data via

logistic regression. Biometrics, 47, 1247-1256.

18. Eide, G. E. and Gefeller, O. (1995) Sequential and average attributable fractions as aids in

the selection of preventive strategies. Journal of Clinical Epidemiology, 48, 645-655.

19. Finkelstein, M. O. and Levin, B. (2001). Statistics for Lawyers. 2nd ed., Springer-

Verlag, New York.

20. Fisher, R. A. (1934, 1970). Statistical Methods for Research Workers (originally published

1925). 14th ed., 1970, Oliver and Boyd, Edinburgh.

21. Fisher, R. A. (1935a). The design of . 8th ed., 1966, Oliver and Boyd, Edinburgh.

22. Fisher, R. A. (1935b). The logic of inductive inference. J. Roy. Statist. Soc., Ser. B, 98, 39- 100

82.

23. Fleiss, J. L. (1979). Inference about population attributable risk from cross-sectional studies.

American Journal of Epidemiology, 110, 103-104.

24. Fleiss, J. L., Bruce, L, Myunghee, C. P. (2003). Statistical Methods for Rates and

Proportions. 3rd ed., Wiley, New York.

25. Fréchet, M. (1951). “Sur les de corrélation dont les marges sont données.” Annales de

l’Université de Lyon, Section A, Ser. 3, 14, 53-77.

26. Gart, J. J. (1966). Alternative analysis of contingency tables. J. Roy. Statist. Soc., Ser. B, 28,

164-179.

27. Gefeller, O. (1990). A simulation study on adjusted attributable risk estimators. Statistica

Applicata, 2, 323-331.

28. Gefeller, O. (1992a). An annotated bibliography on the attributable risk. Biometrical Journal,

8, 1007-1012.

29. Gefeller, O. (1992b). Comparison of adjusted attributable risk estimators. Statistics in

Medicine, 11, 2083-2091.

30. Gefeller, O. and Eide, G. E. (1993). Estimation of adjusted attributable risk in case-control

studies. Letter. Statistics in Medicine, 12, 91-94.

31. Gefeller, O., Windeler, J. (1991). Risk factors for cervical cancer: comments on attributable

risk calculations and the evaluation of screening in case-control studies. International

Journal of Epidemiology, 20, 1140-1141.

32. Greenland, S. and Robins, J. M. (1988). Conceptual problems in the definition and

interpretation of attributable fractions. American Journal of Epidemiology, 128, 1185-1197.

33. Haldane, J. B. S. (1956). The estimation and significance of the logarithm of a ratio of 101

frequencies. Ann. Hum. Genet., 20, 309-311.

34. Hoeffding, S. (1940). "Masstabinvariante Korrelations-Theorie." Schriften Mathematische

Institut Universitat, Berlin, 5, 181-233.

35. Irwin, J. O. (1935). Tests of significance for differences between percentages based on small

numbers. Metron, 12, 83-94.

36. Kooperberg, C. and Petitti, D. B. (1991). Using logistic regression to estimate the adjusted

attributable risk of low birthweight in an unmatched case-control study. Epidemiology, 2,

363-366.

37. Kurtiz, S. J. and Landis, J. R. (1987). Attributable risk ratio estimation from matched-pairs

case-control data. American Journal of Epidemiology, 125, 324-328.

38. Kurtiz, S. J. and Landis, J. R. (1988a). Summary attributable risk estimation from unmatched

case-control data. Statistics in Medicine, 7, 507-517.

39. Kurtiz, S. J. and Landis, J. R. (1988b). Attributable risk estimation from matched case-

control data. Biometrics, 44, 355-367.

40. Last, J. M. (1983). A Dictionary of Epidemiology. Oxford University Press, New York.

41. Leung, H. K. and Kupper, L. L. (1981). Comparison of confidence intervals for attributable

risk. Biometrics, 37, 293-302.

42. Levin, M. L. (1953). The occurrence of lung cancer in man. Acta Unio Internationalis contra

Cancerum, 9, 531-541.

43. Lilienfeld, A. M. (1973). Epidemiology of infectious and non-infectious disease: some

comparisons. American Journal of Epidemiology, 97, 135-147.

44. Lubin, J. H. (1981). A computer program for the analysis of matched case-control studies.

Comput. Biomed. Res., 14, 138-143. 102

45. Lui, K. J. (1998). Confidence intervals for differences in correlated binary proportions

Statistics in Medicine, 17, 2017-2021.

46. Lui, K. J., Kelly, C. (1999). A note on interval estimation of kappa in a series of 2 × 2

tables. Statistics in Medicine, 18, 2041-2049.

47. Lui, K. J., Kelly, C. (2000). A revisit on tests on homogeneity of the .

Biometrics, 56, 309-315.

48. Lui, K. J. (2001a). Notes on interval estimation of the attributable risk in cross-sectional

sampling. Statistics in Medicine, 20, 1797-1809.

49. Lui, K. J. (2001b). Confidence intervals of the attributable risk under cross-sectional

sampling with confounders. Biometrical Journal, 43, 767-779.

50. Lui, K. J. (2003). Interval estimation of the attributable risk for multiple exposure levels in

case-control studies with confounders. Statistics in Medicine, 22, 2443-2457.

51. MacMahon, B. and Pugh, T. F. (1970). Epidemiology: Principles and Methods. Little,

Brown, and Company, Boston.

52. Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from

retrospective studies of disease. J. Natl. Cancer Inst., 22, 719-748.

53. Markush, R. E. (1977). Levin's attributable risk statistic for analytic studies and vital

statistics. American Journal of Epidemiology, 105, 401-406.

54. Mausner, J. S. and Bahn, A. K. (1974). Epidemiology: An Introductory Text. W. B. Saunders,

Philadelphia.

55. McDowell, A., Engel, A., Massey, J. T. et al. (1981). Plan and Operation of the second

National Health and Nutrition Examination Survey, 1976-1980. Hyattsville, MD: National

Center for Health Statistic. (Vital and health statistics, Ser. 1, no.15) (DHHS publication 103

no. (PHS) 81-1317).

56. Mezzetti, M., Ferraroni, M., Decarli, A., La Vecchia, C. and Benichou, J. (1996). Software

for attributable risk and confidence interval estimation in case-control studies. Comput.

Biomed. Res., 29, 63-75.

57. Miettinen, O. S. (1974). Proportion of disease caused or prevented by a given exposure, trait

or intervention. American Journal of Epidemiology, 99, 325-332.

58. Miettinen, O. S. (1976). Estimability and estimation in case-referent studies. American

Journal of Epidemiology, 103, 226-235.

59. Nguyen, T. T., Sampson, A. R. (1985). Counting the number of p × q integer matrices more

concordant than a given matrix. Discrete Applied Mathematics, 11, 187-205.

60. Ouellet, B. L., Romeder, J. M. and Lance, J. M. (1979). Premature mortality attributable to

smoking and hazardous drinking in Canada. American Journal of Epidemiology, 109, 451-

463.

61. Rao, C. R. (1973). Linear Statistical Inference and Its Application. 2nd ed., Wiley,

New York.

62. Schlesselman, J. J. (1982). Case-Control Studies. Design, Conduct and Analysis. Oxford

University Press, New York.

63. Serfling, R. J. (2002). Approximation Theorems of Mathematical Statistics. John Wiley &

Sons, Inc.

64. Shapla, T. J., Nguyen, T. T., and Chen, J. T. (2005). Inference of Attributable Risk for

Multiple Exposure Levels with Confounders for Cross-sectional data. Department of

Mathematics and Statistics, Bowling Green State University, Technical Report No. 05-11.

65. Taylor, J. W. (1977). Simple estimation of population attributable risk from case-control 104

studies. American Journal of Epidemiology, 106, 260.

66. Tchen, A. H. (1980). Inequalities for distribution with given marginals. Ann. Probability, 8,

814-827.

67. Walter, S. D. (1975). The distribution of Levin's measure of attributable risk. Biometrika, 62,

371-374.

68. Walter, S. D. (1976). The estimation and interpretation of attributable risk in health research.

Biometrics, 32, 829-849.

69. Whittemore, A. S. (1982). Statistical methods for estimating attributable risk from

retrospective data. Statistics in Medicine, 1, 229-243.

70. Whittemore, A. S. (1983). Estimating attributable risk from case-control studies. American

Journal of Epidemiology, 117, 76-85.

71. Woolf, B. (1955). On estimating the relation between blood group and disease. Ann. Hum.

Genet., 19, 251-253.

72. Yates, F. (1934). Contingency tables involving small numbers and the test. J. Roy. Statist.

Soc. Suppl. 1, 217-235.