Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Biostatistics (2020) 0,0,pp. 1–14 C doi:10.1093/biostatistics/kxz067

Estimation and inference for the population attributable risk in the presence of misclassification

BENEDICT H. W. WONG Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA JOOYOUNG LEE Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA and Department of , Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA DONNA SPIEGELMAN Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA, Department of Epidemiology, Harvard T.H. Chan School of Public Health, 181 Longwood Ave, Boston, MA 02115, USA, Department of Nutrition and Global Health & Population, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA and Department of Biostatistics, Center on Methods in Implementation and Prevention Science, Yale School of Public Health, 60 College St, New Haven, CT 06510, USA MOLIN WANG∗ Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA, Department of Epidemiology, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA and Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 181 Longwood Ave, Boston, MA 02115 [email protected]

SUMMARY Because it describes the proportion of disease cases that could be prevented if an exposure were entirely eliminated from a target population as a result of an intervention, estimation of the population attributable risk (PAR) has become an important goal of public health research. In epidemiologic studies, categorical covariates are often misclassified. We present methods for obtaining point and interval estimates of the PAR and the partial PAR (pPAR) in the presence of misclassification, filling an important existing gap in public health evaluation methods. We use a likelihood-based approach to estimate parameters in the models for the disease and for the misclassification process, under main study/internal validation study and main study/external validation study designs, and various plausible assumptions about transportability. We assessed the finite sample perf ormance of this method via a simulation study, and used it to obtain corrected point and interval estimates of the pPAR for high red meat intake and alcohol intake in relation

∗To whom correspondence should be addressed.

© The Author 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

2 B.H.W.WONG AND OTHERS to colorectal cancer in the HPFS, where we found that the estimated pPAR for the two risk factors increased by up to 317% after correcting for bias due to misclassification.

Keywords: Attributable fraction; Attributable risk; Measurement error; Misclassification; Partial population attribut- able risk; Population attributable risk; Validation study.

1. INTRODUCTION The population attributable risk (PAR)is the fraction of disease cases that would be prevented if an exposure were to be eliminated from a population of interest. It has attracted much attention in epidemiology and health policy research, as it evaluates the impact of public health interventions which remove the harmful exposure. If the research goal is to estimate the amount or proportion of cases of a disease attributable to a given risk factor, or to predict the impact of public health interventions on the health status of a population, then PARs are particularly relevant (Northridge, 1995). In 2018, a British Journal of Cancer editorial entitled “Population attributable fractions continue to unmask the power of prevention” exhorted that “The population attributable fraction is a critical driver of evidence-based cancer prevention.” (Bray and Soerjomataram, 2018). In a single-exposure setting, the PAR is a function of the (RR) and the of the exposure (Levin, 1952). In the presence of risk factors for the disease under study whose distribution is not affected by the interventions, the effect of the interventions can be evaluated using the partial PAR (pPAR) (Spiegelman and others, 2007). The pPAR is also called the adjusted attributable risk (Bruzzi and others, 1985; Benichou, 2001). In the research motivating this article, cancer epidemiologists were interested in estimating the propor- tion of colorectal cancer (CRC) cases among men in the Health Professionals Follow-up Study (HPFS) that are attributable to a number of modifiable exposures, and thus might be preventable (Platz and others, 2000). The HPFS began in 1986 when 51 529 male health professionals were enrolled by responding to mailed questionnaires (Rimm and others, 1991). Every 2 years since the start of the study, these partici- pants filled in questionnaires inquiring about topics such as dietary intake and health status. The accuracy of the responses in the food frequency questionnaires was assessed by validation with dietary records in a sub-sample of 127 study participants (Rimm and others, 1992). By comparing the dietary records of the validation study participants to their responses in the food frequency questionnaire, we saw that red meat intake and alcohol intake were measured with moderate to substantial levels of misclassification. Most notably, the specificity for high red meat intake was 0.29, and the sensitivity of high alcohol intake was 0.78, leading to a large number of individuals being falsely classified into the high red meat category and/or falsely classified into the low alcohol category. Decisions about a reduction in which risk factors should be emphasized in health promotions programs could be misleading due to bias in the pPAR estimates that quantify the extent to which disease can be prevented by reduction in the individual factor. This article provides a methodology to correct for this bias. In epidemiologic studies, when categorical variables are misclassified, bias will arise in estimates of the exposure and the RRs. This, in turn, affects the validity of the PAR and pPAR estimates, which are functions of the exposure prevalences and the RR estimates, and will be biased if there is bias in estimates of either. There are publications on both the impact of misclassification on estimates of exposure prevalences and on exposure-disease associations (Goldberg, 1975; Copeland and others, 1977; Hsieh and Walter, 1988). The effect of non-differential exposure misclassification on the PAR estimates in the single-exposure setting has also been studied previously. Misclassification is said to be non-differential when it is independent of disease status, that is, when exposure sensitivity and specificity are the same for both the disease cases and the non-cases (Johnson and others, 2014). One article (Hsieh and Walter, 1988) showed that when there is imperfect sensitivity of a single binary exposure, both the Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Estimation and inference for the PAR in presence of misclassification 3 disease-exposure (OR) and the PAR will be underestimated. This article also showed that when there is perfect sensitivity and imperfect specificity, the OR is again underestimated but the PAR is unbiased. On the other hand, when misclassification is differential, the bias in the OR can be in either direction (Copeland and others, 1977). Misclassification is said to be differential when exposure sensitivity and specificity differ between the disease cases and the non-cases (Johnson and others, 2014) and can arise through the dichotomization of a continuous exposure which is subject to non-differential measurement error (Dalen and others, 2009). Other studies have also examined the effect of outcome misclassification on the PAR (Hsieh, 1991; Vogel and others, 2005), and one article has examined the effect of exposure misclassification on the estimation of the pPAR in the two-exposure setting (Wong and others, 2018). In this latter study, it was shown that in the presence of non-differential exposure misclassification, the bias in the pPAR can be in either direction, unlike the bias in the single-exposure PAR which can only be toward the null. In addition, these authors found that the magnitude of the bias is most dependent on the sensitivity of the exposure being eliminated. These findings further motivate the need for developing tools that can help researchers estimate unbiased pPARs and confidence intervals (CIs) in the presence of misclassification. Statistical methods exist for correcting for the misclassification-caused bias in the prevalence estimators as well as association estimators, and there is an especially large literature on the latter (Marshall, 1990; Spiegelman and others, 2000; Yi and others, 2015). However, there are no existing statistical methods for correcting for the misclassification-caused bias in the pPAR estimates. A main contribution of our work is the development of methods for correcting for the misclassification-caused bias in the pPAR estimates through correcting for the biases in both the marginal distribution estimates and the association estimates under various main study/validation study designs. In addition, we provide user-friendly software implementing the proposed methods. The weblink is at Section 6. In Section 2, we describe the methods for correcting the exposure-misclassification induced bias in pPAR point estimates and CIs. We assessed the finite sample performance of our method via an extensive simulation study in Section 3, and we applied our methods to estimate the pPARs for CRC in the HPFS in Section 4. Section 5 concludes this article.

2. METHODS 2.1. PAR and pPAR The PAR represents the proportional reduction in disease prevalence that might occur if all other covariate distributions and associations of these with the outcome were unchanged, but the exposure was eliminated. (1) We denote Y the binary disease outcome, X the targeted modifiable exposures, and Yx the counter- factual value of Y if we had set X(1) = x in the population under study. A general form of the PAR can be written as P(Y = 1) PA R = 1 − 0 (2.1) P(Y = 1)

(VanderWeele, 2010; Sjölander and Vansteelandt, 2010). When there is no confounding of the estimated (1) exposures-outcome relationship, P(Y0 = 1) = P(Y = 1|X = 0). However, when confounding exists, as would be the case in observational research, adjustment for confounders is necessary, and the pPAR (Spiegelman and others, 2007) estimates the proportion of cases that may have been avoided if the targeted modifiable exposures, X(1), were eliminated while the distributions of confounders and all associations in the hypothetical population remained unchanged (Rockhill and others, 1998). Use X(2) to denote the confounders of the estimated exposures-outcome relationship. As discussed in Sjö- lander and Vansteelandt (2010), in the presence of confounding, the PAR (2.1) may be expressed as Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

4 B.H.W.WONG AND OTHERS

(1) (2) pPAR = 1 − EX(2) {P(Y = 1|X = 0, X )}/P(Y = 1). We can rewrite the pPAR as  P(Y = 1|X(1) = 0, X(2) = x(2))f (X(2) = x(2))dx(2) pPAR = 1 −  , (2.2) P(Y = 1|X = x)f (X = x)dx where f (·) is a probability density function, X = (X(1), X(2)), and for categorical elements of X, the corresponding integrals are replaced by sums. Dividing both the numerator and denominator in (2.2)byP(Y = 1|X = 0), we can rewrite (2.2) using the adjusted RRs and the density functions as  (2) (2) (2) rr (2) f (X = x )dx pPAR = 1 −  2x , (2.3) (X = x) x rr3x(1)x(2) f d

( ) ( ) ( ) ( ) ( ) ( ) ( ) P(Y =1|X 1 =0,X 2 =x 2 ) P(Y =1|X 1 =x 1 ,X 2 =x 2 ) where rr x(2) = ( ) ( ) , rr x(1)x(2) = ( ) ( ) , and X = 0 can be replaced by 2 P(Y =1|X 1 =X 2 =0) 3 P(Y =1|X 1 =X 2 =0) another reference level as appropriate. Note that rr20 = 1 and rr300 = 1. When X(2) contains continuous variables, we can use the empirical joint probabilities, or assume an empirically verified multivariate parametric model for the distribution of X(2) (Carroll and others, 2006). Greenland and Drescher (1993) and Dahlqwist and others (2016) used the sample distributions for the density functions in the estimation of the pPAR in cross-sectional, case–control, and data. ≈ ≈ Suppose X(2) can be written as (X˜ (2), X(2)), where the distribution of X(2) is independent from X˜ = ≈ ≈ (X (1), X˜ (2)), which represents all the elements in X except X(2). If the RR for X(2) does not depend on the values of X˜ , the pPAR in (2.3) can be simplified as  ˜ (2) (2) (2) rr x˜(2) f (X =˜x )dx˜ pPAR = 1 −  2 , (2.4) (X˜ =˜x) x˜ rr3x(1)x˜(2) f d

≈ ≈ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) P(Y =1|X 1 =0,X˜ 2 =˜x 2 ,X 2 =0) P(Y =1|X 1 =x 1 ,X˜ 2 =˜x 2 ,X 2 =0) where rr x˜(2) = ( ) and rr (1)x˜(2) = ( ) . See supplementary 2 P(Y =1|X 1 =X(2)=0) 3x P(Y =1|X 1 =X(2)=0) Section 1.1 available at Biostatistics online for a proof. For categorical variables X(1) and X(2), suppose there are S possible unique combinations from the set of targeted modifiable exposures, and T possible unique combinations from the set of confounders. It follows that (2.3) can be expressed as

 − T 1 p rr = −  t=0 .t 2t pPAR 1 S−1 T−1 (2.5) s=0 t=0 pstrr3st

(Spiegelman and others, 2007), where s denotes a stratum of unique combinations of levels of all exposures that are modifiable, t denotes a stratum of unique combinations of levels of all confounders, pst represents the proportion in the population with risk factor levels s and t, with 0 indexing the lowest risk levels, and = S−1 p.t s=0 pst for all t. As pPARis a function of the adjusted RRs, it is necessary to obtain estimates for these parameters in order to estimate the pPAR. For example, we may use a log-linear regression model for the exposure–outcome relationship to obtain the adjusted RRs:

∗ ∗ (β0+X β)Y β0+X β (1−Y ) f1(Y |X; β) = e (1 − e ) , (2.6) Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Estimation and inference for the PAR in presence of misclassification 5

∗ where f1 denotes probability function of Y , X contains X and possibly interactions for the elements in X, and β is the K-vector of log-RRs representing the X − Y associations. Alternatively, we may use the logistic regression model:

∗ Y (β0+X β) ( |X β) = e f1 Y ; (β +X ∗β) , (2.7) 1 + e 0 where eβ represents ORs, and it approximates the RRs under the rare disease assumption. If X(1) and X(2) contain only categorical variables, the RRs can be calculated using

ˆ ∗ ˆ ˆ ∗ ˆ β0 x β (β0+x β) rrˆ 3st = (1 + e )e /{1 + e }, (2.8)

∗ where β0 and β are the regression parameters in (2.7), and the value of x corresponds to the sth stratum (1) (2) of X and the tth stratum of X . The variance of this rrˆ 3st can be estimated using the Delta method.

2.2. Misclassification models When one or more categorical elements of the exposures are subject to non-differential misclassification, the observed exposures are referred to as surrogate exposures. The assumption of non-differential mis- classification is also called the surrogacy assumption. An example of a common surrogate exposure in epidemiological research is self-reported dietary intake, an inaccurate substitute for actual dietary intake, the true exposure (Willett and Lenart, 2013). Define X = (X1, ..., XK ) to be the 1 × K vector of the true exposure values, and let Z = (Z1, ..., ZK ) be the corresponding vector of the surrogate exposure values, with Zk the surrogate of Xk for each k = 1, ..., K. When Xk is not misclassified, Zk = Xk . To correct the misclassification process in estimating the pPAR, we need to model the misclassification process. The misclassification process can be modeled as:

f2(Z|X; ψ) = f2(Z1, ..., ZK |X1, ..., XK ; ψ), (2.9) where ψ is the vector of parameters that characterize the relationship between Z and X. When the misclassification process is conditionally independent across the K exposures, we can simplify Model (2.9) to

K f2(Z|X; ψ) = f2,k (Zk |Xk ; ψk), (2.10) k=1 where ψk = (ψk,1, ψk,2) can be seen as a re-parameterization of the exposure sensitivities, Pr(Zk = 1|Xk = 1), and specificities, Pr(Zk = 0|Xk = 0), for each (Xk , Zk ) pair, for k = 1, ..., K.

2.3. The likelihoods In a main study/internal validation study (MS/IVS) design, validation data are obtained from participants who are also part of the main study. All participants in the main study provide data on Y and Z, and, in addition, the participants in the internal validation study provide data on X. Let i = 1, ..., nM index the participants in the main study but who do not provide validation data and let i = nM + 1, ..., nM + nV index Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

6 B.H.W.WONG AND OTHERS

the participants in the validation study. For the i-th subject, Xi = (Xi1, ..., XiK ) and Zi = (Zi1, ..., ZiK ). The joint likelihood for the observed data in this MS/IVS design is

+ nM nMnV L(β, ψ, π) = f (Yi, Z i) f (Yi, Z i, X i) i=1 i=nM +1  + nM nMnV = f1(Yi|x; β)f2(Z i|x; ψ)f0(x; π)dx f1(Yi|X i; β)f2(Z i|X i; ψ)f0(X i; π), (2.11) i=1 i=nM +1 where the integral over x can be written as a summation if all the elements of x are categorical variables. When the misclassification process is conditionally independent across exposures, Equation (2.11) can be used with f2(Z|X; ψ) defined as in (2.10). This assumption can and should be empirically verified by the data. We have used the surrogacy assumption that f (Y |X, Z) = f (Y |X) in deriving Equations (2.11). When estimating the pPAR, we need to estimate the exposure prevalences, f (X), in addition to the RRs. It is necessary to assume that the misclassification process for Z|X between the main and validation studies are the same, that is, that the misclassification process is transportable between the main and validation studies (Carroll and others, 2006). We call this the assumption of single transportability. If the validation study is a random subsample of the main study, as is the usual situation in MS/IVS designs, this condition is guaranteed, otherwise the assumption is empirically unverifiable although necessary in order for the validation study data to be validly used for misclassification correction. Where appropriate, we can additionally assume that the exposure prevalences in the validation study are the same as those in the main study. We refer to this as double transportability, because both the misclassification process and the exposure prevalences are assumed to be transportable from the validation study to the main study. Typically, the MS/IVS design guarantees the double transportability assumption. However, there are exceptions in practice, for example, when the validation study is not a random sample of the entire study. Then, although the misclassification process can be reasonably considered transportable, the distribution of X will not be. In this case, the appropriate likelihood for the data under the single transportability assumption is

 + nM nMnV = f1(Yi|x; β)f2(Z i|x; ψ)f0(x; π)dx f1(Yi|X i; β)f2(Z i|X i; ψ). (2.12) i=1 i=nM +1

In a main study/external validation (MS/EVS) study design, the participants in the main study provide data on Y and Z, and the participants in the external validation study provide only data on X and Z,but not Y . Under the surrogacy and double transportability assumptions, the joint likelihood for the observed data in the doubly transportable MS/EVS design can be modeled as

 + nM nMnV L(β, ψ, π) = f1(Yi|x; β)f2(Z i|x; ψ)f0(x; π)dx f2(Z i|X i; ψ)f0(X i; π). (2.13) i=1 i=nM +1

If the distribution of the exposure prevalences in the validation study are different from those in the main study, then only f (Z|X) is transportable but f (X) is not. It follows that the appropriate likelihood for this singly transportable MS/EVS design is

 + nM nMnV L(β, ψ, π) = f1(Yi|x; β)f2(Z i|x; ψ)f0(x; π)dx f2(Z i|X i; ψ). (2.14) i=1 i=nM +1 Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Estimation and inference for the PAR in presence of misclassification 7

Table 1. Information provided from the IVS/EVS data under the double/single transporta- bility assumption and equations for the corresponding likelihoods

IVS Equation EVS Equation Double transportability f (Y |X), f (Z|X), f (X) 2.11 f (Z|X), f (X) 2.13 Single transportability f (Y |X), f (Z|X) 2.12 f (Z|X) 2.14

Table 1 summarizes the information that is provided from the internal or external validation data when single or double transportability assumption holds. Depending on the study design and the assumptions made, we can maximize the likelihood expression in (2.11), (2.12), (2.13), or (2.14), to obtain the maximum likelihood estimates for (β, ψ, π). We then use the estimates (β, π) to calculate the pPAR estimates. Estimation of the model parameters, (β, ψ, π) is invariant to the choice of which variables are the targeted modifiable exposures X(1) in (2.3) among the full set of risk factors, X. Therefore, the parameters only need to be estimated once, even if different choices of the modifiable exposures are subsequently considered.

2.4. Interval estimates The formula for the variance of pPAR accounts for the estimation of ψ in addition to β and π, as well as for the covariances between these estimates. Specifically, when X contains only categorical variables, or when X contains continuous variables and the distribution of X is estimated from a parametric model, the variance of pPAR can be obtained using the multivariate delta method, as a function of the variance-covariance matrix of the corrected estimators (β, π), which is a submatrix of the variance-covariance matrix of (β, ψ, π), estimated as the inverse of the negative observed information matrix of likelihood (2.11), (2.12), (2.13), or (2.14) depending upon the study design.The asymptotic variance is derived in supplementary Section 1.2 available at Biostatistics online when X contains only categorical variables. When X contains continuous variables and estimation of the distribution of X involves nonparametric methods, such as the empirical density functions, the bootstrap method can be used to estimate the variance (Haukka, 1995).

3. SIMULATION STUDY We conducted a simulation study to assess the finite sample performance of our methods under the MS/IVS design and the MS/EVS design under both double and single transportability assumptions. The objective was to estimate the pPAR in a two-factor scenario, with one binary exposure being modifiable and the other binary exposure non-modifiable. We chose our underlying parameter values based on the estimates from the HPFS study described in Section 4. For feasibility, we reduced nM by an order of magnitude to 5000. We assumed the logistic regression model (2.7) with X∗ = X for the exposure–disease relationship. The baseline log odds of disease was increased from −6.28 to −3.14 so as to obtain a sufficient number of cases per simulated data set, and we increased the log OR values, β1 and β2, to 0.3364 and 0.2878, which correspond to ORs of 1.40 and 1.333, respectively. We also considered β1, β2 = 0.1, 0.5 to investigate how the results varied with alternative values of β1 and β2. These log OR values correspond to OR values of approximately 1.1 and 1.65, respectively. In the simulations, we set the underlying prevalences of X1 and X2 to be 0.5, with the exposures independently distributed, and used a simple model for the misclassification process, given in supplementary Section 1.3 available at Biostatistics online. We also investigated the potential impact Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

8 B.H.W.WONG AND OTHERS

Table 2. Each cell contains the relative/percentage bias (% Bias), MSE, and CP of pPAR estimates under both MS/IVS and MS/EVS designs when double transportability holds

nV , nM β1 = log(1.40), β1 = log(1.65), β1 = log(1.1), β1 = log(1.1), β1 = log(1.65), β2 =log(1.33), β2 =log(1.65), β2 = log(1.1), β2 = log(1.65), β2 = log(1.1), pPAR = 0.16 pPAR = 0.23 pPAR = 0.05 pPAR = 0.05 pPAR = 0.23 1.0 0.6 3.4 2.5 0.8 0.003 0.002 0.004 0.003 0.003 CK 5000, 0 95.2 95.3 95.8 96.4 94.9 −38.5 −38.8 −34.8 −34.3 −38.8 0.007 0.010 0.005 0.003 0.011 UC 0, 5000 82.9 59.2 94.1 94.8 66 5.4 1.70 4.1 0.89 1.08 0.020 0.019 0.022 0.023 0.018 IVS 125,5000 92.5 96.8 96.2 96.8 95.6 −1.2 −0.4 −4.2 −4.6 −0.5 0.009 0.007 0.011 0.009 0.009 IVS 250,5000 95.9 95.8 95.0 96.4 95.9 −0.2 0.0 −1.1 −2.6 0.2 0.010 0.009 0.013 0.011 0.011 EVS 250,5000 95.5 96.3 95.6 96.8 95.4

CK (Complete Knowledge): hypothetical scenario where X is given in the main study. UC (Uncorrected): pPAR are obtained from naive estimates using Z in the main study. IVS (IVS-Corrected): pPAR is derived from the method for misclassification correction under the MS/IVS design with double transportability assumed, using Equation (2.11). EVS (EVS-Corrected): pPAR is derived from the method for misclassification correction under the MS/EVS design with double transportability assumed, using Equation (2.13)

of doubling the validation study size on the mean relative bias, mean squared error (MSE) and coverage probability (CP) of the point and interval estimates. The relative bias is defined as the percent difference between the simulated pPAR estimates and the true pPAR value, calculated from the underlying parameter values for the exposure prevalences and RRs. For each simulated dataset, we estimated the pPAR for X1, while treating X2 as a non-modifiable ˆ ˆ ˆ exposure. The corrected RRs were estimated based on the corrected (β0, β1, β2) estimates using Equation (2.8). In addition to the corrected pPAR estimates obtained under the MS/IVS design with validation study sizes of 125 and 250 as well as the corrected pPAR estimates under the MS/EVS design with a validation study of size 250, we also calculated the pPAR that would have been estimated if complete data were observed for all participants in the study (nV = 5000, nM = 0), and the uncorrected pPAR which is obtained from regressing Y against Z in the main study. The complete data scenario is equivalent to that which occurs when the true exposure values are observed for all participants in the study. We use the joint prevalences of Z to calculate the uncorrected pPAR. The relative bias, MSE and CP of the pPAR estimates are reported in Table 2. These results show that with a validation study of size 125, we can substantially reduce the bias of our point estimates and improve the CP of our interval estimates for the pPAR, even in the face of substantial exposure misclassification. When we repeated the simulations for different sets of parameter values for β1, β2 = 0.1, 0.5, we saw that in all four of these new scenarios, our observations were similar to those from the scenario where (β1, β2) =[log(1.40), log(1.33)]. The MSE and relative bias for the corrected estimates decreased on average when the validation study size was increased, but increased slightly when the likelihood for the MS/EVS was used instead of the likelihood for the MS/IVS design. The coverage probabilities for the corrected interval estimates were also closer to the ideal value of 95% when the validation study size Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Estimation and inference for the PAR in presence of misclassification 9

increased. When β1 = log(1.65), the coverage probabilities for the uncorrected interval estimates were less than 50%, just as when (β1, β2) =[log(1.40), log(1.33)]. However, the coverage probabilities for the uncorrected interval estimates were close to 95% when β1 = log(1.1), regardless of the value of β2. We repeated the simulation study to assess the performance of our method when the validation study is external and the transportability of exposure prevalences can no longer be assumed. We set the marginal prevalence of X1 to be 0.5 in the main study but 0.6 in the validation study, and that of X2 to be 0.5 in the main study but 0.7 in the validation study. However, we set the misclassification probabilities to be the same in both the main and validation studies, using the same model for the misclassification process given in supplementary Section 1.3 available at Biostatistics online. In this scenario, when only single transportability holds, the appropriate likelihood is given by Equation (2.14). We also compute the corrected estimates for the pPAR obtained from maximizing the likelihood in Equation (2.13), to observe how the incorrect assumption of double transportability affects the ‘corrected’ estimates. The results of the simulation are given in supplementary Table 1 available at Biostatistics online. We see that when the appropriate likelihood is maximized, the MSE and percentage bias of the corrected estimates are substantially lower than when the incorrect likelihood is used. Note that even in the latter scenario, the percentage bias is still less than that of the uncorrected estimates, although the MSE is smaller when β1 = 0.5 and larger when β1 = 0.1, compared with the uncorrected estimates. The coverage probabilities of the corrected CI estimates are close to the optimal value of 95% for all values of (β1, β2).

4. ILLUSTRATIVE EXAMPLE We applied this likelihood-based method for calculation of the pPAR in the presence of exposure mis- classification to the HPFS of risk factors for CRC (Platz and others, 2000). The HPFS began in 1986 when 51 529 male health professionals were enrolled by responding to mailed questionnaires. Every 2 years since the start of the study through 2016, these participants filled in questionnaires inquiring about topics, such as dietary intake and health status. The accuracy of the responses in the food frequency questionnaires was assessed by validation with dietary records in a sub-sample of 127 participants (Rimm and others, 1992). The information from these dietary records, together with the information provided by these 127 participants in the questionnaires, formed our internal validation study. The information from the remaining 51 402 participants, including 2028 (3.9%) CRC cases, formed the main study. Our aim was to estimate the pPAR for CRC, treating one exposure as modifiable, while treating a second exposure as non-modifiable, taking into account the misclassification in both exposures. Two exposures of interest are high red meat intake and high alcohol intake at baseline. High red meat (HRM) intake was defined as two or more servings of beef, pork, or lamb each week, and high alcohol intake was defined as seven or more servings of alcohol each week. First, let high red meat intake be the modifiable exposure and high alcohol intake and age be the non-modifiable exposures. Let X1 and Z1 represent the true and surrogate values for HRM, while X2 and Z2 represent the true and surrogate values for high alcohol intake. For completeness, we also provide the pPAR for when both exposures are eliminated from the population. A random sample of the main study participants were invited to take part in the validation study, and approximately half of the invited participants agreed to provide the validation data (Rimm and others, 1992). Therefore, the double transportability assumption may not hold. However, we estimated pPARs under both the single and double transportability assumptions for illustration and comparison. We used the likelihood in equation (2.11) for the MS/IVS design under the double transportability assumption, and the likelihood in equation (2.12) under the single transportability assumption. For the exposure–disease relationship, we used Poisson regression model (2.6) with X∗ = X, including HRM, high alcohol intake, and age at baseline as the covariates. Age was assumed to be continuous and measured without error. As showed in supplementary Section 1.4 available at Biostatistics online, in these data, the age distribution was approximately independent of HRM and high alcohol intake. Therefore, we used Formula (2.4) with Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

10 B. H. W. WONG AND OTHERS

Table 3. RR estimates with 95% CIs based on data in the HPFS, using the methods for the MS/IVS design with double transportability or single transportability assumed using likelihood (2.11) or (2.12), respectively

Exposure Uncorrected Double transportability Single transportability High red meat 1.07 (0.95–1.21) 1.33 (0.77–2.30) 1.44 (0.80–2.59) High alcohol 1.22 (1.12–1.34) 1.34 (1.11–1.62) 1.46 (0.93–2.30) Age (+1 year) 1.05 (1.04–1.05) 1.05 (1.05–1.05) 1.05 (1.05–1.05)

Table 4. Marginal exposure prevalences for the HPFS data. Subscripts indicate whether the values are observed in the internal validation study (V) or estimated using the main study (M) data as well

  PM (Zk = 1) PM (Xk = 1) PM (Xk = 1) Exposure PV (Xk = 1) PV (Zk = 1) Uncorrected Double transportability Single transportability High red meat 0.62 0.82 0.84 0.64 0.75 High alcohol 0.39 0.35 0.31 0.35 0.32

≈ X(2) = age to estimate the pPAR. If the confounders were not independent of the exposures, as would be the case in many applications, the pPAR would be estimated using Formula (2.3). For the misclassification model, we used a polytomous logistic regression to model the conditional joint distribution of (Z1, Z2) given (X1, X2). Let j = 1, 2, 3, 4 index the unique combinations (0,0), (0,1), (1,0), (1,1) for X =(X1, X2), f2(Z=h|X=j) and h similarly defined for Z =(Z1, Z2). Then, for h = 2, 3, 4 and j = 1, 2, 3, 4, log = ψh−1,j. The f2(Z=1|X=j) sensitivity and specificity of the exposures, based on data in the validation study, are given in supplementary Table 2 available at Biostatistics online. We observed that there is a substantial level of misclassification in the data. In particular, the marginal specificity of HRM (X1) is 0.292. The uncorrected RR estimates are showed in Table 3. The corrected RR estimates were obtained from maximizing the likelihood (2.11), which is for the MS/IVS design with double transportability assumed, and from maximizing the likelihood (2.12), with single transportability assumed. While the corrected RR estimate corresponding to a 1-year increase in age was similar to the uncorrected one, the corrected estimates for the RRs associated with high alcohol intake and HRM were moderately to substantially different from their uncorrected counterparts. In order to obtain corrected estimates for the pPAR based on (2.4), it is also necessary to validly estimate the joint prevalences for the four exposure strata cross- classified from the two binary exposures, HRM and high alcohol intake. In Table 4, we present the marginal prevalences for HRM and high alcohol intake separately, calculated by marginalizing over the estimated joint exposure prevalences, under each of the double and single transportability assumptions. The uncorrected and corrected pPAR estimates with 95% CIs are presented in Table 5.We also estimated the pPAR for CRC, when alcohol intake was assumed modifiable but red meat intake and age were assumed non-modifiable, as well as the pPAR when both exposures were assumed to be modifiable and age was considered non-modifiable. We observed that the corrected estimates were moderately to substantially greater than the corresponding uncorrected estimates. Here, the uncorrected pPARs suggested a lower benefit associated with interventions aimed at reducing red meat intake, while after correction, the impact of red meat reduction was estimated to be even greater than that of alcohol intake reduction. As shown in Tables 3 and 4, the RR estimates for HRM and high alcohol intake obtained under the single transportability assumption were greater than those under the double transportability assumption Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Estimation and inference for the PAR in presence of misclassification 11

Table 5. Point and interval estimates of the pPARfor risk factors for colorectal cancer, estimated from data in the HPFS, for the MS/IVS design, using likelihood (2.11) or (2.12), respectively

Modifiable Non-modifiable Double Single exposure(s) exposure Uncorrected transportability transportability Red meat Alcohol, age 0.06 (−0.04 to 0.15) 0.18 (−0.17 to 0.52) 0.25 (−0.16 to 0.67) Alcohol Red meat, age 0.07 (0.03–0.10) 0.11 (0.04–0.18) 0.12 (−0.08 to 0.33) Red meat, alcohol Age 0.12 (0.00–0.23) 0.26 (−0.15 to 0.67) 0.34 (−0.12 to 0.79)

by 8% and 9%, respectively. While the marginal prevalence of high alcohol intake decreased by 0.03, the marginal prevalence for HRM increased from 0.64 to 0.75. We also computed the pPAR estimates and found that the corrected pPAR estimate for HRM increased by about 40%. This occurs because both the RR and the marginal prevalence estimates for HRM were greater under the single transportability assumption. However, the pPAR estimate for high alcohol intake only increased by 9%, from 0.11 to 0.12, because the marginal prevalence estimate decreased even though the RR estimate increased. We also observed a 30% increase in the pPAR estimate for when both exposures were considered modifiable, which is less than the relative increase in the pPAR for HRM only, but greater than that for high alcohol intake only.

5. DISCUSSION Misclassification continues to be a problem in large observational studies, and results in biased estimates of exposure prevalences and RRs that in turn lead to biased estimates of public health quantities of interest, such as the pPAR. In this article, we described a likelihood-based method for obtaining valid exposure prevalence and RR estimates in the presence of exposure misclassification, requiring the availability of validation study data. This allows us to obtain valid point and interval estimates for the pPAR. Through conducting a simulation study covering multiple study size combinations, we saw that under the MS/IVS design, even though the corrected pPAR estimates have greater MSE compared to the incor- rected estimates, the corrected estimates have an average relative bias of less than 5%. In contrast, the uncorrected estimates have an average relative bias of about 40% toward the null. We concluded from the simulation study results that even a small validation study of size 125 can go a long way in reducing the bias of pPAR estimates, compared with the uncorrected estimates that use surrogate exposure data. We also observed that increasing the validation study size to 250 can also result in a marked decrease in the MSE and relative bias of the point estimates and an improvement in the CP of the interval estimates. Under the MS/EVS design, we saw from the results in supplementary Table 1 available at Biostatistics online that, when only single transportability holds but double transportability is incorrectly assumed, the estimated pPARs have greater MSE and relative bias than when the model for single transportability is used. We applied the proposed method to a study of dietary risk factors for CRC in the HPFS, and observed that the exposure prevalences of the surrogate exposures were substantially different from the corrected esti- mates of the true exposure prevalences. Consequently, when we compared the misclassification-corrected pPAR estimates to the uncorrected versions, we found that the bias in the uncorrected versions appeared to be substantial, between 57% and 317%. This means that CRC may be more preventable than currently believed. Because red meat was measured with substantially more misclassification than alcohol, both the estimated marginal prevalences and the RRs were more severely biased and, as a result, the correction for misclassification for the effect of HRM had a greater impact on the corresponding pPAR estimate than Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

12 B. H. W. WONG AND OTHERS that for high alcohol intake. Thus, interventions focused on reducing red meat consumption as a means of preventing CRC emerged as potentially much more important, compared to those focused on reduction of alcohol intake, than what had been suggested by the standard uncorrected analysis. In the illustrative example, male health professionals were the study population. They may not represent the general population in terms of the distribution of the exposures and confounders. When estimating the PAR and pPAR, the joint distribution of exposures and confounders that is more representative of a target population of more general interest could be available from another source and used instead of those estimated in the study population, from which RRs are estimated. Furthermore, in scenarios where estimates of a more representative joint exposure and confounder distribution may be biased due to mis- classification, the current validation study or another validation study could be used to correct for the bias in these prevalence estimates, as indicated by assumptions about transportability of the misclassification process. We can then use these more representative prevalence estimates, based upon a source different from that used to estimate the RRs, along with suitable misclassification model parameters, to estimate the pPAR. The proposed likelihood methods can be extended in these scenarios. For example, to combine the auxiliary data, the full likelihood can be a product of the likelihoods for data from different sources and we can then maximize this augmented likelihood to obtain the parameter estimates.

6. SOFTWARE R function for implementing the proposed method is available online at https://www.hsph.harvard.edu/molin- wang/software.

SUPPLEMENTARY MATERIAL Supplementary material is available at http://biostatistics.oxfordjournals.org.

ACKNOWLEDGMENTS The authors thank the associate editor and referees for their insightful comments, which have improved the articles. Conflict of Interest: None declared.

FUNDING This research was partially supported by the National Institutes of Health grants DP1 ES025459, U01 CA167552 and R01CA137178.

REFERENCES

BENICHOU, J. (2001). A review of adjusted estimators of attributable risk. Statistical Methods in Medical Research 10, 195–216.

BRAY,F.AND SOERJOMATARAM, I. (2018). Population attributable fractions continue to unmask the power of prevention. Br J Cancer 118, 1031–1032.

BRUZZI, P.,GREEN, S. B., BYAR,D.P.,BRINTON,L.A.AND SCHAIRER, C. (1985). Estimating the population attributable risk for multiple risk factors using case-control data. American Journal of Epidemiology 122, 904–914.

CARROLL, R. J., RUPPERT, D., STEFANSKI,L.A.AND CRAINICEANU, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Boca Raton, Florida: CRC Press.

COPELAND, K. T., CHECKOWAY, H., MCMICHAEL,A.J.AND HOLBROOK, R. H. (1977). Bias due to misclassification in the estimation of relative risk. American Journal of Epidemiology 105, 488–495. Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

Estimation and inference for the PAR in presence of misclassification 13

DAHLQWIST, E., ZETTERQVIST, J., PAWITAN,Y.AND SJÖLANDER, A. (2016). Model-based estimation of the attribut- able fraction for cross-sectional, case–control and cohort studies using the R package AF. European Journal of Epidemiology 31, 575–582.

DALEN, I., BUONACCORSI,J.P.,SEXTON, J. A., LAAKE,P.AND THORESEN, M. (2009). Correction for misclassification of a categorized exposure in binary regression using replication data. Statistics in Medicine 28, 3386–3410.

GOLDBERG, J. D. (1975). The effects of misclassification on the bias in the difference between two proportions and the relative odds in the fourfold table. Journal of the American Statistical Association 70, 561–567.

GREENLAND,S.AND DRESCHER, K. (1993). Maximum likelihood estimation of the attributable fraction from logistic models. Biometrics 49, 865–872.

HAUKKA, J. K. (1995). Correction for covariate measurement error in generalized linear models—a bootstrap approach. Biometrics 51, 1127–1132.

HSIEH, C.-C. (1991). The effect of non-differential outcome misclassification on estimates of the attributable and prevented fraction. Statistics in Medicine 10, 361–373.

HSIEH, C.-C. AND WALTER, S. D. (1988). The effect of non-differential exposure misclassification on estimates of the attributable and prevented fraction. Statistics in Medicine 7, 1073–1085.

JOHNSON, C. Y., FLANDERS, W. D., STRICKLAND, M. J., HONEIN,M.A.AND HOWARDS, P. P. (2014). Potential sensitivity of bias analysis results to incorrect assumptions of nondifferential or differential binary exposures misclassification. Epidemiology (Cambridge, Mass.) 25, 902.

LEVIN, M. L. (1952). The occurrence of lung cancer in man. Acta-Unio Internationalis Contra Cancrum 9, 531–541.

MARSHALL, R. J. (1990). Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology 43, 941–947.

NORTHRIDGE, M. E. (1995). Public health methods–attributable risk as a link between causality and public health action. American Journal of Public Health 85, 1202–1204.

PLATZ, E.A.,WILLETT,W.C., COLDITZ, G.A., RIMM, E. B., SPIEGELMAN,D.AND GIOVANNUCCI, E. (2000). Proportion of colon cancer risk that might be preventable in a cohort of middle-aged us men. Cancer Causes and Control 11, 579–588.

RIMM, E. B., GIOVANNUCCI, E. L., STAMPFER, M. J., COLDITZ, G. A., LITIN,L.B.AND WILLETT, W. C. (1992). and validity of an expanded self-administered semiquantitative food frequency questionnaire among male health professionals. American Journal of Epidemiology 135, 1114–1126.

RIMM, E. B., GIOVANNUCCI, E. L., WILLETT, W. C., COLDITZ, G. A., ASCHERIO, A., ROSNER,B.AND STAMPFER, M. J. (1991). Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet 338, 464–468.

ROCKHILL, B., NEWMAN,B.AND WEINBERG, C. (1998). Use and misuse of population attributable fractions.American Journal of Public Health 88, 15–19.

SJÖLANDER,A.AND VANSTEELANDT, S. (2010). Doubly robust estimation of attributable fractions. Biostatistics 12, 112–121.

SPIEGELMAN, D., HERTZMARK,E.AND WAND, H. C. (2007). Point and interval estimates of partial population attributable risks in cohort studies: examples and software. Cancer Causes and Control 18, 571–579.

SPIEGELMAN, D., ROSNER,B.AND LOGAN, R. (2000). Estimation and inference for logistic regression with covariate misclassification and measurement error in main study/validation study designs. Journal of theAmerican Statistical Association 95, 51–61.

VANDERWEELE, T. J. (2010). Attributable fractions for sufficient cause interactions. The International Journal of Biostatistics 6,5. Downloaded from https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxz067/5767138 by Harvard College Library, Cabot Science Library user on 27 April 2020

14 B. H. W. WONG AND OTHERS

VOGEL, C., BRENNER, H., PFAHLBERG,A.AND GEFELLER, O. (2005). The effects of joint misclassification of exposure and disease on the attributable risk. Statistics in Medicine 24, 1881–1896.

WILLETT,W.AND LENART, E. (2013). Reproducibility and validity of food frequency questionnaires. Nutritional Epidemiology, 96–141.

WONG,B.H.W.,PESKOE,S.B.AND SPIEGELMAN, D. (2018). The effect of risk factor misclassification on the partial population attributable risk. Statistics in Medicine 37, 1259–1275.

YI, G. Y., MA, Y., SPIEGELMAN,D.AND CARROLL, R. J. (2015). Functional and structural methods with mixed measurement error and misclassification in covariates. Journal of the American Statistical Association 110, 681–696.

[Received September 17, 2018; revised December 27, 2019; accepted for publication December 29, 2019]