Practice of Epidemiology a Meta-Regression Method for Studying Etiological Heterogeneity Across Disease Subtypes Classified by M
Total Page:16
File Type:pdf, Size:1020Kb
American Journal of Epidemiology Advance Access published June 26, 2015 American Journal of Epidemiology © The Author 2015. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of DOI: 10.1093/aje/kwv040 Public Health. All rights reserved. For permissions, please e-mail: [email protected]. Practice of Epidemiology A Meta-Regression Method for Studying Etiological Heterogeneity Across Disease Subtypes Classified by Multiple Biomarkers Molin Wang*, Aya Kuchiba, and Shuji Ogino * Correspondence to Dr. Molin Wang, Departments of Biostatistics and Epidemiology, Harvard T.H. Chan School of Public Health, and Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, 677 Huntington Avenue, Boston, MA 02115 (e-mail: [email protected]). Downloaded from Initially submitted July 22, 2014; accepted for publication February 4, 2015. In interdisciplinary biomedical, epidemiologic, and population research, it is increasingly necessary to consider pathogenesis and inherent heterogeneity of any given health condition and outcome. As the unique disease prin- http://aje.oxfordjournals.org/ ciple implies, no single biomarker can perfectly define disease subtypes. The complex nature of molecular pathol- ogy and biology necessitates biostatistical methodologies to simultaneously analyze multiple biomarkers and subtypes. To analyze and test for heterogeneity hypotheses across subtypes defined by multiple categorical and/or ordinal markers, we developed a meta-regression method that can utilize existing statistical software for mixed-model analysis. This method can be used to assess whether the exposure-subtype associations are different across subtypes defined by 1 marker while controlling for other markers and to evaluate whether the difference in exposure-subtype association across subtypes defined by 1 marker depends on any other markers. To illustrate this method in molecular pathological epidemiology research, we examined the associations between smoking status at Harvard Library on July 7, 2015 and colorectal cancer subtypes defined by 3 correlated tumor molecular characteristics (CpG island methylator phenotype, microsatellite instability, and the B-Raf protooncogene, serine/threonine kinase (BRAF ), mutation) in the Nurses’ Health Study (1980–2010) and the Health Professionals Follow-up Study (1986–2010). This method can be widely useful as molecular diagnostics and genomic technologies become routine in clinical medicine and public health. causal inference; genomics; heterogeneity test; molecular diagnosis; omics; transdisciplinary epidemiology Abbreviations: BRAF, B-Raf protooncogene, serine/threonine kinase; CIMP, CpG island methylator phenotype; HPFS, Health Professionals Follow-up Study; MPE, molecular pathological epidemiology; MSI, microsatellite instability; MSS, microsatellite stable; NHS, Nurses’ Health Study; RR, relative risk. Based on the underlying premise that individuals with the (21–24) cancers, as well as nonneoplastic diseases such as same disease name have similar etiologies and disease evo- stroke (25), cardiovascular disease (26), autism (27), infec- lution, epidemiologic research typically aims to investigate tious disease (28), autoimmune disease (29), glaucoma (30), the relationship between exposure and disease. With the ad- and obesity (31). vancement of biomedical sciences, it is increasingly evident New statistical methodologies to address disease heteroge- that many human disease processes comprise a range of het- neity are useful in not only molecular pathological epidemi- erogeneous molecular pathological processes, modified by ology (MPE) (32) with bona fide molecular subclassification the exposome (1). Molecular classification can be utilized but also epidemiologic research that takes other features of in epidemiology because individuals with similar molecular disease heterogeneity (e.g., lethality, disease severity) into pathological processes likely share similar etiologies (2). consideration. There are statistical methods for evaluating Pathogenic heterogeneity has been considered in various neo- whether the association of an exposure with disease varies plasms such as endometrial (3), colorectal (3–20), and lung by subtypes that are defined by categorical (33–36) or ordinal 1 2 Wang et al. (33–35) subclassifiers (M.W., unpublished manuscript, 2015); the level of the pth marker variable corresponding to the jth the published methods by Chatterjee (33), Chatterjee et al. subtype; it is 1 or 0 if the pth marker variable is binary, and it (34), and Rosner et al. (35) apply to cohort studies, and the is the ordinal or median score of the marker level correspond- method by Begg et al. (36) focuses on case-control studies. ing to the jth subtype if the pth marker is an ordinal marker, For simplicity, we use the term “categorical variable” (or the p =1,...,P, j =1,...,J. adjective “categorical”) when referring to “nonordinal cate- One-stage method. The method developed by Rosner gorical variable” throughout this paper. et al. (35), Chatterjee (33), and Chatterjee et al. (34) can be Given the complexity of molecular pathology and patho- usefully applied in cohort studies to investigate multiple β genesis indicated by the unique disease principle (1), no single markers. In that method, 1j in model 1 is modeled byP using biomarker can perfectly subclassify any disease entity. Nota- β γ γ P the marker variables, for example, by 1jð Þ¼ 0 þ p¼1 bly, molecular disease markers are often correlated (37). For γ ; pspj where some interaction terms of marker variables can example, in colorectal cancer, there is a strong association be- be added. Model 1 then becomes tween high-level microsatellite instability (MSI) and high- level CpG island methylator phenotype (CIMP) and between λjðtjXiðtÞ; WiðtÞÞ high-level CIMP and the B-Raf protooncogene, serine/threonine () kinase (BRAF), mutation (38). XP λ γ X γ X β W : Cigarette smoking has been associated with the risk of high- ¼ 0jðtÞexp 0 iðtÞþ pspj iðtÞþ 2j iðtÞ ð2Þ level MSI colorectal cancer (16–18, 20, 39–42), high-level p¼1 Downloaded from CIMP colorectal cancer (17, 20, 42, 43), and BRAF-mutated colorectal cancer (17, 19, 20, 42). Given the correlations be- To distinguish this method from the proposed 2-stage method tween these molecular markers, the association of smoking below, we name it “1-stage method.” The parameters of inter- fi with a subtype de ned by 1 marker may solely (or in part) est, γ0 and each γp, which have the same dimension as β1j, reflect the association with a subtype defined by another characterize how the levels of multiple markers are associated http://aje.oxfordjournals.org/ marker. Thus, it remains unclear which molecular marker with differential exposure associations. We can obtain the subtypes are primarily differentially associated with smok- maximum partial likelihood estimate (33, 34)ofγ ={γ0, γp, ing, and how a marker can confound the association between p =1,...,P} using existing statistical software for the Cox smoking and subtypes defined by other markers. Although the model analysis, such as PROC PHREG in SAS (SAS Insti- published methods (33–35) are useful to analyze the exposure- tute, Inc., Cary, North Carolina), through the data duplication subtype associations according to multiple subtyping markers method (47), which is based on the following transformation in cohort studies using existing statistical software, analysis of model 2: using those methods can become computationally infeasible in large data sets. In this article, we present an intuitive and com- at Harvard Library on July 7, 2015 λjðtjXiðtÞ; WiðtÞÞ putationally efficient biostatistical method for the analysis of () disease and etiological heterogeneity when there are multiple XP XJ λ γ X γ X~ β W ; disease subtyping markers (categorical and/or ordinal), which ¼ 0jðtÞexp 0 iðtÞþ p pjiðtÞþ 2l liðtÞ are possibly, but not necessarily, correlated. p¼1 l¼1 ~ where XpjiðtÞ¼spjXiðtÞ; Wli(t)=Wi(t) for l = j, and Wli(t)= METHODS 0 for l ≠ j. In this data duplication method, model 2 can be fit Cohort and nested case-control studies by using stratified Cox regression (stratified by subtype) on an augmented data set, in which each block of person-time is ~ In cohort studies where age at disease onset is available, augmented for each subtype, and the variables Xpji and Wji a commonly used statistical model for evaluating subtype- are created for p =1,...,P, j =1,...,J. Rosner et al. (35) specific exposure-disease associations is the cause-specific also proposed an adjusted RR for the exposure-disease asso- hazards model (44, 45): ciation for a disease subtype defined by 1 or more marker(s) while adjusting for other markers. The data duplication λ X ;W t λ β X β W t ; jðtj iðtÞ ið ÞÞ ¼ 0jðtÞ expf 1j iðtÞþ 2j ið Þg ð1Þ method may become computationally infeasible when the augmented data set becomes too large; this can easily hap- where λj(t) is the incidence rate at age t for subtype j, λ0j(t)is pen when the original data set is sizable and the number of the baseline incidence rate for subtype j, Xi(t)isapossibly subtypes cross-classified from the multiple markers is large. time-varying column vector of exposure variables for the For example, in our colorectal cancer example, there are ith individual, Wi(t) is a possibly time-varying column vector 3,099,586 rows in our original data set. With P = 3 and J =8, of potential confounders, and β1j and β2j are row vector- in the augmented data set there will be about 3,099,586 × 8 = valued log relative risks (RRs) for the corresponding covari- 24,796,688 rows, P × J = 24 new variables created for each ates for subtype j. Model 1 can be estimated in cohort studies exposure variable, and J = 8 variables created for each con- and incidence density–sampled case-control studies (46). As- founding variable. If more markers are being considered, the sume that J subtypes result from cross-classification of mul- large augmented data set can easily make the Cox model anal- tiple categorical and/or ordinal markers. We create binary ysis computationally infeasible. indicators for categorical markers; thus, hereafter, we treat Two-stage method.