Quick viewing(Text Mode)

Population-Based Study Designs in Molecular Epidemiology Montserrat García-Closas, Roel Vermeulen, David Cox, Qing Lan, Neil E

Population-Based Study Designs in Molecular Epidemiology Montserrat García-Closas, Roel Vermeulen, David Cox, Qing Lan, Neil E

unit 4. integration of biomarkers into study designs

chapter 14. Population-based study designs in molecular epidemiology Montserrat García-Closas, Roel Vermeulen, David Cox, Qing Lan, Neil E. Caporaso, and Nathaniel Rothman

Summary Introduction

This chapter will discuss design There is a wealth of existing and by these technologies can still be considerations for epidemiological emerging opportunities to apply a classified into the classic biomarker studies that use biomarkers in the vast array of new biomarker discovery categories defined more than 20 framework of etiologic investigations. technologies, such as - years ago (Figure 14.1), which The main focus will be on describing wide scans of common genetic include biomarkers of exposure, the incorporation of biomarkers variants, mRNA and microRNA intermediate endpoints (e.g. into the main epidemiologic study expression arrays, proteomics, biomarkers of early biologic effect), designs, including cross-sectional and adductomics, and susceptibility (10–17). Unit 4 or short-term longitudinal designs to further our understanding of Biomarkers in epidemiological to characterize biomarkers, and the etiology of a broad range of studies can also be used to evaluate prospective cohort and case– (1–9). These approaches behavioural characteristics that 14 Chapter control studies to evaluate are allowing investigators to explore affect the likelihood of exposure, biomarker-disease associations. biologic responses to exogenous such as tobacco smoking, as well as The advantages and limitations of and endogenous exposures, clinical behaviour and progression each design will be presented, and evaluate potential modification of disease. The use of biomarkers the impact of study design on the of those responses by variants associated with exposure, feasibility of different approaches in essentially the entire genome, disease development, and clinical to exposure assessment and and define disease processes progression within the same overall biospecimen collection and at the chromosomal, DNA, RNA design is of increasing interest and processing will be discussed. and protein levels. At the same has recently been termed ‘integrative time, most biomarkers analysed epidemiology’ (18,19).

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 241 Figure 14.1. A continuum of biomarker categories reflecting the carcinogenic process resulting from xenobiotic exposures

Figure compiled from (10) and (21, copyright © 2008, Informa Healthcare. Reproduced with permission of Informa Healthcare).

Regardless of the appellation The focus of this chapter is with respect to exposure variables used to describe the use of on design considerations for and intermediate endpoints, and, biomarkers in epidemiologic epidemiological studies that use in some instances, disease at a research, be it “molecular biomarkers, primarily in the context specific point in time. Short-term epidemiology,” “integrative of etiologic research, including longitudinal biomarker studies epidemiology,” or the more limited cross-sectional or short-term are studies in which subjects are “genetic epidemiology,” the longitudinal designs to characterize prospectively followed for a short successful application of new and biomarkers, and prospective cohort period of time (usually a few weeks established biomarker technologies and case–control studies to evaluate to up to a year). Investigations still depends on integrating them into biomarker–disease associations. A are usually performed on healthy the appropriate study design with description of the general principles subjects exposed to particular careful attention to the time-tested of study design (29–31) is outside exogenous or endogenous agents principles of the epidemiologic the scope of this chapter. Instead, where the biomarker is treated as method (16,20–24). Basic principles the focus is on describing the the outcome variable. These studies in vetting new biomarkers and incorporation of biomarkers into the generally focus on exposure and technologies in pilot or transitional main epidemiologic study designs, intermediate endpoint biomarkers, studies apply now more than ever pointing out the advantages and and sometimes evaluate genetic and (25–28). Understanding how to limitations of each, and showing other modifiers of the exposure– collect, process and store biologic how study design affects the endpoint relationship. samples (see Chapter 3), and the feasibility of different approaches factors that influence biomarker to both exposure assessment Questions addressed by levels, with particular attention and biospecimen collection and cross-sectional and short-term to within- and between-person processing. longitudinal studies variation for non-fixed biomarkers (see Chapter 9), are key concerns. Study designs in molecular Cross-sectional and/or short-term Testing for and optimizing laboratory epidemiology longitudinal studies are often used accuracy and precision are also as follows: critical to the successful use of Cross-sectional and short- 1) To answer questions about biomarkers in epidemiology studies term longitudinal studies with whether or not a given population (see Chapter 8). Finally, selecting biomarker endpoints has been exposed to a particular the most appropriate, effective, and compound, the level of exposure, logistically feasible study design to In epidemiological terms, a cross- the range of the exposure, and the use a given biomarker technology sectional study refers to a study external and internal determinants that answers a particular in which all of the information of the exposure. For instance, question remains of paramount refers to the same point in time. recent studies on haemoglobin importance. As such, these studies provide a adducts of acrylamide have shown ‘snapshot’ of the population status that exposure to this toxic chemical

242 is widespread in the general recent introduction into commerce, patterns. Sometimes a biomarker of population, due to dietary and the time between first exposure and exposure can be used only in cross- lifestyle habits (32,33). the occurrence of any chronic health sectional studies to determine if a 2) To evaluate intermediate effect is most likely too short. In this population is exposed to an agent of biologic effects from a wide range particular example, the assessment concern, or used as an independent of exposures in the diet and of preclinical indicators of disease marker of exposure in studies environment, as well as from lifestyle (e.g. markers of pulmonary evaluating intermediate biomarker factors (e.g. obesity and reproductive inflammation) in asymptomatic endpoints. status). This design can be used individuals would be of importance The applicability of exposure to provide mechanistic insight into to identify potential adverse health biomarkers in cross-sectional well established exposure–disease effects at an early stage. studies depends on certain intrinsic relationships, and to supplement 4) To study changes in exposure features related to the marker suggestive but inconclusive evidence and/or intermediate endpoints to itself (e.g. half-life, variability, and on the possible adverse health determine the effectiveness of specificity of the marker) and the effects of an exposure. For instance, intervention studies. For instance, exposure pattern (see Chapter 9). studies have used haematological the effect of exercise and weight The first requirements for successful endpoints to investigate the effects loss interventions on serum application of an exposure marker of benzene on the blood forming levels of four biomarkers related are that the assay is reliable and system at low levels of exposure to knee osteoarthritis (cartilage accurate (see Chapter 8), the marker (34,35). These studies have found oligometric matrix protein (COMP), is detectable in human populations, decreased levels in peripheral blood hyaluronan, antigenic keratin and important effect modifiers (e.g. cell counts at exposures below 1 sulfate, and transforming growth nutrition and demographic variables) ppm, indicating that at low levels of factor-β-1 (TGF-β1)), and clinical and kinetics are known (20). Second, exposure, perturbations in the blood outcome measures (e.g. medial joint the timing of collection in forming system can be detected. space, pain) were examined (37). combination with the biological half- These results hinted at the possibility Intervention programmes indeed life of a biomarker of exposure is key, of increased risk of leukaemia at resulted in changes in COMP (which as this determines the exposure time low levels of benzene exposure, was associated with decreased knee window that a marker of exposure given the putative link between pain) and TGF-β1. reflects. The time of collection may benzene poisoning (a severe form of be critical if, as is often the case in haematotoxicity) and increased risk Cross-sectional and short- cross-sectional studies, only one for leukaemia. term longitudinal studies using sample per subject can be obtained 3) To evaluate whether or not exposure markers on a given occasion, and if the there are early biologic perturbations exposure is of brief duration, highly caused by new exposures, or recent Biomarkers of exposure measure variable in time, or has a distinct changes in lifestyle factors that have the level of an external agent, exposure pattern (e.g. diurnal not been present long enough to have its metabolic by-products in variation in certain endogenous been evaluated for their association either the free state or bound to markers, such as hormones) (38). with disease. For example, there macromolecules, or the specific Chronic, near-constant exposures Unit 4 is considerable public health immunologic response it elicits. pose fewer problems. However, concern about the increased use In addition, exposure biomarkers most biomarkers of internal dose of nanoparticles in both research measure endogenously produced generally provide information about 14 Chapter and manufacturing operations (36). compounds, which may be recent exposures (hours to days), Various initial research studies and influenced directly or indirectly by with the exception of markers of evaluations have demonstrated external factors (e.g. hormones), persistent pesticides, dioxins, greater biological activity of as well as by genetic factors. The polychlorobiphenyls, certain metals, nanoparticles compared with larger first epidemiological evaluation of and serological markers related to particles of the same material, potential biomarkers of exposure infectious agents, which can reflect and significant potential toxicity generally occurs in cross-sectional exposures received many years has been observed in laboratory studies in the general population, before. animals exposed to some types of or in subgroups with specific, well nanoparticles. However, given their characterized exposure and lifestyle

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 243 Cross-sectional and short- plasma that may have epigenetic have been extensively used as the term longitudinal studies using effects on disease development classic biomarker of early genotoxic intermediate endpoints (e.g. hormones, growth factors, effects in cross-sectional studies cytokines). of populations exposed to a wide Intermediate biomarkers directly For maximum utility, an variety of potential carcinogens or indirectly represent events on intermediate biomarker must (43–45). Several cohort studies the continuum between exposure be shown to be predictive of have reported that the and disease, and can provide disease occurrence, preferably in of chromosomal aberrations in important mechanistic insight into prospective cohort studies (40) or peripheral lymphocytes can predict the pathogenesis of disease. As potentially in carefully designed subsequent risk of cancer (46–51). such, they complement classic case–control studies. The criteria for The predictive performance of this epidemiological studies that use validating intermediate biomarkers biomarker was shown to be similar disease endpoints. For instance, have focused on the calculation irrespective of whether the subjects the use of intermediate biomarkers of the etiologic fraction of the had been smokers or occupationally in cross-sectional studies can intermediate endpoint, which varies exposed to carcinogenic agents provide initial clues about the from 0 to 1 (40,41). For intermediate (52). In contrast, such associations disease potential of new exposures endpoints with etiologic fractions were not observed for the sister years before a disease develops that are close to 1.0, either positive chromatid exchange assay, another (10,15,39 – 41). or negative results in cross- biomarker of genotoxicity also One group of intermediate sectional studies of an exposure– measured in peripheral lymphocytes biomarkers, biomarkers of early intermediate endpoint relationship (49–51). biologic effect (10), generally are particularly informative. For Interpretation of results from measure early biologic changes intermediate endpoints linked to cross-sectional studies using that reflect early, and generally risk of developing disease but with a intermediate endpoints is, as non-persistent, effects. Examples substantially lower etiologic fraction, indicated before, premised on the of early biologic effect biomarkers the interpretation must be more assumption that the intermediate include: measures of cellular toxicity; circumspect. Specifically, a positive endpoints reflect biological chromosomal alterations; DNA, association between an exposure changes considered relevant to RNA and protein expression; and and an intermediate biomarker is disease development. This may early non-neoplastic alterations in informative, but a null association be based on and animal cell function (e.g. altered DNA repair, does not rule out that the exposure models or on previous observations altered immune function). Generally, is associated with adverse health that the biomarker is altered in early biologic effect markers are outcomes, as the exposure may act human populations exposed to measured in substances such as through a mechanism not reflected known toxicants. However, these blood and blood components (red by the particular endpoint under studies are not capable, in and of blood cells, white blood cells, DNA, study. themselves, of directly establishing RNA, plasma, sera, urine) because One of the most well known or refuting a causal relationship they are easily accessible, and, in examples of a validated intermediate between a given exposure or a some instances, it is reasonable marker is low-density lipoprotein given level of exposure and risk to assume that they can serve as (LDL) cholesterol. Epidemiological for developing diseases. Results surrogates for other organs. Early studies have shown that elevated of studies using most intermediate biological effect markers also can LDL cholesterol is one of the major biomarkers as outcome measures be measured in other accessible causes of coronary heart disease are only suggestive; a biomarker tissues such as skin, cervical and (CHD). Given its high predictiveness, may be overly sensitive (i.e. it may colon biopsies, epithelial cells from risk management/intervention respond to low levels of chemical surface tissue scrapings or sputum programmes are focused on lowering exposures that are below the samples, exfoliated urothelial cells and identifying factors that would disease threshold, if one exists), be in urine, colonic cells in feces, and reduce LDL levels and thus the risk insensitive, reflect phenomena that epithelial cells in breast nipple for CHD (42). Unfortunately, there are irrelevant to the disease process, aspirates. Other early effect markers are very few examples like LDL. For or fail to reflect important processes include measures of circulating instance, chromosomal aberrations involved in the pathogenesis of biologically active compounds in in peripheral blood lymphocytes disease.

244 Variance in biomarker response potential confounders, and effect questions about whether or not a modifiers. Furthermore, they can given population has been exposed The applicability of exposure and take advantage of a wide range to a particular compound; evaluate intermediate endpoint markers of potential analytic (molecular) intermediate biologic effects from in cross-sectional and semi- approaches, particularly those that a wide range of exposures in the longitudinal studies depends to require cell culturing and extensive diet and environment; evaluate a large extent on the variability processing within an often short whether or not there are early in biomarker response between period of time after collection (e.g. biologic perturbations caused by persons and over time (see Chapters RNA, protein stabilization). new exposures or recent changes 8 and 9). If a biomarker response is Cross-sectional and short- in lifestyle factors that have not highly variable over time within a term longitudinal biomarker been present long enough to have person, then it is clear that a single studies can collect very accurate been evaluated for their association measurement of such a marker would information on the dose–response with disease; and to study changes be a poor estimate of the average relationship between external or in exposure and/or intermediate marker level of a certain individual. internal exposures and intermediate endpoints to determine the However, even if the variance over endpoints; these detailed exposure effectiveness of intervention studies. time is small, and thus a reasonable status data should be exploited to However, as indicated previously, estimate of the individual’s average the fullest. As most biologic markers the interpretation and therefore marker response, the applicability of exposure reflect exposures over the usefulness of these studies in epidemiological studies might the previous several days to months, depend heavily on the validity of the be limited if the variance between this information must be collected markers measured. The availability individuals is small as well. In the over the etiologically relevant time of numerous prospective cohort end, the applicability of a marker in period. For example, in a study studies with stored blood specimens epidemiological research depends on haematologic, cytogenetic and should enhance the ability to rapidly on the relative level between the molecular endpoints among workers test the relationship between a interindividual and intraindividual exposed to benzene, measurements wide variety of early biologic effect variability in marker response. A were collected for over a year before markers, using both standard and useful measure in this regard is the determination of the biological emerging technologies (55,56), and intraclass correlation coefficient endpoints to unequivocally disease risk. Such studies could (ICC), which can be defined as the assess individual exposures (53). ultimately produce a novel endpoint interindividual variance divided by Furthermore, given the increasing to evaluate the disease potential and the sum of the interindividual and interest in identifying potential gene- mechanisms of action of various risk intraindividual variance; in other environment interactions in chronic factors. words, it represents the fraction of the diseases, accurate measurement total variance that can be attributed of the environment becomes very Prospective cohort studies to differences between individuals. important. Simulation studies Short-term longitudinal studies are have shown that even a modest In contrast to cross-sectional studies ideal to collect information needed amount of nondifferential exposure where biomarkers are the outcome to estimate this key parameter. misclassification can dramatically variable, in prospective cohort and Unit 4 Chapter 9 provides a more detailed attenuate the estimate of the case–control studies the risk of description of methods used to interaction parameter and increase disease is the outcome of interest. quantify biomarker variability and sample size requirements (54). As Prospective cohort studies collect 14 Chapter its impact on biomarker-disease such, cross-sectional and semi- exposure information and biological associations. longitudinal studies could have a specimens from a group of healthy distinct advantage in elucidating subjects who are then followed-up to Strengths and limitations gene-environment interactions. identify those who develop disease. Establishing a is initially A distinct advantage of cross- Summary and future directions very costly and time-consuming, as sectional and short-term longitudinal large populations must be recruited studies is that detailed and accurate Cross-sectional and semi- and followed-up long enough information can be collected longitudinal study designs have to identify sufficient numbers of on current exposure patterns, been successfully applied to: answer cases with the disease of interest.

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 245 Although power is limited by the Exposure assessment, timing of to latent periods between the overall cohort size and frequency exposure, and misclassification exposures of interest and the of the outcome, in the long run the outcome. Theoretically, cohort cohort design becomes more cost- A major strength of the cohort studies have the advantage of efficient, since it can study multiple design is that the sequence collecting serial biological samples disease endpoints and provide a between exposure assessment and over time to evaluate biomarkers well defined population that can outcome is the same as the causal that vary in time. However, logistical be easily sampled for efficiency pathway: exposures are measured and cost constrains often result in (57–60). This section describes before disease diagnosis. This is large studies collecting a single key features of cohort studies, particularly important for biomarkers biological sample at one point in particularly with regards to the use of that are directly or indirectly affected time. This results in diminishing the biomarkers. Table 14.1 summarizes by the disease process (61), with value of evaluating the relevant time the strengths and limitations of this the caveat that undiagnosed or window of exposure for disease study design, as compared to the preclinical disease may alter levels causation, and studying markers case–control designs described of specific biomarkers measured with substantial seasonal or day-to- later in this chapter. on specimens collected close to day variations, such as short-term the date of diagnosis. Screening for exposure markers. Participation in the study disease at baseline (e.g. requiring Chapter 9 describes important recent colonoscopy for samples considerations in data analysis and Subjects in a cohort study often have used in studies of colorectal cancer), inference related to the timing of distinct characteristics compared to or excluding cases diagnosed in exposure assessment. Below is a their population of origin, by design the first few years after sample summary of these considerations in or because of the motivation and collection (lag analyses), can limit the context of cohort studies. level of commitment required to be the effect of preclinical disease on Misclassification due to random included in such studies. Collection biomarker levels. within-person variation. Most of biospecimens to measure Environmental data collected biomarkers vary from time to biomarkers can have adverse through is less prone time within the same person. effects on participation rates, even to recall biases (i.e. differential recall This variation could be due to when collection procedures require between cases and controls) than in multiple factors, including diurnal minimally invasive procedures case–control studies, thus facilitating (e.g. melatonin) or monthly (e.g. (e.g. buccal swab or oral rinse as the assessment of biomarker- estrogens) cycles, seasonal variation opposed to blood collection). Cohort environment interactions, such as (e.g. vitamin D), recent dietary or studies that collect questionnaires gene-environment interactions (62– supplement intake (e.g. vitamin and biological samples at baseline 66). However, prospective studies C), as well as from the specificity or before disease onset can often have a lower level of detail on of the assay used to measure the avoid selection biases, as long as specific exposures than case–control biomarker. If this variation is random specimens for each participant studies focusing on one or a few and nondifferential between cases remain available and follow-up is related diseases, due to the need to and controls, the bias will tend to complete. However, subjects with collect at least minimal data on the attenuate measures of association. biological specimens collected after multiple exposures relevant to multiple When within-person variability the cohort has been formed might outcomes. Therefore, although cohort for a particular biomarker is random, have different characteristics from studies can minimize the occurrence the correlation between single the rest of the cohort. Increasingly, of differential misclassification, measurements in a population concerns over privacy have nondifferential misclassification of and the average of multiple also affected the willingness of exposure might be larger than in measurements can be used to participants to take part in some alternative study designs. gauge the extent of attenuation in research studies. Cohort studies with extended the association measure, or be used follow-up provide a wide range of explicitly to correct the attenuation time periods between biomarker of estimates due to collection and diagnosis of the nondifferential misclassification (67). outcome. This can be used to This information can be obtained evaluate hypotheses relating from a representative subsample of

246 the cohort in which the biomarkers versus consistently low levels. If exposure and disease diagnosis are measured at two or more within-person variation in biomarker is often not known. The within- distinct time points. It is important measurements is assumed to person variability in the biomarker that the subjects with repeated be random, methods exist to of exposure is also important to measurements represent the larger estimate the number of replicate consider in respect to the optimum cohort, so that the correlation used measurements required to estimate time point for measuring it with to correct risk estimates can be the ‘true’ mean value of a biomarker respect to disease risk. If the within- generalized to the entire cohort. within a specified range of error. person variability is low, then careful However, true random On the other hand, if variation timing of exposure measurement is often difficult to perform in large is not random, due to changes is not necessary. However, if there cohort studies, due to geographic in behaviour or secular trends, is large within-person variability in dispersion and lower-than- multiple measurements can be used the biomarker, then measurements optimal participation rates in more to estimate exposure error. From should be made as close to the time burdensome subsampling studies. a practical point of view, collecting of predicted maximum effect as Time integration. The most multiple biological specimens from possible. common conceptual timeframes for large numbers of subjects in cohort exposure data in the epidemiology studies increases the cost of sample Considerations of chronic disease in cohort collection and storage, as well as in biospecimen collection, studies are long-term average the burden on study subjects, and processing and storage measurements, as the induction thus might not always be feasible or time of most chronic diseases (e.g. recommendable. The collection and storage of large cardiovascular disease, diabetes or Inference from biomarker/ numbers of samples needed for cancer) is thought to be in the order disease associations. The cohort studies using biological of years or decades. Therefore, association between a biomarker specimens is very complex (see biomarkers would optimally and disease may also be influenced Chapter 3 for considerations in represent cumulative exposure by the point in time in which the sample collection, processing, and over relatively long periods of time, biomarker was assayed with respect storage). Given that samples are such as months or years. Some to where it influences disease on often used years after collection, biomarkers may be able to integrate the causal pathway. In the case optimal biospecimen collection, exposure time, which might also of cancer, early events on the processing and storage protocols depend on sampling, processing, causal pathway (‘initiators’) may that will allow the performance and storage protocols. For example, need to be distinguished from later of a wide range of assays in concentrations of many nutrients events (‘promoters’). Therefore, the future are critical (68,69). are less susceptible to short-term it is important that any biomarker Therefore, validation studies aimed fluctuations in erythrocytes than in related to exposures that are either at optimizing sample handling and plasma or serum. Concentrations initiators or promoters be measured storage protocols, according to the in adipose tissue, which is often during a time period when the impact of these procedures on the more difficult to acquire, reflect even exposure is most likely to exert its stability of samples and biomarker more long-term exposure history. influence on disease. For example, measurements, are strongly Unit 4 It is therefore important to balance initiating exposures should most recommended (69–72). Validation feasibility of sample collection with likely be measured many years, studies can assess considerations implications for time-integration of possibly decades, before cancer such as the influence of time of 14 Chapter the biomarkers of interest. diagnosis. In contrast, exposures collection to arrival at the processing Multiple biomarker levels. considered to be promoters should laboratory (e.g. blood collection Obtaining multiple samples over time be measured more closely to the tubes with different preservatives, can increase the time-integration time of diagnosis. If a biomarker anti-coagulants or clot accelerators, of exposures and biomarker of exposure is not measured at temperature during shipping, measurements. Multiple biomarker the etiologically relevant time, the impact of time between collection to measurements can be used in association between the exposure processing), processing protocols several ways, including averaging and disease could be attenuated (e.g. isolation of serum for proteomic measurements or comparing or not observed at all. The problem analyses (73)), and long-term subjects with consistently high is that the true latency between storage (e.g. freezing temperature,

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 247 impact of thaw/freeze cycles) on Statistical power performed on fresh samples), some biomarker measurements. selection of cases and controls Of paramount consideration is The major weakness of cohort will be necessary. This can be limiting the loss of information due studies is that even for common attained by using sampling designs to exhaustion of archived samples. diseases the number of cases in which only samples from cases The problem of sample exhaustion is is limited by the cohort size and and a random subset of non-cases most evident in prospective cohorts follow-up time. Even a very large are analysed, thus considerably examining incident disease, as the cohort may not acquire enough reducing laboratory requirements amount of sample collected before cases of rare diseases to achieve and cost (60). diagnosis is by definition finite. adequate statistical power after long Moreover, due to the advantages follow-up periods. Considering that Nested case–control of the prospective cohort design, cohorts using biological samples interest in the utilization of biological tend to be small or subsets of larger The nested case–control study specimens could be great among -based cohorts, this is is an efficient sampling scheme the scientific community. Therefore, a particular problem for studies using that includes all cases identified in it is important for investigators biomarkers. Recently, a movement the cohort up to a particular point to try to minimize the volume of to form consortia of cohorts, such in time, and a random sample of sample used for measuring any as the Cohort Consortium to study subjects free of disease at the time one biomarker, either by reducing causes of cancer (http://epi.grants. of the case diagnosis. Increasing the volume of sample used or by cancer.gov/Consortia/cohort.html), the case-to-control ratio to two or maximizing the number of assays has begun to address the problem three controls per case can easily that can be made at any one time of statistical power by coordinating increase the efficiency of nested (multiplexing). Also beneficial is biomarker measurements and case–control studies. Optimally, the formation of an advisory board analyses. Many consortia have controls should be selected for each to aid investigators in evaluating been formed to support genome- case from the pool of participants both internal use and external wide association studies of many that have not developed the disease requests for access to precious diseases (updated information on at the time the case was diagnosed prospective samples. While these new publications from these efforts (risk set sampling). Alternatively, considerations are obvious for those can be found at http://www.genome. controls may be selected from all participants who are diagnosed with gov/gwastudies/). of the participants at baseline who disease, biological samples from were not diagnosed with disease healthy or control subjects should Sampling designs throughout follow-up. Simulation also be carefully preserved, as studies have shown that as long as these subjects may become cases When a cohort is chosen at random the proportion of the baseline cohort in the future. from the general population, the that acquires disease is low (e.g. Despite advances in technology, exposures in the cohort will be less than 5%), the bias introduced such as whole-genome amplification representative of the exposures by violating risk set sampling is to increase the amount of available in the general population. If the minimal. DNA for assays, many biomarkers hypotheses to be tested rely on still require large amounts of participants having either rare or Case–cohort biological samples. This limits the extreme exposures in the general number of measurements that population, then oversampling these A case–cohort design includes can be carried out on the limited people or restricting the cohort a random sample of the cohort resource of biological samples from to certain exposed groups would population at the onset of the study cohort studies. Collecting additional increase efficiency by increasing and all cases identified in the cohort, specimens is often difficult in cohort the prevalence of these exposures up to a particular point in time (74). studies, as members move, are lost in the cohort. This design allows for the evaluation to follow-up, or do not wish to go For many large cohort studies, of several disease endpoints through the further inconvenience of it is not feasible to assay all using the same comparison group providing an additional sample. participants for a given biomarker. (referred to as a subcohort). It may Therefore, with a few exceptions reduce the amount of laboratory (such as assays that can only be work by assaying a subcohort of

248 subjects at baseline, and then measurement should also be and security issues (e.g. backup adding case information as cases considered. This can and should generators, monitoring, distributing accrue. While there are statistical be minimized by assaying matched samples among different freezers). considerations that must be taken cases and controls at the same In the next few years, however, the into account when analysing time, regardless of the study design. cost-per-case for studies fielded case–cohort studies, these are from a cohort will offer economies in now included in most statistical Screening cohorts comparison to fielding a new case– packages. Of greater concern in control study (57). case–cohort studies are problems Prospective cohort studies are more unique to biomarker studies. sometimes designed within Summary and future directions For example, if the biomarker being screening cohorts. In this design, assayed degrades over time or if screening failures lead to missing Prospective cohort studies provide there is substantial laboratory drift in prevalent cases among cohort invaluable resources to study measurement, then cases assayed participants that are misclassified biomarkers of risk, particularly those at varying time periods after baseline as controls (75). Although repeated that can be affected by disease (when controls were assayed) can screening reduces misclassification processes. Multiple prospective lead to bias. Additionally, laboratory of subjects, cases discovered in cohort studies are currently being personal are less easily blinded to follow-up cannot be distinguished followed-up for disease case or control status, which can from prevalent cases missed by with basic risk factor information also lead to bias. These factors limit the initial screening or incident from questionnaires and stored the utility of the case–cohort design disease. However, the degree of blood components, including white in biomarker studies. Another misclassification of prevalent and blood cells that can be used as a limitation of this design is that since incident cases can be assessed source of DNA. At the completion of the same disease-free subjects by analyses of time to diagnosis ongoing collections, current studies are repeatedly used as controls or pathological characteristics. will have stored DNA samples for different disease endpoints, Intensive screening may also on over two million individuals depletion of samples from this group uncover a reservoir of latent disease (16). These studies will provide can become an issue. that would not otherwise become very large numbers of cases of clinically relevant, and that might the more common cancer sites Sample comparability differ from disease detected through (breast, lung, prostate, and colon) clinical symptoms (76,77). to evaluate genetic markers of The methods by which biological susceptibility; biomarkers in serum samples are collected, handled, Resources and infrastructure or plasma, such as hormone levels; and stored can influence the chemical carcinogen levels; and measurement of many biomarkers The vastly greater size of cohorts proteomic patterns. Most cohort of exposure (see Chapter 3). compared to other designs, such studies do not have cryopreserved Therefore, to have valid biomarker as case–control studies, and the blood samples, as the procedure studies, case and control samples time period required for the cohort is very expensive and logistically must be handled in the same way. to mature, mean that a substantially challenging in large studies. Also, Unit 4 For prospective studies, it is also greater initial investment is required cohort studies often have a limited important to consider the length to establish the cohort. For cohort capability to collect detailed disease and type of storage, as some studies that incorporate biological information or biological specimens 14 Chapter biomarkers may degrade over time materials, the infrastructure to to facilitate disease classification, even under ideal conditions. Thus, support biospecimens’ databases, as well as to follow-up cases for it is important to match cases and freezers, and processing require disease progression and survival controls on the method of sample a correspondingly greater effort studies. New cohort studies based collection, duration of storage, and cost. While all studies with on large institutions, such as as well as other factors that may biospecimens must consider the health maintenance organizations be related to the biomarker of risk of untoward events (e.g. freezer (HMOs), could enable access to interest, such as fasting status or failure), the anticipated long useful life clinical records with more detailed season of collection. Additionally, of the samples from cohorts requires disease information, archived batch-to-batch variation in assay special emphasis on quality control biological specimens, and easier

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 249 follow-up of cases for treatment population during a specified period the hospital for other diseases that response and survival. Caucasian of time, and controls are a random might themselves introduce bias populations in wealthier countries sample of the source population if they are related to the genetic or are overrepresented in studies where the cases came from. On the environmental exposures under of most diseases, and the recent other hand, cases and controls in study, particularly when evaluating establishment of consortia in other hospital-based studies are identified gene–environment interactions or populations, such as the Asia among subjects admitted to or seen joint effects (79). Further potential Cohort Consortium (http://www. in clinics associated with specific for occurs if cases or asiacohort.org/), will be critical to hospitals. As in the population- controls are less likely to participate study disease across geographically based design, the distribution of because of problems in the and ethnically diverse populations exposures in the control group collection of biospecimens. Since that might have different exposures should represent that from the the source population for cohorts to environmental risk factors and source population of the cases. is explicit, selection bias is less of a frequencies of susceptibility alleles. However, the source population problem as long as follow-up rates is often more difficult to define in are high (61). Low participation Case–control studies hospital-based studies. rates in case–control studies, and Molecular epidemiology studies particularly refusals related to Case-control studies are often use the hospital-based case– providing biological specimens, conceptualized as a retrospective control design, as the hospital can bias results, especially when sampling of cases and controls from setting facilitates the enrolment of cases are less likely to participate an underlying prospective cohort, subjects, thus enhancing response than controls and selection is referred to as the source population rates, as well as the collection and related to the biomarker of interest. (29,31). The case–control design processing of biological specimens. Low participation rates additionally has been a mainstay of molecular Enrolment of subjects is also made threaten the population-based epidemiology studies due to its well easier by having in-person contact nature of the study, undermining known traditional strengths including with study participants by doctors, its use for estimating absolute and depth and focus of questionnaire nurses or interviewers, which (29). Use of non- information, biologically intensive usually results in higher participation intensive biospecimen collection specimen collection, potential to rates (78). Because study subjects protocols can increase participation enrol large numbers of cases rapidly, are generally less geographically rates—for instance, the collection and ability to target rare diseases distributed than those in population- of buccal cells as a source of that occur in small numbers in based or cohort studies, rapid DNA instead of the more invasive prospective cohort studies. shipment of specimens to central phlebotomy (80). laboratories for more elaborate Types of case–control designs processing protocols, such as Single disease cyropreservation of lymphocytes, Case–control studies can be is made possible. Rapid Case–control studies are generally hospital- or population-based ascertainment of cases through limited to one disease outcome depending on how the cases and the hospitals also facilitates the (or a few related diseases), but controls are identified (Table 14.1). collection of specimens from cases are unconstrained by the rarity of A major concern of case–control before treatment, thus avoiding the the disease, while cohort studies studies is proper case and control potential influence of treatment on (including full cohort, nested case– selection. Proper controls are some biomarker measurements. control, or case–cohort studies) may representative of the study base Potential for selection bias is one identify multiple disease endpoints. from which the cases arise (29). of the most important limitations of The focus of a case–control study Identifying either a random sample case–control designs. The impact on one disease entity permits from the general population or of selection bias in hospital-based more detailed documentation of the source population for cases studies is not only related to the disease information and detailed presenting at a particular hospital(s) reasons for non-participation, but diagnostic procedures not routinely may be difficult. Population-based also to diseases in the control collected in clinical practice, such studies attempt to identify all population. An example of this is as specialized imaging and access cases occurring in a pre-defined selection of controls admitted to to pathologic tissue and other

250 Table 14.1. Advantages and limitations of prospective cohort and case-control designs in molecular epidemiology relevant to the collection of biological specimens and data interpretation Unit 4 biological specimens for application Costs for a series of case– Exposure assessment of novel biomarkers of disease. control studies of different diseases and misclassification Obtaining disease-related data in can sometimes be reduced by 14 Chapter cohorts entails mounting an effort sharing a single control group. When Exposure assessment through that is generally less efficient and different diseases require different questionnaires in case–control more costly. The advantage of exposures, the partial questionnaire studies of a single disease or cohorts’ ability to examine multiple design may offer reduction in the multiple diseases sharing risk outcomes may be somewhat limited burden to respondents, thereby factors (e.g. breast, ovarian, by resources and logistics, limited potentially increasing participation and endometrial cancer) can be exposure information, the diverse (81). Even if these options are more detailed and focused than approaches to documenting disease not feasible, using the same prospective cohort studies that often incidence or mortality, and the rarity infrastructure for control selection for study multiple unrelated diseases. of some outcomes. repeated studies can reduce costs. However, studies that rely on

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 251 retrospective exposure assessment near-constant exposures pose fewer not related to factors influencing may be affected by disease or problems. Ideally, the biomarker the likelihood of participation in a its treatment. Also, questionnaire should persist over time and not study, and therefore selection bias responses subject to rumination by be affected by disease status in in case–control studies is less of a respondents are susceptible to bias case–control studies. However, concern for studying the main effect from differential misclassification. most biomarkers of internal dose of genetic risk factors. Indeed, the Biomarkers (except germ-line generally provide information about robustness of genetic associations genetic variation) and responses recent exposures (hours to days), with disease for different study to questionnaires may change with the exception of markers such designs has been demonstrated in as a consequence of the early as persistent pesticides, dioxins, findings from consortia of studies disease process or diagnosis itself. polychlorinated biphenyls, certain that have shown remarkably Differential errors or recall bias metals, and serological markers consistent estimates of relative risk from questionnaire information related to infectious agents, which across studies of different design collected in case–control studies can reflect exposures received (82,83). Because genetic markers are possible, and their extent should many years before. If the pattern might influence disease progression, be evaluated in the context of of exposure being measured is incomplete ascertainment of cases in specific exposures and populations relatively continuous, short-term case–control studies can introduce under study. Similarly, levels of markers may be applicable in case– survival bias, particularly for cancers certain biomarkers measured after control studies of patients with early associated with high morbidity and diagnosis can be influenced by the disease, so that disease bias would mortality rates, such as pancreatic disease process or treatment, and be less likely. However, short-term and ovarian cancers. This is a must be considered and evaluated markers have generally limited use particular concern for population- to the extent possible for each in case–control studies, as they are based studies, unless a very rapid biomarker of interest. Differences less likely to reflect usual patterns, ascertainment system is implemented in biomarker levels among cases and the disease or its treatment that enrols cases as close as possible diagnosed at different stages of the might influence its absorption, to the time of diagnosis. disease can help evaluate whether metabolism, storage, and excretion. Susceptibility biomarkers can differences in biomarker levels also be measured at the functional/ between cases and controls reflect Biomarkers of susceptibility phenotypic level (e.g. metabolic an influence of the disease on the in case–control studies phenotypes, DNA repair capacity) biomarker rather than the contrary. (16). While genotypic measures The applicability of exposure The approaches to studying genetic are considerably easier to study biomarkers in case–control studies susceptibility factors for disease than phenotypic measures, since depends on certain intrinsic features have evolved very quickly over the they are stable over time and related to the marker itself (e.g. half- last several years, owing to advances much less prone to analytical life, variability, specificity) and the in genotyping technologies, measurement error, phenotypic exposure time window that a marker substantial reductions in genotyping measures are likely to be closer of exposure reflects in relation to costs, and improvements in the to the disease process and can the biologically relevant time of annotation of common genetic integrate the influences of multiple exposure and timing of sample variation, namely, the most common genetic and post-transcriptional collection. Methods to evaluate type of variant, the single nucleotide influences on protein expression these key biomarker features before polymorphism (SNP). The principles and function (84). Therefore, in spite their use in case–control and and quality control approaches of the advantages in measuring other epidemiological studies are for the use of genetic makers genotypic changes, when described in the previous section and in epidemiological studies are complex combinations of genetic in Chapter 8. The time of collection described in Chapter 6. Because variants and/or important post- may be critical if the exposure is inherited genetic markers measured transcriptional events determine a of brief duration, is highly variable at the DNA level are stable over substantial portion of interindividual in time, or has a distinct exposure time, the timing of measurement variation in a particular biologic pattern (e.g. diurnal variation for before disease diagnosis is process, phenotypic assays may be certain endogenous markers, such irrelevant. In addition, it is highly the only means to capture important as hormones). However, chronic, likely that most genetic markers are variation in the population.

252 For example, several studies Virus transformation to ensure reflect measurements from samples have assessed the role of DNA repair large quantities of DNA), since collected outside the hospital, as capacity (DRC) regarding cancer resources for collection, processing habits and exposures change during risk by using in vitro phenotypic and storage are often available in hospitalization (e.g. dietary habits, assays mostly on circulating diagnostic hospitals. This offers the medication used and physical lymphocytes (e.g. mutagen potential for conducting functional activity). Therefore, even if cases sensitivity, host cell reactivation assays that require live cells, such and controls are selected through a assay). These studies have shown as mutagen sensitivity (85), which hospital-based design, collection of differences in DRC between cases in general are not methodologically specimens after the patients return and controls; however, interpretation feasible in cohort studies. Pre- home and are no longer taking of these results must account for treatment specimens, critical for medications for the conditions that study design limitations, such as evaluation of biologic markers that brought them to the hospital should use of lymphocytes to infer DRC in could be affected by treatment, such be considered, if feasible. On the target tissues, the possible impact as chemotherapy, can be obtained other hand, specimens to measure of disease status on assay results, through rapid identification systems biomarkers that are influenced and confounding by unmeasured that recruit cases right at the time of by long-term effects of treatment risk factors that influence the diagnosis. should be collected before treatment assay (85–87). The application of Biomarker measurements can is started at the hospital, within functional assays in multiple, large- be very sensitive to differences in logistic limitations. scale epidemiological studies will handling of samples (e.g. fasting Case–control studies require development of less costly status at blood collection or time might also allow more detailed and labour-intensive assays. In the between collection and processing characterization of disease through future, assays that assess non- of specimens). Therefore, it is the use of biomarkers, such as clonal mutations in DNA, through important that samples from cases the presence of eosinophils in the analysis of DNA isolated and controls be collected during the sputum to identify eosinophilic and from circulating white blood cells, same timeframe and use identical non-eosinophilic asthma, typing may capture some of the same protocols to avoid differential biases. of viruses in infectious diseases, information as the above functional Ideally, the nursing and laboratory or molecular characterization of assays and have wider application staff should be blinded with respect tumours in cancer. This more because of greater logistic ease. to the case–control status of the detailed classification of disease subjects. However, because the permits the analysis of genetic Considerations differences in handling samples and environmental risk factors and in biospecimen collection, between cases and controls are not clinical outcomes by biologically processing and storage always avoidable, it is important to important disease subtypes. These record key information such as date analyses can lead to improvements A case–control study in a relatively and time of collection, processing in risk assessment by identifying small geographic region, or a defined and storage problems, time since diseases with distinct risk profiles. set of hospitals, can permit efficient last meal, current medication, and In addition, identifying subclasses collection of medical records or current tobacco and alcohol use to of disease of different etiology can Unit 4 specimens (e.g. blood, urine, be able to account for the influence aid in understanding the pathogenic surgical tissue and other pathologic of these variables at the data pathways to disease, as well as material) along with supporting analysis stage. This information developing targeted prevention 14 Chapter documentation. Hospital-based can also be used to match cases programmes (e.g. use of hormonal case–control studies or population- and controls selected for specific chemoprevention for women at based studies served by a small biomarker measurements in a high risk of estrogen-receptor number of hospitals can have direct subset of the study population. This positive breast tumours). Review contact with patients in a hospital will ensure efficient adjustment for of medical records can be used setting, thus offering advantages for these extraneous factors during to obtain information on disease the collection of different types of data analysis. characteristics determined for specimens or elaborate processing Biomarkers measured in clinical practice, such as histological protocols (e.g. cryopreserving samples collected from subjects tumour type and tumour grade in lymphocytes and Epstein-Barr during a hospital stay might not cancer patients. However, more

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 253 detailed characterization of disease database resources, such as death to attain very large sample sizes. might require large collections of registries in populations where 2) Inclusion of diverse population biological specimens to determine cases are diagnosed, can be helpful groups. Case–control studies can disease biomarkers, which is in determining survival from cases focus on enrolling a narrow range of facilitated in hospital-based studies. lost to follow-up. ethnic, racial, age or socioeconomic levels that are particularly interesting Follow-up of cases to determine The case–control method in or important but not adequately clinical outcomes relation to other epidemiological represented in existing cohort designs studies. The prospective collection of clinical 3) Specialized specimen information from cases enrolled Existing cohort studies and their collection and processing protocols. in case–control studies (e.g. consortial groups that have or are Case–control studies can use treatment, recurrence of disease, collecting blood samples will accrue labour- and technology-intensive and survival) greatly increases large numbers of cases with common biological collection, processing and the value of these studies, since diseases over the coming years. storage protocols that would not be critical questions on the relationship Appropriately, questions are being logistically feasible or cost-efficient between biomarkers and disease raised about the utility of carrying in a large . progression can be addressed in out new case–control studies, either 4) Depth of exposure data. Case– well characterized populations (see population- or hospital-based, to control studies can collect more Chapter 4). Designing a survival study the main effects of common detailed and broader information study within a case–control study is polymorphisms and their interaction about exposure from both interviews easier to do at the beginning of the with environmental exposures. and records than is feasible in a case–control study rather than later Designers of a new case–control cohort study. This is particularly after subject enrolment is completed. study will need to show that it offers important when there is concern Given the value that such studies benefits that cannot be obtained about a specific type of exposure have for carrying out translational from existing cohorts. Below are that is not generally assessed at all research in a very efficient manner, some considerations when planning or in adequate detail in the typical consideration should be given to to carry out a new case–control cohort questionnaire (which usually implementing this type of study study in contrast with performing focuses on diet and general lifestyle whenever possible. The collection nested studies within existing factors). Examples could include of clinical information is facilitated in cohorts: occupational and environmental hospital-based studies when cases 1) Disease incidence. A key exposures requiring complete are diagnosed in a relatively small advantage of case–control studies occupational and residential number of hospitals, and in stable is the ability to enrol large numbers histories, respectively. Cohorts have populations where patients are likely of cases with less common diseases an inherent limitation in that their to be followed-up in the diagnostic in a relatively short period of time. aim is to study multiple endpoints, hospitals or associated clinics. Given the need for large sample and thus they collect less extensive Information on clinical outcomes sizes (up to several thousand data on exposures relevant to any can be obtained through active cases and controls) to investigate one particular disease, although the follow-up of the cases, in which weak to moderate associations, opportunity to return to participants patients are contacted individually such as main effect for common at later time points may partially through the course of their susceptibility loci, as well as gene– ameliorate this point. Case–control treatment and medical follow-up, environment interaction (88), and studies can more readily focus or through passive follow-up by to explore data subsets, it is only on new exposures of concern extracting information from medical feasible to collect enough cases of for particular diseases, tailoring records. Passive follow-up is less the more common diseases in most methods to optimally capture target costly; however, it is often limited cohort studies. However, pooling data. In contrast, cohort studies will by difficulties in obtaining detailed efforts across cohorts or case– have instruments in place that will information on treatment from control studies, such as consortia inevitably lack precision or entirely medical records, or by loss to follow- of studies of specific tumour sites miss new exposures. up in populations where patients (http://epi.grants.cancer.gov/ change cities or hospitals. Use of Consortia/tablelist.html), are critical

254 Summary and future directions information on the disease to allow a disease types. It should be noted more accurate definition of disease that the relative risk from a case- Case–control studies play a critical and refined classification of complex only design would underestimate role in molecular epidemiologic diseases, such as cancer, diabetes the relative risk derived in a case– research, particularly for biomarkers or hypertension, into entities control design when the exposure of that are unlikely to have disease more biologically or etiologically interest is associated with more than bias, such as DNA-based markers homogeneous among groups. By one disease type. of genetic susceptibility. They can having direct access to patients, Another potential limitation rapidly enrol large numbers of biological specimens, and clinical of the case-series design is the cases, even with rare conditions, records, studies may be generalizability of findings, since in multicentre studies and by able to define diseases or preclinical this design can include highly combining across studies in conditions based on molecular selected cohorts of patients consortia. In addition, case– events driving biological processes to address specific treatment control studies can apply detailed rather than clinical symptoms. protocols, such as in clinical trials. diagnostic procedures, including For instance, cancers can be In etiological studies, it is always specialized imaging approaches classified according to pathological reassuring to observe associations not routinely used in the usual and molecular characteristics of between established factors and healthcare setting, and state-of- tumours, infectious diseases such as disease risk in a particular study the-art molecular analyses when hepatitis can be classified according population; however this cannot tissue samples are collected. Given to the causal virus, and asthma can be observed in case series. that a substantial number of rapidly be more precisely defined according Identification of cases through developing new “omic” technologies to pathophysiologic mechanisms well characterized population- can be readily applied to the case– (91). based registries, or evaluation of control setting, this design should The case-only design, however, established associations between continue to be a core component has limitations when evaluating disease characteristics and clinical of research programmes on the etiological questions, most notably outcomes or risk factors, could etiology of chronic diseases. related to the inability to directly address some of these limitations. estimate risk for disease. Although Case-only and other study the case-only design can be used to Other designs designs estimate multiplicative interactions between risk factors under certain Alternative study designs have Case-only studies assumptions, it is susceptible to been proposed to address some misinterpretation of the interaction of the limitations of the classical Studies including subjects with the parameter (92), is highly dependent epidemiological designs. For disease of interest without a control on the assumption of independence instance, the two-phase sampling population (for instance, case series between the exposure and the design can be used to improve or clinical trials), are often used to genotype under study (93), and it efficiency and reduce the cost of evaluate questions related to disease cannot be used to estimate additive measuring biomarkers in large treatment and progression, including interactions. The degree of etiologic epidemiological studies (95). The Unit 4 secondary effects of treatment. heterogeneity in case-series studies first phase of this design could be These designs are also well suited can be quantified by the ratio of a case–control or cohort study to evaluate the influence of genetic the relative risk for the effect of with basic exposure information 14 Chapter and environmental risk factors on exposure on one disease subtype to and no biomarker measurements. disease for disease progression and the relative risk for another subtype. In a second phase, more elaborate response to treatment, and can be This parameter is equivalent to exposure information and/or very valuable to evaluate etiological the relative risk for the association determination of biomarkers (with questions, such as gene–gene and between exposure and disease collection of biological specimens gene–environment interactions subtype (94). However, case-only if these were not collected in the (89,90), and etiologic heterogeneity studies are limited to the estimation first phase) is carried out in an for different disease subtypes. of the ratio of relative risk, and informative sample of individuals An advantage of these designs cannot be used to obtain estimates defined by disease and exposure is their ability to obtain extensive of the relative risk for different (e.g. subjects with extreme or

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 255 uncommon exposures). Multiple Further, important advances are to the fundamental epidemiologic statistical methods, such as simple being made in the development principles of careful study conditional likelihood (96) or of new approaches in exposure design, vigilant quality control, estimated-score (97), have been assessment (http://www.gei.nih. thoughtful data analysis, cautious developed to analyse data from gov/exposurebiology). At the same interpretation of results, and well two-sampling designs. Another time, large and high-quality case– powered replication of important example is the use of the kin- control studies of many diseases findings. cohort design as a more efficient have been established with detailed alternative to case–control or cohort exposure data and stored biological Acknowledgements studies, when the goal is to estimate specimens, previously established age-specific penetrance for rare cohorts are being followed-up, and This chapter has been adapted inherited mutations in the general new cohort studies with biological and updated from a book chapter population (98,99). In this design, samples are being established in on Application of biomarkers in relatives of selected individuals with developing as well as developed cancer epidemiology (16) and a genetic testing form a retrospective countries. The confluence of chapter on Design considerations cohort that is followed from birth to extraordinary technology and the in molecular epidemiology (21). onset of disease or censoring. availability of large epidemiologic We thank Elizabeth Azzato, David studies should ultimately lead to Hunter, Maria Teresa Landi, and Concluding remarks new insights into the etiology of Sholom Wacholder for their valuable many important diseases and help comments to the chapter. The field of molecular epidemiology to facilitate effective prevention, has undergone a transformational screening and treatment. However, change with the incorporation of this will only be achieved if powerful genomic technology. molecular epidemiologists adhere

256 References

1. Aardema MJ, MacGregor JT (2002). 14. Perera FP (2000). Molecular epidemiology: 25. Schulte PA, Perera FP. Transitional Toxicology and genetic toxicology in the on the path to prevention? J Natl Cancer studies. In: Toniolo P, Boffetta P, Shuker DEG new era of “toxicogenomics”: impact of Inst, 92:602–612.doi:10.1093/jnci/92.8.602 et al., editors. Application of biomarkers in “-omics” technologies. Mutat Res, 499:13–25. PMID:10772677 cancer epidemiology. Lyon: IARC Scientific PMID:11804602 Publications; 1997. p. 19–29. 15. Toniolo P, Boffetta P, Shuker DEG et 2. Wang W, Zhou H, Lin H et al. (2003). al., editors. Application of biomarkers in 26. Rothman N (1995). Genetic susceptibility Quantification of proteins and metabolites by cancer epidemiology. Lyon: IARC Scientific biomarkers in studies of occupational and mass spectrometry without isotopic labeling Publication; 1997. environmental cancer: methodologic issues. or spiked standards. Anal Chem, 75:4818– Toxicol Lett, 77:221–225.doi:10.1016/0378- 4826.doi:10.1021/ac026468x PMID:14674459 16. García-Closas M, Vermeulen R, Sherman 4274(95)03298-3 PMID:7618141 ME et al. Application of biomarkers in cancer 3. Hanash S (2003). Disease proteomics. epidemiology. In: Schottenfeld D, Fraumeni 27. Hulka BS, Margolin BH (1992). Nature, 422:226–232.doi:10.1038/ JF Jr, editors. Cancer epidemiology and Methodological issues in epidemiologic nature01514 PMID:12634796 prevention. 3rd ed. New York (NY): Oxford studies using biologic markers. Am J University Press; 2006. p. 70–88. Epidemiol, 135:200–209. PMID:1536135 4. Baak JP, Path FR, Hermsen MA et al. (2003). and proteomics in cancer. 17. Perera FP, Weinstein IB (1982). 28. Hulka BS (1991). ASPO Distinguished Eur J Cancer, 39:1199–1215.doi:10.1016/ Molecular epidemiology and carcinogen- Achievement Award Lecture. Epidemiological S0959-8049(03)00265-X PMID:12763207 DNA adduct detection: new approaches studies using biological markers: issues to studies of human cancer causation. J for epidemiologists. Cancer Epidemiol 5. Sellers TA, Yates JR (2003). Review of Chronic Dis, 35:581–600.doi:10.1016/0021- Biomarkers Prev, 1:13–19. PMID:1845163 proteomics with applications to genetic 9681(82)90078-9 PMID:6282919 epidemiology. Genet Epidemiol, 24:83–98. 29. Wacholder S, McLaughlin JK, Silverman doi:10.1002/gepi.10226 PMID:12548670 18. Spitz MR, Wu X, Mills G (2005). Integrative DT, Mandel JS (1992). Selection of controls epidemiology: from risk assessment in case-control studies. I. Principles. Am J 6. Staudt LM (2003). Molecular diagnosis to outcome prediction. J Clin Oncol, Epidemiol, 135:1019–1028. PMID:1595688 of the hematologic cancers. N Engl J Med, 23:267–275.doi:10.1200/JCO.2005.05.122 348:1777–1785.doi:10.1056/NEJMra020067 PMID:15637390 30. Breslow NE, Day NE. Design PMID:12724484 considerations. In: Breslow NE, Day NE, 19. Caporaso NE (2007). Integrative study editors. Statistical methods in cancer 7. Strausberg RL, Simpson AJ, Wooster R designs–next step in the evolution of molecular research. Vol 2. The design and analysis (2003). Sequence-based cancer genomics: epidemiology? Cancer Epidemiol Biomarkers of cohort studies. Lyon: IARC Scientific progress, lessons and opportunities. Nat Prev, 16:365–366.doi:10.1158/1055-9965. Publication; 1987. p. 272–315. Rev Genet, 4:409–418.doi:10.1038/nrg1085 EPI-07-0142 PMID:17372231 PMID:12776211 31. Rothman KJ, Greenland S, editors. 20. Rothman N, Stewart WF, Schulte PA Modern epidemiology. Philadelphia (PA): 8. Smith MT, Rappaport SM (2009). Building (1995). Incorporating biomarkers into cancer Lippincott-Raven; 1998. exposure biology centers to put the E into epidemiology: a matrix of biomarker and “G x E” interaction studies. Environ Health study design categories. Cancer Epidemiol 32. Paulsson B, Larsen KO, Törnqvist Perspect, 117:A334–A335. PMID:19672377 Biomarkers Prev, 4:301–311. PMID:7655323 M (2006). Hemoglobin adducts in the assessment of potential occupational 9. Schembri F, Sridhar S, Perdomo C et 21. García-Closas M, Lan Q, Rothman exposure to acrylamides – three case studies. al. (2009). MicroRNAs as modulators of N. Design considerations in molecular Scand J Work Environ Health, 32:154–159. smoking-induced gene expression changes epidemiology. In: Rebbeck T, Ambrosone C, PMID:16680386 in human airway epithelium. Proc Natl Acad Shields P, editors. Molecular epidemiology: Sci USA, 106:2319–2324.doi:10.1073/ applications in cancer and other human 33. Wirfält E, Paulsson B, Törnqvist M et pnas.0806383106 PMID:19168627 diseases. New York (NY): Informa Healthcare; al. (2008). Associations between estimated acrylamide intakes, and hemoglobin AA 10. Committee on Biological Markers of 2008. p. 1–18. adducts in a sample from the Malmö Diet and Unit 4 the National Research Council (1987). 22. Ransohoff DF (2009). Promises and Cancer cohort. Eur J Clin Nutr, 62:314–323. Biological markers in environmental health limitations of biomarkers. Recent Results doi:10.1038/sj.ejcn.1602704 PMID:17356560 research. Environ Health Perspect, 74:3–9. Cancer Res, 181:55–59.doi:10.1007/978-3- Chapter 14 Chapter PMID:3691432 540-69297-3_6 PMID:19213557 34. Qu Q, Shore R, Li G et al. (2002). Hematological changes among Chinese 11. Rothman N, Wacholder S, Caporaso NE 23. Pepe MS, Feng Z, Janes H et al. (2008). workers with a broad range of benzene et al. (2001). The use of common genetic Pivotal evaluation of the accuracy of a exposures. Am J Ind Med, 42:275–285. polymorphisms to enhance the epidemiologic biomarker used for classification or prediction: doi:10.1002/ajim.10121 PMID:12271475 study of environmental carcinogens. Biochim standards for study design. J Natl Cancer Biophys Acta, 1471:C1–C10. PMID:11342183 Inst, 100:1432–1438.doi:10.1093/jnci/djn326 35. Lan Q, Zhang L, Li G et al. (2004). PMID:18840817 Hematotoxicity in workers exposed to low 12. Schulte PA (1987). Methodologic issues in levels of benzene. Science, 306:1774–1776. the use of biologic markers in epidemiologic 24. Ransohoff DF (2007). How to improve doi:10.1126/science.1102443 PMID:15576619 research. Am J Epidemiol, 126:1006–1016. reliability and efficiency of research about PMID:3318408 molecular markers: roles of phases, guidelines, 36. Schulte PA, Geraci C, Zumwalde R et al. and study design. J Clin Epidemiol, 60:1205– (2008). Occupational risk management of 13. Perera FP (1987). Molecular cancer engineered nanoparticles. J Occup Environ Hyg, epidemiology: a new tool in cancer 1219.doi:10.1016/j.jclinepi.2007.04.020 PMID:17998073 5:239–249.doi:10.1080/15459620801907840 prevention. J Natl Cancer Inst, 78:887–898. PMID:18260001 PMID:3471998

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 257 37. Chua SD Jr, Messier SP, Legault C 47. Bonassi S, Norppa H, Ceppi M et al. 58. Rundle AG, Vineis P, Ahsan H (2005). et al. (2008). Effect of an exercise and (2008). Chromosomal aberration frequency Design options for molecular epidemiology dietary intervention on serum biomarkers in lymphocytes predicts the risk of cancer: research within cohort studies. Cancer in overweight and obese adults with results from a pooled cohort study of 22 358 Epidemiol Biomarkers Prev, 14:1899–1907. osteoarthritis of the knee. Osteoarthritis subjects in 11 countries. Carcinogenesis, doi:10.1158/1055-9965.EPI-04-0860 Cartilage, 16:1047–1053.doi:10.1016/j.joca. 29:1178–1183.doi:10.1093/carcin/bgn075 PMID:16103435 2008.02.002 PMID:18359648 PMID:18356148 59. Prentice RL (1995). Design issues in 38. Rejnmark L, Vestergaard P, Heickendorff 48. Smerhovsky Z, Landa K, Rössner P et al. cohort studies. Stat Methods Med Res, 4:273– L et al. (2001). Loop diuretics alter the (2001). Risk of cancer in an occupationally 292.doi:10.1177/096228029500400402 diurnal rhythm of endogenous parathyroid exposed cohort with increased level of PMID:8745127 hormone secretion. A randomized-controlled chromosomal aberrations. Environ Health study on the effects of loop- and thiazide- Perspect, 109:41–45.doi:10.1289/ehp.01109 60. Wacholder S (1991). Practical diuretics on the diurnal rhythms of calcitropic 41 PMID:11171523 considerations in choosing between the hormones and biochemical bone markers in case-cohort and nested case-control postmenopausal women. Eur J Clin Invest, 49. Liou SH, Lung JC, Chen YH et al. (1999). designs. Epidemiology, 2:155–158. 31:764–772.doi:10.1046/j.1365-2362.2001. Increased chromosome-type chromosome doi:10.1097/00001648-199103000-00013 00883.x PMID:11589718 aberration frequencies as biomarkers of PMID:1932316 cancer risk in a blackfoot endemic area. 39. Schulte PA, Rothman N, Schottenfeld Cancer Res, 59:1481–1484. PMID:10197617 61. Hunter DJ. Methodological issues in the use D. Design considerations in molecular of biological markers in cancer epidemiology: epidemiology. In: Schulte PA, Perera FP. 50. Bonassi S, Abbondandolo A, Camurri L cohort studies. In: Toniolo P, Boffetta P, Molecular epidemiology: principles and et al. (1995). Are chromosome aberrations in Shuker DEG et al., editors. Application of practices. San Diego (CA): Academic Press, circulating lymphocytes predictive of future biomarkers in cancer epidemiology. Lyon: Inc.;1993. p. 159–98. cancer onset in humans? Preliminary results IARC Scientific Publications; 1997. p. 39–46. of an Italian cohort study. Cancer Genet 40. Schatzkin A, Freedman LS, Schiffman Cytogenet, 79:133–135.doi:10.1016/0165- 62. Banks E, Meade T (2002). Study of MH, Dawsey SM (1990). Validation of 4608(94)00131-T PMID:7889505 genes and environmental factors in complex intermediate end points in cancer research. J diseases. Lancet, 359:1156–1157, author reply Natl Cancer Inst, 82:1746–1752.doi:10.1093/ 51. Hagmar L, Brøgger A, Hansteen IL et al. 1157.doi:10.1016/S0140-6736(02)08140-0 jnci/82.22.1746 PMID:2231769 (1994). Cancer risk in humans predicted by PMID:11943294 increased levels of chromosomal aberrations 41. Schatzkin A, Gail M (2002). The promise in lymphocytes: Nordic study group on the 63. Burton P, McCarthy M, Elliott P (2002). and peril of surrogate end points in cancer health risk of chromosome damage. Cancer Study of genes and environmental factors in research. Nat Rev Cancer, 2:19–27.doi:10. Res, 54:2919–2922. PMID:8187078 complex diseases. Lancet, 359:1155 –1156, 1038/nrc702 PMID:11902582 author reply 1157.doi:10.1016/S0140-6736 52. Bonassi S, Hagmar L, Strömberg U et (02)08138-2 PMID:11943293 42. Expert Panel on Detection, Evaluation, al.; European Study Group on Cytogenetic and Treatment of High Blood Cholesterol Biomarkers and Health (2000). Chromosomal 64. Clayton D, McKeigue PM (2001). in Adults (2001). Executive summary of the aberrations in lymphocytes predict human Epidemiological methods for studying third report of The National Cholesterol cancer independently of exposure to genes and environmental factors in Education Program (NCEP) expert panel on carcinogens. Cancer Res, 60:1619–1625. complex diseases. Lancet, 358:1356–1360. detection, evaluation, and treatment of high PMID:10749131 doi:10.1016/S0140-6736(01)06418-2 blood cholesterol in adults (adult treatment PMID:11684236 53. Vermeulen R, Li G, Lan Q et al. (2004). panel III). JAMA, 285:2486–2497.doi:10.1001/ 65. Wacholder S, García-Closas M, Rothman jama.285.19.2486 PMID:11368702 Detailed exposure assessment for a molecular epidemiology study of benzene in N (2002). Study of genes and environmental 43. Tucker JD, Eastmond DA, Littlefield two shoe factories in China. Ann Occup Hyg, factors in complex diseases. Lancet, LG. Cytogenetic end-points as biological 48:105–116.doi:10.1093/annhyg/meh005 359:1155, author reply 1157.doi:10.1016/ dosimeters and predictors of risk in PMID:14990432 S0140-6736(02)08137-0 PMID:11943292 epidemiological studies. In: Toniolo P, Boffetta 66. García-Closas M, Thompson WD, Robins P, Shuker DEG et al., editors. Application of 54. García-Closas M, Rothman N, Lubin J (1999). Misclassification in case-control JM (1998). Differential misclassification biomarkers in cancer epidemiology. Lyon: and the assessment of gene-environment IARC Scientific Publication; 1997. p. 185–200. studies of gene-environment interactions: assessment of bias and sample size. Cancer interactions in case-control studies. Am J 44. Zhang L, Eastmond DA, Smith MT Epidemiol Biomarkers Prev, 8:1043–1050. Epidemiol, 147:426–433. PMID:9525528 (2002). The nature of chromosomal PMID:10613335 67. Rosner B, Spiegelman D, Willett WC (1992). aberrations detected in humans exposed to 55. Nicholson JK, Wilson ID (2003). Opinion: Correction of logistic regression relative benzene. Crit Rev Toxicol, 32:1–42.doi:10. risk estimates and confidence intervals for 1080/20024091064165 PMID:11846214 understanding ‘global’ systems biology: metabonomics and the continuum of random within-person measurement error. Am 45. Zhang L, Rothman N, Wang Y et al. metabolism. Nat Rev Drug Discov, 2:668– J Epidemiol, 136:1400–1413. PMID:1488967 (1999). Benzene increases aneuploidy in 676.doi:10.1038/nrd1157 PMID:12904817 68. Holland NT, Smith MT, Eskenazi B, the lymphocytes of exposed workers: a 56. Merrick BA, Tomer KB (2003). Bastaki M (2003). Biological sample collection comparison of data obtained by fluorescence and processing for molecular epidemiological in situ hybridization in interphase and Toxicoproteomics: a parallel approach to identifying biomarkers. Environ Health studies. Mutat Res, 543:217–234.doi:10.1016/ metaphase cells. Environ Mol Mutagen, S1383-5742(02)00090-X PMID:12787814 34:260–268.doi:10.1002/(SICI)1098- Perspect, 111:A578 –A579.doi:10.1289/ehp. 2280(1999)34:4<260::AID-EM6>3.0.CO;2-P 111-a578 PMID:12940285 69. Tworoger SS, Hankinson SE (2006). PMID:10618174 57. Potter JD. Logistics and design issues in the Collection, processing, and storage of use of biological specimens in observational biological samples in epidemiologic studies: 46. Boffetta P, van der Hel O, Norppa H et sex hormones, carotenoids, inflammatory al. (2007). Chromosomal aberrations and epidemiology. In: Toniolo P, Boffetta P, Shuker DEG et al., editors. Application of biomarkers markers, and proteomics as examples. cancer risk: results of a cohort study from Cancer Epidemiol Biomarkers Prev, 15:1578– Central Europe. Am J Epidemiol, 165:36–43. in cancer epidemiology. Lyon: IARC Scientific Publications; 1997. p. 31–37. 1581.doi:10.1158/1055-9965.EPI-06-0629 doi:10.1093/aje/kwj367 PMID:17071846 PMID:16985015

258 70. Peakman TC, Elliott P (2008). The 81. Wacholder S, Benichou J, Heineman EF 91. Bel EH (2004). Clinical phenotypes of UK Biobank sample handling and storage et al. (1994). Attributable risk: advantages of a asthma. Curr Opin Pulm Med, 10:44–50. validation studies. Int J Epidemiol, 37 broad definition of exposure. Am J Epidemiol, doi:10.1097/00063198-200401000-00008 Suppl 1;i2–i6.doi:10.1093/ije/dyn019 140:303–309. PMID:8059765 PMID:14749605 PMID:18381389 82. Cox A, Dunning AM, García-Closas M 92. Schmidt S, Schaid DJ (1999). Potential 71. Jackson C, Best N, Elliott P (2008). et al.; Kathleen Cunningham Foundation misinterpretation of the case-only study to UK Biobank Pilot Study: stability of Consortium for Research into Familial assess gene-environment interaction. Am J haematological and clinical chemistry Breast Cancer; Breast Cancer Association Epidemiol, 150:878–885. PMID:10522659 analytes. Int J Epidemiol, 37 Suppl 1;i16–i22. Consortium (2007). A common coding variant doi:10.1093/ije/dym280 PMID:18381388 in CASP8 is associated with breast cancer 93. Albert PS, Ratnasinghe D, Tangrea J, risk. Nat Genet, 39:352–358.doi:10.1038/ Wacholder S (2001). Limitations of the case- 72. Elliott P, Peakman TC; UK Biobank ng1981 PMID:17293864 only design for identifying gene-environment (2008). The UK Biobank sample handling and interactions. Am J Epidemiol, 154:687–693. storage for the collection, processing 83. Easton DF, Pooley KA, Dunning AM doi:10.1093/aje/154.8.687 PMID:11590080 and archiving of human blood and urine. Int et al.; SEARCH collaborators; kConFab; J Epidemiol, 37:234–244.doi:10.1093/ije/ AOCS Management Group (2007). Genome- 94. Begg CB, Zhang ZF (1994). Statistical dym276 PMID:18381398 wide association study identifies novel analysis of molecular epidemiology studies breast cancer susceptibility loci. Nature, employing case-series. Cancer Epidemiol 73. Fu Q, Garnham CP, Elliott ST et al. (2005). 447:1087–1093.doi:10.1038/nature05887 Biomarkers Prev, 3:173–175. PMID:8049640 A robust, streamlined, and reproducible PMID:17529967 method for proteomic analysis of serum by 95. White JE (1982). A two stage design for delipidation, albumin and IgG depletion, 84. Ahsan H, Rundle AG (2003). Measures of the study of the relationship between a rare and two-dimensional gel electrophoresis. genotype versus gene products: promise and exposure and a rare disease. Am J Epidemiol, Proteomics, 5:2656–2664.doi:10.1002/ pitfalls in cancer prevention. Carcinogenesis, 115:119 –128. PMID:7055123 pmic.200402048 PMID:15924293 24:1429–1434.doi:10.1093/carcin/bgg104 96. Cain KC, Breslow NE (1988). Logistic PMID:12819189 74. Prentice RL (1986). On the design of regression analysis and efficient design for synthetic case-control studies. Biometrics, 85. Wu X, Gu J, Spitz MR (2007). Mutagen two-stage studies. Am J Epidemiol, 128:1198 – 42:301–310.doi:10.2307/2531051 sensitivity: a genetic predisposition factor 1206. PMID:3195561 PMID:3741972 for cancer. Cancer Res, 67:3493–3495. 97. Chatterjee N, Chen Y, Breslow N doi:10.1158/0008-5472.CAN-06-4137 (2003). A pseudoscore estimator for 75. Franco EL (2000). Statistical issues in PMID:17440053 human papillomavirus testing and screening. regression problems for two phase Clin Lab Med, 20:345–367. PMID:10863644 86. Berwick M, Vineis P (2000). Markers sampling. J Am Stat Assoc, 98:158–168 of DNA repair and susceptibility to cancer doi:10.1198/016214503388619184. 76. Welch HG, Black WC (1997). Using in humans: an epidemiologic review. J autopsy series to estimate the disease 98. Wacholder S, Hartge P, Struewing JP et Natl Cancer Inst, 92:874–897.doi:10.1093/ al. (1998). The kin-cohort study for estimating “reservoir” for ductal carcinoma in situ of the jnci/92.11.874 PMID:10841823 breast: how much more breast cancer can penetrance. Am J Epidemiol, 148:623–630. we find? Ann Intern Med, 127:1023–1028. 87. Spitz MR, Wei Q, Dong Q et al. (2003). doi:10.1093/aje/148.7.623 PMID:9778168 PMID:9412284 Genetic susceptibility to lung cancer: the 99. Chatterjee N, Shih J, Hartge P et al. role of DNA damage and repair. Cancer 77. Morrison AS. Screening. In: Rothman KJ, (2001). Association and aggregation analysis Epidemiol Biomarkers Prev, 12:689–698. using kin-cohort designs with applications Greenland S, editors. Modern epidemiology. PMID:12917198 2nd ed. Philadelphia (PA): Lippincott-Raven to genotype and family history data from Publishers; 1998. p. 499–518. 88. García-Closas M, Lubin JH (1999). Power the Washington Ashkenazi Study. Genet and sample size calculations in case-control Epidemiol, 21:123–138.doi:10.1002/gepi.1022 78. Morton LM, Cahill J, Hartge P (2006). studies of gene-environment interactions: PMID:11507721 Reporting participation in epidemiologic comments on different approaches. Am J studies: a of practice. Am J Epidemiol, Epidemiol, 149:689–692. PMID:10206617 163:197–203.doi:10.1093/aje/kwj036 PMID:16339049 89. Yang Q, Khoury MJ, Sun F, Flanders WD (1999). Case-only design to measure gene- 79. Wacholder S, Chatterjee N, Hartge P gene interaction. Epidemiology, 10:167–170. (2002). Joint effect of genes and environment doi:10.1097/00001648-199903000-00014 distorted by selection biases: implications for PMID:10069253 hospital-based case-control studies. Cancer Epidemiol Biomarkers Prev, 11:885–889. 90. Khoury MJ, Flanders WD (1996). PMID:12223433 Nontraditional epidemiologic approaches in Unit 4 the analysis of gene-environment interaction: 80. Lum A, Le Marchand L (1998). A simple case-control studies with no controls. Am J mouthwash method for obtaining genomic Epidemiol, 144:207–213. PMID:8686689 14 Chapter DNA in molecular epidemiological studies. Cancer Epidemiol Biomarkers Prev, 7:719– 724. PMID:9718225

Unit 4 • Chapter 14. Population-based study designs in molecular epidemiology 259 260