A Dissertation
entitled
Cis-acting Genetic Variants that Alter ERCC5 Regulation as a Prototype to Characterize
cis-regulation of Key Protective Genes in Normal Bronchial Epithelial Cells
by
Xiaolu Zhang
Submitted to the Graduate Faculty as partial fulfillment of the requirements for the
Doctor of Philosophy Degree in
Biomedical Sciences
______Dr. James C. Willey, Committee Chair
______Dr. Bina Joe, Committee Member
______Dr. Keith Crist, Committee Member
______Dr. Ivana de la Serna, Committee Member
______Dr. Alexei Fedorov, Committee Member
______Dr. Patricia R. Komuniecki, Dean College of Graduate Studies
The University of Toledo
May 2016
Copyright 2016, Xiaolu Zhang
This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of
Cis-acting Genetic Variants that Alter ERCC5 Regulation as a Prototype to Characterize Cis-regulation of Key Protective Genes in Normal Bronchial Epithelial Cells
by
Xiaolu Zhang
Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Doctor of Philosophy Degree in Biomedical Sciences
The University of Toledo
May 2016
Evidence has suggested that there is inter-individual variation in susceptibility to lung cancer. This variation may be, in part, due to inter-individual variation in dysregulation of key genes in DNA repair, antioxidant, and cell cycle pathways. Excision repair cross-complementation group 5 (ERCC5) gene plays an important role in nucleotide excision repair (NER) and dysregulation of ERCC5 is associated with increased lung cancer risk. The goal of this study was conducting haplotype and diplotype analyses in normal bronchial epithelial cells (NBEC) to understand inter- individual variation in ERCC5 transcript regulation.
We determined genotypes at putative ERCC5 cis-regulatory single nucleotide polymorphic sites (SNP) rs751402 and rs2296147, and transcribed marker SNPs rs1047768 and rs17655. Using a recently developed targeted sequencing method, ERCC5 allele-specific transcript abundance was assessed in NBEC RNA from 55 individuals heterozygous for rs1047768 and 21 subjects heterozygous for rs17655. Syntenic relationships among alleles at rs751402, rs2296147 and rs1047768 were assessed by allele-specific PCR followed by Sanger sequencing. We assessed association of NBEC iii
ERCC5 allele-specific expression at rs1047768 with haplotype and diplotype structure at putative ERCC5 promoter cis-regulatory SNPs rs751402 and rs2296147.
Genotype analysis revealed higher inter-individual variation in allelic ratios in cDNA samples relative to matched gDNA samples at both rs1047768 and rs17655
(p<0.0001 and p=0.0005 respectively). By haplotype analysis, mean expression was higher at the rs1047768 alleles syntenic with rs2296147 T allele compared to rs2296147
C allele (p=0.0030). Sequence analysis predicted that T allele at SNP rs2296147 creates a
TP53 binding site. Mean expression was higher at rs17655 G allele (p<0.0001) which is syntenic with A allele at a linked SNP rs873601 (D’=0.95). G allele at SNP rs873601 is predicted to create a miRNA binding site.
These data support the conclusion that T allele at SNP rs2296147 is associated with higher ERCC5 transcript abundance, possibly through increased responsiveness to
TP53 transcription factor. Genotype at SNP rs17655 also is associated with variation in
ERCC5 transcript abundance, likely due to effect on miRNA binding affinity at the linked SNP rs873601. These effects on ERCC5 transcription likely result in variation in nucleotide excision DNA repair function. These findings provide plausible explanation for the association of genotype at rs2296147 and rs17655 with lung cancer risk.
iv
Dedications
To my husband, Xiaoming Fan, who always supports me but doesn’t let me neglect my mistakes. You knew my weaknesses so that you encourage me when I am depressed. You are a true friend of mine so that you would point out my weaknesses and teach me a lesson. Your dedication and support have provided the foundation for my continuous personal improvement. I am forever grateful of the tremendous sacrifices you have made along the way.
To my mother, Zhibi Zhang, your strong support for my study in a different country. Your sacrifices to foster your daughter and make her intelligent serve as the philosophy I will follow with my children in the future.
Acknowledgements
I would like to express my gratitude to all those who have helped me complete this thesis. I am very grateful to Dr. James C. Willey, my major advisor, for his expert
v insights of research with tremendous patience to guide me and teach me, and for his great support throughout my doctoral studies especially spending time for fixing my poor writing. You helped me stay optimistic and not give up. Your encouragement will keep me moving forward in the future.
I am grateful to Erin L. Crawford, my lab manager, for her help in communication with others, study design, reagent ordering and writing the first manuscript of my life.
You are always happy to answer my questions.
I thank all my former and present colleagues Tom Blomquist, Lauren Stanoszek,
Jiyoun Yeo, Diego A. Morales, Rose T. Zolondek, and Daniel J. Craig, for all great discussion and inspiration. Especially I want to thank Tom for your kind courtesy of reagents, Jiyoun for your great advices, Rose for your help.
I would like to thank my academic advisory committee members: Dr. Bina Joe,
Dr. Keith Crist, Dr. Ivana de la Serna, and Dr. Alexei Fedorov,
vii
Table of Contents
Abstract ...... iii
Acknowledgements ...... vii
Table of Contents ...... viii
List of Tables ...... xii
List of Figures ...... xiii
List of Abbreviations ...... xiiii
1 Introduction ...... 1
2 Literature Review...... 9
2.1 Biomarker Development for Disease Risk ...... 9
2.1.1 Gene Transcript Expression ...... 9
2.1.1.1 Total Transcript Abundance ...... 9
2.1.1.2 Allele-specific Transcript Expression ...... 12
2.1.1.2.1 Dosage Effect ...... 13
2.1.1.2.2 Allelic Imbalance ...... 15
2.1.2 Transcript Regulation...... 16
2.1.2.1 Identify Trans-acting Variations ...... 17
2.1.2.2 Identify Cis-acting Variations ...... 19
2.1.3 Epidemiology Study...... 21
2.1.3.1 Case-control Study ...... 22 viii
2.1.3.2 Cohort Study ...... 23
2.2 Identification of cis-acting Regulatory Variations ...... 24
2.2.1 Single Nucleotide Polymorphism Analysis ...... 24
2.2.1.1 Genotype Analysis ...... 24
2.2.1.2 Haplotype Analysis ...... 27
2.2.2 Empirical Approaches to Assess cis-acting Variations ...... 31
2.2.2.1 Assessing cis-acting Effects by Total Expression ...... 31
2.2.2.2 Assessing cis-acting Effects by ASE ...... 33
2.2.3 Experimental Approaches to Assess cis-acting Variations ...... 35
2.2.3.1 In vitro Approaches...... 35
2.2.3.2 In vivo Approaches ...... 38
2.3 Quality Controlled Molecular Diagnostic Tests Based on RNA ...... 40
2.3.1 Reverse Transcription-Polymerase Chain Reaction ...... 40
2.3.1.1 Relative Quantification ...... 41
2.3.1.2 Absolute Quantification ...... 42
2.3.1.2.1 Competitive PCR ...... 43
2.3.1.2.2 Multiplex Two-Color Fluorometric Real-time
PCR with Quality Control ...... 45
2.3.2 RNA-Sequencing ...... 48
2.3.2.1 Whole Transcriptome RNA-seq ...... 48
2.3.2.2 Targeted RNA-seq ...... 50
2.3.2.2.1 Use of IS as Quality Control ...... 51
2.3.2.2.2 Use of Predicted CV as Quality Control ...... 52
viiii
2.4 Contributions...... 53
2.4.1 Manuscript I ...... 53
2.4.2 Manuscript II ...... 54
2.4.3 Manuscript III ...... 55
2.5 Future Study ...... 55
3 Haplotype and Diplotype Analyses of Variation in ERCC5 Transcription cis- regulation in Normal Bronchial Epithelial Cells ...... 57
3.1 Abstract ...... 58
3.2 Introduction ...... 59
3.3 Materials and Methods ...... 61
3.4 Results ...... 67
3.5 Discussion ...... 70
3.6 Disclosures ...... 75
3.7 Grants ...... 75
3.8 Table and Figure Legends ...... 76
3.9 Table and Figure ...... 80
3.10 Supplemental Table and Figure Legends ...... 87
3.11 Supplemental Table and Figure ...... 88
4 Lung Cancer Risk Test Trial: Study Design, Participant Baseline Characteristics,
Bronchoscopy Safety, and Establishment of Biospecimen Repository ...... 91
4.1 Abstract ...... 92
4.2 Introduction ...... 94
4.3 Methods...... 97
ixi
4.4 Results ...... 104
4.5 Discussion ...... 110
4.6 Conclusions ...... 116
4.7 List of Abbreviations Used ...... 116
4.8 Competing Interests ...... 117
4.9 Author Contributions ...... 117
4.10 Acknowledgements ...... 117
4.11 Table and Figure Legends ...... 118
4.12 Table and Figure ...... 119
4.13 Supplemental Table and Figure Legends ...... 121
4.14 Supplemental Table and Figure ...... 121
5 Control for Stochastic Sampling Variation and Qualitative Sequencing Error in
Next Generation Sequencing ...... 127
5.1 Abstract ...... 128
5.3 Methods...... 133
5.4 Results ...... 139
5.5 Discussion ...... 141
5.6 Table and Figure Legends ...... 145
5.7 Table and Figure ...... 148
5.8 Supplementary Table and Figure ...... 153
6 Conclusions and Summary ...... 154
References ...... 166
xi
List of Tables
3.1 Summary of Haplotype Structures in ERCC5 Promoter Region...... 75
3.2 Summary of Diplotype Structures in ERCC5 Promoter Region ...... 75
S3.1 Demographic Characteristics of Enrolled 80 Subjects ...... 83
S3.2 Summary of Genotype and Diplotype for Heterozygotes at rs1047768 ...... 83
S3.3 ON-TARGET Plus SMARTpool siRNA Sequences ...... 85
4.1 LCRT Subject Characteristics...... 114
4.2 Chronic Obstructive Pulmonary Disease by PFT ...... 115
4.3 Adverse Events (AE) ...... 115
S4.1 Lung Cancer Risk Test Study Enrollment by Study Site ...... 116
S4.2 Work Types and Exposures ...... 117
S4.3 Medical History ...... 118
S4.4 Standard of Care (SOC) vs. Study Driven (SD) Bronchoscopies ...... 119
S4.5 Self-reported vs. Clinical COPD...... 120
S4.6 Transplant vs. Non-Transplant Subjects ...... 121
S5.1 Supplementary Tables ...... 148
xii
List of Figures
3-1 Effects of CEBPG siRNAs on ERCC5 Transcript Abundance ...... 76
3-2 Schematic Overview of ERCC5 ASE Measurement ...... 77
3-3 Allelic Ratios Measured at rs1047768 and rs17655 ...... 78
3-4 Allelic Ratios Measured at rs1047768 Sorted by Various Diplotype ...... 79
3-5 Correlation of CDKN1A and ERCC5 Transcript Abundance ...... 80
3-6 Lung Cancer Risk Through Sub-Optimal Regulation of Protective Genes ...... 81
5-1 Overview of Specimen Preparation for Next-Generation Sequencing ...... 143
5-2 Performance of Models to Predict Observed Variance...... 144
5-3 Effects of Sequence Counts and Sample Molecule on Allelic Ratios ...... 145
5-4 Frequency Plot of Observed Sequencing Variation ...... 146
5-5 Performance of IS to Measure Frequency of Sequence Variation ...... 147
S5-1 Model Design ...... 148
xiii
List of Abbreviations
AI ...... Allelic Imbalance ASE ...... Allele Specific Expression
Cq ...... Quantification Cycle CV ...... Coefficient (of) Variation CEBPG ...... CCAAT/Enhancer Binding Protein, Gamma
ERCC5 ...... Excision Repair Complex Complementary, 5 ESM ...... External Standards Mixture
FFPE ...... Formalin-Fixed, Paraffin-Embedded
GWAS ...... Genome Wide Association Study
IS ...... Internal Standard ISM ...... Internal Standards Mixture
LCRT ...... Lung Cancer Risk Test
NBEC ...... Normal Bronchial Epithelial Cell NGS...... Next-Generation Sequencing NT ...... Native Template
PCR ...... Polymerase Chain Reaction
RT ...... Reverse Transcription RT-qPCR...... Reverse Transcription Quantitative-PCR
SNP ...... Single Nucleotide Polymorphism
xiiii
Chapter 1
Introduction
An estimated 158,040 Americans were expected to die from lung cancer in 2015, accounting for approximately 27 percent of all cancer deaths (Society 2015). As the leading cancer killer in both men and women in the United States, lung cancer causes more deaths than the next three most common cancers combined (colon, breast and pancreatic). In contrast to simple Mendelian pattern, lung cancer risk is a complex genetic trait that commonly involves multiple genes in combination with environmental factors and the same is true for chronic obstructive pulmonary disease (COPD)
(Tockman, Anthonisen et al. 1987, Sundar, Mullapudi et al. 2011). Although smoking is the most common preventable cause of lung cancer and COPD, evidence supporting the genetic basis of risk for lung cancer and COPD has been emerging since the beginning of the twentieth century (Alberg and Samet 2003). The evidence for genetic causes are based on statistics of lung cancer cases among smokers and nonsmokers. Importantly, although smokers worldwide have a 20 times greater risk of developing lung cancer than non-smokers, only 10-15% of smokers would develop lung cancer in their lifetime, consistent with an interaction between smoking and genetic risks (Mattson, Pollack et al.
1987, Irshad and Maryum 2012). Conversely, 10-15% of all lung cancers occur among
1
nonsmokers and there is active investigation for genetic predisposition among non-
smoking lung cancer individuals (Hu, Mao et al. 2002, Wang, Vermeulen et al. 2015).
Although a single high-penetrant gene for lung cancer has not yet been identified
(Schwartz, Prysak et al. 2006, Schwartz 2016), transcription expression of multiple genes
and genomic DNA mutation signatures were reported to be different between healthy and
lung cancer smokers (Spira, Beane et al. 2004, Pleasance, Stephens et al. 2010). All of this evidence strongly suggests that there is inter-individual variation in susceptibility to lung cancer and this variation may result from genetic predisposition.
The five-year survival rate for people diagnosed with late-stage lung cancer is significantly lower than those diagnosed in the early stages (American Cancer Society
2016). Therefore, it is important to identify lung cancer cases in the early-stages and there has been an increased effort to do so. For example, the National Lung Screening Trial
(NLST), launched in 2002, revealed a 20% reduction in lung cancer mortality among individuals screened by low-dose spiral computed tomography (LDCT) compared to the group screened by standard chest X-ray (National Lung Screening Trial Research, Aberle et al. 2011). Since December 2013 annual LDCT screening has been recommended by the United States Preventive Safety Task Force (USPSTF) for people at high lung cancer risk based on epidemiologic characteristics (i.e. age 55-80, > 30 pack-years smoking history and currently smoke or have quit within the past 15 years). NLST reported 24.2% of 3 rounds of LDCT screening were positive. However, 96.4% of these were false- positive and approximately 2.5% of positive test results required additional invasive diagnostic procedure (e.g., bronchoscopy, needle biopsy, etc.). Furthermore, LDCT was associated with health risk, emotional problems, and financial cost to the patient
2
(National Lung Screening Trial Research, Church et al. 2013). Consequently, USPSTF
urged that more research is needed on the use of biomarkers to focus LDCT efforts in
persons who are at highest risk for lung cancer (Moyer and Force 2014). A molecular
diagnostic that further stratifies the individuals at highest risk for lung cancer within
epidemiologically defined high-risk group will enable more accurate selection for
individuals who are most likely to develop lung cancer in their lifetime and reduce risk
and cost of annual LDCT screening.
Various studies have described, discovered, quantified and validated biomarkers
at transcribed mRNA level (Spira, Beane et al. 2004, Spira, Beane et al. 2007). To
understand the role of inter-individual variation in genetic predisposition to lung cancer
risk, our laboratory determined that key genes involved in DNA repair, antioxidant, and
cell cycle pathways display altered regulation in normal bronchial epithelial cell (NBEC)
of lung cancer subjects (Crawford, Khuder et al. 2000, Mullins, Crawford et al. 2005,
Crawford, Blomquist et al. 2007) and identified a promising Lung Cancer Risk Test
(LCRT) biomarker comprising transcript abundance measurement of fifteen genes in
NBEC (Blomquist, Crawford et al. 2009). According to the observed pattern in prior case
control study, it is reasonable to hypothesize that the inheritability of lung cancer may
depend on inter-individual variation in DNA repair capacity, which is associated with sub-optimal expression of DNA repair genes and antioxidant genes in airway epithelium.
Sub-optimal expression of DNA repair and antioxidant genes can result from dysregulation of these genes in part due to genetic variation. Genetic association studies can be conducted to test for correlation between disease status and genetic variation to identify candidate genes or genome regions that contribute to a specific disease (Lewis
3
and Knight 2012). Numerous such studies have been done for DNA repair and
antioxidant genes and identified many risk loci associated with lung cancer (Shen, Berndt
et al. 2005, Schabath, Wu et al. 2006, Zienolddiny, Campa et al. 2006, Gallegos Ruiz,
Floor et al. 2008, Sun, Li et al. 2009, Blomquist, Crawford et al. 2010, Tseden-Ish, Choi
et al. 2012, Blomquist, Brown et al. 2013).
In addition to these genetic association studies where specific target loci were
prioritized with high prior likelihood for lung cancer risk, multiple allelic associations
with lung cancer risk have been carried out by a dense array of genetic markers, which
capture a substantial coverage of common variants in the human genome (McCarthy,
Abecasis et al. 2008), also known as genome-wide association study (GWAS). GWAS has discovered many genetic variants (i.e., single nucleotide polymorphisms, SNP) that were significantly associated with common disease susceptibility in mouse and human genomes (Lohmueller, Pearce et al. 2003, Mehrabian, Allayee et al. 2005, Simmonds
2013). However, results from different GWAS studies have not been consistent.
One reason is that the effect size of each DNA variant associated with and
possibly mechanistically linked to lung cancer risk is very small after adjustment of multiple comparisons. Genes associated with lung cancer risk are regulated by multiple loci, therefore, each of the loci contributes only modestly to the risk (Deutsch, Lyle et al.
2005, Wu, Kraft et al. 2010). In fact, additional evidence showed that region-based analysis (a combination of SNPs of a particular genomic region) may possess higher power than single SNP-based analysis (Zakharov, Wong et al. 2013). Together, it suggests the benefits for region-based analysis over SNP-based analysis and leads to more interest in haplotype, the combination of marker alleles on a single chromosome.
4
A second reason for inconsistent GWAS results is that GWAS of susceptibility loci and lung cancer risk has been limited by the low and variable frequency of SNPs in population and ethnicity. Consequently, thousands of subjects are needed to directly assess the association of individual genetic variants and lung cancer risk. Instead, identifying inherited variation in gene regulation as a phenotypic marker is a powerful intermediate step for determination of lung cancer risk. And according to current results and our previous study it is possible to assess this type of intermediate risk factor with far fewer patients than the thousands typically necessary for a GWAS study aiming to determine association of each individual SNP with risk. Analysis of regulation of key antioxidant and DNA repair genes in NBEC, followed by identification of cis-regulatory
SNPs (cis-rSNP) associated with sub-optimal regulation is more practical and effective to evaluate lung cancer risk.
The proximate phenotypic markers of hereditary risk comprised by the LCRT are key protective antioxidant, DNA repair, and cell cycle control genes that are sub- optimally regulated in normal bronchial epithelial cells (NBEC). The rationale for this approach is that sub-optimal NBEC regulation of a protective gene has greater effect on
risk than an individual SNP. This conclusion is based on results of previous studies in
which we identified cis-rSNPs associated with sub-optimal regulation of genes comprised
by the LCRT, including ERCC5 (Blomquist, Crawford et al. 2010) and CEBPG
(Blomquist, Brown et al. 2013). For example, we identified two c cis-rSNPs that independently contribute to regulation of ERCC5 transcript abundance. Thus, a proximate phenotype based on sub-optimal NBEC regulation of a protective gene enriches for cis-rSNPs that may contribute to risk. Thus, the central hypothesis of this
5
study is that genetic variations (i.e., SNPs) at regulatory regions contribute to cis-
regulation of ERCC5, one of the key genes in the LCRT. The central hypothesis was
tested through two working hypotheses: 1) cis-rSNPs are associated with ERCC5 allele-
specific expression (ASE); 2) The contribution of cis-rSNPs to ERCC5 regulation may be
independent. The mechanistic role of cis-acting genetic variations that regulate
transcription of target gene can be assessed by measuring ASE as each allele serves as an
internal control for the other (Pastinen and Hudson 2004, Pai, Pritchard et al. 2015), and
trans-acting effects or environmental conditions that differentially influence gene
expression among samples should not interfere. Only cis-acting changes in the relative
expression of alleles yield reproducible differences between allelic abundances of
transcripts. Previously, genotypes of putative cis-rSNPs responsible for regulation of
some genes comprised by the LCRT, including excision repair cross-complementation
group 5 (ERCC5) and CCAAT/enhancer binding protein gamma (CEBPG), have been
associated with ASE (Crawford, Blomquist et al. 2007, Blomquist, Crawford et al. 2010,
Blomquist, Brown et al. 2013). This study tested the hypotheses by accomplishing two
specific aims. Aim1: determine haplotype comprised by putative cis-rSNPs using allele- specific PCR followed by Sanger sequencing in normal epithelial cells. Aim2: assess
ASE in ERCC5 using targeted NGS method with quality and stochastic controls. With the addition of more subjects to the LCRT trial over the past 5 years and understanding
ERCC5’s critical role in DNA repair, an advancement in mechanistic understanding regarding reginal effects of heritable variations in cis-regulation of ERCC5 can be made
by assessing haplotypes comprised by putative cis-rSNPs and ASE. In addition to cis-
6
regulation, the trans-effect of CEBPG, a previously identified transcription factor for
ERCC5, was determined in a lung cancer cell line.
The second part of this study was dedicated to development of quality control for molecular tests. Biomarkers are of increasing importance for personalized medicine, with
applications along different stages of disease process: not only for enhancing early
disease detection but also for assessing clinical outcomes of a treatment or determining
most effective treatment for individual or monitoring response to treatment (Pfaffl 2013).
Predictive biomarkers (a subset of biomarkers) are used to predict response to a treatment
in terms of efficacy and safety. Chemoresistance and chemosensitivity assays specifically
have been investigated in vitro and some have been approved by the FDA (Keedy, Temin
et al. 2011, Mok 2011, Korpanty, Graham et al. 2014, Chamizo, Zazo et al. 2015). To
augment immunohistochemistry (IHC), the most commonly used approach in clinical
molecular diagnostics and to enable the quantitation of predictive biomarkers, the quality-
controlled multiplex two-color fluorometric real-time PCR assays were developed for ten
predictive markers which have shown clinical values in response to general or target
chemotherapeutic agents. The key components of this method are internal standards (IS)
and external standards (ES). The competitive IS molecule was designed with identical
priming sites and 4-6bp internal difference from each native target gene template (NT).
This ensures identical thermodynamics and amplification efficiency for both template
species as well as discrimination of IS from NT. ES corrects fluorescence intensity
difference between two probes labeled with different dyes due to the variation of
degradation of probes or software selection of Cq values in each plate of PCR.
7
In addition to real-time PCR, we have implemented quality control in next
generation sequencing (NGS) RNA-sequencing platforms. We previously developed a competitive multiplex-PCR amplicon-based library preparation for targeted RNA-
sequencing on next generation sequencing (NGS) platforms (Blomquist, Crawford et al.
2013). This method can control for sample overloading, signal saturation effects, inter-
assay and inter-sample variations in measurement. In addition, it enables to control the
stochastic sampling error when a low amount of copies are present in the samples.
Uncontrolled analytical variation due to stochastic sampling is potentially a major barrier
that limits the application of NGS to clinical setting. Hence, controlling for stochastic
sampling error is important for mutation detection and differential gene expression
measurement. The hypothesis that assay coefficient of variation (CV) due to stochastic
sampling can be predicted based on a Poisson sampling-based mathematical equation was
tested through cross-titration of cDNA samples from two lung cancer cell line (Fu, Xu et
al. 2014). The predicted CV may then be implemented to determine the confidence limits for each value acquired from NGS analysis. Therefore, false positive results and false
negative results can be reduced by minimizing variation due to stochastic sampling.
8
Chapter 2
Literature Review
2.1. Biomarker development for disease risk
2.1.1. Gene Transcript Expression
Gene expression can be measured at mRNA level and protein levels. Although protein is widely known as functional gene product, the abundance of mRNA as the intermediary between DNA and protein correlates with protein expression level in complex biological samples under certain circumstances (Gry, Rimini et al. 2009, Maier,
Guell et al. 2009, Vogel and Marcotte 2012, Evans 2015). mRNA, also known as transcript, is a direct reflection of gene expression. Thus, by determining the types and quantity of mRNA transcripts present in a cell, we can determine which genes are expressed, and at what level, in that cell at different stages of development and under different environmental conditions.
2.1.1.1.Total Transcript Abundance
Various studies have described, discovered, quantified and validated biomarkers measured at mRNA level. Using DNA microarrays, Spira and colleagues described smoking-induced changes in the gene expression of airway epithelial cells obtained during bronchoscopy from nonsmokers and from current and former smokers without
9
lung cancer in 2004 (Spira, Beane et al. 2004). They found a subset of genes had consistently altered expression in former smokers and speculated this may explain the risk these individuals have for developing lung cancer long after they have discontinued smoking. Using gene-expression profiles from Affymetrix HG-U133A microarrays, Spira and colleagues, later on, identified an 80-gene biomarker that distinguishes smokers with and without lung cancer in a training set of subjects. Then, they tested the biomarker on an independent test set and on an additional validation set. Their biomarker had ~90% sensitivity for stage 1 cancer across all subjects (Spira, Beane et al. 2007). Using recently developed transcriptome sequencing, the same group found that the RNA-Seq data detected additional smoking- and cancer-related transcripts whose expression were either not interrogated by or were not found to be significantly altered when using microarrays
(Beane, Vick et al. 2011, Wang, Vermeulen et al. 2015). These findings support the notion that gene transcript expression (i.e. transcript abundance) in normal airway epithelial cells can serve as a lung cancer biomarker.
Using quality controlled standardized RT-PCR (StaRT-PCR), our laboratory has identified a set of key DNA repair, antioxidant and transcription factor genes that exhibited significant intergene total transcript abundance correlation in normal human bronchial epithelial cells (NBEC) among individuals without lung cancer diagnosis
(Mullins, Crawford et al. 2005). Conversely, in NBEC of individuals diagnosed with lung cancer, intergene total transcript abundance correlation was not observed. This difference between lung cancer cases and matched controls led to identification of a 14-gene transcript expression-based Lung Cancer Risk Test (LCRT). In addition to this pilot study, two additional case-control studies (first set: 25 lung cancer cases and 24 controls;
10
second set: 18 cases and 22 controls) supported the association between the LCRT and prevalence of lung cancer (Blomquist, Crawford et al. 2009). The set of genes that separates lung cancer cases from non–lung cancer controls in this study has different characteristics from a set of genes recently reported to have similar classification capabilities discovered through high-density microarray analysis by Spira and colleagues as reviewed above (Spira, Beane et al. 2007). One difference is that 12 of the 14 genes reported in our findings are key antioxidant or DNA repair genes, whereas the remaining two are transcription factors expressed in normal airway epithelium (Blomquist,
Crawford et al. 2009). In contrast, the set of genes reported by Spira group comprises of primarily signal transduction and small molecule transport genes. A second difference is that, as described above, each of the genes comprising the multigene test reported by our group has increased dispersion among the lung cancer cases rather than altered mean levels. In contrast, each of the genes reported by Spira and colleagues has an altered central tendency of expression.
We speculate that the increased dispersion in total transcript expression of genes in NBECs reported in Blomquist et al. 2009 likely result from inherited characteristics at the germ cell level, and less likely from acquisition of genetic alterations in somatic cells in the airway epithelium, previously described as field effect (Wistuba 2007, Walser, Cui et al. 2008). The field effect is observed in not just those with lung cancer, but in all smokers. There were theoretical and empirical evidence supporting this induction. For example, field effect has been observed in all smokers, not just those with lung cancer
(Wistuba 2007, Kadara and Wistuba 2012). However, the observed increase of variation in transcript expression of genes comprising of the LCRT separated the lung cancer
11
group from the non-lung cancer group regardless of cigarette smoking. This observation
indicates the existence of inter-individual variation in the field cancerization effect
caused by smoking and led to the hypothesis that the basis for this variation is germ cell
inheritance. This hypothesis is supported by accumulating evidence for germ cell
inheritance of particular cis element single nucleotide polymorphisms (SNP) that cause
inter-individual variation in regulation of the genes comprised by LCRT and that are
associated with increased lung cancer risk (Blomquist, Crawford et al. 2010, Blomquist,
Brown et al. 2013). Specifically, particular polymorphisms in the regulatory region of
ERCC5 are associated with increased dispersion of transcript expression around its
median expression value and altered prevalence of lung cancer diagnosis (Blomquist,
Crawford et al. 2010). Recently, a polymorphism (rs3213245, -77T>C) in the regulatory
5’ untranslated region (UTR) of X-ray repair cross-complementing 1(XRCC1) was found
to be significantly associated with altered XRCC1 expression and increased lung cancer risk (Hao, Miao et al. 2006). Both ERCC5 and XRCC1 are among the 14 genes comprised by LCRT. Besides these two genes, we hypothesize that germ cell inheritance of particular alleles at regulatory polymorphic sites may be associated with increased dispersion of transcript expression in the other genes comprised by this test as well.
2.1.1.2.Allele-Specific Transcript Expression
In addition to total transcript abundance, allele specific expression (ASE) is another type of gene transcript expression. While total transcript abundance measures the copies of transcripts derived from all alleles at a gene locus, ASE refers to the differential expression of each allele. In a diploid genome, ASE can be presented by allelic ratio, the
ratio of the amounts of products derived from two alleles. The detection of ASE at
12
transcript level in an individual, accordingly, is to quantitate the relative amounts of the
transcript which originated from a particular allele at the gene locus. The ASE hereafter
refers ASE at transcript level if no specification.
Allele-specific differences in levels of gene transcript expression have been found
to be classically attributable to the associated cis-acting regulatory variation and a distinct
epigenetic pattern of its two parental alleles (Knight 2004, Pastinen, Ge et al. 2006, Song,
Kim et al. 2012). Therefore it is a powerful method for determining the effect of cis- polymophic sites on transcript regulation compared to total transcript abundance which does not control for inter-individual variation in trans-regulatory effects (de la Chapelle
2009, Matera, Musso et al. 2013).
2.1.1.2.1. Dosage Effect
Diploid organisms can only have two alleles for a given gene; however, multiple alleles may exist at the population level such that many combinations of two alleles are observed. Traits inherited in a simple Mendelian pattern are either dominate or recessive, which refers to complete dominance. This trait is produced by only one gene. The complete dominance of a wild-type phenotype over all other mutants often occurs as an effect of "dosage" of a specific gene product: the wild-type allele supplies the correct amount of gene product whereas the mutant alleles cannot. One mutant allele can also be dominant over all other phenotypes, including the wild type (UC Davis).
But this is not the case for many traits, including lung cancer. In other words, dominance is not always complete. Incomplete dominance is the expression of two contrasting alleles such that the individual displays an intermediate phenotype. The phenotype expression is dependent on the dosage of the gene products. Two copies of the
13
gene products result in full expression, while only one copy produces partial expression, thus, in turn, an intermediate phenotype. A variation on incomplete dominance is codominance in which both alleles for the same characteristic are simultaneously expressed in the heterozygote (Boundless 2015). Lung cancer is a complex genetic trait in which complex pattern of inheritance is common, involving multiple genes in combination with environmental factors.
In recent decades, there is an increasing number of reports that various SNPs have dosage effect on gene expression, lung cancer risk, survival time, and response to treatment (Shen, Berndt et al. 2005, Schabath, Wu et al. 2006, Zienolddiny, Campa et al.
2006, Gallegos Ruiz, Floor et al. 2008, Sun, Li et al. 2009, Tseden-Ish, Choi et al. 2012,
Litviakov, Freidin et al. 2015). Most association studies are between genotypes and expression levels of genes that play an important role in tumorigenesis. For example, a deletion on 14q32.2-33 as a common mutation in NSCLC (44%) significantly reduced gene expression for HSP90, residing on 14q32, which, in turn, led to longer survival time in 32 non-small cell lung cancer (NSCLC) patients (Gallegos Ruiz, Floor et al. 2008). As previously published, the variant genotypes (GC/AT + AT/AT) of one locus in p73 versus the wild-type genotype (GC/GC) and the variant genotypes (WM + MM) of one locus in p53 versus the wild-type (WW) were associated with significantly increased risk at statistical borderline after adjusting for age, gender, smoking status, and pack-years.
When the p73 and p53 variant alleles were combined and analyzed as a continuous variable, there was evidence of a gene-dosage effect in addition to a 13% increase in lung cancer risk for each additional variant allele (Schabath, Wu et al. 2006). According to this, one variant SNP may have only a modest independent effect on the phenotype of a
14
multigenic disease such as lung cancer, given the borderline effects from a single gene
SNP and the fact that combined effects from both the p73 and p53 variant alleles only increased lung cancer risk by 13%. Another example is a study consisting of 53 unrelated
HapMap CEU lymphoblastoid cell lines. In this study, Ge and colleagues detected that
5.8% of common SNPs on Illumina 1M BeadChip were significantly associated with allelic ratios and cis regulation in individual transcripts. Specifically, the ASE linear regression results showed that the copies of mutant allele at rs2732087 in XKR9 significantly correlate with ASE (Ge, Pokholok et al. 2009). More recently, a trend of association between SNP genotype and the expression of gene transcript was also observed in a GWAS study in non-tumor lung tissues of 420 patients undergoing lung cancer surgery (Nguyen, Lamontagne et al. 2014).
2.1.1.2.2. Allelic Imbalance
In contrast to dosage effect, the other form of relative allelic expression is called allelic imbalance (AI). The ratio of the abundance of each allele is expected to be ~1 in a diploid organism. This means that theoretically the mRNA transcribed from maternal chromosome and that transcribed from parental chromosome will have roughly the same number of copies. However, this is not the case in practice. When the ratio of the expression levels is not 1 to 1, we call it “allelic imbalance (AI)”. In practice, this ratio varies for experimental reasons but can be controlled for using the source genomic DNA
(gDNA) as a control (Pastinen, Ge et al. 2006).
There are a variety of reasons why the expression may vary between the alleles.
One reason is the environmental factor that silences either the maternal or paternal allele, also known as gene imprinting (Crowley, Zhabotynsky et al. 2015). If one allele is
15
silenced completely, then there will be an extreme case of allelic imbalance. Other reasons include cis-acting genetic variations that alter regulation for just one allele through a change to promoter/enhancer regions (transcription factor binding sites)
(Schork, Thompson et al. 2013, Gusev, Lee et al. 2014, Pickrell 2014, Albert and
Kruglyak 2015), or even through 3′ UTR mutations that affect mRNA stability or microRNA binding (Nicoloso, Sun et al. 2010, Obsteter, Dovc et al. 2015). But this scenario may slightly alter expression of one particular allele, resulting in imbalance to a less degree. Classically, loss of heterozygosity (LOH) is a common form of allelic imbalance to identify somatic-cell genetic changes and to characterize tumor stages. The detection of LOH has been used to identify genomic regions that harbor tumor suppressor genes and to characterize tumor stages and progression (Mei, Galipeau et al. 2000, Zeki and Fitzgerald 2015).
2.1.2. Transcript Regulation
The mRNA expression level of a gene is typically determined by several input signals and exerted at transcript synthesis steps: initiation, elongation, termination, and processes at post-transcriptional level: 5' capping, addition of the poly A tail, and splicing. Among all various factors that can affect these processes, the regulatory sequence has become increasingly important in recent decades. Genetic variation plays a crucial role in disease susceptibility through regulating gene expression. Association studies have discovered many significant genetic variants that influence susceptibility to common diseases through mouse and human genomes (Lohmueller, Pearce et al. 2003,
Mehrabian, Allayee et al. 2005, Cirulli and Goldstein 2010, Simmonds 2013). There are two major classes of genetic variants affecting gene expression across genome: cis and
16
trans. Cis-acting variants are close to the gene(s) that they regulate and affect transcript
synthesis or stability in an allele-specific manner, whereas trans-acting variants are not close (usually on different chromosomes) and can affect both alleles of a gene (Williams,
Chan et al. 2007, Pardini, Naccarati et al. 2012). Both cis- and trans- polymorphisms as well as the non-additive interactions between the two, consequently, can contribute to the variation in gene expression (Rockman and Kruglyak 2006, Albert and Kruglyak 2015,
Waszak, Delaneau et al. 2015). In addition to genetic variants described above, epigenetic mechanisms(Jaenisch and Bird 2003), chromatin conformation (Higgs, Vernimmen et al.
2007), copy number variation (Cahan, Li et al. 2009, Henrichsen, Vinckenbosch et al.
2009) and microRNA (Obsteter, Dovc et al. 2015) all may affect the transcription regulation of a given gene.
2.1.2.1.Identify Trans-acting Variations
It can be very difficult to dissect trans-acting genetic variants from cis-acting genetic variants associated with disease risk. Current approaches to identify trans-acting variations associated with disease risk are based on genome-wide mapping of the expression quantitative traits loci (eQTLs). If a trans-acting effect is mapped to a chromosomal locus, the underlying variant may be a coding variant or regulatory variant in a gene involved in the transcriptional control of the gene(s) that is (are) affected.
Genome-wide mapping studies in yeast and mice showed that trans-acting loci broadly dispersed across regulation pathways are responsible for differences in gene expression
(Brem, Yvert et al. 2002, Schadt, Monks et al. 2003, Yvert, Brem et al. 2003, Albert and
Kruglyak 2015, Waszak, Delaneau et al. 2015).
17
As relatively complex functional studies are necessary to validate the suggested
trans-acting variant underlying eQTLs, positional cloning clarified the biological
function of putative trans-acting variations and combination of linkage analysis revealed
‘linkage hotspots’ that interact with multiple unlinked loci to influence gene expression
(Schadt, Monks et al. 2003, Yvert, Brem et al. 2003, Albert and Kruglyak 2015, Pai,
Pritchard et al. 2015). A decade ago, none of the suggested trans-acting variants
underlying human eQTLs have been conclusively validated (Pastinen, Ge et al. 2006)
That might have been due to that trans-acting variations are not replicable and that their
causal effects on expression are not trivial. Therefore, using a “less-biased” experimental
approach is crucial to characterize the effects of trans-acting variants on gene regulation
(Costa, Aprile et al. 2013). Now that a lot of databases have become available and have
ranked eQTLs based on functional data from variety of Encyclopedia of DNA Elements
(ENCODE) assays(Rosenbloom, Dreszer et al. 2010), for example, DNase I
hypersensitive sites (DHSs), chromatin immunoprecipitation, predicted transcription
factor binding, or reporter gene assays. RegulomeDB(Boyle, Hong et al. 2012), for
example, is a good source that can help investigators narrow down the pool of candidate
SNPs within an eQTL region(Gibson, Powell et al. 2015). An exciting outcome of identification of trans-acting variants in genes is the potential for finding new drivers of gene-regulatory networks (Schadt, Monks et al. 2003, Fehrmann, Jansen et al. 2011,
Westra, Peters et al. 2013, Gibson, Powell et al. 2015), thus, leads to more pronounced causal factors in diseases with complex genetic basis such as lung cancer.
There’s a lack of large-scale studies dedicated to trans-acting effects on gene expression in human. The most important reason may be that trans-regulatory factors and
18
cis-regulatory sequences always non-additively interact with each other and such interaction is required in gene expression process (Gibson and Weir 2005, Gibson,
Powell et al. 2015). One example of such interaction is epistatic interaction between cis-
and trans-acting regulators. And this epistatic interaction is a common feature of complex
genetic architecture underlying quantitative phenotype. Wittkopp and colleagues
compared ASE among different trans-regulatory genetic backgrounds in Drosophila
melanogaster in 2008. Among eight genes analyzed, they identified five genes which
were affected by trans-acting variation that altered total transcript levels and two genes
that were affected by differences in cis-regulation. However, they failed to characterize
the direct epistatic interaction between cis- and trans- acting regulatory polymorphisms
(Wittkopp, Haerum et al. 2008). Westra et al. recently identified and replicated trans
eQTLs for 233 SNPs in peripheral blood samples from 5,311 individuals using eQTL
meta-analysis. They further demonstrated that one trans eQTL which altered expression
of the IFN-α pathway gene functioned through the cis-regulatory effects of that site on a
transcription factor encoding gene (Westra, Peters et al. 2013). Notably, going beyond
univariate SNP-transcript associations, Kirsten and colleagues found 18% of analyzed
genes were trans-regulated in peripheral blood mononuclear cells of 2112 individuals
(Kirsten, Al-Hasani et al. 2015).
2.1.2.2.Identify Cis-acting Variations
Cis-regulatory variations usually reside at the noncoding regulatory DNA
sequences within the target gene (Davidson, McClay et al. 2003), harboring regulatory
elements such as promoters and enhancers, which may lie immediately upstream of the
gene, but can also be found hundreds of kilobases away (Wittkopp and Kalay 2012).
19
Displaying dosage effects or non-additive effect, independent or interactive with trans-
regulatory variations, cis-regulatory loci are fundamental to many processes, including
physiological adaptation, generation of cell diversity, and morphological development
(Beer and Tavazoie 2004).
Prior to more recent integrative analysis combining genotype data and gene
expression profile, the approaches for studies of cis-acting variants were restricted to the
in vitro polymorphic reporter construct assays using established tissues or human cell
lines. Most cis-acting variants validated through this approach resulted in less than 5-fold difference in gene expression (Rockman and Wray 2002). And most of these studies only assessed total expression of target genes. They didn’t control for inter-individual variation in trans-acting factors (Forsberg, Lyrenas et al. 2001), environmental factors
(Gebhardt, Zanker et al. 1999) and epistatic interaction of cis- and trans-effects (Hayashi,
Watanabe et al. 1991, Brophy, Hastings et al. 2001). Cis-acting variants present as a manner of allele-specific variation (Buckland 2004).
The more recent approaches to identify cis-acting variants are usually based on genome-wide mapping eQTLs, and hence they are capable of evaluating cis-acting variants in large scale (Morley, Molony et al. 2004, Cheung, Spielman et al. 2005, Zhang,
Li et al. 2014, Fung, Holdsworth-Carson et al. 2015, Kirsten, Al-Hasani et al. 2015,
Lappalainen 2015). In addition to high throughput, genes are studied in their native background instead of an artificial environment by exploring unknown cis-acting variations across the whole genome. Lastly, this approach integrates with linkage analysis thus allows the detection of regulatory factors (cis-acting variants or markers in high linkage disequilibrium with them) that are acting over short and long range (Schadt,
20
Monks et al. 2003, Cheung, Spielman et al. 2005, Albert and Kruglyak 2015, Lam, Tay et
al. 2015, Lappalainen 2015, Pai, Pritchard et al. 2015). Promoter construct studies remain
useful to validate cis-acting variations discovered by such hypothesis-free methods. The identified cis-acting regulatory polymorphism can provide intuitive clues for novel regulatory motifs and transcription factors regulating a specific gene (Morley, Molony et al. 2004, Fung, Holdsworth-Carson et al. 2015, Kirsten, Al-Hasani et al. 2015).
2.1.3. Epidemiology Study
Lung cancer is associated with a low survival rate, in part, because it typically is at an advanced stage when first detected and treated (Ganti and Mulshine 2006). Studies to improve the post-diagnosis outcome of lung cancer through early detection using low-
dose spiral coaxial tomography (LDCT) screening and surgical intervention are
promising (Unger 2006, Lock and Rodrigues 2007, Smith, Manassaram-Baptiste et al.
2015). However, because as many as 90 million active or former smokers in the United
States alone are candidates for screening according to the demographic criteria (i.e. age
55-80, > 30 pack-years smoking history and currently smoke or have quit within the past
15 years), the potential cost is very high and may be prohibitive (Ganti and Mulshine
2006, Burger, Kass et al. 2008). Additionally, LDCT screening studies completed thus far
are associated with a high incidence of false-positive findings which may lead to unnecessary follow-up diagnostic testing, including biopsies and surgical procedures, with associated risk and emotional and financial cost to the patient (Vansteenkiste,
Dooms et al. 2012, Marshall, Bowman et al. 2013).
Consequently, there is an unmet need to use biomarkers to focus LDCT efforts in persons who are at highest risk for lung cancer. A molecular diagnostic that further
21
stratifies the individuals at highest risk for lung cancer within epidemiologically defined
high-risk group will enable more accurate selection for individuals who are most likely to
develop lung cancer in their lifetime and reduce risk and cost of annual LDCT screening.
When we investigate the role of candidate genes or genetic variations that contribute to a
specific trait/disease by testing for a correlation between trait/disease status and genetic
variation, two kinds of observational study designs are most likely used: case-control
study and cohort study.
2.1.3.1.Case-control Study
Case-control studies are often used to identify factors that may contribute to a
phenotype or disease by comparing subjects who have that trait/disease (the "cases") with
patients who do not have the trait/disease but are otherwise similar (the "controls"). Data
about exposure to a risk factor or several risk factors are then collected retrospectively,
typically by interview, abstraction from records, or survey (Song and Chung 2010,
Basuli, Stevens et al. 2014). Case-control studies are comparatively quick, inexpensive, and requires relatively small sample size (Lewallen and Courtright 1998). Therefore, a
case-control study is usually conducted before a cohort or an experimental study to
identify the possible etiology of the disease.
Case-control studies are particularly appropriate when there is good evidence of
an association between a certain exposure and the disease and when disease is rare and
exposure is frequent among the exposed. They are particularly appropriate for (1)
investigating outbreaks, and (2) studying rare diseases or outcomes (Lewallen and
Courtright 1998). The limitations for a case-control study include (1) selection bias; (2)
recall bias; (3) inefficiency for rare exposures. Criteria or definition of cases must be well
22
formulated and documented (Raphael 1987). If cases are misclassified (include false positives), the findings may be false. Controls should be selected from the same
population that gives rise to the cases and independently of their exposure status
(Sedgwick 2015). And the specific relative measure of effect (rate ratio, risk ratio or odds
ratio) that can be estimated from a case–control study depends on the type of sampling
design used in the selection of the controls.
2.1.3.2.Cohort Study
Cohort study is also named longitudinal study or incidence study, which is
appropriate when there is good evidence of an association of the disease with a certain
exposure, when exposure is rare and incidence of disease among the exposed is frequent,
or when the time between exposure and disease is short (Song and Chung 2010). A
cohort study can be prospective or retrospective. Prospective cohort studies begin with
disease-free patients, classify patients as exposed/unexposed, record outcomes in both groups, and then compare outcomes using relative risk. On the contrary, retrospective cohort studies identify exposed and unexposed groups after both exposure and disease occurs.
In addition to examining rare exposures, because entrance into a cohort study begins with exposure status, investigators can monitor the occurrence of multiple diseases potentially caused by an exposure. Finally, cohort studies allow the direct measurement of the absolute risk of developing a disease after an exposure (Euser, Zoccali et al. 2009).
The weaknesses for a cohort study compared to a case-control study include (1) exposure can change over time (e.g. aging, life style ( diet, smoking pattern), air pollution); (2) changes on method over time affecting disease identification; (3) long period to follow-
23
up; (4) costly; (5) subject selection bias. These issues are especially problematic when studying relatively rare diseases, for instance, lung cancer (Crawford 2016) (Looney and
Hagan 2015). Overall, we believe prospective cohort study gives us greater power to detect and address the absolute risk of putative genetic variations considered as rare exposures.
2.2. Identification of cis-acting Variations
2.2.1. Single Nucleotide Polymorphism Analysis
As the most common genetic variations, numerous single nucleotide polymorphisms (SNP) have been proven to be associated with disease susceptibility like lung cancer risk, abnormal expression level of genes, and altered gene regulations (Shen,
Berndt et al. 2005, Schabath, Wu et al. 2006, Zienolddiny, Campa et al. 2006, Gallegos
Ruiz, Floor et al. 2008, Sun, Li et al. 2009, Blomquist, Crawford et al. 2010, Buch,
Diergaarde et al. 2012, Tseden-Ish, Choi et al. 2012, Blomquist, Brown et al. 2013,
Albert and Kruglyak 2015, Kang, Ma et al. 2015, Yoo, Jin et al. 2015). One can associate
SNPs to these diverse phenotypes or eQTLs by genotype of a single SNP or haplotype comprised bymultiple SNPs. cis-acting variations can be detected through univariate
SNP-based analysis (i.e., genotype analysis) and region-based analysis (i.e., haplotype analysis).
2.2.1.1.Genotype Analysis
As of the end of 2015, over 150 million SNPs have been validated in the human genome and been deposited to public database , compared to 12 million from 8 years ago.
Genotyping of SNPs is a procedure that identifies the alleles presented in a given polymorphic site. From a couple of decades ago, genotype analysis of candidate SNPs
24
has become an important part of genetic association study (Morley, Molony et al. 2004)
to reveal disease risk loci. To date, high throughput genotyping is possible with SNP
array and whole-genome sequencing (Keating, Tischfield et al. 2008, Cirulli and
Goldstein 2010, Global Lipids Genetics, Willer et al. 2013, Simmonds 2013). These
recently developed methods enable genetic association studies with large-scale genotype
data (Cirulli and Goldstein 2010, Simmonds 2013). Moreover, new integrative tools
employ predicted gene functions to systematically prioritize the most likely causal genes
at associated loci, highlight enriched pathways and identify tissues/cell types where genes
from associated loci are highly expressed (Global Lipids Genetics, Willer et al. 2013,
Pers, Karjalainen et al. 2015). SNP genotyping is a very straightforward process and the
protocol basically relies on target amplification, allelic discrimination reactions and
allele-specific product detection, these three major components (Chen and Sullivan 2003,
He, Holme et al. 2014). The mechanism used for allelic discrimination largely determines
the specificity and accuracy of genotyping.
Primer extension can quantitate ASE (i.e. SNuPE) based on radiolabel or
fluorescence as discussed in section 2.2.2.2 and also qualify the type of nucleotide at
polymorphic site utilizing 5’ to 3’ DNA synthesis activity of DNA polymerase and
specific terminator nucleotides (ddNTP) complementary to the polymorphic site. The
detection system for this can be mass spectrometry, microarray-based hybridization,
fluorescence resonance energy transfer (FRET) or DNA sequencing (Nikiforov, Rendle
et al. 1994, Pastinen, Partanen et al. 1996, Chen, Levine et al. 1999, Pastinen, Raitio et al.
2000, Shi 2001). And now high-throughput platforms using primer extension have been emerging, for example, arrayed primer extension (Kranaster, Ketzer et al. 2008, Jiang,
25
Willner et al. 2013). This method highly relies on the error rate of DNA polymerases
during DNA synthesis which was reported as non-negligible (from 0.2% to more than
15% per locus) (Kunkel 2004, Pompanon, Bonin et al. 2005). In contrast, allele-specific
PCR (AS-PCR) requires two allele-specific primers that each anneal to the target with its
3’-terminal base matching each of the two alleles of an SNP (Germer and Higuchi 1999,
Myakishev, Khripin et al. 2001, Liew, Pryor et al. 2004, Hayashi, Hagihara et al. 2008).
AS-PCR makes use of the difference in extension efficiency between primers with matched and mismatched 3’ bases (Wallace 1991, Shen, Tian et al. 2015).
The TaqMan assay designs the dually fluorescence-labeled probes annealed to two alleles at the SNP site. Each of the two probes anneals to one allele to create a stable structure that would lead to its degradation by the moving DNA polymerase, but when it anneals to the other allele, it forms a structure that is less stable so that the probe gets pushed off the template without being cleaved. The cleavage of the dually labeled probes changes the status of FRET between the two fluorophores, providing a mechanism for its detection (Biosystems). Because it requires PCR with SNP specific probe and primers, it is good for a few markers but not capable for a large scale study with a thousand markers.
Although 6.2 million SNP genotyping assays were pre-designed and off-the-shelf, the probe design for rare variant SNPs remains empirical and requires substantial optimization.
The number of genotyped SNPs can be maximized by choosing tag SNPs and the relations of tag SNPs to biological pathways can prioritize candidate SNPs for association studies (Global Lipids Genetics, Willer et al. 2013). Utilizing the feature of tag SNPs and single-base extension, whole-genome genotyping is currently available through high-
26
density SNP arrays, for example, Human-1 Genotype BeadChip (Steemers, Chang et al.
2006), Infinitum HD BeadChip (Illumina) (Illumina). Besides Illumina SNP chips,
Affymetrix SNP chips are also popular in the GWAS research area. The Affymetrix
microarray technology relies on the differential hybridization of genomic DNA to 25-mer
probes which match SNP alleles, while the Illumina Infinium technology uses
hybridization followed by primer extension (Jiang, Willner et al. 2013). Although these
SNP arrays work using different chemistries, they have several aspects in common. Both
rely on the biochemical principle that nucleotide bases bind to their complementary
partners—specifically, A binds to T and C binds to G, in Watson–Crick base pairs. Both array protocols call for the hybridization of fragmented single-stranded DNA to arrays containing millions of unique nucleotide probe sequences. Each probe is designed to bind to a target SNP. They all are scalable high-throughput SNP genotyping assays that delivers robust high-quality genotyping data.
2.2.1.2.Haplotype Analysis
Although single-SNP analysis has proven to be useful in discovering many disease-associated loci, this strategy may be limited by very stringent significance
threshold caused by epistatic effects and poor reproducibility (Wu, Kraft et al.
2010).When some genes with heritabilities for high expression were found, no significant
eQTLs were identified even the power calculation based on sample size suggested ~90%
power to detect such loci (Deutsch, Lyle et al. 2005). It is evident thus that in many cases, expression traits are regulated by multiple loci, each of which contributes only modestly to the trait. Recently, more than three GWAS found that in the chr15q24-25 region multiple SNPs were associated with smoking quantity (cigarette per day, CPD) or lung
27
cancer risk while genotypic odds ratio at each SNP was not significantly different from
1.0 when smoking quantity was controlled (Amos, Wu et al. 2008, Hung, McKay et al.
2008, Thorgeirsson, Geller et al. 2008, Wang, Broderick et al. 2008, Caporaso, Gu et al.
2009, Dai, Zhu et al. 2015, Niu, Wang et al. 2015). Specifically, when two GWAS data
sets as part of the Cancer Genetic Markers of Susceptibility (CGEMS) project were
analyzed, association testing for each SNP revealed that none of the SNPs achieved
genome-wide significance (p <10-7) while in the chr15q25.1 region spanning the nicotinic receptors CHRNA3 and CHRNA5 multiple SNPs were associated with cigarette per day
(CPD) (Caporaso, Gu et al. 2009). It suggests the benefits for region-based analysis over
SNP-based analysis and leads to more interest in haplotype, the combination of marker alleles on a single chromosome.
The 1000 Genomes Project genotyped over 95% of SNPs that are in accessible genomic regions and that have allele-frequency of 1% or higher in 1092 individuals from different ethnic populations. It also provided haplotype information on these SNPs
(Genomes Project, Abecasis et al. 2010). Haplotype analysis of SNPs, which was neglected by traditional genetic association studies, provided more information than single-SNP analysis in intramuscular fat percent (IMF) and discovered significantly associated genes (Barendse 2011). In addition, recently common heritable allelic imbalance phenotypes can be mapped in unrelated individuals to establish regulatory haplotypes (Knight, Keating et al. 2003, Musunuru, Strong et al. 2010, Aroucha, Carmo et al. 2016). For example, the Pastinen group demonstrated a strong regulatory haplotype in the human BTN3A2 locus, which spanned at least 15 kb flanking the gene (Pastinen,
Sladek et al. 2004). A pioneering study by Musunuru and colleagues fine mapped
28
haplotypes associated with low-density lipoprotein cholesterol. A combination of systematic reporter assays and GWAS established that minor allele at SNP rs12740374 was associated with ASE of SORT1by creating a C/EBP transcription factor binding site(Musunuru, Strong et al. 2010). There is additionally theoretical and empirical
evidence (Akey, Jin et al. 2001, Schaid 2006) that haplotype-based analysis may possess higher power than SNP-based analysis. For example, Aroucha and colleagues evaluated
association between SNPs in TNF-α gene and SNPs in IL-10 gene with hepatocellular
carcinoma (HCC) and found low IL-10 production diplotype was associated with HCC
risk(Aroucha, Carmo et al. 2016).
When haplotype, a more powerful discriminator, is used in an association study,
one faces a problem of phase inference, which reconstructs haplotypes for individuals.
Haplotype phase can be estimated using computational approaches or it can be generated
through laboratory-based experimental methods for phasing single individuals (Browning and Browning 2011). Haplotype phasing has been accomplished through several statistical algorithms to infer unknown haplotypes from population genotype data sets and family genotype data sets (Stephens, Smith et al. 2001, Browning and Browning
2007, Howie, Donnelly et al. 2009). The Clark’s algorithm, developed in 1990, is the first published method for haplotype phase inference for multiple markers in unrelated individuals but it is only suitable for tightly linked SNPs (Clark 1990). Following this, a variety of algorithms such as EM (Excoffier and Slatkin 1995), hidden Markov models
and so on, were developed to improve haplotype phasing. PHASE, developed by Stephen
and colleagues (Stephens, Smith et al. 2001, Stephens and Scheet 2005), was considered
the gold standard for population-based haplotype phasing algorithms (Marchini, Cutler et
29
al. 2006). It’s suitable for up to 100 markers and up to several hundred individuals.
MACH and IMPUTE2 (Howie, Donnelly et al. 2009) have primarily been used for the imputation of nongenotyped variants but can also be used for haplotype phase inference for larger data sets than PHASE.
By contrast, sequencing is an experimental phasing approach which automatically produces some information on phase. Experimental phasing is expensive and labor-
intensive, nevertheless it is more direct and accurate compared to computational phasing.
Dear and Cook have proposed concepts for extensively resolving local haplotypes (Dear
and Cook 1989) and Sauer and Olson developed clone-based, targeted haplotyping methods for long segments of the human genome (Burgtorf, Kepper et al. 2003,
Raymond, Subramanian et al. 2005). Since 2005, next generation sequencing (NGS) became considerably less expensive than Sanger sequencing but reads are shorter, up to150 bp with paired-end sequencing (Illumina), providing less information for phasing.
Kitzman et al. in 2011first experimentally directly phased 94% of ascertained heterozygous SNPs into long haplotype blocks (N50 of 386 kilobases (kbp)) by combining fosmid library with NGS to determine the haplotype-resolved genome of a
South Asian individual (Kitzman, Mackenzie et al. 2011). Nevertheless, large-insert cloning is still technically challenging and not readily scalable. To address this limitation,
Paul and Apgar described a single molecule dilution followed by multiple strand displacement amplification for targeted haplotyping the human leukocyte antigen (HLA) genes (Paul and Apgar 2005). Peters and colleagues recently reported haplotyping from
10-20 human cells using long fragment read assembled by NGS without cloning (Peters,
Kermani et al. 2012). More recently, Kaper et al. (Kaper, Swamy et al. 2013)
30
implemented this technology into Illumina Genome Analyzer IIx and successfully phased
96% of heterozygous SNPs into 9,243 haplotype blocks with average size of 264 kb,
maximum size of 4.8 Mb and an N50 of 702 kb. Finally, Kuleshov et al. (Kuleshov, Xie
et al. 2014) applied a statistical algorithm to partially phased information contained in
long fragments assembled by NGS-generated short reads and phased 99% of SNPs into
0.2-1 Mb haplotype blocks to determine allele-specific methylation patterns in human genome. In general, improvements of sequencing technologies will enable researchers to assemble haplotypes from sequencing data with very high accuracy and it opens up the
opportunity to use high-quality haplotypes and genotypes in sequencing association
studies.
Our lab used competitive PCR to quantitate the absolute amount of ASE. By
comparing ASE within same sample from NBECs of cases and controls, interindividual
variation in trans-factors and environmental factors is minimized and population-based
studies can be performed to sort out the single cis-acting SNP or haplotypes within each
locus. Compared to genotype of an individual SNP, haplotype is a more powerful
discriminator between cases and controls in disease GWAS. Use of haplotypes in GWAS
reduces the number of tests to be carried out and necessary for linkage analysis. With
haplotypes we can identify causal SNPs regulating genes nearby, understand gene
regulation mechanism, and conduct disease association studies as well as evolutionary
studies.
2.2.2. Empirical Approaches to Assess the Effect of cis-acting Variations on
Transcript Abundance
2.2.2.1.Assessing cis-acting Effects by Total Expression
31
To localize cis-acting determinants of eQTLs in humans, Cheung and colleagues combined microarray expression data with publicly available SNP genotype data and applied genome-wide mapping linkage analysis to identify master regulators of transcription (Morley, Molony et al. 2004, Cheung, Spielman et al. 2005, Albert and
Kruglyak 2015, Fung, Holdsworth-Carson et al. 2015, Lam, Tay et al. 2015, Lappalainen
2015, Pai, Pritchard et al. 2015). They found approximately 1.44 to 8 fold difference in mean expression level of target genes, suggesting the degree of differential total expression attributable to cis-effects varies considerably (Morley, Molony et al. 2004).
Similar platforms were used to identify the common genetic variations that explain gene total expression differences among different set of individuals (Monks, Leonardson et al.
2004, Stranger, Forrest et al. 2005). More recently, Grundberg group used Illumina
Human HT-12 V3 BeadChips to profile genome-wide RNA expression and reveal the contribution of low-frequency and rare regulatory variants with respect to both transcriptional regulation and complex trait susceptibility(Grundberg, Small et al. 2012).
These microarrays or bead arrays are relatively prevalent so quite a few analysis tools are available. Needless to say, common sources of error can be sorted out by carefully following standard quality control steps (Churchill 2002, Dumur, Nasim et al. 2004,
Zakharkin, Kim et al. 2005, Guindalini and Pellegrino 2016). Additionally, the high- density coverage of probes can assay the known regions of transcriptome at once.
In contrast to above massive parallel platforms, non-parallel methods, for example
TaqMan real-time PCR (Deutsch, Lyle et al. 2005), have also been applied to assess difference in total expression level of genes of interest affected by causal cis-acting variations. These methods look for correlation between SNP genotype and gene
32
expression level to infer the cis-acting effects. The restriction to local genes, nevertheless, limits the implementation for genome-wide study.
Total expression of target genes to infer cis-acting variations, however, has poor reproducibility for the reasons discussed earlier in the beginning of section 2.2.2, including interferences from interindividual variation in trans-regulatory DNA sequences and epistatic mechanisms, copy number variation, and environment factors. Besides these clearly seen perils and pitfalls, it has been found that total expression data was highly discordant between same sample in different laboratories, between platforms and even between replicates on same platform (Pastinen, Ge et al. 2006). And even strong trans- acting signals are less likely to show reproducible expression of target gene (r2-values are
lower) across studies compared with the cis-acting signals (Pastinen, Ge et al. 2006) and
the authors concluded that cell culture is a major source of variation and false-positive signals. These findings along with other evidences (Hayashi, Watanabe et al. 1991,
Gebhardt, Zanker et al. 1999, Brophy, Hastings et al. 2001, Forsberg, Lyrenas et al. 2001,
Buckland 2004) support that an environmental confounder could create a spurious trans-
linkage or association, thus only the strongest cis-acting signals may seem replicable.
2.2.2.2.Assessing cis-acting Effects by Allele-Specific Expression
In addition to total expression, cis-regulatory effects can be alternatively
measured by allele-specific expression (ASE). ASE assays (Yan, Yuan et al. 2002,
Pastinen and Hudson 2004, Pastinen, Sladek et al. 2004, Song, Kim et al. 2012, Muir,
Perumbakkam et al. 2014) are optimal for detecting cis-acting effects, as each allele
serves as an internal control for the other, and trans-acting effects or environmental
conditions that differentially influence gene expression among samples should not
33
interfere. Only cis-acting changes in the relative expression of alleles yield reproducible differences between allelic abundances of transcripts. The cis-regulatory effects on ASE of a gene can be assessed by in vitro and in vivo methods.
Several different methods have been described to quantitate ASE in a number of genes. One of those methods is single nucleotide primer extension (SNuPE) where a transcribed polymorphism is used as a marker to distinguish between the mRNA products of the parental chromosomes, and radiolabeled or fluorescent nucleotides are used to distinguish the mRNA product of each allele (Singer-Sam, LeBon et al. 1992, Penny,
Kay et al. 1996, Bhatnagar, Zhu et al. 2014). Fluorescent dideoxy terminator–based method quantitated by RT-PCR revealed cis-acting inherited variations in gene expression by comparing the relative abundance of each allele of the same genes from a heterozygous individual in 96 CEPH samples (Yan, Yuan et al. 2002). A sophisticated method that enables measuring absolute ASE or AI is competitive PCR with a known number of artificial DNA standards spiked in reverse transcribed cDNAs (Apostolakos,
Schuermann et al. 1993). Ding and colleagues developed an approach combining competitive PCR and matrix-assisted mass spectroscopy that is capable of relative and absolute quantification of gene expression with high sensitivity and claimed a high throughput and a similar accuracy to other methods (Ding and Cantor 2003). Gel electrophoresis following competitive PCR accurately quantitated absolute ASE or allelic imbalance (ASE ratio, the ratio of transcripts derived from respective alleles) with excellent linear dynamic range and >95% allelic specificity thus enabled measurement of a 400-fold range of one allele to the other (i.e., allele A/B ratio from 1:20 to 20:1)
(Blomquist, Crawford et al. 2010).
34
Both gene copies come from the same sample and have been subject to the same
environmental influences including genetic trans-acting factors and experimental variations including mRNA degradation. It makes ASE ratio or AI a powerful approach for determining the cis-acting effects on gene regulation (Yan, Yuan et al. 2002, Pastinen and Hudson 2004, Ge, Pokholok et al. 2009, Bell, Kane et al. 2013). As stated at the beginning of this section, in the absence of either cis-acting sequence variation or epigenetic effects affecting expression of the target mRNA, each chromosome should be equally expressed regardless of the absolute level of gene total transcript abundance. In samples that are heterozygous for a cis-acting regulatory variant or epigenetic modification, mRNA originating from one chromosome will be expressed at a higher level than that from its sister chromosome and this can be detected by changes in the ASE ratio or AI. The experimental variation in this ratio measured in cDNA reverse transcribed from mRNA sample can be controlled by using matched gDNA as a control.
In spite of obvious benefits stated above, the drawbacks of the allelic imbalance approach are the lack of validated high-throughput assays for human genes (Yan, Yuan et al. 2002, Knight 2012), the limitation to samples that are heterozygous (Ge, Pokholok et al. 2009), and the fact of non-heritable factors (such as epigenetic events) may influence allelic representation (Jaenisch and Bird 2003, Higgs, Vernimmen et al. 2007,
Hutchinson, Raj et al. 2014).
2.2.3. Experimental Approaches to Assess cis-acting Variations
2.2.3.1.In vitro Approaches
In vitro methods most commonly involve transient transfection of a synthetic reporter construct containing candidate regulatory polymorphisms into existing cell lines
35
and tissues, then assess the transcriptional activity through that reporter assay. So they are
best suited for hypothesis-driven studies to test whether putative regulatory
polymorphisms affect gene expression. Rockman and colleagues surveyed more than 400
studies through 2001 in which each cis-regulatory variant has been studied in depth as the
best approach presently available to study the dynamics of functional cis-regulatory
variants (Rockman and Wray 2002). The authors claimed that the functional
consequences of a cis-regulatory polymorphism depend on cell type, culture condition,
the distribution of exogenous inducers, and covariation at other sites in the genome.
These variations have been explained in section 2.2.2.1. Covariation at other sites in the
genome refers to numerous factors, for example, the transcriptional environment within
the cells. Transfection of the homozygous G allele of a SNP at -1607 bp in the matrix
metalloproteinase (MMP)-1 resulted in low level of MMP-1 transcription in normal
fibroblasts, whereas higher level in melanoma cells (Rutter, Mitchell et al. 1998). This
indicates that combination of cis-acting sequences in the MMP-1 promoter and specific trans-acting factors can dramatically increase transcription. Maurano and colleagues have identified that cis-regulatory SNPs perturb transcription factor recognition sequences and these cis-regulatory SNPs were tissue-selectively enriched (Maurano, Humbert et al.
2012). Transient transfection studies thereby may be complicated by trans-acting influences on allelic expression.
Moreover, the experimental methods that most reporter construct studies used introduced additional variation. For example, a cis-regulatory variant may have no effect on transcription unless the necessary introns or downstream elements that physically interact with it to transduce the transcriptional output are present in the construct. The
36
spatial proximity of relevant DNA elements, irrespective of their relative genomic
position, has been demonstrated (Miele and Dekker 2009), although not very well
defined. Therefore, the putative promoter or upstream flanking regions most commonly
targeted by studies are often poorly characterized and do not represent the complete
promoter that is active in the cell line. In addition to the experimentally validated
promoter database, the initial choice of allele-specific constructs for transfection studies can be refined by deletion experiments (Zhang, Min et al. 1995, Xu, Berglund et al.
1999). More faithful reproduction of natural gene regulation system containing proximal promoter regions and enhancer can be achieved by cloning whole human genes in bacterial vectors (Wade-Martins, Saeki et al. 2003). Thirteen out of 17 cloned promoters showed reliably greater than 2-fold increase in transcription activity relative to the control across three human cell lines (Hoogendoorn, Coleman et al. 2003), suggesting that >30% of proximal promoters may harbor cis-acting variants. But most published studies have used relatively small promoter constructs (<1 kb) or oligonucleotide subset of minimal promoter (Zhang, Min et al. 1995, Cooper, Trinklein et al. 2006).
In addition, recent knowledge and methods gained from lung cancer genome-wide association studies (GWAS) (Amos, Wu et al. 2008, Wang, McKay et al. 2014) and
Encyclopedia of DNA Elements (ENCODE) studies are valuable for identification of potential cis-regulatory variations (Djebali, Davis et al. 2012, Melnikov, Murugan et al.
2012). Massively parallel reporter assay (MPRA) was developed where one can
synthesize and clone DNA regulatory elements to generate a library of reporter
constructs. These tag expression then is assayed by high-throughput sequencing
(Melnikov, Murugan et al. 2012).Generally, transient transfection studies to identify cis-
37
regulatory variants are often performed in preexisting human cell lines or tissues, which
is one source of variations to the human tissue of interest. Small trans-acting differences
resulting from other genetic variants (cis- and/or trans-) in the host would be another significant source of variations (Rutter, Mitchell et al. 1998).
2.2.3.2.In vivo Approaches
In vivo monitoring of relative allelic expression as first reported as single- nucleotide primer extension (SNuPE) assay in 1992 (Singer-Sam, LeBon et al. 1992) is possible in tissues or cells of individuals heterozygous for a single regulatory variant or harboring specific haplotypes of well-circumscribed regions. The details that how SNuPE works have been discussed in section 2.1.2.2. This amplification-based in vivo method has several advantages: 1) quantifiable and sensitive allele-specific discrimination based on specific nucleotide selection by DNA template bound by DNA polymerase; 2) capable of quantifying relative expression of two alleles in the same tissue sample.
SNuPE assay described by Singer-sam et al. was able to measure quantitatively changes in 0.1% of a transcript population which was adequate for ASE study and has been successfully applied to identify master regulator Xist for X chromosome inactivation
(Penny, Kay et al. 1996). In addition to radiolabeled or fluorescent labeled primer, allele- specific oligonucleotide primers with 3’-terminal mismatch have been proved to discriminate the mRNA products derived from specific allele with less labor intensity.
When the 3’-end nucleotide of one primer is perfectly matched with desired DNA template, the PCR product will extend normally. On the other hand, when the 3’-end nucleotide of the primer is mismatched, the PCR reaction will be blocked or proceed at a reduced efficiency (Wallace 1991). Without additional steps of probe hybridization,
38
ligation, or restriction enzyme cleavage, two allele-specific oligonucleotide primers
correctly determined the genotypes in 12 individuals (Wu, Ugozzoli et al. 1989, Milbury,
Li et al. 2009). However, spurious amplification products were found together with the
expected ones. Concentrations of PCR buffer with optimal Mg2+ and allele-specific
primers, the number of PCR cycles, annealing temperature, etc. need to be carefully
adjusted to optimize the specificity (Wallace 1991).
In 2000s, Morley et al. from Cheung’s group mentioned earlier compared the expression of the two alleles of same marker using allele-specific quantitative RT-PCR
(qRT-PCR) and confirmed the allelic differences in cis (mean fold difference = 1.6)
(Morley, Molony et al. 2004). This evidenced that qRT-PCR combined with allele-
specific priming is particularly accurate for detection of ASE. And now, more precise and
sensitive quantification of alleles present in a sample is possible using digital techniques
(Pekin, Skhiri et al. 2011, Mazutis, Gilbert et al. 2013)to genotype SNPs in circulating
tumor DNA sample(Diaz and Bardelli 2014).
In contrast to in vitro methods which are often interfered by variations from cell
type, environmental factors, and trans-acting factors in unrelated individuals, in vivo
methods exert advantages as (Pastinen and Hudson 2004) (i) alleles are expressed in their
normal environment including genomic and chromatin context; (ii) comparison of alleles
is made within rather than between samples, maximizing the sensitivity of detecting cis-
acting effects by minimizing inter-individual variation in trans-factors; (iii) the
developmental and physiologic history of the tissue is unlikely to be perturbed by the
presence of two low- or two high-expressing alleles; and (iv) population-based studies
allow sampling of haplotype diversity within each locus.
39
2.3. Quality Controlled Molecular Diagnostic Tests Based on RNA
2.3.1. Reverse Transcription-Polymerase Chain Reaction
Reverse transcription-polymerase chain reaction (RT-PCR) is a powerful tool for the detection and quantification of mRNA. It has high sensitivity, good reproducibility, and wide dynamic range of quantification. A cDNA reverse transcribed from mRNA is amplified by PCR. One can detect the amplified product at “end-point” or by “real- time”. End-point determinations analyze the reaction after it is completed, and real-time determinations monitor the reaction in the thermal cycler as it progresses (Freeman,
Walker et al. 1999, Pfaffl 2004).
End-point detection methods include ethidium bromide gel staining, radioactivity labeling, high performance liquid chromatography, southern blotting, fluorescence labeling, or densitometry analysis (Ferre 1992, Reischl and Kochanowski 1995). Without
appropriate controls, this post-PCR step leads to high intra-, inter-assay variability and
lower dynamic range than real time PCR. This property is a drawback for quantitative
measurements because small differences in the multiplication factor lead to large
differences in the amount of product (Raeymaekers 2000, Pfaffl 2004).
Heid et al. developed real time PCR measuring PCR product accumulation
through dual-labeled fluorogenic probes in which one fluorescent dye serves as a
reporter, FAM (6-carboxyfluorescein), and its emission is quenched by the second
fluorescent dye, TAMRA (6-carboxy-tetramethylrhodamine). Nuclease degradation of
the hybridization probe during each PCR cycle releases the quenching of the FAM
fluorescent emission (Holland, Abramson et al. 1991, Heid, Stevens et al. 1996). Real-
time PCR does not require post-PCR sample handling, preventing potential PCR product
40
carry-over contamination and resulting in much faster and higher throughput assays. This
method has an accurate dynamic range of 7 to 8 log orders of magnitude of starting target
molecule determination (Heid, Stevens et al. 1996, Wong and Medrano 2005). The
quantification cycle (Cq) is defined by software as the cycle when sample fluorescence
exceeds a chosen threshold above calculated background fluorescence. The Cq is
dependent on the starting template copy number, the efficiency of PCR amplification,
efficiency of cleavage or hybridization of the fluorogenic probe, and the sensitivity of
fluorescence detection. A Cq value is reported for each sample and can be translated into
a quantitative result by constructing a standard curve or comparing reference Cq values.
Generally, two quantification strategies are used in real-time RT-PCR, relative or
absolute quantification. Both methods need normalization to correct for sample-to-sample
variations in loading.
2.3.1.1. Relative Quantification
In relative quantification, the relative mRNA levels of two or more genes are compared across samples. For example, fold difference of a target gene versus a reference gene (housekeeping or endogenous control gene) in one sample to another sample is compared. A reference gene is considered to have a constant difference in mRNA levels for the gene of interest. Therefore, fold difference of target mRNA to one or more reference mRNA could be reported. The original RNA concentration is not reported because the result is a ratio of expression level of target vs reference gene. This quantification method is adequate to investigate physiological changes comparing control vs treated samples (Pfaffl 2004).
41
Basically, there are two calculation methods of the relative quantification, with
efficiency correction (Pfaffl 2001) and without (Livak and Schmittgen 2001). For
efficiency correction, the slope of the standard curve is required. A significant statistical
bias can be introduced that results in misleading biological interpretation when the
expression levels of target and normalizer is too different or when the target gene is
expressed at very low levels, as the relationship between the two may not be linear at
these cases (Vandesompele 2009).
Even if there are issues about reference gene, relative quantification is more
convenient than absolute quantification, because it requires less stringent controls.
However, when performing relative quantitation, the data (Cq) used for comparison are
arbitrary values and only applicable to the samples run within the same PCR. For
comparison of PCR results from two different experiments, it is necessary to include a
standard control in every plate or run (Wong and Medrano 2005).
2.3.1.2. Absolute Quantification
Absolute quantification reports the final result in copy numbers per total RNA
concentration, per genome, per cell, per gram of tissue, per ml blood, etc. Generally,
there are two approaches for absolute quantification, “standard curve” and “competitive
PCR”. The standard curve approach measures expression of a particular gene using serial
dilutions of known copy number or concentration of a selected sequence in a separate
PCR assay. Competitive PCR measures endogenous gene expression relative to known
numbers of synthetic RNA or DNA sequences placed in same PCR assay. In cases where
data to be compared are assayed on different days or in different laboratories, absolute
42
quantitation may be preferred because results are based on constant reference agents
(Peirson, Butler et al. 2003, Wong and Medrano 2005).
2.3.1.2.1. Competitive PCR
The importance and need for standardization and extended quality control studies
in RT-PCR were emphasized by researchers (Vlems, Ladanyi et al. 2003, Tichopad,
Kitchen et al. 2009, Ruijter, Pfaffl et al. 2013). The significant difference in our assays
(i.e. both, two-color fluorometric real-time PCR and Star-Seq) from commonly used real
time PCR and NGS is the implementation of the known number of synthetic internal
competitive template molecules for each gene in one mixture.
A competitive PCR approach for quantitation of mRNA or DNA was first
introduced by Gilliland et al. (Gilliland, Perrin et al. 1990). The general concept of competitive PCR consists of co-amplification of two different templates sharing the same primer recognition sequences in the same tube. This approach ensures identical thermodynamics and amplification efficiency for both template species. The PCR tube containing the target templates is spiked with a known quantity of synthetic competitive templates, and the ratio of their initial copy numbers remains constant throughout the amplification to endpoint. Their amplicons can be distinguished by the addition of a restriction enzyme site to the standard (Gilliland, Perrin et al. 1990), or by varying its size or sequence (Gililland, Tseng et al. 1992, Celi, Zenilman et al. 1993, Scheuermann and
Bauer 1993, Blomquist, Crawford et al. 2013, Yeo, Crawford et al. 2014).
The synthetic competitive RNA molecules could be spiked into the RNA samples before RT (Becker-Andre and Hahlbrock 1989, Vlems, Ladanyi et al. 2003, Pachmann,
Clement et al. 2005) or synthetic competitive DNA molecules that differ from the cDNA
43
of interest could be added to the endogenous cDNA following RT (Crawford, Peters et al.
2001). By including a known number of competitor RNA in RNA sample prior to RT, variable effects due to differences in conditions of the RT and the PCR amplification could be internally controlled (Wang, Doyle et al. 1989). However, pipetting errors may occur at two points: placing the same amount of RNA from each sample into its respective RT reaction and pipetting cDNA from each RT reaction into each PCR assay.
Further, once transcribed competitor RNA may degrade during long-term storage, and heteroduplex formation between the nearly identical standard and target can result in variable sensitivity and accuracy (Henley, Schuebel et al. 1996, Bustin 2000).
When fluorescence is used to detect the endogenous and standards amounts (e.g.,
TaqMan probes, DNA intercalating dye), it is recommended that the competitive templates should be within 10-fold ratio of the target cDNA molecules because there are upper and lower limits of detection of fluorescence (Ferre 1992, Reischl and
Kochanowski 1995, Raeymaekers 2000). If RNA competitor is used during RT and will be detected using fluorescence method, it is not possible to know ahead of time the proper amount of competitive templates for each gene to be placed into the RNA sample before RT. Therefore, several RT reactions per sample (i.e., a constant amount of the target of interest and varying amounts of competitor) will be needed. This is a very cumbersome process that depletes RNA storages and limits analysis of the cDNA to the genes for which a competitive template was included in RT. When synthetic DNA competitive templates are spiked in to control for cDNA loaded into the PCR reaction, by contrast, one can have a series of PCR reactions with different (competitor: target) ratios and more abundant cDNA rather than the limited source of RNA will be consumed. Thus,
44
when fluorescence emission is used to detect the cDNA signal, Willey and colleagues
state that measurement of cDNA with synthetic DNA competitor is most practical
(Willey).
However, when non-fluorescent methods of detection (e.g., next generation
sequencing as discussed later) are used it is not necessary to keep the NT and IS within
close ratio (Blomquist, Crawford et al. 2013) and RNA or cDNA standards may be
employed.
Ultimately, the main advantage of competitive PCR is that the results are not
affected by tube to tube variations in amplification efficiency controlling for inter-sample variation in interfering substances (Bustin 2000).
2.3.1.2.2. Multiplex Two-Color Fluorometric Real-Time PCR with Quality Control
Chemoresistance and chemosensitivity assays have been investigated as in vitro
diagnostic tests intended to identify drugs which work most effectively for tumor in an
individual patient. Empiric chemotherapy is selecting a chemotherapy regimen based on
clinical trial evidence and on characteristics that determine whether a patient is likely to
benefit from certain drugs. In contrast, assay-directed chemotherapy, contributing to personalized medicine, is an alternative where tumor cells of the patient are treated with and without particular drug and the responses are analyzed (Samson, Seidenfeld et al.
2004). This molecular diagnostic tests can be applied to tissue after biopsy or blood cells
containing nuclei because genetic variability and the occurrence of specific
polymorphisms may affect susceptibility to tumor and the type of response to the therapy.
Most commonly used assays in pathology laboratories are immunohistochemistry (IHC)
which is not quantitative. Ziegler et al. has reviewed DNA predictive markers under
45
investigations (Ziegler, Koch et al. 2012). Researches showed that epidermal growth factor receptor (EGFR) mutation in patients with advanced non-small cell lung cancer
(NSCLC) was associated with resistance to the first-line EGR tyrosine kinase inhibitor therapy. Thus, a test for EGFR mutation for NSCLC patients has been approved by the
FDA (Keedy, Temin et al. 2011). An example of gene expression analysis as a predictive
biomarker is transcript level of the ERCC1, a gene encoding the key enzyme for DNA
repair, measured by quantitative RT-PCR. Other genes have been reported to affect chemo response (Mok 2011, Korpanty, Graham et al. 2014, Chamizo, Zazo et al. 2015).
Based on clinical evidence of values in predicting chemotherapy response in NSCLC
patients, we developed multiplex two-color fluorometric real-time PCR assays with
quality control for 10 predictive biomarkers: ERCC1, RRM1, MRP2 in response to
cisplatin; EGFR, ROS1, ALK1, FGFR 1 to 3, and TYMS in response to target- chemotherapeutic agents.
Our developed two-color fluorometric PCR method is a competitive PCR for absolute quantification in real-time PCR format for tests comprising multiple gene analytes. It is designed for quantification of challenged mRNA, such as formalin-fixed
paraffin embedded (FFPE) samples having chemical modification and physical
fragmentation and crosslinking between nucleic acid themselves or nucleic acid and
protein by formalin (Yeo, Crawford et al. 2014). In contrast to the other commercially
available tests, e.g. COBAS and Abbott methods for the HIV-1 test, two main differences
with our method are the application of internal standards mixture (ISM) and external
standards mixture (ESM) for the quality control. Both COBAS and Abbott assays are
developed for only one target gene (HIV-1), not for multiple genes (i.e., targets and
46
references). As at least one reference gene is needed for the loading control, a method for
measuring at least two genes (i.e., a target and a reference) is necessary for accurate
transcript abundance measurements.
The competitive IS molecule was designed with 4-6bp (two-color fluorometric
real-time PCR) and 6bp (Star-Seq) difference from each native target gene template
(NT). This alteration from NT could be distinguished by probe (two-color fluorometric
real-time PCR) or sequencing program (Star-Seq). After synthesis of IS, each IS was mixed in an ISM with a constant ratio, and it was used for co-amplification of an unknown number of NT, in a single PCR reaction, for direct quantification (two-color fluorometric real-time PCR) and library preparation for sequencing (Star-Seq). ISM is a mixture of synthetic competitive internal standards (IS) with a known concentration of IS for each of multiple represented genes (e.g., one or more reference genes and one or more target genes). In each PCR reaction, the ISM controls for inter-sample variation due to the presence of interfering substances and prevents false negative results. If the same ISM is used across PCR experiments and laboratories, it controls for analytical variation thus enables comparison of gene expression measurements across multiple runs in different laboratories (Crawford, Peters et al. 2001, Crawford, Warner et al. 2002, Canales, Luo et al. 2006, Huggett, Novak et al. 2008).
ESM is a mixture with (1:1) ratio of synthetic native templates (NT) and IS templates of each of genes of interest and with a constant ratio of one gene relative to each other. It controls the fluorescence intensity difference between two probes labeled with different dyes due to the variation of degradation of probes or software selection of
Cq values in each plate of PCR. In every PCR plate, two concentrations of ESM per gene
47
of test need to be amplified in separate wells, and Cq difference between NT and IS (ΔCq
=NT Cq – IS Cq) was obtained for each of the two ESM wells. The mean of two ΔCq
from the two respective ESM reactions (ΔCq) was used for normalization of fluorescence
difference in NT and IS probes and auto-selected Cq difference in an unknown sample.
The final report is NT copies of a target gene per million of a reference gene (e.g. ACTB
in this case).
Crawford et al. demonstrated the two-step PCR with implementation of ISM for
fresh FNA samples in the capillary electrophoresis platform (Crawford, Warner et al.
2002) and Yeo et al. successfully adapted pre-amplification with ISM for FFPE samples to increase signal from background and amount of cDNA reverse transcribed from the limited amount of RNA (Yeo, Crawford et al. 2014).
2.3.2. RNA-Sequencing
RNA-sequencing (RNA-seq) is a recently developed deep sequencing technology that sequences the same region multiple times for simultaneous transcriptome profiling, mapping and quantifying transcriptomes of targeted regions of interest or whole transcriptomes (Wang, Gerstein et al. 2009). It also provides differential gene expression data in forms of aligned read-counts (Nagalakshmi 2010). NGS rapidly became choice of
RNA-seq since 2005 and allows for increased coverage of target size. The relative abundances of individual transcripts in a transcriptome can differ by several orders of magnitude. For the detection and quantification of low abundance transcripts with RNA- seq, the total number of reads per library can be increased (Haas, Chin et al. 2012).
2.3.2.1.Whole Transcriptome RNA-seq
48
Whole transcriptome RNA-sequencing (total RNA sequencing) captures a broad
range of gene expression levels (more than five orders of magnitude was estimated in
mouse (Mortazavi, Williams et al. 2008)) and enables the detection of novel transcripts in
both coding and noncoding RNA species. It provides researchers with great insight into
complex diseases by understanding of altered expression of genetic variants and
molecular mechanisms that regulate disease progression. As well as transcript abundance
quantification, detailed transcript structure information, including representation of
alternative transcripts (variety of splice isoforms and novel exons) (Halvardson, Zaghlool
et al. 2013), polymorphisms, and mutations, including translocation are available
(Marioni, Mason et al. 2008).
Microgram quantities of total RNA are required for whole transcriptome
sequencing. The quality of RNA is of importance for successful sequencing; hence,
RNase-free environment for preventing RNA degradation and no genomic DNA
contamination in RNA are essential. Also removal of ribosomal RNAs (rRNA) prior to
analysis optimizes the percentage of reads covering RNA species of interest because
rRNAs constitute 95-98% of total cellular RNA. The mRNA can be enriched either by
selection of polyadenylated (poly-A) RNAs or by depletion of rRNA (Chen and Duan
2011, Huang, Jaritz et al. 2011). For RNA-seq, however, doubly oligo(dT)-selected polyA+ RNA is preferred (Nagalakshmi, Wang et al. 2008, Nagalakshmi 2010). About
1µg of total enriched mRNA is subjected to cDNA synthesis by RT. Fragmentation can
be applied for either RNA (i.e., RNA hydrolysis or nebulization) or cDNA (i.e., DNase I
treatment or sonication). The fragmented cDNA is size-selected, blunt-ended and ligated
to platform-specific adaptors with or without barcode. Then the library preparation is
49
purified before sequencing (Nagalakshmi 2010). Following sequencing, two assembly methods are used to produce a transcriptome map from the resultant reads: de novo and
genome-guided. The de novo approach does not rely on the presence of a reference
genome to assemble the sequence reads (Grabherr, Haas et al. 2011). The reference
genome approach is cheaper and easier for mapping using tools, such as Bowtie
(Langmead and Salzberg 2012), TopHat splice junction mapper (Trapnell, Pachter et al.
2009), to filter only uniquely mapped reads to a reference genome.
Recently, whole transcriptome analysis at single cell resolution is of growing
interest, especially for profiling rare or heterogeneous populations of cells (Tang,
Barbacioru et al. 2009). However, for some genes with low expression their expression
can be negatively affected by stochastic amplification bias that results in the drop-out of
some RNA species and preferential amplification of others (Mamanova, Andrews et al.
2010, Ozsolak and Milos 2011).
However, for routine molecular diagnostic testing, the cost (required total
sequencing reads due to redundantly sequence high abundance targets in order to quantify
low abundance targets) and complexity (i.e., difficulties in computational bioinformatics
at post-sequencing) of whole transcriptome RNA-sequencing data sets are barriers to use
of this method (Blomquist, Crawford et al. 2013).
2.3.2.2.Targeted RNA-seq
Traditionally, RNA sequencing for transcriptome analysis requires a preparation
of whole transcriptome library. But, in many cases, it is not necessary or less important to
sequence the whole transcriptome than to focus sequencing efforts only on a specific
fraction of the transcriptome. Recently developed targeted RNA sequencing approach is a
50
method for measuring of transcripts of interest with quantitative or qualitative
information. Biotinylated oligonucleotide probes (baits) are used to capture cDNAs from
a library prepared for NGS sequencing (Levin, Berger et al. 2009, Mamanova, Andrews
et al. 2010, Mercer, Gerhardt et al. 2012, Halvardson, Zaghlool et al. 2013). It would be
more efficient for a diagnostic test in clinical setting than whole transcriptome RNA-seq.
Targeted RNA sequencing allows for allele-specific expression measurement (Zhang, Li
et al. 2009) and differential expression analysis (Mercer, Gerhardt et al. 2012) as well as
discovery of novel fusion transcripts from chromosomal rearrangements (Levin, Berger
et al. 2009). Further, data analysis of targeted RNA-seq is significantly faster than whole
transcriptome RNA-sequencing because it takes less time for alignment. However, these
studies have underscored some challenges, for example, inter-library variations
introduced by library preparation (Levin, Berger et al. 2009, Mamanova, Andrews et al.
2010, Mercer, Gerhardt et al. 2012) and the high sequencing depth required to
reproducibly quantify low-abundance transcripts (Turner, Ng et al. 2009, Tarazona,
Garcia-Alcalde et al. 2011).
2.3.2.2.1. Use of Internal Standards as Quality Control for Library Preparation
We developed a competitive multiplex PCR-based amplicon sequencing library
preparation method for targeted RNA-seq (Blomquist, Crawford et al. 2013).This method targets only the sequences of interest and controls for inter-target variation in PCR amplification during library preparation by measuring each transcript NT relative to a known number of synthetic competitive template IS copies. Briefly, cDNA from unknown sample (NT) was mixed with ISM and co-amplified with tailed target-specific primers through PCR (1st round of PCR allows multi-targeting), and then PCR products
51
were subjects to PCR amplification with primers that tag unique barcode sequences at
ends (2nd round of PCR allows multi-sampling), followed by amplification with primers
that incorporate platform-specific sequences at ends (3rd round of PCR allows sequencing on specific NGS platform). The resulting individual library was mixed, gel purified, and then sent to specific NGS sequencer (i.e. Ion Torrent PGMTM or Illumina HiSeq or
MiSeq). The competitive multiplex-PCR amplicon library preparation method provides the quality control for RNA-seq library preparation, reproducibility. In addition, it reduced the over-sequencing of highly expressed transcripts relative to lowly expressed ones yet maintains initial relative quantitative representation of targets, which, in turns, considerably lowers down the cost of sequencing per target, per sample (Blomquist,
Crawford et al. 2013). A key advantage of competitive PCR is its insensitivity to the effect of saturation of the PCR (Reischl and Kochanowski 1995). This overcomes the limitation of previously developed targeted RNA-seq approaches described above.
2.3.2.2.2. Use of Predicted Coefficient of Variation as Quality Control for Stochastic
Error
One challenge that limits wider clinical diagnostic application of NGS is lack of appropriate quality control to accurately identify clinically actionable mutations in tumors
(Cibulskis, Lawrence et al. 2013, Spencer, Tyagi et al. 2014). Not only for mutation frequency determination is analytical variation due to stochastic sampling important, but also to quantitative measurement of differential gene expression or ASE (Mortazavi,
Williams et al. 2008). Our developed RNA-seq library preparation method described above that utilizes competitive IS (Blomquist, Crawford et al. 2013) enables control for sample overloading, signal saturation effects, inter-assay and inter-sample variations in
52
measurement. Analytical variation due to stochastic sampling when target analyte was
under-loaded into library preparation and/or when library product was under-loaded into
sequencer (Fu, Xu et al. 2014), can still compromise the quality of NGS data and lead to
misinterpretation of biological meanings. As the stochastic sampling error could not be
controlled when low copies exist in the samples, each PCR step taking out the cDNA or
PCR products from the previous step can cause stochastic sampling variation. By
formulating based on Poisson sampling, a mathematical equation was developed to
predict assay coefficient of variation (CV). The formula is based on both intact NT copies
loaded into first PCR reaction of library preparation (input molarity) and input of
resulting molecules from library into the sequencer (i.e. sequence counts) (Blomquist,
Crawford et al. 2015). During post-sequencing data analysis pipeline, the predicted CV is
implemented to determine the confidence limits for each value acquired from sequencing.
Any values that do not pass this confidence limit will be eliminated, thus, stochastic
sampling errors are controlled. Since the relative transcript abundance varies over six
orders of magnitude (Mortazavi, Williams et al. 2008, Blomquist, Crawford et al. 2013), predicted CV prevents false positive results by minimizing stochastic variation and ensure the NGS data quality for transcript abundance quantification, representation of alternative transcripts (Halvardson, Zaghlool et al. 2013), polymorphisms and mutations detections.
2.4. Contributions
2.4.1. Manuscript I
Manuscript I: is entitled, “Cis-acting variant sites that alter ERCC5 transcription regulation in normal bronchial epithelial cells.” This manuscript has been submitted to
53
PLOS One. This study was conducted to characterize cis-acting genetic variants responsible for inter-individual variation in ERCC5 transcript regulation in normal bronchial epithelial cells (NBEC). Genotypes at putative ERCC5 cis-regulatory single nucleotide polymorphic sites (SNP) rs751402 and rs2296147, and marker SNPs rs1047768 and rs17655 were determined for currently enrolled 80 subjects. Using a recently developed targeted sequencing method, ERCC5 allele-specific transcript abundance was assessed in NBEC RNA from 55 individuals heterozygous for rs1047768 and 21 subjects heterozygous for rs17655. Syntenic relationships among alleles at rs751402, rs2296147 and rs1047768 were assessed by allele-specific PCR followed by
Sanger sequencing. Association of NBEC ERCC5 allele-specific expression at rs1047768 with haplotype and diplotype structure at putative ERCC5 promoter cis-regulatory SNPs rs751402 and rs2296147 was assessed.
2.4.2. Manuscript II
Manuscript II: is entitled, “Lung cancer risk test trial: study design, participant baseline characteristics, bronchoscopy safety, and establishment of a biospecimen repository.” This manuscript has been accepted and published in BMC Pulmonary
Medicine. In an effort to assess the accuracy and safety of the LCRT we initiated a multi- site prospective cohort trial. The purpose of this report is to describe 1) the LCRT trial study design and primary endpoint, 2) baseline characteristics of enrolled individuals including demographic and lung function data, and 3) secondary endpoints reached thus far, including a) analysis of safety for the bronchoscopic brush method used to obtain samples for LCRT testing, and b) establishment of a biospecimen repository containing
NBEC and peripheral blood samples collected from the LCRT cohort.
54
2.4.3. Manuscript III
Manuscript III: is entitled, “Control for stochastic sampling variation and qualitative sequencing error in next generation sequencing.” This manuscript has been accepted and published in Biomolecular Detection and Quantification. One challenge that limits wider clinical diagnostic application of NGS is lack of appropriate quality control to accurately identify clinically actionable mutations in tumors. In order to address these challenges, we developed and tested two hypotheses. Hypothesis 1: Analytical variation in target analyte quantification is predicted by Poisson (i.e. stochastic) sampling effects at two key points; a) input of intact nucleic acid target molecules into the library preparation reaction, and b) input of amplicons from the library into the sequencer. Hypothesis 2:
Technically derived base substitution, insertion and deletion frequencies observed at each base position in each native target analyte is concordant with frequencies observed in competitive synthetic internal standards present in the same reaction. To test hypothesis
1, we derived equations using Monte Carlo simulation to predict assay coefficient of variation (CV) based on three working models: number of target molecules added to library preparation, number of target sequence read counts from sequencer, or both. A serial dilution of gDNAs from two cell lines with known allelic composition was used tested this hypothesis. To test hypothesis 2, we measured the frequency of base substitutions, insertions and deletions at each base position within amplicons from each of 30 native target analytes, then compared these frequencies to those at corresponding base positions within 30 respective synthetic competitive internal standard templates present in the same NGS library preparation reactions.
2.5. Future Study
55
Through funding in part from RC2 CA148572 and HL108016, we have collected normal bronchial epithelial cell samples from over 500 subjects with lung cancer and/or
COPD or demographically at risk for lung cancer or COPD. We will focus on discovery of cis-regulatory SNPs or haplotypes that affect inter-individual variation in the expression of genes with altered regulation in subjects with lung cancer or COPD.
Integration of pathway analysis and GWAS data would highlight genes that are most likely altered in cases versus control and will be more effective. We have had authority to access GWAS data sets to date, including three COPD GWAS data sets and two lung cancer GWAS data sets. For the multiple putative risk genes, we will assess genotype at putative cis-rSNPs in gDNA from over 500 subjects, and allele-specific expression and total expression in NBEC samples from matched over 500 subjects using targeted competitive multiplex NGS method. We will identify heritable susceptibility cis-rSNPs or haplotypes that contribute to lung cancer and COPD risk. We will evaluate the function of identified putative cis-rSNPs that are more likely to alter transcription factor binding and regulation of respective genes in NBEC, using NBEC in optimized conditional reprogramming culture (CRC) conditions. We will use the Massively Parallel
Reporter Assay (MPRA) combined with targeted NGS to measure inter-allelic difference in transcriptional activity at putative enhancers centered on putative cis-rSNPs.
56
Chapter 3 Haplotype and diplotype analyses of variation in ERCC5 transcription cis-regulation in normal bronchial epithelial cells
Xiaolu Zhang1, Erin L. Crawford1, Thomas M. Blomquist2, Sadik A. Khuder3, Jiyoun
Yeo1, Albert M. Levin4, James C. Willey1,2*
Authors’ Affiliations:
1 Division of Pulmonary/Critical Care and Sleep Medicine, Department of Medicine,
University of Toledo Health Sciences Campus, Toledo, Ohio, United States of America
2 Department of Pathology, University of Toledo Health Sciences Campus, Toledo, Ohio,
United States of America
3 Departments of Medicine and Public Health and Homeland Security, University of
Toledo Health Science Campus, Toledo, Ohio, United States of America
4Department of Public Health Sciences, Henry Ford Health System, Detroit, Michigan,
United States of America
* Corresponding author: James C. Willey, M.D., Mail Stop #1186, 3000 Arlington
Avenue, Toledo, OH 43614, Phone; 419-383-3541 Fax; 419-383-2801,Email:
57
(Modified from “Haplotype and diplotype analyses of variation in ERCC5 transcription cis-regulation in normal bronchial epithelial cells” manuscript submitted to Physiological
Genomics)
3.1 Abstract
Background: Excision repair cross-complementation group 5 (ERCC5) gene plays an important role in nucleotide excision repair and dysregulation of ERCC5 is associated with increased lung cancer risk. Haplotype and diplotype analyses were conducted in normal bronchial epithelial cells (NBEC) to better understand mechanisms responsible for inter-individual variation in transcript abundance regulation of ERCC5.
Methods: We determined genotypes at putative ERCC5 cis-regulatory SNPs (cis-rSNP) rs751402 and rs2296147, and marker SNPs rs1047768 and rs17655. ERCC5 allele- specific transcript abundance was assessed by a recently developed targeted sequencing method. Syntenic relationships among alleles at rs751402, rs2296147 and rs1047768 were assessed by allele-specific PCR followed by Sanger sequencing. We then assessed association of ERCC5 allele-specific expression at rs1047768 with haplotype and diplotype structure at cis-rSNPs rs751402 and rs2296147.
Results: Genotype analysis revealed significantly (p<0.005) higher inter-individual variation in allelic ratios in cDNA samples relative to matched gDNA samples at both rs1047768 and rs17655. By diplotype analysis, mean expression was higher at the rs1047768 alleles syntenic with rs2296147 T allele compared to rs2296147 C allele.
Further, mean expression was lower at rs17655 C allele which is syntenic with G allele at a linked SNP rs873601 (D’=0.95).
58
Conclusions: These data support the conclusions that in NBEC, T allele at SNP rs2296147 up-regulates ERCC5, variation at rs751402 does not alter ERCC5 regulation, and that C allele at SNP rs17655 down-regulates ERCC5. Variation in ERCC5 transcript abundance associated with allelic variation at these SNPs could result in variation in NER function in NBEC and lung cancer risk.
3.2 Introduction
Excision repair cross-complementation group 5 (ERCC5) gene, also known as
Xeroderma Pigmentosum complementation group G (XPG) (O'Donovan, Scherly et al. 1994), plays an important role in nucleotide excision repair (NER). In addition,
ERCC5 is among a set of key antioxidant, DNA repair and cell cycle control genes identified by this laboratory to associate with lung cancer risk (Mullins, Crawford et al. 2005, Blomquist, Crawford et al. 2009). Further, variation in ERCC5 regulation is reported to be associated with treatment response and outcome in bronchogenic carcinoma as well as other cancers (Spitz, Wei et al. 2003, Zienolddiny, Campa et al.
2006, Mathiaux, Le Morvan et al. 2011, Zhang, Sun et al. 2013, Somers, Wilson et al.
2015).
Known transcription regulators of ERCC5 in normal bronchial epithelial cells
(NBEC) include CCAAT/enhancer binding protein gamma (CEBPG), E2F
Transcription Factor 1 (E2F1) and YY1 (a transcription factor belonging to the GLI-
Kruppel class of zinc finger proteins) (Mullins, Crawford et al. 2005, Crawford,
Blomquist et al. 2007). CEBPG is a truncated C/EBP isoform that lacks a transcription activation domain and therefore functions through heterodimerization with other C/EBP members (Tsukada, Yoshida et al. 2011). Knockout of CEBPG or
59
its binding partner CEBPA results in emphysema, a condition associated with lung cancer
risk (Kaisho, Tsutsui et al. 1999, Didon, Roos et al. 2010). E2F1 is a critical regulator of
cell cycle progression (Nevins 1992). A study in a series of 58 lung tumors of all histological types supported a pivotal role of E2F1 in tumorigenesis (Eymin, Gazzeri et
al. 2001).
The common single nucleotide polymorphic (SNP) sites rs751402 and rs2296147
reside in the ERCC5 5’ untranslated region (UTR); rs751402 within a known CEBPG
binding site based on chromatin immunoprecipitation studies (Wang, Zhuang et al. 2012)
and rs2296147 within an experimentally confirmed binding site for E2F1 and YY1
(Crawford, Blomquist et al. 2007). Variation at rs2296147 is predicted to alter binding of
the TP53 transcription factor (Marinescu, Kohane et al. 2005). Both rs751402 and
rs2296147 are associated with lung cancer risk in molecular epidemiologic studies (Shen,
Berndt et al. 2005, Zienolddiny, Campa et al. 2006).
In a previous study, based on genotype analysis we found that SNPs rs751402 and
rs2296147 were associated with inter-individual variation in allelic imbalance in ERCC5
expression in NBEC (Blomquist, Crawford et al. 2010).This observation suggests that
one or both of these SNPs affects ERCC5 cis-regulation. However, based on genotyping
analysis it was not possible to sort out with confidence the independent roles of rs751402
and rs2296147. Further, the patterns of allelic imbalance variation observed indicated
that one or more cis-regulatory SNPs in addition to rs751402 and rs2296147 also played
a role. One candidate is rs17655, a common polymorphic site in the ERCC5 3’ UTR
reported to be associated with lung cancer risk (Matakidou, el Galta et al. 2007). Genetic
variation in the 3’ UTR of a gene can play a role in cis-regulation by influencing
60
microRNA (miRNA) binding activity (Yu, Li et al. 2007, Nicoloso, Sun et al. 2010,
Ryan, Robles et al. 2010).
In an effort to better characterize cis-acting genetic variants responsible for inter- individual variation in ERCC5 transcript regulation, we used recently developed methods to assess in more detail the role of the previously studied 5’UTR sites rs751402 and rs2296147 (Mullins, Crawford et al. 2005, Blomquist, Crawford et al.
2009) and additional 3’UTR site rs17655. Specifically, using allele-specific PCR amplicon libraries prepared for next generation sequencing (NGS) according to a recently described method (Blomquist, Crawford et al. 2013), haplotype and diplotype structure of the ERCC5 promoter region containing rs751402 and rs2296147 were assessed by allele-specific polymerase chain reaction (PCR) followed by direct sequencing. We determined allele-specific expression (ASE) as a measurement of allelic ratio at the marker site rs1047768 in the ERCC5 coding region close to 5’UTR and at rs17655 in the coding region nearby 3’UTR. We then evaluated the association of ASE with each rs751402-rs2296147-rs1047768 haplotype and diplotype.
3.3 Materials and Methods
3.3.1 Study subjects
Bronchoscopic brush biopsy samples of NBEC and matched peripheral blood samples were obtained as previously described (Mullins, Crawford et al. 2005) from
60 subjects without cancer and 20 subjects with lung cancer. Demographic characteristics of subjects used in current study are presented in Supplementary Table
S3.1. This study was conducted under University of Toledo Institutional Review
61
Board approved protocol #106894. All the subjects in current study were sub-set of the
Lung Cancer Risk Test (LCRT) trial, a prospective cohort study (Crawford 2016).
Individuals were recruited at 13 locations in the United States and provided informed consent to participate. Each subject agreed to the banking of residual nucleic acids for use in future studies under University of Toledo Biomedical Institutional Review Board protocol #108538. Then all samples were de-identified and links to identifying information were kept at each site for subjects recruited at that site. Inclusion criteria required subjects to be at high demographic risk for lung cancer based on age (50-90 years) and smoking history (≥ 20 pack-years). Both current and former smokers were eligible. Subjects had to be without a diagnosis of lung cancer prior to or at enrollment.
Subjects were excluded if they were previously diagnosed or treated for lung cancer or had a high pretest likelihood of lung cancer, if they were positive for hepatitis B, C, HIV, or had active TB or if the physician deemed them to be medically inappropriate due to safety concerns. Also excluded were children, pregnant women, prisoners, mentally disabled, those that had received a double lung transplantation, radiation or chemotherapy of any kind within the last month and those scheduled to receive either radiation or chemotherapy.
3.3.2 DNA and RNA extraction
NBEC samples obtained at bronchoscopic brush biopsy were shipped to ResearchDX
(Irvine, CA), where RNA was extracted, treated with DNase I (Qiagen, Valencia, CA) in order to eliminate contaminating genomic DNA (gDNA), and frozen in aliquots. One aliquot of each frozen RNA sample was shipped to the University of Toledo, and tested for gDNA contamination with a pair of primers designed to span an intron-exon junction
62
in Secretoglobin, Family 1A, Member 1 gene (CC10 hereafter) and thereby amplify
only gDNA (Blomquist, Brown et al. 2013). Total RNA was reverse transcribed to
cDNA using Moloney Murine Leukemia Virus Reverse Transcriptase (M-MLV RT)
and oligo-dT primers as described previously (Mullins, Crawford et al. 2005).
Matched gDNA was extracted from whole blood using FlexiGene DNA Kit (Qiagen,
Valencia, CA) according to the manufacturer’s protocol.
3.3.3 Genotyping and allelotyping
Genotype at each polymorphic site was determined by TaqMan SNP genotyping assay (Applied Biosystems) according to the manufacturer’s protocol. Direct assessment of the syntenic relationship of alleles at rs751402, rs2296147 and rs1047768 in individuals heterozygous for rs1047768 was accomplished by allele- specific PCR amplification followed by Sanger sequencing (The University of
Michigan DNA Sequencing Core, Ann Arbor, MI) as described previously
(Blomquist, Crawford et al. 2010). An overview of polymorphic sites and primers relative to ERCC5 gene coordinates is depicted in Figure 3.2. The sequences and design of allele-specific primers were described previously (Blomquist, Crawford et al. 2010).
It was not possible to conduct analysis in 11 samples. In three samples, there was no amplification due to poor quality of gDNA and cDNA; four subjects were heterozygous at rs751402 but homozygous at rs2296147 and the spanned region between rs751402 and rs1047768 was too long to assess synteny by direct sequencing; four samples were not amplified with primers determining rs2296147-
63
rs1047768 synteny, likely due to variation in the transcription start site as previously reported (Blomquist, Crawford et al. 2010).
3.3.4 Measurement of ERCC5 allele-specific and total expression
ERCC5 ASE was measured using a modified version of previously described methods (Blomquist, Crawford et al. 2013). Briefly, a custom, multiplex competitive
PCR amplicon library was prepared for targeted NGS, then sequenced at the University of Michigan DNA Sequencing Core (Ann Arbor, MI) using the Illumina HiSeq 2000 platform. To prepare the library, cDNA was combined with a mixture containing a) primers spanning SNP rs10477678 in the ERCC5 5’UTR region, SNP rs17655 in the
3’UTR region and, the ACTB loading control gene, and b) a known number of internal standard molecules for each of these targets. Each primer was designed with a universal tail sequence (similar to that used for arrayed primer extension: APEX-2) not present in the human genome to allow for multi-template PCR addition of barcode and platform specific sequencing adapters. The internal standards mixture was prepared as described previously (Blomquist, Crawford et al. 2013).
Three sequential PCR amplifications were conducted to prepare the library prior to sequencing. In the first reaction each target sequence was amplified with 5µM APEX- tailed primers using an air thermal cycler (RapidCycler; Idaho Technology, Inc. Idaho
Falls, Idaho), PCR conditions were 95°C/3min (Taq DNA polymerase activation); 35 cycles of 94°C/5sec (denaturation), 58°C/10sec (annealing), 72°C/15 sec (extension).
Each product from the first PCR was purified using QIAquick PCR purification kit
(Qiagen, Valencia, CA) to remove residual primers and primer dimers then used as template for barcoding PCR. Each barcoding reaction was cycled in the air cycler under
64
the following conditions: 95°C/3min (Taq DNA polymerase activation); 15 cycles of
94°C/5sec (denaturation), 58°C/10sec (annealing), 72°C/15 sec (extension). The final
concentration of each forward and reverse barcoding primer was 1µM. The third PCR
for adding Illumina platform specific adaptors was conducted with the same PCR
conditions and primer concentrations as the barcoding PCR. Representative PCR products were checked for quality and quantity with Bioanalyzer 2100 (Agilent
Technologies, city, state) following each of the three amplifications. All products from the third step PCR were combined at equal volumes and then purified by
QIAquick PCR purification kit (Qiagen, Valencia, CA) to remove residual primers.
The concentration of purified products was checked by Bioanalyzer 2100 and then
sent for sequencing. ASE was presented as allelic ratio which was calculated as the
ratio of sequencing counts for each allele and filtered as described below.
3.3.5 Data processing pipeline
The University of Michigan Illumina Sequencing services provided raw
sequencing data in FASTQ format. Practical Extraction and Reporting Language
(PERL) scripts were used to combine Read 1 (forward) and Read 2 (reverse)
sequence reads for each template sequenced. These “joined” reads were then de-
multiplexed based on dual-index barcoding on each template, and the locus was
identified based on the region representing the primer sequences. Intervening
amplicon sequence that was “captured” between the primer sequences was aligned
using custom alignment with Approximate String matching algorithm as previously
described (Blomquist, Crawford et al. 2013, Blomquist, Crawford et al. 2015). These
alignment calls then provided relative abundance in the form of sequence “counts” for
65
each allele at each locus. The ratio of these allele specific sequence counts at each locus
then represents the allele-specific expression ratio.
3.3.6 Filtering for stochastic sampling error
To control for stochastic sampling error, we implemented a previously developed
equation that identifies the minimum allowable input of target gene molecules into library
preparation obtained by measurement relative to known number of input IS, and
minimum allowable number of amplicons from library loaded into sequencer measured
as sequencing counts (Blomquist, Crawford et al. 2015).
A filter for coefficient of variation (CV) expected to result from stochastic sampling
determined analytical variation was applied to both allele-specific expression and total
transcript abundance. Only values with stochastic sampling dependent CV expected to be
less than 1 were subjected to subsequent analysis. Total transcript abundance was
presented as target gene NT molecules/106 ACTB molecules.
3.3.7 Small interfering RNA (siRNA) assay in cell culture
Human lung squamous carcinoma cell line H1703 was obtained from American Type
Culture Collection and were maintained at 37°C and 5% CO2 in RPMI 1640 medium
supplemented with 10% fetal bovine serum. Dharmacon ON-TARGETplus SMARTpool
siRNAs (Supplementary Table S3.3) for CEBPG (si05, si06, si07, and si08), negative
control siRNA and DharmFECT 1 reagent were purchased from Thermo Scientific
(Waltham, MA). One day prior to transfection, cells were trypsin dissociated and seeded
into 6-well plate at a predetermined density such that cells were 70-75% confluent at the
time of transfection. Cells were transfected with DharmFECT 1 reagent and serum-free
RPMI 1640 according to the manufacturer’s protocol. Cells were incubated with siRNA
66
transfection reagent for 24 h and then the media were replaced with RPMI 1640 +
10% FBS. Cells were harvested after an additional 48 h of incubation. A replicate
experiment was done.
3.3.8 Statistical analysis
Ratios of allele-specific expression values from individuals heterozygous at the
marker SNP were log2-transformed prior to analysis. Variance in allelic ratio
measured in cDNA samples was tested for difference from that in matched gDNA
samples by F-test. Difference in mean allelic ratio between cDNA samples and
matched gDNA samples was determined by Student’s t-test. One-way analysis of variance (ANOVA) was used to compare the mean of allelic ratios associated with genotype or diplotype. Pearson’s test was performed to assess inter-gene total transcript abundance bivariate correlation and Fisher Z-test was applied to assess inter-group (e.g. cancer vs non-cancer) difference in bivariate correlation. All statistical tests were two-sided with a statistical significance level of p < 0.05, using either the R statistical programming language (v 3.2.0) or SAS program (v.9.3). All graphs were plotted using GraphPad Prism 6.
3.4 Results
3.4.1 Effects of CEBPG siRNAs on ERCC5 transcript abundance
To validate the functional role of CEBPG in ERCC5 transcript regulation, we knocked down CEBPG transcript level and assess the effect on ERCC5 transcript abundance. As shown in Figure 3.1, after treatment of siRNA, the transcript abundance of CEBPG in cells treated with CEBPG siRNA significantly reduced by
67
94% relative to cells treated with control siRNA. At the same time, ERCC5 transcript
abundance in CEBPG siRNA group decreased by 4-fold compared to control group.
3.4.2 Inter-individual variation in allelic imbalance at rs1047768
Among the eighty newly enrolled subjects, thirty-three individuals were heterozygous
at the marker SNP rs1047768. For analysis of allelic imbalance at rs1047768, data from
these 33 subjects were combined with data from 22 previously enrolled subjects
heterozygous at rs1047768 (Blomquist, Crawford et al. 2010). Allelic ratios for the
cDNA sample and matched gDNA control of each heterozygote are summarized in
Supplementary Table S3.2 and presented graphically in Figure 3.3. Among 55 subjects
that were heterozygous for rs1047768 there was greater inter-individual variation in
rs1047768 allelic ratio in cDNA compared to matched gDNA controls (F-test, p<0.0001)
(Figure 3.3A).
3.4.3 Characterization of haplotype and diplotype structure in ERCC5 promoter
Of 55 individuals heterozygous at the marker SNP rs1047768, it was possible to determine the synteny among allelotypes at rs1047768 as well as promoter SNP sites rs751402 and rs2296147 in 44 samples. Six haplotypes of SNPs rs751402–rs2296147– rs1047768 (Table 3.1) and six unique diplotypes (haplotype pairs for an individual) were observed (Table 3.2). The most common haplotype structures were G–C–C (37%) and
G–T–C (21%). The most common diplotypes were GTT/GCC observed in 17/44
individuals (39%) and ATT/GCC observed in 11/44 (25%) (Table 3.2).
3.4.4 Association of ERCC5 promoter diplotype with rs1047768 allelic imbalance
There was a significant difference in the mean allelic transcript ratio at rs1047768
among the six diplotype groups (p=0.0030). The mean allelic transcript ratio among
68
subjects with GTT/GCC diplotype was 1.2-fold higher than that of subjects with diplotype GTT/GTC (p=0.0280) and 1.4-fold higher than that of subjects with
GCT/GTC (p=0.010). There was no difference in mean allelic ratio between diplotype GTT/GCC and ATT/GCC (Figure 3.4).
The rs1047768 allele (T or C) syntenic with rs2296147 T allele displayed higher transcript abundance than the allele syntenic with rs2296147 C allele, regardless of allele present at rs751402 (Figure 3.4). Consistent with this, significantly 1.44-fold,
1.42-fold, and 1.22-fold higher (p<0.05) rs1047768 T/C mean allelic ratio was observed in the combined group of subjects with either GTT/GCC or ATT/GCC
compared to any groups of subjects with GCT/GCC, GCT/GTC, or GTT/GTC
(Figure 3.4).
3.4.5 Inter-individual variation in allelic imbalance measured at rs17655 in ERCC5
3’UTR
Twenty-one of the 80 newly enrolled subjects were heterozygous at rs17655,
located in the ERCC5 transcript 3’ UTR. Among these subjects, inter-individual
variation in rs17655 G/C allelic ratio was significantly higher in cDNA compared to
matched gDNA (F-test, p=0.0005) (Figure 3.3B). In addition, mean G/C ratio was
30% higher in cDNA relative to matched gDNA (t-test, p<0.0001).
3.4.6 Analysis of rs2296147 effect on transcription factor binding and rs17655 effect
on miRNA binding
T-allele at rs2296147 is predicted to participate in formation of a TP53
transcription factor-binding site and the TP53 binding site is predicted to be lost when
the C-allele is present (Marinescu, Kohane et al. 2005).
69
Although allelic variation at ERCC5 3’UTR SNP rs17655 is not predicted to affect miRNA binding, it is in high linkage disequilibrium (D’ = 0.95) with functional candidate
SNP rs873601 (Genomes Project, Abecasis et al. 2010), which is predicted to alter binding of multiple microRNAs (miRNAs) (Liu, Zhang et al. 2012, Zhu, Shi et al. 2012).
3.4.7 Bivariate analysis of ERCC5 and CDKN1A transcript abundance in non-cancer and cancer subjects
In order to investigate the potential role of TP53 functional activity in regulation of
ERCC5, we measured CDKN1A transcript abundance as surrogate marker for TP53 function (el-Deiry, Tokino et al. 1993, el-Deiry, Harper et al. 1994, Harr, Graves et al.
2005). Among the 80 subjects enrolled, there was significant bivariate correlation between ERCC5 and CDNK1A (also known as p21) among the 60 non-cancer subjects
(r=0.65, p=0.0005), consistent with co-regulation by a common transcription factor. In contrast, the ERCC5 and CDKN1A correlation was significantly lower (Fisher Z- test, p=0.0002) among the 20 cancer subjects (r=-0.52, p=0.0475) (Figure 3.5). Additionally, inter-individual variation in CDNK1A transcript abundance was higher in non-cancer compared to cancer subjects (F-test, p=0.00053).
3.5 Discussion
Known sources of inherited risk for lung cancer include variation in cis-acting regulatory single nucleotide polymorphisms (cis-rSNPs) and/or key transcription factors that regulate antioxidant, DNA repair, and cell proliferation control gene pathways in normal bronchial epithelial cells (NBEC), the progenitor cells for lung cancer (Harr,
70
Graves et al. 2005, Mullins, Crawford et al. 2005, Blomquist, Crawford et al. 2010).
Data presented here advance mechanistic understanding regarding heritable variation
in cis-regulation of the key NER gene ERCC5 (Figure 3.3).
3.5.1 Contribution of 5’UTR SNPs to ERCC5 cis-regulation
Analysis of diplotype structure at rs751402-rs2296147-rs1047768 demonstrated higher abundance of transcript from rs1047768 marker site C-allele was associated with T-allele at putative cis-regulatory SNP rs2296147 and (Figure 3.4) and not associated with variation at rs751402. Notably, rs2296147 T-allele participates in formation of an in silico predicted TP53 transcription factor-binding site (Marinescu,
Kohane et al. 2005) and that site is predicted to be lost when C-allele is present. In previous studies TP53 upregulates ERCC5 transcription (Kannan, Amariglio et al.
2000). Therefore, it is reasonable to hypothesize that TP53 upregulates ERCC5 transcription more effectively when T allele is present at rs2296147. Because TP53 is regulated primarily at the post-translational level, TP53 transcription factor functional activity is often measured indirectly as transcript abundance of key target genes such as CDKN1A (el-Deiry, Tokino et al. 1993, el-Deiry, Harper et al. 1994, Harr, Graves et al. 2005). Indeed, we observed a significant correlation of ERCC5 and CDKN1A at transcript level. As reported in Figure 3.5A, CDKN1A and ERCC5 total transcript abundance values were correlated in non-cancer subject NBEC samples and this correlation is lost in NBEC from lung cancer subjects (Figure 3.5B). Greater sample size will be necessary to evaluate association of rs2296147 and the altered CDKN1A correlation with ERCC5 in lung cancer subjects. In contrast to strong evidence for the cis-regulatory role of rs2296147 in ERCC5 regulation, haplotype and diplotype data
71
do not support a similar role for rs751402. Haplotype-based analyses presented here
distinguished the effects of rs751402 and rs2296147 which was not possible to do based
based previously reported genotype-based analyses (Blomquist, Crawford et al. 2010).
3.5.2 Contribution of 3’UTR SNPs to ERCC5 cis-regulation
We observed an increased mean G/C allelic ratio at rs17655 in cDNA compared to
matched gDNA controls (Figure 3.3B) indicating that this SNP or a SNP in linkage
disequilibrium with it influences ERCC5 transcript levels in NBEC. As described in
Results section, it is likely that the functional SNP responsible for this observation is
rs873601 which is linked to rs17655 and is predicted to alter binding of multiple miRNAs
(He, Qiu et al. 2012, Liu, Zhang et al. 2012, Zhu, Shi et al. 2012). Specifically, the C
allele at SNP rs17655 is linked to G allele at rs873601 (D’ = 0.95), which is putatively
more responsive to multiple miRNAs that will increase the rate of degradation and lower
abundance of transcripts originating from rs17655 C allele. Data presented here support
the need for functional studies to determine whether rs873601 actually binds any miRNA
and, if so, to evaluate allele-specific degradation. Importantly, because rs2296147 and
rs873601 are not linked, we conclude that any rs873601 effect is independent from that
of rs2296147.
3.5.3 Interaction between 5’UTR and 3’UTR SNPs in ERCC5 cis-regulation
Although functional validation is needed, there is potentially higher TP53-mediated
ERCC5 transcription rate from rs2296147 T allele and higher miRNA mediated ERCC5 transcript degradation at rs873601 G allele. Thus, if each of these cis-regulatory sites acted alone without any contribution from the other (for example, in hypothetical
72
alternative transcripts) or any other cis-acting SNP, we would expect not only to observe mean T/C ratio at marker SNP rs1047768 and mean G/C ratio at SNP rs873601 (or linked SNP rs17655) to be >1, but also very little inter-individual variation around these mean ratios. However, we observed significant variation around the mean allelic ratio at each marker SNP. The likely explanation is that the predominant expressed ERCC5 transcripts incorporate both marker SNPs (rs1047768 and rs17655) and the effects resulting from genotype at each of the unlinked cis- regulatory sites (rs2296147 and rs873601) will interact to determine the allelic ratio measured at each marker SNP.
3.5.4 Value of transcript abundance regulation as intermediate lung cancer risk marker
Consistent with a complex genetic mechanism of lung cancer risk, the effect size of each DNA variant associated with lung cancer risk is very small. Consequently, thousands of subjects are needed to directly assess the association of individual genetic variants and lung cancer risk. The data presented here support the conclusion that inherited variation in gene regulation is a powerful intermediate phenotypic marker for lung cancer risk, as presented schematically in Figure 3.6. As we report here and previously (Blomquist, Crawford et al. 2010), it is possible to assess this type of intermediate risk factor with far fewer patients than the thousands typically necessary for a GWAS study aiming to determine association of each individual SNP with risk (Amos, Wu et al. 2008). Specifically, the association of a single genetic variant with transcription regulation (e.g. rs2296147 with ERCC5 regulation) or the association of inherited variation in transcript abundance regulation with lung cancer
73
risk (Blomquist, Crawford et al. 2009) may be assessed with hundreds of subjects
(Blomquist, Crawford et al. 2009). For example, starting with 161 subjects (Amos, Wu et
Wu et al. 2008) we observed significant association of rs2296147 genotype with ERCC5
ASE (Figure 3.4), and with fewer than 100 subjects we observed significantly altered
ERCC5 regulation with lung cancer (Mullins, Crawford et al. 2005) (Figure 3.5). In contrast, there was not a clear association of rs2296147 T allele dosage with lung cancer risk among the subjects enrolled for this study (data not shown).
These findings provide evidence to support, as a strategy to identify variants associated with lung cancer risk, analysis of NBEC regulation of key antioxidant, DNA repair, and cell cycle control genes, followed by identification of cis-regulatory variants associated with sub-optimal regulation.
Based on the findings in the current study, we conclude that the T allele at rs2296147 is associated with higher ERCC5 transcript abundance, possibly through increased responsiveness to TP53 transcription factor. Genotype at rs17655 also is associated with variation in ERCC5 transcript abundance, likely due to effect on miRNA binding affinity at the linked SNP rs873601. These effects on ERCC5 transcription likely result in variation in nucleotide excision DNA repair function. These findings provide plausible explanation for the association of genotype at rs2296147 and rs17655 with lung cancer risk.
3.6 Disclosures
JCW has 5-10% equity interest in and serves as a consultant to Accugenomics, Inc.
Technology relevant to this manuscript was developed and patented by JCW, and is
74
licensed to Accugenomics. These relationships do not alter our adherence to all
BioMed Central policies on sharing data and materials.
3.7 Grants
Significant portions of this work was funded by National Institutes of Health,
National Cancer Institute (RC2-CA147652 and IMAT R21-CA138397) and National
Heart Lung and Blood Institute (RO1-HL108016), and the University of Toledo
Medical Center George Isaac Cancer Research Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
75
3.8 Table and Figure Legends
Table 3.1 Summary of haplotype structures in ERCC5 promoter region.
Table 3.2 Summary of diplotype structures in ERCC5 promoter region.
Figure 3.1 Effects of CEBPG siRNAs on ERCC5 transcript abundance
Transcript abundance was normalized to the reference gene ACTB, presented here as target gene molecules/106 ACTB molecules. The average of transcript abundance measurements obtained from two replicates was calculated for CEBPG and ERCC5 respectively. Logarithm of the mean transcript abundance at base 10 was plotted in y axis. White rectangle, controls treated with control siRNA. Strip-filled rectangle, CEBPG knocked-down by siRNA. After treatment of siRNA, the transcript abundance of CEBPG in cells treated with CEBPG siRNA significantly reduced by 94% relative to cells treated with control siRNA. At the same time, ERCC5 transcript abundance in CEBPG siRNA group decreased by 4-fold compared to control group.
Figure 3.2 Schematic overview of ERCC5 gene putative cis-regulatory polymorphic sites and orientation for allelic-specific expression measurement.
ERCC5 gene coordinate represents NCBI Gene NC_000013.11. All positions noted here are relative to the reported NCBI mRNA RefSeq TSS for ERCC5 NM_000123.3. Grey or white arrowhead indicate the direction of transcription relative to gene orientation. TSS, transcript start site. (A) The syntenic relationship of alleles at rs751402, rs2296147 and
76
rs1047768 in individuals heterozygous for rs1047768 was assessed by allele-specific
PCR amplification followed by direct sequencing. C allele Reverse Primer, a C-allele- specific-primer used in combination with Forward Primer for specific amplification from
C allele at rs2296147 to determine synteny between alleles at rs2296147 and rs751402. T allele Forward Primer and C allele Forward Primer were used in combination with
Reverse Primer in exon 2 for specific amplification from T allele or C allele, respectively, at rs2296147 to determine the synteny between alleles at rs2296147 and rs1047768. cDNA instead of gDNA was used for this amplification to avoid a large intron 1. The depicted cluster region of TSS represents highly variable ERCC5 transcription initiation sites as discussed previsouly (Blomquist, Crawford et al. 2010). (B) The allele-specific expression of ERCC5 was measured at two polymorphic sites, rs1047768 and rs17655.
Generally, native templates in cDNA sample were amplified with a known number of internal standards which contain identical priming sites and 6 nucleotides altered relative to native template. Barcodes that allowed for multiplexing and adapters specific for
Illumina HiSeq platform were added by PCR. The products were quantified and purified then sent for sequencing on Illumona HiSeq. Asterisk indicates nucleotide alteration in internal standard (IS) relative to native template (NT).
Figure 3.3 Allelic ratios measured at rs1047768 and rs17655.
The base 2 of logarithm transformation was applied to allelic ratios measured at two polymorphic sites, rs1047768 and rs17655 in cDNA and matched gDNA samples and used for statistic tests. The dashed line at 0 is reference line for allelic ratio of 1.(A) Inter- individual variation in T/C allelic ratios measured at polymorphic site rs1047768 located
77
in ERCC5 coding region exon 2 is significantly higher in cDNA samples relative to
matched gDNA controls (F-test, p<0.0001). The mean log2(T/C ratio) in cDNA (M=0.11,
SD=0.34) was significantly higher than that for matched gDNA (M=0.03, SD=0.11)
according to t-test (p=0.0216). (B) Similarly to rs1047768, significant higher inter-
individual variation in allelic ratios of cDNA compared to matched gDNA (F-test,
p=0.0005) was also observed at polymorphic site rs17655 located in ERCC5 exon 15
close to 3’UTR. The mean log2(G/C ratio) is significantly higher in cDNA (M=0.25,
SD=0.35) than in gDNA (M=-0.14, SD=0.16) (t-test, p< 0.0001). M, mean. SD, standard deviation.
Figure 3.4 Allelic ratios measured at rs1047768 sorted by various diplotype.
ANOVA was used to assess the difference in T/C allelic ratios among groups. All effects were statistically significant at the 0.05 significance level. Allelic ratios in relationship with six presented diplotypes at rs751402, rs2296147 and rs1047768 in ERCC5 5’UTR and exon 2. ANOVA revealed a significant difference in the mean allelic transcript ratio among the six diplotype groups (p=0.0030). The mean allelic transcript ratio among subjects with GTT/GCC diplotype (group 5) was 1.2-fold higher (p=0.0280) than that of subjects with GTT/GTC diplotype (group 6) and 1.4-fold higher (p=0.010) than that of subjects with GCT/GTC diplotype (group 4). There was no significant difference in mean allelic ratio between diplotype GTT/GCC and ATT/GCC (group 1) so they were combined and presented as NTT/GCC (group 7, N represents either G or A). The rs1047768 allele (T or C) syntenic with rs2296147 T allele displayed higher transcript abundance than the allele syntenic with rs2296147 C allele, regardless of allele present at
78
rs751402. Consistent with this, significantly 1.44-fold, 1.42-fold, and 1.22-fold higher
(p<0.05) mean allelic ratio was observed in group 7 compared to any groups of subjects
with group 3, 4, 6, respectively.
Figure 3.5 Correlation of CDKN1A and ERCC5 transcript abundance in NBEC
samples. (A) Correlation of CDKN1A and ERCC5 in non-cancer subjects. Pearson’s
correlation coefficient was calculated, r=0.58, p=0.0014. (B) Correlation of CDKN1A and
ERCC5 in cancer subjects, r=-0.08, p=0.7781. The correlation coefficient was
significantly decreased in cancer subjects relative to non-cancer subjects (Fisher Z-test,
p=0.0002) indicating a significantly altered correlation in cancer subjects. NC, non-
cancer. CA, cancer. NBEC, normal bronchial epithelial cells.
Figure 3.6 Increased lung cancer risk through sub-optimal normal bronchial
epithelial (NBEC) regulation of protective genes.
This schematic indicates the putative genetic basis for hereditary increased lung cancer
risk in three individuals. SNPs that affect transcript abundance regulation are indicated numerically and as diamonds (trans-regulatory SNPs) or circles (cis-regulatory SNPs). As indicated, each individual is at increased risk due to sub-optimal regulation of a different combination of genes. Further, when the same gene is sub-optimally regulated in multiple individuals (e.g. Gene C in Individuals 1 and 3), a different set of SNPs may be responsible in each individual.
79
3.9 Table and Figure
Table 3.1
Haplotype rs751402 rs2296147 rs1047768 Frequency Count G C C 112 37% G T C 63 21% G T T 55 18% A T T 51 17% G C T 12 4% A T C 7 2%
Table 3.2
rs751402-rs2296147-rs1047768 Diplotype Frequency Parental Parental Count Chromosome 1 Chromosome 2 A-T-T G-C-C 11 25% A-T-T A-T-C 1 2% G-C-T G-C-C 3 7% G-C-T G-T-C 3 7% G-T-T G-C-C 17 39% G-T-T G-T-C 9 20%
80
Figure 3.1
81
Figure 3.2
82
Figure 3.3
83
Figure 3.4
84
Figure 3.5
85
Figure 3.6
86
3.10 Supplemental Table and Figure Legends
Table S3.1 Demographic characteristics of enrolled 80 subjects.
Note: a Two-sided Student's t-test to assess mean difference between non-cancers and cancers. b Two-sided Chi-square test to determine the difference of distributions between non- cancers and cancers. c Ethnicity information for three cancer subjects was missing.
Table S3.2 Summary of genotype, diplotype and allelic ratios for heterozygotes at rs1047768.
Table S3.3 ON-TARGETplus SMARTpool siRNA Sequences
87
3.11 Supplemental Table and Figure
Table S3.1
Non-Cancers Cancers p value N=60 N=20 Age(years) [Mean ± SD] 64.7 ± 7.5 69.3 ± 9.2 0.0648a Gender 0.1548b Female 28 (84.8%) 5 (15.2%) Male 32 (71.1%) 13 (28.9%) Ethnicity 0.5305b Caucasian 53 (76.8%) 16 (23.2%)
African Americanc 7 (87.5%) 1 (12.5%)
Table S3.2
rs1047768 Diplotype Genotype T:C allelic Subject (751402-rs2296147-rs1047768) ratio ID Parental Parental rs751402 rs2296147 rs1047768 gDNA cDNA Chromosome 1 Chromosome 2 1036 A/G T/C T/C A-T-T G-C-C 1.04 0.84 2007 G/G T/C T/C G-T-T G-C-C 1.05 1.22 3030 A/G T/C T/C A-T-T G-C-C 1.02 0.89 1016 G/G T/C T/C G-T-C G-C-T 0.99 0.93 1021 G/G T/T T/C G-T-T G-T-C 1.00 0.47 1060 G/G T/C T/C G-T-T G-C-C 1.04 1.01 1082 A/G T/C T/C A-T-T G-C-C 1.07 0.98 3002 G/G T/C T/C G-T-T G-C-C 1.04 1.07 3063 G/G C/C T/C G-C-T G-C-C 1.03 0.64 2021 G/G T/T T/C G-T-T G-T-C 1.12 1.12 3051 G/G T/T T/C G-T-T G-T-C 1.02 1.12 1084 A/G T/C T/C A-T-T G-C-C 1.01 1.17 2012 G/G T/C T/C G-T-C G-C-T 1.05 0.80 1080 A/A T/T T/C A-T-T A-T-C 0.96 1.03
88
1074 A/G T/C T/C A-T-T G-C-C 1.02 1.10 1067 A/G T/C T/C A-T-T G-C-C 1.09 0.94 2020 G/G T/C T/C G-T-C G-C-T 0.99 0.81 1079 G/G T/C T/C G-T-T G-C-C 1.01 0.84 2018 G/G T/C T/C G-T-T G-C-C 1.04 1.45 2035 G/G T/T T/C G-T-T G-T-C 1.02 0.88 J003 G/G T/T T/C G-T-T G-T-C 0.78 0.83 J004 G/G C/C T/C G-C-T G-C-C 1.06 0.98 J007 G/G T/C T/C G-T-T G-C-C 1.06 1.00 L005 A/G T/C T/C A-T-T G-C-C 0.99 1.14 L014 G/G T/C T/C G-T-T G-C-C 0.97 1.03 532 A/G T/C T/C A-T-T G-C-C 1.15 1.40 344 A/G T/C T/C A-T-T G-C-C 1.07 1.65 574 A/G T/C T/C A-T-T G-C-C 1.09 1.53 289 A/G T/C T/C A-T-T G-C-C 1.10 1.34 720 A/G T/C T/C A-T-T G-C-C 1.06 1.45 399 G/G T/C T/C G-T-T G-C-C 1.21 1.37 572 G/G T/C T/C G-T-T G-C-C 0.95 1.12 591 G/G T/C T/C G-T-T G-C-C 0.87 1.23 652 G/G T/C T/C G-T-T G-C-C 0.87 1.15 664 G/G T/C T/C G-T-T G-C-C 0.99 1.07 128 G/G T/C T/C G-T-T G-C-C 0.95 1.64 286 G/G T/C T/C G-T-T G-C-C 0.97 1.15 389 G/G T/C T/C G-T-T G-C-C 0.94 1.47 715 G/G T/C T/C G-T-T G-C-C 0.89 2.10 303 G/G T/T T/C G-T-T G-T-C 0.97 0.97 589 G/G T/T T/C G-T-T G-T-C 1.16 1.10 649 G/G T/T T/C G-T-T G-T-C 1.06 0.90 650 G/G T/T T/C G-T-T G-T-C 1.10 0.95 309 G/G C/C T/C G-C-T G-C-C 1.18 0.91
89
Table S3.3
Human CEBPG Target Sequence siRNA-05 UCGAAACAGUGACGAGUAU siRNA-06 GAACGGAAUUAGUGUUAUC siRNA-07 GGAAUUAAGUGUACUCAAA siRNA-08 GACAGCAGAUGGCGACAAU
90
Chapter 4 Lung Cancer Risk Test Trial: Study Design, Participant Baseline Characteristics, Bronchoscopy Safety, and Establishment of Biospecimen Repository
E.L. Crawford1, A. Levin2, , F. Safi1, M. Lu2, A. Baugh1, X. Zhang1, J. Yeo1, S.A.
Khuder1, A.M. Boulos1, P. Nana-Sinkam3, P.P. Massion4, , D.A. Arenberg5, D. Midthun6,
P.J. Mazzone7, S.D. Nathan8, R. Wainz9, G. Silvestri10, J. Tita11, and J.C. Willey1,*
1Department of Pulmonary and Critical Care, The University of Toledo Medical Center –
Toledo, OH , 2Department of Biostatistics, Henry Ford Hospital System, Detroit, MI,
3Ohio State University James Comprehensive Cancer Center and Solove Research
Institute – Columbus, OH, 4Thoracic Program, Vanderbilt Ingram Cancer Center –
Nashville, TN/US, 5University of Michigan – Ann Arbor, MI, 6Mayo Clinic – Rochester,
MN/US, 7Cleveland Clinic – Cleveland, OH, 8Inova Fairfax Hospital – Falls Church,
VA/US, 9The Toledo Hospital – Toledo, OH, 10Medical University of South Carolina-
Charleston, SC/US, 11Mercy/St. Vincent’s Hospital, Toledo, OH.
* To whom correspondence should be addressed.
Published in BMC Pulmonary Medicine, January 22, 2016.
91
4.1 Abstract
Introduction. The Lung Cancer Risk Test (LCRT) trial is a prospective cohort study comparing lung cancer incidence among persons with a positive or negative value for the
LCRT, a 15 gene test measured in normal bronchial epithelial cells (NBEC). The purpose of this article is to describe the study design, primary endpoint, and safety; baseline characteristics of enrolled individuals; and establishment of a bio-specimen repository.
Methods. Eligible participants were aged 50-90 years, current or former smokers and 20 pack-years or more cigarette smoking history, free of lung cancer, and willing to undergo bronchoscopic brush biopsy for NBEC sample collection. NBEC, peripheral blood samples, baseline CT, and medical and demographic data were collected from each subject.
Results. Over a two-year span (2010-2012), 403 subjects were enrolled at 12 sites. At baseline 384 subjects remained in study and mean age and smoking history were 62.9 years and 50.4 pack-years respectively, with 34% current smokers. Obstructive lung disease (FEV1/FVC <0.7) was present in 157 (54%). No severe adverse events were associated with bronchoscopic brushing. An NBEC and matched peripheral blood bio- specimen repository was established.
Conclusions. The demographic composition of the enrolled group is representative of the population for which the LCRT is intended. Specifically, based on baseline population characteristics we expect lung cancer incidence in this cohort to be similar to the 3.1% reported in prior studies and representative of the population eligible for low-dose
Computed Tomography (LDCT) lung cancer screening. Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-tolerated in this population.
92
These findings support the feasibility of testing LCRT clinical utility in this prospective study. If validated, the LCRT has the potential to significantly narrow the population of individuals requiring annual low-dose helical CT screening for early detection of lung cancer and delay the onset of screening for individuals with results indicating low lung cancer risk. For these individuals, the small risk incurred by undergoing once in a lifetime bronchoscopic sample collection for LCRT may be offset by a reduction in their CT- related risks. The LCRT biospecimen repository will enable additional studies of genetic basis for COPD and/or lung cancer risk.
Trial Registration: The LCRT Study, NCT 01130285, was registered with
Clinicaltrials.gov on May 24, 2010.
Key words: lung cancer risk test, hereditary lung cancer risk, normal bronchial epithelial cells, lung cancer screening, bronchoscopy safety, bronchial brush safety
93
4.2 Introduction
Lung cancer claimed nearly 160,000 lives in 2014 in the United States alone (2014).
Prevention efforts have reduced cigarette smoking prevalence from about 50% in 1960 to
less than 20% today but, due to past and continued cigarette smoking and the lack of
effective treatment for advanced disease, lung cancer kills more than the next three most
deadly cancers (breast, colon, prostate) combined and is expected to do so for decades to
come (2014). Because prognosis is related to stage, there has long been interest in
detecting lung cancer in early stage when it is amenable to potentially curative treatment.
Thus, it is notable that the US Preventive Services Task Force (USPSTF) now
recommends lung cancer screening with LDCT for healthy individuals at high risk for
lung cancer on the basis of evidence that it will detect the majority of lung cancers in
early stage and thereby reduce lung cancer mortality by 20% (National Lung Screening
Trial Research, Aberle et al. 2011, Humphrey, Deffebach et al. 2013). However, the overall benefit of screening is associated with adverse consequences, including identification of large numbers of nodules, most of which will be nonmalignant, and the complications, costs, and anxiety associated with diagnostic tests (Bach, Mirkin et al.
2012). These adverse consequences could be reduced by restricting screening eligibility to only those at greatest risk. Among the approximately 8 million subjects eligible for screening according to current criteria, which include smoking history >30 pack-years
and age 55-80 years (Blomquist, Crawford et al. 2009, National Lung Screening Trial
Research, Aberle et al. 2011), risk varies widely from less than 0.08% per year to over
1% per year (van Klaveren, Habbema et al. 2001, Bach, Kattan et al. 2003, Spitz, Hong et
al. 2007, Cassidy, Myles et al. 2008, Field, Baldwin et al. 2011, Tammemagi, Pinsky et
94
al. 2011, Raji, Duffy et al. 2012, Kovalchik, Tammemagi et al. 2013). As such, a large majority of screened individuals will not develop lung cancer in their lifetime and the overall benefit of screening is reduced by the adverse events and large cost associated with screening subjects who will not benefit due to low risk. For these reasons, there is increasing interest in the development of an accurate diagnostic molecular test for lung cancer risk that will more accurately stratify subjects for screening. It is expected that limiting screening to those with a positive risk test will reduce the high cost and side effects of screening programs.
Different approaches are currently in progress to develop a molecular diagnostic test for lung cancer risk in the group eligible for annual CT screening based on demographic criteria. These approaches may be divided into two broad categories, early diagnosis and hereditary risk.
The early diagnosis strategy is to detect lung cancers in early stage before symptoms occur so that they can be treated with high chance for cure. This category includes approaches to identify pre-clinical early lung cancer based on blood tests for circulating proteins, antibodies, and/or microRNA (Boeri, Verri et al. 2011, Higgins, Roper et al.
2012, Pecot, Li et al. 2012, Cazzoli, Buttitta et al. 2013, Daly, Rinewalt et al. 2013,
Mehan, Williams et al. 2014, Birse, Lagier et al. 2015, Vachani, Pass et al. 2015), or gene expression tests measured in non-cancer bronchial or nasal airway epithelium that reflect presence of lung cancer due to a field effect (Spira, Beane et al. 2007, Gower, Steiling et al. 2011, Silvestri, Vachani et al. 2015). Because these tests are for early detection they
95
will need to be repeated periodically. A positive test will inform a decision regarding more conservative or more rigorous assessment for presence of lung cancer, including chest CT and/or PET-CT, followed by biopsy. If the intended use is to serve as the primary screening method, an early diagnosis test will need to demonstrate non- inferiority relative to the screening test currently recommended by the USPSTF, annual low dose helical CT.
The hereditary risk test strategy is to identify individuals who have a genetic predisposition to lung cancer so that they can be prioritized for annual chest CT screening. Approaches to identify hereditary risk include a) genome wide association studies (GWAS) to discover DNA polymorphisms associated with lung cancer (Wang,
McKay et al. 2014, Wang, Zhu et al. 2014) and b) studies to identify risk-associated proximate phenotypic markers (Blomquist, Crawford et al. 2009). The Lung Cancer Risk
Test (LCRT) falls into this latter category. The LCRT is a 15 gene test measured in grossly normal bronchial epithelial cells (NBEC) obtained through bronchial brush biopsy (Blomquist, Crawford et al. 2009). The proximate phenotypic markers of hereditary risk comprised by the LCRT are key protective antioxidant, DNA repair, and cell cycle control genes that are sub-optimally regulated in normal bronchial epithelial cells (NBEC). The rationale for this approach is that sub-optimal NBEC regulation of a protective gene has greater effect on risk than an individual single nucleotide polymorphism (SNP). This conclusion is based on results of previous studies in which we identified cis-regulatory SNPs associated with sub-optimal regulation of genes comprised by the LCRT, including ERCC5 (Blomquist, Crawford et al. 2010); [Zhang,
96
submitted] and CEBPG (Blomquist, Brown et al. 2013). For example, (Blomquist,
Crawford et al. 2009) we identified two cis-regulatory SNPs that independently contribute to regulation of ERCC5 transcript abundance (Blomquist, Crawford et al.
2010); [Zhang, submitted]. Thus, a proximate phenotype based on sub-optimal NBEC
regulation of a protective gene enriches for risk determining SNPs.
The clinical setting for LCRT biomarker intended use is individuals who are approaching
annual CT screening eligibility according to USPSTF criteria (Humphrey, Deffebach et
al. 2013). In order to have clinical utility it is important that the test be both accurate and
safe to perform in this intended population. In an effort to assess the accuracy and safety
of the LCRT we initiated a multi-site prospective cohort trial. The purpose of this report is to describe 1) the LCRT trial study design and primary endpoint, 2) baseline
characteristics of enrolled individuals including demographic and lung function data, and
3) secondary endpoints reached thus far, including a) analysis of safety for the
bronchoscopic brush method used to obtain samples for LCRT testing, and b)
establishment of a biospecimen repository containing NBEC and peripheral blood
samples collected from the LCRT cohort.
4.3 Methods
4.3.1 Study design.
This LCRT study (Clinicaltrials.gov, NCT 01130285) was conducted after approval by
an institutional review board at each participating institution (University of Toledo
Medical Center, Mayo Clinic, University of Michigan, The Toledo Hospital, Ohio State
97
University, Vanderbilt University Medical Center/Tennessee Valley VA Medical Center,
Henry Ford Health System, National Jewish Health, Medical University of South
Carolina, Inova Fairfax Hospital, Cleveland Clinic Foundation and Mercy St. Vincent
Medical Center, see Table S4.1) and under a Federal Drug Administration (FDA)
approved Investigational Device Exemption (IDE G090273). The original design to
assess the clinical utility of the LCRT biomarker was a prospective, blinded, nested case-
control study. The original primary endpoint was prediction of risk for development of
lung cancer with an odds ratio of at least 5.0. It was estimated that there would have been
sufficient power to test this endpoint by enrolling approximately 800 subjects and
following them for 3 years, resulting in identification of at least 15 prospective lung
cancer cases. LCRT analysis would then be conducted in NBEC of the 15 cases and 120
matched controls. However, the study was revised to a prospective cohort design due to
a) advances in technology that enable cost-effective measurement of LCRT in all subjects, and b) the greater power associated with this design. The new design and primary endpoints are described below.
The secondary endpoints and analyses were unchanged and include: 1) determination of
study safety at day 30, 2) establishment and maintenance of a biospecimen repository of
biological specimens derived from NBEC [RNA and cytology slides] and corresponding
blood samples [peripheral blood leukocyte Buffy Coat and frozen plasma] from the
subjects enrolled, 3) analysis of the predictive ability of LCRT positive for lung cancer
including sensitivity, specificity, positive predictive value, and negative predictive value,
4) calculation of absolute risk of LCRT positive for lung cancer and, 5) measurement of
the incidence of lung cancer in the study cohort every two years until the end of study.
98
Additionally, we will explore the influence of demographic or clinical variables for lung cancer on the predictive ability of LCRT.
4.3.2 Revised study design.
After development of a the novel targeted NGS platform (Blomquist, Crawford et al.
2013), we implemented LCRT measurement on this platform. The higher throughput of the NGS method enables cost-effective analysis of samples from all 384 subjects and conversion to a prospective cohort study with greater power compared to the original nested case-control design. We plan to assess association of the LCRT value with development of lung cancer in this cohort through follow-up every one to two years for up to 20 years. We will estimate disease-free probabilities for different measured LCRT values at six and eight years of follow up. The primary endpoint will be the prediction of risk for development of lung cancer with a risk ratio of at least 5.0 and we expect to reach this endpoint at the six year follow-up.
Assuming a 20% rate of failure to re-contact (due to death or other factors), approximately 300 individuals from the cohort will be available for analysis. Based on the demographic characteristics of the LCRT cohort, the expected cumulative incidence at six years following enrollment (which will be reached for all subjects between 2016 and 2018) is >5%. Assuming a two-tailed test of significance and a type-1 error rate of
0.05, there will be >80% power to detect a risk ratio associated with LCRT positivity of >
2.45, 1.82, 1.65, 1.57, 1.49, and 1.42 for cumulative incidence rates of lung cancer of 1%,
2%, 3%, 4%, 5%, and 6%, respectively, in the cohort at the six year follow-up. Thus, this proposed study is more than adequately powered to detect even modest LCRT effects at
99
the next planned follow-up. In addition to the risk ratio associated with a positive LCRT, we will also calculate the concordance index of the test based on the estimated Cox proportional hazards model. The concordance index in the Cox model is the correlate to the area under the receiver operator characteristic curve for a logistic regression model.
We will use it to measure LCRT biomarker accuracy in the full cohort analysis.
4.3.3 Participants. To participate in the study, subjects had to be willing and able to provide and sign both written Informed Consent and Health Insurance Portability and
Accountability Act Authorization (HIPAA) forms for this study, undergo bronchoscopy and phlebotomy procedures for the collection of biological specimens and follow up interviews and CT scans. Entry criteria required subjects to be at high demographic risk for lung cancer based on age 50-90 years, and a minimum of 20 pack-years of cigarette smoking history, but to have low likelihood for lung cancer at the time of bronchoscopy.
Both current (defined as self-reported regular use of cigarettes) and former cigarette smokers were eligible. Consent included bronchial brush biopsy to obtain NBEC samples at time of either a) standard of care (SOC) bronchoscopy for a clinical indication for bronchoscopy, b) a study-driven (SD) bronchoscopy, or c) bronchoscopy done for another research study to which they had consented (also considered to be SD). Subjects had to be without a diagnosis of lung cancer prior to or at enrollment. Women with the potential for pregnancy had to have a negative result on a pregnancy test. Subjects were excluded if they were previously diagnosed or treated for lung cancer or had a high pretest likelihood of lung cancer, if they were positive for hepatitis B, C, HIV, or had active TB or if the physician deemed them to be medically inappropriate due to safety
100
concerns. Also excluded were children, pregnant women, prisoners, mentally disabled, those that had received a double lung transplantation, radiation or chemotherapy of any kind within the last month and those scheduled to receive either radiation or chemotherapy.
4.3.4 Recruitment strategies. Twelve medical institutions participated in the LCRT
(Clinicaltrials.gov, NCT 01130285, Table S4.1).
Participants were recruited through physician referral as well as by advertisements in local newspapers, on institutional web sites and through Clinical Trial.gov. The goal was to enroll a sample representative of the US population at high risk of lung cancer death based on demographic criteria.
4.3.5 Enrollment.
Subjects were considered enrolled in the LCRT study when they underwent the study procedure (bronchial brush biopsy with NBEC sample collection). All enrolled subjects had a CT of the chest performed within 3 months prior to study entry or a research driven
CT scan within two weeks after study entry to rule out prevalent lung cancer. Study eligibility, including smoking history, was assessed through initial contact interview by a trained clinical coordinator at each site. The initial Contact Report Form (CRF) was designed to allow for computation of number of pack-years of cigarettes smoked as well as a detailed smoking history that included information on periods of smoking cessation and use of other forms of tobacco such as pipes and cigars. The CRF also contained questions on personal history of selected diseases, stroke, and diabetes, family history of lung cancer, occupational history (jobs and industries either previously demonstrated or
101
thought to be associated with increased risk for lung disease or lung cancer), education, and marital status.
4.3.6 Sample collection.
Standardized sample collection kits were provided to each site. Kits contained supplies for the collection and labeling of biological samples including a disposable bronchial cytology brush (ConMed Corporation, Utica, NY ref.#149) for the collection of NBEC, a
10 ml K2-EDTA vacutainer tube (Becton, Dickinson and Company, Franklin Lakes, NJ ref.#366643) for the collection of whole blood and barcoded stickers. Following positioning of the bronchoscope, the cytology brush was inserted and NBEC were collected from a grossly normal region of either main stem bronchus. For SOC bronchoscopies, this occurred immediately after the diagnostic procedures and the opposite side or in a separate area from the lung region under clinical investigation. If the patient had received a lung transplant, the specimen was obtained from the recipient native mainstem bronchus. The brush was withdrawn, shaken into a tube of normal saline chilled on ice and re-inserted into the bronchoscope for collection of additional NBEC.
This procedure was repeated a total of 5-10 times. After the last brushing, the cytology brush was shaken in the saline and then dabbed onto a glass slide to enable assessment by a pathologist. Immediately prior to or immediately following bronchoscopy, approximately 10 ml of whole blood was obtained using standard phlebotomy techniques into a K2-EDTA vacutainer tube. Blood and NBEC samples were transferred to the lab within 10 min. for processing and stabilization, which was initiated within 1 hour post- collection.
102
4.3.7 Follow up.
Subjects enrolled into the study were followed at 30 days for adverse events (AE) and
serious adverse events (SAE) possibly related to the study procedure and then every 3
months throughout the first two years following enrollment. A research driven CT was
done at the one and two year anniversaries of enrollment if a standard of care CT was not
done within three months of the anniversary. The next follow-up is planned for 2016 with
another in 2018. At each follow-up subjects will receive medical record review and phone interview. Those who meet USPSTF guidelines will be encouraged to enter the closest CT screening program for early detection of lung cancer. Those who do not meet current reimbursement criteria for CT screening will receive a study driven chest CT.
4.3.8 Safety analysis: Adverse events and serious adverse events.
Subjects were monitored for all adverse events (AE) immediately following bronchoscopy until deemed medically stable and ready for discharge and again at 30 days after study enrollment by way of a phone call with the subject. Subjects were monitored for serious adverse events (SAE) for two years following enrollment.
Possible AEs included, but were not limited to, fatigue, muscle aches, bitter taste in
mouth, dry or sore throat, hoarseness, fever [greater than 100°F for more than 24 hours],
bronchospasm, arrhythmia, pneumothorax, hemoptysis, shortness of breath and
infections. An SAE was defined as any serious effect on the health or safety or any life-
threatening problem or death caused by, or associated with the study procedure if that
effect, problem or death was not previously identified in the investigational plan or
application. These included hospitalization [>24 hours], death, disability, or any event
that require intervention to prevent damage.
103
AEs and SAEs were documented and classified in terms of severity [mild, moderate,
severe], expectedness [expected or unexpected] and relatedness [unlikely, possibly,
probably or unknown]. A medical monitor at the data coordinating center (Dr. Paul
Kvale at Henry Ford Health System) worked closely with each site PI and ultimately was
responsible for the final determination of SAE relatedness. Treatments or interventions
and outcomes also were documented.
4.3.9 Statistical analysis.
Statistical significance was determined using an F-test of equality of variances following
by a Student’s t-test for comparison of groups on continuous variables and Chi square or
Fisher exact test for categorical variables. Differences were considered significant if p <
0.05. Power analysis was conducted as described above in the Revised Study Design
section.
4.4. Results
Here we present the baseline characteristics of the enrolled LCRT cohort, and results for secondary endpoints that have been reached including safety analysis and establishment of the NBEC and peripheral blood sample biospecimen repository.
4.4.1 Enrollment.
Accrual for the LCRT study was completed in March 2012. We enrolled 403 subjects
with demographic risk factors for lung cancer into a prospective multi-site, blinded
LCRT study, performed bronchoscopy at enrollment, and collected NBEC and blood
(buffy coat and plasma) samples from each subject (Figure 1). Of the 403 subjects enrolled, 288 were enrolled at the time of a standard of care (SOC) bronchoscopy done
104
for diagnostic purposes and 115 were enrolled at time of a volunteer study driven (SD)
bronchoscopy. Of the 288 SOC bronchoscopies, 64 were done to evaluate for lung
cancer, 34 for monitoring following lung transplantation, and the remaining 190 for a
variety of indications. Of the 403 subjects enrolled, 18 were removed from the study as
screen failures due to diagnosis of prevalence lung cancer at enrolling bronchoscopy or
subsequent tests and one subject withdrew from the study leaving 384 subjects in the
cohort. We conducted a descriptive analysis of baseline data for the 384 remaining
subjects.
4.4.2 Demographic information.
Subject population characteristics are shown in Table 4.1. Of the 384 subjects, mean age
was 62.9 ± 8.2 years with a mean smoking history of 50.4 pack years. Thirty-four percent
were current smokers and approximately 10% were concomitant cigar and/or pipe
smokers. The cohort included 213 males (55%) and 171 females (45%), 89% Caucasians,
10% African Americans, and 1% other. Sixty percent of subjects were married or living
with a partner, 30% were widowed, and 10% were single. A majority (66%) were high
school graduates with or without some college less than a bachelor’s degree, 10% held a
bachelor’s degree and 6% held an advanced degree. Reported income was less than
$40,000 per year in 37% of subjects although 31% of subjects (120 individuals) chose not
to provide household income information. Forty-six percent were retired and 17% were disabled (Table 4.1). Work-related exposures were reported by 234 (61%) of subjects with the highest percentages being asbestos (n=54, 14%), farming (n=41, 11%), chemicals or plastics (10%), welding (10%), foundry or steel milling (9%), and painting
(9%) (Table S4.2). Each subject had a chest CT scan at the time of enrollment; 242
105
subjects (63%) had a clinically indicated (standard of care) CT scan within three months
prior to enrollment and the remaining 142 (37%) had a research driven CT scan within 2
weeks of enrollment. Twelve percent of subjects were undergoing evaluation for lung
cancer at time of enrollment and were negative for cancer (Table S4.3). Based on
responses to baseline questionnaire, self-reported prevalence of chronic obstructive pulmonary disease (COPD) was 41% (n = 156), chronic bronchitis 18% (n = 68), and emphysema 28% (n = 106) (Table S4.3). Because Pulmonary Function Test (PFT) data
were available for most subjects, it was possible to compare self-reported COPD prevalence to test data (see below). Prevalence of other self-reported lung diseases were: interstitial lung disease 9% (n = 35), and sarcoidosis 3% (n = 10) (Table S4.3).
4.4.3 SOC vs SD bronchoscopy characteristics.
The intended population for the LCRT includes both subjects for whom diagnostic bronchoscopy is indicated who also will benefit from LCRT measurement and subjects who will have bronchoscopy only to obtain NBEC samples for LCRT measurement.
Therefore, we compared baseline characteristics between the SOC and SD bronchoscopy subject groups, which represent each of these respective intended population categories.
Of 384 subjects enrolled, bronchoscopy was SOC in 269 (70%) and SD in 115 (30%).
There were no significant differences in in pack years smoked (Table S4.4). SD subjects were slightly younger (mean age of 61.5 compared to 63.6, p = 0.021), more likely to be current smokers (55% vs. 25%, p < 0.001), and less likely to have COPD (41% vs. 60%, p = 0.002) (Table S4.4).
4.4.4 Lung cancer screening eligible sub-group.
106
The USPSTF age and smoking pack year eligibility criteria for lung cancer screening by annual low-dose helical chest CT are 55-80 years and a minimum of 30 pack-years, respectively. Among subjects enrolled into the LCRT study, 253/384 (65.9%) were eligible for annual screening at enrollment, according to these criteria. Seventy subjects did not meet the minimum age criterion at time of enrollment. By the 2016 follow up time point, 45 of these 70 will be eligible for screening and 69/70 will be eligible by the
2018 follow up.
4.4.5 Chronic obstructive pulmonary disease.
We assessed COPD status in the enrolled cohort because COPD is an independent risk factor for lung cancer (Skillrud, Offord et al. 1986, Tockman, Anthonisen et al. 1987,
Mayne, Buenconsejo et al. 1999, Mannino, Aguayo et al. 2003, Wasswa-Kintu, Gan et al.
2005, Purdue, Gold et al. 2007, Young, Hopkins et al. 2009, Schwartz 2012, de-Torres,
Wilson et al. 2015). COPD was defined using GOLD criteria based on pulmonary function test (PFT) data . Demographic information relative to COPD status is displayed in Table 4.2. PFT information was available for 290 subjects. Fifty-four percent of these
(157 subjects) had COPD based on PFT. Among the 157 subjects with COPD based on
PFT, COPD severity was GOLD stage 2 or worse in more than 70% based on established criteria . Mean FEV1/FVC was 0.52 for the 157 subjects with COPD (all stages) compared to 0.78 for the 133 without COPD. Those with COPD were more likely to be male (62% vs. 38% female, p = 0.027) and have a higher mean pack year smoking history (56 vs. 45 for non-COPD, p < 0.001). No differences were noted in age, race or smoking status (current vs. former smokers) (Table 4.2.)
107
Of the 157 subjects with COPD based on PFT criteria, clinical history of COPD based on
self-report or chart review was available for 150. Overall, self-reported status matched the diagnosis by PFT in 67% (Table S4.5).
4.4.6 Lung transplant.
Nine percent (34 subjects) of our cohort had received a (single) lung transplant prior to enrollment. We evaluated differences between lung transplant and non-lung transplant subjects to determine if there were comparable demographic risk factors for lung cancer.
Age (62.9 vs. 63.9 years, p = 0.202), gender, race and smoking history (51.7 vs. 50.0 pack years, p = 0.681) were statistically similar, but 100% of transplant subjects were former smokers compared to only 63% of non-transplant subjects (p < 0.001). Prevalence of COPD was comparable, 62% vs. 53%, p = 0.648. Interstitial lung disease, however, was more prevalent among transplant subjects 29% vs. 7%, p < 0.001 (Table S4.6).
4.4.7 Adverse events.
Serious Adverse Events (SAEs) included any serious effects on the health or safety or any life-threatening problems or death caused by, or associated with the study procedures.
There were no SAEs attributable to this study for either the 241 SOC bronchoscopy subjects or 142 SD bronchoscopy subjects. Adverse Events (AEs) were collected immediately post-procedure and again at the 30 day follow up. AEs classified as possibly or probably attributable to the study were those associated with bronchoscopy and bronchial brush biopsy such as sore throat, hoarseness, cough, throat swelling, chest
108
soreness, bleeding, fever, fatigue and upper respiratory infection. Since the SOC group received the bronchoscopy as part of their standard-of-care, study related AEs were those associated with the bronchial brushing only. There were no AEs classified as study related among the SOC group.
Among the SD group, there were 11 AEs noted in 9 subjects that were possibly (n= 9) or probably (n = 2) attributable to study procedures. Additionally, AEs were documented in two additional subjects that were deemed unlikely to be related (Table 4.3). All AEs were classified as mild.
4.4.8 Establishment of NBEC and peripheral blood sample biospecimen repository.
Matched blood and NBEC were collected for 361/384 (94%) subjects and banked in multiple aliquots. Blood samples were processed at each site at the time of collection to generate 2 aliquots of buffy coat and 2-5 aliquots of plasma from each subject. These aliquots were frozen and stored at -80°C until shipment to the Early Detection Research
Network (EDRN) Biorepository in Fredrick, MD. One aliquot of buffy coat was transferred to the University of Toledo for analysis and the other remains in storage.
NBEC were stabilized at each site in RNA Later (Ambion, Austin, TX) and shipped along with matching slides to ResearchDx, Irvine, CA. RNA was extracted from NBEC within 24-48 hours of receipt, assessed for quality and quantity and stored in aliquots at -
80°C. One NBEC RNA aliquot was shipped to the University of Toledo for analysis for those samples with a minimum yield of 1 microgram and aliquots for each subject remain in storage at ResearchDx.
109
At the University of Toledo, genomic DNA (gDNA) was extracted from one aliquot of buffy coat derived from the blood sample from approximately 80% of subjects and 100% of these yielded gDNA of sufficient quality and quantity for proposed molecular studies.
The quality and quantity of NBEC RNA from approximately 40% of subjects has been assessed to date. RNA from each sample was treated with DNase I, tested via PCR to ensure removal of contaminating gDNA from the RNA and then reverse transcribed into cDNA. For 90% of subjects the cDNA generated from these purified NBEC RNA samples was PCR amplifiable and of sufficient quantity to perform LCRT testing.
Additional aliquots of RNA remain for the roughly 10% of samples that did not pass this quality control. Samples from over 120 subjects were used successfully in preliminary targeted next generation sequencing (NGS) RNA sequencing analysis studies.
4.4.9 Lung cancer incidence.
Two years following initiation of the study, 5 subjects (1.3%) without prevalent lung cancer developed bronchogenic carcinoma. Due to the blinded status of the LCRT study, no further details are available regarding these subjects.
4.5. Discussion
4.5.1 Enrolled cohort is representative of LCRT target population.
The target population of the LCRT biomarker is individuals who meet USPSTF eligibility criteria for annual low dose helical CT screening (Humphrey, Deffebach et al.
2013). The enrollment criteria for the LCRT study included both current and former
110
smokers, individuals with and without concurrent pulmonary disease and/or respiratory
exposures as well as both subjects undergoing medically recommended bronchoscopy
(SOC group) and volunteers (SD group). At the time of enrollment into the LCRT study,
most subjects (66%) met USPSTF age and smoking pack-year eligibility criteria (55-80
years of age, > 30 pack years). Additionally, most of those not eligible at enrollment will
be eligible for screening by the 2016 follow up time point due to increased age, and this
fraction is expected to further increase at the 2018 follow up. Therefore, this group is
highly representative of the LCRT biomarker target population.
4.5.2 Feasibility to reach LCRT study endpoint based on cohort characteristics.
Based on demographic characteristics of the enrolled population (Table 4.1), we expect
lung cancer incidence in the LCRT study to be similar to the 3.1% incidence over 3.9
years reported by Bach et al. (Bach, Jett et al. 2007) in which mean age was 60.1 and
smoking history of 52 pack-years. The five incidental lung cancers observed two years after initiation of the study are consistent with this rate. Taking into account that some of
the 384 study subjects will have died from causes other than bronchogenic carcinoma
prior to these time points and that some will be lost to follow up we estimated incidental
lung cancers in the cohort based on 300 subjects. As such we expect to observe
approximately 12 incidental lung cancers by the 2016 follow up (mean time since
enrollment approximately 5 years) and 17 by the 2018 follow up point (mean time since enrollment approximately 7 years), which will be more than sufficient to reach the proposed endpoint of a risk ratio of > 5.0.
111
4.5.3 Feasibility of LCRT implementation (safety and acceptance by subjects).
The LCRT biomarker requires a one-time acquisition of NBEC through bronchial brush biopsy at the time of bronchoscopy. In addition to the LCRT study, Department of
Defense Lung Cancer Research Program, and NIH recently funded other large studies assessing utility of biomarkers measured in NBEC obtained at bronchoscopy intended to more accurately determine lung cancer risk and/or to enable early lung cancer diagnosis
(Massion, Clinicaltrials.gov NCT01475500 CA152662 and CA102353; Spira,
Clinicaltrials.gov NCT02504697 DECAMP-2 and CA164783-04; Dubinett, CA152751-
05S2). Therefore, it is important to carefully evaluate the safety and comfort of this procedure, which will impact general acceptance by patients and clinicians. Based on published studies bronchoscopy with or without biopsy is considered a safe procedure and it is used not only for medical purposes but also to conduct research (Willey, Coy et al. 1996, Willey, Coy et al. 1997, Romagnoli, Vachier et al. 1999, Crawford, Khuder et al. 2000, Eissa and Erzurum 2001, Crawford, Blomquist et al. 2007, Blomquist, Crawford et al. 2009, Lo Tam Loi, Hoonhorst et al. 2013, Barnes, Saetta et al. 2014, Kim, Oros et al. 2015). Reported complication rates (also known as serious adverse event/SAE rates) for all bronchoscopy procedures range from 0.08-1.93% and mortality rates range from
0.004-0.045% (Jin, Mu et al. 2008, Facciolongo, Patelli et al. 2009, Adare, Afanasiev et al. 2012). One large Japanese study of almost 50,000 patients who underwent bronchoscopy with brush biopsy in either central or peripheral airways reported a complication (SAE) rate of 0.46%. This risk of complication is similar to the 0.28-0.32% complication (SAE) rate reported for colonoscopy (Ko, Riffle et al. 2010, Fisher, Maple
112
et al. 2011) which is routinely used and repeated for colorectal cancer screening.
Importantly, a bronchoscopy with brush biopsy limited to the central airways for
collection of NBEC, the procedure used here, virtually eliminates risk for the primary
complications reported to be associated with bronchoscopy, including pneumothorax or
significant hemorrhage. Consistent with this, we observed no SAE associated with
bronchoscopic brush biopsy in the subjects enrolled based on SD bronchoscopy.
It is particularly important to assess safety and comfort in the subjects meeting accepted
criteria for lung cancer screening, a group that has increased prevalence for numerous
comorbidities. Results from at least one previous report have suggested that research
bronchoscopy and brush biopsy can be safely performed in subjects with heavy smoking
history and those with obstructive lung disease (Romagnoli, Vachier et al. 1999).
Previous guidelines have suggested that an FEV1 less than 60% is considered a
contraindication to performing research driven bronchoscopy. However, bronchoscopy in
adults with stable asthma and COPD has been performed safely at lower values of FEV1
(Hattotuwa, Gamble et al. 2002). Pulmonary function test data was available for more
than 75% of the subjects enrolled here (290 of 384 subjects). One hundred fifty-seven had clinical COPD and more than 70% had GOLD stage 2 or worse (Table 4.2).
Additionally, 9% of enrolled subjects had a history of interstitial lung disease, 9% were single-lung transplant recipients and a small percentage had other pulmonary disease
(Table S4.3) and bronchoscopy was safely performed on all of them. Specifically, no
complications (SAEs) were associated with bronchoscopic brushing and sample
collection in either standard of care (SOC) or study driven (SD) group.
113
In summary, bronchoscopic brush of the central airways to collect NBEC for lung cancer risk analysis was safe and well-tolerated in this study of subjects demographically at risk for lung cancer, including those with significant co-morbid conditions. Because the AE rate was much lower than that reported for routinely used screening colonoscopy (Fisher,
Maple et al. 2011) we expect that this procedure will be acceptable to patients and clinicians if the LCRT or other tests in development are validated to identify subjects with increased risk for lung cancer and/or early stage lung cancer.
4.5.4 COPD characteristics of LCRT Cohort.
The enrolled cohort had a high fraction of COPD based on PFT criteria. This is important because COPD is an independent risk factor for lung cancer (Skillrud, Offord et al. 1986,
Tockman, Anthonisen et al. 1987, Mayne, Buenconsejo et al. 1999, Mannino, Aguayo et al. 2003, Wasswa-Kintu, Gan et al. 2005, Purdue, Gold et al. 2007, Young, Hopkins et al.
2009, Schwartz 2012, de-Torres, Wilson et al. 2015). Notably, using PFT data
(FEV1/FVC <0.7) as the diagnostic criterion, one-third of individuals in this study misclassified their COPD status on the enrollment survey self-report. This is consistent with multiple reports of data acquisition through self-report leading to either misclassification or under-diagnosis of COPD (Barr, Herbstman et al. 2002, Straus,
McAlister et al. 2002, Eisner, Trupin et al. 2005, Zhai, Yu et al. 2014, Aldrich, Munro et al. 2015). Some of this misclassification could be due to patient being told they have
COPD on the basis of radiographic imaging while the PFT data do not meet criteria for
114
COPD diagnosis. Additionally, a portion of the subjects here underwent PFT at the time of enrollment that revealed COPD for the first time because the subject had not been tested prior to enrollment in the study. Given the importance of accurate COPD diagnosis, we plan to obtain both chest CT and PFT data from each subject at each subsequent follow-up. We will then evaluate COPD based on CT (presence of emphysema and/or bronchial thickening) or PFT criteria alone, or in combination as a risk factor for lung cancer.
4.5.5 LCRT cohort and biospecimen repository as a resource for subsequent studies.
As presented here, the LCRT cohort is well characterized with respect to demographic characteristics. In addition, NBEC and matching blood samples were collected from each subject. Each subject had a baseline CT scan and pulmonary function test (PFT) data are available for 76% of individuals. It is planned to obtain repeat PFT and CT scan on all subjects at each subsequent follow-up. This information will enable longitudinal assessment for rate of decline in pulmonary function by both physiologic and radiographic measures and to assess for presence or absence of lung cancer. More than
90% of samples assessed so far passed QC quality and quantity criteria for reliable LCRT measurement. The NBEC and matching blood samples collected in this study are archived and the majority of subjects have given consent for use of samples remaining after LCRT analysis for future IRB approved studies. Currently, we are using NBEC gene expression data and genotyping data from matched peripheral blood cell gDNA to identify proximate phenotypic biomarkers for COPD risk and additional biomarkers for
115
hereditary lung cancer risk. We are integrating these data with COPD genome wide association study (GWAS) data from the Lung Health Study and the COPDgene study available online at NCI dbGAP.
4.6 Conclusions
The demographic composition of the enrolled group is representative of the population for which the LCRT is intended. Specifically, based on baseline population characteristics we expect lung cancer incidence in this cohort to be similar to the 3.1% reported in prior studies and representative of the population eligible for LDCT lung cancer screening.
Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-tolerated in this population. These findings support the feasibility of testing LCRT clinical utility in this prospective study. If validated, the LCRT has the potential to significantly narrow the population of individuals requiring annual low-dose helical CT screening for early detection of lung cancer and delay the onset of screening for individuals with results indicating low lung cancer risk. For these individuals, the small risk incurred by undergoing once in a lifetime bronchoscopic sample collection for LCRT may be offset by a reduction in their CT-related risks. The LCRT biospecimen repository will enable additional studies of genetic basis for COPD and/or lung cancer risk.
4.7 List of abbreviations used
AE, adverse event; COPD, chronic obstructive pulmonary disease; FEV1, forced expiratory volume in 1 second; FVC, forced vital capacity; GWAS, genome wide association study; LCRT, Lung Cancer Risk Test; LDCT, low dose computed
116
tomography; NBEC, normal bronchial epithelial cells; PFT, pulmonary function test;
SAE, serious adverse event; SD, SNP, single nucleotide polymorphism; study driven;
SOC, standard of care; USPSTF, United States Preventive Services Task Force;
4.8 Competing interests
JCW has equity interest in and serves as a consultant to Accugenomics, Inc. which licenses technology utilized here. JCW and ELC are inventors on U.S and international patents related to the technology and biomarkers presented here.
4.9 Author contributions
ELC contributed to the design of the study, interpretation of results, drafting of the manuscript and performed quality assessment of gDNA and RNA. AL, ML and SAK participated in data analysis and statistical interpretation and contributed to the study design. FS and AMB participated in the preparation of the manuscript, XZ, JY and AB participated in the interpretation of data, PN, PPM, DAA, DM, PJM, SDN, RW, GS and
JT served as site directors for the LCRT study and participated in the study design and coordination, JCW conceived of the study, participated in its design and coordination and contributed to the preparation of the manuscript.
4.10 Acknowledgements
This work was funded by grants from the NIH National Cancer Institute CA148572 and
National Heart Lung and Blood Institute HL108016, and the George Isaac Cancer
Research Fund. The NIH did not participate in the study design, data interpretation or
117
manuscript preparation. We thank Dr. Paul Kvale for his professional review and
assessment of adverse and serious adverse events in relation to the study, we thank
Dr.James Jett for contributions in planning of the study, and Dr. Ali Musani for
supporting enrollment at National Jewish Hospital.
4.11 Table and Figure Legends
Table 4.1 LCRT Subject Characteristics
Table 4.2 Chronic Obstructive Pulmonary Disease by PFT
Table 4.3 Adverse Events (AE)
118
4.12 Table and Figure
Table 4.1
Table 1. LCRT Subject Characteristics
Baseline characteristics n = 384 Age in years [mean (SD*)] 62.9 (8.2) Age in years [median] 62 Male 213 (55%) Female 171 (45%) Caucasian 343 (89%) African American 37 (10%) Other or not reported 4 (1%) Cigarette pack years [mean (SD)] 50.4 (25.5) Cigarette pack years [median] 43 Age in years at smoking inception [mean (SD)] 16.1, 3.8 Age in years at smoking inception [median] 16 Total years of smoking [mean (SD)] 37.4 (10) Total years of smoking [median] 38 History of cigar use 35 (9%) History of pipe use 29 (8%) Married or living as married 231 (60%) Widowed 116 (30%) Single 37 (10%) Less than high school education 52 (14%) High school diploma or GED** 118 (31%) Associate degree or some college 136 (35%) Bachelor's degree 40 (10%) Graduate degree 24 (6%) Other or not reported 14 (4%) Employed 109 (28%) Unemployed 30 (8%) Retired 175 (46%) Disabled 64 (17%) Other or not reported 6 (2%) Income < $40,000/year 141 (37%) Income > $40,000/year 123 (32%) Other or not reported 120 (31%) * SD = standard deviation ** GED = Graduate Educational Development
119
Table 4.2
Table 2. Chronic Obstructive Pulmonary Disease by PFT
Classification n M / F* Mean age Race Smoking status Pack years FEV1%$ FEV1/FVC+ in years C/AA/Other** current/former smoked# No COPD 133 65 / 68 62 113 / 16 / 4 46 / 87 45 80 0.78
COPD (all) 157 97 / 60 63 145 / 12 / 0 50 / 107 56 58 0.52
COPD (stage 1) 45 32 / 12 62 41 / 4 / 0 22 / 23 54 76 0.59 COPD (stage 2) 77 48 / 29 64 73 / 4 / 0 22 / 55 57 57 0.55 COPD (stage 3) 26 15 / 11 64 23 / 3 / 0 5 / 21 53 35 0.39 COPD (stage 4) 7 2 / 5 63 7 / 0 / 0 1 / 6 65 22 0.27 COPD (stage unknown) 2 1 / 1 66 1 / 1 / 0 0 / 2 66 - 0.57
Unknown 94 50 / 44 63 85 / 9 / 0 35 / 60 50 - - * M = male, F = female ** C = Caucasian, AA = African-American, Other = other race or race not reported # Pack years = packs of cigarettes smoked per day x years of smoking $ FEV1% = forced expiratory volume in 1 second, percent of expected + FEV1/FVC = FEV1/Forced Vital Capacity
Table 4.3
Table 3 . Adverse Events (AE)
Subject # AE Description Severity Relatedness Treatment Notes
1035 Felt poorly (like he had a fever) for 3 days Mild Possible Did not seek treatment or notify study personnel until 30 day follow-up 1044 Upper respiratory tract infection Mild Possible Treated with antibiotics and steriods, infection resolved 1048 Hoarseness for 2 days Mild Probable 1050 Bruising around eyes Mild Unlikely 1057 Bleeding from ears post bronchoscopy Mild Possible 1057 Petechiae around eyes Mild Possible 1059 Cough Mild Possible 1059 Difficulty swallowing Mild Possible 1060 Felt soreness in lung Mild Possible 1061 Dry scratchy area in throat, feels need to cough Mild Possible 1076 Slight cough Mild Possible 1077 Back of throat swollen Mild Probable 1078 Cough Mild Unlikely
120
4.13 Supplemental Table and Figure Legends
Table S4.1 Lung Cancer Risk Test Study Enrollment by Study Site
Table S4.2 Work Types and Exposures
Table S4.3 Medical History
Table S4.4 Standard of Care (SOC) vs. Study Driven (SD) Bronchoscopies
Table S4.5 Self-reported vs. Clinical COPD
Table S4.6 Transplant vs. Non-Transplant Subjects
4.14 Supplemental Table and Figure
Table S4. 1 Site Location Subjects University of Toledo Medical Center Toledo, OH 83 Mayo Clinic Rochester, MN 43 University of Michigan Ann Arbor, MI 86 The Toledo Hospital Toledo, OH 19 Ohio State University Columbus, OH 29 Vanderbilt University Medical Center/Tennessee Valley VA Nashville, TN 25 Henry Ford Health System Detroit , MI 6 National Jewish Health Denver, CO 4 Medical University of South Carolina Charleston, SC 20 Inova Fairfax Hospital Falls Church, 14 Cleveland Clinic Foundation Cleveland, OH 51 Mercy St. Vincent Medical Center Toledo, OH 4 Total 384 * number in final cohort
121
Table S4. 2
Total reported work exposures n = 234/384 (61%) Asbestos 54 (14%) Baking 11 (3%) Butchering/Meat Packing 13 (3%) Chemicals/Plastics 39 (10%) Coal Mining 4 (1%) Cotton or Jute Processing 2 (<1%) Farming 41 (11%) Fire Fighting 8 (2%) Flour, Feed or Grain Milling 7 (2%) Foundry or Steel Milling 34 (9%) Hard Rock Mining 1 (<1%) Painting 33 (9%) Sandblasting 15 (4%) Welding 37 (10%)
Table S4. 3
Total enrollment n = 384 Standard of care bronchoscopy 269 (70%) Study driven (volunteer) bronchoscopy 115 (30%) Standard of care CT scan 242 (63%) Study driven CT scan 142 (37%) Under investigation for lung cancer at enrollment 46(12%)
Family history of lung cancer 77 (20%) Personal history of cancer 72 (19%)
Personal history of COPD (self-reported) 156 (41%) Personal history of chronic bronchitis 68 (18%) Personal history of emphysema 106 (28%)
122
Personal history of interstitial lung disease 35 (9%) Personal history of sarcoidosis Personal 10 (3%) history of scleroderma 1 (<1%)
Single lung transplant recipient 34 (9%)
Table S4. 4
123
Baseline characteristics SOC SD p value** Total Enrolled 269 115 Age in years [mean (SD*)] 63.6 (8.4) 61.5 (7.5) p = 0.021
Male 153 (57%) 60 (52%) Female 116 (43%) 55 (48%)
Caucasian 243 (91%) 100 (88%) African American 23 (9%) 14 (12%) Other or not reported 3 (1%) 1 (<1%)
Cigarette pack years# [mean (SD)] 49.4 (24.2) 51.9 (22.8) Current smoker 67 (25%) 63 (55%) Former smoker 202 (75%) 52 (45%) p < 0.001
COPD (all)$ 120 (60%) 37 (41%) p = 0.002 Stage 1 26 (13%) 19 (21%) Stage 2 62 (31%) 15 (16%) Stage 3 23 (12%) 2 (3%) Stage 4 7 (4%) 0 (0%)
* SD = standard deviation # Pack years = packs of cigarettes smoked per day x years of smoking $ staging info. unavailable for 2 subjects ** p value from Student's t-test reported if < 0.05
Table S4. 5
124
PFT diagnostic criteria for COPD (all sites)
Self-reported YES by PFT* NO by PFT No PFT data YES 92 32 32 NO 58 94 51 Did not self-report 7 7 11
Accuracy
67%
* PFT = Pulmonary Function Test
Table S4. 6
125
Baseline characteristics Transplant Non-Transplant p value** Total Enrolled 34 350 Age in years [mean (SD*)] 62.9 (8.5) 63.9 (3.7)
Male 22 (65%) 191 (55%) Female 12 (35%) 159 (45%)
Caucasian 31 (91%) 312 (89%) African American 3 (9%) 34 (10%) Other or not reported 0 (0%) 4 (1%)
Cigarette pack years# [mean (SD)] 51.7 (27.2) 50 (23.5) Current smoker 0 (0%) 130 (37%) Former smoker 34 (100%) 220 (63%) p < 0.001
COPD (all)$ 21/34 (62%) 136/236 (53%) Personal history of interstitial lung disease 10 (29%) 25 (7%) p < 0.001 * SD = standard deviation ** p value from Student's t-test reported if < 0.05 # Pack years = packs of cigarettes smoked per day x years of smoking $ in subjects for whom Pulmonary Function Test (PFT) data are available
126
Chapter 5 Control for stochastic sampling variation and qualitative sequencing error in next generation sequencing
Thomas Blomquista, Erin L. Crawfordb, Jiyoun Yeob, Xiaolu Zhangb, James C. Willeya,b*
Authors’ Affiliations: aDepartment of Pathology, University of Toledo Health Sciences Campus, Toledo, OH
43614 b Department of Medicine, University of Toledo Health Sciences Campus, Toledo, OH
43614
* Corresponding author: James C. Willey, M.D., Tel. 001 419 383-3455
Email: [email protected]
Published in Biomolecular Detection and Quantification, September 01, 2015.
127
5.1 Abstract
Background: Clinical implementation of Next-Generation Sequencing (NGS) is
challenged by poor control of low sample input, library preparation biases and qualitative
sequencing error. To address these challenges we developed and tested two hypotheses.
Hypothesis 1: Analytical variation in target analyte quantification is predicted by Poisson
(i.e. stochastic) sampling effects at two key points; a) input of intact nucleic acid target
molecules into the library preparation reaction, and b) input of amplicons from the library
into the sequencer. Hypothesis 2: Technically derived base substitution, insertion and
deletion frequencies observed at each base position in each native target analyte is
concordant with frequencies observed in competitive synthetic internal standards present
in the same reaction. Methods: To test hypothesis 1, we derived equations using Monte
Carlo simulation to predict assay coefficient of variation (CV) based on three working
models: number of target molecules added to library preparation, number of target
sequence read counts from sequencer, or both. These models were tested against NGS
data from specimens with well characterized allelic ratios, molecule inputs and sequence
counts that were prepared using a competitive multiplex-PCR amplicon-based NGS library preparation method comprising synthetic internal standards. To test hypothesis 2, we measured the frequency of base substitutions, insertions and deletions at each base position within amplicons from each of 30 native target analytes, then compared these frequencies to those at corresponding base positions within 30 respective synthetic competitive internal standard templates present in the same NGS library preparation reactions. Results: For hypothesis 1, the Monte Carlo model derived from both sequencing counts and molecule input measurements best predicted CV and explained
128
74% of observed assay variance. For hypothesis 2, observed frequency and type of sequence variation at each base position within each competitive internal standard was concordant with frequency and type of sequence variation seen in NTs (R2 = 0.93).
Conclusion: Inclusion of synthetic competitive internal standard templates in targeted
NGS library preparation controls for low target input into NGS library preparation, low target library product into sequencer, and errors generated during library preparation and sequencing. These controls enable accurate clinical diagnostic reporting of confidence limits and limit of detection for NGS measurement of copy number and for base substitution, insertion and deletion rates at each base position within each target analyte.
129
5.2 Introduction
Quantitative analysis of transcript abundance and/or sequence variant frequency are common applications of next generation sequencing (NGS) (Mortazavi, Williams et al. 2008, Spencer, Tyagi et al. 2014). One important diagnostic NGS application includes accurate identification of clinically actionable sequence variation in tumors and the estimation of tumor cell fraction with the actionable mutation (Cibulskis, Lawrence et al.
2013, Spencer, Tyagi et al. 2014). However, lack of appropriate quality control limits wider clinical diagnostic application of NGS in this context. For example, under-loading of target analyte into library preparation and/or library product into sequencer will result in analytical variation due to stochastic sampling (Fu, Xu et al. 2014). At the same time, over-loading of prepared library onto sequencer will result in re-sampling of library amplicons from the same target analyte molecule, and without proper controls will give false assurance of adequate sampling. Moreover, polymerase errors generated during library preparation and/or sequencing steps can confound accurate estimation of the true proportion of clinically actionable sequence mutations present (Schmitt, Kennedy et al.
2012, Fu, Xu et al. 2014).
Thus, for diagnostic NGS applications, it is important to control for several sources of analytical variation, including sample loading into library preparation, efficiency of target amplification in library preparation, loading of prepared NGS library onto a sequencing platform, and the combined polymerase error rates throughout library preparation and sequencing (Gargis, Kalman et al. 2012, Aziz, Zhao et al. 2015, Gargis,
Kalman et al. 2015). Currently, the most prevalent practice is to rely on sequence count data alone to provide quality control for each potential source of analytical variation. For
130
example, many recently developed programs seek to quantify the fractional
representation of actionable tumor mutations, and enumeration of sequence read
are the only source of data for assay variance analysis (Schmitt, Kennedy et al.
2012, Cibulskis, Lawrence et al. 2013, Frampton, Fichtenholtz et al. 2013, Fu, Xu
et al. 2014, Xu, DiCarlo et al. 2014). While these approaches address many
issues, they provide false assurance regarding control for stochastic sampling
variation due to low input of sample into the library preparation, and do not
provide frequency limit of detection for each type of base substitution, insertion
and deletion at each base position, in each target analyte (Fu, Xu et al. 2014,
Spencer, Tyagi et al. 2014). Recent barcoding methods combined with bait-
capture targeted sequencing provide better control for low sample input while,
again, using only sequence count data to estimate analytical variance (Mortazavi,
Williams et al. 2008, Casbon, Osborne et al. 2011, Jabara, Jones et al. 2011,
Kinde, Wu et al. 2011, Schmitt, Kennedy et al. 2012, Fu, Xu et al. 2014).
However, these methods do not provide a way to assess limit of detection for
observed biological variation (Jabara, Jones et al. 2011), and the bait-capture
method is associated with 100-1000-fold loss in signal (Fu, Xu et al. 2014).
Signal loss is a particular liability for analysis of small or degraded specimens, such as those routinely encountered in the clinical setting (Cibulskis, Lawrence et al. 2013). Furthermore, sequencing read counts are not always concordant with number of molecules “captured” during library preparation, resulting in false negative results (Frampton, Fichtenholtz et al. 2013). In addition, it is less well recognized that if the number of target analyte molecules loaded into the library
131
preparation is low, the analyte may be poorly quantified regardless of the number of analyte amplicons loaded into the sequencer due to over-amplification of a stochastically sampled specimen. In order to address these challenges, we developed and tested two hypotheses.
Hypothesis 1: We hypothesized that analytical variation in target analyte quantification can be predicted by Poisson (i.e. stochastic) sampling effects at two primary points: a) input of intact nucleic acid target molecules loaded into the library preparation reaction, and b) input of derived amplicons from library preparation into the sequencer (i.e. sequence counts) (Figure 1). Using Monte Carlo simulation we derived equations to predict assay coefficient of variation (CV) based on three working models: number of target molecules added to library preparation, number of target amplicons in library added to sequencer (i.e., sequence read count), or both (Figure 1). We then tested these working models using cell lines with known allelic composition. Cell lines were mixed and prepared for NGS such that a broad range of limiting allelic molar proportions and/or sequence read counts were observed. Each target allele was measured relative to a known number of synthetic internal standard molecules using a competitive multiplex-
PCR amplicon-based NGS library preparation method (Blomquist, Crawford et al. 2013).
Hypothesis 2: The accuracy of frequency measurement of acquired mutations in specimens (e.g. circulating plasma DNA, tumors, etc.) is confounded by both sampling error (described above and tested in hypothesis 1), and nucleotide substitution, insertion and deletion errors encountered during both library preparation steps and sequencing
(Cibulskis, Lawrence et al. 2013, Frampton, Fichtenholtz et al. 2013). This latter, technically derived, sequence variation may to some extent be systematic for certain
132
types of sequence variations, but may also vary largely on local sequence context.
We hypothesized that technically derived base substitution, insertion and deletion frequencies observed at each base position in each target analyte is concordant with frequencies observed in respective synthetic internal standards present in the same reaction. In order to characterize the contribution of technically derived nucleotide sequence error rate, we measured the frequency of base substitution, insertion and deletion errors in a NGS data set derived from 213 normal airway brushing derived cDNA specimens with both ample intact nucleic acid loading and sequence counts. Each normal airway brushing derived cDNA specimen was mixed with a known number of synthetic internal standard molecules for each target analyte prior to competitive multiplex PCR amplicon NGS library preparation to determine if frequency of observed base substitution, insertion and deletions in each native target was concordant with frequency observed in each respective synthetic internal standard. If concordant, synthetic internal standards could provide control for both stochastic sampling in quantitative NGS, as well as control for technically derived sequencing error in qualitative NGS of low frequency alleles.
5.3 Methods
5.3.1 Sample Preparation
Hypothesis 1: To test the effect of stochastic sampling on variance in allelic frequency measurements, genomic DNA (gDNA) was extracted by
133
FlexiGene DNA kit (Qiagen) and quantified by NanoDrop (ThermoScientific,
Wilmington, DE) spectrophotometry for two cell lines (H23 [ATCC CRL-5800] and and H520 [ATCC HTB-182]). The cell lines were previously characterized as homozygous for opposite alleles at four polymorphic sites (rs769217, rs1042522, rs735482 and rs2298881) (Blomquist, Crawford et al. 2013). Cross-mixtures of these
two cell-lines were performed so as to create a well characterized extreme limiting
dilution of each of the four bi-allelic loci (see Mixing Design in Supplementary Table 3).
These limiting dilutions of alleles were then loaded into the library preparation (see
Methods: NGS Library Preparation), then limiting dilutions of NGS libraries were added
to the Illumina HiSeq 2500 flow cell (see Methods: NGS Library Preparation).
Hypothesis 2: In order to characterize the base-specific substitution, insertion and
deletion rates imparted by combined library preparation and sequencing error, we used
213 normal human bronchial epithelial cell (NBEC) cDNA specimens. These specimens
were obtained as part of the ongoing Lung Cancer Risk Test (LCRT) study at the
University of Toledo Medical Center (Blomquist, Crawford et al. 2009). Approval for
specimen acquisition for this study was obtained by the institutional review board at the
University of Toledo Medical Center. These samples were chosen based on several key
features: 1) They represent a source of normal nucleic acid templates with presumably
low, or absent, acquired somatic mutations. 2) They were previously confirmed to have
high copy numbers of intact template for each native target, which minimized chance that
stochastic sampling of templates would confound assessment of combined library
preparation and sequencing error on base-specific substitution, insertion and deletion
rates. 3) Competitive synthetic internal standards for targets comprised by the LCRT
134
were cloned into plasmids, and selected as pure clonal isolates, with Sanger sequencing confirmation of final sequence. This additional purification step was taken to eliminate any potential errors introduced by synthesis. We reason that these pure clonal competitive internal standards will have a frequency of technically acquired base substitutions, insertions and deletions that is similar to the native templates during the combined library preparation and sequencing steps.
5.3.2 Development of model to predict analytical variation due to stochastic sampling variation in NGS
Hypothesis 1: To test the hypothesis that analytical variation is dependent on both target analyte native template molecules added into library preparation reaction and resultant amplicon molecules added to sequencer, we developed three working models using Monte Carlo simulation and derived equations to predict expected assay coefficient of variation (CV) (Figure 5.1 and
Supplementary Method – Model Generation). These three models and their equations were based on: target molecules in library added to sequencer (i.e. sequence read counts; Model 1), target native molecules added to library preparation (Model 2), or both (Model 3). This model is based, in part, on a model of biallelic genetic drift provided by Dr. Stephen P. DiFazio that can easily be simulated in excel
135
effects that result in genetic drift of bi-allelic loci should operate statistically in the same way as stochastic sampling of a bi-allelic locus present in a test tube in the laboratory setting, and that the act of pipetting and sampling the specimen DNA is analogous to a founding effect seen in population genetics. We further reasoned that there were two primary founding (i.e. stochastic sampling) effects present in the lab test tube analogy; 1) initial pipetting of the specimen into library preparation reaction, and 2) loading of the prepared library onto the sequencer and the number of sequencing counts enumerated for each target template (Figure 1 and Supplementary Method – Model Generation). This model was varied for both the number of input molecules, as well as number of sequence reads derived (Supplementary Method – Model Generation). This then produced a rich data set, from which three equations were derived by best curve fit analysis
(Supplementary Method – Model Generation). These derived equations were then tested against empirically derived data from cross-mixtures of cell lines to predict observed assay variance in targeted NGS (see Methods: Sample Preparation).
5.3.3 NGS Library Preparation: Targeted Competitive Multiplex-PCR
5.3.3.1 Cell line cross-mixture specimens. Each of four target analytes was PCR-
amplified in samples derived from the cross-mixture of two cell-lines (see Mixing Design in Supplementary Table 3) that had each been mixed with a known number of synthetic competitive internal standard (IS) molecules as previously described (Supplementary
Table 1) (Blomquist, Crawford et al. 2013). NBEC cDNA specimens. Each of 30 target
analytes (two target assays for each of 15 genes) was PCR-amplified in the presence of a known number of respective synthetic competitive internal standard molecules as
136
previously described (Supplementary Table S5.2) (Blomquist, Crawford et al. 2013).
Prepared libraries were then sent for Illumina HiSeq 2500 sequencing service at the
University of Michigan, Genomics Core facility.
5.3.3.2 Internal standard mixture preparation. Each competitive internal standard
(IS) was designed to contain six nucleotide differences from target analyte native template (NT) that enabled reliable differentiation from respective IS during post- sequencing data analysis (Supplementary Tables S5.1 and S5.2) (Blomquist, Crawford et al. 2013). For IS used in the analysis of cell line cross-mixture samples, following synthesis, each IS was PCR-amplified with specific primers to ensure full length product, isolated by gel electrophoresis, quantified using NanoDrop, and mixed with IS for other analytes at equivalent concentration to prepare an internal standard mixture (Blomquist,
Crawford et al. 2013). For IS used in analysis of NBEC cDNA samples, IS were prepare by Accugenomics, Inc. (Wilmington, NC). Briefly, following synthesis IS were cloned in bacteria and purified to ensure an accurate and uniform population of sequences for each competitive IS used (see Methods: Sample Preparation).
5.3.4 NGS data analysis
FASTQ data files from the University of Michigan Genomics core facility were processed as previously described (Blomquist, Crawford et al. 2013).
FASTQ files for hypothesis 2 in this study, pertaining to the LCRT reagents, were additionally processed using Blast 2.2.26+ command line with a Practical
Extraction and Reporting Language (PERL) wrapper to automate feeding of reference and query sequences to the Blast command line interface (reference
137
sequences in Supplementary Table S5.2). This same PERL script then identified and stored the frequency of each Blast result for each template and for the type of base substitution, insertion or deletion that was identified across all reads in a Hash of Hashes of Hashes data table configuration (sequence error frequencies in Supplementary Tables
S5.4 and S5.5). PERL wrapper for Blast 2.2.26+, and the input parameters used for Blast to enumerate base substitution, insertion and deletion frequencies, is available upon request.
Because the goal of hypothesis 2 was to identify and characterize the base by base frequency of combined sequencing and library preparation errors, and not biological variation (which was tested in hypothesis 1), we surmised that the sequencing data could be aggregated into two larger pools of subjects (NT and IS Groups 1 and 2). This is feasible and beneficial for several reasons: 1) A combined data set of normal specimens with minimal biological sequence variation (Group 1 [115 NBEC specimen library preparations] and Group 2 [98 NBEC specimen library preparations], total 213 NBEC specimen library preparations), should provide adequate sampling of very rare technically derived base substitution, deletion and insertion events (1 in 1,000 to 100,000) across each specimen pool. 2) If these normal specimens do indeed have minimal biological variation in sequence, there should be a high degree of concordance in base substitution, insertion and deletion rates between the NTs and their respective competitive IS present in the same specimen (Supplementary Table S5.2). 3) By splitting the sequencing data into two pools, we can, in a surrogate way, assess the performance of external NT controls versus competitive synthetic IS controls, for accurately measuring technically derived base substitution, insertion and deletion frequencies.
138
All final NGS summary counts and absolute quantification of molecules
(where appropriate) are provided in Supplementary Tables S5.3-5.
5.4 Results
5.4.1 Controlling for stochastic sampling error in NGS
For the equation derived from both sequencing coverage and input molarity (Model 3; see Supplementary Method – Model Generation), expected coefficient of variation (CV) was very close to observed (average [observed
CV/expected CV] = 1.01) and explained 74% of observed assay variance (Figure
5.2C). In contrast, observed CV was on average 13-fold, or 1.5-fold, higher than expected CV based on sequencing coverage (Model 1), or input molarity (Model
2), prediction models alone (Figure 5.2A,B). For each assay, when input of target allele copies into library preparation was low (median of 15 molecules; open triangles) assay variance for measured allelic ratio was much higher, compared to high molecule input (median of 3313 molecules; closed circles) (Figure 5.3A-D).
Although there was an approximately 200-fold difference in median molecules loaded into library preparation for low and high loading conditions, sequence counts were high for both conditions (see Mixing Design and raw data in
Supplementary Table S5.3). When only specimens with high molecule input
(>500 molecules) were assessed, variance in measured allelic ratio followed a
Poisson distribution (plotted boxes and dashed line) for target sequence counts
(Figure 5.3E). Similarly, when only specimens with high sequence counts (>500 sequence counts) were assessed, variance in measured allelic ratio followed a
139
Poisson distribution for target molecule input (Figure 5.3F). All data presented in this
section are available in Supplementary Table 3.
5.4.2 Controlling for qualitative sequencing error in NGS
Varying frequency of base substitutions were observed for all nucleotides, and
rare frequency deletion events were detected for guanine and adenine bases (Figure 5.4).
In general, most observed base substitution rates were lower than 1 in 100 for each base
location. Adenine to Guanine and Cytosine to Thymine base transitions (purine-purine or pyrimidine-pyrimidine) were the most common type of sequence variation observed, followed by base tranversions (purine-pyrimidine or pyrimidine-purine) by a factor of approximately 10-fold lower frequency (Figure 5.4). Furthermore, the type of sequence base substitution and its average frequency was concordant between NT and IS for Group
1 (Figure 5.4). The coefficient of variation (CV) around the mean frequency of each type of base substitution was on average 0.28. This roughly translates to a standard deviation of 1.9-fold on either side of the population measurement mean for each type of sequence variation (2.8-fold detection limit with 95% confidence limits for detection of fold change). Data for Group 2 are available in Supplementary Table 5, and are nearly identical to those presented in Figure 4. Bivariate plots of the frequency of technically derived sequence variation for NT and corresponding type of sequence variation for each base position in competitive IS for Groups 1 and 2 (see Methods: NGS data analysis) are presented in Figure 5.5A,C,E and G. Frequency of observed sequence variation in IS explained 93-94% of observed sequence variation in NT (Figure 5.5A, C). Importantly, the vast majority of deviation from the regression line is explainable by the minimum
140
sequence counts observed for the technically derived sequence variation (Figure
5.5B, D). Concordance was slightly higher between NT and NT, or IS and IS
comparisons between groups 1 and 2 respectively, with each explaining 96-97%
of the frequency of base-specific sequence variation observed between the two
groups (Figure 5.5E,G). Again, deviation from the regression line in Figure 5.5E,
G was largely explainable by the minimum sequence counts observed for the rare
technically derived sequence variation (Figure 5.5F, H).
5.5 Discussion
Next-Generation Sequencing (NGS) technologies have the potential to
disrupt a large number of technologies presently used in clinical diagnostics.
However, its implementation in the clinical setting is impeded by a complex
specimen and data analysis process (Figure 5.1), which is compounded by an
equally complex goal of analyzing large multi-target panels. Because of the
profound clinical implications on treatment decision management based on NGS
methods, they should be held to the same analytical performance standards
applied to other methods used in the clinical chemistry laboratory. In an effort to
achieve this goal we developed a competitive multiplex PCR-based amplicon library preparation method that utilizes competitive internal standards (also known as internal amplification controls) (Blomquist, Crawford et al. 2013). The method enables control for sample overloading, excessive amplification cycles, other signal saturation effects and technical biases that can lead to inter-assay and inter-specimen variation in signal measurement. Data also suggested that this
141
method controls for sub-optimal loading of sample into library preparation, suboptimal loading of library preparation into sequencer, and sequencer errors generated during library preparation and sequencing. We decided to address these important challenges by formulating and experimentally testing Hypotheses 1 and 2.
Hypothesis 1 is supported by the data reported here. Specifically, the mathematical equation based on both NT loading into NGS library preparation and sequence read counts from NGS instrument (Monte Carlo simulation Model 3) predicted observed assay coefficient of variation in four targeted NGS assays (Figures 5.2, 5.3 and
Supplementary Methods – Figure S5.1 Model Design). While it remains to confirm the predictive value of this equation across other types of NGS library preparation methods and sequencing platforms, generalizability is likely based on the similarity of biochemical reactions involved. Implementation of this equation may be particularly helpful when only one technical or biological replicate measurement is feasible (i.e. limited clinical specimen). In this context, the laboratory clinician would be asked to comment on the confidence in the measurement of target analyte, or frequency of a clinically actionable mutation present in a tumor specimen. Using this equation, the laboratory information system would be able to easily derive confidence intervals for reporting. As an example, this would simplify a decision regarding whether to direct treatment to an actionable mutation. Importantly, as is clear from Figure 5.3F, large analytical variation from stochastic sampling will be observed if an insufficient concentration of target molecules is sampled, regardless of the concentration of amplification products sampled for loading into sequencer. This is why it is important to use quality control thresholds that address each of these sources of variation.
142
Hypothesis 2 also is supported by data from these studies. Specifically,
the frequency of technically derived sequence variation for each NT was largely
explained by that observed in the respective IS template (Figure 5.5A,C).
Furthermore, any deviation from the regression line observed (in Figures 5.5A,C),
was largely explained by stochastic sampling of low sequence counts for the
technically derived sequence variation (Figures 5.5B,D). Thus, with sufficient molecules loaded into the library preparation, and sequence counts obtained, the limit of detection of rare biological single nucleotide variations in native material can be easily determined using a competitive internal standard (Figure 5.5), and is more accurate than the 2.8-fold change limit of detection estimated by the type of sequence variation only (Figure 5.4). Importantly, base transitions were observed in approximately 10-fold excess compared to base transversion events (Figure
5.4). The error rates observed here are specific to the chosen combination of specimen preparation, sequencing and data analysis pipeline methods, and should not be blindly applied to other NGS pipelines.
In summary, we present data that synthetic internal standards, in the
context of a targeted competitive PCR amplicon library preparation method
(Blomquist, Crawford et al. 2013), control for both stochastic sampling in
quantitative NGS and technically derived sequencing error in qualitative NGS
detection of low frequency alleles. By applying quality-control parameters based on these experimentally validated models that predict key sources of NGS analytical variation, we can now accurately report confidence limits for NGS measurement of clinically important analytical targets, as well as provide an
143
accurate limit of detection for observed base substitution, insertion and deletion rates at each base position within each native target. We are implementing quality control measures described here in analysis of promising diagnostic tests, including a lung cancer diagnostic test (Yeo, Crawford et al. 2014) and a lung cancer risk test (Blomquist,
Crawford et al. 2013). Incorporation of these quality controls provides an analysis pathway consistent with previously reported College of American Pathologists (CAP) and Nex-StoCT guidelines for NGS diagnostics in the clinical setting (Gargis, Kalman et al. 2012, Aziz, Zhao et al. 2015, Gargis, Kalman et al. 2015).
144
5.6 Table and Figure Legends
Figure 5.1 Overview of specimen preparation for Next-Generation Sequencing.
(Portrait, double-column)
This schematic illustrates our hypothesis that two primary points of stochastic
sampling error along the continuum of Next-Generation Sequencing (NGS) library preparation and sequencing can account for observed analytical variation in targeted PCR based NGS assays.
Figure 5.2 Performance of Monte Carlo simulation models to predict observed assay variance. (Portrait, double-column)
Equations used to plot expected coefficient of variance (CV) are presented in
Supplementary Methods – Model Design. Measured CV was obtained by 46- quadruplicate technical measurements; 46 measurements of CV and calculated CV based on Models 1, 2 and 3 are available in Supplementary Table 3.
Figure 5.3 Independent effects of sequence counts and sample molecule loading on measured allelic ratios. (Landscape, full page)
A-D) Effect of low molecule input into library preparation on measured allelic-
ratio relative to expected. To eliminate effect of low sequence counts, only values based
on at least 500 sequence counts were included. Closed Circles = high molecule input
(median = 3313 molecules each replicate). Open Triangles = low molecule input
(median = 15 molecules each replicate). Each data point is a single technical replicate.
145
E) Serially diluted PCR amplicon library samples from the undiluted 1:1 cell line mixture
were loaded into sequencer. Effect of sequences counted (X-axis) on allelic-ratio (Y-
axis) for each target with high molecule input (> 500 molecules in each replicate).
Combined results from all four loci are presented. F) Undiluted PCR amplicon library
samples from serially diluted 1:1 cell line mixture were loaded into sequencer. Effect of
target molecule number (X-axis) on allelic-ratio (Y-axis) for each target with high sequence count (>500 sequence read counts in each replicate for each target. Dashed line with open squares represents an expected frequency of error based on a Poisson distribution (Model 1 and Model 2). Mixing design of cell line DNA and titration of sequencing counts, and all measurements derived from these specimens are available as full and individual subset analysis tables in Supplementary Table 3.
Figure 5.4 Frequency plot of observed technically derived sequencing variation.
(Portrait, double column)
A,B) Type of base substitution is plotted on X-axis. For example, “C > T” represents a transition from a cytosine to thymine base, and “G > -” represents a deletion of a guanine. The first base listed is the expected consensus base at that position based on sequences listed in Supplementary Tables 1 and 2. Each base position, for each template, and the frequency of that type of sequence variation is plotted as an individual data point along the Y-axis. In this figure, only Group 1 data are presented. Means and standard deviation error bars are plotted for each type of sequence variation. Group 2 data plotted essentially identically, and was moved as raw data to Supplementary Table 5.
146
Figure 5.5 Performance of competitive internal standards to measure frequency of technically derived sequence variation. (Portrait, full page)
A,C,E and G) Bivariate plots of measured sequence variation frequency, for each base position along the length of each native template (NT) and internal standard (IS) for groups 1 and 2 (see Methods: NGS data analysis). B,D,F and H) Plots representing fold- deviation of NT:IS ratio away from regression line in respective plots A,C,E and G.
Sequence counts observed (minimum) on the X-axis is the number of sequence counts for the observed type of sequencing error, and not the total number of sequence counts for that assay. Dashed line with open squares represents an expected frequency of error based on a Poisson distribution (Model 1).
147
5.7 Table and Figure
Figure 5.1
148
Figure 5.2
149
Figure 5.3
150
Figure 5.4
151
Figure 5.5
152
5.8 Supplementary Table and Figure
Figure S5.1 Model Design
Available at http://www.sciencedirect.com/science/article/pii/S221475351530005X#MMCvFirst
Table S5.1 Supplementary Tables
Available at http://www.sciencedirect.com/science/article/pii/S221475351530005X#MMCvFirst
153
Chapter 6
Conclusions and Summary
In an effort to understand the role of interindividual variation in genetic predisposition to lung cancer risk, previously described work from this laboratory confirmed that there are interindividual variations in susceptibility to lung cancer, determined that DNA repair genes display altered regulation in normal bronchial epithelial cell (NBEC) of lung cancer subjects and identified a promising Lung Cancer
Risk Test (LCRT) biomarker comprising transcript abundance measurement of fifteen genes in NBEC (Crawford, Khuder et al. 2000, Mullins, Crawford et al. 2005, Crawford,
Blomquist et al. 2007, Blomquist, Crawford et al. 2009).
Genetic variation plays a crucial role in susceptibility of complex genetic disease through regulating gene expression. However, it is evidenced that in many cases, genes are regulated by multiple loci, each of which contributes only modestly to the trait
(Deutsch, Lyle et al. 2005, Wu, Kraft et al. 2010). In a previous study, based on genotype analysis we found that SNPs rs751402 and rs2296147 were associated with inter- individual variation in allelic imbalance in ERCC5 expression in NBEC (Blomquist,
Crawford et al. 2010). The goal of this study is to advance mechanistic understanding regarding heritable variation in cis-regulation of the key NER gene ERCC5, develop 154
qPCR-based molecular diagnostic test comprising transcript abundance value of predictive biomarker for FFPE samples and control analytical variations in NGS- clinical assay.
Identification of cis-acting variant sites that alter ERCC5 transcription regulation in normal bronchial epithelial cells
Analysis of allelic ratios at marker site rs1047768 in association with diplotype structure at rs751402-rs2296147-rs1047768 demonstrates that T-allele at polymorphic site rs2296147 is associated with higher RNA transcript abundance relative to C-allele (Figure 3.3). Notably, rs2296147 T-allele participates in formation of an in silico predicted TP53 transcription factor- binding site (Marinescu, Kohane et al. 2005) and that site is predicted to be lost when C-allele is present. In previous studies TP53 upregulates ERCC5 transcription (Kannan, Amariglio et al. 2000). Therefore, it is reasonable to hypothesize that TP53 upregulates ERCC5 transcription more effectively when T allele is present at rs2296147. Because TP53 is regulated primarily at the post- translational level, TP53 transcription factor functional activity is often measured indirectly as transcript abundance of key target genes such as CDKN1A (el-Deiry,
Tokino et al. 1993, el-Deiry, Harper et al. 1994, Harr, Graves et al. 2005). As reported here, CDKN1A and ERCC5 total transcript abundance values were correlated in non-cancer subject NBEC samples (Figure 3.4A). This correlation is consistent with the hypothesis that TP53 is a transcription factor regulator of both
CDKN1A and ERCC5 transcription in NBEC. In turn, it is reasonable to 155
hypothesize that the significantly altered CDKN1A correlation with ERCC5 in cancer
subjects (Figure 3.4B) is due in part to genotype at rs2296147. In contrast to strong
evidence for the cis-regulatory role of rs2296147 in ERCC5 regulation,
haplotype/diplotype data do not support a similar role for rs751402.
We observed an increased mean G/C allelic ratio at rs17655 in cDNA compared
to matched gDNA controls (Figure 3.2B) indicating that this SNP or a SNP in linkage
disequilibrium with it influences ERCC5 transcript levels in NBEC. As described in
Results section, it is likely that the functional SNP responsible for this observation is
rs873601 which is linked to rs17655 and is predicted to alter binding of multiple miRNAs
(Liu, Zhang et al. 2012, Zhu, Shi et al. 2012). Specifically, the C allele at SNP rs17655 is linked to A allele at rs873601 (r2 = 0.74), which is putatively more responsive to multiple
miRNAs that will increase the rate of degradation and lower abundance of transcripts
originating from rs17655 C allele. Importantly, because rs2296147 and rs873601 are not
linked, we conclude that the presumed rs873601 effect is independent from that of
rs2296147.
The data presented here provide evidence for higher TP53-mediated ERCC5
transcription rate from rs2296147 T allele and higher miRNA mediated ERCC5 transcript
degradation at rs873601 G allele. Thus, if each of these cis-regulatory sites acted alone
without any contribution from the other (for example, in hypothetical alternative
transcripts) or any other cis-acting SNP, we would expect not only to observe mean T/C
ratio at marker SNP rs1047768 and mean G/C ratio at SNP rs873601 (or linked SNP
rs17655) to be greater than one, but also very little inter-individual variation around these
mean ratios. However, we observed significant variation around the mean allelic ratio at 156
each marker SNP. The likely explanation is that the predominant expressed
ERCC5 transcripts incorporate both marker SNPs (rs1047768 and rs17655) and
the effects resulting from genotype at each of the unlinked cis-regulatory sites
(rs2296147 and rs873601) will interact to determine the allelic ratio measured at each marker SNP.
Consistent with a complex genetic mechanism of lung cancer risk, the effect size of each DNA variant associated with lung cancer risk is very small.
Consequently, thousands of subjects are needed to directly assess the association of individual genetic variants and lung cancer risk. The data presented here support the conclusion that inherited variation in gene regulation is a powerful intermediate phenotypic marker for lung cancer risk, as presented schematically in Figure 3.5. As we report here and previously (Blomquist, Crawford et al.
2010), it is possible to assess this type of intermediate risk factor with far fewer patients than the thousands typically necessary for a GWAS study aiming to determine association of each individual SNP with risk (Amos, Wu et al. 2008).
Specifically, the association of a single genetic variant with transcription regulation (e.g., rs2296147 with ERCC5 regulation) or the association of inherited variation in transcript abundance regulation with lung cancer risk (Blomquist,
Crawford et al. 2009) may be assessed with hundreds of subjects (Blomquist,
Crawford et al. 2009). For example, starting with 161 subjects (Amos, Wu et al.
2008) we observed significant association of rs2296147 genotype with ERCC5
ASE (Figure 3.3), and with fewer than 100 subjects we observed significantly altered ERCC5 regulation with lung cancer (Mullins, Crawford et al. 2005) 157
(Figure 3.4). In contrast, there was not a clear association of rs2296147 T allele dosage with lung cancer risk among the subjects enrolled for this study (data not shown).
Based on the findings in the current study, we conclude that the T allele at rs2296147 is associated with higher ERCC5 transcript abundance, possibly through increased responsiveness to TP53 transcription factor. Genotype at rs17655 also is associated with variation in ERCC5 transcript abundance, likely due to effect on miRNA binding affinity at the linked SNP rs873601. These effects on ERCC5 transcription likely result in variation in nucleotide excision DNA repair function. These findings provide plausible explanation for the association of genotype at rs2296147 and rs17655 with lung cancer risk.
In addition to cis-regulation, the effect of CEBPG, a previously identified transcription factor for ERCC5, was determined in a lung cancer cell line. CEBPG knock- down experiment in H1703 lung cancer cell line confirmed the regulatory role of CEBPG in ERCC5 transcription, elucidating the effects of trans-regulation in gene expression and supporting our conclusion that interindividual variation in lung cancer risk attributes to interindividual variation in gene regulation machinery, both cis-acting and trans-acting variation.
Biospecimen repository of NBEC samples from bronchial brush biopsy/bronchoscopy
As described above, the Lung Cancer Risk Test (LCRT) trial is a prospective cohort study comparing lung cancer incidence among persons with a positive or negative value for the LCRT, a 15 gene test measured in normal bronchial epithelial cells (NBEC). 158
NBEC was collected from the target population of the LCRT biomarker is individuals who meet USPSTF eligibility criteria for annual low dose helical CT screening (Humphrey, Deffebach et al. 2013). Lung cancer incidence in the LCRT study is expected to be similar to the 3.1% incidence over 3.9 years reported by
Bach et al. (Bach, Jett et al. 2007). Bronchoscopy with or without biopsy is considered a safe procedure and no complications (SAEs) were associated with bronchoscopic brushing and sample collection in either standard of care (SOC) or study driven (SD) group. Pulmonary function test data was available for more than 75% of the subjects enrolled (290 of 384 subjects). One hundred fifty-seven had clinical COPD and more than 70% had GOLD stage 2 or worse (Table 4.2). It is planned to obtain repeat PFT and CT scan on all subjects at each subsequent follow-up. This information will enable longitudinal assessment for rate of decline in pulmonary function by both physiologic and radiographic measures and to assess for presence or absence of lung cancer.
The demographic composition of the enrolled group is representative of the population for which the LCRT is intended. Specifically, based on baseline population characteristics we expect lung cancer incidence in this cohort to be similar to the 3.1% reported in prior studies and representative of the population eligible for LDCT lung cancer screening. Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-tolerated in this population. These findings support the feasibility of testing LCRT clinical utility in this prospective study. If validated, the LCRT has the potential to significantly narrow the population of individuals requiring annual low-dose helical CT screening for early 159
detection of lung cancer and delay the onset of screening for individuals with results
indicating low lung cancer risk. For these individuals, the small risk incurred by
undergoing once in a lifetime bronchoscopic sample collection for LCRT may be offset
by a reduction in their CT-related risks. The LCRT biospecimen repository will enable
additional studies of genetic basis for COPD and/or lung cancer risk.
Development of Multiplex Two–Color Fluorometric RT-qPCR for Predicting
Chemo Responses in FFPE Samples
Biomarkers have been of increasing importance for personalized medicine, especially for assessing clinical outcomes of a treatment or determining most effective treatment for individual or monitoring undergoing treatment. Predictive biomarkers, connected with response to a treatment in terms of efficacy and/or safety, usually companion molecular diagnostic tests. To augment most commonly used approach immunohistochemistry (IHC) and to enable the quantitation of predictive biomarkers, efforts were made to develop quality-controlled multiplex two-color fluorometric real- time PCR assays for ten predictive markers which have shown clinical values in response to general or target chemotherapeutic agents, ERCC1, RRM1, MRP2 in response to cisplatin; EGFR, ROS1, ALK1, FGFR 1 to 3, and TYMS in response to target- chemotherapeutic agents
Reagents for each of 10 predictive biomarkers, ERCC1, RRM1, MRP2, EGFR,
ROS1, ALK1, FGFR 1 to 3, and TYMS, had acceptable linearity (R2 > 0.99), signal-to- analyte response (slope 1.0 ± 0.05), lower detection threshold (< 10 molecules) and imprecision (CV < 20%). 160
Specificity of each probe was tested by including it in PCR assays containing the synthetic NT or IS serially diluted from 10-11 M to 10-15 M. For each
probe, at each NT or IS dilution, the signal (Cq value: quantification cycle)
observed with amplification in the presence of the non-homologous template was
compared to Cq value observed with amplification in the presence of the
homologous template. Non-homologous (non-specific) binding was <1% for both
NT and IS probes for all genes.
The key components of this method are internal standards (IS) and
external standards (ES). The competitive IS molecule was designed with identical
priming sites and 4-6bp internal difference from each native target gene template
(NT). This ensures identical thermodynamics and amplification efficiency for
both template species as well as discrimination of IS from NT. ES corrects
fluorescence intensity difference between two probes labeled with different dyes
due to the variation of degradation of probes or software selection of Cq values in
each plate of PCR.
At each serial 10-fold dilution of ESM (10-11 M NT/10-11 M IS to 10-17 M NT/10-17 M IS),
the average coefficient of variation (CV) for measurement of each of the four genes was <
10% for > 60 molecules input (10-11 M NT/10-11 M IS to 10-16 M NT/10-16 M IS).
Development of Control for stochastic sampling variation in next generation
sequencing
In addition to real-time PCR, quality control was implemented in next
generation sequencing (NGS) RNA-sequencing platforms. We previously 161
developed competitive multiplex-PCR amplicon library preparation for targeted RNA- sequencing on next generation sequencing (NGS) platforms. One challenge for NGS to apply to clinical setting is analytical variation due to stochastic sampling. This is important for mutation detection and differential gene expression measurement. Although utility of internal standards as competitor control for sample overloading, signal saturation effects, inter-assay and inter-sample variations in measurement, the stochastic sampling error could not be controlled when low copies exist in the samples.
As supported by the data reported in this study, specifically, the mathematical equation based on both NT loading into NGS library preparation and sequence read counts from NGS instrument (Monte Carlo simulation Model 3) predicted observed assay coefficient of variation in four targeted NGS assays (Figures 5.2, 5.3 and
Supplementary Methods 5.8– Model Design). While it remains to confirm the predictive value of this equation across other types of NGS library preparation methods and sequencing platforms, generalizability is likely based on the similarity of biochemical reactions involved. Implementation of this equation may be particularly helpful when only one technical or biological replicate measurement is feasible (i.e. limited clinical specimen). In this context, the laboratory clinician would be asked to comment on the confidence in the measurement of target analyte, or frequency of a clinically actionable mutation present in a tumor specimen. Using this equation, the laboratory information system would be able to easily derive confidence intervals for reporting. As an example, this would simplify a decision regarding whether to direct treatment to an actionable mutation. Importantly, as is clear from Figure 5.3F, large analytical variation from stochastic sampling will be observed if an insufficient concentration of target molecules 162
is sampled, regardless of the concentration of amplification products sampled for loading into sequencer. This is why it is important to use quality control thresholds that address each of these sources of variation. By formulating based on Poisson sampling, a mathematical equation was developed to predict assay coefficient of variation (CV). Then the predicted CV is implemented to determine the confidence limits for each value acquired from sequencing. Therefore, false positive results can be eliminated by minimizing stochastic variation.
By applying quality-control parameters based on these experimentally validated models that predict key sources of NGS analytical variation, we can now accurately report confidence limits for NGS measurement of clinically important analytical targets, as well as provide an accurate limit of detection for observed base substitution, insertion and deletion rates at each base position within each native target. Incorporation of these quality controls provides an analysis pathway consistent with previously reported College of American
Pathologists (CAP) and Nex-StoCT guidelines for NGS diagnostics in the clinical setting.
New Contributions from Chapters 3, 4 and 5
1. Haplotype- and diplotype-based analysis possess greater power than genotype-based
analysis. When multiple loci regulate the expression of one gene, which is common for
complex genetic disease like lung cancer, based on genotyping analysis it was not
possible to sort out with confidence the independent roles of rs751402 and rs2296147.
Assessing the syntenic relationship of alleles at multiple suspicious loci, haplotype 163
structure, enables characterization of the roles of cis-regulatory SNPs (cis-rSNP),
independent and/or interactive.
2. Using allelic imbalance to assess cis-acting genetic variations controls for trans-acting
effects or environmental conditions that differentially influence gene expression
among samples. Association of allelic imbalance with haplotype and diplotype
comprising putative cis-rSNPs allows identification of hereditary cis-rSNPs without
interference from trans-acting effects as well as interaction of multiple cis-rSNPs.
3. The effect size of each genetic variant can be magnified and becomes detectable when
associating to an intermediate risk factor, regulation of key genes which are associated
with inherited disease susceptibility (e.g. lung cancer risk). It is possible to assess this
type of intermediate risk factor with far fewer patients than the thousands typically
necessary for a GWAS study aiming to determine association of each individual SNP
with risk.
4. Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-
tolerated in the LCRT recruited population. And this biospecimen repository will
enable additional studies of genetic basis for COPD and/or lung cancer risk.
5. The findings of this studies has the potential to significantly narrow the population of
individuals requiring annual LDCT for early detection of lung cancer.
6. The developed two-color fluorometric real-time PCR demonstrated good analytical
performance. Probe specificity was <1% non-homologous binding, and primers
detected < 10 molecules. For the 6 orders of magnitude with 1:1 ratio of NT: IS and in
the dilutions of NT to constant IS or vice versa, in the ratio of < 10, R2 value was >
164
0.99 and slope was 1.0 ± 0.05. The average coefficient of variation (CV) for
measurement of each gene was < 10% for > 60 molecules input.
7. ESM controlled for the variation in fluorescent labeling probes and selection of the
threshold. The unknown copies of target NT were calculated by comparison of Cq
values of NT and IS: [NT Cq- IS Cq] multiplied by input IS copies. Variations that can
affect those Cq values, such as probe quality, activity, and the software selection of Cq
were controlled by the mean of the two ESM [NT Cq- IS Cq] values.
8. The developed two-color fluorometric real-time PCR augments the accuracy,
specificity and sensitivity of commonly clinical predictive biomarker tests, for
example, immunohistochemistry (IHC) and FISH. Especially, it allows detection in
FFPE human tissue, which is abundant in hospitals nationwide.
9. Inclusion of synthetic competitive internal standard templates in targeted NGS library
preparation controls for low target input into NGS library preparation, low target
library product into sequencer, and errors generated during library preparation and
sequencing.
10. By formulating based on Poisson sampling, a mathematical equation was developed
to predict assay coefficient of variation (CV). Then the predicted CV is implemented
to determine the confidence limits for each value acquired from sequencing.
Therefore, false positive results can be eliminated by minimizing stochastic variation.
11. Incorporation of these quality controls provides an analysis pathway consistent with
previously reported College of American Pathologists (CAP) and Nex-StoCT
guidelines for NGS diagnostics in the clinical setting.
165
References
"Database of Single Nucleotide Polymorphisms (dbSNP)." Bethesda (MD): National
Center for Biotechnology Information, National Library of Medicine.
(2014). Cancer Facts & Figures. American Cancer Society.
Adare, A., S. Afanasiev, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, H. Al-Ta'ani,
J. Alexander, A. Angerami, K. Aoki, N. Apadula, Y. Aramaki, H. Asano, E. C.
Aschenauer, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, B. Bannier,
K. N. Barish, B. Bassalleck, S. Bathe, V. Baublis, S. Baumgart, A. Bazilevsky, R.
Belmont, A. Berdnikov, Y. Berdnikov, X. Bing, D. S. Blau, K. Boyle, M. L. Brooks, H.
Buesching, V. Bumazhnov, S. Butsyk, S. Campbell, P. Castera, C. H. Chen, C. Y. Chi,
M. Chiu, I. J. Choi, J. B. Choi, S. Choi, R. K. Choudhury, P. Christiansen, T. Chujo, O.
Chvala, V. Cianciolo, Z. Citron, B. A. Cole, M. Connors, M. Csanad, T. Csorgo, S.
Dairaku, A. Datta, M. S. Daugherity, G. David, A. Denisov, A. Deshpande, E. J.
Desmond, K. V. Dharmawardane, O. Dietzsch, L. Ding, A. Dion, M. Donadelli, O.
Drapier, A. Drees, K. A. Drees, J. M. Durham, A. Durum, L. D'Orazio, S. Edwards, Y. V.
Efremenko, T. Engelmore, A. Enokizono, S. Esumi, K. O. Eyser, B. Fadem, D. E. Fields,
M. Finger, M. Finger, Jr., F. Fleuret, S. L. Fokin, J. E. Frantz, A. Franz, A. D. Frawley,
Y. Fukao, T. Fusayasu, K. Gainey, C. Gal, A. Garishvili, I. Garishvili, A. Glenn, X.
Gong, M. Gonin, Y. Goto, R. Granier de Cassagnac, N. Grau, S. V. Greene, M. Grosse 166
Perdekamp, T. Gunji, L. Guo, H. A. Gustafsson, T. Hachiya, J. S. Haggerty, K. I. Hahn,
H. Hamagaki, J. Hanks, K. Hashimoto, E. Haslum, R. Hayano, X. He, T. K. Hemmick, T.
Hester, J. C. Hill, R. S. Hollis, K. Homma, B. Hong, T. Horaguchi, Y. Hori, S. Huang, T.
Ichihara, H. Iinuma, Y. Ikeda, J. Imrek, M. Inaba, A. Iordanova, D. Isenhower, M. Issah,
A. Isupov, D. Ivanischev, B. V. Jacak, M. Javani, J. Jia, X. Jiang, B. M. Johnson, K. S.
Joo, D. Jouan, J. Kamin, S. Kaneti, B. H. Kang, J. H. Kang, J. S. Kang, J. Kapustinsky,
K. Karatsu, M. Kasai, D. Kawall, A. V. Kazantsev, T. Kempel, A. Khanzadeev, K. M.
Kijima, B. I. Kim, C. Kim, D. J. Kim, E. J. Kim, H. J. Kim, K. B. Kim, Y. J. Kim, Y. K.
Kim, E. Kinney, A. Kiss, E. Kistenev, J. Klatsky, D. Kleinjan, P. Kline, Y. Komatsu, B.
Komkov, J. Koster, D. Kotchetkov, D. Kotov, A. Kral, F. Krizek, G. J. Kunde, K. Kurita,
M. Kurosawa, Y. Kwon, G. S. Kyle, R. Lacey, Y. S. Lai, J. G. Lajoie, A. Lebedev, B.
Lee, D. M. Lee, J. Lee, K. B. Lee, K. S. Lee, S. H. Lee, S. R. Lee, M. J. Leitch, M. A.
Leite, M. Leitgab, B. Lewis, S. H. Lim, L. A. Linden Levy, A. Litvinenko, M. X. Liu, B.
Love, C. F. Maguire, Y. I. Makdisi, M. Makek, A. Malakhov, A. Manion, V. I. Manko,
E. Mannel, S. Masumoto, M. McCumber, P. L. McGaughey, D. McGlinchey, C.
McKinney, M. Mendoza, B. Meredith, Y. Miake, T. Mibe, A. C. Mignerey, A. Milov, D.
K. Mishra, J. T. Mitchell, Y. Miyachi, S. Miyasaka, A. K. Mohanty, H. J. Moon, D. P.
Morrison, S. Motschwiller, T. V. Moukhanova, T. Murakami, J. Murata, T. Nagae, S.
Nagamiya, J. L. Nagle, M. I. Nagy, I. Nakagawa, Y. Nakamiya, K. R. Nakamura, T.
Nakamura, K. Nakano, C. Nattrass, A. Nederlof, M. Nihashi, R. Nouicer, N. Novitzky,
A. S. Nyanin, E. O'Brien, C. A. Ogilvie, K. Okada, A. Oskarsson, M. Ouchida, K.
Ozawa, R. Pak, V. Pantuev, V. Papavassiliou, B. H. Park, I. H. Park, S. K. Park, S. F.
Pate, L. Patel, H. Pei, J. C. Peng, H. Pereira, V. Peresedov, D. Y. Peressounko, R. Petti, 167
C. Pinkenburg, R. P. Pisani, M. Proissl, M. L. Purschke, H. Qu, J. Rak, I. Ravinovich, K.
F. Read, R. Reynolds, V. Riabov, Y. Riabov, E. Richardson, D. Roach, G. Roche, S. D.
Rolnick, M. Rosati, P. Rukoyatkin, B. Sahlmueller, N. Saito, T. Sakaguchi, V.
Samsonov, M. Sano, M. Sarsour, S. Sawada, K. Sedgwick, R. Seidl, A. Sen, R. Seto, D.
Sharma, I. Shein, T. A. Shibata, K. Shigaki, M. Shimomura, K. Shoji, P. Shukla, A.
Sickles, C. L. Silva, D. Silvermyr, K. S. Sim, B. K. Singh, C. P. Singh, V. Singh, M.
Slunecka, R. A. Soltz, W. E. Sondheim, S. P. Sorensen, M. Soumya, I. V. Sourikova, P.
W. Stankus, E. Stenlund, M. Stepanov, A. Ster, S. P. Stoll, T. Sugitate, A. Sukhanov, J.
Sun, J. Sziklai, E. M. Takagui, A. Takahara, A. Taketani, Y. Tanaka, S. Taneja, K.
Tanida, M. J. Tannenbaum, S. Tarafdar, A. Taranenko, E. Tennant, H. Themann, T.
Todoroki, L. Tomasek, M. Tomasek, H. Torii, R. S. Towell, I. Tserruya, Y. Tsuchimoto,
T. Tsuji, C. Vale, H. W. van Hecke, M. Vargyas, E. Vazquez-Zambrano, A. Veicht, J.
Velkovska, R. Vertesi, M. Virius, A. Vossen, V. Vrba, E. Vznuzdaev, X. R. Wang, D.
Watanabe, K. Watanabe, Y. Watanabe, Y. S. Watanabe, F. Wei, R. Wei, S. N. White, D.
Winter, S. Wolin, C. L. Woody, M. Wysocki, Y. L. Yamaguchi, R. Yang, A. Yanovich,
J. Ying, S. Yokkaichi, Z. You, I. Younus, I. E. Yushmanov, W. A. Zajc, A. Zelenski and
L. Zolin (2012). "Evolution of pi(0) suppression in Au+Au collisions from radical(s(NN))=39 to 200 GeV." Phys Rev Lett 109(15): 152301.
Akey, J., L. Jin and M. Xiong (2001). "Haplotypes vs single marker linkage
disequilibrium tests: what do we gain?" Eur J Hum Genet 9(4): 291-300.
Alberg, A. J. and J. M. Samet (2003). "Epidemiology of lung cancer." Chest 123(1
Suppl): 21S-49S.
168
Albert, F. W. and L. Kruglyak (2015). "The role of regulatory variation in complex traits
and disease." Nat Rev Genet 16(4): 197-212.
Aldrich, M. C., H. M. Munro, M. Mumma, E. L. Grogan, P. P. Massion, T. S. Blackwell
and W. J. Blot (2015). "Chronic obstructive pulmonary disease and subsequent overall
and lung cancer mortality in low-income adults." PLoS One 10(3): e0121805.
American Cancer Society. (2016). "Non-small cell lung cancer survival rates by stage." from http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small- cell-lung-cancer-survival-rates.
Amos, C. I., X. Wu, P. Broderick, I. P. Gorlov, J. Gu, T. Eisen, Q. Dong, Q. Zhang, X.
Gu, J. Vijayakrishnan, K. Sullivan, A. Matakidou, Y. Wang, G. Mills, K. Doheny, Y. Y.
Tsai, W. V. Chen, S. Shete, M. R. Spitz and R. S. Houlston (2008). "Genome-wide
association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1."
Nat Genet 40(5): 616-622.
Apostolakos, M. J., W. H. Schuermann, M. W. Frampton, M. J. Utell and J. C. Willey
(1993). "Measurement of gene expression by multiplex competitive polymerase chain
reaction." Anal Biochem 213(2): 277-284.
Aroucha, D. C., R. F. Carmo, L. R. Vasconcelos, R. E. Lima, T. F. Mendonca, L. E.
Arnez, M. S. Cavalcanti, M. T. Muniz, M. L. Aroucha, E. R. Siqueira, L. B. Pereira, P.
Moura, L. M. Pereira and M. R. Coelho (2016). "TNF-alpha and IL-10 polymorphisms
increase the risk to hepatocellular carcinoma in HCV infected individuals." J Med Virol.
Aziz, N., Q. Zhao, L. Bry, D. K. Driscoll, B. Funke, J. S. Gibson, W. W. Grody, M. R.
Hegde, G. A. Hoeltge, D. G. Leonard, J. D. Merker, R. Nagarajan, L. A. Palicki, R. S.
Robetorye, I. Schrijver, K. E. Weck and K. V. Voelkerding (2015). "College of American 169
Pathologists' laboratory standards for next-generation sequencing clinical tests." Arch
Pathol Lab Med 139(4): 481-493.
Bach, P. B., J. R. Jett, U. Pastorino, M. S. Tockman, S. J. Swensen and C. B. Begg
(2007). "COmputed tomography screening and lung cancer outcomes." JAMA 297(9):
953-961.
Bach, P. B., M. W. Kattan, M. D. Thornquist, M. G. Kris, R. C. Tate, M. J. Barnett, L. J.
Hsieh and C. B. Begg (2003). "Variations in lung cancer risk among smokers." J Natl
Cancer Inst 95(6): 470-478.
Bach, P. B., J. N. Mirkin, T. K. Oliver, C. G. Azzoli, D. A. Berry, O. W. Brawley, T.
Byers, G. A. Colditz, M. K. Gould, J. R. Jett, A. L. Sabichi, R. Smith-Bindman, D. E.
Wood, A. Qaseem and F. C. Detterbeck (2012). "Benefits and harms of CT screening for lung cancer: a systematic review." Jama 307(22): 2418-2429.
Barendse, W. (2011). "Haplotype analysis improved evidence for candidate genes for intramuscular fat percentage from a genome wide association study of cattle." PLoS One
6(12): e29601.
Barnes, N. C., M. Saetta and K. F. Rabe (2014). "Implementing lessons learned from previous bronchial biopsy trials in a new randomized controlled COPD biopsy trial with roflumilast." BMC Pulm Med 14: 9.
Barr, R. G., J. Herbstman, F. E. Speizer and C. A. Camargo, Jr. (2002). "Validation of self-reported chronic obstructive pulmonary disease in a cohort study of nurses." Am J
Epidemiol 155(10): 965-971.
170
Basuli, D., R. G. Stevens, F. M. Torti and S. V. Torti (2014). "Epidemiological associations between iron and cardiovascular disease and diabetes." Front Pharmacol 5:
117.
Beane, J., J. Vick, F. Schembri, C. Anderlind, A. Gower, J. Campbell, L. Luo, X. H.
Zhang, J. Xiao, Y. O. Alekseyev, S. Wang, S. Levy, P. P. Massion, M. Lenburg and A.
Spira (2011). "Characterizing the impact of smoking and lung cancer on the airway transcriptome using RNA-Seq." Cancer Prev Res (Phila) 4(6): 803-817.
Becker-Andre, M. and K. Hahlbrock (1989). "Absolute mRNA quantification using the polymerase chain reaction (PCR). A novel approach by a PCR aided transcript titration assay (PATTY)." Nucleic Acids Res 17(22): 9437-9446.
Beer, M. A. and S. Tavazoie (2004). "Predicting gene expression from sequence." Cell
117(2): 185-198.
Bell, G. D., N. C. Kane, L. H. Rieseberg and K. L. Adams (2013). "RNA-seq analysis of allele-specific expression, hybrid effects, and regulatory divergence in hybrids compared with their parents from natural populations." Genome biology and evolution 5(7): 1309-
1323.
Bhatnagar, S., X. Zhu, J. Ou, L. Lin, L. Chamberlain, L. J. Zhu, N. Wajapeyee and M. R.
Green (2014). "Genetic and pharmacological reactivation of the mammalian inactive X chromosome." Proceedings of the National Academy of Sciences 111(35): 12591-12598.
Biosystems, A. "TaqMan® SNP Genotyping Assays." PRODUCT BULLETIN.
Birse, C. E., R. J. Lagier, W. FitzHugh, H. I. Pass, W. N. Rom, E. S. Edell, A. O.
Bungum, F. Maldonado, J. R. Jett, M. Mesri, E. Sult, E. Joseloff, A. Li, J. Heidbrink, G.
Dhariwal, C. Danis, J. L. Tomic, R. J. Bruce, P. A. Moore, T. He, M. E. Lewis and S. M. 171
Ruben (2015). "Blood-based lung cancer biomarkers identified through proteomic discovery in cancer tissues, cell lines and conditioned medium." Clin Proteomics 12(1):
18.
Blomquist, T., E. L. Crawford, D. Mullins, Y. Yoon, D. A. Hernandez, S. Khuder, P. L.
Ruppel, E. Peters, D. J. Oldfield, B. Austermiller, J. C. Anders and J. C. Willey (2009).
"Pattern of antioxidant and DNA repair gene expression in normal airway epithelium
associated with lung cancer diagnosis." Cancer Res 69(22): 8629-8635.
Blomquist, T., E. L. Crawford, J. Yeo, X. Zhang and J. C. Willey (2015). "Control for
stochastic sampling variation and qualitative sequencing error in next generation
sequencing." Biomolecular Detection and Quantification.
Blomquist, T. M., R. D. Brown, E. L. Crawford, I. de la Serna, K. Williams, Y. Yoon, D.
A. Hernandez and J. C. Willey (2013). "CEBPG Exhibits Allele-Specific Expression in
Human Bronchial Epithelial Cells." Gene Regul Syst Bio 7: 125-138.
Blomquist, T. M., E. L. Crawford, J. L. Lovett, J. Yeo, L. M. Stanoszek, A. Levin, J. Li,
M. Lu, L. Shi, K. Muldrew and J. C. Willey (2013). "Targeted RNA-Sequencing with
Competitive Multiplex-PCR Amplicon Libraries." PLoS One 8(11): e79120.
Blomquist, T. M., E. L. Crawford and J. C. Willey (2010). "Cis-acting genetic variation at an E2F1/YY1 response site and putative p53 site is associated with altered allele- specific expression of ERCC5 (XPG) transcript in normal human bronchial epithelium."
Carcinogenesis 31(7): 1242-1250.
Boeri, M., C. Verri, D. Conte, L. Roz, P. Modena, F. Facchinetti, E. Calabrò, C. M.
Croce, U. Pastorino and G. Sozzi (2011). "MicroRNA signatures in tissues and plasma
172
predict development and prognosis of computed tomography detected lung cancer."
Proceedings of the National Academy of Sciences 108(9): 3713-3718.
Boundless (2015). "Alternatives to Dominance and Recessiveness." Boundless Biology.
Boyle, A. P., E. L. Hong, M. Hariharan, Y. Cheng, M. A. Schaub, M. Kasowski, K. J.
Karczewski, J. Park, B. C. Hitz, S. Weng, J. M. Cherry and M. Snyder (2012).
"Annotation of functional variation in personal genomes using RegulomeDB." Genome
Res 22(9): 1790-1797.
Brem, R. B., G. Yvert, R. Clinton and L. Kruglyak (2002). "Genetic dissection of transcriptional regulation in budding yeast." Science 296(5568): 752-755.
Brophy, V. H., M. D. Hastings, J. B. Clendenning, R. J. Richter, G. P. Jarvik and C. E.
Furlong (2001). "Polymorphisms in the human paraoxonase (PON1) promoter."
Pharmacogenetics 11(1): 77-84.
Browning, S. R. and B. L. Browning (2007). "Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering." Am J Hum Genet 81(5): 1084-1097.
Browning, S. R. and B. L. Browning (2011). "Haplotype phasing: existing methods and new developments." Nat Rev Genet 12(10): 703-714.
Buch, S. C., B. Diergaarde, T. Nukui, R. S. Day, J. M. Siegfried, M. Romkes and J. L.
Weissfeld (2012). "Genetic variability in DNA repair and cell cycle control pathway genes and risk of smoking-related lung cancer." Mol Carcinog 51 Suppl 1: E11-20.
Buckland, P. R. (2004). "Allele-specific gene expression differences in humans." Hum
Mol Genet 13 Spec No 2: R255-260.
173
Burger, I. M., N. E. Kass, J. H. Sunshine and S. S. Siegelman (2008). "The use of CT for
screening: a national survey of radiologists' activities and attitudes." Radiology 248(1):
160-168.
Burgtorf, C., P. Kepper, M. Hoehe, C. Schmitt, R. Reinhardt, H. Lehrach and S. Sauer
(2003). "Clone-based systematic haplotyping (CSH): a procedure for physical haplotyping of whole genomes." Genome Res 13(12): 2717-2724.
Bustin, S. A. (2000). "Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays." J Mol Endocrinol 25(2): 169-193.
Cahan, P., Y. Li, M. Izumi and T. A. Graubert (2009). "The impact of copy number
variation on local gene expression in mouse hematopoietic stem and progenitor cells."
Nat Genet 41(4): 430-437.
Canales, R. D., Y. Luo, J. C. Willey, B. Austermiller, C. C. Barbacioru, C. Boysen, K.
Hunkapiller, R. V. Jensen, C. R. Knight, K. Y. Lee, Y. Ma, B. Maqsodi, A. Papallo, E. H.
Peters, K. Poulter, P. L. Ruppel, R. R. Samaha, L. Shi, W. Yang, L. Zhang and F. M.
Goodsaid (2006). "Evaluation of DNA microarray results with quantitative gene
expression platforms." Nat Biotechnol 24(9): 1115-1122.
Caporaso, N., F. Gu, N. Chatterjee, J. Sheng-Chih, K. Yu, M. Yeager, C. Chen, K.
Jacobs, W. Wheeler, M. T. Landi, R. G. Ziegler, D. J. Hunter, S. Chanock, S. Hankinson,
P. Kraft and A. W. Bergen (2009). "Genome-wide and candidate gene association study
of cigarette smoking behaviors." PLoS One 4(2): e4653.
Casbon, J. A., R. J. Osborne, S. Brenner and C. P. Lichtenstein (2011). "A method for
counting PCR template molecules with application to next-generation sequencing."
Nucleic Acids Res 39(12): e81. 174
Cassidy, A., J. P. Myles, M. van Tongeren, R. D. Page, T. Liloglou, S. W. Duffy and J.
K. Field (2008). "The LLP risk model: an individual risk prediction model for lung
cancer." Br J Cancer 98(2): 270-276.
Cazzoli, R., F. Buttitta, M. Di Nicola, S. Malatesta, A. Marchetti, W. N. Rom and H. I.
Pass (2013). "microRNAs derived from circulating exosomes as noninvasive biomarkers
for screening and diagnosing lung cancer." J Thorac Oncol 8(9): 1156-1162.
Celi, F. S., M. E. Zenilman and A. R. Shuldiner (1993). "A rapid and versatile method to
synthesize internal standards for competitive PCR." Nucleic Acids Res 21(4): 1047.
Chamizo, C., S. Zazo, M. Domine, I. Cristobal, J. Garcia-Foncillas, F. Rojo and J.
Madoz-Gurpide (2015). "Thymidylate synthase expression as a predictive biomarker of pemetrexed sensitivity in advanced non-small cell lung cancer." BMC Pulm Med 15:
132.
Chen, X., L. Levine and P. Y. Kwok (1999). "Fluorescence polarization in homogeneous
nucleic acid analysis." Genome Res 9(5): 492-498.
Chen, X. and P. F. Sullivan (2003). "Single nucleotide polymorphism genotyping:
biochemistry, protocol, cost and throughput." Pharmacogenomics J 3(2): 77-96.
Chen, Z. and X. Duan (2011). "Ribosomal RNA depletion for massively parallel bacterial
RNA-sequencing applications." Methods Mol Biol 733: 93-103.
Cheung, V. G., R. S. Spielman, K. G. Ewens, T. M. Weber, M. Morley and J. T. Burdick
(2005). "Mapping determinants of human gene expression by regional and genome-wide
association." Nature 437(7063): 1365-1369.
Churchill, G. A. (2002). "Fundamentals of experimental design for cDNA microarrays."
Nat Genet 32 Suppl: 490-495. 175
Cibulskis, K., M. S. Lawrence, S. L. Carter, A. Sivachenko, D. Jaffe, C. Sougnez, S.
Gabriel, M. Meyerson, E. S. Lander and G. Getz (2013). "Sensitive detection of somatic
point mutations in impure and heterogeneous cancer samples." Nat Biotechnol 31(3):
213-219.
Cirulli, E. T. and D. B. Goldstein (2010). "Uncovering the roles of rare variants in
common disease through whole-genome sequencing." Nat Rev Genet 11(6): 415-425.
Clark, A. G. (1990). "Inference of haplotypes from PCR-amplified samples of diploid
populations." Mol Biol Evol 7(2): 111-122.
Cooper, S. J., N. D. Trinklein, E. D. Anton, L. Nguyen and R. M. Myers (2006).
"Comprehensive analysis of transcriptional promoter structure and function in 1% of the
human genome." Genome Res 16(1): 1-10.
Costa, V., M. Aprile, R. Esposito and A. Ciccodicola (2013). "RNA-Seq and human complex diseases: recent accomplishments and future perspectives." Eur J Hum Genet
21(2): 134-142.
Crawford, E. L., A. Levin, F. Safi, M. Lu, A. Baugh, Xiaolu Zhang, Jiyoun Yeo, Sadik A.
Khuder, A. M. Boulos, P. Nana-Sinkam, P. P. Massion, D. A. Arenberg, D. Midthun, P.
J. Mazzone, S. D. Nathan, R. Wainz, G. Silvestri, J. Tita and J. C. Willey (2016). "Lung
cancer risk test trial: study design, participant baseline characteristics, bronchoscopy
safety, and establishment of a biospecimen repository." BMC Pulmonary Medicine
16(16).
Crawford, E. L., T. Blomquist, D. N. Mullins, Y. Yoon, D. R. Hernandez, M. Al-
Bagdhadi, J. Ruiz, J. Hammersley and J. C. Willey (2007). "CEBPG regulates
176
ERCC5/XPG expression in human bronchial epithelial cells and this regulation is modified by E2F1/YY1 interactions." Carcinogenesis 28(12): 2552-2559.
Crawford, E. L., S. A. Khuder, S. J. Durham, M. Frampton, M. Utell, W. G. Thilly, D. A.
Weaver, W. J. Ferencak, C. A. Jennings, J. R. Hammersley, D. A. Olson and J. C. Willey
(2000). "Normal bronchial epithelial cell expression of glutathione transferase P1, glutathione transferase M3, and glutathione peroxidase is low in subjects with bronchogenic carcinoma." Cancer Res 60(6): 1609-1618.
Crawford, E. L., G. J. Peters, P. Noordhuis, M. G. Rots, M. Vondracek, R. C. Grafstrom,
K. Lieuallen, G. Lennon, R. J. Zahorchak, M. J. Georgeson, A. Wali, J. F. Lechner, P. S.
Fan, M. B. Kahaleh, S. A. Khuder, K. A. Warner, D. A. Weaver and J. C. Willey (2001).
"Reproducible gene expression measurement among multiple laboratories obtained in a blinded study using standardized RT (StaRT)-PCR." Mol Diagn 6(4): 217-225.
Crawford, E. L., K. A. Warner, S. A. Khuder, R. J. Zahorchak and J. C. Willey (2002).
"Multiplex standardized RT-PCR for expression analysis of many genes in small samples." Biochem Biophys Res Commun 293(1): 509-516.
Crowley, J. J., V. Zhabotynsky, W. Sun, S. Huang, I. K. Pakatci, Y. Kim, J. R. Wang, A.
P. Morgan, J. D. Calaway, D. L. Aylor, Z. Yun, T. A. Bell, R. J. Buus, M. E. Calaway, J.
P. Didion, T. J. Gooch, S. D. Hansen, N. N. Robinson, G. D. Shaw, J. S. Spence, C. R.
Quackenbush, C. J. Barrick, R. J. Nonneman, K. Kim, J. Xenakis, Y. Xie, W. Valdar, A.
B. Lenarcic, W. Wang, C. E. Welsh, C. P. Fu, Z. Zhang, J. Holt, Z. Guo, D. W.
Threadgill, L. M. Tarantino, D. R. Miller, F. Zou, L. McMillan, P. F. Sullivan and F.
Pardo-Manuel de Villena (2015). "Analyses of allele-specific gene expression in highly
177
divergent mouse crosses identifies pervasive allelic imbalance." Nat Genet 47(4): 353-
360.
Dai, J., M. Zhu, C. Wang, W. Shen, W. Zhou, J. Sun, J. Liu, G. Jin, H. Ma, Z. Hu, D. Lin
and H. Shen (2015). "Systematical analyses of variants in CTCF-binding sites identified a novel lung cancer susceptibility locus among Chinese population." Sci Rep 5: 7833.
Daly, S., D. Rinewalt, C. Fhied, S. Basu, B. Mahon, M. J. Liptay, E. Hong, G.
Chmielewski, M. A. Yoder, P. N. Shah, E. S. Edell, F. Maldonado, A. O. Bungum and J.
A. Borgia (2013). "Development and validation of a plasma biomarker panel for
discerning clinical significance of indeterminate pulmonary nodules." J Thorac Oncol
8(1): 31-36.
Davidson, E. H., D. R. McClay and L. Hood (2003). "Regulatory gene networks and the
properties of the developmental process." Proc Natl Acad Sci U S A 100(4): 1475-1480.
de-Torres, J. P., D. O. Wilson, P. Sanchez-Salcedo, J. L. Weissfeld, J. Berto, A. Campo,
A. B. Alcaide, M. Garcia-Granero, B. R. Celli and J. J. Zulueta (2015). "Lung cancer in patients with chronic obstructive pulmonary disease. Development and validation of the
COPD Lung Cancer Screening Score." Am J Respir Crit Care Med 191(3): 285-291.
de la Chapelle, A. (2009). "Genetic predisposition to human disease: allele-specific
expression and low-penetrance regulatory loci." Oncogene 28(38): 3345-3348.
Dear, P. H. and P. R. Cook (1989). "Happy mapping: a proposal for linkage mapping the
human genome." Nucleic Acids Res 17(17): 6795-6807.
Deutsch, S., R. Lyle, E. T. Dermitzakis, H. Attar, L. Subrahmanyan, C. Gehrig, L.
Parand, M. Gagnebin, J. Rougemont, C. V. Jongeneel and S. E. Antonarakis (2005).
178
"Gene expression variation and expression quantitative trait mapping of human
chromosome 21 genes." Hum Mol Genet 14(23): 3741-3749.
Diaz, L. A. and A. Bardelli (2014). "Liquid biopsies: genotyping circulating tumor
DNA." Journal of Clinical Oncology 32(6): 579-586.
Didon, L., A. B. Roos, G. P. Elmberger, F. J. Gonzalez and M. Nord (2010). "Lung-
specific inactivation of CCAAT/enhancer binding protein alpha causes a pathological
pattern characteristic of COPD." Eur Respir J 35(1): 186-197.
Ding, C. and C. R. Cantor (2003). "A high-throughput gene expression analysis technique
using competitive PCR and matrix-assisted laser desorption ionization time-of-flight
MS." Proc Natl Acad Sci U S A 100(6): 3059-3064.
Djebali, S., C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer, J.
Lagarde, W. Lin, F. Schlesinger, C. Xue, G. K. Marinov, J. Khatun, B. A. Williams, C.
Zaleski, J. Rozowsky, M. Roder, F. Kokocinski, R. F. Abdelhamid, T. Alioto, I.
Antoshechkin, M. T. Baer, N. S. Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, X. Chen,
J. Chrast, J. Curado, T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, E.
Falconnet, M. Fastuca, K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D.
Gonzalez, A. Gordon, H. Gunawardena, C. Howald, S. Jha, R. Johnson, P. Kapranov, B.
King, C. Kingswood, O. J. Luo, E. Park, K. Persaud, J. B. Preall, P. Ribeca, B. Risk, D.
Robyr, M. Sammeth, L. Schaffer, L. H. See, A. Shahab, J. Skancke, A. M. Suzuki, H.
Takahashi, H. Tilgner, D. Trout, N. Walters, H. Wang, J. Wrobel, Y. Yu, X. Ruan, Y.
Hayashizaki, J. Harrow, M. Gerstein, T. Hubbard, A. Reymond, S. E. Antonarakis, G.
Hannon, M. C. Giddings, Y. Ruan, B. Wold, P. Carninci, R. Guigo and T. R. Gingeras
(2012). "Landscape of transcription in human cells." Nature 489(7414): 101-108. 179
Dumur, C. I., S. Nasim, A. M. Best, K. J. Archer, A. C. Ladd, V. R. Mas, D. S.
Wilkinson, C. T. Garrett and A. Ferreira-Gonzalez (2004). "Evaluation of quality-control criteria for microarray gene expression analysis." Clin Chem 50(11): 1994-2002.
Eisner, M. D., L. Trupin, P. P. Katz, E. H. Yelin, G. Earnest, J. Balmes and P. D. Blanc
(2005). "Development and validation of a survey-based COPD severity score." Chest
127(6): 1890-1897.
Eissa, N. T. and S. C. Erzurum (2001). "Flexible bronchoscopy in molecular biology."
Clin Chest Med 22(2): 343-353, ix.
el-Deiry, W. S., J. W. Harper, P. M. O'Connor, V. E. Velculescu, C. E. Canman, J.
Jackman, J. A. Pietenpol, M. Burrell, D. E. Hill, Y. Wang, K. G. Wiman, W. E. Mercer,
M. B. Kastan, K. W. Kohn, S. J. Elledge, K. W. Kinzler and B. Vogelstein (1994).
"WAF1/CIP1 is induced in p53-mediated G1 arrest and apoptosis." Cancer Res 54(5):
1169-1174.
el-Deiry, W. S., T. Tokino, V. E. Velculescu, D. B. Levy, R. Parsons, J. M. Trent, D. Lin,
W. E. Mercer, K. W. Kinzler and B. Vogelstein (1993). "WAF1, a potential mediator of
p53 tumor suppression." Cell 75(4): 817-825.
Euser, A. M., C. Zoccali, K. J. Jager and F. W. Dekker (2009). "Cohort studies:
prospective versus retrospective." Nephron Clin Pract 113(3): c214-217.
Evans, T. G. (2015). "Considerations for the use of transcriptomics in identifying the
'genes that matter' for environmental adaptation." J Exp Biol 218(Pt 12): 1925-1935.
Excoffier, L. and M. Slatkin (1995). "Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population." Mol Biol Evol 12(5): 921-927.
180
Eymin, B., S. Gazzeri, C. Brambilla and E. Brambilla (2001). "Distinct pattern of E2F1
expression in human lung tumours: E2F1 is upregulated in small cell lung carcinoma."
Oncogene 20(14): 1678-1687.
Facciolongo, N., M. Patelli, S. Gasparini, L. Lazzari Agli, M. Salio, C. Simonassi, B. Del
Prato and P. Zanoni (2009). "Incidence of complications in bronchoscopy. Multicentre
prospective study of 20,986 bronchoscopies." Monaldi Arch Chest Dis 71(1): 8-14.
Fehrmann, R. S., R. C. Jansen, J. H. Veldink, H. J. Westra, D. Arends, M. J. Bonder, J.
Fu, P. Deelen, H. J. Groen, A. Smolonska, R. K. Weersma, R. M. Hofstra, W. A.
Buurman, S. Rensen, M. G. Wolfs, M. Platteel, A. Zhernakova, C. C. Elbers, E. M.
Festen, G. Trynka, M. H. Hofker, C. G. Saris, R. A. Ophoff, L. H. van den Berg, D. A.
van Heel, C. Wijmenga, G. J. Te Meerman and L. Franke (2011). "Trans-eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA." PLoS Genet 7(8): e1002197.
Ferre, F. (1992). "Quantitative or semi-quantitative PCR: reality versus myth." PCR
Methods Appl 2(1): 1-9.
Field, J. K., D. Baldwin, K. Brain, A. Devaraj, T. Eisen, S. W. Duffy, D. M. Hansell, K.
Kerr, R. Page, M. Parmar, D. Weller, P. Williamson, D. Whynes and U. Team (2011).
"CT screening for lung cancer in the UK: position statement by UKLS investigators
following the NLST report." Thorax 66(8): 736-737.
Fisher, D. A., J. T. Maple, T. Ben-Menachem, B. D. Cash, G. A. Decker, D. S. Early, J.
A. Evans, R. D. Fanelli, N. Fukami, J. H. Hwang, R. Jain, T. L. Jue, K. M. Khan, P. M.
Malpas, R. N. Sharaf, A. K. Shergill and J. A. Dominitz (2011). "Complications of
colonoscopy." Gastrointest Endosc 74(4): 745-752. 181
Forsberg, L., L. Lyrenas, U. de Faire and R. Morgenstern (2001). "A common functional
C-T substitution polymorphism in the promoter region of the human catalase gene
influences transcription factor binding, reporter gene transcription and is correlated to
blood catalase levels." Free Radic Biol Med 30(5): 500-505.
Frampton, G. M., A. Fichtenholtz, G. A. Otto, K. Wang, S. R. Downing, J. He, M.
Schnall-Levin, J. White, E. M. Sanford, P. An, J. Sun, F. Juhn, K. Brennan, K. Iwanik, A.
Maillet, J. Buell, E. White, M. Zhao, S. Balasubramanian, S. Terzic, T. Richards, V.
Banning, L. Garcia, K. Mahoney, Z. Zwirko, A. Donahue, H. Beltran, J. M. Mosquera,
M. A. Rubin, S. Dogan, C. V. Hedvat, M. F. Berger, L. Pusztai, M. Lechner, C. Boshoff,
M. Jarosz, C. Vietz, A. Parker, V. A. Miller, J. S. Ross, J. Curran, M. T. Cronin, P. J.
Stephens, D. Lipson and R. Yelensky (2013). "Development and validation of a clinical
cancer genomic profiling test based on massively parallel DNA sequencing." Nat
Biotechnol 31(11): 1023-1031.
Freeman, W. M., S. J. Walker and K. E. Vrana (1999). "Quantitative RT-PCR: pitfalls
and potential." Biotechniques 26(1): 112-122, 124-115.
Fu, G. K., W. Xu, J. Wilhelmy, M. N. Mindrinos, R. W. Davis, W. Xiao and S. P. Fodor
(2014). "Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations." Proc Natl Acad Sci U S A 111(5):
1891-1896.
Fung, J. N., S. J. Holdsworth-Carson, Y. Sapkota, Z. Z. Zhao, L. Jones, J. E. Girling, P.
Paiva, M. Healey, D. R. Nyholt, P. A. Rogers and G. W. Montgomery (2015).
"Functional evaluation of genetic variants associated with endometriosis near GREB1."
Hum Reprod 30(5): 1263-1275. 182
Gallegos Ruiz, M. I., K. Floor, P. Roepman, J. A. Rodriguez, G. A. Meijer, W. J. Mooi,
E. Jassem, J. Niklinski, T. Muley, N. van Zandwijk, E. F. Smit, K. Beebe, L. Neckers, B.
Ylstra and G. Giaccone (2008). "Integration of gene dosage and gene expression in non-
small cell lung cancer, identification of HSP90 as potential target." PLoS One 3(3):
e0001722.
Ganti, A. K. and J. L. Mulshine (2006). "Lung cancer screening." Oncologist 11(5): 481-
487.
Gargis, A. S., L. Kalman, M. W. Berry, D. P. Bick, D. P. Dimmock, T. Hambuch, F. Lu,
E. Lyon, K. V. Voelkerding, B. A. Zehnbauer, R. Agarwala, S. F. Bennett, B. Chen, E. L.
Chin, J. G. Compton, S. Das, D. H. Farkas, M. J. Ferber, B. H. Funke, M. R. Furtado, L.
M. Ganova-Raeva, U. Geigenmuller, S. J. Gunselman, M. R. Hegde, P. L. Johnson, A.
Kasarskis, S. Kulkarni, T. Lenk, C. S. Liu, M. Manion, T. A. Manolio, E. R. Mardis, J. D.
Merker, M. S. Rajeevan, M. G. Reese, H. L. Rehm, B. B. Simen, J. M. Yeakley, J. M.
Zook and I. M. Lubin (2012). "Assuring the quality of next-generation sequencing in clinical laboratory practice." Nat Biotechnol 30(11): 1033-1036.
Gargis, A. S., L. Kalman, D. P. Bick, C. da Silva, D. P. Dimmock, B. H. Funke, S.
Gowrisankar, M. R. Hegde, S. Kulkarni, C. E. Mason, R. Nagarajan, K. V. Voelkerding,
E. A. Worthey, N. Aziz, J. Barnes, S. F. Bennett, H. Bisht, D. M. Church, Z. Dimitrova,
S. R. Gargis, N. Hafez, T. Hambuch, F. C. Hyland, R. A. Luna, D. MacCannell, T. Mann,
M. R. McCluskey, T. K. McDaniel, L. M. Ganova-Raeva, H. L. Rehm, J. Reid, D. S.
Campo, R. B. Resnick, P. G. Ridge, M. L. Salit, P. Skums, L. J. Wong, B. A. Zehnbauer,
J. M. Zook and I. M. Lubin (2015). "Good laboratory practice for clinical next-generation sequencing informatics pipelines." Nat Biotechnol 33(7): 689-693. 183
Ge, B., D. K. Pokholok, T. Kwan, E. Grundberg, L. Morcos, D. J. Verlaan, J. Le, V.
Koka, K. C. Lam, V. Gagne, J. Dias, R. Hoberman, A. Montpetit, M. M. Joly, E. J.
Harvey, D. Sinnett, P. Beaulieu, R. Hamon, A. Graziani, K. Dewar, E. Harmsen, J.
Majewski, H. H. Goring, A. K. Naumova, M. Blanchette, K. L. Gunderson and T.
Pastinen (2009). "Global patterns of cis variation in human cells revealed by high-density allelic expression analysis." Nat Genet 41(11): 1216-1222.
Gebhardt, F., K. S. Zanker and B. Brandt (1999). "Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1." J Biol
Chem 274(19): 13176-13180.
Genomes Project, C., G. R. Abecasis, D. Altshuler, A. Auton, L. D. Brooks, R. M.
Durbin, R. A. Gibbs, M. E. Hurles and G. A. McVean (2010). "A map of human genome variation from population-scale sequencing." Nature 467(7319): 1061-1073.
Germer, S. and R. Higuchi (1999). "Single-tube genotyping without oligonucleotide probes." Genome Res 9(1): 72-78.
Gibson, G., J. E. Powell and U. M. Marigorta (2015). "Expression quantitative trait locus analysis for translational medicine." Genome Med 7(1): 60.
Gibson, G. and B. Weir (2005). "The quantitative genetics of transcription." Trends
Genet 21(11): 616-623.
Gililland, J. L., Y. C. Tseng, V. Troche, S. Lahiri and L. Wartofsky (1992). "Atrial natriuretic peptide receptors in human endometrial stromal cells." J Clin Endocrinol
Metab 75(2): 547-551.
184
Gilliland, G., S. Perrin, K. Blanchard and H. F. Bunn (1990). "Analysis of cytokine mRNA and DNA: detection and quantitation by competitive polymerase chain reaction."
Proc Natl Acad Sci U S A 87(7): 2725-2729.
Global Lipids Genetics, C., C. J. Willer, E. M. Schmidt, S. Sengupta, G. M. Peloso, S.
Gustafsson, S. Kanoni, A. Ganna, J. Chen, M. L. Buchkovich, S. Mora, J. S. Beckmann,
J. L. Bragg-Gresham, H. Y. Chang, A. Demirkan, H. M. Den Hertog, R. Do, L. A.
Donnelly, G. B. Ehret, T. Esko, M. F. Feitosa, T. Ferreira, K. Fischer, P. Fontanillas, R.
M. Fraser, D. F. Freitag, D. Gurdasani, K. Heikkila, E. Hypponen, A. Isaacs, A. U.
Jackson, A. Johansson, T. Johnson, M. Kaakinen, J. Kettunen, M. E. Kleber, X. Li, J.
Luan, L. P. Lyytikainen, P. K. Magnusson, M. Mangino, E. Mihailov, M. E. Montasser,
M. Muller-Nurasyid, I. M. Nolte, J. R. O'Connell, C. D. Palmer, M. Perola, A. K.
Petersen, S. Sanna, R. Saxena, S. K. Service, S. Shah, D. Shungin, C. Sidore, C. Song, R.
J. Strawbridge, I. Surakka, T. Tanaka, T. M. Teslovich, G. Thorleifsson, E. G. Van den
Herik, B. F. Voight, K. A. Volcik, L. L. Waite, A. Wong, Y. Wu, W. Zhang, D. Absher,
G. Asiki, I. Barroso, L. F. Been, J. L. Bolton, L. L. Bonnycastle, P. Brambilla, M. S.
Burnett, G. Cesana, M. Dimitriou, A. S. Doney, A. Doring, P. Elliott, S. E. Epstein, G. I.
Eyjolfsson, B. Gigante, M. O. Goodarzi, H. Grallert, M. L. Gravito, C. J. Groves, G.
Hallmans, A. L. Hartikainen, C. Hayward, D. Hernandez, A. A. Hicks, H. Holm, Y. J.
Hung, T. Illig, M. R. Jones, P. Kaleebu, J. J. Kastelein, K. T. Khaw, E. Kim, N. Klopp, P.
Komulainen, M. Kumari, C. Langenberg, T. Lehtimaki, S. Y. Lin, J. Lindstrom, R. J.
Loos, F. Mach, W. L. McArdle, C. Meisinger, B. D. Mitchell, G. Muller, R. Nagaraja, N.
Narisu, T. V. Nieminen, R. N. Nsubuga, I. Olafsson, K. K. Ong, A. Palotie, T.
Papamarkou, C. Pomilla, A. Pouta, D. J. Rader, M. P. Reilly, P. M. Ridker, F. 185
Rivadeneira, I. Rudan, A. Ruokonen, N. Samani, H. Scharnagl, J. Seeley, K. Silander, A.
Stancakova, K. Stirrups, A. J. Swift, L. Tiret, A. G. Uitterlinden, L. J. van Pelt, S.
Vedantam, N. Wainwright, C. Wijmenga, S. H. Wild, G. Willemsen, T. Wilsgaard, J. F.
Wilson, E. H. Young, J. H. Zhao, L. S. Adair, D. Arveiler, T. L. Assimes, S. Bandinelli,
F. Bennett, M. Bochud, B. O. Boehm, D. I. Boomsma, I. B. Borecki, S. R. Bornstein, P.
Bovet, M. Burnier, H. Campbell, A. Chakravarti, J. C. Chambers, Y. D. Chen, F. S.
Collins, R. S. Cooper, J. Danesh, G. Dedoussis, U. de Faire, A. B. Feranil, J. Ferrieres, L.
Ferrucci, N. B. Freimer, C. Gieger, L. C. Groop, V. Gudnason, U. Gyllensten, A.
Hamsten, T. B. Harris, A. Hingorani, J. N. Hirschhorn, A. Hofman, G. K. Hovingh, C. A.
Hsiung, S. E. Humphries, S. C. Hunt, K. Hveem, C. Iribarren, M. R. Jarvelin, A. Jula, M.
Kahonen, J. Kaprio, A. Kesaniemi, M. Kivimaki, J. S. Kooner, P. J. Koudstaal, R. M.
Krauss, D. Kuh, J. Kuusisto, K. O. Kyvik, M. Laakso, T. A. Lakka, L. Lind, C. M.
Lindgren, N. G. Martin, W. Marz, M. I. McCarthy, C. A. McKenzie, P. Meneton, A.
Metspalu, L. Moilanen, A. D. Morris, P. B. Munroe, I. Njolstad, N. L. Pedersen, C.
Power, P. P. Pramstaller, J. F. Price, B. M. Psaty, T. Quertermous, R. Rauramaa, D.
Saleheen, V. Salomaa, D. K. Sanghera, J. Saramies, P. E. Schwarz, W. H. Sheu, A. R.
Shuldiner, A. Siegbahn, T. D. Spector, K. Stefansson, D. P. Strachan, B. O. Tayo, E.
Tremoli, J. Tuomilehto, M. Uusitupa, C. M. van Duijn, P. Vollenweider, L. Wallentin, N.
J. Wareham, J. B. Whitfield, B. H. Wolffenbuttel, J. M. Ordovas, E. Boerwinkle, C. N.
Palmer, U. Thorsteinsdottir, D. I. Chasman, J. I. Rotter, P. W. Franks, S. Ripatti, L. A.
Cupples, M. S. Sandhu, S. S. Rich, M. Boehnke, P. Deloukas, S. Kathiresan, K. L.
Mohlke, E. Ingelsson and G. R. Abecasis (2013). "Discovery and refinement of loci associated with lipid levels." Nat Genet 45(11): 1274-1283. 186
Gower, A. C., K. Steiling, J. F. Brothers, 2nd, M. E. Lenburg and A. Spira (2011).
"Transcriptomic studies of the airway field of injury associated with smoking-related lung disease." Proc Am Thorac Soc 8(2): 173-179.
Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X.
Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A.
Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N.
Friedman and A. Regev (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nat Biotechnol 29(7): 644-652.
Grundberg, E., K. S. Small, A. K. Hedman, A. C. Nica, A. Buil, S. Keildson, J. T. Bell,
T. P. Yang, E. Meduri, A. Barrett, J. Nisbett, M. Sekowska, A. Wilk, S. Y. Shin, D.
Glass, M. Travers, J. L. Min, S. Ring, K. Ho, G. Thorleifsson, A. Kong, U.
Thorsteindottir, C. Ainali, A. S. Dimas, N. Hassanali, C. Ingle, D. Knowles, M.
Krestyaninova, C. E. Lowe, P. Di Meglio, S. B. Montgomery, L. Parts, S. Potter, G.
Surdulescu, L. Tsaprouni, S. Tsoka, V. Bataille, R. Durbin, F. O. Nestle, S. O'Rahilly, N.
Soranzo, C. M. Lindgren, K. T. Zondervan, K. R. Ahmadi, E. E. Schadt, K. Stefansson,
G. D. Smith, M. I. McCarthy, P. Deloukas, E. T. Dermitzakis, T. D. Spector and C.
Multiple Tissue Human Expression Resource (2012). "Mapping cis- and trans-regulatory effects across multiple tissues in twins." Nat Genet 44(10): 1084-1089.
Gry, M., R. Rimini, S. Stromberg, A. Asplund, F. Ponten, M. Uhlen and P. Nilsson
(2009). "Correlations between RNA and protein expression profiles in 23 human cell
lines." BMC Genomics 10: 365.
187
Guindalini, C. and R. Pellegrino (2016). Gene Expression Studies Using Microarrays.
Rodent Model as Tools in Ethical Biomedical Research. L. M. Andersen and S. Tufik.
Cham, Springer International Publishing: 203-216.
Gusev, A., S. H. Lee, G. Trynka, H. Finucane, B. J. Vilhjalmsson, H. Xu, C. Zang, S.
Ripke, B. Bulik-Sullivan, E. Stahl, C. Schizophrenia Working Group of the Psychiatric
Genomics, S.-S. Consortium, A. K. Kahler, C. M. Hultman, S. M. Purcell, S. A.
McCarroll, M. Daly, B. Pasaniuc, P. F. Sullivan, B. M. Neale, N. R. Wray, S.
Raychaudhuri, A. L. Price, C. Schizophrenia Working Group of the Psychiatric
Genomics and S.-S. Consortium (2014). "Partitioning heritability of regulatory and cell- type-specific variants across 11 common diseases." Am J Hum Genet 95(5): 535-552.
Haas, B. J., M. Chin, C. Nusbaum, B. W. Birren and J. Livny (2012). "How deep is deep
enough for RNA-Seq profiling of bacterial transcriptomes?" BMC Genomics 13: 734.
Halvardson, J., A. Zaghlool and L. Feuk (2013). "Exome RNA sequencing reveals rare
and novel alternative transcripts." Nucleic Acids Res 41(1): e6.
Hao, B., X. Miao, Y. Li, X. Zhang, T. Sun, G. Liang, Y. Zhao, Y. Zhou, H. Wang, X.
Chen, L. Zhang, W. Tan, Q. Wei, D. Lin and F. He (2006). "A novel T-77C
polymorphism in DNA repair gene XRCC1 contributes to diminished promoter activity
and increased risk of non-small cell lung cancer." Oncogene 25(25): 3613-3620.
Harr, M. W., T. G. Graves, E. L. Crawford, K. A. Warner, C. A. Reed and J. C. Willey
(2005). "Variation in transcriptional regulation of cyclin dependent kinase inhibitor
p21waf1/cip1 among human bronchogenic carcinomas." Mol Cancer 4: 23.
188
Hattotuwa, K., E. A. Gamble, T. O’Shaughnessy, P. K. Jeffery and N. C. Barnes (2002).
"SAfety of bronchoscopy, biopsy, and bal in research patients with copd*." Chest 122(6):
1909-1912.
Hayashi, G., M. Hagihara and K. Nakatani (2008). "Genotyping by allele-specific L-
DNA-tagged PCR." J Biotechnol 135(2): 157-160.
Hayashi, S., J. Watanabe and K. Kawajiri (1991). "Genetic polymorphisms in the 5'-
flanking region change transcriptional regulation of the human cytochrome P450IIE1
gene." J Biochem 110(4): 559-565.
He, C., J. Holme and J. Anthony (2014). "SNP genotyping: the KASP assay." Methods
Mol Biol 1145: 75-86.
He, J., L. X. Qiu, M. Y. Wang, R. X. Hua, R. X. Zhang, H. P. Yu, Y. N. Wang, M. H.
Sun, X. Y. Zhou, Y. J. Yang, J. C. Wang, L. Jin, Q. Y. Wei and J. Li (2012).
"Polymorphisms in the XPG gene and risk of gastric cancer in Chinese populations."
Hum Genet 131(7): 1235-1244.
Heid, C. A., J. Stevens, K. J. Livak and P. M. Williams (1996). "Real time quantitative
PCR." Genome Res 6(10): 986-994.
Henley, W. N., K. E. Schuebel and D. A. Nielsen (1996). "Limitations imposed by
heteroduplex formation on quantitative RT-PCR." Biochem Biophys Res Commun
226(1): 113-117.
Henrichsen, C. N., N. Vinckenbosch, S. Zollner, E. Chaignat, S. Pradervand, F. Schutz,
M. Ruedi, H. Kaessmann and A. Reymond (2009). "Segmental copy number variation shapes tissue transcriptomes." Nat Genet 41(4): 424-429.
189
Higgins, G., K. M. Roper, I. J. Watson, F. H. Blackhall, W. N. Rom, H. I. Pass, J. F.
Ainscough and D. Coverley (2012). "Variant Ciz1 is a circulating biomarker for early-
stage lung cancer." Proc Natl Acad Sci U S A 109(45): E3128-3135.
Higgs, D. R., D. Vernimmen, J. Hughes and R. Gibbons (2007). "Using genomics to
study how chromatin influences gene expression." Annu Rev Genomics Hum Genet 8:
299-325.
Holland, P. M., R. D. Abramson, R. Watson and D. H. Gelfand (1991). "Detection of
specific polymerase chain reaction product by utilizing the 5'----3' exonuclease activity of
Thermus aquaticus DNA polymerase." Proc Natl Acad Sci U S A 88(16): 7276-7280.
Hoogendoorn, B., S. L. Coleman, C. A. Guy, K. Smith, T. Bowen, P. R. Buckland and M.
C. O'Donovan (2003). "Functional analysis of human promoter polymorphisms." Hum
Mol Genet 12(18): 2249-2254.
Horn, M., R. Baumann, J. A. Pereira, P. N. Sidiropoulos, C. Somandin, H. Welzl, C.
Stendel, T. Luhmann, C. Wessig, K. V. Toyka, J. B. Relvas, J. Senderek and U. Suter
(2012). "Myelin is dependent on the Charcot-Marie-Tooth Type 4H disease culprit protein FRABIN/FGD4 in Schwann cells." Brain 135(Pt 12): 3567-3583.
Howie, B. N., P. Donnelly and J. Marchini (2009). "A flexible and accurate genotype
imputation method for the next generation of genome-wide association studies." PLoS
Genet 5(6): e1000529.
Hu, J., Y. Mao, D. Dryer, K. White and C. C. R. E. R. Group (2002). "Risk factors for
lung cancer among Canadian women who have never smoked." Cancer detection and
prevention 26(2): 129-138.
190
Huang, R., M. Jaritz, P. Guenzl, I. Vlatkovic, A. Sommer, I. M. Tamir, H. Marks, T.
Klampfl, R. Kralovics, H. G. Stunnenberg, D. P. Barlow and F. M. Pauler (2011). "An
RNA-Seq strategy to detect the complete coding and non-coding transcriptome including
full-length imprinted macro ncRNAs." PLoS One 6(11): e27288.
Huggett, J. F., T. Novak, J. A. Garson, C. Green, S. D. Morris-Jones, R. F. Miller and A.
Zumla (2008). "Differential susceptibility of PCR reactions to inhibitors: an important and unrecognised phenomenon." BMC Res Notes 1: 70.
Humphrey, L. L., M. Deffebach, M. Pappas, C. Baumann, K. Artis, J. P. Mitchell, B.
Zakher, R. Fu and C. G. Slatore (2013). "Screening for lung cancer with low-dose
computed tomography: a systematic review to update the US Preventive services task
force recommendation." Ann Intern Med 159(6): 411-420.
Hung, R. J., J. D. McKay, V. Gaborieau, P. Boffetta, M. Hashibe, D. Zaridze, A.
Mukeria, N. Szeszenia-Dabrowska, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, V.
Bencko, L. Foretova, V. Janout, C. Chen, G. Goodman, J. K. Field, T. Liloglou, G.
Xinarianos, A. Cassidy, J. McLaughlin, G. Liu, S. Narod, H. E. Krokan, F. Skorpen, M.
B. Elvestad, K. Hveem, L. Vatten, J. Linseisen, F. Clavel-Chapelon, P. Vineis, H. B.
Bueno-de-Mesquita, E. Lund, C. Martinez, S. Bingham, T. Rasmuson, P. Hainaut, E.
Riboli, W. Ahrens, S. Benhamou, P. Lagiou, D. Trichopoulos, I. Holcatova, F. Merletti,
K. Kjaerheim, A. Agudo, G. Macfarlane, R. Talamini, L. Simonato, R. Lowry, D. I.
Conway, A. Znaor, C. Healy, D. Zelenika, A. Boland, M. Delepine, M. Foglio, D.
Lechner, F. Matsuda, H. Blanche, I. Gut, S. Heath, M. Lathrop and P. Brennan (2008).
"A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit
genes on 15q25." Nature 452(7187): 633-637. 191
Hutchinson, J. N., T. Raj, J. Fagerness, E. Stahl, F. T. Viloria, A. Gimelbrant, J. Seddon,
M. Daly, A. Chess and R. Plenge (2014). "Allele-specific methylation occurs at genetic
variants associated with complex disease." PloS one 9(6): e98464.
Illumina "Sequencing Systems." http://www.illumina.com/systems/sequencing.html.
Illumina. (2016). "Kits for human genotyping applications." from
http://www.illumina.com/techniques/microarrays/human-genotyping/human-genotyping- arrays.html.
Irshad, S. and M. Maryum (2012). "Genetic Susceptibility and Risk Factors Associated with Familial Lung Cancer."
Jabara, C. B., C. D. Jones, J. Roach, J. A. Anderson and R. Swanstrom (2011). "Accurate
sampling and deep sequencing of the HIV-1 protease gene using a Primer ID." Proc Natl
Acad Sci U S A 108(50): 20166-20171.
Jaenisch, R. and A. Bird (2003). "Epigenetic regulation of gene expression: how the
genome integrates intrinsic and environmental signals." Nat Genet 33 Suppl: 245-254.
Jiang, L., D. Willner, P. Danoy, H. Xu and M. A. Brown (2013). "Comparison of the
performance of two commercial genome-wide association study genotyping platforms in
Han Chinese samples." G3 (Bethesda) 3(1): 23-29.
Jin, F., D. Mu, D. Chu, E. Fu, Y. Xie and T. Liu (2008). "Severe complications of
bronchoscopy." Respiration 76(4): 429-433.
Kadara, H. and Wistuba, II (2012). "Field cancerization in non-small cell lung cancer:
implications in disease pathogenesis." Proc Am Thorac Soc 9(2): 38-42.
Kaisho, T., H. Tsutsui, T. Tanaka, T. Tsujimura, K. Takeda, T. Kawai, N. Yoshida, K.
Nakanishi and S. Akira (1999). "Impairment of natural killer cytotoxic activity and 192
interferon gamma production in CCAAT/enhancer binding protein gamma-deficient
mice." J Exp Med 190(11): 1573-1582.
Kang, S., Y. Ma, C. Liu, C. Cao, Hanbateer, J. Qi, J. Li and X. Wu (2015). "Assoication
of XRCC1 gene polymorphisms with risk of non-small cell lung cancer." Int J Clin Exp
Pathol 8(4): 4171-4176.
Kannan, K., N. Amariglio, G. Rechavi and D. Givol (2000). "Profile of gene expression
regulated by induced p53: connection to the TGF-beta family." FEBS Lett 470(1): 77-82.
Kaper, F., S. Swamy, B. Klotzle, S. Munchel, J. Cottrell, M. Bibikova, H. Y. Chuang, S.
Kruglyak, M. Ronaghi, M. A. Eberle and J. B. Fan (2013). "Whole-genome haplotyping by dilution, amplification, and sequencing." Proc Natl Acad Sci U S A 110(14): 5552-
5557.
Keating, B. J., S. Tischfield, S. S. Murray, T. Bhangale, T. S. Price, J. T. Glessner, L.
Galver, J. C. Barrett, S. F. Grant, D. N. Farlow, H. R. Chandrupatla, M. Hansen, S.
Ajmal, G. J. Papanicolaou, Y. Guo, M. Li, S. Derohannessian, P. I. de Bakker, S. D.
Bailey, A. Montpetit, A. C. Edmondson, K. Taylor, X. Gai, S. S. Wang, M. Fornage, T.
Shaikh, L. Groop, M. Boehnke, A. S. Hall, A. T. Hattersley, E. Frackelton, N. Patterson,
C. W. Chiang, C. E. Kim, R. R. Fabsitz, W. Ouwehand, A. L. Price, P. Munroe, M.
Caulfield, T. Drake, E. Boerwinkle, D. Reich, A. S. Whitehead, T. P. Cappola, N. J.
Samani, A. J. Lusis, E. Schadt, J. G. Wilson, W. Koenig, M. I. McCarthy, S. Kathiresan,
S. B. Gabriel, H. Hakonarson, S. S. Anand, M. Reilly, J. C. Engert, D. A. Nickerson, D.
J. Rader, J. N. Hirschhorn and G. A. Fitzgerald (2008). "Concept, design and
implementation of a cardiovascular gene-centric 50 k SNP array for large-scale genomic association studies." PLoS One 3(10): e3583. 193
Keedy, V. L., S. Temin, M. R. Somerfield, M. B. Beasley, D. H. Johnson, L. M.
McShane, D. T. Milton, J. R. Strawn, H. A. Wakelee and G. Giaccone (2011). "American
Society of Clinical Oncology provisional clinical opinion: epidermal growth factor
receptor (EGFR) Mutation testing for patients with advanced non-small-cell lung cancer
considering first-line EGFR tyrosine kinase inhibitor therapy." J Clin Oncol 29(15):
2121-2127.
Kim, V., M. Oros, H. Durra, S. Kelsen, M. Aksoy, W. D. Cornwell, T. J. Rogers and G. J.
Criner (2015). "Chronic Bronchitis and Current Smoking Are Associated with More
Goblet Cells in Moderate to Severe COPD and Smokers without Airflow Obstruction."
PLoS ONE 10(2): e0116108.
Kinde, I., J. Wu, N. Papadopoulos, K. W. Kinzler and B. Vogelstein (2011). "Detection
and quantification of rare mutations with massively parallel sequencing." Proc Natl Acad
Sci U S A 108(23): 9530-9535.
Kirsten, H., H. Al-Hasani, L. Holdt, A. Gross, F. Beutner, K. Krohn, K. Horn, P. Ahnert,
R. Burkhardt, K. Reiche, J. Hackermuller, M. Loffler, D. Teupser, J. Thiery and M.
Scholz (2015). "Dissecting the genetics of the human transcriptome identifies novel trait-
related trans-eQTLs and corroborates the regulatory relevance of non-protein coding
locidagger." Hum Mol Genet 24(16): 4746-4763.
Kitzman, J. O., A. P. Mackenzie, A. Adey, J. B. Hiatt, R. P. Patwardhan, P. H. Sudmant,
S. B. Ng, C. Alkan, R. Qiu, E. E. Eichler and J. Shendure (2011). "Haplotype-resolved genome sequencing of a Gujarati Indian individual." Nat Biotechnol 29(1): 59-63.
Knight, J. (2012). "Resolving the variable genome and epigenome in human disease."
Journal of internal medicine 271(4): 379-391. 194
Knight, J. C. (2004). "Allele-specific gene expression uncovered." Trends Genet 20(3):
113-116.
Knight, J. C., B. J. Keating, K. A. Rockett and D. P. Kwiatkowski (2003). "In vivo
characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading." Nat Genet 33(4): 469-475.
Ko, C. W., S. Riffle, L. Michaels, C. Morris, J. Holub, J. A. Shapiro, M. A. Ciol, M. B.
Kimmey, L. C. Seeff and D. Lieberman (2010). "Serious complications within 30 days of
screening and surveillance colonoscopy are uncommon." Clin Gastroenterol Hepatol
8(2): 166-173.
Korpanty, G. J., D. M. Graham, M. D. Vincent and N. B. Leighl (2014). "Biomarkers
That Currently Affect Clinical Practice in Lung Cancer: EGFR, ALK, MET, ROS-1, and
KRAS." Front Oncol 4: 204.
Kovalchik, S. A., M. Tammemagi, C. D. Berg, N. E. Caporaso, T. L. Riley, M. Korch, G.
A. Silvestri, A. K. Chaturvedi and H. A. Katki (2013). "Targeting of low-dose CT
screening according to the risk of lung-cancer death." N Engl J Med 369(3): 245-254.
Kranaster, R., P. Ketzer and A. Marx (2008). "Mutant DNA polymerase for improved
detection of single-nucleotide variations in microarrayed primer extension."
Chembiochem 9(5): 694-697.
Kuleshov, V., D. Xie, R. Chen, D. Pushkarev, Z. Ma, T. Blauwkamp, M. Kertesz and M.
Snyder (2014). "Whole-genome haplotyping using long reads and statistical methods."
Nat Biotechnol 32(3): 261-266.
Kunkel, T. A. (2004). "DNA replication fidelity." J Biol Chem 279(17): 16895-16898.
195
Lam, T. H., M. Z. Tay, B. Wang, Z. Xiao and E. C. Ren (2015). "Intrahaplotypic Variants
Differentiate Complex Linkage Disequilibrium within Human MHC Haplotypes." Sci
Rep 5: 16972.
Langmead, B. and S. L. Salzberg (2012). "Fast gapped-read alignment with Bowtie 2."
Nat Methods 9(4): 357-359.
Lappalainen, T. (2015). "Functional genomics bridges the gap between quantitative genetics and molecular biology." Genome Res 25(10): 1427-1431.
Levin, J. Z., M. F. Berger, X. Adiconis, P. Rogov, A. Melnikov, T. Fennell, C. Nusbaum,
L. A. Garraway and A. Gnirke (2009). "Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts."
Genome Biol 10(10): R115.
Lewallen, S. and P. Courtright (1998). "Epidemiology in practice: case-control studies."
Community Eye Health 11(28): 57-58.
Lewis, C. M. and J. Knight (2012). "Introduction to Genetic Association Studies." Cold
Spring Harbor Protocols 2012(3): pdb.top068163.
Liew, M., R. Pryor, R. Palais, C. Meadows, M. Erali, E. Lyon and C. Wittwer (2004).
"Genotyping of single-nucleotide polymorphisms by high-resolution melting of small amplicons." Clin Chem 50(7): 1156-1164.
Litviakov, N. V., M. B. Freidin, A. E. Sazonov, M. V. Khalyuzova, M. A. Buldakov, M.
S. Karbyshev, C. Albakh capital Ie, D. S. Isubakova, C. Gagarin capital A, G. B.
Nekrasov, E. B. Mironova, C. Izosimov capital A, R. M. Takhauov and C. Karpov capital
A (2015). "Different patterns of allelic imbalance in sporadic tumors and tumors
196
associated with long-term exposure to gamma-radiation." Mutat Res Genet Toxicol
Environ Mutagen 794: 8-16.
Liu, C., F. Zhang, T. Li, M. Lu, L. Wang, W. Yue and D. Zhang (2012). "MirSNP, a
database of polymorphisms altering miRNA target sites, identifies miRNA-related SNPs in GWAS SNPs and eQTLs." BMC Genomics 13: 661.
Livak, K. J. and T. D. Schmittgen (2001). "Analysis of relative gene expression data
using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method." Methods 25(4):
402-408.
Lo Tam Loi, A. T., S. J. M. Hoonhorst, L. Franciosi, R. Bischoff, R. F. Hoffmann, I.
Heijink, A. J. M. van Oosterhout, H. M. Boezen, W. Timens, D. S. Postma, J.-W.
Lammers, L. Koenderman and N. H. T. ten Hacken (2013). "Acute and chronic
inflammatory responses induced by smoking in individuals susceptible and non-
susceptible to development of COPD: from specific disease phenotyping towards novel
therapy. Protocol of a cross-sectional study." BMJ Open 3(2).
Lock, M. and G. Rodrigues (2007). "Computed tomographic screening for lung cancer."
Can Fam Physician 53(8): 1334-1336.
Lohmueller, K. E., C. L. Pearce, M. Pike, E. S. Lander and J. N. Hirschhorn (2003).
"Meta-analysis of genetic association studies supports a contribution of common variants
to susceptibility to common disease." Nat Genet 33(2): 177-182.
Looney, S. W. and J. Hagan (2015). Ananlysis of Biomarker Data: A Practical Guide.
Maier, T., M. Guell and L. Serrano (2009). "Correlation of mRNA and protein in
complex biological samples." FEBS Lett 583(24): 3966-3973.
197
Mamanova, L., R. M. Andrews, K. D. James, E. M. Sheridan, P. D. Ellis, C. F. Langford,
T. W. Ost, J. E. Collins and D. J. Turner (2010). "FRT-seq: amplification-free, strand- specific transcriptome sequencing." Nat Methods 7(2): 130-132.
Mannino, D. M., S. M. Aguayo, T. L. Petty and S. C. Redd (2003). "Low lung function
and incident lung cancer in the United States: data From the First National Health and
Nutrition Examination Survey follow-up." Arch Intern Med 163(12): 1475-1480.
Marchini, J., D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S.
Qin, H. M. Munro, G. R. Abecasis, P. Donnelly and C. International HapMap (2006). "A
comparison of phasing algorithms for trios and unrelated individuals." Am J Hum Genet
78(3): 437-450.
Marinescu, V. D., I. S. Kohane and A. Riva (2005). "The MAPPER database: a multi-
genome catalog of putative transcription factor binding sites." Nucleic Acids Res
33(Database issue): D91-97.
Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens and Y. Gilad (2008). "RNA-seq:
an assessment of technical reproducibility and comparison with gene expression arrays."
Genome Res 18(9): 1509-1517.
Marshall, H. M., R. V. Bowman, I. A. Yang, K. M. Fong and C. D. Berg (2013).
"Screening for lung cancer with low-dose computed tomography: a review of current status." J Thorac Dis 5 Suppl 5: S524-539.
Matakidou, A., R. el Galta, E. L. Webb, M. F. Rudd, H. Bridle, T. Eisen and R. S.
Houlston (2007). "Genetic variation in the DNA repair genes is predictive of outcome in
lung cancer." Hum Mol Genet 16(19): 2333-2340.
198
Matera, I., M. Musso, P. Griseri, M. Rusmini, M. Di Duca, M. T. So, D. Mavilio, X.
Miao, P. H. Tam, R. Ravazzolo, I. Ceccherini and M. Garcia-Barcelo (2013). "Allele- specific expression at the RET locus in blood and gut tissue of individuals carrying risk alleles for Hirschsprung disease." Hum Mutat 34(5): 754-762.
Mathiaux, J., V. Le Morvan, M. Pulido, J. Jougon, H. Begueret and J. Robert (2011).
"Role of DNA repair gene polymorphisms in the efficiency of platinum-based adjuvant chemotherapy for non-small cell lung cancer." Mol Diagn Ther 15(3): 159-166.
Mattson, M. E., E. S. Pollack and J. W. Cullen (1987). "What are the odds that smoking will kill you?" Am J Public Health 77(4): 425-431.
Maurano, M. T., R. Humbert, E. Rynes, R. E. Thurman, E. Haugen, H. Wang, A. P.
Reynolds, R. Sandstrom, H. Qu, J. Brody, A. Shafer, F. Neri, K. Lee, T. Kutyavin, S.
Stehling-Sun, A. K. Johnson, T. K. Canfield, E. Giste, M. Diegel, D. Bates, R. S. Hansen,
S. Neph, P. J. Sabo, S. Heimfeld, A. Raubitschek, S. Ziegler, C. Cotsapas, N.
Sotoodehnia, I. Glass, S. R. Sunyaev, R. Kaul and J. A. Stamatoyannopoulos (2012).
"Systematic localization of common disease-associated variation in regulatory DNA."
Science 337(6099): 1190-1195.
Mayne, S. T., J. Buenconsejo and D. T. Janerich (1999). "Previous lung disease and risk of lung cancer among men and women nonsmokers." Am J Epidemiol 149(1): 13-20.
Mazutis, L., J. Gilbert, W. L. Ung, D. A. Weitz, A. D. Griffiths and J. A. Heyman (2013).
"Single-cell analysis and sorting using droplet-based microfluidics." Nature protocols
8(5): 870-891.
199
McCarthy, M. I., G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. Ioannidis and J. N. Hirschhorn (2008). "Genome-wide association studies for complex traits: consensus, uncertainty and challenges." Nat Rev Genet 9(5): 356-369.
Mehan, M. R., S. A. Williams, J. M. Siegfried, W. L. Bigbee, J. L. Weissfeld, D. O.
Wilson, H. I. Pass, W. N. Rom, T. Muley, M. Meister, W. Franklin, Y. E. Miller, E. N.
Brody and R. M. Ostroff (2014). "Validation of a blood protein signature for non-small cell lung cancer." Clin Proteomics 11(1): 32.
Mehrabian, M., H. Allayee, J. Stockton, P. Y. Lum, T. A. Drake, L. W. Castellani, M.
Suh, C. Armour, S. Edwards, J. Lamb, A. J. Lusis and E. E. Schadt (2005). "Integrating genotypic and expression data in a segregating mouse population to identify 5- lipoxygenase as a susceptibility gene for obesity and bone traits." Nat Genet 37(11):
1224-1233.
Mei, R., P. C. Galipeau, C. Prass, A. Berno, G. Ghandour, N. Patil, R. K. Wolff, M. S.
Chee, B. J. Reid and D. J. Lockhart (2000). "Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays." Genome Res 10(8): 1126-1137.
Melnikov, A., A. Murugan, X. Zhang, T. Tesileanu, L. Wang, P. Rogov, S. Feizi, A.
Gnirke, C. G. Callan, Jr., J. B. Kinney, M. Kellis, E. S. Lander and T. S. Mikkelsen
(2012). "Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay." Nat Biotechnol 30(3): 271-277.
Mercer, T. R., D. J. Gerhardt, M. E. Dinger, J. Crawford, C. Trapnell, J. A. Jeddeloh, J. S.
Mattick and J. L. Rinn (2012). "Targeted RNA sequencing reveals the deep complexity of the human transcriptome." Nat Biotechnol 30(1): 99-104.
200
Miele, A. and J. Dekker (2009). "Mapping cis- and trans- chromatin interaction networks using chromosome conformation capture (3C)." Methods Mol Biol 464: 105-121.
Milbury, C. A., J. Li and G. M. Makrigiorgos (2009). "PCR-based methods for the
enrichment of minority alleles and mutations." Clinical chemistry 55(4): 632-640.
Mok, T. S. (2011). "Personalized medicine in lung cancer: what we need to know." Nat
Rev Clin Oncol 8(11): 661-668.
Monks, S. A., A. Leonardson, H. Zhu, P. Cundiff, P. Pietrusiak, S. Edwards, J. W.
Phillips, A. Sachs and E. E. Schadt (2004). "Genetic inheritance of gene expression in
human cell lines." Am J Hum Genet 75(6): 1094-1105.
Morley, M., C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens, R. S. Spielman and
V. G. Cheung (2004). "Genetic analysis of genome-wide variation in human gene expression." Nature 430(7001): 743-747.
Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008). "Mapping
and quantifying mammalian transcriptomes by RNA-Seq." Nat Methods 5(7): 621-628.
Moyer, V. A. and U. S. P. S. T. Force (2014). "Screening for lung cancer: U.S.
Preventive Services Task Force recommendation statement." Ann Intern Med 160(5):
330-338.
Muir, W., S. Perumbakkam, A. Black-Pyrkosz, J. Dunn and H. Cheng (2014). Allele-
Specific Expression, a New Genomics Tool for Development of Value-Added SNP chips
and to Fine Map QTL. In. 10th World Congress of Genetics Applied to Livestock
Production. Vancouver, Canada.
Mullins, D. N., E. L. Crawford, S. A. Khuder, D. A. Hernandez, Y. Yoon and J. C.
Willey (2005). "CEBPG transcription factor correlates with antioxidant and DNA repair 201
genes in normal bronchial epithelial cells but not in individuals with bronchogenic
carcinoma." BMC Cancer 5: 141.
Musunuru, K., A. Strong, M. Frank-Kamenetsky, N. E. Lee, T. Ahfeldt, K. V. Sachs, X.
Li, H. Li, N. Kuperwasser, V. M. Ruda, J. P. Pirruccello, B. Muchmore, L. Prokunina-
Olsson, J. L. Hall, E. E. Schadt, C. R. Morales, S. Lund-Katz, M. C. Phillips, J. Wong,
W. Cantley, T. Racie, K. G. Ejebe, M. Orho-Melander, O. Melander, V. Koteliansky, K.
Fitzgerald, R. M. Krauss, C. A. Cowan, S. Kathiresan and D. J. Rader (2010). "From
noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus." Nature
466(7307): 714-719.
Myakishev, M. V., Y. Khripin, S. Hu and D. H. Hamer (2001). "High-throughput SNP
genotyping by allele-specific PCR with universal energy-transfer-labeled primers."
Genome Res 11(1): 163-169.
Nagalakshmi, U., Karl Waern, and Michael Snyder (2010). RNA-Seq: A Method for
Comprehensive Transcriptome Analysis. Current Protocols in Molecular Biology.
Nagalakshmi, U., Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein and M. Snyder
(2008). "The transcriptional landscape of the yeast genome defined by RNA sequencing."
Science 320(5881): 1344-1349.
National Lung Screening Trial Research, T., D. R. Aberle, A. M. Adams, C. D. Berg, W.
C. Black, J. D. Clapp, R. M. Fagerstrom, I. F. Gareen, C. Gatsonis, P. M. Marcus and J.
D. Sicks (2011). "Reduced lung-cancer mortality with low-dose computed tomographic screening." N Engl J Med 365(5): 395-409.
National Lung Screening Trial Research, T., T. R. Church, W. C. Black, D. R. Aberle, C.
D. Berg, K. L. Clingan, F. Duan, R. M. Fagerstrom, I. F. Gareen, D. S. Gierada, G. C. 202
Jones, I. Mahon, P. M. Marcus, J. D. Sicks, A. Jain and S. Baum (2013). "Results of
initial low-dose computed tomographic screening for lung cancer." N Engl J Med
368(21): 1980-1991.
Nevins, J. R. (1992). "E2F: a link between the Rb tumor suppressor protein and viral
oncoproteins." Science 258(5081): 424-429.
Nguyen, J. D., M. Lamontagne, C. Couture, M. Conti, P. D. Pare, D. D. Sin, J. C. Hogg,
D. Nickle, D. S. Postma, W. Timens, M. Laviolette and Y. Bosse (2014). "Susceptibility
loci for lung cancer are associated with mRNA levels of nearby genes in the lung."
Carcinogenesis 35(12): 2653-2659.
Nicoloso, M. S., H. Sun, R. Spizzo, H. Kim, P. Wickramasinghe, M. Shimizu, S. E.
Wojcik, J. Ferdin, T. Kunej, L. Xiao, S. Manoukian, G. Secreto, F. Ravagnani, X. Wang,
P. Radice, C. M. Croce, R. V. Davuluri and G. A. Calin (2010). "Single-nucleotide polymorphisms inside microRNA target sites influence tumor susceptibility." Cancer Res
70(7): 2789-2798.
Nikiforov, T. T., R. B. Rendle, P. Goelet, Y. H. Rogers, M. L. Kotewicz, S. Anderson, G.
L. Trainor and M. R. Knapp (1994). "Genetic Bit Analysis: a solid phase method for
typing single nucleotide polymorphisms." Nucleic Acids Res 22(20): 4167-4175.
Niu, R., Y. Wang, M. Zhu, Y. Wen, J. Sun, W. Shen, Y. Cheng, J. Zhang, G. Jin, H. Ma,
Z. Hu, H. Shen and J. Dai (2015). "Potentially Functional Polymorphisms in POU5F1
Gene Are Associated with the Risk of Lung Cancer in Han Chinese." Biomed Res Int
2015: 851320.
203
O'Donovan, A., D. Scherly, S. G. Clarkson and R. D. Wood (1994). "Isolation of active
recombinant XPG protein, a human DNA repair endonuclease." J Biol Chem 269(23):
15965-15968.
Obsteter, J., P. Dovc and T. Kunej (2015). "Genetic variability of microRNA regulome in
human." Mol Genet Genomic Med 3(1): 30-39.
Ozsolak, F. and P. M. Milos (2011). "RNA sequencing: advances, challenges and
opportunities." Nat Rev Genet 12(2): 87-98.
Pachmann, K., J. H. Clement, C.-P. Schneider, B. Willen, O. Camara, U. Pachmann and
K. Höffken (2005). "Standardized quantification of circulating peripheral tumor cells
from lung and breast cancer." Clinical Chemical Laboratory Medicine 43(6): 617-627.
Pai, A. A., J. K. Pritchard and Y. Gilad (2015). "The genetic and mechanistic basis for
variation in gene regulation." PLoS Genet 11(1): e1004857.
Pai, A. A., J. K. Pritchard and Y. Gilad (2015). "The Genetic and Mechanistic Basis for
Variation in Gene Regulation." PLoS Genet 11(1).
Pardini, B., A. Naccarati, P. Vodicka and R. Kumar (2012). "Gene expression variations:
potentialities of master regulator polymorphisms in colorectal cancer risk." Mutagenesis
27(2): 161-167.
Pastinen, T., B. Ge and T. J. Hudson (2006). "Influence of human genome polymorphism
on gene expression." Hum Mol Genet 15 Spec No 1: R9-16.
Pastinen, T. and T. J. Hudson (2004). "Cis-acting regulatory variation in the human genome." Science 306(5696): 647-650.
204
Pastinen, T., J. Partanen and A. C. Syvanen (1996). "Multiplex, fluorescent, solid-phase
minisequencing for efficient screening of DNA sequence variation." Clin Chem 42(9):
1391-1397.
Pastinen, T., M. Raitio, K. Lindroos, P. Tainola, L. Peltonen and A. C. Syvanen (2000).
"A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays." Genome Res 10(7): 1031-1042.
Pastinen, T., R. Sladek, S. Gurd, A. Sammak, B. Ge, P. Lepage, K. Lavergne, A.
Villeneuve, T. Gaudin, H. Brandstrom, A. Beck, A. Verner, J. Kingsley, E. Harmsen, D.
Labuda, K. Morgan, M. C. Vohl, A. K. Naumova, D. Sinnett and T. J. Hudson (2004). "A
survey of genetic and epigenetic variation affecting human gene expression." Physiol
Genomics 16(2): 184-193.
Paul, P. and J. Apgar (2005). "Single-molecule dilution and multiple displacement
amplification for molecular haplotyping." Biotechniques 38(4): 553-554, 556, 558-559.
Pecot, C. V., M. Li, X. J. Zhang, R. Rajanbabu, C. Calitri, A. Bungum, J. R. Jett, J. B.
Putnam, C. Callaway-Lane, S. Deppen, E. L. Grogan, D. P. Carbone, J. A. Worrell, K. G.
Moons, Y. Shyr and P. P. Massion (2012). "Added value of a serum proteomic signature
in the diagnostic evaluation of lung nodules." Cancer Epidemiol Biomarkers Prev 21(5):
786-792.
Peirson, S. N., J. N. Butler and R. G. Foster (2003). "Experimental validation of novel
and conventional approaches to quantitative real-time PCR data analysis." Nucleic Acids
Res 31(14): e73.
205
Pekin, D., Y. Skhiri, J.-C. Baret, D. Le Corre, L. Mazutis, C. B. Salem, F. Millot, A. El
Harrak, J. B. Hutchison and J. W. Larson (2011). "Quantitative and sensitive detection of
rare mutations using droplet-based microfluidics." Lab on a Chip 11(13): 2156-2166.
Penny, G. D., G. F. Kay, S. A. Sheardown, S. Rastan and N. Brockdorff (1996).
"Requirement for Xist in X chromosome inactivation." Nature 379(6561): 131-137.
Pers, T. H., J. M. Karjalainen, Y. Chan, H. J. Westra, A. R. Wood, J. Yang, J. C. Lui, S.
Vedantam, S. Gustafsson, T. Esko, T. Frayling, E. K. Speliotes, A. T. C. Genetic
Investigation of, M. Boehnke, S. Raychaudhuri, R. S. Fehrmann, J. N. Hirschhorn and L.
Franke (2015). "Biological interpretation of genome-wide association studies using
predicted gene functions." Nat Commun 6: 5890.
Peters, B. A., B. G. Kermani, A. B. Sparks, O. Alferov, P. Hong, A. Alexeev, Y. Jiang, F.
Dahl, Y. T. Tang, J. Haas, K. Robasky, A. W. Zaranek, J. H. Lee, M. P. Ball, J. E.
Peterson, H. Perazich, G. Yeung, J. Liu, L. Chen, M. I. Kennemer, K. Pothuraju, K.
Konvicka, M. Tsoupko-Sitnikov, K. P. Pant, J. C. Ebert, G. B. Nilsen, J. Baccash, A. L.
Halpern, G. M. Church and R. Drmanac (2012). "Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells." Nature 487(7406): 190-195.
Pfaffl, M. W. (2001). "A new mathematical model for relative quantification in real-time
RT-PCR." Nucleic Acids Res 29(9): e45.
Pfaffl, M. W. (2004). Quantification strategies in real-time PCR. A-Z of quantitative
PCR. S. A. Bustin: 87-112.
Pfaffl, M. W. (2013). "Transcriptional biomarkers." Methods 59(1): 1-2.
Pickrell, J. K. (2014). "Joint analysis of functional genomic data and genome-wide association studies of 18 human traits." Am J Hum Genet 94(4): 559-573. 206
Pleasance, E. D., P. J. Stephens, S. O'Meara, D. J. McBride, A. Meynert, D. Jones, M. L.
Lin, D. Beare, K. W. Lau, C. Greenman, I. Varela, S. Nik-Zainal, H. R. Davies, G. R.
Ordonez, L. J. Mudie, C. Latimer, S. Edkins, L. Stebbings, L. Chen, M. Jia, C. Leroy, J.
Marshall, A. Menzies, A. Butler, J. W. Teague, J. Mangion, Y. A. Sun, S. F. McLaughlin,
H. E. Peckham, E. F. Tsung, G. L. Costa, C. C. Lee, J. D. Minna, A. Gazdar, E. Birney,
M. D. Rhodes, K. J. McKernan, M. R. Stratton, P. A. Futreal and P. J. Campbell (2010).
"A small-cell lung cancer genome with complex signatures of tobacco exposure." Nature
463(7278): 184-190.
Pompanon, F., A. Bonin, E. Bellemain and P. Taberlet (2005). "Genotyping errors: causes, consequences and solutions." Nat Rev Genet 6(11): 847-859.
Purdue, M. P., L. Gold, B. Jarvholm, M. C. Alavanja, M. H. Ward and R. Vermeulen
(2007). "Impaired lung function and lung cancer incidence in a cohort of Swedish construction workers." Thorax 62(1): 51-56.
Raeymaekers, L. (2000). "Basic principles of quantitative PCR." Mol Biotechnol 15(2):
115-122.
Raji, O. Y., S. W. Duffy, O. F. Agbaje, S. G. Baker, D. C. Christiani, A. Cassidy and J.
K. Field (2012). "Predictive accuracy of the Liverpool Lung Project risk model for stratifying patients for computed tomography screening for lung cancer: a case-control and cohort validation study." Ann Intern Med 157(4): 242-250.
Raphael, K. (1987). "Recall bias: a proposal for assessment and control." Int J Epidemiol
16(2): 167-170.
Raymond, C. K., S. Subramanian, M. Paddock, R. Qiu, C. Deodato, A. Palmieri, J.
Chang, T. Radke, E. Haugen, A. Kas, D. Waring, D. Bovee, R. Stacy, R. Kaul and M. V. 207
Olson (2005). "Targeted, haplotype-resolved resequencing of long segments of the human genome." Genomics 86(6): 759-766.
Reischl, U. and B. Kochanowski (1995). "Quantitative PCR. A survey of the present
technology." Mol Biotechnol 3(1): 55-71.
Rockman, M. V. and L. Kruglyak (2006). "Genetics of global gene expression." Nat Rev
Genet 7(11): 862-872.
Rockman, M. V. and G. A. Wray (2002). "Abundant raw material for cis-regulatory
evolution in humans." Mol Biol Evol 19(11): 1991-2004.
Romagnoli, M., I. Vachier, A. M. Vignola, P. Godard, J. Bousquet and P. Chanez (1999).
"Safety and cellular assessment of bronchial brushing in airway diseases." Respir Med
93(7): 461-466.
Rosenbloom, K. R., T. R. Dreszer, M. Pheasant, G. P. Barber, L. R. Meyer, A. Pohl, B. J.
Raney, T. Wang, A. S. Hinrichs, A. S. Zweig, P. A. Fujita, K. Learned, B. Rhead, K. E.
Smith, R. M. Kuhn, D. Karolchik, D. Haussler and W. J. Kent (2010). "ENCODE whole-
genome data in the UCSC Genome Browser." Nucleic Acids Research 38: D620-D625.
Ruijter, J. M., M. W. Pfaffl, S. Zhao, A. N. Spiess, G. Boggy, J. Blom, R. G. Rutledge, D.
Sisti, A. Lievens and K. De Preter (2013). "Evaluation of qPCR curve analysis methods
for reliable biomarker discovery: bias, resolution, precision, and implications." Methods
59(1): 32-46.
Rutter, J. L., T. I. Mitchell, G. Buttice, J. Meyers, J. F. Gusella, L. J. Ozelius and C. E.
Brinckerhoff (1998). "A single nucleotide polymorphism in the matrix metalloproteinase-
1 promoter creates an Ets binding site and augments transcription." Cancer Res 58(23):
5321-5325. 208
Ryan, B. M., A. I. Robles and C. C. Harris (2010). "Genetic variation in microRNA networks: the implications for cancer research." Nat Rev Cancer 10(6): 389-402.
Samson, D. J., J. Seidenfeld, K. Ziegler and N. Aronson (2004). "Chemotherapy sensitivity and resistance assays: a systematic review." J Clin Oncol 22(17): 3618-3630.
Schabath, M. B., X. Wu, Q. Wei, G. Li, J. Gu and M. R. Spitz (2006). "Combined effects of the p53 and p73 polymorphisms on lung cancer risk." Cancer Epidemiol Biomarkers
Prev 15(1): 158-161.
Schadt, E. E., S. A. Monks, T. A. Drake, A. J. Lusis, N. Che, V. Colinayo, T. G. Ruff, S.
B. Milligan, J. R. Lamb, G. Cavet, P. S. Linsley, M. Mao, R. B. Stoughton and S. H.
Friend (2003). "Genetics of gene expression surveyed in maize, mouse and man." Nature
422(6929): 297-302.
Schaid, D. J. (2006). "Power and sample size for testing associations of haplotypes with complex traits." Ann Hum Genet 70(Pt 1): 116-130.
Scheuermann, R. H. and S. R. Bauer (1993). "Polymerase chain reaction-based mRNA quantification using an internal standard: analysis of oncogene expression." Methods
Enzymol 218: 446-473.
Schmitt, M. W., S. R. Kennedy, J. J. Salk, E. J. Fox, J. B. Hiatt and L. A. Loeb (2012).
"Detection of ultra-rare mutations by next-generation sequencing." Proc Natl Acad Sci U
S A 109(36): 14508-14513.
Schork, A. J., W. K. Thompson, P. Pham, A. Torkamani, J. C. Roddey, P. F. Sullivan, J.
R. Kelsoe, M. C. O'Donovan, H. Furberg, Tobacco, C. Genetics, C. Bipolar Disorder
Psychiatric Genomics, C. Schizophrenia Psychiatric Genomics, N. J. Schork, O. A.
Andreassen and A. M. Dale (2013). "All SNPs are not created equal: genome-wide 209
association studies reveal a consistent pattern of enrichment among functionally
annotated SNPs." PLoS Genet 9(4): e1003449.
Schwartz, A. G. (2012). "Genetic epidemiology of cigarette smoke-induced lung
disease." Proc Am Thorac Soc 9(2): 22-26.
Schwartz, A. G. (2016). "Genetic Predisposition to Lung Cancer." CHEST Journal
125(5_suppl): 86S-89S.
Schwartz, A. G., G. M. Prysak, C. H. Bock and M. L. Cote (2006). "The molecular
epidemiology of lung cancer." Carcinogenesis 28(3): 507-518.
Sedgwick, P. (2015). "Bias in observational study designs: case-control studies." BMJ
350: h560.
Shen, M., S. I. Berndt, N. Rothman, D. M. Demarini, J. L. Mumford, X. He, M. R.
Bonner, L. Tian, M. Yeager, R. Welch, S. Chanock, T. Zheng, N. Caporaso and Q. Lan
(2005). "Polymorphisms in the DNA nucleotide excision repair genes and lung cancer
risk in Xuan Wei, China." Int J Cancer 116(5): 768-773.
Shen, W., Y. Tian, T. Ran and Z. Gao (2015). "Genotyping and quantification techniques
for single-nucleotide polymorphisms." TrAC Trends in Analytical Chemistry 69: 1-13.
Shi, M. M. (2001). "Enabling large-scale pharmacogenetic studies by high-throughput mutation detection and genotyping technologies." Clin Chem 47(2): 164-172.
Silvestri, G. A., A. Vachani, D. Whitney, M. Elashoff, K. Porta Smith, J. S. Ferguson, E.
Parsons, N. Mitra, J. Brody, M. E. Lenburg and A. Spira (2015). "A Bronchial Genomic
Classifier for the Diagnostic Evaluation of Lung Cancer." New England Journal of
Medicine 373(3): 243-251.
210
Simmonds, M. J. (2013). "GWAS in autoimmune thyroid disease: redefining our
understanding of pathogenesis." Nat Rev Endocrinol 9(5): 277-287.
Singer-Sam, J., J. M. LeBon, A. Dai and A. D. Riggs (1992). "A sensitive, quantitative
assay for measurement of allele-specific transcripts differing by a single nucleotide."
PCR Methods Appl 1(3): 160-163.
Skillrud, D. M., K. P. Offord and R. D. Miller (1986). "Higher risk of lung cancer in
chronic obstructive pulmonary disease. A prospective, matched, controlled study." Ann
Intern Med 105(4): 503-507.
Smith, R. A., D. Manassaram-Baptiste, D. Brooks, M. Doroshenk, S. Fedewa, D. Saslow,
O. W. Brawley and R. Wender (2015). "Cancer screening in the United States, 2015: a
review of current American cancer society guidelines and current issues in cancer
screening." CA Cancer J Clin 65(1): 30-54.
Society, A. C. (2015). "Cancer Facts & Figures 2015." Atlanta: American Cancer
Society.
Somers, J., L. A. Wilson, J. P. Kilday, E. Horvilleur, I. G. Cannell, T. A. Poyry, L. C.
Cobbold, A. Kondrashov, J. R. Knight, S. Puget, J. Grill, R. G. Grundy, M. Bushell and
A. E. Willis (2015). "A common polymorphism in the 5' UTR of ERCC5 creates an
upstream ORF that confers resistance to platinum-based chemotherapy." Genes Dev
29(18): 1891-1896.
Song, J. W. and K. C. Chung (2010). "Observational studies: cohort and case-control studies." Plast Reconstr Surg 126(6): 2234-2242.
211
Song, M.-Y., H.-E. Kim, S. Kim, I.-H. Choi and J.-K. Lee (2012). "SNP-based large- scale identification of allele-specific gene expression in human B cells." Gene 493(2):
211-218.
Song, M. Y., H. E. Kim, S. Kim, I. H. Choi and J. K. Lee (2012). "SNP-based large-scale identification of allele-specific gene expression in human B cells." Gene 493(2): 211-218.
Spencer, D. H., M. Tyagi, F. Vallania, A. J. Bredemeyer, J. D. Pfeifer, R. D. Mitra and E.
J. Duncavage (2014). "Performance of common analysis methods for detecting low- frequency single nucleotide variants in targeted next-generation sequence data." J Mol
Diagn 16(1): 75-88.
Spira, A., J. Beane, V. Shah, G. Liu, F. Schembri, X. Yang, J. Palma and J. S. Brody
(2004). "Effects of cigarette smoke on the human airway epithelial cell transcriptome."
Proc Natl Acad Sci U S A 101(27): 10143-10148.
Spira, A., J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y.-M.
Dumas, P. Calner, P. Sebastiani, S. Sridhar, J. Beamis, C. Lamb, T. Anderson, N. Gerry,
J. Keane, M. E. Lenburg and J. S. Brody (2007). "Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer." Nat Med 13(3): 361-366.
Spira, A., J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y. M.
Dumas, P. Calner, P. Sebastiani, S. Sridhar, J. Beamis, C. Lamb, T. Anderson, N. Gerry,
J. Keane, M. E. Lenburg and J. S. Brody (2007). "Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer." Nat Med 13(3): 361-366.
Spitz, M. R., W. K. Hong, C. I. Amos, X. F. Wu, M. B. Schabath, Q. Dong, S. Shete and
C. J. Etzel (2007). "A risk model for prediction of lung cancer." Journal of the National
Cancer Institute 99(9): 715-726. 212
Spitz, M. R., Q. Wei, Q. Dong, C. I. Amos and X. Wu (2003). "Genetic susceptibility to
lung cancer: the role of DNA damage and repair." Cancer Epidemiol Biomarkers Prev
12(8): 689-698.
Steemers, F. J., W. Chang, G. Lee, D. L. Barker, R. Shen and K. L. Gunderson (2006).
"Whole-genome genotyping with the single-base extension assay." Nat Methods 3(1): 31-
33.
Stephens, M. and P. Scheet (2005). "Accounting for decay of linkage disequilibrium in
haplotype inference and missing-data imputation." Am J Hum Genet 76(3): 449-462.
Stephens, M., N. J. Smith and P. Donnelly (2001). "A new statistical method for
haplotype reconstruction from population data." Am J Hum Genet 68(4): 978-989.
Stranger, B. E., M. S. Forrest, A. G. Clark, M. J. Minichiello, S. Deutsch, R. Lyle, S.
Hunt, B. Kahl, S. E. Antonarakis, S. Tavare, P. Deloukas and E. T. Dermitzakis (2005).
"Genome-wide associations of gene expression variation in humans." PLoS Genet 1(6):
e78.
Straus, S. E., F. A. McAlister, D. L. Sackett and J. J. Deeks (2002). "Accuracy of history,
wheezing, and forced expiratory time in the diagnosis of chronic obstructive pulmonary
disease." J Gen Intern Med 17(9): 684-688.
Sun, X., F. Li, N. Sun, Q. Shukui, C. Baoan, F. Jifeng, C. Lu, L. Zuhong, C. Hongyan, C.
YuanDong, J. Jiazhong and Z. Yingfeng (2009). "Polymorphisms in XRCC1 and XPG and response to platinum-based chemotherapy in advanced non-small cell lung cancer
patients." Lung Cancer 65(2): 230-236.
213
Sundar, I. K., N. Mullapudi, H. Yao, S. D. Spivack and I. Rahman (2011). "Lung cancer
and its association with chronic obstructive pulmonary disease: update on nexus of
epigenetics." Curr Opin Pulm Med 17(4): 279-285.
Tammemagi, C. M., P. F. Pinsky, N. E. Caporaso, P. A. Kvale, W. G. Hocking, T. R.
Church, T. L. Riley, J. Commins, M. M. Oken, C. D. Berg and P. C. Prorok (2011).
"Lung Cancer Risk Prediction: Prostate, Lung, Colorectal and Ovarian Cancer Screening
Trial Models and Validation." Jnci-Journal of the National Cancer Institute 103(13):
1058-1068.
Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau, B.
B. Tuch, A. Siddiqui, K. Lao and M. A. Surani (2009). "mRNA-Seq whole-transcriptome analysis of a single cell." Nat Methods 6(5): 377-382.
Tarazona, S., F. Garcia-Alcalde, J. Dopazo, A. Ferrer and A. Conesa (2011). "Differential
expression in RNA-seq: a matter of depth." Genome Res 21(12): 2213-2223.
Thorgeirsson, T. E., F. Geller, P. Sulem, T. Rafnar, A. Wiste, K. P. Magnusson, A.
Manolescu, G. Thorleifsson, H. Stefansson, A. Ingason, S. N. Stacey, J. T. Bergthorsson,
S. Thorlacius, J. Gudmundsson, T. Jonsson, M. Jakobsdottir, J. Saemundsdottir, O.
Olafsdottir, L. J. Gudmundsson, G. Bjornsdottir, K. Kristjansson, H. Skuladottir, H. J.
Isaksson, T. Gudbjartsson, G. T. Jones, T. Mueller, A. Gottsater, A. Flex, K. K. Aben, F.
de Vegt, P. F. Mulders, D. Isla, M. J. Vidal, L. Asin, B. Saez, L. Murillo, T. Blondal, H.
Kolbeinsson, J. G. Stefansson, I. Hansdottir, V. Runarsdottir, R. Pola, B. Lindblad, A. M.
van Rij, B. Dieplinger, M. Haltmayer, J. I. Mayordomo, L. A. Kiemeney, S. E.
Matthiasson, H. Oskarsson, T. Tyrfingsson, D. F. Gudbjartsson, J. R. Gulcher, S.
Jonsson, U. Thorsteinsdottir, A. Kong and K. Stefansson (2008). "A variant associated 214
with nicotine dependence, lung cancer and peripheral arterial disease." Nature 452(7187):
638-642.
Tichopad, A., R. Kitchen, I. Riedmaier, C. Becker, A. Stahlberg and M. Kubista (2009).
"Design and optimization of reverse-transcription quantitative PCR experiments." Clin
Chem 55(10): 1816-1823.
Tockman, M. S., N. R. Anthonisen, E. C. Wright and M. G. Donithan (1987). "Airways obstruction and the risk for lung cancer." Ann Intern Med 106(4): 512-518.
Trapnell, C., L. Pachter and S. L. Salzberg (2009). "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25(9): 1105-1111.
Tseden-Ish, M., Y. D. Choi, H. J. Cho, H. J. Ban, I. J. Oh, K. S. Kim, S. Y. Song, K. J.
Na, S. J. Ahn, S. Choi and Y. C. Kim (2012). "Disease-free survival of patients after surgical resection of non-small cell lung carcinoma and correlation with excision repair cross-complementation group 1 expression and genotype." Respirology 17(1): 127-133.
Tsukada, J., Y. Yoshida, Y. Kominato and P. E. Auron (2011). "The CCAAT/enhancer
(C/EBP) family of basic-leucine zipper (bZIP) transcription factors is a multifaceted highly-regulated system for gene regulation." Cytokine 54(1): 6-19.
Turner, E. H., S. B. Ng, D. A. Nickerson and J. Shendure (2009). "Methods for genomic partitioning." Annu Rev Genomics Hum Genet 10: 263-284.
UC Davis. "Mendel's Laws." BioWiki: The Dynamic Biology Hypertext, 2016, from http://biowiki.ucdavis.edu/Genetics/Unit_I%3A_Genes,_Nucleic_Acids,_Genomes_and_
Chromosomes/1%3A_Fundamental_Properties_of_Genes/Mendel's_Laws.
Unger, M. (2006). "A pause, progress, and reassessment in lung cancer screening." N
Engl J Med 355(17): 1822-1824. 215
Vachani, A., H. I. Pass, W. N. Rom, D. E. Midthun, E. S. Edell, M. Laviolette, X. J. Li,
P. Y. Fong, S. W. Hunsucker, C. Hayward, P. J. Mazzone, D. K. Madtes, Y. E. Miller, M.
G. Walker, J. Shi, P. Kearney, K. C. Fang and P. P. Massion (2015). "Validation of a
multiprotein plasma classifier to identify benign lung nodules." J Thorac Oncol 10(4):
629-637.
van Klaveren, R. J., J. D. F. Habbema, J. H. Pedersen, H. J. de Koning, M. Oudkerk and
H. C. Hoogsteden (2001). "Lung cancer screening by low-dose spiral computed
tomography." Eur Respir J 18(5): 857-866.
Vandesompele, J., Kubista M , Pfaffl MW (2009). Reference gene software for improved
normalization. Real-Time PCR: Current Technology and Applications. J. Logan, Kirstin
Edwards, Nick Saunders, Caister Academic Press.
Vansteenkiste, J., C. Dooms, C. Mascaux and K. Nackaerts (2012). "Screening and early
detection of lung cancer." Ann Oncol 23 Suppl 10: x320-327.
Vlems, F. A., A. Ladanyi, R. Gertler, R. Rosenberg, J. H. Diepstra, C. Roder, H.
Nekarda, B. Molnar, Z. Tulassay, G. N. van Muijen and I. Vogel (2003). "Reliability of
quantitative reverse-transcriptase-PCR-based detection of tumour cells in the blood between different laboratories using a standardised protocol." Eur J Cancer 39(3): 388-
396.
Vogel, C. and E. M. Marcotte (2012). "Insights into the regulation of protein abundance
from proteomic and transcriptomic analyses." Nat Rev Genet 13(4): 227-232.
Wade-Martins, R., Y. Saeki and E. A. Chiocca (2003). "Infectious delivery of a 135-kb
LDLR genomic locus leads to regulated complementation of low-density lipoprotein receptor deficiency in human cells." Mol Ther 7(5 Pt 1): 604-612. 216
Wallace, L. U. a. R. B. (1991). "Allele-Specific Polymerase Chain Reaction."
METHODS: A Companion to Methods in Enzymology 2(1): 42-48.
Walser, T., X. Cui, J. Yanagawa, J. M. Lee, E. Heinrich, G. Lee, S. Sharma and S. M.
Dubinett (2008). "Smoking and lung cancer: the role of inflammation." Proc Am Thorac
Soc 5(8): 811-815.
Wang, A. M., M. V. Doyle and D. F. Mark (1989). "Quantitation of mRNA by the
polymerase chain reaction." Proc Natl Acad Sci U S A 86(24): 9717-9721.
Wang, J., J. Zhuang, S. Iyer, X. Lin, T. W. Whitfield, M. C. Greven, B. G. Pierce, X.
Dong, A. Kundaje, Y. Cheng, O. J. Rando, E. Birney, R. M. Myers, W. S. Noble, M.
Snyder and Z. Weng (2012). "Sequence features and chromatin structure around the
genomic regions bound by 119 human transcription factors." Genome Res 22(9): 1798-
1812.
Wang, T. W., R. C. Vermeulen, W. Hu, G. Liu, X. Xiao, Y. Alekseyev, J. Xu, B. Reiss,
K. Steiling, G. S. Downward, D. T. Silverman, F. Wei, G. Wu, J. Li, M. E. Lenburg, N.
Rothman, A. Spira and Q. Lan (2015). "Gene-expression profiling of buccal epithelium among non-smoking women exposed to household air pollution from smoky coal."
Carcinogenesis 36(12): 1494-1501.
Wang, Y., P. Broderick, E. Webb, X. Wu, J. Vijayakrishnan, A. Matakidou, M. Qureshi,
Q. Dong, X. Gu, W. V. Chen, M. R. Spitz, T. Eisen, C. I. Amos and R. S. Houlston
(2008). "Common 5p15.33 and 6p21.33 variants influence lung cancer risk." Nat Genet
40(12): 1407-1409.
217
Wang, Y., J. D. McKay, T. Rafnar, Z. Wang, M. N. Timofeeva, P. Broderick, X. Zong,
M. Laplana, Y. Wei and Y. Han (2014). "Rare variants of large effect in BRCA2 and
CHEK2 affect risk of lung cancer." Nature genetics 46(7): 736-741.
Wang, Y., J. D. McKay, T. Rafnar, Z. Wang, M. N. Timofeeva, P. Broderick, X. Zong,
M. Laplana, Y. Wei, Y. Han, A. Lloyd, M. Delahaye-Sourdeix, D. Chubb, V. Gaborieau,
W. Wheeler, N. Chatterjee, G. Thorleifsson, P. Sulem, G. Liu, R. Kaaks, M. Henrion, B.
Kinnersley, M. Vallee, F. LeCalvez-Kelm, V. L. Stevens, S. M. Gapstur, W. V. Chen, D.
Zaridze, N. Szeszenia-Dabrowska, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, V.
Bencko, L. Foretova, V. Janout, H. E. Krokan, M. E. Gabrielsen, F. Skorpen, L. Vatten, I.
Njolstad, C. Chen, G. Goodman, S. Benhamou, T. Vooder, K. Valk, M. Nelis, A.
Metspalu, M. Lener, J. Lubinski, M. Johansson, P. Vineis, A. Agudo, F. Clavel-
Chapelon, H. B. Bueno-de-Mesquita, D. Trichopoulos, K. T. Khaw, M. Johansson, E.
Weiderpass, A. Tjonneland, E. Riboli, M. Lathrop, G. Scelo, D. Albanes, N. E. Caporaso,
Y. Ye, J. Gu, X. Wu, M. R. Spitz, H. Dienemann, A. Rosenberger, L. Su, A. Matakidou,
T. Eisen, K. Stefansson, A. Risch, S. J. Chanock, D. C. Christiani, R. J. Hung, P.
Brennan, M. T. Landi, R. S. Houlston and C. I. Amos (2014). "Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer." Nat Genet 46(7): 736-741.
Wang, Z., M. Gerstein and M. Snyder (2009). "RNA-Seq: a revolutionary tool for
transcriptomics." Nat Rev Genet 10(1): 57-63.
Wang, Z., B. Zhu, M. Zhang, H. Parikh, J. Jia, C. C. Chung, J. N. Sampson, J. W.
Hoskins, A. Hutchinson, L. Burdette, A. Ibrahim, C. Hautman, P. S. Raj, C. C. Abnet, A.
A. Adjei, A. Ahlbom, D. Albanes, N. E. Allen, C. B. Ambrosone, M. Aldrich, P. Amiano,
C. Amos, U. Andersson, G. Andriole, Jr., I. L. Andrulis, C. Arici, A. A. Arslan, M. A. 218
Austin, D. Baris, D. A. Barkauskas, B. A. Bassig, L. E. Beane Freeman, C. D. Berg, S. I.
Berndt, P. A. Bertazzi, R. B. Biritwum, A. Black, W. Blot, H. Boeing, P. Boffetta, K.
Bolton, M. C. Boutron-Ruault, P. M. Bracci, P. Brennan, L. A. Brinton, M. Brotzman, H.
B. Bueno-de-Mesquita, J. E. Buring, M. A. Butler, Q. Cai, G. Cancel-Tassin, F. Canzian,
G. Cao, N. E. Caporaso, A. Carrato, T. Carreon, A. Carta, G. C. Chang, I. S. Chang, J.
Chang-Claude, X. Che, C. J. Chen, C. Y. Chen, C. H. Chen, C. Chen, K. Y. Chen, Y. M.
Chen, A. P. Chokkalingam, L. W. Chu, F. Clavel-Chapelon, G. A. Colditz, J. S. Colt, D.
Conti, M. B. Cook, V. K. Cortessis, E. D. Crawford, O. Cussenot, F. G. Davis, I. De
Vivo, X. Deng, T. Ding, C. P. Dinney, A. L. Di Stefano, W. R. Diver, E. J. Duell, J. W.
Elena, J. H. Fan, H. S. Feigelson, M. Feychting, J. D. Figueroa, A. M. Flanagan, J. F.
Fraumeni, Jr., N. D. Freedman, B. L. Fridley, C. S. Fuchs, M. Gago-Dominguez, S.
Gallinger, Y. T. Gao, S. M. Gapstur, M. Garcia-Closas, R. Garcia-Closas, J. M. Gastier-
Foster, J. M. Gaziano, D. S. Gerhard, C. A. Giffen, G. G. Giles, E. M. Gillanders, E. L.
Giovannucci, M. Goggins, N. Gokgoz, A. M. Goldstein, C. Gonzalez, R. Gorlick, M. H.
Greene, M. Gross, H. B. Grossman, R. Grubb, 3rd, J. Gu, P. Guan, C. A. Haiman, G.
Hallmans, S. E. Hankinson, C. C. Harris, P. Hartge, C. Hattinger, R. B. Hayes, Q. He, L.
Helman, B. E. Henderson, R. Henriksson, J. Hoffman-Bolton, C. Hohensee, E. A. Holly,
Y. C. Hong, R. N. Hoover, H. D. Hosgood, 3rd, C. F. Hsiao, A. W. Hsing, C. A. Hsiung,
N. Hu, W. Hu, Z. Hu, M. S. Huang, D. J. Hunter, P. D. Inskip, H. Ito, E. J. Jacobs, K. B.
Jacobs, M. Jenab, B. T. Ji, C. Johansen, M. Johansson, A. Johnson, R. Kaaks, A. M.
Kamat, A. Kamineni, M. Karagas, C. Khanna, K. T. Khaw, C. Kim, I. S. Kim, J. H. Kim,
Y. H. Kim, Y. C. Kim, Y. T. Kim, C. H. Kang, Y. J. Jung, C. M. Kitahara, A. P. Klein, R.
Klein, M. Kogevinas, W. P. Koh, T. Kohno, L. N. Kolonel, C. Kooperberg, C. P. Kratz, 219
V. Krogh, H. Kunitoh, R. C. Kurtz, N. Kurucu, Q. Lan, M. Lathrop, C. C. Lau, F.
Lecanda, K. M. Lee, M. P. Lee, L. Le Marchand, S. P. Lerner, D. Li, L. M. Liao, W. Y.
Lim, D. Lin, J. Lin, S. Lindstrom, M. S. Linet, J. Lissowska, J. Liu, B. Ljungberg, J.
Lloreta, D. Lu, J. Ma, N. Malats, S. Mannisto, N. Marina, G. Mastrangelo, K. Matsuo, K.
A. McGlynn, R. McKean-Cowdin, L. H. McNeill, R. R. McWilliams, B. S. Melin, P. S.
Meltzer, J. E. Mensah, X. Miao, D. S. Michaud, A. M. Mondul, L. E. Moore, K. Muir, S.
Niwa, S. H. Olson, N. Orr, S. Panico, J. Y. Park, A. V. Patel, A. Patino-Garcia, S.
Pavanello, P. H. Peeters, B. Peplonska, U. Peters, G. M. Petersen, P. Picci, M. C. Pike, S.
Porru, J. Prescott, X. Pu, M. P. Purdue, Y. L. Qiao, P. Rajaraman, E. Riboli, H. A. Risch,
R. J. Rodabough, N. Rothman, A. M. Ruder, J. S. Ryu, M. Sanson, A. Schned, F. R.
Schumacher, A. G. Schwartz, K. L. Schwartz, M. Schwenn, K. Scotlandi, A. Seow, C.
Serra, M. Serra, H. D. Sesso, G. Severi, H. Shen, M. Shen, S. Shete, K. Shiraishi, X. O.
Shu, A. Siddiq, L. Sierrasesumaga, S. Sierri, A. D. Loon Sihoe, D. T. Silverman, M.
Simon, M. C. Southey, L. Spector, M. Spitz, M. Stampfer, P. Stattin, M. C. Stern, V. L.
Stevens, R. Z. Stolzenberg-Solomon, D. O. Stram, S. S. Strom, W. C. Su, M. Sund, S. W.
Sung, A. Swerdlow, W. Tan, H. Tanaka, W. Tang, Z. Z. Tang, A. Tardon, E. Tay, P. R.
Taylor, Y. Tettey, D. M. Thomas, R. Tirabosco, A. Tjonneland, G. S. Tobias, J. R. Toro,
R. C. Travis, D. Trichopoulos, R. Troisi, A. Truelove, Y. H. Tsai, M. A. Tucker, R.
Tumino, D. Van Den Berg, S. K. Van Den Eeden, R. Vermeulen, P. Vineis, K.
Visvanathan, U. Vogel, C. Wang, C. Wang, J. Wang, S. S. Wang, E. Weiderpass, S. J.
Weinstein, N. Wentzensen, W. Wheeler, E. White, J. K. Wiencke, A. Wolk, B. M.
Wolpin, M. P. Wong, M. Wrensch, C. Wu, T. Wu, X. Wu, Y. L. Wu, J. S. Wunder, Y. B.
Xiang, J. Xu, H. P. Yang, P. C. Yang, Y. Yatabe, Y. Ye, E. D. Yeboah, Z. Yin, C. Ying, 220
C. J. Yu, K. Yu, J. M. Yuan, K. A. Zanetti, A. Zeleniuch-Jacquotte, W. Zheng, B. Zhou,
L. Mirabello, S. A. Savage, P. Kraft, S. J. Chanock, M. Yeager, M. T. Landi, J. Shi, N.
Chatterjee and L. T. Amundadottir (2014). "Imputation and subset-based association analysis across different cancer types identifies multiple independent risk loci in the
TERT-CLPTM1L region on chromosome 5p15.33." Hum Mol Genet 23(24): 6616-6633.
Wasswa-Kintu, S., W. Q. Gan, S. F. P. Man, P. D. Pare and D. D. Sin (2005).
"Relationship between reduced forced expiratory volume in one second and the risk of lung cancer: a systematic review and meta-analysis." Thorax 60(7): 570-575.
Waszak, S. M., O. Delaneau, A. R. Gschwind, H. Kilpinen, S. K. Raghav, R. M.
Witwicki, A. Orioli, M. Wiederkehr, N. I. Panousis, A. Yurovsky, L. Romano-Palumbo,
A. Planchon, D. Bielser, I. Padioleau, G. Udin, S. Thurnheer, D. Hacker, N. Hernandez,
A. Reymond, B. Deplancke and E. T. Dermitzakis (2015). "Population Variation and
Genetic Control of Modular Chromatin Architecture in Humans." Cell 162(5): 1039-
1050.
Westra, H. J., M. J. Peters, T. Esko, H. Yaghootkar, C. Schurmann, J. Kettunen, M. W.
Christiansen, B. P. Fairfax, K. Schramm, J. E. Powell, A. Zhernakova, D. V. Zhernakova,
J. H. Veldink, L. H. Van den Berg, J. Karjalainen, S. Withoff, A. G. Uitterlinden, A.
Hofman, F. Rivadeneira, P. A. t Hoen, E. Reinmaa, K. Fischer, M. Nelis, L. Milani, D.
Melzer, L. Ferrucci, A. B. Singleton, D. G. Hernandez, M. A. Nalls, G. Homuth, M.
Nauck, D. Radke, U. Volker, M. Perola, V. Salomaa, J. Brody, A. Suchy-Dicey, S. A.
Gharib, D. A. Enquobahrie, T. Lumley, G. W. Montgomery, S. Makino, H. Prokisch, C.
Herder, M. Roden, H. Grallert, T. Meitinger, K. Strauch, Y. Li, R. C. Jansen, P. M.
Visscher, J. C. Knight, B. M. Psaty, S. Ripatti, A. Teumer, T. M. Frayling, A. Metspalu, 221
J. B. van Meurs and L. Franke (2013). "Systematic identification of trans eQTLs as putative drivers of known disease associations." Nat Genet 45(10): 1238-1243.
Willey, J. C., E. Coy, C. Brolly, M. J. Utell, M. W. Frampton, J. Hammersley, W. G.
Thilly, D. Olson and K. Cairns (1996). "Xenobiotic metabolism enzyme gene expression in human bronchial epithelial and alveolar macrophage cells." Am J Respir Cell Mol Biol
14(3): 262-271.
Willey, J. C., E. L. Coy, M. W. Frampton, A. Torres, M. J. Apostolakos, G. Hoehn, W.
H. Schuermann, W. G. Thilly, D. E. Olson, J. R. Hammersley, C. L. Crespi and M. J.
Utell (1997). "Quantitative RT-PCR measurement of cytochromes p450 1A1, 1B1, and
2B7, microsomal epoxide hydrolase, and NADPH oxidoreductase expression in lung cells of smokers and nonsmokers." Am J Respir Cell Mol Biol 17(1): 114-124.
Willey, J. C., Erin L. Crawford, Charles A. Knight, Kristy A. Warner, Cheryl R. Motten,
Elizabeth Herness Peters, Robert J. Zahorchak, Timothy G. Graves, David A. Weaver,
Jerry R. Bergman, Martin Vondrecek, Roland C. Grafstrom (2004). Use of standardized mixtures of internal standards in quantitative RT-PCR to ensure quality control and develop a standardized gene expression database. A-Z of Quantitative PCR. S. A. Bustin:
545-576.
Williams, R. B., E. K. Chan, M. J. Cowley and P. F. Little (2007). "The influence of genetic variation on gene expression." Genome Res 17(12): 1707-1716.
Wistuba, II (2007). "Genetics of preneoplasia: lessons from lung cancer." Curr Mol Med
7(1): 3-14.
222
Wittkopp, P. J., B. K. Haerum and A. G. Clark (2008). "Independent effects of cis- and
trans-regulatory variation on gene expression in Drosophila melanogaster." Genetics
178(3): 1831-1835.
Wittkopp, P. J. and G. Kalay (2012). "Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence." Nat Rev Genet 13(1): 59-69.
Wong, M. L. and J. F. Medrano (2005). "Real-time PCR for mRNA quantitation."
Biotechniques 39(1): 75-85.
Wu, D. Y., L. Ugozzoli, B. K. Pal and R. B. Wallace (1989). "Allele-specific enzymatic amplification of beta-globin genomic DNA for diagnosis of sickle cell anemia." Proc Natl
Acad Sci U S A 86(8): 2757-2760.
Wu, M. C., P. Kraft, M. P. Epstein, D. M. Taylor, S. J. Chanock, D. J. Hunter and X. Lin
(2010). "Powerful SNP-set analysis for case-control genome-wide association studies."
Am J Hum Genet 86(6): 929-942.
Xu, H., J. DiCarlo, R. V. Satya, Q. Peng and Y. Wang (2014). "Comparison of somatic mutation calling methods in amplicon and whole exome sequence data." BMC Genomics
15: 244.
Xu, Y., L. Berglund, R. Ramakrishnan, R. Mayeux, C. Ngai, S. Holleran, B. Tycko, T.
Leff and N. S. Shachter (1999). "A common Hpa I RFLP of apolipoprotein C-I increases
gene transcription and exhibits an ethnically distinct pattern of linkage disequilibrium
with the alleles of apolipoprotein E." J Lipid Res 40(1): 50-58.
Yan, H., W. Yuan, V. E. Velculescu, B. Vogelstein and K. W. Kinzler (2002). "Allelic
variation in human gene expression." Science 297(5584): 1143.
223
Yeo, J., E. L. Crawford, T. M. Blomquist, L. M. Stanoszek, R. E. Dannemiller, J. Zyrek,
L. E. De Las Casas, S. A. Khuder and J. C. Willey (2014). "A multiplex two-color real- time PCR method for quality-controlled molecular diagnostic testing of FFPE samples."
PLoS One 9(2): e89395.
Yoo, S. S., C. Jin, D. K. Jung, Y. Y. Choi, J. E. Choi, W. K. Lee, S. Y. Lee, J. Lee, S. I.
Cha, C. H. Kim, Y. Seok, E. Lee and J. Y. Park (2015). "Putative functional variants of
XRCC1 identified by RegulomeDB were not associated with lung cancer risk in a Korean
population." Cancer Genet 208(1-2): 19-24.
Young, R. P., R. J. Hopkins, T. Christmas, P. N. Black, P. Metcalf and G. D. Gamble
(2009). "COPD prevalence is increased in lung cancer, independent of age, sex and
smoking history." Eur Respir J 34(2): 380-386.
Yu, Z., Z. Li, N. Jolicoeur, L. Zhang, Y. Fortin, E. Wang, M. Wu and S. H. Shen (2007).
"Aberrant allele frequencies of the SNPs located in microRNA target sites are potentially
associated with human cancers." Nucleic Acids Res 35(13): 4535-4541.
Yvert, G., R. B. Brem, J. Whittle, J. M. Akey, E. Foss, E. N. Smith, R. Mackelprang and
L. Kruglyak (2003). "Trans-acting regulatory variation in Saccharomyces cerevisiae and
the role of transcription factors." Nat Genet 35(1): 57-64.
Zakharkin, S. O., K. Kim, T. Mehta, L. Chen, S. Barnes, K. E. Scheirer, R. S. Parrish, D.
B. Allison and G. P. Page (2005). "Sources of variation in Affymetrix microarray
experiments." BMC Bioinformatics 6: 214.
Zakharov, S., T. Y. Wong, T. Aung, E. N. Vithana, C. C. Khor, A. Salim and A.
Thalamuthu (2013). "Combined genotype and haplotype tests for region-based
association studies." BMC Genomics 14: 569. 224
Zeki, S. and R. C. Fitzgerald (2015). "The use of molecular markers in predicting
dysplasia and guiding treatment." Best Pract Res Clin Gastroenterol 29(1): 113-124.
Zhai, R., X. Yu, A. Shafer, J. C. Wain and D. C. Christiani (2014). "The impact of coexisting copd on survival of patients with early-stage non-small cell lung cancer
undergoing surgical resection." Chest 145(2): 346-353.
Zhang, K., J. B. Li, Y. Gao, D. Egli, B. Xie, J. Deng, Z. Li, J. H. Lee, J. Aach, E. M.
Leproust, K. Eggan and G. M. Church (2009). "Digital RNA allelotyping reveals tissue-
specific and allele-specific gene expression in human." Nat Methods 6(8): 613-618.
Zhang, R., X. Li, G. Ramaswami, K. S. Smith, G. Turecki, S. B. Montgomery and J. B.
Li (2014). "Quantifying RNA allelic ratios by microfluidic multiplex PCR and
sequencing." Nat Methods 11(1): 51-54.
Zhang, R., W. Min and W. C. Sessa (1995). "Functional analysis of the human
endothelial nitric oxide synthase promoter. Sp1 and GATA factors are necessary for basal
transcription in endothelial cells." J Biol Chem 270(25): 15320-15326.
Zhang, T., J. Sun, M. Lv, L. Zhang, X. Wang, J. C. Ren and B. Wang (2013). "XPG is
predictive gene of clinical outcome in advanced non-small-cell lung cancer with platinum
drug therapy." Asian Pac J Cancer Prev 14(2): 701-705.
Zhu, M. L., T. Y. Shi, H. C. Hu, J. He, M. Wang, L. Jin, Y. J. Yang, J. C. Wang, M. H.
Sun, H. Chen, K. L. Zhao, Z. Zhang, H. Q. Chen, J. Q. Xiang and Q. Y. Wei (2012).
"Polymorphisms in the ERCC5 gene and risk of esophageal squamous cell carcinoma
(ESCC) in Eastern Chinese populations." PLoS One 7(7): e41500.
Ziegler, A., A. Koch, K. Krockenberger and A. Grosshennig (2012). "Personalized
medicine using DNA biomarkers: a review." Hum Genet 131(10): 1627-1638. 225
Zienolddiny, S., D. Campa, H. Lind, D. Ryberg, V. Skaug, L. Stangeland, D. H. Phillips,
F. Canzian and A. Haugen (2006). "Polymorphisms of DNA repair genes and risk of non- small cell lung cancer." Carcinogenesis 27(3): 560-567.
226