A Dissertation

entitled

Cis-acting Genetic Variants that Alter ERCC5 Regulation as a Prototype to Characterize

cis-regulation of Key Protective Genes in Normal Bronchial Epithelial Cells

by

Xiaolu Zhang

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Doctor of Philosophy Degree in

Biomedical Sciences

______Dr. James C. Willey, Committee Chair

______Dr. Bina Joe, Committee Member

______Dr. Keith Crist, Committee Member

______Dr. Ivana de la Serna, Committee Member

______Dr. Alexei Fedorov, Committee Member

______Dr. Patricia R. Komuniecki, Dean College of Graduate Studies

The University of Toledo

May 2016

Copyright 2016, Xiaolu Zhang

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of

Cis-acting Genetic Variants that Alter ERCC5 Regulation as a Prototype to Characterize Cis-regulation of Key Protective Genes in Normal Bronchial Epithelial Cells

by

Xiaolu Zhang

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Doctor of Philosophy Degree in Biomedical Sciences

The University of Toledo

May 2016

Evidence has suggested that there is inter-individual variation in susceptibility to lung cancer. This variation may be, in part, due to inter-individual variation in dysregulation of key genes in DNA repair, antioxidant, and cell cycle pathways. Excision repair cross-complementation group 5 (ERCC5) gene plays an important role in nucleotide excision repair (NER) and dysregulation of ERCC5 is associated with increased lung cancer risk. The goal of this study was conducting haplotype and diplotype analyses in normal bronchial epithelial cells (NBEC) to understand inter- individual variation in ERCC5 transcript regulation.

We determined genotypes at putative ERCC5 cis-regulatory single nucleotide polymorphic sites (SNP) rs751402 and rs2296147, and transcribed SNPs rs1047768 and rs17655. Using a recently developed targeted sequencing method, ERCC5 allele-specific transcript abundance was assessed in NBEC RNA from 55 individuals heterozygous for rs1047768 and 21 subjects heterozygous for rs17655. Syntenic relationships among alleles at rs751402, rs2296147 and rs1047768 were assessed by allele-specific PCR followed by Sanger sequencing. We assessed association of NBEC iii

ERCC5 allele-specific expression at rs1047768 with haplotype and diplotype structure at putative ERCC5 promoter cis-regulatory SNPs rs751402 and rs2296147.

Genotype analysis revealed higher inter-individual variation in allelic ratios in cDNA samples relative to matched gDNA samples at both rs1047768 and rs17655

(p<0.0001 and p=0.0005 respectively). By haplotype analysis, mean expression was higher at the rs1047768 alleles syntenic with rs2296147 T allele compared to rs2296147

C allele (p=0.0030). Sequence analysis predicted that T allele at SNP rs2296147 creates a

TP53 binding site. Mean expression was higher at rs17655 G allele (p<0.0001) which is syntenic with A allele at a linked SNP rs873601 (D’=0.95). G allele at SNP rs873601 is predicted to create a miRNA binding site.

These data support the conclusion that T allele at SNP rs2296147 is associated with higher ERCC5 transcript abundance, possibly through increased responsiveness to

TP53 transcription factor. Genotype at SNP rs17655 also is associated with variation in

ERCC5 transcript abundance, likely due to effect on miRNA binding affinity at the linked SNP rs873601. These effects on ERCC5 transcription likely result in variation in nucleotide excision DNA repair function. These findings provide plausible explanation for the association of genotype at rs2296147 and rs17655 with lung cancer risk.

iv

Dedications

To my husband, Xiaoming Fan, who always supports me but doesn’t let me neglect my mistakes. You knew my weaknesses so that you encourage me when I am depressed. You are a true friend of mine so that you would point out my weaknesses and teach me a lesson. Your dedication and support have provided the foundation for my continuous personal improvement. I am forever grateful of the tremendous sacrifices you have made along the way.

To my mother, Zhibi Zhang, your strong support for my study in a different country. Your sacrifices to foster your daughter and make her intelligent serve as the philosophy I will follow with my children in the future.

Acknowledgements

I would like to express my gratitude to all those who have helped me complete this thesis. I am very grateful to Dr. James C. Willey, my major advisor, for his expert

v insights of research with tremendous patience to guide me and teach me, and for his great support throughout my doctoral studies especially spending time for fixing my poor writing. You helped me stay optimistic and not give up. Your encouragement will keep me moving forward in the future.

I am grateful to Erin L. Crawford, my lab manager, for her help in communication with others, study design, reagent ordering and writing the first manuscript of my life.

You are always happy to answer my questions.

I thank all my former and present colleagues Tom Blomquist, Lauren Stanoszek,

Jiyoun Yeo, Diego A. Morales, Rose T. Zolondek, and Daniel J. Craig, for all great discussion and inspiration. Especially I want to thank Tom for your kind courtesy of reagents, Jiyoun for your great advices, Rose for your help.

I would like to thank my academic advisory committee members: Dr. Bina Joe,

Dr. Keith Crist, Dr. Ivana de la Serna, and Dr. Alexei Fedorov,

vii

Table of Contents

Abstract ...... iii

Acknowledgements ...... vii

Table of Contents ...... viii

List of Tables ...... xii

List of Figures ...... xiii

List of Abbreviations ...... xiiii

1 Introduction ...... 1

2 Literature Review...... 9

2.1 Biomarker Development for Disease Risk ...... 9

2.1.1 Gene Transcript Expression ...... 9

2.1.1.1 Total Transcript Abundance ...... 9

2.1.1.2 Allele-specific Transcript Expression ...... 12

2.1.1.2.1 Dosage Effect ...... 13

2.1.1.2.2 Allelic Imbalance ...... 15

2.1.2 Transcript Regulation...... 16

2.1.2.1 Identify Trans-acting Variations ...... 17

2.1.2.2 Identify Cis-acting Variations ...... 19

2.1.3 Epidemiology Study...... 21

2.1.3.1 Case-control Study ...... 22 viii

2.1.3.2 Cohort Study ...... 23

2.2 Identification of cis-acting Regulatory Variations ...... 24

2.2.1 Single Nucleotide Polymorphism Analysis ...... 24

2.2.1.1 Genotype Analysis ...... 24

2.2.1.2 Haplotype Analysis ...... 27

2.2.2 Empirical Approaches to Assess cis-acting Variations ...... 31

2.2.2.1 Assessing cis-acting Effects by Total Expression ...... 31

2.2.2.2 Assessing cis-acting Effects by ASE ...... 33

2.2.3 Experimental Approaches to Assess cis-acting Variations ...... 35

2.2.3.1 In vitro Approaches...... 35

2.2.3.2 In vivo Approaches ...... 38

2.3 Quality Controlled Molecular Diagnostic Tests Based on RNA ...... 40

2.3.1 Reverse Transcription-Polymerase Chain Reaction ...... 40

2.3.1.1 Relative Quantification ...... 41

2.3.1.2 Absolute Quantification ...... 42

2.3.1.2.1 Competitive PCR ...... 43

2.3.1.2.2 Multiplex Two-Color Fluorometric Real-time

PCR with Quality Control ...... 45

2.3.2 RNA-Sequencing ...... 48

2.3.2.1 Whole Transcriptome RNA-seq ...... 48

2.3.2.2 Targeted RNA-seq ...... 50

2.3.2.2.1 Use of IS as Quality Control ...... 51

2.3.2.2.2 Use of Predicted CV as Quality Control ...... 52

viiii

2.4 Contributions...... 53

2.4.1 Manuscript I ...... 53

2.4.2 Manuscript II ...... 54

2.4.3 Manuscript III ...... 55

2.5 Future Study ...... 55

3 Haplotype and Diplotype Analyses of Variation in ERCC5 Transcription cis- regulation in Normal Bronchial Epithelial Cells ...... 57

3.1 Abstract ...... 58

3.2 Introduction ...... 59

3.3 Materials and Methods ...... 61

3.4 Results ...... 67

3.5 Discussion ...... 70

3.6 Disclosures ...... 75

3.7 Grants ...... 75

3.8 Table and Figure Legends ...... 76

3.9 Table and Figure ...... 80

3.10 Supplemental Table and Figure Legends ...... 87

3.11 Supplemental Table and Figure ...... 88

4 Lung Cancer Risk Test Trial: Study Design, Participant Baseline Characteristics,

Bronchoscopy Safety, and Establishment of Biospecimen Repository ...... 91

4.1 Abstract ...... 92

4.2 Introduction ...... 94

4.3 Methods...... 97

ixi

4.4 Results ...... 104

4.5 Discussion ...... 110

4.6 Conclusions ...... 116

4.7 List of Abbreviations Used ...... 116

4.8 Competing Interests ...... 117

4.9 Author Contributions ...... 117

4.10 Acknowledgements ...... 117

4.11 Table and Figure Legends ...... 118

4.12 Table and Figure ...... 119

4.13 Supplemental Table and Figure Legends ...... 121

4.14 Supplemental Table and Figure ...... 121

5 Control for Stochastic Sampling Variation and Qualitative Sequencing Error in

Next Generation Sequencing ...... 127

5.1 Abstract ...... 128

5.3 Methods...... 133

5.4 Results ...... 139

5.5 Discussion ...... 141

5.6 Table and Figure Legends ...... 145

5.7 Table and Figure ...... 148

5.8 Supplementary Table and Figure ...... 153

6 Conclusions and Summary ...... 154

References ...... 166

xi

List of Tables

3.1 Summary of Haplotype Structures in ERCC5 Promoter Region...... 75

3.2 Summary of Diplotype Structures in ERCC5 Promoter Region ...... 75

S3.1 Demographic Characteristics of Enrolled 80 Subjects ...... 83

S3.2 Summary of Genotype and Diplotype for Heterozygotes at rs1047768 ...... 83

S3.3 ON-TARGET Plus SMARTpool siRNA Sequences ...... 85

4.1 LCRT Subject Characteristics...... 114

4.2 Chronic Obstructive Pulmonary Disease by PFT ...... 115

4.3 Adverse Events (AE) ...... 115

S4.1 Lung Cancer Risk Test Study Enrollment by Study Site ...... 116

S4.2 Work Types and Exposures ...... 117

S4.3 Medical History ...... 118

S4.4 Standard of Care (SOC) vs. Study Driven (SD) Bronchoscopies ...... 119

S4.5 Self-reported vs. Clinical COPD...... 120

S4.6 Transplant vs. Non-Transplant Subjects ...... 121

S5.1 Supplementary Tables ...... 148

xii

List of Figures

3-1 Effects of CEBPG siRNAs on ERCC5 Transcript Abundance ...... 76

3-2 Schematic Overview of ERCC5 ASE Measurement ...... 77

3-3 Allelic Ratios Measured at rs1047768 and rs17655 ...... 78

3-4 Allelic Ratios Measured at rs1047768 Sorted by Various Diplotype ...... 79

3-5 Correlation of CDKN1A and ERCC5 Transcript Abundance ...... 80

3-6 Lung Cancer Risk Through Sub-Optimal Regulation of Protective Genes ...... 81

5-1 Overview of Specimen Preparation for Next-Generation Sequencing ...... 143

5-2 Performance of Models to Predict Observed Variance...... 144

5-3 Effects of Sequence Counts and Sample Molecule on Allelic Ratios ...... 145

5-4 Frequency Plot of Observed Sequencing Variation ...... 146

5-5 Performance of IS to Measure Frequency of Sequence Variation ...... 147

S5-1 Model Design ...... 148

xiii

List of Abbreviations

AI ...... Allelic Imbalance ASE ...... Allele Specific Expression

Cq ...... Quantification Cycle CV ...... Coefficient (of) Variation CEBPG ...... CCAAT/Enhancer Binding Protein, Gamma

ERCC5 ...... Excision Repair Complex Complementary, 5 ESM ...... External Standards Mixture

FFPE ...... Formalin-Fixed, Paraffin-Embedded

GWAS ...... Genome Wide Association Study

IS ...... Internal Standard ISM ...... Internal Standards Mixture

LCRT ...... Lung Cancer Risk Test

NBEC ...... Normal Bronchial Epithelial Cell NGS...... Next-Generation Sequencing NT ...... Native Template

PCR ...... Polymerase Chain Reaction

RT ...... Reverse Transcription RT-qPCR...... Reverse Transcription Quantitative-PCR

SNP ...... Single Nucleotide Polymorphism

xiiii

Chapter 1

Introduction

An estimated 158,040 Americans were expected to die from lung cancer in 2015, accounting for approximately 27 percent of all cancer deaths (Society 2015). As the leading cancer killer in both men and women in the United States, lung cancer causes more deaths than the next three most common cancers combined (colon, breast and pancreatic). In contrast to simple Mendelian pattern, lung cancer risk is a complex genetic trait that commonly involves multiple genes in combination with environmental factors and the same is true for chronic obstructive pulmonary disease (COPD)

(Tockman, Anthonisen et al. 1987, Sundar, Mullapudi et al. 2011). Although smoking is the most common preventable cause of lung cancer and COPD, evidence supporting the genetic basis of risk for lung cancer and COPD has been emerging since the beginning of the twentieth century (Alberg and Samet 2003). The evidence for genetic causes are based on statistics of lung cancer cases among smokers and nonsmokers. Importantly, although smokers worldwide have a 20 times greater risk of developing lung cancer than non-smokers, only 10-15% of smokers would develop lung cancer in their lifetime, consistent with an interaction between smoking and genetic risks (Mattson, Pollack et al.

1987, Irshad and Maryum 2012). Conversely, 10-15% of all lung cancers occur among

1

nonsmokers and there is active investigation for genetic predisposition among non-

smoking lung cancer individuals (Hu, Mao et al. 2002, Wang, Vermeulen et al. 2015).

Although a single high-penetrant gene for lung cancer has not yet been identified

(Schwartz, Prysak et al. 2006, Schwartz 2016), transcription expression of multiple genes

and genomic DNA mutation signatures were reported to be different between healthy and

lung cancer smokers (Spira, Beane et al. 2004, Pleasance, Stephens et al. 2010). All of this evidence strongly suggests that there is inter-individual variation in susceptibility to lung cancer and this variation may result from genetic predisposition.

The five-year survival rate for people diagnosed with late-stage lung cancer is significantly lower than those diagnosed in the early stages (American Cancer Society

2016). Therefore, it is important to identify lung cancer cases in the early-stages and there has been an increased effort to do so. For example, the National Lung Screening Trial

(NLST), launched in 2002, revealed a 20% reduction in lung cancer mortality among individuals screened by low-dose spiral computed tomography (LDCT) compared to the group screened by standard chest X-ray (National Lung Screening Trial Research, Aberle et al. 2011). Since December 2013 annual LDCT screening has been recommended by the United States Preventive Safety Task Force (USPSTF) for people at high lung cancer risk based on epidemiologic characteristics (i.e. age 55-80, > 30 pack-years smoking history and currently smoke or have quit within the past 15 years). NLST reported 24.2% of 3 rounds of LDCT screening were positive. However, 96.4% of these were false- positive and approximately 2.5% of positive test results required additional invasive diagnostic procedure (e.g., bronchoscopy, needle biopsy, etc.). Furthermore, LDCT was associated with health risk, emotional problems, and financial cost to the patient

2

(National Lung Screening Trial Research, Church et al. 2013). Consequently, USPSTF

urged that more research is needed on the use of biomarkers to focus LDCT efforts in

persons who are at highest risk for lung cancer (Moyer and Force 2014). A molecular

diagnostic that further stratifies the individuals at highest risk for lung cancer within

epidemiologically defined high-risk group will enable more accurate selection for

individuals who are most likely to develop lung cancer in their lifetime and reduce risk

and cost of annual LDCT screening.

Various studies have described, discovered, quantified and validated biomarkers

at transcribed mRNA level (Spira, Beane et al. 2004, Spira, Beane et al. 2007). To

understand the role of inter-individual variation in genetic predisposition to lung cancer

risk, our laboratory determined that key genes involved in DNA repair, antioxidant, and

cell cycle pathways display altered regulation in normal bronchial epithelial cell (NBEC)

of lung cancer subjects (Crawford, Khuder et al. 2000, Mullins, Crawford et al. 2005,

Crawford, Blomquist et al. 2007) and identified a promising Lung Cancer Risk Test

(LCRT) biomarker comprising transcript abundance measurement of fifteen genes in

NBEC (Blomquist, Crawford et al. 2009). According to the observed pattern in prior case

control study, it is reasonable to hypothesize that the inheritability of lung cancer may

depend on inter-individual variation in DNA repair capacity, which is associated with sub-optimal expression of DNA repair genes and antioxidant genes in airway epithelium.

Sub-optimal expression of DNA repair and antioxidant genes can result from dysregulation of these genes in part due to genetic variation. Genetic association studies can be conducted to test for correlation between disease status and genetic variation to identify candidate genes or genome regions that contribute to a specific disease (Lewis

3

and Knight 2012). Numerous such studies have been done for DNA repair and

antioxidant genes and identified many risk loci associated with lung cancer (Shen, Berndt

et al. 2005, Schabath, Wu et al. 2006, Zienolddiny, Campa et al. 2006, Gallegos Ruiz,

Floor et al. 2008, Sun, Li et al. 2009, Blomquist, Crawford et al. 2010, Tseden-Ish, Choi

et al. 2012, Blomquist, Brown et al. 2013).

In addition to these genetic association studies where specific target loci were

prioritized with high prior likelihood for lung cancer risk, multiple allelic associations

with lung cancer risk have been carried out by a dense array of genetic markers, which

capture a substantial coverage of common variants in the human genome (McCarthy,

Abecasis et al. 2008), also known as genome-wide association study (GWAS). GWAS has discovered many genetic variants (i.e., single nucleotide polymorphisms, SNP) that were significantly associated with common disease susceptibility in mouse and human genomes (Lohmueller, Pearce et al. 2003, Mehrabian, Allayee et al. 2005, Simmonds

2013). However, results from different GWAS studies have not been consistent.

One reason is that the effect size of each DNA variant associated with and

possibly mechanistically linked to lung cancer risk is very small after adjustment of multiple comparisons. Genes associated with lung cancer risk are regulated by multiple loci, therefore, each of the loci contributes only modestly to the risk (Deutsch, Lyle et al.

2005, Wu, Kraft et al. 2010). In fact, additional evidence showed that region-based analysis (a combination of SNPs of a particular genomic region) may possess higher power than single SNP-based analysis (Zakharov, Wong et al. 2013). Together, it suggests the benefits for region-based analysis over SNP-based analysis and leads to more interest in haplotype, the combination of marker alleles on a single chromosome.

4

A second reason for inconsistent GWAS results is that GWAS of susceptibility loci and lung cancer risk has been limited by the low and variable frequency of SNPs in population and ethnicity. Consequently, thousands of subjects are needed to directly assess the association of individual genetic variants and lung cancer risk. Instead, identifying inherited variation in gene regulation as a phenotypic marker is a powerful intermediate step for determination of lung cancer risk. And according to current results and our previous study it is possible to assess this type of intermediate risk factor with far fewer patients than the thousands typically necessary for a GWAS study aiming to determine association of each individual SNP with risk. Analysis of regulation of key antioxidant and DNA repair genes in NBEC, followed by identification of cis-regulatory

SNPs (cis-rSNP) associated with sub-optimal regulation is more practical and effective to evaluate lung cancer risk.

The proximate phenotypic markers of hereditary risk comprised by the LCRT are key protective antioxidant, DNA repair, and cell cycle control genes that are sub- optimally regulated in normal bronchial epithelial cells (NBEC). The rationale for this approach is that sub-optimal NBEC regulation of a protective gene has greater effect on

risk than an individual SNP. This conclusion is based on results of previous studies in

which we identified cis-rSNPs associated with sub-optimal regulation of genes comprised

by the LCRT, including ERCC5 (Blomquist, Crawford et al. 2010) and CEBPG

(Blomquist, Brown et al. 2013). For example, we identified two c cis-rSNPs that independently contribute to regulation of ERCC5 transcript abundance. Thus, a proximate phenotype based on sub-optimal NBEC regulation of a protective gene enriches for cis-rSNPs that may contribute to risk. Thus, the central hypothesis of this

5

study is that genetic variations (i.e., SNPs) at regulatory regions contribute to cis-

regulation of ERCC5, one of the key genes in the LCRT. The central hypothesis was

tested through two working hypotheses: 1) cis-rSNPs are associated with ERCC5 allele-

specific expression (ASE); 2) The contribution of cis-rSNPs to ERCC5 regulation may be

independent. The mechanistic role of cis-acting genetic variations that regulate

transcription of target gene can be assessed by measuring ASE as each allele serves as an

internal control for the other (Pastinen and Hudson 2004, Pai, Pritchard et al. 2015), and

trans-acting effects or environmental conditions that differentially influence gene

expression among samples should not interfere. Only cis-acting changes in the relative

expression of alleles yield reproducible differences between allelic abundances of

transcripts. Previously, genotypes of putative cis-rSNPs responsible for regulation of

some genes comprised by the LCRT, including excision repair cross-complementation

group 5 (ERCC5) and CCAAT/enhancer binding protein gamma (CEBPG), have been

associated with ASE (Crawford, Blomquist et al. 2007, Blomquist, Crawford et al. 2010,

Blomquist, Brown et al. 2013). This study tested the hypotheses by accomplishing two

specific aims. Aim1: determine haplotype comprised by putative cis-rSNPs using allele- specific PCR followed by Sanger sequencing in normal epithelial cells. Aim2: assess

ASE in ERCC5 using targeted NGS method with quality and stochastic controls. With the addition of more subjects to the LCRT trial over the past 5 years and understanding

ERCC5’s critical role in DNA repair, an advancement in mechanistic understanding regarding reginal effects of heritable variations in cis-regulation of ERCC5 can be made

by assessing haplotypes comprised by putative cis-rSNPs and ASE. In addition to cis-

6

regulation, the trans-effect of CEBPG, a previously identified transcription factor for

ERCC5, was determined in a lung cancer cell line.

The second part of this study was dedicated to development of quality control for molecular tests. Biomarkers are of increasing importance for personalized medicine, with

applications along different stages of disease process: not only for enhancing early

disease detection but also for assessing clinical outcomes of a treatment or determining

most effective treatment for individual or monitoring response to treatment (Pfaffl 2013).

Predictive biomarkers (a subset of biomarkers) are used to predict response to a treatment

in terms of efficacy and safety. Chemoresistance and chemosensitivity assays specifically

have been investigated in vitro and some have been approved by the FDA (Keedy, Temin

et al. 2011, Mok 2011, Korpanty, Graham et al. 2014, Chamizo, Zazo et al. 2015). To

augment immunohistochemistry (IHC), the most commonly used approach in clinical

molecular diagnostics and to enable the quantitation of predictive biomarkers, the quality-

controlled multiplex two-color fluorometric real-time PCR assays were developed for ten

predictive markers which have shown clinical values in response to general or target

chemotherapeutic agents. The key components of this method are internal standards (IS)

and external standards (ES). The competitive IS molecule was designed with identical

priming sites and 4-6bp internal difference from each native target gene template (NT).

This ensures identical thermodynamics and amplification efficiency for both template

species as well as discrimination of IS from NT. ES corrects fluorescence intensity

difference between two probes labeled with different dyes due to the variation of

degradation of probes or software selection of Cq values in each plate of PCR.

7

In addition to real-time PCR, we have implemented quality control in next

generation sequencing (NGS) RNA-sequencing platforms. We previously developed a competitive multiplex-PCR amplicon-based library preparation for targeted RNA-

sequencing on next generation sequencing (NGS) platforms (Blomquist, Crawford et al.

2013). This method can control for sample overloading, signal saturation effects, inter-

assay and inter-sample variations in measurement. In addition, it enables to control the

stochastic sampling error when a low amount of copies are present in the samples.

Uncontrolled analytical variation due to stochastic sampling is potentially a major barrier

that limits the application of NGS to clinical setting. Hence, controlling for stochastic

sampling error is important for mutation detection and differential gene expression

measurement. The hypothesis that assay coefficient of variation (CV) due to stochastic

sampling can be predicted based on a Poisson sampling-based mathematical equation was

tested through cross-titration of cDNA samples from two lung cancer cell line (Fu, Xu et

al. 2014). The predicted CV may then be implemented to determine the confidence limits for each value acquired from NGS analysis. Therefore, false positive results and false

negative results can be reduced by minimizing variation due to stochastic sampling.

8

Chapter 2

Literature Review

2.1. Biomarker development for disease risk

2.1.1. Gene Transcript Expression

Gene expression can be measured at mRNA level and protein levels. Although protein is widely known as functional gene product, the abundance of mRNA as the intermediary between DNA and protein correlates with protein expression level in complex biological samples under certain circumstances (Gry, Rimini et al. 2009, Maier,

Guell et al. 2009, Vogel and Marcotte 2012, Evans 2015). mRNA, also known as transcript, is a direct reflection of gene expression. Thus, by determining the types and quantity of mRNA transcripts present in a cell, we can determine which genes are expressed, and at what level, in that cell at different stages of development and under different environmental conditions.

2.1.1.1.Total Transcript Abundance

Various studies have described, discovered, quantified and validated biomarkers measured at mRNA level. Using DNA microarrays, Spira and colleagues described smoking-induced changes in the gene expression of airway epithelial cells obtained during bronchoscopy from nonsmokers and from current and former smokers without

9

lung cancer in 2004 (Spira, Beane et al. 2004). They found a subset of genes had consistently altered expression in former smokers and speculated this may explain the risk these individuals have for developing lung cancer long after they have discontinued smoking. Using gene-expression profiles from Affymetrix HG-U133A microarrays, Spira and colleagues, later on, identified an 80-gene biomarker that distinguishes smokers with and without lung cancer in a training set of subjects. Then, they tested the biomarker on an independent test set and on an additional validation set. Their biomarker had ~90% sensitivity for stage 1 cancer across all subjects (Spira, Beane et al. 2007). Using recently developed transcriptome sequencing, the same group found that the RNA-Seq data detected additional smoking- and cancer-related transcripts whose expression were either not interrogated by or were not found to be significantly altered when using microarrays

(Beane, Vick et al. 2011, Wang, Vermeulen et al. 2015). These findings support the notion that gene transcript expression (i.e. transcript abundance) in normal airway epithelial cells can serve as a lung cancer biomarker.

Using quality controlled standardized RT-PCR (StaRT-PCR), our laboratory has identified a set of key DNA repair, antioxidant and transcription factor genes that exhibited significant intergene total transcript abundance correlation in normal human bronchial epithelial cells (NBEC) among individuals without lung cancer diagnosis

(Mullins, Crawford et al. 2005). Conversely, in NBEC of individuals diagnosed with lung cancer, intergene total transcript abundance correlation was not observed. This difference between lung cancer cases and matched controls led to identification of a 14-gene transcript expression-based Lung Cancer Risk Test (LCRT). In addition to this pilot study, two additional case-control studies (first set: 25 lung cancer cases and 24 controls;

10

second set: 18 cases and 22 controls) supported the association between the LCRT and prevalence of lung cancer (Blomquist, Crawford et al. 2009). The set of genes that separates lung cancer cases from non–lung cancer controls in this study has different characteristics from a set of genes recently reported to have similar classification capabilities discovered through high-density microarray analysis by Spira and colleagues as reviewed above (Spira, Beane et al. 2007). One difference is that 12 of the 14 genes reported in our findings are key antioxidant or DNA repair genes, whereas the remaining two are transcription factors expressed in normal airway epithelium (Blomquist,

Crawford et al. 2009). In contrast, the set of genes reported by Spira group comprises of primarily signal transduction and small molecule transport genes. A second difference is that, as described above, each of the genes comprising the multigene test reported by our group has increased dispersion among the lung cancer cases rather than altered mean levels. In contrast, each of the genes reported by Spira and colleagues has an altered central tendency of expression.

We speculate that the increased dispersion in total transcript expression of genes in NBECs reported in Blomquist et al. 2009 likely result from inherited characteristics at the germ cell level, and less likely from acquisition of genetic alterations in somatic cells in the airway epithelium, previously described as field effect (Wistuba 2007, Walser, Cui et al. 2008). The field effect is observed in not just those with lung cancer, but in all smokers. There were theoretical and empirical evidence supporting this induction. For example, field effect has been observed in all smokers, not just those with lung cancer

(Wistuba 2007, Kadara and Wistuba 2012). However, the observed increase of variation in transcript expression of genes comprising of the LCRT separated the lung cancer

11

group from the non-lung cancer group regardless of cigarette smoking. This observation

indicates the existence of inter-individual variation in the field cancerization effect

caused by smoking and led to the hypothesis that the basis for this variation is germ cell

inheritance. This hypothesis is supported by accumulating evidence for germ cell

inheritance of particular cis element single nucleotide polymorphisms (SNP) that cause

inter-individual variation in regulation of the genes comprised by LCRT and that are

associated with increased lung cancer risk (Blomquist, Crawford et al. 2010, Blomquist,

Brown et al. 2013). Specifically, particular polymorphisms in the regulatory region of

ERCC5 are associated with increased dispersion of transcript expression around its

median expression value and altered prevalence of lung cancer diagnosis (Blomquist,

Crawford et al. 2010). Recently, a polymorphism (rs3213245, -77T>C) in the regulatory

5’ untranslated region (UTR) of X-ray repair cross-complementing 1(XRCC1) was found

to be significantly associated with altered XRCC1 expression and increased lung cancer risk (Hao, Miao et al. 2006). Both ERCC5 and XRCC1 are among the 14 genes comprised by LCRT. Besides these two genes, we hypothesize that germ cell inheritance of particular alleles at regulatory polymorphic sites may be associated with increased dispersion of transcript expression in the other genes comprised by this test as well.

2.1.1.2.Allele-Specific Transcript Expression

In addition to total transcript abundance, allele specific expression (ASE) is another type of gene transcript expression. While total transcript abundance measures the copies of transcripts derived from all alleles at a gene locus, ASE refers to the differential expression of each allele. In a diploid genome, ASE can be presented by allelic ratio, the

ratio of the amounts of products derived from two alleles. The detection of ASE at

12

transcript level in an individual, accordingly, is to quantitate the relative amounts of the

transcript which originated from a particular allele at the gene locus. The ASE hereafter

refers ASE at transcript level if no specification.

Allele-specific differences in levels of gene transcript expression have been found

to be classically attributable to the associated cis-acting regulatory variation and a distinct

epigenetic pattern of its two parental alleles (Knight 2004, Pastinen, Ge et al. 2006, Song,

Kim et al. 2012). Therefore it is a powerful method for determining the effect of cis- polymophic sites on transcript regulation compared to total transcript abundance which does not control for inter-individual variation in trans-regulatory effects (de la Chapelle

2009, Matera, Musso et al. 2013).

2.1.1.2.1. Dosage Effect

Diploid organisms can only have two alleles for a given gene; however, multiple alleles may exist at the population level such that many combinations of two alleles are observed. Traits inherited in a simple Mendelian pattern are either dominate or recessive, which refers to complete dominance. This trait is produced by only one gene. The complete dominance of a wild-type phenotype over all other mutants often occurs as an effect of "dosage" of a specific gene product: the wild-type allele supplies the correct amount of gene product whereas the mutant alleles cannot. One mutant allele can also be dominant over all other phenotypes, including the wild type (UC Davis).

But this is not the case for many traits, including lung cancer. In other words, dominance is not always complete. Incomplete dominance is the expression of two contrasting alleles such that the individual displays an intermediate phenotype. The phenotype expression is dependent on the dosage of the gene products. Two copies of the

13

gene products result in full expression, while only one copy produces partial expression, thus, in turn, an intermediate phenotype. A variation on incomplete dominance is codominance in which both alleles for the same characteristic are simultaneously expressed in the heterozygote (Boundless 2015). Lung cancer is a complex genetic trait in which complex pattern of inheritance is common, involving multiple genes in combination with environmental factors.

In recent decades, there is an increasing number of reports that various SNPs have dosage effect on gene expression, lung cancer risk, survival time, and response to treatment (Shen, Berndt et al. 2005, Schabath, Wu et al. 2006, Zienolddiny, Campa et al.

2006, Gallegos Ruiz, Floor et al. 2008, Sun, Li et al. 2009, Tseden-Ish, Choi et al. 2012,

Litviakov, Freidin et al. 2015). Most association studies are between genotypes and expression levels of genes that play an important role in tumorigenesis. For example, a deletion on 14q32.2-33 as a common mutation in NSCLC (44%) significantly reduced gene expression for HSP90, residing on 14q32, which, in turn, led to longer survival time in 32 non-small cell lung cancer (NSCLC) patients (Gallegos Ruiz, Floor et al. 2008). As previously published, the variant genotypes (GC/AT + AT/AT) of one locus in p73 versus the wild-type genotype (GC/GC) and the variant genotypes (WM + MM) of one locus in p53 versus the wild-type (WW) were associated with significantly increased risk at statistical borderline after adjusting for age, gender, smoking status, and pack-years.

When the p73 and p53 variant alleles were combined and analyzed as a continuous variable, there was evidence of a gene-dosage effect in addition to a 13% increase in lung cancer risk for each additional variant allele (Schabath, Wu et al. 2006). According to this, one variant SNP may have only a modest independent effect on the phenotype of a

14

multigenic disease such as lung cancer, given the borderline effects from a single gene

SNP and the fact that combined effects from both the p73 and p53 variant alleles only increased lung cancer risk by 13%. Another example is a study consisting of 53 unrelated

HapMap CEU lymphoblastoid cell lines. In this study, Ge and colleagues detected that

5.8% of common SNPs on Illumina 1M BeadChip were significantly associated with allelic ratios and cis regulation in individual transcripts. Specifically, the ASE linear regression results showed that the copies of mutant allele at rs2732087 in XKR9 significantly correlate with ASE (Ge, Pokholok et al. 2009). More recently, a trend of association between SNP genotype and the expression of gene transcript was also observed in a GWAS study in non-tumor lung tissues of 420 patients undergoing lung cancer surgery (Nguyen, Lamontagne et al. 2014).

2.1.1.2.2. Allelic Imbalance

In contrast to dosage effect, the other form of relative allelic expression is called allelic imbalance (AI). The ratio of the abundance of each allele is expected to be ~1 in a diploid organism. This means that theoretically the mRNA transcribed from maternal chromosome and that transcribed from parental chromosome will have roughly the same number of copies. However, this is not the case in practice. When the ratio of the expression levels is not 1 to 1, we call it “allelic imbalance (AI)”. In practice, this ratio varies for experimental reasons but can be controlled for using the source genomic DNA

(gDNA) as a control (Pastinen, Ge et al. 2006).

There are a variety of reasons why the expression may vary between the alleles.

One reason is the environmental factor that silences either the maternal or paternal allele, also known as gene imprinting (Crowley, Zhabotynsky et al. 2015). If one allele is

15

silenced completely, then there will be an extreme case of allelic imbalance. Other reasons include cis-acting genetic variations that alter regulation for just one allele through a change to promoter/enhancer regions (transcription factor binding sites)

(Schork, Thompson et al. 2013, Gusev, Lee et al. 2014, Pickrell 2014, Albert and

Kruglyak 2015), or even through 3′ UTR mutations that affect mRNA stability or microRNA binding (Nicoloso, Sun et al. 2010, Obsteter, Dovc et al. 2015). But this scenario may slightly alter expression of one particular allele, resulting in imbalance to a less degree. Classically, loss of heterozygosity (LOH) is a common form of allelic imbalance to identify somatic-cell genetic changes and to characterize tumor stages. The detection of LOH has been used to identify genomic regions that harbor tumor suppressor genes and to characterize tumor stages and progression (Mei, Galipeau et al. 2000, Zeki and Fitzgerald 2015).

2.1.2. Transcript Regulation

The mRNA expression level of a gene is typically determined by several input signals and exerted at transcript synthesis steps: initiation, elongation, termination, and processes at post-transcriptional level: 5' capping, addition of the poly A tail, and splicing. Among all various factors that can affect these processes, the regulatory sequence has become increasingly important in recent decades. Genetic variation plays a crucial role in disease susceptibility through regulating gene expression. Association studies have discovered many significant genetic variants that influence susceptibility to common diseases through mouse and human genomes (Lohmueller, Pearce et al. 2003,

Mehrabian, Allayee et al. 2005, Cirulli and Goldstein 2010, Simmonds 2013). There are two major classes of genetic variants affecting gene expression across genome: cis and

16

trans. Cis-acting variants are close to the gene(s) that they regulate and affect transcript

synthesis or stability in an allele-specific manner, whereas trans-acting variants are not close (usually on different chromosomes) and can affect both alleles of a gene (Williams,

Chan et al. 2007, Pardini, Naccarati et al. 2012). Both cis- and trans- polymorphisms as well as the non-additive interactions between the two, consequently, can contribute to the variation in gene expression (Rockman and Kruglyak 2006, Albert and Kruglyak 2015,

Waszak, Delaneau et al. 2015). In addition to genetic variants described above, epigenetic mechanisms(Jaenisch and Bird 2003), chromatin conformation (Higgs, Vernimmen et al.

2007), copy number variation (Cahan, Li et al. 2009, Henrichsen, Vinckenbosch et al.

2009) and microRNA (Obsteter, Dovc et al. 2015) all may affect the transcription regulation of a given gene.

2.1.2.1.Identify Trans-acting Variations

It can be very difficult to dissect trans-acting genetic variants from cis-acting genetic variants associated with disease risk. Current approaches to identify trans-acting variations associated with disease risk are based on genome-wide mapping of the expression quantitative traits loci (eQTLs). If a trans-acting effect is mapped to a chromosomal locus, the underlying variant may be a coding variant or regulatory variant in a gene involved in the transcriptional control of the gene(s) that is (are) affected.

Genome-wide mapping studies in yeast and mice showed that trans-acting loci broadly dispersed across regulation pathways are responsible for differences in gene expression

(Brem, Yvert et al. 2002, Schadt, Monks et al. 2003, Yvert, Brem et al. 2003, Albert and

Kruglyak 2015, Waszak, Delaneau et al. 2015).

17

As relatively complex functional studies are necessary to validate the suggested

trans-acting variant underlying eQTLs, positional cloning clarified the biological

function of putative trans-acting variations and combination of linkage analysis revealed

‘linkage hotspots’ that interact with multiple unlinked loci to influence gene expression

(Schadt, Monks et al. 2003, Yvert, Brem et al. 2003, Albert and Kruglyak 2015, Pai,

Pritchard et al. 2015). A decade ago, none of the suggested trans-acting variants

underlying human eQTLs have been conclusively validated (Pastinen, Ge et al. 2006)

That might have been due to that trans-acting variations are not replicable and that their

causal effects on expression are not trivial. Therefore, using a “less-biased” experimental

approach is crucial to characterize the effects of trans-acting variants on gene regulation

(Costa, Aprile et al. 2013). Now that a lot of databases have become available and have

ranked eQTLs based on functional data from variety of Encyclopedia of DNA Elements

(ENCODE) assays(Rosenbloom, Dreszer et al. 2010), for example, DNase I

hypersensitive sites (DHSs), chromatin immunoprecipitation, predicted transcription

factor binding, or reporter gene assays. RegulomeDB(Boyle, Hong et al. 2012), for

example, is a good source that can help investigators narrow down the pool of candidate

SNPs within an eQTL region(Gibson, Powell et al. 2015). An exciting outcome of identification of trans-acting variants in genes is the potential for finding new drivers of gene-regulatory networks (Schadt, Monks et al. 2003, Fehrmann, Jansen et al. 2011,

Westra, Peters et al. 2013, Gibson, Powell et al. 2015), thus, leads to more pronounced causal factors in diseases with complex genetic basis such as lung cancer.

There’s a lack of large-scale studies dedicated to trans-acting effects on gene expression in human. The most important reason may be that trans-regulatory factors and

18

cis-regulatory sequences always non-additively interact with each other and such interaction is required in gene expression process (Gibson and Weir 2005, Gibson,

Powell et al. 2015). One example of such interaction is epistatic interaction between cis-

and trans-acting regulators. And this epistatic interaction is a common feature of complex

genetic architecture underlying quantitative phenotype. Wittkopp and colleagues

compared ASE among different trans-regulatory genetic backgrounds in Drosophila

melanogaster in 2008. Among eight genes analyzed, they identified five genes which

were affected by trans-acting variation that altered total transcript levels and two genes

that were affected by differences in cis-regulation. However, they failed to characterize

the direct epistatic interaction between cis- and trans- acting regulatory polymorphisms

(Wittkopp, Haerum et al. 2008). Westra et al. recently identified and replicated trans

eQTLs for 233 SNPs in peripheral blood samples from 5,311 individuals using eQTL

meta-analysis. They further demonstrated that one trans eQTL which altered expression

of the IFN-α pathway gene functioned through the cis-regulatory effects of that site on a

transcription factor encoding gene (Westra, Peters et al. 2013). Notably, going beyond

univariate SNP-transcript associations, Kirsten and colleagues found 18% of analyzed

genes were trans-regulated in peripheral blood mononuclear cells of 2112 individuals

(Kirsten, Al-Hasani et al. 2015).

2.1.2.2.Identify Cis-acting Variations

Cis-regulatory variations usually reside at the noncoding regulatory DNA

sequences within the target gene (Davidson, McClay et al. 2003), harboring regulatory

elements such as promoters and enhancers, which may lie immediately upstream of the

gene, but can also be found hundreds of kilobases away (Wittkopp and Kalay 2012).

19

Displaying dosage effects or non-additive effect, independent or interactive with trans-

regulatory variations, cis-regulatory loci are fundamental to many processes, including

physiological adaptation, generation of cell diversity, and morphological development

(Beer and Tavazoie 2004).

Prior to more recent integrative analysis combining genotype data and gene

expression profile, the approaches for studies of cis-acting variants were restricted to the

in vitro polymorphic reporter construct assays using established tissues or human cell

lines. Most cis-acting variants validated through this approach resulted in less than 5-fold difference in gene expression (Rockman and Wray 2002). And most of these studies only assessed total expression of target genes. They didn’t control for inter-individual variation in trans-acting factors (Forsberg, Lyrenas et al. 2001), environmental factors

(Gebhardt, Zanker et al. 1999) and epistatic interaction of cis- and trans-effects (Hayashi,

Watanabe et al. 1991, Brophy, Hastings et al. 2001). Cis-acting variants present as a manner of allele-specific variation (Buckland 2004).

The more recent approaches to identify cis-acting variants are usually based on genome-wide mapping eQTLs, and hence they are capable of evaluating cis-acting variants in large scale (Morley, Molony et al. 2004, Cheung, Spielman et al. 2005, Zhang,

Li et al. 2014, Fung, Holdsworth-Carson et al. 2015, Kirsten, Al-Hasani et al. 2015,

Lappalainen 2015). In addition to high throughput, genes are studied in their native background instead of an artificial environment by exploring unknown cis-acting variations across the whole genome. Lastly, this approach integrates with linkage analysis thus allows the detection of regulatory factors (cis-acting variants or markers in high linkage disequilibrium with them) that are acting over short and long range (Schadt,

20

Monks et al. 2003, Cheung, Spielman et al. 2005, Albert and Kruglyak 2015, Lam, Tay et

al. 2015, Lappalainen 2015, Pai, Pritchard et al. 2015). Promoter construct studies remain

useful to validate cis-acting variations discovered by such hypothesis-free methods. The identified cis-acting regulatory polymorphism can provide intuitive clues for novel regulatory motifs and transcription factors regulating a specific gene (Morley, Molony et al. 2004, Fung, Holdsworth-Carson et al. 2015, Kirsten, Al-Hasani et al. 2015).

2.1.3. Epidemiology Study

Lung cancer is associated with a low survival rate, in part, because it typically is at an advanced stage when first detected and treated (Ganti and Mulshine 2006). Studies to improve the post-diagnosis outcome of lung cancer through early detection using low-

dose spiral coaxial tomography (LDCT) screening and surgical intervention are

promising (Unger 2006, Lock and Rodrigues 2007, Smith, Manassaram-Baptiste et al.

2015). However, because as many as 90 million active or former smokers in the United

States alone are candidates for screening according to the demographic criteria (i.e. age

55-80, > 30 pack-years smoking history and currently smoke or have quit within the past

15 years), the potential cost is very high and may be prohibitive (Ganti and Mulshine

2006, Burger, Kass et al. 2008). Additionally, LDCT screening studies completed thus far

are associated with a high incidence of false-positive findings which may lead to unnecessary follow-up diagnostic testing, including biopsies and surgical procedures, with associated risk and emotional and financial cost to the patient (Vansteenkiste,

Dooms et al. 2012, Marshall, Bowman et al. 2013).

Consequently, there is an unmet need to use biomarkers to focus LDCT efforts in persons who are at highest risk for lung cancer. A molecular diagnostic that further

21

stratifies the individuals at highest risk for lung cancer within epidemiologically defined

high-risk group will enable more accurate selection for individuals who are most likely to

develop lung cancer in their lifetime and reduce risk and cost of annual LDCT screening.

When we investigate the role of candidate genes or genetic variations that contribute to a

specific trait/disease by testing for a correlation between trait/disease status and genetic

variation, two kinds of observational study designs are most likely used: case-control

study and cohort study.

2.1.3.1.Case-control Study

Case-control studies are often used to identify factors that may contribute to a

phenotype or disease by comparing subjects who have that trait/disease (the "cases") with

patients who do not have the trait/disease but are otherwise similar (the "controls"). Data

about exposure to a risk factor or several risk factors are then collected retrospectively,

typically by interview, abstraction from records, or survey (Song and Chung 2010,

Basuli, Stevens et al. 2014). Case-control studies are comparatively quick, inexpensive, and requires relatively small sample size (Lewallen and Courtright 1998). Therefore, a

case-control study is usually conducted before a cohort or an experimental study to

identify the possible etiology of the disease.

Case-control studies are particularly appropriate when there is good evidence of

an association between a certain exposure and the disease and when disease is rare and

exposure is frequent among the exposed. They are particularly appropriate for (1)

investigating outbreaks, and (2) studying rare diseases or outcomes (Lewallen and

Courtright 1998). The limitations for a case-control study include (1) selection bias; (2)

recall bias; (3) inefficiency for rare exposures. Criteria or definition of cases must be well

22

formulated and documented (Raphael 1987). If cases are misclassified (include false positives), the findings may be false. Controls should be selected from the same

population that gives rise to the cases and independently of their exposure status

(Sedgwick 2015). And the specific relative measure of effect (rate ratio, risk ratio or odds

ratio) that can be estimated from a case–control study depends on the type of sampling

design used in the selection of the controls.

2.1.3.2.Cohort Study

Cohort study is also named longitudinal study or incidence study, which is

appropriate when there is good evidence of an association of the disease with a certain

exposure, when exposure is rare and incidence of disease among the exposed is frequent,

or when the time between exposure and disease is short (Song and Chung 2010). A

cohort study can be prospective or retrospective. Prospective cohort studies begin with

disease-free patients, classify patients as exposed/unexposed, record outcomes in both groups, and then compare outcomes using relative risk. On the contrary, retrospective cohort studies identify exposed and unexposed groups after both exposure and disease occurs.

In addition to examining rare exposures, because entrance into a cohort study begins with exposure status, investigators can monitor the occurrence of multiple diseases potentially caused by an exposure. Finally, cohort studies allow the direct measurement of the absolute risk of developing a disease after an exposure (Euser, Zoccali et al. 2009).

The weaknesses for a cohort study compared to a case-control study include (1) exposure can change over time (e.g. aging, life style ( diet, smoking pattern), air pollution); (2) changes on method over time affecting disease identification; (3) long period to follow-

23

up; (4) costly; (5) subject selection bias. These issues are especially problematic when studying relatively rare diseases, for instance, lung cancer (Crawford 2016) (Looney and

Hagan 2015). Overall, we believe prospective cohort study gives us greater power to detect and address the absolute risk of putative genetic variations considered as rare exposures.

2.2. Identification of cis-acting Variations

2.2.1. Single Nucleotide Polymorphism Analysis

As the most common genetic variations, numerous single nucleotide polymorphisms (SNP) have been proven to be associated with disease susceptibility like lung cancer risk, abnormal expression level of genes, and altered gene regulations (Shen,

Berndt et al. 2005, Schabath, Wu et al. 2006, Zienolddiny, Campa et al. 2006, Gallegos

Ruiz, Floor et al. 2008, Sun, Li et al. 2009, Blomquist, Crawford et al. 2010, Buch,

Diergaarde et al. 2012, Tseden-Ish, Choi et al. 2012, Blomquist, Brown et al. 2013,

Albert and Kruglyak 2015, Kang, Ma et al. 2015, Yoo, Jin et al. 2015). One can associate

SNPs to these diverse phenotypes or eQTLs by genotype of a single SNP or haplotype comprised bymultiple SNPs. cis-acting variations can be detected through univariate

SNP-based analysis (i.e., genotype analysis) and region-based analysis (i.e., haplotype analysis).

2.2.1.1.Genotype Analysis

As of the end of 2015, over 150 million SNPs have been validated in the human genome and been deposited to public database , compared to 12 million from 8 years ago.

Genotyping of SNPs is a procedure that identifies the alleles presented in a given polymorphic site. From a couple of decades ago, genotype analysis of candidate SNPs

24

has become an important part of genetic association study (Morley, Molony et al. 2004)

to reveal disease risk loci. To date, high throughput genotyping is possible with SNP

array and whole-genome sequencing (Keating, Tischfield et al. 2008, Cirulli and

Goldstein 2010, Global Lipids Genetics, Willer et al. 2013, Simmonds 2013). These

recently developed methods enable genetic association studies with large-scale genotype

data (Cirulli and Goldstein 2010, Simmonds 2013). Moreover, new integrative tools

employ predicted gene functions to systematically prioritize the most likely causal genes

at associated loci, highlight enriched pathways and identify tissues/cell types where genes

from associated loci are highly expressed (Global Lipids Genetics, Willer et al. 2013,

Pers, Karjalainen et al. 2015). SNP genotyping is a very straightforward process and the

protocol basically relies on target amplification, allelic discrimination reactions and

allele-specific product detection, these three major components (Chen and Sullivan 2003,

He, Holme et al. 2014). The mechanism used for allelic discrimination largely determines

the specificity and accuracy of genotyping.

Primer extension can quantitate ASE (i.e. SNuPE) based on radiolabel or

fluorescence as discussed in section 2.2.2.2 and also qualify the type of nucleotide at

polymorphic site utilizing 5’ to 3’ DNA synthesis activity of DNA polymerase and

specific terminator nucleotides (ddNTP) complementary to the polymorphic site. The

detection system for this can be mass spectrometry, microarray-based hybridization,

fluorescence resonance energy transfer (FRET) or DNA sequencing (Nikiforov, Rendle

et al. 1994, Pastinen, Partanen et al. 1996, Chen, Levine et al. 1999, Pastinen, Raitio et al.

2000, Shi 2001). And now high-throughput platforms using primer extension have been emerging, for example, arrayed primer extension (Kranaster, Ketzer et al. 2008, Jiang,

25

Willner et al. 2013). This method highly relies on the error rate of DNA polymerases

during DNA synthesis which was reported as non-negligible (from 0.2% to more than

15% per locus) (Kunkel 2004, Pompanon, Bonin et al. 2005). In contrast, allele-specific

PCR (AS-PCR) requires two allele-specific primers that each anneal to the target with its

3’-terminal base matching each of the two alleles of an SNP (Germer and Higuchi 1999,

Myakishev, Khripin et al. 2001, Liew, Pryor et al. 2004, Hayashi, Hagihara et al. 2008).

AS-PCR makes use of the difference in extension efficiency between primers with matched and mismatched 3’ bases (Wallace 1991, Shen, Tian et al. 2015).

The TaqMan assay designs the dually fluorescence-labeled probes annealed to two alleles at the SNP site. Each of the two probes anneals to one allele to create a stable structure that would lead to its degradation by the moving DNA polymerase, but when it anneals to the other allele, it forms a structure that is less stable so that the probe gets pushed off the template without being cleaved. The cleavage of the dually labeled probes changes the status of FRET between the two fluorophores, providing a mechanism for its detection (Biosystems). Because it requires PCR with SNP specific probe and primers, it is good for a few markers but not capable for a large scale study with a thousand markers.

Although 6.2 million SNP genotyping assays were pre-designed and off-the-shelf, the probe design for rare variant SNPs remains empirical and requires substantial optimization.

The number of genotyped SNPs can be maximized by choosing tag SNPs and the relations of tag SNPs to biological pathways can prioritize candidate SNPs for association studies (Global Lipids Genetics, Willer et al. 2013). Utilizing the feature of tag SNPs and single-base extension, whole-genome genotyping is currently available through high-

26

density SNP arrays, for example, Human-1 Genotype BeadChip (Steemers, Chang et al.

2006), Infinitum HD BeadChip (Illumina) (Illumina). Besides Illumina SNP chips,

Affymetrix SNP chips are also popular in the GWAS research area. The Affymetrix

microarray technology relies on the differential hybridization of genomic DNA to 25-mer

probes which match SNP alleles, while the Illumina Infinium technology uses

hybridization followed by primer extension (Jiang, Willner et al. 2013). Although these

SNP arrays work using different chemistries, they have several aspects in common. Both

rely on the biochemical principle that nucleotide bases bind to their complementary

partners—specifically, A binds to T and C binds to G, in Watson–Crick base pairs. Both array protocols call for the hybridization of fragmented single-stranded DNA to arrays containing millions of unique nucleotide probe sequences. Each probe is designed to bind to a target SNP. They all are scalable high-throughput SNP genotyping assays that delivers robust high-quality genotyping data.

2.2.1.2.Haplotype Analysis

Although single-SNP analysis has proven to be useful in discovering many disease-associated loci, this strategy may be limited by very stringent significance

threshold caused by epistatic effects and poor reproducibility (Wu, Kraft et al.

2010).When some genes with heritabilities for high expression were found, no significant

eQTLs were identified even the power calculation based on sample size suggested ~90%

power to detect such loci (Deutsch, Lyle et al. 2005). It is evident thus that in many cases, expression traits are regulated by multiple loci, each of which contributes only modestly to the trait. Recently, more than three GWAS found that in the chr15q24-25 region multiple SNPs were associated with smoking quantity (cigarette per day, CPD) or lung

27

cancer risk while genotypic odds ratio at each SNP was not significantly different from

1.0 when smoking quantity was controlled (Amos, Wu et al. 2008, Hung, McKay et al.

2008, Thorgeirsson, Geller et al. 2008, Wang, Broderick et al. 2008, Caporaso, Gu et al.

2009, Dai, Zhu et al. 2015, Niu, Wang et al. 2015). Specifically, when two GWAS data

sets as part of the Cancer Genetic Markers of Susceptibility (CGEMS) project were

analyzed, association testing for each SNP revealed that none of the SNPs achieved

genome-wide significance (p <10-7) while in the chr15q25.1 region spanning the nicotinic receptors CHRNA3 and CHRNA5 multiple SNPs were associated with cigarette per day

(CPD) (Caporaso, Gu et al. 2009). It suggests the benefits for region-based analysis over

SNP-based analysis and leads to more interest in haplotype, the combination of marker alleles on a single chromosome.

The 1000 Genomes Project genotyped over 95% of SNPs that are in accessible genomic regions and that have allele-frequency of 1% or higher in 1092 individuals from different ethnic populations. It also provided haplotype information on these SNPs

(Genomes Project, Abecasis et al. 2010). Haplotype analysis of SNPs, which was neglected by traditional genetic association studies, provided more information than single-SNP analysis in intramuscular fat percent (IMF) and discovered significantly associated genes (Barendse 2011). In addition, recently common heritable allelic imbalance phenotypes can be mapped in unrelated individuals to establish regulatory haplotypes (Knight, Keating et al. 2003, Musunuru, Strong et al. 2010, Aroucha, Carmo et al. 2016). For example, the Pastinen group demonstrated a strong regulatory haplotype in the human BTN3A2 locus, which spanned at least 15 kb flanking the gene (Pastinen,

Sladek et al. 2004). A pioneering study by Musunuru and colleagues fine mapped

28

haplotypes associated with low-density lipoprotein cholesterol. A combination of systematic reporter assays and GWAS established that minor allele at SNP rs12740374 was associated with ASE of SORT1by creating a C/EBP transcription factor binding site(Musunuru, Strong et al. 2010). There is additionally theoretical and empirical

evidence (Akey, Jin et al. 2001, Schaid 2006) that haplotype-based analysis may possess higher power than SNP-based analysis. For example, Aroucha and colleagues evaluated

association between SNPs in TNF-α gene and SNPs in IL-10 gene with hepatocellular

carcinoma (HCC) and found low IL-10 production diplotype was associated with HCC

risk(Aroucha, Carmo et al. 2016).

When haplotype, a more powerful discriminator, is used in an association study,

one faces a problem of phase inference, which reconstructs haplotypes for individuals.

Haplotype phase can be estimated using computational approaches or it can be generated

through laboratory-based experimental methods for phasing single individuals (Browning and Browning 2011). Haplotype phasing has been accomplished through several statistical algorithms to infer unknown haplotypes from population genotype data sets and family genotype data sets (Stephens, Smith et al. 2001, Browning and Browning

2007, Howie, Donnelly et al. 2009). The Clark’s algorithm, developed in 1990, is the first published method for haplotype phase inference for multiple markers in unrelated individuals but it is only suitable for tightly linked SNPs (Clark 1990). Following this, a variety of algorithms such as EM (Excoffier and Slatkin 1995), hidden Markov models

and so on, were developed to improve haplotype phasing. PHASE, developed by Stephen

and colleagues (Stephens, Smith et al. 2001, Stephens and Scheet 2005), was considered

the gold standard for population-based haplotype phasing algorithms (Marchini, Cutler et

29

al. 2006). It’s suitable for up to 100 markers and up to several hundred individuals.

MACH and IMPUTE2 (Howie, Donnelly et al. 2009) have primarily been used for the imputation of nongenotyped variants but can also be used for haplotype phase inference for larger data sets than PHASE.

By contrast, sequencing is an experimental phasing approach which automatically produces some information on phase. Experimental phasing is expensive and labor-

intensive, nevertheless it is more direct and accurate compared to computational phasing.

Dear and Cook have proposed concepts for extensively resolving local haplotypes (Dear

and Cook 1989) and Sauer and Olson developed clone-based, targeted haplotyping methods for long segments of the human genome (Burgtorf, Kepper et al. 2003,

Raymond, Subramanian et al. 2005). Since 2005, next generation sequencing (NGS) became considerably less expensive than Sanger sequencing but reads are shorter, up to150 bp with paired-end sequencing (Illumina), providing less information for phasing.

Kitzman et al. in 2011first experimentally directly phased 94% of ascertained heterozygous SNPs into long haplotype blocks (N50 of 386 kilobases (kbp)) by combining fosmid library with NGS to determine the haplotype-resolved genome of a

South Asian individual (Kitzman, Mackenzie et al. 2011). Nevertheless, large-insert cloning is still technically challenging and not readily scalable. To address this limitation,

Paul and Apgar described a single molecule dilution followed by multiple strand displacement amplification for targeted haplotyping the human leukocyte antigen (HLA) genes (Paul and Apgar 2005). Peters and colleagues recently reported haplotyping from

10-20 human cells using long fragment read assembled by NGS without cloning (Peters,

Kermani et al. 2012). More recently, Kaper et al. (Kaper, Swamy et al. 2013)

30

implemented this technology into Illumina Genome Analyzer IIx and successfully phased

96% of heterozygous SNPs into 9,243 haplotype blocks with average size of 264 kb,

maximum size of 4.8 Mb and an N50 of 702 kb. Finally, Kuleshov et al. (Kuleshov, Xie

et al. 2014) applied a statistical algorithm to partially phased information contained in

long fragments assembled by NGS-generated short reads and phased 99% of SNPs into

0.2-1 Mb haplotype blocks to determine allele-specific methylation patterns in human genome. In general, improvements of sequencing technologies will enable researchers to assemble haplotypes from sequencing data with very high accuracy and it opens up the

opportunity to use high-quality haplotypes and genotypes in sequencing association

studies.

Our lab used competitive PCR to quantitate the absolute amount of ASE. By

comparing ASE within same sample from NBECs of cases and controls, interindividual

variation in trans-factors and environmental factors is minimized and population-based

studies can be performed to sort out the single cis-acting SNP or haplotypes within each

locus. Compared to genotype of an individual SNP, haplotype is a more powerful

discriminator between cases and controls in disease GWAS. Use of haplotypes in GWAS

reduces the number of tests to be carried out and necessary for linkage analysis. With

haplotypes we can identify causal SNPs regulating genes nearby, understand gene

regulation mechanism, and conduct disease association studies as well as evolutionary

studies.

2.2.2. Empirical Approaches to Assess the Effect of cis-acting Variations on

Transcript Abundance

2.2.2.1.Assessing cis-acting Effects by Total Expression

31

To localize cis-acting determinants of eQTLs in humans, Cheung and colleagues combined microarray expression data with publicly available SNP genotype data and applied genome-wide mapping linkage analysis to identify master regulators of transcription (Morley, Molony et al. 2004, Cheung, Spielman et al. 2005, Albert and

Kruglyak 2015, Fung, Holdsworth-Carson et al. 2015, Lam, Tay et al. 2015, Lappalainen

2015, Pai, Pritchard et al. 2015). They found approximately 1.44 to 8 fold difference in mean expression level of target genes, suggesting the degree of differential total expression attributable to cis-effects varies considerably (Morley, Molony et al. 2004).

Similar platforms were used to identify the common genetic variations that explain gene total expression differences among different set of individuals (Monks, Leonardson et al.

2004, Stranger, Forrest et al. 2005). More recently, Grundberg group used Illumina

Human HT-12 V3 BeadChips to profile genome-wide RNA expression and reveal the contribution of low-frequency and rare regulatory variants with respect to both transcriptional regulation and complex trait susceptibility(Grundberg, Small et al. 2012).

These microarrays or bead arrays are relatively prevalent so quite a few analysis tools are available. Needless to say, common sources of error can be sorted out by carefully following standard quality control steps (Churchill 2002, Dumur, Nasim et al. 2004,

Zakharkin, Kim et al. 2005, Guindalini and Pellegrino 2016). Additionally, the high- density coverage of probes can assay the known regions of transcriptome at once.

In contrast to above massive parallel platforms, non-parallel methods, for example

TaqMan real-time PCR (Deutsch, Lyle et al. 2005), have also been applied to assess difference in total expression level of genes of interest affected by causal cis-acting variations. These methods for correlation between SNP genotype and gene

32

expression level to infer the cis-acting effects. The restriction to local genes, nevertheless, limits the implementation for genome-wide study.

Total expression of target genes to infer cis-acting variations, however, has poor reproducibility for the reasons discussed earlier in the beginning of section 2.2.2, including interferences from interindividual variation in trans-regulatory DNA sequences and epistatic mechanisms, copy number variation, and environment factors. Besides these clearly seen perils and pitfalls, it has been found that total expression data was highly discordant between same sample in different laboratories, between platforms and even between replicates on same platform (Pastinen, Ge et al. 2006). And even strong trans- acting signals are less likely to show reproducible expression of target gene (r2-values are

lower) across studies compared with the cis-acting signals (Pastinen, Ge et al. 2006) and

the authors concluded that cell culture is a major source of variation and false-positive signals. These findings along with other evidences (Hayashi, Watanabe et al. 1991,

Gebhardt, Zanker et al. 1999, Brophy, Hastings et al. 2001, Forsberg, Lyrenas et al. 2001,

Buckland 2004) support that an environmental confounder could create a spurious trans-

linkage or association, thus only the strongest cis-acting signals may seem replicable.

2.2.2.2.Assessing cis-acting Effects by Allele-Specific Expression

In addition to total expression, cis-regulatory effects can be alternatively

measured by allele-specific expression (ASE). ASE assays (Yan, Yuan et al. 2002,

Pastinen and Hudson 2004, Pastinen, Sladek et al. 2004, Song, Kim et al. 2012, Muir,

Perumbakkam et al. 2014) are optimal for detecting cis-acting effects, as each allele

serves as an internal control for the other, and trans-acting effects or environmental

conditions that differentially influence gene expression among samples should not

33

interfere. Only cis-acting changes in the relative expression of alleles yield reproducible differences between allelic abundances of transcripts. The cis-regulatory effects on ASE of a gene can be assessed by in vitro and in vivo methods.

Several different methods have been described to quantitate ASE in a number of genes. One of those methods is single nucleotide primer extension (SNuPE) where a transcribed polymorphism is used as a marker to distinguish between the mRNA products of the parental chromosomes, and radiolabeled or fluorescent nucleotides are used to distinguish the mRNA product of each allele (Singer-Sam, LeBon et al. 1992, Penny,

Kay et al. 1996, Bhatnagar, Zhu et al. 2014). Fluorescent dideoxy terminator–based method quantitated by RT-PCR revealed cis-acting inherited variations in gene expression by comparing the relative abundance of each allele of the same genes from a heterozygous individual in 96 CEPH samples (Yan, Yuan et al. 2002). A sophisticated method that enables measuring absolute ASE or AI is competitive PCR with a known number of artificial DNA standards spiked in reverse transcribed cDNAs (Apostolakos,

Schuermann et al. 1993). Ding and colleagues developed an approach combining competitive PCR and matrix-assisted mass spectroscopy that is capable of relative and absolute quantification of gene expression with high sensitivity and claimed a high throughput and a similar accuracy to other methods (Ding and Cantor 2003). Gel electrophoresis following competitive PCR accurately quantitated absolute ASE or allelic imbalance (ASE ratio, the ratio of transcripts derived from respective alleles) with excellent linear dynamic range and >95% allelic specificity thus enabled measurement of a 400-fold range of one allele to the other (i.e., allele A/B ratio from 1:20 to 20:1)

(Blomquist, Crawford et al. 2010).

34

Both gene copies come from the same sample and have been subject to the same

environmental influences including genetic trans-acting factors and experimental variations including mRNA degradation. It makes ASE ratio or AI a powerful approach for determining the cis-acting effects on gene regulation (Yan, Yuan et al. 2002, Pastinen and Hudson 2004, Ge, Pokholok et al. 2009, Bell, Kane et al. 2013). As stated at the beginning of this section, in the absence of either cis-acting sequence variation or epigenetic effects affecting expression of the target mRNA, each chromosome should be equally expressed regardless of the absolute level of gene total transcript abundance. In samples that are heterozygous for a cis-acting regulatory variant or epigenetic modification, mRNA originating from one chromosome will be expressed at a higher level than that from its sister chromosome and this can be detected by changes in the ASE ratio or AI. The experimental variation in this ratio measured in cDNA reverse transcribed from mRNA sample can be controlled by using matched gDNA as a control.

In spite of obvious benefits stated above, the drawbacks of the allelic imbalance approach are the lack of validated high-throughput assays for human genes (Yan, Yuan et al. 2002, Knight 2012), the limitation to samples that are heterozygous (Ge, Pokholok et al. 2009), and the fact of non-heritable factors (such as epigenetic events) may influence allelic representation (Jaenisch and Bird 2003, Higgs, Vernimmen et al. 2007,

Hutchinson, Raj et al. 2014).

2.2.3. Experimental Approaches to Assess cis-acting Variations

2.2.3.1.In vitro Approaches

In vitro methods most commonly involve transient transfection of a synthetic reporter construct containing candidate regulatory polymorphisms into existing cell lines

35

and tissues, then assess the transcriptional activity through that reporter assay. So they are

best suited for hypothesis-driven studies to test whether putative regulatory

polymorphisms affect gene expression. Rockman and colleagues surveyed more than 400

studies through 2001 in which each cis-regulatory variant has been studied in depth as the

best approach presently available to study the dynamics of functional cis-regulatory

variants (Rockman and Wray 2002). The authors claimed that the functional

consequences of a cis-regulatory polymorphism depend on cell type, culture condition,

the distribution of exogenous inducers, and covariation at other sites in the genome.

These variations have been explained in section 2.2.2.1. Covariation at other sites in the

genome refers to numerous factors, for example, the transcriptional environment within

the cells. Transfection of the homozygous G allele of a SNP at -1607 bp in the matrix

metalloproteinase (MMP)-1 resulted in low level of MMP-1 transcription in normal

fibroblasts, whereas higher level in melanoma cells (Rutter, Mitchell et al. 1998). This

indicates that combination of cis-acting sequences in the MMP-1 promoter and specific trans-acting factors can dramatically increase transcription. Maurano and colleagues have identified that cis-regulatory SNPs perturb transcription factor recognition sequences and these cis-regulatory SNPs were tissue-selectively enriched (Maurano, Humbert et al.

2012). Transient transfection studies thereby may be complicated by trans-acting influences on allelic expression.

Moreover, the experimental methods that most reporter construct studies used introduced additional variation. For example, a cis-regulatory variant may have no effect on transcription unless the necessary introns or downstream elements that physically interact with it to transduce the transcriptional output are present in the construct. The

36

spatial proximity of relevant DNA elements, irrespective of their relative genomic

position, has been demonstrated (Miele and Dekker 2009), although not very well

defined. Therefore, the putative promoter or upstream flanking regions most commonly

targeted by studies are often poorly characterized and do not represent the complete

promoter that is active in the cell line. In addition to the experimentally validated

promoter database, the initial choice of allele-specific constructs for transfection studies can be refined by deletion experiments (Zhang, Min et al. 1995, Xu, Berglund et al.

1999). More faithful reproduction of natural gene regulation system containing proximal promoter regions and enhancer can be achieved by cloning whole human genes in bacterial vectors (Wade-Martins, Saeki et al. 2003). Thirteen out of 17 cloned promoters showed reliably greater than 2-fold increase in transcription activity relative to the control across three human cell lines (Hoogendoorn, Coleman et al. 2003), suggesting that >30% of proximal promoters may harbor cis-acting variants. But most published studies have used relatively small promoter constructs (<1 kb) or oligonucleotide subset of minimal promoter (Zhang, Min et al. 1995, Cooper, Trinklein et al. 2006).

In addition, recent knowledge and methods gained from lung cancer genome-wide association studies (GWAS) (Amos, Wu et al. 2008, Wang, McKay et al. 2014) and

Encyclopedia of DNA Elements (ENCODE) studies are valuable for identification of potential cis-regulatory variations (Djebali, Davis et al. 2012, Melnikov, Murugan et al.

2012). Massively parallel reporter assay (MPRA) was developed where one can

synthesize and clone DNA regulatory elements to generate a library of reporter

constructs. These tag expression then is assayed by high-throughput sequencing

(Melnikov, Murugan et al. 2012).Generally, transient transfection studies to identify cis-

37

regulatory variants are often performed in preexisting human cell lines or tissues, which

is one source of variations to the human tissue of interest. Small trans-acting differences

resulting from other genetic variants (cis- and/or trans-) in the host would be another significant source of variations (Rutter, Mitchell et al. 1998).

2.2.3.2.In vivo Approaches

In vivo monitoring of relative allelic expression as first reported as single- nucleotide primer extension (SNuPE) assay in 1992 (Singer-Sam, LeBon et al. 1992) is possible in tissues or cells of individuals heterozygous for a single regulatory variant or harboring specific haplotypes of well-circumscribed regions. The details that how SNuPE works have been discussed in section 2.1.2.2. This amplification-based in vivo method has several advantages: 1) quantifiable and sensitive allele-specific discrimination based on specific nucleotide selection by DNA template bound by DNA polymerase; 2) capable of quantifying relative expression of two alleles in the same tissue sample.

SNuPE assay described by Singer-sam et al. was able to measure quantitatively changes in 0.1% of a transcript population which was adequate for ASE study and has been successfully applied to identify master regulator Xist for X chromosome inactivation

(Penny, Kay et al. 1996). In addition to radiolabeled or fluorescent labeled primer, allele- specific oligonucleotide primers with 3’-terminal mismatch have been proved to discriminate the mRNA products derived from specific allele with less labor intensity.

When the 3’-end nucleotide of one primer is perfectly matched with desired DNA template, the PCR product will extend normally. On the other hand, when the 3’-end nucleotide of the primer is mismatched, the PCR reaction will be blocked or proceed at a reduced efficiency (Wallace 1991). Without additional steps of probe hybridization,

38

ligation, or restriction enzyme cleavage, two allele-specific oligonucleotide primers

correctly determined the genotypes in 12 individuals (Wu, Ugozzoli et al. 1989, Milbury,

Li et al. 2009). However, spurious amplification products were found together with the

expected ones. Concentrations of PCR buffer with optimal Mg2+ and allele-specific

primers, the number of PCR cycles, annealing temperature, etc. need to be carefully

adjusted to optimize the specificity (Wallace 1991).

In 2000s, Morley et al. from Cheung’s group mentioned earlier compared the expression of the two alleles of same marker using allele-specific quantitative RT-PCR

(qRT-PCR) and confirmed the allelic differences in cis (mean fold difference = 1.6)

(Morley, Molony et al. 2004). This evidenced that qRT-PCR combined with allele-

specific priming is particularly accurate for detection of ASE. And now, more precise and

sensitive quantification of alleles present in a sample is possible using digital techniques

(Pekin, Skhiri et al. 2011, Mazutis, Gilbert et al. 2013)to genotype SNPs in circulating

tumor DNA sample(Diaz and Bardelli 2014).

In contrast to in vitro methods which are often interfered by variations from cell

type, environmental factors, and trans-acting factors in unrelated individuals, in vivo

methods exert advantages as (Pastinen and Hudson 2004) (i) alleles are expressed in their

normal environment including genomic and chromatin context; (ii) comparison of alleles

is made within rather than between samples, maximizing the sensitivity of detecting cis-

acting effects by minimizing inter-individual variation in trans-factors; (iii) the

developmental and physiologic history of the tissue is unlikely to be perturbed by the

presence of two low- or two high-expressing alleles; and (iv) population-based studies

allow sampling of haplotype diversity within each locus.

39

2.3. Quality Controlled Molecular Diagnostic Tests Based on RNA

2.3.1. Reverse Transcription-Polymerase Chain Reaction

Reverse transcription-polymerase chain reaction (RT-PCR) is a powerful tool for the detection and quantification of mRNA. It has high sensitivity, good reproducibility, and wide dynamic range of quantification. A cDNA reverse transcribed from mRNA is amplified by PCR. One can detect the amplified product at “end-point” or by “real- time”. End-point determinations analyze the reaction after it is completed, and real-time determinations monitor the reaction in the thermal cycler as it progresses (Freeman,

Walker et al. 1999, Pfaffl 2004).

End-point detection methods include ethidium bromide gel staining, radioactivity labeling, high performance liquid chromatography, southern blotting, fluorescence labeling, or densitometry analysis (Ferre 1992, Reischl and Kochanowski 1995). Without

appropriate controls, this post-PCR step leads to high intra-, inter-assay variability and

lower dynamic range than real time PCR. This property is a drawback for quantitative

measurements because small differences in the multiplication factor lead to large

differences in the amount of product (Raeymaekers 2000, Pfaffl 2004).

Heid et al. developed real time PCR measuring PCR product accumulation

through dual-labeled fluorogenic probes in which one fluorescent dye serves as a

reporter, FAM (6-carboxyfluorescein), and its emission is quenched by the second

fluorescent dye, TAMRA (6-carboxy-tetramethylrhodamine). Nuclease degradation of

the hybridization probe during each PCR cycle releases the quenching of the FAM

fluorescent emission (Holland, Abramson et al. 1991, Heid, Stevens et al. 1996). Real-

time PCR does not require post-PCR sample handling, preventing potential PCR product

40

carry-over contamination and resulting in much faster and higher throughput assays. This

method has an accurate dynamic range of 7 to 8 log orders of magnitude of starting target

molecule determination (Heid, Stevens et al. 1996, Wong and Medrano 2005). The

quantification cycle (Cq) is defined by software as the cycle when sample fluorescence

exceeds a chosen threshold above calculated background fluorescence. The Cq is

dependent on the starting template copy number, the efficiency of PCR amplification,

efficiency of cleavage or hybridization of the fluorogenic probe, and the sensitivity of

fluorescence detection. A Cq value is reported for each sample and can be translated into

a quantitative result by constructing a standard curve or comparing reference Cq values.

Generally, two quantification strategies are used in real-time RT-PCR, relative or

absolute quantification. Both methods need normalization to correct for sample-to-sample

variations in loading.

2.3.1.1. Relative Quantification

In relative quantification, the relative mRNA levels of two or more genes are compared across samples. For example, fold difference of a target gene versus a reference gene (housekeeping or endogenous control gene) in one sample to another sample is compared. A reference gene is considered to have a constant difference in mRNA levels for the gene of interest. Therefore, fold difference of target mRNA to one or more reference mRNA could be reported. The original RNA concentration is not reported because the result is a ratio of expression level of target vs reference gene. This quantification method is adequate to investigate physiological changes comparing control vs treated samples (Pfaffl 2004).

41

Basically, there are two calculation methods of the relative quantification, with

efficiency correction (Pfaffl 2001) and without (Livak and Schmittgen 2001). For

efficiency correction, the slope of the standard curve is required. A significant statistical

bias can be introduced that results in misleading biological interpretation when the

expression levels of target and normalizer is too different or when the target gene is

expressed at very low levels, as the relationship between the two may not be linear at

these cases (Vandesompele 2009).

Even if there are issues about reference gene, relative quantification is more

convenient than absolute quantification, because it requires less stringent controls.

However, when performing relative quantitation, the data (Cq) used for comparison are

arbitrary values and only applicable to the samples run within the same PCR. For

comparison of PCR results from two different experiments, it is necessary to include a

standard control in every plate or run (Wong and Medrano 2005).

2.3.1.2. Absolute Quantification

Absolute quantification reports the final result in copy numbers per total RNA

concentration, per genome, per cell, per gram of tissue, per ml blood, etc. Generally,

there are two approaches for absolute quantification, “standard curve” and “competitive

PCR”. The standard curve approach measures expression of a particular gene using serial

dilutions of known copy number or concentration of a selected sequence in a separate

PCR assay. Competitive PCR measures endogenous gene expression relative to known

numbers of synthetic RNA or DNA sequences placed in same PCR assay. In cases where

data to be compared are assayed on different days or in different laboratories, absolute

42

quantitation may be preferred because results are based on constant reference agents

(Peirson, Butler et al. 2003, Wong and Medrano 2005).

2.3.1.2.1. Competitive PCR

The importance and need for standardization and extended quality control studies

in RT-PCR were emphasized by researchers (Vlems, Ladanyi et al. 2003, Tichopad,

Kitchen et al. 2009, Ruijter, Pfaffl et al. 2013). The significant difference in our assays

(i.e. both, two-color fluorometric real-time PCR and Star-Seq) from commonly used real

time PCR and NGS is the implementation of the known number of synthetic internal

competitive template molecules for each gene in one mixture.

A competitive PCR approach for quantitation of mRNA or DNA was first

introduced by Gilliland et al. (Gilliland, Perrin et al. 1990). The general concept of competitive PCR consists of co-amplification of two different templates sharing the same primer recognition sequences in the same tube. This approach ensures identical thermodynamics and amplification efficiency for both template species. The PCR tube containing the target templates is spiked with a known quantity of synthetic competitive templates, and the ratio of their initial copy numbers remains constant throughout the amplification to endpoint. Their amplicons can be distinguished by the addition of a restriction enzyme site to the standard (Gilliland, Perrin et al. 1990), or by varying its size or sequence (Gililland, Tseng et al. 1992, Celi, Zenilman et al. 1993, Scheuermann and

Bauer 1993, Blomquist, Crawford et al. 2013, Yeo, Crawford et al. 2014).

The synthetic competitive RNA molecules could be spiked into the RNA samples before RT (Becker-Andre and Hahlbrock 1989, Vlems, Ladanyi et al. 2003, Pachmann,

Clement et al. 2005) or synthetic competitive DNA molecules that differ from the cDNA

43

of interest could be added to the endogenous cDNA following RT (Crawford, Peters et al.

2001). By including a known number of competitor RNA in RNA sample prior to RT, variable effects due to differences in conditions of the RT and the PCR amplification could be internally controlled (Wang, Doyle et al. 1989). However, pipetting errors may occur at two points: placing the same amount of RNA from each sample into its respective RT reaction and pipetting cDNA from each RT reaction into each PCR assay.

Further, once transcribed competitor RNA may degrade during long-term storage, and heteroduplex formation between the nearly identical standard and target can result in variable sensitivity and accuracy (Henley, Schuebel et al. 1996, Bustin 2000).

When fluorescence is used to detect the endogenous and standards amounts (e.g.,

TaqMan probes, DNA intercalating dye), it is recommended that the competitive templates should be within 10-fold ratio of the target cDNA molecules because there are upper and lower limits of detection of fluorescence (Ferre 1992, Reischl and

Kochanowski 1995, Raeymaekers 2000). If RNA competitor is used during RT and will be detected using fluorescence method, it is not possible to know ahead of time the proper amount of competitive templates for each gene to be placed into the RNA sample before RT. Therefore, several RT reactions per sample (i.e., a constant amount of the target of interest and varying amounts of competitor) will be needed. This is a very cumbersome process that depletes RNA storages and limits analysis of the cDNA to the genes for which a competitive template was included in RT. When synthetic DNA competitive templates are spiked in to control for cDNA loaded into the PCR reaction, by contrast, one can have a series of PCR reactions with different (competitor: target) ratios and more abundant cDNA rather than the limited source of RNA will be consumed. Thus,

44

when fluorescence emission is used to detect the cDNA signal, Willey and colleagues

state that measurement of cDNA with synthetic DNA competitor is most practical

(Willey).

However, when non-fluorescent methods of detection (e.g., next generation

sequencing as discussed later) are used it is not necessary to keep the NT and IS within

close ratio (Blomquist, Crawford et al. 2013) and RNA or cDNA standards may be

employed.

Ultimately, the main advantage of competitive PCR is that the results are not

affected by tube to tube variations in amplification efficiency controlling for inter-sample variation in interfering substances (Bustin 2000).

2.3.1.2.2. Multiplex Two-Color Fluorometric Real-Time PCR with Quality Control

Chemoresistance and chemosensitivity assays have been investigated as in vitro

diagnostic tests intended to identify drugs which work most effectively for tumor in an

individual patient. Empiric chemotherapy is selecting a chemotherapy regimen based on

clinical trial evidence and on characteristics that determine whether a patient is likely to

benefit from certain drugs. In contrast, assay-directed chemotherapy, contributing to personalized medicine, is an alternative where tumor cells of the patient are treated with and without particular drug and the responses are analyzed (Samson, Seidenfeld et al.

2004). This molecular diagnostic tests can be applied to tissue after biopsy or blood cells

containing nuclei because genetic variability and the occurrence of specific

polymorphisms may affect susceptibility to tumor and the type of response to the therapy.

Most commonly used assays in pathology laboratories are immunohistochemistry (IHC)

which is not quantitative. Ziegler et al. has reviewed DNA predictive markers under

45

investigations (Ziegler, Koch et al. 2012). Researches showed that epidermal growth factor receptor (EGFR) mutation in patients with advanced non-small cell lung cancer

(NSCLC) was associated with resistance to the first-line EGR tyrosine kinase inhibitor therapy. Thus, a test for EGFR mutation for NSCLC patients has been approved by the

FDA (Keedy, Temin et al. 2011). An example of gene expression analysis as a predictive

biomarker is transcript level of the ERCC1, a gene encoding the key enzyme for DNA

repair, measured by quantitative RT-PCR. Other genes have been reported to affect chemo response (Mok 2011, Korpanty, Graham et al. 2014, Chamizo, Zazo et al. 2015).

Based on clinical evidence of values in predicting chemotherapy response in NSCLC

patients, we developed multiplex two-color fluorometric real-time PCR assays with

quality control for 10 predictive biomarkers: ERCC1, RRM1, MRP2 in response to

cisplatin; EGFR, ROS1, ALK1, FGFR 1 to 3, and TYMS in response to target- chemotherapeutic agents.

Our developed two-color fluorometric PCR method is a competitive PCR for absolute quantification in real-time PCR format for tests comprising multiple gene analytes. It is designed for quantification of challenged mRNA, such as formalin-fixed

paraffin embedded (FFPE) samples having chemical modification and physical

fragmentation and crosslinking between nucleic acid themselves or nucleic acid and

protein by formalin (Yeo, Crawford et al. 2014). In contrast to the other commercially

available tests, e.g. COBAS and Abbott methods for the HIV-1 test, two main differences

with our method are the application of internal standards mixture (ISM) and external

standards mixture (ESM) for the quality control. Both COBAS and Abbott assays are

developed for only one target gene (HIV-1), not for multiple genes (i.e., targets and

46

references). As at least one reference gene is needed for the loading control, a method for

measuring at least two genes (i.e., a target and a reference) is necessary for accurate

transcript abundance measurements.

The competitive IS molecule was designed with 4-6bp (two-color fluorometric

real-time PCR) and 6bp (Star-Seq) difference from each native target gene template

(NT). This alteration from NT could be distinguished by probe (two-color fluorometric

real-time PCR) or sequencing program (Star-Seq). After synthesis of IS, each IS was mixed in an ISM with a constant ratio, and it was used for co-amplification of an unknown number of NT, in a single PCR reaction, for direct quantification (two-color fluorometric real-time PCR) and library preparation for sequencing (Star-Seq). ISM is a mixture of synthetic competitive internal standards (IS) with a known concentration of IS for each of multiple represented genes (e.g., one or more reference genes and one or more target genes). In each PCR reaction, the ISM controls for inter-sample variation due to the presence of interfering substances and prevents false negative results. If the same ISM is used across PCR experiments and laboratories, it controls for analytical variation thus enables comparison of gene expression measurements across multiple runs in different laboratories (Crawford, Peters et al. 2001, Crawford, Warner et al. 2002, Canales, Luo et al. 2006, Huggett, Novak et al. 2008).

ESM is a mixture with (1:1) ratio of synthetic native templates (NT) and IS templates of each of genes of interest and with a constant ratio of one gene relative to each other. It controls the fluorescence intensity difference between two probes labeled with different dyes due to the variation of degradation of probes or software selection of

Cq values in each plate of PCR. In every PCR plate, two concentrations of ESM per gene

47

of test need to be amplified in separate wells, and Cq difference between NT and IS (ΔCq

=NT Cq – IS Cq) was obtained for each of the two ESM wells. The mean of two ΔCq

from the two respective ESM reactions (ΔCq) was used for normalization of fluorescence

difference in NT and IS probes and auto-selected Cq difference in an unknown sample.

The final report is NT copies of a target gene per million of a reference gene (e.g. ACTB

in this case).

Crawford et al. demonstrated the two-step PCR with implementation of ISM for

fresh FNA samples in the capillary electrophoresis platform (Crawford, Warner et al.

2002) and Yeo et al. successfully adapted pre-amplification with ISM for FFPE samples to increase signal from background and amount of cDNA reverse transcribed from the limited amount of RNA (Yeo, Crawford et al. 2014).

2.3.2. RNA-Sequencing

RNA-sequencing (RNA-seq) is a recently developed deep sequencing technology that sequences the same region multiple times for simultaneous transcriptome profiling, mapping and quantifying transcriptomes of targeted regions of interest or whole transcriptomes (Wang, Gerstein et al. 2009). It also provides differential gene expression data in forms of aligned read-counts (Nagalakshmi 2010). NGS rapidly became choice of

RNA-seq since 2005 and allows for increased coverage of target size. The relative abundances of individual transcripts in a transcriptome can differ by several orders of magnitude. For the detection and quantification of low abundance transcripts with RNA- seq, the total number of reads per library can be increased (Haas, Chin et al. 2012).

2.3.2.1.Whole Transcriptome RNA-seq

48

Whole transcriptome RNA-sequencing (total RNA sequencing) captures a broad

range of gene expression levels (more than five orders of magnitude was estimated in

mouse (Mortazavi, Williams et al. 2008)) and enables the detection of novel transcripts in

both coding and noncoding RNA species. It provides researchers with great insight into

complex diseases by understanding of altered expression of genetic variants and

molecular mechanisms that regulate disease progression. As well as transcript abundance

quantification, detailed transcript structure information, including representation of

alternative transcripts (variety of splice isoforms and novel exons) (Halvardson, Zaghlool

et al. 2013), polymorphisms, and mutations, including translocation are available

(Marioni, Mason et al. 2008).

Microgram quantities of total RNA are required for whole transcriptome

sequencing. The quality of RNA is of importance for successful sequencing; hence,

RNase-free environment for preventing RNA degradation and no genomic DNA

contamination in RNA are essential. Also removal of ribosomal RNAs (rRNA) prior to

analysis optimizes the percentage of reads covering RNA species of interest because

rRNAs constitute 95-98% of total cellular RNA. The mRNA can be enriched either by

selection of polyadenylated (poly-A) RNAs or by depletion of rRNA (Chen and Duan

2011, Huang, Jaritz et al. 2011). For RNA-seq, however, doubly oligo(dT)-selected polyA+ RNA is preferred (Nagalakshmi, Wang et al. 2008, Nagalakshmi 2010). About

1µg of total enriched mRNA is subjected to cDNA synthesis by RT. Fragmentation can

be applied for either RNA (i.e., RNA hydrolysis or nebulization) or cDNA (i.e., DNase I

treatment or sonication). The fragmented cDNA is size-selected, blunt-ended and ligated

to platform-specific adaptors with or without barcode. Then the library preparation is

49

purified before sequencing (Nagalakshmi 2010). Following sequencing, two assembly methods are used to produce a transcriptome map from the resultant reads: de novo and

genome-guided. The de novo approach does not rely on the presence of a reference

genome to assemble the sequence reads (Grabherr, Haas et al. 2011). The reference

genome approach is cheaper and easier for mapping using tools, such as Bowtie

(Langmead and Salzberg 2012), TopHat splice junction mapper (Trapnell, Pachter et al.

2009), to filter only uniquely mapped reads to a reference genome.

Recently, whole transcriptome analysis at single cell resolution is of growing

interest, especially for profiling rare or heterogeneous populations of cells (Tang,

Barbacioru et al. 2009). However, for some genes with low expression their expression

can be negatively affected by stochastic amplification bias that results in the drop-out of

some RNA species and preferential amplification of others (Mamanova, Andrews et al.

2010, Ozsolak and Milos 2011).

However, for routine molecular diagnostic testing, the cost (required total

sequencing reads due to redundantly sequence high abundance targets in order to quantify

low abundance targets) and complexity (i.e., difficulties in computational bioinformatics

at post-sequencing) of whole transcriptome RNA-sequencing data sets are barriers to use

of this method (Blomquist, Crawford et al. 2013).

2.3.2.2.Targeted RNA-seq

Traditionally, RNA sequencing for transcriptome analysis requires a preparation

of whole transcriptome library. But, in many cases, it is not necessary or less important to

sequence the whole transcriptome than to focus sequencing efforts only on a specific

fraction of the transcriptome. Recently developed targeted RNA sequencing approach is a

50

method for measuring of transcripts of interest with quantitative or qualitative

information. Biotinylated oligonucleotide probes (baits) are used to capture cDNAs from

a library prepared for NGS sequencing (Levin, Berger et al. 2009, Mamanova, Andrews

et al. 2010, Mercer, Gerhardt et al. 2012, Halvardson, Zaghlool et al. 2013). It would be

more efficient for a diagnostic test in clinical setting than whole transcriptome RNA-seq.

Targeted RNA sequencing allows for allele-specific expression measurement (Zhang, Li

et al. 2009) and differential expression analysis (Mercer, Gerhardt et al. 2012) as well as

discovery of novel fusion transcripts from chromosomal rearrangements (Levin, Berger

et al. 2009). Further, data analysis of targeted RNA-seq is significantly faster than whole

transcriptome RNA-sequencing because it takes less time for alignment. However, these

studies have underscored some challenges, for example, inter-library variations

introduced by library preparation (Levin, Berger et al. 2009, Mamanova, Andrews et al.

2010, Mercer, Gerhardt et al. 2012) and the high sequencing depth required to

reproducibly quantify low-abundance transcripts (Turner, Ng et al. 2009, Tarazona,

Garcia-Alcalde et al. 2011).

2.3.2.2.1. Use of Internal Standards as Quality Control for Library Preparation

We developed a competitive multiplex PCR-based amplicon sequencing library

preparation method for targeted RNA-seq (Blomquist, Crawford et al. 2013).This method targets only the sequences of interest and controls for inter-target variation in PCR amplification during library preparation by measuring each transcript NT relative to a known number of synthetic competitive template IS copies. Briefly, cDNA from unknown sample (NT) was mixed with ISM and co-amplified with tailed target-specific primers through PCR (1st round of PCR allows multi-targeting), and then PCR products

51

were subjects to PCR amplification with primers that tag unique barcode sequences at

ends (2nd round of PCR allows multi-sampling), followed by amplification with primers

that incorporate platform-specific sequences at ends (3rd round of PCR allows sequencing on specific NGS platform). The resulting individual library was mixed, gel purified, and then sent to specific NGS sequencer (i.e. Ion Torrent PGMTM or Illumina HiSeq or

MiSeq). The competitive multiplex-PCR amplicon library preparation method provides the quality control for RNA-seq library preparation, reproducibility. In addition, it reduced the over-sequencing of highly expressed transcripts relative to lowly expressed ones yet maintains initial relative quantitative representation of targets, which, in turns, considerably lowers down the cost of sequencing per target, per sample (Blomquist,

Crawford et al. 2013). A key advantage of competitive PCR is its insensitivity to the effect of saturation of the PCR (Reischl and Kochanowski 1995). This overcomes the limitation of previously developed targeted RNA-seq approaches described above.

2.3.2.2.2. Use of Predicted Coefficient of Variation as Quality Control for Stochastic

Error

One challenge that limits wider clinical diagnostic application of NGS is lack of appropriate quality control to accurately identify clinically actionable mutations in tumors

(Cibulskis, Lawrence et al. 2013, Spencer, Tyagi et al. 2014). Not only for mutation frequency determination is analytical variation due to stochastic sampling important, but also to quantitative measurement of differential gene expression or ASE (Mortazavi,

Williams et al. 2008). Our developed RNA-seq library preparation method described above that utilizes competitive IS (Blomquist, Crawford et al. 2013) enables control for sample overloading, signal saturation effects, inter-assay and inter-sample variations in

52

measurement. Analytical variation due to stochastic sampling when target analyte was

under-loaded into library preparation and/or when library product was under-loaded into

sequencer (Fu, Xu et al. 2014), can still compromise the quality of NGS data and lead to

misinterpretation of biological meanings. As the stochastic sampling error could not be

controlled when low copies exist in the samples, each PCR step taking out the cDNA or

PCR products from the previous step can cause stochastic sampling variation. By

formulating based on Poisson sampling, a mathematical equation was developed to

predict assay coefficient of variation (CV). The formula is based on both intact NT copies

loaded into first PCR reaction of library preparation (input molarity) and input of

resulting molecules from library into the sequencer (i.e. sequence counts) (Blomquist,

Crawford et al. 2015). During post-sequencing data analysis pipeline, the predicted CV is

implemented to determine the confidence limits for each value acquired from sequencing.

Any values that do not pass this confidence limit will be eliminated, thus, stochastic

sampling errors are controlled. Since the relative transcript abundance varies over six

orders of magnitude (Mortazavi, Williams et al. 2008, Blomquist, Crawford et al. 2013), predicted CV prevents false positive results by minimizing stochastic variation and ensure the NGS data quality for transcript abundance quantification, representation of alternative transcripts (Halvardson, Zaghlool et al. 2013), polymorphisms and mutations detections.

2.4. Contributions

2.4.1. Manuscript I

Manuscript I: is entitled, “Cis-acting variant sites that alter ERCC5 transcription regulation in normal bronchial epithelial cells.” This manuscript has been submitted to

53

PLOS One. This study was conducted to characterize cis-acting genetic variants responsible for inter-individual variation in ERCC5 transcript regulation in normal bronchial epithelial cells (NBEC). Genotypes at putative ERCC5 cis-regulatory single nucleotide polymorphic sites (SNP) rs751402 and rs2296147, and marker SNPs rs1047768 and rs17655 were determined for currently enrolled 80 subjects. Using a recently developed targeted sequencing method, ERCC5 allele-specific transcript abundance was assessed in NBEC RNA from 55 individuals heterozygous for rs1047768 and 21 subjects heterozygous for rs17655. Syntenic relationships among alleles at rs751402, rs2296147 and rs1047768 were assessed by allele-specific PCR followed by

Sanger sequencing. Association of NBEC ERCC5 allele-specific expression at rs1047768 with haplotype and diplotype structure at putative ERCC5 promoter cis-regulatory SNPs rs751402 and rs2296147 was assessed.

2.4.2. Manuscript II

Manuscript II: is entitled, “Lung cancer risk test trial: study design, participant baseline characteristics, bronchoscopy safety, and establishment of a biospecimen repository.” This manuscript has been accepted and published in BMC Pulmonary

Medicine. In an effort to assess the accuracy and safety of the LCRT we initiated a multi- site prospective cohort trial. The purpose of this report is to describe 1) the LCRT trial study design and primary endpoint, 2) baseline characteristics of enrolled individuals including demographic and lung function data, and 3) secondary endpoints reached thus far, including a) analysis of safety for the bronchoscopic brush method used to obtain samples for LCRT testing, and b) establishment of a biospecimen repository containing

NBEC and peripheral blood samples collected from the LCRT cohort.

54

2.4.3. Manuscript III

Manuscript III: is entitled, “Control for stochastic sampling variation and qualitative sequencing error in next generation sequencing.” This manuscript has been accepted and published in Biomolecular Detection and Quantification. One challenge that limits wider clinical diagnostic application of NGS is lack of appropriate quality control to accurately identify clinically actionable mutations in tumors. In order to address these challenges, we developed and tested two hypotheses. Hypothesis 1: Analytical variation in target analyte quantification is predicted by Poisson (i.e. stochastic) sampling effects at two key points; a) input of intact nucleic acid target molecules into the library preparation reaction, and b) input of amplicons from the library into the sequencer. Hypothesis 2:

Technically derived base substitution, insertion and deletion frequencies observed at each base position in each native target analyte is concordant with frequencies observed in competitive synthetic internal standards present in the same reaction. To test hypothesis

1, we derived equations using Monte Carlo simulation to predict assay coefficient of variation (CV) based on three working models: number of target molecules added to library preparation, number of target sequence read counts from sequencer, or both. A serial dilution of gDNAs from two cell lines with known allelic composition was used tested this hypothesis. To test hypothesis 2, we measured the frequency of base substitutions, insertions and deletions at each base position within amplicons from each of 30 native target analytes, then compared these frequencies to those at corresponding base positions within 30 respective synthetic competitive internal standard templates present in the same NGS library preparation reactions.

2.5. Future Study

55

Through funding in part from RC2 CA148572 and HL108016, we have collected normal bronchial epithelial cell samples from over 500 subjects with lung cancer and/or

COPD or demographically at risk for lung cancer or COPD. We will focus on discovery of cis-regulatory SNPs or haplotypes that affect inter-individual variation in the expression of genes with altered regulation in subjects with lung cancer or COPD.

Integration of pathway analysis and GWAS data would highlight genes that are most likely altered in cases versus control and will be more effective. We have had authority to access GWAS data sets to date, including three COPD GWAS data sets and two lung cancer GWAS data sets. For the multiple putative risk genes, we will assess genotype at putative cis-rSNPs in gDNA from over 500 subjects, and allele-specific expression and total expression in NBEC samples from matched over 500 subjects using targeted competitive multiplex NGS method. We will identify heritable susceptibility cis-rSNPs or haplotypes that contribute to lung cancer and COPD risk. We will evaluate the function of identified putative cis-rSNPs that are more likely to alter transcription factor binding and regulation of respective genes in NBEC, using NBEC in optimized conditional reprogramming culture (CRC) conditions. We will use the Massively Parallel

Reporter Assay (MPRA) combined with targeted NGS to measure inter-allelic difference in transcriptional activity at putative enhancers centered on putative cis-rSNPs.

56

Chapter 3 Haplotype and diplotype analyses of variation in ERCC5 transcription cis-regulation in normal bronchial epithelial cells

Xiaolu Zhang1, Erin L. Crawford1, Thomas M. Blomquist2, Sadik A. Khuder3, Jiyoun

Yeo1, Albert M. Levin4, James C. Willey1,2*

Authors’ Affiliations:

1 Division of Pulmonary/Critical Care and Sleep Medicine, Department of Medicine,

University of Toledo Health Sciences Campus, Toledo, Ohio, United States of America

2 Department of Pathology, University of Toledo Health Sciences Campus, Toledo, Ohio,

United States of America

3 Departments of Medicine and Public Health and Homeland Security, University of

Toledo Health Science Campus, Toledo, Ohio, United States of America

4Department of Public Health Sciences, Henry Ford Health System, Detroit, Michigan,

United States of America

* Corresponding author: James C. Willey, M.D., Mail Stop #1186, 3000 Arlington

Avenue, Toledo, OH 43614, Phone; 419-383-3541 Fax; 419-383-2801,Email:

[email protected]

57

(Modified from “Haplotype and diplotype analyses of variation in ERCC5 transcription cis-regulation in normal bronchial epithelial cells” manuscript submitted to Physiological

Genomics)

3.1 Abstract

Background: Excision repair cross-complementation group 5 (ERCC5) gene plays an important role in nucleotide excision repair and dysregulation of ERCC5 is associated with increased lung cancer risk. Haplotype and diplotype analyses were conducted in normal bronchial epithelial cells (NBEC) to better understand mechanisms responsible for inter-individual variation in transcript abundance regulation of ERCC5.

Methods: We determined genotypes at putative ERCC5 cis-regulatory SNPs (cis-rSNP) rs751402 and rs2296147, and marker SNPs rs1047768 and rs17655. ERCC5 allele- specific transcript abundance was assessed by a recently developed targeted sequencing method. Syntenic relationships among alleles at rs751402, rs2296147 and rs1047768 were assessed by allele-specific PCR followed by Sanger sequencing. We then assessed association of ERCC5 allele-specific expression at rs1047768 with haplotype and diplotype structure at cis-rSNPs rs751402 and rs2296147.

Results: Genotype analysis revealed significantly (p<0.005) higher inter-individual variation in allelic ratios in cDNA samples relative to matched gDNA samples at both rs1047768 and rs17655. By diplotype analysis, mean expression was higher at the rs1047768 alleles syntenic with rs2296147 T allele compared to rs2296147 C allele.

Further, mean expression was lower at rs17655 C allele which is syntenic with G allele at a linked SNP rs873601 (D’=0.95).

58

Conclusions: These data support the conclusions that in NBEC, T allele at SNP rs2296147 up-regulates ERCC5, variation at rs751402 does not alter ERCC5 regulation, and that C allele at SNP rs17655 down-regulates ERCC5. Variation in ERCC5 transcript abundance associated with allelic variation at these SNPs could result in variation in NER function in NBEC and lung cancer risk.

3.2 Introduction

Excision repair cross-complementation group 5 (ERCC5) gene, also known as

Xeroderma Pigmentosum complementation group G (XPG) (O'Donovan, Scherly et al. 1994), plays an important role in nucleotide excision repair (NER). In addition,

ERCC5 is among a set of key antioxidant, DNA repair and cell cycle control genes identified by this laboratory to associate with lung cancer risk (Mullins, Crawford et al. 2005, Blomquist, Crawford et al. 2009). Further, variation in ERCC5 regulation is reported to be associated with treatment response and outcome in bronchogenic carcinoma as well as other cancers (Spitz, Wei et al. 2003, Zienolddiny, Campa et al.

2006, Mathiaux, Le Morvan et al. 2011, Zhang, Sun et al. 2013, Somers, Wilson et al.

2015).

Known transcription regulators of ERCC5 in normal bronchial epithelial cells

(NBEC) include CCAAT/enhancer binding protein gamma (CEBPG), E2F

Transcription Factor 1 (E2F1) and YY1 (a transcription factor belonging to the GLI-

Kruppel class of zinc finger proteins) (Mullins, Crawford et al. 2005, Crawford,

Blomquist et al. 2007). CEBPG is a truncated C/EBP isoform that lacks a transcription activation domain and therefore functions through heterodimerization with other C/EBP members (Tsukada, Yoshida et al. 2011). Knockout of CEBPG or

59

its binding partner CEBPA results in emphysema, a condition associated with lung cancer

risk (Kaisho, Tsutsui et al. 1999, Didon, Roos et al. 2010). E2F1 is a critical regulator of

cell cycle progression (Nevins 1992). A study in a series of 58 lung tumors of all histological types supported a pivotal role of E2F1 in tumorigenesis (Eymin, Gazzeri et

al. 2001).

The common single nucleotide polymorphic (SNP) sites rs751402 and rs2296147

reside in the ERCC5 5’ untranslated region (UTR); rs751402 within a known CEBPG

binding site based on chromatin immunoprecipitation studies (Wang, Zhuang et al. 2012)

and rs2296147 within an experimentally confirmed binding site for E2F1 and YY1

(Crawford, Blomquist et al. 2007). Variation at rs2296147 is predicted to alter binding of

the TP53 transcription factor (Marinescu, Kohane et al. 2005). Both rs751402 and

rs2296147 are associated with lung cancer risk in molecular epidemiologic studies (Shen,

Berndt et al. 2005, Zienolddiny, Campa et al. 2006).

In a previous study, based on genotype analysis we found that SNPs rs751402 and

rs2296147 were associated with inter-individual variation in allelic imbalance in ERCC5

expression in NBEC (Blomquist, Crawford et al. 2010).This observation suggests that

one or both of these SNPs affects ERCC5 cis-regulation. However, based on genotyping

analysis it was not possible to sort out with confidence the independent roles of rs751402

and rs2296147. Further, the patterns of allelic imbalance variation observed indicated

that one or more cis-regulatory SNPs in addition to rs751402 and rs2296147 also played

a role. One candidate is rs17655, a common polymorphic site in the ERCC5 3’ UTR

reported to be associated with lung cancer risk (Matakidou, el Galta et al. 2007). Genetic

variation in the 3’ UTR of a gene can play a role in cis-regulation by influencing

60

microRNA (miRNA) binding activity (Yu, Li et al. 2007, Nicoloso, Sun et al. 2010,

Ryan, Robles et al. 2010).

In an effort to better characterize cis-acting genetic variants responsible for inter- individual variation in ERCC5 transcript regulation, we used recently developed methods to assess in more detail the role of the previously studied 5’UTR sites rs751402 and rs2296147 (Mullins, Crawford et al. 2005, Blomquist, Crawford et al.

2009) and additional 3’UTR site rs17655. Specifically, using allele-specific PCR amplicon libraries prepared for next generation sequencing (NGS) according to a recently described method (Blomquist, Crawford et al. 2013), haplotype and diplotype structure of the ERCC5 promoter region containing rs751402 and rs2296147 were assessed by allele-specific polymerase chain reaction (PCR) followed by direct sequencing. We determined allele-specific expression (ASE) as a measurement of allelic ratio at the marker site rs1047768 in the ERCC5 coding region close to 5’UTR and at rs17655 in the coding region nearby 3’UTR. We then evaluated the association of ASE with each rs751402-rs2296147-rs1047768 haplotype and diplotype.

3.3 Materials and Methods

3.3.1 Study subjects

Bronchoscopic brush biopsy samples of NBEC and matched peripheral blood samples were obtained as previously described (Mullins, Crawford et al. 2005) from

60 subjects without cancer and 20 subjects with lung cancer. Demographic characteristics of subjects used in current study are presented in Supplementary Table

S3.1. This study was conducted under University of Toledo Institutional Review

61

Board approved protocol #106894. All the subjects in current study were sub-set of the

Lung Cancer Risk Test (LCRT) trial, a prospective cohort study (Crawford 2016).

Individuals were recruited at 13 locations in the United States and provided informed consent to participate. Each subject agreed to the banking of residual nucleic acids for use in future studies under University of Toledo Biomedical Institutional Review Board protocol #108538. Then all samples were de-identified and links to identifying information were kept at each site for subjects recruited at that site. Inclusion criteria required subjects to be at high demographic risk for lung cancer based on age (50-90 years) and smoking history (≥ 20 pack-years). Both current and former smokers were eligible. Subjects had to be without a diagnosis of lung cancer prior to or at enrollment.

Subjects were excluded if they were previously diagnosed or treated for lung cancer or had a high pretest likelihood of lung cancer, if they were positive for hepatitis B, C, HIV, or had active TB or if the physician deemed them to be medically inappropriate due to safety concerns. Also excluded were children, pregnant women, prisoners, mentally disabled, those that had received a double lung transplantation, radiation or chemotherapy of any kind within the last month and those scheduled to receive either radiation or chemotherapy.

3.3.2 DNA and RNA extraction

NBEC samples obtained at bronchoscopic brush biopsy were shipped to ResearchDX

(Irvine, CA), where RNA was extracted, treated with DNase I (Qiagen, Valencia, CA) in order to eliminate contaminating genomic DNA (gDNA), and frozen in aliquots. One aliquot of each frozen RNA sample was shipped to the University of Toledo, and tested for gDNA contamination with a pair of primers designed to span an intron-exon junction

62

in Secretoglobin, Family 1A, Member 1 gene (CC10 hereafter) and thereby amplify

only gDNA (Blomquist, Brown et al. 2013). Total RNA was reverse transcribed to

cDNA using Moloney Murine Leukemia Virus Reverse Transcriptase (M-MLV RT)

and oligo-dT primers as described previously (Mullins, Crawford et al. 2005).

Matched gDNA was extracted from whole blood using FlexiGene DNA Kit (Qiagen,

Valencia, CA) according to the manufacturer’s protocol.

3.3.3 Genotyping and allelotyping

Genotype at each polymorphic site was determined by TaqMan SNP genotyping assay (Applied Biosystems) according to the manufacturer’s protocol. Direct assessment of the syntenic relationship of alleles at rs751402, rs2296147 and rs1047768 in individuals heterozygous for rs1047768 was accomplished by allele- specific PCR amplification followed by Sanger sequencing (The University of

Michigan DNA Sequencing Core, Ann Arbor, MI) as described previously

(Blomquist, Crawford et al. 2010). An overview of polymorphic sites and primers relative to ERCC5 gene coordinates is depicted in Figure 3.2. The sequences and design of allele-specific primers were described previously (Blomquist, Crawford et al. 2010).

It was not possible to conduct analysis in 11 samples. In three samples, there was no amplification due to poor quality of gDNA and cDNA; four subjects were heterozygous at rs751402 but homozygous at rs2296147 and the spanned region between rs751402 and rs1047768 was too long to assess synteny by direct sequencing; four samples were not amplified with primers determining rs2296147-

63

rs1047768 synteny, likely due to variation in the transcription start site as previously reported (Blomquist, Crawford et al. 2010).

3.3.4 Measurement of ERCC5 allele-specific and total expression

ERCC5 ASE was measured using a modified version of previously described methods (Blomquist, Crawford et al. 2013). Briefly, a custom, multiplex competitive

PCR amplicon library was prepared for targeted NGS, then sequenced at the University of Michigan DNA Sequencing Core (Ann Arbor, MI) using the Illumina HiSeq 2000 platform. To prepare the library, cDNA was combined with a mixture containing a) primers spanning SNP rs10477678 in the ERCC5 5’UTR region, SNP rs17655 in the

3’UTR region and, the ACTB loading control gene, and b) a known number of internal standard molecules for each of these targets. Each primer was designed with a universal tail sequence (similar to that used for arrayed primer extension: APEX-2) not present in the human genome to allow for multi-template PCR addition of barcode and platform specific sequencing adapters. The internal standards mixture was prepared as described previously (Blomquist, Crawford et al. 2013).

Three sequential PCR amplifications were conducted to prepare the library prior to sequencing. In the first reaction each target sequence was amplified with 5µM APEX- tailed primers using an air thermal cycler (RapidCycler; Idaho Technology, Inc. Idaho

Falls, Idaho), PCR conditions were 95°C/3min (Taq DNA polymerase activation); 35 cycles of 94°C/5sec (denaturation), 58°C/10sec (annealing), 72°C/15 sec (extension).

Each product from the first PCR was purified using QIAquick PCR purification kit

(Qiagen, Valencia, CA) to remove residual primers and primer dimers then used as template for barcoding PCR. Each barcoding reaction was cycled in the air cycler under

64

the following conditions: 95°C/3min (Taq DNA polymerase activation); 15 cycles of

94°C/5sec (denaturation), 58°C/10sec (annealing), 72°C/15 sec (extension). The final

concentration of each forward and reverse barcoding primer was 1µM. The third PCR

for adding Illumina platform specific adaptors was conducted with the same PCR

conditions and primer concentrations as the barcoding PCR. Representative PCR products were checked for quality and quantity with Bioanalyzer 2100 (Agilent

Technologies, city, state) following each of the three amplifications. All products from the third step PCR were combined at equal volumes and then purified by

QIAquick PCR purification kit (Qiagen, Valencia, CA) to remove residual primers.

The concentration of purified products was checked by Bioanalyzer 2100 and then

sent for sequencing. ASE was presented as allelic ratio which was calculated as the

ratio of sequencing counts for each allele and filtered as described below.

3.3.5 Data processing pipeline

The University of Michigan Illumina Sequencing services provided raw

sequencing data in FASTQ format. Practical Extraction and Reporting Language

(PERL) scripts were used to combine Read 1 (forward) and Read 2 (reverse)

sequence reads for each template sequenced. These “joined” reads were then de-

multiplexed based on dual-index barcoding on each template, and the locus was

identified based on the region representing the primer sequences. Intervening

amplicon sequence that was “captured” between the primer sequences was aligned

using custom alignment with Approximate String matching algorithm as previously

described (Blomquist, Crawford et al. 2013, Blomquist, Crawford et al. 2015). These

alignment calls then provided relative abundance in the form of sequence “counts” for

65

each allele at each locus. The ratio of these allele specific sequence counts at each locus

then represents the allele-specific expression ratio.

3.3.6 Filtering for stochastic sampling error

To control for stochastic sampling error, we implemented a previously developed

equation that identifies the minimum allowable input of target gene molecules into library

preparation obtained by measurement relative to known number of input IS, and

minimum allowable number of amplicons from library loaded into sequencer measured

as sequencing counts (Blomquist, Crawford et al. 2015).

A filter for coefficient of variation (CV) expected to result from stochastic sampling

determined analytical variation was applied to both allele-specific expression and total

transcript abundance. Only values with stochastic sampling dependent CV expected to be

less than 1 were subjected to subsequent analysis. Total transcript abundance was

presented as target gene NT molecules/106 ACTB molecules.

3.3.7 Small interfering RNA (siRNA) assay in cell culture

Human lung squamous carcinoma cell line H1703 was obtained from American Type

Culture Collection and were maintained at 37°C and 5% CO2 in RPMI 1640 medium

supplemented with 10% fetal bovine serum. Dharmacon ON-TARGETplus SMARTpool

siRNAs (Supplementary Table S3.3) for CEBPG (si05, si06, si07, and si08), negative

control siRNA and DharmFECT 1 reagent were purchased from Thermo Scientific

(Waltham, MA). One day prior to transfection, cells were trypsin dissociated and seeded

into 6-well plate at a predetermined density such that cells were 70-75% confluent at the

time of transfection. Cells were transfected with DharmFECT 1 reagent and serum-free

RPMI 1640 according to the manufacturer’s protocol. Cells were incubated with siRNA

66

transfection reagent for 24 h and then the media were replaced with RPMI 1640 +

10% FBS. Cells were harvested after an additional 48 h of incubation. A replicate

experiment was done.

3.3.8 Statistical analysis

Ratios of allele-specific expression values from individuals heterozygous at the

marker SNP were log2-transformed prior to analysis. Variance in allelic ratio

measured in cDNA samples was tested for difference from that in matched gDNA

samples by F-test. Difference in mean allelic ratio between cDNA samples and

matched gDNA samples was determined by Student’s t-test. One-way analysis of variance (ANOVA) was used to compare the mean of allelic ratios associated with genotype or diplotype. Pearson’s test was performed to assess inter-gene total transcript abundance bivariate correlation and Fisher Z-test was applied to assess inter-group (e.g. cancer vs non-cancer) difference in bivariate correlation. All statistical tests were two-sided with a statistical significance level of p < 0.05, using either the R statistical programming language (v 3.2.0) or SAS program (v.9.3). All graphs were plotted using GraphPad Prism 6.

3.4 Results

3.4.1 Effects of CEBPG siRNAs on ERCC5 transcript abundance

To validate the functional role of CEBPG in ERCC5 transcript regulation, we knocked down CEBPG transcript level and assess the effect on ERCC5 transcript abundance. As shown in Figure 3.1, after treatment of siRNA, the transcript abundance of CEBPG in cells treated with CEBPG siRNA significantly reduced by

67

94% relative to cells treated with control siRNA. At the same time, ERCC5 transcript

abundance in CEBPG siRNA group decreased by 4-fold compared to control group.

3.4.2 Inter-individual variation in allelic imbalance at rs1047768

Among the eighty newly enrolled subjects, thirty-three individuals were heterozygous

at the marker SNP rs1047768. For analysis of allelic imbalance at rs1047768, data from

these 33 subjects were combined with data from 22 previously enrolled subjects

heterozygous at rs1047768 (Blomquist, Crawford et al. 2010). Allelic ratios for the

cDNA sample and matched gDNA control of each heterozygote are summarized in

Supplementary Table S3.2 and presented graphically in Figure 3.3. Among 55 subjects

that were heterozygous for rs1047768 there was greater inter-individual variation in

rs1047768 allelic ratio in cDNA compared to matched gDNA controls (F-test, p<0.0001)

(Figure 3.3A).

3.4.3 Characterization of haplotype and diplotype structure in ERCC5 promoter

Of 55 individuals heterozygous at the marker SNP rs1047768, it was possible to determine the synteny among allelotypes at rs1047768 as well as promoter SNP sites rs751402 and rs2296147 in 44 samples. Six haplotypes of SNPs rs751402–rs2296147– rs1047768 (Table 3.1) and six unique diplotypes (haplotype pairs for an individual) were observed (Table 3.2). The most common haplotype structures were G–C–C (37%) and

G–T–C (21%). The most common diplotypes were GTT/GCC observed in 17/44

individuals (39%) and ATT/GCC observed in 11/44 (25%) (Table 3.2).

3.4.4 Association of ERCC5 promoter diplotype with rs1047768 allelic imbalance

There was a significant difference in the mean allelic transcript ratio at rs1047768

among the six diplotype groups (p=0.0030). The mean allelic transcript ratio among

68

subjects with GTT/GCC diplotype was 1.2-fold higher than that of subjects with diplotype GTT/GTC (p=0.0280) and 1.4-fold higher than that of subjects with

GCT/GTC (p=0.010). There was no difference in mean allelic ratio between diplotype GTT/GCC and ATT/GCC (Figure 3.4).

The rs1047768 allele (T or C) syntenic with rs2296147 T allele displayed higher transcript abundance than the allele syntenic with rs2296147 C allele, regardless of allele present at rs751402 (Figure 3.4). Consistent with this, significantly 1.44-fold,

1.42-fold, and 1.22-fold higher (p<0.05) rs1047768 T/C mean allelic ratio was observed in the combined group of subjects with either GTT/GCC or ATT/GCC

compared to any groups of subjects with GCT/GCC, GCT/GTC, or GTT/GTC

(Figure 3.4).

3.4.5 Inter-individual variation in allelic imbalance measured at rs17655 in ERCC5

3’UTR

Twenty-one of the 80 newly enrolled subjects were heterozygous at rs17655,

located in the ERCC5 transcript 3’ UTR. Among these subjects, inter-individual

variation in rs17655 G/C allelic ratio was significantly higher in cDNA compared to

matched gDNA (F-test, p=0.0005) (Figure 3.3B). In addition, mean G/C ratio was

30% higher in cDNA relative to matched gDNA (t-test, p<0.0001).

3.4.6 Analysis of rs2296147 effect on transcription factor binding and rs17655 effect

on miRNA binding

T-allele at rs2296147 is predicted to participate in formation of a TP53

transcription factor-binding site and the TP53 binding site is predicted to be lost when

the C-allele is present (Marinescu, Kohane et al. 2005).

69

Although allelic variation at ERCC5 3’UTR SNP rs17655 is not predicted to affect miRNA binding, it is in high linkage disequilibrium (D’ = 0.95) with functional candidate

SNP rs873601 (Genomes Project, Abecasis et al. 2010), which is predicted to alter binding of multiple microRNAs (miRNAs) (Liu, Zhang et al. 2012, Zhu, Shi et al. 2012).

3.4.7 Bivariate analysis of ERCC5 and CDKN1A transcript abundance in non-cancer and cancer subjects

In order to investigate the potential role of TP53 functional activity in regulation of

ERCC5, we measured CDKN1A transcript abundance as surrogate marker for TP53 function (el-Deiry, Tokino et al. 1993, el-Deiry, Harper et al. 1994, Harr, Graves et al.

2005). Among the 80 subjects enrolled, there was significant bivariate correlation between ERCC5 and CDNK1A (also known as p21) among the 60 non-cancer subjects

(r=0.65, p=0.0005), consistent with co-regulation by a common transcription factor. In contrast, the ERCC5 and CDKN1A correlation was significantly lower (Fisher Z- test, p=0.0002) among the 20 cancer subjects (r=-0.52, p=0.0475) (Figure 3.5). Additionally, inter-individual variation in CDNK1A transcript abundance was higher in non-cancer compared to cancer subjects (F-test, p=0.00053).

3.5 Discussion

Known sources of inherited risk for lung cancer include variation in cis-acting regulatory single nucleotide polymorphisms (cis-rSNPs) and/or key transcription factors that regulate antioxidant, DNA repair, and cell proliferation control gene pathways in normal bronchial epithelial cells (NBEC), the progenitor cells for lung cancer (Harr,

70

Graves et al. 2005, Mullins, Crawford et al. 2005, Blomquist, Crawford et al. 2010).

Data presented here advance mechanistic understanding regarding heritable variation

in cis-regulation of the key NER gene ERCC5 (Figure 3.3).

3.5.1 Contribution of 5’UTR SNPs to ERCC5 cis-regulation

Analysis of diplotype structure at rs751402-rs2296147-rs1047768 demonstrated higher abundance of transcript from rs1047768 marker site C-allele was associated with T-allele at putative cis-regulatory SNP rs2296147 and (Figure 3.4) and not associated with variation at rs751402. Notably, rs2296147 T-allele participates in formation of an in silico predicted TP53 transcription factor-binding site (Marinescu,

Kohane et al. 2005) and that site is predicted to be lost when C-allele is present. In previous studies TP53 upregulates ERCC5 transcription (Kannan, Amariglio et al.

2000). Therefore, it is reasonable to hypothesize that TP53 upregulates ERCC5 transcription more effectively when T allele is present at rs2296147. Because TP53 is regulated primarily at the post-translational level, TP53 transcription factor functional activity is often measured indirectly as transcript abundance of key target genes such as CDKN1A (el-Deiry, Tokino et al. 1993, el-Deiry, Harper et al. 1994, Harr, Graves et al. 2005). Indeed, we observed a significant correlation of ERCC5 and CDKN1A at transcript level. As reported in Figure 3.5A, CDKN1A and ERCC5 total transcript abundance values were correlated in non-cancer subject NBEC samples and this correlation is lost in NBEC from lung cancer subjects (Figure 3.5B). Greater sample size will be necessary to evaluate association of rs2296147 and the altered CDKN1A correlation with ERCC5 in lung cancer subjects. In contrast to strong evidence for the cis-regulatory role of rs2296147 in ERCC5 regulation, haplotype and diplotype data

71

do not support a similar role for rs751402. Haplotype-based analyses presented here

distinguished the effects of rs751402 and rs2296147 which was not possible to do based

based previously reported genotype-based analyses (Blomquist, Crawford et al. 2010).

3.5.2 Contribution of 3’UTR SNPs to ERCC5 cis-regulation

We observed an increased mean G/C allelic ratio at rs17655 in cDNA compared to

matched gDNA controls (Figure 3.3B) indicating that this SNP or a SNP in linkage

disequilibrium with it influences ERCC5 transcript levels in NBEC. As described in

Results section, it is likely that the functional SNP responsible for this observation is

rs873601 which is linked to rs17655 and is predicted to alter binding of multiple miRNAs

(He, Qiu et al. 2012, Liu, Zhang et al. 2012, Zhu, Shi et al. 2012). Specifically, the C

allele at SNP rs17655 is linked to G allele at rs873601 (D’ = 0.95), which is putatively

more responsive to multiple miRNAs that will increase the rate of degradation and lower

abundance of transcripts originating from rs17655 C allele. Data presented here support

the need for functional studies to determine whether rs873601 actually binds any miRNA

and, if so, to evaluate allele-specific degradation. Importantly, because rs2296147 and

rs873601 are not linked, we conclude that any rs873601 effect is independent from that

of rs2296147.

3.5.3 Interaction between 5’UTR and 3’UTR SNPs in ERCC5 cis-regulation

Although functional validation is needed, there is potentially higher TP53-mediated

ERCC5 transcription rate from rs2296147 T allele and higher miRNA mediated ERCC5 transcript degradation at rs873601 G allele. Thus, if each of these cis-regulatory sites acted alone without any contribution from the other (for example, in hypothetical

72

alternative transcripts) or any other cis-acting SNP, we would expect not only to observe mean T/C ratio at marker SNP rs1047768 and mean G/C ratio at SNP rs873601 (or linked SNP rs17655) to be >1, but also very little inter-individual variation around these mean ratios. However, we observed significant variation around the mean allelic ratio at each marker SNP. The likely explanation is that the predominant expressed ERCC5 transcripts incorporate both marker SNPs (rs1047768 and rs17655) and the effects resulting from genotype at each of the unlinked cis- regulatory sites (rs2296147 and rs873601) will interact to determine the allelic ratio measured at each marker SNP.

3.5.4 Value of transcript abundance regulation as intermediate lung cancer risk marker

Consistent with a complex genetic mechanism of lung cancer risk, the effect size of each DNA variant associated with lung cancer risk is very small. Consequently, thousands of subjects are needed to directly assess the association of individual genetic variants and lung cancer risk. The data presented here support the conclusion that inherited variation in gene regulation is a powerful intermediate phenotypic marker for lung cancer risk, as presented schematically in Figure 3.6. As we report here and previously (Blomquist, Crawford et al. 2010), it is possible to assess this type of intermediate risk factor with far fewer patients than the thousands typically necessary for a GWAS study aiming to determine association of each individual SNP with risk (Amos, Wu et al. 2008). Specifically, the association of a single genetic variant with transcription regulation (e.g. rs2296147 with ERCC5 regulation) or the association of inherited variation in transcript abundance regulation with lung cancer

73

risk (Blomquist, Crawford et al. 2009) may be assessed with hundreds of subjects

(Blomquist, Crawford et al. 2009). For example, starting with 161 subjects (Amos, Wu et

Wu et al. 2008) we observed significant association of rs2296147 genotype with ERCC5

ASE (Figure 3.4), and with fewer than 100 subjects we observed significantly altered

ERCC5 regulation with lung cancer (Mullins, Crawford et al. 2005) (Figure 3.5). In contrast, there was not a clear association of rs2296147 T allele dosage with lung cancer risk among the subjects enrolled for this study (data not shown).

These findings provide evidence to support, as a strategy to identify variants associated with lung cancer risk, analysis of NBEC regulation of key antioxidant, DNA repair, and cell cycle control genes, followed by identification of cis-regulatory variants associated with sub-optimal regulation.

Based on the findings in the current study, we conclude that the T allele at rs2296147 is associated with higher ERCC5 transcript abundance, possibly through increased responsiveness to TP53 transcription factor. Genotype at rs17655 also is associated with variation in ERCC5 transcript abundance, likely due to effect on miRNA binding affinity at the linked SNP rs873601. These effects on ERCC5 transcription likely result in variation in nucleotide excision DNA repair function. These findings provide plausible explanation for the association of genotype at rs2296147 and rs17655 with lung cancer risk.

3.6 Disclosures

JCW has 5-10% equity interest in and serves as a consultant to Accugenomics, Inc.

Technology relevant to this manuscript was developed and patented by JCW, and is

74

licensed to Accugenomics. These relationships do not alter our adherence to all

BioMed Central policies on sharing data and materials.

3.7 Grants

Significant portions of this work was funded by National Institutes of Health,

National Cancer Institute (RC2-CA147652 and IMAT R21-CA138397) and National

Heart Lung and Blood Institute (RO1-HL108016), and the University of Toledo

Medical Center George Isaac Cancer Research Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

75

3.8 Table and Figure Legends

Table 3.1 Summary of haplotype structures in ERCC5 promoter region.

Table 3.2 Summary of diplotype structures in ERCC5 promoter region.

Figure 3.1 Effects of CEBPG siRNAs on ERCC5 transcript abundance

Transcript abundance was normalized to the reference gene ACTB, presented here as target gene molecules/106 ACTB molecules. The average of transcript abundance measurements obtained from two replicates was calculated for CEBPG and ERCC5 respectively. Logarithm of the mean transcript abundance at base 10 was plotted in y axis. White rectangle, controls treated with control siRNA. Strip-filled rectangle, CEBPG knocked-down by siRNA. After treatment of siRNA, the transcript abundance of CEBPG in cells treated with CEBPG siRNA significantly reduced by 94% relative to cells treated with control siRNA. At the same time, ERCC5 transcript abundance in CEBPG siRNA group decreased by 4-fold compared to control group.

Figure 3.2 Schematic overview of ERCC5 gene putative cis-regulatory polymorphic sites and orientation for allelic-specific expression measurement.

ERCC5 gene coordinate represents NCBI Gene NC_000013.11. All positions noted here are relative to the reported NCBI mRNA RefSeq TSS for ERCC5 NM_000123.3. Grey or white arrowhead indicate the direction of transcription relative to gene orientation. TSS, transcript start site. (A) The syntenic relationship of alleles at rs751402, rs2296147 and

76

rs1047768 in individuals heterozygous for rs1047768 was assessed by allele-specific

PCR amplification followed by direct sequencing. C allele Reverse Primer, a C-allele- specific-primer used in combination with Forward Primer for specific amplification from

C allele at rs2296147 to determine synteny between alleles at rs2296147 and rs751402. T allele Forward Primer and C allele Forward Primer were used in combination with

Reverse Primer in exon 2 for specific amplification from T allele or C allele, respectively, at rs2296147 to determine the synteny between alleles at rs2296147 and rs1047768. cDNA instead of gDNA was used for this amplification to avoid a large intron 1. The depicted cluster region of TSS represents highly variable ERCC5 transcription initiation sites as discussed previsouly (Blomquist, Crawford et al. 2010). (B) The allele-specific expression of ERCC5 was measured at two polymorphic sites, rs1047768 and rs17655.

Generally, native templates in cDNA sample were amplified with a known number of internal standards which contain identical priming sites and 6 nucleotides altered relative to native template. Barcodes that allowed for multiplexing and adapters specific for

Illumina HiSeq platform were added by PCR. The products were quantified and purified then sent for sequencing on Illumona HiSeq. Asterisk indicates nucleotide alteration in internal standard (IS) relative to native template (NT).

Figure 3.3 Allelic ratios measured at rs1047768 and rs17655.

The base 2 of logarithm transformation was applied to allelic ratios measured at two polymorphic sites, rs1047768 and rs17655 in cDNA and matched gDNA samples and used for statistic tests. The dashed line at 0 is reference line for allelic ratio of 1.(A) Inter- individual variation in T/C allelic ratios measured at polymorphic site rs1047768 located

77

in ERCC5 coding region exon 2 is significantly higher in cDNA samples relative to

matched gDNA controls (F-test, p<0.0001). The mean log2(T/C ratio) in cDNA (M=0.11,

SD=0.34) was significantly higher than that for matched gDNA (M=0.03, SD=0.11)

according to t-test (p=0.0216). (B) Similarly to rs1047768, significant higher inter-

individual variation in allelic ratios of cDNA compared to matched gDNA (F-test,

p=0.0005) was also observed at polymorphic site rs17655 located in ERCC5 exon 15

close to 3’UTR. The mean log2(G/C ratio) is significantly higher in cDNA (M=0.25,

SD=0.35) than in gDNA (M=-0.14, SD=0.16) (t-test, p< 0.0001). M, mean. SD, standard deviation.

Figure 3.4 Allelic ratios measured at rs1047768 sorted by various diplotype.

ANOVA was used to assess the difference in T/C allelic ratios among groups. All effects were statistically significant at the 0.05 significance level. Allelic ratios in relationship with six presented diplotypes at rs751402, rs2296147 and rs1047768 in ERCC5 5’UTR and exon 2. ANOVA revealed a significant difference in the mean allelic transcript ratio among the six diplotype groups (p=0.0030). The mean allelic transcript ratio among subjects with GTT/GCC diplotype (group 5) was 1.2-fold higher (p=0.0280) than that of subjects with GTT/GTC diplotype (group 6) and 1.4-fold higher (p=0.010) than that of subjects with GCT/GTC diplotype (group 4). There was no significant difference in mean allelic ratio between diplotype GTT/GCC and ATT/GCC (group 1) so they were combined and presented as NTT/GCC (group 7, N represents either G or A). The rs1047768 allele (T or C) syntenic with rs2296147 T allele displayed higher transcript abundance than the allele syntenic with rs2296147 C allele, regardless of allele present at

78

rs751402. Consistent with this, significantly 1.44-fold, 1.42-fold, and 1.22-fold higher

(p<0.05) mean allelic ratio was observed in group 7 compared to any groups of subjects

with group 3, 4, 6, respectively.

Figure 3.5 Correlation of CDKN1A and ERCC5 transcript abundance in NBEC

samples. (A) Correlation of CDKN1A and ERCC5 in non-cancer subjects. Pearson’s

correlation coefficient was calculated, r=0.58, p=0.0014. (B) Correlation of CDKN1A and

ERCC5 in cancer subjects, r=-0.08, p=0.7781. The correlation coefficient was

significantly decreased in cancer subjects relative to non-cancer subjects (Fisher Z-test,

p=0.0002) indicating a significantly altered correlation in cancer subjects. NC, non-

cancer. CA, cancer. NBEC, normal bronchial epithelial cells.

Figure 3.6 Increased lung cancer risk through sub-optimal normal bronchial

epithelial (NBEC) regulation of protective genes.

This schematic indicates the putative genetic basis for hereditary increased lung cancer

risk in three individuals. SNPs that affect transcript abundance regulation are indicated numerically and as diamonds (trans-regulatory SNPs) or circles (cis-regulatory SNPs). As indicated, each individual is at increased risk due to sub-optimal regulation of a different combination of genes. Further, when the same gene is sub-optimally regulated in multiple individuals (e.g. Gene C in Individuals 1 and 3), a different set of SNPs may be responsible in each individual.

79

3.9 Table and Figure

Table 3.1

Haplotype rs751402 rs2296147 rs1047768 Frequency Count G C C 112 37% G T C 63 21% G T T 55 18% A T T 51 17% G C T 12 4% A T C 7 2%

Table 3.2

rs751402-rs2296147-rs1047768 Diplotype Frequency Parental Parental Count Chromosome 1 Chromosome 2 A-T-T G-C-C 11 25% A-T-T A-T-C 1 2% G-C-T G-C-C 3 7% G-C-T G-T-C 3 7% G-T-T G-C-C 17 39% G-T-T G-T-C 9 20%

80

Figure 3.1

81

Figure 3.2

82

Figure 3.3

83

Figure 3.4

84

Figure 3.5

85

Figure 3.6

86

3.10 Supplemental Table and Figure Legends

Table S3.1 Demographic characteristics of enrolled 80 subjects.

Note: a Two-sided Student's t-test to assess mean difference between non-cancers and cancers. b Two-sided Chi-square test to determine the difference of distributions between non- cancers and cancers. c Ethnicity information for three cancer subjects was missing.

Table S3.2 Summary of genotype, diplotype and allelic ratios for heterozygotes at rs1047768.

Table S3.3 ON-TARGETplus SMARTpool siRNA Sequences

87

3.11 Supplemental Table and Figure

Table S3.1

Non-Cancers Cancers p value N=60 N=20 Age(years) [Mean ± SD] 64.7 ± 7.5 69.3 ± 9.2 0.0648a Gender 0.1548b Female 28 (84.8%) 5 (15.2%) Male 32 (71.1%) 13 (28.9%) Ethnicity 0.5305b Caucasian 53 (76.8%) 16 (23.2%)

African Americanc 7 (87.5%) 1 (12.5%)

Table S3.2

rs1047768 Diplotype Genotype T:C allelic Subject (751402-rs2296147-rs1047768) ratio ID Parental Parental rs751402 rs2296147 rs1047768 gDNA cDNA Chromosome 1 Chromosome 2 1036 A/G T/C T/C A-T-T G-C-C 1.04 0.84 2007 G/G T/C T/C G-T-T G-C-C 1.05 1.22 3030 A/G T/C T/C A-T-T G-C-C 1.02 0.89 1016 G/G T/C T/C G-T-C G-C-T 0.99 0.93 1021 G/G T/T T/C G-T-T G-T-C 1.00 0.47 1060 G/G T/C T/C G-T-T G-C-C 1.04 1.01 1082 A/G T/C T/C A-T-T G-C-C 1.07 0.98 3002 G/G T/C T/C G-T-T G-C-C 1.04 1.07 3063 G/G C/C T/C G-C-T G-C-C 1.03 0.64 2021 G/G T/T T/C G-T-T G-T-C 1.12 1.12 3051 G/G T/T T/C G-T-T G-T-C 1.02 1.12 1084 A/G T/C T/C A-T-T G-C-C 1.01 1.17 2012 G/G T/C T/C G-T-C G-C-T 1.05 0.80 1080 A/A T/T T/C A-T-T A-T-C 0.96 1.03

88

1074 A/G T/C T/C A-T-T G-C-C 1.02 1.10 1067 A/G T/C T/C A-T-T G-C-C 1.09 0.94 2020 G/G T/C T/C G-T-C G-C-T 0.99 0.81 1079 G/G T/C T/C G-T-T G-C-C 1.01 0.84 2018 G/G T/C T/C G-T-T G-C-C 1.04 1.45 2035 G/G T/T T/C G-T-T G-T-C 1.02 0.88 J003 G/G T/T T/C G-T-T G-T-C 0.78 0.83 J004 G/G C/C T/C G-C-T G-C-C 1.06 0.98 J007 G/G T/C T/C G-T-T G-C-C 1.06 1.00 L005 A/G T/C T/C A-T-T G-C-C 0.99 1.14 L014 G/G T/C T/C G-T-T G-C-C 0.97 1.03 532 A/G T/C T/C A-T-T G-C-C 1.15 1.40 344 A/G T/C T/C A-T-T G-C-C 1.07 1.65 574 A/G T/C T/C A-T-T G-C-C 1.09 1.53 289 A/G T/C T/C A-T-T G-C-C 1.10 1.34 720 A/G T/C T/C A-T-T G-C-C 1.06 1.45 399 G/G T/C T/C G-T-T G-C-C 1.21 1.37 572 G/G T/C T/C G-T-T G-C-C 0.95 1.12 591 G/G T/C T/C G-T-T G-C-C 0.87 1.23 652 G/G T/C T/C G-T-T G-C-C 0.87 1.15 664 G/G T/C T/C G-T-T G-C-C 0.99 1.07 128 G/G T/C T/C G-T-T G-C-C 0.95 1.64 286 G/G T/C T/C G-T-T G-C-C 0.97 1.15 389 G/G T/C T/C G-T-T G-C-C 0.94 1.47 715 G/G T/C T/C G-T-T G-C-C 0.89 2.10 303 G/G T/T T/C G-T-T G-T-C 0.97 0.97 589 G/G T/T T/C G-T-T G-T-C 1.16 1.10 649 G/G T/T T/C G-T-T G-T-C 1.06 0.90 650 G/G T/T T/C G-T-T G-T-C 1.10 0.95 309 G/G C/C T/C G-C-T G-C-C 1.18 0.91

89

Table S3.3

Human CEBPG Target Sequence siRNA-05 UCGAAACAGUGACGAGUAU siRNA-06 GAACGGAAUUAGUGUUAUC siRNA-07 GGAAUUAAGUGUACUCAAA siRNA-08 GACAGCAGAUGGCGACAAU

90

Chapter 4 Lung Cancer Risk Test Trial: Study Design, Participant Baseline Characteristics, Bronchoscopy Safety, and Establishment of Biospecimen Repository

E.L. Crawford1, A. Levin2, , F. Safi1, M. Lu2, A. Baugh1, X. Zhang1, J. Yeo1, S.A.

Khuder1, A.M. Boulos1, P. Nana-Sinkam3, P.P. Massion4, , D.A. Arenberg5, D. Midthun6,

P.J. Mazzone7, S.D. Nathan8, R. Wainz9, G. Silvestri10, J. Tita11, and J.C. Willey1,*

1Department of Pulmonary and Critical Care, The University of Toledo Medical Center –

Toledo, OH , 2Department of Biostatistics, Henry Ford Hospital System, Detroit, MI,

3Ohio State University James Comprehensive Cancer Center and Solove Research

Institute – Columbus, OH, 4Thoracic Program, Vanderbilt Ingram Cancer Center –

Nashville, TN/US, 5University of Michigan – Ann Arbor, MI, 6Mayo Clinic – Rochester,

MN/US, 7Cleveland Clinic – Cleveland, OH, 8Inova Fairfax Hospital – Falls Church,

VA/US, 9The Toledo Hospital – Toledo, OH, 10Medical University of South Carolina-

Charleston, SC/US, 11Mercy/St. Vincent’s Hospital, Toledo, OH.

* To whom correspondence should be addressed.

Published in BMC Pulmonary Medicine, January 22, 2016.

91

4.1 Abstract

Introduction. The Lung Cancer Risk Test (LCRT) trial is a prospective cohort study comparing lung cancer incidence among persons with a positive or negative value for the

LCRT, a 15 gene test measured in normal bronchial epithelial cells (NBEC). The purpose of this article is to describe the study design, primary endpoint, and safety; baseline characteristics of enrolled individuals; and establishment of a bio-specimen repository.

Methods. Eligible participants were aged 50-90 years, current or former smokers and 20 pack-years or more cigarette smoking history, free of lung cancer, and willing to undergo bronchoscopic brush biopsy for NBEC sample collection. NBEC, peripheral blood samples, baseline CT, and medical and demographic data were collected from each subject.

Results. Over a two-year span (2010-2012), 403 subjects were enrolled at 12 sites. At baseline 384 subjects remained in study and mean age and smoking history were 62.9 years and 50.4 pack-years respectively, with 34% current smokers. Obstructive lung disease (FEV1/FVC <0.7) was present in 157 (54%). No severe adverse events were associated with bronchoscopic brushing. An NBEC and matched peripheral blood bio- specimen repository was established.

Conclusions. The demographic composition of the enrolled group is representative of the population for which the LCRT is intended. Specifically, based on baseline population characteristics we expect lung cancer incidence in this cohort to be similar to the 3.1% reported in prior studies and representative of the population eligible for low-dose

Computed Tomography (LDCT) lung cancer screening. Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-tolerated in this population.

92

These findings support the feasibility of testing LCRT clinical utility in this prospective study. If validated, the LCRT has the potential to significantly narrow the population of individuals requiring annual low-dose helical CT screening for early detection of lung cancer and delay the onset of screening for individuals with results indicating low lung cancer risk. For these individuals, the small risk incurred by undergoing once in a lifetime bronchoscopic sample collection for LCRT may be offset by a reduction in their CT- related risks. The LCRT biospecimen repository will enable additional studies of genetic basis for COPD and/or lung cancer risk.

Trial Registration: The LCRT Study, NCT 01130285, was registered with

Clinicaltrials.gov on May 24, 2010.

Key words: lung cancer risk test, hereditary lung cancer risk, normal bronchial epithelial cells, lung cancer screening, bronchoscopy safety, bronchial brush safety

93

4.2 Introduction

Lung cancer claimed nearly 160,000 lives in 2014 in the United States alone (2014).

Prevention efforts have reduced cigarette smoking prevalence from about 50% in 1960 to

less than 20% today but, due to past and continued cigarette smoking and the lack of

effective treatment for advanced disease, lung cancer kills more than the next three most

deadly cancers (breast, colon, prostate) combined and is expected to do so for decades to

come (2014). Because prognosis is related to stage, there has long been interest in

detecting lung cancer in early stage when it is amenable to potentially curative treatment.

Thus, it is notable that the US Preventive Services Task Force (USPSTF) now

recommends lung cancer screening with LDCT for healthy individuals at high risk for

lung cancer on the basis of evidence that it will detect the majority of lung cancers in

early stage and thereby reduce lung cancer mortality by 20% (National Lung Screening

Trial Research, Aberle et al. 2011, Humphrey, Deffebach et al. 2013). However, the overall benefit of screening is associated with adverse consequences, including identification of large numbers of nodules, most of which will be nonmalignant, and the complications, costs, and anxiety associated with diagnostic tests (Bach, Mirkin et al.

2012). These adverse consequences could be reduced by restricting screening eligibility to only those at greatest risk. Among the approximately 8 million subjects eligible for screening according to current criteria, which include smoking history >30 pack-years

and age 55-80 years (Blomquist, Crawford et al. 2009, National Lung Screening Trial

Research, Aberle et al. 2011), risk varies widely from less than 0.08% per year to over

1% per year (van Klaveren, Habbema et al. 2001, Bach, Kattan et al. 2003, Spitz, Hong et

al. 2007, Cassidy, Myles et al. 2008, Field, Baldwin et al. 2011, Tammemagi, Pinsky et

94

al. 2011, Raji, Duffy et al. 2012, Kovalchik, Tammemagi et al. 2013). As such, a large majority of screened individuals will not develop lung cancer in their lifetime and the overall benefit of screening is reduced by the adverse events and large cost associated with screening subjects who will not benefit due to low risk. For these reasons, there is increasing interest in the development of an accurate diagnostic molecular test for lung cancer risk that will more accurately stratify subjects for screening. It is expected that limiting screening to those with a positive risk test will reduce the high cost and side effects of screening programs.

Different approaches are currently in progress to develop a molecular diagnostic test for lung cancer risk in the group eligible for annual CT screening based on demographic criteria. These approaches may be divided into two broad categories, early diagnosis and hereditary risk.

The early diagnosis strategy is to detect lung cancers in early stage before symptoms occur so that they can be treated with high chance for cure. This category includes approaches to identify pre-clinical early lung cancer based on blood tests for circulating proteins, antibodies, and/or microRNA (Boeri, Verri et al. 2011, Higgins, Roper et al.

2012, Pecot, Li et al. 2012, Cazzoli, Buttitta et al. 2013, Daly, Rinewalt et al. 2013,

Mehan, Williams et al. 2014, Birse, Lagier et al. 2015, Vachani, Pass et al. 2015), or gene expression tests measured in non-cancer bronchial or nasal airway epithelium that reflect presence of lung cancer due to a field effect (Spira, Beane et al. 2007, Gower, Steiling et al. 2011, Silvestri, Vachani et al. 2015). Because these tests are for early detection they

95

will need to be repeated periodically. A positive test will inform a decision regarding more conservative or more rigorous assessment for presence of lung cancer, including chest CT and/or PET-CT, followed by biopsy. If the intended use is to serve as the primary screening method, an early diagnosis test will need to demonstrate non- inferiority relative to the screening test currently recommended by the USPSTF, annual low dose helical CT.

The hereditary risk test strategy is to identify individuals who have a genetic predisposition to lung cancer so that they can be prioritized for annual chest CT screening. Approaches to identify hereditary risk include a) genome wide association studies (GWAS) to discover DNA polymorphisms associated with lung cancer (Wang,

McKay et al. 2014, Wang, Zhu et al. 2014) and b) studies to identify risk-associated proximate phenotypic markers (Blomquist, Crawford et al. 2009). The Lung Cancer Risk

Test (LCRT) falls into this latter category. The LCRT is a 15 gene test measured in grossly normal bronchial epithelial cells (NBEC) obtained through bronchial brush biopsy (Blomquist, Crawford et al. 2009). The proximate phenotypic markers of hereditary risk comprised by the LCRT are key protective antioxidant, DNA repair, and cell cycle control genes that are sub-optimally regulated in normal bronchial epithelial cells (NBEC). The rationale for this approach is that sub-optimal NBEC regulation of a protective gene has greater effect on risk than an individual single nucleotide polymorphism (SNP). This conclusion is based on results of previous studies in which we identified cis-regulatory SNPs associated with sub-optimal regulation of genes comprised by the LCRT, including ERCC5 (Blomquist, Crawford et al. 2010); [Zhang,

96

submitted] and CEBPG (Blomquist, Brown et al. 2013). For example, (Blomquist,

Crawford et al. 2009) we identified two cis-regulatory SNPs that independently contribute to regulation of ERCC5 transcript abundance (Blomquist, Crawford et al.

2010); [Zhang, submitted]. Thus, a proximate phenotype based on sub-optimal NBEC

regulation of a protective gene enriches for risk determining SNPs.

The clinical setting for LCRT biomarker intended use is individuals who are approaching

annual CT screening eligibility according to USPSTF criteria (Humphrey, Deffebach et

al. 2013). In order to have clinical utility it is important that the test be both accurate and

safe to perform in this intended population. In an effort to assess the accuracy and safety

of the LCRT we initiated a multi-site prospective cohort trial. The purpose of this report is to describe 1) the LCRT trial study design and primary endpoint, 2) baseline

characteristics of enrolled individuals including demographic and lung function data, and

3) secondary endpoints reached thus far, including a) analysis of safety for the

bronchoscopic brush method used to obtain samples for LCRT testing, and b)

establishment of a biospecimen repository containing NBEC and peripheral blood

samples collected from the LCRT cohort.

4.3 Methods

4.3.1 Study design.

This LCRT study (Clinicaltrials.gov, NCT 01130285) was conducted after approval by

an institutional review board at each participating institution (University of Toledo

Medical Center, Mayo Clinic, University of Michigan, The Toledo Hospital, Ohio State

97

University, Vanderbilt University Medical Center/Tennessee Valley VA Medical Center,

Henry Ford Health System, National Jewish Health, Medical University of South

Carolina, Inova Fairfax Hospital, Cleveland Clinic Foundation and Mercy St. Vincent

Medical Center, see Table S4.1) and under a Federal Drug Administration (FDA)

approved Investigational Device Exemption (IDE G090273). The original design to

assess the clinical utility of the LCRT biomarker was a prospective, blinded, nested case-

control study. The original primary endpoint was prediction of risk for development of

lung cancer with an odds ratio of at least 5.0. It was estimated that there would have been

sufficient power to test this endpoint by enrolling approximately 800 subjects and

following them for 3 years, resulting in identification of at least 15 prospective lung

cancer cases. LCRT analysis would then be conducted in NBEC of the 15 cases and 120

matched controls. However, the study was revised to a prospective cohort design due to

a) advances in technology that enable cost-effective measurement of LCRT in all subjects, and b) the greater power associated with this design. The new design and primary endpoints are described below.

The secondary endpoints and analyses were unchanged and include: 1) determination of

study safety at day 30, 2) establishment and maintenance of a biospecimen repository of

biological specimens derived from NBEC [RNA and cytology slides] and corresponding

blood samples [peripheral blood leukocyte Buffy Coat and frozen plasma] from the

subjects enrolled, 3) analysis of the predictive ability of LCRT positive for lung cancer

including sensitivity, specificity, positive predictive value, and negative predictive value,

4) calculation of absolute risk of LCRT positive for lung cancer and, 5) measurement of

the incidence of lung cancer in the study cohort every two years until the end of study.

98

Additionally, we will explore the influence of demographic or clinical variables for lung cancer on the predictive ability of LCRT.

4.3.2 Revised study design.

After development of a the novel targeted NGS platform (Blomquist, Crawford et al.

2013), we implemented LCRT measurement on this platform. The higher throughput of the NGS method enables cost-effective analysis of samples from all 384 subjects and conversion to a prospective cohort study with greater power compared to the original nested case-control design. We plan to assess association of the LCRT value with development of lung cancer in this cohort through follow-up every one to two years for up to 20 years. We will estimate disease-free probabilities for different measured LCRT values at six and eight years of follow up. The primary endpoint will be the prediction of risk for development of lung cancer with a risk ratio of at least 5.0 and we expect to reach this endpoint at the six year follow-up.

Assuming a 20% rate of failure to re-contact (due to death or other factors), approximately 300 individuals from the cohort will be available for analysis. Based on the demographic characteristics of the LCRT cohort, the expected cumulative incidence at six years following enrollment (which will be reached for all subjects between 2016 and 2018) is >5%. Assuming a two-tailed test of significance and a type-1 error rate of

0.05, there will be >80% power to detect a risk ratio associated with LCRT positivity of >

2.45, 1.82, 1.65, 1.57, 1.49, and 1.42 for cumulative incidence rates of lung cancer of 1%,

2%, 3%, 4%, 5%, and 6%, respectively, in the cohort at the six year follow-up. Thus, this proposed study is more than adequately powered to detect even modest LCRT effects at

99

the next planned follow-up. In addition to the risk ratio associated with a positive LCRT, we will also calculate the concordance index of the test based on the estimated Cox proportional hazards model. The concordance index in the Cox model is the correlate to the area under the receiver operator characteristic curve for a logistic regression model.

We will use it to measure LCRT biomarker accuracy in the full cohort analysis.

4.3.3 Participants. To participate in the study, subjects had to be willing and able to provide and sign both written Informed Consent and Health Insurance Portability and

Accountability Act Authorization (HIPAA) forms for this study, undergo bronchoscopy and phlebotomy procedures for the collection of biological specimens and follow up interviews and CT scans. Entry criteria required subjects to be at high demographic risk for lung cancer based on age 50-90 years, and a minimum of 20 pack-years of cigarette smoking history, but to have low likelihood for lung cancer at the time of bronchoscopy.

Both current (defined as self-reported regular use of cigarettes) and former cigarette smokers were eligible. Consent included bronchial brush biopsy to obtain NBEC samples at time of either a) standard of care (SOC) bronchoscopy for a clinical indication for bronchoscopy, b) a study-driven (SD) bronchoscopy, or c) bronchoscopy done for another research study to which they had consented (also considered to be SD). Subjects had to be without a diagnosis of lung cancer prior to or at enrollment. Women with the potential for pregnancy had to have a negative result on a pregnancy test. Subjects were excluded if they were previously diagnosed or treated for lung cancer or had a high pretest likelihood of lung cancer, if they were positive for hepatitis B, C, HIV, or had active TB or if the physician deemed them to be medically inappropriate due to safety

100

concerns. Also excluded were children, pregnant women, prisoners, mentally disabled, those that had received a double lung transplantation, radiation or chemotherapy of any kind within the last month and those scheduled to receive either radiation or chemotherapy.

4.3.4 Recruitment strategies. Twelve medical institutions participated in the LCRT

(Clinicaltrials.gov, NCT 01130285, Table S4.1).

Participants were recruited through physician referral as well as by advertisements in local newspapers, on institutional web sites and through Clinical Trial.gov. The goal was to enroll a sample representative of the US population at high risk of lung cancer death based on demographic criteria.

4.3.5 Enrollment.

Subjects were considered enrolled in the LCRT study when they underwent the study procedure (bronchial brush biopsy with NBEC sample collection). All enrolled subjects had a CT of the chest performed within 3 months prior to study entry or a research driven

CT scan within two weeks after study entry to rule out prevalent lung cancer. Study eligibility, including smoking history, was assessed through initial contact interview by a trained clinical coordinator at each site. The initial Contact Report Form (CRF) was designed to allow for computation of number of pack-years of cigarettes smoked as well as a detailed smoking history that included information on periods of smoking cessation and use of other forms of tobacco such as pipes and cigars. The CRF also contained questions on personal history of selected diseases, stroke, and diabetes, family history of lung cancer, occupational history (jobs and industries either previously demonstrated or

101

thought to be associated with increased risk for lung disease or lung cancer), education, and marital status.

4.3.6 Sample collection.

Standardized sample collection kits were provided to each site. Kits contained supplies for the collection and labeling of biological samples including a disposable bronchial cytology brush (ConMed Corporation, Utica, NY ref.#149) for the collection of NBEC, a

10 ml K2-EDTA vacutainer tube (Becton, Dickinson and Company, Franklin Lakes, NJ ref.#366643) for the collection of whole blood and barcoded stickers. Following positioning of the bronchoscope, the cytology brush was inserted and NBEC were collected from a grossly normal region of either main stem bronchus. For SOC bronchoscopies, this occurred immediately after the diagnostic procedures and the opposite side or in a separate area from the lung region under clinical investigation. If the patient had received a lung transplant, the specimen was obtained from the recipient native mainstem bronchus. The brush was withdrawn, shaken into a tube of normal saline chilled on ice and re-inserted into the bronchoscope for collection of additional NBEC.

This procedure was repeated a total of 5-10 times. After the last brushing, the cytology brush was shaken in the saline and then dabbed onto a glass slide to enable assessment by a pathologist. Immediately prior to or immediately following bronchoscopy, approximately 10 ml of whole blood was obtained using standard phlebotomy techniques into a K2-EDTA vacutainer tube. Blood and NBEC samples were transferred to the lab within 10 min. for processing and stabilization, which was initiated within 1 hour post- collection.

102

4.3.7 Follow up.

Subjects enrolled into the study were followed at 30 days for adverse events (AE) and

serious adverse events (SAE) possibly related to the study procedure and then every 3

months throughout the first two years following enrollment. A research driven CT was

done at the one and two year anniversaries of enrollment if a standard of care CT was not

done within three months of the anniversary. The next follow-up is planned for 2016 with

another in 2018. At each follow-up subjects will receive medical record review and phone interview. Those who meet USPSTF guidelines will be encouraged to enter the closest CT screening program for early detection of lung cancer. Those who do not meet current reimbursement criteria for CT screening will receive a study driven chest CT.

4.3.8 Safety analysis: Adverse events and serious adverse events.

Subjects were monitored for all adverse events (AE) immediately following bronchoscopy until deemed medically stable and ready for discharge and again at 30 days after study enrollment by way of a phone call with the subject. Subjects were monitored for serious adverse events (SAE) for two years following enrollment.

Possible AEs included, but were not limited to, fatigue, muscle aches, bitter taste in

mouth, dry or sore throat, hoarseness, fever [greater than 100°F for more than 24 hours],

bronchospasm, arrhythmia, pneumothorax, hemoptysis, shortness of breath and

infections. An SAE was defined as any serious effect on the health or safety or any life-

threatening problem or death caused by, or associated with the study procedure if that

effect, problem or death was not previously identified in the investigational plan or

application. These included hospitalization [>24 hours], death, disability, or any event

that require intervention to prevent damage.

103

AEs and SAEs were documented and classified in terms of severity [mild, moderate,

severe], expectedness [expected or unexpected] and relatedness [unlikely, possibly,

probably or unknown]. A medical monitor at the data coordinating center (Dr. Paul

Kvale at Henry Ford Health System) worked closely with each site PI and ultimately was

responsible for the final determination of SAE relatedness. Treatments or interventions

and outcomes also were documented.

4.3.9 Statistical analysis.

Statistical significance was determined using an F-test of equality of variances following

by a Student’s t-test for comparison of groups on continuous variables and Chi square or

Fisher exact test for categorical variables. Differences were considered significant if p <

0.05. Power analysis was conducted as described above in the Revised Study Design

section.

4.4. Results

Here we present the baseline characteristics of the enrolled LCRT cohort, and results for secondary endpoints that have been reached including safety analysis and establishment of the NBEC and peripheral blood sample biospecimen repository.

4.4.1 Enrollment.

Accrual for the LCRT study was completed in March 2012. We enrolled 403 subjects

with demographic risk factors for lung cancer into a prospective multi-site, blinded

LCRT study, performed bronchoscopy at enrollment, and collected NBEC and blood

(buffy coat and plasma) samples from each subject (Figure 1). Of the 403 subjects enrolled, 288 were enrolled at the time of a standard of care (SOC) bronchoscopy done

104

for diagnostic purposes and 115 were enrolled at time of a volunteer study driven (SD)

bronchoscopy. Of the 288 SOC bronchoscopies, 64 were done to evaluate for lung

cancer, 34 for monitoring following lung transplantation, and the remaining 190 for a

variety of indications. Of the 403 subjects enrolled, 18 were removed from the study as

screen failures due to diagnosis of prevalence lung cancer at enrolling bronchoscopy or

subsequent tests and one subject withdrew from the study leaving 384 subjects in the

cohort. We conducted a descriptive analysis of baseline data for the 384 remaining

subjects.

4.4.2 Demographic information.

Subject population characteristics are shown in Table 4.1. Of the 384 subjects, mean age

was 62.9 ± 8.2 years with a mean smoking history of 50.4 pack years. Thirty-four percent

were current smokers and approximately 10% were concomitant cigar and/or pipe

smokers. The cohort included 213 males (55%) and 171 females (45%), 89% Caucasians,

10% African Americans, and 1% other. Sixty percent of subjects were married or living

with a partner, 30% were widowed, and 10% were single. A majority (66%) were high

school graduates with or without some college less than a bachelor’s degree, 10% held a

bachelor’s degree and 6% held an advanced degree. Reported income was less than

$40,000 per year in 37% of subjects although 31% of subjects (120 individuals) chose not

to provide household income information. Forty-six percent were retired and 17% were disabled (Table 4.1). Work-related exposures were reported by 234 (61%) of subjects with the highest percentages being asbestos (n=54, 14%), farming (n=41, 11%), chemicals or plastics (10%), welding (10%), foundry or steel milling (9%), and painting

(9%) (Table S4.2). Each subject had a chest CT scan at the time of enrollment; 242

105

subjects (63%) had a clinically indicated (standard of care) CT scan within three months

prior to enrollment and the remaining 142 (37%) had a research driven CT scan within 2

weeks of enrollment. Twelve percent of subjects were undergoing evaluation for lung

cancer at time of enrollment and were negative for cancer (Table S4.3). Based on

responses to baseline questionnaire, self-reported prevalence of chronic obstructive pulmonary disease (COPD) was 41% (n = 156), chronic bronchitis 18% (n = 68), and emphysema 28% (n = 106) (Table S4.3). Because Pulmonary Function Test (PFT) data

were available for most subjects, it was possible to compare self-reported COPD prevalence to test data (see below). Prevalence of other self-reported lung diseases were: interstitial lung disease 9% (n = 35), and sarcoidosis 3% (n = 10) (Table S4.3).

4.4.3 SOC vs SD bronchoscopy characteristics.

The intended population for the LCRT includes both subjects for whom diagnostic bronchoscopy is indicated who also will benefit from LCRT measurement and subjects who will have bronchoscopy only to obtain NBEC samples for LCRT measurement.

Therefore, we compared baseline characteristics between the SOC and SD bronchoscopy subject groups, which represent each of these respective intended population categories.

Of 384 subjects enrolled, bronchoscopy was SOC in 269 (70%) and SD in 115 (30%).

There were no significant differences in in pack years smoked (Table S4.4). SD subjects were slightly younger (mean age of 61.5 compared to 63.6, p = 0.021), more likely to be current smokers (55% vs. 25%, p < 0.001), and less likely to have COPD (41% vs. 60%, p = 0.002) (Table S4.4).

4.4.4 Lung cancer screening eligible sub-group.

106

The USPSTF age and smoking pack year eligibility criteria for lung cancer screening by annual low-dose helical chest CT are 55-80 years and a minimum of 30 pack-years, respectively. Among subjects enrolled into the LCRT study, 253/384 (65.9%) were eligible for annual screening at enrollment, according to these criteria. Seventy subjects did not meet the minimum age criterion at time of enrollment. By the 2016 follow up time point, 45 of these 70 will be eligible for screening and 69/70 will be eligible by the

2018 follow up.

4.4.5 Chronic obstructive pulmonary disease.

We assessed COPD status in the enrolled cohort because COPD is an independent risk factor for lung cancer (Skillrud, Offord et al. 1986, Tockman, Anthonisen et al. 1987,

Mayne, Buenconsejo et al. 1999, Mannino, Aguayo et al. 2003, Wasswa-Kintu, Gan et al.

2005, Purdue, Gold et al. 2007, Young, Hopkins et al. 2009, Schwartz 2012, de-Torres,

Wilson et al. 2015). COPD was defined using GOLD criteria based on pulmonary function test (PFT) data . Demographic information relative to COPD status is displayed in Table 4.2. PFT information was available for 290 subjects. Fifty-four percent of these

(157 subjects) had COPD based on PFT. Among the 157 subjects with COPD based on

PFT, COPD severity was GOLD stage 2 or worse in more than 70% based on established criteria . Mean FEV1/FVC was 0.52 for the 157 subjects with COPD (all stages) compared to 0.78 for the 133 without COPD. Those with COPD were more likely to be male (62% vs. 38% female, p = 0.027) and have a higher mean pack year smoking history (56 vs. 45 for non-COPD, p < 0.001). No differences were noted in age, race or smoking status (current vs. former smokers) (Table 4.2.)

107

Of the 157 subjects with COPD based on PFT criteria, clinical history of COPD based on

self-report or chart review was available for 150. Overall, self-reported status matched the diagnosis by PFT in 67% (Table S4.5).

4.4.6 Lung transplant.

Nine percent (34 subjects) of our cohort had received a (single) lung transplant prior to enrollment. We evaluated differences between lung transplant and non-lung transplant subjects to determine if there were comparable demographic risk factors for lung cancer.

Age (62.9 vs. 63.9 years, p = 0.202), gender, race and smoking history (51.7 vs. 50.0 pack years, p = 0.681) were statistically similar, but 100% of transplant subjects were former smokers compared to only 63% of non-transplant subjects (p < 0.001). Prevalence of COPD was comparable, 62% vs. 53%, p = 0.648. Interstitial lung disease, however, was more prevalent among transplant subjects 29% vs. 7%, p < 0.001 (Table S4.6).

4.4.7 Adverse events.

Serious Adverse Events (SAEs) included any serious effects on the health or safety or any life-threatening problems or death caused by, or associated with the study procedures.

There were no SAEs attributable to this study for either the 241 SOC bronchoscopy subjects or 142 SD bronchoscopy subjects. Adverse Events (AEs) were collected immediately post-procedure and again at the 30 day follow up. AEs classified as possibly or probably attributable to the study were those associated with bronchoscopy and bronchial brush biopsy such as sore throat, hoarseness, cough, throat swelling, chest

108

soreness, bleeding, fever, fatigue and upper respiratory infection. Since the SOC group received the bronchoscopy as part of their standard-of-care, study related AEs were those associated with the bronchial brushing only. There were no AEs classified as study related among the SOC group.

Among the SD group, there were 11 AEs noted in 9 subjects that were possibly (n= 9) or probably (n = 2) attributable to study procedures. Additionally, AEs were documented in two additional subjects that were deemed unlikely to be related (Table 4.3). All AEs were classified as mild.

4.4.8 Establishment of NBEC and peripheral blood sample biospecimen repository.

Matched blood and NBEC were collected for 361/384 (94%) subjects and banked in multiple aliquots. Blood samples were processed at each site at the time of collection to generate 2 aliquots of buffy coat and 2-5 aliquots of plasma from each subject. These aliquots were frozen and stored at -80°C until shipment to the Early Detection Research

Network (EDRN) Biorepository in Fredrick, MD. One aliquot of buffy coat was transferred to the University of Toledo for analysis and the other remains in storage.

NBEC were stabilized at each site in RNA Later (Ambion, Austin, TX) and shipped along with matching slides to ResearchDx, Irvine, CA. RNA was extracted from NBEC within 24-48 hours of receipt, assessed for quality and quantity and stored in aliquots at -

80°C. One NBEC RNA aliquot was shipped to the University of Toledo for analysis for those samples with a minimum yield of 1 microgram and aliquots for each subject remain in storage at ResearchDx.

109

At the University of Toledo, genomic DNA (gDNA) was extracted from one aliquot of buffy coat derived from the blood sample from approximately 80% of subjects and 100% of these yielded gDNA of sufficient quality and quantity for proposed molecular studies.

The quality and quantity of NBEC RNA from approximately 40% of subjects has been assessed to date. RNA from each sample was treated with DNase I, tested via PCR to ensure removal of contaminating gDNA from the RNA and then reverse transcribed into cDNA. For 90% of subjects the cDNA generated from these purified NBEC RNA samples was PCR amplifiable and of sufficient quantity to perform LCRT testing.

Additional aliquots of RNA remain for the roughly 10% of samples that did not pass this quality control. Samples from over 120 subjects were used successfully in preliminary targeted next generation sequencing (NGS) RNA sequencing analysis studies.

4.4.9 Lung cancer incidence.

Two years following initiation of the study, 5 subjects (1.3%) without prevalent lung cancer developed bronchogenic carcinoma. Due to the blinded status of the LCRT study, no further details are available regarding these subjects.

4.5. Discussion

4.5.1 Enrolled cohort is representative of LCRT target population.

The target population of the LCRT biomarker is individuals who meet USPSTF eligibility criteria for annual low dose helical CT screening (Humphrey, Deffebach et al.

2013). The enrollment criteria for the LCRT study included both current and former

110

smokers, individuals with and without concurrent pulmonary disease and/or respiratory

exposures as well as both subjects undergoing medically recommended bronchoscopy

(SOC group) and volunteers (SD group). At the time of enrollment into the LCRT study,

most subjects (66%) met USPSTF age and smoking pack-year eligibility criteria (55-80

years of age, > 30 pack years). Additionally, most of those not eligible at enrollment will

be eligible for screening by the 2016 follow up time point due to increased age, and this

fraction is expected to further increase at the 2018 follow up. Therefore, this group is

highly representative of the LCRT biomarker target population.

4.5.2 Feasibility to reach LCRT study endpoint based on cohort characteristics.

Based on demographic characteristics of the enrolled population (Table 4.1), we expect

lung cancer incidence in the LCRT study to be similar to the 3.1% incidence over 3.9

years reported by Bach et al. (Bach, Jett et al. 2007) in which mean age was 60.1 and

smoking history of 52 pack-years. The five incidental lung cancers observed two years after initiation of the study are consistent with this rate. Taking into account that some of

the 384 study subjects will have died from causes other than bronchogenic carcinoma

prior to these time points and that some will be lost to follow up we estimated incidental

lung cancers in the cohort based on 300 subjects. As such we expect to observe

approximately 12 incidental lung cancers by the 2016 follow up (mean time since

enrollment approximately 5 years) and 17 by the 2018 follow up point (mean time since enrollment approximately 7 years), which will be more than sufficient to reach the proposed endpoint of a risk ratio of > 5.0.

111

4.5.3 Feasibility of LCRT implementation (safety and acceptance by subjects).

The LCRT biomarker requires a one-time acquisition of NBEC through bronchial brush biopsy at the time of bronchoscopy. In addition to the LCRT study, Department of

Defense Lung Cancer Research Program, and NIH recently funded other large studies assessing utility of biomarkers measured in NBEC obtained at bronchoscopy intended to more accurately determine lung cancer risk and/or to enable early lung cancer diagnosis

(Massion, Clinicaltrials.gov NCT01475500 CA152662 and CA102353; Spira,

Clinicaltrials.gov NCT02504697 DECAMP-2 and CA164783-04; Dubinett, CA152751-

05S2). Therefore, it is important to carefully evaluate the safety and comfort of this procedure, which will impact general acceptance by patients and clinicians. Based on published studies bronchoscopy with or without biopsy is considered a safe procedure and it is used not only for medical purposes but also to conduct research (Willey, Coy et al. 1996, Willey, Coy et al. 1997, Romagnoli, Vachier et al. 1999, Crawford, Khuder et al. 2000, Eissa and Erzurum 2001, Crawford, Blomquist et al. 2007, Blomquist, Crawford et al. 2009, Lo Tam Loi, Hoonhorst et al. 2013, Barnes, Saetta et al. 2014, Kim, Oros et al. 2015). Reported complication rates (also known as serious adverse event/SAE rates) for all bronchoscopy procedures range from 0.08-1.93% and mortality rates range from

0.004-0.045% (Jin, Mu et al. 2008, Facciolongo, Patelli et al. 2009, Adare, Afanasiev et al. 2012). One large Japanese study of almost 50,000 patients who underwent bronchoscopy with brush biopsy in either central or peripheral airways reported a complication (SAE) rate of 0.46%. This risk of complication is similar to the 0.28-0.32% complication (SAE) rate reported for colonoscopy (Ko, Riffle et al. 2010, Fisher, Maple

112

et al. 2011) which is routinely used and repeated for colorectal cancer screening.

Importantly, a bronchoscopy with brush biopsy limited to the central airways for

collection of NBEC, the procedure used here, virtually eliminates risk for the primary

complications reported to be associated with bronchoscopy, including pneumothorax or

significant hemorrhage. Consistent with this, we observed no SAE associated with

bronchoscopic brush biopsy in the subjects enrolled based on SD bronchoscopy.

It is particularly important to assess safety and comfort in the subjects meeting accepted

criteria for lung cancer screening, a group that has increased prevalence for numerous

comorbidities. Results from at least one previous report have suggested that research

bronchoscopy and brush biopsy can be safely performed in subjects with heavy smoking

history and those with obstructive lung disease (Romagnoli, Vachier et al. 1999).

Previous guidelines have suggested that an FEV1 less than 60% is considered a

contraindication to performing research driven bronchoscopy. However, bronchoscopy in

adults with stable asthma and COPD has been performed safely at lower values of FEV1

(Hattotuwa, Gamble et al. 2002). Pulmonary function test data was available for more

than 75% of the subjects enrolled here (290 of 384 subjects). One hundred fifty-seven had clinical COPD and more than 70% had GOLD stage 2 or worse (Table 4.2).

Additionally, 9% of enrolled subjects had a history of interstitial lung disease, 9% were single-lung transplant recipients and a small percentage had other pulmonary disease

(Table S4.3) and bronchoscopy was safely performed on all of them. Specifically, no

complications (SAEs) were associated with bronchoscopic brushing and sample

collection in either standard of care (SOC) or study driven (SD) group.

113

In summary, bronchoscopic brush of the central airways to collect NBEC for lung cancer risk analysis was safe and well-tolerated in this study of subjects demographically at risk for lung cancer, including those with significant co-morbid conditions. Because the AE rate was much lower than that reported for routinely used screening colonoscopy (Fisher,

Maple et al. 2011) we expect that this procedure will be acceptable to patients and clinicians if the LCRT or other tests in development are validated to identify subjects with increased risk for lung cancer and/or early stage lung cancer.

4.5.4 COPD characteristics of LCRT Cohort.

The enrolled cohort had a high fraction of COPD based on PFT criteria. This is important because COPD is an independent risk factor for lung cancer (Skillrud, Offord et al. 1986,

Tockman, Anthonisen et al. 1987, Mayne, Buenconsejo et al. 1999, Mannino, Aguayo et al. 2003, Wasswa-Kintu, Gan et al. 2005, Purdue, Gold et al. 2007, Young, Hopkins et al.

2009, Schwartz 2012, de-Torres, Wilson et al. 2015). Notably, using PFT data

(FEV1/FVC <0.7) as the diagnostic criterion, one-third of individuals in this study misclassified their COPD status on the enrollment survey self-report. This is consistent with multiple reports of data acquisition through self-report leading to either misclassification or under-diagnosis of COPD (Barr, Herbstman et al. 2002, Straus,

McAlister et al. 2002, Eisner, Trupin et al. 2005, Zhai, Yu et al. 2014, Aldrich, Munro et al. 2015). Some of this misclassification could be due to patient being told they have

COPD on the basis of radiographic imaging while the PFT data do not meet criteria for

114

COPD diagnosis. Additionally, a portion of the subjects here underwent PFT at the time of enrollment that revealed COPD for the first time because the subject had not been tested prior to enrollment in the study. Given the importance of accurate COPD diagnosis, we plan to obtain both chest CT and PFT data from each subject at each subsequent follow-up. We will then evaluate COPD based on CT (presence of emphysema and/or bronchial thickening) or PFT criteria alone, or in combination as a risk factor for lung cancer.

4.5.5 LCRT cohort and biospecimen repository as a resource for subsequent studies.

As presented here, the LCRT cohort is well characterized with respect to demographic characteristics. In addition, NBEC and matching blood samples were collected from each subject. Each subject had a baseline CT scan and pulmonary function test (PFT) data are available for 76% of individuals. It is planned to obtain repeat PFT and CT scan on all subjects at each subsequent follow-up. This information will enable longitudinal assessment for rate of decline in pulmonary function by both physiologic and radiographic measures and to assess for presence or absence of lung cancer. More than

90% of samples assessed so far passed QC quality and quantity criteria for reliable LCRT measurement. The NBEC and matching blood samples collected in this study are archived and the majority of subjects have given consent for use of samples remaining after LCRT analysis for future IRB approved studies. Currently, we are using NBEC gene expression data and genotyping data from matched peripheral blood cell gDNA to identify proximate phenotypic biomarkers for COPD risk and additional biomarkers for

115

hereditary lung cancer risk. We are integrating these data with COPD genome wide association study (GWAS) data from the Lung Health Study and the COPDgene study available online at NCI dbGAP.

4.6 Conclusions

The demographic composition of the enrolled group is representative of the population for which the LCRT is intended. Specifically, based on baseline population characteristics we expect lung cancer incidence in this cohort to be similar to the 3.1% reported in prior studies and representative of the population eligible for LDCT lung cancer screening.

Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-tolerated in this population. These findings support the feasibility of testing LCRT clinical utility in this prospective study. If validated, the LCRT has the potential to significantly narrow the population of individuals requiring annual low-dose helical CT screening for early detection of lung cancer and delay the onset of screening for individuals with results indicating low lung cancer risk. For these individuals, the small risk incurred by undergoing once in a lifetime bronchoscopic sample collection for LCRT may be offset by a reduction in their CT-related risks. The LCRT biospecimen repository will enable additional studies of genetic basis for COPD and/or lung cancer risk.

4.7 List of abbreviations used

AE, adverse event; COPD, chronic obstructive pulmonary disease; FEV1, forced expiratory volume in 1 second; FVC, forced vital capacity; GWAS, genome wide association study; LCRT, Lung Cancer Risk Test; LDCT, low dose computed

116

tomography; NBEC, normal bronchial epithelial cells; PFT, pulmonary function test;

SAE, serious adverse event; SD, SNP, single nucleotide polymorphism; study driven;

SOC, standard of care; USPSTF, United States Preventive Services Task Force;

4.8 Competing interests

JCW has equity interest in and serves as a consultant to Accugenomics, Inc. which licenses technology utilized here. JCW and ELC are inventors on U.S and international patents related to the technology and biomarkers presented here.

4.9 Author contributions

ELC contributed to the design of the study, interpretation of results, drafting of the manuscript and performed quality assessment of gDNA and RNA. AL, ML and SAK participated in data analysis and statistical interpretation and contributed to the study design. FS and AMB participated in the preparation of the manuscript, XZ, JY and AB participated in the interpretation of data, PN, PPM, DAA, DM, PJM, SDN, RW, GS and

JT served as site directors for the LCRT study and participated in the study design and coordination, JCW conceived of the study, participated in its design and coordination and contributed to the preparation of the manuscript.

4.10 Acknowledgements

This work was funded by grants from the NIH National Cancer Institute CA148572 and

National Heart Lung and Blood Institute HL108016, and the George Isaac Cancer

Research Fund. The NIH did not participate in the study design, data interpretation or

117

manuscript preparation. We thank Dr. Paul Kvale for his professional review and

assessment of adverse and serious adverse events in relation to the study, we thank

Dr.James Jett for contributions in planning of the study, and Dr. Ali Musani for

supporting enrollment at National Jewish Hospital.

4.11 Table and Figure Legends

Table 4.1 LCRT Subject Characteristics

Table 4.2 Chronic Obstructive Pulmonary Disease by PFT

Table 4.3 Adverse Events (AE)

118

4.12 Table and Figure

Table 4.1

Table 1. LCRT Subject Characteristics

Baseline characteristics n = 384 Age in years [mean (SD*)] 62.9 (8.2) Age in years [median] 62 Male 213 (55%) Female 171 (45%) Caucasian 343 (89%) African American 37 (10%) Other or not reported 4 (1%) Cigarette pack years [mean (SD)] 50.4 (25.5) Cigarette pack years [median] 43 Age in years at smoking inception [mean (SD)] 16.1, 3.8 Age in years at smoking inception [median] 16 Total years of smoking [mean (SD)] 37.4 (10) Total years of smoking [median] 38 History of cigar use 35 (9%) History of pipe use 29 (8%) Married or living as married 231 (60%) Widowed 116 (30%) Single 37 (10%) Less than high school education 52 (14%) High school diploma or GED** 118 (31%) Associate degree or some college 136 (35%) Bachelor's degree 40 (10%) Graduate degree 24 (6%) Other or not reported 14 (4%) Employed 109 (28%) Unemployed 30 (8%) Retired 175 (46%) Disabled 64 (17%) Other or not reported 6 (2%) Income < $40,000/year 141 (37%) Income > $40,000/year 123 (32%) Other or not reported 120 (31%) * SD = standard deviation ** GED = Graduate Educational Development

119

Table 4.2

Table 2. Chronic Obstructive Pulmonary Disease by PFT

Classification n M / F* Mean age Race Smoking status Pack years FEV1%$ FEV1/FVC+ in years C/AA/Other** current/former smoked# No COPD 133 65 / 68 62 113 / 16 / 4 46 / 87 45 80 0.78

COPD (all) 157 97 / 60 63 145 / 12 / 0 50 / 107 56 58 0.52

COPD (stage 1) 45 32 / 12 62 41 / 4 / 0 22 / 23 54 76 0.59 COPD (stage 2) 77 48 / 29 64 73 / 4 / 0 22 / 55 57 57 0.55 COPD (stage 3) 26 15 / 11 64 23 / 3 / 0 5 / 21 53 35 0.39 COPD (stage 4) 7 2 / 5 63 7 / 0 / 0 1 / 6 65 22 0.27 COPD (stage unknown) 2 1 / 1 66 1 / 1 / 0 0 / 2 66 - 0.57

Unknown 94 50 / 44 63 85 / 9 / 0 35 / 60 50 - - * M = male, F = female ** C = Caucasian, AA = African-American, Other = other race or race not reported # Pack years = packs of cigarettes smoked per day x years of smoking $ FEV1% = forced expiratory volume in 1 second, percent of expected + FEV1/FVC = FEV1/Forced Vital Capacity

Table 4.3

Table 3 . Adverse Events (AE)

Subject # AE Description Severity Relatedness Treatment Notes

1035 Felt poorly (like he had a fever) for 3 days Mild Possible Did not seek treatment or notify study personnel until 30 day follow-up 1044 Upper respiratory tract infection Mild Possible Treated with antibiotics and steriods, infection resolved 1048 Hoarseness for 2 days Mild Probable 1050 Bruising around eyes Mild Unlikely 1057 Bleeding from ears post bronchoscopy Mild Possible 1057 Petechiae around eyes Mild Possible 1059 Cough Mild Possible 1059 Difficulty swallowing Mild Possible 1060 Felt soreness in lung Mild Possible 1061 Dry scratchy area in throat, feels need to cough Mild Possible 1076 Slight cough Mild Possible 1077 Back of throat swollen Mild Probable 1078 Cough Mild Unlikely

120

4.13 Supplemental Table and Figure Legends

Table S4.1 Lung Cancer Risk Test Study Enrollment by Study Site

Table S4.2 Work Types and Exposures

Table S4.3 Medical History

Table S4.4 Standard of Care (SOC) vs. Study Driven (SD) Bronchoscopies

Table S4.5 Self-reported vs. Clinical COPD

Table S4.6 Transplant vs. Non-Transplant Subjects

4.14 Supplemental Table and Figure

Table S4. 1 Site Location Subjects University of Toledo Medical Center Toledo, OH 83 Mayo Clinic Rochester, MN 43 University of Michigan Ann Arbor, MI 86 The Toledo Hospital Toledo, OH 19 Ohio State University Columbus, OH 29 Vanderbilt University Medical Center/Tennessee Valley VA Nashville, TN 25 Henry Ford Health System Detroit , MI 6 National Jewish Health Denver, CO 4 Medical University of South Carolina Charleston, SC 20 Inova Fairfax Hospital Falls Church, 14 Cleveland Clinic Foundation Cleveland, OH 51 Mercy St. Vincent Medical Center Toledo, OH 4 Total 384 * number in final cohort

121

Table S4. 2

Total reported work exposures n = 234/384 (61%) Asbestos 54 (14%) Baking 11 (3%) Butchering/Meat Packing 13 (3%) Chemicals/Plastics 39 (10%) Coal Mining 4 (1%) Cotton or Jute Processing 2 (<1%) Farming 41 (11%) Fire Fighting 8 (2%) Flour, Feed or Grain Milling 7 (2%) Foundry or Steel Milling 34 (9%) Hard Rock Mining 1 (<1%) Painting 33 (9%) Sandblasting 15 (4%) Welding 37 (10%)

Table S4. 3

Total enrollment n = 384 Standard of care bronchoscopy 269 (70%) Study driven (volunteer) bronchoscopy 115 (30%) Standard of care CT scan 242 (63%) Study driven CT scan 142 (37%) Under investigation for lung cancer at enrollment 46(12%)

Family history of lung cancer 77 (20%) Personal history of cancer 72 (19%)

Personal history of COPD (self-reported) 156 (41%) Personal history of chronic bronchitis 68 (18%) Personal history of emphysema 106 (28%)

122

Personal history of interstitial lung disease 35 (9%) Personal history of sarcoidosis Personal 10 (3%) history of scleroderma 1 (<1%)

Single lung transplant recipient 34 (9%)

Table S4. 4

123

Baseline characteristics SOC SD p value** Total Enrolled 269 115 Age in years [mean (SD*)] 63.6 (8.4) 61.5 (7.5) p = 0.021

Male 153 (57%) 60 (52%) Female 116 (43%) 55 (48%)

Caucasian 243 (91%) 100 (88%) African American 23 (9%) 14 (12%) Other or not reported 3 (1%) 1 (<1%)

Cigarette pack years# [mean (SD)] 49.4 (24.2) 51.9 (22.8) Current smoker 67 (25%) 63 (55%) Former smoker 202 (75%) 52 (45%) p < 0.001

COPD (all)$ 120 (60%) 37 (41%) p = 0.002 Stage 1 26 (13%) 19 (21%) Stage 2 62 (31%) 15 (16%) Stage 3 23 (12%) 2 (3%) Stage 4 7 (4%) 0 (0%)

* SD = standard deviation # Pack years = packs of cigarettes smoked per day x years of smoking $ staging info. unavailable for 2 subjects ** p value from Student's t-test reported if < 0.05

Table S4. 5

124

PFT diagnostic criteria for COPD (all sites)

Self-reported YES by PFT* NO by PFT No PFT data YES 92 32 32 NO 58 94 51 Did not self-report 7 7 11

Accuracy

67%

* PFT = Pulmonary Function Test

Table S4. 6

125

Baseline characteristics Transplant Non-Transplant p value** Total Enrolled 34 350 Age in years [mean (SD*)] 62.9 (8.5) 63.9 (3.7)

Male 22 (65%) 191 (55%) Female 12 (35%) 159 (45%)

Caucasian 31 (91%) 312 (89%) African American 3 (9%) 34 (10%) Other or not reported 0 (0%) 4 (1%)

Cigarette pack years# [mean (SD)] 51.7 (27.2) 50 (23.5) Current smoker 0 (0%) 130 (37%) Former smoker 34 (100%) 220 (63%) p < 0.001

COPD (all)$ 21/34 (62%) 136/236 (53%) Personal history of interstitial lung disease 10 (29%) 25 (7%) p < 0.001 * SD = standard deviation ** p value from Student's t-test reported if < 0.05 # Pack years = packs of cigarettes smoked per day x years of smoking $ in subjects for whom Pulmonary Function Test (PFT) data are available

126

Chapter 5 Control for stochastic sampling variation and qualitative sequencing error in next generation sequencing

Thomas Blomquista, Erin L. Crawfordb, Jiyoun Yeob, Xiaolu Zhangb, James C. Willeya,b*

Authors’ Affiliations: aDepartment of Pathology, University of Toledo Health Sciences Campus, Toledo, OH

43614 b Department of Medicine, University of Toledo Health Sciences Campus, Toledo, OH

43614

* Corresponding author: James C. Willey, M.D., Tel. 001 419 383-3455

Email: [email protected]

Published in Biomolecular Detection and Quantification, September 01, 2015.

127

5.1 Abstract

Background: Clinical implementation of Next-Generation Sequencing (NGS) is

challenged by poor control of low sample input, library preparation biases and qualitative

sequencing error. To address these challenges we developed and tested two hypotheses.

Hypothesis 1: Analytical variation in target analyte quantification is predicted by Poisson

(i.e. stochastic) sampling effects at two key points; a) input of intact nucleic acid target

molecules into the library preparation reaction, and b) input of amplicons from the library

into the sequencer. Hypothesis 2: Technically derived base substitution, insertion and

deletion frequencies observed at each base position in each native target analyte is

concordant with frequencies observed in competitive synthetic internal standards present

in the same reaction. Methods: To test hypothesis 1, we derived equations using Monte

Carlo simulation to predict assay coefficient of variation (CV) based on three working

models: number of target molecules added to library preparation, number of target

sequence read counts from sequencer, or both. These models were tested against NGS

data from specimens with well characterized allelic ratios, molecule inputs and sequence

counts that were prepared using a competitive multiplex-PCR amplicon-based NGS library preparation method comprising synthetic internal standards. To test hypothesis 2, we measured the frequency of base substitutions, insertions and deletions at each base position within amplicons from each of 30 native target analytes, then compared these frequencies to those at corresponding base positions within 30 respective synthetic competitive internal standard templates present in the same NGS library preparation reactions. Results: For hypothesis 1, the Monte Carlo model derived from both sequencing counts and molecule input measurements best predicted CV and explained

128

74% of observed assay variance. For hypothesis 2, observed frequency and type of sequence variation at each base position within each competitive internal standard was concordant with frequency and type of sequence variation seen in NTs (R2 = 0.93).

Conclusion: Inclusion of synthetic competitive internal standard templates in targeted

NGS library preparation controls for low target input into NGS library preparation, low target library product into sequencer, and errors generated during library preparation and sequencing. These controls enable accurate clinical diagnostic reporting of confidence limits and limit of detection for NGS measurement of copy number and for base substitution, insertion and deletion rates at each base position within each target analyte.

129

5.2 Introduction

Quantitative analysis of transcript abundance and/or sequence variant frequency are common applications of next generation sequencing (NGS) (Mortazavi, Williams et al. 2008, Spencer, Tyagi et al. 2014). One important diagnostic NGS application includes accurate identification of clinically actionable sequence variation in tumors and the estimation of tumor cell fraction with the actionable mutation (Cibulskis, Lawrence et al.

2013, Spencer, Tyagi et al. 2014). However, lack of appropriate quality control limits wider clinical diagnostic application of NGS in this context. For example, under-loading of target analyte into library preparation and/or library product into sequencer will result in analytical variation due to stochastic sampling (Fu, Xu et al. 2014). At the same time, over-loading of prepared library onto sequencer will result in re-sampling of library amplicons from the same target analyte molecule, and without proper controls will give false assurance of adequate sampling. Moreover, polymerase errors generated during library preparation and/or sequencing steps can confound accurate estimation of the true proportion of clinically actionable sequence mutations present (Schmitt, Kennedy et al.

2012, Fu, Xu et al. 2014).

Thus, for diagnostic NGS applications, it is important to control for several sources of analytical variation, including sample loading into library preparation, efficiency of target amplification in library preparation, loading of prepared NGS library onto a sequencing platform, and the combined polymerase error rates throughout library preparation and sequencing (Gargis, Kalman et al. 2012, Aziz, Zhao et al. 2015, Gargis,

Kalman et al. 2015). Currently, the most prevalent practice is to rely on sequence count data alone to provide quality control for each potential source of analytical variation. For

130

example, many recently developed programs seek to quantify the fractional

representation of actionable tumor mutations, and enumeration of sequence read

are the only source of data for assay variance analysis (Schmitt, Kennedy et al.

2012, Cibulskis, Lawrence et al. 2013, Frampton, Fichtenholtz et al. 2013, Fu, Xu

et al. 2014, Xu, DiCarlo et al. 2014). While these approaches address many

issues, they provide false assurance regarding control for stochastic sampling

variation due to low input of sample into the library preparation, and do not

provide frequency limit of detection for each type of base substitution, insertion

and deletion at each base position, in each target analyte (Fu, Xu et al. 2014,

Spencer, Tyagi et al. 2014). Recent barcoding methods combined with bait-

capture targeted sequencing provide better control for low sample input while,

again, using only sequence count data to estimate analytical variance (Mortazavi,

Williams et al. 2008, Casbon, Osborne et al. 2011, Jabara, Jones et al. 2011,

Kinde, Wu et al. 2011, Schmitt, Kennedy et al. 2012, Fu, Xu et al. 2014).

However, these methods do not provide a way to assess limit of detection for

observed biological variation (Jabara, Jones et al. 2011), and the bait-capture

method is associated with 100-1000-fold loss in signal (Fu, Xu et al. 2014).

Signal loss is a particular liability for analysis of small or degraded specimens, such as those routinely encountered in the clinical setting (Cibulskis, Lawrence et al. 2013). Furthermore, sequencing read counts are not always concordant with number of molecules “captured” during library preparation, resulting in false negative results (Frampton, Fichtenholtz et al. 2013). In addition, it is less well recognized that if the number of target analyte molecules loaded into the library

131

preparation is low, the analyte may be poorly quantified regardless of the number of analyte amplicons loaded into the sequencer due to over-amplification of a stochastically sampled specimen. In order to address these challenges, we developed and tested two hypotheses.

Hypothesis 1: We hypothesized that analytical variation in target analyte quantification can be predicted by Poisson (i.e. stochastic) sampling effects at two primary points: a) input of intact nucleic acid target molecules loaded into the library preparation reaction, and b) input of derived amplicons from library preparation into the sequencer (i.e. sequence counts) (Figure 1). Using Monte Carlo simulation we derived equations to predict assay coefficient of variation (CV) based on three working models: number of target molecules added to library preparation, number of target amplicons in library added to sequencer (i.e., sequence read count), or both (Figure 1). We then tested these working models using cell lines with known allelic composition. Cell lines were mixed and prepared for NGS such that a broad range of limiting allelic molar proportions and/or sequence read counts were observed. Each target allele was measured relative to a known number of synthetic internal standard molecules using a competitive multiplex-

PCR amplicon-based NGS library preparation method (Blomquist, Crawford et al. 2013).

Hypothesis 2: The accuracy of frequency measurement of acquired mutations in specimens (e.g. circulating plasma DNA, tumors, etc.) is confounded by both sampling error (described above and tested in hypothesis 1), and nucleotide substitution, insertion and deletion errors encountered during both library preparation steps and sequencing

(Cibulskis, Lawrence et al. 2013, Frampton, Fichtenholtz et al. 2013). This latter, technically derived, sequence variation may to some extent be systematic for certain

132

types of sequence variations, but may also vary largely on local sequence context.

We hypothesized that technically derived base substitution, insertion and deletion frequencies observed at each base position in each target analyte is concordant with frequencies observed in respective synthetic internal standards present in the same reaction. In order to characterize the contribution of technically derived nucleotide sequence error rate, we measured the frequency of base substitution, insertion and deletion errors in a NGS data set derived from 213 normal airway brushing derived cDNA specimens with both ample intact nucleic acid loading and sequence counts. Each normal airway brushing derived cDNA specimen was mixed with a known number of synthetic internal standard molecules for each target analyte prior to competitive multiplex PCR amplicon NGS library preparation to determine if frequency of observed base substitution, insertion and deletions in each native target was concordant with frequency observed in each respective synthetic internal standard. If concordant, synthetic internal standards could provide control for both stochastic sampling in quantitative NGS, as well as control for technically derived sequencing error in qualitative NGS of low frequency alleles.

5.3 Methods

5.3.1 Sample Preparation

Hypothesis 1: To test the effect of stochastic sampling on variance in allelic frequency measurements, genomic DNA (gDNA) was extracted by

133

FlexiGene DNA kit (Qiagen) and quantified by NanoDrop (ThermoScientific,

Wilmington, DE) spectrophotometry for two cell lines (H23 [ATCC CRL-5800] and and H520 [ATCC HTB-182]). The cell lines were previously characterized as homozygous for opposite alleles at four polymorphic sites (rs769217, rs1042522, rs735482 and rs2298881) (Blomquist, Crawford et al. 2013). Cross-mixtures of these

two cell-lines were performed so as to create a well characterized extreme limiting

dilution of each of the four bi-allelic loci (see Mixing Design in Supplementary Table 3).

These limiting dilutions of alleles were then loaded into the library preparation (see

Methods: NGS Library Preparation), then limiting dilutions of NGS libraries were added

to the Illumina HiSeq 2500 flow cell (see Methods: NGS Library Preparation).

Hypothesis 2: In order to characterize the base-specific substitution, insertion and

deletion rates imparted by combined library preparation and sequencing error, we used

213 normal human bronchial epithelial cell (NBEC) cDNA specimens. These specimens

were obtained as part of the ongoing Lung Cancer Risk Test (LCRT) study at the

University of Toledo Medical Center (Blomquist, Crawford et al. 2009). Approval for

specimen acquisition for this study was obtained by the institutional review board at the

University of Toledo Medical Center. These samples were chosen based on several key

features: 1) They represent a source of normal nucleic acid templates with presumably

low, or absent, acquired somatic mutations. 2) They were previously confirmed to have

high copy numbers of intact template for each native target, which minimized chance that

stochastic sampling of templates would confound assessment of combined library

preparation and sequencing error on base-specific substitution, insertion and deletion

rates. 3) Competitive synthetic internal standards for targets comprised by the LCRT

134

were cloned into plasmids, and selected as pure clonal isolates, with Sanger sequencing confirmation of final sequence. This additional purification step was taken to eliminate any potential errors introduced by synthesis. We reason that these pure clonal competitive internal standards will have a frequency of technically acquired base substitutions, insertions and deletions that is similar to the native templates during the combined library preparation and sequencing steps.

5.3.2 Development of model to predict analytical variation due to stochastic sampling variation in NGS

Hypothesis 1: To test the hypothesis that analytical variation is dependent on both target analyte native template molecules added into library preparation reaction and resultant amplicon molecules added to sequencer, we developed three working models using Monte Carlo simulation and derived equations to predict expected assay coefficient of variation (CV) (Figure 5.1 and

Supplementary Method – Model Generation). These three models and their equations were based on: target molecules in library added to sequencer (i.e. sequence read counts; Model 1), target native molecules added to library preparation (Model 2), or both (Model 3). This model is based, in part, on a model of biallelic genetic drift provided by Dr. Stephen P. DiFazio that can easily be simulated in excel

. We reasoned that population based founding

135

effects that result in genetic drift of bi-allelic loci should operate statistically in the same way as stochastic sampling of a bi-allelic locus present in a test tube in the laboratory setting, and that the act of pipetting and sampling the specimen DNA is analogous to a founding effect seen in population genetics. We further reasoned that there were two primary founding (i.e. stochastic sampling) effects present in the lab test tube analogy; 1) initial pipetting of the specimen into library preparation reaction, and 2) loading of the prepared library onto the sequencer and the number of sequencing counts enumerated for each target template (Figure 1 and Supplementary Method – Model Generation). This model was varied for both the number of input molecules, as well as number of sequence reads derived (Supplementary Method – Model Generation). This then produced a rich data set, from which three equations were derived by best curve fit analysis

(Supplementary Method – Model Generation). These derived equations were then tested against empirically derived data from cross-mixtures of cell lines to predict observed assay variance in targeted NGS (see Methods: Sample Preparation).

5.3.3 NGS Library Preparation: Targeted Competitive Multiplex-PCR

5.3.3.1 Cell line cross-mixture specimens. Each of four target analytes was PCR-

amplified in samples derived from the cross-mixture of two cell-lines (see Mixing Design in Supplementary Table 3) that had each been mixed with a known number of synthetic competitive internal standard (IS) molecules as previously described (Supplementary

Table 1) (Blomquist, Crawford et al. 2013). NBEC cDNA specimens. Each of 30 target

analytes (two target assays for each of 15 genes) was PCR-amplified in the presence of a known number of respective synthetic competitive internal standard molecules as

136

previously described (Supplementary Table S5.2) (Blomquist, Crawford et al. 2013).

Prepared libraries were then sent for Illumina HiSeq 2500 sequencing service at the

University of Michigan, Genomics Core facility.

5.3.3.2 Internal standard mixture preparation. Each competitive internal standard

(IS) was designed to contain six nucleotide differences from target analyte native template (NT) that enabled reliable differentiation from respective IS during post- sequencing data analysis (Supplementary Tables S5.1 and S5.2) (Blomquist, Crawford et al. 2013). For IS used in the analysis of cell line cross-mixture samples, following synthesis, each IS was PCR-amplified with specific primers to ensure full length product, isolated by gel electrophoresis, quantified using NanoDrop, and mixed with IS for other analytes at equivalent concentration to prepare an internal standard mixture (Blomquist,

Crawford et al. 2013). For IS used in analysis of NBEC cDNA samples, IS were prepare by Accugenomics, Inc. (Wilmington, NC). Briefly, following synthesis IS were cloned in bacteria and purified to ensure an accurate and uniform population of sequences for each competitive IS used (see Methods: Sample Preparation).

5.3.4 NGS data analysis

FASTQ data files from the University of Michigan Genomics core facility were processed as previously described (Blomquist, Crawford et al. 2013).

FASTQ files for hypothesis 2 in this study, pertaining to the LCRT reagents, were additionally processed using Blast 2.2.26+ command line with a Practical

Extraction and Reporting Language (PERL) wrapper to automate feeding of reference and query sequences to the Blast command line interface (reference

137

sequences in Supplementary Table S5.2). This same PERL script then identified and stored the frequency of each Blast result for each template and for the type of base substitution, insertion or deletion that was identified across all reads in a Hash of Hashes of Hashes data table configuration (sequence error frequencies in Supplementary Tables

S5.4 and S5.5). PERL wrapper for Blast 2.2.26+, and the input parameters used for Blast to enumerate base substitution, insertion and deletion frequencies, is available upon request.

Because the goal of hypothesis 2 was to identify and characterize the base by base frequency of combined sequencing and library preparation errors, and not biological variation (which was tested in hypothesis 1), we surmised that the sequencing data could be aggregated into two larger pools of subjects (NT and IS Groups 1 and 2). This is feasible and beneficial for several reasons: 1) A combined data set of normal specimens with minimal biological sequence variation (Group 1 [115 NBEC specimen library preparations] and Group 2 [98 NBEC specimen library preparations], total 213 NBEC specimen library preparations), should provide adequate sampling of very rare technically derived base substitution, deletion and insertion events (1 in 1,000 to 100,000) across each specimen pool. 2) If these normal specimens do indeed have minimal biological variation in sequence, there should be a high degree of concordance in base substitution, insertion and deletion rates between the NTs and their respective competitive IS present in the same specimen (Supplementary Table S5.2). 3) By splitting the sequencing data into two pools, we can, in a surrogate way, assess the performance of external NT controls versus competitive synthetic IS controls, for accurately measuring technically derived base substitution, insertion and deletion frequencies.

138

All final NGS summary counts and absolute quantification of molecules

(where appropriate) are provided in Supplementary Tables S5.3-5.

5.4 Results

5.4.1 Controlling for stochastic sampling error in NGS

For the equation derived from both sequencing coverage and input molarity (Model 3; see Supplementary Method – Model Generation), expected coefficient of variation (CV) was very close to observed (average [observed

CV/expected CV] = 1.01) and explained 74% of observed assay variance (Figure

5.2C). In contrast, observed CV was on average 13-fold, or 1.5-fold, higher than expected CV based on sequencing coverage (Model 1), or input molarity (Model

2), prediction models alone (Figure 5.2A,B). For each assay, when input of target allele copies into library preparation was low (median of 15 molecules; open triangles) assay variance for measured allelic ratio was much higher, compared to high molecule input (median of 3313 molecules; closed circles) (Figure 5.3A-D).

Although there was an approximately 200-fold difference in median molecules loaded into library preparation for low and high loading conditions, sequence counts were high for both conditions (see Mixing Design and raw data in

Supplementary Table S5.3). When only specimens with high molecule input

(>500 molecules) were assessed, variance in measured allelic ratio followed a

Poisson distribution (plotted boxes and dashed line) for target sequence counts

(Figure 5.3E). Similarly, when only specimens with high sequence counts (>500 sequence counts) were assessed, variance in measured allelic ratio followed a

139

Poisson distribution for target molecule input (Figure 5.3F). All data presented in this

section are available in Supplementary Table 3.

5.4.2 Controlling for qualitative sequencing error in NGS

Varying frequency of base substitutions were observed for all nucleotides, and

rare frequency deletion events were detected for guanine and adenine bases (Figure 5.4).

In general, most observed base substitution rates were lower than 1 in 100 for each base

location. Adenine to Guanine and Cytosine to Thymine base transitions (purine-purine or pyrimidine-pyrimidine) were the most common type of sequence variation observed, followed by base tranversions (purine-pyrimidine or pyrimidine-purine) by a factor of approximately 10-fold lower frequency (Figure 5.4). Furthermore, the type of sequence base substitution and its average frequency was concordant between NT and IS for Group

1 (Figure 5.4). The coefficient of variation (CV) around the mean frequency of each type of base substitution was on average 0.28. This roughly translates to a standard deviation of 1.9-fold on either side of the population measurement mean for each type of sequence variation (2.8-fold detection limit with 95% confidence limits for detection of fold change). Data for Group 2 are available in Supplementary Table 5, and are nearly identical to those presented in Figure 4. Bivariate plots of the frequency of technically derived sequence variation for NT and corresponding type of sequence variation for each base position in competitive IS for Groups 1 and 2 (see Methods: NGS data analysis) are presented in Figure 5.5A,C,E and G. Frequency of observed sequence variation in IS explained 93-94% of observed sequence variation in NT (Figure 5.5A, C). Importantly, the vast majority of deviation from the regression line is explainable by the minimum

140

sequence counts observed for the technically derived sequence variation (Figure

5.5B, D). Concordance was slightly higher between NT and NT, or IS and IS

comparisons between groups 1 and 2 respectively, with each explaining 96-97%

of the frequency of base-specific sequence variation observed between the two

groups (Figure 5.5E,G). Again, deviation from the regression line in Figure 5.5E,

G was largely explainable by the minimum sequence counts observed for the rare

technically derived sequence variation (Figure 5.5F, H).

5.5 Discussion

Next-Generation Sequencing (NGS) technologies have the potential to

disrupt a large number of technologies presently used in clinical diagnostics.

However, its implementation in the clinical setting is impeded by a complex

specimen and data analysis process (Figure 5.1), which is compounded by an

equally complex goal of analyzing large multi-target panels. Because of the

profound clinical implications on treatment decision management based on NGS

methods, they should be held to the same analytical performance standards

applied to other methods used in the clinical chemistry laboratory. In an effort to

achieve this goal we developed a competitive multiplex PCR-based amplicon library preparation method that utilizes competitive internal standards (also known as internal amplification controls) (Blomquist, Crawford et al. 2013). The method enables control for sample overloading, excessive amplification cycles, other signal saturation effects and technical biases that can lead to inter-assay and inter-specimen variation in signal measurement. Data also suggested that this

141

method controls for sub-optimal loading of sample into library preparation, suboptimal loading of library preparation into sequencer, and sequencer errors generated during library preparation and sequencing. We decided to address these important challenges by formulating and experimentally testing Hypotheses 1 and 2.

Hypothesis 1 is supported by the data reported here. Specifically, the mathematical equation based on both NT loading into NGS library preparation and sequence read counts from NGS instrument (Monte Carlo simulation Model 3) predicted observed assay coefficient of variation in four targeted NGS assays (Figures 5.2, 5.3 and

Supplementary Methods – Figure S5.1 Model Design). While it remains to confirm the predictive value of this equation across other types of NGS library preparation methods and sequencing platforms, generalizability is likely based on the similarity of biochemical reactions involved. Implementation of this equation may be particularly helpful when only one technical or biological replicate measurement is feasible (i.e. limited clinical specimen). In this context, the laboratory clinician would be asked to comment on the confidence in the measurement of target analyte, or frequency of a clinically actionable mutation present in a tumor specimen. Using this equation, the laboratory information system would be able to easily derive confidence intervals for reporting. As an example, this would simplify a decision regarding whether to direct treatment to an actionable mutation. Importantly, as is clear from Figure 5.3F, large analytical variation from stochastic sampling will be observed if an insufficient concentration of target molecules is sampled, regardless of the concentration of amplification products sampled for loading into sequencer. This is why it is important to use quality control thresholds that address each of these sources of variation.

142

Hypothesis 2 also is supported by data from these studies. Specifically,

the frequency of technically derived sequence variation for each NT was largely

explained by that observed in the respective IS template (Figure 5.5A,C).

Furthermore, any deviation from the regression line observed (in Figures 5.5A,C),

was largely explained by stochastic sampling of low sequence counts for the

technically derived sequence variation (Figures 5.5B,D). Thus, with sufficient molecules loaded into the library preparation, and sequence counts obtained, the limit of detection of rare biological single nucleotide variations in native material can be easily determined using a competitive internal standard (Figure 5.5), and is more accurate than the 2.8-fold change limit of detection estimated by the type of sequence variation only (Figure 5.4). Importantly, base transitions were observed in approximately 10-fold excess compared to base transversion events (Figure

5.4). The error rates observed here are specific to the chosen combination of specimen preparation, sequencing and data analysis pipeline methods, and should not be blindly applied to other NGS pipelines.

In summary, we present data that synthetic internal standards, in the

context of a targeted competitive PCR amplicon library preparation method

(Blomquist, Crawford et al. 2013), control for both stochastic sampling in

quantitative NGS and technically derived sequencing error in qualitative NGS

detection of low frequency alleles. By applying quality-control parameters based on these experimentally validated models that predict key sources of NGS analytical variation, we can now accurately report confidence limits for NGS measurement of clinically important analytical targets, as well as provide an

143

accurate limit of detection for observed base substitution, insertion and deletion rates at each base position within each native target. We are implementing quality control measures described here in analysis of promising diagnostic tests, including a lung cancer diagnostic test (Yeo, Crawford et al. 2014) and a lung cancer risk test (Blomquist,

Crawford et al. 2013). Incorporation of these quality controls provides an analysis pathway consistent with previously reported College of American Pathologists (CAP) and Nex-StoCT guidelines for NGS diagnostics in the clinical setting (Gargis, Kalman et al. 2012, Aziz, Zhao et al. 2015, Gargis, Kalman et al. 2015).

144

5.6 Table and Figure Legends

Figure 5.1 Overview of specimen preparation for Next-Generation Sequencing.

(Portrait, double-column)

This schematic illustrates our hypothesis that two primary points of stochastic

sampling error along the continuum of Next-Generation Sequencing (NGS) library preparation and sequencing can account for observed analytical variation in targeted PCR based NGS assays.

Figure 5.2 Performance of Monte Carlo simulation models to predict observed assay variance. (Portrait, double-column)

Equations used to plot expected coefficient of variance (CV) are presented in

Supplementary Methods – Model Design. Measured CV was obtained by 46- quadruplicate technical measurements; 46 measurements of CV and calculated CV based on Models 1, 2 and 3 are available in Supplementary Table 3.

Figure 5.3 Independent effects of sequence counts and sample molecule loading on measured allelic ratios. (Landscape, full page)

A-D) Effect of low molecule input into library preparation on measured allelic-

ratio relative to expected. To eliminate effect of low sequence counts, only values based

on at least 500 sequence counts were included. Closed Circles = high molecule input

(median = 3313 molecules each replicate). Open Triangles = low molecule input

(median = 15 molecules each replicate). Each data point is a single technical replicate.

145

E) Serially diluted PCR amplicon library samples from the undiluted 1:1 cell line mixture

were loaded into sequencer. Effect of sequences counted (X-axis) on allelic-ratio (Y-

axis) for each target with high molecule input (> 500 molecules in each replicate).

Combined results from all four loci are presented. F) Undiluted PCR amplicon library

samples from serially diluted 1:1 cell line mixture were loaded into sequencer. Effect of

target molecule number (X-axis) on allelic-ratio (Y-axis) for each target with high sequence count (>500 sequence read counts in each replicate for each target. Dashed line with open squares represents an expected frequency of error based on a Poisson distribution (Model 1 and Model 2). Mixing design of cell line DNA and titration of sequencing counts, and all measurements derived from these specimens are available as full and individual subset analysis tables in Supplementary Table 3.

Figure 5.4 Frequency plot of observed technically derived sequencing variation.

(Portrait, double column)

A,B) Type of base substitution is plotted on X-axis. For example, “C > T” represents a transition from a cytosine to thymine base, and “G > -” represents a deletion of a guanine. The first base listed is the expected consensus base at that position based on sequences listed in Supplementary Tables 1 and 2. Each base position, for each template, and the frequency of that type of sequence variation is plotted as an individual data point along the Y-axis. In this figure, only Group 1 data are presented. Means and standard deviation error bars are plotted for each type of sequence variation. Group 2 data plotted essentially identically, and was moved as raw data to Supplementary Table 5.

146

Figure 5.5 Performance of competitive internal standards to measure frequency of technically derived sequence variation. (Portrait, full page)

A,C,E and G) Bivariate plots of measured sequence variation frequency, for each base position along the length of each native template (NT) and internal standard (IS) for groups 1 and 2 (see Methods: NGS data analysis). B,D,F and H) Plots representing fold- deviation of NT:IS ratio away from regression line in respective plots A,C,E and G.

Sequence counts observed (minimum) on the X-axis is the number of sequence counts for the observed type of sequencing error, and not the total number of sequence counts for that assay. Dashed line with open squares represents an expected frequency of error based on a Poisson distribution (Model 1).

147

5.7 Table and Figure

Figure 5.1

148

Figure 5.2

149

Figure 5.3

150

Figure 5.4

151

Figure 5.5

152

5.8 Supplementary Table and Figure

Figure S5.1 Model Design

Available at http://www.sciencedirect.com/science/article/pii/S221475351530005X#MMCvFirst

Table S5.1 Supplementary Tables

Available at http://www.sciencedirect.com/science/article/pii/S221475351530005X#MMCvFirst

153

Chapter 6

Conclusions and Summary

In an effort to understand the role of interindividual variation in genetic predisposition to lung cancer risk, previously described work from this laboratory confirmed that there are interindividual variations in susceptibility to lung cancer, determined that DNA repair genes display altered regulation in normal bronchial epithelial cell (NBEC) of lung cancer subjects and identified a promising Lung Cancer

Risk Test (LCRT) biomarker comprising transcript abundance measurement of fifteen genes in NBEC (Crawford, Khuder et al. 2000, Mullins, Crawford et al. 2005, Crawford,

Blomquist et al. 2007, Blomquist, Crawford et al. 2009).

Genetic variation plays a crucial role in susceptibility of complex genetic disease through regulating gene expression. However, it is evidenced that in many cases, genes are regulated by multiple loci, each of which contributes only modestly to the trait

(Deutsch, Lyle et al. 2005, Wu, Kraft et al. 2010). In a previous study, based on genotype analysis we found that SNPs rs751402 and rs2296147 were associated with inter- individual variation in allelic imbalance in ERCC5 expression in NBEC (Blomquist,

Crawford et al. 2010). The goal of this study is to advance mechanistic understanding regarding heritable variation in cis-regulation of the key NER gene ERCC5, develop 154

qPCR-based molecular diagnostic test comprising transcript abundance value of predictive biomarker for FFPE samples and control analytical variations in NGS- clinical assay.

Identification of cis-acting variant sites that alter ERCC5 transcription regulation in normal bronchial epithelial cells

Analysis of allelic ratios at marker site rs1047768 in association with diplotype structure at rs751402-rs2296147-rs1047768 demonstrates that T-allele at polymorphic site rs2296147 is associated with higher RNA transcript abundance relative to C-allele (Figure 3.3). Notably, rs2296147 T-allele participates in formation of an in silico predicted TP53 transcription factor- binding site (Marinescu, Kohane et al. 2005) and that site is predicted to be lost when C-allele is present. In previous studies TP53 upregulates ERCC5 transcription (Kannan, Amariglio et al. 2000). Therefore, it is reasonable to hypothesize that TP53 upregulates ERCC5 transcription more effectively when T allele is present at rs2296147. Because TP53 is regulated primarily at the post- translational level, TP53 transcription factor functional activity is often measured indirectly as transcript abundance of key target genes such as CDKN1A (el-Deiry,

Tokino et al. 1993, el-Deiry, Harper et al. 1994, Harr, Graves et al. 2005). As reported here, CDKN1A and ERCC5 total transcript abundance values were correlated in non-cancer subject NBEC samples (Figure 3.4A). This correlation is consistent with the hypothesis that TP53 is a transcription factor regulator of both

CDKN1A and ERCC5 transcription in NBEC. In turn, it is reasonable to 155

hypothesize that the significantly altered CDKN1A correlation with ERCC5 in cancer

subjects (Figure 3.4B) is due in part to genotype at rs2296147. In contrast to strong

evidence for the cis-regulatory role of rs2296147 in ERCC5 regulation,

haplotype/diplotype data do not support a similar role for rs751402.

We observed an increased mean G/C allelic ratio at rs17655 in cDNA compared

to matched gDNA controls (Figure 3.2B) indicating that this SNP or a SNP in linkage

disequilibrium with it influences ERCC5 transcript levels in NBEC. As described in

Results section, it is likely that the functional SNP responsible for this observation is

rs873601 which is linked to rs17655 and is predicted to alter binding of multiple miRNAs

(Liu, Zhang et al. 2012, Zhu, Shi et al. 2012). Specifically, the C allele at SNP rs17655 is linked to A allele at rs873601 (r2 = 0.74), which is putatively more responsive to multiple

miRNAs that will increase the rate of degradation and lower abundance of transcripts

originating from rs17655 C allele. Importantly, because rs2296147 and rs873601 are not

linked, we conclude that the presumed rs873601 effect is independent from that of

rs2296147.

The data presented here provide evidence for higher TP53-mediated ERCC5

transcription rate from rs2296147 T allele and higher miRNA mediated ERCC5 transcript

degradation at rs873601 G allele. Thus, if each of these cis-regulatory sites acted alone

without any contribution from the other (for example, in hypothetical alternative

transcripts) or any other cis-acting SNP, we would expect not only to observe mean T/C

ratio at marker SNP rs1047768 and mean G/C ratio at SNP rs873601 (or linked SNP

rs17655) to be greater than one, but also very little inter-individual variation around these

mean ratios. However, we observed significant variation around the mean allelic ratio at 156

each marker SNP. The likely explanation is that the predominant expressed

ERCC5 transcripts incorporate both marker SNPs (rs1047768 and rs17655) and

the effects resulting from genotype at each of the unlinked cis-regulatory sites

(rs2296147 and rs873601) will interact to determine the allelic ratio measured at each marker SNP.

Consistent with a complex genetic mechanism of lung cancer risk, the effect size of each DNA variant associated with lung cancer risk is very small.

Consequently, thousands of subjects are needed to directly assess the association of individual genetic variants and lung cancer risk. The data presented here support the conclusion that inherited variation in gene regulation is a powerful intermediate phenotypic marker for lung cancer risk, as presented schematically in Figure 3.5. As we report here and previously (Blomquist, Crawford et al.

2010), it is possible to assess this type of intermediate risk factor with far fewer patients than the thousands typically necessary for a GWAS study aiming to determine association of each individual SNP with risk (Amos, Wu et al. 2008).

Specifically, the association of a single genetic variant with transcription regulation (e.g., rs2296147 with ERCC5 regulation) or the association of inherited variation in transcript abundance regulation with lung cancer risk (Blomquist,

Crawford et al. 2009) may be assessed with hundreds of subjects (Blomquist,

Crawford et al. 2009). For example, starting with 161 subjects (Amos, Wu et al.

2008) we observed significant association of rs2296147 genotype with ERCC5

ASE (Figure 3.3), and with fewer than 100 subjects we observed significantly altered ERCC5 regulation with lung cancer (Mullins, Crawford et al. 2005) 157

(Figure 3.4). In contrast, there was not a clear association of rs2296147 T allele dosage with lung cancer risk among the subjects enrolled for this study (data not shown).

Based on the findings in the current study, we conclude that the T allele at rs2296147 is associated with higher ERCC5 transcript abundance, possibly through increased responsiveness to TP53 transcription factor. Genotype at rs17655 also is associated with variation in ERCC5 transcript abundance, likely due to effect on miRNA binding affinity at the linked SNP rs873601. These effects on ERCC5 transcription likely result in variation in nucleotide excision DNA repair function. These findings provide plausible explanation for the association of genotype at rs2296147 and rs17655 with lung cancer risk.

In addition to cis-regulation, the effect of CEBPG, a previously identified transcription factor for ERCC5, was determined in a lung cancer cell line. CEBPG knock- down experiment in H1703 lung cancer cell line confirmed the regulatory role of CEBPG in ERCC5 transcription, elucidating the effects of trans-regulation in gene expression and supporting our conclusion that interindividual variation in lung cancer risk attributes to interindividual variation in gene regulation machinery, both cis-acting and trans-acting variation.

Biospecimen repository of NBEC samples from bronchial brush biopsy/bronchoscopy

As described above, the Lung Cancer Risk Test (LCRT) trial is a prospective cohort study comparing lung cancer incidence among persons with a positive or negative value for the LCRT, a 15 gene test measured in normal bronchial epithelial cells (NBEC). 158

NBEC was collected from the target population of the LCRT biomarker is individuals who meet USPSTF eligibility criteria for annual low dose helical CT screening (Humphrey, Deffebach et al. 2013). Lung cancer incidence in the LCRT study is expected to be similar to the 3.1% incidence over 3.9 years reported by

Bach et al. (Bach, Jett et al. 2007). Bronchoscopy with or without biopsy is considered a safe procedure and no complications (SAEs) were associated with bronchoscopic brushing and sample collection in either standard of care (SOC) or study driven (SD) group. Pulmonary function test data was available for more than 75% of the subjects enrolled (290 of 384 subjects). One hundred fifty-seven had clinical COPD and more than 70% had GOLD stage 2 or worse (Table 4.2). It is planned to obtain repeat PFT and CT scan on all subjects at each subsequent follow-up. This information will enable longitudinal assessment for rate of decline in pulmonary function by both physiologic and radiographic measures and to assess for presence or absence of lung cancer.

The demographic composition of the enrolled group is representative of the population for which the LCRT is intended. Specifically, based on baseline population characteristics we expect lung cancer incidence in this cohort to be similar to the 3.1% reported in prior studies and representative of the population eligible for LDCT lung cancer screening. Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-tolerated in this population. These findings support the feasibility of testing LCRT clinical utility in this prospective study. If validated, the LCRT has the potential to significantly narrow the population of individuals requiring annual low-dose helical CT screening for early 159

detection of lung cancer and delay the onset of screening for individuals with results

indicating low lung cancer risk. For these individuals, the small risk incurred by

undergoing once in a lifetime bronchoscopic sample collection for LCRT may be offset

by a reduction in their CT-related risks. The LCRT biospecimen repository will enable

additional studies of genetic basis for COPD and/or lung cancer risk.

Development of Multiplex Two–Color Fluorometric RT-qPCR for Predicting

Chemo Responses in FFPE Samples

Biomarkers have been of increasing importance for personalized medicine, especially for assessing clinical outcomes of a treatment or determining most effective treatment for individual or monitoring undergoing treatment. Predictive biomarkers, connected with response to a treatment in terms of efficacy and/or safety, usually companion molecular diagnostic tests. To augment most commonly used approach immunohistochemistry (IHC) and to enable the quantitation of predictive biomarkers, efforts were made to develop quality-controlled multiplex two-color fluorometric real- time PCR assays for ten predictive markers which have shown clinical values in response to general or target chemotherapeutic agents, ERCC1, RRM1, MRP2 in response to cisplatin; EGFR, ROS1, ALK1, FGFR 1 to 3, and TYMS in response to target- chemotherapeutic agents

Reagents for each of 10 predictive biomarkers, ERCC1, RRM1, MRP2, EGFR,

ROS1, ALK1, FGFR 1 to 3, and TYMS, had acceptable linearity (R2 > 0.99), signal-to- analyte response (slope 1.0 ± 0.05), lower detection threshold (< 10 molecules) and imprecision (CV < 20%). 160

Specificity of each probe was tested by including it in PCR assays containing the synthetic NT or IS serially diluted from 10-11 M to 10-15 M. For each

probe, at each NT or IS dilution, the signal (Cq value: quantification cycle)

observed with amplification in the presence of the non-homologous template was

compared to Cq value observed with amplification in the presence of the

homologous template. Non-homologous (non-specific) binding was <1% for both

NT and IS probes for all genes.

The key components of this method are internal standards (IS) and

external standards (ES). The competitive IS molecule was designed with identical

priming sites and 4-6bp internal difference from each native target gene template

(NT). This ensures identical thermodynamics and amplification efficiency for

both template species as well as discrimination of IS from NT. ES corrects

fluorescence intensity difference between two probes labeled with different dyes

due to the variation of degradation of probes or software selection of Cq values in

each plate of PCR.

At each serial 10-fold dilution of ESM (10-11 M NT/10-11 M IS to 10-17 M NT/10-17 M IS),

the average coefficient of variation (CV) for measurement of each of the four genes was <

10% for > 60 molecules input (10-11 M NT/10-11 M IS to 10-16 M NT/10-16 M IS).

Development of Control for stochastic sampling variation in next generation

sequencing

In addition to real-time PCR, quality control was implemented in next

generation sequencing (NGS) RNA-sequencing platforms. We previously 161

developed competitive multiplex-PCR amplicon library preparation for targeted RNA- sequencing on next generation sequencing (NGS) platforms. One challenge for NGS to apply to clinical setting is analytical variation due to stochastic sampling. This is important for mutation detection and differential gene expression measurement. Although utility of internal standards as competitor control for sample overloading, signal saturation effects, inter-assay and inter-sample variations in measurement, the stochastic sampling error could not be controlled when low copies exist in the samples.

As supported by the data reported in this study, specifically, the mathematical equation based on both NT loading into NGS library preparation and sequence read counts from NGS instrument (Monte Carlo simulation Model 3) predicted observed assay coefficient of variation in four targeted NGS assays (Figures 5.2, 5.3 and

Supplementary Methods 5.8– Model Design). While it remains to confirm the predictive value of this equation across other types of NGS library preparation methods and sequencing platforms, generalizability is likely based on the similarity of biochemical reactions involved. Implementation of this equation may be particularly helpful when only one technical or biological replicate measurement is feasible (i.e. limited clinical specimen). In this context, the laboratory clinician would be asked to comment on the confidence in the measurement of target analyte, or frequency of a clinically actionable mutation present in a tumor specimen. Using this equation, the laboratory information system would be able to easily derive confidence intervals for reporting. As an example, this would simplify a decision regarding whether to direct treatment to an actionable mutation. Importantly, as is clear from Figure 5.3F, large analytical variation from stochastic sampling will be observed if an insufficient concentration of target molecules 162

is sampled, regardless of the concentration of amplification products sampled for loading into sequencer. This is why it is important to use quality control thresholds that address each of these sources of variation. By formulating based on Poisson sampling, a mathematical equation was developed to predict assay coefficient of variation (CV). Then the predicted CV is implemented to determine the confidence limits for each value acquired from sequencing. Therefore, false positive results can be eliminated by minimizing stochastic variation.

By applying quality-control parameters based on these experimentally validated models that predict key sources of NGS analytical variation, we can now accurately report confidence limits for NGS measurement of clinically important analytical targets, as well as provide an accurate limit of detection for observed base substitution, insertion and deletion rates at each base position within each native target. Incorporation of these quality controls provides an analysis pathway consistent with previously reported College of American

Pathologists (CAP) and Nex-StoCT guidelines for NGS diagnostics in the clinical setting.

New Contributions from Chapters 3, 4 and 5

1. Haplotype- and diplotype-based analysis possess greater power than genotype-based

analysis. When multiple loci regulate the expression of one gene, which is common for

complex genetic disease like lung cancer, based on genotyping analysis it was not

possible to sort out with confidence the independent roles of rs751402 and rs2296147.

Assessing the syntenic relationship of alleles at multiple suspicious loci, haplotype 163

structure, enables characterization of the roles of cis-regulatory SNPs (cis-rSNP),

independent and/or interactive.

2. Using allelic imbalance to assess cis-acting genetic variations controls for trans-acting

effects or environmental conditions that differentially influence gene expression

among samples. Association of allelic imbalance with haplotype and diplotype

comprising putative cis-rSNPs allows identification of hereditary cis-rSNPs without

interference from trans-acting effects as well as interaction of multiple cis-rSNPs.

3. The effect size of each genetic variant can be magnified and becomes detectable when

associating to an intermediate risk factor, regulation of key genes which are associated

with inherited disease susceptibility (e.g. lung cancer risk). It is possible to assess this

type of intermediate risk factor with far fewer patients than the thousands typically

necessary for a GWAS study aiming to determine association of each individual SNP

with risk.

4. Collection of NBEC by bronchial brush biopsy/bronchoscopy was safe and well-

tolerated in the LCRT recruited population. And this biospecimen repository will

enable additional studies of genetic basis for COPD and/or lung cancer risk.

5. The findings of this studies has the potential to significantly narrow the population of

individuals requiring annual LDCT for early detection of lung cancer.

6. The developed two-color fluorometric real-time PCR demonstrated good analytical

performance. Probe specificity was <1% non-homologous binding, and primers

detected < 10 molecules. For the 6 orders of magnitude with 1:1 ratio of NT: IS and in

the dilutions of NT to constant IS or vice versa, in the ratio of < 10, R2 value was >

164

0.99 and slope was 1.0 ± 0.05. The average coefficient of variation (CV) for

measurement of each gene was < 10% for > 60 molecules input.

7. ESM controlled for the variation in fluorescent labeling probes and selection of the

threshold. The unknown copies of target NT were calculated by comparison of Cq

values of NT and IS: [NT Cq- IS Cq] multiplied by input IS copies. Variations that can

affect those Cq values, such as probe quality, activity, and the software selection of Cq

were controlled by the mean of the two ESM [NT Cq- IS Cq] values.

8. The developed two-color fluorometric real-time PCR augments the accuracy,

specificity and sensitivity of commonly clinical predictive biomarker tests, for

example, immunohistochemistry (IHC) and FISH. Especially, it allows detection in

FFPE human tissue, which is abundant in hospitals nationwide.

9. Inclusion of synthetic competitive internal standard templates in targeted NGS library

preparation controls for low target input into NGS library preparation, low target

library product into sequencer, and errors generated during library preparation and

sequencing.

10. By formulating based on Poisson sampling, a mathematical equation was developed

to predict assay coefficient of variation (CV). Then the predicted CV is implemented

to determine the confidence limits for each value acquired from sequencing.

Therefore, false positive results can be eliminated by minimizing stochastic variation.

11. Incorporation of these quality controls provides an analysis pathway consistent with

previously reported College of American Pathologists (CAP) and Nex-StoCT

guidelines for NGS diagnostics in the clinical setting.

165

References

"Database of Single Nucleotide Polymorphisms (dbSNP)." Bethesda (MD): National

Center for Biotechnology Information, National Library of Medicine.

(2014). Cancer Facts & Figures. American Cancer Society.

Adare, A., S. Afanasiev, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, H. Al-Ta'ani,

J. Alexander, A. Angerami, K. Aoki, N. Apadula, Y. Aramaki, H. Asano, E. C.

Aschenauer, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, B. Bannier,

K. N. Barish, B. Bassalleck, S. Bathe, V. Baublis, S. Baumgart, A. Bazilevsky, R.

Belmont, A. Berdnikov, Y. Berdnikov, X. Bing, D. S. Blau, K. Boyle, M. L. Brooks, H.

Buesching, V. Bumazhnov, S. Butsyk, S. Campbell, P. Castera, C. H. Chen, C. Y. Chi,

M. Chiu, I. J. Choi, J. B. Choi, S. Choi, R. K. Choudhury, P. Christiansen, T. Chujo, O.

Chvala, V. Cianciolo, Z. Citron, B. A. Cole, M. Connors, M. Csanad, T. Csorgo, S.

Dairaku, A. Datta, M. S. Daugherity, G. David, A. Denisov, A. Deshpande, E. J.

Desmond, K. V. Dharmawardane, O. Dietzsch, L. Ding, A. Dion, M. Donadelli, O.

Drapier, A. Drees, K. A. Drees, J. M. Durham, A. Durum, L. D'Orazio, S. Edwards, Y. V.

Efremenko, T. Engelmore, A. Enokizono, S. Esumi, K. O. Eyser, B. Fadem, D. E. Fields,

M. Finger, M. Finger, Jr., F. Fleuret, S. L. Fokin, J. E. Frantz, A. Franz, A. D. Frawley,

Y. Fukao, T. Fusayasu, K. Gainey, C. Gal, A. Garishvili, I. Garishvili, A. Glenn, X.

Gong, M. Gonin, Y. Goto, R. Granier de Cassagnac, N. Grau, S. V. Greene, M. Grosse 166

Perdekamp, T. Gunji, L. Guo, H. A. Gustafsson, T. Hachiya, J. S. Haggerty, K. I. Hahn,

H. Hamagaki, J. Hanks, K. Hashimoto, E. Haslum, R. Hayano, X. He, T. K. Hemmick, T.

Hester, J. C. Hill, R. S. Hollis, K. Homma, B. Hong, T. Horaguchi, Y. Hori, S. Huang, T.

Ichihara, H. Iinuma, Y. Ikeda, J. Imrek, M. Inaba, A. Iordanova, D. Isenhower, M. Issah,

A. Isupov, D. Ivanischev, B. V. Jacak, M. Javani, J. Jia, X. Jiang, B. M. Johnson, K. S.

Joo, D. Jouan, J. Kamin, S. Kaneti, B. H. Kang, J. H. Kang, J. S. Kang, J. Kapustinsky,

K. Karatsu, M. Kasai, D. Kawall, A. V. Kazantsev, T. Kempel, A. Khanzadeev, K. M.

Kijima, B. I. Kim, C. Kim, D. J. Kim, E. J. Kim, H. J. Kim, K. B. Kim, Y. J. Kim, Y. K.

Kim, E. Kinney, A. Kiss, E. Kistenev, J. Klatsky, D. Kleinjan, P. Kline, Y. Komatsu, B.

Komkov, J. Koster, D. Kotchetkov, D. Kotov, A. Kral, F. Krizek, G. J. Kunde, K. Kurita,

M. Kurosawa, Y. Kwon, G. S. Kyle, R. Lacey, Y. S. Lai, J. G. Lajoie, A. Lebedev, B.

Lee, D. M. Lee, J. Lee, K. B. Lee, K. S. Lee, S. H. Lee, S. R. Lee, M. J. Leitch, M. A.

Leite, M. Leitgab, B. Lewis, S. H. Lim, L. A. Linden Levy, A. Litvinenko, M. X. Liu, B.

Love, C. F. Maguire, Y. I. Makdisi, M. Makek, A. Malakhov, A. Manion, V. I. Manko,

E. Mannel, S. Masumoto, M. McCumber, P. L. McGaughey, D. McGlinchey, C.

McKinney, M. Mendoza, B. Meredith, Y. Miake, T. Mibe, A. C. Mignerey, A. Milov, D.

K. Mishra, J. T. Mitchell, Y. Miyachi, S. Miyasaka, A. K. Mohanty, H. J. Moon, D. P.

Morrison, S. Motschwiller, T. V. Moukhanova, T. Murakami, J. Murata, T. Nagae, S.

Nagamiya, J. L. Nagle, M. I. Nagy, I. Nakagawa, Y. Nakamiya, K. R. Nakamura, T.

Nakamura, K. Nakano, C. Nattrass, A. Nederlof, M. Nihashi, R. Nouicer, N. Novitzky,

A. S. Nyanin, E. O'Brien, C. A. Ogilvie, K. Okada, A. Oskarsson, M. Ouchida, K.

Ozawa, R. Pak, V. Pantuev, V. Papavassiliou, B. H. Park, I. H. Park, S. K. Park, S. F.

Pate, L. Patel, H. Pei, J. C. Peng, H. Pereira, V. Peresedov, D. Y. Peressounko, R. Petti, 167

C. Pinkenburg, R. P. Pisani, M. Proissl, M. L. Purschke, H. Qu, J. Rak, I. Ravinovich, K.

F. Read, R. Reynolds, V. Riabov, Y. Riabov, E. Richardson, D. Roach, G. Roche, S. D.

Rolnick, M. Rosati, P. Rukoyatkin, B. Sahlmueller, N. Saito, T. Sakaguchi, V.

Samsonov, M. Sano, M. Sarsour, S. Sawada, K. Sedgwick, R. Seidl, A. Sen, R. Seto, D.

Sharma, I. Shein, T. A. Shibata, K. Shigaki, M. Shimomura, K. Shoji, P. Shukla, A.

Sickles, C. L. Silva, D. Silvermyr, K. S. Sim, B. K. Singh, C. P. Singh, V. Singh, M.

Slunecka, R. A. Soltz, W. E. Sondheim, S. P. Sorensen, M. Soumya, I. V. Sourikova, P.

W. Stankus, E. Stenlund, M. Stepanov, A. Ster, S. P. Stoll, T. Sugitate, A. Sukhanov, J.

Sun, J. Sziklai, E. M. Takagui, A. Takahara, A. Taketani, Y. Tanaka, S. Taneja, K.

Tanida, M. J. Tannenbaum, S. Tarafdar, A. Taranenko, E. Tennant, H. Themann, T.

Todoroki, L. Tomasek, M. Tomasek, H. Torii, R. S. Towell, I. Tserruya, Y. Tsuchimoto,

T. Tsuji, C. Vale, H. W. van Hecke, M. Vargyas, E. Vazquez-Zambrano, A. Veicht, J.

Velkovska, R. Vertesi, M. Virius, A. Vossen, V. Vrba, E. Vznuzdaev, X. R. Wang, D.

Watanabe, K. Watanabe, Y. Watanabe, Y. S. Watanabe, F. Wei, R. Wei, S. N. White, D.

Winter, S. Wolin, C. L. Woody, M. Wysocki, Y. L. Yamaguchi, R. Yang, A. Yanovich,

J. Ying, S. Yokkaichi, Z. You, I. Younus, I. E. Yushmanov, W. A. Zajc, A. Zelenski and

L. Zolin (2012). "Evolution of pi(0) suppression in Au+Au collisions from radical(s(NN))=39 to 200 GeV." Phys Rev Lett 109(15): 152301.

Akey, J., L. Jin and M. Xiong (2001). "Haplotypes vs single marker linkage

disequilibrium tests: what do we gain?" Eur J Hum Genet 9(4): 291-300.

Alberg, A. J. and J. M. Samet (2003). "Epidemiology of lung cancer." Chest 123(1

Suppl): 21S-49S.

168

Albert, F. W. and L. Kruglyak (2015). "The role of regulatory variation in complex traits

and disease." Nat Rev Genet 16(4): 197-212.

Aldrich, M. C., H. M. Munro, M. Mumma, E. L. Grogan, P. P. Massion, T. S. Blackwell

and W. J. Blot (2015). "Chronic obstructive pulmonary disease and subsequent overall

and lung cancer mortality in low-income adults." PLoS One 10(3): e0121805.

American Cancer Society. (2016). "Non-small cell lung cancer survival rates by stage." from http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small- cell-lung-cancer-survival-rates.

Amos, C. I., X. Wu, P. Broderick, I. P. Gorlov, J. Gu, T. Eisen, Q. Dong, Q. Zhang, X.

Gu, J. Vijayakrishnan, K. Sullivan, A. Matakidou, Y. Wang, G. Mills, K. Doheny, Y. Y.

Tsai, W. V. Chen, S. Shete, M. R. Spitz and R. S. Houlston (2008). "Genome-wide

association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1."

Nat Genet 40(5): 616-622.

Apostolakos, M. J., W. H. Schuermann, M. W. Frampton, M. J. Utell and J. C. Willey

(1993). "Measurement of gene expression by multiplex competitive polymerase chain

reaction." Anal Biochem 213(2): 277-284.

Aroucha, D. C., R. F. Carmo, L. R. Vasconcelos, R. E. Lima, T. F. Mendonca, L. E.

Arnez, M. S. Cavalcanti, M. T. Muniz, M. L. Aroucha, E. R. Siqueira, L. B. Pereira, P.

Moura, L. M. Pereira and M. R. Coelho (2016). "TNF-alpha and IL-10 polymorphisms

increase the risk to hepatocellular carcinoma in HCV infected individuals." J Med Virol.

Aziz, N., Q. Zhao, L. Bry, D. K. Driscoll, B. Funke, J. S. Gibson, W. W. Grody, M. R.

Hegde, G. A. Hoeltge, D. G. Leonard, J. D. Merker, R. Nagarajan, L. A. Palicki, R. S.

Robetorye, I. Schrijver, K. E. Weck and K. V. Voelkerding (2015). "College of American 169

Pathologists' laboratory standards for next-generation sequencing clinical tests." Arch

Pathol Lab Med 139(4): 481-493.

Bach, P. B., J. R. Jett, U. Pastorino, M. S. Tockman, S. J. Swensen and C. B. Begg

(2007). "COmputed tomography screening and lung cancer outcomes." JAMA 297(9):

953-961.

Bach, P. B., M. W. Kattan, M. D. Thornquist, M. G. Kris, R. C. Tate, M. J. Barnett, L. J.

Hsieh and C. B. Begg (2003). "Variations in lung cancer risk among smokers." J Natl

Cancer Inst 95(6): 470-478.

Bach, P. B., J. N. Mirkin, T. K. Oliver, C. G. Azzoli, D. A. Berry, O. W. Brawley, T.

Byers, G. A. Colditz, M. K. Gould, J. R. Jett, A. L. Sabichi, R. Smith-Bindman, D. E.

Wood, A. Qaseem and F. C. Detterbeck (2012). "Benefits and harms of CT screening for lung cancer: a systematic review." Jama 307(22): 2418-2429.

Barendse, W. (2011). "Haplotype analysis improved evidence for candidate genes for intramuscular fat percentage from a genome wide association study of cattle." PLoS One

6(12): e29601.

Barnes, N. C., M. Saetta and K. F. Rabe (2014). "Implementing lessons learned from previous bronchial biopsy trials in a new randomized controlled COPD biopsy trial with roflumilast." BMC Pulm Med 14: 9.

Barr, R. G., J. Herbstman, F. E. Speizer and C. A. Camargo, Jr. (2002). "Validation of self-reported chronic obstructive pulmonary disease in a cohort study of nurses." Am J

Epidemiol 155(10): 965-971.

170

Basuli, D., R. G. Stevens, F. M. Torti and S. V. Torti (2014). "Epidemiological associations between iron and cardiovascular disease and diabetes." Front Pharmacol 5:

117.

Beane, J., J. Vick, F. Schembri, C. Anderlind, A. Gower, J. Campbell, L. Luo, X. H.

Zhang, J. Xiao, Y. O. Alekseyev, S. Wang, S. Levy, P. P. Massion, M. Lenburg and A.

Spira (2011). "Characterizing the impact of smoking and lung cancer on the airway transcriptome using RNA-Seq." Cancer Prev Res (Phila) 4(6): 803-817.

Becker-Andre, M. and K. Hahlbrock (1989). "Absolute mRNA quantification using the polymerase chain reaction (PCR). A novel approach by a PCR aided transcript titration assay (PATTY)." Nucleic Acids Res 17(22): 9437-9446.

Beer, M. A. and S. Tavazoie (2004). "Predicting gene expression from sequence." Cell

117(2): 185-198.

Bell, G. D., N. C. Kane, L. H. Rieseberg and K. L. Adams (2013). "RNA-seq analysis of allele-specific expression, hybrid effects, and regulatory divergence in hybrids compared with their parents from natural populations." Genome biology and evolution 5(7): 1309-

1323.

Bhatnagar, S., X. Zhu, J. Ou, L. Lin, L. Chamberlain, L. J. Zhu, N. Wajapeyee and M. R.

Green (2014). "Genetic and pharmacological reactivation of the mammalian inactive X chromosome." Proceedings of the National Academy of Sciences 111(35): 12591-12598.

Biosystems, A. "TaqMan® SNP Genotyping Assays." PRODUCT BULLETIN.

Birse, C. E., R. J. Lagier, W. FitzHugh, H. I. Pass, W. N. Rom, E. S. Edell, A. O.

Bungum, F. Maldonado, J. R. Jett, M. Mesri, E. Sult, E. Joseloff, A. Li, J. Heidbrink, G.

Dhariwal, C. Danis, J. L. Tomic, R. J. Bruce, P. A. Moore, T. He, M. E. Lewis and S. M. 171

Ruben (2015). "Blood-based lung cancer biomarkers identified through proteomic discovery in cancer tissues, cell lines and conditioned medium." Clin Proteomics 12(1):

18.

Blomquist, T., E. L. Crawford, D. Mullins, Y. Yoon, D. A. Hernandez, S. Khuder, P. L.

Ruppel, E. Peters, D. J. Oldfield, B. Austermiller, J. C. Anders and J. C. Willey (2009).

"Pattern of antioxidant and DNA repair gene expression in normal airway epithelium

associated with lung cancer diagnosis." Cancer Res 69(22): 8629-8635.

Blomquist, T., E. L. Crawford, J. Yeo, X. Zhang and J. C. Willey (2015). "Control for

stochastic sampling variation and qualitative sequencing error in next generation

sequencing." Biomolecular Detection and Quantification.

Blomquist, T. M., R. D. Brown, E. L. Crawford, I. de la Serna, K. Williams, Y. Yoon, D.

A. Hernandez and J. C. Willey (2013). "CEBPG Exhibits Allele-Specific Expression in

Human Bronchial Epithelial Cells." Gene Regul Syst Bio 7: 125-138.

Blomquist, T. M., E. L. Crawford, J. L. Lovett, J. Yeo, L. M. Stanoszek, A. Levin, J. Li,

M. Lu, L. Shi, K. Muldrew and J. C. Willey (2013). "Targeted RNA-Sequencing with

Competitive Multiplex-PCR Amplicon Libraries." PLoS One 8(11): e79120.

Blomquist, T. M., E. L. Crawford and J. C. Willey (2010). "Cis-acting genetic variation at an E2F1/YY1 response site and putative p53 site is associated with altered allele- specific expression of ERCC5 (XPG) transcript in normal human bronchial epithelium."

Carcinogenesis 31(7): 1242-1250.

Boeri, M., C. Verri, D. Conte, L. Roz, P. Modena, F. Facchinetti, E. Calabrò, C. M.

Croce, U. Pastorino and G. Sozzi (2011). "MicroRNA signatures in tissues and plasma

172

predict development and prognosis of computed tomography detected lung cancer."

Proceedings of the National Academy of Sciences 108(9): 3713-3718.

Boundless (2015). "Alternatives to Dominance and Recessiveness." Boundless Biology.

Boyle, A. P., E. L. Hong, M. Hariharan, Y. Cheng, M. A. Schaub, M. Kasowski, K. J.

Karczewski, J. Park, B. C. Hitz, S. Weng, J. M. Cherry and M. Snyder (2012).

"Annotation of functional variation in personal genomes using RegulomeDB." Genome

Res 22(9): 1790-1797.

Brem, R. B., G. Yvert, R. Clinton and L. Kruglyak (2002). "Genetic dissection of transcriptional regulation in budding yeast." Science 296(5568): 752-755.

Brophy, V. H., M. D. Hastings, J. B. Clendenning, R. J. Richter, G. P. Jarvik and C. E.

Furlong (2001). "Polymorphisms in the human paraoxonase (PON1) promoter."

Pharmacogenetics 11(1): 77-84.

Browning, S. R. and B. L. Browning (2007). "Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering." Am J Hum Genet 81(5): 1084-1097.

Browning, S. R. and B. L. Browning (2011). "Haplotype phasing: existing methods and new developments." Nat Rev Genet 12(10): 703-714.

Buch, S. C., B. Diergaarde, T. Nukui, R. S. Day, J. M. Siegfried, M. Romkes and J. L.

Weissfeld (2012). "Genetic variability in DNA repair and cell cycle control pathway genes and risk of smoking-related lung cancer." Mol Carcinog 51 Suppl 1: E11-20.

Buckland, P. R. (2004). "Allele-specific gene expression differences in humans." Hum

Mol Genet 13 Spec No 2: R255-260.

173

Burger, I. M., N. E. Kass, J. H. Sunshine and S. S. Siegelman (2008). "The use of CT for

screening: a national survey of radiologists' activities and attitudes." Radiology 248(1):

160-168.

Burgtorf, C., P. Kepper, M. Hoehe, C. Schmitt, R. Reinhardt, H. Lehrach and S. Sauer

(2003). "Clone-based systematic haplotyping (CSH): a procedure for physical haplotyping of whole genomes." Genome Res 13(12): 2717-2724.

Bustin, S. A. (2000). "Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays." J Mol Endocrinol 25(2): 169-193.

Cahan, P., Y. Li, M. Izumi and T. A. Graubert (2009). "The impact of copy number

variation on local gene expression in mouse hematopoietic stem and progenitor cells."

Nat Genet 41(4): 430-437.

Canales, R. D., Y. Luo, J. C. Willey, B. Austermiller, C. C. Barbacioru, C. Boysen, K.

Hunkapiller, R. V. Jensen, C. R. Knight, K. Y. Lee, Y. Ma, B. Maqsodi, A. Papallo, E. H.

Peters, K. Poulter, P. L. Ruppel, R. R. Samaha, L. Shi, W. Yang, L. Zhang and F. M.

Goodsaid (2006). "Evaluation of DNA microarray results with quantitative gene

expression platforms." Nat Biotechnol 24(9): 1115-1122.

Caporaso, N., F. Gu, N. Chatterjee, J. Sheng-Chih, K. Yu, M. Yeager, C. Chen, K.

Jacobs, W. Wheeler, M. T. Landi, R. G. Ziegler, D. J. Hunter, S. Chanock, S. Hankinson,

P. Kraft and A. W. Bergen (2009). "Genome-wide and candidate gene association study

of cigarette smoking behaviors." PLoS One 4(2): e4653.

Casbon, J. A., R. J. Osborne, S. Brenner and C. P. Lichtenstein (2011). "A method for

counting PCR template molecules with application to next-generation sequencing."

Nucleic Acids Res 39(12): e81. 174

Cassidy, A., J. P. Myles, M. van Tongeren, R. D. Page, T. Liloglou, S. W. Duffy and J.

K. Field (2008). "The LLP risk model: an individual risk prediction model for lung

cancer." Br J Cancer 98(2): 270-276.

Cazzoli, R., F. Buttitta, M. Di Nicola, S. Malatesta, A. Marchetti, W. N. Rom and H. I.

Pass (2013). "microRNAs derived from circulating exosomes as noninvasive biomarkers

for screening and diagnosing lung cancer." J Thorac Oncol 8(9): 1156-1162.

Celi, F. S., M. E. Zenilman and A. R. Shuldiner (1993). "A rapid and versatile method to

synthesize internal standards for competitive PCR." Nucleic Acids Res 21(4): 1047.

Chamizo, C., S. Zazo, M. Domine, I. Cristobal, J. Garcia-Foncillas, F. Rojo and J.

Madoz-Gurpide (2015). "Thymidylate synthase expression as a predictive biomarker of pemetrexed sensitivity in advanced non-small cell lung cancer." BMC Pulm Med 15:

132.

Chen, X., L. Levine and P. Y. Kwok (1999). "Fluorescence polarization in homogeneous

nucleic acid analysis." Genome Res 9(5): 492-498.

Chen, X. and P. F. Sullivan (2003). "Single nucleotide polymorphism genotyping:

biochemistry, protocol, cost and throughput." Pharmacogenomics J 3(2): 77-96.

Chen, Z. and X. Duan (2011). "Ribosomal RNA depletion for massively parallel bacterial

RNA-sequencing applications." Methods Mol Biol 733: 93-103.

Cheung, V. G., R. S. Spielman, K. G. Ewens, T. M. Weber, M. Morley and J. T. Burdick

(2005). "Mapping determinants of human gene expression by regional and genome-wide

association." Nature 437(7063): 1365-1369.

Churchill, G. A. (2002). "Fundamentals of experimental design for cDNA microarrays."

Nat Genet 32 Suppl: 490-495. 175

Cibulskis, K., M. S. Lawrence, S. L. Carter, A. Sivachenko, D. Jaffe, C. Sougnez, S.

Gabriel, M. Meyerson, E. S. Lander and G. Getz (2013). "Sensitive detection of somatic

point mutations in impure and heterogeneous cancer samples." Nat Biotechnol 31(3):

213-219.

Cirulli, E. T. and D. B. Goldstein (2010). "Uncovering the roles of rare variants in

common disease through whole-genome sequencing." Nat Rev Genet 11(6): 415-425.

Clark, A. G. (1990). "Inference of haplotypes from PCR-amplified samples of diploid

populations." Mol Biol Evol 7(2): 111-122.

Cooper, S. J., N. D. Trinklein, E. D. Anton, L. Nguyen and R. M. Myers (2006).

"Comprehensive analysis of transcriptional promoter structure and function in 1% of the

human genome." Genome Res 16(1): 1-10.

Costa, V., M. Aprile, R. Esposito and A. Ciccodicola (2013). "RNA-Seq and human complex diseases: recent accomplishments and future perspectives." Eur J Hum Genet

21(2): 134-142.

Crawford, E. L., A. Levin, F. Safi, M. Lu, A. Baugh, Xiaolu Zhang, Jiyoun Yeo, Sadik A.

Khuder, A. M. Boulos, P. Nana-Sinkam, P. P. Massion, D. A. Arenberg, D. Midthun, P.

J. Mazzone, S. D. Nathan, R. Wainz, G. Silvestri, J. Tita and J. C. Willey (2016). "Lung

cancer risk test trial: study design, participant baseline characteristics, bronchoscopy

safety, and establishment of a biospecimen repository." BMC Pulmonary Medicine

16(16).

Crawford, E. L., T. Blomquist, D. N. Mullins, Y. Yoon, D. R. Hernandez, M. Al-

Bagdhadi, J. Ruiz, J. Hammersley and J. C. Willey (2007). "CEBPG regulates

176

ERCC5/XPG expression in human bronchial epithelial cells and this regulation is modified by E2F1/YY1 interactions." Carcinogenesis 28(12): 2552-2559.

Crawford, E. L., S. A. Khuder, S. J. Durham, M. Frampton, M. Utell, W. G. Thilly, D. A.

Weaver, W. J. Ferencak, C. A. Jennings, J. R. Hammersley, D. A. Olson and J. C. Willey

(2000). "Normal bronchial epithelial cell expression of glutathione transferase P1, glutathione transferase M3, and glutathione peroxidase is low in subjects with bronchogenic carcinoma." Cancer Res 60(6): 1609-1618.

Crawford, E. L., G. J. Peters, P. Noordhuis, M. G. Rots, M. Vondracek, R. C. Grafstrom,

K. Lieuallen, G. Lennon, R. J. Zahorchak, M. J. Georgeson, A. Wali, J. F. Lechner, P. S.

Fan, M. B. Kahaleh, S. A. Khuder, K. A. Warner, D. A. Weaver and J. C. Willey (2001).

"Reproducible gene expression measurement among multiple laboratories obtained in a blinded study using standardized RT (StaRT)-PCR." Mol Diagn 6(4): 217-225.

Crawford, E. L., K. A. Warner, S. A. Khuder, R. J. Zahorchak and J. C. Willey (2002).

"Multiplex standardized RT-PCR for expression analysis of many genes in small samples." Biochem Biophys Res Commun 293(1): 509-516.

Crowley, J. J., V. Zhabotynsky, W. Sun, S. Huang, I. K. Pakatci, Y. Kim, J. R. Wang, A.

P. Morgan, J. D. Calaway, D. L. Aylor, Z. Yun, T. A. Bell, R. J. Buus, M. E. Calaway, J.

P. Didion, T. J. Gooch, S. D. Hansen, N. N. Robinson, G. D. Shaw, J. S. Spence, C. R.

Quackenbush, C. J. Barrick, R. J. Nonneman, K. Kim, J. Xenakis, Y. Xie, W. Valdar, A.

B. Lenarcic, W. Wang, C. E. Welsh, C. P. Fu, Z. Zhang, J. Holt, Z. Guo, D. W.

Threadgill, L. M. Tarantino, D. R. Miller, F. Zou, L. McMillan, P. F. Sullivan and F.

Pardo-Manuel de Villena (2015). "Analyses of allele-specific gene expression in highly

177

divergent mouse crosses identifies pervasive allelic imbalance." Nat Genet 47(4): 353-

360.

Dai, J., M. Zhu, C. Wang, W. Shen, W. Zhou, J. Sun, J. Liu, G. Jin, H. Ma, Z. Hu, D. Lin

and H. Shen (2015). "Systematical analyses of variants in CTCF-binding sites identified a novel lung cancer susceptibility locus among Chinese population." Sci Rep 5: 7833.

Daly, S., D. Rinewalt, C. Fhied, S. Basu, B. Mahon, M. J. Liptay, E. Hong, G.

Chmielewski, M. A. Yoder, P. N. Shah, E. S. Edell, F. Maldonado, A. O. Bungum and J.

A. Borgia (2013). "Development and validation of a plasma biomarker panel for

discerning clinical significance of indeterminate pulmonary nodules." J Thorac Oncol

8(1): 31-36.

Davidson, E. H., D. R. McClay and L. Hood (2003). "Regulatory gene networks and the

properties of the developmental process." Proc Natl Acad Sci U S A 100(4): 1475-1480.

de-Torres, J. P., D. O. Wilson, P. Sanchez-Salcedo, J. L. Weissfeld, J. Berto, A. Campo,

A. B. Alcaide, M. Garcia-Granero, B. R. Celli and J. J. Zulueta (2015). "Lung cancer in patients with chronic obstructive pulmonary disease. Development and validation of the

COPD Lung Cancer Screening Score." Am J Respir Crit Care Med 191(3): 285-291.

de la Chapelle, A. (2009). "Genetic predisposition to human disease: allele-specific

expression and low-penetrance regulatory loci." Oncogene 28(38): 3345-3348.

Dear, P. H. and P. R. Cook (1989). "Happy mapping: a proposal for linkage mapping the

human genome." Nucleic Acids Res 17(17): 6795-6807.

Deutsch, S., R. Lyle, E. T. Dermitzakis, H. Attar, L. Subrahmanyan, C. Gehrig, L.

Parand, M. Gagnebin, J. Rougemont, C. V. Jongeneel and S. E. Antonarakis (2005).

178

"Gene expression variation and expression quantitative trait mapping of human

chromosome 21 genes." Hum Mol Genet 14(23): 3741-3749.

Diaz, L. A. and A. Bardelli (2014). "Liquid biopsies: genotyping circulating tumor

DNA." Journal of Clinical Oncology 32(6): 579-586.

Didon, L., A. B. Roos, G. P. Elmberger, F. J. Gonzalez and M. Nord (2010). "Lung-

specific inactivation of CCAAT/enhancer binding protein alpha causes a pathological

pattern characteristic of COPD." Eur Respir J 35(1): 186-197.

Ding, C. and C. R. Cantor (2003). "A high-throughput gene expression analysis technique

using competitive PCR and matrix-assisted laser desorption ionization time-of-flight

MS." Proc Natl Acad Sci U S A 100(6): 3059-3064.

Djebali, S., C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer, J.

Lagarde, W. Lin, F. Schlesinger, C. Xue, G. K. Marinov, J. Khatun, B. A. Williams, C.

Zaleski, J. Rozowsky, M. Roder, F. Kokocinski, R. F. Abdelhamid, T. Alioto, I.

Antoshechkin, M. T. Baer, N. S. Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, X. Chen,

J. Chrast, J. Curado, T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, E.

Falconnet, M. Fastuca, K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D.

Gonzalez, A. Gordon, H. Gunawardena, C. Howald, S. Jha, R. Johnson, P. Kapranov, B.

King, C. Kingswood, O. J. Luo, E. Park, K. Persaud, J. B. Preall, P. Ribeca, B. Risk, D.

Robyr, M. Sammeth, L. Schaffer, L. H. See, A. Shahab, J. Skancke, A. M. Suzuki, H.

Takahashi, H. Tilgner, D. Trout, N. Walters, H. Wang, J. Wrobel, Y. Yu, X. Ruan, Y.

Hayashizaki, J. Harrow, M. Gerstein, T. Hubbard, A. Reymond, S. E. Antonarakis, G.

Hannon, M. C. Giddings, Y. Ruan, B. Wold, P. Carninci, R. Guigo and T. R. Gingeras

(2012). "Landscape of transcription in human cells." Nature 489(7414): 101-108. 179

Dumur, C. I., S. Nasim, A. M. Best, K. J. Archer, A. C. Ladd, V. R. Mas, D. S.

Wilkinson, C. T. Garrett and A. Ferreira-Gonzalez (2004). "Evaluation of quality-control criteria for microarray gene expression analysis." Clin Chem 50(11): 1994-2002.

Eisner, M. D., L. Trupin, P. P. Katz, E. H. Yelin, G. Earnest, J. Balmes and P. D. Blanc

(2005). "Development and validation of a survey-based COPD severity score." Chest

127(6): 1890-1897.

Eissa, N. T. and S. C. Erzurum (2001). "Flexible bronchoscopy in molecular biology."

Clin Chest Med 22(2): 343-353, ix.

el-Deiry, W. S., J. W. Harper, P. M. O'Connor, V. E. Velculescu, C. E. Canman, J.

Jackman, J. A. Pietenpol, M. Burrell, D. E. Hill, Y. Wang, K. G. Wiman, W. E. Mercer,

M. B. Kastan, K. W. Kohn, S. J. Elledge, K. W. Kinzler and B. Vogelstein (1994).

"WAF1/CIP1 is induced in p53-mediated G1 arrest and apoptosis." Cancer Res 54(5):

1169-1174.

el-Deiry, W. S., T. Tokino, V. E. Velculescu, D. B. Levy, R. Parsons, J. M. Trent, D. Lin,

W. E. Mercer, K. W. Kinzler and B. Vogelstein (1993). "WAF1, a potential mediator of

p53 tumor suppression." Cell 75(4): 817-825.

Euser, A. M., C. Zoccali, K. J. Jager and F. W. Dekker (2009). "Cohort studies:

prospective versus retrospective." Nephron Clin Pract 113(3): c214-217.

Evans, T. G. (2015). "Considerations for the use of transcriptomics in identifying the

'genes that matter' for environmental adaptation." J Exp Biol 218(Pt 12): 1925-1935.

Excoffier, L. and M. Slatkin (1995). "Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population." Mol Biol Evol 12(5): 921-927.

180

Eymin, B., S. Gazzeri, C. Brambilla and E. Brambilla (2001). "Distinct pattern of E2F1

expression in human lung tumours: E2F1 is upregulated in small cell lung carcinoma."

Oncogene 20(14): 1678-1687.

Facciolongo, N., M. Patelli, S. Gasparini, L. Lazzari Agli, M. Salio, C. Simonassi, B. Del

Prato and P. Zanoni (2009). "Incidence of complications in bronchoscopy. Multicentre

prospective study of 20,986 bronchoscopies." Monaldi Arch Chest Dis 71(1): 8-14.

Fehrmann, R. S., R. C. Jansen, J. H. Veldink, H. J. Westra, D. Arends, M. J. Bonder, J.

Fu, P. Deelen, H. J. Groen, A. Smolonska, R. K. Weersma, R. M. Hofstra, W. A.

Buurman, S. Rensen, M. G. Wolfs, M. Platteel, A. Zhernakova, C. C. Elbers, E. M.

Festen, G. Trynka, M. H. Hofker, C. G. Saris, R. A. Ophoff, L. H. van den Berg, D. A.

van Heel, C. Wijmenga, G. J. Te Meerman and L. Franke (2011). "Trans-eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA." PLoS Genet 7(8): e1002197.

Ferre, F. (1992). "Quantitative or semi-quantitative PCR: reality versus myth." PCR

Methods Appl 2(1): 1-9.

Field, J. K., D. Baldwin, K. Brain, A. Devaraj, T. Eisen, S. W. Duffy, D. M. Hansell, K.

Kerr, R. Page, M. Parmar, D. Weller, P. Williamson, D. Whynes and U. Team (2011).

"CT screening for lung cancer in the UK: position statement by UKLS investigators

following the NLST report." Thorax 66(8): 736-737.

Fisher, D. A., J. T. Maple, T. Ben-Menachem, B. D. Cash, G. A. Decker, D. S. Early, J.

A. Evans, R. D. Fanelli, N. Fukami, J. H. Hwang, R. Jain, T. L. Jue, K. M. Khan, P. M.

Malpas, R. N. Sharaf, A. K. Shergill and J. A. Dominitz (2011). "Complications of

colonoscopy." Gastrointest Endosc 74(4): 745-752. 181

Forsberg, L., L. Lyrenas, U. de Faire and R. Morgenstern (2001). "A common functional

C-T substitution polymorphism in the promoter region of the human catalase gene

influences transcription factor binding, reporter gene transcription and is correlated to

blood catalase levels." Free Radic Biol Med 30(5): 500-505.

Frampton, G. M., A. Fichtenholtz, G. A. Otto, K. Wang, S. R. Downing, J. He, M.

Schnall-Levin, J. White, E. M. Sanford, P. An, J. Sun, F. Juhn, K. Brennan, K. Iwanik, A.

Maillet, J. Buell, E. White, M. Zhao, S. Balasubramanian, S. Terzic, T. Richards, V.

Banning, L. Garcia, K. Mahoney, Z. Zwirko, A. Donahue, H. Beltran, J. M. Mosquera,

M. A. Rubin, S. Dogan, C. V. Hedvat, M. F. Berger, L. Pusztai, M. Lechner, C. Boshoff,

M. Jarosz, C. Vietz, A. Parker, V. A. Miller, J. S. Ross, J. Curran, M. T. Cronin, P. J.

Stephens, D. Lipson and R. Yelensky (2013). "Development and validation of a clinical

cancer genomic profiling test based on massively parallel DNA sequencing." Nat

Biotechnol 31(11): 1023-1031.

Freeman, W. M., S. J. Walker and K. E. Vrana (1999). "Quantitative RT-PCR: pitfalls

and potential." Biotechniques 26(1): 112-122, 124-115.

Fu, G. K., W. Xu, J. Wilhelmy, M. N. Mindrinos, R. W. Davis, W. Xiao and S. P. Fodor

(2014). "Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations." Proc Natl Acad Sci U S A 111(5):

1891-1896.

Fung, J. N., S. J. Holdsworth-Carson, Y. Sapkota, Z. Z. Zhao, L. Jones, J. E. Girling, P.

Paiva, M. Healey, D. R. Nyholt, P. A. Rogers and G. W. Montgomery (2015).

"Functional evaluation of genetic variants associated with endometriosis near GREB1."

Hum Reprod 30(5): 1263-1275. 182

Gallegos Ruiz, M. I., K. Floor, P. Roepman, J. A. Rodriguez, G. A. Meijer, W. J. Mooi,

E. Jassem, J. Niklinski, T. Muley, N. van Zandwijk, E. F. Smit, K. Beebe, L. Neckers, B.

Ylstra and G. Giaccone (2008). "Integration of gene dosage and gene expression in non-

small cell lung cancer, identification of HSP90 as potential target." PLoS One 3(3):

e0001722.

Ganti, A. K. and J. L. Mulshine (2006). "Lung cancer screening." Oncologist 11(5): 481-

487.

Gargis, A. S., L. Kalman, M. W. Berry, D. P. Bick, D. P. Dimmock, T. Hambuch, F. Lu,

E. Lyon, K. V. Voelkerding, B. A. Zehnbauer, R. Agarwala, S. F. Bennett, B. Chen, E. L.

Chin, J. G. Compton, S. Das, D. H. Farkas, M. J. Ferber, B. H. Funke, M. R. Furtado, L.

M. Ganova-Raeva, U. Geigenmuller, S. J. Gunselman, M. R. Hegde, P. L. Johnson, A.

Kasarskis, S. Kulkarni, T. Lenk, C. S. Liu, M. Manion, T. A. Manolio, E. R. Mardis, J. D.

Merker, M. S. Rajeevan, M. G. Reese, H. L. Rehm, B. B. Simen, J. M. Yeakley, J. M.

Zook and I. M. Lubin (2012). "Assuring the quality of next-generation sequencing in clinical laboratory practice." Nat Biotechnol 30(11): 1033-1036.

Gargis, A. S., L. Kalman, D. P. Bick, C. da Silva, D. P. Dimmock, B. H. Funke, S.

Gowrisankar, M. R. Hegde, S. Kulkarni, C. E. Mason, R. Nagarajan, K. V. Voelkerding,

E. A. Worthey, N. Aziz, J. Barnes, S. F. Bennett, H. Bisht, D. M. Church, Z. Dimitrova,

S. R. Gargis, N. Hafez, T. Hambuch, F. C. Hyland, R. A. Luna, D. MacCannell, T. Mann,

M. R. McCluskey, T. K. McDaniel, L. M. Ganova-Raeva, H. L. Rehm, J. Reid, D. S.

Campo, R. B. Resnick, P. G. Ridge, M. L. Salit, P. Skums, L. J. Wong, B. A. Zehnbauer,

J. M. Zook and I. M. Lubin (2015). "Good laboratory practice for clinical next-generation sequencing informatics pipelines." Nat Biotechnol 33(7): 689-693. 183

Ge, B., D. K. Pokholok, T. Kwan, E. Grundberg, L. Morcos, D. J. Verlaan, J. Le, V.

Koka, K. C. Lam, V. Gagne, J. Dias, R. Hoberman, A. Montpetit, M. M. Joly, E. J.

Harvey, D. Sinnett, P. Beaulieu, R. Hamon, A. Graziani, K. Dewar, E. Harmsen, J.

Majewski, H. H. Goring, A. K. Naumova, M. Blanchette, K. L. Gunderson and T.

Pastinen (2009). "Global patterns of cis variation in human cells revealed by high-density allelic expression analysis." Nat Genet 41(11): 1216-1222.

Gebhardt, F., K. S. Zanker and B. Brandt (1999). "Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1." J Biol

Chem 274(19): 13176-13180.

Genomes Project, C., G. R. Abecasis, D. Altshuler, A. Auton, L. D. Brooks, R. M.

Durbin, R. A. Gibbs, M. E. Hurles and G. A. McVean (2010). "A map of human genome variation from population-scale sequencing." Nature 467(7319): 1061-1073.

Germer, S. and R. Higuchi (1999). "Single-tube genotyping without oligonucleotide probes." Genome Res 9(1): 72-78.

Gibson, G., J. E. Powell and U. M. Marigorta (2015). "Expression quantitative trait locus analysis for translational medicine." Genome Med 7(1): 60.

Gibson, G. and B. Weir (2005). "The quantitative genetics of transcription." Trends

Genet 21(11): 616-623.

Gililland, J. L., Y. C. Tseng, V. Troche, S. Lahiri and L. Wartofsky (1992). "Atrial natriuretic peptide receptors in human endometrial stromal cells." J Clin Endocrinol

Metab 75(2): 547-551.

184

Gilliland, G., S. Perrin, K. Blanchard and H. F. Bunn (1990). "Analysis of cytokine mRNA and DNA: detection and quantitation by competitive polymerase chain reaction."

Proc Natl Acad Sci U S A 87(7): 2725-2729.

Global Lipids Genetics, C., C. J. Willer, E. M. Schmidt, S. Sengupta, G. M. Peloso, S.

Gustafsson, S. Kanoni, A. Ganna, J. Chen, M. L. Buchkovich, S. Mora, J. S. Beckmann,

J. L. Bragg-Gresham, H. Y. Chang, A. Demirkan, H. M. Den Hertog, R. Do, L. A.

Donnelly, G. B. Ehret, T. Esko, M. F. Feitosa, T. Ferreira, K. , P. Fontanillas, R.

M. Fraser, D. F. Freitag, D. Gurdasani, K. Heikkila, E. Hypponen, A. Isaacs, A. U.

Jackson, A. Johansson, T. Johnson, M. Kaakinen, J. Kettunen, M. E. Kleber, X. Li, J.

Luan, L. P. Lyytikainen, P. K. Magnusson, M. Mangino, E. Mihailov, M. E. Montasser,

M. Muller-Nurasyid, I. M. Nolte, J. R. O'Connell, C. D. Palmer, M. Perola, A. K.

Petersen, S. Sanna, R. Saxena, S. K. Service, S. Shah, D. Shungin, C. Sidore, C. Song, R.

J. Strawbridge, I. Surakka, T. Tanaka, T. M. Teslovich, G. Thorleifsson, E. G. Van den

Herik, B. F. Voight, K. A. Volcik, L. L. Waite, A. Wong, Y. Wu, W. Zhang, D. Absher,

G. Asiki, I. Barroso, L. F. Been, J. L. Bolton, L. L. Bonnycastle, P. Brambilla, M. S.

Burnett, G. Cesana, M. Dimitriou, A. S. Doney, A. Doring, P. Elliott, S. E. Epstein, G. I.

Eyjolfsson, B. Gigante, M. O. Goodarzi, H. Grallert, M. L. Gravito, C. J. Groves, G.

Hallmans, A. L. Hartikainen, C. Hayward, D. Hernandez, A. A. Hicks, H. Holm, Y. J.

Hung, T. Illig, M. R. Jones, P. Kaleebu, J. J. Kastelein, K. T. Khaw, E. Kim, N. Klopp, P.

Komulainen, M. Kumari, C. Langenberg, T. Lehtimaki, S. Y. Lin, J. Lindstrom, R. J.

Loos, F. Mach, W. L. McArdle, C. Meisinger, B. D. Mitchell, G. Muller, R. Nagaraja, N.

Narisu, T. V. Nieminen, R. N. Nsubuga, I. Olafsson, K. K. Ong, A. Palotie, T.

Papamarkou, C. Pomilla, A. Pouta, D. J. Rader, M. P. Reilly, P. M. Ridker, F. 185

Rivadeneira, I. Rudan, A. Ruokonen, N. Samani, H. Scharnagl, J. Seeley, K. Silander, A.

Stancakova, K. Stirrups, A. J. Swift, L. Tiret, A. G. Uitterlinden, L. J. van Pelt, S.

Vedantam, N. Wainwright, C. Wijmenga, S. H. Wild, G. Willemsen, T. Wilsgaard, J. F.

Wilson, E. H. Young, J. H. Zhao, L. S. Adair, D. Arveiler, T. L. Assimes, S. Bandinelli,

F. Bennett, M. Bochud, B. O. Boehm, D. I. Boomsma, I. B. Borecki, S. R. Bornstein, P.

Bovet, M. Burnier, H. Campbell, A. Chakravarti, J. C. Chambers, Y. D. Chen, F. S.

Collins, R. S. Cooper, J. Danesh, G. Dedoussis, U. de Faire, A. B. Feranil, J. Ferrieres, L.

Ferrucci, N. B. Freimer, C. Gieger, L. C. Groop, V. Gudnason, U. Gyllensten, A.

Hamsten, T. B. Harris, A. Hingorani, J. N. Hirschhorn, A. Hofman, G. K. Hovingh, C. A.

Hsiung, S. E. Humphries, S. C. Hunt, K. Hveem, C. Iribarren, M. R. Jarvelin, A. Jula, M.

Kahonen, J. Kaprio, A. Kesaniemi, M. Kivimaki, J. S. Kooner, P. J. Koudstaal, R. M.

Krauss, D. Kuh, J. Kuusisto, K. O. Kyvik, M. Laakso, T. A. Lakka, L. Lind, C. M.

Lindgren, N. G. Martin, W. Marz, M. I. McCarthy, C. A. McKenzie, P. Meneton, A.

Metspalu, L. Moilanen, A. D. Morris, P. B. Munroe, I. Njolstad, N. L. Pedersen, C.

Power, P. P. Pramstaller, J. F. Price, B. M. Psaty, T. Quertermous, R. Rauramaa, D.

Saleheen, V. Salomaa, D. K. Sanghera, J. Saramies, P. E. Schwarz, W. H. Sheu, A. R.

Shuldiner, A. Siegbahn, T. D. Spector, K. Stefansson, D. P. Strachan, B. O. Tayo, E.

Tremoli, J. Tuomilehto, M. Uusitupa, C. M. van Duijn, P. Vollenweider, L. Wallentin, N.

J. Wareham, J. B. Whitfield, B. H. Wolffenbuttel, J. M. Ordovas, E. Boerwinkle, C. N.

Palmer, U. Thorsteinsdottir, D. I. Chasman, J. I. Rotter, P. W. Franks, S. Ripatti, L. A.

Cupples, M. S. Sandhu, S. S. Rich, M. Boehnke, P. Deloukas, S. Kathiresan, K. L.

Mohlke, E. Ingelsson and G. R. Abecasis (2013). "Discovery and refinement of loci associated with lipid levels." Nat Genet 45(11): 1274-1283. 186

Gower, A. C., K. Steiling, J. F. Brothers, 2nd, M. E. Lenburg and A. Spira (2011).

"Transcriptomic studies of the airway field of injury associated with smoking-related lung disease." Proc Am Thorac Soc 8(2): 173-179.

Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X.

Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A.

Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N.

Friedman and A. Regev (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nat Biotechnol 29(7): 644-652.

Grundberg, E., K. S. Small, A. K. Hedman, A. C. Nica, A. Buil, S. Keildson, J. T. Bell,

T. P. Yang, E. Meduri, A. Barrett, J. Nisbett, M. Sekowska, A. Wilk, S. Y. Shin, D.

Glass, M. Travers, J. L. Min, S. Ring, K. Ho, G. Thorleifsson, A. Kong, U.

Thorsteindottir, C. Ainali, A. S. Dimas, N. Hassanali, C. Ingle, D. Knowles, M.

Krestyaninova, C. E. Lowe, P. Di Meglio, S. B. Montgomery, L. Parts, S. Potter, G.

Surdulescu, L. Tsaprouni, S. Tsoka, V. Bataille, R. Durbin, F. O. Nestle, S. O'Rahilly, N.

Soranzo, C. M. Lindgren, K. T. Zondervan, K. R. Ahmadi, E. E. Schadt, K. Stefansson,

G. D. Smith, M. I. McCarthy, P. Deloukas, E. T. Dermitzakis, T. D. Spector and C.

Multiple Tissue Human Expression Resource (2012). "Mapping cis- and trans-regulatory effects across multiple tissues in twins." Nat Genet 44(10): 1084-1089.

Gry, M., R. Rimini, S. Stromberg, A. Asplund, F. Ponten, M. Uhlen and P. Nilsson

(2009). "Correlations between RNA and protein expression profiles in 23 human cell

lines." BMC Genomics 10: 365.

187

Guindalini, C. and R. Pellegrino (2016). Gene Expression Studies Using Microarrays.

Rodent Model as Tools in Ethical Biomedical Research. L. M. Andersen and S. Tufik.

Cham, Springer International Publishing: 203-216.

Gusev, A., S. H. Lee, G. Trynka, H. Finucane, B. J. Vilhjalmsson, H. Xu, C. Zang, S.

Ripke, B. Bulik-Sullivan, E. Stahl, C. Schizophrenia Working Group of the Psychiatric

Genomics, S.-S. Consortium, A. K. Kahler, C. M. Hultman, S. M. Purcell, S. A.

McCarroll, M. Daly, B. Pasaniuc, P. F. Sullivan, B. M. Neale, N. R. Wray, S.

Raychaudhuri, A. L. Price, C. Schizophrenia Working Group of the Psychiatric

Genomics and S.-S. Consortium (2014). "Partitioning heritability of regulatory and cell- type-specific variants across 11 common diseases." Am J Hum Genet 95(5): 535-552.

Haas, B. J., M. Chin, C. Nusbaum, B. W. Birren and J. Livny (2012). "How deep is deep

enough for RNA-Seq profiling of bacterial transcriptomes?" BMC Genomics 13: 734.

Halvardson, J., A. Zaghlool and L. Feuk (2013). "Exome RNA sequencing reveals rare

and novel alternative transcripts." Nucleic Acids Res 41(1): e6.

Hao, B., X. Miao, Y. Li, X. Zhang, T. Sun, G. Liang, Y. Zhao, Y. Zhou, H. Wang, X.

Chen, L. Zhang, W. Tan, Q. Wei, D. Lin and F. He (2006). "A novel T-77C

polymorphism in DNA repair gene XRCC1 contributes to diminished promoter activity

and increased risk of non-small cell lung cancer." Oncogene 25(25): 3613-3620.

Harr, M. W., T. G. Graves, E. L. Crawford, K. A. Warner, C. A. Reed and J. C. Willey

(2005). "Variation in transcriptional regulation of cyclin dependent kinase inhibitor

p21waf1/cip1 among human bronchogenic carcinomas." Mol Cancer 4: 23.

188

Hattotuwa, K., E. A. Gamble, T. O’Shaughnessy, P. K. Jeffery and N. C. Barnes (2002).

"SAfety of bronchoscopy, biopsy, and bal in research patients with copd*." Chest 122(6):

1909-1912.

Hayashi, G., M. Hagihara and K. Nakatani (2008). "Genotyping by allele-specific L-

DNA-tagged PCR." J Biotechnol 135(2): 157-160.

Hayashi, S., J. Watanabe and K. Kawajiri (1991). "Genetic polymorphisms in the 5'-

flanking region change transcriptional regulation of the human cytochrome P450IIE1

gene." J Biochem 110(4): 559-565.

He, C., J. Holme and J. Anthony (2014). "SNP genotyping: the KASP assay." Methods

Mol Biol 1145: 75-86.

He, J., L. X. Qiu, M. Y. Wang, R. X. Hua, R. X. Zhang, H. P. Yu, Y. N. Wang, M. H.

Sun, X. Y. Zhou, Y. J. Yang, J. C. Wang, L. Jin, Q. Y. Wei and J. Li (2012).

"Polymorphisms in the XPG gene and risk of gastric cancer in Chinese populations."

Hum Genet 131(7): 1235-1244.

Heid, C. A., J. Stevens, K. J. Livak and P. M. Williams (1996). "Real time quantitative

PCR." Genome Res 6(10): 986-994.

Henley, W. N., K. E. Schuebel and D. A. Nielsen (1996). "Limitations imposed by

heteroduplex formation on quantitative RT-PCR." Biochem Biophys Res Commun

226(1): 113-117.

Henrichsen, C. N., N. Vinckenbosch, S. Zollner, E. Chaignat, S. Pradervand, F. Schutz,

M. Ruedi, H. Kaessmann and A. Reymond (2009). "Segmental copy number variation shapes tissue transcriptomes." Nat Genet 41(4): 424-429.

189

Higgins, G., K. M. Roper, I. J. Watson, F. H. Blackhall, W. N. Rom, H. I. Pass, J. F.

Ainscough and D. Coverley (2012). "Variant Ciz1 is a circulating biomarker for early-

stage lung cancer." Proc Natl Acad Sci U S A 109(45): E3128-3135.

Higgs, D. R., D. Vernimmen, J. Hughes and R. Gibbons (2007). "Using genomics to

study how chromatin influences gene expression." Annu Rev Genomics Hum Genet 8:

299-325.

Holland, P. M., R. D. Abramson, R. Watson and D. H. Gelfand (1991). "Detection of

specific polymerase chain reaction product by utilizing the 5'----3' exonuclease activity of

Thermus aquaticus DNA polymerase." Proc Natl Acad Sci U S A 88(16): 7276-7280.

Hoogendoorn, B., S. L. Coleman, C. A. Guy, K. Smith, T. Bowen, P. R. Buckland and M.

C. O'Donovan (2003). "Functional analysis of human promoter polymorphisms." Hum

Mol Genet 12(18): 2249-2254.

Horn, M., R. Baumann, J. A. Pereira, P. N. Sidiropoulos, C. Somandin, H. Welzl, C.

Stendel, T. Luhmann, C. Wessig, K. V. Toyka, J. B. Relvas, J. Senderek and U. Suter

(2012). "Myelin is dependent on the Charcot-Marie-Tooth Type 4H disease culprit protein FRABIN/FGD4 in Schwann cells." Brain 135(Pt 12): 3567-3583.

Howie, B. N., P. Donnelly and J. Marchini (2009). "A flexible and accurate genotype

imputation method for the next generation of genome-wide association studies." PLoS

Genet 5(6): e1000529.

Hu, J., Y. Mao, D. Dryer, K. White and C. C. R. E. R. Group (2002). "Risk factors for

lung cancer among Canadian women who have never smoked." Cancer detection and

prevention 26(2): 129-138.

190

Huang, R., M. Jaritz, P. Guenzl, I. Vlatkovic, A. Sommer, I. M. Tamir, H. Marks, T.

Klampfl, R. Kralovics, H. G. Stunnenberg, D. P. Barlow and F. M. Pauler (2011). "An

RNA-Seq strategy to detect the complete coding and non-coding transcriptome including

full-length imprinted macro ncRNAs." PLoS One 6(11): e27288.

Huggett, J. F., T. Novak, J. A. Garson, C. Green, S. D. Morris-Jones, R. F. Miller and A.

Zumla (2008). "Differential susceptibility of PCR reactions to inhibitors: an important and unrecognised phenomenon." BMC Res Notes 1: 70.

Humphrey, L. L., M. Deffebach, M. Pappas, C. Baumann, K. Artis, J. P. Mitchell, B.

Zakher, R. Fu and C. G. Slatore (2013). "Screening for lung cancer with low-dose

computed tomography: a systematic review to update the US Preventive services task

force recommendation." Ann Intern Med 159(6): 411-420.

Hung, R. J., J. D. McKay, V. Gaborieau, P. Boffetta, M. Hashibe, D. Zaridze, A.

Mukeria, N. Szeszenia-Dabrowska, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, V.

Bencko, L. Foretova, V. Janout, C. Chen, G. Goodman, J. K. Field, T. Liloglou, G.

Xinarianos, A. Cassidy, J. McLaughlin, G. Liu, S. Narod, H. E. Krokan, F. Skorpen, M.

B. Elvestad, K. Hveem, L. Vatten, J. Linseisen, F. Clavel-Chapelon, P. Vineis, H. B.

Bueno-de-Mesquita, E. Lund, C. Martinez, S. Bingham, T. Rasmuson, P. Hainaut, E.

Riboli, W. Ahrens, S. Benhamou, P. Lagiou, D. Trichopoulos, I. Holcatova, F. Merletti,

K. Kjaerheim, A. Agudo, G. Macfarlane, R. Talamini, L. Simonato, R. Lowry, D. I.

Conway, A. Znaor, C. Healy, D. Zelenika, A. Boland, M. Delepine, M. Foglio, D.

Lechner, F. Matsuda, H. Blanche, I. Gut, S. Heath, M. Lathrop and P. Brennan (2008).

"A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit

genes on 15q25." Nature 452(7187): 633-637. 191

Hutchinson, J. N., T. Raj, J. Fagerness, E. Stahl, F. T. Viloria, A. Gimelbrant, J. Seddon,

M. Daly, A. Chess and R. Plenge (2014). "Allele-specific methylation occurs at genetic

variants associated with complex disease." PloS one 9(6): e98464.

Illumina "Sequencing Systems." http://www.illumina.com/systems/sequencing.html.

Illumina. (2016). "Kits for human genotyping applications." from

http://www.illumina.com/techniques/microarrays/human-genotyping/human-genotyping- arrays.html.

Irshad, S. and M. Maryum (2012). "Genetic Susceptibility and Risk Factors Associated with Familial Lung Cancer."

Jabara, C. B., C. D. Jones, J. Roach, J. A. Anderson and R. Swanstrom (2011). "Accurate

sampling and deep sequencing of the HIV-1 protease gene using a Primer ID." Proc Natl

Acad Sci U S A 108(50): 20166-20171.

Jaenisch, R. and A. Bird (2003). "Epigenetic regulation of gene expression: how the

genome integrates intrinsic and environmental signals." Nat Genet 33 Suppl: 245-254.

Jiang, L., D. Willner, P. Danoy, H. Xu and M. A. Brown (2013). "Comparison of the

performance of two commercial genome-wide association study genotyping platforms in

Han Chinese samples." G3 (Bethesda) 3(1): 23-29.

Jin, F., D. Mu, D. Chu, E. Fu, Y. Xie and T. Liu (2008). "Severe complications of

bronchoscopy." Respiration 76(4): 429-433.

Kadara, H. and Wistuba, II (2012). "Field cancerization in non-small cell lung cancer:

implications in disease pathogenesis." Proc Am Thorac Soc 9(2): 38-42.

Kaisho, T., H. Tsutsui, T. Tanaka, T. Tsujimura, K. Takeda, T. Kawai, N. Yoshida, K.

Nakanishi and S. Akira (1999). "Impairment of natural killer cytotoxic activity and 192

interferon gamma production in CCAAT/enhancer binding protein gamma-deficient

mice." J Exp Med 190(11): 1573-1582.

Kang, S., Y. Ma, C. Liu, C. Cao, Hanbateer, J. Qi, J. Li and X. Wu (2015). "Assoication

of XRCC1 gene polymorphisms with risk of non-small cell lung cancer." Int J Clin Exp

Pathol 8(4): 4171-4176.

Kannan, K., N. Amariglio, G. Rechavi and D. Givol (2000). "Profile of gene expression

regulated by induced p53: connection to the TGF-beta family." FEBS Lett 470(1): 77-82.

Kaper, F., S. Swamy, B. Klotzle, S. Munchel, J. Cottrell, M. Bibikova, H. Y. Chuang, S.

Kruglyak, M. Ronaghi, M. A. Eberle and J. B. Fan (2013). "Whole-genome haplotyping by dilution, amplification, and sequencing." Proc Natl Acad Sci U S A 110(14): 5552-

5557.

Keating, B. J., S. Tischfield, S. S. Murray, T. Bhangale, T. S. Price, J. T. Glessner, L.

Galver, J. C. Barrett, S. F. Grant, D. N. Farlow, H. R. Chandrupatla, M. Hansen, S.

Ajmal, G. J. Papanicolaou, Y. Guo, M. Li, S. Derohannessian, P. I. de Bakker, S. D.

Bailey, A. Montpetit, A. C. Edmondson, K. Taylor, X. Gai, S. S. Wang, M. Fornage, T.

Shaikh, L. Groop, M. Boehnke, A. S. Hall, A. T. Hattersley, E. Frackelton, N. Patterson,

C. W. Chiang, C. E. Kim, R. R. Fabsitz, W. Ouwehand, A. L. Price, P. Munroe, M.

Caulfield, T. Drake, E. Boerwinkle, D. Reich, A. S. Whitehead, T. P. Cappola, N. J.

Samani, A. J. Lusis, E. Schadt, J. G. Wilson, W. Koenig, M. I. McCarthy, S. Kathiresan,

S. B. Gabriel, H. Hakonarson, S. S. Anand, M. Reilly, J. C. Engert, D. A. Nickerson, D.

J. Rader, J. N. Hirschhorn and G. A. Fitzgerald (2008). "Concept, design and

implementation of a cardiovascular gene-centric 50 k SNP array for large-scale genomic association studies." PLoS One 3(10): e3583. 193

Keedy, V. L., S. Temin, M. R. Somerfield, M. B. Beasley, D. H. Johnson, L. M.

McShane, D. T. Milton, J. R. Strawn, H. A. Wakelee and G. Giaccone (2011). "American

Society of Clinical Oncology provisional clinical opinion: epidermal growth factor

receptor (EGFR) Mutation testing for patients with advanced non-small-cell lung cancer

considering first-line EGFR tyrosine kinase inhibitor therapy." J Clin Oncol 29(15):

2121-2127.

Kim, V., M. Oros, H. Durra, S. Kelsen, M. Aksoy, W. D. Cornwell, T. J. Rogers and G. J.

Criner (2015). "Chronic Bronchitis and Current Smoking Are Associated with More

Goblet Cells in Moderate to Severe COPD and Smokers without Airflow Obstruction."

PLoS ONE 10(2): e0116108.

Kinde, I., J. Wu, N. Papadopoulos, K. W. Kinzler and B. Vogelstein (2011). "Detection

and quantification of rare mutations with massively parallel sequencing." Proc Natl Acad

Sci U S A 108(23): 9530-9535.

Kirsten, H., H. Al-Hasani, L. Holdt, A. Gross, F. Beutner, K. Krohn, K. Horn, P. Ahnert,

R. Burkhardt, K. Reiche, J. Hackermuller, M. Loffler, D. Teupser, J. Thiery and M.

Scholz (2015). "Dissecting the genetics of the human transcriptome identifies novel trait-

related trans-eQTLs and corroborates the regulatory relevance of non-protein coding

locidagger." Hum Mol Genet 24(16): 4746-4763.

Kitzman, J. O., A. P. Mackenzie, A. Adey, J. B. Hiatt, R. P. Patwardhan, P. H. Sudmant,

S. B. Ng, C. Alkan, R. Qiu, E. E. Eichler and J. Shendure (2011). "Haplotype-resolved genome sequencing of a Gujarati Indian individual." Nat Biotechnol 29(1): 59-63.

Knight, J. (2012). "Resolving the variable genome and epigenome in human disease."

Journal of internal medicine 271(4): 379-391. 194

Knight, J. C. (2004). "Allele-specific gene expression uncovered." Trends Genet 20(3):

113-116.

Knight, J. C., B. J. Keating, K. A. Rockett and D. P. Kwiatkowski (2003). "In vivo

characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading." Nat Genet 33(4): 469-475.

Ko, C. W., S. Riffle, L. Michaels, C. Morris, J. Holub, J. A. Shapiro, M. A. Ciol, M. B.

Kimmey, L. C. Seeff and D. Lieberman (2010). "Serious complications within 30 days of

screening and surveillance colonoscopy are uncommon." Clin Gastroenterol Hepatol

8(2): 166-173.

Korpanty, G. J., D. M. Graham, M. D. Vincent and N. B. Leighl (2014). "Biomarkers

That Currently Affect Clinical Practice in Lung Cancer: EGFR, ALK, MET, ROS-1, and

KRAS." Front Oncol 4: 204.

Kovalchik, S. A., M. Tammemagi, C. D. Berg, N. E. Caporaso, T. L. Riley, M. Korch, G.

A. Silvestri, A. K. Chaturvedi and H. A. Katki (2013). "Targeting of low-dose CT

screening according to the risk of lung-cancer death." N Engl J Med 369(3): 245-254.

Kranaster, R., P. Ketzer and A. Marx (2008). "Mutant DNA polymerase for improved

detection of single-nucleotide variations in microarrayed primer extension."

Chembiochem 9(5): 694-697.

Kuleshov, V., D. Xie, R. Chen, D. Pushkarev, Z. Ma, T. Blauwkamp, M. Kertesz and M.

Snyder (2014). "Whole-genome haplotyping using long reads and statistical methods."

Nat Biotechnol 32(3): 261-266.

Kunkel, T. A. (2004). "DNA replication fidelity." J Biol Chem 279(17): 16895-16898.

195

Lam, T. H., M. Z. Tay, B. Wang, Z. Xiao and E. C. Ren (2015). "Intrahaplotypic Variants

Differentiate Complex Linkage Disequilibrium within Human MHC Haplotypes." Sci

Rep 5: 16972.

Langmead, B. and S. L. Salzberg (2012). "Fast gapped-read alignment with Bowtie 2."

Nat Methods 9(4): 357-359.

Lappalainen, T. (2015). "Functional genomics bridges the gap between quantitative genetics and molecular biology." Genome Res 25(10): 1427-1431.

Levin, J. Z., M. F. Berger, X. Adiconis, P. Rogov, A. Melnikov, T. Fennell, C. Nusbaum,

L. A. Garraway and A. Gnirke (2009). "Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts."

Genome Biol 10(10): R115.

Lewallen, S. and P. Courtright (1998). "Epidemiology in practice: case-control studies."

Community Eye Health 11(28): 57-58.

Lewis, C. M. and J. Knight (2012). "Introduction to Genetic Association Studies." Cold

Spring Harbor Protocols 2012(3): pdb.top068163.

Liew, M., R. Pryor, R. Palais, C. Meadows, M. Erali, E. Lyon and C. Wittwer (2004).

"Genotyping of single-nucleotide polymorphisms by high-resolution melting of small amplicons." Clin Chem 50(7): 1156-1164.

Litviakov, N. V., M. B. Freidin, A. E. Sazonov, M. V. Khalyuzova, M. A. Buldakov, M.

S. Karbyshev, C. Albakh capital Ie, D. S. Isubakova, C. Gagarin capital A, G. B.

Nekrasov, E. B. Mironova, C. Izosimov capital A, R. M. Takhauov and C. Karpov capital

A (2015). "Different patterns of allelic imbalance in sporadic tumors and tumors

196

associated with long-term exposure to gamma-radiation." Mutat Res Genet Toxicol

Environ Mutagen 794: 8-16.

Liu, C., F. Zhang, T. Li, M. Lu, L. Wang, W. Yue and D. Zhang (2012). "MirSNP, a

database of polymorphisms altering miRNA target sites, identifies miRNA-related SNPs in GWAS SNPs and eQTLs." BMC Genomics 13: 661.

Livak, K. J. and T. D. Schmittgen (2001). "Analysis of relative gene expression data

using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method." Methods 25(4):

402-408.

Lo Tam Loi, A. T., S. J. M. Hoonhorst, L. Franciosi, R. Bischoff, R. F. Hoffmann, I.

Heijink, A. J. M. van Oosterhout, H. M. Boezen, W. Timens, D. S. Postma, J.-W.

Lammers, L. Koenderman and N. H. T. ten Hacken (2013). "Acute and chronic

inflammatory responses induced by smoking in individuals susceptible and non-

susceptible to development of COPD: from specific disease phenotyping towards novel

therapy. Protocol of a cross-sectional study." BMJ Open 3(2).

Lock, M. and G. Rodrigues (2007). "Computed tomographic screening for lung cancer."

Can Fam Physician 53(8): 1334-1336.

Lohmueller, K. E., C. L. Pearce, M. Pike, E. S. Lander and J. N. Hirschhorn (2003).

"Meta-analysis of genetic association studies supports a contribution of common variants

to susceptibility to common disease." Nat Genet 33(2): 177-182.

Looney, S. W. and J. Hagan (2015). Ananlysis of Biomarker Data: A Practical Guide.

Maier, T., M. Guell and L. Serrano (2009). "Correlation of mRNA and protein in

complex biological samples." FEBS Lett 583(24): 3966-3973.

197

Mamanova, L., R. M. Andrews, K. D. James, E. M. Sheridan, P. D. Ellis, C. F. Langford,

T. W. Ost, J. E. Collins and D. J. Turner (2010). "FRT-seq: amplification-free, strand- specific transcriptome sequencing." Nat Methods 7(2): 130-132.

Mannino, D. M., S. M. Aguayo, T. L. Petty and S. C. Redd (2003). "Low lung function

and incident lung cancer in the United States: data From the First National Health and

Nutrition Examination Survey follow-up." Arch Intern Med 163(12): 1475-1480.

Marchini, J., D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S.

Qin, H. M. Munro, G. R. Abecasis, P. Donnelly and C. International HapMap (2006). "A

comparison of phasing algorithms for trios and unrelated individuals." Am J Hum Genet

78(3): 437-450.

Marinescu, V. D., I. S. Kohane and A. Riva (2005). "The MAPPER database: a multi-

genome catalog of putative transcription factor binding sites." Nucleic Acids Res

33(Database issue): D91-97.

Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens and Y. Gilad (2008). "RNA-seq:

an assessment of technical reproducibility and comparison with gene expression arrays."

Genome Res 18(9): 1509-1517.

Marshall, H. M., R. V. Bowman, I. A. Yang, K. M. Fong and C. D. Berg (2013).

"Screening for lung cancer with low-dose computed tomography: a review of current status." J Thorac Dis 5 Suppl 5: S524-539.

Matakidou, A., R. el Galta, E. L. Webb, M. F. Rudd, H. Bridle, T. Eisen and R. S.

Houlston (2007). "Genetic variation in the DNA repair genes is predictive of outcome in

lung cancer." Hum Mol Genet 16(19): 2333-2340.

198

Matera, I., M. Musso, P. Griseri, M. Rusmini, M. Di Duca, M. T. So, D. Mavilio, X.

Miao, P. H. Tam, R. Ravazzolo, I. Ceccherini and M. Garcia-Barcelo (2013). "Allele- specific expression at the RET locus in blood and gut tissue of individuals carrying risk alleles for Hirschsprung disease." Hum Mutat 34(5): 754-762.

Mathiaux, J., V. Le Morvan, M. Pulido, J. Jougon, H. Begueret and J. Robert (2011).

"Role of DNA repair gene polymorphisms in the efficiency of platinum-based adjuvant chemotherapy for non-small cell lung cancer." Mol Diagn Ther 15(3): 159-166.

Mattson, M. E., E. S. Pollack and J. W. Cullen (1987). "What are the odds that smoking will kill you?" Am J Public Health 77(4): 425-431.

Maurano, M. T., R. Humbert, E. Rynes, R. E. Thurman, E. Haugen, H. Wang, A. P.

Reynolds, R. Sandstrom, H. Qu, J. Brody, A. Shafer, F. Neri, K. Lee, T. Kutyavin, S.

Stehling-Sun, A. K. Johnson, T. K. Canfield, E. Giste, M. Diegel, D. Bates, R. S. Hansen,

S. Neph, P. J. Sabo, S. Heimfeld, A. Raubitschek, S. Ziegler, C. Cotsapas, N.

Sotoodehnia, I. Glass, S. R. Sunyaev, R. Kaul and J. A. Stamatoyannopoulos (2012).

"Systematic localization of common disease-associated variation in regulatory DNA."

Science 337(6099): 1190-1195.

Mayne, S. T., J. Buenconsejo and D. T. Janerich (1999). "Previous lung disease and risk of lung cancer among men and women nonsmokers." Am J Epidemiol 149(1): 13-20.

Mazutis, L., J. Gilbert, W. L. Ung, D. A. Weitz, A. D. Griffiths and J. A. Heyman (2013).

"Single-cell analysis and sorting using droplet-based microfluidics." Nature protocols

8(5): 870-891.

199

McCarthy, M. I., G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. Ioannidis and J. N. Hirschhorn (2008). "Genome-wide association studies for complex traits: consensus, uncertainty and challenges." Nat Rev Genet 9(5): 356-369.

Mehan, M. R., S. A. Williams, J. M. Siegfried, W. L. Bigbee, J. L. Weissfeld, D. O.

Wilson, H. I. Pass, W. N. Rom, T. Muley, M. Meister, W. Franklin, Y. E. Miller, E. N.

Brody and R. M. Ostroff (2014). "Validation of a blood protein signature for non-small cell lung cancer." Clin Proteomics 11(1): 32.

Mehrabian, M., H. Allayee, J. Stockton, P. Y. Lum, T. A. Drake, L. W. Castellani, M.

Suh, C. Armour, S. Edwards, J. Lamb, A. J. Lusis and E. E. Schadt (2005). "Integrating genotypic and expression data in a segregating mouse population to identify 5- lipoxygenase as a susceptibility gene for obesity and bone traits." Nat Genet 37(11):

1224-1233.

Mei, R., P. C. Galipeau, C. Prass, A. Berno, G. Ghandour, N. Patil, R. K. Wolff, M. S.

Chee, B. J. Reid and D. J. Lockhart (2000). "Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays." Genome Res 10(8): 1126-1137.

Melnikov, A., A. Murugan, X. Zhang, T. Tesileanu, L. Wang, P. Rogov, S. Feizi, A.

Gnirke, C. G. Callan, Jr., J. B. Kinney, M. Kellis, E. S. Lander and T. S. Mikkelsen

(2012). "Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay." Nat Biotechnol 30(3): 271-277.

Mercer, T. R., D. J. Gerhardt, M. E. Dinger, J. Crawford, C. Trapnell, J. A. Jeddeloh, J. S.

Mattick and J. L. Rinn (2012). "Targeted RNA sequencing reveals the deep complexity of the human transcriptome." Nat Biotechnol 30(1): 99-104.

200

Miele, A. and J. Dekker (2009). "Mapping cis- and trans- chromatin interaction networks using chromosome conformation capture (3C)." Methods Mol Biol 464: 105-121.

Milbury, C. A., J. Li and G. M. Makrigiorgos (2009). "PCR-based methods for the

enrichment of minority alleles and mutations." Clinical chemistry 55(4): 632-640.

Mok, T. S. (2011). "Personalized medicine in lung cancer: what we need to know." Nat

Rev Clin Oncol 8(11): 661-668.

Monks, S. A., A. Leonardson, H. Zhu, P. Cundiff, P. Pietrusiak, S. Edwards, J. W.

Phillips, A. Sachs and E. E. Schadt (2004). "Genetic inheritance of gene expression in

human cell lines." Am J Hum Genet 75(6): 1094-1105.

Morley, M., C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens, R. S. Spielman and

V. G. Cheung (2004). "Genetic analysis of genome-wide variation in human gene expression." Nature 430(7001): 743-747.

Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008). "Mapping

and quantifying mammalian transcriptomes by RNA-Seq." Nat Methods 5(7): 621-628.

Moyer, V. A. and U. S. P. S. T. Force (2014). "Screening for lung cancer: U.S.

Preventive Services Task Force recommendation statement." Ann Intern Med 160(5):

330-338.

Muir, W., S. Perumbakkam, A. Black-Pyrkosz, J. Dunn and H. Cheng (2014). Allele-

Specific Expression, a New Genomics Tool for Development of Value-Added SNP chips

and to Fine Map QTL. In. 10th World Congress of Genetics Applied to Livestock

Production. Vancouver, Canada.

Mullins, D. N., E. L. Crawford, S. A. Khuder, D. A. Hernandez, Y. Yoon and J. C.

Willey (2005). "CEBPG transcription factor correlates with antioxidant and DNA repair 201

genes in normal bronchial epithelial cells but not in individuals with bronchogenic

carcinoma." BMC Cancer 5: 141.

Musunuru, K., A. Strong, M. Frank-Kamenetsky, N. E. Lee, T. Ahfeldt, K. V. Sachs, X.

Li, H. Li, N. Kuperwasser, V. M. Ruda, J. P. Pirruccello, B. Muchmore, L. Prokunina-

Olsson, J. L. Hall, E. E. Schadt, C. R. Morales, S. Lund-Katz, M. C. Phillips, J. Wong,

W. Cantley, T. Racie, K. G. Ejebe, M. Orho-Melander, O. Melander, V. Koteliansky, K.

Fitzgerald, R. M. Krauss, C. A. Cowan, S. Kathiresan and D. J. Rader (2010). "From

noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus." Nature

466(7307): 714-719.

Myakishev, M. V., Y. Khripin, S. Hu and D. H. Hamer (2001). "High-throughput SNP

genotyping by allele-specific PCR with universal energy-transfer-labeled primers."

Genome Res 11(1): 163-169.

Nagalakshmi, U., Karl Waern, and Michael Snyder (2010). RNA-Seq: A Method for

Comprehensive Transcriptome Analysis. Current Protocols in Molecular Biology.

Nagalakshmi, U., Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein and M. Snyder

(2008). "The transcriptional landscape of the yeast genome defined by RNA sequencing."

Science 320(5881): 1344-1349.

National Lung Screening Trial Research, T., D. R. Aberle, A. M. Adams, C. D. Berg, W.

C. Black, J. D. Clapp, R. M. Fagerstrom, I. F. Gareen, C. Gatsonis, P. M. Marcus and J.

D. Sicks (2011). "Reduced lung-cancer mortality with low-dose computed tomographic screening." N Engl J Med 365(5): 395-409.

National Lung Screening Trial Research, T., T. R. Church, W. C. Black, D. R. Aberle, C.

D. Berg, K. L. Clingan, F. Duan, R. M. Fagerstrom, I. F. Gareen, D. S. Gierada, G. C. 202

Jones, I. Mahon, P. M. Marcus, J. D. Sicks, A. Jain and S. Baum (2013). "Results of

initial low-dose computed tomographic screening for lung cancer." N Engl J Med

368(21): 1980-1991.

Nevins, J. R. (1992). "E2F: a link between the Rb tumor suppressor protein and viral

oncoproteins." Science 258(5081): 424-429.

Nguyen, J. D., M. Lamontagne, C. Couture, M. Conti, P. D. Pare, D. D. Sin, J. C. Hogg,

D. Nickle, D. S. Postma, W. Timens, M. Laviolette and Y. Bosse (2014). "Susceptibility

loci for lung cancer are associated with mRNA levels of nearby genes in the lung."

Carcinogenesis 35(12): 2653-2659.

Nicoloso, M. S., H. Sun, R. Spizzo, H. Kim, P. Wickramasinghe, M. Shimizu, S. E.

Wojcik, J. Ferdin, T. Kunej, L. Xiao, S. Manoukian, G. Secreto, F. Ravagnani, X. Wang,

P. Radice, C. M. Croce, R. V. Davuluri and G. A. Calin (2010). "Single-nucleotide polymorphisms inside microRNA target sites influence tumor susceptibility." Cancer Res

70(7): 2789-2798.

Nikiforov, T. T., R. B. Rendle, P. Goelet, Y. H. Rogers, M. L. Kotewicz, S. Anderson, G.

L. Trainor and M. R. Knapp (1994). "Genetic Bit Analysis: a solid phase method for

typing single nucleotide polymorphisms." Nucleic Acids Res 22(20): 4167-4175.

Niu, R., Y. Wang, M. Zhu, Y. Wen, J. Sun, W. Shen, Y. Cheng, J. Zhang, G. Jin, H. Ma,

Z. Hu, H. Shen and J. Dai (2015). "Potentially Functional Polymorphisms in POU5F1

Gene Are Associated with the Risk of Lung Cancer in Han Chinese." Biomed Res Int

2015: 851320.

203

O'Donovan, A., D. Scherly, S. G. Clarkson and R. D. Wood (1994). "Isolation of active

recombinant XPG protein, a human DNA repair endonuclease." J Biol Chem 269(23):

15965-15968.

Obsteter, J., P. Dovc and T. Kunej (2015). "Genetic variability of microRNA regulome in

human." Mol Genet Genomic Med 3(1): 30-39.

Ozsolak, F. and P. M. Milos (2011). "RNA sequencing: advances, challenges and

opportunities." Nat Rev Genet 12(2): 87-98.

Pachmann, K., J. H. Clement, C.-P. Schneider, B. Willen, O. Camara, U. Pachmann and

K. Höffken (2005). "Standardized quantification of circulating peripheral tumor cells

from lung and breast cancer." Clinical Chemical Laboratory Medicine 43(6): 617-627.

Pai, A. A., J. K. Pritchard and Y. Gilad (2015). "The genetic and mechanistic basis for

variation in gene regulation." PLoS Genet 11(1): e1004857.

Pai, A. A., J. K. Pritchard and Y. Gilad (2015). "The Genetic and Mechanistic Basis for

Variation in Gene Regulation." PLoS Genet 11(1).

Pardini, B., A. Naccarati, P. Vodicka and R. Kumar (2012). "Gene expression variations:

potentialities of master regulator polymorphisms in colorectal cancer risk." Mutagenesis

27(2): 161-167.

Pastinen, T., B. Ge and T. J. Hudson (2006). "Influence of human genome polymorphism

on gene expression." Hum Mol Genet 15 Spec No 1: R9-16.

Pastinen, T. and T. J. Hudson (2004). "Cis-acting regulatory variation in the human genome." Science 306(5696): 647-650.

204

Pastinen, T., J. Partanen and A. C. Syvanen (1996). "Multiplex, fluorescent, solid-phase

minisequencing for efficient screening of DNA sequence variation." Clin Chem 42(9):

1391-1397.

Pastinen, T., M. Raitio, K. Lindroos, P. Tainola, L. Peltonen and A. C. Syvanen (2000).

"A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays." Genome Res 10(7): 1031-1042.

Pastinen, T., R. Sladek, S. Gurd, A. Sammak, B. Ge, P. Lepage, K. Lavergne, A.

Villeneuve, T. Gaudin, H. Brandstrom, A. Beck, A. Verner, J. Kingsley, E. Harmsen, D.

Labuda, K. Morgan, M. C. Vohl, A. K. Naumova, D. Sinnett and T. J. Hudson (2004). "A

survey of genetic and epigenetic variation affecting human gene expression." Physiol

Genomics 16(2): 184-193.

Paul, P. and J. Apgar (2005). "Single-molecule dilution and multiple displacement

amplification for molecular haplotyping." Biotechniques 38(4): 553-554, 556, 558-559.

Pecot, C. V., M. Li, X. J. Zhang, R. Rajanbabu, C. Calitri, A. Bungum, J. R. Jett, J. B.

Putnam, C. Callaway-Lane, S. Deppen, E. L. Grogan, D. P. Carbone, J. A. Worrell, K. G.

Moons, Y. Shyr and P. P. Massion (2012). "Added value of a serum proteomic signature

in the diagnostic evaluation of lung nodules." Cancer Epidemiol Biomarkers Prev 21(5):

786-792.

Peirson, S. N., J. N. Butler and R. G. Foster (2003). "Experimental validation of novel

and conventional approaches to quantitative real-time PCR data analysis." Nucleic Acids

Res 31(14): e73.

205

Pekin, D., Y. Skhiri, J.-C. Baret, D. Le Corre, L. Mazutis, C. B. Salem, F. Millot, A. El

Harrak, J. B. Hutchison and J. W. Larson (2011). "Quantitative and sensitive detection of

rare mutations using droplet-based microfluidics." Lab on a Chip 11(13): 2156-2166.

Penny, G. D., G. F. Kay, S. A. Sheardown, S. Rastan and N. Brockdorff (1996).

"Requirement for Xist in X chromosome inactivation." Nature 379(6561): 131-137.

Pers, T. H., J. M. Karjalainen, Y. Chan, H. J. Westra, A. R. Wood, J. Yang, J. C. Lui, S.

Vedantam, S. Gustafsson, T. Esko, T. Frayling, E. K. Speliotes, A. T. C. Genetic

Investigation of, M. Boehnke, S. Raychaudhuri, R. S. Fehrmann, J. N. Hirschhorn and L.

Franke (2015). "Biological interpretation of genome-wide association studies using

predicted gene functions." Nat Commun 6: 5890.

Peters, B. A., B. G. Kermani, A. B. Sparks, O. Alferov, P. Hong, A. Alexeev, Y. Jiang, F.

Dahl, Y. T. Tang, J. Haas, K. Robasky, A. W. Zaranek, J. H. Lee, M. P. Ball, J. E.

Peterson, H. Perazich, G. Yeung, J. Liu, L. Chen, M. I. Kennemer, K. Pothuraju, K.

Konvicka, M. Tsoupko-Sitnikov, K. P. Pant, J. C. Ebert, G. B. Nilsen, J. Baccash, A. L.

Halpern, G. M. Church and R. Drmanac (2012). "Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells." Nature 487(7406): 190-195.

Pfaffl, M. W. (2001). "A new mathematical model for relative quantification in real-time

RT-PCR." Nucleic Acids Res 29(9): e45.

Pfaffl, M. W. (2004). Quantification strategies in real-time PCR. A-Z of quantitative

PCR. S. A. Bustin: 87-112.

Pfaffl, M. W. (2013). "Transcriptional biomarkers." Methods 59(1): 1-2.

Pickrell, J. K. (2014). "Joint analysis of functional genomic data and genome-wide association studies of 18 human traits." Am J Hum Genet 94(4): 559-573. 206

Pleasance, E. D., P. J. Stephens, S. O'Meara, D. J. McBride, A. Meynert, D. Jones, M. L.

Lin, D. Beare, K. W. Lau, C. Greenman, I. Varela, S. Nik-Zainal, H. R. Davies, G. R.

Ordonez, L. J. Mudie, C. Latimer, S. Edkins, L. Stebbings, L. Chen, M. Jia, C. Leroy, J.

Marshall, A. Menzies, A. Butler, J. W. Teague, J. Mangion, Y. A. Sun, S. F. McLaughlin,

H. E. Peckham, E. F. Tsung, G. L. Costa, C. C. Lee, J. D. Minna, A. Gazdar, E. Birney,

M. D. Rhodes, K. J. McKernan, M. R. Stratton, P. A. Futreal and P. J. Campbell (2010).

"A small-cell lung cancer genome with complex signatures of tobacco exposure." Nature

463(7278): 184-190.

Pompanon, F., A. Bonin, E. Bellemain and P. Taberlet (2005). "Genotyping errors: causes, consequences and solutions." Nat Rev Genet 6(11): 847-859.

Purdue, M. P., L. Gold, B. Jarvholm, M. C. Alavanja, M. H. Ward and R. Vermeulen

(2007). "Impaired lung function and lung cancer incidence in a cohort of Swedish construction workers." Thorax 62(1): 51-56.

Raeymaekers, L. (2000). "Basic principles of quantitative PCR." Mol Biotechnol 15(2):

115-122.

Raji, O. Y., S. W. Duffy, O. F. Agbaje, S. G. Baker, D. C. Christiani, A. Cassidy and J.

K. Field (2012). "Predictive accuracy of the Liverpool Lung Project risk model for stratifying patients for computed tomography screening for lung cancer: a case-control and cohort validation study." Ann Intern Med 157(4): 242-250.

Raphael, K. (1987). "Recall bias: a proposal for assessment and control." Int J Epidemiol

16(2): 167-170.

Raymond, C. K., S. Subramanian, M. Paddock, R. Qiu, C. Deodato, A. Palmieri, J.

Chang, T. Radke, E. Haugen, A. Kas, D. Waring, D. Bovee, R. Stacy, R. Kaul and M. V. 207

Olson (2005). "Targeted, haplotype-resolved resequencing of long segments of the human genome." Genomics 86(6): 759-766.

Reischl, U. and B. Kochanowski (1995). "Quantitative PCR. A survey of the present

technology." Mol Biotechnol 3(1): 55-71.

Rockman, M. V. and L. Kruglyak (2006). "Genetics of global gene expression." Nat Rev

Genet 7(11): 862-872.

Rockman, M. V. and G. A. Wray (2002). "Abundant raw material for cis-regulatory

evolution in humans." Mol Biol Evol 19(11): 1991-2004.

Romagnoli, M., I. Vachier, A. M. Vignola, P. Godard, J. Bousquet and P. Chanez (1999).

"Safety and cellular assessment of bronchial brushing in airway diseases." Respir Med

93(7): 461-466.

Rosenbloom, K. R., T. R. Dreszer, M. Pheasant, G. P. Barber, L. R. Meyer, A. Pohl, B. J.

Raney, T. Wang, A. S. Hinrichs, A. S. Zweig, P. A. Fujita, K. Learned, B. Rhead, K. E.

Smith, R. M. Kuhn, D. Karolchik, D. Haussler and W. J. Kent (2010). "ENCODE whole-

genome data in the UCSC Genome Browser." Nucleic Acids Research 38: D620-D625.

Ruijter, J. M., M. W. Pfaffl, S. Zhao, A. N. Spiess, G. Boggy, J. Blom, R. G. Rutledge, D.

Sisti, A. Lievens and K. De Preter (2013). "Evaluation of qPCR curve analysis methods

for reliable biomarker discovery: bias, resolution, precision, and implications." Methods

59(1): 32-46.

Rutter, J. L., T. I. Mitchell, G. Buttice, J. Meyers, J. F. Gusella, L. J. Ozelius and C. E.

Brinckerhoff (1998). "A single nucleotide polymorphism in the matrix metalloproteinase-

1 promoter creates an Ets binding site and augments transcription." Cancer Res 58(23):

5321-5325. 208

Ryan, B. M., A. I. Robles and C. C. Harris (2010). "Genetic variation in microRNA networks: the implications for cancer research." Nat Rev Cancer 10(6): 389-402.

Samson, D. J., J. Seidenfeld, K. Ziegler and N. Aronson (2004). "Chemotherapy sensitivity and resistance assays: a systematic review." J Clin Oncol 22(17): 3618-3630.

Schabath, M. B., X. Wu, Q. Wei, G. Li, J. Gu and M. R. Spitz (2006). "Combined effects of the p53 and p73 polymorphisms on lung cancer risk." Cancer Epidemiol Biomarkers

Prev 15(1): 158-161.

Schadt, E. E., S. A. Monks, T. A. Drake, A. J. Lusis, N. Che, V. Colinayo, T. G. Ruff, S.

B. Milligan, J. R. Lamb, G. Cavet, P. S. Linsley, M. Mao, R. B. Stoughton and S. H.

Friend (2003). "Genetics of gene expression surveyed in maize, mouse and man." Nature

422(6929): 297-302.

Schaid, D. J. (2006). "Power and sample size for testing associations of haplotypes with complex traits." Ann Hum Genet 70(Pt 1): 116-130.

Scheuermann, R. H. and S. R. Bauer (1993). "Polymerase chain reaction-based mRNA quantification using an internal standard: analysis of oncogene expression." Methods

Enzymol 218: 446-473.

Schmitt, M. W., S. R. Kennedy, J. J. Salk, E. J. Fox, J. B. Hiatt and L. A. Loeb (2012).

"Detection of ultra-rare mutations by next-generation sequencing." Proc Natl Acad Sci U

S A 109(36): 14508-14513.

Schork, A. J., W. K. Thompson, P. Pham, A. Torkamani, J. C. Roddey, P. F. Sullivan, J.

R. Kelsoe, M. C. O'Donovan, H. Furberg, Tobacco, C. Genetics, C. Bipolar Disorder

Psychiatric Genomics, C. Schizophrenia Psychiatric Genomics, N. J. Schork, O. A.

Andreassen and A. M. Dale (2013). "All SNPs are not created equal: genome-wide 209

association studies reveal a consistent pattern of enrichment among functionally

annotated SNPs." PLoS Genet 9(4): e1003449.

Schwartz, A. G. (2012). "Genetic epidemiology of cigarette smoke-induced lung

disease." Proc Am Thorac Soc 9(2): 22-26.

Schwartz, A. G. (2016). "Genetic Predisposition to Lung Cancer." CHEST Journal

125(5_suppl): 86S-89S.

Schwartz, A. G., G. M. Prysak, C. H. Bock and M. L. Cote (2006). "The molecular

epidemiology of lung cancer." Carcinogenesis 28(3): 507-518.

Sedgwick, P. (2015). "Bias in observational study designs: case-control studies." BMJ

350: h560.

Shen, M., S. I. Berndt, N. Rothman, D. M. Demarini, J. L. Mumford, X. He, M. R.

Bonner, L. Tian, M. Yeager, R. Welch, S. Chanock, T. Zheng, N. Caporaso and Q. Lan

(2005). "Polymorphisms in the DNA nucleotide excision repair genes and lung cancer

risk in Xuan Wei, China." Int J Cancer 116(5): 768-773.

Shen, W., Y. Tian, T. Ran and Z. Gao (2015). "Genotyping and quantification techniques

for single-nucleotide polymorphisms." TrAC Trends in Analytical Chemistry 69: 1-13.

Shi, M. M. (2001). "Enabling large-scale pharmacogenetic studies by high-throughput mutation detection and genotyping technologies." Clin Chem 47(2): 164-172.

Silvestri, G. A., A. Vachani, D. Whitney, M. Elashoff, K. Porta Smith, J. S. Ferguson, E.

Parsons, N. Mitra, J. Brody, M. E. Lenburg and A. Spira (2015). "A Bronchial Genomic

Classifier for the Diagnostic Evaluation of Lung Cancer." New England Journal of

Medicine 373(3): 243-251.

210

Simmonds, M. J. (2013). "GWAS in autoimmune thyroid disease: redefining our

understanding of pathogenesis." Nat Rev Endocrinol 9(5): 277-287.

Singer-Sam, J., J. M. LeBon, A. Dai and A. D. Riggs (1992). "A sensitive, quantitative

assay for measurement of allele-specific transcripts differing by a single nucleotide."

PCR Methods Appl 1(3): 160-163.

Skillrud, D. M., K. P. Offord and R. D. Miller (1986). "Higher risk of lung cancer in

chronic obstructive pulmonary disease. A prospective, matched, controlled study." Ann

Intern Med 105(4): 503-507.

Smith, R. A., D. Manassaram-Baptiste, D. Brooks, M. Doroshenk, S. Fedewa, D. Saslow,

O. W. Brawley and R. Wender (2015). "Cancer screening in the United States, 2015: a

review of current American cancer society guidelines and current issues in cancer

screening." CA Cancer J Clin 65(1): 30-54.

Society, A. C. (2015). "Cancer Facts & Figures 2015." Atlanta: American Cancer

Society.

Somers, J., L. A. Wilson, J. P. Kilday, E. Horvilleur, I. G. Cannell, T. A. Poyry, L. C.

Cobbold, A. Kondrashov, J. R. Knight, S. Puget, J. Grill, R. G. Grundy, M. Bushell and

A. E. Willis (2015). "A common polymorphism in the 5' UTR of ERCC5 creates an

upstream ORF that confers resistance to platinum-based chemotherapy." Genes Dev

29(18): 1891-1896.

Song, J. W. and K. C. Chung (2010). "Observational studies: cohort and case-control studies." Plast Reconstr Surg 126(6): 2234-2242.

211

Song, M.-Y., H.-E. Kim, S. Kim, I.-H. Choi and J.-K. Lee (2012). "SNP-based large- scale identification of allele-specific gene expression in human B cells." Gene 493(2):

211-218.

Song, M. Y., H. E. Kim, S. Kim, I. H. Choi and J. K. Lee (2012). "SNP-based large-scale identification of allele-specific gene expression in human B cells." Gene 493(2): 211-218.

Spencer, D. H., M. Tyagi, F. Vallania, A. J. Bredemeyer, J. D. Pfeifer, R. D. Mitra and E.

J. Duncavage (2014). "Performance of common analysis methods for detecting low- frequency single nucleotide variants in targeted next-generation sequence data." J Mol

Diagn 16(1): 75-88.

Spira, A., J. Beane, V. Shah, G. Liu, F. Schembri, X. Yang, J. Palma and J. S. Brody

(2004). "Effects of cigarette smoke on the human airway epithelial cell transcriptome."

Proc Natl Acad Sci U S A 101(27): 10143-10148.

Spira, A., J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y.-M.

Dumas, P. Calner, P. Sebastiani, S. Sridhar, J. Beamis, C. Lamb, T. Anderson, N. Gerry,

J. Keane, M. E. Lenburg and J. S. Brody (2007). "Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer." Nat Med 13(3): 361-366.

Spira, A., J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y. M.

Dumas, P. Calner, P. Sebastiani, S. Sridhar, J. Beamis, C. Lamb, T. Anderson, N. Gerry,

J. Keane, M. E. Lenburg and J. S. Brody (2007). "Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer." Nat Med 13(3): 361-366.

Spitz, M. R., W. K. Hong, C. I. Amos, X. F. Wu, M. B. Schabath, Q. Dong, S. Shete and

C. J. Etzel (2007). "A risk model for prediction of lung cancer." Journal of the National

Cancer Institute 99(9): 715-726. 212

Spitz, M. R., Q. Wei, Q. Dong, C. I. Amos and X. Wu (2003). "Genetic susceptibility to

lung cancer: the role of DNA damage and repair." Cancer Epidemiol Biomarkers Prev

12(8): 689-698.

Steemers, F. J., W. Chang, G. Lee, D. L. Barker, R. Shen and K. L. Gunderson (2006).

"Whole-genome genotyping with the single-base extension assay." Nat Methods 3(1): 31-

33.

Stephens, M. and P. Scheet (2005). "Accounting for decay of linkage disequilibrium in

haplotype inference and missing-data imputation." Am J Hum Genet 76(3): 449-462.

Stephens, M., N. J. Smith and P. Donnelly (2001). "A new statistical method for

haplotype reconstruction from population data." Am J Hum Genet 68(4): 978-989.

Stranger, B. E., M. S. Forrest, A. G. Clark, M. J. Minichiello, S. Deutsch, R. Lyle, S.

Hunt, B. Kahl, S. E. Antonarakis, S. Tavare, P. Deloukas and E. T. Dermitzakis (2005).

"Genome-wide associations of gene expression variation in humans." PLoS Genet 1(6):

e78.

Straus, S. E., F. A. McAlister, D. L. Sackett and J. J. Deeks (2002). "Accuracy of history,

wheezing, and forced expiratory time in the diagnosis of chronic obstructive pulmonary

disease." J Gen Intern Med 17(9): 684-688.

Sun, X., F. Li, N. Sun, Q. Shukui, C. Baoan, F. Jifeng, C. Lu, L. Zuhong, C. Hongyan, C.

YuanDong, J. Jiazhong and Z. Yingfeng (2009). "Polymorphisms in XRCC1 and XPG and response to platinum-based chemotherapy in advanced non-small cell lung cancer

patients." Lung Cancer 65(2): 230-236.

213

Sundar, I. K., N. Mullapudi, H. Yao, S. D. Spivack and I. Rahman (2011). "Lung cancer

and its association with chronic obstructive pulmonary disease: update on nexus of

epigenetics." Curr Opin Pulm Med 17(4): 279-285.

Tammemagi, C. M., P. F. Pinsky, N. E. Caporaso, P. A. Kvale, W. G. Hocking, T. R.

Church, T. L. Riley, J. Commins, M. M. Oken, C. D. Berg and P. C. Prorok (2011).

"Lung Cancer Risk Prediction: Prostate, Lung, Colorectal and Ovarian Cancer Screening

Trial Models and Validation." Jnci-Journal of the National Cancer Institute 103(13):

1058-1068.

Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau, B.

B. Tuch, A. Siddiqui, K. Lao and M. A. Surani (2009). "mRNA-Seq whole-transcriptome analysis of a single cell." Nat Methods 6(5): 377-382.

Tarazona, S., F. Garcia-Alcalde, J. Dopazo, A. Ferrer and A. Conesa (2011). "Differential

expression in RNA-seq: a matter of depth." Genome Res 21(12): 2213-2223.

Thorgeirsson, T. E., F. Geller, P. Sulem, T. Rafnar, A. Wiste, K. P. Magnusson, A.

Manolescu, G. Thorleifsson, H. Stefansson, A. Ingason, S. N. Stacey, J. T. Bergthorsson,

S. Thorlacius, J. Gudmundsson, T. Jonsson, M. Jakobsdottir, J. Saemundsdottir, O.

Olafsdottir, L. J. Gudmundsson, G. Bjornsdottir, K. Kristjansson, H. Skuladottir, H. J.

Isaksson, T. Gudbjartsson, G. T. Jones, T. Mueller, A. Gottsater, A. Flex, K. K. Aben, F.

de Vegt, P. F. Mulders, D. Isla, M. J. Vidal, L. Asin, B. Saez, L. Murillo, T. Blondal, H.

Kolbeinsson, J. G. Stefansson, I. Hansdottir, V. Runarsdottir, R. Pola, B. Lindblad, A. M.

van Rij, B. Dieplinger, M. Haltmayer, J. I. Mayordomo, L. A. Kiemeney, S. E.

Matthiasson, H. Oskarsson, T. Tyrfingsson, D. F. Gudbjartsson, J. R. Gulcher, S.

Jonsson, U. Thorsteinsdottir, A. Kong and K. Stefansson (2008). "A variant associated 214

with nicotine dependence, lung cancer and peripheral arterial disease." Nature 452(7187):

638-642.

Tichopad, A., R. Kitchen, I. Riedmaier, C. Becker, A. Stahlberg and M. Kubista (2009).

"Design and optimization of reverse-transcription quantitative PCR experiments." Clin

Chem 55(10): 1816-1823.

Tockman, M. S., N. R. Anthonisen, E. C. Wright and M. G. Donithan (1987). "Airways obstruction and the risk for lung cancer." Ann Intern Med 106(4): 512-518.

Trapnell, C., L. Pachter and S. L. Salzberg (2009). "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25(9): 1105-1111.

Tseden-Ish, M., Y. D. Choi, H. J. Cho, H. J. Ban, I. J. Oh, K. S. Kim, S. Y. Song, K. J.

Na, S. J. Ahn, S. Choi and Y. C. Kim (2012). "Disease-free survival of patients after surgical resection of non-small cell lung carcinoma and correlation with excision repair cross-complementation group 1 expression and genotype." Respirology 17(1): 127-133.

Tsukada, J., Y. Yoshida, Y. Kominato and P. E. Auron (2011). "The CCAAT/enhancer

(C/EBP) family of basic-leucine zipper (bZIP) transcription factors is a multifaceted highly-regulated system for gene regulation." Cytokine 54(1): 6-19.

Turner, E. H., S. B. Ng, D. A. Nickerson and J. Shendure (2009). "Methods for genomic partitioning." Annu Rev Genomics Hum Genet 10: 263-284.

UC Davis. "Mendel's Laws." BioWiki: The Dynamic Biology Hypertext, 2016, from http://biowiki.ucdavis.edu/Genetics/Unit_I%3A_Genes,_Nucleic_Acids,_Genomes_and_

Chromosomes/1%3A_Fundamental_Properties_of_Genes/Mendel's_Laws.

Unger, M. (2006). "A pause, progress, and reassessment in lung cancer screening." N

Engl J Med 355(17): 1822-1824. 215

Vachani, A., H. I. Pass, W. N. Rom, D. E. Midthun, E. S. Edell, M. Laviolette, X. J. Li,

P. Y. Fong, S. W. Hunsucker, C. Hayward, P. J. Mazzone, D. K. Madtes, Y. E. Miller, M.

G. Walker, J. Shi, P. Kearney, K. C. Fang and P. P. Massion (2015). "Validation of a

multiprotein plasma classifier to identify benign lung nodules." J Thorac Oncol 10(4):

629-637.

van Klaveren, R. J., J. D. F. Habbema, J. H. Pedersen, H. J. de Koning, M. Oudkerk and

H. C. Hoogsteden (2001). "Lung cancer screening by low-dose spiral computed

tomography." Eur Respir J 18(5): 857-866.

Vandesompele, J., Kubista M , Pfaffl MW (2009). Reference gene software for improved

normalization. Real-Time PCR: Current Technology and Applications. J. Logan, Kirstin

Edwards, Nick Saunders, Caister Academic Press.

Vansteenkiste, J., C. Dooms, C. Mascaux and K. Nackaerts (2012). "Screening and early

detection of lung cancer." Ann Oncol 23 Suppl 10: x320-327.

Vlems, F. A., A. Ladanyi, R. Gertler, R. Rosenberg, J. H. Diepstra, C. Roder, H.

Nekarda, B. Molnar, Z. Tulassay, G. N. van Muijen and I. Vogel (2003). "Reliability of

quantitative reverse-transcriptase-PCR-based detection of tumour cells in the blood between different laboratories using a standardised protocol." Eur J Cancer 39(3): 388-

396.

Vogel, C. and E. M. Marcotte (2012). "Insights into the regulation of protein abundance

from proteomic and transcriptomic analyses." Nat Rev Genet 13(4): 227-232.

Wade-Martins, R., Y. Saeki and E. A. Chiocca (2003). "Infectious delivery of a 135-kb

LDLR genomic locus leads to regulated complementation of low-density lipoprotein receptor deficiency in human cells." Mol Ther 7(5 Pt 1): 604-612. 216

Wallace, L. U. a. R. B. (1991). "Allele-Specific Polymerase Chain Reaction."

METHODS: A Companion to Methods in Enzymology 2(1): 42-48.

Walser, T., X. Cui, J. Yanagawa, J. M. Lee, E. Heinrich, G. Lee, S. Sharma and S. M.

Dubinett (2008). "Smoking and lung cancer: the role of inflammation." Proc Am Thorac

Soc 5(8): 811-815.

Wang, A. M., M. V. Doyle and D. F. Mark (1989). "Quantitation of mRNA by the

polymerase chain reaction." Proc Natl Acad Sci U S A 86(24): 9717-9721.

Wang, J., J. Zhuang, S. Iyer, X. Lin, T. W. Whitfield, M. C. Greven, B. G. Pierce, X.

Dong, A. Kundaje, Y. Cheng, O. J. Rando, E. Birney, R. M. Myers, W. S. Noble, M.

Snyder and Z. Weng (2012). "Sequence features and chromatin structure around the

genomic regions bound by 119 human transcription factors." Genome Res 22(9): 1798-

1812.

Wang, T. W., R. C. Vermeulen, W. Hu, G. Liu, X. Xiao, Y. Alekseyev, J. Xu, B. Reiss,

K. Steiling, G. S. Downward, D. T. Silverman, F. Wei, G. Wu, J. Li, M. E. Lenburg, N.

Rothman, A. Spira and Q. Lan (2015). "Gene-expression profiling of buccal epithelium among non-smoking women exposed to household air pollution from smoky coal."

Carcinogenesis 36(12): 1494-1501.

Wang, Y., P. Broderick, E. Webb, X. Wu, J. Vijayakrishnan, A. Matakidou, M. Qureshi,

Q. Dong, X. Gu, W. V. Chen, M. R. Spitz, T. Eisen, C. I. Amos and R. S. Houlston

(2008). "Common 5p15.33 and 6p21.33 variants influence lung cancer risk." Nat Genet

40(12): 1407-1409.

217

Wang, Y., J. D. McKay, T. Rafnar, Z. Wang, M. N. Timofeeva, P. Broderick, X. Zong,

M. Laplana, Y. Wei and Y. Han (2014). "Rare variants of large effect in BRCA2 and

CHEK2 affect risk of lung cancer." Nature genetics 46(7): 736-741.

Wang, Y., J. D. McKay, T. Rafnar, Z. Wang, M. N. Timofeeva, P. Broderick, X. Zong,

M. Laplana, Y. Wei, Y. Han, A. Lloyd, M. Delahaye-Sourdeix, D. Chubb, V. Gaborieau,

W. Wheeler, N. Chatterjee, G. Thorleifsson, P. Sulem, G. Liu, R. Kaaks, M. Henrion, B.

Kinnersley, M. Vallee, F. LeCalvez-Kelm, V. L. Stevens, S. M. Gapstur, W. V. Chen, D.

Zaridze, N. Szeszenia-Dabrowska, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, V.

Bencko, L. Foretova, V. Janout, H. E. Krokan, M. E. Gabrielsen, F. Skorpen, L. Vatten, I.

Njolstad, C. Chen, G. Goodman, S. Benhamou, T. Vooder, K. Valk, M. Nelis, A.

Metspalu, M. Lener, J. Lubinski, M. Johansson, P. Vineis, A. Agudo, F. Clavel-

Chapelon, H. B. Bueno-de-Mesquita, D. Trichopoulos, K. T. Khaw, M. Johansson, E.

Weiderpass, A. Tjonneland, E. Riboli, M. Lathrop, G. Scelo, D. Albanes, N. E. Caporaso,

Y. Ye, J. Gu, X. Wu, M. R. Spitz, H. Dienemann, A. Rosenberger, L. Su, A. Matakidou,

T. Eisen, K. Stefansson, A. Risch, S. J. Chanock, D. C. Christiani, R. J. Hung, P.

Brennan, M. T. Landi, R. S. Houlston and C. I. Amos (2014). "Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer." Nat Genet 46(7): 736-741.

Wang, Z., M. Gerstein and M. Snyder (2009). "RNA-Seq: a revolutionary tool for

transcriptomics." Nat Rev Genet 10(1): 57-63.

Wang, Z., B. Zhu, M. Zhang, H. Parikh, J. Jia, C. C. Chung, J. N. Sampson, J. W.

Hoskins, A. Hutchinson, L. Burdette, A. Ibrahim, C. Hautman, P. S. Raj, C. C. Abnet, A.

A. Adjei, A. Ahlbom, D. Albanes, N. E. Allen, C. B. Ambrosone, M. Aldrich, P. Amiano,

C. Amos, U. Andersson, G. Andriole, Jr., I. L. Andrulis, C. Arici, A. A. Arslan, M. A. 218

Austin, D. Baris, D. A. Barkauskas, B. A. Bassig, L. E. Beane Freeman, C. D. Berg, S. I.

Berndt, P. A. Bertazzi, R. B. Biritwum, A. Black, W. Blot, H. Boeing, P. Boffetta, K.

Bolton, M. C. Boutron-Ruault, P. M. Bracci, P. Brennan, L. A. Brinton, M. Brotzman, H.

B. Bueno-de-Mesquita, J. E. Buring, M. A. Butler, Q. Cai, G. Cancel-Tassin, F. Canzian,

G. Cao, N. E. Caporaso, A. Carrato, T. Carreon, A. Carta, G. C. Chang, I. S. Chang, J.

Chang-Claude, X. Che, C. J. Chen, C. Y. Chen, C. H. Chen, C. Chen, K. Y. Chen, Y. M.

Chen, A. P. Chokkalingam, L. W. Chu, F. Clavel-Chapelon, G. A. Colditz, J. S. Colt, D.

Conti, M. B. Cook, V. K. Cortessis, E. D. Crawford, O. Cussenot, F. G. Davis, I. De

Vivo, X. Deng, T. Ding, C. P. Dinney, A. L. Di Stefano, W. R. Diver, E. J. Duell, J. W.

Elena, J. H. Fan, H. S. Feigelson, M. Feychting, J. D. Figueroa, A. M. Flanagan, J. F.

Fraumeni, Jr., N. D. Freedman, B. L. Fridley, C. S. Fuchs, M. Gago-Dominguez, S.

Gallinger, Y. T. Gao, S. M. Gapstur, M. Garcia-Closas, R. Garcia-Closas, J. M. Gastier-

Foster, J. M. Gaziano, D. S. Gerhard, C. A. Giffen, G. G. Giles, E. M. Gillanders, E. L.

Giovannucci, M. Goggins, N. Gokgoz, A. M. Goldstein, C. Gonzalez, R. Gorlick, M. H.

Greene, M. Gross, H. B. Grossman, R. Grubb, 3rd, J. Gu, P. Guan, C. A. Haiman, G.

Hallmans, S. E. Hankinson, C. C. Harris, P. Hartge, C. Hattinger, R. B. Hayes, Q. He, L.

Helman, B. E. Henderson, R. Henriksson, J. Hoffman-Bolton, C. Hohensee, E. A. Holly,

Y. C. Hong, R. N. Hoover, H. D. Hosgood, 3rd, C. F. Hsiao, A. W. Hsing, C. A. Hsiung,

N. Hu, W. Hu, Z. Hu, M. S. Huang, D. J. Hunter, P. D. Inskip, H. Ito, E. J. Jacobs, K. B.

Jacobs, M. Jenab, B. T. Ji, C. Johansen, M. Johansson, A. Johnson, R. Kaaks, A. M.

Kamat, A. Kamineni, M. Karagas, C. Khanna, K. T. Khaw, C. Kim, I. S. Kim, J. H. Kim,

Y. H. Kim, Y. C. Kim, Y. T. Kim, C. H. Kang, Y. J. Jung, C. M. Kitahara, A. P. Klein, R.

Klein, M. Kogevinas, W. P. Koh, T. Kohno, L. N. Kolonel, C. Kooperberg, C. P. Kratz, 219

V. Krogh, H. Kunitoh, R. C. Kurtz, N. Kurucu, Q. Lan, M. Lathrop, C. C. Lau, F.

Lecanda, K. M. Lee, M. P. Lee, L. Le Marchand, S. P. Lerner, D. Li, L. M. Liao, W. Y.

Lim, D. Lin, J. Lin, S. Lindstrom, M. S. Linet, J. Lissowska, J. Liu, B. Ljungberg, J.

Lloreta, D. Lu, J. Ma, N. Malats, S. Mannisto, N. Marina, G. Mastrangelo, K. Matsuo, K.

A. McGlynn, R. McKean-Cowdin, L. H. McNeill, R. R. McWilliams, B. S. Melin, P. S.

Meltzer, J. E. Mensah, X. Miao, D. S. Michaud, A. M. Mondul, L. E. Moore, K. Muir, S.

Niwa, S. H. Olson, N. Orr, S. Panico, J. Y. Park, A. V. Patel, A. Patino-Garcia, S.

Pavanello, P. H. Peeters, B. Peplonska, U. Peters, G. M. Petersen, P. Picci, M. C. Pike, S.

Porru, J. Prescott, X. Pu, M. P. Purdue, Y. L. Qiao, P. Rajaraman, E. Riboli, H. A. Risch,

R. J. Rodabough, N. Rothman, A. M. Ruder, J. S. Ryu, M. Sanson, A. Schned, F. R.

Schumacher, A. G. Schwartz, K. L. Schwartz, M. Schwenn, K. Scotlandi, A. Seow, C.

Serra, M. Serra, H. D. Sesso, G. Severi, H. Shen, M. Shen, S. Shete, K. Shiraishi, X. O.

Shu, A. Siddiq, L. Sierrasesumaga, S. Sierri, A. D. Loon Sihoe, D. T. Silverman, M.

Simon, M. C. Southey, L. Spector, M. Spitz, M. Stampfer, P. Stattin, M. C. Stern, V. L.

Stevens, R. Z. Stolzenberg-Solomon, D. O. Stram, S. S. Strom, W. C. Su, M. Sund, S. W.

Sung, A. Swerdlow, W. Tan, H. Tanaka, W. Tang, Z. Z. Tang, A. Tardon, E. Tay, P. R.

Taylor, Y. Tettey, D. M. Thomas, R. Tirabosco, A. Tjonneland, G. S. Tobias, J. R. Toro,

R. C. Travis, D. Trichopoulos, R. Troisi, A. Truelove, Y. H. Tsai, M. A. Tucker, R.

Tumino, D. Van Den Berg, S. K. Van Den Eeden, R. Vermeulen, P. Vineis, K.

Visvanathan, U. Vogel, C. Wang, C. Wang, J. Wang, S. S. Wang, E. Weiderpass, S. J.

Weinstein, N. Wentzensen, W. Wheeler, E. White, J. K. Wiencke, A. Wolk, B. M.

Wolpin, M. P. Wong, M. Wrensch, C. Wu, T. Wu, X. Wu, Y. L. Wu, J. S. Wunder, Y. B.

Xiang, J. Xu, H. P. Yang, P. C. Yang, Y. Yatabe, Y. Ye, E. D. Yeboah, Z. Yin, C. Ying, 220

C. J. Yu, K. Yu, J. M. Yuan, K. A. Zanetti, A. Zeleniuch-Jacquotte, W. Zheng, B. Zhou,

L. Mirabello, S. A. Savage, P. Kraft, S. J. Chanock, M. Yeager, M. T. Landi, J. Shi, N.

Chatterjee and L. T. Amundadottir (2014). "Imputation and subset-based association analysis across different cancer types identifies multiple independent risk loci in the

TERT-CLPTM1L region on chromosome 5p15.33." Hum Mol Genet 23(24): 6616-6633.

Wasswa-Kintu, S., W. Q. Gan, S. F. P. Man, P. D. Pare and D. D. Sin (2005).

"Relationship between reduced forced expiratory volume in one second and the risk of lung cancer: a systematic review and meta-analysis." Thorax 60(7): 570-575.

Waszak, S. M., O. Delaneau, A. R. Gschwind, H. Kilpinen, S. K. Raghav, R. M.

Witwicki, A. Orioli, M. Wiederkehr, N. I. Panousis, A. Yurovsky, L. Romano-Palumbo,

A. Planchon, D. Bielser, I. Padioleau, G. Udin, S. Thurnheer, D. Hacker, N. Hernandez,

A. Reymond, B. Deplancke and E. T. Dermitzakis (2015). "Population Variation and

Genetic Control of Modular Chromatin Architecture in Humans." Cell 162(5): 1039-

1050.

Westra, H. J., M. J. Peters, T. Esko, H. Yaghootkar, C. Schurmann, J. Kettunen, M. W.

Christiansen, B. P. Fairfax, K. Schramm, J. E. Powell, A. Zhernakova, D. V. Zhernakova,

J. H. Veldink, L. H. Van den Berg, J. Karjalainen, S. Withoff, A. G. Uitterlinden, A.

Hofman, F. Rivadeneira, P. A. t Hoen, E. Reinmaa, K. Fischer, M. Nelis, L. Milani, D.

Melzer, L. Ferrucci, A. B. Singleton, D. G. Hernandez, M. A. Nalls, G. Homuth, M.

Nauck, D. Radke, U. Volker, M. Perola, V. Salomaa, J. Brody, A. Suchy-Dicey, S. A.

Gharib, D. A. Enquobahrie, T. Lumley, G. W. Montgomery, S. Makino, H. Prokisch, C.

Herder, M. Roden, H. Grallert, T. Meitinger, K. Strauch, Y. Li, R. C. Jansen, P. M.

Visscher, J. C. Knight, B. M. Psaty, S. Ripatti, A. Teumer, T. M. Frayling, A. Metspalu, 221

J. B. van Meurs and L. Franke (2013). "Systematic identification of trans eQTLs as putative drivers of known disease associations." Nat Genet 45(10): 1238-1243.

Willey, J. C., E. Coy, C. Brolly, M. J. Utell, M. W. Frampton, J. Hammersley, W. G.

Thilly, D. Olson and K. Cairns (1996). "Xenobiotic metabolism enzyme gene expression in human bronchial epithelial and alveolar macrophage cells." Am J Respir Cell Mol Biol

14(3): 262-271.

Willey, J. C., E. L. Coy, M. W. Frampton, A. Torres, M. J. Apostolakos, G. Hoehn, W.

H. Schuermann, W. G. Thilly, D. E. Olson, J. R. Hammersley, C. L. Crespi and M. J.

Utell (1997). "Quantitative RT-PCR measurement of cytochromes p450 1A1, 1B1, and

2B7, microsomal epoxide hydrolase, and NADPH oxidoreductase expression in lung cells of smokers and nonsmokers." Am J Respir Cell Mol Biol 17(1): 114-124.

Willey, J. C., Erin L. Crawford, Charles A. Knight, Kristy A. Warner, Cheryl R. Motten,

Elizabeth Herness Peters, Robert J. Zahorchak, Timothy G. Graves, David A. Weaver,

Jerry R. Bergman, Martin Vondrecek, Roland C. Grafstrom (2004). Use of standardized mixtures of internal standards in quantitative RT-PCR to ensure quality control and develop a standardized gene expression database. A-Z of Quantitative PCR. S. A. Bustin:

545-576.

Williams, R. B., E. K. Chan, M. J. Cowley and P. F. Little (2007). "The influence of genetic variation on gene expression." Genome Res 17(12): 1707-1716.

Wistuba, II (2007). "Genetics of preneoplasia: lessons from lung cancer." Curr Mol Med

7(1): 3-14.

222

Wittkopp, P. J., B. K. Haerum and A. G. Clark (2008). "Independent effects of cis- and

trans-regulatory variation on gene expression in Drosophila melanogaster." Genetics

178(3): 1831-1835.

Wittkopp, P. J. and G. Kalay (2012). "Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence." Nat Rev Genet 13(1): 59-69.

Wong, M. L. and J. F. Medrano (2005). "Real-time PCR for mRNA quantitation."

Biotechniques 39(1): 75-85.

Wu, D. Y., L. Ugozzoli, B. K. Pal and R. B. Wallace (1989). "Allele-specific enzymatic amplification of beta-globin genomic DNA for diagnosis of sickle cell anemia." Proc Natl

Acad Sci U S A 86(8): 2757-2760.

Wu, M. C., P. Kraft, M. P. Epstein, D. M. Taylor, S. J. Chanock, D. J. Hunter and X. Lin

(2010). "Powerful SNP-set analysis for case-control genome-wide association studies."

Am J Hum Genet 86(6): 929-942.

Xu, H., J. DiCarlo, R. V. Satya, Q. Peng and Y. Wang (2014). "Comparison of somatic mutation calling methods in amplicon and whole exome sequence data." BMC Genomics

15: 244.

Xu, Y., L. Berglund, R. Ramakrishnan, R. Mayeux, C. Ngai, S. Holleran, B. Tycko, T.

Leff and N. S. Shachter (1999). "A common Hpa I RFLP of apolipoprotein C-I increases

gene transcription and exhibits an ethnically distinct pattern of linkage disequilibrium

with the alleles of apolipoprotein E." J Lipid Res 40(1): 50-58.

Yan, H., W. Yuan, V. E. Velculescu, B. Vogelstein and K. W. Kinzler (2002). "Allelic

variation in human gene expression." Science 297(5584): 1143.

223

Yeo, J., E. L. Crawford, T. M. Blomquist, L. M. Stanoszek, R. E. Dannemiller, J. Zyrek,

L. E. De Las Casas, S. A. Khuder and J. C. Willey (2014). "A multiplex two-color real- time PCR method for quality-controlled molecular diagnostic testing of FFPE samples."

PLoS One 9(2): e89395.

Yoo, S. S., C. Jin, D. K. Jung, Y. Y. Choi, J. E. Choi, W. K. Lee, S. Y. Lee, J. Lee, S. I.

Cha, C. H. Kim, Y. Seok, E. Lee and J. Y. Park (2015). "Putative functional variants of

XRCC1 identified by RegulomeDB were not associated with lung cancer risk in a Korean

population." Cancer Genet 208(1-2): 19-24.

Young, R. P., R. J. Hopkins, T. Christmas, P. N. Black, P. Metcalf and G. D. Gamble

(2009). "COPD prevalence is increased in lung cancer, independent of age, sex and

smoking history." Eur Respir J 34(2): 380-386.

Yu, Z., Z. Li, N. Jolicoeur, L. Zhang, Y. Fortin, E. Wang, M. Wu and S. H. Shen (2007).

"Aberrant allele frequencies of the SNPs located in microRNA target sites are potentially

associated with human cancers." Nucleic Acids Res 35(13): 4535-4541.

Yvert, G., R. B. Brem, J. Whittle, J. M. Akey, E. Foss, E. N. Smith, R. Mackelprang and

L. Kruglyak (2003). "Trans-acting regulatory variation in Saccharomyces cerevisiae and

the role of transcription factors." Nat Genet 35(1): 57-64.

Zakharkin, S. O., K. Kim, T. Mehta, L. Chen, S. Barnes, K. E. Scheirer, R. S. Parrish, D.

B. Allison and G. P. Page (2005). "Sources of variation in Affymetrix microarray

experiments." BMC Bioinformatics 6: 214.

Zakharov, S., T. Y. Wong, T. Aung, E. N. Vithana, C. C. Khor, A. Salim and A.

Thalamuthu (2013). "Combined genotype and haplotype tests for region-based

association studies." BMC Genomics 14: 569. 224

Zeki, S. and R. C. Fitzgerald (2015). "The use of molecular markers in predicting

dysplasia and guiding treatment." Best Pract Res Clin Gastroenterol 29(1): 113-124.

Zhai, R., X. Yu, A. Shafer, J. C. Wain and D. C. Christiani (2014). "The impact of coexisting copd on survival of patients with early-stage non-small cell lung cancer

undergoing surgical resection." Chest 145(2): 346-353.

Zhang, K., J. B. Li, Y. Gao, D. Egli, B. Xie, J. Deng, Z. Li, J. H. Lee, J. Aach, E. M.

Leproust, K. Eggan and G. M. Church (2009). "Digital RNA allelotyping reveals tissue-

specific and allele-specific gene expression in human." Nat Methods 6(8): 613-618.

Zhang, R., X. Li, G. Ramaswami, K. S. Smith, G. Turecki, S. B. Montgomery and J. B.

Li (2014). "Quantifying RNA allelic ratios by microfluidic multiplex PCR and

sequencing." Nat Methods 11(1): 51-54.

Zhang, R., W. Min and W. C. Sessa (1995). "Functional analysis of the human

endothelial nitric oxide synthase promoter. Sp1 and GATA factors are necessary for basal

transcription in endothelial cells." J Biol Chem 270(25): 15320-15326.

Zhang, T., J. Sun, M. Lv, L. Zhang, X. Wang, J. C. Ren and B. Wang (2013). "XPG is

predictive gene of clinical outcome in advanced non-small-cell lung cancer with platinum

drug therapy." Asian Pac J Cancer Prev 14(2): 701-705.

Zhu, M. L., T. Y. Shi, H. C. Hu, J. He, M. Wang, L. Jin, Y. J. Yang, J. C. Wang, M. H.

Sun, H. Chen, K. L. Zhao, Z. Zhang, H. Q. Chen, J. Q. Xiang and Q. Y. Wei (2012).

"Polymorphisms in the ERCC5 gene and risk of esophageal squamous cell carcinoma

(ESCC) in Eastern Chinese populations." PLoS One 7(7): e41500.

Ziegler, A., A. Koch, K. Krockenberger and A. Grosshennig (2012). "Personalized

medicine using DNA biomarkers: a review." Hum Genet 131(10): 1627-1638. 225

Zienolddiny, S., D. Campa, H. Lind, D. Ryberg, V. Skaug, L. Stangeland, D. H. Phillips,

F. Canzian and A. Haugen (2006). "Polymorphisms of DNA repair genes and risk of non- small cell lung cancer." Carcinogenesis 27(3): 560-567.

226