Integrative Genomics Methods for Personalized Treatment of Non-Small-Cell Lung

Cancer

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Michael Frederick Sharpnack, B.A.

Biomedical Sciences Graduate Program

The Ohio State University

2018

Dissertation Committee:

Kun Huang, PhD, Advisor

Jeffrey Parvin, MD, PhD

David P. Carbone, MD, PhD

Kai He, MD, PhD

Copyrighted by

Michael Sharpnack

2018

Abstract

Lung cancer is the most deadly form of cancer, responsible for over 1.6 million deaths annually1, the majority of which are due to non-small cell lung cancer, of which adenocarcinoma and squamous cell carcinoma are the major subtypes2. Standard chemotherapy produces responses in a small minority of patients3, and despite the tremendous growth of personalized therapies in the last decade, only a minority of patients benefit from these treatments in the North American setting3–6. A greater understanding of the biology of non-small cell lung cancer is desperately needed to develop novel targeted therapies and their accompanying biomarkers.

Understanding the function of cancer-associated requires the integration and analysis of multiple modalities of biological data. Cancer associated genes can be activated or repressed by DNA somatic mutations, RNA , epigenetic changes, microRNA-mediated silencing, post-translational regulation, and other mechanisms7–9. To understand how tumors form and grow, we have to be able to measure

DNA, RNA, , metabolites, and lipids. Further, integrative and analytical methods are necessary to leverage these data together, collectively termed integrative genomics.

Here, we leverage DNA mutations and copy number measurements, RNA transcriptomics, proteomics, and clinical data to discover regulatory relationships in tumors, develop prognostic biomarkers, and identify mediators of tumor mutation burden.

i

First, we focus on the RNA editing protein ADAR, and propose an immune-mediated function in lung adenocarcinoma. Second, we develop a method to integrate RNA and protein expression data to predict binary clinical variables, and test its ability to predict tumor recurrence in surgically resected lung adenocarcinoma samples. Finally, we define the relationship between tumor mutation burden and genome stability protein inactivation to better understand tumor immunogenicity in non-small cell lung cancer. Taken together, these approaches present a comprehensive methodology to utilize integrative genomic data for clinical applications in non-small cell lung cancer.

ii

Dedication

This thesis is dedicated to my wife, Danielle, and my two children, Jaylen and Leora,

with all of my love.

iii

Acknowledgments

I would like to acknowledge my advisor, Kun Huang, for his unselfish and positive approach to mentorship. Thank you to the members of my committee, and my peers and advisors in the MSTP and BSGP programs at The Ohio State University for all of their support. In addition, I would like to thank members of the Thoracic Oncology lab and

Butte lab at University of California San Francisco, for kindly accepting me into their groups. Finally, I would especially like to thank Parag Mallick and members of the

Mallick lab at Stanford for their unwavering support, mentorship, and friendship.

To my wife and kids: you mean everything to me and I could not have done this without your kindness, love, and hard work. I would like to thank my parents, Jan and Doug, for their guidance and support. And my siblings and their spouses, Becky, Zane, James, and

Vicki, for always being there for me. In addition, thank you to my friends for all of their support.

iv

Vita

2007 ...... Loveland High School, Loveland, Ohio

2011 ...... B.A. Mathematics with Honors, New York

University

2013-2014 ...... Graduate Fellow, Presidential Fellowship,

The Ohio State University

2014-2017 ...... Graduate Fellow, Clinical and Translational

Research Informatics Training Program,

National Library of Medicine

2018 to present ...... Graduate Research Assistant, The Ohio

State University

Publications

A. Yates, A. Webb, M. Sharpnack, H. Chamberlin, K. Huang, and R. Machiraju. Visualizing Multidimensional Data with Glyph SPLOMs. Computer Graphics Forum, 2014 33:3, 301-310.

M.F. Sharpnack, K. Huang. Detecting Cancer Pathway Crosstalk with Distance Correlation. American Medical Informatics Association Join Summits 2015. Finalist for the Marco Ramoni Distinguished Paper Award.

M.F. Sharpnack, B. Chen, D. Aran, I. Kosti, D.D. Sharpnack, D.P. Carbone, P. Mallick, K. Huang. Global Transcriptome Analysis of RNA Abundance Regulation by ADAR in Lung Adenocarcinoma. 2018. EBioMedicine, Volume 27, 167-175.

Fields of Study

v

Major Field: Biomedical Sciences

vi

Table of Contents

Abstract ...... i

Dedication ...... iii

Acknowledgments ...... iv

Vita ...... v

List of Tables ...... ix

List of Figures ...... x

Chapter 1: Introduction ...... 1

1.1 Biology of Integrative Genomics ...... 1

1.2 Integrative Genomics in Non-Small Cell Lung Cancer ...... 3

1.3 Correspondence between RNA and protein abundances ...... 5

1.4 Therapeutic applications of integrative genomics in NSCLC ...... 8

1.5 A-I RNA editing ...... 12

1.6 Structure and Function of the ADAR1 protein ...... 14

1.7 Role of ADAR as an Oncogene in Cancer ...... 17

Chapter 2: Global Transcriptome Analysis of RNA Abundance Regulation by ADAR in

Lung Adenocarcinoma ...... 19

2.1 Introduction ...... 19

2.2 Materials and methods ...... 20

vii

2.3 Results ...... 23

2.4 Discussion ...... 41

Chapter 3: Proteogenomic Analysis of Surgically Resected NSCLC ...... 43

3.1 Introduction ...... 43

3.2 Materials and Methods ...... 45

3.3 Results ...... 54

3.4 Discussion ...... 67

Chapter 4: Investigation of the Association of Genome Stability Protein Inactivation and

Tumor Mutation Burden ...... 70

4.1 Introduction ...... 70

4.2 Materials and Methods ...... 71

4.3 Results ...... 73

4.4 Discussion ...... 82

Chapter 6: Conclusions & Future Directions ...... 84

References ...... 88

viii

List of Tables

Table 3.1: Clinical attributes for the 51 lung adenocarcinoma patients...... 56

Table 3.2: Integrative Biomarker of Recurrence Leave-One-Out Cross Validation

Performance...... 66

ix

List of Figures

Figure 1.1: Integrative genomics for personalized cancer therapy...... 3

Figure 1.2: Structure and function of the ADAR1 protein...... 13

Figure 2.1: Pipeline to discover regulatory RNA editing sites...... 23

Figure 2.2: Landscape of regulatory RNA editing sites in LUAD...... 27

Figure 2.3: The APOL1 3’ UTR contains multiple regulatory editing sites controlled by

ADAR...... 30

Figure 2.4: Survival analysis of APOL1 RNA expression and APOL1 editing sites...... 33

Figure 2.5: RNA-editing regulated genes in LUAD are enriched in and innate immune-related genes...... 37

Figure 2.6: ADAR amplification is associated with decreased immune cell concentrations in lung adenocarcinoma...... 39

Figure 3.1: -level mRNA-protein correlation in human non-small cell lung cancer.58

Figure 3.2: Synergistic discovery of differentially regulated genes using matched RNA and protein abundances...... 60

Figure 3.3: Timm50 is differentially correlated between recurrent and non-recurrent tumors...... 62

Figure 3.4: Overview of integrative RNA-protein biomarker discovery pipeline...... 64

Figure 4.1: Landscape of alteration of genome stability related pathways in non-small cell lung cancer...... 75

Figure 4.2: Association between smoking and tumor mutation burden...... 76

x

Figure 4.3: Enrichment of genomic stability related pathway inactivation in hypermutant tumors...... 78

Figure 4.4: Predicting tumor mutation burden with age, genomic stability gene inactivation, and smoking status...... 81

xi

Chapter 1: Introduction

1.1 Biology of Integrative Genomics The central dogma of biology states that DNA is transcribed into RNA, which is then translated into protein10. The past two decades have witnessed tremendous complications to this principle, wherein each class of molecule present in the cell represents a biological source of information that holds clues to the regulatory principles that govern the cell11. The goal of integrative genomics is to incorporate all biological sources of information to understand the cell (or organ, or organism) from a global systems approach12. Once the principles underlying cellular regulation are understood, the consequences of modifying molecules, i.e. via targeted therapies, can be understood.

Computational methods in integrative genomics reflect the analytical challenges of the data produced. Prior to the widespread use of the revolutionary next generation sequencing (NGS) and shotgun proteomics technologies, systems biology was primarily limited to data produced from low throughput techniques and micro-arrays13. In the last decade, however, unbelievable amounts of NGS and proteomics data have been produced, including nearly a million RNA sequencing samples in the SRA database14 and nearly 5000 experiments in the proteomeXchange database15.

As each type of molecule is partly responsible for cellular homeostasis, modifications to these molecules can be co-opted to contribute to tumor development and growth. In fact, nearly every measurable type of molecule has been implicated in cancer16. Some alterations can affect genes with consequences to entire systems. This thesis explores in detail one of these genes, ADAR, which is responsible for the majority

1 of adenosine to inosine RNA editing within the cell. Amplifications of the ADAR gene, a frequent occurrence in human cancer, increase the amount of edited in thousands of molecules, yet the results are conserved invasive and anti-apoptotic phenotypes17,18.

Integrative genomics methods are perfectly positioned to answer complex systems questions, and identify the exact mechanisms underlying the oncogenic behavior of genes such as ADAR.

Here, we will first investigate the role of ADAR in NSCLC as described above in

Chapter 2. In Chapter 3, we will explore the correspondence between RNA and protein abundances, and show how these can be incorporated to predict clinically relevant variables in lung adenocarcinoma. Finally, in Chapter 4, we establish an association between predicted inactivation of genome stability genes and tumor mutation burden.

Thesis statement: Integrative genomics methods reveal novel tumor biology relating to inter-molecular regulation. In particular, we propose an immunoregulatory role for the epitranscriptome regulatory protein, ADAR1. We show that protein and RNA data each contain complementary but independent information that can be used to predict tumor prognosis. Finally, we elucidate the genetic components of tumor mutation burden in non-small cell lung cancer (NSCLC).

2

Figure 1.1: Integrative genomics for personalized cancer therapy. (A) Integrative genomic from each step of the extended central dogma. (B) Example measurements for each step in 1.1A and their approximate feature space size. Here a feature is the unit of measurement for the corresponding datatype. For example, DNA measurements can be processed into mutations and copy number values.

1.2 Integrative Genomics in Non-Small Cell Lung Cancer Lung cancer accounts for more deaths than any other cancer, the majority of which have non-small-cell histology1. Seminal publications in the last 10 years have

3 identified patterns in somatic mutations underlying both squamous and adenoid histological subtypes of non-small-cell lung cancer7–9,19. Most importantly, The Cancer

Genome Atlas project has provided comprehensive molecular data on over a thousand

NSCLC patients, with roughly 500 patients from both lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) histologies7,8. An early goal of these and other studies was to identify genes that were mutated at frequencies higher than predicted by background mutation rates20. To this end, numerous novel oncogenes and tumor suppressors were identified or matched to new cancer types. One conclusion from these studies was that cancer associated genes could regulate any step of the extended central dogma9.

Furthermore, somatic DNA mutations and deletions are only two ways that oncogenes and tumor suppressors can have altered function. Genes can be repressed or overexpressed by epigenetic changes21. Insufficient DNA mismatch repair functionality is a frequent finding in colon and rectal carcinomas; however, the majority of inactivations in this pathway are due to hypermethylation of the MLH1 gene, with a minority attributed to somatic mutation or deletion22.

Understanding the biological impact of these mutations frequently requires integration of multiple modalities of genomic data. For example, large regions of the genome are often deleted or amplified in tumors23. These regions can encompass hundreds of genes, and a key question is, alterations to which of these genes are beneficial to the tumor? Two approaches can quickly identify candidate driver genes prior to functional validation: 1) looking for amplification peaks or deletion troughs

4 within the altered segment, and 2) integrating RNA and protein data to differentiate which of the genes are actually affected by the DNA changes.

In some cases, the functional nature of the cancer-associated gene requires integrated genomics methods to understand its role in cancer. Alterations in cancer- associated genes can have global effects across entire classes of molecules. For example, mutations in the chromatin-remodeling gene, BRG1 (encoding the SMARCA4 protein), can alter the fidelity and openness of hundreds of DNA regions throughout the genome24.

These changes can ultimately lead to the dysregulation of other cancer-associated genes to aid in tumorigenesis. The full characterization of the oncogenic and tumor-suppressive potentials of BRG1 will ultimately require epigenetic, genomic, transcriptomic, and proteomic data, as well as methods to integrate the data.

1.3 Correspondence between RNA and protein abundances Although tremendous numbers of tumors have undergone proteomic or transcriptomic profiling, few studies have performed both. Results from studies that integrate both RNA and protein data are termed proteogenomic25. Although these studies nominally are attempting to measure the same phenotypes, there are advantages to measuring both RNAs and . Proteogenomic pipelines have been developed that utilize the ability of RNA sequencing to detect alternative splicing events, fusion proteins, somatic mutations, and RNA edits in proteomics experiments by creating custom peptide libraries26. Without these custom peptide libraries, peptide quantification pipelines might miss the most important altered proteins, but it is unclear that these methods confer a great deal of sensitivity for altered proteins27. Aside from creating

5 custom peptide libraries, there is evidence that protein and RNA data can contain independent and important information28.

Although there are many genomic levels of regulation in the cell, the translation of proteins from RNA is perhaps one of the most tightly regulated processes, due in large part to the steep functional consequences and energetic requirements of protein synthesis.

Protein coding mRNA levels can fluctuate wildly while the corresponding protein abundance changes little or not at all29. As a consequence, both RNA and protein measurements reflect interconnected measurements that nevertheless contain independent information. Statistically, RNA and protein measurements produced by RNAseq and shotgun proteomics experiments, respectively, are generally imperfectly positively correlated28,30–32. The source of this imperfection is shrouded in controversy, as some studies have suggested that experimental noise is its primary driver and that transcription essentially explains protein abundances33, while others suggest that the imperfect correlation contains important biological information about post-transcriptional regulation34. Reflecting the complexity of biological systems, it is also possible that in some contexts and for some genes, transcription and translation are essentially continuous, while for others there is a regulatory layer separating the two processes.

The abundance of mRNAs and proteins is determined by their relative synthesis and degradation rates35. Therefore, context dependent changes to the parameters in this equation may alter the correlation between mRNA and protein abundances. Biological observations can be from steady state, where RNA and protein levels have reached equilibrium, or from a perturbed state, where levels are changing in response to a

6 stimulus. Whether cells are in a steady or perturbed state can affect the correspondence between RNA and protein values29.

The largest sources of proteogenomic data in human cancer to date have been from the Cancer Proteomic Tumor Analysis Consortium (CPTAC)36. This project produces proteomic data from tissue previously analyzed as a part of TCGA; therefore, these tumors have extensive associated molecular and clinical data, and they have been analyzed with precise protocols. There has not yet been a lung cancer CPTAC publication; however, CPTAC has published datasets from colon and rectal32, breast31, and ovarian cancers30. These studies have shown that mRNA’s and proteins from the same gene are typically well correlated, and that the ones that aren’t tend to be enriched in ribosomal, oxidative phosphorylation related, and spliceosomal genes. They also observed that proteomic and phosphoproteomic data tends to be more specific in discovering functional copy number alteration (CNA) effects than mRNA data. Copy number levels are much more predictive of mRNA abundances than they are of protein or phosphoprotein abundances, such that many copy number to mRNA relationships may not actually affect protein content in the cell. They also discovered novel subtypes in these cancer types by performing integrative clustering with the added proteomic and phosphoproteomic data.

Only a handful of studies have produced proteogenomic data from lung cancer samples, and they all have very small sample sizes. An early study by Chen et al37 using relatively low throughput RNA and proteomic measurement techniques found that mRNA and protein abundances were essentially uncorrelated in lung adenocarcinoma

7 biopsy samples. Recent studies by Stewart et al.38 and Li et al.39 focused on finding prognostic signatures in the proteomic data that are corroborated by the mRNA data.

Specifically, Stewart et al. found a proteomic prognostic signature of metabolic genes and

Li et al. created a signature from MCT1 and GLUT1 protein and mRNA abundances. In contrast to Chen et al., these studies found moderate correlation between corresponding mRNA and protein abundances. Collectively, these studies all demonstrate that proteogenomic approaches can increase our understanding of tumor biology as well as identify novel treatment targets and prognostic signatures.

1.4 Therapeutic applications of integrative genomics in NSCLC The power of integrative genomics methods is not limited to increasing our understanding of tumor biology. In the last decade, tyrosine kinase inhibition (TKI) and immune checkpoint inhibition (ICI) have both emerged as promising therapies in

NSCLC. In lung adenocarcinoma, the FDA has approved anti-EGFR, anti-ALK, and anti-

BRAF TKI treatments in various contexts. Unfortunately, while these treatments tend to increase progression free survival, they are largely unable to increase overall survival in comparison to standard of care chemotherapy40. This is due in large part to resistance mechanisms that may even exist before initiation of therapy41. While a portion of patients experience long-term disease control, a majority of tumors develop resistance to TKI therapy within a year of treatment initiation42.

The development of TKI has almost exclusively benefitted patients with LUAD histology, leaving LUSC patients with few targeted treatment options2. LUAD oncogenic mutations are frequently in the RAS/RAF/MEK/ERK pathways, of which EGFR is

8 upstream6. These pathways are drivers of growth in many cell types, and are altered in

LUAD much more frequently than in LUSC patients7,8. Because of the relative lack of identifiable oncogenic alterations in LUSC, other approaches are necessary to increase overall patient survival.

Similarly to TKI therapies, ICI therapies have elicited anti-tumor responses in only a minority of NSCLC patients in preliminary clinical trials. Currently, there are two

FDA approved contexts for first-line therapies of ICIs in NSCLC, >50% PD-L1 immunohistochemical staining in stage IV adenocarcinomas and confirmed mismatch repair pathway mutations in any solid tumor. These two biomarkers only begin to tell the story of the heterogeneous ways in which tumors can have primary sensitivity or resistance to ICIs. Although ICI targeting numerous molecules have been developed in preclinical trials, the only FDA approved ICI in NSCLC target either the proteins PD-L1 or PD-1. PD-L1 can be present on tumor cells and binds to PD-1 on T-cells, inducing an exhausted T-cell phenotype that is permissive to tumor growth43.

PD-L1 immunohistochemistry is a logical biomarker for anti-PD-L1 therapy, and it is effective but imperfect. The Keynote-024 clinical trial recorded an impressively high response rate of 44.8% to pembrolizumab (anti-PD-L1 monoclonal antibody) monotherapy in >50% PD-L1 positive patients, but this means that a majority of patients who have a positive therapeutic biomarker and receive this therapy will not benefit from it3. Two clear goals emerge from trials such as this: 1) develop more sensitive and specific biomarkers and 2) discover drug combinations that increase the number of

9 responsive patients. The complexities of the tumor-immune cell interaction necessitate the use of integrative genomics to solve both of these tasks.

It would be appropriate to ask why it is useful for tumors to increase expression of

PD-L1, a molecule that induces T-cell exhaustion, in the first place. Several studies have demonstrated that immune surveillance of nascent tumors is a constant phenomenon, and that immune evasion may be a necessary adaptation for successful tumor growth44. There is a balance between a tumor’s ability to adapt to its surroundings by producing molecules not present in germline cells and its ease of detection by the host immune system. The best-studied example of this is the ability of somatic mutations to increase the immunogenicity of a tumor, also known as tumor neoantigen burden45. The mechanism of this process is by fragmentation of mutated proteins into peptides and presentation of the mutation-containing peptides by human MHC-I46. Mutated peptide fragments can be recognized by immune cells and activate an anti-tumor response.

Recently it was shown that tumors are enriched in mutations in peptides that are unable to be presented by a patient’s MCH-I (there are many different genotypes of MHC-I encoding HLA genes)47.

Neoantigen burden, the number of somatic mutations predicted to bind to a patient’s MHC-I alleles, is highly correlated to the overall tumor somatic mutation burden (TMB)48. Neoantigen burden and TMB are promising therapeutic biomarkers for

ICI45,49–52. Carbone et al.49 showed that first line nivolumab is not superior to standard of care chemotherapy in patients with only > 5% PD-L1 positive staining in stage IV or recurrent non-small cell lung cancer; however, an exploratory analysis demonstrated

10 superior progression-free survival and response rates in nivolumab treated patients with high TMB. In addition, they further showed that PD-L1 immunohistochemistry and TMB were uncorrelated, and that patients that were both PD-L1 high and TMB high had the highest response rates to nivolumab. Patients that were PD-L1 low and TMB low had the poorest response rates to nivolumab, and patients with mixed PD-L1 and TMB status had intermediate response rates.

In addition to PD-L1 immunohistochemistry and TMB, an interferon gamma

RNA signature showed promise in an initial study of melanomas treated with ICI, but the signature was not effective on two independent validation cohorts53,54. Interpreting these and the neoantigen/mutation burden studies together, it is clear that response to ICI is a multifactorial response that requires some combination of tumor immunogenicity, a “hot” immune environment, and presence of immunoinhibitory factors that can be overcome by therapy.

Immunogenicity is not only created by neoantigenic mutations. In addition to proteins, aberrant nucleic acids can set off innate immune responses similar to those present in viral infections55,56. Immunogenicity itself is likely a multifactorial process that involves sensing of DNA, RNA, and protein by immune cells and intracellular receptors.

For example, it was recently shown that demethylating agents, such as azathioprine, can synergize with ICI therapy by inducing the transcription of immunogenic endogenous retroviruses that preferentially form double stranded RNA57,58. Taken together, these studies highlight the potential for combination therapies that both induce immunogenicity and dis-inhibit the immune system. Sagiv-Barfi et al.59 recently showed great success in

11 preclinical models with a combination therapy of an immunogenic molecule and an immune stimulator. Post-transcriptional regulation is an important source of transcriptional diversity, and can induce RNA molecules to either activate or inhibit the immune system. Accordingly, we sought to discover the role of RNA editing in immunosuppression in cancer.

1.5 A-I RNA editing After transcription from DNA, there are several ways in which an RNA sequence can be altered prior to the RNA molecule carrying out its function. A-I RNA editing is an important and ubiquitous modification in which adenosine are covalently deaminated into inosines, which are read by the translational machinery as guanosines. A-I RNA editing is carried about by enzymes in the double-stranded RNA specific adenosine deaminase gene family, of which there are 3 members, ADAR, ADARB1, and ADAR3. Only ADAR and ADARB1 are enzymatically active, while there is evidence that ADAR3 functions by inhibiting the other two RNA editing enzymes60. ADAR is primarily responsible for editing in non-protein-coding regions, while ADARB1 is responsible for protein-coding editing events61.

12

Figure 1.2: Structure and function of the ADAR1 protein. (A) ADAR1 protein domains in the two ADAR1 isoforms, p150 and p110. (B) Immune and apoptotic functions of the

ADAR1 protein. (Left) MDA5 and MAVS bind to unedited dsRNA and activate

13

IRF3/IRF7 to induce transcription of interferon stimulated genes (ISGs). This also induces ADAR1p150 expression which edits dsRNA and creates a negative feedback loop by inhibiting IRF3/IRF7 signaling. (Right) ADAR1p110 is phosphorylated by kinases activated by cellular stress, such as UV light. Phosphorylated ADAR1p110 translocates to the cytoplasm to inhibit STUA1 binding to anti-apoptotic genes, inhibiting apoptosis.

1.6 Structure and Function of the ADAR1 protein Adenosine deaminase acting on RNA 1 (ADAR1) is encoded by the ADAR gene on 1q21 and is a double stranded (dsRNA) binding protein with enzymatic activity62. ADAR encodes for 2 Z-DNA binding domains, 3 dsRNA binding domains, and a catalytic deaminase domain. The Z-DNA binding domains are thought to localize

ADAR1 to nascent transcribed RNAs, making RNA editing the first alteration to occur, before binding of other RNA binding proteins, such as splicing factors63. This is consistent with the observation of RNA editing sites within introns. ADAR was first discovered as an anti-viral gene, which binds to and modulates the genomes of and immune response to dsRNA viruses64. Consistent with its antiviral function, ADAR is an interferon response gene, and stimulation with type I and type II interferons both upregulate ADAR expression65.

There are two expressed isoforms of ADAR, ADAR1p150 and ADAR1p110.

Structurally, ADAR1p150 is longer and contains exons 1-15, while ADAR1p110 lacks 14 exon one and a portion of exon 2. It is not clear if ADAR1p150 and ADAR1p110 preferentially edit differing RNA sites, however, there is a functional distinction between the two that has emerged in studies conducted in the last 5 years. ADAR1p150 is primarily expressed only in the presence of interferon, and can be both cytosolic and nuclear. ADAR1p110 is constitutively expressed and is predominantly nuclear66.

Complete knockout of both ADAR alleles (ADAR-KO) is embryonic lethal in mice66,67 and is not observed in humans; however, genetic insufficiency of ADAR has been identified as a cause of Aicardi-Gutiéres syndrome (AGS)68. AGS presents similarly to an in utero-acquired infection, causing an interferon α mediated inflammatory syndrome with symptoms primarily in the brain and skin. Mutations affecting each functional domain of ADAR have been observed in AGS.

Several studies showed that ADAR-KO is rescued by simultaneously knocking out

MDA5, one of the dsRNA sensing proteins66,67,69. Ahmad et al.70 showed that ADAR mediated editing of dsRNAs prevents the dsRNA sensing protein MDA5 from inducing an interferon response. In the absence of ADAR, MDA5 forms a positive feedback loop with interferon β to create inflammation. Crucially, Ahmad, et al. also demonstrated that the primary mediators of ADAR-KO induced inflammation are double stranded, complementary inverted repeats termed Alu repeats. Approximately 50% of the genome is composed of repeated sequences, one of the most common being Alu elements71. Alu elements tend to be repeated in reverse complementarity, such that they naturally form

~300 long double stranded structures in various areas of RNA molecules. Alu elements can be located within protein coding genes or independently. MDA5

15 preferentially binds to Alu:Alu dsRNA; however, ADAR edits Alu:Alu dsRNA and inhibits its ability to activate downstream interferon signaling after MDA5 binding70.

Together, we have a picture of a negative feedback loop whereby ADAR is upregulated by interferon, but ADAR also downregulates interferon.

In addition to its roles in innate immunity, which are typically attributed to the

ADAR1p150 isoform, ADAR1p110 has been implicated as a regulator of apoptosis. Two recent studies shed light on the apoptosis-related functions of ADAR1p110. First, Yang et al.72 demonstrated that ADAR1p110 inhibits the expression of the XIAP and MDM2, two anti-apoptotic genes, by preventing STAU1 binding. STAU1 is an RNA binding protein that facilitates translocation of XIAP and MDM2 RNA into the cytoplasm, such that ADAR1p110 inhibition of this process results in decreased XIAP and MDM2 protein translation. Overall, they observed a pro-apoptotic effect from ADAR overexpression. In contrast, Sakurai et al.73 found an anti-apoptotic effect from ADAR overexpression, mediated by inhibiting STAU1 induced degradation of edited mRNAs. Both studies point out that there may be context dependent apoptosis regulatory functions of ADAR.

Specifically, the two studies interrogated different targets of ADAR and STAU1, used different in-vitro models, and used different treatments to induce apoptosis. All of these sources of variation could contribute to the seemingly contradictory results presented in these papers, nonetheless, ADAR appears to be implicated in some form of apoptosis regulation.

16

1.7 Role of ADAR as an Oncogene in Cancer ADAR has become the focus for intense interest in cancer research due to the high frequency of amplification of its genomic locus, 1q21. ADAR is amplified or mutated in

10% of TCGA NSCLC patients, and ADAR RNA strongly correlates with ADAR gene copy number74. Accordingly, increases in ADAR expression and RNA editing have been observed in numerous human solid tumor types, including bladder, breast, colon, head and neck, liver, lung, and thyroid carcinomas18,75. Additionally, an oncogenic role for

ADAR has been proposed in chronic myeloid leukemia76.

Functionally, ADAR inhibition has been shown to slow cell growth in a number of in-vitro contexts, including in LUAD17. Anadón et al.17 found that in the ADAR-amplified lung adenocarcinoma cell lines, H1993 and H1395, ADAR inhibition via siRNA caused reduced cell viability and invasiveness. Further, they showed in an athymic mouse model that ADAR inhibition resulted in lower metastatic burden, however, ectopic overexpression of ADAR in A549-xenograft mouse models resulted in increased metastatic burden. These results are consistent with those found by Goh et al., who discovered increased rates of recurrence in breast cancer patients with 1q21.3 amplification77.

Although it is clear that ADAR1 has an oncogenic role, it is not clear which of

ADAR1’s functions is beneficial to tumor cells. As described above, ADAR1 edits millions of adenosines throughout the transcriptome, and it is not determined which of these could be of value. Initial studies of RNA editing in cancer focused on editing events within protein coding regions that potentially alter sequences78,79. Amino acid altering RNA editing sites only comprise <1% of the total number of editing sites, and 17 very few of those sites are edited at high frequencies18. RNA editing-mediated recoding of AZIN1-S367G is linked to increased risk of hepatocellular carcinoma, and suppression of this editing site decreased tumor growth in a mouse model79. It is not known if the anti-apoptotic or immune suppressive functions of ADAR contribute to its oncogenic potential.

18

Chapter 2: Global Transcriptome Analysis of RNA Abundance Regulation by

ADAR in Lung Adenocarcinoma

2.1 Introduction The genomic locus containing the adenosine deaminase acting on RNA (ADAR) gene is amplified in 13% of LUAD and has been functionally confirmed as an oncogene in lung17, breast80, stomach81, chronic myeloid leukemia76 and liver79 cancers.

Knockdown of ADAR in LUAD is associated with reduced in-vitro cell viability, and decreased metastatic potential in xenograft mouse models17. Therefore, targeted therapies towards and a greater functional understanding of ADAR amplified tumors have the potential to greatly increase the number of treatable cases of LUAD.

As covered in Chapter 1, Adenosine to inosine RNA editing is mediated predominantly by the ADAR gene and to a lesser extent, the ADARB1 gene. This modification occurs in millions of sites across the transcriptome, mostly within Alu repeats82. Previous studies of the oncogenic role of ADAR in cancer have focused on the editing of coding sequences of tumor-associated proteins, such as AZIN179 and NEIL117.

Although this role has been confirmed in functional studies of hepatocellular carcinoma, less than 1% of known RNA editing sites reside in coding sequences18. The function of

RNA editing in non-protein coding regions is only known for a few edited genes. For example, it has been shown that ADAR-mediated RNA editing changes the accessibility to the HuR RNA binding protein (RBP) within the cathepsin S (CTSS) 3’ UTR, which enhances its mRNA’s stability in endothelial cells83. In addition, Zhang et al.84 showed that RNA editing of the MDM2 3’ UTR segment can abolish mir-200b mediated

19 repression of the MDM2 mRNA. Recently, RNA editing of non-coding regions by ADAR has been implicated in immune regulation65–67,69 and apoptosis72,73, but it is not clear how these functions are related to ADAR’s oncogenic potential.

Here, we investigate the role of mRNA abundance regulation by RNA editing in human LUAD. We create a pipeline that inputs RNA gene abundances and RNA editing frequencies, and outputs potentially regulatory pairs of RNA editing sites and mRNA target genes. This pipeline is applied to TCGA LUAD RNA sequencing data to obtain a global picture of the RNA regulatory editome. We identify enrichment of both apoptosis and immunoregulatory genes potentially regulated by ADAR-mediated RNA editing.

ADAR is alternatively spliced into two separate isoforms: a short, nuclear, and constitutively expressed p110 isoform; and a long, cytosplasmic, and interferon inducible p150 isoform66. The p110 isoform is responsible for fine tuning apoptosis, at least in part via modulating accessibility to Staufen 1 binding sites in edited mRNA’s72,73. The p150 isoform edits double stranded dsRNA’s and prevents them from activating the MDA5-

MAVS interferon response. Neither of these functions of ADAR have been linked to cancer, although it was seen by Fumagalli, et al.80 that ADAR RNA expression is jointly explained by ADAR genomic copy number and STAT1 expression, a marker of interferon activity. We further establish that ADAR genomic copy number (CN) is negatively associated with immune and apoptosis pathways, as well as immune cell signatures, establishing potential oncogenic roles for ADAR in LUAD.

2.2 Materials and methods Data

20

RNAseq abundances, copy number data, and editing frequencies were downloaded for lung adenocarcinoma (LUAD) matched normal and tumor samples.

RNAseq gene level expression (RNAseqV2 RSEM) and copy number data were downloaded from Broad Institute Firehose. Gistic 2.0 copy number data was further processed to gene level as follows. If a gene is contained within a segment as defined in the gistic-processed data, the value of that segment is assigned to the gene. Editing frequency data was downloaded from synapse.org18. Editing frequency is the number of edited reads covering a given editing site divided by the total number of reads covering that editing site. Editing sites are annotated in the RADAR85 database, which collects and curates functionally validated A-I RNA editing sites.

Identification of regulatory edits

First, RNA editing sites were matched to the genes containing them. Often, there are multiple RNA editing sites within each gene. Then, for each gene, RNA editing frequencies were associated to RNA abundances with Spearman correlation. Significant edits were determined using R’s (v1.0.136) built in significance of correlation test and corrected for multiple hypothesis testing (Benjamini Hochberg corrected q-value < 0.1).

All correlations performed in this study are Spearman correlations.

RNA binding protein and microRNA motif analysis and secondary structure visualization

Regulatory edits were grouped by gene and searched for continuously edited regions (CER). A CER was defined as greater than 5 sequential significant RNA regulatory editing sites, with no two consecutive RNA editing sites being separated by

21 more than 100 base pairs. These CERs were then input as a bed file into the RBPmap tool86, which searches for RNA binding protein (RBP) motifs within genomic regions.

We used the following parameters in our searches: high stringency, hg19 reference, all

Human/Mouse motifs, and no conservation filter. MicroRNA motif enrichment was calculated using the targetScan web tool87, and microRNA’s with combined score < -0.3 were considered for further analysis. Secondary structure prediction was made with the

Forna88 web app.

Pathway and Immune cell analysis

Pathway scores for tumor samples were calculated using single sample gene set enrichment analysis (SSGSEA)89. Immune exclusion and tumor purity scores were downloaded from Aran, et al.90 and converted to immune infiltrate scores (1-SCORE).

Immune cell subset analysis was performed using the TIMER web app91.

Histological slide evaluation of immune cell subsets

Histological slides for TCGA LUAD tumors, were visualized using the digital slide archive92. We selected a subset of tumors such that there were roughly equal representation of ADAR amplified and normal copy numbers. In total, 97 LUAD tumors,

45 of which had no evidence of ADAR amplification, and 52 of which had high ADAR copy number, were selected for analysis and we conducted the experiment with no knowledge of the ADAR CN status of these tumors. We were able to visually estimate neutrophil, lymphocyte, and macrophage infiltrations, as well as necrosis on a scale of 0-

3.

22

2.3 Results Pipeline for discovery of regulatory A-I RNA editing sites

We created a pipeline for discovery of regulatory RNA editing sites as follows.

We first matched each RNA editing site with its host RNA abundance (Figure 2.1A), then tested all RNA edits for their association with host RNA abundance (Figure 2.1B-

C). All relevant code can be found at github.com/michaelsharpnack/RNA_edits.

Figure 2.1: Pipeline to discover regulatory RNA editing sites. Editing sites annotated in

RADAR database are matched to their host gene (A). The regulatory potential of each editing site is discovered by testing the association between editing frequency and host

RNA abundance (B). In this case, one editing site is significantly positively associated

23 with RNA abundance (A1, maroon) and is classified as a potential regulatory editing site

(C).

Landscape of regulatory editing sites in lung adenocarcinoma

Previous papers have focused on the potential for protein coding RNA editing sites to modify oncogene and tumor suppressor functions; however, the frequencies of many non-coding RNA editing sites are significantly increased in LUAD. In the TCGA

LUAD cohort, 4115 non-protein-coding RNA editing sites were differentially edited (t- test, BH q-value < 0.1) while only 20 protein coding RNA editing sites were differentially edited. We therefore applied our regulatory RNA editing pipeline to the

TCGA LUAD RNAseq dataset to discover alternative functions of the ADAR oncogene.

RNAseq data for 488 LUAD tumor and 57 matched normal samples were run through our regulatory RNA editing pipeline. This data includes 54,362 frequently edited sites, and 52,276 RNA editing site-RNA abundance combinations tested for their regulatory potential. 5,468 (10%) of the edit-gene combinations were predicted to have regulatory potential across 1,413 genes (Figure 2.2A). The majority of the significant

RNA editing sites had a positive association with RNA abundance (4,976 or 91%, Figure

2.2B). It is possible that the enrichment of positive RNA editing-mRNA abundance associations is due to a bias in detection; given that we would not be able to measure frequently edited transcripts that are degraded. and length are both potential sources of bias in regulatory RNA editing site detection. The expression of each 24 gene with detectable RNA editing sites is, although significantly so, very weakly associated with the ability to detect regulatory RNA editing sites (ρ = 0.059, p = 5x10-4).

Gene length and the number of Alu elements within each gene are also not strongly associated with the number of detected regulatory RNA editing sites (ρ = -0.066, p =

1x10-4 for total gene length; and ρ = 0.028, p = 0.10 for the number of Alu elements present in the gene body).

293 of the genes tested (8.5%) had more than 5 predicted regulatory RNA editing sites, and the genes with the top 50 RNA editing sites are shown in Figure 2.2C.

Regulatory relationships between ADAR and these genes have only been investigated for a small number of these genes, including CTSS 83 and MDM2 84. Remarkably our method was able to discover these established regulatory relationships. CTSS is the best-studied example of gene regulation by ADAR, and CTSS is the top hit according to our algorithm.

In addition, we investigated for possible enrichment of putative regulatory relationships that are evolutionarily conserved. 1664/11319 (14.7%) and 4475/40957 (10.9%) of RNA editing sites were significant in rhesus or chimp and non-conserved, respectively, indicating a weak but significant enrichment of regulatory RNA editing sites in evolutionarily conserved RNA editing sites (Fisher’s exact test, p < 2.2e-16).

We next sought to discover which regulatory relationships are present in both tumor and normal, and which are only present in tumor. Since there are so many fewer normal samples, instead of comparing statistical significance of RNA editing sites between tumor and normal, we compared the Spearman correlations. For each gene, we calculated the Spearman correlation between the tumor and normal regulatory

25 correlations of all the RNA editing sites within that gene. A high tumor-normal correlation indicates that RNA editing mediated regulation of a gene is not cancer specific. Of the 344 genes with at least 5 regulatory RNA editing sites in tumors, 54/344

(16%) of these genes’ edits had a positive correlation between tumor and normal regulatory RNA editing sites of ρ > 0.3, while 16 had a negative correlation of ρ < -0.3.

274/344 (80%) of the regulatory edits are cancer-specific, especially in the genes with the highest numbers of RNA editing sites (Figure 2.2E).

26

Figure 2.2: Landscape of regulatory RNA editing sites in LUAD. (A) Plot showing the distribution of RNA editing site-mRNA abundance correlation and the significance of the association. The red line denotes q-value < 0.1 significance threshold. Histogram of regulatory RNA editing sites’ association to RNA abundance (B). Genes are ranked by the number of predicted regulatory RNA editing sites within their RNA molecule, and the top 50 are shown in (C). Edits are grouped by their host gene and RNA editing site-RNA

Spearman correlations are compared between tumor and normal (D). Genes with high

Tumor-normal regulatory RNA editing site correlation have regulatory RNA editing sites with similar regulatory potential in both tumor and normal. The tumor-normal edit

27 correlation is plotted against the number of regulatory edits in each host RNA molecule

(E).

The ADAR gene regulates the majority of RNA editing sites, while the ADARB1 gene regulates a relatively small portion61. Because ADAR and ADARB1 are upregulated and downregulated in LUAD, respectively, the majority of RNA editing sites is over- edited in tumors. There are 756 differentially edited regulatory RNA editing sites across

242 differentially expressed genes. Given that the majority of regulatory RNA editing sites is positively associated with target RNA expression, one would expect to see that over-edited genes also tend to be overexpressed. In fact, 558/756 (74%) of the regulatory

RNA editing sites are over-edited while their host genes are overexpressed, and 93% of these RNA editing sites are positively associated with RNA abundances in tumors. Of the remaining RNA editing sites, 137/756 (18%) are over-edited with host genes that are underexpressed. Unsurprisingly, 41% of these editing sites are negatively associated with

RNA abundances in tumors. Together, these results show that ADAR overexpression in tumors potentially controls cancer-specific differential expression of hundreds of genes in

LUAD.

An alternative hypothesis to RNA-editing mediated mRNA abundance regulation is that mRNA abundances are merely associated with RNA editing frequencies. In particular, ADAR is upregulated by interferon signaling, which could also regulate genes that are edited by ADAR. To isolate the effects of ADAR on its target genes, we compared 28 the results from ADAR knockdown experiments on human B-Cells from Wang et al.93 to our results in human LUAD. Interferon pathways were not significantly enriched among differentially expressed genes after ADAR knockdown in their study. Wang et al. found

105 genes with evidence of RNA editing that had both alterations in RNA editing frequency and mRNA abundance after ADAR knockdown via siRNA. 84/105 (80%) of these genes also displayed evidence of editing in the TCGA LUAD dataset, and 54/84

(64%) of the genes found to be regulated by RNA editing were also putative regulated genes found by our method. We performed a permutation test to assess the significance of this result by selecting 105 random subsets of 1,413 genes from the 3,412 genes with evidence of editing. In none of these iterations did a random gene set have a greater than

64% overlap with the set from Wang et al, a probability of < 1x10-5 that the overlap in our results and the results of Wang et al. occurred randomly. For example, APOL1,

CFLAR, DAP3, EIF2AK2, and MAVS were all putative regulated genes found by both our method and Wang et al.

The APOL1 3’ UTR contains multiple cancer-specific regulatory editing sites controlled by ADAR

The CTSS and Apolipoprotein L1 (APOL1) genes had the largest number of predicted regulatory RNA editing sites in LUAD (Figure 2.2C). Since an ADAR-CTSS regulatory relationship has already been established, we investigated the RNA editing patterns within the 3’ UTR of the APOL1 gene. APOL1 functions are not well characterized; however, it circulates with the densest high density lipoprotein (HDL) subfraction, HDL394. APOL1 variants have been implicated in chronic kidney disease95,

29 and they likely evolved due to their protective effects against T. Brucei rhodesiense infections96. APOL1 is upregulated after interferon treatment or TLR3 stimulation via an

IRF3-dependent pathway, and high APOL1 expression in HIV-infected patients with an interferon response contributes to chronic kidney disease97. Despite its role in innate immunity and status as an interferon-regulated gene, APOL1 has not yet been associated to ADAR-mediated regulation.

Figure 2.3: The APOL1 3’ UTR contains multiple regulatory editing sites controlled by

ADAR. APOL1 3’ UTR contains several RNA editing sites which are edited at a very

30 high frequency (A), and most are significantly positively correlated to APOL1 mRNA abundance (B). APOL1 editing frequencies (C) and RNA abundances (D) are both positively correlated with ADAR RNA abundance. Figure 2.3E shows for each patient the difference in chr22:36662382 editing frequency and APOL1 mRNA abundance between tumor and normal. APOL1 tends to be over-edited and overexpressed in the same tumors

(Rho = 0.27, p = 0.056). *q < 0.1.

APOL1 editing sites are frequently edited—21/74 (28%) sites have greater than

10% average editing frequency across all patients in LUAD (Figure 2.3A). 71/74 (96%) editing sites in the APOL1 3’ UTR that are consistently edited in LUAD have predicted regulatory roles, and editing of these sites is positively associated with APOL1 RNA abundance (Figure 2.3B). To test whether ADAR or ADARB1 controls the editing of

APOL1, the editing frequencies of the predicted regulatory sites were correlated with the

RNA abundances of ADAR and ADARB1, respectively. The editing frequency of the majority of these sites as well as APOL1 RNA abundance was significantly correlated with ADAR RNA abundance, indicating that ADAR, not ADARB1, controls them (Figure

2.3C-D). In addition, Wang et al.93 found that after siRNA knockdown of ADAR, both

APOL1 editing and gene expression decreased.

We next sought to discover if the regulation of APOL1 could be cancer related.

APOL1 is overexpressed in LUAD compared to matched normal tissue samples (q < 0.1), and 26/74 (35%) of its RNA editing sites show increased evidence of editing in cancer. In 31 tumors where APOL1 is over-edited, it also tends to be overexpressed (Figure 2.3E).

Together, this evidence suggests that ADAR overexpression and associated increased editing in LUAD potentially induces overexpression of APOL1.

Since our method provides nucleotide level regulatory hypotheses, we performed a secondary structure analysis on the edited regions of the APOL1 3’ UTR. We discovered that many of the regulatory RNA editing sites change the predicted secondary structure, and several of these regulatory RNA editing sites potentially effect RBP and microRNA motifs, including numerous predicted PTBP1 binding sites. PTBP1 is overexpressed in tumors and correlated with ADAR expression. However, the predictions of RNA secondary structure, and RBP and microRNA motifs are not perfect and require functional validation.

RNA editing of the APOL1 3’ UTR is associated with poor overall survival in lung adenocarcinoma

We next investigated the clinical importance of APOL1 in LUAD. We separated the tumors into APOL1 high and APOL1 low based on the mean APOL1 expression across all tumors and performed survival analysis (Figure 2.4A). High APOL1 expression is significantly associated with poor prognosis (Chi-Square test, p = 0.0066).

In addition, high APOL1 expression is associated with poor survival using the survival meta-analysis tool, precog (precog.stanford.edu)98, which compiles the results of several gene expression datasets in lung adenocarcinoma (z = 3.37, p < 10-3). To determine if this effect was due to RNA editing or other modes of APOL1 regulation, we then repeated this survival analysis across all 74 editing sites in the APOL1 3’ UTR. 30/74 (41%) of the

32

RNA editing sites within the 3’ UTR of APOL1 are significantly associated with poor prognosis in LUAD (Chi-square test, q-value < 0.1). The most significant edit (q = 0.020) was also the closest to the APOL1 coding region, located at chr22:36662161 (Figure

2.4B). Editing of this site is predicted to remove a kink in the secondary structure of the

APOL1 3’ UTR. An alternative hypothesis is that overall editing rates are prognostic, and not just specific sites; however, Paz-Yaacov et al.75 investigated this hypothesis and did not report a significant association between global editing rates and overall survival in

LUAD.

Figure 2.4: Survival analysis of APOL1 RNA expression and APOL1 editing sites.

Tumors are separated into low vs. high APOL1 expression based on mean expression (A).

Tumors are separated into low vs. high chr22:36662161 editing frequency (B). P-values are calculated from Chi-Square tests.

33

ADAR regulated genes come from apoptosis and innate immune related pathways.

We performed functional enrichment99,100 of the genes with five or more regulatory RNA editing sites in LUAD and found enrichment of apoptosis related genes

(Figure 2.5A). In particular, we found that DFFA, CASP6, CASP8, CASP10, MDM2, and

CFLAR all contain multiple predicted regulatory RNA editing sites (shown in parentheses). These genes are inhibitors of the FAS ligand induced apoptotic pathway and except for MDM2 84, have not yet been linked to ADAR-mediated regulation (Figure

2.5B).

In addition, we found enrichment of 2 pathways related to the innate immune functions of ADAR (significant by p-value but not after multiple hypothesis correction).

Most importantly, EIF2AK2 (PKR), EIF2S3 (EIF2γ), DDX58 (RIG-I) and MAVS contain numerous regulatory edits (Figure 2.2). These proteins are all interferon stimulated, and act in concert to inhibit the replication of dsRNA viruses. PKR phosphorylates EIF2α and leads to inhibition of the EIF2 complex, of which EIF2γ is the core subunit101 (Figure

2.5C). Inhibition of the EIF2 complex halts translation and leads to responses102. Our results predict that ADAR positively regulates these genes via editing of their 3’ UTR’s. In addition, Li et al.103 noted that ADAR knockdown in interferon stimulated human LUAD A549 cells prevented the expected upregulation of EIF2AK2, suggesting that ADAR is necessary for interferon-mediated EIF2AK2 expression. The

34 authors were unable to explain this finding; however, our method predicts that ADAR upregulates EIF2AK2 via editing of its 3’ UTR103.

Interlinked with stress granule formation in the innate immune response to dsRNA, is the MDA5-MAVS complex. MDA5 binds to unedited dsRNA molecules and is the first step of the dsRNA antiviral interferon response104. The MAVS protein binds to

MDA5 and forms filaments on the mitochondrial membrane and acts as an adaptor in this process105. MAVS contains both positive and negative regulatory RNA editing sites in

LUAD. In fact, RNA editing sites within the MAVS 3’ UTR that have higher mean editing frequency tend to be more negatively associated with MAVS abundance (Rho = -

0.57, p = 6.6x10-11). This suggests that there is a more complex regulatory process governing MAVS RNA abundance. This pattern is present in other RNA molecules with multiple negative regulatory sites, such as LIMD1 and VHL.

It is important to note that the same bias that has caused enrichment in positive regulation by RNA editing in a majority of regulatory RNA editing sites may also bias the pathway enrichment analysis we performed. In addition, we performed an analysis to uncouple the upstream effects of interferon pathway expression on potential regulatory

RNA editing sites. We split patients into four groups based on interferon pathway expression and RNA editing levels at each RNA editing site, and then compared the gene expression for genes with potential regulatory RNA editing sites between these groups.

We tested 6,139 regulatory RNA editing sites for potential regulation by RNA editing independent of interferon pathway expression. 2,494 regulatory RNA editing sites were measured in a sufficient number of tumors (N > 200) for comparison, and we found

35 evidence of widespread regulation of their host genes by RNA editing independent of interferon pathway expression. In addition, there are numerous genes whose expression is both independently associated with RNA editing levels and interferon levels. For example, EIF2AK2 is highly expressed in tumors with high 3’ UTR RNA editing and high interferon pathway expression; moderately expressed in tumors with either high 3’

UTR RNA editing or high interferon pathway expression; and lowly expressed in tumors with low 3’ UTR RNA editing and low interferon pathway expression. This result is consistent with the finding by Li et al.103 that both ADAR and interferon expression are necessary to induce EIF2AK2 expression.

36

Figure 2.5: RNA-editing regulated genes in LUAD are enriched in apoptosis and innate immune-related genes. Biocarta pathway enrichments of genes with 5 or more predicted regulatory RNA editing sites (A). Pathway diagram showing the genes that are enriched in the apoptosis (B) and innate immune-related pathways (C). Blue and red colored proteins indicate positive and negative regulation by ADAR-mediated RNA editing, respectively. The MAVS protein is colored in both blue and red because there are both positive and negative regulatory edits within its 3’ UTR.

37

ADAR copy number is anti-correlated with immune and apoptotic signatures in lung adenocarcinoma.

To investigate the role of ADAR in immune and apoptotic pathways we calculated the pathway enrichment in each tumor sample using single sample gene set enrichment analysis (ssGSEA)89 on the 50 MSigDB hallmark gene sets106. ADAR was initially identified as an oncogene due to its frequent amplification in several human cancers 18.

As such, we investigated the association between ADAR CN and pathway activity, and found that for many pathways, the association between ADAR CN and pathway activity is the opposite of the association between ADAR RNA abundance and pathway activity, in spite of a high ADAR CN-RNA correlation (ρ = 0.50). We found that ADAR CN is strongly negatively associated with apoptotic activity in LUAD (Figure 2.6A). ADAR

RNA abundance was not correlated with the apoptosis pathway in LUAD. Similarly for immune related measures, ADAR RNA is positively correlated with the interferon alpha pathway but ADAR CN is negatively associated with it (Figure 2.6A). One possible explanation for this finding is that separate ADAR isoforms have opposing effects on immune and apoptotic pathways. The ADAR p150 isoform contains exon 1 while ADAR p110 does not; however, the exons are too highly correlated to distinguish between their expression. The lowest correlation between exon 1 and exons 2-15 expression is 0.77.

Therefore, we cannot distinguish between the expression of ADAR p150 or ADAR p110 in this cohort.

38

Figure 2.6: ADAR amplification is associated with decreased immune cell concentrations in lung adenocarcinoma. Figure 2.6A shows the correlation between ADAR RNA and CN abundances and relevant pathway, tumor purity, and immune infiltrate signatures. The

TIMER app was used to estimate immune cell subtype concentrations in LUAD tumors with known ADAR CN status (B). Tumors with ADAR CN gain or amplification had evidence of significantly fewer CD8+ T Cells, CD4+ T Cells, Macrophages, Neutrophils, and Dendritic Cells. Histological slides were scored for their lymphocyte concentrations on a grade 0-3 (C). Significance codes for (B): 0 ≤ *** ≤ 0.001 ≤ ** ≤ 0.01 ≤ * ≤ 0.05.

39

Given the finding that ADAR CN is negatively associated with signatures of apoptosis and innate immunity, we investigated the association between ADAR CN and immune infiltrates. We first downloaded tumor purity and immune infiltrate data from

Aran et al.90 and then used the TIMER app91 to test for associations with specific immune cell types. We found that ADAR CN is significantly negatively associated with all markers of immune infiltrate and tumor purity used in Aran et al. (Figure 2.6A). In fact, given the strength of the association, we searched the entire genome for genes whose copy numbers negatively correlate with immune infiltrate signatures in LUAD. It was found that genes most negatively associated with immune infiltrates reside on the 1q21 locus, and form a coherent amplicon that includes ADAR. We further discovered that

ADAR genomic gains and amplifications have significantly fewer predicted concentrations of CD8+ T cells, CD4+ T cells, macrophages, neutrophils, and dendritic cells (Figure 2.6B). It is possible that other genes located at the 1q21 locus mediate this effect, and the causal role of ADAR in immune exclusion requires functional validation.

To confirm this finding, we graded LUAD tumors based on their apparent infiltration of different immune cell types on a 0-3 scale in histological slides from

TCGA92. We selected 97 LUAD tumors, 45 of which had no evidence of ADAR amplification, and 52 of which had high ADAR copy number, and conducted the experiment with no knowledge of the ADAR CN status of these tumors. We were able to visually estimate neutrophil, lymphocyte, and macrophage infiltrations, as well as necrosis. While no apparent pattern was observed for neutrophils, necrosis, and

40 macrophages, there was a trend towards fewer lymphocytes in ADAR CN high tumors (t- test, p = 0.01) (Figure 2.6C).

The fundamental question is, does RNA abundance regulation by RNA editing account for the differences between ADAR RNA and copy number pathway enrichments?

We looked at the independent effects of interferon pathway expression and ADAR copy number on editing levels, by doing a differential editing experiment with one of the variables controlled. The differences in editing between copy number high vs. low and interferon high vs. low are extremely similar (ρ = 0.76). In other words, whether ADAR is upregulated by interferon or copy number amplifications doesn’t matter, the same RNA editing sites are changed. This is evidence against RNA abundance regulation by RNA editing as being the driving force behind the intransitive relationship that we see between

ADAR RNA, copy number, and interferon expression.

2.4 Discussion Despite the prevalence of A-I RNA editing of non-coding regions within mRNAs, very little research has been done to elucidate the functions of these alterations in cancer.

To remedy this situation, we perform a comprehensive analysis of RNA abundance regulation by RNA editing. These results are a resource to better understand the regulation of thousands of genes, and a full list of genes with regulatory edits is provided in the supplementary data of the accompanying publication. Despite confirmation of some of these results in an ADAR knockdown experiment, further functional validation is necessary to confirm the causality of these regulatory relationships. This method can be run on any dataset of matched RNA abundance and RNA editing frequencies, and it

41 provides nucleotide level information to better guide validation experiments. For example, our method could be used as a resource to understand the RNA regulatory landscape of normal tissues from data published by61. A limitation of our method is a lesser ability to detect negative compared to positive regulatory relationships. Detection of negative relationships may require time-lapse functional experiments, where the degradation of heavily edited transcripts can be tracked in real time.

In addition, an integrative analysis of ADAR CN and pathway enrichments showed that despite the high correlation between ADAR CN and RNA, ADAR copy number and

RNA tend to have opposite relationships with immune and apoptosis pathways. ADAR is an interferon stimulated gene66, and positive correlation between ADAR RNA and immune regulated pathways could point towards ADAR being upregulated in a compensatory fashion by interferon stimulation. Sensing of aberrant DNA or RNA, such as by toll like receptors and MDA5 can cause this interferon stimulation. When ADAR is genomically amplified, these patterns could imply that unregulated increases in ADAR expression constitutively repress interferon in the tumor microenvironment and drive tumor immune evasion. It has been shown previously that poly(I:U) dsRNA is sufficient to decrease interferon responses107. In addition, this finding raises the possibility of a synergism between ADAR inhibition and immunotherapies. Three studies recently discovered that DNA methylation inhibitors induce an interferon response via expression of aberrant dsRNA’s, and can even induce sensitivity to immune checkpoint therapies

57,58,108. In a similar manner, ADAR inhibition could be a new addition to combination immunotherapies.

42

Chapter 3: Proteogenomic Analysis of Surgically Resected NSCLC

3.1 Introduction Five-year survival of patients with surgically resected, early stage lung adenocarcinoma ranges from 50-70%109, and adjuvant chemotherapy reduces this risk by only a small amount. An accurate prediction of the risk of tumor recurrence at the time of surgery could potentially spare patients the toxicity of adjuvant chemotherapy, and target other patients for increased therapy and surveillance. Many previous attempts have been made to predict recurrence and prognosticate outcomes after resection of lung adenocarcinomas; however, significant challenges to reproducibility and implementation have prevented the widespread use of these signatures in the clinic110. Most of the previous classifiers have evaluated empirically selected protein expression by immunohistochemistry or RNA expression patterns based on microarrays. To date, there has been no effort to compare and integrate high content proteomic with transcriptomic approaches in carefully clinically annotated cases of this disease.

In this study, we present an integrative approach combining both transcriptomic and proteomic data. The central hypothesis of this study is that protein and mRNA measurements of lung adenocarcinoma tumors encompass independent information that can be leveraged to discover novel dysregulated genes and integrative clinical biomarkers.

There have been multiple proteogenomics studies in model systems, such as bacteria111, yeast112, and cell lines113. Initial studies in humans reiterated the poor correlation between mRNA and protein measurements37, highlighting the importance of

43 regulation at the post-transcriptional level. Recent studies in cell lines have proposed that a greater amount of protein variation can be explained by transcription than previously thought28; however, a picture has emerged of bursts of mRNA transcription creating stable changes in protein expression in response to perturbation35. In surgically resected tumor samples, the cell states vary from perturbed to steady state, implying that mRNA- protein correlation may vary as well. For example, Wei et al.114 showed that RNA-protein correlation differs between aging and young humans and rhesus macaques. The discovery that mRNA-protein correlation is a phenotype that can be correlated with biological and clinical outcomes necessitates further studies with matched mRNA and protein measurements. Large datasets of matched RNAseq and proteomics results were published by Zhang et al.32 in colorectal and Mertin et al.31 in breast cancer samples of convenience, however these study were not designed to explore an integrative clinical biomarker.

Recently, Zhang et al.30 published a proteogenomic dataset from high-grade serous ovarian cancer tumors, which can be separated into early and late survivors, however there are not significant differences to mRNA and protein expression between the two groups.

Here, we investigate differential mRNA-protein correlation between recurrent and non-recurrent lung adenocarcinoma tumors. We then leverage this difference, in combination with differential mRNA and protein abundances, to predict lung adenocarcinoma recurrence with matched transcriptomic and proteomic data using a novel supervised classification algorithm.

44

3.2 Materials and Methods RNAseq Data collection and Preprocessing

RNA from tumor samples resected at Vanderbilt and MD Anderson was extracted from fresh frozen tissue with Qiagen RNeasy mini kit, converted to a poly-A selected cDNA library, and paired-end sequenced on Illumina HiSeq 2000. Raw fastq files were filtered for adapters and low quality, and aligned to UCSC hg19 reference genome with

TopHat2115 using default parameters. Read counts were generated with htseq-count116 using RefSeq gene definitions117. RNA from tumor samples resected at WashU was extracted from fresh frozen tissue, converted to a poly-A selected cDNA library with

NuGen v2 kit, and paired-end sequenced on Illumina HiSeq 2000. Raw fastq files were filtered for adapters and low quality, and aligned to UCSC hg19 reference genome with

STAR 2-pass method118,119. Read counts were generated with featureCounts120 using

RefSeq gene definitions. Variants from both RNAseq datasets were extracted with samtools’ mpileup121.

The MD Anderson and Vanderbilt cohort tumor tissue preparation.

Formalin fixed paraffin embedded tissues of tumor resections collected at

Vanderbilt and MD Anderson and Washington University (WashU cohort) were used in protein extraction. The Vanderbilt and MD Anderson cohort tissue samples were deparaffnized using sub-x xylene (Surgipath, Richmond, IL) followed by rehydration in three ethanol washes as previously described122. Samples were homogenized in lysis buffer containing trifluoroethanol (TFE) and 100 mM ammonium bicarbonate at pH 8.0 using Sonic Dismembrator model 100 (Fisher scientific, Pittsburgh, PA) at 20 W for 20 S with 30 S intervals. The sonication step was repeated twice, and the samples were stored 45 on ice between sonications. The concentration of the proteins in each lysate was measured using BCA protein assay (Thermo Fisher Pierce, Rockford, Illinois) using the manufacturer’s protocol. A total of 200 µg of lysate was reduced with 20 mM tris(2- carboxyethyl)phosphine (TCEP, Pierce, Rockford,IL) and 50 mM DTT (Sigma-Aldrich,

St. Louis, MO) at 60 ˚C for 30 min followed by alkylation with 100 mM iodoacetamide

(Sigma-Aldrich, St. Louis, MO) in dark for 20 min at room temperature. The concentration of TFE was reduced to 10% of the total volume by diluting in 50 mM ammonium bicarbonate. The samples were digested with trypsin (Promega Corporation,

Madison, WI) at a ratio of 1:50 (w:w) overnight at 37 ˚C followed by acidification with

0.5% TFA. Protein digests were frozen at -80 ˚C and lyophilized to dryness. The samples were re-suspended in HPLC-grade water with vortexing for 1 min and desalted using

Oasis HLB 96-well µElution plate (30 µm, 5 mg, Waters Corporation, Milford, MA) as previously described32.

WashU cohort tumor tissue preparation

The FFPE tumor tissues were deparaffinized in xylene followed by rehydration in ethanol as previously described123. The tumor tissues were homogenized in a modified lysis buffer containing 0.2% RapiGest (Waters Corporation, Milford, MA) in 50 mM ammonium bicarbonate. The lysates were incubated at 105 ˚C for 30 min and stored on ice for 5 min. The samples were sonicated using Sonic Dismembrator model 100 (Fisher

Scientific) at 20 W for 20 s with 30 s intervals. This sonication step was repeated twice, and the samples were incubated at 70 ˚C for 2 h. The protein concentration in each lysate was determined by BCA protein assay (Thermo Fisher Pierce, Rockford, Illinois) using

46 the manufacturer’s protocol. A total of 100 µg of tissue proteins were reduced with 50 mM DTT at 60 ˚C for 30 min followed by alkylation with 100 mM iodoacetamide in dark at room temperature for 20 min. The samples were digested with sequencing grade trypsin (Promega Corporation, Madison, WI) at a ratio of 1:50 (w:w) and 0.01%

ProteaseMax surfactant (Promega Corporation, Madison, WI) at 37 ˚C for 3 h. The samples were acidified with 0.5% TFA and centrifuged at 14000 g for 15 min. The supernatant was collected and evaporated to dryness in a Speed-Vac concentrator

(Thermo Scientific). The samples were stored in -80 ˚C until LC/LC-MS/MS analysis.

The Vanderbilt and MD Anderson cohort peptide fractionation by off-line high pH reverse-phase chromatography.

The samples (n = 44) were reconstituted in 400 µL of 1.0 M triethylammonium bicarbonate (TEAB) at pH 7.5 and injected into the chromatography system. Tryptic peptides were fractionated at high pH reverse-phase XBridge BEH C18 analytical column (250 mm x 4.6 mm, 130 ˚A, 5 µm) equipped with an XBridge BEH C18 centry guard cartridge. The separation was achieved at a flow rate of 0.5 µL/min in 10 mM

TEAB and water at pH 7.5 (solvent A) and 100 % acetonitrile (solvent B). A multi-step gradient with three linear gradients were used; from 0-5% B in 10 min, 5-35% B in 60 min, 35-60% B in 15 min and 70% B for 10 min before reaching the initial conditions. A total of 60 fractions were collected and recombined into 15 peptide fractions as previously described32. The samples were evaporated to dryness in a Speed-Vac concentrator and stored in -80 ˚C until LC MS/MS runs.

The Vanderbilt and MD Anderson cohort LC-MS/MS analysis.

47

The protein digests were reconstituted in 50 µL of 2% acetonitrile and 0.1% formic acid. An Eksigent NanoLC 2D pump with an AS1 auto-sampler reverse-phase LC system was used for peptide fractionation. A total of 8 µg were injected and separated using 0.1% formic acid (solvent A) and 0.1% formic acid in acetonitrile in a packed capillary tip (Polymicro Technologies) containing Jupiter C18 resin (Phenomenex, 5 µm,

300˚A) in-line with a solid phase extraction column (packed with the same resin). The gradient was programmed to desalt the samples on the column for 15 min at 100% A prior to separation at a flow rate of 1.5 µL/min. The separation was achieved by changing mobile phase composition from 100% A to 25% B in 50 min, 25%-90% B in 65 min and held at 90% for extra 9 min. Peptides eluting the column were ionized at 1.45 kV and analyzed with a Thermo Velos Pro dual-pressure linear ion trap mass spectrometer

(Thermo Fisher Scientific, Bremen, Germany) by data dependent acquisition. The top five MS/MS scans were acquired for every full MS scan for an m/z range from 400-2000.

The method was used with an ion transfer tube temperature at 200 ˚C; S-lens RF 65%; dynamic exclusion with a repeat count 1 and repeat duration of 1 s for an exclusion list size of 50 mass-to-charges; CID with normalized collision energy of 30%, q= 0.25, and activation time of 10 ms; the minimum intensity threshold was set to 1000 counts.

The WashU cohort LC/LC-MS/MS analysis.

For the analysis of the WashU cohort (17 sample), liquid chromatography coupled to tandem mass spectrometry was performed using a Waters nanoacquity two- dimensional (2D) UHPLC system (Waters Corporation, Milford, MA) with two reverse- phases interfaced to a Thermo LTQ-Oribitrap Elite hybrid mass spectrometer (Thermo

48

Fisher Scientific, Bremen, Germany). A total of 8 µg of protein digest reconstituted in

100 mM ammonium formate was injected using Acquity UPLC autosampler (Waters

Corporation, Milford, MA) and the peptides were fractionated online at high pH prior to analytical separation. The fractionation of peptides was achieved in the first reverse- phase column (Waters BEH C18, 130 ˚A, 1.7 µm, 300 µm, 100 mm) at pH 10.0 in buffer

A1 (20 mM ammonium formate) by varying the amounts of solvent B1 (100% acetonitrile). The column was equilibrated at 3% B1 (v/v), which was increased to 4.7%

(v/v) in 1 min eluting the first fraction of peptides and decreased back to 3% (v/v) B1 in the next 4 min. The column was held at 3% (v/v) B1 during separation at a steady flow rate of 2 µL/min. The solvent % B1 (v/v) was increased from 4.7%, 9.0%, 10.8%, 12.0%,

13.1%, 14.0%, 14.9%, 15.8%, 16.7%, 17.7%, 18.9%, 20.4%, 22.2%, 25.8% and to 65% over fifteen fractions. Each fraction eluted from the fractioning column was loaded onto a

Waters symmetry C18 trap column (100Å, 5 µm, 180 µm x 20 mm) and desalted at a flow rate of 20 µL/min. The analytical separation was achieved in the second reverse- phase column (Waters HSS T3, C18, 100Å, 1.8 µm, 75 µm X 150 mm) at pH 2.4 which was equilibrated to initial conditions; 95% (v/v) A2 (water with 0.1 % formic acid) and

5% (v/v) B2 (acetonitrile with 0.1 % formic acid). The subsequent separation was achieved by three linear gradients at 38 ˚C where, the % B2 was increased from 5%-9% in 3 min; 9%-30% over 44 min; 30%-40% over 5 min and 40%-85% over 5 min at a flow rate of 0.5 µL/min. The column was held at 5% (v/v) B2 from 65-70 min before reaching initial conditions. The 2D LC was coupled to LTQ-orbitrap Elite via a nanospray Flex ion source (Thermo Fisher Scientific, Bremen, Germany) containing a 30 µm inner-

49 diameter stainless steel emitter (Thermo Fisher Scientific) with spray voltage between

1.7-1.8 kV. The orbitrap mass spectrometer was operated in data dependent acquisition mode, where the top fifteen MS/MS scans were acquired for every full MS-scan. The full

MS-scan was acquired in the orbitrap MS-analyzer with resolution r = 120,000 at m/z 400 for every 107 charges acquired in the ion trap MS-analyzer. This acquisition was set to trigger MS/MS scans for the top fifteen most abundant m/z peaks after collision induced dissociation (CID) for an automated gain control (AGC) target value of 5000 charges.

The method was programmed with an ion transfer tube temperature at 275 ˚C; S-lens RF

55%; dynamic exclusion with a repeat count 1 and repeat duration of 15 s for exclusion list size of 500 mass-to-charges; CID with normalized collision energy of 35%, q= 0.25 and activation time of 10 ms; the minimum intensity threshold was set to 6000 counts.

Data processing and protein identification.

For protein identification, Myrimatch version 2.1.111 was used with a customized

RefSeq human database (version 54) and Peptitome version 1.0.42. The raw files generated in Xcaliber software (Thermo Fisher Scientific) for all fifteen fractions of each protein digest were used in the peptide identification. The MS/MS spectra were searched with fixed carbamidomethyl modification at cysteine, and variable acetylation at protein

N-termini, oxidation of methionines and deamidation at asparagine and glutamine (only for WashU cohort data). A maximum of two missed cleavages were allowed for every fully tryptic peptide (proline rule applied) with a minimum peptide length of six amino acids. The data were filtered in IdPicker software version 3.0.504. The proteins present in each sample were identified with a peptide false discovery rate (FDR) of 1% and a

50 protein FDR of 4.45%. Protein groups were filtered to only include proteins with a minimum of two peptides and with spectra required per peptide. For proteogenomic analysis, protein groups identified in each sample were grouped based on the gene group and the respective number of spectral counts for each gene group per patient was recorded.

Normalization and Filtering.

Both proteomics and RNAseq datasets were normalized by dividing each patient column by the total number of counts in that column, and then multiplying by a million to get counts per million. We then filtered out any genes for which the median across more than half of the patients was zero. We only used features for which there were matching protein and RNA features from the same gene.

Differential gene expression and correlation.

We developed a novel method of differential gene expression by comparing the rank median expression of each group and dividing by the total number of genes to get a number between -1 and 1. This method is robust to outliers, simple, and non-parametric.

All differential correlation was computed as the absolute value of the difference between

RNA-protein Spearman correlation values within each cohort. A cutoff for significance of 0.54 was used. We chose this cutoff by taking the value of correlation or anti- correlation necessary to achieve significance within a single cohort (Spearman ρ > 0.27, estimated p < 0.05) and multiplying by two, i.e. taking the minimum difference necessary between a significantly correlated and significantly anti-correlated RNA-protein pair.

Construction of the biomarker.

51

To create an integrated biomarker of tumor recurrence, we employ a model selection approach. For each gene, we find a set of models, or functions, that relate the

RNA measurements to the protein measurements in the non-recurrent and recurrent cohorts. Formally, we define the functions as follows.

2 Protein ~ fR(RNA) + N(0,σR ) (1)

2 Protein ~ fNR(RNA) + N(0,σNR )

2 2 Where fR and fNR are the recurrent and non-recurrent functions and N(0,σR ) and N(0,σNR ) are the normally distributed error terms of the models. After the models are generated on a training set, the likelihood that an expression measurement from a test sample came from a recurrent or non-recurrent patient is obtained by computing the probability density of the difference between the theoretical and test protein expression values for each model.

To learn the relationship between RNA and protein measurements for each gene, we use L1 trend filtering, which seeks to fit a piecewise linear function to the data. Trend filtering controls for over-fitting with a sparsity term, which is optimized using cross validation. We implemented trend filtering using the R package genlasso. Trend filtering seeks to optimize the following objective function.

2 (1/2)||Protein – f(RNA)|| 2 + λ ||D f(RNA) ||1 (2)

Where λ ≥ 0 is the regularization parameter, and D is the second-order difference matrix defined in Kim et al124. Trend filtering enforces a piecewise linear regression model and

52 the number of knots, or differing slope values, is determined by cross-validation within the training set to optimize the number of kinks given the noisiness of the data.

We compute the overall probability of a patient being recurrent or non-recurrent using

Bayes’ theorem with an uninformative prior and independent genes.

!"#$ !"#$%%"&'∨! ,! ,! ,! ,…,! ,! !! !! !! !! !! !! = !"#$ !"#!!"#$%%"&'∨! ,! ,! ,! ,…,! ,! !! !! !! !! !! !!

! !"#$ !"#$%%"&' !!! !"!" !! ,!! ∨!"#$%%"&' ! ! (3) !"#$ !"#!!"#$%%"&' ! !"#$ ! ,! ∨!"#!!"#$%%"&' !!! !! !!

Where Pg,j , Rg,j are the protein and RNA measurements for each gene in the signature for a given patient and M is the number of genes in the signature. To perform feature selection to find the final gene signature, we remove genes that are inaccurate on the training set based on the number of incorrectly predicted log odds ratios for each gene.

Functional Analysis of Dys-regulated genes.

The dys-regulated genes identified in this study were examined for enrichments of regulatory factors including RNA binding protein and microRNA binding sites. 5’ and 3’

UTR coordinates for all available transcripts were downloaded from the UCSC Table

Browser for the (hg38)125. UTR exon sequences were extracted for each transcript using the R package BSgenome.Hsapiens.UCSC.hg3837. Sequence motifs for

178 human RNA-binding Proteins (RBP) binding sites (101 RBPs) were collected from

CISBP-RNA126. Each UTR sequence (length L) was scanned for each motif (length M) using a single nucleotide sliding window providing L-M+1 scores. The maximum score

53 for each transcript was selected as the motif representative score. The set of putative targets for each RBP motif across the whole genome were identified as the set of transcripts with representative scores >90% of the motifs’ theoretical maximum. The set of targets were compare to the dys-regulated genes in order to identify the putative RBP dys-regulated targets. The background set of targets were identified as the targets associated with the global set of genes assayed (all genes for which RNA and protein data was available). A hypergeometric test was used to determine whether the dys-regulated genes were enriched as targets for each RBP motif.

MicroRNA data was collected from the TargetScan website127. The human conserved microRNA family targets for were downloaded from the database (214 microRNA families). This provided a list of genomic coordinates for the microRNA binding sites. Using the UCSC liftover tool the original hg19 binding site coordinates were converted into the hg38 genomic coordinates. Overlap of these sites with transcribed regions provided the set of gene targets for each microRNA family. After identifying the set of dys-regulated microRNA targets a hypergeometric test analogous to the RBP analysis was used to calculate the putative enrichment for each of the microRNA targets in the dys-regulated gene set. RBP and microRNA motifs with a Benjamini-

Hochberg corrected p-value < 0.25 were considered significantly enriched.

3.3 Results Proteogenomic Analysis of Surgically Resected Non-Small Cell Lung Cancer

We collected fresh frozen and formalin-fixed paraffin embedded (FFPE) specimens from 61 patients, half selected for rapid recurrence after surgery, and half

54 selected for long-term (> 3 year) survival after surgical resection. 44 of these patients were recruited at Vanderbilt University and MD Anderson (tissue was processed at

Vanderbilt for all samples), and 17 patients were recruited at Washington University in

St. Louis. 10 of the tumors from WashU had squamous histological characteristics and were not considered for further biomarker analysis, but their data is included in the supplementary materials of this publication. The remaining patients with adenocarcinomas were matched for recurrence and adjuvant chemotherapy status (Table

3.1). RNAseq was performed on the fresh frozen tissues and tandem liquid chromatography mass spectrometry (LC-MS) was performed on the FFPE tissues. In total, 5,482 and 6,581 protein groups were identified in the Vanderbilt and WashU cohorts, respectively. 5,253 and 5,284 of these proteins were matched by gene symbol to their corresponding mRNA in the Vanderbilt and WashU cohorts, respectively. A total of

6,577 genes were measured in at least one study, and 3,960 genes were quantified in both studies.

55

Recurrent Non- (N=25) Recurrent (N=26)

Male Sex 18 8

Adjuvant 9 12 Therapy

No 16 14 Adjuvant Therapy

Stage

Ia/b 8/8 6/14

IIa/b 3/4 2/1

IIIa/b 0/1 2/1

Collection Site

Vanderbilt 17 18

MD 4 5 Anderson

WashU 3 4

Table 3.1: Clinical attributes for the 51 lung adenocarcinoma patients.

56

RNA-protein Correlation Is Dependent On Tumor Recurrence Status In Lung

Adenocarcinoma

We observed high correlation of mRNA measurements across patients, as well as high correlation of protein measurements across patients, indicating that the data generated from each site are suitable to be combined for analysis. Median mRNA-protein

Spearman correlations were ρ = 0.15 in the WashU cohort (4672 genes compared, Figure

3.1A) and ρ = 0.17 in the Vanderbilt cohort (4656 genes compared, Figure 3.1B).

We compared the pathway enrichments for low and high correlated genes and found similar trends to those found in previous the CPTAC studies. Interestingly, the mRNA splicing pathway, which is enriched for poor mRNA-protein correlation in colorectal, breast, and ovarian cancers, is enriched for high mRNA-protein correlation in lung adenocarcinoma. Aberrant splicing has recently been implicated in lung adenocarcinoma, and may contribute to the overall low mRNA-protein correlation seen in this study8.

Prior research has shown that the unexplained protein variability is not solely accounted for by technical noise, but also post-transcriptional regulation28. As such, we sought to discover genes whose mRNA-protein correlation was dependent on the clinical outcome. We matched the mRNA and protein data on the gene level and filtered by expression to obtain a set of 2286 paired RNA and protein measurements per lung adenocarcinoma patient (N = 51, See Materials and Methods for details). Globally, there is a significant difference between the mRNA-protein correlation of all genes in the recurrent group and in the non-recurrent group (p value < 10-16 Wilcoxon Rank sum test,

57

Figure 3.1C). Overall, the genes we investigated were more highly correlated in the non- recurrent tumors (Figure 3.1C).

Figure 3.1: Gene-level mRNA-protein correlation in human non-small cell lung cancer. mRNA protein correlation in Vanderbilt (A) and WashU (B) datasets. (C) Histogram of mRNA-protein correlations within each cohort. The significance of the difference between recurrent and non-recurrent mRNA-protein correlations was determined by the wilcoxon rank sum test.

58

Synergistic Detection of RNA and Protein Dysregulation

We investigated the gene-level differences in mRNA-protein correlation and abundances with Spearman correlation (Figure 3.2A) and a 2-dimensional differential expression method (Figure 3.2B). We show that the mRNA-protein correlation of individual genes can vary greatly between recurrent and non-recurrent tumors (Figure

3.2A). We hypothesized that mRNA-protein correlation itself may contain important information about the state of the cell. Poorly correlated mRNA and protein abundances may reflect post-transcriptional (splicing, microRNA, RNA localization, etc.) and post- translational (phosphorylation, ubiquitination, altered degradation, etc.) regulation. As such, differential correlation can be used to detect dysregulated genes in cancer, and necessitates the collection and analysis of large clinical cohorts with matched mRNA and protein data.

We found that genes can be differentially expressed independently at the mRNA and protein levels (Figure 3.2B). Indeed, there is little overlap between genes that are differentially expressed at the mRNA and protein levels, including differential correlation

(Figure 3.2C, differential expression p-values are reported as uncorrected p-values produced by the R package, npSeq128, see Online Methods). Were we to only use one data type, we would have found 66 differentially expressed proteins or 159 differentially expressed mRNA’s; however, the inclusion of both allows us to generate 325 hypotheses of dysregulated genes. The numbers of differentially expressed proteins and mRNA’s reported by npSeq are very low due to it’s stringency; however, we chose this non-

59 parametric approach to minimize the chance of differential expression being driven by outliers. Outlier driven differential expression is not as useful in biomarker development, because it does not capture the behavior of an entire cohort. In addition, we observed high intragroup variability relative to intergroup variability.

Figure 3.2: Synergistic discovery of differentially regulated genes using matched RNA and protein abundances. (A) RNA-protein correlations within recurrent and non-recurrent patient cohorts are shown in a scatterplot. Genes whose RNA-protein abundances are significantly correlated or anti-correlated (uncorrected p-value < 0.05) are shown in red.

60

(B) RNA and protein differential expression is shown as the change in median rank abundance between non-recurrent and recurrent cohorts. Genes are differentially expressed at RNA and protein levels. (C) Overlap of genes differentially expressed at the protein and RNA levels, as well as genes that are differentially correlated. Zero genes displayed simultaneous differential expression at both levels and differential correlation.

Please see Materials and Methods section for more information about how differential expression and correlation was computed.

We further investigated which genes were most differentially correlated. The most differentially correlated gene, Timm50, has highly correlated RNA-protein abundances among non-recurrent tumors but highly anticorrelated abundances among recurrent tumors (Figure 3.3). Timm50 encodes the protein, Tim50, that is involved in the mitochondrial apoptosis pathway, is upregulated by mutant p53129, and its loss induces apoptosis in breast cancer cells130. Timm50 is weakly differentially expressed at the RNA level, and not differentially expressed at the protein level, such that its discovery as a dysregulated gene in our patient cohort requires the use of both data types.

To examine whether aberrant post-transcriptional regulation contributed to the poor

RNA-protein Spearman correlations in recurrent patients, we search for enriched RNA- binding proteins (RBPs) and micro-RNAs (miRNAs) motifs within these genes. This analysis of 178 RBP and 214 miRNA family motifs identified no significant enrichment

(FDR <0.25) for post-transcriptional motifs within this gene set. 61

Figure 3.3: Timm50 is differentially correlated between recurrent and non-recurrent tumors. (A) Timm50 is weakly differentially expressed at the RNA level (p < 0.05), but not differentially expressed at the protein level (B). Timm50 differential RNA-protein correlation between recurrent and non-recurrent tumors.

Integrating RNA and protein abundances for predicting tumor recurrence

62

We next sought to leverage the RNA and protein data by developing a novel, comprehensive methodology to generate integrative expression biomarkers (Figure 3.4).

In brief, we separate patients into training and test cohorts, and then further separate the training cohort according to a binary clinical variable (Figure 3.4A). In this study, the variable is recurrence status. For each gene, we perform regression using a recently developed machine learning technique, L1 trend filtering124, to find a piecewise-linear relationship between RNA and protein abundances in each cohort (Figure 3.4A). Trend filtering produces a set of piecewise linear equations that seek to balance over and under- fitting of the model. For instance, if the relationship is highly non-linear with a high signal to noise ratio, then the model will have many knots that closely follow the data. In the case of a highly linear or low signal to noise ratio, then there will be no knots, and simple linear regression is performed. The test samples are then compared to the model, and an error is calculated that represents the difference between the model-predicted and test protein values, given the test RNA values (Figure 3.4B). Errors are then calculated for each training sample and used to learn parameters for a normal distribution independently for each cohort. P-values for the test errors are extracted from recurrent and non-recurrent distributions and combined to generate a log odds ratio (LOR) for each gene-patient combination (Figure 3.4C). These LOR values are then summed for all genes included in the signature to generate a final LOR that a tumor will recur or not. For more details on how genes are included in the final signatures, see Online Methods.

Because our method considers protein as a function of RNA, a gene that has differential RNA expression in the absence of differential protein expression would not

63 be considered as a useful biomarker. We remedy this situation by generating a separate

LOR that an RNA measurement was taken from recurrent or non-recurrent RNA abundance distributions (Figure 3.4D). The accuracy of the LOR’s generated by the integrative or RNA-alone methods are compared on the training set for each gene, and using a simple objective function, the method decides whether to use each gene as an integrative or RNA biomarker.

Figure 3.4: Overview of integrative RNA-protein biomarker discovery pipeline. (A)

Patients are divided into their clinical groups, here we use binary recurrence status to

64 group the patients. Regression is then performed using trendfiltering to find a relationship between RNA and protein abundances within each cohort. (B) This model is then used to test a separate test sample or samples. Given a test RNA abundance, the test error is calculated as the difference between the predicted protein value and the test protein value

(arrows). (C) The test errors (in red) are then compared to the distributions of training errors in each cohort, and a log odds ratio is calculated (LOR1). (D) Since this integrative method does not detect differential RNA abundances in the absence of differential protein abundances, a second log odds ratio is calculated by comparing the test RNA abundances

(in red) to the training RNA abundances in each cohort (LOR2).

Using a synthetic dataset, we show that our method is able to simultaneously utilize changes to protein concentrations, RNA concentrations, and RNA-protein correlations. Leave-one-out cross-validation results on our patient cohort are shown in

Table 3.2. Our integrative method was able to correctly predict 36/51 (71%) patients’ recurrence status, including 20/26 (77%) non-recurrent patients and 16/25 (64%) recurrent patients. This is in contrast to results using protein and RNA expression separately, which collectively had an accuracy of ~50%. Interestingly, the majority of prediction errors using our integrative approach of non-recurrent patients (4/6, 67%) came from the misclassification of patients who received chemotherapy. This suggests that our method was able to find tumors that may have recurred without the intervention of adjuvant chemotherapy. 65

Dataypes Used Feature Selection Total Non Recurrent (%)

(%) Recurrent (%)

RNA All 24 (47) 12 (46) 12 (48)

FS 29 (57) 13 (50) 16 (64)

Protein All 18 (35) 10 (38) 8 (32)

FS 24 (47) 12 (46) 12 (48)

RNA+ All 19 (37) 9 (35) 10 (40)

Protein

FS 25 (49) 11 (42) 14 (56)

Integrative All 22 (43) 10 (38) 12 (48)

FS 36 (71) 20 (77) 16 (64)

Table 3.2: Integrative Biomarker of Recurrence Leave-One-Out Cross Validation

Performance.

66

To find genes that best predict patient recurrence status, we include feature selection by evaluating each gene’s performance on the training cohort. The result is a signature generated by each cross-validation test. We evaluated the biological significance of each gene included in a majority of signatures: Sumo1, Pcbd1, Psmc5,

Arcn1, Ppa2, and Sri. Each of these genes was utilized as an integrative biomarker, not as an RNA biomarker. Sumo1 is covalently attached to target proteins in a process termed sumoylation. Sumoylation is involved in many cellular responses; most notably, sumoylation of DNA damage response proteins is necessary to repair DNA double- stranded breaks131. Pcbd1 is a dimerization cofactor of Hnf1a, which has been implicated in numerous cancers132,133. Psmc5 has proteasomal functions, has been used as a biomarker of radiosensitivity in a lung cancer H460 cell line134, and has been identified as a modifier of the Tgfb transcriptional program135. Arcn1 has been hypothesized to function in vesicle trafficking136, and in one study, Arcn1 RNA expression was predictive of survival in surgically resected lung cancer137. Ppa2 is a mitochondrial inorganic pyrophosphatase. Sri has been shown to be involved in multidrug resistance in cancer138,139, and protects against mitochondrial apoptosis140. Our integrative biomarker method selected biologically relevant genes to predict lung adenocarcinoma recurrence.

3.4 Discussion In this study, we present a novel comprehensive characterization of 51 lung adenocarcinoma tumors with matched RNA and protein abundance analysis. We further

67 show that the combined analysis of RNA and protein abundances can be used to define candidate biomarkers of recurrence risk for surgically resected lung adenocarcinomas.

Although several papers have used RNA data to inform the choice of protein biomarkers, our method is the first, to our knowledge, to integrate RNA and protein expression data into a single signature. In fact, our method can be more broadly implemented to perform supervised learning to predict a binary response variable using any two matched datasets.

There are several limitations of this study. First and foremost, RNA sequencing data was generated from fresh frozen tissues while proteomics data was generated from FFPE tissues. This is a possible explanation for the unusually low RNA-protein correlation.

Second, although our integrative biomarker potentially improved upon RNA or protein- based biomarkers for recurrence prediction in our dataset, the accuracy (71%) is too low to be of clinical utility. This result is possibly due to the high intra-group variability observed in our data. Third, these patients do not have matched DNA sequencing data, so a comprehensive catalogue of driver mutations is lacking. Fourth, the majority of recurrent tumors were male (72%) and the majority of non-recurrent tumors were female

(69%). Ultimately, independent validation is necessary to demonstrate robustness of our findings.

One interesting approach for a future study would be to find combinations of

RNA’s and proteins that are predictive of a clinical or biological outcome that are not necessarily from the same gene. It might be that the expression of one protein as a function of an entirely different RNA, which is possibly non-coding, could be an

68 excellent biomarker. This possibility highlights the fact that our method contextualizes the protein expression within the landscape of RNA expression.

69

Chapter 4: Investigation of the Association of Genome Stability Protein Inactivation

and Tumor Mutation Burden

4.1 Introduction

Recently, the advent of immune checkpoint inhibition therapy (ICI) has greatly diversified the therapeutic landscape against multiple cancer types, including non-small cell lung cancer (NSCLC). The finding that a minority of NSCLC patients benefits from

ICI has motivated efforts to develop therapeutic biomarkers for ICI sensitivity141. Positive

PD-L1 protein expression in greater than 50% of tumor cells as measured by immunohistochemistry is a successful biomarker in non-small cell lung cancer; however, roughly half of PD-L1 positive patients will not respond to ICI3. Several other promising biomarkers have emerged, including tumor mutation burden (TMB). Retrospective studies have indicated that high TMB is associated with response to ICI in NSCLC45,49,50, melanoma142,143, and colorectal cancer144,145. In colorectal cancer, high TMB is often associated with specific alterations to proteins responsible for maintaining genomic stability, a finding which recently led to the FDA approval of immunotherapy for microsatellite instability high or mismatch repair (MMR) deficient solid tumors144,145.

NSCLC tumors have high average TMB20, largely attributed to the carcinogenic effects of cigarette smoking146. Currently, little is known about the genetic determinants of TMB in NSCLC, and studies of ICI treatment in MMR deficient tumors did not include patients with NSCLC145. Hundreds of proteins are directly and indirectly involved in maintaining genomic stability. Although these genes have been functionally 70 linked to genomic stability and the inhibition of tumor growth, a concrete link between each gene’s inactivation and increased somatic tumor mutation burden has yet to be established in NSCLC. Chae, et al. performed a pan-cancer analysis of mutations in DNA repair genes; however, they neither looked at specific types of mutations nor did they specifically focus on NSCLC tumors147.

Here, we investigate the role of alterations in stability related proteins (GSPs) in

NSCLC. We first investigate the landscape of alteration of GSPs in NSCLC, and further test their association with TMB. We create a predictive model of TMB that accounts for tumors with multiple alterations in GSPs. In addition, we attempt to clarify the genetic factors controlling TMB independently of smoking history.

4.2 Materials and Methods Data acquisition and processing

Level 3 clinical and genomic data from The Cancer Genome Atlas LUAD (n =

461) and LUSC (n = 465) were downloaded and processed from gdac.broadinstitute.org.

Patients were selected for this study if TCGA provided matched RNA sequencing, copy number, clinical, and exome sequencing data. Gistic 2.0 processed categorical copy number data was downloaded from cbiortal.org.

Classification of Genome Stability Related Proteins

Lists of genes involved in genome stability were downloaded from dnapittcrew.upmc.com and the Repairtoire database. After comparison of the two databases, gene functions were further validated with literature searches. The genes were classified as either directly or indirectly related to DNA repair, and placed accordingly

71 into one of 16 functionally related pathways. Mutations in these genes were considered loss of function if they fell in to the following classifications: nonsense mutations, frameshift or in-frame deletions and insertions. In addition, we considered genes with deep deletions to be loss of function.

Enrichment of Genome Stability Related Proteins in TMB high tumors

Tumor mutation burden estimates were calculated by dividing the total number of mutations per patient in the TCGA *.MAF file by the mean exome length studied per patient. The exome length studied per patient is the number of bases in the whole exome sequencing experiment that were sequenced to an adequate depth for somatic mutation calling. A cutoff of 10 Mutations per megabase separated hypermutant from non- hypermutant tumors, as in Campbell, et al.148

Prediction of Tumor Mutation Burden

We evaluate the predictive value of smoking history and pathway level mutations for tumor mutation burden. The predicted loss of function mutations are flagged in the mutation matrix. If at least one gene within one of the 16 pathways includes a predicted loss of function mutation or deep copy number deletion, that pathway is considered to be inactive. This reduces the sparsity of the mutation matrix for further processing. The smoking status variable is converted to dummy variables such that each column of the matrix is each smoking categorical status variable (1-4). Age is log2 transformed and included in each of the combinations of features. The groups we study are age + smoking status + pathway mutation, age + smoking status, age + pathway mutation, and a randomized control group. The fourth group is a randomized group such the indexes of

72 each sample are randomly sampled without replacement and relocated to the index of that random sample. The random group provides a negative control for background signal that is picked up by the trained model.

Each of the four groups is separated into two groups for training (95%) and testing (5%). The large portion for training is due to the sparsity of the mutation matrix such that signal is more likely to be found in the training set. This procedure of training and testing was repeated for 1000 trials and used to generate distributions for each feature set’s model. These distributions were then plotted together to display the differences between the predictive accuracy and variance. This procedure was conducted for both the

LUAD and LUSC datasets.

Data and Code availability

All code necessary to generate the results of this study is available at github.com/michaelsharpnack/GenomeStabilityProteins. Data is available at https://www.dropbox.com/sh/s8n9h1kafy9rd0m/AABfhgWknpyaE1ggTPh67iK7a?dl=0.

4.3 Results

Landscape of Alteration of genome stability related proteins in non-small cell lung cancer

Genes were classified based on their potential to maintain DNA sequence fidelity, resulting in a signature of 150 genomic stability related genes (See Materials and

Methods for more details). These genes come from 17 pathways; the full list of genes and their associated pathways can be found in Supplementary Table 2. We investigated the frequency of alterations of these genes, dividing all mutations into those with 73 predicted loss of function and those with other potential effects (See Materials and

Methods for more details). In total, 242/461 (52%) and 300/465 (65%) of tumors had at least one inactivated GSP in LUAD and LUSC, respectively (Figure 1). A full distribution of numbers of inactivated genes and pathways is shown in Supplementary

Figure 1. The total number of inactivating mutations is imperfectly correlated to the

TMB in both tumor types (LUAD ρ = 0.32, LUSC ρ = 0.28). Although there are non- genomic predictors of TMB, such as smoking history, one possible explanation for weak correlation between TMB and the number of inactivated GSPs is that inactivation of some GSPs does not contribute to increased TMB. We therefore sought to discover which

GSP inactivation events are associated with increased TMB in NSCLC.

74

Figure 4.1: Landscape of alteration of genome stability related pathways in non-small cell lung cancer. Predicted loss-of-function mutations to genes in pathways are shown in dark blue, other mutations are shown in red, and deep deletions are shown in light blue.

Smokers have significantly higher tumor mutation burdens than non-smokers; however, in both LUAD and LUSC, there are smokers with low TMB and non-smokers with high TMB (Figure 4.2A-B). Smoking pack year history (SPY) is weakly positively correlated (ρ = 0.20, p = 2.5x10-4) and uncorrelated (ρ = -0.026, p = 0.61) in LUAD and

LUSC, respectively (Figure 4.2C-D). These results indicate that the primary effect of 75 smoking on TMB is possibly between smokers and non-smokers, and not on smokers alone.

Figure 4.2: Association between smoking and tumor mutation burden. Histograms of tumor mutation burden (mutations per Mb) of non-smokers (red) and smokers (blue) in

LUAD (A) and LUSC (B) Tumors. Correlation between tumor mutation burden and smoking pack year history in LUAD (C) and LUSC (D).

76

Increased Tumor Mutation Burden is associated with inactivated genome stability related proteins

To test the association between GSP inactivation and TMB, we split tumors into hypermutant and non-hypermutant with a threshold of 10 mutations per Mb. We tested for enrichment of GSP inactivation within hypermutant tumors. In the full NSCLC cohort, POLE, REV3L, and FANCE inactivation were all significantly enriched in hypermutant tumors (Figure 4.3A). REV3L is also significant within the LUSC cohort alone. For example, FANCE is inactivated in 14/465 (3%) LUSC tumors, all but two of which are in the top 25th percentile of tumors by TMB. Similarly, REV3L, is inactivated in 26/465 (5.5%) of LUSC tumors, 22 of which are in the top 25th percentile of tumors by

TMB. No GSP is significantly enriched in LUAD hypermutant tumors after multiple hypothesis correction (Figure 4.3C).

77

Figure 4.3: Enrichment of genomic stability related pathway inactivation in hypermutant tumors. Each gene (A) and pathway (B) is tested (Fisher’s exact test, benjamini-hochberg correction q-value) for enrichment of its inactivation in hypermutant tumors in NSCLC.

C-D shows the enrichment in genes and pathways in both LUAD and LUSC cohorts.

Many of the GSPs are inactivated at very low frequencies, which could contribute to the difficulty in detecting their association to increased TMB. To account for this, we tested each of the 16 DNA repair and genome stability related pathways for their enrichment in hypermutant NSCLC tumors. Inactivation of the DNA polymerase, base 78 excision repair, conserved DNA damage response, mismatch excision repair, nucleotide excision repair, Fanconi anemia, and other suspected and known DNA repair genes pathways are all significantly enriched in hypermutant NSCLC tumors (Figure 4.3B). In addition, inactivation of conserved DNA damage response, Fanconi anemia, nucleotide excision repair, DNA polymerases, and other suspected and known DNA repair genes were all enriched in hypermutant LUAD tumors, while inactivation of the mismatch excision repair, nucleotide excision repair, and DNA polymerases pathways were enriched in hypermutant LUSC tumors (Figure 4.3D).

Deficits in the mismatch excision repair pathways can cause high levels of microsatellite instability in solid tumors (MSI-H). Although we noted an enrichment of mismatch excision repair pathway inactivation in NSCLC tumors, MSI-H NSCLC tumors are extremely rare, constituting less than 1% of the total population149. None of the 5 MSI-H tumors in the NSCLC cohort (out of the 851 NSCLC tumors with reported microsatellite stability status in Hause et al.) have inactivating mutations in mismatch repair pathway genes. It is likely that this pathway is inactivated epigenetically, rather than via somatic mutations, as is frequently the case in colon and rectal carcinomas.

Modeling tumor mutation burden with alterations to genome stability related proteins

A univariate analysis does not account for the fact that tumors frequently have more than one alteration in GSP’s, which is common in lung cancer tumors. As such, we performed a multivariate analysis with a penalized linear regression (LASSO, see

Materials and Methods). 1000 iterations of 20-fold cross validation LASSO regression were performed on both LUSC and LUAD datasets (Figure 4.4A-B). In LUSC and

79

LUAD smoking, mutations, and smoking+mutations test set correlations are significantly higher than the random dataset (p < 2.2×10-16). However only in LUAD is the smoking+mutations model significantly higher correlated than the smoking model (p =

0.0359). In contrast, in LUSC the next most predictive feature set, after smoking+mutations is mutations but there is no significant improvement by adding smoking to feature set (p = 0.7119). Also of note is that the variance of the correlation distributions is greater in LUSC than LUAD. This could be attributed possibly to the mutations (a sparse matrix) being the most important feature set in LUSC. This is an important note because smoking, though predictive in LUSC (median ρ = 0.1101), is not nearly as powerful as in LUAD (median = 0.4679). This shows that an alternative to using smoking in LUSC is to use pathway level mutations (median ρ = 0.3303). Perhaps not surprisingly, the lifelong non-smoker and long term cessation statuses had high predictive accuracy for TMB in LUAD. However, by adding mutations to either LUSC or

LUAD we see an improvement beyond a smoking only model with the most improvement seen in LUSC.

80

Figure 4.4: Predicting tumor mutation burden with age, genomic stability gene inactivation, and smoking status. Correlation distributions from 1000 iterations of lasso regression cross-validation on the feature sets color-coded above. The y-axis shows the probability density function (PDF) and the x-axis shows the correlation associated with the PDF.

To discover which genes are most important in the multivariate analysis, we test to see which genes the LASSO procedure selects more frequently than expected by chance. In LUAD, 6 genes are selected in at least half of the cross validation experiments, with p < 1x10-323: TP53, FANCM, FANCE, RAD23L, REV3L, and FANCB. In addition, the LASSO procedure frequently selected smoking history as predictive of TMB in

81

LUAD tumors. By contrast, only REV3L was selected in more than half of the cross validation experiments in LUSC. In addition MLH1, POLE, and FANCE were all selected with p < 1x10-25 in the LUSC cohort.

4.4 Discussion We studied the relationship between tumor mutation burden and the inactivation of genome stability related proteins in NSCLC. We found previously established mediators of TMB, such as POLE and MLH1, and proposed roles for genes such as

REV3L and FANCE. A distinguishing factor in this study is that we specifically only considered mutations or copy number changes with probable deleterious effects. While this approach decreases the likelihood of discovering false positive associations, it also increases the probability missing associations. Also, we investigate potential combinatorial effects of inactivation of multiple GSPs by modeling TMB. With this approach we are able to increase our power to discover associations between GSP inactivation and increased TMB.

This study identifies REV3L as a promising mediator of TMB. REV3L is the catalytic subunit of polymerase zeta, which is involved in the genotoxic repair mechanism, DNA translesion synthesis150. REV3L localizes to the mitochondria151 and has been implicated in modulating cisplatin resistance in several cancer types, including

NSCLC152–154. REV3L has not been previously associated with high TMB in NSCLC, and further studies are necessary to confirm its causal role.

In addition, although Alexandrov, et al.146 performed a similar study of the association between TMB and smoking pack year history, we conclude that major

82 differences in TMB are between non-smokers and smokers, not between smokers with a short pack year history vs. a long pack year history. One possible explanation for this observation is that tumor formation occurs relatively early in a patient’s smoking history.

Once the tumor cell that forms the final dominant clones begins dividing, any subsequently acquired mutations will have exponentially decreasing variant allele fractions. Subclonal mutations with low variant allele fractions are difficult to distinguish from sequencing errors, and are less likely to elicit an anti-tumor immune response45, although they can harbor mutations that can confer resistance to targeted therapies41.

A limitation of this study is that it does not include results on a validation dataset. In addition to validation in other large-scale genomics clinical trials, in-vitro functional validation could help distinguish the causal roles of proposed GSPs. It would be helpful to know the timing of these mutations, particularly in tumors with more than one clonal alteration to GSPs. Finally, tumor mutation burden has shown promise as a therapeutic biomarker or immune checkpoint inhibition treatment49, but ultimately mutations that create potential neoantigens that are recognized by resident CD8+ cells are able to generate anti-tumor immune responses.

Understanding the genes responsible for increased TMB in human NSCLC could pave the way for new biomarkers for immunotherapies, as well as a greater understanding of tumor evolution. This knowledge is could be essential to decreasing the number of untreatable NSCLC tumors.

83

Chapter 6: Conclusions & Future Directions

Since the completion of the human genome project, thousands of genomes, methylomes, transcriptomes, and proteomes have been analyzed, many of which are from tumor samples. The biological knowledge gleaned from this data has been breathtaking, however, many advanced solid tumors remain incurable despite the development of targeted anticancer therapies. One of the greatest advances in non-small cell lung cancer treatment is the invention of anti-EGFR small molecule inhibitors, however, these therapies have not been shown to increase overall survival in patients with EGFR activating mutations, only to increase progression free survival40. Hijacking tumor evolution and continually inhibiting dominant clones within tumors could conceivably achieve long-term tumor control. In these regimens, tumors are treated with a targeted therapy until tumor recurrence, and then re-analyzed to find a second targeted therapy.

This process is continued until no targeted therapies are left. The rationale behind this approach is to corner tumors into degenerate evolutionary loops, where they are adapt and escape therapies.

A second approach, which we advocate here, is to leverage the integrative genomic data to produce novel combination therapies. By integrating DNA copy number,

RNA expression, and RNA editing data we discovered an association between ADAR copy number and lower immune cell signatures as well as lymphocyte counts. The immunosuppressive potential of ADAR makes it a logical target for combination therapies. ADAR was identified as a top target for combination therapy with ICI in an in-

84 vivo CRISPR screen155. Aside from combination with ICI, ADAR inhibition would be a logical inclusion to DNA demethylating therapies, given their joint ability to induce interferon via MDA5/MAVS activation57,58.

Another open question about ADAR is whether the interferon suppression in cancer is via endogenous or exogenous signaling. Our results are based off of bulk tumor sequencing data, where the interferon expression is calculated as an RNA signature of interferon stimulating genes. It is impossible to decipher which fraction of the interferon expression is from immune cells, and which fraction is from tumor cells. Single cell sequencing may be an effective solution to this problem. Additionally, other genes on the

1q21 amplicon may be drivers of tumor progression—S100 genes, several of which are located on 1q21 were implicated recently as drivers of breast cancer recurrence77.

In addition to biological discovery, here we created methods to integrate multimodality genomic data into biomarkers. In chapter 3, we presented a framework for integrating two matched datasets, and applied it to transcriptomic and proteomic data to predict tumor recurrence. Here, matched data implies that for each feature in one datatype, there is a corresponding feature in the other datatype; i.e. for each gene used there is a protein and RNA measurement available. The method itself is agnostic to the types of data presented, and could be adapted to accept non-matched data to predict clinical or biological phenotypes. For example, we could incorporate DNA copy number and proteomics data to discover the behavior of tumors that have functional versus non- functional amplifications or deletions. As the cost of high-throughput methods continues to drop, integrative genomics methods may be necessary and realistic biomarkers.

85

Integrative genomics biomarkers may soon be utilized in NSCLC to predict who will benefit from ICI therapy. Recently, it was shown in a cohort of NSCLC patients with recurrent or stage IV disease, that patients with high TMB and >50% positive PD-L1 immunohistochemistry had the highest response rate to ICI49. These two biomarker modalities may be necessary to optimize ICI monotherapy, but currently there are hundreds of combination immunotherapies in clinical trials. Tumors can disrupt each sequence of events necessary to mount an effective immune response, and biomarkers of each of these processes could be informative to find the best responders. It is possible that a panel of immune and tumor markers may be necessary to select between differing combination immune and other therapies. Nonetheless, it is likely that TMB or the closely related neoantigen burden measures will be important predictors of ICI effectiveness.

Here, we investigated the genetic correlates of TMB by focusing on a list of functionally confirmed genome stability related genes. We identified novel possible mediators of TMB in REV3L and FANCE, and linked POLE and MLH1 deficiencies to

NSCLC. Mismatch repair pathway deficiencies were recently approved as biomarkers of first-line ICI therapy in human solid tumors; however, NSCLC patients were not including in the pilot studies. We found that mismatch repair pathway deficiencies were in fact associated with increased TMB in NSCLC. Collectively, our study presents a roadmap of genes and pathways to investigate for their role in increased TMB in NSCLC.

Collectively, this dissertation aims to demonstrate that each type of biological molecule in the cell contains independent information about cellular function, and

86 ultimately, human disease. As these techniques continue to decrease in cost, it is our hope that the creation of integrative genomics data from human cancer samples will become standard. Ultimately, this information will be necessary to understand and treat human cancer. These developments are particularly urgent for the thousands of patients whose tumors are currently not treatable by personalized therapies.

87

References

1. Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-tieulent J, Jemal A. Global Cancer

Statistics, 2012. CA a cancer J Clin. 2015;65(2):87-108. doi:10.3322/caac.21262.

2. Herbst RS, Morgensztern D, Boshoff C. The biology and management of non-

small cell lung cancer. Nature. 2018;553(7689):446-454. doi:10.1038/nature25183

3. Reck M, Rodríguez-Abreu D, Robinson AG, et al. Pembrolizumab versus

Chemotherapy for PD-L1–Positive Non–Small-Cell Lung Cancer. N Engl J Med.

2016;375(19):1823-1833. doi:10.1056/NEJMoa1606774

4. Mok TS, Wu Y, Thongprasert S, et al. Gefitinib or Carboplatin-Paclitaxel in

Pulmonary Adenocarcinoma. N Engl J Med. 2009;361(10):947-957.

doi:10.1056/NEJMoa1404595

5. Lovly CM, McDonald NT, Chen H, et al. Rationale for co-targeting IGF-1R and

ALK in ALK fusion–positive lung cancer. Nat Med. 2014;20(9):1027-1034.

doi:10.1038/nm.3667

6. Pao W, Chmielecki J. Rational, biologically based treatment of EGFR-mutant non-

small-cell lung cancer. Nat Rev Cancer. 2010;10(11):760-774.

doi:10.1038/nrc2947

7. Cancer T, Atlas G. Comprehensive genomic characterization of squamous cell

lung cancers. Nature. 2012;489(7417):519-525. doi:10.1038/nature11404

8. Collisson E a., Campbell JD, Brooks AN, et al. Comprehensive molecular

profiling of lung adenocarcinoma. Nature. 2014;511:543-550.

doi:10.1038/nature13385

88

9. Imielinski M, Berger AH, Hammerman PS, et al. Mapping the hallmarks of lung

adenocarcinoma with massively parallel sequencing. Cell. 2012;150(6):1107-1120.

doi:10.1016/j.cell.2012.08.029

10. Crick F. Central dogma of molecular biology. Nature. 1970;227(5258):561-563.

doi:10.1038/227561a0

11. Shapiro JA. Revisiting the central dogma in the 21st century. Ann N Y Acad Sci.

2009;1178:6-28. doi:10.1111/j.1749-6632.2009.04990.x

12. Giallourakis C, Henson C, Reich M, Xie X, Mootha VK. Disease Gene Discovery

Through Integrative Genomics. Annu Rev Genomics Hum Genet. 2005;6(1):381-

406. doi:10.1146/annurev.genom.6.080604.162234

13. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-

generation sequencing technologies. Nat Rev Genet. 2016;17(6):333-351.

doi:10.1038/nrg.2016.49

14. Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids

Res. 2011;39(SUPPL. 1):2010-2012. doi:10.1093/nar/gkq1019

15. Vizcaíno JA, Deutsch EW, Wang R, et al. ProteomeXchange provides globally

coordinated proteomics data submission and dissemination. Nat Biotechnol.

2014;32(3):223-226. doi:10.1038/nbt.2839

16. Boroughs LK, Deberardinis RJ. Metabolic pathways promoting cancer cell

survival and growth. Nat Cell Biol. 2015;17(4):351-359. doi:10.1038/ncb3124

17. Anadón C, Guil S, Simó-Riudalbas L, et al. Gene amplification-associated

overexpression of the RNA editing enzyme ADAR1 enhances human lung

89

tumorigenesis. Oncogene. 2015;(October):1-7. doi:10.1038/onc.2015.469

18. Han L, Diao L, Yu S, et al. The Genomic Landscape and Clinical Relevance of A-

to-I RNA Editing in Human Cancers. Cancer Cell. 2015;28(4):515-528.

doi:10.1016/j.ccell.2015.08.013

19. Ding L, Getz G, Wheeler D a, et al. Somatic mutations affect key pathways in lung

adenocarcinoma. Nature. 2008;455(7216):1069-1075. doi:10.1038/nature07423

20. Lawrence MS, Stojanov P, Polak P, et al. Mutational heterogeneity in cancer and

the search for new cancer-associated genes. Nature. 2013;499(7457):214-218.

doi:10.1038/nature12213

21. Flavahan WA, Gaskell E, Bernstein BE. Epigenetic plasticity and the hallmarks of

cancer. Science (80- ). 2017;357:1-8. doi:10.1126/science.aal2380

22. Herman JG, Umar A, Polyak K, et al. Incidence and functional consequences of

hMLH1 promoter hypermethylation in colorectal carcinoma. Proc Natl Acad Sci.

1998;95(June):6870-6875.

23. Zack TI, Schumacher SE, Carter SL, et al. Pan-cancer patterns of somatic copy

number alteration. Nat Genet. 2013;45(10):1134-1140. doi:10.1038/ng.2760

24. Wilson BG, Roberts CWM. SWI/SNF nucleosome remodellers and cancer. Nat

Rev Genet. 2011;11(July):481-492. doi:10.1038/nrc3068

25. Nesvizhskii AI. Proteogenomics: concepts, applications and computational

strategies. Nat Methods. 2014;11(11):1114-1125. doi:10.1038/nmeth.3144

26. Wang X, Zhang B. customProDB: an R package to generate customized protein

databases from RNA-Seq data for proteomics search. Bioinformatics.

90

2013;29(24):3235-3237. doi:10.1093/bioinformatics/btt543

27. Ruggles K V., Tang Z, Wang X, et al. An analysis of the sensitivity of

proteogenomic mapping of somatic mutations and novel splicing events in cancer.

Mol Cell Proteomics. 2015:mcp.M115.056226. doi:10.1074/mcp.M115.056226

28. Jingyi B, Li J, Biggin MD. Statistics requantitates the central dogma. Science (80-

). 2015;347(6226):1066-1067.

29. Battle A, Khan Z, Wang S, et al. Impact of regulatory variation from RNA to

protein. Science (80- ). 2015;(18515).

30. Zhang H, Liu T, Zhang Z, et al. Integrated Proteogenomic Characterization of

Human High-Grade Serous Ovarian Cancer. Cell. 2016;166:1-11.

doi:10.1016/j.cell.2016.05.069

31. Mertins P, Mani DR, Ruggles K V., et al. Proteogenomics connects somatic

mutations to signalling in breast cancer. Nature. 2016;534(7605):55-62.

doi:10.1038/nature18003

32. Zhang B, Wang J, Wang X, et al. Proteogenomic characterization of human colon

and rectal cancer. Nature. 2014;513:382-387. doi:10.1038/nature13438

33. Li JJ, Bickel PJ, Biggin MD. System wide analyses have underestimated protein

abundances and the importance of transcription in mammals. PeerJ. 2014;2:e270.

doi:10.7717/peerj.270

34. Liu Q, Zhang B. Integrative Omics Analysis Reveals Post-Transcriptionally

Enhanced Protective Host Response in Colorectal Cancers with Microsatellite

Instability. J Proteome Res. 2015. doi:10.1021/acs.jproteome.5b00847

91

35. Jovanovic M, Rooney MS, Mertins P, et al. Dynamic profiling of the protein life

cycle in response to pathogens. Science (80- ). 2015;347(6226).

doi:10.1126/science.1260793

36. Rudnick PA, Markey SP, Roth J, et al. A Description of the Clinical Proteomic

Tumor Analysis Consortium (CPTAC) Common Data Analysis Pipeline. J

Proteome Res. 2016;15(3):1023-1032. doi:10.1021/acs.jproteome.5b01091

37. Chen G, Gharib TG, Huang C-C, et al. Discordant protein and mRNA expression

in lung adenocarcinomas. Mol Cell Proteomics. 2002;1(4):304-313.

doi:10.1074/mcp.M200008-MCP200

38. Stewart PA, Parapatics K, Welsh EA, et al. A Pilot Proteogenomic Study with

Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung

Adenocarcinoma. PLoS One. 2015;10(11):e0142162.

doi:10.1371/journal.pone.0142162

39. Li L, Wei Y, To C, et al. Integrated Omic analysis of lung cancer reveals

metabolism proteome signatures with prognostic impact. Nat Commun.

2014;5(May):5469. doi:10.1038/ncomms6469

40. Maemondo M, Inoue A, Kobayashi K, et al. Gefitinib or chemotherapy for non-

small-cell lung cancer with mutated EGFR. N Engl J Med. 2010;362(25):2380-

2388. doi:10.1056/NEJMoa0909530

41. Hata AN, Niederst MJ, Archibald HL, et al. Tumor cells can follow distinct

evolutionary paths to become resistant to epidermal growth factor receptor

inhibition. Nat Med. 2016;(August 2015). doi:10.1038/nm.4040

92

42. Jackman D, Pao W, Riely GJ, et al. Clinical definition of acquired resistance to

epidermal growth factor receptor tyrosine kinase inhibitors in non-small-cell lung

cancer. J Clin Oncol. 2010;28(2):357-360. doi:10.1200/JCO.2009.24.7049

43. Sen DR. The epigenetic landscape of T cell exhaustion. Science (80- ). 2016:1-6.

44. Gajewski TF, Schreiber H, Fu Y. Innate and adaptive immune cells in the tumor

microenvironment. Nat Immunol. 2013;14(10). doi:10.1038/ni.2703

45. Mcgranahan N, Furness AJS, Rosenthal R, et al. Clonal neoantigens elicit T cell

immunoreactivity and sensitivity to immune checkpoint blockade. Science (80- ).

2016;351(6280):1463-1470.

46. Garcia-Lora A, Algarra I, Garrido F. MHC class I antigens, immune surveillance,

and tumor immune escape. J Cell Physiol. 2003;195(3):346-355.

doi:10.1002/jcp.10290

47. Marty R, Kaabinejadian S, Rossell D, et al. MHC-I Genotype Restricts the

Oncogenic Mutational Landscape. Cell. 2017:1-12. doi:10.1016/j.cell.2017.09.050

48. Campbell JD, Alexandrov A, Kim J, et al. Distinct patterns of somatic genome

alterations in lung adenocarcinomas and squamous cell carcinomas. Nat Genet.

2016;48(6):607-616. doi:10.1038/ng.3564

49. Carbone DP, Reck M, Paz-Ares L, et al. First-Line Nivolumab in Stage IV or

Recurrent Non–Small-Cell Lung Cancer. N Engl J Med. 2017;376(25):2415-2426.

doi:10.1056/NEJMoa1613493

50. Rizvi NA, Hellmann MD, Snyder A, et al. Mutational landscape determines

sensitivity to PD-1 blockade in non-small cell lung cancer. Science (80- ).

93

2015;348(6230):124-129.

51. Łuksza M, Riaz N, Makarov V, et al. A neoantigen fitness model predicts tumour

response to checkpoint blockade immunotherapy. Nature. 2017;551(7681):517-

520. doi:10.1038/nature24473

52. Hellman M, Antonia S, Atmaca A, et al. Impact of Tumor Mutation Burden on the

Efficacy of Nivolumab + Ipilimumab in Small Cell Lung Cancer: An Exploratory

Analysis of Checkmate 032. In: World Conference on Lung Cancer. ; 2017.

53. Ayers M, Lunceford J, Nebozhyn M, et al. IFN- γ – related mRNA profile predicts

clinical response to PD-1 blockade. J Clin Invest. 2017;127(15):1-11.

doi:10.1172/JCI91190

54. Riaz N, Havel JJ, Makarov V, Horak CE, Weinhold N, Chan TA. Tumor and

Microenvironment Evolution during Immunotherapy with Nivolumab. Cell.

2017;171:934-949. doi:10.1016/j.cell.2017.09.028

55. Cai X, Chiu YH, Chen ZJ. The cGAS-cGAMP-STING pathway of cytosolic DNA

sensing and signaling. Mol Cell. 2014;54(2):289-296.

doi:10.1016/j.molcel.2014.03.040

56. Goubau D, Deddouche S, Reis e Sousa C. Cytosolic Sensing of Viruses. Immunity.

2013;38(5):855-869. doi:10.1016/j.immuni.2013.05.007

57. Roulois D, Yau HL, Singhania R, et al. DNA-Demethylating Agents Target

Colorectal Cancer Cells by Inducing Viral Mimicry by Endogenous Transcripts

Article DNA-Demethylating Agents Target Colorectal Cancer Cells by Inducing

Viral Mimicry by Endogenous Transcripts. Cell. 2015;162(5):961-973.

94

doi:10.1016/j.cell.2015.07.056

58. Chiappinelli KB, Strissel PL, Desrichard A, et al. Inhibiting DNA Methylation

Causes an Interferon Response in Cancer via dsRNA Including Endogenous

Retroviruses. Cell. 2015;162(5):974-986. doi:10.1016/j.cell.2015.07.011

59. Sagiv-Barfi I, Czerwinski DK, Levy S, et al. Eradication of spontaneous

malignancy by local immunotherapy. Sci Transl Med. 2018;10(31).

doi:10.1126/scitranslmed.aan4488

60. Chen CX, Cho DS, Wang Q, et al. A third member of the RNA-specific adenosine

deaminase gene family , ADAR3 , contains both single- and double-stranded RNA

binding domains . A third member of the RNA-specific adenosine deaminase gene

family , ADAR3 , contains both single- and double-stra. RNA. 2000;6:755-767.

61. Tan MH, Li Q, Shanmugam R, et al. Dynamic landscape and regulation of RNA

editing in mammals. Nature. 2017;550(7675):249-254. doi:10.1038/nature24041

62. Melcher T, Maas S, Herb A, Sprengel R, Seeburg PH, Higuchi M. A mammalian

RNA editing enzyme. Nature. 1996;379(6564):460-464. doi:10.1038/379460a0

63. Schwartz T, Rould MA, Lowenhaupt K, Herbert A, Rich A. Crystal structure of

the Zalpha domain of the human editing enzyme ADAR1 bound to left-handed Z-

DNA. Science (80- ). 1999;284(5421):1841-1845.

doi:10.1126/science.284.5421.1841

64. Lei M, Liu Y, Samuel CE. Adenovirus VAI RNA antagonizes the RNA-editing

activity of the ADAR adenosine deaminase. Virology. 1998;245(2):188-196.

doi:10.1006/viro.1998.9162

95

65. George CX, Ramaswami G, Li JB, Samuel CE. Editing of cellular self RNAs by

adenosine deaminase ADAR1 suppresses innate immune stress responses. J Biol

Chem. 2016;291(12):6158-6168. doi:10.1074/jbc.M115.709014

66. Pestal K, Funk CC, Snyder JM, Price ND, Treuting PM, Stetson DB. Isoforms of

RNA-Editing Enzyme ADAR1 Independently Control Nucleic Acid Sensor

MDA5-Driven Autoimmunity and Multi-organ Development. Immunity.

2015;43(5):933-944. doi:10.1016/j.immuni.2015.11.001

67. Liddicoat BJ, Piskol R, Chalk AM, et al. RNA editing by ADAR1 prevents MDA5

sensing of endogenous dsRNA as nonself. Science (80- ). 2015;349(6252):1115-

1120. doi:10.1126/science.aac7049

68. Rice GI, Kasher PR, Forte GMA, et al. Mutations in ADAR1 cause Aicardi-

Goutières syndrome associated with a type I interferon signature. Nat Genet.

2012;44(11):1243-1248. doi:10.1038/ng.2414

69. Mannion NM, Greenwood SM, Young R, et al. The RNA-Editing Enzyme

ADAR1 Controls Innate Immune Responses to RNA. Cell Rep. 2014;9(4):1482-

1494. doi:10.1016/j.celrep.2014.10.041

70. Ahmad S, Mu X, Yang F, et al. Breaching Self-Tolerance to Alu Duplex RNA

Underlies MDA5-Mediated Inflammation. Cell. 2018;172(4):797-802.e13.

doi:10.1016/j.cell.2017.12.016

71. Ross JP, Rand KN, Molloy PL. Hypomethylation of repeated DNA sequences in

cancer. Epigenomics. 2010;2(2):245-269. doi:10.2217/epi.10.2

72. Yang C, Chen Y, Chang Y, et al. ADAR1-mediated 3 ′ UTR editing and

96

expression control of antiapoptosis genes fine-tunes cellular apoptosis response.

Cell Death Dis. 2017;8:e2833-13. doi:10.1038/cddis.2017.12

73. Sakurai M, Shiromoto Y, Ota H, et al. ADAR1 controls apoptosis of stressed cells

by inhibiting Staufen1-mediated mRNA decay. Nat Struct Mol Biol.

2017;24(6):534-543. doi:10.1038/nsmb.3403

74. Gao J, Aksoy BA, Dogrusoz U, et al. Integrative analysis of complex cancer

genomics and clinical profiles using the {cBioPortal.}. Sci Signal.

2013;6(269):pl1. doi:10.1126/scisignal.2004088

75. Paz-Yaacov N, Bazak L, Buchumenski I, et al. Elevated RNA Editing Activity Is a

Major Contributor to Transcriptomic Diversity in Tumors. Cell Rep.

2015;13(2):267-276. doi:10.1016/j.celrep.2015.08.080

76. Jiang Q, Crews L a, Barrett CL, et al. ADAR1 promotes malignant progenitor

reprogramming in chronic myeloid leukemia. Proc Natl Acad Sci U S A.

2013;110(3):1041-1046. doi:10.1073/pnas.1213021110

77. Goh JY, Feng M, Wang W, et al. Chromosome 1q21.3 amplification is a trackable

biomarker and actionable target for breast cancer recurrence. Nat Med.

2017;23(11). doi:10.1038/nm.4405

78. Chan TH, Lin CH, Qi L, et al. A disrupted RNA editing balance mediated by

ADARs (Adenosine DeAminases that act on RNA) in human hepatocellular

carcinoma. Gut. 2014;63(5):832-843. doi:10.1136/gutjnl-2012-304037

79. Chen L, Li Y, Lin CH, et al. Recoding RNA editing of AZIN1 predisposes to

hepatocellular carcinoma. Nat Med. 2013;19(2):209-216. doi:10.1038/nm.3043

97

80. Fumagalli D, Gacquer D, Rothé F, et al. Principles Governing A-to-I RNA Editing

in the Breast Cancer Transcriptome. Cell Rep. 2015;13(2):277-289.

doi:10.1016/j.celrep.2015.09.032

81. Chan THM, Qamra A, Tan KT, et al. ADAR-mediated RNA editing predicts

progression and prognosis of Gastric Cancer. Gastroenterology. 2016;151(4):637-

650.e10. doi:10.1053/j.gastro.2016.06.043

82. Bahn JH, Ahn J, Lin X, et al. Genomic analysis of ADAR1 binding and its

involvement in multiple RNA processing pathways. Nat Commun. 2015;6:6355.

doi:10.1038/ncomms7355

83. Stellos K, Gatsiou A, Stamatelopoulos K, et al. Adenosine-to-inosine RNA editing

controls cathepsin S expression in atherosclerosis by enabling HuR-mediated post-

transcriptional regulation. Nat Med. 2016;22(10):1140-1150. doi:10.1038/nm.4172

84. Zhang L, Yang C-S, Varelas X, Monti S. Altered RNA editing in 3’ UTR perturbs

microRNA-mediated regulation of oncogenes and tumor-suppressors. Sci Rep.

2016;6(November 2015):23226. doi:10.1038/srep23226

85. Ramaswami G, Li JB. RADAR: A rigorously annotated database of A-to-I RNA

editing. Nucleic Acids Res. 2014;42(D1):109-113. doi:10.1093/nar/gkt996

86. Paz I, Kosti I, Ares M, Cline M, Mandel-gutfreund Y. RBPmap : a web server for

mapping binding sites of RNA-binding proteins. Nucleic Acids Res. 2014;42:361-

367. doi:10.1093/nar/gku406

87. Grimson A, Farh KKH, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP.

MicroRNA Targeting Specificity in Mammals: Determinants beyond Seed Pairing.

98

Mol Cell. 2007;27(1):91-105. doi:10.1016/j.molcel.2007.06.017

88. Kerpedjiev P, Hammer S, Hofacker IL. Forna (force-directed RNA): Simple and

effective online RNA secondary structure diagrams. Bioinformatics.

2015;31(20):3377-3379. doi:10.1093/bioinformatics/btv372

89. Barbie DA, Tamayo P, Boehm JS, et al. Systematic RNA interference reveals that

oncogenic KRAS -driven cancers require TBK1. Nature.

2009;462(November):108-112. doi:10.1038/nature08460

90. Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat

Commun. 2015;6:8971. doi:10.1038/ncomms9971

91. Li Bo, Severson Eric, Pignon Jean-Christophe, et al. Comprehensive analyses of

tumor immunity: implications for cancer immunotherapy. Genome Biol.

2016;17(174):1-16. doi:10.1186/s13059-016-1028-7

92. Gutman D, Cobb J, Somanna D, et al. Cancer Digital Slide Archive: an informatics

resource to support integrated in silico analysis of TCGA pathology data. J Am

Med Informatics Assoc. 2013;20(6):1091-1098. doi:10.1136/amiajnl-2012-001469

93. Wang IX, So E, Devlin JL, Zhao Y, Wu M, Cheung VG. ADAR Regulates RNA

Editing, Transcript Stability, and Gene Expression. Cell Rep. 2013;5(3):849-860.

doi:10.1016/j.celrep.2013.10.002

94. Monajemi H, Fontijn RD, Pannekoek H, Horrevoets AJG. The Apolipoprotein L

Gene Cluster Has Emerged Recently in Evolution and Is Expressed in Human

Vascular Tissue. Genomics. 2002;79(4):539-546. doi:10.1006/geno.2002.6729

95. Parsa A, Kao L, Xie D, et al. APOL1 risk variants, race, and progression of

99

chronic kidney disease. N Engl J Med. 2013;369(23):2183-2196.

doi:10.1056/NEJMoa1310345

96. Vanhamme L, Paturiaux-Hanocq F, Poelvoorde P, et al. Apolipoprotein L-I is the

trypanosome lytic factor of human serum. Nature. 2003;422(March):83-87.

doi:10.1038/nature01457.1.

97. Nichols B, Jog P, Lee JH, et al. Innate immunity pathways regulate the

nephropathy gene Apolipoprotein L1. Kidney Int. 2015;87(2):332-342.

doi:10.1038/ki.2014.270

98. Gentles AJ, Newman AM, Liu CL, et al. The prognostic landscape of genes and

infiltrating immune cells across human cancers. Nat Med. 2015;21(8):1-12.

doi:10.1038/nm.3909

99. Chen EY, Tan CM, Kou Y, et al. Enrichr : interactive and collaborative HTML5

gene list enrichment analysis tool. BMC Bioinformatics. 2013;14(128).

100. Kuleshov M V, Jones MR, Rouillard AD, et al. Enrichr : a comprehensive gene set

enrichment analysis web server 2016 update. Nucleic Acids Res.

2016;44(May):90-97. doi:10.1093/nar/gkw377

101. Carpenter S, Ricci EP, Mercier BC, Moore MJ, Fitzgerald K a. Post-transcriptional

regulation of gene expression in innate immunity. Nat Rev Immunol.

2014;14(6):361-376. doi:10.1038/nri3682

102. Mccormick C, Khaperskyy DA. Translation inhibition and stress granules in the

antiviral immune response. Nat Rev Immunol. 2017. doi:10.1038/nri.2017.63

103. Li Y, Banerjee S, Goldstein SA, et al. Ribonuclease L mediates the cell-lethal

100

phenotype of double-stranded RNA editing enzyme ADAR1 deficiency in a

human cell line. Elife. 2017;6(e25687):1-18. doi:10.7554/eLife.25687

104. Wu B, Peisley A, Richards C, et al. Structural basis for dsRNA recognition,

filament formation, and antiviral signal activation by MDA5. Cell. 2013;152(1-

2):276-289. doi:10.1016/j.cell.2012.11.048

105. Hou F, Sun L, Zheng H, Skaug B, Jiang QX, Chen ZJ. MAVS forms functional

prion-like aggregates to activate and propagate antiviral innate immune response.

Cell. 2011;146(3):448-461. doi:10.1016/j.cell.2011.06.041

106. Liberzon A, Birger C, Thorvaldsdo H, et al. The Molecular Signatures Database.

Cell Syst. 2015;1(1):417-425. doi:10.1016/j.cels.2015.12.004

107. Vitali P, Scadden a DJ. Double-stranded RNAs containing multiple IU pairs are

sufficient to suppress interferon induction and apoptosis. Nat Struct Mol Biol.

2010;17(9):1043-1050. doi:10.1038/nsmb.1864

108. Wrangle J, Wang W, Koch A, et al. Alterations of immune response of non-small

cell lung cancer with Azacytidine. Oncotarget. 2013;4(11):2067-2079.

109. Winton T, Livingston R, Johnson D, et al. Vinorelbine plus cisplatin vs.

observation in resected non-small-cell lung cancer. N Engl J Med.

2005;352(25):2589-2597. doi:10.1056/NEJMoa043623

110. Zhu C-Q, Ding K, Strumpf D, et al. Prognostic and Predictive Gene Signature for

Adjuvant Chemotherapy in Resected Non-Small-Cell Lung Cancer. J Clin Oncol.

2010;28(29):4417-4424. doi:10.1200/JCO.2009.26.4325

111. Nie L, Wu G, Zhang W. Correlation between mRNA and protein abundance in

101

Desulfovibrio vulgaris: A multiple regression to identify sources of variations.

Biochem Biophys Res Commun. 2006;339(2):603-610.

doi:10.1016/j.bbrc.2005.11.055

112. Marguerat S, Schmidt A, Codlin S, Chen W, Aebersold R, Bähler J. Quantitative

analysis of fission yeast transcriptomes and proteomes in proliferating and

quiescent cells. Cell. 2012;151(3):671-683. doi:10.1016/j.cell.2012.09.019

113. Tian Q, Stepaniants SB, Mao M, et al. Integrated genomic and proteomic analyses

of gene expression in Mammalian cells. Mol Cell Proteomics. 2004;3(10):960-969.

doi:10.1074/mcp.M400055-MCP200

114. Wei Y-N, Hu H-Y, Xie G-C, et al. Transcript and protein expression decoupling

reveals RNA binding proteins and miRNAs as potential modulators of human

aging. Genome Biol. 2015;16(1):1-15. doi:10.1186/s13059-015-0608-2

115. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2:

accurate alignment of transcriptomes in the presence of insertions, deletions and

gene fusions. Genome Biol. 2013;14(4):R36. doi:10.1186/gb-2013-14-4-r36

116. Anders S, Pyl PT, Huber W. HTSeq-A Python framework to work with high-

throughput sequencing data. Bioinformatics. 2015;31(2):166-169.

doi:10.1093/bioinformatics/btu638

117. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a

curated non-redundant sequence database of genomes, transcripts and proteins.

Nucleic Acids Res. 2007;35(Database issue):D61-5. doi:10.1093/nar/gkl842

118. Engström PG, Steijger T, Sipos B, et al. Systematic evaluation of spliced

102

alignment programs for RNA-seq data. Nat Methods. 2013;10(12):1185-1191.

doi:10.1038/nmeth.2722

119. Dobin A, Davis CA, Schlesinger F, et al. STAR : ultrafast universal RNA-seq

aligner. Bioinformatics. 2013;29:15-21.

120. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program

for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-

930. doi:10.1093/bioinformatics/btt656

121. Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and

SAMtools. Bioinformatics. 2009;25(16):2078-2079.

doi:10.1093/bioinformatics/btp352

122. Sprung RW, Brock JWC, Tanksley JP, et al. Equivalence of Protein Inventories

Obtained from Formalin-fixed Paraffin-embedded and Frozen Tissue in

Multidimensional Liquid Chromatography-Tandem Mass Spectrometry Shotgun

Proteomic Analysis. Mol Cell Proteomics. 2009;8(8):1988-1998.

doi:10.1074/mcp.M800518-MCP200

123. Scicchitano MS, Dalmas DA, Boyce RW, Thomas HC, Frazier KS. Protein

extraction of formalin-fixed, paraffin-embedded tissue enables robust proteomic

profiles by mass spectrometry. J Histochem Cytochem. 2009;57(9):849-860.

doi:10.1369/jhc.2009.953497

124. Kim S, Boyd S, Gorinevsky D. l1 Trend Filtering. 2009;51(2):339-360.

125. Karolchik D, Hinrichs AS, Furey TS, et al. The UCSC Table Browser data

retrieval tool. Nucleic Acids Res. 2004;32(Database issue):D493-6.

103

doi:10.1093/nar/gkh103

126. Ray D, Kazan H, Cook KB, et al. A compendium of RNA-binding motifs for

decoding gene regulation. Nature. 2013;499(7457):172-177.

doi:10.1038/nature12311

127. Agarwal V, Bell GW, Nam JW, Bartel DP. Predicting effective microRNA target

sites in mammalian mRNAs. Elife. 2015;4(AUGUST2015):1-38.

doi:10.7554/eLife.05005

128. Li J, Tibshirani R. Finding consistent patterns: A nonparametric approach for

identifying differential expression in RNA-Seq data. Stat Methods Med Res.

2011;22(5):519-536. doi:10.1177/0962280211428386

129. Sankala H, Vaughan C, Wang J, Deb S, Graves PR. Upregulation of the

mitochondrial transport protein, Tim50, by mutant p53 contributes to cell growth

and chemoresistance. Arch Biochem Biophys. 2011;512(1):52-60.

doi:10.1016/j.abb.2011.05.005

130. Gao H, Korn JM, Ferretti S, et al. High-throughput screening using patient-derived

tumor xenografts to predict clinical trial drug response. Nat Med.

2015;21(11):1318-1325. doi:10.1038/nm.3954

131. Eifler K, Vertegaal ACO. SUMOylation-Mediated Regulation of Cell Cycle

Progression and Cancer. Trends Biochem Sci. 2015;40(12):779-793.

doi:10.1016/j.tibs.2015.09.006

132. Johnen G, Kaufman S. Studies on the enzymatic and transcriptional activity of the

dimerization cofactor for hepatocyte nuclear factor 1. Proc Natl Acad Sci U S A.

104

1997;94(25):13469-13474.

133. Hoskins JW, Jia J, Flandez M, et al. Transcriptome analysis of pancreatic cancer

reveals a tumor suppressor function for HNF1A. Carcinogenesis.

2014;35(12):2670-2678. doi:10.1093/carcin/bgu193

134. Yim J, Shik H, Lee S, et al. Radiosensitizing effect of PSMC5 , a 19S proteasome

ATPase, in H460 lung cancer cells. Biochem Biophys Res Commun.

2016;469(1):94-100. doi:10.1016/j.bbrc.2015.11.077

135. Choy L, Derynck R. The type II transforming growth factor (TGF)-b receptor-

interacting protein TRIP-1 acts as a modulator of the TGF-b response. J Biol

Chem. 1998;273(47):31455-31462. doi:10.1074/jbc.273.47.31455

136. Kobayashi H, Nishimura H, Matsumoto K, Yoshida M. Biochemical and

Biophysical Research Communications Identi fi cation of the determinants of 2-

deoxyglucose sensitivity in cancer cells by shRNA library screening. Biochem

Biophys Res Commun. 2015;467(1):121-127. doi:10.1016/j.bbrc.2015.09.106

137. Tomida S, Koshikawa K, Yatabe Y, et al. Gene expression-based, individualized

outcome prediction for surgically treated lung cancer patients. Oncogene.

2004;23(31):5360-5370. doi:10.1038/sj.onc.1207697

138. Y. Q, Y. Y, B. L, W. X. Comparative proteomic profiling identified sorcin being

associated with gemcitabine resistance in non-small cell lung cancer. Med Oncol.

2010;27(4):1303-1308. doi:10.1007/s12032-009-9379-5

139. Maddalena F, Laudiero G, Piscazzi a., et al. Sorcin Induces a Drug-Resistant

Phenotype in Human Colorectal Cancer by Modulating Ca2+ Homeostasis.

105

Cancer Res. 2011;71(24):7659-7669. doi:10.1158/0008-5472.CAN-11-2172

140. Landriscina M, Laudiero G, Maddalena F, et al. Mitochondrial chaperone Trap1

and the calcium binding protein Sorcin interact and protect cells against apoptosis

induced by antiblastic agents. Cancer Res. 2010;70(16):6577-6586.

doi:10.1158/0008-5472.CAN-10-1256

141. Garon EB, Rizvi NA, Hui R, et al. Pembrolizumab for the Treatment of Non–

Small-Cell Lung Cancer. N Engl J Med. 2015;372(21):2018-2028.

doi:10.1056/NEJMoa1501824

142. Van Allen EM, Miao D, Schilling B, et al. Genomic correlates of response to

CTLA-4 blockade in metastatic melanoma. Science (80- ). 2015;350(6257):207-

211. doi:10.1126/science.aad0095

143. Snyder A, Makarov V, Merghoub T, et al. Genetic Basis for Clinical Response to

CTLA-4 Blockade in Melanoma. N Engl J Med. 2014;371(23):2189-2199.

doi:10.1056/NEJMoa1406498

144. Le DT, Uram JN, Wang H, et al. PD-1 Blockade in Tumors with Mismatch-Repair

Deficiency. N Engl J Med. 2015;372(26):2509-2520.

doi:10.1056/NEJMoa1500596

145. Le DT, Durham JN, Smith KN, et al. Mismatch repair deficiency predicts response

of solid tumors to PD-1 blockade. Science (80- ). 2017;357(6349):409-413.

doi:10.1126/science.aan6733

146. Alexandrov LB, Ju YS, Haase K, et al. Mutational signatures associated with

tobacco smoking in human cancer. Science (80- ). 2016;354(6312):618-622.

106

doi:10.1126/science.aag0299

147. Chae YK, Anker JF, Carneiro BA, Platanias C, Giles FJ. Genomic landscape of

DNA repair genes in cancer. Oncotarget. 2016;7(17).

148. Campbell BB, Light N, Fabrizio D, et al. Comprehensive Analysis of

Hypermutation in Human Cancer. Cell. 2017:1042-1056.

doi:10.1016/j.cell.2017.09.048

149. Hause RJ, Pritchard CC, Shendure J, Salipante SJ. Classification and

characterization of microsatellite instability across 18 cancer types. Nat Med.

2016;22(11):1342-1350. doi:10.1038/nm.4191

150. Suzuki T, Grúz P, Honma M, Adachi N, Nohmi T. Sensitivity of human cells

expressing low-fidelity or weak-catalytic-activity variants of DNA polymerase ζ to

genotoxic stresses. DNA Repair (Amst). 2016;45:34-43.

doi:10.1016/j.dnarep.2016.06.002

151. Singh B, Li X, Owens KM, Vanniarajan A, Liang P, Singh KK. Human REV3

DNA polymerase zeta localizes to mitochondria and protects the mitochondrial

genome. PLoS One. 2015;10(10):1-18. doi:10.1371/journal.pone.0140409

152. Wang W, Sheng W, Yu C, et al. REV3L modulates cisplatin sensitivity of non-

small cell lung cancer H1299 cells. Oncol Rep. 2015;34(3):1460-1468.

doi:10.3892/or.2015.4121

153. Huang KK, Jang KW, Kim S, et al. Exome sequencing reveals recurrent REV3L

mutations in cisplatin-resistant squamous cell carcinoma of head and neck. Sci

Rep. 2016;6(January):1-10. doi:10.1038/srep19552

107

154. Yang L, Shi T, Liu F, et al. REV3L, a promising target in regulating the

chemosensitivity of cervical cancer cells. PLoS One. 2015;10(3):1-18.

doi:10.1371/journal.pone.0120334

155. Manguso RT, Hans W, Zimmer MD, et al. In vivo CRISPR screening identifies

Ptpn2 as a cancer immunotherapy target. Nature. 2017;547(7664):413-418.

doi:10.1038/nature23270

108