AN INVESTIGATION INTO THE NON-CODING GENOMIC LANDSCAPE AND

EFFECTS OF CHEMOTHERAPEUTICS IN PRE-TREATED ADVANCED CANCERS

by

Harwood Kwan

B.Sc., The University of British Columbia, 2017

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES

(Medical Genetics)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

March 2020

© Harwood Kwan, 2020

The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:

An investigation into the non-coding genomic landscape and effects of chemotherapeutics in pre-treated advanced cancers

submitted by Harwood Kwan in partial fulfillment of the requirements for the degree of Master of Science in Medical Genetics

Examining Committee:

Steven Jones, Professor, Medical Genetics, UBC Supervisor Inanc Birol, Professor, Medical Genetics, UBC Supervisory Committee Member Peter Stirling, Associate Professor, Medical Genetics, UBC Supervisory Committee Member Philipp Lange, Assistant Professor, Pathology and Laboratory Medicine, UBC Additional Examiner

ii Abstract

Cancer is a disease which arises due to somatic alterations in the genome. However, most studies on cancer genetics only explore the impact that coding have on the progression of the disease. Furthermore, many genomic inquiries on cancer only implicate primary untreated tumours, which misses the impact of metastasis and treatment. Here we present a cohort of 638 advanced cancer patients with whole genomic, transcriptomic and clinical information. Through this cohort, we attempt to better characterize the non-coding region of metastatic cancers as well as attempt to understand the mutational impact of chemotherapeutics. Using a positional clustering method, we identified 1,567 significant mutational hotspots in the genome. 86 genes were identified as being affected by a hotspot in a regulatory region, including in the TERT promoter, a region with well-known driving mutations. To characterize the biological function of the hotspots, we analyzed the impact of on corresponding gene expression. We show an increased expression for TERT and AP2A1 when their respective promoter regions are mutated, the latter being a novel association. Mutational clusters affecting non-coding RNAs were also examined for any functional impact, but no significant associations were seen. Large non-coding mutational events such as kataegis were seen in multiple cancer types and across all chromosomes. However, little recurrence was seen for kataegis. Additionally, using observed mutational frequencies, we attempt to identify any mutations that may be treatment-induced.

Examining the breast, lung, colon and pancreas and ovarian cohorts, we were able to extract known resistance mutations such as ESR1 mutations after aromatase inhibitor treatment and

EGFR T790M mutations post anti-EGFR therapy. Further insights are required to confirm the expressional change seen in the cohort. Additional studies to determine AP2A1’s role in cancer would help understand this correlation. Overall, our study shows the presence of important

iii mutations in the non-coding space of metastatic cancers, and the power of whole genome sequencing. Furthermore, we display the need for similar datasets to extrapolate mutations which correlate to resistance.

iv Lay Summary

Cancers are caused by changes in our genetic information that result in uncontrolled growth of our cells. Most of our information about cancers come from early-stage cancers that have not undergone any treatment, limiting our understanding of how the disease progresses.

Additionally, much is still unknown about how mutations in non-coding DNA regions can affect cancer progression. Here we present a study on the genomic landscape advanced cancers, where we attempt to garner a better understanding of metastatic mutations and the effects of chemotherapies on our DNA. We show that the most prevalent mutations reside in the coding regions, but functionally relevant mutations can be seen in the non-coding regions as well.

However, there does not appear to be a singular mutation responsible for metastasis. We also show that using clinical data, we can identify mutations that arise due to chemotherapeutics.

v Preface

Under the supervision of Dr. Steven Jones, I, Harwood Kwan, designed the experiments and studies described within the thesis with the assistance of Dr. Erin Pleasance. This work was approved by and conducted under the University of British Columbia – British Columbia Cancer

Agency Research Ethics Board (H12-00137, H14-00681), and approved by the institutional review board (IRB). The POG program is registered under clinical trial number NCT02155621.

Patients were referred to the POG program through their treating oncologist and enrolled into the program through a POG trained oncologist or study nurse. Sample collection was performed by the overseeing surgical oncologist. Dr. Andrew Mungall was responsible for the processing and library construction of the samples. Dr. Richard Moore oversaw the sequencing of the samples. Eric Chuah, Karen Mungall, Tina Wong and Reanne Bowlby supervised the alignment and variant calling of the samples. Each person named oversaw multiple individuals who contributed to some part of the overall pipeline.

A version of Chapter 2 is published in “Pleasance, E.D., Titmuss E., Williamson L., et al.

(2020). Pan-cancer analysis of advanced patient tumors reveals interactions between therapy and genomic landscapes. Nature Cancer. In press.”. I performed all the computational experiments described. Aside from the clustering algorithm, I wrote the remainder of the code for the positional clustering of mutations, including determination of statistical significance and expression analysis. I analyzed the results and selected candidate mutations for downstream analysis. Dr. Jahanshah Ashkani was responsible for the STAR alignment of RNA-seq used in expression analysis. I also called the variants in the TCGA samples and performed all the expressional analysis in that dataset. I also wrote all the code pertaining to kataegis identification

vi and performed all the computational analysis regarding kataegis events. The portion of the manuscript pertaining to this work was co-written by myself, Dr. Erin Pleasance, Dr. Laura

Williamson and Emma Titmuss.

A version of Chapter 3 is also published in “Pleasance, E.D., Titmuss E., Williamson L., et al. (2020). Pan-cancer analysis of advanced patient tumors reveals interactions between therapy and genomic landscapes. Nature Cancer. In press.”. I wrote and performed all computational experiments described. Biological relevance was examined by both Dr. Erin

Pleasance and me. The portion of the manuscript pertaining to this work was co-written by myself and Dr. Erin Pleasance, Dr. Laura Williamson and Emma Titmuss.

vii Table of Contents

Abstract ...... iii

Lay Summary ...... v

Preface ...... vi

Table of Contents ...... viii

List of Tables ...... xii

List of Figures ...... xiii

List of Abbreviations ...... xiii

Acknowledgements ...... xviii

Dedication ...... xx

Chapter 1: Introduction ...... 1

1.1 Research Aims ...... 1

1.2 Background ...... 1

1.2.1 Cancer is a Global Health and Economic Issue ...... 1

1.2.2 Genetics of Cancer ...... 2

History of Cancer Genetics ...... 2

Cancerous Mutations Arise from Multiple Sources ...... 3

DNA Repair Plays an Important Role in Cancer Progression ...... 5

Properties of Oncogenic Mutations ...... 6

Non-coding DNA Play Functional Roles in Cancer ...... 8

Kataegis ...... 14

Cancer Evolves Over Time ...... 15

viii 1.2.3 Metastatic Cancer ...... 16

1.2.4 Whole Genome Sequencing in Cancer ...... 18

History of Sequencing ...... 18

Sequencing in Cancer ...... 19

Identification of Mutations ...... 20

Methods of Studying Recurrent Cancerous Mutations ...... 23

1.2.5 Personalized Oncogenomics ...... 24

Overview of Cancer Therapy ...... 24

Genetics in Cancer Therapy ...... 26

1.2.6 Summary ...... 28

Chapter 2: Non-coding Mutations in Pre-Treated Metastatic Cancer ...... 30

2.1 Introduction ...... 30

2.2 Methods ...... 30

2.2.1 Patient Samples ...... 30

2.2.2 Sample Collection, Preparation and Sequencing ...... 32

2.2.3 Alignment and Variant Calling ...... 32

2.2.4 Gene Expression Profiling ...... 33

2.2.5 Clustering of Recurrent Mutations ...... 33

2.2.6 Functional Validation of Non-Coding Regions ...... 36

2.2.7 Validation with an External Cohort ...... 37

2.2.8 Identification of Regions of Kataegis ...... 37

2.3 Results ...... 38

2.3.1 Filtering against normal baseline removes artifacts ...... 38

ix 2.3.2 Positional clustering shows minimal recurrence for non-coding mutations ...... 39

2.3.3 AP2A1 promoter mutations correlate with higher expression ...... 49

2.3.4 Kataegis is a highly recurring event in metastatic cancer ...... 57

2.4 Discussion ...... 59

2.4.1 Summary of Results ...... 59

2.4.2 Limitations of the study ...... 65

2.4.3 Future directions ...... 66

Chapter 3: Treatment Associated Mutations ...... 69

3.1 Introduction ...... 69

3.2 Methods ...... 69

3.2.1 Patient Sample Please refer to section 2.2.1...... 69

3.2.2 Sample Collection, Preparation and Sequencing ...... 69

3.2.3 Alignment and Variant Calling ...... 70

3.2.4 Gene Expression Profiling ...... 70

3.2.5 Copy-Number Calling ...... 70

3.2.6 Collection of Clinical information ...... 70

3.2.7 Association of Treatment and SNVs/Indels ...... 71

3.2.8 Association of Treatment and Copy Number Variations ...... 72

3.2.9 Analysis of Time on Therapy ...... 73

3.3 Results ...... 74

3.3.1 Large-scale search for therapy induced mutations detects real associations ...... 74

3.3.2 Copy number alterations may be influenced by treatment ...... 76

3.3.3 Resistance mutations are associated with longer exposure to therapy ...... 77

x 3.4 Discussion ...... 80

3.4.1 Summary of findings ...... 80

3.4.2 Limitations of the study ...... 81

3.4.3 Future directions ...... 82

Chapter 4: Conclusion ...... 83

Bibliography ...... 85

xi List of Tables

Table 1.1 Examples of commonly used variant callers ...... 22

Table 1.2 Examples of commonly used targeted therapies in cancer ...... 27

Table 2.1 Summary of clustering results ...... 40

Table 2.2 Top mutational clusters in metastatic cancers ...... 43

Table 2.3 Top ncRNA mutational clusters in metastatic cancers ...... 47

Table 2.4 Overview of AP2A1 mutations ...... 55

Table 2.5 Sasquatch DNase I footprinting prediction results ...... 55

xii List of Figures

Figure 1.1 Non protein coding DNA make up the majority of genetic information ...... 10

Figure 1.2 Non protein coding DNA make up the majority of the transcriptome ...... 12

Figure 1.3 Tumor heterogeneity provides variance for selection ...... 16

Figure 1.4 Metastasis can occur via multiple mechanisms ...... 18

Figure 2.1 Overview of tumor types in the cohort ...... 31

Figure 2.2 Positional clustering of mutations ...... 34

Figure 2.3 Calculation of local background mutaton rate ...... 35

Figure 2.4 Filtering reomves artifacts and unwanted mutations ...... 39

Figure 2.5 Chromosomal overview of cluster distribution ...... 41

Figure 2.6 Overview of mutation clusters affecting coding or regulatory regions ...... 42

Figure 2.7 Overview of mutation clusters affecting coding or regulatory regions (zoomed) ...... 45

Figure 2.8 TERT promoter mutations cause an increased expression ...... 50

Figure 2.9 AP2A1 promoter mutations induce an increased expression similar to amplification….

...... 51

Figure 2.10 AP2A1 expressional difference is not a cohort effect ...... 52

Figure 2.11 AP2A1 mutations are located close to the transcription start site ...... 54

Figure 2.12 AP2A1 promoter mutations induce a similar increase in expression in TCGA ...... 56

Figure 2.13 Certain regions are more prone to kataegis ...... 57

Figure 2.14 Increased expression of APOBEC3CB a predictor of kataegis events ...... 58

Figure 2.15 Kataegis mutational burden correlates with APOBEC3B expression ...... 59

Figure 3.1 Certain mutations arise from exposure to therapy ...... 75

xiii Figure 3.2 Non protein coding DNA make up the majority of genetic information ...... 77

Figure 3.3 Patients with ESR1 resistance mutations have longer aromatase inhibitor treament times ...... 78

Figure 3.4 Patients with EGFR resistance mutations have longer EGFR inhibitor treament times

...... 79

xiv List of Abbreviations

ACC – Adrenocortical carcinoma

BCC – Basal cell carcinoma

BER – Base excision repair

BLCA – Bladder urothelial carcinoma

BRCA – Breast cancer

CERV – Cervical cancer

CHOL – Cholangiocarcinoma

CNS-PNS – Nervous system tumors

COLO – Colorectal cancer

COSMIC – Catalogue of Somatic Mutations in Cancer

ESCA – Esophageal cancer

ETC – Electron transport chain

HCC – Hepatocellular carcinoma

HNSC – Head and neck squamous cell carcinoma

ICGC – International consortium

Indel – Insertion/deletion

IR – Ionizing radiation

KDNY – Kidney cancer lncRNA – Long non-coding RNA

LUNG – Lung cancer

LYMP – Blood and lymphoid cancer miRNA – Micro RNA

xv MISC – Miscellaneous cancers

MMR – Mismatch repair ncRNA – Non-coding RNA

NER – Nucleotide excision repair

OV – Ovarian cancer

PANC – Pancreatic cancer

PARPi - Poly (ADP-ribose) polymerase inhibitors

POG – Personalized oncogenomics

PRAD – Prostate adenocarcinoma

ROS – Reactive oxidative species rRNA – Ribosomal RNA

SARC – Sarcoma

SECR – Tumor of secretory organs

SKCM – Melanoma snoRNA – Small nucleolar RNA snRNA – Small nuclear RNA

SNV – Single nucleotide variant

STAD – Stomach adenocarcinoma

SV – Structural variant

TARGET – Therapeutically Applicable Research to Generate Effective Treatments

TCGA – The Cancer Genome Atlas

TF – Transcription factor

THCA – Thyroid carcinoma

xvi THYM – Thymoma

TKI – Tyrosine kinase inhibitor

TPM – Transcript per million tRNA – Transfer RNA

UCEC – Uterine Corpus Endometrial Carcinoma

UV – Ultraviolet

UVM – Uveal Melanoma

xvii Acknowledgements

I would like to thank my supervisor Dr. Steven Jones for his continual support of my work and for constantly inspiring me to achieve more. You have given me more opportunities than imaginable, to pursue things I would never been able to. I have enjoyed your mentorship and am forever grateful for all the things you have taught me. To Louise, thank you for taking care of all the administrative issues and making my life a little easier. Your work is truly appreciated.

I would like to thank my supervisory committee, Drs. Inanc Birol and Peter Stirling.

Your insights and ideas have made this journey much easier and much more enjoyable. Our discussions during committee meetings always led to me learning something new or grasping something in a novel way.

I would like to thank all the members of the Jones lab and the people at the GSC. I am especially grateful to some individuals in particular, who have made this journey incredibly memorable. To Jake and Eric, thank you for taking the time to share your experiences with me and to answer any dumb questions I had. Your openness and guidance will always be cherished.

To my fellow trainees in the lab, Jasleen, Luka, Emre, Jenny, Jean-Michel, Vahid, and Michael, thank you for teaching me so much over the course of these two years. I always enjoyed hearing about the cool scientific ventures you were pursuing and was always in awe of the work you guys did. I am sure you will all succeed greatly in all that you pursue. I came in with zero to little knowledge of bioinformatics and now I can proudly say I am a little less lost because of all of you. To Erin, thank you for all your support throughout my entire time in the lab. I am forever grateful that your door was always open to me, that you were always willing to listen to my crazy ideas and were willing to sift through all the thoughts in my head. Your impact on my

xviii work and my growth as a scientist and a critical thinker cannot be overstated. To Laura and

Emma, thank you for all your hard work on the POG570 paper and for making my time at the

GSC memorable. Although emails from you were often accompanied by more work, it was always enjoyable. To Zoltan, Jahanshah and Kieran, thank you for all your insights and assistance during my thesis work. Your suggestions and comments aided me profusely and led me to always think deeper into the reasonings and methodology behind my project.

I am grateful for the funding support I received during my graduate studies. These include a CIHR Canada Graduate Scholarship Master’s Award, and a Rotation Program Award from the Department of Medical Genetics. I would also like to acknowledge financial support for travel from the UBC Graduate Studies Travel Award.

I am eternally grateful to all the participants of the Personalized Oncogenomics Program who are the backbone of this study. I am also thankful for all the directors, clinicians, bioinformaticians, and project managers of the POG project, without whom, none of this work would be possible.

Finally, my warmest thanks to all my friends and family for their continual support through this entire process.

xix Dedication

To my parents Klip and Queenie,

who have supported me financially, physically and emotionally through and through.

xx Chapter 1: Introduction

1.1 Research Aims

The objectives of this thesis are to accurately characterize the non-coding genomic landscape of advanced metastatic tumors and to analyze the mutational impact of chemotherapeutics on tumors. We aim to identify areas of high mutational burden in the non- coding genome in order to establish mutations that may be significant in cancer and cancer metastasis. We also aim to identify mutations which may have arisen due to the presence of chemotherapeutics with the hope of determining novel resistance mutations, resulting in a better understanding and future prescription of chemotherapeutics.

To address the objectives, we began by identifying recurrently mutated regions of the genome in 638 metastatic cancers through a positional clustering method. The clusters were then filtered based on statistical relevance and location in genome, generating a shortened list of significant regions for downstream analysis. Lastly, we measured the occurrence rates of mutations in the presence of chemotherapeutics, relating the appearance of alterations to therapeutics.

1.2 Background

1.2.1 Cancer is a Global Health and Economic Issue

Over the past decade, global cancer incidence has risen substantially. From 2006 to 2016, cancer diagnoses have increased by 28%.1 Currently, one in five men and one in six women are predicted to develop cancer during their lifetime, with the number of novel cancer cases reaching

18 million in 2018.1,2 This number has been predicted to rise to 29.5 million by 2040.3 Most of this

1 growth has been attributed to overall global population growth and ageing, as progress against infectious diseases and increased life expectancy has allowed for mechanisms of cancer to flourish.1,4 Furthermore, cancer is the second leading cause of death worldwide, exceeded only by heart disease.2,3 Internationally, cancer was responsible for an estimated 9.6 million deaths in 2018, approximately one in every six deaths.5 This number is predicted to rise to 16.5 million by 2040.5

The economic impact of cancer is enormous, particularly in low- and middle- income countries, where the cancer burden highly stresses an already weak health care and adversely affects any effort of sustainable development of the country.6 Globally, this economic impact is estimated at approximately US$1.16 trillion dollars.7 Altogether, this results in cancer being one of the biggest contributors to the global disease burden. As such, efforts in cancer research to improve diagnosis, prognosis and treatment are heavily desired. This is the underlying basis and driving factor for the body of work that will be presented.

1.2.2 Genetics of Cancer

History of Cancer Genetics

The work of Mendel in 1866 showed that phenotypic traits were attributable to distinct units, which were capable of being passed on as units of heritability.8 Rediscovery of Mendel’s work in the 1900’s prompted a resurgence in the field, enabled by the deepened understanding of cells and chromosomes. Work by Sutton and Hunt Morgan isolated the heritable units to the chromosome, and these units were eventually termed ‘genes’ by Johannsen.9 At a similar time, physicians began to observe that certain diseases were inherited according to Mendelian rules, laying the foundation that certain medical conditions could be ascribed to a genetic cause. This was further established by Ingram, who in 1956 found that a single alteration in the haemoglobin

2 gene was responsible for the presentation of sickle cell anaemia.10 Together, this work began to form our understanding of how genetics may contribute to disease. One medical condition observed to portray genetical undertones was cancer, where early studies into retinoblastoma showed the disease to be inherited in a Mendelian manner.11 Subsequent familial studies unveiled other hereditary cancers, while advances in cloning and sequencing allowed the first cancer genes to be discovered. However, it was not until recently, with the increasing affordability and throughput of high-throughput sequencing, that we have started to see the magnitude of genomic contributions involved in tumorigenesis. While our understanding is still growing, this section will illustrate our current understanding of cancer genetics, highlighting challenges and unknowns.

Cancerous Mutations Arise from Multiple Sources

Early studies into cancer revealed that the process of involved multiple steps. Mottram, Berenblum and Shuvik showed in their experiments that tumors required an initiation process, achieved through the application of a carcinogen before it could enter into the promotional phase of carcinogenesis, where the tumour when treated with “irritants” would then display the uncontrollable growth representative of cancers.12 We now understand that this initiation process is not an attribute of any particular chemical, but rather the result of irreversible

DNA damage that the carcinogen generates, whereas “irritants” were simply chemicals that promoted growth. DNA damage has since been recognized as the causal factor for cancer development. Various exogenous and endogenous factors have been shown to be capable of damaging DNA.13 While arising from different sources, endogenous and exogenous factors can produce similar types of damage, and both have been shown to be implicated and play large roles

3 in most cases of cancer.13 As such, a complete understanding of both sources of DNA damage is crucial in the prevention of cancer.

It has long been established that exogenous factors are capable of causing changes in our genetic information.14,15 Examples of such factors include radiation from x-rays and UV light, as well as chemical agents which are highly prevalent in tobacco smoke and the combustion of fossil fuels, such as alkylating agents, aromatic amines and polycyclic aromatic hydrocarbons.16

Radiation can damage the DNA directly, through the breakage of DNA strands, base lesions, base modifications or generation of DNA adducts.16 Radiation can also influence DNA indirectly, an example being the radiolysis of water by IR in x-rays, which generates a cluster of highly reactive hydroxyl radicals that can, in turn, react with DNA to form DNA adducts.17

Chemical agents act directly on the DNA structure, often attaching to bases to form DNA adducts.16 For example, alkylating agents act through the addition of alkyl group onto a base ring nitrogen, which can subsequently force cleavage into an abasic site, induce a substitution mutation, or result in cross-linked DNA.18 Other environmental stresses including natural toxins, extreme heat or cold, hypoxia, and oxidative stress have also been shown to cause DNA damage in human cells.19–22 Furthermore, there has been increasing evidence that everyday use products in cosmetics, pharmaceuticals and food processing are capable of damaging DNA.23–27

Understandably, tissues exposed to the environment, such as skin and lung, are more prone to exogenous stresses, resulting in the tumours that are often associated with a high mutational burden.

It was not until after the discovery of DNA that we began to understand endogenous sources of DNA damage. Aside from its contribution to sporadic cancers, this internal damage ultimately fuels the mechanism from which hereditary cancers develop. Damage from

4 endogenous sources can arise from a wide range of processes and can be as simple as errors during the replication process that are not corrected, resulting in around 10-100 single base substitutions, insertions or deletions per genome per generation.28,29 Other sources of endogenous

DNA damage include lesions due to the action of topoisomerase as well as base aberrations from spontaneous base deamination or abasic site generation.16 Byproducts from the ETC (electron transport chain) and other endogenous metabolic processes, such as ROS, are also capable of compromising the DNA base or backbone when present in excess, resulting in DNA adducts or strand breaks respectively.30 For example, hydroxyl radicals produced as a byproduct of the

Fenton’s reaction can react with DNA bases through the attack of double bonds or sugar residues and the removal of hydrogen atoms from methyl groups.30

It is through exogenous and endogenous stresses that tumorigenesis is able to proceed.

The mutations that arise create a wide phenotypic population amongst cells, upon which selective pressures in the cell can then act. The selection process favours cells that portray the hallmarks of cancer, eventually leading to cancerous growths that continually evolve.

DNA Repair Plays an Important Role in Cancer Progression

Estimates have shown that within a human somatic cell the genome is subject to approximately 70,000 potential mutational events per day.31 Hence robust DNA repair mechanisms are required to maintain genomic integrity from the constant barrage of stresses, and to avoid the initiation of tumorigenesis. Due to the importance of maintaining the genome, multiple molecular pathways have evolved over time to help preserve this information. Pathways that address base damage include base excision repair (BER), nucleotide excision repair (NER), mismatch repair (MMR), and interstrand cross-link repair (ICL). Pathways that address DNA

5 breaks include single-strand break repair (SSBR) as well as double-strand break repair (DSBR), where the latter is achieved mainly through the homologous recombination (HR) and the non- homologous end-joining (NHEJ) pathways.32

A compromised repair pathway will lead to the increased accumulation of DNA lesions resulting in an increased probability of tumorigenesis.32 The importance of these pathways is further exemplified by their causal relationship with hereditary cancers. Examples include hereditary breast cancer and lynch syndromes, where the causative mutations lie in genes responsible for HR and MMR respectively.33,34 Conversely, certain DNA repair pathway aberrations have been exploited therapeutically, through the ability of drugs to act in a synthetic lethal manner within the cancer cells. An example of this is the use of poly (ADP-ribose) polymerase inhibitors (PARPi) in the treatment of BRCA1/2-mutated ovarian cancers, where the introduction of PARPi to a cell causes an increase in single stranded breaks, which are converted into irreparable DSBs in BRCA1/2 deficient cancer cells.35,36 Together this illustrates the necessity of understanding the nature of DNA repair mutations.

Properties of Oncogenic Mutations

While mutations are constantly accumulating in the genome, not all mutations trigger tumorigenesis. On the contrary, the majority of mutations that occur are thought to be largely neutral and are fittingly named passenger mutations.37 The select few mutations capable of inducing or promoting tumorigenesis are known as cancer-driving mutations.37 A combination of mutational effect, gene region and cell type all play a role in governing whether a mutation is a passenger or a driver. Understanding the contribution of a mutation is necessary as the same

6 mutation may have different functional consequences depending on the cellular environment, which subsequently affect how the mutation affects cancer progression and treatment.

All oncogenic genes contribute to the alteration of cell division but through different mechanisms.38 As per their functions, these genes are often classified into proto-oncogenes and tumour suppressors. Proto-oncogenes are genes that normally enhance cell proliferation, an example being genes involved in the initiation of cell division or genes who functions to inhibit apoptosis.38 Mutational activation to an oncogene allows the cell to exhibit enhanced proliferative properties through upregulation or modification of normal function. Activation can arise from mutations in the coding sequence resulting in greater protein activity or deregulation of the protein as well as mutations that change the relative abundance of the protein, such as increasing expression or duplication of the gene.39 A translocation which moves a proto- oncogene to an area of constitutive expression or one that results in a gene fusion are also common mechanisms of action for proto-oncogenes.39 Notable commonly mutated proto- oncogenes include the HER2, MYC, CCND1 and the Ras family genes.38

The idea of tumour suppressors was first conceived by Knudson in his landmark paper on the mutational process in retinoblastoma, which eventually led to the establishment of the Two- hit hypothesis, the idea that multiple hits were required for impairing tumor suppressor function.40 Tumour suppressors function as their name suggests, contributing to pathways that prohibit the cell from expressing tumor like properties.38 Genes involved in DNA repair pathways are also classified as a tumour suppressor.38 Notable tumor suppressors include RB1, p32, INK4, and PTEN.38 The majority of hereditary cancers are caused by the inheritance of an impaired tumor suppressor allele rather than an oncogene, as the offspring are still viable after a singular tumor suppressor mutation, but results in an increased chance of tumorigenesis.41

7 As there are many cellular functions that contribute to maintain the normalcy of a cell, tumor suppressors are often classified further into caretaker, gatekeeper and landscaper genes.42

Caretaker genes do not directly regulate cellular proliferation, but rather encode products that work to maintain genomic and chromosomal integrity.43,44 In essence, inactivation of these genes is similar to subjecting the genome to mutagens, where increased alteration of the genome occurs, which subsequently affect the cellular proliferation. Products from these genes include factors in DNA repair and control pathways, as well as cell-cycle checkpoints.42,44 Similarly, landscaper genes do not control growth directly but maintain the surrounding microenvironment.42 Inactivation of landscaper genes allows for a stromal environment that is conducive to unregulated cell proliferation.42 Unlike caretaker and landscapers, gatekeeper genes actively monitor cell proliferation and death.43 As such, aberrations in these genes directly contribute to irregular cell growth regulation and differentiation.

Different cancers exhibit different mutation patterns and frequencies varying across cell types due to numerous factors. Highly expressed genes have elevated mutation rates in comparison to more lowly expressed genes.45 Regions of late replication also have a higher mutational rates when compared to early replicated regions.46 Along with exogenous sources that are acting on the cell, this process allows for a unique spectrum of mutations for every tumor type.

Non-coding DNA Play Functional Roles in Cancer

Non-coding DNA refers to any genetic sequence in an organism’s DNA which does not encode for protein. Estimates of human genome composition suggest that 98% of genetic information is non-coding (Fig 1.1).47,48 With such a vast majority of the genome dedicated to

8 non-coding functions, it is essential that we understand the biological relevance of these regions in any attempt to recognize the underlying genetics of a disease.47 As recently as the early 1970s, non-coding DNA was still seen as “junk DNA” by geneticists, sequences with no biological functionality.49 It was soon recognized that non-coding DNA was not simply a place holder but had important roles in the cellular system.

9 A B

20000

Area 15000 20 DNA transposons Gene Introns Antisense lncRNA genes LINEs lncRNA genes LTR retrotransposons miRNA genes 10000 miscellaneous heterochromatin Other lncRNA genes miscellaneous unique sequences Other small ncRNA genes

10 Genes of Number Protein coding Protein-coding genes segmental duplications Pseudogenes Percentage of Genome of Percentage 5000 simple sequence repeats SINEs

0 0 LINEs SINEs Introns Pseudogenes miRNA genes miRNA lncRNA genes lncRNA Protein coding Protein DNA transposons DNA Other lncRNA genes lncRNA Other Protein-coding genes Protein-coding LTR retrotransposons LTR segmental duplications segmental Antisense lncRNA genes lncRNA Antisense Other small ncRNA genes ncRNA small Other simple sequence repeats sequence simple miscellaneous heterochromatin miscellaneous miscellaneous unique sequences unique miscellaneous

Figure 1.1. Non protein coding DNA make up the majority of genetic information. A. List of genomic regions and respective percentage coverage of genome. Protein coding sequences make up around 2% of the genome while majority are non-coding sequences. B. Types of transcribed genes and their relative numbers. Protein coding genes form the largest transcribed group but make up less than 50% of all genes. ncRNAs make up a significant portion of actively transcribed genes. Figure created by author with data from “Genome Size Evolution in Animals” by Gregory T. R.47 and “Non-coding RNA in neurodegeneration” by Salta E. et al.49

Some non-coding regions are transcribed into non-coding RNAs (ncRNAs), which are simply RNAs that do not undergo translation into protein molecules (Fig 1.1).50,51 Instead, they remain as RNA molecules, some folding into secondary RNA structures, which have enzymatic

10 or regulatory functions.51 The most notable non-coding RNAs are ribosomal RNAs (rRNA) and transfer RNAs (tRNA).52 rRNAs accounts for the majority of RNA mass in the cell (Fig 1.2).52 rRNAs are synthesized in the nucleolus by RNA Polymerase I, where it remains to associate with ribosomal proteins to form the two ribosomal subunits.53 Ribosomal proteins contain basic and aromatic residues which allow for binding with the negatively charged RNA.53,54 The primary structure of rRNA may differ across organisms but all result in stem-loop base-pairing which create three-dimensional structures that are conserved across species.53,55 tRNAs are transcribed by RNA polymerase III as pre-tRNAs in the nucleus, which subsequently undergo extensive modifications such as intron splicing and motif removal.56 The processed tRNA then acts as a physical linker between mRNA and amino acids, using an anti-codon sequence to present the proper amino acid to the protein synthesis machinery. Other non-coding RNAs include long non- coding RNAs (lncRNA), micro RNAs (miRNA), small interfering RNAs (siRNA), small nuclear

(snRNA), small nucleolar RNAs (snoRNAs) and piwi interacting RNAs (piRNA).51,52

11 1e+08

Type 1e+05 Circular RNA miRNA snRNA snoRNA lncRNA rRNA tRNA

Number of RNA molecules RNA of Number mRNA

1e+02 tRNA rRNA mRNA miRNA snRNA lncRNA snoRNA Circular RNA Circular

Figure 1.2. Non protein coding RNA make up the majority of the transcriptome. Figure illustrates the absolute quantities of different types of RNA in the cell. tRNA abundance is several folds higher than other RNAs, with rRNAs being the next most abundant. Figure created by author with data from “Non-coding RNA: what is functional and what is junk?” by Palazzo A.F. and Lee E.S.51

12 As more research has been conducted into understanding these regions, researchers have been able to determine numerous biological functions for non-coding DNA. Aside from the structural involvement in ribosomes of rRNAs and the amino acid delivery role in translation of tRNAs, non-coding DNA was found to be an important transcriptional regulatory factor in the genome, acting both in a cis and trans fashion.47 Cis-regulating elements include the proximal promoter and enhancer elements; trans-regulating elements include various non-coding RNAs, such as lncRNAs, miRNAs and siRNAs.52,57 Several lncRNAs, such as HOTAIR and XIST, have been shown to have major roles in epigenetic regulation.58,59 snoRNAs and lncRNAs also have functions in post-transcriptional regulation through splicing and translational repression respectively.51 A growing body of literature has noted associations between non-coding elements, specifically functional regulators, with congenital anomalies and complex diseases, indicating an importance of non-coding elements in disease progression as well, prompting further investigation to explore what else may be impacted by non-coding genomic sequences.60,61

Somatic alterations of regulatory non-coding regions in cancers have been noted to promote tumour progression. Single nucleotide variations on ncRNAs or dysregulated ncRNAs have been described to affect cancer risk and progression.60 Examples can be seen in hepatocellular carcinoma, where lncRNAs such as lincRNA-p21, CCHE1 and GIHGCC, have been seen to play a role in cancer progression.62–64 Mutations in promoter and enhancer regions have also been shown to play a significant role in cancer, disrupting normal expression levels. A prominent example can be seen in the TERT gene, where hotspot mutations in the promoter region generate novel transcription factor binding sites, leading to a substantial increase in telomerase reverse transcriptase expression.65,66 This allows the cell to divide indefinitely and

13 become immortal like, as there would be little loss of genetic information.66 Despite this, much is still unknown of the non-coding landscape of cancers, mainly due to the lack of whole-genome datasets that contain this information. With sequencing technologies improving rapidly and exponentially, these mysteries are slowly becoming uncovered.

Kataegis

Patterns are often found in the mutational landscape of tumors, mainly due to the mechanisms of exogeneous and endogenous mutagenesis as well as deficiencies in repair pathways. One pattern that has been observed is kataegis, which is the hypermutation of small localized genomic regions, primarily of the C>T variety in the context of TpCpN trinucleotides.67 Computational modeling by Alexandrov et al defined kataegis as six or more consecutive mutations with average inter-mutation distances of ≤1 kb.67 Kataegis was first discovered in breast cancer, where Nik-Zanial et al found kataegis in more than 50% of tumors studied.68 Kataegis has now been observed in a variety of tumors, including lung and ovarian cancers. The break-induced repair pathway and APOBEC activity have been implicated as the primary source of kataegis with studies showing that tumors with APOBEC3B signature are often plagued with kataegis.69–71 Studies have shown that kataegis may serve as a marker of good prognosis, but further investigation is needed to fully understand the biological implications.72

As kataegis usually takes place in non-coding regions and contributes to the shaping of the non- coding space, it is relevant to explore this process and its behaviors in any study that attempts to characterize the non-coding mutational space.

14 Cancer Evolves Over Time

A tumour does not stay stagnant over the course of disease. Rather, oncogenesis is dynamic, accruing more and more mutation as it progresses. This generally leads to a heterogeneous tumor, a state where subpopulations of tumor cells with distinct genotypes and phenotypes coexiste.73 Intra-tumor heterogeneity has also been suggested as a mechanism for tumor progression, where accumulation of different genetic alterations within a population of cells allows for selection of subclones that will have a growth and survival advantage in the specific environment in which it resides or can undergo metatasis.74–77 This is of concern, particularly for therapy, as it provides the genetic underpinnings for resistance through selection of different therapeutic susceptibilities within the same tumor (Fig 1.1).75 An example would be in gastrointestinal stromal tumors, where treatment with imatinib or sunitinib lead to the development of secondary drug-resistant mutations.78–80 Furthermore, cooperation between different subclones has also been reported, where a subclonal population could enhance the proliferation of another through paracrine signaling.81 Overall, this illustrates the importance of understanding all mutations within the entirety of a tumor, even if it persists in a minor tumor subpopulation, as it may influence growth of the entire tumor or be actively selected for during progression or therapy.

15 Figure 1.3 Tumor heterogeneity provides variance for selection. Progression of tumor results in a genetically heterogenous population (different color cell represents a different genetic composition). When therapy is presented, it acts as a selection mechanism, allowing only resistance clones to survive. This clone is then able to repopulate the tumor leading to a resistance driven relapse.

Tumour heterogeneity has also been implicated in the progression to metastasis. Deep sequencing analyses of tumours have traced the origin of metastases to distinct subpopulations of the primary. In some cases, complex genetic events were shared between multiple tumor sites and an isolated subclonal population of the primary, suggesting a single common progenitor initiating metastatic outgrowth.82 As such, understanding clonal heterogeneity and the genetic evolution in tumors is crucial in any attempt to study metastases.

1.2.3 Metastatic Cancer

With advances in therapies and a better understanding of cancer, treatment of primary cancers has greatly improved. However, progression towards metastases is often met with a gloomy prognosis. Metastasis is defined as the development of malignant growths away from the primary tumour. This may present as growths on a distant or neighbouring organ, or another part 16 of the organ of origin. It is estimated that the process of metastasis is responsible for around 90% of all cancer deaths.83 Despite this, the process of metastasis is not fully understood, with questions surrounding the mechanisms and initiation of the process. Understanding metastasis is therefore a critical process to improving cancer patient outcomes.

Early attempts at unraveling the genetic cause of metastasis focused on comparing genetic similarities and differences between primary and metastatic tumours.84–86 These studies were ultimately inconclusive, showing a wide variety of genetic divergence between metastases and primaries across tumor types.83 Metastases do not appear to have distinct progression patterns, as both linear (metastasis occurs late in the course of the disease) and parallel progressions (metastasis occurs early and progresses alongside the primary) have been observed.87 Recently, debate has arisen on whether metastasis follows the previously described monoclonal seeding model, as additional evidence has arisen for a polyclonal seeding model, where multiple primary tumor linages can be found in the metastases (Fig 1.2).82 Hopefully, with the advent of novel sequencing technologies and the rapid improvements of such technologies, we will be able to better tackle these questions and garner more information for ameliorating the prognosis of metastasis through understanding at the genetic level.

17

Figure 1.4 Metastasis can occur via multiple mechanisms. Metastasis occurs when tumor cells from the primary infiltrate another organ and cause malignant growths. Two mechanisms have been described in literature. The monoclonal seeding model, shown by the upper path, suggests that metastasis is initiated by a singular clone and the resulting growth stems from this singular clone. The polyclonal seeding model, shown by the lower path, suggests that a cluster of varying clones all contribute to metastasis. This figure is derived from “Lung PNG Picture” (pngall.com/lungs-png/download/15109) used under CC BY-NC 4.0 (creativecommons.org/licenses/by-nc/4.0/) and from “Liver.svg” by Mikael Häggström, Public Domian.

1.2.4 Whole Genome Sequencing in Cancer

History of Sequencing

Our understanding of DNA began when Watson and Crick famously solved the three- dimensional double-helical structure.88 Initial attempts to sequence DNA were adapted from protein sequencing methods, which involved a large amount of analytical chemistry and fractionation, as well as restricting the target area to short stretches.89–92 The major breakthrough

18 that forever altered DNA sequencing technologies came with Sanger’s development of the chain- termination technique.93 Improvements on the technique led to the development of the first automated DNA sequencing machines based on the chain-termination method, which were used in conjunction with PCR and recombinant technologies to sequence increasingly complex genomes.94–98 The development of pyrosequencing, which utilized luminescence instead of radio- or fluorescently labelled dNTPs led the surge of second-generation DNA sequencing technologies, which focused on the parallelization of sequencing reactions in order to greatly increase the sequencing load of any singular run.99 This ultimately led to the highly successful

Illumina sequencing platform which has been the basis of sequencing for countless experiments.100 Recent years have focused on the development of technologies focused on sequencing single DNA molecules, negating the need of amplification from previous technologies.101,102 These sequencing technologies are capable of producing extremely long reads, which is especially useful in the cases of genome assembly.103,104 Additionally, many of these technologies are able to detect modified bases, adding another layer of information generated by sequencing.105 With constant improvements in technologies resulting in greater efficiency and lower cost, large scale sequencing has turned from a luxury to a common practice in scientific work.

Sequencing in Cancer

Many large-scale efforts have incorporated sequencing in their attempts to characterize cancer. Projects such as The Cancer Genome Atlas (TCGA), International Cancer Genome

Consortium (ICGC), Catalog of Somatic Mutations in Cancer (COSMIC), and Therapeutically

Applicable Research to Generate Effective Treatments (TARGET) have taken advantage of

19 sequencing to better understand the mutation processes that drive the disease.106–109 This has, in turn, facilitated improvements in diagnosis and treatment of certain genomic events, as well as providing the data needed for the development of bioinformatic tools that aid analysis.

There remain some caveats in the information generated from tumor sequencing. Firstly, the majority of the assembled data is focused on the genetics of the coding region of the genome.110 As such, there is a lack of genomic data that allows for in-depth analysis of the non- coding region of the genome. Furthermore, most sequenced tumors studied represent primary untreated disease and do not offer information regarding metastasis as well as any response to chemotherapeutics.110 The incorporation of datasets that include clinical information as well as advanced tumours are therefore still required to allow for a better understanding of the entire process of cancer as well as the understanding of therapeutics in a human system.

Identification of Mutations

One of the most crucial qualities of sequencing tumours, especially in the clinical setting, is the ability to identify genetic variants at high fidelity. However, not all variants seen in the sequence are real, as some may be artifacts derived from library preparation, sample enrichment, the sequencing process, as well as mapping/alignment issues.111 Therefore, any genomic inquiry requires the identification of real variants amidst all such noise, a step referred to as variant calling.111 Numerous variant callers have been developed, all employing different algorithms tuned for distinct purposes and uses.

Variants themselves can be grouped into different types: single nucleotide variants

(SNVs), insertion and deletions (indel), and structural variants (SV).111 Different approaches and algorithms are needed to detect variants of varying, and as such, few callers are able to perform

20 all three.111 Germline and somatic variant calling also have inherently different assumptions due to differences in allelic fraction for the mutations. Germline is expected to have 50 or 100% allele frequencies, which allows for easy rejection of artifacts present at low frequency.112,113

Somatic calling is not as straight forward as real variants may be present in low frequencies due to the purity of the sample or tumour clonality.112,113 Therefore, most somatic variant callers focus on algorithms that disambiguate low-frequency variants, using more sensitive statistical modeling and error correction.111 Examples of strategies include heuristic approaches, joint genotype analysis, joint allele frequencies, haplotype-based strategies as well as machine learning methods (Table 1.1). While most somatic variant callers use matched tumor-normal samples, some are able to proceed without the matched normal. Choice of variant caller then depends on variant of interest, available data, source of data (some callers perform better for certain sequencers), and desired variant allele frequency.

21 Table 1.1 Examples of commonly used variant callers

Algorithm Variant Caller Type of Variant called

qSNP114 SNV

RADIA115 SNV

Heuristic threshold Shimmer116 SNV, indel

VarDict117 SNV, indel, SV

VarScan2118 SNV, indel

JointSNVMix2119 SNV

SAMtools120 SNV, indel

Joint genotype analysis Seurat121 SNV, indel, SV

SNVSniffer122 SNV, indel

SomaticSniper123 SNV

deepSNV124 SNV

LoFreq125 SNV, indel Allele frequency analysis MuTect126 SNV

Strelka127 SNV, indel

FreeBayes128 SNV, indel

HapMuC129 SNV, indel Haplotype analysis LocHap130 SNV, indel

Platypus131 SNV, indel, SV

BAYSIC132 SNV

MutationSeq133 SNV Machine learning SNooPer134 SNV, indel

SomaticSeq135 SNV

Note. Adapted from A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data by Chang Xu.103

22 Methods of Studying Recurrent Cancerous Mutations

After identification of variants, the fundamental challenge is to distinguish driving mutations from passenger mutations. Early studies which mainly dealt with hereditary cancers used classical genetics methods such as linkage and family studies to identify chromosomal regions or genes that were inherited along with the disease.136–138 The advent of large-scale sequencing efforts allowed for more precise identification of mutations and a greater quantity of mutations. This, however, leads to the issue of more potential candidates. Several computational methods have been developed to assist in the categorization of mutations.

A common approach to prioritize somatic mutations which are likely to possess driving qualities is to determine genomic regions with high mutational frequency across individual tumours with the understanding that a high recurrence rate may suggest higher fitness (more oncogenic).139 There are two approaches when identifying regions of high mutational burden which are often used together. The first is to focus on DNA elements that are expected to have a biological function which restricts the search space, allowing for more power as well as a greater chance for biologically functional variants.140 The second is the identify clusters of SNVs, known as hotspots, and then comparing SNV frequencies within the hotspot against local or global mutation rates.139–141 This significantly amplifies power and can concentrate mutations that have similar effects.

Hotspot analyses allow for the identification of prime candidate cancer driving SNVs which can then be filtered by functionality. Mutations in coding regions can be distinguished through computational prediction of the resulting protein sequence and structure. Non-coding regions are not as straightforward and various methods are employed to best characterize the region. If the region is associated with a known gene or is in a post-transcriptional regulatory

23 region (example being UTRs which control stability), expression analysis may uncover mutations that affect the regulatory role of the region.141 Transcription factor (TF) motif prediction can also identify the functional impact of non-coding mutations. This prediction is accomplished through DNA specificities based on previous DNA-binding experiments such as chromatin immunoprecipitation and protein binding microarrays which identify regions of TF occupancy142–144. Effect can then be calculated using position weight matrices, probabilistic representations, or other prediction algorithms.140,145,146 Apart from computational prediction, functional validation assays such as luciferase reporter assays, physical binding assays and knockdown assays also assist in validating effects of potential drivers.140 However, due to the low throughput of such assays, they often cannot keep pace with the discovery of new candidate mutations.

1.2.5 Personalized Oncogenomics

Overview of Cancer Therapy

The first effective line of cancer treatment was radiation therapy, with the first cancer case being cured exclusively by radiation occurring in 1898.147 The turn of the 20th century saw a rapid development of surgical techniques to treat cancer, with the first abdominoperineal resection performed by Miles in 1908, the first lobectomy by Davies in 1912 and the first radical hysterectomy by Wertheim in 1906.148–150 The use of chemotherapies for cancer therapies began in the 1930s, and development of chemotherapies continued on throughout the 1940s and 1950s leading to the advent of important chemotherapies such as 5-fluoroururacil, cyclophosphamide and methotrexate.151–153 By 1958, the first case of cancer solely cured by chemotherapies was reported.154 However, the basis for solid tumour treatment remained surgery and radiotherapy

24 throughout the 1960s, which often led to uncontrolled development of micrometastases.155 This eventually led to the use of adjuvant chemotherapy in the late 1960s, which changed the concept of localized treatment.155 Also occurring in the late 1960s was the beginning of use of combination therapy which resulted in higher cure rates.156–159 The advancements in molecular understanding of the cell and of the genome throughout the 1970s and 1980s enabled the development of novel chemotherapies with various mechanisms and better application of existing chemotherapies.155 Sequencing advancements sparked the beginning of targeted therapies, which focused on exploiting molecular targets in the tumour.

Nowadays, cancer is treated with a combination of chemotherapies, radiation therapy and surgery. Modern surgery has moved away from Halstedian techniques to less invasive procedures such as laparoscopy, videothoracoscopy, radiofrequency ablation and radiosurgery techniques.160–162 Technological advances in X-ray production and delivery, as well as improvement of computer-based treatment planning and our understanding of cancer have drastically improved the manner in which we apply radiation therapy, improving efficiency and minimizing detrimental effects.163,164 Examples include four-dimensional conformal radiotherapy which uses dynamic CT images to assist in targeting and radiogenic therapy which uses radiation to stimulate cytotoxic agents.163 The most recent pharmacological trend for chemotherapies has been the development of different targeted kinase inhibitors.165 Together this has contributed to raising curability rates and reducing the mortality of cancer.

Novel developments of cancer therapy have focused on moving towards personalized therapeutic regimens with the intention of treating each tumour uniquely. Promising ventures include immunotherapy, the application of targeted nanoparticles and gene therapy.155

25 Genetics in Cancer Therapy

Current targeted therapies focus on identifying phenotypic or genetic traits in the tumor that render it susceptible to certain therapies. Examples include EGFR gain of function mutations as a target for EGFR inhibitors or the use of aromatase inhibitors in estrogen positive breast cancers (Table 1.2).166,167

26 Table 1.2 Examples of commonly used targeted therapies in cancer

Cancer Type Gene Drug Ref

VEGF Bevacizumab Ferrara et al.168

VEGF Ramucirumab Zhu et al.169 Colorectal EGFR Cetuximab Baselga170

EGFR Panitumumab Messersmith et al.171

ERBB2 Trastuzumab Molina et al.172

ERBB2 Pertuzumab Minckwitz et al.173

ESR1 Lapatinib Denkert et al.174

ESR1 Letrozole Reinert et al.175 Breast MTOR Everolimus Saran et al.176

CDK4/6 Abemaciclib Hamilton et al.177

BRCA1/2 Olaparib Robson et al.178

BRCA1/2 Talazoparib Litton et al.179

EGFR Gefitinib Kobayashi et al.180

EGFR Afatinib Sequist et al.181

EGFR Erlotinib Shepherd et al.182

EGFR (T790M) Osimertinib Mok et al.183 Lung ALK Crizotinib Shaw et al.184

ALK Ceritinib Shaw et al.185

BRAF Dabrafenib and trametinib Hauschild et al.186

ROS1 Crizotinib Shaw et al.187

Ovarian BRCA1/2 Olaparib Ledermann et al.188

27 Unfortunately, any form of targeted therapy creates a selection force which naturally allows for the rise of resistance population once the corresponding variant is introduced to a tumor cell.

This can be seen in EGFR inhibitor treatments, where prolonged treatment gives rise to the

T790M mutation in the EGFR gene, as well as in aromatase inhibitor treatments, where prolonged exposure results in ESR1 gain of function mutations that allow for constitutive ER- activation or ER-independent growth signalling.175,189 Currently, identification of resistance mutation stems from sequencing patients that have relapsed as a result of a resistant clone and subsequent validation through functional studies in cell lines derived from the patient.190 The current lack of large datasets containing patient genomic data with clinical information renders this a slow process that only arises in specific case studies or experimentally designed studies.

Large datasets that contain all this information would allow us to probe many therapies across multiple types, allowing for more sensitivity and rapid resistance mutations detection.

Understanding the mutations that arise from therapy and the manner in which they induce resistance will ultimately assist in the prescription of therapeutics allowing for better prognosis.

1.2.6 Summary

Our understanding of cancer is continually growing, allowing for treatments of better efficacy and reduced harm. However, much remains unknown, particularly regarding metastatic cancers. As metastases account for the majority of cancer deaths, a better genetic understanding of this process may allow for better prognosis and treatment. As sequencing technologies have improved, we are also now better equipped to generate data sets that are able to tackle the questions that linger on non-coding regions of the DNA. For cancer, in particular, this includes identifying potential drivers in non-coding regions. Additionally, with the continual development

28 of novel therapeutics, insight into resistance mechanisms is crucial to ensure the best course of treatment for the patient. Using our dataset of 638 patients with genomic, transcriptomic and clinical data, we attempt to give more insight into these areas and illustrate the benefits of this type of dataset.

29 Chapter 2: Non-coding Mutations in Pre-Treated Metastatic Cancer

2.1 Introduction

Currently, most information we have regarding cancer is concerning the coding region of primary cancers. This is due to a lack of datasets containing information on whole-genome sequences as well as on patients with advanced lesions. Here we present our inquiry into the non- coding space of 638 advanced cancer patients from a range of cancer types where we tackle three main objectives:

1. Investigate the mutational landscape of metastatic cancers and elicit any recurrent

mutations, focusing on the functional non-coding region.

2. Investigate the functional impact of any recurrent non-coding mutation using

differential expression and motif prediction.

3. Identify and describe how large scale mutational processes such as kataegis affects

our cohort

2.2 Methods

2.2.1 Patient Samples

The cohort was assembled of patients who were enrolled in the POG program who gave informed consent between July 2012 and December 2018. Patients were selected based on biopsy availability, sufficient tumour content and complete unambiguous clinical data. This resulted in 638 patients with complete comprehensive clinical tumour profiles. The cohort was composed of 238 males and 400 females with an average age of 55.5 years. Selected patients’

30 tumors were from a wide assortment of sites. An overview of tumor subtype and numbers can be seen in figure 2.1.

150

100

count 150

100

50

50 Number of Patients

0 OV ACC BCC UVM HCC MISC BLCA LYMP ESCA STAD THCA BRCA COLO LUNG PANC SARC CHOL SECR UCEC HNSC CERV KDNY THYM PRAD SKCM

CNS-PNS

Figure 2.1 Overview of tumor types in the cohort. Patient samples were derived from a variety of tumor sites. The largest cohort was breast cancers, followed by colorectal cancers and lung.

31 2.2.2 Sample Collection, Preparation and Sequencing

Tumour specimens were collected using ultrasound- or CT-guided or needle core biopsies, endobronchial ultrasound biopsies, or tissue resection and immediate embedment in optimal cutting temperature compound for snap freezing on dry ice. When needed, liquid biopsies were performed with the resulting specimen spun down and embedded into optimal cutting temperature compound. Sections were reviewed for optimal tumor content and cellularity. Selected sections were subjected to DNA and RNA extraction. Matched normal DNA was extracted from peripheral blood. Genomic libraries were constructed using PCR-free methods (E6875-6877B-GSC, New England Biolabs). Transcriptomic libraries were constructed through the BC Cancer Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Tumour genomic libraries were sequenced to a target depth of 80x coverage and normal blood genomic libraries were sequenced to a depth of 40x coverage on Illumina (San Diego, California) HiSeq 2500 using V3 or V4 chemistry and paired-end 125 base reads, or on HiseqX using version 2.5 chemistry and paired-end 150 base reads. Transcriptomic libraries were sequenced to 150-200 million 75-base paired end reads on Illumina HiSeq2500, or on NextSeq500 using version 2 chemistry. Detailed library construction protocol have been previously described.191

2.2.3 Alignment and Variant Calling

Sequenced reads were aligned to the human reference genome (hg19) using the Burrows-

Wheeler Alignment tool (v0.5.7 for up to 125 bp reads and v0.7.6 for 150 bp reads).192 A matched normal variant calling approach was utilized to identify somatic alterations. SNVs were identified through a probabilistic joint variant calling approach utilizing SAMtools (v0.1.17),

32 MutationSeq (v1.0.2 and v4.3.5) and Strelka (v1.0.6).120,127,133 Variants were selected for using a threshold of MutationSeq probability >= 0.85 and Strelka QSS >= 15. Only selected variants were used for downstream analysis. Variants were annotated to genes using SnpEff (v3.2) with the Ensembl database (v69).193,194

2.2.4 Gene Expression Profiling

RNA-Seq reads were aligned to hg38 reference genome using STAR (version 2.5.2b) and expression was quantified using RSEM (version 1.3.0) as tags per million (TPM) using the approach described in the TOIL RNA-seq pipeline by Vivian et al.195–197 This was done to minimize any batch effects that may be present in our POG data sets and other publicly available

RNA-seq samples, allowing for comparison. Furthermore, TPM has also been shown to be a better comparative measure between samples vs other expression measurements.198 Genes were annotated based on Ensembl version 85.194

2.2.5 Clustering of Recurrent Mutations

A positional clustering method was utilized to uncover recurrent mutations. All single somatic nucleotide mutations within 50 bp of two other mutations were grouped together using the R package ClusteredMutations (v1.0.1) (Fig 2.2).199 This threshold was chosen to minimize the incorporation of singular mutations that are bordering clusters into the respective cluster.

33

Figure 2.2 Positional clustering of mutations. Mutations were clustered based on distance from neighbouring lesions. Mutations had to be within 50 bp of two other mutations to be cluster as seen in cluster 1 and cluster 2. This is shown by the red mutation on the right of cluster 2 which is excluded. Patients can have multiple lesions within a cluster as seen in cluster 1, but only clusters with 5 or more patients were considered for downstream analysis.

To determine variants that may be significant to the progression of cancer, a mutational significance score was attributed to each of the clusters. The significance of each cluster was calculated using a binomial distribution as previously described by Weinhold et al.141 The equation is present below, where n is the total number of patients in the cluster and pi is the probability of mutation for the patient:

012 - ( ) ( ) 3( )913 ! # ≥ % = 1 − ! # < % = 1 − + ,. / 78 1 − 78 3 5 6

The probability of mutation was calculated using a local background mutation rate qi and the length Li of the cluster. The local background mutation rate was calculated as the average of all

34 mutations within 10kb upstream and downstream of the cluster divided by total number of patients (Fig 2.3).

<= 78 ~ 1 − (1 − ;8) - ; = 8 20000 ∙ 638

Figure 2.3 Calculation of local background mutation rate. All mutations within 10 kb upstream and downstream of the cluster were identified and averaged for a per base mutation rate. This was then divided by total number of patients for a per patient per base mutation rate.

Clusters that did not reach significance (p < 0.05) were filtered out. To remove artifacts, germline events, and any mutations in occurring in polymorphemic locations, all mutations in remaining clusters were filtered against the matched normal of the entire cohort, summarized at equivalent positions using SAMtools mpileup (v.0.1.17).120 Any mutation present in a location that was mutated at an allele frequency higher than the tumor variant in at least 2% of normal samples was filtered out. The remaining mutations were re-clustered using the previously described methods methods and significance was recalculated for each cluster. Multiple test correction was performed using the false discovery rate method.200 An FDR of 5% was utilized.

Significant clusters were considered to be clusters that had a q-value < 0.05 and a patient frequency of at least six patients (which was just under 1% of our cohort).

35 To identify potentially functional regulatory non-coding mutations, clusters fully residing in either a promoter region, an enhancer region or the 3’ and 5’UTRs were selected for further analysis. Promoters were defined as 1.5 kb upstream to 500 bp downstream of the transcription start site as described by Ensembl (v69).194 Enhancers were defined using the GeneHancer database.201 Only enhancers that were categorized as “Double Elite” status (where multiple databases have identified the enhancer and functional assays have proved its validity) were considered for our analysis. UTRs were defined by Ensembl (v69).194 To determine any structural or functional effect on non-coding RNAs, a list of all lncRNA, miRNA, siRNA, snRNA, snoRNA and other miscellaneous small RNAs was generated based on Ensembl (v69) annotations.194 Clusters in intronic and intergenic regions residing in these non-coding genes were then filtered out. The top 10% of non-coding genes with mutational clusters were then examined in the literature for any possible function.

2.2.6 Functional Validation of Non-coding Regions

All genes with regulatory mutational clusters were analyzed for expression. Genes with no expression were filtered out. Remaining genes were subjected to expressional analysis using

TPM data generated from RNA-seq as described above. P-values were calculated using an unpaired two-sample Wilcoxon test. Multiple testing was performed with the false discovery rate method.200 Transcription factor binding prediction was performed using Sasquatch

(Fragmentation: DNase; kmer split: 7bp; Tissue: ENCODE_Duke_Fibrobl_merged;

Normalization: propensity-based (fibroblast)).202

36 2.2.7 Validation with an External Cohort

External validation of the expressional changes of the AP2A1 cluster was performed using data from the TCGA cohort. Patients in the SKCM, HNSC and LUNG cohorts were selected and analyzed for AP2A1 promoter mutations. A BAM slicing approach was utilized to isolate the region of interest to maximize variant calling potential. Variants were called using

SAMtools (v0.1.17) and manually verified for legitimacy.120 Expressional analysis was performed using TPM data generated by TCGA.

2.2.8 Identification of Regions of Kataegis

This analysis was only performed on the 570 patients in the POG570 cohort. Inter- mutational distances were calculated for all mutations within a patient. Mutations were then grouped into regions of kataegis using the definition of six or more consecutive mutations with average inter-mutation distances of ≤1 kb as proposed by Alexandrov et al.67 Only regions with

> 50% of mutational burden as C>T or C>G mutations were considered to be real kataegic events. All mutations were then filtered against the normal background as previously described to filter for potential artifacts. To analyze regions with high rates of kataegis, the genome was split into 10 kb bins and each kataegic event was placed into its corresponding bin. Kataegic events larger than 10kb were placed into all residing bins. To evaluate the kataegis burden in a patient, we extracted all mutations within kataegis events from the patient and divided by all somatic mutations identified. Both an unpaired two-sample Wilcoxon test and a linear regression were used to determine correlations between APOBEC3B expression and kataegis.

37 2.3 Results

2.3.1 Filtering against normal baseline removes artifacts

To reduce artifacts, we filtered mutations against a normal baseline. Before any level of filtering was performed, 224,670 clusters of at least three mutations were found in the cancer genomes of 638 patients, consisting of 847,237 SNVs. Filtering against the normal baseline resulted in the removal of 69,019 variants, 15,762 total clusters, as well as 4,577 significant clusters. The majority of these filtered mutations were found in the intergenic space and the intronic space, and minimal effect was noticed on well-established and verified coding variants

(Fig 2.4). This suggests that the filtering employed was proficient in eliminating artifactual variants due to library preparation, sequencing or alignment errors without discarding high- quality somatic variants.

38 Mutation count (Log10) 1 10 100 1000 10000 stop_lost stop_gained intron_variant intergenic_region missense_variant intragenic_variant synonymous_variant splice_region_variant 3_prime_UTR_variant 5_prime_UTR_variant upstream_gene_variant non_coding_exon_variant downstream_gene_variant splice_donor_variant+intron_variant splice_region_variant+intron_variant splice_acceptor_variant+intron_variant missense_variant+splice_region_variant splice_region_variant+synonymous_variant splice_region_variant+non_coding_exon_variant 5_prime_UTR_premature_start_codon_gain_variant Region

Figure 2.4 Filtering removes artifacts and unwanted mutations. All clustered mutations were filtered against a normal baseline. Figure shows mutations which were removed. Majority of mutations filtered were from intergenic and intronic regions. Few coding mutations were removed.

2.3.2 Positional clustering shows minimal recurrence for non-coding mutations

Positional clustering of filtered mutations and final significance calculations resulted in

1,567 significant clusters in 638 patients. Of these, 37 resided in coding regions, 77 resided in a promoter or enhancer regions, five were in the coding region of predicted genes, two were in

UTRs, 442 were intronic, and 997 were in intergenic regions and nine clusters were attributed to multiple regulatory regions (examples being a promoter and a 5’UTR or a promoter and the coding region of an unannotated gene) (Table 2.1).

39

Table 2.1 Summary of clustering results

Patient number in most P-value of most Region Type Number recurrent cluster significant cluster

Coding Regions 37 99 6.73 x 10-285

Promoters 86 21 4.60 x 10-43

Enhancers 8 16 3.24 x 10-32

UTRs 6 7 1.11 x 10-15

Intronic 442 11 2.16 x 10-32

Intergenic 997 12 6.29 x 10-28

Unannotated genes 12 8 9.33 x 10-24

Note: Clusters categorized in two regions were placed in both regions

In total, 97 genes were affected by a mutational hotspot in a promoter or enhancer hotspot, and 25 genes had a mutational hotspot in the protein coding region. Clusters were well- dispersed among chromosomes, with length of chromosome correlating with mutational burden

(Fig 2.5). The majority of top scoring clusters reside in the coding space and are in genes that have been well studied and confirmed as key drivers in tumorigenesis (Fig 2.6). Table 2.2 shows the top 20 identified clusters. When examining lower scoring clusters, a wide variety of promoters are present (Fig 2.7). The highest recurring promoter mutation was found in TERT, a

60 bp cluster which houses two frequently mutated base pairs along with other sparse mutations.

Comparison of our clusters with other non-coding mutational studies of primary cancers did not yield any significant novel mutations or statistically significant differences in mutational rates.

40 1 2 3 4 5

64 PIK3CA 32 PIK3CA TERT 16 RP11-565A3.2 APC 8 NRAS LRRTM4

0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+080.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08

6 7 8 9 10

64

32 ESR1 BRAF PLEKHS1 16 CDKN2A ADGRG6 8 PTEN

0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+080e+00 5e+07 1e+08 0e+00 5e+07 1e+08

11 12 13 14 15 Coding Variant KRAS 64 3'UTR

32 5'UTR

16 Promoter Enhancer Patient Count 8 Intergenic Region 0e+00 5e+07 1e+08 5e+07 1e+08 3e+07 6e+07 9e+07 3e+07 6e+07 9e+07 2e+07 4e+07 6e+07 8e+07 1e+08 Intronic Region 16 17 18 19 20 TP53 64 TP53 TP53 32 TP53 16 SMAD4 DUSP15 RP11-279O17.3 8

0e+00 2e+07 4e+07 6e+07 8e+07 2e+07 4e+07 6e+07 8e+070e+00 2e+07 4e+07 6e+07 0e+00 2e+07 4e+07 6e+07 2e+07 4e+07 6e+07

21 22 X Figure 2.5 Chromosomal overview of cluster 64 distribution. Clusters were plotted according to their 32 position on the chromosome. Larger chromosomes 16 generally had more mutational clusters whereas smaller 8 chromosomes had decreased numbers. Highly recurrent 1.5e+07 2.0e+07 2.5e+07 3.0e+07 3.5e+07 4.0e+07 2e+07 3e+07 4e+07 5e+070.0e+00 5.0e+07 1.0e+08 1.5e+08 clusters were spread across all chromosomes. Genome Position 41 100 KRAS

TP53

Coding Variant 3'UTR 075 50 TP53 5'UTR Promoter Enhancer Number of Patients

TP53 TP53 PIK3CA

PIK3CA

ESR1 TERT ADGRG6 BRAF PLEKHS1 CDKN2A SMAD4 PTEN APC NRAS 25 5

0 100 500 750 1000 Hotspot Significance [-log2(P-value)]

Figure 2.6 Overview of mutation clusters affecting coding or regulatory regions. Mutations were clustered using a positional clustering method. Significance of each hotspot was calculated using a binomial distribution. An FDR of 5% was used to for multiple test correction. Clusters were sorted into genomic elements and plotted. Clusters that are in multiple regulatory regions (Example: promoter of one gene, enhancer in another) are plotted for every gene it affects. Majority of highly significant mutational clusters are seen to reside in coding regions of known cancer genes.

42 Table 2.2 Top mutational clusters in metastatic cancers

Chromosome Start (bp) End (bp) Distance Gene Effect Patients p-value

12 25398255 25398285 31 KRAS Coding 99 6.73 x 10-285

13 7577018 7577176 159 TP53 Coding 66 1.64 x 10-81

3 178936072 178936137 66 PIK3CA Coding 34 3.97 x 10-70

3 178952074 178952117 44 PIK3CA Coding 30 3.25 x 1067

17 7577498 7577610 113 TP53 Coding 49 2.03 x 10-59

10 115511590 115511593 4 PLEKHS1 Promoter 16 4.60 x 10-43

6 152419884 152419926 43 ESR1 Coding 22 1.30 x 10-42

17 7578176 7578291 116 TP53 Coding 35 5.69 x 10-36

2 77611361 77611361 1 LRRTM4 Intronic 11 2.16 x 10-32

6 142706206 142706236 31 ADGRG6 Enhancer 16 3.24 x 10-32

5 1295205 1295264 60 TERT Promoter 21 5.27 x 10-31

1 115256529 115256530 2 NRAS Coding 10 1.95 x 10-30

17 7578369 7578551 183 TP53 Coding 35 1.75 x 10-29

43 Table 2.2 Top mutational clusters in metastatic cancers (continued)

Chromosome Start End Distance Gene Effect Patients p-value

6 14065842 14065842 1 N/A Intergenic 9 6.29 x 10-28

7 140453136 140453193 58 BRAF Coding 15 4.90 x 10-26

6 161652663 161652664 2 AGPAT4 Intronic 9 1.90 x 10-25

13 81255050 81255050 1 N/A Intergenic 9 8.48 x 10-25

18 48591888 48591919 32 SMAD4 Coding 13 1.01 x 10-23

20 4004848 4004848 1 FTLP3 Promoter 8 9.33 x 10-23

23 21649148 21649149 2 CNKSR2 Intronic 8 1.14 x 10-21

44 ESR1

TERT 20

ADGRG6 PLEKHS1 15 BRAF Coding Variant 3'UTR 5'UTR Promoter Enhancer

Number of Patients CDKN2A SMAD4

RP11-258J10.1

10 PTEN APC NRAS U1 IGHV2-70 CTNNB1 SMAD4 RNA5SP357 IGHV2-70 C20orf203 RP11-652G5.2 FTLP3 AHNAK2 TP53 IDH1 COQ10B FTLP3 C20orf203 PIK3CA AP2A1 5

0 50 100 150 Hotspot Significance [-log2(P-value)]

Figure 2.7 Overview of mutation clusters affecting coding or regulatory regions (zoomed). Figure shows a closer examination of the group of clusters at the bottom right of Fig 2.6. Mutations were clustered using a positional clustering method. Significance of each hotspot was calculated using a binomial distribution. An FDR of 5% was used to for multiple test correction. Clusters were sorted into genomic elements and plotted. Clusters that are in multiple regulatory regions (Example: promoter of one gene, enhancer in another) are plotted for every gene it affects. Majority of promoters are situated in this region of the graph.

45

Screening for clusters in ncRNA genes resulted in 295 clusters. All significant mutational clusters were found in lncRNAs or host genes for miRNAs. Table 2.3 shows the most frequently mutated clusters in ncRNA gene regions. Notably, almost no lncRNAs that have confirmed roles in cancer were seen to be heavily impacted by mutations. HOTAIR was one such lncRNA which seemingly did not accrue recurring mutations. We did identify two mutational clusters in the host gene of mir4500, an ncRNA which has a known function in cancer suppression. However,

RNA-seq was unable to detect mir4500 levels and as such we were unable to determine the direct effect of mutations on mir4500. Examination of the expression of STAT3 and HMGA2, two mir4500 targets, showed no difference between affected and unaffected patients.

46 Table 2.3 Top ncRNA mutational clusters in metastatic cancers

Chromosome Start (bp) End (bp) Distance Non-coding Gene Type Patients p-value

4 188543492 188543542 51 LINC02492 lncRNA 11 2.83 x10-14

8 34641882 34642048 167 LINC01288 lncRNA 11 8.26 x10-7

16 8381292 8381333 42 lncRNA 10 9.40 x10-13

6 86665386 86665476 91 lncRNA 9 2.64 x10-8

8 54103492 54103595 104 lncRNA 9 4.32 x10-8

5 12621402 12621554 153 LINC01194 lncRNA 9 5.34 x10-6

14 29347391 29347392 2 LINC02326 lncRNA 8 1.23 x10-18

8 123097297 123097405 109 lncRNA 8 1.79 x10-8

5 4579099 4579143 45 lncRNA 8 2.01 x10-8

1 192468455 192468585 131 lncRNA 8 6.69 x10-8

15 25139618 25139691 74 lncRNA 8 1.55 x10-7

17 52261794 52261897 104 lncRNA 8 3.21 x10-7

3 21221782 21221880 99 lncRNA 8 3.43 x10-7

47 Table 2.3 Top ncRNA mutational clusters in metastatic cancers (continued)

Chromosome Start End Distance Non-coding Gene Type Patients p-value

3 20589315 20589446 132 SGO1-AS1 lncRNA 8 6.88 x10-7

1 194840581 194840693 113 lncRNA 8 1.26 x10-6

13 88097894 88098016 123 MIR4500HG lncRNA 8 1.83 x10-6

2 22547897 22548028 132 lncRNA 8 2.09 x10-6

1 187507370 187507483 114 ERVMER61-1 lncRNA 8 3.10 x10-6

2 2646763 2646891 129 lncRNA 8 3.29 x10-6

12 34358453 34358572 120 lncRNA 8 5.64 x10-6

48 2.3.3 AP2A1 promoter mutations correlate with higher expression

Of the 97 genes with mutations in either a promoter and/or an enhancer, only 52 genes showed any expression. When examined through expressional analysis, only two clusters, within the TERT promoter and AP2A1 promoter, showed significant expressional differences between the mutational group and normal group after multiple test correction with the false discovery rate

(p = 0.044, p = 0.0401) (Fig 2.8 & Fig 2.9). TERT promoter mutations were shown to increase expression, similar to what previous studies have noted.65,66,141,203 An increased expression pattern was also seen in patients with AP2A1 promoter mutations, an association which has not been previously described. AP2A1 expression was examined across all cohorts to determine if the increased expression observed was the result of a cohort effect. AP2A1 expression did not vary drastically across cohorts and all affected patients showed AP2A1 expression that was near the top of their respective cohorts (Fig 2.10).

49 n=617 n=21

p =0.013 0.044 9

6 genotype

No Mutation TERT promoter Mutation Expression [log2(TPM + 0.1)]

3

0

No Mutation TERT promoter Mutation

Figure 2.8 TERT promoter mutations cause an increased expression. Patients were divided into two group,s those with mutations in the TERT promoter and those without. Expression of TERT was measured and compared between two groups through an unpaired two-samples Wilcoxon test using TPM data. A larger expression is seen for patients with mutations in the TERT promoter. This change in expression is significant after multiple test correction using the false discovery rate.

50 n=625 n=7 n=6

p 0.00063= 0.0401 9

genotype 6 No Mutation AP2A1 promoter Mutation AP2A1 CNV Mutant Expression [log2(TPM + 0.1)]

3

0

No Mutation AP2A1 promoter Mutation AP2A1 CNV Mutant

Figure 2.9 AP2A1 promoter mutations induce an increased expression similar to amplification. Patients were divided into three groups, those with mutations in the AP2A1 promoter, those with an amplification of AP2A1 and those without any AP2A1 regulatory mutations. Expression of AP2A1 was calculated and compared between promoter mutations and no mutations through an unpaired two-samples Wilcoxon test using TPM data. A larger expression is seen for patients with mutations in the AP2A1 promoter when compared to those without mutation. This change in expression is significant after multiple test correction using the false discovery rate. This change in expression is also similar to the expressional effect induced by gene amplification.

51 n=5 n=3 n=1 n=152 n=6 n=16 n=19 n=92 n=11 n=2 n=10 n=5 n=70 n=12 n=14 n=33 n=63 n=3 n=53 n=12 n=16 n=11 n=4 n=4 n=11 n=10

9

6 genotype No Mutation AP2A1 promoter Mutation AP2A1 CNV Mutant Expression [log2(TPM + 0.1)] 3

0

ACC BCC BLCA BRCA CERV CHOL CNS-PNS COLO ESCA HCC HNSC KDNY LUNG LYMP MISC OV PANC PRAD SARC SECR SKCM STAD THCA THYM UCEC UVM Figure 2.10 AP2A1 expressional difference is not a cohort effect. Figure shows AP2A1 expression for all patients divided into cohorts. Red represents patients with promoter mutations, green for copy number amplifications. Examination of AP2A1 expressional levels across cohorts show all mutated patients are near the top of their cohort in AP2A1 expression and affected cohorts do not have drastically increased expression. 52 The AP2A1 promoter cluster was found to be 63 base pairs long with two smaller distinct clusters (Figure 2.11). A total of 10 mutations were present in seven patients. The list of mutations within the cluster is shown in Table 2.4. Mutations were mostly C>T mutations.

Affected patients came from a variety of cohorts, including melanomas, head and neck squamous cell carcinomas and small-cell lung carcinomas. Motif prediction using Sasquatch showed a chromatin disruption level similar to TERT and PLEKHS1 promoter mutations, which are frequently observed and well-established promoter mutations in cancers (Table 2.5).141,204

53 5

Number of Mutations 2

1

0

50270000 50280000 50290000 50300000 50310000 position

3 Number of Mutations

1

0

50269000 50269500 50270000 50270500 position Position (bp)

Figure 2.11 AP2A1 mutations are located close to the transcription start site. An overview of the AP2A1 gene is presented. Top shows entire gene, whereas bottom is focused on the promoter. Mutation position and quantity are indicated by the lollipops. Red points indicated the clustered group of mutations. Mutations in the cluster can be seen to separate into two smaller clusters. Overall, the cluster is in close proximity to the transcription start site of AP2A1.

54 Table 2.4 Overview of AP2A1 mutations

Patient Position Reference Base Variant Base

A 50270010 C T

B 50270040 C T

C 50270055 C T

D 50270038 C T

50269983 G A

E 50269984 G A

50270000 G A

F 50269994 G A

G 50270048 C T

Table 2.5 Sasquatch DNase I footprinting prediction results

Gene Patients Average total damage Average change (%)

TERT 21 0.85 73.8

PLEKHS1 16 0.60 50.7

AP2A1 7 0.42 61.8

Random N/A 0.13 45.1

To validate the expression difference, we examined the SKCM, HNSC and LUNG cohorts in TCGA for mutations in the corresponding hotspot location. Only the SKCM cohort contained samples with coverage in the region of interest for downstream analysis. Of 425

55 SKCM patients, 112 patients contained enough information for variant calling and expression analysis. A similar trend was seen in the TCGA cohort, as patients with mutations in the AP2A1 promoter cluster were shown to have a higher AP2A1 expression level (Fig 2.12).

n=104 n=8

8

p =0.055 0.055

7

Mutation

6 No Mutation AP2A1 promoter Mutation Expression [log2(TPM + 0.1)]

5

4

No Mutation AP2A1 promoter Mutation

Figure 2.12 AP2A1 promoter mutations induce a similar increase in expression in TCGA. Patients from the SKCM cohort in TCGA were divided into two groups, those with mutations in the AP2A1 promoter and those without. Expression data was retrieved and expression of AP2A1 was compared between promoter mutations and no mutations through an unpaired two- samples Wilcoxon test using TPM data. A larger expression is seen for patients with mutations in the AP2A1 promoter when compared to those without mutation, but not quite significant.

56 2.3.4 Kataegis is a highly recurring event in metastatic cancer

Examination of kataegis in the POG570 cohort showed a wide exposure across multiple cohorts. Overall, 62.3% of cases were found to contain at least one instance of kataegis, including sarcoma cases, in which this process has not been previously described. Other cohorts affected include breast, melanoma, lung, colon, ovarian and pancreatic cases. Kataegic events were seen in every chromosome, with certain areas having a slightly higher incidence of kataegis, although not statistically significant (Fig 2.13). The bin with the most patients was on chromosome 14, from 106,320,001 bp to 106,330,001 bp, where four patients were seen to have a kataegis event in this region. No kataegis events were seen to be recurring in the sex chromosomes.

Figure 2.13 Certain regions are more prone to kataegis. Kataegis events were placed into 10kb bins across the genome and plotted. Only bins with 2 or more kataegic events are shown. The darker the red, the higher incidence of recurrence. The highest recurrence was on chromosome 14, where 4 patients had a kataegis event in the same region.

57 The presence of kataegis was then compared to patient APOBEC3B expression. Consistent with previous literature describing the suggested role of APOBEC3B in kataegis, an association was seen between patient APOBEC3B expression level and the development of kataegis (p = 5.3 x

10-10) (Fig 2.14).69–71 No linear trend was observed between kataegis burden and APOBEC3B when examining only those patients with identified kataegis events (Fig 2.15).

-10 p =5.3e-10 5.3 x 10

6

3 Kat

Unaffected Patient Patient with Kataegis

0 APOBEC3B Expression (Log2(TPM))

-3

Unaffected Patient Patient with Kataegis

Figure 2.14 Increased expression of APOBEC3B a predictor of kataegis events. Patients from the POG570 cohort were divided into those who have a kataegic event and those who do not. APOBEC3B expression was compared between the two groups. An increased expression is seen for patients with kataegis.

58 12.5

10.0

7.5

5.0 Kataegis Mutation Burden (%)

2.5

0.0

-5 0 5 10 APOBEC3B Expression (Log2(TPM))

Figure 2.15 Kataegis mutational burden correlates with APOBEC3B expression. Kataegis mutational burden was calculated for patients from the POG570 cohort. A linear regression was used to correlate APOBEC3B expression with kataegis burden. Only patients with kataegis events were examined. No linear trend was seen.

2.4 Discussion

2.4.1 Summary of Results

Due to the large amount of mutations we were investigating, we decided to filter down the SNVs as much as possible, both for a manageable data set as well as to eliminate any artificial mutations that may have arisen during the process. Certain predicted issues we were hoping to eliminate were mutations in polymorphic regions, common sequencing artifacts and

59 any artifacts that may have been produced during library preparation.140 The decision to remove any mutations in polymorphic regions, which are regions in the DNA with more than one allele occurring at a rate of at least 1%, was made with understanding that the commonality of the polymorphism in the population discredits any mutation in that position from being an oncogenic mutation, as it would most likely not be inherited in polymorphic frequencies if the mutation was oncogenic. As described earlier, the matched normal DNA sequences of all POG cases were aggregated and used as a baseline for filtration, with a percentage of mutation among normals calculated for each base. Filtering was then accomplished by discarding mutations present in bases with a percentage of mutation higher than 2%. This would allow for the identification of both polymorphic mutations and regions where library preparation and alignment were prone to produce artifacts, as both tumor and normal samples underwent similar library construction and sequencing protocols. A little under 2% of mutations in clusters were removed through filtration against a normal baseline representative of all 638 POG cases, with the majority of filtered SNVs in non-coding regions. Closer inspection of filtered SNVs show common SNPs, sequencing artifacts and variants that were called in poorly aligned areas. Together this suggests that our method of filtration is beneficial to isolating real variants, when attempting to study somatic mutations in a whole-genome setting.

Mutational clustering unveiled a wide range of genomic regions that were seemingly recurrently mutated. Many of these clusters had been previously identified in primary cancers.141,205 While the majority of clusters were seen in intergenic regions, the clusters of highest significance and the greatest number of patients affected resided in the coding regions.

This is consistent with other inquiries into non-coding regions, with known coding cancer driving mutations often presenting as the highest recurring mutations in a cohort, and low

60 recurrence of mutations in the non-coding region.204,206,207 This result strengthens the idea that majority of driving SNVs in cancers are located in coding regions rather than non-coding regions, and that while we can’t deny the effect of non-coding elements on tumorigenesis, they seem to play a much lesser role in comparison. For ncRNAs in particular, the proliferative effect of ncRNAs in cancer may be driven more by expressional differences rather than functional modifications from mutations. This is shown in our cohort where known cancer involved and cell proliferation driving lncRNAs were not heavily mutated. The lack of highly recurrent novel mutations in either the non-coding or coding regions also indicate that a single nucleotide genetic variations may not be the overarching driving force of metastases, that different tumor groups and cohorts have different biological needs for metastasis, and that, much like the initiation of tumorigenesis, many different alterations come together to form a mosaic genetic change that pushes this process.

One ncRNA that did stand out in our analysis of recurrent mutations was MIR4500HG, which is the host gene of mir4500, a miRNA which has shown tumour suppression qualities.208,209 Two individual clusters were found on parts of this host gene. Unfortunately, mir4500 was not detected at high enough signals for expression analysis and as such we were unable to determine what exact effect of the mutations, whether they affected miRNA synthesis or disturbed the targeting region. Literature suggests that mir4500 has multiple targets that are cancer type specific.208–210 However, an examination of the targets showed no increase of expression with mir4500 host gene mutation. This presumably suggests that the mutations may not affect function, but as there may be other unknown targets, an examination into the direct effects of mir4500 are warranted.

61 Most of the significant promoter mutations seen in our cohort have been previously described in other primary cancer cohorts, validating our overall detection methods.141,205 The highest recurring cluster was found in the TERT promoter, a frequently observed and well- studied mutational cluster. Mutations in the TERT promoter have been shown to introduce novel

TF binding sites and increases the overall expression patterns of TERT.65,66,203 A similar expression pattern was seen in our cohort, where mutations in the TERT promoter increased

TERT expression (Fig 2.7). As such, our methods of conducting expression analysis can also be considered valid due to alignment with well-established data. Patients with mutations in the

PLEKHS1 promoter, another frequently described cluster in the literature, showed no expressional change of PLEKHS1 when compared to control patients. At the moment, there is no consensus of how this promoter mutation behaves, as different studies have described different effects.141,205 One possibility may be that the expressional differences in PLEKHS1 are the result of different subtypes of cancer, as overexpression of PLEKHS1 has been seen in specific subtypes of bladder cancer and is a biomarker for progression.211 As our cohort lacks bladder cancer profiles, we are unable to tackle this question.

From expression analysis, we saw that only AP2A1 promoter mutations showed any biological effect aside from TERT. While the AP2A1 cluster we found had been previously identified, this is the first instance where a difference in expression has been detected between the mutant group and the control group. The expression of patients with promoter mutations is comparable to the expression in patients that have an amplification of the AP2A1 gene, suggesting a true biological signal (Fig 2.8). Motif prediction further conveys biological function as predicted disturbance of chromatin structure is similar to the effects of known promoter altering mutations (Table 2.5). As Sasquatch simply measures and predicts DNase I foot printing,

62 the exact change that the mutations cause within the promoter region are still unknown. TF motif prediction with other software yielded no consensus results and were often saturated with possible alterations. As such, these predictions were not considered viable. Further investigation of this is necessary to elicit mechanisms of action for this mutation. Validation in TCGA samples further solidifies that the difference in expression was not simply an artifact of our cohort, but a real signal, and that this function is not isolated to metastatic cancers.

Despite all evidence pointing to an upregulation of AP2A1, many questions still linger due to the AP2A1 gene having no established role in cancer as of yet. Studies on AP2A1 have mainly noted a role in binding to clathrin cages in clathrin-mediated endocytosis as part of the

AP2 complex.212,213 The AP2 complex has also been seen to associate with substrates of

EGFR.213 Several abstracts have described a potential role for AP2A1 in the nuclear transport of

EGFR protein.214 This is of interest as when EGFR is localized to the nucleus, the EGFR protein functions as a TF for mitotic genes such as cyclin D1.215–217 An increased expression of AP2A1 protein would therefore theoretically result in increased nuclear EGFR levels, which would subsequently augment proliferation of cell through upregulated mitotic pathways. Aside from the proliferative impact, the increase of AP2A1 expression and its potential role may also have implications in chemotherapeutics. Many cancers currently utilize EGFR-inhibitors as first line therapies, examples include cetuximab in colorectal cancers and afatanib in small cell lung cancers.218,219 Localization of EGFR into the nucleus has been implicated as a potential resistance mechanism towards EGFR-inhibitors and has been link with a poor prognosis.215 As a result, overexpression of AP2A1 would also potentially allow for resistance towards anti-EGFR therapies, and it may be beneficial for use as a biomarker when prescribing anti-EGFR therapies.

63 Overall, the lack of highly recurrent and novel mutational clusters in the regulatory regions of metastatic cancers that show a functional impact suggests that mutations in regulatory regions may not be a large driving force for metastasis. This is interesting as many genes that have been described to be involved in metastasis are shown to induce metastasis through an aberrant expression.220 Therefore the cause of the previously observed aberrant expression remains to be explored and may possibly be attributed to large SVs or other regulating mechanisms.

As kataegis heavily impacts the overall landscape of the non-coding space, we sought to better understand this process in metastases in order to generate the best characterization of non- coding space. Our analysis of kataegis showed that this mutational event can arise in almost any cancer type. Here we report the first cases of kataegis in a clinical setting for sarcomas and confirm recent reports of kataegis in colorectal cancers.221 Overview of kataegis recurrence in the genome suggests that no specific region is routinely targeted in kataegis. This aligns with the proposed mechanism for kataegis, which implicates BIR and APOBEC3B activity, which target to breakpoint repair and single stranded DNA respectively, both nonspecific to genome location.69–71 Why some areas are more prone kataegis is unknown, and may be attributed to regions being more prone to being a break point. However, given that the highest number of recurrences seen was four patients, this itself may simply be an artifact and is not be of interest.

Further supporting the current state of literature surrounding kataegis, we showed that patients with higher APOBEC3B expression were more prone to developing kataegis, indicating a necessity for this pathway. However, the linear trend between APOBEC3B expression and degree of mutational burden from kataegis was relatively weak, indicating that the relationship

64 between APOBEC3B expression and kataegis may be a complicated one and that other factors are implicated in the process.

2.4.2 Limitations of the study

Several limitations significantly constrained the degree of discovery achieved by our study. The major limitation was the sample size. While a cohort of 638 patients is seemingly a sizable amount, it is not comparable with the thousands of samples that populate other similar studies. Furthermore, when divided into cancer types this number is drastically reduced. Our largest cohort was the breast cancer cohort, which consisted of 153 patients. The next largest are colorectal cancers and lung type cancers, which were at 93 and 70 respectively. As such, we were limited to a pan-cancer search for statistical power reasons and any mutational cluster that was uncovered was either present in multiple cancer types or was highly recurrent in one of the larger cohorts. This can be seen in the coding mutation clusters, where only pan-cancer mutations such as TP53 mutations or highly recurrent mutations in large cohorts such as KRAS for colorectal and ESR1 for breast are observed.

Another limitation is due to the profile of patients in our cohort. All patients are selected from the POG project, which recruits advanced cancer patients with poor prognosis and had prior exposure to therapy. Therefore, while we are addressing a very specific population of metastatic cancers which may not extrapolate to other instances of metastases.

Furthermore, as we only have access to the metastatic sample of the patient and do not know the genomic profile of the primary tumour, we cannot make any precise statements on which mutations arose over the course of treatment and metastasis, and which mutations are

65 involved in metastasis. All associations and suggestions regarding the mutational landscape are simply observational and are descriptions of the sequencing information we have.

Addressing the limitations when studying transcribed non-coding genes, the manner in which RNA was sequenced and the data were processed effectively masked many smaller ncRNAs, such as miRNAs, that are present in the transcriptome from the generated expressional results. This was done to more accurately measure mRNA levels for greater clinical impact. As such, although we could identify mutations in miRNA host genes, such as MIR4500HG, we were unable to determine if there were any functional effects associated with the mutations.

Additionally, the lack of whole-genome sequencing datasets severely limits any attempt at validating results. All validation of mutational results was performed on the TCGA cohort, which is composed mainly of whole exome sequences as well as low coverage whole genome sequences. Therefore, in order to obtain any information of non-coding variants, manual manipulation of the genomic information was required along with manual verification of variants, as few to no variant callers would work at the minimal read depth in the excised region.

As such, variants are not as stringently called as possible and human error may have been introduced as well.

Moreover, many of the conclusions that have been proposed are driven by correlations that were uncovered through data-mining. While these results are valid, many require additional investigation through dedicated designed experiments for biological confirmation.

2.4.3 Future directions

Several follow up experiments are needed to validate the results of our probe into the non-coding genome. First, functional assays are needed to confirm the expressional change from

66 AP2A1 promoter mutations. This may be accomplished through reporter gene assays using the normal and mutated promoter or expression analysis following CRISPR or site directed mutagenesis of the promoter region in cultured melanoma cells. An additional experiment on the quantity of AP2A1 protein in the cell would be beneficial to confirm an increased accumulation of protein post mutation. Transcription factor binding and chromatin structure assays may be employed to confirm the results of DNase 1 footprinting prediction and to determine the mechanism of increased expression, particularly if and which transcription factors are actively binding. Furthermore, to establish a role for AP2A1 in cancer, more insight is needed into the possible function of AP2A1 in the nuclear localization of EGFR.

Recent studies into non-coding regions have also utilized conservation scores to facilitate identification of biologically relevant regions of the genome, or the importance of particular bases in the genome.222,223 This may be an interesting venture, particularly for intergenic and intronic variants where function is still unknown. This may also assist in the understanding of variants in ncRNAs, as key structural bases in the transcript would likely be conserved.

Our study also identified many clusters in intronic regions. One possible function of intronic variants is the introduction of aberrant splicing and splice isoforms. This may be an interesting venture as promoter and enhancer mutations did not explain the often-observed dysregulation of expression patterns seen in metastasis so this mechanism may be a possible explanation. Other biological mechanisms to consider for future studies include large scale SVs and copy number alterations.

Our results were also hampered by an inability to detect miRNA expression levels. It would be beneficial to explore the MIR4500HG mutations and other small ncRNA genes using

67 miRNA-seq, which focuses on generating expression level for smaller ncRNAs. This is particularly of intrigue as mir4500 has already been established as having a role in cancer.

Repetition of our analysis in additional whole genome cancer datasets is another important step, in order to generate consensus on which recurrent mutations are real. However, this is limited by the quantity of available whole genome datasets. The wide spread establishment of whole genome sequencing efforts to both understand and treat cancer is therefore a necessity to further investigations into the non-coding region of cancer and potential roles in metastasis. The ability to sequence cancers as they progress would be the ultimate goal, allowing detailed understanding of the sequential pattern of mutations and the exact mechanisms through which progression arises.

68 Chapter 3: Treatment Associated Mutations

3.1 Introduction

Tumours becoming resistant to chemotherapies is a common observation. In many cases, this is the result of genetic modification after exposure to therapy. This has been seen in ESR1 mutations in response to aromatase inhibitors which allow for estrogen-independent transcriptional activity, as well as in EGFR, where mutations inhibit targeted binding of TKIs.

The majority of resistance mutations have been detected through specific case-studies, which limit the throughput on a comprehensive discovery of resistance mechanisms. A factor for this has been the lack of large datasets that contain clinical information, which has prohibited any large-scale investigations. Here, we attempt to identify resistance mutations using our cohort of

638 metastatic patients, whom have been treated with chemotherapies and for whom we have detailed clinical information. We will be tackling two main objectives:

1. To formulate a method to detect resistance mutations

2. To survey the genome for any single nucleotide, small indel or copy number variation

that may have resulted from therapy

3.2 Methods

3.2.1 Patient Sample

Please refer to section 2.2.1.

3.2.2 Sample Collection, Preparation and Sequencing

Please refer to section 2.2.2.

69 3.2.3 Alignment and Variant Calling

Please refer to section 2.2.3.

3.2.4 Gene Expression Profiling

Please Refer to section 2.2.4.

3.2.5 Copy-Number Calling

Regions of copy number variation were identified using the Hidden Markov model-based approach in CNAseq (v0.0.6).224 Amplifications were defined as regions with a copy number call of five. Deletions were defined as regions with a copy number call of 0. Copy number variants were annotated to genes using SnpEff (v3.2) with the Ensembl database (v69).193,194 Copy number variants from TCGA were taken from the TCGA database.106 Amplifications in the

TCGA cohort was defined as regions with a copy number call of five or more. Deletions in the

TCGA cohort were defined as regions with a copy number call of zero.

3.2.6 Collection of Clinical information

Prescription information was collection from the British Columbia (BC) Cancer pharmacy database and reviewed for each patient between July 2012 and December 2018. Drug prescriptions relating to cancer treatments were selected. This database captures all cancer therapies funded by the province and therefore captures the vast majority of treatments delivered to patients; although rare, additional treatments were identified using patients’ charts where available. Other information retrieved include biopsy date and progression dates. Drugs were classified into drug classes through mechanism of action.

70

3.2.7 Association of Treatment and SNVs/Indels

To investigate the association of treatment and alteration, we divided tumours into their respective analysis cohorts. This was done as certain drugs are cohort specific. Only the BRCA,

COLO, LUNG, PANC, OV, and SARC cohorts were examined for statistical power reasons.

Drugs were consolidated into drug classes for increased detection and statistical power. We then isolated drug classes that were prescribed in at least 10% of a particular cohort. Patients were considered to have been exposed to a therapy if the course of treatment began before tumor biopsy and if the course of treatment was greater than 28 days. 28 days was chosen as this was determined to be the length which clinicians would use to test for toxicity of the drug or any adverse effects for the patient.

Every gene with an SNV or indel in a treated patient was identified, and mutation ratios of that gene were examined for patients subjected to the specific drug class and patients who did not undergo that same drug class treatment. A p-value was calculated between the ratios using a

Pearson’s chi-square test, where the expected value was the mutation ratio of the gene in question for patients who did not undergo the treatment in question and the observed value was the mutation ratio of the gene in question for patients who underwent the therapy in question.

Only significant associations between treatment and gene were considered for downstream analysis. As EGFR is known to have both sensitive and resistant mutations, these were separated into two separate gene groups, sensitive and resistance for downstream analysis. Mutation rates for each significant gene were extracted from the TCGA cohort and were compared against the mutation rates our cohort. Only genes that were mutated at a higher rate in the treatment group compared to TCGA were used for downstream analysis. This was done such that we can filter

71 out mutations in genes which likely arose during the primary tumor and are not of interest to our analysis. Genes were then further filtered by their functionality as defined by SnpEff annotation.

Only variants that were clustered in a nine base pair region or resulted in a truncated protein were considered for downstream analysis. A nine base pair region was selected to best identify mutations affecting a similar activity site. The remaining genes were classified as having treatment-mutation associations. Known resistance mutations, ESR1 mutations with aromatase inhibitors in the BRCA cohort and EGFR mutations with EGFR inhibitors in the LUNG cohort served as our controls. Biological relevance of associations was examined through literature.

A second method was also utilized, where binomial distributions of mutation rates were utilized rather than chi-square tests. Briefly, treatments and genes were selected for as mentioned above. A cohort mutation rate for each gene generated by dividing mutated patients in a cohort by all patients in the cohort. A mutation rate for TCGA was also generated by dividing patients with mutation in the gene. Two p-values were then generated for each gene-treatment association using a binomial distribution, one using the cohort mutation rate, and one using the TCGA mutation rate, testing the likelihood of number of patients mutated for a particular gene in a treatment cohort. The two p-values were then combined using Fisher’s method. Multiple testing was performed using the false discovery rate.

3.2.8 Association of Treatment and Copy Number Variations

To investigate the association of treatment and copy number variations, we employed a similar approach as the single nucleotide variations. Only patients in the POG570 cohort were examined. Briefly, tumors were separated into their respective analysis cohorts and only the

BRCA, COLO, LUNG, PANC, OV, and SARC cohorts were selected for downstream analysis.

72 We then isolated drugs classes that were prescribed in at least 10% of a particular cohort. Every gene that had a copy number alteration in a treated patient was extracted, and the mutation rates were compared for patients that were subjected to treatment from a drug class and patients who did not undergo that same treatment. A p-value was attributed to every comparison using a

Pearson’s chi-square test. Copy number variation rates for each gene were extracted from the

TCGA cohort and were compared against those for each gene-treatment association. Only genes that were mutated at a higher rate in our cohort compared to TCGA were used for downstream analysis. Genes were then further filtered by expressional differences. Expressional analysis was compared for every patient with a copy number variation and treatment, versus patients with the same treatment but no copy number change, using an unpaired two-samples Wilcoxon test. Only amplifications showing increased expression and deletions showing decreased expression were considered for downstream analysis. Genes were then grouped according to their chromosome location (genes had to be on the same arm of the chromosome) and treatment to form loci- treatment associations. Significant loci were classified as treatment-CNV associations. A heuristic approach was then used to identify the candidate target gene for each locus. Briefly, genes that had at least 90% of the number of patients of the most recurrent gene in the locus were selected. From these genes, the gene that had the greatest expressional difference was selected as the candidate gene of that locus. Associations were examined in literature for any biological relevance.

3.2.9 Analysis of Time on Therapy

To determine how mutation correlated with time on therapy, treatment times were extracted from clinical data. Only treatments known to induce resistance mutations were

73 examined. Patients that were treated with a particular therapy were separated into mutated and unmutated groups. Therapy times were then compared between the two groups using an unpaired two-samples Wilcoxon test.

3.3 Results

3.3.1 Large-scale search for therapy induced mutations detects real associations

Using treatment-mutation association method, we examined the BRCA, COLO, LUNG,

OV, PANC and SARC cohorts. After filtering for fold change with TCGA and selection based on mutational effect, we were left with seven treatment-mutation associations with three being our controls: ESR1 and aromatase inhibitors in the BRCA cohort, EGFR (sensitive) and EGFR inhibitors in the LUNG cohort, as well as EGFR (resistant) and EGFR inhibitors in the LUNG cohort (Fig 3.1).

74 EGFR (Resistance) (EGFR inhibitor) 150 200

Clustered Truncated

BRCA LUNG 50 100 Fold Change vs untreated primary (TCGA)

ESR1 ESR1 OR5H2 (Aromatase inhibitor) (DNA synthesis inhibitor) (TOPII inhibitor) ANKRD12 (CDK4/6 inhibitor)

CTNNB1 ARID1A EGFR (Sensitive)

0 (EGFR inhibitor) (Mitotic inhibitor) (EGFR inhibitor)

25 10 15 20 25 Mutation Rate x Drug Association Score

Figure 3.1 Certain mutations arise from exposure to therapy. Treatment-mutation associations were extracted from the cohort and examined for biological relevance. Only BRCA and LUNG cohorts contained any significant associations. All of our controls (ESR1, EGFR (Sensitive) & EGFR (Resistance)) were extracted. Square points represent genes with clustered mutations, whereas circle points are genes with truncation causing mutations.

Only BRCA and LUNG cohorts are observed to have any associations. CTNNB1 mutations with

EGFR inhibitor treatment was the only mutation besides the controls, to have literature describing its relevance to cancer. Further examination of the ANKRD12, OR5H2 and ARID1A genes and alterations show that these genes that are frequently mutated in our cohort mainly due to alignment. Examination of transcripts also do not show mutational changes in the mRNA for these three genes. Expression analysis identified no change between patients with the mutation

75 and those without, voiding the possibility of non-sense mediated RNA decay. Therefore, these treatment-associations were not considered to be real.

A secondary approach using binomial regression calculations was also used in an attempt to elicit potential mutations arising from therapy. However, only our controls, ESR1 and EGFR resistance mutations remained after multiple test correction. A larger sample size is required to pursue an analysis through this method.

3.3.2 Copy number alterations may be influenced by treatment

Examination of copy number variations identified 21 significant treatment-CNV associations, including our control of amplifications of 8p11.23-p11.22, which contains the

FGFR1 gene, when treated with aromatase inhibitors (Fig 3.2). We also identified an amplification in 17q12-q21.2 associated with HER2 inhibitors. This region contains ERBB2, which when amplified is targeted by HER2 inhibitors, explaining this association. Other associations are also seen, but candidate genes within the loci do not appear to have biological relevance to cancer nor to the mechanisms of the therapy. Notably, more amplifications are seen compared to deletions.

76 25

17q22-q24.2 (ER antagonist) 8p11.23-p11.22 (Aromatase inhibitor) 20 25

12q15 17q22-q24.2 (ER antagonist) (ER antagonist) 8p11.23-p11.22 15 (Aromatase inhibitor) Amplification

9p21.3 Deletion 20 (Anthracyclines)

20q13.2-q13.31 9p21.3 (Anthracyclines) 18p11.32 (Taxanes) (DNA synthesis inhibitor) BRCA 11q13.2-q13.3 12q15 COLO 8q24.13-q24.22 (mTOR inhibitor) 20q13.13 (ER antagonist) 15 Expression [-log2(p-value)] 10 (DNA synthesis inhibitor) Amplification (Aromatase inhibitor) OV 9p21.3 Deletion (Anthracyclines) 11q13.2-q13.3 20q13.13-q13.32 (DNA synthesis inhibitor) 17q22 (ER antagonist) 20q13.2-q13.31 (mTOR inhibitor) 9p21.3 (Anthracyclines) 18p11.32 (Taxanes) 20q13.12 (DNA synthesis inhibitor) BRCA 8q22.3 11q13.2-q13.3 (VEGF inhibitor) 17q12-q21.2 COLO (Anthracyclines) 8q24.13-q24.22 (mTOR inhibitor) 20q13.13 Expression [-log2(p-value)] 8p21.3-p21.2 10 (Aromatase inhibitor) (HER2 (DNA inhibitor) synthesis inhibitor) 14q13.1-q23.1 OV (Taxanes) 8p21.3-p21.2 (mTOR inhibitor) 11q13.2-q13.3 20q13.13-q13.32 8p23.1 (Anthracyclines) (DNA synthesis inhibitor) 17q22 (ER antagonist) 5 (VEGF inhibitor) (mTOR inhibitor) 3q26.32-q26.33 20q13.12 8q22.3 (HER2 inhibitor) (VEGF inhibitor) 17q12-q21.2 (Anthracyclines) 8p21.3-p21.2 (HER2 inhibitor) 14q13.1-q23.1 (Taxanes) 8p21.3-p21.2 (mTOR inhibitor) 8p23.1 (Anthracyclines)

5 (VEGF inhibitor) 3q26.32-q26.33 (HER2 inhibitor) 0 0

3 10 30 CNV Rate x Drug Association3 Score 10 30 CNV Rate x Drug Association Score

Figure 3.2 Therapy can induce copy number alterations. Treatment-CNV associations were extracted from the cohort and examined for biological relevance. Expression p-value and association score of each treatment-CNV association is represented by a candidate gene chosen through heuristic methods.

3.3.3 Resistance mutations are associated with longer exposure to therapy

To explore the relationship between time on therapy and resistance mutation generation,

we examined the treatment times for patients who had known resistance mutations (ESR1, EGFR

T790M), against patients who underwent the same treatment but did not acquire a mutation. We

77 saw that patients with resistance mutations are generally on a therapy longer than patients who did not develop resistance. Patients with ESR1 resistance mutations were shown to have a prolonged exposure to aromatase inhibitors compared to unaffected patients (p = 0.0025) (Fig

3.3). EGFR resistance mutations were shown to have a longer time on therapy compared to patients with both sensitive EGFR mutations and no mutations (note: all resistant EGFR patients had sensitivity mutations as well) (Fig. 3.4).

n=65 n=17

4000 n=3 n=3 n=3

p =0.00250.0025 p = 0.081 0.081

p = 0.081 1000 0.081

3000

Mutation

No Mutation ays) Mutation ays) No Mutation EGFR Sensitive Mutation 2000 ESR1 Mutation EGFR Resistance Mutation

500 Length of Treatment (D Length of Treatment (D 1000

0 0 No Mutation EGFR Sensitive Mutation EGFR Resistance Mutation No Mutation ESR1 Mutation

Figure 3.3 Patients with ESR1 resistance mutations have longer aromatase treatment times. Aromatase inhibitor treatment times were analysed for patients with and without ESR1 resistance mutations. An increase in treatment time was seen in patients with ESR1 resistance mutations.

78

n=65 n=17

4000 n=3 n=3 n=3 p =0.00250.0025 p = 0.081 0.081

p = 0.081 1000 0.081

3000

Mutation

No Mutation ays) Mutation ays) No Mutation EGFR Sensitive Mutation 2000 ESR1 Mutation EGFR Resistance Mutation

500 Length of Treatment (D Length of Treatment (D 1000

0 0 No Mutation EGFR Sensitive Mutation EGFR Resistance Mutation No Mutation ESR1 Mutation

Figure 3.4 Patients with EGFR resistance mutations have longer EGFR inhibitor treatment times. EGFR inhibitor treatment times were analysed for patients with and without EGFR sensitivity and resistance mutations. An increase in treatment time was seen in patients with EGFR resistance mutations which was greater than that seen in EGFR sensitivity mutations.

79 3.4 Discussion

3.4.1 Summary of findings

Our aim for this work was to establish a workflow which would allow us to uncover resistance mutations and to subsequently use the method to discover novel resistance mutations.

Both our methods were able to identify all known resistance mutations: ESR1 in the presence of aromatase inhibitors and EGFR in the presence of EGFR inhibitors. This suggests both our methods, using a chi-square test and using binomial regressions, are functioning as intended.

However, one method appears to be too lenient whereas the other appears to be too stringent, as the chi-square test was seen to allow for mutations which were biologically non-functional, whereas the binomial model only allowed for the controls to be detected. Therefore, the use of either method would appear to be determined on the attributes of the available data, such as cohort size, noise in variant calls and alignment and the number of overall mutations.

One interesting treatment-mutation association was that of CTNNB1 mutations with

EGFR inhibitors. CTNNB1 has been shown to drive tumorigenesis when mutated and has been shown to contribute to lung metastasis, particularly assisting in the resistance generated by the

T790M EGFR mutation.225–227 Whether this association is due to a “tag along” with EGFR resistance mutations or it is an actual association itself remains to be seen. Overall this poses an interesting question for future studies.

Our methods were also able to identify our known control associations when examining treatment and copy number variations. However, due to the difference in copy number calling between our cohort and TCGA (due to the use of whole exome sequencing in TCGA), filtration of the results was difficult. Even after filtration with expression, a large quantity of genes remained. Grouping into loci lowered the amount of targeted areas, but no good strategy remains

80 for identification of the targeting gene in a specific locus. Therefore, additional insights and strategies are needed to decipher how treatment affects the copy number landscape.

Treatment times were shown to be associated with the development of resistance mutations. This aligns with the natural selection model of resistance generation, where the tumor under selective pressures will actively select for a resistant clone.77 Only tumours that are initially sensitive to the drug can gain resistance mutations. The longer the therapy, the more likely this process will be able to occur. This also illustrates that time on therapy prior to biopsy may not be a good indicator of drug sensitivity inducing mutations. As such, we are limited to other ways to search for gene-drug targets.

3.4.2 Limitations of the study

Our methods are highly limited by the relatively small size of our cohort. This can be seen through our binomial regression method of eliciting resistance mutations, where only controls were seen, even after grouping for therapies into drug classes. As such, we were not able to explore individual treatments, which may have different effects on the genome. Furthermore, our cohort consisted of only advanced patients with poor prognosis. While this may be beneficial to discover as many resistant mutations as possible, we are also limiting our study to a very specific group of patients, and as such may be limiting what we can find.

We were also unable to multiple test correct for the chi-square testing method due to a lack of power. As such, our results had many false positives, as shown through the non- functional mutations. Our survey was also not a closed experiment, meaning that we cannot affirm any mutation was directly caused by a therapy. As such, external validation would be needed for any associations we made.

81 The use of treatment times also poses several issues. First, treatment times are subjective measurements established by the clinician. Although it would be logical that the clinician keeps a patient on a drug longer if it is beneficial and working and stops the drug course once there is progression, that is not always the case. As such, we must be careful in using this measurement for this study and any downstream studies.

3.4.3 Future directions

It would be beneficial to repeat our method in other cohorts with similar data types to see if similar associations are drawn. A bigger dataset would also mean more power to detect less frequent resistant mutations. More insight is also needed to better develop methods of detecting resistance in copy-number variations. Perhaps establishment of better copy-number calling would be one area to begin as the performance of copy number calling tools have been shown to be quite poor.228

It would also be interesting to see if the CTNNB1 association with EGFR inhibitors is a real resistance mechanism or just an artifact due to the prevalence of these mutations. A better understanding would allow for better treatment options as well as insight for drug development.

82 Chapter 4: Conclusion

The overarching goal of this study was to better understand the non-coding genomic landscape of metastatic cancers and to explore the mutational effects of chemotherapies. We showed that the non-coding region, although highly mutated, may not be as important as coding regions when considering oncogenic mutations. Apart from the well-studied TERT promoter mutations, we were unable to find any other non-coding region with a similar singular nucleotide mutation pattern resulting in a large expression change. This aligns with the current literature, where there has not been a consensus on other non-coding somatic drivers aside from TERT promoters, with different cohorts proposing different driver candidates. This suggests that many noncoding driving mutations may be cohort specific, and the lack of consensus indicates a lesser driving factor in the progression of cancer. Different mutations enabling the same effect

(example: disruption of promoter site) may be another reason that this is the case.

As we found no significant novel noncoding mutations in our cohort compared to primary cancers, we showed that there are no definitive noncoding mutations responsible for metastasis. This is further exemplified by the lack of biological effects stemming from the mutations that were shown, both in the noncoding RNA and regulatory regions. However, through this process, we were able to identify a novel effect for a previously described promoter mutation cluster in the AP2A1 promoter, one that may possibly have oncogenic attributes in primary and metastatic cancers. However, due to the relatively unknown function of AP2A1 and the AP2 complex in cancers, further insights are needed to confirm these findings. This would include a confirmation of the role of AP2A1 in cancer, as well as verify the biological effect of the AP2A1 promoter mutations.

83 In the search for mutations associated with chemotherapeutics, we propose two methods which we show can be used to identify resistance mutations. We employed the methods and were able to generate a candidate gene, CTNNB1, for future studies. These methods are raw and require more fine tuning, but they show that eliciting resistance mutations through large scale searches is a possibility and is something we should pursue.

Our studies show the power of clinical data and whole genome sequencing in large cancer cohorts. With whole genome sequencing we are able to elicit far more mutations that may have functional impact, such as in promoters, enhancers and ncRNA. If paired with whole genome sequencing of the primary, many more conclusions could be made including the exact mutations that were involved in metastasis. Larger cohorts with clinical data would allow for more power in the search for resistance mutations. As such, our study is effectively an argument for the benefits of whole genome sequencing.

Overall, our study has improved the understanding of metastatic cancers. We have built a foundation on which other studies can arise, contributing to the ultimate goal of curing this deadly disease.

84 Bibliography

1. Fitzmaurice, C. et al. Global, Regional, and National Cancer Incidence, Mortality, Years

of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-Years for 29

Cancer Groups, 1990 to 2016: A Systematic Analysis for the Global Burden of Disease

Study. JAMA Oncol. 4, 1553–1568 (2018).

2. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2019. CA. Cancer J. Clin. 7–34

(2019) doi:10.3322/caac.21551.

3. Torre, L., Siegel, R. L. & Jemal, A. Global Cancer Facts & Figures 3rd Edition. Am.

Cancer Soc. (2015) doi:10.1002/ijc.27711.

4. Fitzmaurice, C. et al. Global, Regional, and National Cancer Incidence, Mortality, Years

of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-Years for 29

Cancer Groups, 1990 to 2017: A Systematic Analysis for the Global Burden of Disease

Study. JAMA Oncol. (2019) doi:10.1001/jamaoncol.2019.2996.

5. Torre, L., Siegel, R. L. & Jemal, A. Global Cancer Facts & Figures 4th Edition. Am.

Cancer Soc. (2018).

6. Fitzmaurice, C. et al. Global, Regional, and National Cancer Incidence, Mortality, Years

of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-years for 32

Cancer Groups, 1990 to 2015. JAMA Oncol. 3, 524–548 (2017).

7. Stewart, B. W. & Wild, C. P. World cancer report 2014. World Heal. Organ. (2014)

doi:9283204298.

8. Mendel, G. & Bateson, W. Experiments in Plant Hybridization (EN transl.). J. R. Hortic.

Soc. 26, 3–47 (1901).

9. Johannsen, W. A. The Genotype conception of Heredity. Am. Nat. 45, 129–159 (1911).

85 10. Ingram, V. M. A specific chemical difference between the globins of normal human and

sickle-cell anæmia hæmoglobin. Nature 178, 792–4 (1956).

11. Dunn, J. M., Phillips, R. A., Becker, A. J. & Gallie, B. L. Identification of germline and

somatic mutations affecting the retinoblastoma gene. Science (80-. ). 241, 1797–1800

(1988).

12. Berenblum, I. & Shubik, P. The persistence of latent tumour cells induced in the mouse’s

skin by a single application of 9:10-dimethyl-1:2-benzanthracene. Br. J. Cancer 3, 384–

386 (1949).

13. Basu, A. K. DNA damage, mutagenesis and cancer. Int. J. Mol. Sci. 19, 970 (2018).

14. Friedberg, E. C. A brief history of the DNA repair field. Cell Res. 18, 3–7 (2008).

15. Davies, R. J. H. Ultraviolet Radiation Damage in DNA. Biochem. Soc. Trans. 23, 407–418

(1995).

16. Chatterjee, N. & Walker, G. C. Mechanisms of DNA damage, repair, and mutagenesis.

Environ. Mol. Mutagen. 58, 235–263 (2017).

17. Desouky, O., Ding, N. & Zhou, G. Targeted and non-targeted effects of ionizing radiation.

J. Radiat. Res. Appl. Sci. 8, 247–254 (2015).

18. Loechler, E. L. A Violation of the Swain-Scott Principle, and Not SN1 versus

SN2Reaction Mechanisms, Explains Why Carcinogenic Alkylating Agents Can Form

Different Proportions of Adducts at Oxygen versus Nitrogen in DNA. Chem. Res. Toxicol.

7, 277–280 (1994).

19. Essigmann, J. M. et al. Structural identification of the major DNA adduct formed by

aflatoxin B1 in vitro. Proc. Natl. Acad. Sci. U. S. A. 74, 1870–1874 (1977).

20. Luoto, K. R., Kumareswaran, R. & Bristow, R. G. Tumor hypoxia as a driving force in

86 genetic instability. Genome Integr. 4, 5 (2013).

21. Gafter-Gvili, A. et al. Oxidative stress-induced DNA damage and repair in human

peripheral blood mononuclear cells: protective role of hemoglobin. PLoS One 8, e68341

(2013).

22. Kantidze, O. L., Velichko, A. K., Luzhin, A. V & Razin, S. V. Heat Stress-Induced DNA

Damage. Acta Naturae 8, 75–78 (2016).

23. Meeker, J. D., Calafat, A. M. & Hauser, R. Urinary bisphenol A concentrations in relation

to serum thyroid and reproductive hormone levels in men from an infertility clinic.

Environ. Sci. Technol. 44, 1458–1463 (2010).

24. Mamur, S., Yüzbaşıoğlu, D., Ünal, F. & Yılmaz, S. Does potassium sorbate induce

genotoxic or mutagenic effects in lymphocytes? Toxicol. Vitr. 24, 790–794 (2010).

25. Zengin, N., Yüzbaşıoğlu, D., Ünal, F., Yılmaz, S. & Aksoy, H. The evaluation of the

genotoxicity of two food preservatives: Sodium benzoate and potassium benzoate. Food

Chem. Toxicol. 49, 763–769 (2011).

26. Yilmaz, S., Ünal, F., Yüzbaşıoğlu, D. & Çelik, M. DNA damage in human lymphocytes

exposed to four food additives in vitro. Toxicol. Ind. Health 30, 926–937 (2012).

27. Pandir, D. DNA damage in human germ cell exposed to the some food additives in vitro.

Cytotechnology 68, 725–733 (2016).

28. Loeb, L. A. & Monnat, R. J. DNA polymerases and human disease. Nat. Rev. Genet. 9,

594–604 (2008).

29. Kunkel, T. A. DNA Replication Fidelity. J. Biol. Chem. 279, 16895–16898 (2004).

30. Henle, E. S. & Linn, S. Formation, prevention, and repair of DNA damage by

iron/hydrogen peroxide. J. Biol. Chem. 272, 19095–19098 (1997).

87 31. Lindahl, T. & Barnes, D. E. Repair of Endogenous DNA Damage. Cold Spring Harb.

Symp. Quant. Biol. 65, 127–134 (2000).

32. Hakem, R. DNA-damage repair; the good, the bad, and the ugly. EMBO J. 27, 589–605

(2008).

33. Cohen, S. A. & Leininger, A. The genetic basis of Lynch syndrome and its implications

for clinical practice and risk management. Appl. Clin. Genet. 7, 147–158 (2014).

34. Godet, I. & Gilkes, D. M. BRCA1 and BRCA2 mutations and treatment strategies for

breast cancer. Integr. cancer Sci. Ther. 4, (2017).

35. McCabe, N. et al. Deficiency in the Repair of DNA Damage by Homologous

Recombination and Sensitivity to Poly(ADP-Ribose) Polymerase Inhibition. Cancer Res.

66, 8109–8115 (2006).

36. Helleday, T. The underlying mechanism for the PARP and BRCA synthetic lethality:

Clearing up the misunderstandings. Mol. Oncol. 5, 387–393 (2011).

37. Pon, J. R. & Marra, M. A. Driver and Passenger Mutations in Cancer. Annu. Rev. Pathol.

Mech. Dis. 10, 25–50 (2015).

38. Lee, E. Y. H. P. & Muller, W. J. Oncogenes and tumor suppressor genes. Cold Spring

Harb. Perspect. Biol. 2, a003236 (2010).

39. Botezatu, A. et al. Mechanisms of Oncogene Activation. in New Aspects in Molecular and

Cellular Mechanisms of Human Carcinogenesis (2016). doi:10.5772/61249.

40. Knudson, A. G. Mutation and Cancer: Statistical Study of Retinoblastoma. Proc. Natl.

Acad. Sci. 68, 820–823 (1971).

41. Evans, H. J. & Prosser, J. Tumor-suppressor genes: cardinal factors in inherited

predisposition to human cancers. Environ. Health Perspect. 98, 25–37 (1992).

88 42. Michor, F., Iwasa, Y. & Nowak, M. A. Dynamics of cancer progression. Nat. Rev. Cancer

4, 197–205 (2004).

43. Deininger, P. Genetic instability in cancer: caretaker and gatekeeper genes. Ochsner J. 1,

206–209 (1999).

44. Kinzler, K. W. & Vogelstein, B. Gatekeepers and caretakers. Nature 386, 761–763 (1997).

45. Park, C., Qian, W. & Zhang, J. Genomic evidence for elevated mutation rates in highly

expressed genes. EMBO Rep. 13, 1123–1129 (2012).

46. Koren, A. et al. Differential relationship of DNA replication timing to different forms of

human mutation and variation. Am. J. Hum. Genet. 91, 1033–1040 (2012).

47. Gloss, B. S. & Dinger, M. E. Realizing the significance of noncoding functionality in

clinical genomics. Exp. Mol. Med. 50, 97 (2018).

48. Gregory, T. R. Genome Size Evolution in Animals. in The Evolution of the Genome (ed.

Gregory, T. R. B. T.-T. E. of the G.) 3–87 (Academic Press, 2005).

doi:https://doi.org/10.1016/B978-012301463-4/50003-6.

49. Orgel, L. E. & Crick, F. H. C. Selfish DNA: the ultimate parasite. Nature 284, 604–607

(1980).

50. Salta, E. & De Strooper, B. Noncoding RNAs in neurodegeneration. Nat. Rev. Neurosci.

18, 627–640 (2017).

51. Mattick, J. S. & Makunin, I. V. Non-coding RNA. Hum. Mol. Genet. 15, 17–29 (2006).

52. Palazzo, A. F. & Lee, E. S. Non-coding RNA: what is functional and what is junk? .

Frontiers in Genetics vol. 6 2 (2015).

53. Urlaub, H., Kruft, V., Bischof, O., Müller, E. C. & Wittmann-Liebold, B. Protein-rRNA

binding features and their structural and functional implications in ribosomes as

89 determined by cross-linking studies. EMBO J. 14, 4578–4588 (1995).

54. Decatur, W. A. & Fournier, M. J. rRNA modifications and ribosome function. Trends

Biochem. Sci. 27, 344–351 (2002).

55. Tsukuda, M., Kitahara, K. & Miyazaki, K. Comparative RNA function analysis reveals

high functional similarity between distantly related bacterial 16 S rRNAs. Sci. Rep. 7,

9993 (2017).

56. Sharp, S. J., Schaack, J., Cooley, L., Burke, D. J. & Soil, D. Structure and Transcription of

Eukaryotic tRNA Gene. Crit. Rev. Biochem. 19, 107–144 (1985).

57. Elkon, R. & Agami, R. Characterization of noncoding regulatory DNA in the human

genome. Nat. Biotechnol. 35, 732–746 (2017).

58. Cerase, A., Pintacuda, G., Tattermusch, A. & Avner, P. Xist localization and function:

new insights from multiple levels. Genome Biol. 16, 166 (2015).

59. Tang, Q. & Hann, S. S. HOTAIR: An Oncogenic Long Non-Coding RNA in Human

Cancer. Cell. Physiol. Biochem. 47, 893–913 (2018).

60. Yan, B. & Wang, Z. Long Noncoding RNA: Its Physiological and Pathological Roles.

DNA Cell Biol. 31, S-34-S-41 (2012).

61. Lorenzen, J. M. & Thum, T. Long noncoding RNAs in kidney and cardiovascular

diseases. Nat. Rev. Nephrol. 12, 360–373 (2016).

62. Jia, M. et al. lincRNA-p21 inhibits invasion and metastasis of hepatocellular carcinoma

through Notch signaling-induced epithelial–mesenchymal transition. Hepatol. Res. 46,

1137–1144 (2016).

63. Peng, W. & Fan, H. Long noncoding RNA CCHE1 indicates a poor prognosis of

hepatocellular carcinoma and promotes carcinogenesis via activation of the ERK/MAPK

90 pathway. Biomed. Pharmacother. 83, 450–455 (2016).

64. Sui, C. jun et al. Long noncoding RNA GIHCG promotes hepatocellular carcinoma

progression through epigenetically regulating miR-200b/a/429. J. Mol. Med. 94, 1281–

1296 (2016).

65. Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma.

Science 339, 957–959 (2013).

66. Bell, R. J. A. et al. Understanding TERT Promoter Mutations: A Common Path to

Immortality. Mol. Cancer Res. 14, 315–323 (2016).

67. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500,

415–421 (2013).

68. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell

149, 979–993 (2012).

69. Taylor, B. J. et al. DNA deaminases induce break-associated mutation showers with

implication of APOBEC3B and 3A in breast cancer kataegis. Elife 2, e00534–e00534

(2013).

70. Nikkilä, J. et al. Elevated APOBEC3B expression drives a kataegic-like mutation

signature and replication stress-related therapeutic vulnerabilities in p53-defective cells.

Br. J. Cancer 117, 113–123 (2017).

71. Seplyarskiy, V. B. et al. APOBEC-induced mutations in human cancers are strongly

enriched on the lagging DNA strand during replication. Genome Res. 26, 174–182

(2016).

72. D’Antonio, M., Tamayo, P., Mesirov, J. P. & Frazer, K. A. Kataegis Expression Signature

in Breast Cancer Is Associated with Late Onset, Better Prognosis, and Higher HER2

91 Levels. Cell Rep. 16, 672–683 (2016).

73. Heppner, G. H. Tumor Heterogeneity. Cancer Res. 44, 2259 LP – 2265 (1984).

74. Durrett, R., Foo, J., Leder, K., Mayberry, J. & Michor, F. Evolutionary dynamics of tumor

progression with random fitness values. Theor. Popul. Biol. 78, 54–66 (2010).

75. Durrett, R. & Moseley, S. Evolution of resistance and progression to disease during clonal

expansion of cancer. Theor. Popul. Biol. 77, 42–48 (2010).

76. Lai, L. A. et al. Increasing genomic instability during premalignant neoplastic progression

revealed through high resolution array-CGH. Genes Chromosom. Cancer 46, 532–542

(2007).

77. Arneth, B. Comparison of Burnet’s clonal selection theory with tumor cell-clone

development. Theranostics 8, 3392–3399 (2018).

78. Heinrich, M. C. et al. Primary and secondary kinase genotypes correlate with the

biological and clinical activity of sunitinib in imatinib-resistant gastrointestinal stromal

tumor. J. Clin. Oncol. 26, 5352–5359 (2008).

79. Gajiwala, K. S. et al. KIT kinase mutants show unique mechanisms of drug resistance to

imatinib and sunitinib in gastrointestinal stromal tumor patients. Proc. Natl. Acad. Sci. U.

S. A. 106, 1542–1547 (2009).

80. Gramza, A. W., Corless, C. L. & Heinrich, M. C. Resistance to Tyrosine Kinase Inhibitors

in Gastrointestinal Stromal Tumors. Clin. Cancer Res. 15, 7510 LP – 7518 (2009).

81. Chapman, A. et al. Heterogeneous tumor subpopulations cooperate to drive invasion. Cell

Rep. 8, 688–695 (2014).

82. Leung, M. L. et al. Single-cell DNA sequencing reveals a late-dissemination model in

metastatic colorectal cancer. Genome Res. 27, 1287–1299 (2017).

92 83. Lee, W.-C., Kopetz, S., Wistuba, I. I. & Zhang, J. Metastasis of cancer: when and how?

Ann. Oncol. 28, 2045–2047 (2017).

84. Yoshida, B. A., Sokoloff, M. M., Welch, D. R. & Rinker-Schaeffer, C. W. Metastasis-

Suppressor Genes: a Review and Perspective on an Emerging Field. JNCI J. Natl. Cancer

Inst. 92, 1717–1730 (2000).

85. Minn, A. J. et al. Genes that mediate breast cancer metastasis to lung. Nature 436, 518–

524 (2005).

86. Bos, P. D. et al. Genes that mediate breast cancer metastasis to the brain. Nature 459,

1005–1009 (2009).

87. Turajlic, S. & Swanton, C. Metastasis as an evolutionary process. Science (80-. ). 352, 169

LP – 175 (2016).

88. Watson, J. D. & Crick, F. H. C. Molecular structure of nucleic acids. Nature 171, 737–738

(1953).

89. Sanger, F., Donelson, J. E., Coulson, A. R., Kössel, H. & Fischer, D. Use of DNA

polymerase I primed by a synthetic oligonucleotide to determine a nucleotide sequence in

phage f1 DNA. Proc. Natl. Acad. Sci. 70, 1209–1213 (1973).

90. Padmanabhan, R., Padmanabhan, R. & Wu, R. Nucleotide sequence analysis of DNA: IX.

Use of oligonucleotides of defined sequence as primers in DNA sequence analysis.

Biochem. Biophys. Res. Commun. 48, 1295–1302 (1972).

91. Holley, R. W., Apgar, J., Merrill, S. H. & Zubkoff, P. L. Nucleotide and oligonucleotide

compositions of the alanine-, valine-, and tyrosine-acceptor “soluble” ribonucleic acids of

yeast. J. Am. Chem. Soc. 83, 4861–4862 (1961).

92. Padmanabhan, R., Jay, E. & Wu, R. Chemical synthesis of a primer and its use in the

93 sequence analysis of the lysozyme gene of bacteriophage T4. Proc. Natl. Acad. Sci. 71,

2510–2514 (1974).

93. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating

inhibitors. Proc. Natl. Acad. Sci. U. S. A. 74, 5463–5467 (1977).

94. Ansorge, W., Sproat, B., Stegemann, J., Schwager, C. & Zenke, M. Automated DNA

sequencing: ultrasensitive detection of fluorescent bands during electrophoresis. Nucleic

Acids Res. 15, 4593–4602 (1987).

95. Smith, L. M., Fung, S., Hunkapiller, M. W., Hunkapiller, T. J. & Hood, L. E. The

synthesis of oligonucleotides containing an aliphatic amino group at the 5’ terminus:

synthesis of fluorescent DNA primers for use in DNA sequence analysis. Nucleic Acids

Res. 13, 2399–2412 (1985).

96. Prober, J. M. et al. A system for rapid DNA sequencing with fluorescent chain-

terminating dideoxynucleotides. Science (80-. ). 238, 336–341 (1987).

97. Luckey, J. A. et al. High speed DNA sequencing by capillary electrophoresis. Nucleic

Acids Res. 18, 4417–4421 (1990).

98. Hunkapiller, T., Kaiser, R. J., Koop, B. F. & Hood, L. Large-scale and automated DNA

sequence determination. Science (80-. ). 254, 59 LP – 67 (1991).

99. Nyrén, P. & Lundin, A. Enzymatic method for continuous monitoring of inorganic

pyrophosphate synthesis. Anal. Biochem. 151, 504–509 (1985).

100. Voelkerding, K. V, Dames, S. A. & Durtschi, J. D. Next-generation sequencing: from

basic research to diagnostics. Clin. Chem. 55, 641–658 (2009).

101. Niedringhaus, T. P., Milanova, D., Kerby, M. B., Snyder, M. P. & Barron, A. E.

Landscape of next-generation sequencing technologies. Anal. Chem. 83, 4327–4341

94 (2011).

102. Pareek, C. S., Smoczynski, R. & Tretyn, A. Sequencing technologies and genome

sequencing. J. Appl. Genet. 52, 413–435 (2011).

103. Schadt, E. E., Turner, S. & Kasarskis, A. A window into third-generation sequencing.

Hum. Mol. Genet. 19, R227–R240 (2010).

104. Van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next-generation

sequencing technology. Trends Genet. 30, 418–426 (2014).

105. Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-

time sequencing. Nat. Methods 7, 461–465 (2010).

106. Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an

immeasurable source of knowledge. Contemp. Oncol. (Poznan, Poland) 19, A68–A77

(2015).

107. Zhang, J. et al. International Cancer Genome Consortium Data Portal--a one-stop shop for

cancer genomics data. Database (Oxford). 2011, bar026–bar026 (2011).

108. Forbes, S. A. et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr.

Protoc. Hum. Genet. Chapter 10, Unit-10.11 (2008).

109. Auvil, J. G. Therapeutically Applicable Research to Generate Effective Treatments

(TARGET). AACR Annu. Meet. (2018).

110. Zhao, E. Y., Jones, M. & Jones, S. J. M. Whole-Genome Sequencing in Cancer. Cold

Spring Harb. Perspect. Med. 9, (2019).

111. Xu, C. A review of somatic single nucleotide variant calling algorithms for next-

generation sequencing data. Comput. Struct. Biotechnol. J. 16, 15–24 (2018).

112. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping

95 and population genetical parameter estimation from sequencing data. Bioinformatics 27,

2987–2993 (2011).

113. Xu, F. et al. A fast and accurate SNP detection algorithm for next-generation sequencing

data. Nat. Commun. 3, 1258 (2012).

114. Kassahn, K. S. et al. Somatic Point Mutation Calling in Low Cellularity Tumors. PLoS

One 8, e74380 (2013).

115. Radenbaugh, A. J. et al. RADIA: RNA and DNA Integrated Analysis for Somatic

Mutation Detection. PLoS One 9, e111516 (2014).

116. Hansen, N. F., Gartner, J. J., Mei, L., Samuels, Y. & Mullikin, J. C. Shimmer: detection of

genetic alterations in tumors using next-generation sequence data. Bioinformatics 29,

1498–1503 (2013).

117. Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing

in cancer research. Nucleic Acids Res. 44, e108–e108 (2016).

118. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery

in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

119. Roth, A. et al. JointSNVMix: a probabilistic model for accurate detection of somatic

mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28,

907–913 (2012).

120. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25,

2078–2079 (2009).

121. Christoforides, A. et al. Identification of somatic mutations in cancer through Bayesian-

based analysis of sequenced genome pairs. BMC Genomics 14, 302 (2013).

122. Liu, Y., Loewer, M., Aluru, S. & Schmidt, B. SNVSniffer: an integrated caller for

96 germline and somatic single-nucleotide and indel mutations. BMC Syst. Biol. 10 Suppl 2,

47 (2016).

123. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole

genome sequencing data. Bioinformatics 28, 311–317 (2012).

124. Gerstung, M., Papaemmanuil, E. & Campbell, P. J. Subclonal variant calling with multiple

samples and prior knowledge. Bioinformatics 30, 1198–1204 (2014).

125. Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for

uncovering cell-population heterogeneity from high-throughput sequencing datasets.

Nucleic Acids Res. 40, 11189–11201 (2012).

126. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and

heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

127. Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced

tumor–normal sample pairs. Bioinformatics 28, 1811–1817 (2012).

128. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing.

arXiv Prepr. arXiv1207.3907 (2012).

129. Usuyama, N. et al. HapMuC: somatic mutation calling using heterozygous germ line

variants near candidate mutations. Bioinformatics 30, 3302–3309 (2014).

130. Sengupta, S. et al. Ultra-fast local-haplotype variant calling using paired-end DNA-

sequencing data reveals somatic mosaicism in tumor and normal blood samples. Nucleic

Acids Res. 44, e25–e25 (2016).

131. Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for

calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

132. Cantarel, B. L. et al. BAYSIC: a Bayesian method for combining sets of genome variants

97 with improved specificity and sensitivity. BMC Bioinformatics 15, 104 (2014).

133. Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour-normal

paired sequencing data. Bioinformatics 28, 167–175 (2012).

134. Spinella, J.-F. et al. SNooPer: a machine learning-based method for somatic variant

identification from low-pass next-generation sequencing. BMC Genomics 17, 912 (2016).

135. Fang, L. T. et al. An ensemble approach to accurately detect somatic mutations using

SomaticSeq. Genome Biol. 16, 197 (2015).

136. Xu, J. et al. Linkage and association studies of prostate cancer susceptibility: evidence for

linkage at 8p22-23. Am. J. Hum. Genet. 69, 341–350 (2001).

137. Eeles, R. A. et al. Linkage Analysis of Chromosome 1q Markers in 136 Prostate Cancer

Families. Am. J. Hum. Genet. 62, 653–658 (1998).

138. Easton, D. F., Bishop, D. T., Ford, D. & Crockford, G. P. Genetic linkage analysis in

familial breast and ovarian cancer: results from 214 families. The Breast Cancer Linkage

Consortium. Am. J. Hum. Genet. 52, 678–701 (1993).

139. Araya, C. L. et al. Identification of significantly mutated regions across cancer types

highlights a rich landscape of functional molecular alterations. Nat. Genet. 48, 117 (2016).

140. Gan, K. A., Carrasco Pro, S., Sewell, J. A. & Fuxman Bass, J. I. Identification of Single

Nucleotide Non-coding Driver Mutations in Cancer . Frontiers in Genetics vol. 9 16

(2018).

141. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis of

noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160 (2014).

142. Noyes, M. B. et al. Analysis of homeodomain specificities allows the family-wide

prediction of preferred recognition sites. Cell 133, 1277–1289 (2008).

98 143. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–

339 (2013).

144. Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor

sequence specificity. Cell 158, 1431–1443 (2014).

145. Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given

motif. Bioinformatics 27, 1017–1018 (2011).

146. Coetzee, S. G., Coetzee, G. A. & Hazelett, D. J. motifbreakR: an R/Bioconductor package

for predicting variant effects at transcription factor binding sites. Bioinformatics 31, 3847–

3849 (2015).

147. Grubbe, E. H. X-rays in the treatment of cancer and other malignant diseases. Med. Rec.

62, 692–695 (1902).

148. Miles, W. E. A method of performing abdominoperineal excision for carcinoma of the

rectum and of the terminal portion of the pelvic colon. Lancet 2, 1812–1815 (1909).

149. Davies, H. M. Recent advances in the surgery of the lung and pleura. Br. J. Surg. 1, 228–

258 (1913).

150. Naef, A. P. Hugh Morriston Davies: first dissection lobectomy in 1912. Ann. Thorac.

Surg. 56, 988–989 (1993).

151. Goodman, L. S. et al. Nitrogen mustard therapy: Use of methyl-bis (beta-chloroethyl)

amine hydrochloride and tris (beta-chloroethyl) amine hydrochloride for hodgkin’s

disease, lymphosarcoma, leukemia and certain allied and miscellaneous disorders. J. Am.

Med. Assoc. 132, 126–132 (1946).

152. Farber, S., Diamond, L. K., Mercer, R. D., Sylvester, R. F. & Wolff, J. A. Temporary

Remissions in Acute Leukemia in Children Produced by Folic Acid Antagonist, 4-

99 Aminopteroyl-Glutamic Acid (Aminopterin). N. Engl. J. Med. 238, 787–793 (1948).

153. Heidelberger, C. et al. Fluorinated pyrimidines, a new class of tumour-inhibitory

compounds. Nature 179, 663 (1957).

154. Li, M. C., Hertz, R. & Bergenstal, D. M. Therapy of Choriocarcinoma and Related

Trophoblastic Tumors with Folic Acid and Purine Antagonists. N. Engl. J. Med. 259, 66–

74 (1958).

155. Arruebo, M. et al. Assessment of the evolution of cancer treatment therapies. Cancers

(Basel). 3, 3279–3330 (2011).

156. Devita, V. T., Serpick, A. A. & Carbone, P. P. Combination chemotherapy in the

treatment of advanced Hodgkin’s disease. Ann. Intern. Med. 73, 881–895 (1970).

157. Devita, V. T. et al. Advanced diffuse histiocytic lymphoma, a potentially curable disease:

results with combination chemotherapy. Lancet 305, 248–250 (1975).

158. Devita, V. T., Moxley, J. H., Brace, K. & Frei III, E. Intensive combination chemotherapy

and X-irradiation in the treatment of Hodgkin’s disease. in Proc Am Assoc Cancer Res

vol. 6 881–895 (1965).

159. Einhorn, L. H. & Donohue, J. Cis-diamminedichloroplatinum, vinblastine, and bleomycin

combination chemotherapy in disseminated testicular cancer. Ann. Intern. Med. 87, 293–

298 (1977).

160. Phillips, E. H. et al. Laparoscopic colectomy. Ann. Surg. 216, 703–707 (1992).

161. Sherwood, J. T. & Brock, M. V. Lung cancer: new surgical approaches. Respirology 12,

326–332 (2007).

162. Genden, E. M. et al. Evolution of the management of laryngeal cancer. Oral Oncol. 43,

431–439 (2007).

100 163. Verellen, D. et al. Innovations in image-guided radiotherapy. Nat. Rev. Cancer 7, 949–

960 (2007).

164. Hall, E. J. Intensity-modulated radiation therapy, protons, and the risk of second cancers.

Int. J. Radiat. Oncol. Biol. Phys. 65, 1–7 (2006).

165. Krause, D. S. & Van Etten, R. A. Tyrosine Kinases as Targets for Cancer Therapy. N.

Engl. J. Med. 353, 172–187 (2005).

166. Ladanyi, M. & Pao, W. Lung adenocarcinoma: guiding EGFR-targeted therapy and

beyond. Mod. Pathol. 21, S16–S22 (2008).

167. Altundag, K. & Ibrahim, N. K. Aromatase Inhibitors in Breast Cancer: An Overview.

Oncol. 11, 553–562 (2006).

168. Ferrara, N., Hillan, K. J. & Novotny, W. Bevacizumab (Avastin), a humanized anti-VEGF

monoclonal antibody for cancer therapy. Biochem. Biophys. Res. Commun. 333, 328–335

(2005).

169. Zhu, A. X. et al. A phase II and biomarker study of ramucirumab, a human monoclonal

antibody targeting the VEGF receptor-2, as first-line monotherapy in patients with

advanced hepatocellular cancer. Clin. Cancer Res. 19, 6614–6623 (2013).

170. Baselga, J. The EGFR as a target for anticancer therapy-focus on cetuximab. Eur. J.

Cancer 37, 16–22 (2001).

171. Messersmith, W. A. & Hidalgo, M. Panitumumab, a Monoclonal Anti–Epidermal Growth

Factor Receptor Antibody in Colorectal Cancer: Another One or the One? Clin. Cancer

Res. 13, 4664 LP – 4666 (2007).

172. Molina, M. A. et al. Trastuzumab (Herceptin), a Humanized Anti-HER2 Receptor

Monoclonal Antibody, Inhibits Basal and Activated HER2 Ectodomain Cleavage in

101 Breast Cancer Cells. Cancer Res. 61, 4744 LP – 4749 (2001).

173. von Minckwitz, G. et al. Adjuvant Pertuzumab and Trastuzumab in Early HER2-Positive

Breast Cancer. N. Engl. J. Med. 377, 122–131 (2017).

174. Denkert, C. et al. HER2 and ESR1 mRNA expression levels and response to neoadjuvant

trastuzumab plus chemotherapy in patients with primary breast cancer. Breast Cancer Res.

15, R11 (2013).

175. Reinert, T., Saad, E. D., Barrios, C. H. & Bines, J. Clinical Implications of ESR1

Mutations in Hormone Receptor-Positive Advanced Breast Cancer. Front. Oncol. 7, 26

(2017).

176. Saran, U., Foti, M. & Dufour, J.-F. Cellular and molecular effects of the mTOR inhibitor

everolimus. Clin. Sci. 129, 895–914 (2015).

177. Hamilton, E. & Infante, J. R. Targeting CDK4/6 in patients with cancer. Cancer Treat.

Rev. 45, 129–138 (2016).

178. Robson, M. et al. Olaparib for metastatic breast cancer in patients with a germline BRCA

mutation. N. Engl. J. Med. 377, 523–533 (2017).

179. Litton, J. K. et al. Talazoparib in patients with advanced breast cancer and a germline

BRCA mutation. N. Engl. J. Med. 379, 753–763 (2018).

180. Kobayashi, S. et al. EGFR mutation and resistance of non–small-cell lung cancer to

gefitinib. N. Engl. J. Med. 352, 786–792 (2005).

181. Sequist, L. V et al. Phase III study of afatinib or cisplatin plus pemetrexed in patients with

metastatic lung adenocarcinoma with EGFR mutations. J. Clin. Oncol. 31, 3327–3334

(2013).

182. Shepherd, F. A. et al. Erlotinib in previously treated non–small-cell lung cancer. N. Engl.

102 J. Med. 353, 123–132 (2005).

183. Mok, T. S. et al. Osimertinib or platinum–pemetrexed in EGFR T790M–positive lung

cancer. N. Engl. J. Med. 376, 629–640 (2017).

184. Shaw, A. T. et al. Crizotinib versus chemotherapy in advanced ALK-positive lung cancer.

N. Engl. J. Med. 368, 2385–2394 (2013).

185. Shaw, A. T. et al. Ceritinib in ALK-rearranged non–small-cell lung cancer. N. Engl. J.

Med. 370, 1189–1197 (2014).

186. Hauschild, A. et al. Dabrafenib in BRAF-mutated metastatic melanoma: a multicentre,

open-label, phase 3 randomised controlled trial. Lancet 380, 358–365 (2012).

187. Shaw, A. T. et al. Crizotinib in ROS1-rearranged non–small-cell lung cancer. N. Engl. J.

Med. 371, 1963–1971 (2014).

188. Ledermann, J. et al. Olaparib maintenance therapy in platinum-sensitive relapsed ovarian

cancer. N. Engl. J. Med. 366, 1382–1392 (2012).

189. Sonoda, T. et al. EGFR T790M mutation after chemotherapy for small cell lung cancer

transformation of EGFR-positive non-small cell lung cancer. Respir. Med. case reports

24, 19–21 (2018).

190. Mansoori, B., Mohammadi, A., Davudian, S., Shirjang, S. & Baradaran, B. The Different

Mechanisms of Cancer Drug Resistance: A Brief Review. Adv. Pharm. Bull. 7, 339–348

(2017).

191. Bose, P. et al. Integrative genomic analysis of ghost cell odontogenic carcinoma. Oral

Oncol. 51, e71–e75 (2015).

192. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25, 1754–1760 (2009).

103 193. Cingolani, P. et al. A program for annotating and predicting the effects of single

nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster

strain w1118; iso-2; iso-3. Fly (Austin). 6, 80–92 (2012).

194. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

195. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21

(2012).

196. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with

or without a reference genome. BMC Bioinformatics 12, 323 (2011).

197. Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat.

Biotechnol. 35, 314–316 (2017).

198. Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17,

13 (2016).

199. Lora, D. ClusteredMutations: Location and Visualization of Clustered Somatic Mutations.

R Packag. (2016).

200. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and

Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).

201. Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes

in GeneCards. Database (Oxford). 2017, bax028 (2017).

202. Schwessinger, R. et al. Sasquatch: predicting the impact of regulatory SNPs on

transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res.

27, 1730–1742 (2017).

203. Vinagre, J. et al. Frequency of TERT promoter mutations in human cancers. Nat.

Commun. 4, 2185 (2013).

104 204. Juul, M. et al. Non-coding cancer driver candidates identified with a sample- and position-

specific model of the somatic mutation rate. Elife 6, e21778 (2017).

205. Rheinbay, E. et al. Discovery and characterization of coding and non-coding driver

mutations in more than 2,500 whole cancer genomes. bioRxiv 237313 (2017)

doi:10.1101/237313.

206. Gartner, J. J. et al. Whole-genome sequencing identifies a recurrent functional

synonymous mutation in melanoma. Proc. Natl. Acad. Sci. 110, 13481 LP – 13486 (2013).

207. Teng, H. et al. Identification of recurrent and novel mutations by whole‑genome

sequencing of colorectal tumors from the Han population in Shanghai, eastern China. Mol.

Med. Rep. 18, 5361–5370 (2018).

208. Yu, F. Y. et al. MiR-4500 is epigenetically downregulated in colorectal cancer and

functions as a novel tumor suppressor by regulating HMGA2. Cancer Biol. Ther. 17,

1149–1157 (2016).

209. Zhang, L. et al. Down-Regulation of miR-4500 Promoted Non-Small Cell Lung Cancer

Growth. Cell. Physiol. Biochem. 34, 1166–1174 (2014).

210. Li, R., Teng, X., Zhu, H., Han, T. & Liu, Q. MiR-4500 Regulates PLXNC1 and Inhibits

Papillary Thyroid Cancer Progression. Horm. Cancer 1–11 (2019).

211. Pignot, G. et al. PleKHS1: A new molecular marker predicting risk of progression of non-

muscle-invasive bladder cancer. Oncol. Lett. 18, 3471–3480 (2019).

212. Goodman, O. B. & Keen, J. H. The α Chain of the AP-2 Adaptor Is a Clathrin Binding

Subunit. J. Biol. Chem. 270, 23768–23773 (1995).

213. van Delft, S., Schumacher, C., Hage, W., Verkleij, A. J. & van Bergen en Henegouwen, P.

M. Association and colocalization of Eps15 with adaptor protein-2 and clathrin. J. Cell

105 Biol. 136, 811–821 (1997).

214. Chen, H.-G. & Hsu, S.-C. Abstract 3325: Role of AP2A1 in EGFR nuclear translocation

and transcriptional activation activity. Cancer Res. 74, 3325 LP – 3325 (2014).

215. Pereira, N. B. et al. Nuclear localization of epidermal growth factor receptor (EGFR) in

ameloblastomas. Oncotarget 6, 9679–9685 (2015).

216. Lin, S.-Y. et al. Nuclear localization of EGF receptor and its potential new role as a

transcription factor. Nat. Cell Biol. 3, 802–808 (2001).

217. Brand, T. M. et al. Nuclear EGFR as a molecular target in cancer. Radiother. Oncol. 108,

370–377 (2013).

218. Troiani, T. et al. Therapeutic value of EGFR inhibition in CRC and NSCLC: 15 years of

clinical evidence. ESMO Open 1, e000088 (2016).

219. Holleman, M. S., van Tinteren, H., Groen, H. J., Al, M. J. & Uyl-de Groot, C. A. First-line

tyrosine kinase inhibitors in EGFR mutation-positive non-small-cell lung cancer: a

network meta-analysis. Onco. Targets. Ther. 12, 1413–1421 (2019).

220. Yan, J. & Huang, Q. Genomics screens for metastasis genes. Cancer Metastasis Rev. 31,

419–428 (2012).

221. Yang, L. et al. An enhanced genetic model of colorectal cancer progression history.

Genome Biol. 20, 168 (2019).

222. Hornshøj, H. et al. Pan-cancer screen for mutations in non-coding elements with

conservation and cancer specificity reveals correlations with expression and survival. npj

Genomic Med. 3, 1 (2018).

223. Zhou, L. & Zhao, F. Prioritization and functional assessment of noncoding variants

associated with complex diseases. Genome Med. 10, 53 (2018).

106 224. Jones, S. J. M. et al. Evolution of an adenocarcinoma in response to selection by targeted

kinase inhibitors. Genome Biol. 11, R82 (2010).

225. Gao, C. et al. Exon 3 mutations of CTNNB1 drive tumorigenesis: a review. Oncotarget 9,

5492–5508 (2017).

226. Nakayama, S. et al. β-catenin contributes to lung tumor development induced by EGFR

mutations. Cancer Res. 74, 5891–5902 (2014).

227. Paul, I., Bhattacharya, S., Chatterjee, A. & Ghosh, M. K. Current Understanding on EGFR

and Wnt/β-Catenin Signaling in Glioma and Their Possible Crosstalk. Genes Cancer 4,

427–446 (2013).

228. Zare, F., Dow, M., Monteleone, N., Hosny, A. & Nabavi, S. An evaluation of copy

number variation detection tools for cancer using whole exome sequencing data. BMC

Bioinformatics 18, 286 (2017).

107