Stefanie Friedrich Computational Analysis of Tumour Heterogeneity

Computational Analysis of Tumour Heterogeneity Stefanie Friedrich

Stefanie Friedrich

ISBN 978-91-7797-943-2

Department of Biochemistry and Biophysics

Doctoral Thesis in Biochemistry towards Bioinformatics at Stockholm University, Sweden 2020 Computational Analysis of Tumour Heterogeneity Stefanie Friedrich Academic dissertation for the Degree of Doctor of Philosophy in Biochemistry towards Bioinformatics at Stockholm University to be publicly defended on Friday 20 March 2020 at 14.00 in Wangari, Widerströmska huset (KI), Tomtebodavägen 18.

Abstract Every tumour is unique and characterised by its genetic, epigenetic, phenotypic, and morphological signature. The diversity observed between and within tumours, and over time, is termed tumour heterogeneity. An increased heterogeneity within a tumour correlates with progression, higher resistance rates, and poorer outcome. Heterogeneity between tumours explains aspects of a treatment’s ineffectiveness. Depending on a tumour’s unique signature, common processes like unhindered cell proliferation, invasiveness, or treatment resistance characterise tumour progression. Studying tumour heterogeneity aims to understand cancer causes and evolution, and eventually to improve cancer treatment outcomes. This thesis presents application and development of computational methods to study tumour heterogeneity. Papers I and II concern the in-depth investigation of clinical tissue samples taken from patients. The findings range from spatial expansion of gene expression patterns based on high-resolution data to a gene expression signature of non- responding cancer cells revealed by spatio-temporal analysis. These cells underwent a transition from an epithelial to a mesenchymal pre-treatment. Papers III and IV present tools to detect fusion transcripts and copy number variations, respectively. Both tools, applicable to high-resolution data, enable the in-depth study of , which are the driving force behind tumour heterogeneity. The results in this thesis demonstrate how the beneficial combination of high-resolution data and computational methods leads to novel insights of tumour heterogeneity.

Keywords: tumour heterogeneity, human genome and gene expression analyses, pathway annotation, fusion transcript detection, copy number calling, and high-resolution data.

Stockholm 2020 http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-176074

ISBN 978-91-7797-943-2 ISBN 978-91-7797-944-9

Department of Biochemistry and Biophysics

Stockholm University, 106 91 Stockholm

COMPUTATIONAL ANALYSIS OF TUMOUR HETEROGENEITY

Stefanie Friedrich

Computational Analysis of Tumour Heterogeneity

Stefanie Friedrich ©Stefanie Friedrich, Stockholm University 2020

ISBN print 978-91-7797-943-2 ISBN PDF 978-91-7797-944-9

Front page: Painting by Osnat Tazdok (2019). Inside My Mind (#6158)

Printed in Sweden by Universitetsservice US-AB, Stockholm 2020 Abstract

Every tumour is unique and characterised by its genetic, epigenetic, phenotypic, and morphological signature. The diversity observed between and within tumours, and over time, is termed tumour heterogeneity. An increased heterogeneity within a tumour correlates with cancer progression, higher resistance rates, and poorer outcome. Heterogeneity between tumours explains aspects of a treatment’s ineffectiveness. Depending on a tumour’s unique signature, common processes like unhindered cell proliferation, invasiveness, or treatment resistance characterise tumour progression. Studying tumour heterogeneity aims to understand cancer causes and evolution, and eventually to improve cancer treatment outcomes. This thesis presents application and development of computational methods to study tumour heterogeneity. Papers I and II concern the in-depth investigation of clinical tissue samples taken from prostate cancer patients. The findings range from spatial expansion of gene expression patterns based on high-resolution data to a gene expression signature of non-responding cancer cells revealed by spatio-temporal analysis. These cells underwent a transition from an epithelial to a mesenchymal phenotype pre-treatment. Papers III and IV present tools to detect fusion transcripts and copy number variations, respectively. Both tools, applicable to high-resolution data, enable the in-depth study of mutations, which are the driving force behind tumour heterogeneity. The results in this thesis demonstrate how the beneficial combination of high-resolution data and computational methods leads to novel insights of tumour heterogeneity.

List of papers

The following papers and manuscripts, referred to in the text by their Roman numerals, are included in this thesis:

PAPER I: Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity Emelie Berglund*, Jonas Maaskola*, Niklas Schultz*, Stefanie Friedrich*, Maja Marklund, Joseph Bergenstråhle, Firas Tarish, Anna Tanoglidi, Sanja Vickovic, Ludvig Larsson, Fredrik Salmén, Christoph Ogris, Karolina Wallenborg, Jens Lagergren, Patrik Ståhl, Erik Sonnhammer, Thomas Helleday, and Joakim Lundeberg (2018) Nature communications, 9(1), 2419.

PAPER II: Spatio-temporal analysis of prostate tumours suggests the pre- existence of ADT-resistant expression clones Maja Marklund*, Niklas Schultz*, Stefanie Friedrich*, Emelie Berglund, Firas Tarish, Jonas Maaskola, Joseph Bergenstråhle, Yao Liu, Anna Tanoglidi, Patrik Ståhl, Thomas Helleday, Erik Sonnhammer, and Joakim Lundeberg (2020) manuscript.

PAPER III: Fusion transcript detection using spatial transcriptomics Stefanie Friedrich and Erik Sonnhammer (2020) BMC Medical Genomics ‒ under revision.

PAPER IV: MetaCNV ‒ a consensus approach to infer accurate copy numbers from low coverage data Stefanie Friedrich, Remus Barbulescu, Thomas Helleday, and Erik Sonnhammer (2020) BMC Medical Genomics ‒ under revision.

* Contributed equally Reprints were made with permission from the publisher

Contents

Introduction...... 1 Background...... 1 Problem statement...... 3 Limitations...... 5 Synopsis and structure of the thesis...... 6

Literature review...... 9 Cancer research...... 9 Cancer...... 11 Pathways in cancer...... 12 A primary tumour site...... 13 Tumour micro-environment...... 15 Tumour development...... 15 Tumour initiation...... 16 Tumour promotion and progression...... 17 Tumour heterogeneity...... 18 Models of tumour heterogeneity...... 19 Types of tumour heterogeneity...... 20 Spatial heterogeneity...... 21 Spatio-temporal heterogeneity...... 27 Investigation of tumour heterogeneity...... 29 Approaches...... 29 Technologies...... 29 Standard sequence-based methods: bulk sequencing...... 30 High-resolution sequence-based methods...... 30 Single-cell sequencing...... 31 Spatially resolved omics: in situ capturing methods...... 31 Computational methods...... 32 Alignment and data preprocessing...... 33 calling...... 34 Copy number calling...... 35 Fusion transcript detection...... 37 Differential gene expression analysis...... 37 Pathway annotation...... 39 Over-representation analysis...... 39 Systems approaches of pathway annotation...... 40 Biological pathway databases...... 42 Machine learning & data mining...... 42 Predictive methods...... 42 Clustering methods...... 43 Dimensionality reduction methods...... 44 Anomaly detection...... 45 Benchmarking...... 46 Predictions...... 46

Present investigations...... 49 Paper I: Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity...... 49 Paper II: Spatio-temporal analysis of prostate tumours suggests the pre-existence of ADT-resistant expression clones...... 51 Paper III: Fusion transcript detection using spatial transcriptomics...... 53 Paper IV: MetaCNV ‒ a consensus approach to infer accurate copy numbers from low coverage data...... 55

General discussion...... 57 Conclusions and recommendations...... 59

Summary in Swedish...... 61

Acknowledgements...... 63

References...... 66 Abbreviations

ADT Androgen Deprivation Therapy Akt AKT Serine/Threonine APC Adenomatous Polyposis Coli AXL AXL Receptor Tyrosine Kinase BCL2 BCL2 Apoptosis Regulator bp Base pair BWA Burrows-Wheeler alignment tool BWT Burrows-Wheeler transform CASP3 Caspase 3 CDK Cyclin Dependent Kinase cDNA Complementary DNA cis-SAGe cis-Splicing of Adjacent Genes CNV Copy Number Variation CRISP/CAS9 Cysteine Rich Secretory Protein/CRISPR-associated protein 9 CTGF Connective Tissue Growth Factor CyclinE2 Cyclin-dependent kinase 2 DNA Deoxyribonucleic Acid ECM Extracellular Matrix EGF Epidermal Growth Factor EGFR Epidermal Growth Factor Receptor ELK4 ETS Transcription Factor ELK4 EMT Epithelial-Mesenchymal Transition EPO Hormone erythropoietin ERG ETS Transcription Factor ERG ETS E26 transformation-specific FDR False Discovery Rate FISH Fluorescence In Situ Hybridization FOX Family of winged helix/forkhead FN False Negative FP False Positive FPKM Fragments Per Kilobase per Million mapped reads FWER Familywise Error Rate Gbp Giga base pairs

i GRCh38 Genome Reference Consortium human build 38 (length 3,099 Mbp) Gs Gleason score H&E Hematoxylin and Eosin HIF Hypoxia-Inducible Factor HPV Human papillomavirus ICAM Intercellular Adhesion Molecules INDEL INsertion or DELetion kbp Kilo base pairs KEGG Kyoto Encyclopedia of Genes and Genomes KRAS KRAS Proto-Oncogene, GTPase LCM Laser-Capture Microdissection MAE Mean Absolute Error MAPK Mitogen-Activated Protein Kinase Mbp Mega base pairs MCC Matthew’s Correlation Coefficient MEK Alias MAPK27, mitogen-activated protein kinase 7 miRNA microRNA MLRE Mean Log Ratio Error MMP Matrix Metallopeptidases mRNA Messenger RNA MSE Mean Squared Error mTOR Mechanistic Target of Rapamycin MWT Moderated Welch test MYC c-MYC (homologue of the Myelocytomatosis gene) OXPHOS Oxidative PCA Principal Component Analysis PCR Polymerase chain reaction PDCD Programmed Cell Death PI3K Phosphoinositide 3- PIN Prostatic Intraepithelial Neoplasia PPAR Peroxisome Proliferator-Activated Receptors pre-mRNA Precursor mRNA RAF Raf-1 Proto-Oncogene, Serine/Threonine Kinase Rb Retinoblastoma gene RNA Ribonucleic Acid RNAseq Sequenced RNA RPKM Reads Per Kilobase per Million mapped reads RTK Receptor Tyrosine Kinases scDNA Single-cell sequenced DNA SIRT1 Sirtuin 1 SLC30A8 Solute Carrier Family 30, member 8 SLC45A3 Solute Carrier Family 45, member 3 SMAD2-4 SMAD Family Members 2-4 ii SNV Single Nucleotide Variant STD Spatial Transcriptome Decomposition SV Structural Variant t-SNE t-distributed Stochastic Neighbor Embedding TCA Tricarboxylic Acid TGFbeta Transforming Growth Factor beta TIMP3 Tissue Inhibitor of Metalloproteinase 3 TMPRSS2 Transmembrane protease, serine 2 TN True Negative TNR True Negative rate TP True Positive TPM Transcripts Per Million TPR True Positive rate TP53 Tumour protein p53 (53 kilodalton molecular mass) TSP1 Thrombospondin uPAR Plasminogen activator, urokinase receptor VEGF Vascular Endothelial Growth Factor A WGS Whole Genome Sequencing WES Whole Exome Sequencing WNT Wingless Int-1 ZEB1/2 Zinc-finger E-box binding homeobox 1/2

iii

Introduction

“Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.” – Marie Curie, physicist and chemist.

This chapter introduces the research field this thesis is embedded in. First, the biological principles that govern organisms, especially humans, are briefly explained, and abnormalities leading to diseases, especially cancer, are briefly described. Second, the research problems this thesis addresses are introduced, and how this thesis copes with them is highlighted.

Background The code of life is written in two different languages, one is used for the storage of the genetic code and the other one for describing how gene products form and function. The genetic code is written in four nucleotides (As, Ts, Cs, and Gs), stringed together to form the DNA. The other language uses 20 amino acids to form and describe a functional protein. There are strict translation rules between nucleotides and amino acids. Structure and order is a success factor in evolution. The eukaryotic DNA is hierarchically organized into subunits, e.g. , genes, and exons within genes. In order to live, organisms synthesize the code of a gene, i.e. the sequential order of nucleotides, into a gene product or protein, i.e. a chain of amino acids. The synthesis of a functional gene product is performed in two steps: transcription and translation, summarized as gene expression (Figure 1). The rule describing the residue-by-residue unidirectional two-steps flow of information is called the ‘central dogma of molecular biology’ (Crick 1970).

1 Figure 1. Synthesis of a functional gene product (Adapted from Alzforum 2017). A The two-step process of gene expression. In , e.g. humans, the transcription (including post-transcriptional modifications, e.g. splicing) is performed in the nucleus where the DNA is stored. The translation into a protein happens in the cytoplasm. B The central dogma of molecular biology, as stated by Francis Crick (1970), is valid for all living organisms. In general (continuous line), information flows from DNA to RNA to protein unidirectionally. Special processes are exceptions or special cases (dotted line) and include, for example, reverse transcription (RNA → DNA) among retroviruses.

The human genome, stored in each cell, is a chain of ~3 billion nucleotide pairs; unwound the genome is ~200 cm long (Piovesan et al 2019). Although all cells in the human body harbour the same genetic code, there is much diversity regarding the extent to which genes are expressed. The expression level depends on the circumstances, e.g. concerning the cell type, tissue type, organ, or the temporary protein function to attain. Adaptability and variability are success factors in evolution. A change of environmental conditions might require adaptation of the synthesis of a gene product or the genetic code itself. For example, a duplicated gene within an organism can evolve into two specialized genes transcribed and translated into proteins with slightly different functions (Dewey 2019; Ohno 1970). There are many mutation types, which are mainly classified by the permanence or the process step they are occurring in: genetic, epigenetic, or transcription-induced. Many mutations are favourable. For example, a mutation in the lactase gene LCT enables the digestion of lactose (Schaffner and Sabeti 2008); a mutation in solute carrier family 30, member 8 (SLC30A8) reduces the risk of suffering from diabetes (Williams 2016). Reliability is a success factor in evolution. It is assumed that up to 1,000,000 mutations can occur in a single human cell per day; they are detected and repaired by subtly balanced and coordinated mechanisms

2 (Berk et al 1999; Li et al 2016; Lindahl 1993). Cells divide in order to replace damaged, old, or dead cells, or simply to maintain organisms and organ growth. Some cells are renewed weekly, some endure a lifetime. For example, cells of the small intestine epithelium in the gastrointestinal tract renew every 4-5 days (Clevers 2013). In the multi-step process of cell division, the DNA in the nucleus duplicates and the cytoplasm with organelles separates into two daughter cells. At each step, checkpoint genes ensure the integrity of the genetic code (Cooper and Hausman 2007). Mutations are either repaired or apoptosis is enforced. Mutations in the checkpoint genes (e.g. retinoblastoma protein, Rb, and tumour protein p53 (TP53)) can have severe consequences and can lead to diseases like cancer. Cancer is defined as a genetic disease, driven by the accumulation of mutations. These mutations often occur in genes controlling cell proliferation, e.g. cell cycle checkpoints, cell death, DNA damage response, or DNA repair. An anti-cancer treatment aims to remove these abnormal cancer cells by surgery, or to kill them by or radiation therapy, or to help the patient’s immune system to control or remove cancer cells by targeted immunotherapy. For example, the 5-year survival rate for melanoma patients using immunotherapy increased from 5 to 52% in recent years due to research (Larkin et al 2019). Nevertheless, a heterogeneous response to cancer treatment is observed. Many aspects influence the effectiveness of a therapy. Tumour heterogeneity, which describes the diversity within and among tumours, is one of them.

Problem statement Despite intense research, our knowledge of tumour heterogeneity and the underlying causes remains in an initial stage (McGranahan and Swanton 2017). This is due to different constraints. A major problem is the complex interplay of factors: different facets (e.g. genetic and epigenetic variation, gene expression levels) influence each other. For example, elevated gene expression levels due to high proliferation rates lead to transcription stress, followed by an increased number of misfolded and dysfunctional DNA repair proteins and thus to higher mutational burden, which in turn drives tumorigenesis. Further, the micro-environment but also the primary and secondary tumour site, the cancer type, and the patient as an individual shape tumour heterogeneity. Attempts have been made to solve this complexity with an analytic approach with a focus on single elements and their effects; influencing external factors are then neglected or eliminated. The disadvantage, however, is that a general model of tumour heterogeneity cannot be built because knowledge about interconnections, dependencies, context-specific phenomena, and temporal influences is lacking. A systems

3 approach, that considers interactions between elements within a context, is therefore needed to complement the analytic approach. The application of methods based on a systems approach allows, for example, the study of spatio-temporal gene expression levels under different conditions, the contribution to the activation of biological pathways under different conditions, and the influence of the heterogeneous stroma compartment on cancer evolution. This problem, including a systems approach, is addressed in papers I‒III. The complexity of the spatial tumour heterogeneity is further increased by the time dimension, leading to a spatio-temporal tumour heterogeneity. For example, it is still discussed, whether the process of cell migration and that of forming therapy resistance are temporarily interlinked processes (van Etten and Dehm 2016; Watson et al 2015). There is also a lack of knowledge on how gene expression within and surrounding a tumour clone changes over time and whether resistant tumour clones can be identified by their gene expression signature before the onset of treatment. This problem is addressed in paper II. Another key problem concerns the data used to investigate the facets of heterogeneity, and their dependencies. Data derived from bulk sequenced DNA and RNA comprising millions of cells (bulk samples) are commonly analysed using computational methods in an attempt to compensate for the mix of captured signals. The resulting information is used, for example, to derive differentially expressed genes for healthy and diseased conditions (bulk RNA sequenced), to deduce fusion transcripts occurrences and to link them to certain conditions (bulk DNA and bulk RNA sequenced), and to infer tumour evolution based on mutation frequencies (bulk DNA sequenced). However, the heterogeneous character of tumours has impeded the derivation of unambiguous results. Computational methods alone cannot resolve or overcome this problem. Novel technologies to capture signals on almost a single-cell level, and their application, are needed in order to obtain clear results. However, the application of new technologies also requires the improvement or redevelopment of computational methods and analysis strategies. This problem is addressed in papers I‒IV. Spatial transcriptomics (Ståhl et al 2016) is a technology that captures gene expression signals tissue-wide on almost a single-cell level while keeping the spatial information where the cells were located within the tissue sample. If this technology is applied to clinical tissue samples from cancer patients, insights into the transcriptomic signature of cancer foci within tissue samples is possible. However, the development and application of novel methods are needed to handle this extensive but also unscaled data, in respect to normalization, revealing hidden patterns, structures, and dependencies. This problem is addressed in papers I‒III. Recent advances in sequencing technologies have made it possible to sequence the whole genomes of a few cells (laser-capture microdissection,

4 LCM) and solely single cells (scDNA) located in clinical tissue samples. This also makes it possible to investigate co-occurring and mutually exclusive mutation in cancer cells (Alizadeh et al 2015), and to infer tumour evolution. However, calling mutations from sequenced genomes with low material input and resulting low coverage remains a challenge. This problem is addressed in paper IV. The development of the novel method STfusion, which is presented in paper III, was not motivated by problem solving. However, applying the method indirectly resolves certain difficulties: fusion transcript detection on a almost single-cell level, complementing the current approaches of fusion transcript detection, and localizing fusion transcripts within tissue samples and spatially relating them to diseased areas.

Limitations This thesis is about the development and application of computational methods to systematically analyse tumour heterogeneity. The investigations were limited in some aspects. One major limitation was time. The underlying work for the four papers, that form this thesis, as well as writing and publishing the papers, was performed within a time frame of 4 years. The study was limited regarding sample sizes. Papers I and II refer to four patients in total, and for each patient several tissue samples with tumour areas of different Gleason scores were available. However, the low number of patients reduces the generalizability of the results. Although papers III and IV are about methods, their applicability is verified on one line and five cancer cell lines, respectively. The study was limited by financial and economic aspects. The application of sequencing technologies is becoming more and more affordable, but is still a major expense. Further, novel technologies, such as spatial transcriptomics, are often cost-intensive because the processes are not fully mature. Parts of the study were limited by experimental validations. Papers III and IV present methods; the correctness of the methods was computationally verified using publicly available data. However, the tools’ results were not proven experimentally. These limitations also provide opportunities for further research.

5 Synopsis and structure of the thesis The aim of this thesis is the computational analysis of spatial and temporal tumour heterogeneity within and among tumours to improve our understanding of cancer occurrence, promotion and progression, and treatment resistance, and to contribute to the development of an effective cancer cure. To achieve this aim, an interdisciplinary approximation is required; it encompasses the application of models, algorithms, and techniques from bioinformatics, machine learning, network science, statistics, and mathematics, but also knowledge about the molecular biology of and cancer development, to reveal underlying patterns and processes. The thesis starts with an introductory chapter, including the statement of the research problems, a description of the structure of the thesis, and some limitations. This is followed by a comprehensive literature review. The first part is about cancer and tumour heterogeneity with a focus on models, types, and factors influencing tumour heterogeneity, emphasizing the scientific significance of studying tumour heterogeneity. The second part examines thoroughly the approaches, technologies, and computational methods commonly applied or recently developed to investigate spatial and temporal tumour heterogeneity. The subsequent chapter, ‘Present investigations’, contains the four publications this thesis comprises. Of these four papers, paper I has been published, paper II is a manuscript, and papers III and IV have been submitted for peer review and are under revision. The papers are introduced in more detail in the respective chapter. Briefly, for papers I and II, bioinformatics methods and techniques were developed and applied to experimentally produced data in order to investigate tumour heterogeneity in clinical tissue samples taken from cancer patients. With papers III and IV, tools to study mutations as a main driving factor of tumour heterogeneity on almost a single-cell level or for a few cells were provided (Figure 2).

6 Figure 2. Overview of present investigations comprising four papers.

The thesis finishes with an integrative discussion of the research findings obtained from the four papers. The contributions to the research field of tumour heterogeneity are highlighted and suggestions for future work provided.

7 8 Literature review

“Like dwarfs perched on the shoulders of giants we are lifted up and borne aloft to see more and farther than our predecessors.” – Bernard of Chartres (attributed), philosopher

This chapter reviews current knowledge about cancer, tumour heterogeneity, the role and impact of technologies and computational methods in its investigation, and how this led into this PhD project.

Cancer research The first notion of the disease in humans appeared in an Egypt papyrus around 2700 BC. Eight cases were described and the tumours were removed by cauterization. It was further stated that cancer is untreatable and fatal (Hajdu 2011). In the Hippocratic Corpus, a text collection of unknown authors around the time of Hippocrates (460‒ c. 370 BC), the terms onkos, a disease producing masses, and karkinos, ulcerating non-healing lumps, were mentioned. Galen of Pergamon (129‒216) bequeathed a 20,000-page-long legacy of research including a classification of tumours into general lumps, cancerous and non-cancerous tumours, and their symptoms (American Cancer Society 2018; Faguet 2015; Hajdu 2011). Gabriele Falloppio (1523‒1562) described the clinical differences between malignant and benign tumours; this differentiation is mainly followed today. The hypothesis that tumours arise locally and spread in the body to establish new tumour loci is credited to Le Dran (1685‒1770). Jean Astruc (1684‒1766) found that cancer areas are more acidic than normal tissues. The occurrence of multiple cancers in individuals and also among families was explained as a constitutional or hereditary cancer predisposition by Jacques Delpech (1772‒1835) and Gaspard Bayle (1774‒1816). Robert Remak (1815‒1865) concluded that cancer cells are normal cells that have undergone a transformation. This theory was extended by Louis Bard (1829‒1894), who stated that cancer does not develop into a mature differentiated state as normal cells do. A description of cancer stages and a prognostic assessment were provided by Pierre Broca (1824‒1880). The role of somatic mutations in cancer development was proposed first by Theodor Boveri (1862‒1915)

9 (Hajdu 2011). Paul Ehrlich, who was awarded the Nobel Prize in 1908, found that the of cancer cells increases with every new tumour generation. (Faguet 2015; Hajdu 2011) Warburg, who was nominated 47 times for the Nobel Prize and awarded it in 1931, discovered the switch in cancer cells from aerobic to anaerobic metabolism, known as the ‘Warburg effect’ (Cancer Research 2019; Warburg 1956; Weinhouse et al 1956). Anaerobic metabolism is solely based on glycolysis. Further, the level of oxygen in cancer cells decreases, which leads to elevated levels of hypoxia inducible factor (HIF) and the hormone erythropoietin (EPO). This relation was discovered by Gregg Semenza, Peter Ratcliffe and William Kaelin (Nobel Prize in 2019; NobelPrize.org 2019a). The link between bacteria or virus infections and cancer was suspected since the 17th century. Bernadino Ramazzini (1633‒1714) investigated empirically the correlation between a person’s health and occupation. He observed an absence of cervical cancer and an increase in among nuns compared to married women (Faguet 2015). Three centuries later, Harald zur Hausen, who was awarded the Nobel Prize in 2008, discovered that the human papilloma virus (HPV) causes cervical cancer (NobelPrize.org 2019b). HPV is sexually transmitted and causes around 90% of cervical cancer cases worldwide (Lowy and Schiller 2012). The increased number of breast cancers among childless women is assumed to be caused by the missing protective change in the hormone profile during pregnancy (Britt et al 2007). Recent decades have been characterized by the beneficial combination of natural sciences, technologies, and computer sciences, enabling advanced interdisciplinary cancer research. For example, next-generation sequencing technologies, single-cell sequencing approaches, and spatial transcriptomics, combined with computational methods to pre-process and analyse data, led to an in-depth view of the diversity and complex interrelations within and among tumours. The application of bioinformatics methods facilitated cancer research intensely. For example, read alignment methods and mutation calling enabled the detection of genetic abnormalities and instabilities, genomic relationships between primary and secondary tumour sites were explained with phylogenetic methods, artificial intelligence was used to predict cancer foci in medical images but also to predict drug response depending on genetic disposition; methods for recognizing patterns in large data sets were applied to identify gene expression signatures characterizing different cancer types; and systems biology methods applied to gene expression changes under different conditions revealed activated pathways, i.e. functional relations among genes. However, this is just the beginning. Genetic, epigenetic, transcriptomic, protein and metabolic data, including their regulatory connections, of single

10 cells within and surrounding cancers tracked over time will be available in the coming years. This means that both, an in-depth view of detailed interrelated processes involved in tumour evolution and a comprehensive analysis of general patterns behind tumour initiation, promotion, and progression are possible. Targeted manipulations of a genome using the CRISP/CAS9 method allow the identification of single driver mutations, their interplay, and their separation from passenger mutations. Computational methods will be able to extract hidden patterns, display complex disturbances and their effects over time, and predict tumour evolution and therapy outcomes. The resulting discoveries might finally, after 5,000 years of documented cancer research, provide us with an answer as to what the disease causes, how to cure it, or how to control it.

Cancer Currently, approximately 100 different cancer types are classified, usually according to the affected tissue and/or cell type and organ (National Cancer Institute 2019). The main categorization is into carcinoma, sarcoma, melanoma, lymphoma, and leukaemia. The latter does not form solid tumours; a tumour is an abnormal growth of tissue and can be benign, pre- malign, or malignant. Hanahan and Weinberg (2011) reduced the multifaceted processes observed in cancer to a set of eight traits, termed the ‘hallmarks of cancer’ (Figure 3).

11 Figure 3. The six hallmarks of cancer (Reprinted from Hanahan and Weinberg 2000). The hallmarks have been extended with ‘reprogramming of energy metabolism’ and ‘evading immune destruction’ by Hanahan and Weinberg (2011).

Pathways in cancer Modularity is an organizational strategy of biological systems aimed at maintaining flexibility by simultaneously saving energy (Clune et al 2013). Biological pathways reflect the modularity of functionally related genes, thus pathways contain higher-order functional information (Kanehisa 2019). For example, the metabolic pathway ‘glycolysis’ describes biochemical processes involved in degrading a glucose molecule into two pyruvate molecules to release energy under hypoxia (Zheng 2012); the pathway encompasses the functionally related genes. Cancer cells rely more on glycolysis, normal cells rather on oxidative phosphorylation (OXPHOS), which is the oxygen-dependent energy-releasing pathway. This switch of energy release is termed the ‘Warburg effect’ (Lopez-Lazaro 2008). Upregulation of gene expression activity of genes encompassed by the glycolysis pathway indicates an activation of the glycolysis process and the energy switch. Thus, pathways are useful for understanding the higher-level complexity of cellular processes and organism behaviour by reducing them to a pathway of interactions (Kanehisa et al 2004). In the same way that interactions of molecules form a pathway, pathways are interlinked and form a network. In cancer, many metabolic, gene regulatory, and signalling pathways (Figure 4) are disturbed by up- or downregulated gene expression and/or by mutations in the encompassing genes.

12 Figure 4. Pathways involved in epithelial-mesenchymal transition (Adapted from GeneTex 2019). The transition process promotes invasion and in tumours (GeneTex 2019).

Pathways that recur to be activated in cancer are related to (categories according to the Kyoto Encyclopedia of Genes and Genomes (KEGG; Kanehisa 2019): • Metabolism, e.g. citrate cycle (TCA cycle), steroid biosynthesis. • Genetic information processing, e.g. protein processing in endoplasmic reticulum, DNA replication, base excision repair, homologous recombination. • Environmental information processing, e.g. HIF-1 signalling, MAPK signalling, TGF-beta signalling, PI3K-Akt signalling, ECM-receptor interaction. • Cellular processes, e.g. phagosome, lysosome, cell cycle, apoptosis, p53 signalling, focal adhesion, regulation of actin cytoskeleton. • Immune system, e.g. toll-like receptor signalling, B cell receptor signalling • Endocrine system, e.g. PPAR signalling, estrogen signalling

A primary tumour site A cancer cell does not exist alone. It is surrounded by other cancer cells with which it interacts (Hanahan and Weinberg 2011). They mould a heterogeneous collection of multiple cells of different genotypic and phenotypic profiles. For example, a cancer clone of the size of 0.5 cm contains ~500,000 cells, one of the size of 1 cm already ~4 Mio cancer cells (Naugler 2010).

13 The invasiveness and the ability to metastasize (Figures 3 and 5) give cancer cells their dangerous potential (Erpenbeck and Schön 2010; Liotta and Kohn 2003). The terms describe the mechanism of migration into neighbouring tissue areas and, if entering the blood or lymphatic system, the travel to other parts of the body.

Figure 5. Schema of a primary carcinoma (Adapted from Prajapati and Lambert 2016). The cancer cells are heterogeneous concerning, for example, their genetic, phenotypic, and epigenetic profile. Cancer cells are surrounded by the micro- environment comprising the extracellular matrix (ECM), vessels, locally resident cells (e.g. smooth muscle cells, fibroblasts), and wandering cells (e.g. immune cells, macrophages).

In Figure 5, the carcinoma cell at the bottom has started leaving the cell population and migrating into neighbouring tissue compartments. Since cancer cells tend to evade cell death, there is a high probability that they are able to establish a new cancer cell population at a secondary site; they metastasize (Nguyen et al 2009). Cancer cells migrate as single cells or a collection of cells (Nguyen et al 2009).

14 Tumour micro-environment

Hanahan and Weinberg (2011) stated, in order to understand the biology of tumours, that the focus needs to be extended from solely the cancer cells to the micro-environment. There is evidence that the tumour micro- environment is decisive regarding the success or failure of a treatment. However, there is still an immense knowledge gap (Aiello and Kang 2019; Balkwill et al 2012; Seoane and De Mattos-Arruda 2014). For example, the cancer cells of a carcinoma are surrounded by the stroma compartment, which harbours blood and lymphatic vessels, the extracellular matrix (ECM), and different types of locally resident and wandering cells. Examples of locally resident cells are smooth muscle cells and fibroblasts. Wandering cells include macrophages and immune cells; the latter can invade the tumour area. These cancer cells are in permanent communication with the surrounding cells; everything a cancer cell needs must be transported through the surrounding stroma compartment.

Tumour development Tumour development and intra-tumour heterogeneity are closely intertwined. Intra-tumour heterogeneity increases as a tumour evolves. Tracking spatial intra-tumour heterogeneity over time enables the study of tumour lineage, the reconstruction of tumour initiation, and the identification of the driving forces. The development of cancer takes place in a multi-step process, divided into cancer initiation, cancer promotion, and cancer progression (Figure 6).

Figure 6. Stages of tumour development (Adapted from Barcellos-Hoff et al 2013).

15 Tumour initiation The initiating disturbance leading to an abnormal behaviour and cancer is yet not known. Different theories try to explain the circumstances under which cell regulation is perturbed and unhindered cell proliferation starts to develop. The most widely accepted is the theory of somatic mutations as the origin of malignant transformation originating from Theodor Boveri (Wunderlich 2002) and further developed by Peter Nowell (1976) as the theory of clonal evolution. It is hypothesized that the initiating mutation in one or more tumour suppressor genes and/or proto-oncogenes causes an altered cell cycle and opens the door for cancer. For example, a first mutation that inactivates a protein responsible for apoptosis makes affected cells immortal, and a second mutation activates oncogenes for this immortal defective cell, which advances the cancer growth (Knudson 1985). It is assumed that only the interplay of multiple promoting factors leads to cancer (Blot and Tarone 2015; Hanahan and Weinberg 2011; Pickup et al 2014). Nowell (1976) assumed that six to seven mutations are necessary for the development of a malignant tumour. However, there is also criticism concerning mutations as initial disturbance. For example, retroviral insertions of mutagens in mice showed that 2,000 genes have to be mutated to contribute to cancer development (Touw and Erkeland 2007). In this theory of somatic mutation, any cell can transform from healthy to malignant. In the stem cell theory developed by Bonnet and Dick (1997) it is assumed that cancer cells arise from cancer stem cells that show characteristics of stem cells like self-renewal and differentiation potential. Cancer cells evolve from cancer stem cells and are assumed to be resistant to the majority of modern cancer treatments. The two theories of clonal evolution and evolution are assumed to complement each other. Further, stem cells might contribute to a faster tumour evolution (Naugler 2010; Schiavone et al 2019). Warburg, the name giver of the Warburg effect, suggested a mitochondrial dysfunction leading to a switch towards anaerobic energy generation by respiration and by fermentation (Brand 2010). Mitochondria organelle transplantation experiments seem to prove his hypothesis: if replacing the nuclei of healthy and cancer cells, the cancer cells with normal mammary mitochondria stop proliferating and the healthy with cytoplasm and mitochondria from the cancer cell developed cancer. This was confirmed in 78 cases and by two studies (Elliott et al 2012; Seyfried 2015). Bussey et al (2017) suggested that cells start running an ancestral programme if threatened, known as the ‘atavism theory’ essentially based on an idea proposed by Theodor Boveri in 1914. The observed damage in cancer cells is a response in the form of a primitive defence mechanism to a damaging environment. Using phylostratigraphy, a method for estimating

16 the age of a gene, oncogenes cluster in age around the onset of multicellularity (Domazet-Loso and Tautz 2010; Trigos et al 2017). Further, Trigos et al (2017) showed that over expressed genes were older than under expressed ones in cancer. This pattern is even true if applied to gene networks: gene networks formed at the age of unicellularity are enriched in cancer. They also found a correlation between the stage of cancer and the age of the more highly expressed genes: the later the stage of disease progression, the older the genes that are more highly expressed.

Tumour promotion and progression Tumour promotion follows tumour initiation and is characterized by a clonal growth of the cancer cell population. In order to increase the population size, cancer cells need to downregulate tumour suppressor genes and upregulate oncogenes (Marks et al 2007). In this phase, the tumour is still in situ (Figure 6). During tumour progression, which is the last stage, a cancer becomes more aggressive (continuously expands) and additionally invades the host system. Invasion is assumed to happen step wise, although each step is not necessarily accompanied by a following one (Nguyen et al 2009). Tumour progression starts with local invasion into neighbouring tissue areas, followed by entering and travelling through blood and lymphatic vessels (intravasation). After leaving the vessel, invading cancer cells can enter distant sites (extravasation). If they have settled, the cells can start forming smaller clusters (micro metastasis) and finally larger tumours (colonization; Hanahan and Weinberg 2011). The metastatic spread can occur in different ways: as single cells, as monoclonal collections of cells, or as polyclonal seeding. It is further assumed that invasion happens in waves (Hanahan and Weinberg 2011; van Etten and Dehm 2016). The process of tissue invasion and metastasis is not completely understood, but it involves cell adhesion between cancer cells, and between a cancer cell and the ECM (Hanahan and Weinberg 2011). Another interesting fact is that cancer cells that originated from a primary tumour were found by genome sequencing in the blood of the patient three years post prostate resection (Hong et al 2015). Circulating cancer cells have been found in the blood samples of many cancer patients. This suggests that the cells of an invasive and motile tumour can enter the circulatory system early in tumour development and infiltrate distant organs (Massagué and Obenauf 2016). One developmental programme that is highly involved in invasion and metastasis is epithelial-mesenchymal transition. Carcinoma cells with an epithelial phenotype gain additionally a mesenchymal phenotype and lose the epithelial gene expression profile. The two processes, gain and loss, are assumed to be linked but can be performed independently

17 (Aiello and Kang 2019; paper II), which results in a wider range of different cell states: epithelial, hybrids and complete mesenchymal (Aiello and Kang 2019).

Tumour heterogeneity Diversity in general, and cellular heterogeneity in particular, is a success factor in the evolution of biological systems (Marzluff and Dial 1991; NBII 2011). Variety is one of several consequences of adaptability due to changed environmental conditions. However, the variety in cancer cells is not a result of specialization within the cancer cell population; system parts in specialized populations depend on each other and only survive if the overall system does. Cancer cell populations are groups of generalized single cancer cells. Species populations that are characterized as generalists show a wider mutational diversity within the population than specialized species populations (Bolnick et al 2007), resulting in greater competition among the group members (Bolnick et al 2003). Generalized populations prefer a heterogeneous environment and have an evolutionary advantage if the environmental conditions change, are unstable or are threatening. Further, they evolve through a negative-frequency selection: the fitness of a genotype or phenotype decreases if it becomes more common (Kassen 2002). Generalist populations of birds, for example, tend to be more tolerant to viruses and bacteria (Aguirre et al 2002). Some species can develop both specialized and generalized populations; a change in environmental conditions is decisive. The trigger for a species population to become invasive comes from decreased resource levels (Tilman 2004). Invasive species are fast-growing, they have dispersal ability and phenotypic plasticity, and are tolerant of environmental conditions and the consumption of nutrients (Kolar and Lodge 2001). Before invasion, invasive species need to already have a survival advantage at low population densities compared to the surrounding populations, can efficiently use resources/nutrients, and are able to manipulate the environment to support growth. These species- specific traits, behaviours, and processes have also been observed spatially and temporarily among cancer cells. Tumour heterogeneity is a wide-ranging term describing diverse aspects of biological and chemical processes within and among cancer cells and their environment. To be more specific, tumour heterogeneity describes the spatial and temporal variety of cancer cells within a tumour or between tumours concerning their genetic, morphologic, transcriptomic, epigenetic, and metabolic signature, their motile and proliferative behaviour, their metastatic potential, and the micro-environment the cancer cells are embedded in (Marusyk and Polyak 2010). Emerging evidence exists indicating that the

18 extent of intra-tumour heterogeneity correlates with tumour progression, and not the mutational burden of a cancer clone (Gerlinger et al 2012; Giraudeau et al 2019; Little et al 2019). Some cancer types are more prone to develop intra-tumour heterogeneity, e.g. prostate cancer (Gay et al 2016), breast cancer (Shipitsin et al 2007), and leukaemia (Campbell et al 2008). Tumour heterogeneity is an eminent success factor in establishing an effective and efficient cancer treatment (Stanta and Bonin 2018). For example, Gerlinger et al (2012) compared gene expression-based prognostic signatures derived from spatially distinct regions of a single cancer. Different regions of the same tumour harboured either good- or bad- prognosis signatures. This indicates that a single biopsy cannot sufficiently represent the overall tumour profile because a single feature does not have the predictive prognostic capability (Wei et al 2017). Insights into mutually exclusive and co-occurring molecular variation signatures would improve the predictive character of features and found in a single biopsy.

Models of tumour heterogeneity Models of tumour heterogeneity not only describe the diversity at a specific time point within a tumour, they also infer the development from the initiation to the last stage of cancer progression. Models of tumour heterogeneity are models of tumour development. There are two models of heterogeneity: the theory of cancer stem cells and the theory of clonal evolution. The latter describes two kinds of expansion: linear and branched expansion. The theory of clonal evolution, first proposed by Nowell (1976), is interlinked with the theory of somatic mutation as the origin of malignant transformation, which was stated by Theodor Boveri. As described before, in his theory of somatic mutation as the origin of cancer, it is hypothesized that the transformation from normal to malignant cancer cells takes place in several dependent steps of inactivating tumour suppressors and activating oncogenes. Accumulating mutations in tumour suppressor genes, oncogenes, and gene repair genes lead to genetic instability and drive the cell to gain more mutations while growing. During tumorigenesis, sub populations with favouring mutations arise, leading to intra-tumour heterogeneity. The emerging sub population that is amenable to evolution leads to a subclonal structure; those clones with the highest advantage become dominant. Isaiah Fidler (1978) suggested that the subpopulations explain the diverse metastatic potential of cancer cells. Linear models of expansion describe a sequentially ordered occurrence of somatic mutations in driver genes resulting in a subclonal structure. Whereas the branched evolutionary tumour growth happens through a splitting mechanism (Swanton 2012), resembling a tree, in each generation a higher genomic instability occurs.

19 The cancer stem cell theory hypothesis states that different cancer stem cells cause intra-tumour heterogeneity. The genetic differences in cancer stem cells can in turn follow a clonal evolution. Cancer stem cells have been identified in, for example, breast (Al-Hajj et al 2003) and prostate (Sampayo and Bissell 2019) cancer.

Types of tumour heterogeneity The facets of tumour heterogeneity are structured into inter- and intra- tumour heterogeneity (Figure 7). Inter-tumour heterogeneity describes the diversity among tumour sites, either within or among patients. For example, prostate cancer is often multifocal: at different places within the same organ and temporarily close, different tumour foci arise. It has been proven that they have mutationally different origins (Cheng et al 1998). The study of differences and similarities among tumour foci and circumstances like micro-environmental conditions aims to explain tumour initiation and evolution and the heterogeneous response to cancer treatment. Intra-tumour heterogeneity can be studied from different perspectives, e.g. from a genetic, epigenetic, metabolic, or genetic-transcriptomic perspective. However, each separate perspective limits the model because a tumour is a system of heterogeneous and interconnected cells with all these facets.

Figure 7. Types of tumour heterogeneity and examples (Adapted from Frame et al 2017).

The interplay and network of genes and proteins play a crucial role in understanding the molecular mechanism of cancer development (Wu et al 2012). A systems approach encompassing the spatial and temporal aspects of tumour heterogeneity improves the model quality. In the following, the

20 different facets and levels are first examined separately and then interrelated to satisfy both approaches.

Spatial heterogeneity Spatial heterogeneity is shaped by the molecular diversity (e.g. morphologic, genetic, epigenetic, and gene expression) of cells within and among tumours.

Morphological heterogeneity The morphological heterogeneity in cancer is often used by pathologists to grade and to assign a prognostic evaluation (Swanton 2012). For example, in prostate cancer the Gleason grading system is applied to describe the histopathological patterns appearing in tissue samples (Epstein 2010). The higher the grade, the more aggressive the cancer is and the worse is the prognosis; although significant differences among patients of the same grade have been observed and benchmark extensions suggested (Berglund et al 2018a; Pierorazio et al 2013).

Genetic heterogeneity The spectrum of genetic mutations, i.e. variations of the DNA sequence with focus on genes in general but also in cancer is immense. To facilitate handling, genetic mutations are structured by the length of their affected DNA sequence, and their effect, in structural and sequential variations. Structural variations affect longer regions of the DNA and change somewhat the DNA structure (e.g. gene order), for example, translocations, inversions, copy number variations (CNV), and single and double-strand breaks ‒ even whole chromosomes can be amplified ‒ whereas sequential variations are shorter and change somewhat the sequential order of the DNA sequence, examples include single nucleotide variations (SNVs) and short insertions and deletions (shortened to INDELs).

Germline and somatic mutations Mutations are commonly differentiated into germline and somatic mutations. Germline mutations are defined as found in germ cells and can be passed on to offspring. Such mutations in tumour suppressor or proto-oncogenes of germ cells can make an individual vulnerable to developing cancer. They are usually identified as germline if found in both a tumour sample and a matched normal sample such as blood. Somatic mutations are acquired after conception and during lifetime. They can occur in any cell except in germ cells. The fact that somatic mutations drive cancer development is one of the hallmarks of cancer (Hanahan and Weinberg 2011). The diversity of somatic mutations and environmental conditions contributes to intra-tumour and

21 inter-tumour heterogeneity. Studying somatic mutations improves the understanding of tumour progression and treatment possibilities. In recent years in the study of epigenetic characteristics, which are the factors that determine how DNA is read and expressed, it has been discussed whether environmentally induced can be passed on to future offspring and persist for several generations (Lind and Spagopoulou 2018). An example of an inherited epigenetic alteration is the Dutch population experiencing years of starvation during World War II: their offspring born thereafter have an increased risk of being smaller and developing glucose intolerance compared to offspring born before World War II (Lumey 1992; Lumey and Stein 2009). This form of inheritance does not follow the germline-somatic inherited mutation idea.

Clonal structure The occurrence and frequency of somatic genetic mutations can be used to infer a subclonal structure of tumours (Figure 8). Subclonal composition means that each subclone contains a different set of driver and passenger mutations. With an increasing number of subclones, the diversity of driver mutations increases. Higher subclonal diversity is associated with the risk of cancer progression (Alkhazraji et al 2019; Merlo et al 2010). Mutational burden is different from genetic intra-tumour heterogeneity. The total number of genetic mutations of a tumour containing only one clone can be high but with mutations equally distributed among all cancer cells. A tumour with a subclonal structure might yield a lower mutational burden (total number of mutations) but each subclone comprises a different set of driver and passenger mutations. The latter is associated with cancer progression. A higher mutational burden, in contrast, is linked to a better immunotherapy response in several cancers and a better outcome for patients with curative surgery and subsequent chemotherapy (Lee et al 2019). This evolutionary process is termed negative ‘frequency-dependent selection’: the subclone with the smallest frequency is the fittest (Kassen 2002).

22 Figure 8. Tumour heterogeneity (Adapted from Fischer et al 2014).

Mutation frequencies and tend to be higher as the disease progresses and even higher in metastasis than at the primary site (Hanahan and Weinberg 2011; van Etten and Dehm 2016). That indicates that the progression of the disease correlates with tumour genetic heterogeneity. It further suggests a clonal evolution and subclonal selection following a Darwinian selection mechanism (van Etten and Dehm 2016). Clonal structure has been observed in almost all cancers, including acute myeloid leukaemia, breast cancer, melanoma (van Etten and Dehm 2016), and prostate cancer (Gundem et al 2015).

Driver and passenger mutations Spatial heterogeneity is shaped by mutational diversity which is divided into driver and passenger mutations. According to the theory of somatic mutations, the initial disturbance is caused by a cell harbouring a mutation with a growth advantage, called a driver mutation. The descending cells form the tumour clone. During proliferation, additional driver and passenger mutations are acquired, and these form the subclonal structure. Passenger mutations do not increase the fitness of a clone, but driver mutations do.

Mutation types The mutation spectrum in cancer is wide. Major roles are played by CNVs, SNVs, and fusion transcripts. Paper III concerns the detection of fusion transcripts, while paper IV is about CNVs; these types of mutations are more comprehensively introduced in this thesis.

Copy number variations (CNVs) are an important component of genetic variation. It is estimated that up to 12% of the human genome is affected by a CNV. The term describes the number of copies that exist for a stretch of

23 the genome. CNVs comprises both amplification and deletion of a DNA segment. CNVs can occur on one (heterozygous) or both (homozygous) alleles of a diploid as human. Like all genetic mutations, CNVs can be inherited or somatically acquired. Most copy number changes tend to occur later in cancer evolution, and deletions precede amplifications (Li et al 2020). CNVs are repaired by homologous recombination or non-homologous DNA end-joining. If a homologous recombination mechanism repairs a copy number alteration but the sister is mutated, this mutation is then copied to the copy number repaired segment (Hastings et al 2009; Helleday et al 2014). The relation between CNVs and gene expression is not fully understood. A study by Blackburn et al (2015) found that 10% of the analysed CNVs were associated with an altered gene expression of a gene near the CNV segment. Such genes with altered gene expression were related to the immune system.

Epigenetic heterogeneity Epigenetic markers manipulate the transcription of certain genes but do not change the actual sequence of DNA nucleotides. The epigenome is significantly influenced by environmental factors like diet, toxins, and hormones (Bernstein et al 2007; Misteli 2007). DNA methylation is currently the most widely studied epigenetic change. In human tumour cells, gene-specific hypermethylation will result in the repression of transcription. Concurrently, global hypomethylation, which is associated with increased chromosomal instability, is observed (Lehmann 2010; Sheaffer et al 2016). In paper III, a relation between fusion transcript occurrence and an epigenetic mechanism is suggested. It is assumed that epigenetic alterations in cancer cells are enough to alter pathways linked to genomic instability. Epigenetic changes, like methylation of the promoter region and/or the first exons of tumour suppressor genes or oncogenes have similar effects as genetic mutations have. Moreover, both genetic and epigenetic alterations follow a clonal evolution (Stanta and Bonin 2018).

Fusion transcripts Fusion transcripts play a major role in cancer (Figure 9); they are linked to tumour progression but are also used as a and treatment target. A fusion transcript is a chimera of sequence fragments of different genes. If the chimera was caused by a genetic effect, e.g. a deleted sequence between the parental genes, it is termed gene fusion. For example, ETS gene fusions (e.g TMPRSS2-ERG) occur in 50% of all prostate cancers (Adamo and Ladomery 2016). Eighty per cent of all known gene fusions, i.e. caused by a genetic

24 mutation, have been found in leukaemia, lymphomas, and sarcomas. In carcinomas, only 10% of the known gene fusions can be detected. The reason behind this imbalance is not yet known (Kumar-Sinha et al 2012).

Figure 9. Recurrence Number of fusion events (Adapted from Picco et al 2019). Shown are the number of fusion transcripts per cancer cell line, sorted by cancer type (Picco et al 2019).

Fusion transcripts can also be transcription-induced. For those, no convincing genetic mutation has been identified so far; the abnormal mechanism behind their occurrence is unknown. Transcription-induced fusion transcripts are further divided into cis- and trans-splicing. An example of a chimera-caused cis-splicing of adjacent genes, i.e. cis-SAGe, is SLC45A3-ELK4, which was detected first in prostate cancer (Rickman et al 2009). cis-SAGe do not exclusively occur in neoplasia; they have also been detected in normal tissues adjacent to cancer. This might indicate that cis- SAGe are a very early event in cancer evolution due to an abnormal transcription. cis-SAGe have also been suggested as a normal mechanism for adapting transcription and thus extending the flexibility of gene expression (Li et al 2018; Qin et al 2016). The occurrence of cis-SAGe is still not a fully understood phenomenon.

Gene expression variation The transcriptomic and functional diversity within and among tumours does not follow a clonal evolution (Stanta and Bonin 2018). For example, gene expression is significantly different in the centre and at the external border of a tumour (Berglund et al 2018a; Stanta and Bonin 2018). The micro- environment influences the cancer cell phenotype: a heterogeneous environment interacts with a heterogeneous cancer cell population (Stanta and Bonin 2018). The cancer cells on the periphery expressed genes that are

25 functionally related to cell motility (e.g. regulation of actin cytoskeleton, focal adhesion), whereas those cells in the centre transcribed genes related to transcription stress (e.g. protein processing in endoplasmic reticulum, lysosome; Berglund et al 2018a). Moreover, a gene or gene family can be involved in many different processes. For example, the members of the family of winged helix/forkhead (FOX) transcription factors are involved in embryogenesis, longevity, tumorigenesis, cell fate determination, and phenotypic plasticity; their activities regulate cell cycle, cell differentiation, and determine cell types (Kaestner et al 2000). Their overexpression can be favourable and harmful, depending on e.g. the cancer type, the mutation they are involved in, and their protein localisation within the cell. Gene expression variance within cancer populations is not limited to the same cell type. Cancer cells are characterized by phenotypic plasticity: they can undergo a transition from an epithelial to a mesenchymal phenotype. These different cell states are related to cancer progression and treatment resistance and have been the subject of many recent research studies (e.g. paper II).

Systems view on tumour heterogeneity Studying tumour heterogeneity from a pathway perspective enables the combination of genetic and epigenetic variations (following a clonal evolution) and differential expression variations (not following a clonal evolution). Further, data are incomplete; shifting the results to a pathway level might offer results that are more reliable. There is much observed diversity in tumours; a systems view might explain the common processes and underlying dysregulations (Valencia and Hidalgo 2012). Analysing 5,272 mutated genes and 40 tumour types, Baudot et al (2010) found that a minority of these genes were mutated among many cancer types; they are known as the ‘usual suspect’ oncogenes, tumour suppressors, and DNA repair genes. In contrast, 73% of the mutated genes were detected in only one tumour type. However, analysing all these mutated genes on a pathway level, they clustered into specific cellular processes. Pathways such as focal adhesion, adherens junctions and cell adhesion, cell cycle, ErbB, MAPK, ras, and mTOR signalling were activated in at least four tumour types. These pathways also reveal the weak points of a cancer and provide targets for cancer treatment. Kandoth et al (2013) studied SNVs and INDELs in genes across 12 tumour types provided by The Cancer Genome Atlas (TCGA). Of these, 127 genes were significantly mutated and could be classified into 20 categories of cellular processes: transcription factors/regulators, histone modifiers, genome integrity, MAPK, PI3K, and RTK signalling. RTKs are a family of receptors containing, for example, EGF, PDGF, VEGF, and AXL receptors.

26 Spatio-temporal heterogeneity is a dynamic process: a cancer clone permanently changes its molecular signature as a result of selective pressure but also of neutral evolution (McGranahan and Swanton 2017; Williams et al 2016). Especially interesting is the rise of subclones resistant to cancer therapies (Cun et al 2018). Here, it is immensely important to know at what time point of tumour evolution resistant subclones emerge, how the therapy-resistant subclones evolve, and what co-occurring or mutually exclusive conditions exist. It is also interesting to see how a therapy affects cancer cell populations and the micro-environment as regards mutations, cell states, and cell motility, and how this influences resistance to treatment. The fact that resistant genetic subclonal populations, i.e. characterized by a unique genetic signature, can already be present pretreatment and can expand during chemotherapy has been shown in high-grade serous ovarian cancer (Schwarz et al 2015), in acute myeloid leukaemia (Roche-Lestienne et al 2002; Skaggs et al 2006), in colorectal cancer (Frydrych et al 2019), and in non-small-cell lung carcinoma (Caswell and Swanton 2017). In contrast, a switch of genotype and phenotype dominance was discussed in a study conducted by the Cancer Genome Atlas Research Network (2015). A total of 333 primary prostate carcinomas in the TCGA project were researched, with seven mutually exclusive subtypes being found, each related to a mutation in one of seven genes. The subtypes explained 75% of the 333 carcinomas. Castration-resistant cancer clones, however, showed only androgen receptor-related mutations and/or gene expression aberration (van Etten and Dehm 2016). This indicates that expression level variation plays a major role during cancer progression, invasion, and therapy resistance (Figure 10).

27 Figure 10. Temporal pathway activation model of carcinogenesis in colorectal cancer, including miRNAs regulation (Adapted from Slaby et al 2009).

To summarize the current knowledge and the emerging trends regarding spatio-temporal tumour heterogeneity: during tumour promotion, genetic alterations drive clonal evolution and foster tumour growth. Later in the progression phase, when the cancer growth exceeds resource levels and invasion is initiated, cancer cells tailor their phenotype from an epithelial to a mesenchymal cell state in order to support single and/or collective cell motility. The phenotypic diversity is facilitated by the micro-environment. In this late phase of cancer progression, gaining additional favourable genetic mutations is less important, as the cancer cells are already highly superior to their neighbouring normal cells. Immortal tumour cells that have migrated from the primary tumour site circulate in the organism and contribute to monoclonal or polyclonal transfer between metastases or between metastasis and the primary tumour.

28 Investigation of tumour heterogeneity Although the previous chapters were dominated by biological aspects of tumours and tumour heterogeneity, it can be assumed that these findings were obtained with the support of novel technologies and innovative computational methods. Moreover, the insights and resulting improvements of cancer treatments emphasize the important role of computational analysis methods and strategies, and the synergy effects of an interdisciplinary approach. Some of these computational methods, and those that were utilized in the papers I‒IV, are introduced in this chapter.

Approaches Tumour heterogeneity is approached in two ways; analytically and systemically. With an analytic approach the focus is on single elements and their effects. Single elements are, for example, a certain mutation such as gene duplications, or within a certain gene, like the tumour suppressor gene TP53. With a systematic approach, however, the complex phenomenon is considered as a whole to build simplified models by shifting the problem to a higher level. Interactions, changes and the context of phenomena are in focus. Examples are gene expression and gene function under changes of conditions and time to identify therapeutic targets, and the influence of stroma on cancer evolution and treatment resistance (Berglund et al 2018a; Hanauer et al 2007). The two approaches, analytic and systemic, complement each other, and both are needed in order to understand spatial and temporal diversity as a main hindrance to cancer treatment. To support the different approaches, many computational methods and software tools have been developed, and these are introduced in the following chapters.

Technologies To obtain genome and transcriptome data, several techniques have emerged, e.g. sequencing, blotting, and nucleic acid hybridization. The latter is a process in which a complementary DNA (cDNA) or RNA single strand attaches to a single strand of DNA or an RNA by forming hydrogen bonds between the complementary strands nucleic bases. Hybridization is, for example, the core principle behind microarrays, and used for many laboratory methods, for example, in situ hybridization and southern and northern blotting. Hybridization and blotting are targeted, i.e. the gene targets need to be known beforehand. For example, with fluorescence in situ hybridization (FISH), up to a few hundred genes (probes) can be

29 investigated (Jensen 2014), although FISH variants with probe sets for ~10,000 genes have been developed recently (Burgess 2019; Eng et al 2019). The term ‘sequencing’ refers to the reading of a DNA, RNA, or peptide sequence and is usually untargeted. The reading is followed by a comparison of the extracted sequence to other sequences, e.g. in the form of a sequence alignment to a reference genome or transcriptome, a comparison of sequences from different experiments, or a database search, e.g. in Pfam (El- Gebali et al 2019). The techniques can also be combined, for example, in situ hybridization- based single-cell RNA sequencing approaches were developed (Ke ta al 2013). In the following, sequence-based methods used for transcriptome- or genome-wide analysis are briefly introduced; they are fundamental for the papers I‒IV in this thesis.

Standard sequence-based methods: bulk sequencing Since the first sequencing of the amino acid chain of insulin by Sanger in 1955 (Nobel prizes in 1958 and 1980), pioneering concepts and methods have conquered the sciences for life. For example, the idea of blasting an entire genome into short pieces to read them in parallel, termed shotgun sequencing, increased the speed of genome sequencing dramatically (Gajewski 2016; Venter 2001). The next generation of sequencing technology, i.e. high-throughput, sequences millions of short reads (≤ 500 bp) leading to high genome or transcriptome coverage. The third generation of sequencing is under development and is aimed at single molecule sequencing without the amplification step during sequencing library preparation, and using long reads of up to 8 kbp in length (Bleidorn 2016). Polymerase chain reaction (PCR) allows sequence segments to be copied and these can then be sequenced. Bulk sequencing captures the average signal of larger cell populations. Both transcriptome-wide and genome-wide signals, but also targeted signals (whole exome sequencing) of cell populations, can be obtained. The advantage of bulk sequencing approaches is the higher material input as a source for signal capture than for single-cell sequencing. The signal, however, arises from heterogeneous cells. If, for example, the tumour purity is less than 100%, i.e. the tumour area is smaller than the tissue sample, the extent to which the signal was received from the tumour or the neighbouring area cannot be determined.

High-resolution sequence-based methods High-resolution sequence-based methods enables one to zoom in to tissue samples and to extract information from single cells or cell subpopulations, either with or without kept spatial information.

30 The higher resolution leads, however, to lower material input for sequencing. Technologies like PCR copy DNA sequences and increase the sequencing depth and thus the depth of coverage. However, PCR prefers small and GC-poor molecules (Minikel 2012). This leads to unevenly distributed copies of unique molecules and fewer unique molecules with increasing sequencing depth. This PCR amplification bias can be reduced using unique molecule identifiers (UMIs; Miner et al 2004; Sena et al 2018).

Single-cell sequencing Sequencing the molecular information of a single cell enables the analysis of cell-specific genotype and phenotype information. In particular, capturing multiple molecular types (multi-omics) from the same cell, e.g. the transcriptome and the proteome or the transcriptome and the genome (Dey et al 2015), reveals intracellular processes and their interrelations (Macaulay et al 2017). Single-cells have to be isolated first from the tissue sample using, for example, droplet-based methods as DropIn (Klein et al 2015) and Drop- seq (Macosko et al 2015) apply. With single-cell sequencing the spatial information is lost.

Spatially resolved omics: in situ capturing methods Various methods have been developed to investigate the heterogeneity of individual cells across the entire genome and transcriptome while at the same time maintaining the spatial context of the cells. For example, Geo-seq is a combination of single-cell sequencing and LCM (Chen et al 2017), Slide-seq uses a DNA-barcoded bead array to detect mRNA from tissue sections (Rodriques et al 2019), spatial transcriptomics uses arrays with spatial barcodes to capture mRNA in situ, and LCM. The latter two are briefly introduced in this thesis. Spatial transcriptomics plays a major role in the papers I‒III. Genomic data obtained using LCM, often resulting in low coverage, can be analysed using MetaCNV (paper IV).

Laser-capture microdissection LCM means using a laser to cut out a small area of a tissue sample containing a few cells or only a single cell. Purposive areas, e.g. within or adjacent to tumours, can be selected. The advantages of LCM are its speed, precision, versatility, and reliability (Datta et al 2015). It is also beneficial that cut-out cell populations preserve their morphological features within the environmental tissue. Compared to bulk sequenced tissue samples, the number of cells and thus the tissue heterogeneity are immensely reduced. The microdissected tissues can be used for genome, transcriptome, and proteome analysis (Datta et al 2015).

31 Spatial transcriptomics Instead of capturing, extracting, and analysing mRNA molecules jointly from a tissue biopsy, a cell subpopulation, or a single cell, mRNA molecules are captured by adding spatial barcodes, which identifies their spatial location within the tissue sample. The advantage is that all genes expressed in the tissue sample are captured along with their location, whereas in single- cell sequencing, even with a high number of sequenced cells, only the mRNA molecules from the chosen cell are captured and the spatial information is lost. Using spatial transcriptomics, tissue- and transcriptome- wide and on almost single-cell resolution the gene expression levels are captured; the resolution depends on the tissue and cell type but comprises 10‒200 cells per barcoded spot (Moncada et al 2018). Applying this technology involves several steps. First, the sliced tissue sample (~10 μm) is put on a glass slide, i.e. a microarray with multiple spots each containing ~200 million DNA capture probes and the spot’s barcode (Ståhl et al 2016). Second, the tissue samples are H&E stained and imaged. Third, the tissue is permeabilized to capture the mRNA molecules, and cDNA synthesis is performed. When the tissue is removed from the glass side, the remaining probes including the barcodes, are captured for sequencing. The transcriptome of each tissue domain retrieved from a spot can then be visualized on the imaged sample (Berglund et al 2018b; Ståhl et al 2016). This technology is fundamental for papers I and II and is used to analyse clinical tissue samples. The results of paper I are also used in paper III.

Computational methods The richness of the data produced with standard and in particular high- resolution technologies is a mercy for scientists. To analyse the sheer volume of data and reduce it to simple benchmarks, methods from bioinformatics, machine learning, network science, mathematics, and statistics can be used. A common pipeline of computational analysis starts with the data output from an experiment, e.g. the sequenced reads, which are then pre-processed, i.e. aligned and annotated. The preprocessed data can then be analysed, e.g. calling mutations with a following results analysis or comparison of gene expression levels for different experiments. Finally, the results are interpreted and set in a context.

32 Alignment and data preprocessing The sequenced reads are pieced together by either mapping them to a reference genome, e.g. the human reference genome GRCh38, or de novo (Behjati and Tarpey 2013). Before mapping, reads can be filtered by quality requirements, e.g. removing bases of low quality. Most popular are the aligners Bowtie and Burrows-Wheeler alignment tool (BWA; Li and Durbin 2009) for sequenced genomes, Star (Dobin et al 2013) and Tophat2 (Kim et al 2013) for sequenced transcriptomes, while for transcriptomes, splice sites need to be considered. To align millions of reads to a reference genome quickly and correctly, different search algorithms can be applied, e.g. suffix trees, suffix arrays, and hash tables. BWA, Bowtie, and SOAP (Li et al 2009) implemented an indexing algorithm based on a suffix array created by Michael Burrows and David Wheeler, the Burrows- Wheeler transform (BWT; Burrows and Wheeler 1994). The algorithm creates first so many copies of the reference genome sequence as the reference genome long is and adds each sequence copy as an additional row, however with rotating letters as the start position. Each new row starts therefore with the next letter in the sequence. The sequence copies are then lexicographically sorted and the last column of letters of these sequences is used as an index to estimate where a read originates. With this index, a fast search is facilitated, where a read maps within the reference genome. Bowtie2 (Langmead and Salzberg 2012) added the FM index (Ferragina and Manzini 2000), which is based on BWT. Further, to consider mismatches, Bowtie2 divides the reads into parts and tries to map these to the reference genome; if hit, the alignment is extended and scored. The best scoring reads with the mismatches are kept. These steps are done for each read and its reverse complement (Langmead and Salzberg 2012). Artificially duplicated DNA fragments during PCR should be removed from the alignment because they influence mutation calling. GATK best practice recommends additionally for RNAseq splitting reads containing Ns into multiple supplementary alignments and hard clip mismatching overhangs (DePristo et al 2011). The resulting genome alignments can be used, for example for mutation calling. Annotating alignments means adding knowledge about functional elements to the alignment, e.g. gene positions, promoter regions, splicing variants, or polyadenylation sites. The annotation step plays a bigger role in the alignment of a sequenced transcriptome. The number of reads aligned to a specific position mirrors the expression level of a functional unit, e.g. a gene or exon. The inferred gene expression levels per unit can be used to perform, for example differential gene expression analysis. Gene expression abundance is measured in log-transformed expression fold change, reads per kilobase per million mapped reads (RPKM), fragments per kilobase per

33 million mapped reads (FPKM), and transcripts per million (TPM). scRNA often requires additional steps, e.g. normalisation and filtering of mitochondrial transcripts.

Mutation calling Subpopulations of the cells within a cancer clone are distinguished by their variety of genomic, epigenetic and phenotypic features. The variety of, for example, genetic mutations can be inferred using computational methods. Common genetic mutations called from DNA sequences are SNVs, INDELs, CNVs, inversions, translocations, and gene fusions. A genetic mutation does not have to entail a change in the phenotype, e.g. if the non-mutated allele is expressed or the mutated gene is not expressed. A comparison of mutations found in DNA and RNAseq data can disclose this. Fusion transcripts, which can have a genetic cause or are transcription-induced, are usually searched for in RNAseq data. Depending on the mutation type, different search algorithms emerged. Algorithms specialized in finding sequential variants (SNVs, INDELs) look for non-reference bases. During post-processing, probabilistic models are applied to remove false positives. Calling structural variants, in contrast, is usually done by looking for changes in the read-alignment flow, e.g. breakpoints in coverage, read pairs with each mate aligned to different reference genome segments, and insert-size changes of read pairs (Xu 2018). Mutation frequencies of SNVs and INDELs are commonly used to infer a subclonal composition. Driver mutations that most commonly occur in a tumour also tend to arise the earliest (Gerstung et al 2020). However, the subclonal composition is dynamic and can change over time. Only if mutation frequencies from different time points are present can a spatio- temporal intra-tumour heterogeneity be inferred (Turajlic et al 2015). Knowledge about structural variants is also required to infer a clonal structure. For example, SNV frequencies need to be corrected for the impact of CNVs to convert allele frequencies into subclone frequencies (Turajlic et al 2015). However, this makes mutation calling error-prone because each prior inference step risks the introduction of errors, which are propagated through the analysis (Turajlic et al 2015). Another problem emerges from rare cell populations and tumour purities. For example, an alignment of sequenced reads of 150 bp in length to the reference genome GRCh38 (~3 Gbp) and a coverage of 30x, i.e. on average each base pair is covered by 30 reads, contains 600 Mio reads. If only 1% of these were misaligned, that is, aligned to a wrong reference sequence segment, and only half of them were misaligned with one mismatch, the final alignment would harbour three Mio false SNVs. If ~10,000 true SNVs are assumed (Stoler et al 1999; Tomlinson et al 2002), the share of TP among all called mutations is 0.3% (precision = 0.003; false discovery rate

34 FDR = 0.997). Most FPs can thus be identified by their low frequencies. A mismatch in one read out of 30 appropriates an allele frequency of 0.033 at the mismatch position. However, true SNVs can also be present at very low frequencies. In order to study clonal evolution and, for example, the rise of resistant subclones, information about low-frequency mutations is needed. This problem can be circumvented with an increased depth of coverage. (Spencer et al 2014) compared the results of four mutation callers at different coverages for SNVs with 5, 10, 25, and 50% allele frequency. The best caller achieved good results for SNVs with a 5% frequency on a depth of coverage of 400x. To detect subclonal populations at a 0.05 frequency in a tissue sample. (Petrackova et al 2019) recommend a minimum depth of coverage of 288x and at a 0.02 frequency a minimum depth of coverage of 2,918x (1% false positive risk). An alternative strategy to identify the mutation of subpopulations is calling mutations from high-resolution data, e.g. LCMs or single cells. A lower material input usually leads to less coverage. With PCR, the sequencing depth can be increased. However, this DNA amplification also introduces errors exponentially (e.g. sequence errors, unevenly distributed copies of unique molecules), which are propagated to the downstream analysis (Brodin et al 2013; Potapov and Ong 2017). Thus, there is an extensive need for mutation callers applicable for low coverage data. This problem is addressed in paper IV with MetaCNV, a caller that infers reliable CNVs from low coverage data. In paper IV, it was shown that a depth of coverage of 1x sequenced single-cell DNA had already led to reliable results.

Copy number calling A copy number variant (CNV) is a form of structural variant describing that a larger segment within a genome is either amplified or deleted. CNVs can disturb normal gene expression, e.g. a deleted tumour suppressor gene can lead to dysfunctional processes during cell division. Different tools for structural variant calling emerged. For example, callers can infer structural variants in general or only CNVs; some callers work with single sample, some require a matched normal sample (e.g. blood) to distinguish germline from somatic mutations. Callers that are specialized in single cell sequenced genomes usually require multiple single cells and infer copy numbers for each single-cell relative to the pooled cells. An exception is MetaCNV (paper IV); the tool can be applied to alignments of one single cell, as well as of LCM, and of bulk sequenced data. Tools for calling CNVs computationally apply one of these four approaches, each having advantages and disadvantages. The four approaches are based on (i) read coverage, (ii) paired-end mapping, (iii) split-reads, and (iv) hybrids (Figure 11; Mills et al 2011; Zhao et al 2013).

35 Figure 11. The read coverage (i.e. read-depth), paired-end (i.e. read-pair), and split- read approach (Adapted from Pirooznia et al 2015). Hybrids combine at least two of the approaches (Pirooznia et al 2015).

Callers based on a read coverage approach look for a change in coverage, i.e. a breakpoint, which indicates a change in the copy number. Different algorithms to improve the results are implemented, e.g. GC-content corrections. A high depth of coverage is sought to achieve good results (Pirooznia et al 2015; Zhao et al 2013). Callers based on a pair-end mapping approach use correctly aligned paired-end reads and, depending on the change of the distance between the read pairs, a deletion or amplification is inferred. The advantage of this approach is that it is relatively independent of the coverage, and thus applicable for low-coverage data. A disadvantage is that only correctly aligned paired-end reads can be used, which decreases the amount of usable reads and the depth of coverage. (Pirooznia et al 2015; Zhao et al 2013) The split-reads approach uses only discordantly aligned reads. An attempt is made to realign a not-aligned by splitting the read into parts. Aligned read parts indicate a structural variant. The advantage of this approach is the high level of accuracy of the breakpoint positions. Due to the reduced read length when splitting the read, the probability of a misalignment increases (Pirooznia et al 2015; Zhao et al 2013).

36 Hybrids combine the different approaches. MetaCNV is an example of a hybrid. In paper IV, it was hypothesized that read overage approaches perform better for calling deletions, and paired-end mapping approaches perform better for calling amplifications. With this hybrid solution, reliable CNVs can be called even from low- coverage data, e.g. alignments from single-cell sequenced genomes (Zhao et al 2013).

Fusion transcript detection Fusion transcripts are commonly detected in bulk RNAseq data by identifying reads that span the fusion point, or mates of a read-pair, with each aligned to one parental gene. Due to sequencing and alignment errors, the computational methods using sequenced reads contain an elevated number of false positives. Fusion callers try to reduce these false positives with improved search and evaluation algorithms. However, from RNAseq data a frequency of occurrence cannot be inferred, unless the chimera was caused by a genetic abnormality and the mutation frequency was calculated from the sequenced genome. An alternative approach is the use of high-resolution data, e.g. spatial transcriptomics data. Paper IV describes a novel approach, STfusion, to detect a fusion transcript using spatial transcriptomics. The fusion transcript can be detected on almost a single-cell level. Further, frequencies can be estimated from the number of tissue domains harbouring the fusion transcript compared to the total number of tissue domains annotated as diseased. STfusion is an approach that extends existing calling methods to detect fusion transcripts. This proposed method uses a lacking poly(A) tail at the 5´ gene of the parental genes to indicate a fusion transcript. The 3´ gene of the parental genes is polyadenylated. Poly(A) tails are, together with the 5´ cap, post-translational modifications; both are attached to the messenger RNA in order to facilitate translation into a protein.

Differential gene expression analysis A powerful tool for investigating tumour phenotypic heterogeneity is the calculation of significantly differentially expressed genes between biological conditions followed by a pathway annotation. The conditions can, for example, be normal epithelial and tumour cells, the centre and the periphery of a tumour, and tumour biopsies before and after treatment. Even comparisons of the same condition (e.g. normal epithelial) to several other conditions (e.g. cells of different Gleason scores), or comparisons of several conditions simultaneously, can be performed. Although phenotype and genotype are strongly interlinked, only a relatively small number of all genes are expressed in each cell. The gene expression and expression level depend on the circumstances, e.g. disease state, cell type, and surrounding cells. Cells harbouring the same mutational

37 genetic signature as shared by a subpopulation within a cancer clone do not have to express the same genes and at the same level. Even cells belonging to different cancer cell subpopulations can present similar gene expression patterns. Hence, the analysis of differentially expressed genes provides insight into the different molecular functions, disease states, and disease progression. The challenge in differential gene expression analysis is to identify significant changes explaining the difference specific for the conditions which is commonly done with a statistical hypothesis test. In papers I–III, the parametric methods Welch’s t-test (Welch 1947) and the moderated Welch test (MWT; Demissie et al 2008) were applied to identify significantly differentially expressed genes. When using a parametric test, it is assumed that each value is mapped into a particular distribution, such as Poisson, negative binomial, or normal distribution (Costa-Silva et al 2017). The assumption can be confirmed with a hypothesis test, e.g. the Kolmogorov-Smirnov test (Kolmogorov 1933), which compares the sampled data set against a synthetic data set that follows this tested distribution. With non-parametric methods, however, it is assumed that the distribution cannot be inferred from the values, and that the information about the data can increase with its volume (Costa-Silva et al 2017). Examples of non- parametric methods are Wilcoxon-Mann-Whitney (also termed the Wilcoxon rank sum test; Wilcoxon 1945), and the Kruskal-Wallis test (Kruskal and Wallis 1952). In hypothesis testing, a difference between the two data sets is inferred if the null hypothesis cannot be held (Everitt 2006). The null hypothesis states that there is no difference between the data sets and is provisionally assumed until a mistaken rejection is unlikely. The significance is measured with a test statistic and a probability value (p-value), under the null hypothesis, that this test statistic was observed. The alternative hypothesis is favoured if the p-value, i.e. the probability of a mistake, is less than the significance level threshold alpha (Triola 2014). The Welch’s t-test is a type of Student’s t-test and tests the hypothesis that two data sets with normal distribution but unequal variances and/or unequal data set sizes have an equal mean (Welch 1947). In gene expression, unequal variances should be assumed because positive and negative correlation of expression levels and variance was observed (Komurov and Ram 2010). Further, a Welch’s t-test is more robust than a Student’s t-test and maintains FPs better for unequal variance and different data set sizes but has similar power to a Student’s t-test if variances and data set sizes are equal (Ruxton 2006). The MWT is an adaption of Welch’s t-test; it is applicable to data sets with unequal variances but also to data sets containing few variables. In data sets with balanced (e.g. n1 = n2 = 3) or unbalanced (e.g. n1= 3 and n2= 9) small samples, variance inequalities cannot be detected (Demissie et al 2008).

38 In hypothesis testing, the p-value measures the probability for a test statistic under the null hypothesis if one test was performed. If multiple tests, e.g. one test for each of the expressed genes, were performed, the probability of observing the result needs to be corrected for the number of performed tests. This is commonly done with the Benjamini-Hochberg procedure which controls the FDR (Benjamini and Hochberg 1995), or familywise error rate (FWER; Bonferroni, 1936).

Pathway annotation Pathway annotation means identifying altered biological pathways under a certain condition; this type of analysis helps to reduce the complexity of information and to add explanatory power to the analysis (Khatri et al 2012). Pathway activation is commonly indicated by relating the genes encompassed by the biological pathway to the genes of a query gene set. The query gene set can be, for example, a result of a differential gene expression analysis, a list of mutated genes from mutation calling, or any other group of genes from an experiment whose higher-order biology information is desired. There are different approaches to defining and measuring the relation between the query gene set and the biological pathway. It is important to consider the different assumptions they are built on, in order to infer reliable results.

Over-representation analysis The simplest way to identify pathway activation is from the overlapping genes, i.e. genes that are both in the query gene set and part of the biological pathway, called ‘over-representation analysis’. The basic assumption is that there is independence among the query genes but also among the pathways (Khatri et al 2012). Whether the genes, which are in the pathway gene set, are significantly overrepresented can be assessed e.g. using a hypergeometric test or Fisher’s exact test. The latter compares expected to observed overlap involving a 2 x 2 contingency table; for each test a test statistic and a p-value is returned (Fisher 1922; Fisher 1992). The significance of the overlapping genes is calculated by relating it to the null hypotheses: that it is equally likely that the query genes are in the pathway gene set as that they are not (Khatri et al 2012). The null hypothesis implies that the correlation among the query genes, which are part of a pathway, is similar as the one among the query genes outside the pathway (Khatri et al 2012). For example, the expression levels of significantly differentially expressed genes of an experiment might be correlated. However, whether the occurrence of mutated genes is correlated or not has not been proven yet. A highly cited instance of the over-representation analysis method is the database for annotation, visualization and integrated discovery (DAVID; Huang et al 2007). The significance of over-representation is measured by

39 using a modified Fisher’s exact test called the EASE score: from the number of overlapping genes one gene is removed (Hosack et al 2003; Huang et al 2007). Overlaps of one gene or few genes are penalized because these genes can be false positives. DAVID has however been shown to miss a high number of activated pathways (Ogris et al 2016; Ogris et al 2017). Methods using a network-crosstalk approach, e.g. BinoX (Ogris et al 2017), achieve a much higher sensitivity (TPR) than DAVID.

Systems approaches of pathway annotation More sophisticated methods apply a systems approach to infer pathway activation: they consider not only the genes but also the connections between them, and the global gene network they are embedded in. This approach led to the development of a new generation of pathway annotation algorithms.

Functional Class Scoring Methods using functional class scoring evaluate the significance of a relation to a pathway using the joint level statistic over all genes in a pathway, termed pathway-level statistic. Examples of a pathway-level statistic are the Kolmogorov-Smirnov statistic, the Wilcoxon rank sum, and the mean of the gene-level statistic (Khatri et al 2012). The main difference to the over- representation method is that all the tested genes and their gene expression level of an experiment are used; a cut-off to filter significant genes beforehand is not needed. To extract significant pathway activations, the pathway-level statistic is compared to one of two null hypotheses: (i) it is equally likely that the query genes are in the pathway gene set as that they are not, and (ii) that the query genes in the pathway are not differentially expressed (Khatri et al 2012). The former is also applied in over- representation analysis. The latter considers the correlation structure among the tested genes. A disadvantage of this method is that it is limited to data with a score assigned to all genes, such as differential gene expression analysis.

Network-based analysis A broader applicable approach is the use of genome-wide functional association networks such as FunCoup (Ogris et al 2018) or STRING (Szklarczyk et al 2019) as additional input to the query gene set and the biological pathways. The network can be used to extend the query gene set with neighbouring genes, and the extended gene set can be used for over- representation analysis, as in FunCoup, STRING, or GeneMANIA (Montojo et al 2014).

40 Network-crosstalk analysis The functional association networks can also serve as a blueprint or prior information in order to interrelate query genes and pathway genes; the interrelation is based on a biologically meaningful genome-wide network (Figure 12). The shared links between them serve as an indicator for a relation. Different ways to evaluate a relation as significant have evolved: EnrichNet (Glaab et al 2012), uses network distances, CrosstalkZ (McCormack et al 2013) applies the z-score, NEArender (Jeggari and Alexeyenko 2017) applies Pearson's chi-squared test, BinoX tests against a binomial null distribution, and NEAT (Signorelli et al 2016) tests against a hypergeometric null distribution. A major advantage of these network- crosstalk analysis methods, and BinoX in particular, is its increased sensitivity and decreased false positive rate. The BinoX algorithm, which is implemented in the public PathwAX web server (Ogris et al 2016), was applied to the large number of data-driven experiments for papers I and II to infer activated pathways. This algorithm was also used for paper III to infer altered pathways related to the occurrence or absence of fusion transcripts on almost a single-cell level.

Figure 12. Pathway annotation based on network-crosstalk (Adapted from McCormack et al 2013). A The two pathways, p1 and p2, are compared to the query gene set s by their shared links (inter-crosstalk). The significance of the depleted or enriched inter-crosstalk is calculated using a random network shown in B. B Relative to the random network, the inter-crosstalk between p1 and s is depleted and the inter-crosstalk between p2 and s is enriched. (McCormack et al 2013)

41 Biological pathway databases For pathway annotation, different alternative sets of functional associations and biological pathways exist. Most popular are the databases provided by Gene Ontology (The Gene Ontology Consortium 2019), KEGG (Kanehisa 2019), and Reactome (Jassal 2011). The Gene Ontology (GO) databases associates gene products to GO terms belonging to the domains cellular components, biological processes, and molecular functions. The GO terms topology is a directed acyclic graph; the direction implies a hierarchy among the terms and the acyclicality enables linking genes to zero or more terms. KEGG is a collection of several databases, including one database for biological pathways where the pathways are grouped into seven sections. For human, KEGG lists 337 pathways. Reactome’s curated and peer-reviewed pathways are structured in a GO-like hierarchy (Guala et al 2019). For human, Reactome comprises 2,244 pathways (Jassal 2011). Biological pathways but also GO terms and domains are interconnected.

Machine learning & data mining There is an immense need for computational techniques that deal with the enormous amount of data produced by the aforementioned technologies and methods. Such techniques can be found in the fields of machine learning, and data and text mining. For example, pattern recognition techniques can uncover latent patterns and structures. Moreover, based on the sheer amount of data, models can be developed to predict future outcomes. This is helpful in two ways: the model can reveal relations and weights of explanatory factors, and the predicted outcome reveals cause-effect connections. In paper IV, CNVs were predicted rule-based. In papers I and II, the dimensional reduction methods principal component analysis (PCA; Pearson 1901), t-distributed stochastic neighbor embedding (t-SNE; van der Maaten and Hinton 2011), and the factor analysis method spatial transcriptome decomposition (STD; Maaskola et al 2018) were applied to reveal hidden structures. Cluster analysis, a method used to identify groups of similar objects, was utilized in papers I and II. Anomaly detection supported the localization of fusion transcripts introduced in paper III, and the annotation of transcriptomic factors in papers I and II.

Predictive methods The application areas for predictive methods are immense and diverse. They were used, for example, to predict the outcome of cancer treatment plans (Cruz and Wishart 2007; Michiels et al 2005), to diagnose and prognosticate cancer (Kourou et al 2015; Wang et al 2019), and to predict cancer driver genes (Collier et al 2019). There are plenty of methods that can be used for predicting, e.g. artificial neural networks, support vector machines, and decision trees. For a prediction, first, a model is built, and second, the model

42 is applied to unseen data to make a prediction. Commonly, a data set containing instances with known outcomes is needed to create the model (supervised learning), which harbours several obstacles. Overfitting is one of them; this occurs if a model is overtrained and thus specialized in the training data set. Overfitted models can be identified by, for example, their greater complexity. A prediction model based on, for example, simple rules inhibits overfitting. Predictive methods, mainly based on ensemble learning, were also applied to call mutations (Huang et al 2019a; Wood et al 2018). However, in order to create a reliable model, a training data set with known mutations is needed that contradicts the occurrence of somatic mutations. Thus, these methods are not portable to new data sets. One countermeasure is to develop a model pre-trained (Huang et al 2019a), or independently of a training data set. The latter was realized in paper IV with the application of general rules as prediction model. Rules are written in IF/THEN format. The single rule is simple; however, the combination of rules creates a more complex system, i.e. the prediction model.

Clustering methods With clustering, instances of a data set are divided into clusters aiming for a large distance among the clusters. Clustering methods can be structured in many ways; one is by cluster-membership: to what extent can an instance be a member of (i) only one (partitioning methods), (ii) none or one (density- based methods), or (iii) at least one cluster (fuzzy clustering). In hierarchical clustering, (iv), the instances and clusters form a hierarchy, and both can be a member of more than one cluster. Hierarchical clustering can be used to identify a structure among the instances. It further does not require a preset number of clusters like many other clustering algorithms do (e.g. k-means). Using bootstrapping, confidence scores are assigned to nodes mirroring the conformity of data and node. Hierarchical clustering can be applied to, for example, sequential mutation (SNVs, INDELs) frequencies corrected for structural variants (e.g. CNVs) to reconstruct a clonal structure of the subpopulations. If mutation frequency data are available for several time points, spatio-temporal heterogeneity can be inferred as shown in Figure 8 (Gundem et al 2015). Hierarchical clustering applied to the gene expression profiles of healthy and diseased cells in tumour samples mirrors their relationship, and reveals similarities and dissimilarities between them. This was utilized in papers I and II. Hierarchical clustering is performed in two steps: first, dissimilarities (equal to 1 ‒ similarities) between instances (vectors) are calculated based on a defined distance measure; second, the clusters are linked (merged or split) using a criterion. The result is commonly presented as a tree. Hierarchical clustering can be performed top-down or bottom-up.

43 Appropriate distance measure and linkage criteria are crucial for a reliable and robust result and depend on the data (Jaskowiak et al 2014). For example, it should be considered whether larger distances between data points gain less or more weight. Distance measures include the ‘traditional’ ones, e.g. Euclidean distance and Manhattan distance, and those based on correlation, e.g. Kendall distance and Pearson distance. The linkage criterion decides which instances are joined first (or split first for top-bottom approaches). Examples are average linkage and ward linkage (Ward 1963). The latter chooses the two instances that, if merged, reduce the variance among all instances less.

Dimensionality reduction methods The complexity and dependency of genotype and phenotype stored in biological data impede pattern recognition and computational analysis. The complexity of the molecular data is caused by, e.g., interrelated genes, related and redundant features, latent processes behind the data, noise, and inaccurate data collection. Methods that overcome this problem and reduce the data complexity can be found under the topic dimensionality reduction. They are applied, for example, to visualize data (t-SNE), to reduce the number of features characterizing the data (feature selection), or to identify hidden patterns such as molecular subtypes (PCA, factor analysis, linear discriminant analysis, neural networks). (Huang et al 2019b; Xu et al 2018) PCA, t-SNE, and a form of factor analysis that was especially developed for spatial transcriptomics data (termed STD) were utilized in papers I and II.

Factor analysis Using factor analysis, an additional layer containing information about a hidden structure is added to the data. The latent structure is described by underlying factors that steer the observed outputs in samples. Factor analysis can be used exploratively to find the best mathematical solution, or confirmatically to test which model confirms a hypothesis (Borkenau and Ostendorf 1990; Lewis 2017). In papers I and II, the explorative form was utilized. The final model links the factors to the observations (e.g. gene expression levels) as factor loading (e.g. gene score) and to the instances (e.g. samples) as weights (e.g. factor activities). Factor loadings and weights are calculated iteratively, starting with estimated parameters, e.g. maximum-likelihood estimation (Borkenau and Ostendorf 1990). STD differs slightly from factor analysis. The latter requires normally distributed data, and STD assumes a negative binomial distribution of the gene expression (Maaskola et al 2018). The explorative factors need to be annotated or interpreted manually. In papers I and II, the annotation of transcriptomic factors was supported by

44 hierarchical clustering, outlier gene detection using z-scores, and pathway annotation. Another disadvantage is that the number of factors needs to be set beforehand (Maaskola et al 2018). The most informative model contains the lowest number of factors explaining the data.

Principal component analysis PCA extracts latent components resulting from linear correlated features that were transformed into a set of data in a new coordinate system (Pearson 1901). For nonlinear problems, PCA cannot be applied; however, transforming the data into a higher dimensional space using kernels can circumvent the non-linearity (Schölkopf et al 1998). PCA is an iterative process of finding components (a vector, similar to linear regression) that describe the data. The first component is a vector, which describes all data best (which is defined by minimal variance around the vector). The second vector describes the variance in the data, which could not be explained by the first component etc. Finally, the components are ordered decreasingly by the extent to which data can be explained by a component. (Fodor 2002) Factor analysis and PCA are related. Both reduce dimensionality and reveal latent constructs. However, factor analysis is aimed at explaining the correlation among observations with a factor, whereas in PCA, the variance in the data is sought to be explained completely by the components. Moreover, factor analysis comprises several methods, whereas PCA names a specific method. Other methods used to discover latent structures include probabilistic topic models, semantic analysis, and latent Dirichlet allocation; these are mainly used in natural language processing and text mining,

T-distributed stochastic neighbor embedding t-SNE is a further development of SNE; however, it considers local and global structures within data and with increasing speed. Using t-SNE, data of higher dimensionality are reduced to a two-dimensional space. The method does not assume linear dependencies as PCA does. In contrast, t-SNE uses the similarity (conditional probability) between data points. The method creates a probability distribution using a normal distribution and a t- distribution to mirror the high-dimension probability distribution in the two- dimensional space. Optimizing the cost is performed by gradient descent.

Anomaly detection An anomaly can be any instance that deviates significantly from the model that the majority of instances follow (Zimek and Schubert 2017). An anomaly can be caused by, e.g., an inaccurate or improper data collection or an abnormal behaviour of an instance. The former mentioned is commonly removed from the data set because it disturbs further data processing or

45 decreases accuracy. In contrast, anomalies caused by abnormal behaviour contain information. z-score The z-score (eq 1), also referred to as the standard score, calculates for an instance the deviation from the standard. The z-score mirrors the deviation (distance) between instance value and standard. Applying the z-score requires normally distributed instances; the deviation is set to the standard deviation σ. x   z  i (1) i 

In papers I and II, the z-score was used to identify genes that have a higher score within a transcriptomic factor than in the remaining transcriptomic factors. This outlier gene characterizes a transcriptomic factor because the gene plays a more important role for this factor than for the other factors. The outlier genes of a factor were submitted for a pathway annotation, which revealed biological processes in which a factor is involved. c-score In paper III, a novel detector score for anomalies, the c-score (eq 2 and 3), was introduced to find mutated transcriptional processes. For gene pairs, the c-score locates abnormal transcription levels among otherwise normal transcription levels within a tissue sample, and indicates fusion transcripts.

5´gene 3´gene R5´  ; R3´  (2) 5´genes 3´genes

if R5´  R3´ c - score  R5´ (3)

otherwise c - score  R3´

Benchmarking

Predictions To evaluate the quality of a prediction, as performed in paper IV for the performance of MetaCNV and competitor callers, benchmarks from machine learning, i.e. used for classification and regression models, can be applied. The benchmark compares true and predicted values. In classification models, benchmarking is based on the confusion matrix, which divides the two-class classification into four cases: TP, TN, FP, and FN. A prominent measurer is Matthew’s correlation coefficient MCC (eq 4; Matthews 1975), which is used in paper IV:

46 TP  TN  FP  FN MCC  (4) (TP  FP)(TP  FN)(TN  FP)(TN  FN)

Other examples are sensitivity (also termed recall or TPR), which gives the ratio of correctly classified (TP) and total positive objects (TP + FN), specificity (also termed TNR), which gives the ratio of objects correctly rejected (TN) in the totality of actually negative objects (TN + FP), and FDR (ratio of incorrectly rejected (FN) objects to rejected objects (FN + TN)). To calculate measures for classifications with more than two classes, the values can be micro-averaged, e.g. TP = TPclass 1 + TPclass 2 + … + TPclass n , or macro-averaged, e.g. MCC = MCCclass 1 + MCCclass 2 + … + MCCclass n. In regression models, the difference between true and predicted value per instance (the residue) is calculated. This error is then averaged over all instances (MAE) or averaged over the squared errors of all instances (mean squared error, MSE). For the prediction of copy numbers in paper IV, a novel measure, the mean log ratio error (MLRE; eq 5), was introduced.

   1 N  x 1  MLRE   LR ; LR  abs ln  i  (5) N i i   xˆ 1 j  1   i 

This measure punishes outliers less, in contrast to MSE for example, where the error for each instance is squared before averaging. Another novel measure introduced with paper IV is the variance of residuals (eq 6), which calculates the variance of a predicted value to each true value separately before averaging. This measure is informative if distinct true values have to be predicted, and the differences need to be reflected in the predicted values.

1 M  1 N   2   z  z2  (6) Re s M   N  i  j1  j1 

47 48 Present investigations

Paper I: Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity

Summary and purpose The aim to identify transcriptomic diversity of prostate cancer arises from the insight that for different cancer types a disparate optimal treatment, and thus the chance of recovery, exists. One step towards this diversity map is to investigate cancer heterogeneity for one human with prostate cancer in depth to reveal co-expressed genes among subclones and among tumour clones of different disease states. This is done by comparing and analysing spatial transcriptomics data and transcriptomic profiles of different disease states within tissue samples. The challenge lies in the complexity of biological tissue samples, cell types, genes, mutation types, gene expression levels etc.

Author contributions The project was a team effort. Stefanie Friedrich, during her PhD candidature, was a shared primary author. In particular, her contributions comprise the differential gene expression analysis, outlier detection using z-scores followed by pathway annotation, the hierarchical clustering of factors, and the DNA analysis.

Novelty There is a lack of knowledge about the transcriptomic intra-tumour and inter-tumour diversity in prostate cancer, which is a multifocal disease with competing and organ-wide almost simultaneously occurring clones of different origins. The expression levels throughout an intact tissue sample were measured and compared with annotated morphological changes. We found that the area with active gene expression profile annotated as a cancer profile extended across the areas annotated morphologically as cancer. Further, the spatial dimension of the transcriptomic variation has not been studied in prostate cancer before. We show for the first time the gene expression diversity and activated pathways within a tumour clone annotated

49 in a clinical tissue sample. We identified the pathway protein processing in endoplasmic reticulum as being highly active in a tumour, and even more activated in the cancer centre than on its periphery.

50 Paper II: Spatio-temporal analysis of prostate tumours suggests the pre-existence of ADT- resistant expression clones

Summary and purpose Paper II builds on the findings in paper I and extends the investigation of tumour heterogeneity by adding the temporal dimension. The aim was to investigate changes in tissue-wide gene expression comprising cancer areas of different Gleason scores pre- and post-treatment to reveal the molecular processes leading to castration-resistant prostate cancer. In order to identify common processes among tumours of different Gleason scores within and among patients, the number of patients was increased to three. Similarly to paper I, spatial transcriptomics was applied to analyse clinical tissue samples from prostate cancer patients. Using spatial transcriptome decomposition, distinct expression profiles of different tissue samples and regions within the tissue samples were compared. The methods developed for paper I were also applied in paper II (e.g. outlier detection using z-scores and pathway annotation), and further developed, e.g. hierarchical clustering of transcriptomic factors.

Author contributions This project was a team effort. Stefanie Friedrich, during her PhD candidature, was a shared primary author. In particular, her contributions comprise the differential gene expression analysis, outlier detection using z-scores followed by pathway annotation, the hierarchical clustering of factors, and the findings that non-responding cells are located at the invasion front in the periphery and underwent an epithelial-mesenchymal transition, that the AR activity in the nucleus of non-responding cells is reduced before Androgen deprivation therapy (ADT) compared to responding cancer cells, and that ADT increases cell migration.

Novelty Distinct gene expression profiles matching resistant cell clusters were identified pre-treatment. These cells underwent a transformation from a solely epithelial phenotype to a hybrid of epithelial and mesenchymal phenotype before treatment when still part of the tumour clone. The transformed cancer cells exhibit lower androgen receptor expression levels before treatment and still express AR post-treatment. Further, the non- responsive cancer cells were already showing increased cell motility eight

51 weeks post-treatment. Our findings include a gene expression and pathway signature of non-responding cancer cells in prostate cancer.

52 Paper III: Fusion transcript detection using spatial transcriptomics

Summary and purpose Fusion transcripts, which contribute to cancer development and are involved in treatment resistance, were commonly studied in samples lacking high spatial resolution, e.g. bulk sequenced samples. In this paper, a novel method is proposed that uses the occurrence or presence of the poly(A) tail to indicate a fusion transcript. By applying this method to high-resolution data with spatial information, e.g. spatial transcriptomics, fusion transcripts can be localized. The results can be used to study the occurrence of fusion transcripts in healthy and diseased areas within tissue samples. STfusion further extends existing approaches.

Author contributions This project was a team effort. Stefanie Friedrich, during her PhD candidature, was the primary author.

Novelty We present a novel method for detecting fusion transcripts. The method is relatively independent from the depth of coverage and can thus be applied to high-resolution data with tendential lower coverage. Further, we present the novel c-score, which measures anomalies in transcription levels and indicates the higher probability of fusion transcript occurrence in clinical tissue samples on almost a single-cell level by using spatial transcriptomics data. This method, STfusion, and the use of the c-score enable the analysis of differential gene expression in areas with and without fusion transcript occurrence. We perform such an analysis and find that the fusion transcript occurrence correlates with the activation of pathways related to cell motility (regulation of actin cytoskeleton and focal adhesion) and treatment resistance (e.g PI3K-Akt and MAPK signalling). Finally, a non-canonical DNA structure, the i-Motif, was identified at the fusion point, suggesting an epigenetic mechanism causing the fusion transcript.

53 54 Paper IV: MetaCNV ‒ a consensus approach to infer accurate copy numbers from low coverage data

Summary and purpose This paper introduced MetaCNV, a tool for calling copy number variations from low-coverage data, e.g. resulting from high-resolution data. Low- coverage data are challenging because of the weak signal; this requires sensitive callers. MetaCNV is a hybrid of existing approaches. The MetaCNV model predicts based on rules copy numbers, thus it combines the strengths and bypasses the weaknesses of current approaches. Reliable copy numbers are, for example, needed to infer the clonal structure of cancer cell populations.

Author contributions This project was a team effort. Stefanie Friedrich, during her PhD candidature, was the primary author.

Context and novelty Although there are plenty of copy number callers, the demands placed on them are constantly changing. The recent advantages of sequencing technologies for producing whole genome data for single cells or a few cells require mutation callers that are able to handle the resulting low-coverage alignments. Predictive models applied to mutation calling require training data. We circumvent this problem by building the predictive model based on rules; this also avoids overfitting to the training data set. The prediction rules are based on the observation that different approaches are more appropriate for deletions or amplifications. We also provide a pipeline to evaluate copy number callers on cancer cell lines that is objective, easy to perform, transparent, and replicable. Further, we introduce two novel measures to benchmark a prediction based on a regression model, namely the mean log ratio error and the variance of residuals. The first measure does punish outliers less than other popular error measures (e.g. MSE). The latter calculates the variance depending on the true value. MetaCNV’s predictions were verified on single-cell sequenced genomes with an iteratively increased depth of coverage from 1x to 6x, but also on

55 simulated low-coverage and real high-coverage data. The MetaCNV predictive performance is robust and outputs reliable copy numbers.

56 General discussion

“A ship in port is safe, but that is not what ships are for. Sail out to sea and do new things.” – Grace Hopper, computer scientist

The series of four studies that this thesis comprises focus on the computational analysis of tumour heterogeneity. Papers I and II are about the investigation of tumour heterogeneity in clinical tissue samples of cancer patients. Papers III and IV provide novel tools to identify mutations also applicable to high-resolution data. Paper I aimed to investigate in-depth the spatial tumour heterogeneity of three tumour foci annotated in 12 samples taken from a prostate cancer patient. Additional diseased areas like prostatic intraepithelial neoplasia (PIN) and were identified within the tissue samples. For the first time, spatial gene expression and the underlying gene expression profiles of different disease states within and among tumours were revealed. These comprehensive data opened up new opportunities to study tumour heterogeneity on the expression level within a tumour. The three tumour foci with different Gleason scores (Gs 6 and 7) but also of different sizes were compared concerning their expression level in the centre and the periphery. All three were characterised by pathways related to high transcription stress in the centre and cell motility in the periphery. Except for the largest tumour clone, here the pathways linked to cell motility were activated in both centre and periphery. These results suggest that tumour gene expression heterogeneity is dynamic. The study of CNVs revealed that genes with amplified or deleted regions are expressed in smaller areas within the tissue sections creating an additional layer of heterogeneity. The copy number analysis further showed that deletions and amplifications were spread locally, and that tumour foci arise from independent origins, each with a unique mutation signature. The latter is inline with previous studies on multifocal prostate cancer. After an initial response to a treatment, cancer recurrence is currently one of the biggest challenges. In paper II, non-responsiveness to Androgen deprivation therapy among tumour clones was investigated. A major role in this study played the diversity in gene expression and pathway activation within and among tumours of different Gleason scores and the spatio-

57 temporal evolution of treatment-resistant cancer cells under a course of treatment. It was shown that non-responding cancer cells existed already pre- treatment and that they can be identified by their expression profile and androgen receptor activity in the cancer cells’ nucleus. The non-responding cancer cells underwent an epithelial-mesenchymal transition before the onset of treatment. Using the proposed gene signature of non-responding cancer cells in prostate cancer opens up the possibility of testing for potential non- responsiveness pre-treatment but also of developing improved treatment strategies. Papers I, II, and III also highlight the importance of high-resolution data concerning both the transcriptome and the genome. High-resolution information is the key to understanding the processes within and among tumours. For example, it was shown that a tumour has a different spectrum of gene expression levels in the centre and the periphery. Further, non- responding cancer cells tend to be at the invasion front on the periphery of the tumour. Using STfusion, a transcription-induced fusion transcript, cis- SAGe SLC45A3-ELK4 was located on either the periphery or at the centre of a tumour. The occurrence of cis-SAGe SLC45A3-ELK4 was discussed to be caused by a non-canonical DNA structure categorized as an epigenetic mechanism. These results emphasize again the use of a systems approach to gain a more complex understanding of the involved abnormal processes behind tumour evolution. The link between mutations and cancer development is not questioned. However, there is a lack of computational methods for identifying reliable mutations even at smaller frequencies in sequenced samples and in high- resolution data comprising only a few or single cells. Cancer treatment decisions and treatment resistance are related to the mutational signature of cancer cells, which emphasize the importance of reliable mutation calling. With detailed information about the mutational signatures, tumour heterogeneity can be investigated thoroughly but also structured by, e.g. mutually exclusive and co-occurring mutations. With MetaCNV and STfusion, two of the four key contributions of this thesis, mutation detection methods are provided that can infer reliable copy numbers and fusion transcripts, respectively, from high-resolution data. With the proposed methods, information about tumour heterogeneity can be derived. Finally, a standardized way to evaluate mutation callers, comprising data sets and benchmarks, would facilitate the choice of an appropriate mutation caller. In paper IV, a suggestion was made to evaluate copy number callers on experimentally confirmed copy numbers in cancer cell lines with different coverages. Further, two novel error measures for regression problems, namely MLRE and variance of residuals, were provided. MLRE punishes outliers less than common measure (e.g. MSE). An evaluation pipeline, as suggested in paper IV, is transparent, easy to perform, and replicable.

58 Conclusions and recommendations The challenges, but also the opportunities ahead, are enormous. For example, reliable and accurate determination of spatial and temporal variations in clinical tissue samples would reveal effects on underlying processes. However, such essential similarities and context-specific differences can only be investigated in larger sample sizes harbouring diverse instances. In paper I, several tissue samples from one prostate cancer patient were studied comprehensively. In paper II, the number of prostate cancer patients was increased to three and the time dimension was added. The quality of the spatio-temporal pattern of tumour heterogeneity would improve if future studies comprise the data from several time points taken from more patients. In papers I and II, the diseased areas (cancer clones with different Gleason scores, PIN areas, inflammation) were investigated thoroughly. The tumour micro-environment was also considered in these analyses. However, a comprehensive study of the tumour micro-environment and its influences on tumour evolution, especially on tumour progression, would extend our understanding of cancer development. The tumour micro-environment is more diverse, which prevents current computational methods from identifying the underlying processes. The identification of mutational and gene expression signatures enables the development of predictive biomarkers. Further, categorizing tumour heterogeneity enables conclusion to be drawn about its extent from incomplete data, e.g. needle biopsies that cover only a small section of a tumour area. One way to facilitate the identification of mutational signatures is to extend the idea behind MetaCNV, i.e. combining callers with a consensus approach, to other types of variants and the inclusion of more callers. Further, MetaCNV can be extended to other organisms; currently it is only applicable for human. Finally, a systems approach to reveal novel knowledge of spatio-temporal tumour heterogeneity comprising multi-omics data and a wide range of alteration types is emphasized (Baudot et al 2010; Luo et al 2009). For example, extending pathway annotation for spatial multi-omics data (e.g. mutated and dysregulated genes) enables the identification of hub processes and their dependencies in tumour development. Further, adding the temporal aspect into the analysis facilitates the study of the phenotype change (phenotypic plasticity), which is observed in the later stages of tumour evolution. Thirdly, a systems view should incorporate the tumour micro- environment.

59 60 Summary in Swedish

Cancer är en komplex sjukdom, där kombinationer av avvikelser leder till utveckling av en odödlig cancercell. Efter århundraden av cancerforskning finns det fortfarande inget botemedel. Detta beror bland annat på att varje tumör är unik. Även om sjukdomen har samma namn döljer den en sjukdom som är unik för varje drabbad individ. För att göra saken värre, finns det även inom en tumör olika celler som uppträder olika. Tumörens tillstånd och dess sammansättning ändras också när tumören utvecklas. Denna skillnad mellan tumörer, inom tumörer men också över tid kallas tumörheterogenitet. För att utveckla lämpliga botemedel och terapier som tar hänsyn till tumörernas egenart men också patienternas individualitet, hjälper det att förstå orsakerna och sammanhanget för denna tumörheterogenitet. Datorstödda metoder och algoritmer hjälper till att kartlägga, analysera och förstå sjukdomens komplexitet med det övergripande målet att hjälpa patienter. Detta arbete innehåller fyra publikationer om ämnet tumörheterogenitet. I två av dem analyseras tumörheterogenitet hos prostatacancerpatienter i detalj. Data producerades med en ny teknik, Spatial Transcriptomics, med vilken genuttrycksintensitet för färre celler i ett vävnadsprov kan bestämmas jämfört med konventionella teknologier. Resultaten visar att variationen i genuttryck men också omfattningen av de drabbade områdena var mycket större man tidigare trott. Dessutom kunde vi visa att vissa cancerceller förändrades i så omfattande utsträckning att de inte svarade på behandling, vilket ledde till terapiresistens. Vi kunde också visa att den tillämpade kemiska terapin ledde till spridning av cancerceller i kroppen och förmodligen främjar utvecklingen av metastaser. De andra två publikationerna var för sig avser programvara som kan användas för att identifiera mutationer i vävnadsprover. Mutationer anses vara drivkraften bakom cancerutvecklingen. Den möjliga upplösningen av sekvenseringsdata har ökat avsevärt de senaste åren. Man kan nu till och med sekvensera DNA från få eller till och med enskilda celler. Detta ökar vår kunskap om tumörheterogenitet betydligt, men innebär också nya utmaningar för datorstödda metoder och algoritmer. Å ena sidan är mindre information tillgänglig per cellkluster för att få en pålitlig signal, å andra sidan är variansen hos signalerna mycket större. Den publicerade programvaran erbjuder lösningar som identifierar en pålitlig signal i högupplösta data.

61 Detta arbete visar också att komplexa problem som cancer endast kan lösas tvärvetenskapligt, det vill säga tillsammans, ur olika perspektiv och från olika tillvägagångssätt.

62 Acknowledgements

“In nature, nothing exists alone.” – Rachel Carson, biologist

In the beginning it was just an idea that inspired and motivated me to master the immense work day after day and despite the missteps. Now it is almost done, and I am filled with joy and gratitude to see how lucky I was to have had so much support and benevolence.

My deepest gratitude and appreciation go to my PhD supervisor, Professor Erik Sonnhammer. You gave me the opportunity to work on these exciting projects and the chance to prove myself. I thank you for your support and guidance, and the trust and freedom that you have given me to develop my research skills. I also thank my co-supervisor Professor Thomas Helleday. I thank you for supporting me and placing great trust in my ability to master projects and additional collaborations. Professor Pia Ädelroth, Professor Peter Brzezinski, and Professor Stefan Nordlund, thank you for your helpful advice, insightful discussions, and for accompanying me during my PhD study. I thank the members of the DIRINTH project group: Professors Thomas Helleday, Erik Sonnhammer, Joakim Lundeberg, Charles Swanton, Eva Grönroos, Dr Firas Al-Ubaidi, the project leader Dr Niklas Schultz, Dr Jonas Maaskola, Emelie Berglund, and Maja Maklund. We have done pioneering research work and I am very proud to have been a part of it. Our project work plays a major and fundamental role in this thesis. I thank you all for the inspiring and constructive cooperation. Dr Ulrika Warpman Berglund and Ingrid Almlöf, I am grateful for the insightful collaboration work. My thanks also goes to all current and former colleagues. We worked together for some years and shared the ups and downs of a PhD study. I am extremely grateful for the uncountable scientific and non-scientific discussions, laughter, and encouragement. I would especially like to thank Emma, Thomas, and Dimitri, as well as my former colleagues Andreas, Christoph, Daniel, and Mateusz. Many thanks too to my fellow sunseeker Sana for the countless exciting talks and your support. Thank you, Daniel J

63 and Lucie D., for your beautiful humour and kindness. I would like to thank the students I supervised for their enthusiasm and eagerness: Remus, Shuhan, Fuxi, Bo, Kyle, Patrick, and Alvin.

A deep gratitude goes to my dear family, who always supported me, stood by me, and did not leave me, even though I lived far away from home and visits were few and far between. You always welcomed me warmly and gave me a good time so that I could devote myself to my work when back in Stockholm. Special thanks to my dear father Ulrich. I am eternally grateful for your support, infinite encouragement, and for teaching me to think outside the box. A special thanks also goes to Christian and Petra. I am extremely grateful for the empathy, positive attitude, and hospitality that I received from you. I also thank my dear brother Dieter, Anke, Maria and Josephine. I am forever grateful to my dear and deeply missed grandparents. Many thanks also to Sybille, Klaus, and Janina; thank you for always welcoming me warmly and supporting me. Thank you very much, Carmen, Stefan, Phyllip, Lina and Lenny, for the caring hours at home. My gratitude to Bärbel, Rainer, and Michael. You always treated me as part of your family and helped me whenever I needed it. Erik D., I am extremely grateful for your caring motivation during the final weeks of my study. A thanks also goes to Guenther, Sven, Hannes and Ulrike. I really appreciate our friendship. Thank you very much, Dr Helga and Karl Schuck. Your professional and many years of experience in psychologically preparing for competitions has helped me a lot in the last years. I thank Professor Walter Knöchel for the kind advice and support during my study.

The swimming club, which I have been attending since my arrival in Sweden, plays a very important role in my Swedish life. The people at the club are my family here in Sweden, and helped me to stay cheery and healthy especially over the last four years. Special thanks to Peter, Magnus, Gittan, and Julia. Peter, you supported me enormously and saved my life a couple of times. Thank you very much for your kindness and selfless help. Magnus, we shared many many swimming sessions and countless inspiring conversations. Gittan, we have spent a lot of time together and I am very grateful to be your friend. Thank you and Lasse too for helping me with the Swedish translations. Coach Julia, thank you for challenging me every week beyond my limits. I would also like to thank Philippa, Birgitta and Hakan, Dominika, Johanna, Lyaila, Max and Robert, Nada, Mikael, Therese, Tong, Sofia, Chantel, Delenn, Anna, Oxana, Pavel, Masha & Sergey, Shannon, Arianna, Solvejg, and Michaela. Countless times I was questioning, but on my way home from our swimming session I used to feel that everything would be fine as long as I continued meeting you at the pool.

64 I would particularly like to thank Anette and Louise. It was only with your help that I was able to gain a foothold in Sweden, and to study. Anette, thank you very much for giving me the chance to become a ‘researcher’. Louise, I shall never forget our ‘fikas runt hörnet’ and the inspiring talks.

Osnat Tazdok, I am grateful for the beautiful painting.

65 References

Adamo P, Ladomery MR (2016) The oncogene ERG: a key factor in prostate cancer. Oncogene 35:403–414. doi: 10.1038/onc.2015.109 Aguirre A, Ostfeld RS, Tabor GM, House C, Pearl MC (2002) Conservation Medicine: Ecological Health in Practice. https://books.google.se/books/about/Conservation_Medicine.html?id=N2bRCw AAQBAJ&redir_esc=y. Accessed 27 Nov 2019 Aiello NM, Kang Y (2019) Context-dependent EMT programs in cancer metastasis. J Exp Med 216:1016–1026. doi: 10.1084/jem.20181827 Al-Hajj M, Wicha MS, Benito-Hernandez A, Morrison SJ, Clarke MF (2003) Prospective identification of tumorigenic breast cancer cells. Proc Natl Acad Sci USA 100:3983–3988. doi: 10.1073/pnas.0530291100 Alizadeh AA, Aranda V, Bardelli A, Blanpain C, Bock C, Borowski C, Caldas C, Califano A, Doherty M, Elsner M, Esteller M, Fitzgerald R, Korbel JO, Lichter P, Mason CE, Navin N, Pe’er D, Polyak K, Roberts CWM, Siu L, Zucman- Rossi J (2015) Toward understanding and exploiting tumor heterogeneity. Nat Med 21:846–853. doi: 10.1038/nm.3915 Alkhazraji A, Elgamal M, Ang SH, Shivarov (2019) All cancer hallmarks lead to diversity. Int J Clin Exp Med 12:132–157. Alzforum (2017) Gene Expression Map of Human Body Gives Value to Variants. https://www.alzforum.org/news/research-news/gene-expression-map-human- body-gives-value-variants. Accessed 16 Nov 2019 American Cancer Society ACS (2018) Early History of Cancer . In: Early History of Cancer. https://www.cancer.org/cancer/cancer-basics/history-of-cancer/what- is-cancer.html. Accessed 17 Nov 2019 Balkwill FR, Capasso M, Hagemann T (2012) The at a glance. J Cell Sci 125:5591–5596. doi: 10.1242/jcs.116392 Barcellos-Hoff MH, Lyden D, Wang TC (2013) The evolution of the cancer niche during multistage carcinogenesis. Nat Rev Cancer 13:511–518. doi: 10.1038/nrc3536 Baudot A, de la Torre V, Valencia A (2010) Mutated genes, pathways and processes in tumours. EMBO Rep 11:805–810. doi: 10.1038/embor.2010.133 Behjati S, Tarpey PS (2013) What is next generation sequencing? Arch Dis Child Educ Pract Ed 98:236–238. doi: 10.1136/archdischild-2013-304340 Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57:289–300. doi: 10.1111/j.2517- 6161.1995.tb02031.x Berglund E, Maaskola J, Schultz N, Friedrich S, Marklund M, Bergenstråhle J, Tarish F, Tanoglidi A, Vickovic S, Larsson L, Salmén F, Ogris C, Wallenborg K, Lagergren J, Ståhl P, Sonnhammer E, Helleday T, Lundeberg J (2018a)

66 Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity. Nat Commun 9:2419. doi: 10.1038/s41467-018-04724-5 Berglund E, Maaskola J, Schultz N, Friedrich S, Marklund M, Bergenstråhle J, Tarish F, Tanoglidi A, Vickovic S, Larsson L, Salmén F, Ogris C, Wallenborg K, Lagergren J, Ståhl P, Sonnhammer E, Helleday T, Lundeberg J (2018b) Spatial Maps of Prostate Cancer Transcriptomes Reveal an Unexplored Landscape of Heterogeneity. Supplementary information. Nat Commun 9:2419. doi: 10.1038/s41467-018-04724-5 Berk A, Zipursky SL, Matsudaira P, David B, Darnell J (1999) Molecular Cell Biology, 4th edn. 36. Bernstein BE, Meissner A, Lander ES (2007) The mammalian epigenome. Cell 128:669–681. doi: 10.1016/j.cell.2007.01.033 Blackburn A, Almeida M, Dean A, Curran JE, Johnson MP, Moses EK, Abraham LJ, Carless MA, Dyer TD, Kumar S, Almasy L, Mahaney MC, Comuzzie A, Williams-Blangero S, Blangero J, Lehman DM, Göring HHH (2015) Effects of copy number variable regions on local gene expression in white blood cells of Mexican Americans. Eur J Hum Genet 23:1229–1235. doi: 10.1038/ejhg.2014.280 Bleidorn C (2016) Third generation sequencing: technology and its potential impact on evolutionary biodiversity research. Systematics and Biodiversity 14:1–8. doi: 10.1080/14772000.2015.1099575 Blot WJ, Tarone RE (2015) Doll and Peto’s quantitative estimates of cancer risks: holding generally true for 35 years. J Natl Cancer Inst. doi: 10.1093/jnci/djv044 Bolnick DI, Svanbäck R, Araújo MS, Persson L (2007) Comparative support for the niche variation hypothesis that more generalized populations also are more heterogeneous. Proc Natl Acad Sci USA 104:10075–10079. doi: 10.1073/pnas.0703743104 Bolnick DI, Svanbäck R, Fordyce JA, Yang LH, Davis JM, Hulsey CD, Forister ML (2003) The ecology of individuals: incidence and implications of individual specialization. Am Nat 161:1–28. doi: 10.1086/343878 Bonferroni, C (1936) Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8:3–62. Bonnet D, Dick JE (1997) Human acute myeloid is organized as a hierarchy that originates from a primitive hematopoietic cell. Nat Med 3:730– 737. doi: 10.1038/nm0797-730 Borkenau P, Ostendorf F (1990) Comparing exploratory and confirmatory factor analysis: A study on the 5-factor model of personality. Pers Individ Dif 11:515– 524. doi: 10.1016/0191-8869(90)90065-Y Brand RA (2010) Biographical sketch: Otto Heinrich Warburg, PhD, MD. Clin Orthop Relat Res 468:2831–2832. doi: 10.1007/s11999-010-1533-z Britt K, Ashworth A, Smalley M (2007) Pregnancy and the risk of breast cancer. Endocr Relat Cancer 14:907–933. doi: 10.1677/ERC-07-0137 Brodin J, Mild M, Hedskog C, Sherwood E, Leitner T, Andersson B, Albert J (2013) PCR-induced transitions are the major source of error in cleaned ultra-deep pyrosequencing data. PLoS ONE 8:e70388. doi: 10.1371/journal.pone.0070388 Burgess DJ (2019) Spatial transcriptomics coming of age. Nat Rev Genet 20:317. doi: 10.1038/s41576-019-0129-z

67 Burrows M, Wheeler DJ (1994) A Block-sorting Lossless Data Compression Algorithm. Bussey KJ, Cisneros LH, Lineweaver CH, Davies PCW (2017) Ancestral gene regulatory networks drive cancer. Proc Natl Acad Sci USA 114:6160–6162. doi: 10.1073/pnas.1706990114 Campbell PJ, Pleasance ED, Stephens PJ, Dicks E, Rance R, Goodhead I, Follows GA, Green AR, Futreal PA, Stratton MR (2008) Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc Natl Acad Sci USA 105:13081–13086. doi: 10.1073/pnas.0801523105 Cancer Genome Atlas Research Network (2015) The molecular taxonomy of primary prostate cancer. Cell 163:1011–1025. doi: 10.1016/j.cell.2015.10.025 Cancer Research (2019) 75th Anniversary Celebration - Nobel Laureates. https://cancerres.aacrjournals.org/site/misc/75th_anniversary/CR75_nobel_laure ates.html. Accessed 17 Nov 2019 Caswell DR, Swanton C (2017) The role of tumour heterogeneity and clonal cooperativity in metastasis, immune evasion and clinical outcome. BMC Med 15:133. doi: 10.1186/s12916-017-0900-y Cheng L, Song SY, Pretlow TG, Abdul-Karim FW, Kung HJ, Dawson DV, Park WS, Moon YW, Tsai ML, Linehan WM, Emmert-Buck MR, Liotta LA, Zhuang Z (1998) Evidence of independent origin of multiple tumors from patients with prostate cancer. J Natl Cancer Inst 90:233–237. doi: 10.1093/jnci/90.3.233 Chen J, Suo S, Tam PP, Han J-DJ, Peng G, Jing N (2017) Spatial transcriptomic analysis of cryosectioned tissue samples with Geo-seq. Nat Protoc 12:566–580. doi: 10.1038/nprot.2017.003 Clevers H (2013) The intestinal crypt, a prototype stem cell compartment. Cell 154:274–284. doi: 10.1016/j.cell.2013.07.004 Clune J, Mouret J-B, Lipson H (2013) The evolutionary origins of modularity. Proc Biol Sci 280:20122863. doi: 10.1098/rspb.2012.2863 Collier O, Stoven V, Vert J-P (2019) LOTUS: A single- and multitask machine learning algorithm for the prediction of cancer driver genes. PLoS Comput Biol 15:e1007381. doi: 10.1371/journal.pcbi.1007381 Cooper GM, Hausman RE (2007) The Cell: A Molecular Approach, Fourth Edition, 4th edn. 745. Costa-Silva J, Domingues D, Lopes FM (2017) RNA-Seq differential expression analysis: An extended review and a software tool. PLoS ONE 12:e0190152. doi: 10.1371/journal.pone.0190152 Crick F (1970) Central dogma of molecular biology. Nature 227:561–563. doi: 10.1038/227561a0 Crosetto N, Bienko M, van Oudenaarden A (2015) Spatially resolved transcriptomics and beyond. Nat Rev Genet 16:57–66. doi: 10.1038/nrg3832 Cruz JA, Wishart DS (2007) Applications of machine learning in cancer prediction and prognosis. Cancer Inform 2:59–77. doi: 10.1177/117693510600200030 Cun Y, Yang T-P, Achter V, Lang U, Peifer M (2018) Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust. Nat Protoc 13:1488–1501. doi: 10.1038/nprot.2018.033 Datta S, Malhotra L, Dickerson R, Chaffee S, Sen CK, Roy S (2015) Laser capture microdissection: Big data from small samples. Histol Histopathol 30:1255–1269. doi: 10.14670/HH-11-622

68 Demissie M, Mascialino B, Calza S, Pawitan Y (2008) Unequal group variances in microarray data analyses. Bioinformatics 24:1168–1174. doi: 10.1093/bioinformatics/btn100 DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. doi: 10.1038/ng.806 Dewey CN (2019) Whole-Genome Alignment. Methods Mol Biol 1910:121–147. doi: 10.1007/978-1-4939-9074-0_4 Dey SS, Kester L, Spanjaard B, Bienko M, van Oudenaarden A (2015) Integrated genome and transcriptome sequencing of the same cell. Nat Biotechnol 33:285– 289. doi: 10.1038/nbt.3129 Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21. doi: 10.1093/bioinformatics/bts635 Domazet-Loso T, Tautz D (2010) Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa. BMC Biol 8:66. doi: 10.1186/1741-7007-8-66 El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD (2019) The Pfam protein families database in 2019. Nucleic Acids Res 47:D427–D432. doi: 10.1093/nar/gky995 Elliott RL, Jiang XP, Head JF (2012) Mitochondria organelle transplantation: introduction of normal epithelial mitochondria into human cancer cells inhibits proliferation and increases drug sensitivity. Breast Cancer Res Treat 136:347– 354. doi: 10.1007/s10549-012-2283-2 Eng C-HL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, Yun J, Cronin C, Karp C, Yuan G-C, Cai L (2019) Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH. Nature 568:235–239. doi: 10.1038/s41586-019-1049- y Epstein JI (2010) An update of the Gleason grading system. J Urol 183:433–440. doi: 10.1016/j.juro.2009.10.046 Erpenbeck L, Schön MP (2010) Deadly allies: the fatal interplay between platelets and metastasizing cancer cells. Blood 115:3427–3436. doi: 10.1182/blood- 2009-10-247296 Everitt B (2006) The Cambridge dictionary of statistics. Cambridge University Press Faguet GB (2015) A brief history of cancer: age-old milestones underlying our current knowledge database. Int J Cancer 136:2022–2036. doi: 10.1002/ijc.29134 Ferragina P, Manzini G (2000) Opportunistic data structures with applications. Proceedings 41st Annual Symposium on Foundations of Computer Science. IEEE Comput. Soc, pp 390–398 Fidler IJ (1978) Tumor heterogeneity and the biology of cancer invasion and metastasis. Cancer Res 38:2651–2660. Fischer A, Vázquez-García I, Illingworth CJR, Mustonen V (2014) High-definition reconstruction of clonal composition in cancer. Cell Rep 7:1740–1752. doi: 10.1016/j.celrep.2014.04.055

69 Fisher RA (1992) Statistical methods for research workers. In: Kotz S, Johnson NL (eds) Breakthroughs in Statistics. Springer, New York, pp 66–70 Fisher RA (1922) On the Interpretation of χ 2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society 85:87. doi: 10.2307/2340521 Fodor IK (2002) A survey of dimension reduction techniques. doi: 10.2172/15002155 Frame FM, Noble AR, Klein S, Walker HF, Suman R, Kasprowicz R, Mann VM, Simms MS, Maitland NJ (2017) Tumor heterogeneity and therapy resistance - implications for future treatments of prostate cancer. JCMT 3:302. doi: 10.20517/2394-4722.2017.34 Frydrych LM, Ulintz P, Bankhead A, Sifuentes C, Greenson J, Maguire L, Irwin R, Fearon ER, Hardiman KM (2019) Rectal cancer sub-clones respond differentially to neoadjuvant therapy. Neoplasia 21:1051–1062. doi: 10.1016/j.neo.2019.08.004 Gajewski M (2016) Everything you really need to know about DNA sequencing. In: Cancer Research UK. https://scienceblog.cancerresearchuk.org/2016/04/25/everything-you-really- need-to-know-about-dna-sequencing/. Accessed 5 Dec 2019 Gay L, Baker A-M, Graham TA (2016) Tumour Cell Heterogeneity. [version 1; peer review: 5 approved]. F1000Res. doi: 10.12688/f1000research.7210.1 GeneTex (2019) Epithelial-Mesenchymal Transition. In: Epithelial-Mesenchymal Transition. https://www.genetex.com/Research/Overview/cancer/Epithelial- Mesenchymal_Transition. Accessed 20 Jan 2020 Gerlinger M, Rowan AJ, Horswell S, Math M, Larkin J, Endesfelder D, Gronroos E, Martinez P, Matthews N, Stewart A, Tarpey P, Varela I, Phillimore B, Begum S, McDonald NQ, Butler A, Jones D, Raine K, Latimer C, Santos CR, Swanton C (2012) Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 366:883–892. doi: 10.1056/NEJMoa1113205 Gerstung M, Jolly C, Leshchiner I, Dentro SC, Gonzalez S, Rosebrock D, Mitchell TJ, Rubanova Y, Anur P, Yu K, Tarabichi M, Deshwar A, Wintersinger J, Kleinheinz K, Vázquez-García I, Haase K, Jerman L, Sengupta S, Macintyre G, Malikic S, PCAWG Consortium (2020) The evolutionary history of 2,658 cancers. Nature 578:122–128. doi: 10.1038/s41586-019-1907-7 Giraudeau M, Sepp T, Ujvari B, Renaud F, Tasiemski A, Roche B, Capp J-P, Thomas F (2019) Differences in mutational processes and intra-tumour heterogeneity between organs: The local selective filter hypothesis. Evol Med Public Health 2019:139–146. doi: 10.1093/emph/eoz017 Glaab E, Baudot A, Krasnogor N, Schneider R, Valencia A (2012) EnrichNet: network-based gene set enrichment analysis. Bioinformatics 28:i451–i457. doi: 10.1093/bioinformatics/bts389 Guala D, Ogris C, Müller N, Sonnhammer ELL (2019) Genome-wide functional association networks: background, data & state-of-the-art resources. Brief Bioinformatics. doi: 10.1093/bib/bbz064 Gundem G, Van Loo P, Kremeyer B, Alexandrov LB, Tubio JMC, Papaemmanuil E, Brewer DS, Kallio HML, Högnäs G, Annala M, Kivinummi K, Goody V, Latimer C, O’Meara S, Dawson KJ, Isaacs W, Emmert-Buck MR, Nykter M,

70 Foster C, Kote-Jarai Z, Bova GS (2015) The evolutionary history of lethal metastatic prostate cancer. Nature 520:353–357. doi: 10.1038/nature14347 Hajdu SI (2011) A note from history: landmarks in history of cancer, part 1. Cancer 117:1097–1102. doi: 10.1002/cncr.25553 Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144:646–674. doi: 10.1016/j.cell.2011.02.013 Hanahan D, Weinberg RA (2000) The hallmarks of cancer. Cell 100:57–70. doi: 10.1016/s0092-8674(00)81683-9 Hanauer D, Rhodes D, Sinha-Kumar C, Chinnaiyan A (2007) Bioinformatics Approaches in the Study of Cancer. Curr Mol Med 7:133–141. doi: 10.2174/156652407779940431 Hastings PJ, Lupski JR, Rosenberg SM, Ira G (2009) Mechanisms of change in gene copy number. Nat Rev Genet 10:551–564. doi: 10.1038/nrg2593 Helleday T, Eshtad S, Nik-Zainal S (2014) Mechanisms underlying mutational signatures in human cancers. Nat Rev Genet 15:585–598. doi: 10.1038/nrg3729 Hong MKH, Macintyre G, Wedge DC, Van Loo P, Patel K, Lunke S, Alexandrov LB, Sloggett C, Cmero M, Marass F, Tsui D, Mangiola S, Lonie A, Naeem H, Sapre N, Phal PM, Kurganovs N, Chin X, Kerger M, Warren AY, Hovens CM (2015) Tracking the origins and drivers of subclonal metastatic expansion in prostate cancer. Nat Commun 6:6605. doi: 10.1038/ncomms7605 Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA (2003) Identifying biological themes within lists of genes with EASE. Genome Biol 4:R70. doi: 10.1186/gb-2003-4-10-r70 Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA (2007) The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol 8:R183. doi: 10.1186/gb-2007-8-9-r183 Huang W, Guo YA, Muthukumar K, Baruah P, Chang MM, Jacobsen Skanderup A (2019a) SMuRF: portable and accurate ensemble prediction of somatic mutations. Bioinformatics 35:3157–3159. doi: 10.1093/bioinformatics/btz018 Huang X, Wu L, Ye Y (2019b) A review on dimensionality reduction techniques. Int J Patt Recogn Artif Intell 33:1950017. doi: 10.1142/S0218001419500174 Jaskowiak PA, Campello RJGB, Costa IG (2014) On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics 15 Suppl 2:S2. doi: 10.1186/1471-2105-15-S2-S2 Jassal B (2011) Pathway annotation and analysis with Reactome: the solute carrier class of membrane transporters. Hum Genomics 5:310–315. doi: 10.1186/1479- 7364-5-4-310 Jeggari A, Alexeyenko A (2017) NEArender: an R package for functional interpretation of “omics” data via network enrichment analysis. BMC Bioinformatics 18:118. doi: 10.1186/s12859-017-1534-y Jensen E (2014) Technical review: In situ hybridization. Anat Rec (Hoboken) 297:1349–1353. doi: 10.1002/ar.22944 Kaestner KH, Knöchel W, Martinez DE (2000) Unified nomenclature for the winged helix/forkhead transcription factors. Genes Dev 14:142–146. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, Leiserson MDM, Miller CA, Welch JS, Walter MJ, Wendl MC, Ley TJ, Wilson RK, Raphael BJ, Ding L (2013)

71 Mutational landscape and significance across 12 major cancer types. Nature 502:333–339. doi: 10.1038/nature12634 Kanehisa M (2019) Toward understanding the origin and evolution of cellular organisms. Protein Sci 28:1947–1951. doi: 10.1002/pro.3715 Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res 32:D277-80. doi: 10.1093/nar/gkh063 Kassen R (2002) The experimental evolution of specialists, generalists, and the maintenance of diversity. J Evol Biol 15:173–190. doi: 10.1046/j.1420- 9101.2002.00377.x Ke R, Mignardi M, Pacureanu A, Svedlund J, Botling J, Wählby C, Nilsson M (2013) In situ sequencing for RNA analysis in preserved tissue and cells. Nat Methods 10:857–860. doi: 10.1038/nmeth.2563 Khatri P, Sirota M, Butte AJ (2012) Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8:e1002375. doi: 10.1371/journal.pcbi.1002375 Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36. doi: 10.1186/gb-2013-14-4-r36 Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW (2015) Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161:1187–1201. doi: 10.1016/j.cell.2015.04.044 Knudson AG (1985) Hereditary cancer, oncogenes, and antioncogenes. Cancer Res 45:1437–1443. Kolar CS, Lodge DM (2001) Progress in invasion biology: predicting invaders. Trends Ecol Evol (Amst) 16:199–204. doi: 10.1016/s0169-5347(01)02101-2 Kolmogorov A (1933) Sulla determinazione empirica di una legge di distribuzione. Giorn Inst Italiano Attuari 4:83–91. Komurov K, Ram PT (2010) Patterns of human gene expression variance show strong associations with signaling network hierarchy. BMC Syst Biol 4:154. doi: 10.1186/1752-0509-4-154 Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI (2015) Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J 13:8–17. doi: 10.1016/j.csbj.2014.11.005 Kruskal WH, Wallis WA (1952) Use of Ranks in One-Criterion Variance Analysis. J Am Stat Assoc 47:583–621. doi: 10.1080/01621459.1952.10483441 Kumar-Sinha C, Kalyana-Sundaram S, Chinnaiyan AM (2012) SLC45A3-ELK4 chimera in prostate cancer: spotlight on cis-splicing. Cancer Discov 2:582–585. doi: 10.1158/2159-8290.CD-12-0212 Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi: 10.1038/nmeth.1923 Larkin J, Chiarion-Sileni V, Gonzalez R, Grob J-J, Rutkowski P, Lao CD, Cowey CL, Schadendorf D, Wagstaff J, Dummer R, Ferrucci PF, Smylie M, Hogg D, Hill A, Márquez-Rodas I, Haanen J, Guidoboni M, Maio M, Schöffski P, Carlino MS, Wolchok JD (2019) Five-Year Survival with Combined Nivolumab and Ipilimumab in Advanced Melanoma. N Engl J Med 381:1535–1546. doi: 10.1056/NEJMoa1910836

72 Lee D-W, Han S-W, Bae JM, Jang H, Han H, Kim H, Bang D, Jeong S-Y, Park KJ, Kang GH, Kim T-Y (2019) Tumor Mutation Burden and Prognosis in Patients with Colorectal Cancer Treated with Adjuvant Fluoropyrimidine and Oxaliplatin. Clin Cancer Res 25:6141–6147. doi: 10.1158/1078-0432.CCR-19- 1105 Lehmann U (2010) [DNA methylation. From basic research to routine diagnostics]. Pathologe 31 Suppl 2:274–279. doi: 10.1007/s00292-010-1300-7 Lewis TF (2017) Evidence regarding the internal structure: confirmatory factor analysis. Measurement and Evaluation in Counseling and Development 50:239– 247. doi: 10.1080/07481756.2017.1336929 Lindahl T (1993) Instability and decay of the primary structure of DNA. Nature 362:709–715. doi: 10.1038/362709a0 Lind MI, Spagopoulou F (2018) Evolutionary consequences of epigenetic inheritance. Heredity 121:205–209. doi: 10.1038/s41437-018-0113-y Liotta LA, Kohn EC (2003) Cancer’s deadly signature. Nat Genet 33:10–11. doi: 10.1038/ng0103-10 Little P, Lin D-Y, Sun W (2019) Associating somatic mutations to clinical outcomes: a pan-cancer study of survival time. Genome Med 11:37. doi: 10.1186/s13073- 019-0643-9 Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics 25:1754–1760. doi: 10.1093/bioinformatics/btp324 Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967. doi: 10.1093/bioinformatics/btp336 Li Y, Roberts ND, Wala JA, Shapira O, Schumacher SE, Kumar K, Khurana E, Waszak S, Korbel JO, Haber JE, Imielinski M, Weischenfeldt J, Beroukhim R, Campbell PJ (2020) Patterns of somatic structural variation in human cancer genomes. Nature 578:112–121. doi: 10.1038/s41586-019-1913-9 Li Z, Pearlman AH, Hsieh P (2016) DNA mismatch repair and the DNA damage response. DNA Repair (Amst) 38:94–101. doi: 10.1016/j.dnarep.2015.11.019 Li Z, Qin F, Li H (2018) Chimeric RNAs and their implications in cancer. Curr Opin Genet Dev 48:36–43. doi: 10.1016/j.gde.2017.10.002 Lopez-Lazaro M (2008) The warburg effect: why and how do cancer cells activate glycolysis in the presence of oxygen? Anticancer Agents Med Chem 8:305–312. doi: 10.2174/187152008783961932 Lowy DR, Schiller JT (2012) Reducing HPV-associated cancer globally. Cancer Prev Res (Phila Pa) 5:18–23. doi: 10.1158/1940-6207.CAPR-11-0542 Lumey LH (1992) Decreased birthweights in infants after maternal in utero exposure to the Dutch famine of 1944-1945. Paediatr Perinat Epidemiol 6:240–253. doi: 10.1111/j.1365-3016.1992.tb00764.x Lumey LH, Stein AD (2009) Transgenerational effects of prenatal exposure to the Dutch famine. BJOG 116:868; author reply 868. doi: 10.1111/j.1471- 0528.2009.02110.x Luo J, Solimini NL, Elledge SJ (2009) Principles of cancer therapy: oncogene and non-oncogene addiction. Cell 136:823–837. doi: 10.1016/j.cell.2009.02.024 Maaskola J, Bergenstråhle L, Jurek A, Fernández Navarro J, Lagergren J, Lundeberg J (2018) Charting tissue expression anatomy by spatial transcriptome decomposition. BioRxiv. doi: 10.1101/362624

73 Macaulay IC, Ponting CP, Voet T (2017) Single-Cell Multiomics: Multiple Measurements from Single Cells. Trends Genet 33:155–168. doi: 10.1016/j.tig.2016.12.003 Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA (2015) Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161:1202–1214. doi: 10.1016/j.cell.2015.05.002 Marks F, Fürstenberger G, Müller-Decker K (2007) Tumor promotion as a target of cancer prevention. Recent Results Cancer Res 174:37–47. doi: 10.1007/978-3- 540-37696-5_3 Marusyk A, Polyak K (2010) Tumor heterogeneity: causes and consequences. Biochim Biophys Acta 1805:105–117. doi: 10.1016/j.bbcan.2009.11.002 Marzluff JM, Dial KP (1991) Life history correlates of taxonomic diversity. Ecology 72:428–439. doi: 10.2307/2937185 Massagué J, Obenauf AC (2016) Metastatic colonization by circulating tumour cells. Nature 529:298–306. doi: 10.1038/nature17038 Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451. doi: 10.1016/0005- 2795(75)90109-9 McCormack T, Frings O, Alexeyenko A, Sonnhammer ELL (2013) Statistical assessment of crosstalk enrichment between gene groups in biological networks. PLoS ONE 8:e54945. doi: 10.1371/journal.pone.0054945 McGranahan N, Swanton C (2017) Clonal heterogeneity and tumor evolution: past, present, and the future. Cell 168:613–628. doi: 10.1016/j.cell.2017.01.018 Merlo LMF, Shah NA, Li X, Blount PL, Vaughan TL, Reid BJ, Maley CC (2010) A comprehensive survey of clonal diversity measures in Barrett’s esophagus as biomarkers of progression to esophageal adenocarcinoma. Cancer Prev Res (Phila Pa) 3:1388–1397. doi: 10.1158/1940-6207.CAPR-10-0108 Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365:488–492. doi: 10.1016/S0140-6736(05)17866-0 Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, 1000 Genomes Project (2011) Mapping copy number variation by population- scale genome sequencing. Nature 470:59–65. doi: 10.1038/nature09708 Miner BE, Stöger RJ, Burden AF, Laird CD, Hansen RS (2004) Molecular barcodes detect redundancy and contamination in hairpin-bisulfite PCR. Nucleic Acids Res 32:e135. doi: 10.1093/nar/gnh132 Minikel EV (2012) How PCR duplicates arise in next-generation sequencing. In: CureFFI.org. https://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in- next-generation-sequencing/. Accessed 8 Dec 2019 Misteli T (2007) Beyond the sequence: cellular organization of genome function. Cell 128:787–800. doi: 10.1016/j.cell.2007.01.028 Moncada R, Chiodin M, Devlin JC, Baron M, Hajdu CH, Simeone D, Yanai I (2018) Building a tumor atlas: integrating single-cell RNA-Seq data with spatial transcriptomics in pancreatic ductal adenocarcinoma. BioRxiv. doi: 10.1101/254375

74 Montojo J, Zuberi K, Rodriguez H, Bader GD, Morris Q (2014) GeneMANIA: Fast gene network construction and function prediction for Cytoscape. [version 1; peer review: 2 approved]. F1000Res 3:153. doi: 10.12688/f1000research.4572.1 National Cancer Institute (2019) A to Z List of Cancer Types . https://www.cancer.gov/types. Accessed 17 Nov 2019 Naugler CT (2010) Population of cancer cell clones: possible implications of cancer stem cells. Theor Biol Med Model 7:42. doi: 10.1186/1742-4682-7-42 NBII (2011) Introduction to Genetic Diversity. In: Introduction to Genetic Diversity. https://web.archive.org/web/20110225072641/http://www.nbii.gov/portal/server. pt?open=512&objID=405&PageID=0&cached=true&mode=2&userID=2. Accessed 16 Nov 2019 Nguyen DX, Bos PD, Massagué J (2009) Metastasis: from dissemination to organ- specific colonization. Nat Rev Cancer 9:274–284. doi: 10.1038/nrc2622 NobelPrize.org (2019a) Prize announcement: The Nobel Prize in Physiology or Medicine 2019 . https://www.nobelprize.org/prizes/medicine/2019/advanced- information/. Accessed 17 Nov 2019 NobelPrize.org (2019b) Harald zur Hausen - Facts. https://www.nobelprize.org/prizes/medicine/2008/hausen/facts/. Accessed 17 Nov 2019 Nowell PC (1976) The clonal evolution of tumor cell populations. Science 194:23– 28. doi: 10.1126/science.959840 Ogris C, Guala D, Helleday T, Sonnhammer ELL (2017) A novel method for crosstalk analysis of biological networks: improving accuracy of pathway annotation. Nucleic Acids Res 45:e8. doi: 10.1093/nar/gkw849 Ogris C, Guala D, Sonnhammer ELL (2018) FunCoup 4: new species, data, and visualization. Nucleic Acids Res 46:D601–D607. doi: 10.1093/nar/gkx1138 Ogris C, Helleday T, Sonnhammer ELL (2016) PathwAX: a web server for network crosstalk based pathway annotation. Nucleic Acids Res 44:W105-9. doi: 10.1093/nar/gkw356 Ohno S (1970) Evolution by Gene Duplication. doi: 10.1007/978-3-642-86659-3 Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6 2:559–572. doi: 10.1080/14786440109462720 Petrackova A, Vasinek M, Sedlarikova L, Dyskova T, Schneiderova P, Novosad T, Papajik T, Kriegova E (2019) Standardization of sequencing coverage depth in NGS: recommendation for detection of clonal and subclonal mutations in cancer diagnostics. Front Oncol 9:851. doi: 10.3389/fonc.2019.00851 Picco G, Chen ED, Alonso LG, Behan FM, Gonçalves E, Bignell G, Matchan A, Fu B, Banerjee R, Anderson E, Butler A, Benes CH, McDermott U, Dow D, Iorio F, Stronach E, Yang F, Yusa K, Saez-Rodriguez J, Garnett MJ (2019) Functional linkage of gene fusions to cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat Commun 10:2198. doi: 10.1038/s41467-019- 09940-1 Pickup MW, Mouw JK, Weaver VM (2014) The extracellular matrix modulates the hallmarks of cancer. EMBO Rep 15:1243–1253. doi: 10.15252/embr.201439246 Pierorazio PM, Walsh PC, Partin AW, Epstein JI (2013) Prognostic Gleason grade grouping: data based on the modified Gleason scoring system. BJU Int 111:753–760. doi: 10.1111/j.1464-410X.2012.11611.x

75 Piovesan A, Pelleri MC, Antonaros F, Strippoli P, Caracausi M, Vitale L (2019) On the length, weight and GC content of the human genome. BMC Res Notes 12:106. doi: 10.1186/s13104-019-4137-z Pirooznia M, Goes FS, Zandi PP (2015) Whole-genome CNV analysis: advances in computational approaches. Front Genet 6:138. doi: 10.3389/fgene.2015.00138 Potapov V, Ong JL (2017) Examining sources of error in PCR by single-molecule sequencing. PLoS ONE 12:e0169774. doi: 10.1371/journal.pone.0169774 Prajapati P, Lambert DW (2016) Cancer-associated fibroblasts - Not-so-innocent bystanders in metastasis to bone? J Bone Oncol 5:128–131. doi: 10.1016/j.jbo.2016.03.008 Qin F, Song Z, Chang M, Song Y, Frierson H, Li H (2016) Recurrent cis-SAGe chimeric RNA, D2HGDH-GAL3ST2, in prostate cancer. Cancer Lett 380:39–46. doi: 10.1016/j.canlet.2016.06.013 Rickman DS, Pflueger D, Moss B, VanDoren VE, Chen CX, de la Taille A, Kuefer R, Tewari AK, Setlur SR, Demichelis F, Rubin MA (2009) SLC45A3-ELK4 is a novel and frequent erythroblast transformation-specific fusion transcript in prostate cancer. Cancer Res 69:2734–2738. doi: 10.1158/0008-5472.CAN-08- 4926 Roche-Lestienne C, Soenen-Cornu V, Grardel-Duflos N, Laï J-L, Philippe N, Facon T, Fenaux P, Preudhomme C (2002) Several types of mutations of the Abl gene can be found in chronic myeloid leukemia patients resistant to STI571, and they can pre-exist to the onset of treatment. Blood 100:1014–1018. doi: 10.1182/blood.v100.3.1014 Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, Welch J, Chen LM, Chen F, Macosko EZ (2019) Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 363:1463–1467. doi: 10.1126/science.aaw1219 Ruxton GD (2006) The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology 17:688–690. doi: 10.1093/beheco/ark016 Sampayo RG, Bissell MJ (2019) Cancer stem cells in breast and prostate: Fact or fiction? Adv Cancer Res 144:315–341. doi: 10.1016/bs.acr.2019.03.010 Schaffner SF, Sabeti PC (2008) Evolutionary Adaptation in the Human Lineage. Nature Education 1(1):14: Schiavone K, Garnier D, Heymann M-F, Heymann D (2019) The heterogeneity of osteosarcoma: the role played by cancer stem cells. Adv Exp Med Biol 1139:187–200. doi: 10.1007/978-3-030-14366-4_11 Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319. doi: 10.1162/089976698300017467 Schwarz RF, Ng CKY, Cooke SL, Newman S, Temple J, Piskorz AM, Gale D, Sayal K, Murtaza M, Baldwin PJ, Rosenfeld N, Earl HM, Sala E, Jimenez- Linan M, Parkinson CA, Markowetz F, Brenton JD (2015) Spatial and temporal heterogeneity in high-grade serous ovarian cancer: a phylogenetic analysis. PLoS Med 12:e1001789. doi: 10.1371/journal.pmed.1001789 Sena JA, Galotto G, Devitt NP, Connick MC, Jacobi JL, Umale PE, Vidali L, Bell CJ (2018) Unique Molecular Identifiers reveal a novel sequencing artefact with implications for RNA-Seq based gene expression analysis. Sci Rep 8:13121. doi: 10.1038/s41598-018-31064-7

76 Seoane J, De Mattos-Arruda L (2014) The challenge of intratumour heterogeneity in precision medicine. J Intern Med 276:41–51. doi: 10.1111/joim.12240 Seyfried TN (2015) Cancer as a mitochondrial metabolic disease. Front Cell Dev Biol 3:43. doi: 10.3389/fcell.2015.00043 Sheaffer KL, Elliott EN, Kaestner KH (2016) DNA hypomethylation contributes to genomic instability and intestinal cancer initiation. Cancer Prev Res (Phila Pa) 9:534–546. doi: 10.1158/1940-6207.CAPR-15-0349 Shipitsin M, Campbell LL, Argani P, Weremowicz S, Bloushtain-Qimron N, Yao J, Nikolskaya T, Serebryiskaya T, Beroukhim R, Hu M, Halushka MK, Sukumar S, Parker LM, Anderson KS, Harris LN, Garber JE, Richardson AL, Schnitt SJ, Nikolsky Y, Gelman RS, Polyak K (2007) Molecular definition of breast tumor heterogeneity. Cancer Cell 11:259–273. doi: 10.1016/j.ccr.2007.01.013 Signorelli M, Vinciotti V, Wit EC (2016) NEAT: an efficient network enrichment analysis test. BMC Bioinformatics 17:352. doi: 10.1186/s12859-016-1203-6 Skaggs BJ, Gorre ME, Ryvkin A, Burgess MR, Xie Y, Han Y, Komisopoulou E, Brown LM, Loo JA, Landaw EM, Sawyers CL, Graeber TG (2006) Phosphorylation of the ATP-binding loop directs oncogenicity of drug-resistant BCR-ABL mutants. Proc Natl Acad Sci USA 103:19466–19471. doi: 10.1073/pnas.0609239103 Slaby O, Svoboda M, Michalek J, Vyzula R (2009) MicroRNAs in colorectal cancer: translation of molecular biology into clinical application. Mol Cancer 8:102. doi: 10.1186/1476-4598-8-102 Spencer DH, Tyagi M, Vallania F, Bredemeyer AJ, Pfeifer JD, Mitra RD, Duncavage EJ (2014) Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn 16:75–88. doi: 10.1016/j.jmoldx.2013.09.003 Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, Giacomello S, Asp M, Westholm JO, Huss M, Mollbrink A, Linnarsson S, Codeluppi S, Borg Å, Pontén F, Costea PI, Sahlén P, Mulder J, Bergmann O, Lundeberg J, Frisén J (2016) Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353:78–82. doi: 10.1126/science.aaf2403 Stanta G, Bonin S (2018) Overview on Clinical Relevance of Intra-Tumor Heterogeneity. Front Med (Lausanne) 5:85. doi: 10.3389/fmed.2018.00085 Stoler DL, Chen N, Basik M, Kahlenberg MS, Rodriguez-Bigas MA, Petrelli NJ, Anderson GR (1999) The onset and extent of genomic instability in sporadic colorectal tumor progression. Proc Natl Acad Sci USA 96:15121–15126. doi: 10.1073/pnas.96.26.15121 Swanton C (2012) Intratumor heterogeneity: evolution through space and time. Cancer Res 72:4875–4882. doi: 10.1158/0008-5472.CAN-12-2217 Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ, Mering C von (2019) STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47:D607–D613. doi: 10.1093/nar/gky1131 The Gene Ontology Consortium (2019) The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res 47:D330–D338. doi: 10.1093/nar/gky1055

77 Tilman D (2004) Niche tradeoffs, neutrality, and community structure: a stochastic theory of resource competition, invasion, and community assembly. Proc Natl Acad Sci USA 101:10854–10861. doi: 10.1073/pnas.0403458101 Tomlinson I, Sasieni P, Bodmer W (2002) How many mutations in a cancer? Am J Pathol 160:755–758. doi: 10.1016/S0002-9440(10)64896-1 Touw IP, Erkeland SJ (2007) Retroviral insertion mutagenesis in mice as a comparative oncogenomics tool to identify disease genes in human leukemia. Mol Ther 15:13–19. doi: 10.1038/sj.mt.6300040 Trigos AS, Pearson RB, Papenfuss AT, Goode DL (2017) Altered interactions between unicellular and multicellular genes drive hallmarks of transformation in a diverse range of solid tumors. Proc Natl Acad Sci USA 114:6406–6411. doi: 10.1073/pnas.1617743114 Triola MF (2014) Elementary Statistics. Pearson Turajlic S, McGranahan N, Swanton C (2015) Inferring mutational timing and reconstructing tumour evolutionary histories. Biochim Biophys Acta 1855:264– 275. doi: 10.1016/j.bbcan.2015.03.005 Tzadok O (2019) Inside My Mind (#6158). Valencia A, Hidalgo M (2012) The cancer genome challenge and the importance of analytical pipelines. van der Maaten L, Hinton G (2011) Visualizing Data using t-SNE. J. Mach. Learn. Res. van Etten JL, Dehm SM (2016) Clonal origin and spread of metastatic prostate cancer. Endocr Relat Cancer 23:R207-17. doi: 10.1530/ERC-16-0049 Venter JC (2001) Shotgunning the human genome: A personal view. Encyclopedia of life sciences. doi: 10.1038/npg.els.0005850 Wang C, Guo J, Zhao N, Liu Y, Liu X, Liu G, Guo M (2019) A cancer survival prediction method based on graph convolutional network. IEEE Trans Nanobioscience. doi: 10.1109/TNB.2019.2936398 Warburg O (1956) On the Origin of Cancer Cells. Science 123:309–314. doi: 10.1126/science.123.3191.309 Ward JH (1963) Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc 58:236–244. doi: 10.1080/01621459.1963.10500845 Watson PA, Arora VK, Sawyers CL (2015) Emerging mechanisms of resistance to androgen receptor inhibitors in prostate cancer. Nat Rev Cancer 15:701–711. doi: 10.1038/nrc4016 Weinhouse S, Warburg O, Burk D, Schade AL (1956) On respiratory impairment in cancer cells. Science 124:269–270. Wei L, Wang J, Lampert E, Schlanger S, DePriest AD, Hu Q, Gomez EC, Murakam M, Glenn ST, Conroy J, Morrison C, Azabdaftari G, Mohler JL, Liu S, Heemers HV (2017) Intratumoral and intertumoral genomic heterogeneity of multifocal localized prostate cancer impacts molecular classifications and genomic prognosticators. Eur Urol 71:183–192. doi: 10.1016/j.eururo.2016.07.008 Welch BL (1947) The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34:28–35. doi: 10.1093/biomet/34.1-2.28 Wilcoxon F (1945) Individual Comparisons by Ranking Methods. Biometrics Bulletin 1:80. doi: 10.2307/3001968

78 Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A (2016) Identification of neutral tumor evolution across cancer types. Nat Genet 48:238–244. doi: 10.1038/ng.3489 Williams SCP (2016) Genetic mutations you want. Proc Natl Acad Sci USA 113:2554–2557. doi: 10.1073/pnas.1601663113 Wood DE, White JR, Georgiadis A, Van Emburgh B, Parpart-Li S, Mitchell J, Anagnostou V, Niknafs N, Karchin R, Papp E, McCord C, LoVerso P, Riley D, Diaz LA, Jones S, Sausen M, Velculescu VE, Angiuoli SV (2018) A machine learning approach for somatic mutation discovery. Sci Transl Med. doi: 10.1126/scitranslmed.aar7939 Wunderlich V (2002) JMM-past and present. Chromosomes and cancer: Theodor Boveri’s predictions 100 years later. J Mol Med 80:545–548. doi: 10.1007/s00109-002-0374-y Wu D, Rice CM, Wang X (2012) Cancer bioinformatics: a new approach to systems clinical medicine. BMC Bioinformatics 13:71. doi: 10.1186/1471-2105-13-71 Xu C (2018) A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput Struct Biotechnol J 16:15–24. doi: 10.1016/j.csbj.2018.01.003 Xu X, Liang T, Zhu J, Zheng D, Sun T (2018) Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing 328:5–15. doi: 10.1016/j.neucom.2018.02.100 Zhao M, Wang Q, Wang Q, Jia P, Zhao Z (2013) Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 14 Suppl 11:S1. doi: 10.1186/1471-2105-14-S11-S1 Zheng J (2012) Energy metabolism of cancer: Glycolysis versus oxidative phosphorylation (Review). Oncol Lett 4:1151–1157. doi: 10.3892/ol.2012.928 Zimek A, Schubert E (2017) Outlier Detection. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, New York, pp 1–5

79