Ploidetect Enables Pan-Cancer Analysis of the Causes and Impacts of Chromosomal Instability

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.06.455329; this version posted August 8, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Ploidetect enables pan-cancer analysis of the causes and impacts of chromosomal instability Luka Culibrk1,2, Jasleen K. Grewal1,2, Erin D. Pleasance1, Laura Williamson1, Karen Mungall1, Janessa Laskin3, Marco A. Marra1,4, and Steven J.M. Jones1,4, 1Canada’s Michael Smith Genome Sciences Center at BC Cancer, Vancouver, British Columbia, Canada 2Bioinformatics training program, University of British Columbia, Vancouver, British Columbia, Canada 3Department of Medical Oncology, BC Cancer, Vancouver, British Columbia, Canada 4Department of Medical Genetics, Faculty of Medicine, Vancouver, British Columbia, Canada Cancers routinely exhibit chromosomal instability, resulting in tumors mutate, these variants are considerably more difficult the accumulation of changes in the abundance of genomic ma- to detect accurately compared to other types of mutations terial, known as copy number variants (CNVs). Unfortunately, and consequently they may represent an under-explored the detection of these variants in cancer genomes is difficult. We facet of tumor biology. 20 developed Ploidetect, a software package that effectively iden- While small mutations can be determined through base tifies CNVs within whole-genome sequenced tumors. Ploidetect changes embedded within aligned sequence reads, CNVs was more sensitive to CNVs in cancer related genes within ad- are variations in DNA quantity and are typically determined vanced, pre-treated metastatic cancers than other tools, while also segmenting the most contiguously. Chromosomal instabil- through measurement of relative DNA abundance by way of ity, as measured by segment contiguity, was associated with sev- sequence read depth (10, 11). Due to random noise and se- 25 eral biological and clinical variables, including tumor mutation quence bias, read depth data is unreliable at the granular level burden, tumor type, duration of therapy and immune microen- and therefore must be noise corrected. Segmentation, the vironment, highlighting the relevance of measuring CNV across process of clustering genomic loci into contiguous regions the cancer genome. Investigation of gene mutations in sam- of constant copy number is used for CNV analysis as the ples revealed and the mutation status of several genes including aggregation of numerous sequential read depth observations 30 ROCK2 and AC074391.1. Leveraging our heightened ability to aids in overcoming the inherent noise in the data. There have detect CNVs, we identified 282 genes which were recurrently been a number of segmentation methods developed to tackle homozygously deleted in metastatic tumors. Further analysis of this issue in the context of both next-generation sequencing one recurrently deleted gene, MACROD2, identified a putative data and array CGH (10, 12, 13). Although CNVs are fragile tumor suppressor locus associated with response to chromosomal instability and chemotherapeutic agents. Our results arguably comparable in importance with small variants 35 outline the multifaceted impacts of CNVs in cancer by provid- in cancer, tools for somatic CNV detection tend to make ing evidence of their involvement in tumorigenic behaviors and errors considerably more often than small variant callers their utility as biomarkers for biological processes. We propose (14). While highly sensitive CNV detection for identifying that increasingly accurate determination of CNVs is critical for tumor suppressor deletions and oncogene amplifications is their productive study in cancer, and our work demonstrates desirable, it is vital to also ensure that CNV segmentation 40 advances made possible by progress in this regard. is precise. Highly fragmented segmentation results preclude the use of metrics such as CNV burden as phenotypic Correspondence: [email protected] features related to chromosomal instability (CIN) in cancer DRAFTbiology because false positive breakpoints cause an inflated Introduction estimate of the magnitude of CIN. Inherently noisy data 45 impedes the ability to separate signal from noise, necessarily Copy number variation (CNV) is a major class of somatic limiting the degree of positive, publishable findings that can mutation often observed in the context of cancer. CNV has be generated from CNV data (15). been shown to impact the majority of cancer genomes (1) and Clinical genomics is a rising area of interest in cancer 5 is associated with poor clinical outcomes (2). Homozygous research. Studies that evaluate the utility of a personalized 50 deletions are CNVs which result in the complete loss of ge- approach to identifying potential cancer therapeutics have nomic loci, and recurrent homozygous deletions are known been reported in recent years (16, 17). Studies that broadly to be enriched for tumor suppressor genes (3). Homologous attempt to genomically characterize large cohorts of tumor recombination deficiency (HRD), a cause of CNV, has been biopsies are also becoming increasingly common-place 10 positively associated with platinum therapy response (4), (18, 19). As the cost of sequencing continues to decrease, 55 and other phenotypes such as an increased prevalence of large-scale sequencing studies of tumor biopsies will become tandem duplications have been described with relevance to more commonplace, highly precise and accurate tools to biological processes (5, 6). Overall, chromosomal instability interrogate these genomes will be required for their effective (CIN) driving the accumulation of CNVs is associated interpretation. 15 with poor patient outcome in multiple cancer types (7–9). 60 Although CNV is a highly pervasive mechanism by which Culibrk et al. | August 6, 2021 | 1–18 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.06.455329; this version posted August 8, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Here we present Ploidetect, a tool that performs copy number Benchmarking. We compared Ploidetect with 3 other CNV 115 variation analysis of cancer genomes. Ploidetect contributes tools, namely Sequenza (11), BIC-seq2 (10) and PURPLE to the field of somatic CNV detection by using a novel seg- (23). We used the latest versions of each software at the mentation approach to better segment genomes compared to time of analysis, corresponding to Sequenza version 3.0.0, 65 contemporary methods. We demonstrate its performance on BIC-seq2 version 0.7.2, and AMBER/COBALT/PURPLE a cohort of 725 metastatic tumor biopsies from patients with versions 3.5/1.11/2.51, respectively. Our test set was 725 120 treatment-resistant cancers sequenced as part of the Personal- cases of the Personalized Oncogenomics (POG) cohort (17), ized OncoGenomic (POG) program at BC Cancer (20). Fur- including recent cases sequenced after the publication of ther, we find that by using Ploidetect’s CNV estimates we Pleasance et al. Purity and ploidy results were obtained 70 can reconstruct existing knowledge of the CIN phenotype in from Ploidetect, Sequenza and PURPLE. While BIC-seq2 cancer, discover novelties regarding CIN, and elucidate puta- included a program, purityEM, which appeared to perform 125 tive gene associations with CIN. As a unique advantange of tumor purity and ploidy estimation, which we were unable Ploidetect, parameter tuning for tool usage is entirely data- to successfully implement due in part to a lack of documen- driven. Ploidetect utilizes a two-step process of joint purity- tation. We instead used Ploidetect’s tumor purity and ploidy 75 ploidy estimation followed by CNV calling using a novel seg- estimates to assign copy number to the BIC-seq2 segments. mentation procedure. In a cohort of 725 real-world metastatic Each tool was run using its respective default settings using 130 tumor biopsies, Ploidetect outperforms existing methods for a Snakemake (24) workflow. Homozygous deletions were detection of CNVs, while also limiting oversegmentation. defined as a copy number below 0.25, and amplifications Using Ploidetect’s CNV determination, we found that CIN were defined as a copy number ≥ ploidy + 3. 80 correlates with a number of genomic and clinical features in cancer, including site of origin, immune infiltration, and response to chemotherapy. Further analysis identified known Detection of significantly deleted regions. We per- 135 and putative gene associations with CIN. Lastly, a permu- formed a permutation analysis by shuffling the midpoint of tation analysis of recurrently deleted regions (RDRs) within all homozygous deletions on each chromosome 107 times. 85 the genome enabled prioritization of candidate fragile tu- For each interval, we recorded the number of permutations mor suppressor genes. This work represents our contribu- where more shuffled deletions were more numerous than ac- tion to improving CNV detection in paired tumor-normal ge- tual deletions. P-values were obtained by dividing this count 140 nomic analyses and the ability of our method to enable CNV by the number of permutations overall. The p-values were biomarker research and discovery. adjusted for multiple testing using the Benjamini-Hochberg

Ploidetect Enables Pan-Cancer Analysis of the Causes and Impacts of Chromosomal Instability

Ontology-Based Methods for Analyzing Life Science Data

BIOINFORMATICS APPLICATIONS NOTE Pages 380-381

Are You an Invited Speaker? a Bibliometric Analysis of Elite Groups for Scholarly Events in Bioinformatics

Full List of PCAWG Consortium Working Groups and Writing

Biological Pathways Exchange Language Level 3, Release Version 1 Documentation

Systems Biology Graphical Notation: Process Description Language Level 1

I S C B N E W S L E T T

Mapping the Protein Universe

A General Mathematical Framework for Understanding the Behavior of Heterogeneous Stem Cell Regeneration Arxiv:1903.11448V1 [Q-B

Bioinformatics 2 -- Lecture 3

Generating Functional Protein Variants with Variational Autoencoders

Context-Aware Prediction of Pathogenicity of Missense Mutations Involved in Human Disease