Genetics: Early Online, published on December 26, 2017 as 10.1534/genetics.117.300552

1 The hidden genomic and transcriptomic plasticity of giant marker in cancer

1 2 3 1 1 2 Gemma Macchia * , Marco Severgnini , Stefania Purgato , Doron Tolomeo , Hilen Casciaro ,

2 1 1 4 4 3 Ingrid Cifola , Alberto L’Abbate , Anna Loverro , Orazio Palumbo , Massimo Carella ,

5 3 2 6 1 4 Laurence Bianchini , Giovanni Perini , Gianluca De Bellis , Fredrik Mertens , Mariano Rocchi ,

1# 5 Clelia Tiziana Storlazzi .

6 (1) Department of Biology, University of Bari “Aldo Moro”, Bari, Italy;

7 (2) Institute for Biomedical Technologies (ITB), CNR, Segrate, Italy;

8 (3) Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy;

9 (4) Laboratorio di Genetica Medica, IRCCS Casa Sollievo della Sofferenza, San Giovanni

10 Rotondo, Italy;

11 (5) Laboratory of solid tumor genetics, Université Côte d'Azur, CNRS, IRCAN, Nice, France.

12 (6) Department of Clinical Genetics, University and Regional Laboratories, Lund University,

13 Lund, Sweden.

14

15 EMBL-EBI Array Express database: E-MTAB-5625

16 NCBI Short Read Archive: PRJNA378952.

17 GenBank repository: KY966261-KY966313 and KY966314-KY966332

18

19

20

21

22

23

24

25

26

27 Running Title: Neocentromeres and chimeric transcripts in cancer

1

Copyright 2017. 1 Keywords: neocentromere, fusion transcript, WDLPS, LSC, amplification

2

3 * Corresponding author:

4 Macchia Gemma, Department of Biology, University of Bari, Via Orabona no.4, 70125 Bari (Italy)

5 Email: [email protected]

6 Tel No: +39 0805443582

7 Fax: +39 0805443386

8

9

2 1 ABSTRACT

2

3 Genome amplification in the form of rings or giant rod-shaped marker chromosomes is a common

4 genetic alteration in soft tissue tumours. The mitotic stability of these structures is often rescued by

5 perfectly functioning analphoid neocentromeres, which therefore significantly contribute to cancer

6 progression. Here, we disentangled the genomic architecture of many neocentromeres stabilizing

7 marker chromosomes in well-differentiated liposarcoma and lung sarcomatoid carcinoma samples.

8 In cells carrying heavily rearranged RGMs, these structures were assembled as patchworks of

9 multiple short amplified sequences, disclosing an extremely high level of complexity and definitely

10 ruling out the existence of regions prone to the neocentromere seeding. Moreover, by studying two

11 well-differentiated liposarcoma samples derived from the onset and the recurrence of the same

12 tumor, we documented an expansion of the neocentromeric domain that occurred during tumor

13 progression, which reflects a strong selective pressure acting toward the improvement of the

14 neocentromeric functionality in cancer. In lung sarcomatoid carcinoma cells, extensive “centromere

15 sliding” phenomena giving rise to multiple, closely mapping neocentromeric epialleles on separate

16 co-existing markers occur likely due to the instability of neocentromeres arising in cancer cells.

17 Finally, by investigating the transcriptional activity of neocentromeres, we came across a burst of

18 chimeric transcripts, both by extremely complex genomic rearrangements, and cis/trans-splicing

19 events. Post-transcriptional editing events have been reported to expand and variegate the genetic

20 repertoire of higher eukaryotes, so they might have a determining role in cancer. The increased

21 incidence of fusion transcripts, might act as a driving force for the genomic amplification process,

22 together with the increased transcription of oncogenes.

23

24

25

26

3 1 INTRODUCTION

2 Genome amplification is a frequent genetic alteration in cancer, with variable cytogenetic

3 manifestations including double minutes, homogeneously staining regions and/or ring and giant

4 rod-shaped marker chromosomes (RGM) (MATSUI et al. 2013; L'ABBATE et al. 2014; NORD et al.

5 2014). While double minutes and homogeneously staining regions have been described in a variety

6 of cancer types (MATSUI et al. 2013), RGMs are particularly common in soft tissue tumours,

7 notably in well-differentiated liposarcomas (WDLPS), and shown to contain amplified sequences

8 from several chromosomes (NORD et al. 2014). During tumour progression, the ring chromosomes

9 are frequently broken and resealed or transformed into rod-shaped markers capturing the telomeres

10 from other chromosomes (NORD et al. 2014). This instability results in a highly complex internal

11 structure of these markers, as well as in extensive heterogeneity with respect to size and number per

12 cell (GARSED et al. 2014; NORD et al. 2014). RGMs frequently lack functional centromeric alphoid

13 sequences and their mitotic stability is rescued by the emergence of perfectly functioning analphoid

14 neocentromeres, which might indirectly contribute to cancer progression (MACCHIA et al. 2015).

15 Nonetheless, there are few studies addressing neocentromeres in cancer, probably because most of

16 the technologies employed to study the tumour genotypes are unable to unveil them. The

17 occurrence of neocentromeres in cancer, therefore, could be more frequent than reported. Similarly,

18 very little is known about the impact of neocentromeres on transcription, although centromeric

19 satellite regions have been reported to produce non-coding transcripts actively involved in the

20 centromere assembly (CHAN et al. 2012; ROSIC et al. 2014; QUENET AND DALAL 2015; MCNULTY

21 et al. 2017). Also, within neocentromeres are still actively transcribed (AMOR AND CHOO

22 2002; WONG et al. 2006). In line with these notions, the occurrence of neocentromeres in colon

23 cancer cell lines was reported to correlate with large DNase I hypersensitive sites, which are usually

24 sites of active transcription or high nucleosome turnover (ATHWAL et al. 2015). By combining

25 chromatin immunoprecipitation (IP) deep sequencing (ChIP-seq), whole genome sequencing

26 (WGS), immuno-fluorescence in situ hybridisation (immuno-FISH), whole transcriptome

27 sequencing (total RNA-seq) and other molecular analyses, we investigated in detail the genomic

4 1 architecture of neocentromeres arising on RGMs, as well as their contribution to transcription, in

2 the lung sarcomatoid carcinoma (LSC) cell line 04T036 and in the three liposarcoma cell lines

3 93T449, 94T778 and 95T1000. Overall, our study uncovered the complex organization of

4 neocentromeres in cancer and shed light on the extraordinarily high genomic and transcriptomic

5 plasticity associated with RGMs in solid tumours.

6

7 MATERIALS AND METHODS

8 Tumour cell lines

9 Four tumour cell lines (04T036, 93T449, 94T778, and 95T1000), kindly provided by The Centre

10 Hospitalier Universitaire de Nice (France), were included in the study. 04T036 was established

11 from the LSC of a 50-year-old man. Cytogenetic and multicolor FISH analyses showed a near-

12 triploid karyotype with numerous structural aberrations and four to six small RGMs containing

13 9 amplified sequences, and two RGMs containing chromosome 3 amplified sequences

14 (ITALIANO et al. 2006). 93T449 and 94T778 cell lines were obtained from a primary retroperitoneal

15 WDLPS at onset and at relapse, respectively. These commercial cell lines showed complex

16 karyotypes with multiple RGMs at G-banding and multicolour FISH analysis , and a clear

17 difference in the chromosome overall arrangement between them (SIRVENT et al. 2000; GARSED et

18 al. 2014). 95T1000 cell line was generated from a WDLPS relapse; SKY analysis revealed a

19 hypertriploid karyotype with multiple chromosomal structural abnormalities (PEDEUTOUR et al.

20 2012). All cells retained a giant marker chromosome, previously identified in the primary cell

21 cultures. This giant chromosome contained high-level amplification of chromosomal regions

22 deriving from 10p and 12q and lacked alpha-satellite DNA (PEDEUTOUR et al. 2012).

23

24 SNP array data

25 All cell lines were analysed by Affymetrix Genome Wide Human SNP Array 6.0 platform

26 (Affymetrix, Santa Clara, CA, USA), as described (STORLAZZI et al. 2010).

27

5 1 Whole Genome Sequencing

2 WGS was carried out to disentangle the genomic architecture of RGMs holding neocentromeres.

3 Library preparations were performed using the TruSeqDNA Nano 350 bp protocol (Illumina, San

4 Diego, CA, USA). The sequencing data were acquired using the Illumina Xten at the NYGC (New

5 York, US), in a paired-end 150-cycle run (mean coverage 40× per sample). Reads were aligned to

6 the human reference genome (GRCh37/hg19) using BWA-MEM (v.0.7.12) [http://bio-

7 bwa.sourceforge.net/, (LI AND DURBIN 2009)] and PCR duplicates were removed using Picard

8 (v.1.119) (http://picard.sourceforge.net/). Candidate structural variations (SVs) were identified

9 using Delly (v. 0.5.9) and Crest (v. 1.0) with default parameters (WANG et al. 2011; RAUSCH et al.

10 2012). Copy number analysis was performed using BIC-seq 0.7alpha (XI et al. 2011), and genomic

11 intervals showing a log2 copyRatio > 0.5 and > 2.5 were considered as amplified and highly

12 amplified, respectively.

13

14 ChIP-sequencing

15 To determine the internal structure of the neocentromeres, native ChIP-seq was performed as

16 described (WADE et al. 2009). Immunoprecipitation was run using a polyclonal antibody against the

17 CENP-A (TRAZZI et al. 2009). Both input and IP DNA fragments were purified and processed using

18 the TruSeq ChIP Library Preparation Kit (Illumina) and sequenced on the Illumina HiSeq 2500 at

19 the IGA Technology Services facility (Udine, Italy) (single-end 100-cycle run, 140M

20 reads/sample). Raw reads were aligned to the human reference genome (GRCh37/hg19), using

21 BWA-MEM (v 0.7.10). CENP-A enriched regions corresponding to putative neocentromeres were

22 identified using the CNV-seq tool, merging all overlapping intervals (XIE AND TAMMI 2009).

23 Selected regions were then filtered to exclude alphoid sequences, weak enrichments and regions

24 with read “spikes” piling-up in a single position. Next, putative neocentromeric fragments were

25 ranked according to their Overall Evaluation Criteria (OEC) score (SEVERGNINI et al. 2006). We

26 then screened the first 40 hits looking at the ChIP-seq alignment data with the Integrative Genomic

27 Viewer (IGV) software (THORVALDSDOTTIR et al. 2013) to exclude false positives, and validated

6 1 the selected intervals by immuno-FISH as described below. Using this approach, multiple CENP-A

2 enriched peaks were detected in all samples. In order to define the neocentromere internal structure,

3 a local re-mapping of the IP reads was performed against the identified putative neocentromeric

4 intervals (custom reference) by Blast (ALTSCHUL et al. 1990) or Blat (KENT 2002), searching for

5 reads spanning over the junctions. The list of reads splitting between two neocentromeric regions

6 was created and the OEC score was, then, used to prioritize the putative junctions. Full details are

7 provided in Methods S1.

8

9 Immuno-FISH assays on elongated chromosomes

10 Immuno-FISH on elongated chromosomes was performed to validate the putative structure of

11 neocentromeres, as previously described (EARNSHAW et al. 1989). As recommended by Beh et al.,

12 2016, we used a rabbit anti-CENP-C polyclonal antibody and a goat anti-rabbit FITC-conjugate

13 antibody to label functional centromeres (BEH et al. 2016). CENP-A and CENP-C, were reported to

14 localize exclusively to active centromeres, as part of the constitutive centromere-associated network

15 (KLARE et al. 2015; SHONO et al. 2015; BEH et al. 2016). Alphoid and BAC probes spanning

16 candidate neocentromeric regions were selected and labelled as reported (TROMBETTA et al. 2012;

17 MACCHIA et al. 2015) (Table S1). Subsequent FISH experiments were conducted to verify the co-

18 localization between CENP-C signals and the tested neocentromeric intervals as described

19 (STORLAZZI et al. 2006).

20

21 PCR and qPCR assays

22 PCR and Sanger sequencing were performed to validate genomic SVs as described (STORLAZZI et

23 al. 2010). Primer sequences are available upon request. To test differential ChIP-enrichments

24 between the two related samples 93T449 and 94T778, qPCR experiments were performed starting

25 from 10 ng of IP (target) and Input DNA (negative control) with the ready-to-use hot start reaction

26 mix for SYBR Green I-based real-time PCR assays using the LightCycler® 96 System, according

27 to the manufacturer’s protocols (Roche, Basel, Switzerland) (primers listed in Table S7). As

7 1 positive control, we included the top ChIP-enriched region shared by the two cell lines, defined by

2 immuno-FISH as the neocentromeric core. The results were analysed with a custom approach based

3 on the ΔΔCt-method (LIVAK AND SCHMITTGEN 2001). In detail, the ΔCt was first calculated for

4 each neocentromeric peak as the difference in the Ct between the IP and the Input. Then, the ΔΔCt

5 was obtained using the positive control as calibrator. Finally, we compared the results for 93T449

–ΔΔCt 6 and 94T778 looking at the 2 fold ratios between each region and the corresponding calibrator.

7

8 RNA-seq library preparation and analysis

9 We investigated the transcription activity of neocentromeres by total RNA-seq. Total RNA was

10 extracted using RNeasy Mini Kit (QIAGEN, Hilden, Germany), checked for integrity on the 2100

11 Bioanalyzer instrument (Agilent Technologies, Santa Clara, CA, USA) and stored at -80° until use.

12 Libraries were prepared using the TruSeq Stranded Total RNA Library Prep Kit (Illumina), and

13 sequenced on the HiSeq 2000 at the IGA Technology Services facility (Udine, Italy) (paired-end

14 100-cycle run, 70M reads/sample). After running FastQC for quality control

15 (www.bioinformatics.babraham.ac.uk/projects/fastqc/), paired-end reads were mapped to the human

16 reference genome (GRCh37/hg19) using STAR aligner (v.2.4) (DOBIN et al. 2013), with Gencode

17 (v. 19, (HARROW et al. 2012)) as gene transcript model and Samtools (v. 0.1.19) (LI et al. 2009) for

18 duplicate removal. Alignment tracks were checked by IGV to identify transcriptional activity of the

19 neocentromeric sites.

20

21 Chimeric transcripts, RNA-seq and WGS data integration

22 Looking at the RNA-seq alignment tracks by IGV, we found multiple truncated transcripts, likely

23 derived from gene fusion events. To identify these chimeras, we analysed RNA-seq data using

24 ChimeraScan with default parameters (IYER et al. 2011). The identified fusion transcripts were then

25 mapped at SV positions with a custom approach combining WGS and RNA-seq data (Methods S1).

26 RT-PCR and Sanger sequencing validations were performed on all transcripts as previously

27 described (STORLAZZI et al. 2010), filtering out overlapping, adjacent and read-through transcripts

8 1 and all chimeras with a score <15. For chimeras not supported by any SV we considered for

2 validation only those with split reads. All chimeras detected in 93T449 or 94T778 were tested in

3 both cell lines.

4

5 Data Access

6 SNP array data are available at the EMBL-EBI Array Express database

7 (https://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-5625. WGS, RNA-seq,

8 anti-CENP-A and input ChIP-seq data are available at NCBI Short Read Archive (SRA,

9 https://www.ncbi.nlm.nih.gov/sra) with accession number PRJNA378952. All validated SV

10 sequences were submitted to GenBank repository (http://www.ncbi.nlm.nih.gov/genbank/), under

11 accession numbers KY966261-KY966313 and KY966314-KY966332 for chimeric transcripts and

12 genomic fusions, respectively.

13

14 RESULTS

15 Seven distinct neocentromeres coexist in the cell line 04T036

16 The LSC cell line 04T036 was already described as carrying multiple markers stabilized by

17 neocentromeres (ITALIANO et al. 2006). In order to study these neocentromeres, we first

18 investigated the genomic context in which they occurred. Both SNP-array and WGS copy number

19 variant analyses identified weak amplification of the 3q26.1-q29 and 9p24.3-p23 regions (Table

20 S1). FISH with BAC probes mapping at these regions confirmed the presence of a variable number

21 of RGMs, looking like isochromosomes, per cell: two derived from chromosome 3, and two to five

22 from chromosome 9 (ITALIANO et al. 2006). We mapped the fusion junctions of these

23 isochromosomes using by WGS (Figure 1), and validated the results by PCR and Sanger

24 sequencing. Neocentromeres arising on the multiple copies of RGMs described by Italiano et al.

25 (2006) were detected by looking at anti-CENP-C signals not co-localizing with alphoid probes at

26 immuno-FISH on metaphase spreads (Figure 1A) (ITALIANO et al. 2006). To define the internal

27 organization of these neocentromeres, we performed ChIP-seq with the anti-CENP-A antibody.

9 1 This analysis disclosed seven separate peaks of enrichment spanning the amplified regions of the

2 RGM of chromosome 3 (peaks 1 and 2) and 9 (peaks 3-7), ranging from 52 to 136 Kb in size

3 (Figure 1, Table S2). The bell-shaped coverage profile of each candidate region, visualized in IGV

4 by looking at the ChIP-seq alignment to the reference genome, as well as both ChIP-seq and WGS

5 data (Table S3a and 3b) suggested that each of them was specific for a different marker

6 chromosome (Figure 1B, C). This hypothesis was verified by performing simultaneous immuno-

7 FISH experiments on elongated chromosomes, with anti-CENP-C antibodies pinpointing

8 centromeres, and different combinations of BAC probes spanning the ChIP-enriched peaks. The

9 obtained results showed the co-localization of the neocentromeric CENP-C signal with a different

10 BAC probe on a separate markers, confirming the existence of distinct neocentromeric epialleles

11 arising on the multiple copies of RGMs (Figure 1).

12

13 Complex neocentromeres undergoing structural evolution in WDLPS cell lines

14 The combination of SNP-array and WGS approaches confirmed the already described highly

15 complex internal structure of the marker chromosomes of the WDLPS cell lines (ITALIANO et al.

16 2009; GARSED et al. 2014) (Table S1, 3a, 3b). In 95T1000, despite the internal arrangement of the

17 multiple detected markers varying within and among cells, their amplified content from

18 chromosomes 10 and 12 remained highly conserved (Figure S1, Table S1). The presence of

19 neocentromeres arising on RGMs was confirmed by looking at anti-CENP-C signals not co-

20 localizing with alphoid probes by immuno-FISH on metaphase spreads (Figure 2B). The anti-

21 CENP-A ChIP-seq analysis, performed to characterize these neocentromeres, revealed four separate

22 peaks of enrichment at 10p12.1, 10p12.33, 12q13.3, and 12q14.1, spanning 74, 9, 43 and 40 Kb in

23 size, respectively. The inspection of the ChIP-seq alignment data by IGV disclosed a coverage drop

24 at one or both ends of each peak (Figure 2A), suggesting that the multiple non-continuous, non-

25 collinear enriched sequences might be juxtaposed by SVs to form a single neocentromere. Our

26 custom ChIP-seq analysis (Methods S1) confirmed this hypothesis, revealing specific SVs joining

27 the fragments (Figure 2A, Table S4), all validated by PCR and Sanger sequencing. Immuno-FISH

10 1 experiments with anti-CENP-C antibodies and BAC probes spanning the core ChIP-enriched

2 regions confirmed the occurrence of a single neocentromere stabilizing all the observed marker

3 chromosomes (Figure 2C).

4 In 93T449 (primary tumour) and 94T778 (recurrence), we found 17 chromosomes involved

5 in the amplification (Figure S2). By comparing the copy number profiles of the two related cell

6 lines, we disclosed a perfect conservation of the overall RGM amplified content, with few

7 differences in the copy number state of the amplicons (Figure S3 and Table S1). In metaphase

8 spreads, a higher number of RGM per cell was found in the recurrence versus the onset tumour

9 (average number of 2.63 vs 1.14). Moreover, we detected 16 SVs specific for 93T449, and six for

10 94T778 (Table S5). In both cell lines, each observed RGM was mitotically stabilized by a

11 neocentromere (Figure 4A, B), as confirmed when looking at anti-CENP-C signals not co-

12 localizing with alphoid probes by immuno-FISH on metaphase spreads (Figure 2B).

13 The CENP-A ChIP-seq analysis disclosed four strongly enriched regions shared by both cell

14 lines, as well as five additional weakly enriched fragments, specific for 94T778, clearly suggesting

15 an enlargement of the CENP-A centromeric domain in the recurrence of the tumour. As for

16 95T1000, the neocentromeric seeding occurred on a patchwork of amplified sequences (Figure 3).

17 By juxtaposing each enrichment peak according to the identified SVs, we inferred two separate

18 neocentromeric contiguous sequences (NEO1 and NEO2), sharing a 5 Kb region at chr1: 188.377-

19 188.382 Mb. Since no SV connecting these two contigs could be detected, we hypothesized the

20 occurrence of two distinct neocentromeres arising at the tumour onset (Figure 3). To validate this

21 hypothesis, we performed immuno-FISH experiments on elongated chromosomes with the anti-

22 CENP-C antibody and BAC probes spanning the core regions of the two inferred neocentromeres.

23 The results disclosed a mutually exclusive co-localization of the anti-CENP-C signals with either of

24 the probe on each marker chromosome, confirming two separate neocentromeres occurring within

25 the same cell in both samples (Figure 4). Finally, qPCR assays performed on the CENP-A IP and

26 input DNA with primer pairs specific for each fragment of NEO1 and NEO2 confirmed the

27 differential enrichment ratio of shared vs specific (94T778) ChIP-enriched sites, proving the size

11 1 increase of both neocentromeres from the primary tumour to the recurrence (Figure S4). More

2 specifically, NEO1 increased from 84 to 147 Kb and NEO2 from 68 to 121 Kb.

3

4 From neocentromeric transcription to the burst of chimeric transcripts

5 We investigated the transcription activity of neocentromeres by integrating RNA-seq, WGS

6 and ChIP-seq data. In line with previous studies (AMOR AND CHOO 2002; WONG et al. 2006), we

7 found that gene transcriptional activity was not hampered by the presence of neocentromeres.

8 Surprisingly, we also found two chimeric transcripts with one of the two partners mapping within

9 the neocentromere domains of 93T449 and 94T778 (i.e APP/HMCN1, and LOC100507250/HMCN1).

10 Investigating their origin, we found that the SVs supporting these chimeras were not involved in the

11 assembly of the neocentromeres; therefore, they likely derived from additional rearrangements of

12 the same amplified region, which did not acquire a centromeric function. Despite that this analysis

13 did not reveal any aberrant transcript specific for neocentromeres, it shed light on the multiple

14 chimeric transcripts originated through the amplification process manifested in the RGMs. We

15 detected thousands of putative chimeras, and validated the most abundant ones by RT-PCR and

16 Sanger sequencing (Table 1, Table S6). Looking at the Refseq gene function annotations, most of

17 the chimeric partners were reported as cancer-associated genes (Table 1). To investigate the origin

18 of these chimeras, we built a custom pipeline to integrate WGS and RNA-seq data (for details see

19 Methods S1), and demonstrated that some transcripts were supported by perfectly matching SVs

20 (Class I), while others showed a much more complex origin. Class II transcripts, indeed, were

21 assembled by means of several non-contiguous, non-collinear genomic fragments interposed

22 between partner genes, which were actively transcribed, but, subsequently, spliced out from the

23 mature mRNA (Figure S5 and S6). On the contrary, Class III chimeras, lacking any supporting SV,

24 could possibly originate from post-transcriptional events (Table 1).

25

26 DISCUSSION

12 1 In the present study, we unveiled the extremely complex molecular architecture of neocentromeres

2 mitotically stabilizing RGMs in WDLPS and LSC cell lines, describing their structure down to the

3 nucleotide level, and disclosed the occurrence of a burst of chimeric transcripts associated with

4 genomic amplification. Both these findings shed light on the extraordinary genomic and

5 transcriptomic plasticity of RGMs, likely playing a role in cancer evolution.

6

7 The enhanced complexity of neocentromeres arising at RGMs

8 The genomic architecture of neocentromeres in LSC and WDLPS cancer types showed distinct

9 features. Each of the multiple RGMs found in the 04T036 LSC cell line had a single

10 neocentromere, embedded in a continuous, non-rearranged sequence. Four out of the five

11 neocentromeres mitotically stabilizing the multiple copies of RGM9 mapped close to each other

12 (within 1 Mb), while the fifth mapped ~10 Mb apart. Instead, the two neocentromeres stabilizing

13 the RGM3 mapped ~ 5 Mb apart from each other. Very likely, only a single neocentromeric-

14 seeding event occurred on each of the two ancestral types of RGM, which subsequently multiplied

15 following mitotic errors. The different positions of the neocentromeres arising on the multiple

16 copies of RGM3 and RGM9 were likely due to extensive “sliding” processes along the

17 chromosomes during tumour evolution, which led to functional “epialleles”. This phenomenon was

18 recently discovered by Purgato et al. (2015) while studying the satellite-free centromere of horse

19 chromosome 11 (likely an evolutionary neocentromere) (PURGATO et al. 2015). That centromere

20 was sliding within a ~500 Kb genomic segment, a smaller region compared to those described here

21 in 04T036. When a neocentromere arises in natural populations, the meiotic process can be affected

22 by large distance sliding events giving rise to epialleles. Indeed, in heterozygous individuals,

23 crossing-over events involving regions delimited by neocentromeric epialleles would generate

24 acentric/dicentric chromosomes, resulting in a reduced fitness. However, the mitotic process is not

25 affected by distant neocentromeric epialleles, so, in cancer, it seems very likely that neocentromere

26 fluctuations might involve several Mb, especially on RGMs. Moreover, the alternative hypothesis

13 1 calling for the simultaneous and independent seeding of seven neocentromeres on separate acentric

2 RGMs seems highly unlikely.

3 Different from what was observed in the LSC cell line, the neocentromeres of all the studied

4 WDLPS cell lines showed a very complex structure, consisting of a patchwork of multiple short

5 sequences, amplified from several chromosomes. The structural complexity of neocentromeres in

6 WDLPS strongly indicates that they arose secondarily to the amplification process, very likely as a

7 consequence of the massive recruitment of CENP-A on highly rearranged RGMs to enable double-

8 strand-break repair (ZEITLIN et al. 2009). Moreover, by comparing the 93T449 and 94T778 cell

9 lines (primary and relapse of the same tumour, respectively), we disclosed additional interesting

10 features. The perfect match between the amplified RGM content of the two related cell lines

11 indicates that these structures arose as early events in the tumour evolution and were maintained

12 under a strong selective pressure. The higher number of SVs specific for the onset cell line

13 (93T449) also suggested that the recurrence was established from a minor sub-clone of the primary

14 tumour. Moreover, the neocentromeres of 94T778 were 50-60 Kb larger than those of 93T449,

15 suggesting an evolution of these structures during tumour progression. As the size of centromeres

16 tends to be uniform within species, and the expansion of the neocentromeric size has been

17 suggested to constitute a key factor in the survival of neocentric chromosomes (WANG et al. 2014),

18 we speculate that the size increase of neocentromeres during tumorigenesis might be correlated with

19 acquired mitotic stability of the RGMs harbouring them. As for 04T036, the presence of two

20 separate neocentromeres, even sharing a 5 Kb domain in 94T778 cell line, might be the result of

21 centromeric sliding phenomena.

22 Literature on human neocentromeres arising in acentric chromosomal fragments indicates a

23 variable functional efficiency of these structures, which often leads to mosaicism for the marker

24 (MARSHALL et al. 2008). In cancer, an incomplete functional efficiency of neocentromeres, causing

25 non-disjunction, might lead to the segregation of multiple copies of RGMs to daughter cells,

26 potentially increasing their selective advantage. This consideration could apply to the finding of a

27 higher number of RGMs in a relapse (94T778) than in a primary tumour (93T449). It might be also

14 1 conjectured that the instability of the neocentromeres in the primary tumour led to the selection of a

2 higher number of RGMs per cell, subsequently stabilized by a neocentromere expansion in the

3 recurrence. Combined, our results shed light on the extraordinary evolutionary plasticity of cancer-

4 associated neocentromeres, and provide strong support for the strictly epigenetic nature of the

5 neocentromeric seeding process.

6

7 Transcriptomic plasticity: the hidden side of the complex rearrangements in RGMs

8 Centromeric transcription has been reported as actively participating to the kinetochore

9 assembly and function (MULLER AND ALMOUZNI 2017). Our RNA-seq analyses, however, did not

10 reveal any unusual transcriptional activity specific for neocentromeres, but disclosed the occurrence

11 of several chimeric transcripts derived from the whole RGM, which we considered worthwhile to

12 investigate, as very few previous studies focused on this topic. The combined analysis of genomic

13 and transcriptomic sequencing data allowed us to detect chimeric transcripts originating from

14 extremely complex rearrangements, such as those resulting in the fusion of three partner genes

15 (among Class I transcripts), or those assembled by means of multiple interposed non-contiguous,

16 non-collinear genomic fragments (Class II). To the best of our knowledge, such complex fusions

17 have not been described before in cancer. The molecular mechanism underlying these genomic

18 chimeras might resemble the “exon-shuffling” process creating functional genetic novelties and

19 multi-domain over natural evolution (FRANCA et al. 2012). Indeed, a similar but

20 accelerated mechanism might enrich the genetic repertoire of cancer cells, likely providing a

21 selective advantage. The splicing machinery also seems to be actively involved in shaping chimeras

22 and creating variability in cancer. In assembling Class II chimeras, for instance, it actively removes

23 the interposed fragments from the mature mRNA. Here, we also report multiple examples of

24 transcription-induced chimeras (Class III), although our analysis likely just uncovered the tip of the

25 iceberg (LU et al. 2016).

26 The burst of chimeric transcripts documented in our study brings the focus on the extreme

27 transcriptome plasticity of highly rearranged RGMs, which might be crucial for cancer initiation

15 1 and progression. Indeed, the shuffling of coding sequences might strongly contribute to the

2 proliferative success of the cancer cells by enriching and variegating the genetic substrate for

3 selection to act on. We propose that the increased incidence of gene fusion events, caused by the

4 extremely rearranged nature of RGMs, might act as a driving force for genome amplification,

5 together with the well-established oncogene increased expression (TAYLOR et al. 2011;

6 PANAGOPOULOS et al. 2014).

7

8 ACKNOWLEDGEMENTS

9 This work was supported by the Italian Association on Cancer Research (AIRC) (AIRC Investigator

10 Grant 15413) to CTS, PRIN (Progetti di Interesse Nazionale) and EPIGEN (CNR) to MR,

11 Fondazione cassa di Risparmio di Puglia and APQ Ricerca Regione Puglia (FutureInResearch) to

12 GM.

13

14 DISCLOSURE/CONFLICT OF INTEREST

15 The authors have no duality of interest to declare.

16 Supplementary information is available

17 REFERENCES

18

19 Altschul, S. F., W. Gish, W. Miller, E. W. Myers and D. J. Lipman, 1990 Basic local alignment search tool. J Mol Biol 215: 403- 20 410. 21 Amor, D. J., and K. H. Choo, 2002 Neocentromeres: role in human disease, evolution, and centromere study. Am J Hum Genet 71: 22 695-714. 23 Athwal, R. K., M. P. Walkiewicz, S. Baek, S. Fu, M. Bui et al., 2015 CENP-A nucleosomes localize to transcription factor hotspots 24 and subtelomeric sites in human cancer cells. Epigenetics Chromatin 8: 2. 25 Beh, T. T., R. N. MacKinnon and P. Kalitsis, 2016 Active centromere and chromosome identification in fixed cell lines. Mol 26 Cytogenet 9: 28. 27 Chan, F. L., O. J. Marshall, R. Saffery, B. W. Kim, E. Earle et al., 2012 Active transcription and essential role of RNA polymerase II 28 at the centromere during mitosis. Proc Natl Acad Sci U S A 109: 1979-1984. 29 Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski et al., 2013 STAR: ultrafast universal RNA-seq aligner. 30 Bioinformatics 29: 15-21. 31 Earnshaw, W. C., H. Ratrie, 3rd and G. Stetten, 1989 Visualization of centromere proteins CENP-B and CENP-C on a stable 32 dicentric chromosome in cytological spreads. Chromosoma 98: 1-12. 33 Franca, G. S., D. V. Cancherini and S. J. de Souza, 2012 Evolutionary history of exon shuffling. Genetica 140: 249-257. 34 Garsed, D. W., O. J. Marshall, V. D. Corbin, A. Hsu, L. Di Stefano et al., 2014 The architecture and evolution of cancer 35 neochromosomes. Cancer Cell 26: 653-667. 36 Harrow, J., A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans et al., 2012 GENCODE: the reference annotation 37 for The ENCODE Project. Genome Res 22: 1760-1774. 38 Italiano, A., R. Attias, A. Aurias, G. Perot, F. Burel-Vandenbos et al., 2006 Molecular cytogenetic characterization of a metastatic 39 lung sarcomatoid carcinoma: 9p23 neocentromere and 9p23-p24 amplification including JAK2 and JMJD2C. Cancer Genet 40 Cytogenet 167: 122-130.

16 1 Italiano, A., G. Maire, N. Sirvent, P. A. Nuin, F. Keslair et al., 2009 Variability of origin for the neocentromeric sequences in 2 analphoid supernumerary marker chromosomes of well-differentiated liposarcomas. Cancer Lett 273: 323-330. 3 Iyer, M. K., A. M. Chinnaiyan and C. A. Maher, 2011 ChimeraScan: a tool for identifying chimeric transcription in sequencing data. 4 Bioinformatics 27: 2903-2904. 5 Kent, W. J., 2002 BLAT--the BLAST-like alignment tool. Genome Res 12: 656-664. 6 Klare, K., J. R. Weir, F. Basilico, T. Zimniak, L. Massimiliano et al., 2015 CENP-C is a blueprint for constitutive centromere- 7 associated network assembly within human kinetochores. J Cell Biol 210: 11-22. 8 L'Abbate, A., G. Macchia, P. D'Addabbo, A. Lonoce, D. Tolomeo et al., 2014 Genomic organization and evolution of double 9 minutes/homogeneously staining regions with MYC amplification in human cancer. Nucleic Acids Res 42: 9131-9145. 10 Li, H., and R. Durbin, 2009 Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754-1760. 11 Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al., 2009 The Sequence Alignment/Map format and SAMtools. 12 Bioinformatics 25: 2078-2079. 13 Livak, K. J., and T. D. Schmittgen, 2001 Analysis of relative data using real-time quantitative PCR and the 2(-Delta 14 Delta C(T)) Method. Methods 25: 402-408. 15 Lu, G., J. Wu, G. Zhao, Z. Wang, W. Chen et al., 2016 Abundant and broad expression of transcription-induced chimeras and 16 products in mammalian genomes. Biochem Biophys Res Commun 470: 759-765. 17 Macchia, G., K. H. Nord, M. Zoli, S. Purgato, P. D'Addabbo et al., 2015 Ring chromosomes, breakpoint clusters, and 18 neocentromeres in sarcomas. Genes Chromosomes Cancer 54: 156-167. 19 Marshall, O. J., A. C. Chueh, L. H. Wong and K. H. Choo, 2008 Neocentromeres: new insights into centromere structure, disease 20 development, and karyotype evolution. American journal of human genetics 82: 261-282. 21 Matsui, A., T. Ihara, H. Suda, H. Mikami and K. Semba, 2013 Gene amplification: mechanisms and involvement in cancer. Biomol 22 Concepts 4: 567-582. 23 McNulty, S. M., L. L. Sullivan and B. A. Sullivan, 2017 Human Centromeres Produce Chromosome-Specific and Array-Specific 24 Alpha Satellite Transcripts that Are Complexed with CENP-A and CENP-C. Dev Cell 42: 226-240 e226. 25 Muller, S., and G. Almouzni, 2017 Chromatin dynamics during the cell cycle at centromeres. Nat Rev Genet. 26 Nord, K. H., G. Macchia, J. Tayebwa, J. Nilsson, F. Vult von Steyern et al., 2014 Integrative genome and transcriptome analyses 27 reveal two distinct types of ring chromosome in soft tissue sarcomas. Hum Mol Genet 23: 878-888. 28 Panagopoulos, I., B. Bjerkehagen, L. Gorunova, J. M. Berner, K. Boye et al., 2014 Several fusion genes identified by whole 29 transcriptome sequencing in a spindle cell sarcoma with rearrangements of chromosome arm 12q and MDM2 30 amplification. Int J Oncol 45: 1829-1836. 31 Pedeutour, F., G. Maire, A. Pierron, D. M. Thomas, D. W. Garsed et al., 2012 A newly characterized human well-differentiated 32 liposarcoma cell line contains amplifications of the 12q12-21 and 10p11-14 regions. Virchows Arch 461: 67-78. 33 Purgato, S., E. Belloni, F. M. Piras, M. Zoli, C. Badiale et al., 2015 Erratum to: Centromere sliding on a mammalian chromosome. 34 Chromosoma 124: 289. 35 Quenet, D., and Y. Dalal, 2015 Correction: a long non-coding RNA is required for targeting centromeric protein A to the human 36 centromere. Elife 4. 37 Rausch, T., T. Zichner, A. Schlattl, A. M. Stutz, V. Benes et al., 2012 DELLY: structural variant discovery by integrated paired-end 38 and split-read analysis. Bioinformatics 28: i333-i339. 39 Rosic, S., F. Kohler and S. Erhardt, 2014 Repetitive centromeric satellite RNA is essential for kinetochore formation and cell 40 division. J Cell Biol 207: 335-349. 41 Severgnini, M., L. Pattini, C. Consolandi, E. Rizzi, C. Battaglia et al., 2006 Application of the Taguchi method to the analysis of the 42 deposition step in microarray production. IEEE Trans Nanobioscience 5: 164-172. 43 Shono, N., J. Ohzeki, K. Otake, N. M. Martins, T. Nagase et al., 2015 CENP-C and CENP-I are key connecting factors for 44 kinetochore and CENP-A assembly. J Cell Sci 128: 4572-4587. 45 Sirvent, N., A. Forus, W. Lescaut, F. Burel, S. Benzaken et al., 2000 Characterization of centromere alterations in liposarcomas. 46 Genes Chromosomes Cancer 29: 117-129. 47 Storlazzi, C. T., T. Fioretos, C. Surace, A. Lonoce, A. Mastrorilli et al., 2006 MYC-containing double minutes in hematologic 48 malignancies: evidence in favor of the episome model and exclusion of MYC as the target gene. Hum Mol Genet 15: 933- 49 942. 50 Storlazzi, C. T., A. Lonoce, M. C. Guastadisegni, D. Trombetta, P. D'Addabbo et al., 2010 Gene amplification as double minutes or 51 homogeneously staining regions in solid tumors: origin and structure. Genome Res 20: 1198-1206. 52 Taylor, B. S., P. L. DeCarolis, C. V. Angeles, F. Brenet, N. Schultz et al., 2011 Frequent alterations and epigenetic silencing of 53 differentiation pathway genes in structurally rearranged liposarcomas. Cancer Discov 1: 587-597. 54 Thorvaldsdottir, H., J. T. Robinson and J. P. Mesirov, 2013 Integrative Genomics Viewer (IGV): high-performance genomics data 55 visualization and exploration. Brief Bioinform 14: 178-192. 56 Trazzi, S., G. Perini, R. Bernardoni, M. Zoli, J. C. Reese et al., 2009 The C-terminal domain of CENP-C displays multiple and 57 critical functions for mammalian centromere formation. PLoS One 4: e5832. 58 Trombetta, D., G. Macchia, N. Mandahl, K. H. Nord and F. Mertens, 2012 Molecular genetic characterization of the 11q13 59 breakpoint in a desmoplastic fibroma of bone. Cancer Genet 205: 410-413. 60 Wade, C. M., E. Giulotto, S. Sigurdsson, M. Zoli, S. Gnerre et al., 2009 Genome sequence, comparative analysis, and population 61 genetics of the domestic horse. Science 326: 865-867. 62 Wang, J., C. G. Mullighan, J. Easton, S. Roberts, S. L. Heatley et al., 2011 CREST maps somatic structural variation in cancer 63 genomes with base-pair resolution. Nat Methods 8: 652-654. 64 Wang, K., Y. Wu, W. Zhang, R. K. Dawe and J. Jiang, 2014 Maize centromeres expand and adopt a uniform size in the genetic 65 background of oat. Genome Res 24: 107-116. 66 Wong, N. C., L. H. Wong, J. M. Quach, P. Canham, J. M. Craig et al., 2006 Permissive transcriptional activity at the centromere 67 through pockets of DNA hypomethylation. PLoS Genet 2: e17. 68 Xi, R., A. G. Hadjipanayis, L. J. Luquette, T. M. Kim, E. Lee et al., 2011 Copy number variation detection in whole-genome 69 sequencing data using the Bayesian information criterion. Proc Natl Acad Sci U S A 108: E1128-1136. 70 Xie, C., and M. T. Tammi, 2009 CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC 71 Bioinformatics 10: 80. 17 1 Zeitlin, S. G., N. M. Baker, B. R. Chapados, E. Soutoglou, J. Y. Wang et al., 2009 Double-strand DNA breaks recruit the 2 centromeric histone CENP-A. Proc Natl Acad Sci U S A 106: 15762-15767. 3

4

5

18 1 FIGURES AND TABLE LEGENDS

2

3 Figure 1: The genomic architecture of the RGMs and neocentromeres in 04T036 cell line. (A)

4 Metaphase spread showing immuno-FISH results on the 04T036 cell line; probe names and signals

5 are consistently coloured; white arrows pinpoint neocentromeres on RGMs, detectable by CENP-C

6 signals (green) not co-localizing with alphoid probe (red); BAC probes spanning the amplified

7 regions are also hybridized (violet and blue) on the RGMs. (B, C) Schematic representation of the

8 overall structures of RGM9 (B) and RGM3 (C) in 04T036; thin blue lines and breakpoint positions

9 indicate the SV connections assembling the isochromosomes. IGV partial auto-scaled plots of the

10 CENP-A ChIP-enriched regions, with the corresponding input panels (negative controls); Y and X-

11 axis represent read count averaged in windows (value shown at the left of each box), and reference

12 human genome (GRCh37/hg19) coordinates (Kb positions shown at the top), respectively; BAC

13 probes spanning each neocentromeric peak are reported on the right. Immuno-FISH results are

14 shown on the right (D); probe names and signals are consistently coloured; CENP-C signals (green)

15 mapping at RGMs pinpoint neocentromeres (as shown in A). Left and middle boxes of Immuno-

16 FISH panels show a two-way merge between each probe and CENP-C signals; right boxes show

17 partial metaphases with three-way merged signals; in each box, pale blue and yellow arrows

18 indicate neocentromeres carrying blue/green and red/green co-localizing signals, respectively. The

19 results achieved with this combination of different neocentromeric probes confirmed the presence

20 of multiple neocentromeric alleles [represented as red lines on the RGM ideograms (B, C)].

21

22 Figure 2: The genomic architecture of the neocentromeres in 95T1000 cell line. (A) IGV partial

23 plots of the CENP-A ChIP-enriched regions aligned to the human reference (GRCh37/hg19),

24 juxtaposed in proper order and orientation to assemble the neocentromeric inferred structure of

25 95T1000; top panel shows the normalized enrichment profile of each peak (IP versus input; full

26 details in Supplementary Methods); middle and lower panels show the IP and input raw enrichment

27 profiles; Y and X-axis represent read count averaged in windows (value shown at the left of each

19 1 box), and reference genome coordinates (Kb positions shown at the top), respectively; we fixed the

2 scale of the Y axis to the value of the peak with the highest read count average on IGV. BAC

3 probes chosen for the immuno-FISH experiments are reported below each corresponding

4 neocentromeric peak. The lower panel is a schematic reconstrunction of the neocentromeric

5 assembled contig; each neocentromeric fragment is represented as an arrow and aligned to the

6 upper panel; thin blue lines represent SV connections, and flags indicate breakpoint positions.

7 Immuno-FISH results with anti CENP-C antibodies (green) co-hybridized with alphoid probe (B)

8 and BAC probes spanning the neocentromeric core (C) are shown; probe names and signals are

9 consistently coloured; CENP-C signals (green), not co-localizing with the alphoid probe (red),

10 pinpoint neocentromeres mapping at RGMs in (B), while white arrow heads indicate

11 neocentromeres carrying merged blue/red/green signals in (C).

12

13 Figure 3: The genomic architecture of NEO1 and NEO2 in 93T449 and 94T778 cell lines. At

14 the top of each box, IGV partial plots at the CENP-A ChIP-enriched regions of NEO1 (A) and

15 NEO2 (B) are shown as juxtaposed in a proper order and orientation to assemble the

16 neocentromeric domain of both 93T449 and 94T778 cell lines. The top panels of both (A) and (B)

17 boxes show the normalized enrichment profile of the neocentromeric peaks for each cell line (IP

18 versus input; full details in Supplementary Methods); the middle and lower panels show the IP and

19 input raw enrichment profiles; Y and X-axis represent read count averaged in windows (the relative

20 values are shown on the left of each box), and reference genome coordinates (Kb positions shown at

21 the top), respectively; we fixed the scale of the Y axis to the value of the peak with the highest read

22 count average on IGV. BAC probes chosen for the immuno-FISH experiments are reported below

23 the corresponding neocentromeric peak. At the bottom of each box, a schematic reconstruction of

24 the neocentromeric assembled contig is represented; each neocentromeric fragment is shown as an

25 arrow and aligned to the upper panel; thin blue lines represent SV connections, and flags indicate

26 breakpoint positions. A clear differential enrichment between the neocentromeric tails is shown

27 when comparing 93T449 and 94T778.

20 1

2 Figure 4: Metaphase spreads showing immuno-FISH results on the 93T449 (A, C) and 94T778 (B,

3 D) cell lines; probe names and signals are consistently coloured; CENP-C signals (green), not co-

4 localizing with alphoid probes (red), pinpoint neocentromeres on RGMs (green arrows), while

5 probes RP11-203K8 and RP11-30I11 are specific for NEO1 and NEO2 cores, respectively. (C, D)

6 Left and middle boxes show two-way merged images of each probe and CENP-C signals; boxes to

7 the right show the three-way merged signals on partial metaphase spreads; pale blue and yellow

8 arrows indicate neocentromeres carrying blue/green and red/green merged signals, respectively.

9 The results achieved with this combination of different neocentromeric probes demonstrate the

10 presence of two neocentromeric alleles in both cell lines.

11

12 Table 1: List of all the validated chimeras detected by ChimeraScan; Class I, II and III refer to

13 chimeras derived from canonical, complex and transcription-induced fusions, respectively. Gene

14 names styles are consistent with their main function (bold = cancer genes, underlined =

15 ubiquitination, Italic = RNA-splicing and surveillance, Italic = long non-coding, and bold =

16 trafficking and transport). OF= out of frame, IF= in frame, NA= not analysed. *= chimera whose

17 supporting SV, containing a SINE at the breakpoint site, was detected by long-PCR and Sanger

18 sequencing.

19

21 Peak 1 AB B 192,04 192,06 192,08 192,10 192,12 192,14 192,16 [0-362] q29

q28 q27 [0-362] q26.33 q26.32 q26.31 RP11-144F4 q26.1 167,011,826 Peak 2 167,013,341 187,26 187,28 187,30 187,32 187,3 187,36 187,38

q26.1 q26.31 [0-2198] q26.32 q26.33

ALPHOID q27 RP11-668E6 q28 [0-2198] RP11-58M14 q29 CENPC RP11-58M14 RGM3 Peak 3 C 1,34 1,36 1,38 1,40 1,42 1,44 1,46 D RGM3 [0-3633]

[0-3633]

[0-3633]

RP11-668E6

Peak 4 10,56 10,58 10,60 10,62 10,64 10,66 10,68 RGM9 [0-626]

[0-626]

RP11-142K8

Peak 5 10,78 10,80 10,82 10,84 10,86 10,90 10,92 [0-1460]

[0-1460]

RP11-258B1

Peak 6 11,66 11,68 11,70 11,72 11,74 11,76 11,78 [0-1560]

[0-1560] p24.3 p24.2 RP11-351P15 p24.1 p23 Peak 7 13,033,681 12,30 12,32 12,34 12,36 12,38 12,40 12,42 13,093,117 13,074,236 [0-4218] 13,033,310 p23 p24.1 [0-4218] p24.2 p24.3 RP11-822C9 RGM9 A chr12 chr12 chr10 chr10 chr12 59,600 59,620 51,920 51,900 51,880 51,860 51,840 17,920 17,930 27,680 27,700 27,720 59,600 59,580 [0-29]

[0-1790] 95T1000

[0-1790] INPUT IP NORMALIZED IP INPUT

RP11-791A6 RP11-979M22 RP11-344K11

51,910,969 17,920,630 27,682,052 59,598,872

59,619,625 51,836,596 17,929,851 27,725,341

B C ALPHOID RP11-791A6 RP11-791A6 RP11-979M22 RP11-344K11 RP11-979M22 CENPC CENPC CENPC A chr1 chr12 chr6 chr1 188,370 188,390 70,290 70,300 114,120 114,100 185,780 185,800 185,820 185,880 185,860 185,840 185,890 [0-27]

[0-27]

[0-3498]

[0-3498]

[0-1793]

[0-1793] INPUT IP INPUT IP NORMALIZED IP INPUT IP INPUT

RP11-203K8 185,886,859

185,829,026

70,296,334 114,124,644 185,780,258 185,886,859 94T778 93T449

188,382,000 70,302,122 114,104,606 185,829,026

B chr1 chr1 chr12 chr12 chr12 188,380 188,390 188,840 188,830 60,540 60,55 58,050 58,060 66,020 66,000 65,980 65,960 65,940 65,920

[0-23]

[0-23]

[0-1450]

[0-1450]

[0-776]

[0-776] INPUT IP INPUT IP NORMALIZED IP INPUT IP INPUT

RP11-30I11 58,051,280 66,027,424

60,552,128 58,059,125

188,843,129 60,541,345 58,051,280 66,027,424 94T778 93T449

188,382,930 188,833,503 60,552,128 58,059,125

A C 93T449 RP11-30I11 RP11-203K8 RP11-30I11 CENPC CENPC RP11-203K8 CENPC

RP11-30I11 RP11-203K8 RP11-30I11 CENPC CENPC RP11-203K8 CENPC

ALPHOID CENPC B D 94T778

RP11-30I11 ALPHOID RP11-30I11 RP11-203K8 RP11-203K8 CENPC CENPC CENPC CENPC 94T778 93T449 95T1000 04T036

RASAL2 MARCH9 SLCO3A1 ASB7 LIMS3 LOC100507250 SH3TC1/ IGF1R B4GALNT1/ VAMP4 APP/HMCN1 VEZT MARCH9 SUCO FURIN NDFIP2 RASAL2 IGF1R MARCH9 MARCH9 SLCO3A1 SH3TC1/ APP/HMCN1 QKI SUCO/ LOC100507250 NDFIP2 VAMP4 B4GALNT1/ VEZT FURIN CRADD CDC123 KDM2B LEMD3 HMGA2 FAM107B/CDNF MLLT10 LEMD3 CAMK1D ANKRD26/ GNS/ANKRD26 ITGA5 UPF2 FRS2 BC065763/abParts ZNF286A/SIAE PPARD / LOC100132735 / /CPZ /CPZ /FGD6 / / UPF2 /LOC101926960 / - / LEMD3 /ADAMTS1 /ADAMTS1 SLC12A7 SLC12A7 LOC440895 LOC101926960 /MYO1D SLC9B1 / /MKX /LYZ / / / / / / / /LOC100240734 / /KRT121P UPF2 MYO16 MYO16 CELF2 TAF3 TAF3 TRIP13 TRIP13 ACVR1 ACVR1 CCNT1 /BC073932 / /BC073932 / / /BACE2 /BACE2 CHIMERA

SLC26A10 SLC26A10 ANAPC5 NCKAP1L

SLC28A3 SLC28A3

-

SPA17 /HMCN1 /HMCN1

- -

AS1 AS1

/

MALL

SCORE 477 909 101 119 140 180 190 103 113 170 175 251 328 122 205 208 262 294 19 18 49 11 15 18 28 65 69 77 78 91 23 19 36 36 37 51 73 76 78 90 22 92 22 35 44 72 85 8

Class I Class I Class III Class I Class I Class I Class I Class III Class III Class III Class II Class I Class II Class I Class I Class I Class II Class I Class II Class I Class II Class I Class II Class I Class I Class III Cla Class I Class I Class II Class I Class II Class I Class I Class II Class I Class III Class I Class I Class III Class I Class I Class II Class II Class I Class I Class I Class I CLASS ss I

I

IF IF IF IF IF IF OF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF OF IF IF IF IF IF IF IF IF IF IF OF IF OF IF IF IF OF OF IF IF IF IF FRAME

No conserved domainretained N No conserved domainretained N Promoter swapping N C Chimericprotein C Promoter swapping Chime C No conserved domainretained Chimericprotein C No conserved dom Chimericprotein Chimericprotein Chimericprotein N Promoter swapping NA Promoter swapping N Promoter swapping No conserved domainretained (uc001spx.2 isoform) N Chimericprotein N N C No conserved domainretained Promoter swapping N N N No conserved domainretained (uc001spy.3isoform) No conserved domainretained C No conserved domainretained Promoter swapping No conserved domainretained No conserved domainretained (uc001s No conserved domainretained (uc001spy.3isoform) N C N C ------term truncated FURIN term term truncated HMGA2 term truncated term truncated S term truncated FURIN term truncated SHETC1 term truncated QKI term truncated HMCN1 (lossIG domains) of term truncated TAF3 (loss of BTPdomain) term term truncated LEMD3 (loss ofLEM domain) term truncated MYO1D term term truncated MALL (MARVEL domainretained) term truncated HMCN1 (lossIg domains) of term truncated TAF3 (loss of BTPdomain) term truncated HMCN1 term truncated CPZ (loss of CRD domain) term truncated BACE term truncated HMCN1 (lossIG domains) of

ricprotein

truncated truncated CPZ (loss of CRD domain) truncated BACE2 (no domainloss) IN SILICOIN TRANSLATION

CDC123 LEMD3 ain retained HETC1

2

(no domainloss)

(lossIg domains) of

px.2 isoform)