Supplementary Materials for Frequent Alterations and Epigenetic
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Materials For Frequent alterations and epigenetic silencing of differentiation pathway genes in structurally rearranged liposarcomas Barry S. Taylor, Penelope L. DeCarolis, Christina V. Angeles, Fabienne Brenet, Nikolaus Schultz, Cristina R. Antonescu, Joseph M. Scandura, Chris Sander, Agnes J. Viale, Nicholas D. Socci, Samuel Singer Supplementary Methods Alignment. All reads were aligned to the reference human genome (NCBI build 36.1, hg18). Mate-paired and methylation sequence reads were aligned with ABI Bioscope seeded extension mapper (ver. 1.2). Exome reads were aligned either with Bioscope or with BWA (1) to allow gaps for small indel detection, as described in below. RNA sequencing reads were aligned with the Bioscope whole-transcriptome pipeline. For all experiments on each of the four samples, both fragment and mate-pair reads mapping to the reference genome with sufficient quality were converted to SAM format (2) for subsequent analyses and for visualization in the Integrative Genomics Viewer (3). DNA copy number from whole-genome sequence. Copy-number alterations were assessed in the whole-genome sequence with the SegSeq algorithm (4) (w=400, a=100, b=10) using only the forward reads of mate pairs that mapped uniquely to the genome in each of the tumor/normal pairs. Duplicate reads aligned to unique genome positions were not excluded. Previous simulations with fragment libraries of 50bp reads indicate that ~2.55Gb (~82.8% of the genome) is mappable assuming an edit distance of two. Therefore, this was used as the alignable portion of the genome for this analysis, and are likely conservative estimates for empirical data aligned with progressive mapping (see alignment details in Methods, main text). We adapted RAE (5), an algorithm originally developed to detect copy number alterations from array data (aCGH or SNP arrays), to the analysis of whole-genome sequencing data (Fig. S9). The two samples were individually parameterized and predicted alterations were identified with the adapted multi-component model as low-level gain (A0 ≥ 0.9), high-level amplification (A0 ≥ 0.9 and A1 > 0.25), heterozygous loss (D0 ≥ 0.9), and homozygous deletion (D0 ≥ 0.9 and D1 ≥ 0.9). We additionally co-hybridized source tumor DNA from each sample to Agilent 244K array comparative genomic hybridization (aCGH) microarrays with a pool of reference normal DNA according to the manufacturer’s instructions (Agilent Technologies, Wilmington, DE). Raw data were obtained and normalized as previously described (6). Probe-level data were segmented with Circular Binary Segmentation and analyzed with the original implementation of RAE, both as previously described (5, 7). Unsupervised hierarchical clustering of copy number alterations in the larger set of DLPS used for sample selection was performed as previously described (8). Structural rearrangement detection. The non-redundant mate pairs (excludes duplicate mate pairs) aligned to the reference genome for each sample were first classified into groups based on the alignment position (strand and orientation) and distance separating paired reads. This distance was based on the empirical distribution of the insert sizes estimated from the alignment of each sample. A putative rearrangement, focusing here on intra- and inter-chromosomal rearrangements and associated aberrations, was defined as an event supported by a cluster of multiple atypically paired reads in a tumor sample, that are lacking from its corresponding matched normal. We excluded both singleton non-overlapping atypical mates and clusters of overlapping atypical mates where the mate chromosome and position was inconsistent. All remaining structurally atypical mates (inter-chromosomal or intra-chromosomal indicating inversion or not) were processed with the GASV algorithm (9), pairing matched normal and tumor data to filter mates indicative of a breakpoint in the normal sample and therefore to determine only somatic rearrangements. Candidate rearrangements were excluded if either breakpoint: (a) overlapped a previously characterized structural variant in normal populations (as described in ref. (6)) or in previous individual genome sequencing (10- 17) (b) appeared within 1Mb of a sequence gap or, (c) were supported by fewer than 5 atypical reads. The most common repetitive elements adjacent to rearrangement breakpoints were Alu, L1, L2, and low-complexity sequence (data not shown). The copy number at each breakpoint was determined to be the extreme copy number segmentation value (calculated as described above) that overlapped the breakpoint or adjacent sequence (5’ and 3’ sequence equal in length to the breakpoint). Rearrangements were encoded as mixed-type complex (Table S2) if their origin was either ambiguous or combinatorial, or if major and minor annotated alteration types both exceeded 25% of the supporting reads. The approach described here is fairly stringent, designed to minimize the number of false positive events. To assess the accuracy of rearrangement discovery, we performed validation with array-based copy number data generated on the same samples, an approach similar to the 1000 Genomes Consortium (18). Here, we focused on rearrangements associated with a copy number alteration (CNA; 97.6% of total) and tested for the presence of the breakpoints in segmentation of aCGH data. This is predicated on the fact that CNAs arise from double-stranded DNA breaks and should therefore harbor an associated rearrangement (cluster of structurally atypical mates at its breakpoints) from mate- paired sequencing. Nevertheless, the resolution of the low-pass whole-genome sequencing data is much greater than the accompanying aCGH platform (near-base resolution versus 235,829 probes separated by a median of ~9kb of intervening genome sequence), so we limited this analysis to those rearrangements whose associated segment of CNA (sequence-based, see above) spanned 3 or more probes of the corresponding array design. The rationale for this criterion is that a CNA that is smaller in size than can be reasonable detected with fixed-resolution aCGH would lack sufficient data for validation. Additionally, for rearrangement breakpoints that fell in a gap of aCGH segmentation (a region of breakpoint ambiguity between two adjacent probes that marks the end and start of the 5’ and 3’ adjacent segments respectively), 2 the rearrangement was assigned to the adjacent segment of extreme copy number. We considered a rearrangement either partially or completely concordant if one or both breakpoints agreed with the breakpoints called from aCGH and was associated with the same CNA. Otherwise, the event was considered discordant and a likely false positive. In total, 91.3% of rearrangements had complete concordance (both breakpoints) with array-based data, which corresponds to an estimated FDR of 8.7%. For balanced rearrangements (~2.4% of those detected here), we assume the false positive and negative rates were higher because of the low depth-of-coverage of the genome sequencing. To determine if rearrangement breakpoints were over-represented in genic regions, we compared the number of observed genic breakpoints to a distribution of random rearrangements. We generated a set of random rearrangements breakpoints conditioned on both the size distribution of the observed breakpoints and their chromosome of origin. Because nearly all of the somatic rearrangements were associated with a CNA, this distribution is likely to reflect the overall distribution of rearrangements. In total, we performed 10,000 permutations, in each producing an expected count of rearrangements in which one or both randomized breakpoints fall within the genomic footprint of a gene. We calculated an empirical p-value for the enrichment of breakpoints in genes by comparing these to the observed number. Mutation detection. Single-nucleotide variants were determined from exome and RNA sequencing reads in regions of sufficient coverage. We first exclude all degenerate reads from further analysis, defined here as any read from either experiment whose chromosome, start position, strand, and color-space sequence matched another aligned read. For exome and RNA data, base quality recalibration, variant detection, and variant annotation were performed with the GATK framework (19, 20). Specifically, after base quality recalibration for color-space reads, variant detection in exome data was performed with the UnifiedGenotyper. For high-coverage exome experiments, variants were excluded if their variant quality was <30, genotype quality <5, or if they were associated with either homopolymer runs or excessive strand bias. Novel variants, those not previously identified in either dbSNP ver. 130 (excluding overlap with COSMIC ver. 48) or 1000genomes (18), were required to be derived from base-space reads not duplicated from non-duplicate color-space reads, were not resident exclusively in higher-error base positions (positions 38-50) and had evidence of the variant allele in reads mapping to both strands. Candidate somatic mutations were those with a variant genotype in the tumor and reference genotype in the normal sample with minimum coverage of ≥10 and 6 reads respectively. Additionally, we required that the tumor variant frequency was ≥10%, and each variant was detected in 4 or more tumor reads. Our pipeline for small insertion and deletion (indel) detection was as follows. Gapped realignment of exome sequencing reads was performed with BWA. The alignment output was sorted and duplicate