Carcinoma Discovery of Recurrent Structural Variants in Nasopharyngeal
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on February 28, 2014 - Published by Cold Spring Harbor Laboratory Press Discovery of recurrent structural variants in nasopharyngeal carcinoma Anton Valouev, Ziming Weng, Robert T. Sweeney, et al. Genome Res. 2014 24: 300-309 originally published online November 8, 2013 Access the most recent version at doi:10.1101/gr.156224.113 Supplemental http://genome.cshlp.org/content/suppl/2013/11/21/gr.156224.113.DC1.html Material References This article cites 44 articles, 17 of which can be accessed free at: http://genome.cshlp.org/content/24/2/300.full.html#ref-list-1 Creative This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the Commons first six months after the full-issue publication date (see License http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported), as described at http://creativecommons.org/licenses/by-nc/3.0/. Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here. To subscribe to Genome Research go to: http://genome.cshlp.org/subscriptions © 2014 Valouev et al.; Published by Cold Spring Harbor Laboratory Press Downloaded from genome.cshlp.org on February 28, 2014 - Published by Cold Spring Harbor Laboratory Press Method Discovery of recurrent structural variants in nasopharyngeal carcinoma Anton Valouev,1,5,6 Ziming Weng,2,3,5 Robert T. Sweeney,2 Sushama Varma,2 Quynh-Thu Le,4 Christina Kong,2 Arend Sidow,2,3,6 and Robert B. West2,6 1Division of Bioinformatics, Department of Preventive Medicine, University of Southern California Keck School of Medicine, Los Angeles, California 90087, USA; 2Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA; 3Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA; 4Department of Radiation Oncology, Stanford University School of Medicine, Stanford, California 94305, USA We present the discovery of genes recurrently involved in structural variation in nasopharyngeal carcinoma (NPC) and the identification of a novel type of somatic structural variant. We identified the variants with high complexity mate-pair libraries and a novel computational algorithm specifically designed for tumor-normal comparisons, SMASH. SMASH combines signals from split reads and mate-pair discordance to detect somatic structural variants. We demonstrate a >90% validation rate and a breakpoint reconstruction accuracy of 3 bp by Sanger sequencing. Our approach identified three in-frame gene fusions (YAP1-MAML2, PTPLB-RSRC1, and SP3-PTK2) that had strong levels of expression in corre- sponding NPC tissues. We found two cases of a novel type of structural variant, which we call ‘‘coupled inversion,’’ one of which produced the YAP1-MAML2 fusion. To investigate whether the identified fusion genes are recurrent, we performed fluorescent in situ hybridization (FISH) to screen 196 independent NPC cases. We observed recurrent rearrangements of MAML2 (three cases), PTK2 (six cases), and SP3 (two cases), corresponding to a combined rate of structural variation re- currence of 6% among tested NPC tissues. [Supplemental material is available for this article.] Nasopharyngeal carcinoma (NPC) is a malignant neoplasm of the herently capture genomic structure in that discordantly aligning head and neck originating in the epithelial lining of the naso- mate-pair reads occur at sites of genomic rearrangements, exposing pharynx. It has a high incidence among the native people of the underlying lesions (Supplemental Fig. S1a). Second, large-insert American Arctic and Greenland and in southern Asia (Yu and Yuan mate-pair libraries deliver deep physical coverage of the genome 2002). NPC is strongly linked to consumption of Cantonese salted (100–10003), reliably revealing somatic structural variants even in fish (Ning et al. 1990) and infection with Epstein-Barr virus (EBV) specimens with low tumor content (Supplemental Fig. S1a,b). (Raab-Traub 2002), which is almost invariably present within the Third, variant-supporting mate-pair reads from large inserts may cancer cells and is thought to promote oncogenic transformation align up to several kilobases away from a breakpoint, beyond the (zur Hausen et al. 1970). A challenging feature of NPC genome repeats that often catalyze structural variants (Supplemental Fig. sequencing is that significant lymphocyte infiltration (e.g., 80% of S1c). For these reasons, large insert mate-pair libraries have been cells in a sample) (Jayasurya et al. 2000) is common, requiring used extensively for de novo assembly of genomes and for iden- special laboratory and bioinformatic approaches not necessary for tification of inherited structural variants (International Human higher-purity tumors (Mardis et al. 2009). Genome Sequencing Consortium 2001). In principle, they should Currently, cancer genomes are analyzed by reading short se- also be well-suited for analysis of ‘‘difficult’’ low-tumor purity quences (usually 100 bases) from the ends of library DNA inserts cancer tissues such as NPCs. In a recent study, paired-end fosmid 200–500 bp in length (Meyerson et al. 2010). For technical reasons, sequencing libraries with nearly 40 kb inserts were adapted to it is difficult to obtain deep genome coverage with inserts ex- Illumina sequencing and applied to the K562 cell line (Williams ceeding 600 bp using this approach. Much larger inserts can be et al. 2012). Due to low library complexity, of 33.9 million se- produced by circularizing large DNA fragments of up to 10 kb quenced read pairs, only about 7 million were unique (corresponding (Fullwood et al. 2009; Hillmer et al. 2011), and subsequent iso- to about 0.53 genome sequence coverage), but nonetheless facili- lation of a short fragment that contains both ends (mate pairs). tated structural variation detection. Large-insert and fosmid mate-pair libraries offer several attractive Mate-pair techniques have not yet been applied to produce features that make them well-suited for analysis of structural var- truly deep sequencing data sets of tumor-normal samples (303 or iation (Raphael et al. 2003; International Human Genome Se- greater sequence coverage), presumably due to the difficulty of quencing Consortium 2004; Tuzun et al. 2005; Kidd et al. 2008; retaining sufficient library complexity to support deep sequencing. Hampton et al. 2011; Williams et al. 2012). First, mate pairs in- We have improved the efficiency of library preparation by com- bining two existing protocols (Supplemental Fig. S2), and so were 5These authors contributed equally to this work. 6Corresponding authors E-mail [email protected] Ó 2014 Valouev et al. This article is distributed exclusively by Cold Spring E-mail [email protected] Harbor Laboratory Press for the first six months after the full-issue publication E-mail [email protected] date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is Article published online before print. Article, supplemental material, and pub- available under a Creative Commons License (Attribution-NonCommercial 3.0 lication date are at http://www.genome.org/cgi/doi/10.1101/gr.156224.113. Unported), as described at http://creativecommons.org/licenses/by-nc/3.0/. 300 Genome Research 24:300–309 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/14; www.genome.org www.genome.org Downloaded from genome.cshlp.org on February 28, 2014 - Published by Cold Spring Harbor Laboratory Press Structural variant discovery in NPC able to generate 3.5-kb insert libraries with sufficient genomic cells in NPC-5989 were abnormal, but that tumor content in NPC- complexity to enable deep sequencing of two NPC genomes. To 5421 was only ;20%. take full advantage of unique features offered by large-insert li- Consistent with the involvement of Epstein-Barr Virus (EBV) braries, such as the large footprints of breakpoint-spanning inserts in NPC etiology and previous estimates of 1–30 viral genome (Supplemental Fig. S1c) and the correlation between the two ends copies per cell (Nanbo et al. 2007; Liu et al. 2011), we detected large of alignment coordinates of breakpoint-spanning inserts, we amounts of EBV genomic sequence in NPC-5989 but not in its also developed a novel somatic structural variant caller. SMASH matched normal sample (1.86 million versus 521 read pairs). We (Somatic Mutation Analysis by Sequencing and Homology de- estimate that there were about 108 EBV genome copies per cell in tection) is specifically designed to accurately map somatic struc- the tumor sample (Supplemental Material). We found that a sig- tural lesions, including deletions, duplications, translocations, and nificant portion of the EBV genome existed in a circular form (7736 large duplicative insertions via direct comparison of tumor and read pairs supported circularization of the EBV genome spanning normal data sets. both ends of the genome at coordinates 169 kb and 0 kb). However, Structural variation methods, such as GASV (Sindi et al. we could not identify any breakpoints joining human and EBV 2009), SegSeq (Chiang et al. 2009), DELLY (Rausch et al. 2012b), regions, indicating that the virus was likely not integrated into the HYDRA (Quinlan et al. 2010), AGE (Abyzov and Gerstein 2011), host genome. and others (Lee et al. 2008; for review, see Snyder et al. 2010; Alkan To systematically identify somatic rearrangements