REVIEWS

APPLICATIONS OF NEXT-GENERATION SEQUENCING Cancer genome-sequencing study design

Jill C. Mwenifumbo and Marco A. Marra Abstract | Discoveries from cancer genome sequencing have the potential to translate into advances in cancer prevention, diagnostics, prognostics, treatment and basic biology. Given the diversity of downstream applications, cancer genome-sequencing studies need to be designed to best fulfil specific aims. Knowledge of second-generation cancer genome-sequencing study design also facilitates assessment of the validity and importance of the rapidly growing number of published studies. In this Review, we focus on the practical application of second-generation sequencing technology (also known as next-generation sequencing) to cancer and discuss how aspects of study design and methodological considerations — such as the size and composition of the discovery cohort — can be tailored to serve specific research aims.

Driver mutations Cancer pathogenesis is rooted in inherited genetic mortality. The aim of this article is not to review results Somatic mutations that have variation and acquired somatic mutation; accord- of cancer genome-sequencing studies but to focus on a role in creating, controlling ingly, genomics is integral to cancer research (for a their archetypal specific aims, methodological requi- and/or directing some aspect review, see REF. 1). In 2008, the first cancer genome was sites and study designs. Throughout this Review, the of the cancer phenotype. sequenced using second-generation technology (also benefits and limitations of approaches, technologies and 2 Kataegis known as next-generation sequencing) . Four years interpretation will also be discussed. From the Greek meaning later, approximately 800 genomes from at least 25 dif- ‘thunderstorm’, this refers ferent cancer types have been sequenced. Consider that Specific aims to clusters of somatic only 20 years ago, sequencing one human genome took Thus far, most cancer genome-sequencing studies have single-nucleotide variants that driver often colocalize with somatic an international collaboration more than 10 years and had one or more of four specific aims: discovering 3–5 structural variants. cost US$3.8 billion . Today, accurate and rapid genome mutations; identifying somatic ; sequencing costs only a few thousand dollars. With this characterizing clonal evolution; and advancing person- Chromothripsis advancement in technology comes considerable capac- alized medicine (FIG. 1; Supplementary information S1 From the Greek meaning ‘chromosome shattering’, ity to increase basic cancer biology knowledge and the (table)). First, determining which somatic mutations this refers to a single event opportunity to advance cancer prevention, diagnostics, are likely to contribute to the cancer phenotype is the of genome shattering and prognostics and treatment. most common aim of cancer genome-sequencing stud- reassembly that results in Despite the diversity of questions to be addressed ies. Discovering driver mutations leads to improved complex somatic structural using cancer genome-sequencing studies (that is, studies understanding of basic cancer biology and conse- variations characterized by oscillating copy number that have used second-generation technology to quently treatment discovery and development. Take the and tens to hundreds of sequence at least one cancer genome), only a limited gene enhancer of zeste 2 (EZH2), for example: second- rearrangements that localize number of specific aims have been investigated to date; generation sequencing resulted in the discovery of somatic to one or a few chromosomes. many remain to be explored. For instance, cancer pre- mutations in EZH2 in lymphoma at a clinically signifi- vention is an important area that could greatly benefit cant frequency6, spurring functional characterization7 from well-designed second-generation cancer genome- and leading to a promising treatment8. Second, identify- Genome Sciences Centre, sequencing studies. Family-based and case–control ing somatic mutational signatures has also led to gains BC Cancer Research Centre, study designs will be integral to uncovering inherited in understanding basic cancer biology. For the first time, 675 West 10th Avenue, polymorphisms that predispose individuals to cancer. researchers can uncover global signatures of the muta- Vancouver, British Colombia Knowing cancer predisposition can benefit patients, tion processes and DNA repair mechanisms that con- V5Z 1L3, Canada. Correspondence to M.A.M. health practitioners and the health-care system if it tribute to the catalogue of somatic mutations in a cancer e‑mail: [email protected] results in a change of lifestyle behaviours or medical type. Indeed, the signatures of two newly discovered doi:10.1038/nrg3445 intervention that reduces cancer risk, morbidity or mutational phenomena, kataegis9 and chromothripsis10,

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 321

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

Single-patient study Discovery cohort e.g. Ley (2008)2 e.g. Campbell (2008)42

Ultra-deep sequencing e.g. Shah (2012)60

Multi-sample sequencing e.g. Ding (2012)100

Multi-region sequencing e.g. Tao (2011)55

Personalized Mutational signatures Clonal evolution Driver mutations Recurrent mutation medicine e.g. Nik-Zainal 30 e.g. Shah (2009)16 e.g. Banerji (2012)72 e.g. Fujimoto (2012) e.g. Jones (2010)17 (2012)9

SV detection Pathway analysis <30-fold e.g. Ellis (2012)27 e.g. Stephens (2011)10

SNV, indel, SV detection Clinically actionable ≥30-fold mutation e.g. Berger (2012)97 e.g. Chapman (2011)65

Validation Genome Multi-omics Exome e.g. Cheung (2012)89 e.g. Pleasance (2009)19 e.g. Wang (2011)64 e.g. Muzny (2012)43

Generalizability e.g. Wu (2012)63 e.g. Morin (2011)62

Clinical importance Validation or No validation or Interaction omics e.g. Molenaar extension cohort extension cohort e.g. Molenaar (2012)90 e.g. Jones (2012)76 e.g. Sung (2012)94 (2012)90

Figure 1 | Cancer genome second-generation-sequencing study designs. This flow diagram is intended to highlight the diversity of second-generation cancer genome-sequencing study designs. Study design choicesNature are Reviews in yellow, | Genetics green and blue boxes. Choosing a path along these boxes, which are connected by arrows, represents a possible cancer genome-sequencing study design. The yellow boxes highlight that single-patient studies are well-suited for personalized medicine, whereas the blue boxes highlight that discovery cohorts are well-suited for discovering driver mutations. Dark grey boxes represent choices for analyses or methods specific to the box that they are connected to. Specifically, clonal evolution can be examined through ultra-deep, multi-sample or multi-region sequencing. Discovery cohorts can be multi-omics studies that combine genome, exome or transcriptome sequencing. Genome sequencing can be either <30‑fold or ≥30‑fold redundant coverage for focused or comprehensive somatic mutation detection, respectively. Validation and extension cohorts can confirm the findings from discovery cohorts or can explore generalizability or clinical importance. Secondary aims are in light grey boxes. Peer-reviewed publications that may serve as models for a particular study design feature or study aim are noted in boxes. SNV, single-nucleotide variant; SV, structural variant.

have been discovered thanks to cancer genome-sequenc- the emergence of mutated v-Ki-Ras2 Kirsten rat sarcoma ing studies. Third, characterizing clonal evolution is an oncogene (KRAS) and acquired resistance to epidermal important concept, particularly when considering can- growth factor (EGFR)-targeted therapy11). cer treatment, and this characterization can be achieved Finally, advancing personalized medicine is a clear at the nucleotide level using sequencing. For example, application of using second-generation technology to major subclones may share a somatic mutation (or muta- sequence cancer genomes. The goal of personalized tions) that confers intrinsic drug resistance, whereas medicine is to reduce toxicity and to improve efficacy minor subclones and de novo mutations may evolve to through selecting the correct treatment for the correct confer acquired resistance (as has been documented for patient at the correct dose and time. Medulloblastoma

322 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

could be a model cancer type to further the personal- public repository of simple genetic variations31; the tran- ized medicine paradigm as it is a heterogeneous cancer sition to transversion ratio (~2.1 for the whole genome); type with respect to both overall survival and molecu- and concordance with matched SNP genotyping arrays. lar signatures; moreover, aggressive treatment results in SNP arrays can be used to estimate a false-negative rate; improved mortality at the cost of substantial morbidity12. however, the assumption is that array calls are the gold Identifying the patients who would be best served by standard. an aggressive treatment regime has great potential to SNVs are not the only kind of somatic mutation, but improve the quality of life of medulloblastoma survivors. they are the most abundant. The somatic status of struc- The specific aims discussed here are not exhaus- tural variants, such as copy number variants (CNVs), tive; many more remain to be explored with second- copy-neutral regions of loss of heterogeneity (cnLOH), generation sequencing of cancer genomes. For instance, inversions and translocations, are also determined by de novo germline variants associated with childhood comparison with the non-cancerous genome. cancers might be discovered through trio studies that include the proband offspring and both parents. What Redundant sequence coverage for studying single- is more, these specific aims are not mutually exclusive. nucleotide variants. Cancer genome-sequencing studies For example, through characterizing clonal evolution, typically produce on the order of 90 Gb of aligned researchers might find driver mutations. sequence to achieve 30‑fold redundant coverage of the 3 Gb haploid human genome, which is generally Methodological requisites accepted as sufficient to detect inherited SNPs reliably25 Second-generation cancer genome-sequencing studies in diploid genomes. Estimating the optimal redundant have a generally accepted set of working methods (for coverage for normal genome sequencing is challenging a Review, see REF. 13), including but not limited to full owing to rapid changes of variables that influence the sequencing of the matched normal genome (that is, the detection of SNPs, such as library construction meth- patient’s non-cancerous genome), at least 30‑fold redundant ods, sequencing chemistry, sequencing read length, read sequence coverage for the detection of single-nucleotide alignment algorithms and bioinformatics tools for vari- variation, and verification resequencing to confirm the ant identification. It has been suggested, however, that somatic status of acquired mutations. on the order of 50‑fold coverage is required to detect inherited genotypes confidently32. Matched normal genome. Subtracting the genetic varia- Although cancer originates from a common progeni- tion of a non-cancerous ‘normal’ genome from its can- tor, it evolves through clonal expansion, somatic muta- cerous counterpart allows the identification of somatic tion and selection1,33, which means that cancer cells mutations. As a source of the normal genome, studies from the same patient do not share all somatic muta- of haematological cancer types often use skin biopsies tions. Moreover, general characteristics of cancer, such as (for example, see REFS 2,14,15), whereas studies of solid aneuploidy, non-cancerous cell contamination and tumours frequently use peripheral blood mononuclear extensive unbalanced structural variation, can add to the cells (for example, see REFS 16–18). Both of these options variability in mutant allele frequencies (that is, mutant may have contamination with circulating tumour DNA sequencing read count / (mutant + reference sequencing or cells. Surgical margins and proximal lymph nodes can read count)). Both of these facts mean that — unlike also serve as a source of normal DNA (for example, see inherited SNPs that exist at B allele frequencies of 0% (in REFS 19–21). Their collection is the least invasive for the the case of a homozygous reference), 50% (in the case of patient if surgical resection is a part of the treatment plan. a heterozygous variant) or 100% (in the case of a homo­ However, it should be noted that, despite normal appear- zygous variant) — acquired SNVs may exist at a contin- ance, surrounding tissue may contain residual disease uous range of mutant allele frequencies. Consequently, cells, early tumour-initiating somatic mutations and/or the current standard of 30‑fold redundant coverage is an altered transcriptome or epigenome. Regardless of the likely to be insufficient to mitigate false-negative calls source of the matched normal genome, bioinformatics of SNVs with low mutant allele frequencies. In fact, to analyses should allow for low levels of contamination. detect somatic mutations in minor subclones that have The average person inherits 3 to 4 million single- mutant allele frequencies as low as 1 to 2%, substantially Redundant sequence nucleotide polymorphisms (SNPs; for example, see higher sequence coverage (for example, 400- to 500‑fold) coverage REFS 22–26 The total number of bases ). Compared to these millions of inherited of the cancer genome would be required. sequenced divided by the polymorphisms, there are relatively few (specifically, In an effort to minimize false-negative errors, the total number of bases in the thousands to tens of thousands) candidate somatic single- International Cancer Genome Consortium guidelines haploid genome. nucleotide variants (SNVs) in the cancer genomes of suggest that the tumour cell content of a sample is at REFS 27–30 B allele frequencies adults (for example, see ). To have confidence least 60 to 80% viable cells where possible. Some cancer Frequencies equal to B / (A + B), that these candidates are real somatic SNVs, research- genome-sequencing studies attempt to reduce non-can- where A is the count for the ers must have identified the large majority of inherited cerous cell contamination through microdissection34, reference nucleotide at an SNPs in the matched normal genome (BOX 1). Metrics cell sorting35, creation of low passage cell lines36 and inherited single-nucleotide that help to assess the quality of the SNP calling in non- xenographs37. Using one of these techniques is likely to polymorphism (SNP) position, and B is the count for the cancerous genomes include: the proportion of the SNPs be required for cancer types that have a high stromal alternate nucleotide at that that overlap with those found in US National Center for content, such as pancreatic cancer, or that have sub- same SNP position. Biotechnology Information SNP database, which is a stantial normal cellular content, such as haematological

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 323

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

Paired-end reads cancer types. Alternatively, increasing redundant cov- cases to less than 30‑fold redundancy still allows a study Sequencing reads from each erage can help to compensate for low tumour purity, to detect structural variants. Such studies can: charac- end of the same DNA and in some cases, this is the most straightforward way terize the architecture of structural variants at single- molecule. Knowing the to do so. nucleotide resolution (for example, see REFS 38,39); sequence of both reads and the length of the DNA molecule describe the distribution of different types of structural improves mapping to a Paired-end reads for detecting structural variants. variant in the cancer of an individual patient (for exam- reference sequence, de novo Second-generation cancer genome sequencing using ple, see REFS 39,40); examine the patterns of these vari- assembly and detecting paired-end reads with greater than 30‑fold redundant ants across cancers from different patients (for example, structural variations. coverage allows detailed characterization and simulta- see REFS 10,41); explore the evolution of structural vari- neous detection of SNVs, indels and structural variants ants (for example, see REFS 36,38); and discover chimeric (Supplementary information S2 (table)). Sequencing genes (for example, see REFS 42,43).

Box 1 | Single-nucleotide alteration calling and filtering Somatic mutation calling has several sources of false positives and false and/or by a minimum number of reads that support variant calls help to negatives, including technical bias, sequencing error, alignment artefacts, reduce this type of false positive. Annotating acquired SNVs and indels normal DNA contaminated with cancer DNA, tumour heterogeneity, copy that are found in dbSNP or the 1000 Genomes repository is another way to number variants (CNVs) and copy-neutral regions of loss of heterogeneity address this issue. Of note, the current build of dbSNP contains somatic (cnLOH). Single-nucleotide variant (SNV)- and indel-calling methods have mutations. features that attempt to minimize false-positive calls, and further filtering Sequencing error can also help to reduce these errors. A small selection of the many tools Second-generation sequencing has a higher rate of base error than that can be used to detect SNVs and indels is listed in the table below. first-generation sequencing. Errors are more likely to occur at the 3′ ends Technical bias of reads from Illumina platform sequencing and at homopolymeric During the standard library construction process, PCR introduces sequences in the case of Roche 454 Life Sciences sequencing. Requiring a duplicate reads and strand bias and GC bias can be compounded. Filtering minimum base quality and/or consensus sequence quality also helps to out duplicate reads and SNV calls that are mainly supported by one strand reduce uncertainty at poorly sequenced loci. can ameliorate PCR artefact bias. GC bias is a source of uneven redundant Alignment artefacts coverage across the genome; stretches of high AT and high GC content Alignment tools produce artefacts in regions of low mappability, such as low tend to be under-represented. Because inherited polymorphisms vastly copy and simple tandem repeats. Filtering by mapping quality may help. outnumber acquired somatic variants, poor confidence in calls at these Alternatively, SNV and indel calls in low mappability regions can be filtered under-represented loci means that inherited single-nucleotide out. Misalignment of reads around indels also occurs; most tools locally polymorphisms (SNPs) and indels are more likely to be mistaken for realign or assemble reads around indels, but filtering SNVs by proximity to acquired SNVs and indels. Filtering by minimum read depth at a position gapped alignments is also an option.

Tool Statistic Multiple samples Filtering Indels URL Samtools: Bayesian genotype Called varFilter Yes http://samtools.sourceforge.net/mpileup.shtml mpileup, bcftools likelihood model independently Genome Bayesian genotype Called Variant Yes http://www.broadinstitute.org/gatk/gatkdocs/ Analysis Toolkit: likelihood model independently quality score org_broadinstitute_sting_gatk_walkers_genotyper_ UnifiedGenotyper recalibrator UnifiedGenotyper.html; http://www.broadinstitute. org/gatk/gatkdocs/org_broadinstitute_sting_gatk_ walkers_variantrecalibration_VariantRecalibrator.html Genome Local de novo Called Variant Yes http://www.broadinstitute.org/gatk/gatkdocs/org_ Analysis Toolkit: assembly of independently quality score broadinstitute_sting_gatk_walkers_haplotypecaller_ HaplotypeCaller haplotypes and recalibrator HaplotypeCaller.html; http://www.broadinstitute. affine gap penalty org/gatk/gatkdocs/org_broadinstitute_sting_gatk_ pair hidden Markov walkers_variantrecalibration_VariantRecalibrator.html model (HMM) likelihood function SomaticSniper Bayesian genotype Tumour–normal Built in Yes http://gmt.genome.wustl.edu/somatic-sniper/current likelihood model pairs called jointly VarScan2 Heuristic Fisher’s Called somaticFilter Yes http://varscan.sourceforge.net exact test independently or as tumour–normal pairs Strelka Bayesian continuous Tumour–normal Post-call Yes ftp://[email protected] allele frequencies pairs called jointly filtration for both tumour and normal samples JointSNVMix Probabilistic Tumour–normal MutationSeq Not http://code.google.com/p/joint-snv-mix; http:// graphical models pairs called jointly specified compbio.bccrc.ca/software/mutationseq

Affine gap penalty is a penalty that reduces sequence alignment score on the basis of the existence and length of gaps due to indels.

324 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

Structural variants can be predicted from genome or somatic mutations that are deemed of interest (for sequence on the basis of unexpected mapping distance example, see REFS 53,54). The International Cancer and orientation of paired-end reads44, split reads (for a Genome Consortium has proposed that, on the basis review, see REF. 45) and/or de novo assembly of sequenc- of extrapolation from the calculated verification rate, ing reads using bioinformatics tools such as Abyss46. at least 95% of somatic mutants listed in the catalogue Paired-end reads from short DNA library fragments, for each sample should be real. A margin of error of on the order of hundreds of base pairs, provide the 5% requires that a minimum of 384 somatic mutants ability to detect smaller size intrachromosomal rear- should be verified. rangements (for reviews, see REFS 47,48). By contrast, A common form of verification is PCR amplifica- paired-end reads from large DNA library fragments, on tion of the locus containing the candidate SNV, indel the order of thousands of base pairs, provide the abil- or structural variant breakpoint, followed by Sanger ity to detect rearrangements in complex DNA regions, sequencing (for example, see REFS 2,42). Other methods such as repetitive and duplicated sequences, and require of verification include mass-spectrometric genotyping (for less sequencing than short fragment libraries to achieve example, see REFS 21,55) and targeted capture followed comparable physical coverage40,49. Determining the dif- by sequencing using a different second-generation ferential redundant coverage across segments of the can- platform (for example, see REFS 15,18). Of note, Sanger cer genome compared with its matched normal genome sequencing and mass spectrometric genotyping are lim- is a way to detect CNVs specifically42; this method does ited by their inability to detect mutant alleles that are not depend on paired-end reads. However, paired-end found at a low frequency (for example, see REFS 28,55). reads improve the upstream process of aligning reads to Verification of amplifications or deletions can be done the . In samples with a high degree of through assessing concordance of the CNVs called clonal heterogeneity or low tumour purity, the detection from sequence data versus those detected with array- of CNVs can be confounded; however, bioinformatics comparative genomic hybridization (for example, see adjustments for heterogeneity can help to improve REFS 34,39) or SNP arrays (for example, see REFS 14,18); detection50. however, the limits of detection with array technology Structural variants predicted from genome sequenc- make breakpoint sequencing preferable for smaller ing may result in gene amplification, deletion, disrup- CNVs (for example, see REFS 42,56). tion or rearrangement. Transcriptome sequencing, however, can be used to identify how structural vari- Study types ants in the genome alter transcription of affected genes. Today, with second-generation technology, the cost of Notably, it can be used to identify transcribed chimeric sequencing a human cancer genome can be as little as genes. Chimeric proteins are ideal macromolecules for US$5,000, and it continues to drop. Adequately pow- molecularly targeted drug therapy, which aims to inhibit ered studies have now become economically feasible, a specific mutated protein, thus maximizing efficacy but they are not inexpensive. Several study designs have while minimizing toxicity. The prototypical model of been developed to maximize impact while minimizing successful molecular-targeted drug therapy is inhibition cost (FIG. 1; Supplementary information S3 (table)). of the oncogenic chimaera BCR–ABL by imatinib51. Single-patient studies. Single-patient studies are hypothesis- Verification resequencing. Verification resequencing generating and have the potential to inform clinical is the use of a different technology to minimize false- practice, but they do not allow for the generalization of positive calls of somatic mutations that occur owing to findings. Researchers can theorize as to which somatic Chimeric genes technology-specific systematic errors, such as library mutations from the whole catalogue are important to the A combination of segments construction artefacts, sequencing errors and biases pathogenesis of cancer in an individual patient. These of two or more genes that and alignment inaccuracy. The objective of verification theories may be based on the literature (for example, see forms a new gene. resequencing is to confirm that the candidate somatic REFS 2,15) or on phylogenetic evolutionary analysis55; Split reads mutation is not present in the normal genome and is consequently, they are limited in novelty or the strength Sequencing reads that align to present in the cancer genome. It is important for stud- of the conclusion, respectively. Single-patient second- non-contiguous spans of the ies to report a verification rate (that is, a false-positive generation cancer genome-sequencing studies are well- reference sequence owing to rate), but given that there may be thousands of candi- positioned to describe the somatic mutational signature somatic structural variation. date somatic mutations per patient, sometimes it is not of a cancer type (for example, see REFS 20,57) and to Mass-spectrometric practical to confirm them all; most somatic mutations explore clonal evolution (for example, see REFS 16,18). genotyping in a patient’s cancer genome are unique, and verification They are best positioned to be integral components A method that generates resequencing can be costly, both in terms of money as of personalized genomic medicine (for example, see locus-specific amplicons REFS 17,58 followed by primer well as in terms of the quantity of nucleic acid required. ), the objective of which is to inform physician extension that incorporates Strategies for determining a false-positive rate while decision-making with respect to treatment. mass-modified reducing the number of individual verifications car- dideoxynucleotides at ried out include: selecting the mutations that are most Genome discovery cohort. The discovery cohort is a the single-nucleotide likely to have an effect on the structure, function or group of the same type, or subtype, of cancer that is polymorphism position. A mass spectrometer then expression level of a protein (such as nonsynonymous subject to second-generation sequencing. Discovery measures the differential SNVs; for example, see REFS 2,16); a random sample of cohort studies have the potential to detect recurrent mass of the products. the somatic mutations (for example, see REFS 37,52); somatic mutations of genes and pathways (that is,

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 325

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

Box 2 | Detecting recurrent mutations

In each cancer, there is a collection of somatic mutations, some of which the context CpG), among different regions of the genome (for example, create, control and/or direct the cancer phenotype. There are several exons versus introns), across a cohort of a cancer type or subtype (for statistical methods to identify the somatic mutations that are likely to be example, hypermutated versus non-hypermutated colorectal cancers) or contributing to cancer pathogenesis. One way to determine which somatic among individuals. Owing to their length and/or nucleotide composition, mutations are probably driver mutations is with bioinformatics tools that some genes may have inherently high mutation rates; thus, these factors find genes that are somatically mutated more often than would be expected may be adjusted for. Different methods calculate the background mutation by chance (or those that have a higher mutation rate than the background rate differently. In addition to considering the mutation rate, the transition mutation rate) in a cohort. The background mutation rate can vary by the to transversion ratio and the nonsynonymous to synonymous mutation ratio base context (for example, there is an increased rate of C to T mutations in can be used as proxy indicators of selection.

Example Type of study Type (or types) Method Method to detect Type (or types) of Number of Validation study of cancer to detect statistically somatic mutations significantly (cohort statistically significant recurrently size/genes significant evidence of mutated assessed) recurrence selection genes Chapman Multi-ome; Multiple myeloma MutSig Nonsynonymous Non-silent protein 10; 18 161/BRAF, (2011)65 23 WGS; 16 WES to synonymous coding; non-coding IRF4 mutations ratio regions with high regulatory potential Berger 25 WGS Melanoma MutSig – 11 107/PREX2 (2012)97 Ellis Multi-ome; Breast cancer MuSiC – Tier 1 mutations 18 – (2012)27 46 WGS; 31 WES Fujimoto 27 WGS Hepatocellular In‑house – Protein-altering 15 120/ARID1A (2012)30 carcinoma point mutations Jones Multi-ome; Medulloblastoma MutSig – 8 – (2012)76 39 WGS; 21 WES; 65 custom capture Morin Multi-ome; Non-Hodgkin’s – In‑house, based on Nonsynonymous 26 261/MEF2B; (2011)62 14 WGS or WES; lymphomas REF. 105 point mutation or 89/MLL2 117 WTS nonsense mutations of 109 recurrently mutated genes Shah Multi-ome; Breast cancer – In‑house 6 159/29 (2012)60 15 WGS; 54 WES; genes 80 WTS

An empty cell means ‘not specified’, and a dash (–) means ‘not done’. ARID1A, AT-rich interactive domain 1A; BRAF, v‑raf murine sarcoma viral oncogene B1; IRF4, interferon regulatory factor 4; MEF2B, myocyte enhancer factor 2B; MLL2, myeloid/lymphoid or mixed-lineage leukaemia 2; PREX2, phosphatidylinositol‑3,4,5‑ trisphosphate-dependent Rac exchange factor 2; WES, whole-exome sequencing; WGS, whole-genome sequencing; WTS, whole-transcriptome sequencing.

mutations found in more than one patient’s cancer; studies have often been small (containing between BOX 2). Recurrence in cancers from different patients 2 and 97 cases) and thus of limited statistical power is good evidence that a mutation might be involved in (Supplementary information S3 (table)). If the aim of cancer pathogenesis, but it is not definitive evidence a study is to catalogue the large majority of recurrently because phenomena such as linkage disequilibrium somatically mutated genes found in a cancer type, with a pathogenic gene deletion can result in recurrent or subtype, then the International Cancer Genome somatic deletion of an adjacent gene (or genes) that Consortium recommends that approximately 100 is not involved in pathogenesis59. Second-generation matched tumour–normal pairs in a discovery cohort cancer genome sequencing of discovery cohorts can and 400 in a validation cohort (see below) are required also allow for the identification of somatic mutational to reliably detect genes that are somatically mutated in signatures or patterns of clonal evolution that may 3% of cases. Of note, this two-tiered design requires characterize a cancer type or subtype (for example, see that all somatically mutated genes identified in the dis- REFS 41,52,60,61). covery cohort are assessed in the validation cohort. If The actual power of the discovery cohort is in its a somatically mutated locus or gene recurs at a fairly potential for unbiased discovery and novel hypoth- high frequency, a large discovery cohort is not nec- esis generation. The statistical power of the discovery essarily required (for example, see REFS 62,63), but a cohort is a function of the number of patients, inter- highly sensitive survey of the mutational landscape of a tumour genomic heterogeneity and the frequency cancer type cannot be achieved with smaller discovery of the event of interest. Genome discovery cohort cohorts.

326 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

Normal transcription Greater transcription Gene silencing Altered transcript isoform

AAA AAA AAA AAA AAA AAA AAA AAA

Transcriptome AAA AAA AAA AAA AAA AAA Promoter Amplification hypermethylation Splice site SNV

Translocation Nonsynonymous Promoter Nonsynonymous SNV Deletion SNV hypomethylation

AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA

Transcriptome AAA

Chimeric transcript Gain-of-function variant Loss-of-function variant No transcription and greater transcription

Figure 2 | The integration of transcriptome and epigenome with whole-genome sequencing.Nature Integration Reviews | Genetics analyses can indicate whether the somatic mutation of a gene results in a pathogenic increase, decrease or change of function. The dark blue tracts represent intergenic regions, the overlaid light blue rectangles represent genic regions, and the light blue rectangles with poly(A)s represent mRNA. The green circles represent cytosine methylation. The red and purple bars represent a translated nonsynonymous single-nucleotide variant (SNV). The yellow bar represents a splice site mutation. The black line represents the altered exon joining. The green tract represents an interchromosomal translocation.

Multi-ome discovery cohort. Multi-ome discovery that is manageable in terms of cost and makes greater cohorts use second-generation technology to sequence redundant coverage possible, increasing the sensitivity an assortment of genomes, exomes (for example, of detection of low mutation allele frequency somatic see REFS 64,65) and/or (for example, see coding mutations (for example, see REFS 57,72). However, REFS 60,62) in a group of cancers of the same type or sub- exome sequencing can miss somatic coding mutations type. Specifically, researchers will typically sequence the in areas where sequence coverage is poor or where genome of a smaller number of samples and the exomes targeted capture probes need improvement57. or transcriptomes of a larger number of samples; more Second-generation sequencing of the transcriptome than one ‘omic’ measurement per sample is not nec- (using RNA sequencing (RNA-seq)73,74) allows for the essary. The advantages and disadvantages of exome discovery of SNVs (for example, see REF. 62), indels, and transcriptome sequencing, and consequently chimeric transcripts (for example, see REF. 75), novel the multi-ome discovery cohort, are further discussed transcripts, alternative splicing (for example, see REF. 60), below. allelic imbalances (for example, see REF. 76) and differ- Second-generation technology is widely used to entially expressed transcripts (for example, see REF. 77) sequence cancer exomes, and this approach has been the (FIG. 2). It should be noted that SNVs that decrease tran- source of exciting findings in various cancer types (for script levels by negatively affecting transcription or example, see REFS 66–69). Here we focus on how exome transcript stability will not be detected in the transcrip- sequencing can be coupled with genome sequenc- tome. Also, small RNAs are not captured in most RNA- ing. Certain versions of today’s exome technology can seq libraries and require their own library preparation78. capture up to 70 Mb of exons, non-coding RNAs and Compared with array-based technologies, RNA-seq has Multi-ome discovery cohort A cohort of cancer genomes, non-coding regions with high regulatory potential. many advantages because it is digital technology (for exomes and/or transcriptomes; The exome can also be customized to capture specific a Review, see REF. 79). For example, RNA-seq has an more than one omic regions of interest. Theoretically, sequencing the exome improved ability to compare transcription levels across measurement per sample does not improve the detection of any specific type of different genes, samples, experiments, time points and is not necessary. mutation over genome sequencing; it can detect cod- platforms. Moreover, it has a greater dynamic range 70,71 Allelic imbalances ing SNVs, indels and structural variants . Practically, and increased sensitivity depending on the depth of Unequal transcript levels the substantially lower cost of sequencing per sample sequencing. Transcriptome sequencing can be limited, of the alleles of a gene. allows researchers to focus on a subset of the genome however, by the difficulty of finding a matched ‘normal’

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 327

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

tissue. For example, differential expression analysis affect transcript length and structure (for example, see comparing cancerous tissues to cell‑of‑origin mate- REFS 60,62) and determined DNA somatic mutations rial would not be possible if the cell type of origin were association with transcript levels (for example, see unobtainable or unknown. REFS 16,17,60). Few cancer genome-sequencing studies The advantages of multi-ome discovery cohorts are have explored the interaction between the multifaceted twofold. First, because the quantity of sequence required epigenome and the somatic mutations of the genome or for coverage of exomes and transcriptomes is substan- dysregulation of the transcriptome, even though aber- tially less than for genomes, the multi-ome discovery ration of the epigenome is a feature of cancer patho- cohorts minimize the expenditure of dollars, time and genesis85. Chromatin immunoprecipitation followed resources required to discover recurrent and potentially by second-generation sequencing of the captured DNA high-impact somatic mutations. The key words are fragments (ChIP–seq) is a well-established technique ‘recurrent’ and ‘high-impact’. The lower cost of exome used to survey genome-wide interactions between and transcriptome sequencing makes possible larger modified histones and DNA82,86. In addition, second- sample sizes, which have a greater power to detect sta- generation-sequencing-based methods that determine tistically significant recurrent somatic mutations; high- the methylation state of cytosines, and associated impact refers to somatic mutations that are most likely bioinformatics tools, are emerging82,87. to affect protein function and/or expression. At the same time, researchers can identify frequently recurrent Validation and extension cohort. A two-tiered study structural variants, SNVs or indels in the few genome design includes hypothesis generation using a dis- sequences that cannot be detected through the exome covery cohort (that is, single-patient, genome cohort or transcriptome sequencing. The second advantage of or multi-ome cohort) followed by hypothesis testing multi-ome discovery cohorts (or in any study in which using a targeted approach in a larger validation cohort. more than one ome is measured) is integrative analy- Targeted methods for exploring somatic mutations of ses (that is, integration omics), which explores how the interest are varied (see Supplementary information S3 different types of somatic mutation discovered with (table)). They include: genotyping for specific nucleo- the different types of omes converge on a mutated locus, tide changes; sequencing of single exons, the complete gene or pathway (FIG. 2). This is important, as there is coding region of genes or whole genes; and targeting no single second-generation sequencing omics measure thousands of genes using custom capture followed by capable of detecting all the types of abnormalities impli- ultra-deep second-generation resequencing. When validat- cated in cancer pathogenesis. Landmark studies from ing patterns of structural variation across samples, often Integration omics Research Network use inte- the extension cohort stage consists of analysis of publicly Examining how somatic mutation or deregulation of a grative analyses to consider cancer molecular aetiology available whole-genome array data sets (for example, see 43,77 genome, transcriptome and/or from as many perspectives as possible . Bioinformatics REFS 29,83,88). epigenome converge on a tools such as PathScan80 and PARADIGM81 The primary aim of a validation cohort is to test pathway, process or gene; facilitate integrative analyses of multi-ome cohorts. the reproducibility of the discovery cohort findings more than one omic REFS 28,89 measurement per sample is On a separate but related note, exploring the (for example, see ). Some cancer genome- not necessary. For example, interaction between omes (that is, interaction omics) sequencing studies have secondary aims, such as sur- gene inactivation through is an emerging field in cancer genome sequencing. veying generalizability (for example, see REF. 63) or single-nucleotide variants or Technically, the task is straightforward and involves assessing clinical importance (for example, see REF. 90) of epigenomic silencing. making more than one omic measurement on the the somatically mutated genes and pathways of interest FIG. 1 Interaction omics same sample and then analysing the global interactions ( ; Supplementary information S1 (table)); this Examining how the somatic between, for example, genomes and transcriptomes goes beyond validating findings and extends them. If a mutation or deregulation of the or epigenomes. It considers how somatic mutation or validation or extension cohort is composed of different genome, transcriptome and/or deregulation in one ome affects another. Practically, cancer types, then researchers may determine the extent epigenome affect one another; more than one omic exploring omic interactions is constrained by several to which findings from a particular cancer type can be measurement per sample is factors (for a Review, see REF. 82). First, there are only generalized to any of the hundreds of other histological ideal. For example, somatic a few cancer genome studies that have used second- types of cancer. Specifically, on achieving primary aims, copy number variants can have generation sequencing to make more than one omic meas- researchers may survey generalizability in a cohort of effects on transcript levels. urements per sample (Supplementary information S1 similar cancer subtypes (for example, see REFS 62,83,91), Custom capture (table)). Second, there is a dearth of bioinformatics tools among cancers with a uniting feature (for example, pae- 63 Hybridization or amplification to carry out interaction analyses for second-generation diatric cancers ), between cancers with contrasting of selected regions of the sequence data. This may be, in part, due to the fact that features (for example, brainstem versus non-brainstem genome to specifically capture there are a multitude of different specific questions cancers63) or among various common types of cancer loci for second-generation REF. 38 sequencing. that a researcher may ask when exploring interac- (for example, see ). Somatically mutated genes or tion omics. For instance, in exploring the interaction pathways that are common to a wide variety of cancer Ultra-deep between the genome and transcriptome, researchers types probably underlie the hallmark phenotypes of can- second-generation have determined the proportion of DNA SNVs that cer pathogenesis (for a review of cancer hallmarks, see resequencing are detectable in the transcriptome (for example, see REF. 92). However, somatic mutations that are common Greater than 100‑fold REFS 62,76,83 redundant sequence coverage ), discovered cancer-specific RNA-editing to a specific cancer subtype may be important to dis- of a targeted selection of events that recode the amino acid sequence (for exam- tinctive phenotypic features, optimal treatment or might somatic mutations. ple, see REFS 16,84), identified splice site mutations that serve as molecular subtype biomarkers (for example,

328 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

Table 1 | Examples of clinically relevant actionable mutations Clinically relevant or Treatment Cancer types Refs actionable mutated genes and/or pathways RET Sunitininib (multi-targeted receptor tyrosine kinase Adenocarcinoma, lung adenocarcinoma 17,84 inhibitor), sorafenib (multi-targeted receptor tyrosine kinase and RAF kinases inhibitor) BRAF Vemurafenib (small-molecule BRAF V600E kinase Melanoma, hypermutated colorectal cancer, 19,43,60, inhibitor) multiple myeloma, malignant melanoma, 65,97 triple-negative breast cancer KRAS Cetuximab (EGFR inhibitor) Colorectal cancer, metastatic pancreatic cancer, 21,28,36, cancer of the ampulla of Vater, lung cancer, early 40,43,58, T cell precursor acute lymphoblastic leukaemia 83 EGFR, ERBB2, ERBB3 Gefitinib (EGFR inhibitor), erlotinib (EGFR tyrosine ER-positive breast cancer, hepatocellular 16,21,27, signalling kinase inhibitor), cetuximab, lapatinib (ERBB1 and carcinoma, lung cancer, colorectal cancer, breast 30,43,52, ERBB2 receptors tyrosine kinase inhibitor) cancer, lobular breast cancer 60,61,72 EML4–ALK Crizotinib (ALK and ROS1 inhibitor) Neuroblastoma 90 PML–RARA All trans-retinoic acid therapy Acute myeloid leukaemia, acute promyelocytic 53,100 leukaemia LRRK Candidate treatment: bortezomib Multiple myeloma, metastatic acral melanomas 34,65 ALK, anaplastic lymphoma receptor tyrosine kinase; BRAF, v‑Raf murine sarcoma viral oncogene B1; EGFR, epidermal growth factor receptor; EML4, echinoderm microtubule-associated protein-like 4; ER, oestrogen receptor; ERBB2, v‑Erb‑b2 erythroblastic leukaemia viral oncogene 2; KRAS, v‑Ki‑Ras2 Kirsten rat sarcoma viral oncogene; LRRK, leucine-rich repeat kinase 2; PML, promyelocytic leukaemia; RARA, retinoic acid receptor alpha; RET, ret proto-oncogene; ROS1, c‑Ros oncogene 1.

see REFS 89,90). If validation or extension cohort cases of cancer genes — approximately 487 according to the are linked to clinical data, they can be used to explore Sanger Institute Cancer Gene Census — is probably an clinical importance, which can be defined by the value underestimation of the true number. added to the clinical management of cancer patients Cases in which similar cancer phenotypes in differ- (TABLE 1). For instance, in neuroblastoma, somatic ent patients are found to have recurrent mutations of mutation of alpha thalassaemia/mental retardation syn- genes or pathways suggest convergent molecular evolu- drome X-linked (ATRX) associates with age, and age is tion and thus provide compelling evidence for biologi- a prognostic marker of survival89. Another example cal relevance and selection95,96. Bioinformatics tools such is that chromothripsis associates with poor survival in as MutSig and MuSiC find the statistically significant neuroblastoma90 and acute myeloid leukaemia93. recurrent somatically mutated genes30,97 (BOX 2). Perhaps Somatic mutations can also predict the response to, the most common method for discovering driver muta- or the need for, standard treatment (for example, see tions is by identifying recurrent somatic mutations REFS 27,64). Before development of somatic mutations as that are predicted to have translational consequences clinically important prognostic or diagnostic biomark- (for example, missense, nonsense, splice-site SNVs or ers, researchers will need to consider and to discuss more coding indels). In addition, signatures of negative than whether a finding is statistically significant; clinical selection on somatic mutations in promoter regions21, significance (that is, the strength of correlations and/or decreased expression of genes with somatic mutations in effect size) and technical reliability (that is, sensitivity untranslated regions or introns52 and significantly recur- and specificity) are elemental to clinical importance. rent mutations found in non-coding regions (for exam- Some of the recent studies using large discovery cohorts ple, see REFS 30,60,65) suggest that non-coding mutations report statistically significant findings and consequently may also provide insight into the pathogenesis of cancer. have no validation cohort (for example, see REFS 90,94). Synonymous somatic mutations also deserve attention As the sample size of discovery cohorts increases, inde- as they can potentially modulate transcript levels, pro- pendent replication studies, rather than validation tein structure and splicing (for example, see REFS 98,99). cohorts, may become the norm. Given the multiple mutational mechanisms that inacti- vate, activate, moderate or change the function of genes, Investigating specific aims researchers should ideally assess whether structural Discovering driver mutations. The first clue that a variants or transcriptional or epigenetic deregulation somatic mutation may be important to cancer pathogen- also affect recurrently mutated genes. esis is if it affects a known cancer gene or a member of a After recurrence has been established, patterns pathway implicated in cancer. Here the interpretation of can also indicate the functional effect that a putative cancer genome-sequencing findings is based on current driver mutation might have. For instance, mutation knowledge and is not dependent on statistical analyses; hotspots72 (that is, a single DNA locus in which somatic consequently, it does not require a large sample size. A mutations recur) and somatic mutations that cluster disadvantage of this method is that the current number in a particular domain of a gene (for example, see

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 329

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

REFS 57,62) suggest an increase of function or a cancer- multi-sample sequencing is based on observing a promoting change of function65. Conversely, recurrent change in somatic mutation allele frequencies over time somatic mutations that are distributed along a gene in related populations of cancer cells; in addition to the are suggestive of a candidate tumour suppressor62: primary cancer, the recurrent cancer (for example, see a gene that allows for cancer growth when deactivated REFS 17,38,54) and/or secondary metastasis (for exam- or deleted. There are also bioinformatics means to ple, see REFS 34,40,101) are also sequenced. Somatic detect signatures of selection in recurrently mutated mutations that change in frequency or that are unique genes, which can reveal candidate driver genes (BOX 2 to the subsequent cancer may be important to disease and Supplementary information S1 (table)). One caveat progression and/or acquired drug resistance (in cancer is the assumption that nonsynonymous mutations are types that have been exposed to the selective pressure of evolutionarily neutral. treatment). One of the major hurdles in multi-sample Pathway analysis is another method used to iden- evolutionary examination is the ethics of sequential tify driver genes, and it is particularly important in sampling in which there is no intrinsic benefit for a cancer types with heterogeneous mutational land- patient. For haematological cancer, this is less of an scapes, where few recurrently mutated genes are found issue, as frequent blood draws are used to monitor pro- (Supplementary information S1 (table)). The somatic gression or remission. In most solid tumours, however, mutations that are important to cancer pathogenesis invasive biopsies or resection of drug-resistant recur- in individual patients may differ, but different somatic rent cancer or metastasis is not a widespread practice. mutations can create similar dysfunction of a biologi- cal pathway (for example, see REFS 30,90). On a related Future perspectives note, several somatic mutations may work together to Personalized medicine is an important area that deregulate the same pathway in one patient (for exam- will greatly benefit from innovative cancer genome- ple, see REFS 17,21). In both these scenarios, pathway sequencing study designs. Currently, second-generation analysis may uncover driver mutations of interest for sequencing of cancer genomes with the intent of guid- further evaluation, bearing in mind the important ing therapy decisions is in its infancy, but promise has caveats that not all genes affected by somatic mutations been shown in single-patient studies. Among the first are included in pathway analysis and that understand- cancer genome-sequencing studies to inform treatment ing of the function of gene products in relation to each was work on a rare subtype of treatment-resistant meta- other is incomplete. Also, determining mutations that static adenocarinoma17. On the basis of an integrative are mutually exclusive can help to define subtypes (for analysis of genome and transcriptome data, researchers example, see REF. 76) and/or reveal the different somatic developed a hypothesis for the mechanism driving the mutations that create similar oncogenic dysfunction cancer. Subsequent therapeutic intervention coincided (for example, see REFS 89,97). A bioinformatics tool with partial remission of the metastatic disease, which called mutation relation test can be used to assess ultimately progressed. Further sequencing showed that statistically mutually exclusive relationships80. the metastasis had undergone extensive evolution, par- ticularly in the pathway that was targeted by tyrosine Characterizing clonal evolution. There are three main kinase inhibitors and had become drug-resistant. second-generation genome-sequencing study designs Moving from single patients to cohort-based person- that can demonstrate clonality and the molecular alized medicine research trials has great potential. In fact, evolution of cancer (Supplementary information S1 recent studies have found that ~20% of triple-negative (table)). These designs, which are not mutually exclu- breast cancers16 and more than 60% of lung cancers77 sive, use ultra-deep resequencing (for example, see have potential clinically actionable drug targets. However, REFS 2,60,100), multi-region sequencing (for example, cohort-based personalized medicine research trials see REFS 55,101,102) and/or sequential multi-sample will have numerous challenges. If a randomized sequencing (for example, see REFS 100,103,104). First, trial tests the safety and efficacy of cancer genome- ultra-deep resequencing (that is, >100‑fold redundant sequencing guided treatment decisions, then it might coverage) of select somatic mutations allows research- require many different drugs to be used in the treat- ers to assess somatic mutant allele frequencies more ment arm; this becomes complicated in terms of who accurately and to detect those that are at a low fre- funds the trial. If a randomized trial tests the safety and quency (for example, see REFS 16,18,61). Clustering efficacy of a novel treatment and if in order to meet the Multi-region sequencing analyses based on mutant allele frequencies can reveal inclusion criteria patients are required to have a specific Sequencing of distinct regions the number of subclones or the intra-tumour hetero­ somatically mutated gene, pathway or mutational signa- of the same solid tumour, this geneity (for example, see REFS 16,18,60) and can be used ture, then a large number of potential participants will allows for the examination of intra-tumour heterogeneity to construct a phylogenetic tree to show the inferred probably need to be screened. Furthermore, ethical con- and clonal evolution. evolutionary relationships among subclones (for exam- cerns require that personalized medicine approaches be ple, see REFS 61,100,103). An advantage of this method tested in end-stage patients for whom the standard care Clinically actionable is that it requires only one sample to be sequenced. has failed. Consequently, this paradigm will initially be drug targets Second, multi-region sequencing can also detect tested in the patients with the greatest need and the Biological molecules or processes that can be intra-tumour clonal heterogeneity and phylogeny most challenging cases. Despite challenges, second- targeted by an existing in solid tumours without the need for determining generation cancer sequencing is rapidly moving towards or experimental drug. somatic mutation allele frequencies55. Third, sequential use in a clinical capacity.

330 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

1. Stratton, M. R., Campbell, P. J. & Futreal, P. A. 22. Levy, S. et al. The diploid genome sequence of an 47. Fullwood, M. J., Wei, C.‑L., Liu, E. T. & Ruan, Y. The cancer genome. Nature 458, 719–724 (2009). individual human. PLoS Biol. 5, e254 (2007). Next-generation DNA sequencing of paired-end tags 2. Ley, T. J. et al. DNA sequencing of a cytogenetically 23. Wheeler, D. A. et al. The complete genome of an (PET) for transcriptome and genome analyses. normal acute myeloid leukaemia genome. Nature individual by massively parallel DNA sequencing. Genome Res. 19, 521–532 (2009). 456, 66–72 (2008). Nature 452, 872–876 (2008). 48. Medvedev, P., Stanciu, M. & Brudno, M. This was first study to use second-generation 24. Wang, J. et al. The diploid genome sequence of an Computational methods for discovering structural technology to sequence a cancer genome. It Asian individual. Nature 456, 60–65 (2008). variation with next-generation sequencing. Nature established cancer genome sequencing as an 25. Bentley, D. R. et al. Accurate whole human genome Methods 6, S13–S20 (2009). unbiased method for discovering candidate driver sequencing using reversible terminator chemistry. 49. Hillmer, A. M. et al. Comprehensive long-span paired- mutations. Nature 456, 53–59 (2008). end-tag mapping reveals characteristic patterns of 3. International Human Genome Sequencing Consortium. An accurate consensus sequence was built with structural variations in epithelial cancer genomes. Finishing the euchromatic sequence of the human second-generation technology from >30‑fold Genome Res. 21, 665–675 (2011). genome. Nature 431, 931–945 (2004). redundant coverage of 35 bp paired-end reads. 50. Boeva, V. et al. Control-FREEC: a tool for assessing 4. Lander, E. S. et al. Initial sequencing and analysis of 26. Pelak, K. et al. The characterization of twenty copy number and allelic content using next- the human genome. Nature 409, 860–921 (2001). sequenced human genomes. PLoS Genet. 6, e1001111 generation sequencing data. Bioinformatics 28, 5. Battelle Technology Partnership Practice. Economic (2010). 423–425 (2012). impact of the Human Genome Project: how a 27. Ellis, M. J. et al. Whole-genome analysis informs 51. Druker, B. J. et al. Five-year follow‑up of patients $3.8 billion investment drove $796 billion in breast cancer response to aromatase inhibition. receiving imatinib for chronic myeloid leukemia. economic impact, created 310,000 jobs, and launched Nature 486, 353–360 (2012). N. Engl. J. Med. 355, 2408–2417 (2006). the genomic revolution. battelle.org [online], 28. Bass, A. J. et al. Genomic sequencing of colorectal 52. Lee, E. et al. Landscape of somatic retrotransposition http://www.battelle.org/docs/default-document-library/ adenocarcinomas identifies a recurrent in human cancers. Science 337, 967–971 (2012). economic_impact_of_the_human_genome_project. VTI1A‑TCF7L2 fusion. Nature Genet. 43, 964–968 53. Welch, J. S. Use of whole-genome sequencing to pdf?sfvrsn=2 (2011). (2011). diagnose a cryptic fusion oncogene. JAMA 305, 1577 6. Morin, R. D. et al. Somatic mutations altering EZH2 29. Berger, M. F. et al. The genomic complexity of primary (2011). (Tyr641) in follicular and diffuse large B‑cell human prostate cancer. Nature 470, 214–220 54. Weiss, G. J. et al. Paired tumor and normal whole lymphomas of germinal-center origin. Nature Genet. (2011). genome sequencing of metastatic olfactory 42, 181–185 (2010) (2011). 30. Fujimoto, A. et al. Whole-genome sequencing of liver neuroblastoma. PLoS ONE 7, e37029 (2012). 7. Sneeringer, C. J. et al. Coordinated activities of wild- cancers identifies etiological influences on mutation 55. Tao, Y. et al. Rapid growth of a hepatocellular type plus mutant EZH2 drive tumor-associated patterns and recurrent mutations in chromatin carcinoma and the driving mutations revealed by hypertrimethylation of lysine 27 on histone H3 regulators. Nature Genet. 44, 760–764 (2012). cell-population genetic analysis of whole-genome (H3K27) in human B‑cell lymphomas. Proc. Natl Acad. 31. Sherry, S. T. et al. dbSNP: the NCBI database of data. Proc. Natl Acad. Sci. USA 108, 12042–12047 Sci. USA 107, 20980–20985 (2010). genetic variation. Nucleic Acids Res. 29, 308–311 (2011). 8. McCabe, M. T. et al. EZH2 inhibition as a therapeutic (2001). 56. Bueno, R. et al. Second generation sequencing strategy for lymphoma with EZH2‑activating 32. Ajay, S. S., Parker, S. C. J., Ozel Abaan, H., of the mesothelioma tumor genome. PLoS ONE 5, mutations. Nature 492, 108–112 (2012). Fuentes Fajardo, K. V. & Margulies, E. H. Accurate and e10612 (2010). 9. Nik-Zainal, S. et al. Mutational processes molding the comprehensive sequencing of personal genomes. 57. Totoki, Y. et al. High-resolution characterization of a genomes of 21 breast cancers. Cell 149, 979–993 Genome Res. 21, 1498–1505 (2011). hepatocellular carcinoma genome. Nature Genet. 43, (2012). 33. Navin, N. et al. Tumour evolution inferred by 464–469 (2011). 10. Stephens, P. J. et al. Massive genomic rearrangement single-cell sequencing. Nature 472, 90–94 (2011). 58. Demeure, M. J. et al. Cancer of the ampulla of Vater: acquired in a single catastrophic event during cancer 34. Turajlic, S. et al. of analysis of the whole genome sequence exposes a development. Cell 144, 27–40 (2011). matched primary and metastatic acral melanomas. potential therapeutic vulnerability. Genome Med. 4, 11. Misale, S. et al. Emergence of KRAS mutations and Genome Res. 22, 196–207 (2011). 56 (2012). acquired resistance to anti-EGFR therapy in colorectal 35. Puente, X. S. et al. Whole-genome sequencing 59. Muller, F. L. et al. Passenger deletions generate cancer. Nature 486, 532–536 (2012). identifies recurrent mutations in chronic lymphocytic therapeutic vulnerabilities in cancer. Nature 488, 12. Northcott, P. A. et al. Medulloblastomics: the end of the leukaemia. Nature 475, 101–105 (2011). 337–342 (2012). beginning. Nature Rev. Cancer 12, 818–834 (2012). 36. Campbell, P. J. et al. The patterns and dynamics of 60. Shah, S. P. et al. The clonal and mutational evolution 13. Meyerson, M., Gabriel, S. & Getz, G. Advances in genomic instability in metastatic pancreatic cancer. spectrum of primary triple-negative breast cancers. understanding cancer genomes through second- Nature 467, 1109–1113 (2010). Nature 486, 395–399 (2012). generation sequencing. Nature Rev. Genet. 11, 37. Peña-Llopis, S. et al. BAP1 loss defines a new class of 61. Nik-Zainal, S. et al. The life history of 21 breast 685–696 (2010). renal cell carcinoma. Nature Genet. 44, 751–759 cancers. Cell 149, 994–1007 (2012). 14. Mardis, E. R. et al. Recurring mutations found by (2012). This paper demonstrates the utility of sequencing an acute myeloid leukemia genome. 38. Ng, C. K. et al. The role of tandem duplicator characterizing the somatic mutational signature N. Engl. J. Med. 361, 1058–1066 (2009). phenotype in tumour evolution in high-grade serous with the discovery of kataegis. 15. Link, D. C. Identification of a novel TP53 cancer ovarian cancer. J. Pathol. 226, 703–712 (2012). 62. Morin, R. D. et al. Frequent mutation of histone- susceptibility mutation through whole-genome 39. Stephens, P. J. et al. Complex landscapes of somatic modifying genes in non-Hodgkin lymphoma. sequencing of a patient with therapy-related AML. rearrangement in human breast cancer genomes. Nature 476, 298–303 (2011). JAMA 305, 1568 (2011). Nature 462, 1005–1010 (2009). 63. Wu, G. et al. Somatic histone H3 alterations in 16. Shah, S. P. et al. Mutational evolution in a lobular 40. Kloosterman, W. P. et al. Chromothripsis is a common pediatric diffuse intrinsic pontine gliomas and breast tumour profiled at single nucleotide resolution. mechanism driving genomic rearrangements in non-brainstem glioblastomas. Nature Genet. 44, Nature 461, 809–813 (2009). primary and metastatic colorectal cancer. Genome 251–253 (2012). This study used ultra-deep resequencing to Biol. 12, R103 (2011). 64. Wang, L. et al. SF3B1 and other novel cancer genes in characterize clonal evolution and showed that 41. McBride, D. J. et al. Tandem duplication of chronic lymphocytic leukemia. N. Engl. J. Med. 365, variable somatic mutation allele frequencies can chromosomal segments is common in ovarian and 2497–2506 (2011). reflect different subclones. Moreover, considerable breast cancer genomes. J. Pathol. 227, 446–455 65. Chapman, M. A. et al. Initial genome sequencing evolution can occur over time. (2012). and analysis of multiple myeloma. Nature 471, 17. Jones, S. J. et al. Evolution of an adenocarcinoma in 42. Campbell, P. J. et al. Identification of somatically 467–472 (2011). response to selection by targeted kinase inhibitors. acquired rearrangements in cancer using genome-wide 66. Harbour, J. W. et al. Frequent mutation of BAP1 Genome Biol. 11, R82 (2010). massively parallel paired-end sequencing. Nature in metastasizing uveal melanomas. Science 330, This work incorporated second-generation Genet. 40, 722–729 (2008). 1410–1413 (2010). sequencing into the personalized medicine By investigating the paired-end sequencing reads This study discovered a gene that was somatically framework. Specifically, the intent of the case study that did not align to the reference genome as mutated in an impressive number of metastasizing was to inform physician decision making with expected with respect to each other, the authors tumours using second-generation sequencing of respect to treatment of a rare cancer. were able to demonstrate a high-throughput and exomes. This study highlights that there are novel 18. Ding, L. et al. Genome remodelling in a basal-like high-resolution bioinformatics method to and valuable candidate therapeutic targets that breast cancer metastasis and xenograft. Nature 464, characterize structural variation. are yet to be discovered. 999–1005 (2010). 43. Muzny, D. M. et al. Comprehensive molecular 67. Yoshida, K. et al. Frequent pathway mutations of 19. Pleasance, E. D. et al. A comprehensive catalogue of characterization of human colon and rectal cancer. splicing machinery in myelodysplasia. Nature 478, somatic mutations from a human cancer genome. Nature 487, 330–337 (2012). 64–69 (2011). Nature 463, 191–196 (2009). With 97 colorectal cancer genomes sequenced to 68. Schwartzentruber, J. et al. Driver mutations in histone 20. Pleasance, E. D. et al. A small-cell lung cancer genome low‑to‑moderate redundant coverage, this H3.3 and chromatin remodelling genes in paediatric with complex signatures of tobacco exposure. Nature discovery cohort is the largest to date. glioblastoma. Nature 482, 226–231 (2012). 463, 184–190 (2009). 44. Korbel, J. O. et al. Paired-end mapping reveals 69. Pugh, T. J. et al. Medulloblastoma exome sequencing This study highlighted that the distribution and extensive structural variation in the human genome. uncovers subtype-specific somatic mutations. Nature composition of somatic mutations across a genome Science 318, 420–426 (2007). 488, 106–110 (2012). is not uniform. It showed that through examining 45. Onishi-Seebacher, M. & Korbel, J. O. Challenges in 70. Sathirapongsasuti, J. F. et al. Exome sequencing- the mutational signatures, researchers can gain studying genomic structural variant formation based copy-number variation and loss of insight into the mechanisms and processes that mechanisms: the short-read dilemma and beyond. heterozygosity detection: ExomeCNV. Bioinformatics may have given rise to the mutations. BioEssays 33, 840–850 (2011). 27, 2648–2654 (2011). 21. Lee, W. et al. The mutation spectrum revealed by 46. Simpson, J. T. et al. ABySS: a parallel assembler 71. Karakoc, E. et al. Detection of structural variants paired genome sequences from a lung cancer patient. for short read sequence data. Genome Res. 19, and indels within exome data. Nature Methods 9, Nature 465, 473–477 (2010). 1117–1123 (2009). 176–178 (2012).

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 331

© 2013 Macmillan Publishers Limited. All rights reserved REVIEWS

72. Banerji, S. et al. Sequence analysis of mutations and 88. Zhang, J. et al. A novel retinoblastoma therapy from 102. Gerlinger, M. et al. Intratumor heterogeneity translocations across breast cancer subtypes. Nature genomic and epigenetic analyses. Nature 481, and branched evolution revealed by multiregion 486, 405–409 (2012). 329–334 (2012). sequencing. N. Engl. J. Med. 366, 883–892 (2012). 73. Ruan, Y. et al. Fusion transcripts and transcribed 89. Cheung, N.‑K. V. et al. Association of age at 103. Walter, M. J. et al. Clonal architecture of secondary retrotransposed loci discovered through diagnosis and genetic mutations in patients with acute myeloid leukemia. N. Engl. J. Med. 366, comprehensive transcriptome analysis using neuroblastoma. JAMA 307, 1062–1071 (2012). 1090–1098 (2012). paired-end ditags (PETs). Genome Res. 17, 90. Molenaar, J. J. et al. Sequencing of neuroblastoma 104. Jones, S. et al. Comparative lesion sequencing 828–838 (2007). identifies chromothripsis and defects in provides insights into tumor evolution. Proc. Natl 74. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. neuritogenesis genes. Nature 483, 589–593 Acad. Sci. USA 105, 4283–4288 (2008). & Wold, B. Mapping and quantifying mammalian (2012). 105. Greenman, C., Wooster, R., Futreal, P. A., transcriptomes by RNA-seq. Nature Methods 5, This is one of the largest discovery cohorts to Stratton, M. R. & Easton, D. F. Statistical analysis of 621–628 (2008). date. Researchers sequenced the genomes of 87 pathogenicity of somatic mutations in cancer. Genetics 75. Roberts, K. G. et al. Genetic alterations activating tumour–normal pairs to at least 30‑fold 173, 2187–2198 (2006). kinase and cytokine receptor signaling in high-risk redundant coverage. acute lymphoblastic leukemia. Cancer Cell 22, 91. Collins, C. C. et al. Next generation sequencing Acknowledgements 153–166 (2012). of prostate cancer from a patient identifies a J.C.M. thanks the Canadian Institutes of Health Research and 76. Jones, D. T. W. et al. Dissecting the genomic deficiency of methylthioadenosine phosphorylase, the Michael Smith Foundation for Health Research for their complexity underlying medulloblastoma. Nature 488, an exploitable tumor target. Mol. Cancer Ther. 11, support. M.A.M. is the University of British Columbia, 100–105 (2012). 775–783 (2012). Canada Research Chair in Genome Science. 77. Hammerman, P. S. et al. Comprehensive genomic 92. Hanahan, D. & Weinberg, R. A. Hallmarks of characterization of squamous cell lung cancers. Nature cancer: the next generation. Cell 144, 646–674 Competing interests statement 489, 519–525 (2012). (2011). The authors declare no competing financial interests. 78. Morin, R. D. et al. Application of massively parallel 93. Rausch, T. et al. Genome sequencing of pediatric sequencing to microRNA profiling and discovery medulloblastoma links catastrophic DNA in human embryonic stem cells. Genome Res. 18, rearrangements with TP53 mutations. Cell 148, FURTHER INFORMATION 610–621 (2008). 59–71 (2012). 1000 Genomes Project: http://www.1000genomes.org 79. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a 94. Sung, W.‑K. et al. Genome-wide survey of recurrent COSMIC: Cancer Gene Census: http://cancer.sanger.ac.uk/ revolutionary tool for transcriptomics. Nature Rev. HBV integration in hepatocellular carcinoma. Nature cancergenome/projects/census Genet. 10, 57–63 (2009). Genet. 44, 765–769 (2012). dbSNP — NCBI: http://www.ncbi.nlm.nih.gov/snp 80. Dees, N. D. et al. MuSiC: identifying mutational 95. Klein, G. Lymphoma development in mice and E.6 Quality Standards of Samples — International Cancer significance in cancer genomes. Genome Res. 22, humans: diversity of initiation is followed by Genome Consortium: http://icgc.org/icgc/goals-structure- 1589–1598 (2012). convergent cytogenetic evolution. Proc. Natl Acad. Sci. policies-guidelines/e6‑quality-standards-of-samples 81. Vaske, C. J. et al. Inference of patient-specific USA 76, 2442–2446 (1979). E.7 Study Design and Statistical Issues— International pathway activities from multi-dimensional cancer 96. Castoe, T. A., De Koning, A. P. J. & Pollock, D. D. Cancer Genome Consortium: http://www.icgc.org/icgc/ genomics data using PARADIGM. Bioinformatics 26, Adaptive molecular convergence: molecular evolution goals-structure-policies-guidelines/e7‑study-design-and- i237–i245 (2010). versus molecular phylogenetics. Commun. Integr. Biol. statistical-issues 82. Hawkins, R. D., Hon, G. C. & Ren, B. Next-generation 3, 67–69 (2010). E.8 Genome Analyses— International Cancer Genome genomics: an integrative approach. Nature Rev. Genet. 97. Berger, M. F. et al. Melanoma genome sequencing Consortium: http://www.icgc.org/icgc/goals-structure- 11, 476–486 (2010). reveals frequent PREX2 mutations. Nature 485, policies-guidelines/e8‑genome-analyses 83. Zhang, J. et al. The genetic basis of early T‑cell 502–506 (2012). Genome MuSiC: http://gmt.genome.wustl.edu/genome- precursor acute lymphoblastic leukaemia. Nature 98. Kimchi-Sarfaty, C. et al. A ‘silent’ polymorphism in the music/current 481, 157–163 (2012). MDR1 gene changes substrate specificity. Science MutSig: http://www.broadinstitute.org/cancer/cga/mutsig 84. Ju, Y. S. et al. A transforming KIF5B and RET gene 315, 525–528 (2007). Nature Reviews Genetics Series on Applications of next- fusion in lung adenocarcinoma revealed from 99. Pagani, F., Raponi, M. & Baralle, F. E. Synonymous generation sequencing: http://www.nature.com/nrg/series/ whole-genome and transcriptome sequencing. mutations in CFTR exon 12 affect splicing and are not nextgeneration/index.html Genome Res. 22, 436–445 (2011). neutral in evolution. Proc. Natl Acad. Sci. USA 102, Nature Reviews Genetics Series on Study designs: http:// 85. Esteller, M. Cancer epigenomics: DNA methylomes 6368–6372 (2005). www.nature.com/nrg/series/studydesigns/index.html and histone-modification maps. Nature Rev. Genet. 8, 100. Ding, L. et al. Clonal evolution in relapsed WHO — International Classification of Diseases for 286–298 (2007). acute myeloid leukaemia revealed by whole- Oncology, 3rd Edition (ICD-O-3): http://www.who.int/ 86. Barski, A. et al. High-resolution profiling of histone genome sequencing. Nature 481, 506–510 classifications/icd/adaptations/oncology/en methylations in the human genome. Cell 129, (2012). 823–837 (2007). 101. Wu, C. et al. Integrated genome and transcriptome SUPPLEMENTARY INFORMATION 87. Lister, R. et al. Human DNA methylomes at base sequencing identifies a novel form of hybrid See online article: S1 (table) | S2 (table) | S3 (table) resolution show widespread epigenomic differences. and aggressive prostate cancer. J. Pathol. 227, ALL LINKS ARE ACTIVE IN THE ONLINE PDF Nature 462, 315–322 (2009). 53–61 (2012).

332 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved