Structural Variation: the Genome’S Hidden Architecture Monya Baker
Total Page:16
File Type:pdf, Size:1020Kb
TECHNOLOGY FEATURE Uncovering variants 134 Calling for more algorithms 135 Box 1: Getting a bigger picture 136 Structural variation: the genome’s hidden architecture Monya Baker Next-generation sequencing is uncovering more variants than ever before, but it also faces limitations. The Austrian monk Gregor Mendel may tend to take dip- number of nucleotides, structural varia- have founded the science of genetics, but his loidy as the default. tion accounts for more differences between ideas now limit genomic studies, according to Even microarrays human genomes than the more extensively Jim Lupski, a molecular geneticist at Baylor designed to assess studied single-nucleotide differences. A 2010 College of Medicine in Texas. “Mendelian copy number gen- study estimated that such “non-SNP varia- thinking is to genetics as Newtonian think- erally assume that tion” totaled about 50 megabases per human ing is to physics. We saw a whole new world most individuals genome4. when Einstein came along.” carry exactly two Conditions including autism, schizophre- According to Mendelian principles, indi- copies of any partic- nia and Crohn’s disease have all been associ- viduals inherit exactly two copies of each ular region, which ated with structural variation. And uncov- gene—one from each parent. Genes on sex Next-generation can throw off some ering structural variation will be essential chromosomes have long been recognized as sequencing is revealing calculations. for understanding heterogeneity within exceptions, but genetic deletions and duplica- new variation, but With the excep- tumors, says Jan Korbel, who studies struc- tions also break the rules, and in ways that are it won’t be able to tions of very large tural variation at the European Molecular find everything, much harder to track. says Evan Eichler variants that can Biology Laboratory in Heidelberg. “There In the early 1990s, Lupski and colleagues at the University of be caught under a will be mechanisms that lead to cancer found that the hereditary neuropathy Washington, Seattle. microscope, find- that no one would have thought of a year Charcot-Marie-Tooth disease was associ- ing copy-number ago,” he predicts. Last year, researchers at ated not with a flawed version of a particu- variations has been difficult, says Evan the Broad Institute of Harvard and MIT in © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature lar genetic region but instead with the pres- Eichler, who studies genome sciences at the Cambridge, Massachusetts, sequenced the ence of an extra copy of the normal version. University of Washington in Seattle. “Every entire genomes of both normal and cancer- “We all believed Mendel for many, many technology developed in genomics up to years, but when you have a duplication, you about 2007 is really biased toward typing npg Human chromosome are triallelic rather than biallelic,” he says. unique tags in the genome,” he says. Experts were so unwilling to accept the idea And copy-number variation is just one of a disease caused by a gene-dosage effect type of a class of previously overlooked dif- that both Science and Nature declined even ferences now collectively known as structural Reference A B C to send his paper out for review, he recalls. variation. This catch-all category includes When Lupski did publish, in Cell, his lab used insertions, duplications, deletions, inver- Deletion A C four independent methods to substantiate his sions, recurring mobile elements, and other conclusions within a skeptical scientific com- rearrangements, now usually defined as Insertion A E B C munity1. those covering 50 or more base pairs (Fig. 1). Inversion A C B Despite additional findings like this, and (The number is arbitrary; earlier defini- the sequencing of the human genome, stud- tions set the number at 1,000 base pairs until Tandem A A B C ies are still built on Mendelian assumptions, sequencing technologies capable of detecting duplication says Lupski. The concept of paired homolo- smaller variants drove it down.) Dispersed A B A C gous chromosomes is taught in high school There is a growing recognition that struc- duplication biology; the term ‘paralogous’, which refers tural variation is pervasive and important. Copy-number A A B C A Laboratory Biology Molecular European Korbel, Jan to clusters of identical or near-identical Since the publication of seminal papers2,3 variant sequences at different chromosomal loca- in 2004, the numbers of references to struc- Figure 1 | Structural variation occurs in all forms tions within the same genome, is much more tural variation in the scientific literature and sizes. Genome structural variation encompasses obscure. Genome-wide association studies, and entries in the curated archive Database polymorphic rearrangements 50 base pairs to which find correlations between a disease of Genomic Variants have soared (Fig. 2). hundreds of kilobases in size and affects about population and specific genetic variation, It is now recognized that, in terms of the 0.5% of the genome of a given individual. NATURE METHODS | VOL.9 NO.2 | FEBRUARY 2012 | 133 TECHNOLOGY FEATURE ous tissue taken from seven men with pros- rare variants with disease. One study that apparent, revealing tate cancer: point mutations were relatively tied schizophrenia risk to duplications in a single-nucleotide infrequent, whereas chromosome rearrange- neuropeptide gene began with a genome- variations. Finding ments were much more common5. wide hunt for copy-number variations in structural variation Researchers have only recently begun 802 patients and 742 controls6. Such sample is more complex; mining short-read sequencing data for sizes would not have been feasible with analysis must make structural variation. “Next-generation tech- sequencing, and nor would the follow-up sense of partially nologies are opening up a new zone of dis- analysis of 114 ‘regions of interest’ assessed aligned reads and covery,” says Eichler, adding that sequencing in an additional 14,177 subjects. sort out repetitive still produces many artifacts that can send However, the information that arrays can More structural sequences. researchers on wild goose chases, and it reveal about structural variation is limited. variation is being In the past few overlooks many variants. “We pat ourselves Arrays for comparative genomic hybridiza- uncovered than years, scientists on the back when we find new structural tion and SNP arrays are both used in struc- ever before, says have produced an Jan Korbel at the variants, but we’re missing 50% of the vari- tural variation studies, but they can detect European Molecular alphabet soup of ants out there because of the limitation of only sequences that match the oligonucle- Biology Laboratory in algorithms to hunt our methods and our technology,” he says. otide probes used to make them, and these Heidelberg. out structural varia- probes are usually biased against ‘difficult’, tion from sequenc- Uncovering variants highly repetitive regions. Custom-made sets ing data. These typically use one of a hand- The extent of copy-number variation was of arrays with tens of millions of probes may ful of strategies (Fig. 3). ‘Read depth’ refers first demonstrated using microarrays. These find variants as small as 500 bases, but the to the number of sequenced fragments can analyze the greatest numbers of samples use of such arrays is not feasible for huge mapped to a particular part of the genome at the lowest cost, essential for achieving numbers of samples. For the most com- and can indicate how many copies of a the population sizes necessary to associate monly used arrays, the limit of detection region are present. In ‘split reads’, a single is usually much larger, gener- sequenced fragment maps to two parts of a PubMed citations ally on the order of 5 kilobases, the reference genome that are far away from and greater for highly repeti- each other. That means that those pieces are 500 DGV citations tive sequences. And although next to each other in the donor genome, a Structural variation arrays can detect that a sample situation that may indicate a deletion, inser- 400 Copy-number variation has more or fewer copies of a tion or inversion. region compared to a reference More complex but arguably more power- 300 genome, they generally cannot ful are paired reads, which work as a kind determine an absolute number. of molecular calipers. First, the genome is Another advantage that precisely divided into molecules of known 200 sequencing offers is in the abil- size: these might be 500-bp fragments, 3-kb © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature ity to find ‘breakpoints’, the fragments and 40-kb plasmids. Rather than 100 sequence boundaries where a sequencing the entire stretch of DNA, a task structural variant begins and impossible for current high-throughput 0 ends, says Korbel. Without sequencing machines, reads are taken only npg 2009200820072006200520042003 2010 knowing the breakpoints, it is from the ends. If these appear too close Year hard to track a variant across a together (for example, the ends of a 3-kb b DGV entries population and even more dif- fragment map to sequences on the reference ficult to understand the func- genome that are 2 kb apart), then the newly 500,000 Copy-number variation tional impact a variant might sequenced genome may harbor an insertion. Inversion have or the mechanisms that If the ends appear too far apart, the genome 400,000 produced it. may have a deletion. But sequencing still misses None of these measures reveals everything 300,000 considerable variation. In about a variant. Read depth can indicate how next-generation sequencing, many copies are present, but not where the 200,000 genomic DNA is shredded into copies occur, for example. And the shorter fragments of manageable size. the read is, the less likely it will be to reveal 100,000 These fragments are partially breakpoints, says Michael Brudno, a com- Stephen Scherer, The Hospital for Sick Children Sick for Hospital The Scherer, Stephen sequenced as ‘reads’, usually puter scientist at the University of Toronto: 0 2004 2005 2006 2007 2008 2009 2010 around 80–150 bp long.