technology feature

Uncovering variants 134 Calling for more algorithms 135 Box 1: Getting a bigger picture 136

Structural variation: the ’s hidden architecture Monya Baker

Next-generation sequencing is uncovering more variants than ever before, but it also faces limitations.

The Austrian monk Gregor Mendel may tend to take dip- number of nucleotides, structural varia- have founded the science of genetics, but his loidy as the default. tion accounts for more differences between ideas now limit genomic studies, according to Even microarrays human than the more extensively Jim Lupski, a molecular geneticist at Baylor designed to assess studied single-nucleotide differences. A 2010 College of Medicine in Texas. “Mendelian copy number gen- study estimated that such “non-SNP varia- thinking is to genetics as Newtonian think- erally assume that tion” totaled about 50 megabases per human ing is to physics. We saw a whole new world most individuals genome4. when Einstein came along.” carry exactly two Conditions including , schizophre- According to Mendelian principles, indi- copies of any partic- nia and Crohn’s disease have all been associ- viduals inherit exactly two copies of each ular region, which ated with structural variation. And uncov- gene—one from each parent. Genes on sex Next-generation can throw off some ering structural variation will be essential have long been recognized as sequencing is revealing calculations. for understanding heterogeneity within exceptions, but genetic deletions and duplica- new variation, but With the excep- tumors, says Jan Korbel, who studies struc- tions also break the rules, and in ways that are it won’t be able to tions of very large tural variation at the European Molecular find everything, much harder to track. says Evan Eichler variants that can Biology Laboratory in Heidelberg. “There In the early 1990s, Lupski and colleagues at the University of be caught under a will be mechanisms that lead to cancer found that the hereditary neuropathy Washington, Seattle. microscope, find- that no one would have thought of a year Charcot-Marie-Tooth disease was associ- ing copy-number ago,” he predicts. Last year, researchers at ated not with a flawed version of a particu- variations has been difficult, says Evan the Broad Institute of Harvard and MIT in

© 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature lar genetic region but instead with the pres- Eichler, who studies genome sciences at the Cambridge, Massachusetts, sequenced the ence of an extra copy of the normal version. University of Washington in Seattle. “Every entire genomes of both normal and cancer- “We all believed Mendel for many, many technology developed in genomics up to years, but when you have a duplication, you about 2007 is really biased toward typing npg Human are triallelic rather than biallelic,” he says. unique tags in the genome,” he says. Experts were so unwilling to accept the idea And copy-number variation is just one of a disease caused by a gene-dosage effect type of a class of previously overlooked dif- that both Science and Nature declined even ferences now collectively known as structural Reference A B C to send his paper out for review, he recalls. variation. This catch-all category includes When Lupski did publish, in Cell, his lab used insertions, duplications, deletions, inver- A C four independent methods to substantiate his sions, recurring mobile elements, and other conclusions within a skeptical scientific com- rearrangements, now usually defined as A E B C munity1. those covering 50 or more base pairs (Fig. 1). Inversion A C B Despite additional findings like this, and (The number is arbitrary; earlier defini- the sequencing of the human genome, stud- tions set the number at 1,000 base pairs until Tandem A A B C ies are still built on Mendelian assumptions, sequencing technologies capable of detecting duplication says Lupski. The concept of paired homolo- smaller variants drove it down.) Dispersed A B A C gous chromosomes is taught in high school There is a growing recognition that struc- duplication biology; the term ‘paralogous’, which refers tural variation is pervasive and important. Copy-number A A B C A Laboratory Biology Molecular European Korbel, Jan to clusters of identical or near-identical Since the publication of seminal papers2,3 variant sequences at different chromosomal loca- in 2004, the numbers of references to struc- Figure 1 | Structural variation occurs in all forms tions within the same genome, is much more tural variation in the scientific literature and sizes. Genome structural variation encompasses obscure. Genome-wide association studies, and entries in the curated archive Database polymorphic rearrangements 50 base pairs to which find correlations between a disease of Genomic Variants have soared (Fig. 2). hundreds of kilobases in size and affects about population and specific , It is now recognized that, in terms of the 0.5% of the genome of a given individual.

nature methods | VOL.9 NO.2 | FEBRUARY 2012 | 133 technology feature

ous tissue taken from seven men with pros- rare variants with disease. One study that apparent, revealing tate cancer: point were relatively tied risk to duplications in a single-nucleotide infrequent, whereas chromosome rearrange- neuropeptide gene began with a genome- variations. Finding ments were much more common5. wide hunt for copy-number variations in structural variation Researchers have only recently begun 802 patients and 742 controls6. Such sample is more complex; mining short-read sequencing data for sizes would not have been feasible with analysis must make structural variation. “Next-generation tech- sequencing, and nor would the follow-up sense of partially nologies are opening up a new zone of dis- analysis of 114 ‘regions of interest’ assessed aligned reads and covery,” says Eichler, adding that sequencing in an additional 14,177 subjects. sort out repetitive still produces many artifacts that can send However, the information that arrays can More structural sequences. researchers on wild goose chases, and it reveal about structural variation is limited. variation is being In the past few overlooks many variants. “We pat ourselves Arrays for comparative genomic hybridiza- uncovered than years, scientists on the back when we find new structural tion and SNP arrays are both used in struc- ever before, says have produced an Jan Korbel at the variants, but we’re missing 50% of the vari- tural variation studies, but they can detect European Molecular alphabet soup of ants out there because of the limitation of only sequences that match the oligonucle- Biology Laboratory in algorithms to hunt our methods and our technology,” he says. otide probes used to make them, and these Heidelberg. out structural varia- probes are usually biased against ‘difficult’, tion from sequenc- Uncovering variants highly repetitive regions. Custom-made sets ing data. These typically use one of a hand- The extent of copy-number variation was of arrays with tens of millions of probes may ful of strategies (Fig. 3). ‘Read depth’ refers first demonstrated using microarrays. These find variants as small as 500 bases, but the to the number of sequenced fragments can analyze the greatest numbers of samples use of such arrays is not feasible for huge mapped to a particular part of the genome at the lowest cost, essential for achieving numbers of samples. For the most com- and can indicate how many copies of a the population sizes necessary to associate monly used arrays, the limit of detection region are present. In ‘split reads’, a single is usually much larger, gener- sequenced fragment maps to two parts of a PubMed citations ally on the order of 5 kilobases, the reference genome that are far away from and greater for highly repeti- each other. That means that those pieces are 500 DGV citations tive sequences. And although next to each other in the donor genome, a Structural variation arrays can detect that a sample situation that may indicate a deletion, inser- 400 Copy-number variation has more or fewer copies of a tion or inversion. region compared to a reference More complex but arguably more power- 300 genome, they generally cannot ful are paired reads, which work as a kind determine an absolute number. of molecular calipers. First, the genome is Another advantage that precisely divided into molecules of known 200 sequencing offers is in the abil- size: these might be 500-bp fragments, 3-kb

© 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature ity to find ‘breakpoints’, the fragments and 40-kb plasmids. Rather than 100 sequence boundaries where a sequencing the entire stretch of DNA, a task structural variant begins and impossible for current high-throughput 0 ends, says Korbel. Without sequencing machines, reads are taken only npg 2009200820072006200520042003 2010 knowing the breakpoints, it is from the ends. If these appear too close Year hard to track a variant across a together (for example, the ends of a 3-kb b DGV entries population and even more dif- fragment map to sequences on the reference ficult to understand the func- genome that are 2 kb apart), then the newly 500,000 Copy-number variation tional impact a variant might sequenced genome may harbor an insertion. Inversion have or the mechanisms that If the ends appear too far apart, the genome 400,000 produced it. may have a deletion. But sequencing still misses None of these measures reveals everything 300,000 considerable variation. In about a variant. Read depth can indicate how next-generation sequencing, many copies are present, but not where the 200,000 genomic DNA is shredded into copies occur, for example. And the shorter fragments of manageable size. the read is, the less likely it will be to reveal 100,000 These fragments are partially breakpoints, says Michael Brudno, a com- Stephen Scherer, The Hospital for Sick Children Sick for Hospital The Scherer, Stephen sequenced as ‘reads’, usually puter scientist at the University of Toronto: 0 2004 2005 2006 2007 2008 2009 2010 around 80–150 bp long. “Once the break is in a repeat and the short Year Reads are then aligned to read cannot span that repeat, you don’t know the reference genome to check where the breakpoint really is.” In general, Figure 2 | Entries for structural variation are increasing in the scientific literature (a) and in the Database of Genomic Variants for differences. When reads the more repetitive a region is, the harder it is (DGV; b), which posts curated data from peer-reviewed studies match precisely to unique to analyze, says Brudno. “If a read matches in on human samples. It draws from two other databases (DGVa and spots, alterations in a hand- two separate places in the genome, you don’t dbVAR) that accept open submissions for data. ful of nucleotides are readily know which is right.”

134 | VOL.9 NO.2 | FEBRUARY 2012 | nature methods technology feature

Another strategy uses aberrations from Read pairs

the reference genome to identify loci where No structural Mobile element Tandem structural variants might be, and then assem- variation Deletion insertion duplication bles reads just for that area. This does elimi- Sample nate some biases introduced by the reference Reference genome, says Ira Hall, a molecular geneti- cist at the University of Virginia School of Read depth Medicine, but it is still far from perfect. In Sample particular, he says, assembly approaches tend reads not to deal well with heterozygosity, when one variant occurs on only one of a pair of Duplication Deletion Reference homologous chromosomes. Because individual algorithms tend to Split reads Assembly specialize in finding variants of particular sizes and types, researchers often use several algorithms together. Almost exactly a year Novel sequence ago, the described an Deletion insertion Reference Reference Laboratory Biology Molecular European Korbel, Jan effort to find structural variations using next- generation sequencing data from185 human Figure 3 | Several analytic techniques are used to find structural variation. genomes7. It used 19 algorithms to identify over 22,000 deletions plus 6,000 other vari- use all variants from all calling algorithms— Calling for more algorithms ants such as insertions or duplications. instead it evaluated each algorithm’s rates of Charles Lee, a cytogeneticist at Brigham and Combining algorithms effectively does not false positives experimentally and included Women’s Hospital in Boston and among mean pooling results indiscriminately, says variants identified by less specific algorithms the first researchers to associate gene dupli- Korbel, one of the leaders of the study. The only if they were verified with microarrays cations with disease, says that one of the 1000 Genomes Project, for example, did not or PCR. take-home lessons of the study is just how © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature npg

nature methods | VOL.9 NO.2 | FEBRUARY 2012 | 135 technology feature

far the hunt to BOX 1 Getting a bigger picture catalog structural variation has to go. Short-read sequencing technologies are “Next-generation uncovering a wealth of new structural variants, but 150-base-pair reads can go sequencing is a new only so far to build up an accurate picture of area for everybody,” the six billion base pairs in a diploid human he says. “There’s a genome. Larger-scale analyses can show lot of work that has Structural variation is how many genes are present and in what to happen.” With more difficult to assess orientation, as well as which versions occur very high-quality data (40× to 60× than single-nucleotide together on the same chromosome. variation, says Charles One approach to such analyses is coverage), he esti- Lee at Brigham and clever sample preparation followed by mates, sequencing Women’s Hospital. sequencing. Independent technologies data can find about from the laboratories of Jay Shendure at 80% of known deletions but less than 20% of

the University of Washington in Seattle Wisconsin of Schwartz, University David duplications. As for inversions and so-called and Stephen Quake at Stanford University Patterns in single stretched-out DNA ‘copy-neutral variation’, he says, the fraction in California may soon help to sort that molecules extending for hundreds of detected so far is still insignificant. out. Quake has a microfluidics system that kilobases can identify structural variants The algorithms for calling structural physically separates all the chromosomes across the genome. variants from sequencing are just not accu- of a single cell; DNA on each chromosome rate enough for results to be believed with- can then be amplified in isolation9. (So far published work has only shown genotyping out additional confirmation, says Stephen in SNPs, but Quake is working toward additional applications.) Shendure fragmented a Scherer at the University of Toronto, who has genome into a very picky type of bacterial plasmid that accepts only about 40 kilobases also linked duplications with disease. “If you at a time. These could then be grown up in pools and sequenced, and the resultant data want complete structural variation data now, could be built up into 400-kilobase regions, each derived from a single chromosome10. you need to do whole-genome sequencing Other approaches get structural information from single molecules by directly and also run microarrays.” analyzing extremely long stretches of DNA. One such approach is called optical mapping. Algorithms to find structural variants can This starts with a tube of long DNA molecules that have been isolated from a biological be hard to evaluate individually. Because sample. These are stretched out onto a positively charged glass surface to which the they generally target particular classes of negatively charged DNA molecules adhere firmly. The surface is then dipped into a variants, doing rigorous comparisons is dif- solution of restriction enzymes; cleaved DNA recoils, leaving gaps that can be seen under ficult. “There hasn’t been to my knowledge a microscope. An entire human genome can be laid out on four patches, each about 1 really good benchmarks of many of these centimeter square, says David C. Schwartz at the University of Wisconsin–Madison, who algorithms—of course everyone’s algorithm developed the technology. Next, the entire sample is meticulously photographed at

© 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature does the best when you read the paper high resolution, essentially creating a data set of a million single-molecule restriction [describing it],” says Ben Raphael, a com- maps. Sophisticated software assembles the fragments and then identifies inversions, puter scientist at Brown University in Rhode translocations and duplications. “We know the size and position of each of those 325,000 Island. Academic institutions and sequenc- npg base-pair fragments. We can tell you which have gotten bigger, which are smaller and ing services companies, such as Illumina and which have been rearranged,” explains Schwartz, who has performed optical mapping of Complete Genomics, are intent on improving bacterial genomes and several plant and animal genomes, including human11. However, their abilities to identify structural variation. the system is not something other researchers could easily set up on their own, he says. In October 2011, OpGen Inc., a company Schwartz founded, announced the launch of So far, though, offerings are not yet mature. GenomeBuilder, a commercial system to use optical mapping for analyzing structural “The most powerful algorithms right now are variation in large genomes. all open source,” says Korbel. Meanwhile, Schwartz, who is no longer affiliated with OpGen, is working on a new But scientists are learning how to make system called Nanocoding with the goal of having it be something that dedicated more sophisticated algorithms, in particular scientists can build in their own labs. With this technology, DNA is introduced into by combining multiple kinds of analysis—for nanochannels that precisely control how much the DNA is stretched and so how many example, considering paired ends in the con- nucleotides there are per given length. The DNA is treated with ‘nicking’ restriction text of split reads or read depth. “The types enzymes, which cut only one strand of DNA; the nicked strands can then be labeled with of variants that can be discovered by each dye and imaged. The system makes it possible to explore variation from about 2 kilobases method are different, so bringing all these up to entire chromosomes, says Schwartz. (For people who don’t want to build their own together is the most promising,” explains system, a company unaffiliated with Schwartz, BioNano Genomics, offers a product that Brudno. Another emerging strategy is to works using similar principles.) write programs designed to analyze reads Ultimately, Schwartz says, studies of long DNA molecules can have even better that support multiple variants. In repetitive resolution if combined with sequencing technologies performing alongside long, regions, explains Raphael, reads can have immobilized DNA strands. The reads produced are still short, he admits, but researchers many good alignments. Instead of making a will know where they came from. call for each variant individually, alternative variants will be considered simultaneously.

136 | VOL.9 NO.2 | FEBRUARY 2012 | nature methods technology feature

“We want to push the limits of getting harder methods are ready to outstrip mapping tech- (Box 1). Most needed, though, are bet- and harder variants,” he says. niques. In the latest study, about a quarter ter ways to simultaneously analyze the full Even relying on a reference is a limitation. of the genomes was inaccessible to assem- range of genetic differences that can occur If read lengths were extended to 1,000 base bly, and the inaccessible regions are prob- in a single individual. “The strength will be pairs, that would still not be long enough to ably particularly rich in structural variation, when we can go in and integrate across all uniquely map regions characterized by very says Brudno. “Until we have better reads the variation, irrespective of class, type, vari- long repeated elements. What’s more, the with much better quality, you want to use ant and frequency,” Eichler says. Structural current reference genome, itself a compila- the billions of dollars invested in the whole variation is not a nascent field, he reflects. It’s tion of several individuals, is incomplete, human genome,” he says. (In fact, the inter- just another “type of variation that has been particularly around the centromeres and play between read length, accuracy and cost under-studied. I hope in ten years, people telomeres, regions known to be both highly is a topic of continuing discussion among will just be studying the full spectrum of repetitive and highly variable between indi- experts, particularly as sequencing platforms genetic variation.” viduals. If a genetic rearrangement involves continue to improve.) But those improve- DNA missing from the reference genome, ments are coming, along with better analysis Monya Baker is technology editor for there is no way to map it. techniques. “Everyone in the field agrees that Nature and Nature Methods The obvious solution is to do away with the we should be assembling genomes,” says Hall. ([email protected]). reference genome and piece together each “What there is controversy about is how soon 1. Lupski, J.R. et al. Cell 66, 219–232 (1991). newly sequenced genome from scratch. The it will be before we can do that in a meaning- 2. Iafrate, A.J. et al. Nat. Genet. 36, 949–951 (2004). sequencing company BGI recently described ful way.” 3. Sebat, J. et al. Science 305, 525–528 (2004). a de novo assembly of two human genomes Ultimately, the goal of finding structural 4. Pang, A.W. et al. Genome Biol. 11, R52 (2010). 5. Berger, M.F. et al. Nature 470, 214–220 (2011). and reported over a quarter-million vari- variation will not be achieved by any exist- 6. Vacic, V. et al. Nature 471, 499–503 (2011). ants ranging from 1 to 23 kilobases, ing technology, says Eichler, but through 7. Mills, R.E. et al. Nature 470, 59–65 (2011). mostly insertions and deletions8. However, technological shifts to additional methods. 8. Li, Y. et al. Nat. Biotechnol. 29, 723–730 (2011). 9. Fan, H. C. et al. Nat. Biotechnol. 29, 51–57 (2011). the analysis was designed to detect homo- Researchers need better ways to scan for 10. Kitzman, J.O. et al. Nat. Biotechnol. 29, 59–63 (2011). zygous variation, and researchers are not large-scale variation and to tell which vari- 11. Teague, B. et al. Proc. Natl. Acad. Sci. USA 107, convinced that current de novo assembly ants occur together on the same chromosome 10848–10853 (2010). © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature npg

nature methods | VOL.9 NO.2 | FEBRUARY 2012 | 137