Sequence Assembly and Alignment Tech Guide
Total Page:16
File Type:pdf, Size:1020Kb
www.roche-applied-science.com Genome Sequencer FLX System Longer sequencing reads mean more applications. In 2005, the Genome Sequencer 20 System was launched ■ Read length: 100 bases ■ 20 million bases in less than 5 hours In 2007, the Genome Sequencer FLX System was launched ■ Read length: 250 to 300 bases ■ 100 million bases in less than 8 hours Available in 2008, the Genome Sequencer FLX with improved chemistries ■ Read length: >400 bases ■ 1 billion bases in less than 24 hours More applications lead to more publications Sequencing-by-Synthesis: Using an enzymatically Proven performance with an expanding list of applications and coupled reaction, light is generated when individual more than 80 peer-reviewed publications. nucleotides are incorporated. Hundreds of thousands of Visit www.genome-sequencing.com to learn more. individual DNA fragments are sequenced in parallel. Roche Diagnostics Roche Applied Science For life science research only. Not for use in diagnostic procedures. Indianapolis, Indiana 454 and GENOME SEQUENCER are trademarks of 454 Life Sciences Corporation, Branford, CT, USA. © 2007 Roche Diagnostics. All rights reserved. Table of contents Letter from the Editor . .5 Index of Experts . .5 Q1: How do you decide whether to take a global, local, or hybrid approach to alignment? . 6 Q2: What scoring function do you use? Do you allow gaps? . 9 Q3: How do you choose which assembly algorithm to use? . 12 Q4: What approach do you use to deal with very short reads? . 14 Q5: How do you ensure high-quality assemblies? What's your process for detecting errors? . 15 Q6: How do you close the gaps? At what point is a genome considered "finished"? . 17 List of Resources . 21 Evolving? Don’t change jobs without us. E-mail your updated address information to [email protected]. Please include the subscriber number appearing directly above your name on the address label. GenomeWeb Intelligence Network D N A S T A R ArrayStar® shines at straightforward If your studies involve gene expression, ArrayStar is the only microarray microarray software that helps you do so much analysis so easily. Why shuffle between a confusing assortment of complex, analysis expensive, hard-to-learn software when you can quickly and easily perform the analyses you want with a single user-friendly program? And ArrayStar is as straightforward to own as it is to use. When you purchase ArrayStar, the perpetual license makes it yours forever. Your research benefits from these features! • Compatible with Affymetrix and NimbleGen normalized data files • Import wizard supports .txt files • Conducts a wide range of analyses and visualizations to evaluate expression patterns • Identifies genes with similar expression patterns • Clusters images to identify groups of co-regulated genes • Plots gene expression levels to A selection of genes highlighted within simultaneous view relationships with Line Graphs multiple views of ArrayStar and Thumbnails • Displays expression levels of many genes across multiple experiments with a Heat Map Affymetrix is a trademark of Affymetrix, Inc. NimbleGen is a trademark of NimbleGen Systems, Inc. To receive a 30-day fully-functional free trial version visit us at www.dnastar.com/arraystar For more information, or to order ArrayStar today, please call toll-free in the US, 1.866.511.5090 or in the UK, 0.808.234.1643 DNASTAR, Inc • Madison, WI, USA • www.dnastar.com Toll free: 866-511-5090 • 608-258-7420 • [email protected] www.dnastar.com/arraystar Letter from the editor At long last, you’ve got your sequence choices and decisions, Genome Technology has data. But now you have to figure out rounded up experts in the sequence alignment and what it means. To make heads or tails assembly fields. These experienced veterans give their of all that information, you might want advice on how to approach sequence alignment, to align and then assemble your choose an assembly algorithm, and when to declare genome of choice. Gone, though, are yourself "done!" with putting that genome's pieces the days of manually post-processing together. Without further ado, but with much thanks the snippets of sequence. The flurry of activity as more to the following contributors, here are their thoughts and more genomes are sequenced (Dog! Honey bee! on the questions facing researchers in sequence Shark!) has brought more advanced and accurate ways alignment and assembly. of aligning and assembling whatever your favorite — Ciara Curtin genome might be. To help you slog through all the Index of experts Genome Technology would like to thank the following contributors for taking the time to respond to the questions in this tech guide. Stephen Altschul Sudhir Kumar Senior Investigator Professor National Center for Biotechnology Arizona State University Information Elliott Margulies Serafim Batzoglou Investigator Assistant Professor National Human Genome Research Stanford University Institute Gustavo Glusman Darren Platt Senior Research Scientist Informatics Department Head Systems Biology Institute DOE Joint Genome Institute Xiaoqiu Huang Mihai Pop Associate Professor Assistant Professor Iowa State University University of Maryland Genome Technology Sequence Assembly and Alignment 5 How do you decide whether to Questiontake a global, local, or hybrid approach to sequence alignment? If you know beforehand that two or more sequences problems such as alignment of orthologous proteins are globally related, then it is almost always or overlap of reads in DNA sequencing projects. advantageous to use a global alignment algorithm to Local alignment is best used for searching a database compare them. This will cut down on noise from for a sequence of interest. Hybrid approaches are random local similarities, and will force the needed when aligning large genomic regions alignment of related sequence regions that a local between different species. algorithm might leave unaligned. However, if it is not — Serafim Batzoglou known whether the sequences being compared are globally related, or indeed whether they are related Most sequence assembly tools start by identifying at all, then a local alignment algorithm is to be overlapping sequence reads and aligning them. For preferred. Because most database searches are example, Phrap uses cross match as its alignment exploratory in this sense, almost all popular sequence engine. Arachne implements its own fast heuristic to database search programs, identify overlapping reads, by such as FASTA and Blast, indexing subsequences employ local alignment “Almost all popular sequence (words); the optimal align- methods. ment is obtained using A few years ago, Dr. Yi- database search programs, such dynamic programming in a Kuo Yu described what he as FASTA and Blast, employ local process conceptually related called a "hybrid" alignment to the FASTA algorithm. In method, which was a alignment methods.” the context of sequence combination of the Hidden assembly, therefore, the — Stephen Altschul Markov Model approach to choice of alignment algor- local sequence alignment ithm is out of the user's (which considers all possible hands, having already been paths through a path graph) and the Smith- addressed by the assembly tool developer. Waterman algorithm (which seeks only the optimal — Gustavo Glusman local alignment). This hybrid approach, which is a local alignment method, has significant advantages Start with a local alignment program named SIM. If from a statistical perspective, and perhaps the alignment covers most of the sequences, use a advantages from the perspective of sensitivity as global alignment program named GAP. If the well. However, there is so far no widely available sequences contain similar regions separated by software implementing this approach. different regions, use a pair-wise alignment program — Stephen Altschul named GAP3 or a multiple alignment program named MAP2. The global alignment model is appropriate in specific — Xiaoqiu Huang 6 Sequence Assembly and Alignment Genome Technology In global alignments, sequences are aligned The overarching answer is usually: do the right thing. beginning to end, with gaps inserted whenever If you expect the sequence comes from a gene, then needed. This procedure is appropriate for protein it should align in full over the coding area that you (amino) sequences of individual genes, ribosomal want put it in. So that typically means global RNA sequences, and short stretches of the genomic alignment. In assembly, that's what you want. You sequence. For genomic DNA, it is always necessary to want every read to align along the genome. In account for medium and large-scale rearrangements practice, you might put low-quality sequence off the in addition to large sequence insertions and ends and make sure it actually aligns properly. deletions, which necessitates the building of local — Darren Platt alignments. For aligning DNA sequences of genome segments coding for In general I prefer to use a proteins (protein-coding local alignment procedure, genes), one will often need followed by post-processing to take a hybrid approach in in case I need to place which the first step is to “The overarching answer is additional constraints on the align the translated protein usually: do the right thing.” alignment. For example, sequences by taking a global when mapping a sequenc- approach, which is followed — Darren Platt ing read to a reference by the adjustment of exon genome, I try to ensure the DNA sequences to reflect alignment spans the entire the protein alignment, and read (i.e. a global alignment then the use of local from the point of view of procedure for aligning homologous intron the read). Requiring a global alignment for the reads, sequences. however, would miss polymorphisms between the — Sudhir Kumar DNA being sequenced and the available reference. — Mihai Pop I typically align genomic sequences, so a hybrid approach works best for me. A global alignment alone doesn't work when the parts of the genome you're trying to align are in different segments and all shuffled up, so you need to first pre-process the data to figure out which sequences should be aligning in the first place.