Comparative Genome Assembly Algorithm, Which Can Assemble a Newly Sequenced Genome by Is a Bioinformatics Scientist at Mapping It Onto a Reference Genome

Mihai Pop Comparative genome is a Bioinformatics Scientist at the Institute for Genomic Research. He earned his PhD in assembly Computer Science from Johns Hopkins University in 2000. His Mihai Pop, Adam Phillippy, Arthur L. Delcher and Steven L. Salzberg research interests include Date received (in revised form): 21st June 2004 genome assembly and comparative genomics. Abstract Adam Phillippy is a Bioinformatics Software One of the most complex and computationally intensive tasks of genome sequence analysis is Engineer at The Institute for genome assembly. Even today, few centres have the resources, in both software and hardware, Genomic Research. His to assemble a genome from the thousands or millions of individual sequences generated in a research interests include whole-genome shotgun sequencing project. With the rapid growth in the number of comparative genomics, whole genome alignment and sequenced genomes has come an increase in the number of organisms for which two or more sequence assembly. closely related species have been sequenced. This has created the possibility of building a Arthur L. Delcher comparative genome assembly algorithm, which can assemble a newly sequenced genome by is a Bioinformatics Scientist at mapping it onto a reference genome. We describe here a novel algorithm for comparative The Institute for Genomic genome assembly that can accurately assemble a typical bacterial genome in less than four Research and Professor minutes on a standard desktop computer. The software is available as part of the open-source Emeritus of Computer Science at Loyola College in Maryland. AMOS project. He earned his PhD in Computer Science from Johns Hopkins University in 1989 and INTRODUCTION biodefence, where multiple strains of was a member of the Most large-scale genome sequencing Bacillus anthracis are currently being Informatics Research group at Celera Genomics that projects today employ the whole-genome sequenced in order to create precise developed the human genome shotgun (WGS) sequencing strategy, in genotyping information that can be used assembly program. which a genome is shattered into for forensic purposes.1 Sequencing Steven L. Salzberg, PhD numerous small fragments, and the projects for several major human is the Senior Director of fragments are then sequenced from both pathogens, including Mycobacterium Bioinformatics at TIGR and is a ends. The resulting sequences, ranging tuberculosis, Streptococcus pneumoniae and Research Professor of Computer Science and Biology from 650 to 850 base pairs (bp) in length Staphylococcus aureus (to name but a few) at Johns Hopkins University. using the latest sequencing technology, are targeting multiple strains of these must then be assembled to reconstruct the bacteria in order to understand virulence, chromosomes of the target organism. As drug resistance and other phenotypic Keywords: assembly, genome sequencing has become more differences between strains. On a larger shotgun sequencing, efficient, the number of organisms scale, the mouse, rat and chimpanzee comparative genomics, open source sequenced by this strategy has rapidly genomes are all being sequenced and increased. In many cases, little is known mapped to the human genome to better about a genome prior to sequencing; understand human biology, and multiple sometimes even the genome size is only a Drosophila species are being sequenced rough estimate. Genome assemblers have and mapped onto one another. been relied on not only to reconstruct the DNA sequence assembly algorithms genome, but also to help answer such have generally followed an algorithmic basic questions as how many strategy known as overlap-layout-consensus.2 chromosomes an organism has. Although many distinct systems have Mihai Pop, In addition to sequencing new and been developed – among them the The Institute for Genomic completely unknown species, the popular program phrap3 – all begin by Research, scientific community has recognised in comparing all the shotgun sequences 9712 Medical Center Drive, Rockville, MD 20850, USA recent years the value of sequencing two (‘reads’) to one another and computing or more very similar species (or strains) which ones overlap each other (the overlap Tel: +1 301 795 7000 Fax: +1 301 838 0208 from the same genus. Perhaps the most step). To avoid the quadratic time E-mail: [email protected] prominent example of this is in requirement of a brute-force approach, & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 3. 237–248. SEPTEMBER 2004 2 3 7 Pop et al. most assemblers use some form of a sequence of a closely related organism to hashing strategy (or similar indexing determine the relative placement of the technique) based on short oligomers in reads. Thus the overlap and layout stages order to identify those reads that are likely of a typical assembler are replaced with a Reference and target to overlap, and then run a more costly module performing the alignment to a genomes alignment algorithm on any pairs of reads reference genome. The remaining assembly that have a significant intersection in the stages – consensus generation and hash table. Following the overlap step, the scaffolding – remain the same as for an layout step positions the reads precisely overlap-layout-consensus algorithm. with respect to one another, producing a Throughout the remainder of this paper, multiple alignment of all reads. This the genome being sequenced is referred to multi-alignment is then used to produce as the target genome, the goal being to the consensus, ie the final DNA sequence. obtain an assembly of this genome using a Some algorithms4–6 also include an reference genome as a template. additional step in which contigs are joined The differences between the target together using the linking information genome and the reference genome, from the paired-end reads, creating larger coupled with the presence of repeats in structures known as scaffolds. This the data, create the biggest challenges for scaffolding step can also be run as a a comparative assembler. Even when the separate program after the initial two genomes belong to the same species, assembly.7 the differences between them can be AMOS Comparative In the AMOS Comparative Assembler significant. For example, Figure 1 shows Assembler (AMOS-Cmp), we take a fundamentally the alignment between two strains of different approach: the overlap step is Streptococcus agalactiae.9,10 It is immediately skipped entirely. Instead, reads are aligned clear that the strains differ by multiple to the reference genome using a modified insertions and deletions (indels, identified version of the MUMmer algorithm.8 This by breaks in the diagonal), furthermore a Alignment-layout- strategy can be called alignment-layout- cluster of repeats can be seen in the first consensus consensus, because it produces a layout 500 kilobases (Kbp) of the two genomes. directly from the alignment. In fact, The algorithm employed by AMOS- AMOS goes a step beyond many Cmp to overcome these challenges has conventional assembly algorithms, five major steps. because information is used from the Read placement with paired-end reads in the alignment step. As • Read alignment. Each shotgun read MUMmer described below, this helps in is aligned to the reference genome disambiguating the placement of reads using MUMmer.8,11,12 Repetitive that have matches to more than one place sequences and polymorphisms in the reference genome. between the target and the reference cause some reads to align in a non- METHODS contiguous fashion. A modified All assembly algorithms attempt to version of the Longest Increasing recover the information lost through the Subsequence (LIS) algorithm13 is used WGS shearing process; in particular, they in order to generate chains of mutually try to identify the original placement of consistent matches between each read the shotgun sequence fragments along the and the reference. In addition to the genome. This problem encompasses most longest consistent chain, a set of near- of the complexity of shotgun sequence optimal chains is also computed in assembly, while the remaining tasks, such order to identify reads anchored in as computing the precise multiple repeats. Those reads that are alignment of the reads or identifying ambiguously placed in the genome polymorphisms, are somewhat easier. Our (one or more chains are within 2 per comparative assembler uses the complete cent identity from the best placement) 238 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 3. 237–248. SEPTEMBER 2004 Comparative genome assembly 2,000,000 1,500,000 NEM316 1,000,000 S. agalactiae Figure 1: Alignment 500,000 between two strains of Streptococcus agalactiae. The lines represent near-exact matches. The 0 positions of the matches 0 500,000 1,000,000 1,500,000 2,000,000 on the two genomes are S. agalactiae 2603V/R shown along the axes are classified as repetitive and resolved random. To illustrate how well these later (in some cases) by using mate- steps work, the original shotgun reads pair information. from the S. agalactiae 26039 WGS project were aligned to the final, Repeat solution • Repeat resolution. For each read finished chromosome. Out of a total that cannot be unambiguously placed of 26,099 reads, 25,310 were uniquely along the reference genome, a three- anchored, 314 were placed with the stage process is employed to help of a uniquely anchored mate, 22 disambiguate its placement. First, it is were placed by mate-pair constraints, checked to see if the paired-end and the remaining 442 had to be sequence (the ‘mate’) is uniquely placed in a randomly chosen copy of a anchored in the genome. If it is, the repeat. read is placed in the location that Layout refinement satisfies the constraints imposed by the • Layout refinement. Indels and mate-pair information. Second, if a rearrangements between the target and read and its mate are both the reference genomes complicate the ambiguously placed, an attempt is mapping of the reads to the reference.

Load more