Elloumi CH10 Date: June 14, 2013 Time: 8:58 am PART C BIOLOGICAL FEATURE EXTRACTION Elloumi CH10 Date: June 14, 2013 Time: 8:58 am Elloumi CH10 Date: June 14, 2013 Time: 8:58 am CHAPTER 10 ALGORITHMS AND DATA STRUCTURES FOR NEXT-GENERATION SEQUENCES FRANCESCO VEZZI,1,2 GIUSEPPE LANCIA,1 and ALBERTO POLICRITI1,2 1Department of Mathematics and Informatics and 2Institute of Applied Genomics, University of Udine, Udine, Italy The first genome was sequenced in 1975 [87] and from this first success sequencing tech- Q1 nologies have significantly improved, with a strong acceleration in the last few years. Today these technologies allow us to read (huge amounts of) contiguous DNA stretches and are the key to reconstructing the genome sequence of a new species, of an individual within a population, or to studying the levels of expressions of single cell lines. Even though a number of different applications use sequencing data today, the “highest” sequencing goal is always the reconstruction of the complete genome sequence. The success in determining the first human genome sequence has encouraged many groups to tackle the problem of reconstructing the codebook of others species, including microbial, mammalian, and plant genomes. Despite such efforts in sequencing new organisms, most species in the biosphere have not been sequenced yet. There are many reason for this, but the two main causes are the costs of a sequencing project and the difficulties in building a reliable assembly. Until few years ago, Sanger sequencing was the only unquestioned available technology. This method has been used in order to produce many complete genomes of microbes, vertebrates (e.g., human [96]), and plants (e.g., grapevine [37]). Roughly speaking, in order to sequence an organisms, it is necessary to extract the DNA, break it into small fragments, and read their tips. As a final result one obtains a set of sequences, usually named reads, that may be assembled in order to reconstruct the original genome sequence or searched within a database of an already reconstructed genome. Reads are randomly sampled along the DNA sequence, so in order to be sure that each base in the genome is present in at least one read we have to oversample the genome. Given a set of reads, the sum of all the read lengths divided by the genome length is the coverage. If the ratio between the overall length of the reads and the genome length is C, then we say that the genome has been sequenced with depth of coverage C or C times (C×). One of the first and most important (practical) algorithmic insights in genome assembly (see [96]) was the observation that using reads coming from the two ends of a single Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data, First Edition. Edited by Mourad Elloumi and Albert Y. Zomaya. © 2013 John Wiley & Sons, Inc. Published 2013 by John Wiley & Sons, Inc. 225 Elloumi CH10 Date: June 14, 2013 Time: 8:58 am 226 ALGORITHMS AND DATA STRUCTURES FOR NEXT-GENERATION SEQUENCES sequence, named the paired reads of an insert* of known estimated length, the overall down-line process of assembly was greatly simplified. Recently, new sequencing methods have emerged [59]. In particular, the commercially available technologies include pyrosequencing (454 [1]), sequencing by synthesis (Illumina [3]), and sequencing by ligation (SOLiD [2]). Compared to the traditional Sanger method, these technologies function with significantly lower production costs and much higher throughput. These advances have significantly reduced the cost of several applications having sequencing or resequencing as an intermediate step. The computationally significant aspect of these new technologies is that the reads pro- duced are much shorter than traditional Sanger reads. At the actual state, Illumina HiSeq 2000, the latest Illumina instrument available on the market, is able to produce reads of length 150 bp and generates more than 200 billion of output data per run, Solid 4 System produces paired reads of length 50 bp, while Roche 454 GS FLX Titanium has the lowest throughput but it is able to produce single reads of length 400 bp and paired reads of length 200 bp. Other technologies are now approaching (Polonator, Helicos BioSciences, Pacific BioSciences, and Oxford Nanopore Technologies[66]) promising higher throughput and lower costs. At the beginning of the new-generation sequencing (NGS) era, as a consequence of the extremely short lengths of both reads and inserts, NGS data have been used mainly in (several) resequencing projects [9, 23, 42, 102]. A resequencing project is based on the availability of a reference sequence (usually a fairly complete genome sequence) against which short sequences can be aligned, using a short-read aligner [48, 52, 71, 81]. Rese- quencing projects allow the reconstruction of the genetic information in similar organisms and the identification of differences among individuals of the same species. The most impor- tant such differences are single-nucleotide polymorphisms (SNPs) [53, 90], copy number variation (CNV) [17, 18, 32], and insertion/deletion events (indels) [62]. Despite the short length of reads, but encouraged by technology improvements, many groups have started to use NGS data in order to reconstruct new genomes from scratch. De novo assembly is in general a difficult problem and is made even more difficult not only by short read lengths [69] but also from the problem of having reliable sequencing and distribution error models. Many tools have been proposed (see, e.g., Velvet [104], ALLPATHS [56], and ABySS [92], to mention just a few of the available ones) but the results achievable to date are far from those of the Sanger-era assemblers (PCAP [34]). The unbridled spread of second-generation sequencing machines has been accompanied by a (natural) effort toward producing computational instruments capable of analyzing the large amounts of newly available data. The aim of this chapter is to present and (compar- atively) discuss the data structures that have been proposed in the context of NGS data processing. In particular, we will concentrate our attention on two widely studied areas: data structures for alignment and de novo assembly. The chapter is divided into two main sections. In the first we will classify algorithms and data structures specifically designed for the alignment of short nucleotide sequences produced by NGS instruments against a database. We will propose a division into categories and describe some of the most successful tools proposed so far. In the second part we will *The name “insert” is used since the sequence providing the reads is inserted in a bacterial’s genome to be reproduced in a sufficient number of copies. Elloumi CH10 Date: June 14, 2013 Time: 8:58 am ALIGNERS 227 deal with de novo assembly. De novo assembly is a computationally challenging problem with NP-complete versions easily obtainable from its real-world definition. In this part we will classify the different de novo strategies and describe available tools. Moreover we will focus our attention on the limits of the currently most used tools, discussing when such limits can be traced back to data structures employed and when, instead, they are a direct consequence of the kind of data processed. 10.1 ALIGNERS One of the main applications of string matching is computational biology. A DNA sequence can be seen as a string over the alphabet ={A, C, G, T}. Given a reference genome sequence, we are interested in searching (aligning) different sequences (reads) of various lengths. When aligning such reads against another DNA sequence, we must consider both errors due to the sequencer and intrinsic errors due to the variability between individuals of the same species. For these reasons, all the programs aligning reads against a reference sequence must deal (at least) with mismatches [5, 41]. As a general rule, tools used to align Sanger reads (see [5]) are not suitable—that is, are not efficient enough—to align next-generation sequencer output due, essentially, to the sheer amount of data to handle. (The advent of next-generation sequencers moved the bottleneck from data production to data analysis.) Therefore, in order to keep the pace with data production, new algorithms and data structures have been proposed in the last years. String matching can be divided into two main areas: exact string matching and approx- imate string matching. When doing approximate string matching, we need to employ a distance metric between strings. The most commonly used metrics are the edit distance (or Levenshtein distance) [47] and the Hamming distance [29]. Approximate string matching at distance k under the edit metric is called the k-difference problem, while under the Hamming metric, it is called the k-mismatch problem. In many practical applications like short-sequence alignment, we are interested in finding the best occurrence of the pattern with at most k mismatches. We will refer to this as the best-k- difference/mismatch problem. Recently, a flurry of papers presenting new indexing algo- rithms to solve this problem have appeared [46, 50, 51]. While all the aligners allow to specify constraints on the Hamming distance, only some of them allow to use also the edit distance. All aligners designed for NGS use some form of index to speed up the search phase. Aligners usually build an index over the text, but some solutions that index only the reads or both are available. According to [49], we can cluster existing alignment algorithms into two main classes: algorithms based on hash tables and algorithms based on suffix-based data structures. A third category is formed by algorithms based on merge sorting but, to the best of our knowledge, the only available solution that belongs to this category is [57].
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages28 Page
-
File Size-