Spades: a New Genome Assembler for Single-Cell Sequencing

Problem setting and preparing data The SPAdes approach SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg Academic University A. Bankevich, S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A. Korobeynikov, A.S. Kulikov, V.M. Lesin, S.I. Nikolenko, S. Pham, A.D. Prjibelski, A.V. Pyshkin, A.V. Sirotkin, N. Vyahhi, G. Tesler, M.A. Alekseyev, P.A. Pevzner August 27, 2012 Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Outline 1 Problem setting and preparing data Assembly: problem and pipeline Error correction: BayesHammer 2 The SPAdes approach De Bruijn graphs, mate-pairs, and simplification Paired de Bruijn graphs and repeats Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Single-cell sequencing Recent years have seen the advent of single-cell sequencing as a way to sequence genomes that we previously couldn’t. It turns out that many bacteria (“dark matter of life”) cannot be sequenced by standard techniques, most often because they cannot be cloned millions of times to get large DNA samples needed for regular sequencing. This is usually due to the fact that these bacteria come in metagenomic samples (ocean samples, microbiomes of larger organisms etc.) and cannot be cultivated alone. For now, metagenomic analysis can yield more or less only individual genes. Single-cell sequencing can get reasonable sequencing coverage by amplifying a single cell (more on that later). Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Single-cell sequencing Important examples: [Woyke et al., 2009, 2010]: marine cells sequencing from metagenomic ocean samples; [Dalerba et al., 2011]: studying tumor heterogeneity; [Islam et al., 2011]: characterizing single cell transcriptome; [Xu et al., 2012; Hou et al., 2012]: single-cell sequencing of tumor cells; [Yoon et al., 2011]: sequencing bacteria from the human microbiome; [Chitsaz et al., 2011]: assembling single-cell sequencing data from marine samples. So how does single-cell sequencing work and why do we need new assemblers for it? Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer The assembly problem First, let us go back to the general assembly problem. We have a long DNA sequence but cannot read it directly. Instead, we can perform sequencing, getting many random small pieces of the DNA (reads). For popular sequencing technologies, reads are usually L ≈ 35-400bp long. The assembly problem is to reconstruct the original sequence from the set of reads. Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer The assembly problem In many sequencing methods, we read ≈ L nucleotides from both ends of a longer DNA fragment. This leads to paired-end reads (read pairs): two reads for which we know the distance between them [Chaisson et al., 2009]. left read genome right read read length read length gap distance insert size Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology Recent single-cell sequencing projects use the MDA technology [Dean et al., 2001; 2002]: Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology However, MDA has a number of problems that seriously complicate assembly. First and foremost, non-uniform coverage: Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology This means that a coverage threshold (often used in conventional assemblers) would throw a lot of the genome away: Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology The insert size distribution is also not nearly as nice: Finally, there are simply more errors, including lots of chimeric connections. Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer MDA technology Computational problems of MDA: 1 highly non-uniform coverage; 2 noisier, more complicated insert size distribution; 3 more chimeric connections (sometimes with large coverage); 4 more frequent errors. First single-cell assembler: E+V-SC (Euler+Velvet-Single-Cell assembler) [Chitsaz et al., 2011]; captured 25% more genes than “regular” assemblers on single-cell data. Meet SPAdes – a new assembler specifically designed for the single-cell case. Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer Assembly pipeline Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer Basic idea: break reads into k-mers and study the set of k-mers. ACGTGTGATGCATGATCG ACGTGTGATGC CGTGTGATGCA GTGTGATGCAT TGTGATGCATG GTGATGCATGA TGATGCATGAT GATGCATGATC ATGCATGATCG k-mers are the basic building block in modern assemblers (de Bruijn graph). However, de Bruijn graphs are not directly applicable because of errors in reads. Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer The first step in assembly is to fix as many errors as we can. If we know what k-mers are correct and how to correct others, it’s easy to correct reads. original read ACGTGTGATGCATGATCG ACGTGTGATGCATGATCG ACGTGTGATGC ACGTGTGATGC CGTGTGATGCA correction CGTGAGATGCA GTGTGATGCAT correction GTGAGATGCAT TGTGATGCATG TGTGATGCATG solid k-mers GTGATGCATGA GTGATGCATGA TGATGCATGAT correction AGATGCATGAT GATGCATGATC GATGCATGATC ATGCATGATCG ATGCATGATCG ACGTGAGATGCATGATCG corrected read Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer In regular (multi-cell) error correction, we can look at how many copies of a k-mer there are and assume that rare k-mers represent errors. In single-cell datasets, this idea fails due to non-uniform coverage. 104 102 100 0 1;000 2;000 3;000 4;000 KBases Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer Basic idea: a k-mer is covered many times (even if non-uniformly). Thus, if we look at the set of similar k-mers we can find out what to do because nucleotides of the central k-mer will be better covered than others. Problem: there may be several centers in such a cluster. ATGTGTGATGC ATGTGGGATGC ACGTGGGATGC ACGTGGGATGC ACGTGTGATGC ACGTGGGATGC ACGTGTGATGC ACGTGTGATGC ATGTGTGATGC ATGTGTGATGC ACGTGAGATGC ACGTGTGATAC ACGTGTGATGC ACGTGTGATGC ACGTGGGATGC Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data Assembly: problem and pipeline The SPAdes approach Error correction: BayesHammer BayesHammer In Hammer, Medvedev et al. (2011) constructed and roughly clustered the Hamming graph of k-mers; BayesHammer uses probabilistic subclustering to get multiple centers in a cluster. Reads k-mers Hamming graph ACGTGTG ACGTG ACATG CGTGT GTGTG ACCTG ACATGTG ACATG ACGTG CGTGT CATGT CATGT ATGTG ATGTG CCTGT ACCTGTC ACCTG CCTGT GTGTG CTGTC CTGTC As a result, BayesHammer corrects single-cell datasets much better than other tools. Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats Outline 1 Problem setting and preparing data Assembly: problem and pipeline Error correction: BayesHammer 2 The SPAdes approach De Bruijn graphs, mate-pairs, and simplification Paired de Bruijn graphs and repeats Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs When there are relatively few k-mers left, we can begin global processing (the actual assembly). Basic idea: de Bruijn graph [Idury & Waterman, 1995; Pevzner et al., 2001]. Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs Algorithmic Biology Lab, SPbAU SPAdes Problem setting and preparing data De Bruijn graphs, mate-pairs, and simplification The SPAdes approach Paired de Bruijn graphs and repeats De Bruijn graphs Algorithmic Biology

Spades: a New Genome Assembler for Single-Cell Sequencing

Methods for De-Novo Genome Assembly

A Galaxy-Based Virus Genome Assembly Pipeline A

Culturing Ancient Bacteria Carrying Resistance Genes from Permafrost and Comparative Genomics with Modern Isolates

A New Versatile Metagenomic Assembler (Supplementary Material)

Computational Protocol for Assembly and Analysis of SARS-Ncov-2 Genomes

Sequencing and Comparative Analysis of <I>De Novo</I> Genome

De Novo Assembly

Metaspades: a New Versatile Metagenomics Assembler

Evaluation of Genome Assembly Software Based on Long Reads

A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

Assembly of the Mitochondrial Genome in the Campanulaceae Family Using Illumina Low-Coverage Sequencing

An Adjunctive Therapy Administered with an Antibiotic Prevents