How Do Biologists Assemble Genomes? Graph Algorithms

Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach

©2015 by Compeau and Pevzner. All rights reserved. The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem Outline

• What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly What’s inside the cell nucleus? Contents of the Nucleus

• The nucleus contains chromosomes.

• Humans have 23 pairs of chromosomes (one in each pair comes from each parent).

• But what are chromosomes made of? DNA: The Building Block of Life

• One more zoom, and we reach the molecular level.

• Early 1950s: Researchers start uncovering properties of chromosomal substance, now called “deoxyribose nucleic acid”: DNA

• 1953: Watson and Crick publish “double helix” structure of DNA. Molecular Structure of DNA

DNAs Double Helix DNAs Molecular Structure Molecular Structure of DNA

• Nucleotide: Half of one “rung” of DNA.

• Four choices for the nucleic acid of a nucleotide: 1. Adenine (A) 2. Cytosine (C) 3. Guanine (G)—bonds to C 4. Thymine (T)—bonds to A DNAs Molecular Structure Why is DNA Important?

Central Dogma of Molecular Biology: DNA is transcribed into RNA, which is then translated into protein (chain of amino acids).

Courtesy: Rachel Raynes! Genome: A Long DNA “Book”

• Genome: The nucleotide sequence read down one side of an organism’s chromosomal DNA.

…CCGTAGTCGCATGGAACAGTATACGAGACAGTACAGATACGATACGATACGATCATTAACCGAGAGTACCAGATTCCAGATCATAC TTACGCTTAGCTACGGACGTACGATACCCAGATTACGATCCATATAGATATAACCGGTGTGTCTTGCTAATACGTAACGGGGTGCCT TCGATAGGTCAGAATACCAGATCTCTCGATCTTCTTACAGATACTACGATCCCCAGATACTACCCCTACTGACCCATCGTACGGGTA CTACTACGGATATGATACCGATGTAGAGGGATCCATATATCCCGAGACGTCTCGCGCATAAGATCATCGTCTAGATACACGTACGTA CTAGACTAGCGTATGCCTCTTATGATCGTCCCGATCGAGTCGCGTGCTCAGAAAAGCTACGATACGATACCCGATACTAGACCATAG… • A human genome has about 3 billion nucleotides.

• Biologists want to be able to read this book. This is what it means to sequence a genome. We Share 99.9% of Our Genomes

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGATGAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT Species vs. Individual Sequencing

Species Sequencing: What is the “consensus” genome of an entire species? Species vs. Individual Sequencing

Individual Sequencing: What makes an individual unique within their species? Why SequenceLETTERS a Species’sNATURE MICROBIOLOGY DOI: Genome? 10.1038/NMICROBIOL.2016.48

()

Bacteria Nomurabacteria Kaiserbacteria Adlerbacteria Cloacimonetes Aquificae Chloroflexi Campbellbacteria Calescamantes Caldiserica WOR-3 Dictyoglomi TA06 -Therm. Latescibacteria Giovannonibacteria BRC1 Wolfebacteria Jorgensenbacteria RBX1 Ignavibacteria WOR1 Chlorobi Azambacteria PVC Parcubacteria superphylum Yanofskybacteria Moranbacteria , Lentisphaerae, Magasanikbacteria Uhrbacteria Falkowbacteria Candidate Omnitrophica Phyla Radiation SM2F11 Rokubacteria NC10 Aminicentantes Peregrinibacteria Tectomicrobia, BD1-5, GN02 Absconditabacteria SR1 Dadabacteria Deltaprotebacteria () Chrysiogenetes Deferribacteres Hydrogenedentes NKB19 Woesebacteria Shapirobacteria Amesbacteria TM6 Collierbacteria Pacebacteria Beckwithbacteria Roizmanbacteria Dojkabacteria WS6 Gottesmanbacteria CPR1 Levybacteria CPR3 Daviesbacteria Microgenomates Curtissbacteria WWE3 Zetaproteo. Acidithiobacillia

Betaproteobacteria Major lineages with isolated representative: italics Major lineage lacking isolated representative: 0.4

Micrarchaeota Diapherotrites Eukaryotes Nanohaloarchaeota Aenigmarchaeota Loki. Parvarchaeota Thor.

Korarch. DPANN Crenarch. Pacearchaeota Bathyarc. Nanoarchaeota YNPFFA Woesearchaeota Aigarch. Opisthokonta Altiarchaeales Halobacteria Z7ME43 Methanopyri TACK Methanococci Excavata Archaea Hadesarchaea Thermococci Thaumarchaeota Archaeplastida Hug et al., 2016! Methanobacteria Thermoplasmata Chromalveolata Archaeoglobi Methanomicrobia Amoebozoa

Figure 1 | A current view of the , encompassing the total diversity represented by sequenced genomes. The tree includes 92 named , 26 archaeal phyla and all five of the Eukaryotic supergroups. Major lineages are assigned arbitrary colours and named, with well-characterized lineage names, in italics. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots. For details on taxon sampling and tree inference, see Methods. The names Tenericutes and Thermodesulfobacteria are bracketed to indicate that these lineages branch within the Firmicutesand the , respectively. Eukaryotic supergroups are noted, but not otherwise delineated due to the low resolution of these lineages. The CPR phyla are assigned a single colour as they are composed entirely of organisms without isolated representatives, and are still in the process of definition at lower taxonomic levels. The complete ribosomal protein tree is available in rectangular format with full bootstrap values as Supplementary Fig. 1 andin Newick format in Supplementary Dataset 2.

2 NATURE MICROBIOLOGY | www.nature.com/naturemicrobiology

© 2016 Macmillan Publishers Limited. All rights reserved Why Sequence an Individual’s Genome?

Personalized Medicine: Tailoring medical treatment to the individual based on their genetics.

2010: First person whose life was saved due to genome sequencing. Brief History of Genome Sequencing

1977: Gilbert and Sanger develop sequencing techniques independently.

1980: They share the Nobel Walter Gilbert prize.

The resulting sequencing methods cost $1 per nucleotide.

Frederick Sanger Brief History of Genome Sequencing

1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome.

Francis Collins

1997: Craig Venter founds Celera genomics, a private firm with the same goal.

Craig Venter Brief History of Genome Sequencing

2000: Draft of human genome is simultaneously completed by the Human Genome Consortium (public) and Celera Genomics (private). Brief History of Genome Sequencing

2000s: Race is on to sequence other mammalian genomes. Brief History of Genome Sequencing

2008: US passes Genetic Nondiscrimination Act.

2013: UK declares public funding to sequence 100,000 human genomes.

2015: Ilumina reduces cost of sequencing an individual human genome to $1,000. The Future of Genomics What Makes Genome Sequencing Hard?

Sequencing machines can only read short pieces of DNA (~250 nucleotides long), called reads.

vs. General Idea of Genome Assembly

Multiple identical copies of a genome

Shatter the genome into reads

Sequence the reads AGAATATCA TGAGAATAT GAGAATATC

AGAATATCA Assemble the genome using GAGAATATC overlapping reads TGAGAATAT ...TGAGAATATCA... General Idea of Genome Assembly

Multiple identical copies of a genome

Shatter the genome into reads

Sequence the reads AGAATATCA TGAGAATAT GAGAATATC

AGAATATCA Assemble the genome using GAGAATATC overlapping reads TGAGAATAT ...TGAGAATATCA...

STOP and Think: What does this remind you of?

Outline

• What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly Complications in Genome Assembly

1. DNA is double-stranded (and may consist of multiple chromosomes).

2. Reads have imperfect coverage of the underlying genome.

3. Sequencing machines are error-prone. Assumptions for Genome Assembly

1. DNA is single-stranded (and consists of a single chromosome, like ).

2. Reads have perfect coverage of the underlying genome: every k-mer in the genome is present.

3. Sequencing machines are error-free. k-mer Composition

The k-mer composition of a string Text, denoted

Compositionk(Text), is the collection of all k-mer substrings of Text (including repeats).

Composition3(TATGGGGTGC) = {ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG} String Reconstruction Problem

String Reconstruction Problem: Reconstruct a string from its k-mer composition. • Input: An integer k and a collection Patterns of k- mers. • Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

Exercise Break: Reconstruct a string having the 3- mer composition {AAT, ATG, GTT, TAA, TGT}. What algorithm did you use? HOW DO WE ASSEMBLE GENOMES?

Solving the String Composition Problem is a straightforward exercise, but in order to model genome assembly, we need to solve its inverse problem.

String Reconstruction Problem: Reconstruct a string from its k-mer composition.

Input: An integer k and a collection Patterns of k-mers. Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

Before we ask you to solve the String Reconstruction Problem, let’s consider the follow- ing example of a 3-mer composition:

AAT ATG GTT TAA TGT

The most natural way to solve the String Reconstruction Problem is to mimic the solution of the Newspaper Problem and “connect” a pair of k-mers if they overlap in k 1 symbols. For the above example, it is easy to see that the string should start with TAA because there is no 3-mer ending in TA. This implies that the next 3-mer in the string should start with AA. There is only one 3-mer satisfying this condition, AAT:

TAA AAT

In turn, AAT can onlyString be extended Reconstruction by ATG, which can Problem only be extended by TGT, and so on, leading us to reconstruct TAATGTT:

TAA AAT ATG TGT GTT TAATGTT

It looks like we are finished with the String Reconstruction Problem and can let you move on to theExercise next chapter. Break: To Reconstruct be sure, let’s considera string having another the 3-mer 3- composition: mer composition {AAT, ATG, ATG, ATG, CAT, CCA, AAT ATG ATGGAT ATG, GCC CAT, GGA CCA, GGG GAT, GTT GCC, TAA GGA, TGC GGG, TGG GTT, TGT TAA}. TGC TGG TGT

121 Repeats Complicate Assembly

What made the previous exercise tricky was the presence of ATG three times in the composition.

Alu sequence: ~300- nucleotide sequence that occurs over a million times in the human genome.

Courtesy: Dan Gilbert! Outline

• What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly Solution to Previous Exercise

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT TAATGCCATGGGATGTT

STOP and Think: Is this the only solution? Genome Path

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT TAATGCCATGGGATGTT

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT From a Genome Path to a Genome

STOP and Think: Could you reconstruct this genome path if you only knew the 3-mer composition?

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT From a Genome Path to a Genome

STOP and Think: Could you reconstruct this genome path if you only knew the 3-mer composition?

• Prefix: First k – 1 letters in a string. • Suffix: Last k – 1 letters in a string. Overlap Graph: Form a node for each read in Patterns, then connect x to y if Suffix(x) = Prefix(y).

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT From a Genome Path to a Genome

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT From a Genome Path to a Genome

STOP and Think: What is the issue with this method?

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT From a Genome Path to a Genome

We don’t know the order of k-mers in advance, so we need to order them lexicographically.

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT From a Genome Path to a Genome

STOP and Think: What does this remind you of?

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT From a Genome Path to a Genome

STOP and Think: What does this remind you of?

Hamiltonian Path Problem: Construct a Hamiltonian path in a graph. • Input: A directed network. • Output: A path visiting every node in the graph exactly once (if such a path exists). From a Genome Path to a Genome

STOP and Think: What does this remind you of?

Hamiltonian Path Problem: Construct a Hamiltonian path in a graph. • Input: A directed network. • Output: A path visiting every node in the graph exactly once (if such a path exists).

... and this problem has no known fast solution! de Bruijn’s Question k-Universal Binary String: Contains every binary k- mer exactly once. • Example: 0001110100 is 3-universal, as it contains 000, 001, 011, 111, 110, 101, 010, 100.

000 001 010 011 100 101 110 111

Nicolaas de Bruijn de Bruijn’s Question

Can we draw a better network that doesn’t fall in the trap of the Hamiltonian Cycle Problem?

000 001 010 011 100 101 110 111

Nicolaas de Bruijn Outline

• What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly Reimagining the Overlap Graph

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT Reimagining the Overlap Graph

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

3 CC CCA GCC CA GC

CAT TGC ATG de Bruijn Graph: Produced TAA AAT TGT GTT ATG by “gluing” nodes with the TA AA AT TG GT TT ATG same label. GAT TGG

GA GG GGA GGG Reimagining the Overlap Graph

STOP and Think: Do you see the genome? How did you find it? 3 CC CCA GCC CA GC

CAT TGC ATG de Bruijn Graph: Produced TAA AAT TGT GTT ATG by “gluing” nodes with the TA AA AT TG GT TT ATG same label. GAT TGG

GA GG GGA GGG Reimagining the Overlap Graph

CC

CCA 6 5 GCC CA GC

CAT 7 ATG 4 TGC 8 de Bruijn Graph: Produced TAA AAT TGT GTT ATG TA AA AT TG GT TT 1 2 3 14 15 by “gluing” nodes with the 13 ATG same label. GAT 12 9 TGG

GA 11 G G GGA GGG 10 Reimagining the Overlap Graph

STOP and Think: What is special about this path?

CC

CCA 6 5 GCC CA GC

CAT 7 ATG 4 TGC 8 de Bruijn Graph: Produced TAA AAT TGT GTT ATG TA AA AT TG GT TT 1 2 3 14 15 by “gluing” nodes with the 13 ATG same label. GAT 12 9 TGG

GA 11 G G GGA GGG 10 A Different Computational Problem

Eulerian Path Problem: Construct an Eulerian path in a graph. • Input: A directed network. • Output: A path visiting every edge in the graph exactly once (if such a path exists). A Different Computational Problem

Eulerian Path Problem: Construct an Eulerian path in a graph. • Input: A directed network. • Output: A path visiting every edge in the graph exactly once (if such a path exists).

STOP and Think: Can we construct the de Bruijn graph of Text knowing only its k-mer composition? HOW DO WE ASSEMBLE GENOMES?

Constructing de Bruijn graphs from k-mer composition

Constructing the de Bruijn graph by gluing identically labeled nodes will help us later when we generalize the notion of de Bruijn graph for other applications. We will now describe anotherde useful Bruijn way to construct Graph de Bruijn graphsof Reads without gluing. Given a collection of k-mers Patterns, the nodes of DEBRUIJNk(Patterns) are simply all unique (k 1)-mers occurring as a prefix or suffix of 3-mers in Patterns. For example, say weStart are givenwith the the following set of collection reads: of 3-mers:

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

Then the set of eleven unique 2-mers occurring as a prefix or suffix in this collection is as follows:

AA AT CA CC GA GC GG GT TA TG TT

For every k-mer in Patterns, we connect its prefix node to its suffix node by a directed edge in order to produce DEBRUIJN(Patterns). You can verify that this process produces the same de Bruijn graph that we have been working with (Figure 3.16).

FIGURE 3.16 The de Bruijn graph above is the same as the graph in Figure 3.14, although it has been drawn differently.

De Bruijn Graph from k-mers Problem: Construct the de Bruijn graph of a collection of k-mers. 3E Input: A collection of k-mers Patterns. Output: The de Bruijn graph DEBRUIJN(Patterns).

137 HOW DO WE ASSEMBLE GENOMES?

Constructing de Bruijn graphs from k-mer compositionHOW DO WE ASSEMBLE GENOMES?

Constructing the de Bruijn graph by gluing identically labeled nodes will help us later Constructing de Bruijn graphs from k-mer composition when we generalize the notion of de Bruijn graph for other applications. We will now describeConstructing another the useful de Bruijn way graph to construct by gluing de identically Bruijn graphs labeled without nodes will gluing. help us later when we generalize the notion of de Bruijn graph for other applications. We will now Given a collection of k-mers Patterns, the nodes of DEBRUIJN (Patterns) are simply describe anotherde useful Bruijn way to construct Graph de Bruijn graphsof Reads without gluing.k all unique (k 1)-mers occurring as a prefix or suffix of 3-mers in Patterns. For example, Given a collection of k-mers Patterns, the nodes of DEBRUIJNk(Patterns) are simply sayall we unique are given (k 1)-mers the following occurring collection as a prefix of or 3-mers: suffix of 3-mers in Patterns. For example, say weStart are givenwith the the following set of collection reads: of 3-mers: AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT Then the set of eleven unique 2-mers occurring as a prefix or suffix in this collection is as follows:Then the set of eleven unique 2-mers occurring as a prefix or suffix in this collection is as follows:Take every distinct prefix/suffix occurring in reads: AA AT CA CC GA GC GG GT TA TG TT AA AT CA CC GA GC GG GT TA TG TT

ForFor every everyk-merk-mer in inPatternsPatterns, we connect connect its its prefix prefix node node to to its its suffix suffix node node by a by directed a directed edgeedge in in order order to to produce produce DDEBRUIJNRUIJN((PatternsPatterns).) You. You can can verify verify that that this this process process produces produces thethe same same de de Bruijn Bruijn graph graph that we we have have been been working working with with (Figure (Figure 3.16 3.16). ).

FIGURE 3.16 The de Bruijn graph above is the same as the graph in Figure 3.14, FIGUREalthough3.16 it hasThe been de drawn Bruijn diff grapherently. above is the same as the graph in Figure 3.14, although it has been drawn differently.

De Bruijn Graph from k-mers Problem: DeConstruct Bruijn theGraph de Bruijn from graphk-mers of a collectionProblem of: k-mers. 3E ConstructInput the: A de collection Bruijn graph of k-mers of a collectionPatterns. of k-mers. 3E Output: The de Bruijn graph DEBRUIJN(Patterns). Input: A collection of k-mers Patterns. Output: The de Bruijn graph DEBRUIJN(Patterns).

137

137 HOWHOW DO DO WE WE ASSEMBLE ASSEMBLE GENOMES GENOMES? ?

ConstructingConstructing de de Bruijn Bruijn graphs graphs from fromkk-mer-mer compositionHOW DO WE ASSEMBLE GENOMES? Constructing the de Bruijn graph by gluing identically labeled nodes will help us later Constructing the de Bruijn graph by gluing identically labeled nodes will help us later Constructingwhen we generalize de Bruijn the graphs notion from ofk de-mer Bruijn composition graph for other applications. We will now when we generalize the notion of de Bruijn graph for other applications. We will now Constructingdescribe another the de useful Bruijn way graph to construct by gluing de identically Bruijn graphs labeled without nodes gluing. will help us later describe another useful way to construct de Bruijn graphsE RUIJN without( gluing.) whenGiven we generalize a collection the of notionk-mers ofPatterns de Bruijn, the graph nodes for of otherD B applications.k Patterns Weare will simply now Givenall unique a collection (k 1)-mers of occurringk-mers Patterns as a prefix, the or nodes suffix of 3-mersDEBRUIJN in Patternsk(Patterns. For example,) are simply describe anotherde useful Bruijn way to construct Graph de Bruijn graphsof Reads without gluing. all uniquesay we are(k given1)-mers the following occurring collection as a prefix of 3-mers:or suffix of 3-mers in Patterns. For example, Given a collection of k-mers Patterns, the nodes of DEBRUIJNk(Patterns) are simply sayall we unique are given (k 1)-mers the following occurring collection as a prefix of or 3-mers: suffix of 3-mers in Patterns. For example, AAT ATGATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT say weStart are givenwith the the following set of collection reads: of 3-mers: AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT Then the set of eleven unique 2-mers occurring as a prefix or suffix in this collection is as AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT follows: Then the set of eleven unique 2-mers occurring as a prefix or suffix in this collection is as follows:Then the set of eleven unique 2-mers occurring as a prefix or suffix in this collection is as follows:Take every distinctAA AT C Aprefix/suffixCC GA GC GG occurringGT TA TG TT in reads:

For every k-mer inAPatternsA AT ,C weA connectCC GA itsG prefixC GG nodeGT toTA itsTG suffixTT node by a directed AA AT CA CC GA GC GG GT TA TG TT edge in order to produce DEBRUIJN(Patterns). You can verify that this process produces Forthe every samek-mer de Bruijn in Patterns graph that, we we connect have been its prefix working node with to (Figure its suffix 3.16 node). by a directed For everyFor eachk-mer inread,Patterns draw, we connect edge its connecting prefix node to its prefix suffix node to bysuffix. a directed edgeedge in in order order to to produce produce DDEBRUIJNRUIJN((PatternsPatterns).) You. You can can verify verify that that this this process process produces produces thethe same same de de Bruijn Bruijn graph graph that we we have have been been working working with with (Figure (Figure 3.16 3.16). ).

AA AT CA CC GA GC GG GT TA TG TT

FIGURE 3.16 The de Bruijn graph above is the same as the graph in Figure 3.14, although it has been drawn differently.

FIGURE 3.16 The de Bruijn graph above is the same as the graph in Figure 3.14, FIGUREalthough3.16 it hasThe been de drawn Bruijn diff grapherently. above is the same as the graph in Figure 3.14, althoughDe Bruijn it has Graph been from drawnk-mers differently. Problem: Construct the de Bruijn graph of a collection of k-mers. 3E De BruijnInput Graph: A collection from k-mers of k-mers ProblemPatterns: . DeConstruct BruijnOutput theGraph de: The Bruijn from de Bruijn graphk-mers of graph a collectionProblem DEBRUIJN of: k-mers.(Patterns). 3E ConstructInput the: A de collection Bruijn graph of k-mers of a collectionPatterns. of k-mers. 3E Output: The de Bruijn graph DEBRUIJN(Patterns). Input: A collection of k-mers Patterns. Output: The de Bruijn graph DEBRUIJN(Patterns).

137

137

137 HOW DO WE ASSEMBLE GENOMES?

Constructing de Bruijn graphs from k-mer composition

Constructing the de Bruijn graph by gluing identically labeled nodes will help us later when we generalize the notion of de Bruijn graph for other applications. We will now describe another useful way to construct de Bruijn graphs without gluing.

Given a collection of k-mers Patterns, the nodes of DEBRUIJNk(Patterns) are simply all unique (k 1)-mers occurring as a prefix or suffix of 3-mers in Patterns. For example, The Graph is the Same! say we are given the following collection of 3-mers: 3 CC AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT CCA GCC

Then the set of eleven unique 2-mersCA occurringGC as a prefix or suffix in this collection is as follows: CAT TGC ATG TAA AAT TGT GTT ATG AATAAT ACAA CCAT GA GCTGGG GGTT TATT TG TT

ATG For every k-mer in Patterns, weG connectAT its prefixTGG node to its suffix node by a directed edge in order to produce DEBRUIJN(Patterns). You can verify that this process produces GA GG the same de Bruijn graph that we haveGG beenA working with (Figure 3.16). GGG

AA AT CA CC GA GC GG GT TA TG TT

FIGURE 3.16 The de Bruijn graph above is the same as the graph in Figure 3.14, although it has been drawn differently.

De Bruijn Graph from k-mers Problem: Construct the de Bruijn graph of a collection of k-mers. 3E Input: A collection of k-mers Patterns. Output: The de Bruijn graph DEBRUIJN(Patterns).

137 Key Question

We know that the Hamiltonian Path Problem has no known solution, but what about the Eulerian Path Problem?