<<

Graph in

Bioinformatics: Issues and Algorithms CSE 308-408 • Fall 2007 • Lecture 13

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 1 - Outline

• Introduction to graph • Eulerian & Hamiltonian Problems • Benzer experiment and graphs • DNA sequencing • The Shortest Superstring & Traveling Salesman Problems • Sequencing by hybridization

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 2 - The Obsession Problem

Find a tour crossing every bridge just once. – , 1735

Bridges of Königsberg

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 3 - Eulerian Cycle Problem

• Find a cycle that visits every edge exactly once. • for solution has linear .

More complicated Königsberg

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 4 - Mapping problems to graphs

studied chemical structures of hydrocarbons in mid-1800's. • He used trees (acyclic connected graphs) to enumerate structural isomers.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 5 - Hamiltonian Cycle Problem

• Find a cycle that visits every exactly once. • NP-complete (i.e., a very hard problem to solve).

Game invented by Sir William Hamilton in 1857

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 6 - Beginning of in

Benzer’s work: • Developed deletion mapping. • “Proved” linearity of gene. • Demonstrated internal structure of gene.

Seymour Benzer (1950's)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 7 - Viruses attack bacteria

• Normally bacteriophage T4 kills bacteria. • However if T4 is mutated (e.g., an important gene is deleted), it gets disabled and loses ability to kill bacteria. • Suppose bacteria is infected with two different mutants, each of which is disabled. Would the bacteria still survive? • Amazingly, a pair of disable viruses can kill a bacteria even if each of them is disabled. • How can this be explained?

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 8 - Benzer’s experiment

• Idea: infect bacteria with pairs of mutant T4 bacteriophage (virus). • Each T4 mutant has an unknown interval deleted from its genome. • If two intervals overlap: T4 pair is missing part of its genome and is disabled – bacteria survive. • If two intervals do not overlap: T4 pair has its entire genome and is enabled – bacteria die.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 9 - Benzer’s experiment

Complementation between pairs of mutant T4 bacteriophages

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 10 - Benzer’s experiment and interval graphs

• Construct : each T4 mutant is a vertex; place edge between mutant pairs where bacteria survived (i.e., deleted intervals in pair of mutants overlap). • Interval graph structure reveals whether DNA is linear or branched DNA. Case for linear genes

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 11 - Benzer’s experiment and interval graphs

Case for branched genes

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 12 - Interval graphs: a comparison

Linear genome Branched genome

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 13 - DNA sequencing: history

Sanger method (1977): Gilbert method (1977): labeled ddNTPs* chemical method to terminate DNA copying at cleave DNA at specific random points. points (G, G+A, T+C, C).

Both methods generate labeled fragments of varying lengths that are further electrophoresed.

* Dideoxynucleotides, or ddNTPs, are nucleotides lacking 3'-hydroxyl (-OH) group on their deoxyribose sugar. Note that the sugar normally referred to as deoxyribose lacks a 2'-OH, while the dideoxyribose lacks hydroxyl groups on both its 2' and 3' carbon. The lack of this hydroxyl group means that, after being added by a DNA polymerase to a growing nucleotide chain, no further nucleotides can be added, as no phosphodiester bond can be created. Thus, these molecules form the basis of the dideoxy chain- termination method of DNA sequencing, which was developed by Frederick Sanger in 1977. (From http://encyclopedia.thefreedictionary.com/ddNTP.) http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 14 - Sanger Method: generating a read

• Start at primer (restriction site). • Grow DNA chain. • Include ddNTPs. • Stops reaction at all possible points. • Separate products by length, using gel electrophoresis. http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 15 - DNA sequencing

• Shear DNA into millions of small fragments. • Read 500 – 700 nucleotides at a time from the small fragments (Sanger method).

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 16 - Fragment assembly

• Computational challenge: assemble individual short fragments (reads) into single genomic sequence. • Until late 1990's, the shotgun fragment assembly of human genome was viewed as intractable problem. The Shortest Superstring Problem. Given of strings, find shortest string containing all of them.

Input: Strings s1, s2, ..., sn.

Output: A string s that contains all strings s1, s2, ..., sn as substrings, such that length of s is minimized.

• Problem is NP-complete (efficient solution unlikely). • This formulation does not take into account sequencing errors!

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 17 - Shortest Superstring Problem: an example

Note: this is an approximation – an atttactive mathematical model for a real-world biological problem.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 18 - Reducing SSP to TSP

Define overlap(si, sj) as the length of the longest prefix of sj

that matches a suffix of si.

aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa

What is overlap(si, sj) for these strings? 3?

aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa

What is overlap(si, sj) for these strings? No, 12!

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 19 - Reducing SSP to TSP

Define overlap(si, sj) as the length of the longest prefix of sj

that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa

• Build graph with n vertices representing strings s1, s2, ..., sn.

• Insert edges of length overlap overlap(si, sj) between vertices

si and sj. • Find longest which visits every vertex exactly once. This is Traveling Salesman Problem (TSP), which is also NP- complete.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 20 - Reducing SSP to TSP

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 21 - SSP to TSP: an Example

Say that S = {ATC, CCA, CAG, TCC, AGT}

SSP TSP ATC AGT 0 2 CCA 1 1 AGT 1 CCA ATC 1 ATCCAGT 2 2 2

TCC CAG 1 TCC CAG ATCCAGT

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 22 - Sequencing by Hybridization (SBH): history

• 1988: SBH suggested as an First microarray an alternative sequencing prototype (1989) method. Nobody believed it will ever work. First commercial • 1991: Light directed polymer DNA microarray prototype w/16,000 synthesis developed by Steve features (1994) Fodor and colleagues. • 1994: Affymetrix develops 500,000 features first 64-kb DNA microarray. per chip (2002)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 23 - How SBH works

• Attach all possible DNA probes of length to flat surface, each probe at distinct and known location. This set of probes is called the DNA array. • Apply a solution containing fluorescently labeled DNA fragment to array. • The DNA fragment hybridizes with only those probes that are complementary to substrings of length l of fragment. • Using a spectroscopic detector, determine which probes hybridize to DNA fragment to obtain l-mer composition of target DNA fragment. • Apply combinatorial algorithm (next) to reconstruct sequence of target DNA fragment from l-mer composition. http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 24 - Hybridization on DNA Array

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 25 - l-mer composition

Spectrum(s, l) is an unordered multiset of all possible (n – l + 1) l-mers in a string s of length n. The order of individual elements in Spectrum(s, l) does not matter.

For s = TATGGTGC all of following are equivalent representations of Spectrum(s, 3):

{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 26 - l-mer composition

Spectrum(s, l) is an unordered multiset of all possible (n – l + 1) l-mers in a string s of length n. The order of individual elements in Spectrum(s, l) does not matter.

For s = TATGGTGC all of following are equivalent representations of Spectrum(s, 3):

{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

We usually choose lexicographically-ordered representation as canonical one. http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 27 - Different sequences ⇒ different spectrums

Note that different sequences may have same spectrum.

Spectrum(GTATCT, 2)= {AT, CT, GT, TA, TC} Spectrum(GTCTAT, 2)=

The Sequencing by Hybridization (SBH) Problem. Reconstruct a string from its l-mer composition.

Input: Set S representing all l-mers from an unknown string s. Output: String s such that Spectrum(s, l) = S.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 28 - SBH: approach

S = {ATG AGG TGC TCC GTC GGT GCA CAG}

Corresponding H :

ATG AGG TGC TCC GTC GGT GCA CAG

ATGCAG GTCC

Path visits every vertex once!

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 29 - SBH: Hamiltonian Path approach

A more complicated graph:

S = {ATG TGG TGC GTG GGC GCA GCG CGT }

H

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 30 - SBH: Hamiltonian Path approach

S = {ATG TGG TGC GTG GGC GCA GCG CGT}

Path 1:

H

ATGCGTGGCA

Path 2: H

ATGGCGTGCA

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 31 - SBH: approach

S = {ATG TGG TGC GTG GGC GCA GCG CGT}

• Vertices correspond to (l–1)-mers: {AT, TG, GC, GG, GT, CA, CG} • Edges correspond to l-mers from S.

GT CG

AT TG GC CA

GG Path visits every edge once!

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 32 - SBH: Eulerian Path approach

S = {ATG TGG TGC GTG GGC GCA GCG CGT} corresponds to two different paths:

GT CG GT CG

5 4 6 5 4 3 AT TG GC CA AT TG GC CA

1 7 8 1 2 8 2 3 6 7 GG GG ATGGCGTGCA ATGCGTGGCA

I.e., two acceptable solutions to this SBH problem.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 33 - Euler Theorem

A graph is balanced if for every vertex, the number of incoming edges equals the number of outgoing edges: in(v) = out(v)

Theorem: a connected graph is Eulerian (i.e., has an Euler Cycle) if and only if each of its vertices is balanced.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 34 - Euler Theorem: Proof

Two cases to consider:

(1) Eulerian ⇒ balanced For every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v) = out(v)

(2) Balanced ⇒ Eulerian ???

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 35 - Algorithm for constructing an Eulerian Cycle

(a) Start with an arbitrary vertex v and form arbitrary cycle with v unused edges until dead-end is reached. Since graph is Eulerian, dead-end must be w starting point, i.e., vertex v.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 36 - Algorithm for constructing an Eulerian Cycle

(b) If cycle from Step (a) earlier is not Eulerian cycle, it must v contain a vertex w, which has untraversed edges. Perform Step (a) again, w using vertex w as the starting point. Once again, we will end up at starting vertex w.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 37 - Algorithm for constructing an Eulerian Cycle

(c) Combine cycles from v Step (a) and Step (b) into single cycle and iterate Step (b). w

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 38 - Euler Theorem: extension

Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced.

I.e., one vertex has an extra outgoing edge, while another corresponding vertex has an extra incoming edge.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 39 - Some Difficulties with SBH

• Fidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and those with one or two mismatches. • Array Size: effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology. • Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in future.

• Practicality (again): although impractical, SBH still spearheaded techniques for expression and SNP analysis.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 40 - Traditional DNA sequencing

DNA

Shake

DNA fragments

Known location Vector = circular genome (restriction site) (bacterium, plasmid) + =

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 41 - Different types of vectors

Size of insert VECTOR (bp)

Plasmid 2,000 - 10,000

Cosmid 40,000

BAC (Bacterial Artificial 70,000 - 300,000 Chromosome) > 300,000 YAC (Yeast Artificial Chromosome) Not used much recently

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 42 - Electrophoresis diagrams

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 43 - Challenging to read sometimes

• Filtering • Correcting for length compression • Smoothening • Method for deciding nucleotides

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 44 - Shotgun sequencing

Genomic segment

Cut many times at random (shotgun)

Get one or two reads from each segment

~500 bp ~500 bp

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 45 - Fragment assembly

Reads

• Cover region with ~7-fold redundancy. • Overlap reads, extend to reconstruct original genomic region.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 46 - Read coverage

C

Length of genomic segment: L Number of reads: n Coverage C = (nl) / L Length of each read: l

How much coverage is enough?

Lander-Waterman model: Assumssuming uniformorm ddiistribution of reads,eads, C = 10 results in 1 gapped region per 1,1,0000,00,000 nucleeototideses..

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 47 - Challenges in fragment assembly

Repeats: a major problem for fragment assembly. More than 50% of human genome consists of repeats: • over 1 million Alu repeats (about 300 bp), • about 200,000 LINE repeats (1000 bp and longer).

Repeat Repeat Repeat

Green and blue fragments are interchangeable when assembling repetitive DNA

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 48 - Bad things happen when you confuse repeated regions ...

Correct assembly (we don't know this starting off): Repeat Repeat Repeat

Potential mis-assembly (seems just as plausible): Repeat Repeat Repeat

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 49 - Triazzle: a fun example

The puzzle looks simple

BUT there are repeats!!!

Repeats make it very difficult.

Try it – only $7.99 at www.triazzle.com

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 50 - Repeat types

Low-Complexity DNA (e.g., ATATATATACATA…)

N Microsatellite repeats (a1…ak) where k ~ 3-6 (e.g., CAGCAGTAGCAGCACCAG)

Transposons/retrotransposons SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 106 copies) LINE Long Interspersed Nuclear Elements ~500 - 5,000 bp long, 200,000 copies LTR retroposons Long Terminal Repeats (~700 bp at each end

Gene Families genes duplicate & then diverge

Segmental duplications ~very long, very similar copies

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 51 - Overlap-Layout-Consensus

Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA ..ACGATTACAATAGGTT.. sequence and correct read errors

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 52 - Overlap

• Find best match between suffix of one read and prefix of another. • Due to sequencing errors, need to use dynamic programming to find optimal overlap alignment. • Apply filtration method to filter out pairs of fragments that do not share a significantly long common substring.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 53 - Overlapping reads

• Sort all k-mers in reads (k ~ 24). • Find pairs of reads sharing a k-mer. • Extend to full alignment – throw away if not >95% similar.

TACA TAGATTACACAGATTACT GA || |||||||||||||||||| || TAGT TAGATTACACAGATTACTAGA

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 54 - Overlapping reads and repeats

• A k-mer that appears N times initiates N2 comparisons

• For an Alu that appears 106 times, this means 1012 comparisons, which is obviously too much. • Solution: discard all k-mers that appear more than t x Coverage, (t ~ 10)

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 55 - Finding overlapping reads

Create local multiple alignments from the overlapping reads:

TAGATTACACAGATTACTGA TAGATTACACAGATTACTGATAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGATAG TTACACAGATTATTGA

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 56 - Finding overlapping reads

• Correct errors using multiple alignment

C: 20 C: 20 C: 35 C: 35 T: 30 C: 0 C: 35 C: 35 TAGATTACACAGATTACTGA C: 40 C: 40 TAGTAGATTACACAGATTACTGA TTACACAGATTATTGA TAGATTACACAGATTACTGA A: 15 A: 15 A: 25 A: 25 - A: 0 A: 40 A: 40 A: 25 A: 25 • Score alignments • Accept alignments with good scores

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 57 - Layout

• Repeats are major challenge. • Do two aligned fragments really overlap, or are they from two copies of a repeat? • Solution: repeat masking – hide repeats!!! • Masking results in high rate of misassembly (up to 20%). • Misassembly means a lot more work at finishing step.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 58 - Merge reads into contigs

repeat region

Merge reads up to potential repeat boundaries

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 59 - Repeats, errors, and contig lengths

• Repeats shorter than read length are okay. • Repeats with more base pair differences than sequencing error rate are okay. • To make smaller portion of genome appear repetitive: - try to increase effective increase read length, - work to decrease sequencing error rate.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 60 - Role of error correction

Discards ~90% of single-letter sequencing errors. Lowers error rate: ⇒ decreases effective repeat content, ⇒ increases contig length.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 61 - Merge reads into contigs

Repeat region

• Ignore non-maximal reads. • Merge only maximal reads into contigs.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 62 - Merge reads into contigs

Repeat boundary??? Sequencing error b a

• Ignore “hanging” reads, when detecting repeat boundaries.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 63 - Merge reads into contigs

?????

Unambiguous

• Insert non-maximal reads whenever unambiguous.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 64 - Link contigs into supercontigs

Normal density

Too dense: Overcollapsed?

Inconsistent links: Overcollapsed?

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 65 - Link contigs into supercontigs

Find all links between unique contigs.

Connect contigs incrementally, if ≥ 2 links.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 66 - Link contigs into supercontigs

Fill gaps in supercontigs with paths of overcollapsed contigs.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 67 - Wrap-up

Readings for next time: • IBP 9.1-9.5 (hash tables, repeat-finding, exact pattern , keyword trees, suffix trees).

Remember: • Come to class having done the readings. • Check Blackboard regularly for updates.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 68 -