Bioinformatics

Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE 308-408 • Fall 2007 • Lecture 13 CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 1 - Outline • Introduction to graph theory • Eulerian & Hamiltonian Cycle Problems • Benzer experiment and interval graphs • DNA sequencing • The Shortest Superstring & Traveling Salesman Problems • Sequencing by hybridization http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 2 - The Bridge Obsession Problem Find a tour crossing every bridge just once. – Leonhard Euler, 1735 Bridges of Königsberg http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 3 - Eulerian Cycle Problem • Find a cycle that visits every edge exactly once. • Algorithm for solution has linear time complexity. More complicated Königsberg http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 4 - Mapping problems to graphs • Arthur Cayley studied chemical structures of hydrocarbons in mid-1800's. • He used trees (acyclic connected graphs) to enumerate structural isomers. http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 5 - Hamiltonian Cycle Problem • Find a cycle that visits every vertex exactly once. • NP-complete (i.e., a very hard problem to solve). Game invented by Sir William Hamilton in 1857 http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 6 - Beginning of graph theory in biology Benzer’s work: • Developed deletion mapping. • “Proved” linearity of gene. • Demonstrated internal structure of gene. Seymour Benzer (1950's) http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 7 - Viruses attack bacteria • Normally bacteriophage T4 kills bacteria. • However if T4 is mutated (e.g., an important gene is deleted), it gets disabled and loses ability to kill bacteria. • Suppose bacteria is infected with two different mutants, each of which is disabled. Would the bacteria still survive? • Amazingly, a pair of disable viruses can kill a bacteria even if each of them is disabled. • How can this be explained? http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 8 - Benzer’s experiment • Idea: infect bacteria with pairs of mutant T4 bacteriophage (virus). • Each T4 mutant has an unknown interval deleted from its genome. • If two intervals overlap: T4 pair is missing part of its genome and is disabled – bacteria survive. • If two intervals do not overlap: T4 pair has its entire genome and is enabled – bacteria die. http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 9 - Benzer’s experiment Complementation between pairs of mutant T4 bacteriophages http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 10 - Benzer’s experiment and interval graphs • Construct interval graph: each T4 mutant is a vertex; place edge between mutant pairs where bacteria survived (i.e., deleted intervals in pair of mutants overlap). • Interval graph structure reveals whether DNA is linear or branched DNA. Case for linear genes http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 11 - Benzer’s experiment and interval graphs Case for branched genes http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 12 - Interval graphs: a comparison Linear genome Branched genome http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 13 - DNA sequencing: history Sanger method (1977): Gilbert method (1977): labeled ddNTPs* chemical method to terminate DNA copying at cleave DNA at specific random points. points (G, G+A, T+C, C). Both methods generate labeled fragments of varying lengths that are further electrophoresed. * Dideoxynucleotides, or ddNTPs, are nucleotides lacking 3'-hydroxyl (-OH) group on their deoxyribose sugar. Note that the sugar normally referred to as deoxyribose lacks a 2'-OH, while the dideoxyribose lacks hydroxyl groups on both its 2' and 3' carbon. The lack of this hydroxyl group means that, after being added by a DNA polymerase to a growing nucleotide chain, no further nucleotides can be added, as no phosphodiester bond can be created. Thus, these molecules form the basis of the dideoxy chain- termination method of DNA sequencing, which was developed by Frederick Sanger in 1977. (From http://encyclopedia.thefreedictionary.com/ddNTP.) http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 14 - Sanger Method: generating a read • Start at primer (restriction site). • Grow DNA chain. • Include ddNTPs. • Stops reaction at all possible points. • Separate products by length, using gel electrophoresis. http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 15 - DNA sequencing • Shear DNA into millions of small fragments. • Read 500 – 700 nucleotides at a time from the small fragments (Sanger method). http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 16 - Fragment assembly • Computational challenge: assemble individual short fragments (reads) into single genomic sequence. • Until late 1990's, the shotgun fragment assembly of human genome was viewed as intractable problem. The Shortest Superstring Problem. Given set of strings, find shortest string containing all of them. Input: Strings s1, s2, ..., sn. Output: A string s that contains all strings s1, s2, ..., sn as substrings, such that length of s is minimized. • Problem is NP-complete (efficient solution unlikely). • This formulation does not take into account sequencing errors! CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 17 - Shortest Superstring Problem: an example Note: this is an approximation – an atttactive mathematical model for a real-world biological problem. http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 18 - Reducing SSP to TSP Define overlap(si, sj) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa What is overlap(si, sj) for these strings? 3? aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa What is overlap(si, sj) for these strings? No, 12! http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 19 - Reducing SSP to TSP Define overlap(si, sj) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa • Build graph with n vertices representing strings s1, s2, ..., sn. • Insert edges of length overlap overlap(si, sj) between vertices si and sj. • Find longest path which visits every vertex exactly once. This is Traveling Salesman Problem (TSP), which is also NP- complete. CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 20 - Reducing SSP to TSP http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 21 - SSP to TSP: an Example Say that S = {ATC, CCA, CAG, TCC, AGT} SSP TSP ATC AGT 0 2 CCA 1 1 AGT 1 CCA ATC 1 ATCCAGT 2 2 2 TCC CAG 1 TCC CAG ATCCAGT http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 22 - Sequencing by Hybridization (SBH): history • 1988: SBH suggested as an First microarray an alternative sequencing prototype (1989) method. Nobody believed it will ever work. First commercial • 1991: Light directed polymer DNA microarray prototype w/16,000 synthesis developed by Steve features (1994) Fodor and colleagues. • 1994: Affymetrix develops 500,000 features first 64-kb DNA microarray. per chip (2002) http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 23 - How SBH works • Attach all possible DNA probes of length l to flat surface, each probe at distinct and known location. This set of probes is called the DNA array. • Apply a solution containing fluorescently labeled DNA fragment to array. • The DNA fragment hybridizes with only those probes that are complementary to substrings of length l of fragment. • Using a spectroscopic detector, determine which probes hybridize to DNA fragment to obtain l-mer composition of target DNA fragment. • Apply combinatorial algorithm (next) to reconstruct sequence of target DNA fragment from l-mer composition. http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 24 - Hybridization on DNA Array http://www.bioalgorithms.info CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 13 - 25 - l-mer composition Spectrum(s, l) is an unordered multiset of all possible (n – l + 1) l-mers in a string s of length n. The order of individual elements

Bioinformatics

1. Directed Graphs Or Quivers What Is Category Theory? • Graph Theory on Steroids • Comic Book Mathematics • Abstract Nonsense • the Secret Dictionary

Tutorial 13 Travelling Salesman Problem1

Proper Interval Graphs and the Guard Problem 1

Classes of Graphs with Restricted Interval Models Andrzej Proskurowski, Jan Arne Telle

Graph Theory

On Computing Longest Paths in Small Graph Classes

An Application of Graph Theory in Markov Chains Reliability Analysis

Complexity and Exact Algorithms for Vertex Multicut in Interval and Bounded Treewidth Graphs 1

Lecture 9. Basic Problems of Graph Theory

582670 Algorithms for Bioinformatics

Graph Algorithms in Bioinformatics an Introduction to Bioinformatics Algorithms Outline

On Hamiltonian Line-Graphsq