Restriction Mapping Bioinformatics Algorithms Part 2 František Mráz, KSVI
Total Page:16
File Type:pdf, Size:1020Kb
Bioinformatics Algorithms Physical Mapping – Restriction Mapping Bioinformatics Algorithms part 2 František Mráz, KSVI Based on slides from http://bix.ucsd.edu/bioalgorithms/slides.php And other sources 1 Bioinformatics Algorithms Contents 2 Bioinformatics Algorithms Molecular Scissors – Restriction Enzymes • HindII - first restriction enzyme – was discovered accidentally in 1970 while studying how the bacterium Haemophilus influenzae takes up DNA from the virus • Recognizes and cuts DNA at sequences: • GTGCAC • GTTAAC Molecular Cell Biology, 4th edition 3 Bioinformatics Algorithms Recognition Sites of Restriction Enzymes Molecular Cell Biology, 4th edition 5 Bioinformatics Algorithms Restriction Maps • A map showing positions of restriction sites in a DNA sequence • If DNA sequence is known then construction of restriction map is a trivial exercise • In early days of molecular biology DNA sequences were often unknown • Biologists had to solve the problem of constructing restriction maps without knowing DNA sequences 6 Bioinformatics Algorithms Measuring Length of Restriction Fragments • Restriction enzymes break DNA into restriction fragments. Direction • Gel electrophoresis is a process for of DNA separating DNA by size and measuring movement sizes of restriction fragments • Visualization: autoradiography or fluorescence 7 Bioinformatics Algorithms Physical Map, Restriction Mapping Problem • Definition: Let S be a DNA sequence. A physical map consists of a set M of markers and a function p: M→ N that assigns each marker a position of M in S. N denotes the set of nonnegative integers • For a set X of points on the line, let δ X = { | x1 - x2| : x1, x2 ∈ X } denote the multiset of all pairwise distances between points in X called partial digest. In the restriction mapping problem, a subset E ⊆δ X (of experimentally obtained fragment lengths) is given and the task is to reconstruct X from E. 8 Bioinformatics Algorithms Full Restriction Digest: Multiple Solutions • Reconstruct the order of the fragments from the sizes of the fragments {3,5,5,9} • Alternative ordering of restriction fragments: • Reconstruction from the full restriction digest is impossible. 9 Bioinformatics Algorithms Three different problems • One (full) digest is not enough • Use 2 restriction enzymes • Use 1 restriction enzyme, but differently 1. The double digest problem –DDP 2. The partial digest problem –PDP 3. The simplified partial digest problem –SPDP 10 Bioinformatics Algorithms Double Digest Mapping • Use two restriction enzymes; three full digests: • Δ A – a complete digest of S using A, • ΔB – a complete digest of S using B, and • ΔAB – a complete digest of S using both A and B. • Computationally, Double Digest problem is more complex than Partial Digest problem 11 Bioinformatics Algorithms Double Digest: Example 12 Bioinformatics Algorithms Double Digest: Example Without the information about X (i.e. ΔAB ), it is impossible to solve the double digest problem as this diagram illustrates 13 Bioinformatics Algorithms Double Digest Problem Input: ΔA – fragment lengths from the complete digest with enzyme A. ΔB – fragment lengths from the complete digest with enzyme B. ΔAB – fragment lengths from the complete digest with both A and B. Output: A – location of the cuts in the restriction map for the enzyme A. B – location of the cuts in the restriction map for the enzyme B. 14 Bioinformatics Algorithms Double Digest: Multiple Solutions 15 Bioinformatics Algorithms Double digest • The decision problem of the DDP is NP-complete. • All algorithms have problems with more than 10 restriction sites for each enzyme. • A solution may not be unique and the number of solutions grows exponentially. • DDP is a favourite mapping method since the experiments are easy to conduct . 16 Bioinformatics Algorithms DDP is NP-complete 1) DDP is in NP (easy) 2) given a (multi-)set of integers X = {x1, . , xn }. The Set Partitioning Problem (SPP) is to determine whether we can partition X into two subsets X1 and X2 such that This problem is known to be NP-complete. ∑x= ∑ x x∈ X1 x∈ 2 X 17 Bioinformatics Algorithms DDP is NP-complete • Let X be the input of the SPP, assuming that the sum of all elements of X is even. Then set • ΔA = X, KK⎧ ⎫ n • ΔB = ⎨ , ⎬ . with K= x, and ⎩ 2 2 ⎭ ∑ i i =1 • ΔAB = ΔA. • then there exists an integer n0 and ndicesi {j1, j2,…jn }w h t i n0 n x= x ∑j i ∑j i i =1 i= n0 1 + because of the choice of ΔB and ΔAB. Thus a solution for the SPP exists. Thus SPP is a DDP in which one of the two enzymes producequal length.ed only two fragments of 18 Bioinformatics Algorithms Partial Restriction Digest • The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to prevent it from being cut at all restriction sites. • This experiment generates the set of all possible restriction fragments between every two (not necessarily consecutive) cuts. • This set of fragment sizes is used to determine the positions of the restriction sites in the DNA sequence. 19 Bioinformatics Algorithms Multiset of Restriction Fragments • We assume that multiplicity of a fragment can be detected, i.e., the number of restriction fragments of the same length can be determined (e.g., by observing twice as much fluorescence intensity for a double fragment than for a single fragment) Multiset: {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} 20 Bioinformatics Algorithms Partial Digest Fundamentals X: the set of n integers representing the location of all cuts in the restriction map, including the start and end n: the total number of cuts δX: the multiset of integers representing lengths of each of the fragments produced from a partial digest 21 Bioinformatics Algorithms One More Partial Digest Example X 0 2 4 7 10 0 2 4 7 10 2 2 5 8 4 3 6 7 3 10 Representation of δX = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} as a two dimensional table, with elements of X = {0, 2, 4, 7, 10} along both the top and left side. The elements at (i, j ) in the table is xj – xi for 1 ≤ i < j ≤ n. 22 Bioinformatics Algorithms Partial Digest Problem: Formulation • Goal: Given all pairwise distances between points on a line, reconstruct the positions of those points. • Input: The multiset of pairwise distances L, containing n (n -1)/2 integers. • Output: A set X, of n integers, such that δ X = L. 23 Bioinformatics Algorithms Partial Digest: Multiple Solutions • It is not always possible to uniquely reconstruct a set X based only on δX. • For example, the set X = {0, 2, 5} and (X + 10) = {10, 12, 15} both produce δX={2, 3, 5} as their partial digest set. • The sets {0,1,2,5,7,9,12} and {0,1,5,7,8,10,12} present a less trivial example of non-uniqueness. They both digest into: {1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12} 24 Bioinformatics Algorithms Homometric Sets 0 1 2 5 7 9 12 0 1 5 7 8 10 12 0 1 2 5 7 9 12 0 1 5 7 8 10 12 1 1 4 6 8 11 1 4 6 7 9 11 2 3 5 7 10 5 2 3 5 7 5 2 4 7 7 1 3 5 7 2 5 8 2 4 9 3 10 2 12 12 25 Bioinformatics Algorithms Partial Digest: Brute Force 1. Find the restriction fragment of maximum length M. M is the length of the DNA sequence. 2. For every possible set X ={0, x2 , … ,xn-1 , M} compute the corresponding δX 3. If δX is equal to the experimental partial digest L, then X is the correct restriction map 26 Bioinformatics Algorithms BruteForcePDP BruteForcePDP(L, n): M ← maximum element in L for every set of n – 2 integers 0 < x2 < … xn-1 < M X ← {0,x2,…,xn-1,M} Form δX from X if δX = L return X output “no solution” • BruteForcePDP takes O (M n − 2) time since it must examine all possible sets of positions. • One way to improve the algorithm is to limit the values of xi to only those values which occur in L. 27 Bioinformatics Algorithms AnotherBruteForcePDP AnotherBruteForcePDP(L, n) M ← maximum element in L for every set of n – 2 integers 0 < x2 < … xn -1 < M from L X ← { 0,x2,…,xn -1,M } Form δX from X if δX = L; return X output “no solution” • It is more efficient, but still slow • If L = {2, 998, 1000} (n = 3, M = 1000), BruteForcePDP will be extremely slow, but AnotherBruteForcePDP will be quite fast • Fewer sets are examined, but runtime is still exponential: O(n 2n – 4 ) 28 Bioinformatics Algorithms Branch and Bound Algorithm for PDP 1. Begin with X = {0} 2. Remove the largest element in L and place it in X 3. See if the element fits on the right or left side of the restriction map 4. When it fits, find the other lengths it creates and remove those from L 5. Go back to step 2 until L is empty 29 Bioinformatics Algorithms Branch and Bound Algorithm for PDP 1. Begin with X = {0} 2. Remove the largest element in L and place it in X 3. See if the element fits on the right or left side of the restriction map 4. When it fits, find the other lengths it creates and remove those from L 5. Go back to step 2 until L is empty WRONG ALGORITHM 30 Bioinformatics Algorithms Defining D(y, X) • Before describing PartialDigest, first define D(y, X ) as the multiset of all distances between point y and all other points in the set X D(y, X ) = {|y – x1|, |y – x2|, …, |y – xn |} for X = {x1, x2, …, xn } 31 Bioinformatics Algorithms PartialDigest Algorithm • S.