Physical Mapping by STS Hybridization: Algorithmic

Physical Mapping by STS Hybridization Algorithmic Strategies and the Challenge of Software Evaluation David S Greenb erg Sorin Istrail Sandia National Lab oratories Sandia National Lab oratories Mail Stop Mail Stop PO Box PO Box Albuquerque NM Albuquerque NM dsgreencssandiagov scistracssandiagov March To app ear in the Journal of Computational Biology summer Abstract An imp ortant to ol in the analysis of genomic sequences is the physical map In this pap er we examine the construction of physical maps from hybridization data b etween STS sequence tag sites prob es and clones of genomic fragments An algorithmic theory of the mapping pro cess a prop osed p erformance evaluation pro cedure and several new algorithmic strategies for mapping are given A unifying theme for these developments is the idea of a conservative extension An algorithm measure of algorithm quality or description of physical map is a conservative extension if it is a generalization for data with errors of a corresp onding concept in the errorfree case In our algorithmic theory we show that the nature of hybridization exp eriments imp oses inherent limitations on the mapping information recorded in the exp erimental data We prove that only certain typ es of mapping information can b e reliably cal culated by any algorithm A test generator is then presented along with quantitative measures for determining how much of the p ossible information is b eing computed by a given algorithm Weaknesses and strengths of these measures are discussed Each of the new algorithms presented in this pap er is based on combinatorial op timizations Despite the fact that all the optimizations are NPcomplete we have de velop ed algorithmic to ols for the design of comp etitive approximation algorithms We apply our p erformance evaluation program to our algorithms and obtain solid evidence that the algorithms are capable of retrieving highlevel reliable mapping information This work supp orted in part by the US Department of Energy under contract DEACDP Intro duction A central question for the Human Genome Program is how to bridge the gap b etween the size of DNA fragments which can b e directly sequenced and the size of the human genome Researchers continue to lo ok for ways of extending the size of fragments which can b e directly sequenced and ways of picking out imp ortant pieces to sequence However at present and in the near term large scale sequencing pro jects dep end on some pro cess which involves dividing a large segment of DNA into overlapping pieces analyzing the smaller pieces separately and determining the order of the pieces so as to combine information ab out the pieces into information ab out the whole It is this last step reordering fragments which is the fo cus of this pap er Many metho ds have b een prop osed for reordering fragments ranging from to ols for helping exp erts reorder the fragments by hand to automatic programs employing maximum likeliho o d analysis combinatorial optimizations or a variety of heuristics These to ols have b een aimed at data which includes STS prob eclone interactions FISH data radiation hybridization data genetic map orderings break p oint p ositions etc Typically a given to ol or algorithm is evaluated by applying it to actual data and displaying the resulting map In this pap er we attempt to provide a framework for a more rigorous analysis of mapping algorithms We aim to answer the following questions For a given typ e of data what typ e of information is reasonable to exp ect from an algorithm For a given algorithm how well do es it work on dierent typ es of data Are there go o d algorithms for particular typ es of data and if so what are they Since these questions are clearly broad and dicult to answer we have b egun by concen trating on a particular typ e of data hybridizations of sequencetagsite STS prob es with clones of genomic fragments and a particular class of algorithms those based on combina torial optimizations We pay esp ecial attention to data in which there are errors of various typ es clones which contain multiple unrelated fragments ie chimera clones with internal deletions and b oth falsep ositive and falsenegative errors in the hybridization data In this rst study we were able to show The information available in a hybridization matrix of even error free exp eriments is limited The amount of available information is directly related to the information in the PQtree of the matrix The presence of errors further degrades the information available Furthermore the errors may obscure redundancies and singularities which could easily have b een ltered in the errorfree case Despite these limitations on information available in hybridization matrices it is p ossi ble even data containing errors to determine relevant facts ab out the target genome 1 Some terms here such as PQtree may b e unfamiliar to readers new to the sub ject They are dened in later sections In particular we identify lo cal prop erties which can b e determined such as the fact that certain prob e pairs are adjacent Given the correct ordering for a matrix it is p ossible to divide the adjacent prob e pairs into a weak and a strong set where the weak set cannot b e reliably determined by any algorithm and the strong set is p otentially determinable The denitions of these sets is formal and precise for errorfree data and heuristic for data containing errors The p ercentage of strong adjacent pairs identied by an algorithm is a reasonable measure of algorithm success and can b e used to compare algorithms However this measure is lo cal in nature and more global measures are still needed We formulate a noise mo del which allows us to unify the seemingly dissimilar chal lenges of nding the correct order of prob es and of dealing with the errors in the hybridization matrix We show that combinatorial optimization functions can b e used to search for solutions which require the p ostulation of as few errors as p ossible We b elieve that the success of these functions is due to their b eing conservative ex tensions of the errorfree case and their having monotonicity with resp ect to common error typ es Figure gives a table of contents for this pap er The remainder of the pap er is divided up as follows Section presents the basic biology required for the rest of the pap er Section rigorously denes the physical mapping problem and shows that the information ab out a genomic target which can b e reliably inferred from an errorfree hybridization matrix is related to the PQtree of the matrix In Section we discuss p ossible sources of errors in the hybridization matrix and extend our theory to include the errors In Sections and we discuss how to generate synthetic data for testing mapping algorithms and show exp erimentally that hybridization matrices will tend to have ambiguities which algorithms cannot resolve In Sections and we describ e how combinatorial optimizations can b e used to search for go o d maps and present several algorithms based on the optimizations In Sections and we describ e a pilot exp eriment in which we examine the p erformance of our algorithms on varying amounts of errors In Sections and we contrast our work with other studies in the eld and describ e the large amount of work which still remains to b e done A biology primer Readers who are already familiar with the pro cess of physical mapping can skip this section Readers who desire more details than present in this section are encouraged to read Browns or Nelson and Brownsteins b o oks The essence of the physical mapping pro cess is as follows The exp eriment b egins with a sample of target DNA recall that DNA is a linear sequence of basepairs ACG and T Pure samples of the target DNA are cut at sp ecic p oints and then each fragment of DNA is inserted into a circular DNA molecule called a vector to pro duce a recombinant DNA molecule The DNA fragment incorp orated into the vector is called an insert The vector Contents Intro duction A biology primer Exp erimental errors A theory of physical mapping Errorfree hybridization exp eriments Mapping Information Algorithmically retrievable mapping information Prob e restrictions Consecutive ones matrices Bo oth and Luekers PQtree algorithm The true map is almost never algorithmically retrievable PQtrees enco de information in common to all maps Exp erimental errors Extending the mo del to data with errors NPcompleteness barriers Reasonable goals Evaluating Algorithms on Hybridization Data Generator Mimicking real data Exploring a range of data The Quality of Data Connected Comp onents Redundant prob es Weak adjacencies Summary of data quality Combinatorial Optimizations as Mo

Load more