Physical Mapping by STS Hybridization

Algorithmic Strategies and the Challenge of Software

Evaluation

David S Greenb erg Sorin Istrail

Sandia National Lab oratories Sandia National Lab oratories

Mail Stop Mail Stop

PO Box PO Box

Albuquerque NM Albuquerque NM

dsgreencssandiagov scistracssandiagov

March

To app ear in the Journal of Computational Biology summer

Abstract

An imp ortant to ol in the analysis of genomic sequences is the physical map In this

pap er we examine the construction of physical maps from hybridization data b etween

STS sequence tag sites prob es and clones of genomic fragments An algorithmic theory

of the mapping pro cess a prop osed p erformance evaluation pro cedure and several new

algorithmic strategies for mapping are given A unifying theme for these developments

is the idea of a conservative extension An algorithm measure of algorithm quality

or description of physical map is a conservative extension if it is a generalization for

data with errors of a corresp onding concept in the errorfree case

In our algorithmic theory we show that the nature of hybridization exp eriments

imp oses inherent limitations on the mapping information recorded in the exp erimental

data We prove that only certain typ es of mapping information can b e reliably cal

culated by any algorithm A test generator is then presented along with quantitative

measures for determining how much of the p ossible information is b eing computed by

a given algorithm Weaknesses and strengths of these measures are discussed

Each of the new algorithms presented in this pap er is based on combinatorial op

timizations Despite the fact that all the optimizations are NPcomplete we have de

velop ed algorithmic to ols for the design of comp etitive approximation algorithms We

apply our p erformance evaluation program to our algorithms and obtain solid evidence

that the algorithms are capable of retrieving highlevel reliable mapping information



This work supp orted in part by the US Department of Energy under contract DEACDP

Intro duction

A central question for the Human Genome Program is how to bridge the gap b etween

the size of DNA fragments which can b e directly sequenced and the size of the human

genome Researchers continue to lo ok for ways of extending the size of fragments which can

b e directly sequenced and ways of picking out imp ortant pieces to sequence However at

present and in the near term large scale sequencing pro jects dep end on some pro cess which

involves dividing a large segment of DNA into overlapping pieces analyzing the smaller

pieces separately and determining the order of the pieces so as to combine information

ab out the pieces into information ab out the whole

It is this last step reordering fragments which is the fo cus of this pap er Many metho ds

have b een prop osed for reordering fragments ranging from to ols for helping exp erts reorder

the fragments by hand to automatic programs employing maximum likeliho o d analysis

combinatorial optimizations or a variety of heuristics These to ols have b een aimed at

data which includes STS prob eclone interactions FISH data radiation hybridization data

genetic map orderings break p oint p ositions etc

Typically a given to ol or algorithm is evaluated by applying it to actual data and displaying

the resulting map In this pap er we attempt to provide a framework for a more rigorous

analysis of mapping algorithms We aim to answer the following questions

For a given typ e of data what typ e of information is reasonable to exp ect from an

algorithm

For a given algorithm how well do es it work on dierent typ es of data

Are there go o d algorithms for particular typ es of data and if so what are they

Since these questions are clearly broad and dicult to answer we have b egun by concen

trating on a particular typ e of data hybridizations of sequencetagsite STS prob es with

clones of genomic fragments and a particular of algorithms those based on combina

torial optimizations We pay esp ecial attention to data in which there are errors of various

typ es clones which contain multiple unrelated fragments ie chimera clones with internal

deletions and b oth falsep ositive and falsenegative errors in the hybridization data

In this rst study we were able to show

The information available in a hybridization matrix of even error free exp eriments is

limited The amount of available information is directly related to the information in

the PQtree of the matrix

The presence of errors further degrades the information available Furthermore the

errors may obscure redundancies and singularities which could easily have b een ltered

in the errorfree case

Despite these limitations on information available in hybridization matrices it is p ossi

ble even data containing errors to determine relevant facts ab out the target genome

1

Some terms here such as PQtree may b e unfamiliar to readers new to the sub ject They are dened in

later sections

In particular we identify lo cal prop erties which can b e determined such as the fact

that certain prob e pairs are adjacent

Given the correct ordering for a matrix it is p ossible to divide the adjacent prob e

pairs into a weak and a strong set where the weak set cannot b e reliably determined

by any algorithm and the strong set is p otentially determinable The denitions of

these sets is formal and precise for errorfree data and heuristic for data containing

errors

The p ercentage of strong adjacent pairs identied by an algorithm is a reasonable

measure of algorithm success and can b e used to compare algorithms However this

measure is lo cal in nature and more global measures are still needed

We formulate a noise mo del which allows us to unify the seemingly dissimilar chal

lenges of nding the correct order of prob es and of dealing with the errors in the

hybridization matrix We show that combinatorial optimization functions can b e

used to search for solutions which require the p ostulation of as few errors as p ossible

We b elieve that the success of these functions is due to their b eing conservative ex

tensions of the errorfree case and their having monotonicity with resp ect to common

error typ es

Figure gives a table of contents for this pap er

The remainder of the pap er is divided up as follows Section presents the basic biology

required for the rest of the pap er Section rigorously denes the physical mapping problem

and shows that the information ab out a genomic target which can b e reliably inferred from

an errorfree hybridization matrix is related to the PQtree of the matrix In Section

we discuss p ossible sources of errors in the hybridization matrix and extend our theory

to include the errors In Sections and we discuss how to generate synthetic data for

testing mapping algorithms and show exp erimentally that hybridization matrices will tend

to have ambiguities which algorithms cannot resolve In Sections and we describ e

how combinatorial optimizations can b e used to search for go o d maps and present several

algorithms based on the optimizations In Sections and we describ e a pilot exp eriment

in which we examine the p erformance of our algorithms on varying amounts of errors In

Sections and we contrast our work with other studies in the eld and describ e the

large amount of work which still remains to b e done

A biology primer

Readers who are already familiar with the pro cess of physical mapping can skip this section

Readers who desire more details than present in this section are encouraged to read Browns

or Nelson and Brownsteins b o oks

The essence of the physical mapping pro cess is as follows The exp eriment b egins with a

sample of target DNA recall that DNA is a linear sequence of basepairs ACG and T

Pure samples of the target DNA are cut at sp ecic p oints and then each fragment of DNA

is inserted into a circular DNA molecule called a vector to pro duce a recombinant DNA

molecule The DNA fragment incorp orated into the vector is called an insert The vector

Contents

Intro duction

A biology primer

Exp erimental errors

A theory of physical mapping

Errorfree hybridization exp eriments

Mapping Information

Algorithmically retrievable mapping information

Prob e restrictions

Consecutive ones matrices

Bo oth and Luekers PQtree algorithm

The true map is almost never algorithmically retrievable

PQtrees enco de information in common to all maps

Exp erimental errors

Extending the mo del to data with errors

NPcompleteness barriers

Reasonable goals

Evaluating Algorithms on Hybridization Data

Generator

Mimicking real data

Exploring a range of data

The Quality of Data

Connected Comp onents

Redundant prob es

Weak adjacencies

Summary of data quality

Combinatorial Optimizations as Mo dels of Errorcorrection

Genomic reconstruction through parsimonious explanation

Fragmentcounting ob jective functions

Other ob jective functions

Ob jective functions

Algorithms

The Humangreedy algorithm Minimizing

The clonecover algorithm

The Fiedlervectorspectral algorithm Minimizing

The cyclebasis algorithm Minimizing

The OPT algorithm

Exp erimental Design

Measures of success

Parameters to vary

Timing

Exp erimental Analysis of Algorithms

General observations

Sp ectral

Cycle Basis Algorithm

Clone Cover

Human

Random

Comparison

Related work

Summary and Future Directions

Figure

transp orts the DNA fragment into a host cell Within the host cell the vector multiplies

pro ducing numerous identical copies of itself and therefore of the DNA fragment it carries

When the host divides copies of the recombinant DNA molecule are passed to the progeny

and further vector replication takes place After a numb er of clone divisions a colony or

clone of identical host cells is pro duced Each cell in the colony contains one or more copies

of the recombinant DNA molecules The DNA fragment is now said to b e cloned

It is p ossible to construct probes of various typ es In general a prob e is a piece of DNA

which is complementary to some section of the target DNA Prob es may b e short random

sequences copies of the ends of clone fragments genetic markers or practically any previ

ously identied piece of DNA One particular useful typ e of prob e is the sequencetagsite

STS prob e An STS is constructed to match a single site on the target DNA

Given a set of clones and a set of prob es it is in principle p ossible to determine which

prob es hybridize to which clones A prob e should hybridize ie stick under exp erimental

circumstances to a clone if and only if its sequence is complementary to some subsequence

of the clone Thus a record of all hybridizations tells us which prob es are contained within

which clones

Exp erimental errors

Unfortunately the simple pro cess describ ed ab ove neglects many imp ortant shortcomings of

actual exp eriments Typically these shortcomings are describ ed as errors in the hybridiza

tion data although some of them are strictly sp eaking merely deviations from the simple

mo del

Chimerism One imp ortant typ e of exp erimental error chimerism results from the cloning

pro cess itself A chimera is a clone which contains two unrelated fragments of DNA The for

mation of chimeric clones o ccurs with all vector typ es phages cosmids YACs and megaY

Acs and can o ccur b oth in the adoption of the fragment into the vector and in the subse

quent clonal replication Furthermore the prop ortion of chimeric clones tends to increase

as the size of the DNA insert increases Since the presence of chimeric clones greatly com

plicates the mapping pro cess great eort has gone into reducing their o ccurrence andor

detecting the chimeric clones

There are several metho ds used for chimerism reduction They are all characterized by

a tradeo b etween lab orintensive pro cedures and accuracy One metho d uses physical

mapping of the ends of the YAC insert It involves the isolation of the ends of the YAC

insert which requires a considerable amount of exp erimental eort Although very accurate

in detection the metho d is impractical for the detection of every chimeric clone Another

metho d uses uorescence in situ hybridization FISH It is also a very lab orious metho d

which requires sophisticated techniques not available in all lab oratories Also it has sensi

tivity limits and it fails to detect short noncontiguous sequences Another diculty that the

metho ds must face is the one of b eing able to distinguish chimeric clones from clones sub ject

to cotransformation events ie the intro duction and stable indep endent maintenance of

two or more YACs in the same cell

Despite the metho ds develop ed for chimerism detection the p ercentage of chimerism in

genomic libraries continues to b e high When the YAC technology was develop ed in

the inventors predicted that the technology would suer from some chimerism with an

estimate of ab out chimerism in the clones of a YAC library In the clone libraries that

were used in the creation of the rst highresolution maps the chimerism was

discovered however to o ccur at much higher p ercentages reaching in the chromosome

map and ab out in the Y chromosome map Also the frequency of chimerism has

b een estimated at for the two most widely used human YAC libraries The

recent advance in the discovery of megaYACs seems to conrm the exp ectation that the

larger the YAC is the more chimerism is intro duced in the library

In the face of all these exp erimental diculties and the fact that chimerism is intimately

connected to the basic op erations of recombinant technology the computational supp ort

for mapping chimeric clones is vital for the creation of reliable physical maps of the chro

mosomes

Deletions Deletions are another way in which gaps b etween fragments can o ccur Even

when a single fragment is inserted in a vector a piece of the fragment may b e deleted

in the replication stage The result is that the clone represents two fragments which are

not contiguous in the genome ie a chimera Some deletions app ear to b e random and

some are the result of uncloneable genomic regions in regular or megaYACs It was

even conjectured that there may b e an uncloneable region of the human genome on the

average for every million bases Particularly the megaYACs contain a lot of deletions

it seems that they are internally scrambled The splicing mechanism of yeast which is

part of the yeasts system of DNA repair seems resp onsible for putting together pieces

of nonyeast DNA ie chimeras and deletions Other factors resp onsible for these errors

are rep etitive sequences and fragility of sequences Indeed some sequences of the human

genome containing large numb er of rep eated sequences tend to b e uncloneable as they

seem to trigger the yeasts repair system Other sequences seem to o fragile to b e inserted

without breaking them into pieces

Hybridization errors The pro cess of hybridization of prob es to clones is complex b oth

biologically and exp erimentally Inevitably there are certain p ercentages of b oth false p osi

tive incorrect recording that a prob e hybridizes to a clone and false negative the opp osite

error hybridization results

One reason for false hybridization results is that a prob e may not hybridize strongly to

its lo cation Clones of dierent sizes and relative p ositions to the prob e site may have

conformations which make hybridization more or less likely This sort of problem leads

mostly to false negatives

Another reason for false hybridization results is that while a prob e may have a unique exact

matching site on the target genome there may b e other sites that are pretty go o d matches

Under some hybridization conditions this may yield a false p ositive

The numb er of prob eclone pairs which must b e checked can b e very large Therefore it is

common to p o ol clones and check for hybridization Often the p o oling techniques

do not precisely dene which clone in a p o ol was the cause of a hybridization In this case

either false p ositives or false negatives may arise In addition the p o oling techniques may

intro duce correlations b etween false p ositives andor false negatives

The reading of a gel to determine hybridization is not simple Sometimes the human or

machine reader simply makes a mistake Since the volume of data is already high it is not

common to do redundant checks for accuracy Again this can lead to either false p ositives

or false negatives

Other errors Although we will concentrate on resolving the problems due to chimeric

clones and hybridization errors there are several other typ es of errors common in physical

mapping

Repeated occurrences of probes The knowledge that each prob e sticks to a unique lo cation

in the genome greatly increases the information yielded by a cloneprob e match STSs

usually have a unique or almost unique o ccurrence in the genome and therefore an

STS actually denes a p osition in the genome Other typ es of prob es however may have

a large numb er of o ccurrences and therefore the information they provide is much weaker

An STS is usually dicult to nd while generating small prob es is usually routine Although

it is computationally harder to deal with prob es with rep eated o ccurrences it is imp ortant

to use the information they can provide due to the simplicity of the exp eriments involving

them

Errors in restriction fragment ngerprinting Restriction maps are dicult to construct

for the entire genomes b ecause the sites for the most suitable enzymes are distributed

nonrandomly and are sometimes blo cked by the action of methylation systems restriction

maps also fail to address the need of most map users for ready access to the cloned DNA

Errors sp ecic to restriction fragment maps include measurement errors in the lengths of

fragments missing fragments and extra fragments due to a cut not o ccurring at some

sites

Gaps are regions of the genome not represented in a map Typically there are many gaps in

a map and for statistical reasons complete closure of a map for a large genome is practically

imp ossible The problem is not only due to the random sampling of the genome libraries

some libraries do not contain all the sequences present in the genome It is estimated that

maps for the human genome will b e likely to contain b etween and gaps

A theory of physical mapping

Despite the many exp erimental and theoretical studies of physical mapping

the very notion of a physical map is still p o orly understo o d In this section we present a

formal mathematical mo del of the physical mapping problem using hybridization data We

use this mo del to investigate the algorithmic tractability of the problem In particular we

quantify the amount of information inherent in hybridization data This will allow us to

evaluate the p erformance of mapping algorithms in a fair and rigorous way

We start by examining hybridization data corresp onding to p erfect exp eriments That

is we assume in this section that the hybridization matrix is error free all prob eclone

overlaps are faithfully identied and no clones are chimeric In the next section we will

expand the mo del to include errors The evaluation of these errorfree data sets will show

that there is a nontrivial structure to the information which we can hop e to algorithmically

retrieve An understanding of this structure will b e critical to the analysis of noisy data

that is data in which hybridization errors and cloning errors o ccur

In particular we will show that many prop erties of the target DNA are not retrievable from

a hybridization matrix alone In fact we will show that often there are several p ossible

orders of prob es along the target DNA which could have yielded the hybridization matrix

In these cases it is not reasonable to exp ect an algorithm to always determine the true order

of the prob es Instead we show that an algorithm of Bo oth and Lueker can b e used to

enco de information which is true of all ordering which could have led to the hybridization

matrix and thus surely is true of the actual order

Errorfree hybridization exp eriments

The basic comp onents of a physical mapping exp eriment are a target genomic region a set

of clones a set of probes and a hybridization matrix

The goal of a mapping eort is to pro duce a map of some target genomic region Let G b e

the genomic region for which we desire a genomic map For the purp oses of mapping the

region G is completely determined by its sequence of base pairs We represent this sequence

as an interval I G over the real line

Exp erimentally the target genomic region is studied through a set of clones Each clone

consists of one or more fragments of the target genome Each fragment is naturally asso ci

ated with the subinterval of I G corresp onding to its piece of G When a clone is a single

fragment we call it a regular clone and when its consists of at least two fragments we call it

chimeric Throughout Section it will b e assumed that all clones are regular

Information ab out the clones is gathered through hybridization with prob es An STS prob e

is designed to dene a single lo cation on the genome Although an STS has measurable

extent on the genome for our purp oses it is reasonable to consider it to mark a p oint

lo cation Thus each prob e is asso ciated with a p oint in the interval I G

Genomic Placements Using the corresp ondence b etween comp onents and the interval

I G we can now formally dene the input to a mapping exp eriment The exp eriment

consists of a set of clones C fC C C g and a set of prob es P fP P P g

m n

Each regular clone C is dened by the lo cation of its endp oints b and e on the interval

i i i

I G Chimeric clones are dened by several pairs of endp oints Each prob e P is dened

j

by its lo cation p on I G We call the set of clones plus the set of prob es plus the p ositions

j

of all endp oints and prob es the genomic placement of the exp eriment We represent the

placement as G C P pos where pos is the function which maps fragment endp oints and

prob es to their p osition on I G Ultimately the goal of a physical map is to nd out as

much as p ossible ab out the genomic placement as p ossible

Fingerprints and Hybridization matrices The wetlab p ortion of a physical mapping

exp eriment takes genetic material which can b e describ ed by a genetic placement and

returns exp erimental results Just as a genomic placement can b e used describ e any input

to a hybridization exp eriment we can formalize the p ossible outputs of an exp eriment

For a given set of clones the ngerprint of a p osition in I G is dened to b e the set of

clones that contain that p osition We will see that the output can b e describ ed as a set of

ngerprints

An actual exp eriment records the hybridizations of some prob es with some clones The

results of all the hybridization tests can b e encapsulated in a single hybridization matrix

The hybridization matrix A is dened by AC P if the interval of C is determined

i j i

exp erimentally to hybridize to prob e P and otherwise The recording of uncertainty

j

ab out the hybridization result is p ossible by using values b etween and but we leave this

extension for future work In practice the hybridization is rarely tested directly but rather

through a p o oling scheme which may intro duce biases In this section we assume that the

testing is somehow done p erfectly

It should b e noted that the hybridization matrix do es not contain any direct information

ab out the actual p osition of the endp oints of the clones or of the prob es Instead each

prob es hybridizations corresp onds to the ngerprint of the prob es p osition Let us call a

set of p ositions complete if their collection of ngerprints contains all p ossible ngerprints

for the given clone arrangement That is the ngerprint of any p osition not in the set

would b e the same as some ngerprint in the set If the set of prob e p ositions is complete

then no additional ngerprints can b e determined Having multiple prob es with the same

ngerprint may give added reliability for data containing errors and gives some evidence

of the relative size of fragments we will see that in a certain sense only the fact that there

exists at least one prob e with a given ngerprint is useful to physical mapping

A hybridization event can equally well b e describ ed as a prob e hybridizing a clone or a clone

hybridizing a prob e This duality b etween clones and prob es carries over to ngerprints

The ngerprints describ ed ab ove might b e called probe ngerprints ie for P the set

j

pf P fC j AC P g while a clone ngerprint ie for C the set cf C fP j

j i i j i i j

AC P g could b e dened analogously

i j

The fact that the output of a hybridization exp eriment is restricted to ngerprint informa

tion will have a ma jor impact on the p ossible maps which can b e algorithmically determined

In particular we will show that the exact sizes of the clones and the exact distances b etween

prob es are not determinable

Mapping Information

The basic question of this pap er is What information about the genomic region can be

determined from a given set of experimental results We can make this question somewhat

more formal by asking What information ab out the genomic placement can b e determined

from a hybridization matrix We still however lack a formal denition of information

In this section we start with some examples of p ossible typ es of information and then show

that it is only p ossible to compute some typ es of information from the hybridization matrix

Some simple facts ab out the genomic placement are readily apparent from the hybridization

matrix The p ositive information that a clone contains a prob e that is the prob es p osition

is within the clones interval is directly contained in the matrix If two clones contain the

same prob e and the prob e o ccurs uniquely on the genome then it can b e deduced that

the clones overlap The negative information that a clone do es not contain a prob e can b e

used to rene the overlaps deduced by p ositive information

The following two examples illustrate the interplay of p ositive and negative information

Example Supp ose that clone C contains two prob es P and P but no other and clone

C contains P P and P The fact that b oth clones share at least one prob e shows that

the clones overlap the fact that clone C contains an additional prob e shows that on at

least one side its interval extends b eyond the interval of C

Example Let us now consider three regular clones C C C and three prob es P P P

Supp ose that C contains all three prob es C contains P and P and C contains only P

Then the p ositive information tells us that every pair of clones overlaps This is enough to

tell us that the clones mutually overlap but is not enough to tell their order The negative

information however allows us to determine that the order of the p ositions within the

placement of the prob es and b egin p oints of the clones must b e is b p b p b p or its

reverse All the end p oints must come after p but their relative order is not dened

Intuitively prob es yield information by witnessing regions in which clones overlap or regions

in which clones do not overlap We have already noted that there are a xed numb er of

p ossible prob e ngerprints p ossible Each ngerprint witnesses a region of overlap It is

p ossible when one fragment is completely contained within another for distinct regions to

share the same ngerprint however

Mapping prop erties In order to dene mapping information more formally we dene a

mapping property A mapping prop erty distinguishes b etween two dierent genomic place

ments Formally a mapping prop erty divides the universe of p ossible genomic placements

into two sets those having the prop erty and those not having the prop erty For exam

ple one p ossible prop erty is There is a prob e at p osition p in G Another is Prob e j is

b etween the p osition of the b eginning of clone i and the b eginning of clone i

Denition A mapping prop erty is a function from the set of al l genomic placements

to the set f g If the function is on a placement then it is said to have the property

otherwise it does not have the property

The information contained in a map can b e dened as a collection of mapping prop erties

ab out the genomic placement Several natural levels of information can easily b e dened

We will show that though all are natural to dene it is not p ossible to determine some

prop erties from the hybridization matrices alone

Based on the denition of a genomic placement the maximum amount of information which

could p ossibly b e in the matrix is the p osition of the endp oints of each clone and the p osition

of the prob es We dene Complete Mapping Information as knowing the b eginning and

end p osition of every clone and the p osition of every prob e in G

Complete mapping information allows us to determine the size of each clone and the distance

b etween two consecutive prob es However the hybridization matrix do es not directly enco de

sizes so we may have to settle for relative p ositions rather than absolute p ositions The

maximum amount of relative p osition information is a total order on the endp oints of

clones and the p ositions of prob es We dene Complete Order Information as knowing the

order of the clone endp oints and prob e p ositions along G

For ease of exp osition we ignore here the p ossibility that the p osition of two comp onents

coincide It is straightforward to extend the ordering to allow sets of equal comp onents

within the ordering Typically we will not b e able to distinguish a total order from is

reverse order but we will not rep eatedly p oint this out

As will b ecome clear later the relative order of some comp onents may not b e available We

might hop e that a total order of the prob es or a total order of the clone endp oints would

b e p ossible Thus we dene The True Probe Order Information as knowing the order of the

prob e p ositions along G Similarly The True Clone Order Information is knowing the order

of the clone b egin and end p oints along G

In fact as we will show b elow it is rarely p ossible to determine all the information at these

total ordering levels Instead it will b e p ossible to quantify exactly what information can

b e determined from the matrix

Algorithmically retrievable mapping information

Having dened the input and output of an exp eriment and the information one might

hop e to retrieve from an exp eriment we are now ready to show that some information is

irretrievable Intuitively a prop erty is irretrievable from a hybridization matrix if based

on the hybridization matrix alone one is not sure whether the prop erty holds for the input

genomic placement or not If information is irretrievable then no algorithm regardless of

computation p ower can determine the information with certainty Formally

Denition A genomic placement is consistent with a hybridization matrix if the matrix

rows can be place in onetoone correspondence with the placements clones and the matrix

columns can be place in onetoone correspondence with the placements probes so that the

entry in the matrix is a one if and only if the corresponding probe is contained in the

corresponding clone

Denition A mapping property for a hybridization matrix is algorithmically irretrievable

if there exist two genomic placements consistent with the matrix such that for one the

mapping property is true and for the other the mapping property is false

Prob e restrictions

As a rst step toward determining the retrievable information from a hybridization matrix

we dene the probe restriction of a placement by a matrix The intuitive idea is that if a

matrix do es not contain a complete set of ngerprints then there is missing information

Furthermore the ngerprints only witness regions of overlap and not the exact endp oints of

fragments

Since a hybridization matrix may not contain a complete set of ngerprints for its input

genomic placement we dene a canonical placement for which the matrix is complete

Denition Let E G C P pos be a genomic placement The prob e restriction of E is

E j G C P pos where C consists of one clone for each clone in C C such that at

P i

least one probe is in C the value of pos P posP and values of pos b and pos e

i j j

i i

equal the position of the rst and last probe contained in the corresponding clone C

i

Thus the prob e restriction is derived from a placement by shrinking each clone so that C

i

is the interval from the p osition of the rst prob e contained in C to the p osition of the last

i

prob e in C

i

Lemma For any genomic placement E G C P pos the set of probes P is complete

for the set of clones in E j

P

Theorem Any mapping property which is true of a genomic placement but not true of

its probe restriction is algorithmical ly irretrievable

Pro of Any hybridization matrix which is consistent for the genomic placement is also

consistent for its prob e restriction

Unless a hybridization matrix happ ens to contain prob es which coincide with every endp oint

of a clone then for at least one clone the p osition of the clones endp oints are dierent in

the genomic placement and its prob e restriction Even if the hybridization matrix do es

contain all such prob es a genomic placement in which the p ositions are p erturb ed slightly

will yield the same matrix Thus unless the prob es o ccur at every p ossible p osition thereby

eliminating the p ossibility of p erturbation the exact p osition of clone endp oints or prob e

p ositions is irretrievable Excluding this highly improbable situation we have the following

corollary

Corollary Level of mapping information is not algorithmical ly achievable

Since and its corollary precludes any prop erty which dep ends on exact p ositions

of prob es and clones from b eing retrievable we dene a restricted class of prop erties which

dep end only on prob e order

Consecutive ones matrices

Since information ab out the exact p osition of clones and prob es is irretrievable we turn to

information ab out relative p osition The imp ortance of prob e restrictions in dening the

available information p oints toward lo oking at the relative order of prob es If we determine

the relative order of the prob es then we can retrieve the relative order of the clones in the

prob e restriction which is as much as we can exp ect In the prob e restriction in some

sense a clone is its ordered set of prob es

In order to examine prob e orders we intro duce the following notation Let P er mn b e

the set of p ermutations of the set f ng A p ermutation P er mn is a probe order

and we denote it by P P P Finally let A b e the matrix resulting from

n

p ermuting the n columns of A according to the p ermutation

Denition A probe order is consistent i for every clone C of C the probes in the

clone ngerprint of C occur consecutively in

Denition The true order of the genomic placement G C P pos is the order of the

probes on G

Certainly for errorfree matrices is consistent but there need not b e a unique consistent

prob e ordering In extreme cases requiring may not eliminate any p ermutations

For example if the numb er of clones is equal to the numb er of prob es and clone i hybridizes

prob e j if and only if i j then any prob e ordering is consistent On the other extreme

it is p ossible to construct a matrix for which only one order and its reverse is consistent

In simulations such as those describ ed in Section refsecdataqual an m n hybridization

matrix of regular clones b oth has a numb er of consistent prob e orders which is exp onential

in n and has exp onentially many nonconsistent orders

Despite the lack of sp ecicity of the requirement that a prob e order b e consistent it is

nonetheless an imp ortant concept for mapping In particular it is often interesting to know

whether any consistent orders exist In the literature such matrices o ccur in many contexts

and are dened as follows

Denition A matrix A has the consecutive ones prop erty CP for short if there

exists a permutation P er mn such that in the columnordered matrix A every row has

al l the ones occurring consecutively Such a permutation is cal led a CPp ermutation for

A

Thus a matrix is CP i there exists a consistent order for it

0

Although the matrix A is CP when all the clones are regular and there are no hybridiza

tion errors it is not clear whether one can make stronger statements ab out it As we will see

in some ways knowing that the true order is consistent is all that one can know Therefore

we dene the basic notion of a map as follows

Denition Let G C P pos be a genomic placement and A its hybridization matrix A

map for the genomic placement is a columnordered matrix A where is a consistent

0

probe order The true map of the genomic placement is A where is the true probe

order

Bo oth and Luekers PQtree algorithm

Since the true order is a consistent order it is desirable to determine all p ossible consistent

orders of a hybridization matrix Fortunately the problem of determining all CP orders is

well understo o d In fact a linear algorithm due to Bo oth and Lueker constructs the set

of all such p ermutations in a concise way the so called PQtrees

Since the PQtrees will b e imp ortant to the remainder of this section we give a sketch of the

theory of PQtrees here Readers interested in a more detailed discussion are encouraged to

read Bo oth and Luekers pap er The PQtree is a metho d of enco ding a set of p ermutations

It consists of three typ es of ob jects leafno des P no des and Qno des The leafno des

contain the basic element b eing p ermuted in our case the columns of the matrix We

lab el the leaf no des f ng

The allowable orderings of the leaf no des are constrained by their inclusion in P and Q

no des All the leaf no des in a Qno de are kept in a xed order That is to say in any

p ermutation in the set enco ded by a PQtree the leaf no des within a Qno de app ear in

the same order but in either p ossible direction eg or The leaf no des in a

P no de on the other hand have no sp ecied order That is to say in any p ermutation in

the set enco ded by a PQtree the leaf no des within a P no de can o ccur in any order eg

or or or or or

Thus if all the leaf no des were in a single Qno de then the set of p ermutations enco ded

would b e just the order within the Qno de and its reverse while if all the leaf no des were

in a single P no de then the set of p ermutations enco ded would b e al l p ermutations of the

leaf no des

A PQtree allows P and Q no des to b e placed within other P and Q no des Thus each

constituent of a no de can either b e a leaf or a set of p ossible orderings of a subset of

the leaves The set of p ermutations which keep the sets f g f g and f g

together but allow any ordering within sets or among sets ie but not

can b e represented as a P no de containing three P no des each of which

contains the leaves in one set

The surprising prop erty of PQtrees is that each PQtree corresp onds to all p ossible con

sistent orderings of a matrix like our hybridization matrices and that each matrix has a

corresp onding tree Thus by using the Bo oth and Lueker algorithm it is p ossible to enco de

al l p ossible consistent maps in one PQtree

Knowing all p ossible consistent maps allows us to b e certain that we have the true map

in our set but the numb er of consistent maps may b e very large The PQtree however

do es more than just list all consistent maps it tells us a great deal ab out the structure

of these maps Intuitively it tells us two typ es of information ab out the maps On the one

hand it enco des consensus information that is parts of the ordering which are true of all

consistent maps through the leaves contained directly in Qno des On the other hand it

enco des restricted variability through the grouping supplied by the P and Qno des

Due to the fact that the PQtree can nest P no des and Q no des there is actually a continuum

from strong consensus information to weak variability information At the strongest end are

leaves within Q no des The order of these leaves is known to b e the same in all consistent

maps and therefore in the true map Next strongest are leaves within P no des The leaves

are known to o ccur together with no intervening leaves in all consistent maps Although

their lo cal order is unknown the global fact that they o ccur together is guaranteed to b e

true of the true map

As one works ones way up from leaf no des the information enco ded refers to larger and

larger p ortions of the genome when the PQtrees refer to hybridization matrices Thus

a high level Q no de might represent the fact that several segments o ccur in a xed order

although the order within the segments is not xed A high level P no de gives only the

weaker information that there are segments but nothing ab out their order

Thus given the PQtree for a hybridization matrix the uncertainty ab out the true order is

due to the P no des If no P no des exist then the tree collapses to a single Qno de On the

other hand if any P no des exist then the complete true map is not retrievable

Lemma Unique consistent map A CP matrix has exactly two CPpermutations

a permutation and its reverse i its PQtree consists of exactly one nonleaf node which is

a Qnode

Lemma P no de information ab out the true map is irretrievable It is algorith

mical ly irretrievable to nd the order of the children of a P node in the PQtree of a hy

bridization matrix

Pro of By denition there are two orderings of the children which result in consistent order

ings for the matrix and each of these orderings corresp onds to a dierent prob erestricted

genomic placement

The true map is almost never algorithmically retrievable

If the hybridization matrix the input of the mapping problem would assure that the map

is unique up to reversal then the task of mapping would b e clear nd the map In this

case the map must b e the true one The next theorem relates the question of nding the

true map to the question of nding a unique consistent order of a matrix

Theorem Let us consider a genomic placement and its hybridization matrix A There

is an algorithm which takes as input A and computes the true map if and only if the matrix

has a unique consistent order up to reversal

The theory of when a matrix has a unique will b e presented elsewhere but the conditions

necessary are quite strict We do not know of any biologically relevant mo del of matrix

generation in which the resulting matrices are likely to meet the conditions for unique CP

ordering In other words the hybridization matrix wil l almost never have a unique CP

ordering This means that it will almost always b e b eyond the capability of any algorithm

irresp ective of the amount of computing time used to compute the true map

Given that the true map cannot b e computed the natural question then is what is a fair

goal for a mapping algorithm We know that the true map must b e CP but there may

b e exp onentially many CP maps to consider Since computing the one true map seems

to b e problematic we instead ask what information ab out the true map is retrievable For

example although we cannot know the complete order corresp onding to the true map we

might b e able to determine a partial order which is true of all CP maps

PQtrees enco de information in common to all maps

We are now lo oking for information which is true of al l CP maps and therefore true of the

true map Fortunately the PQtree gives us exactly what we want

Theorem A mapping property is retrievable from a given hybridization matrix if and only

if the property is true for al l genomic placements which are consistent with a permutation

encoded by the PQtree of the hybridization matrix or for al l genomic placements it is

false

Pro of if In order for the prop erty to b e irretrievable there would have to b e two genomic

placements consistent with the matrix one for which the prop erty was true and one for

which the prop erty was false However by construction the PQtree enco des all consistent

p ermutations so a genomic placement which is consistent with the matrix must b e consistent

with some p ermutation enco ded by the PQtree By assumption such placements either all

true or all false for the prop erty and thus the prop erty is retrievable

only if By the denition a mapping prop erty is irretrievable if it is true of one consistent

placement and false of another

We have already seen that the order of the children of a P no de is irretrievable Theorem

implies that the order of the children of a Qno des meets our denition of retrievable

Similarly other structural prop erties of the PQtree such as that certain prob es are in the

same subtree are retrievable It seems in fact that the only useful retrievable prop erties

are those which describ e the structure of the PQtree

Although we have b een forced to progressively reduce our goal concerning the information

ab out the genomic placement all has not b een lost We have shown that it is p ossible to

retrieve information ab out the genomic placement which is enco ded in the PQtree and in

some sense this is the most information which we can exp ect to retrieve

Unfortunately when we lo ok at mapping in the presence of errors such an algorithm will

not exist In the next sections we examine in more detail the sort of information present

in the PQtree Since we do not know how to create PQtrees or any similar structure

which the data contains errors it will b e our goal to describ e as much information enco ded

by the PQtree as p ossible in terms which are indep endent of the PQtree

As noted in Section the PQtree contains a complex amount of mapping information

common to all maps The recursive structure of the tree induces a hierarchy from lo cal

information pairwise prob e adjacencies to global information comp onents which must b e

together

Lo cal resolution adjacent pairs of prob es

The nest level of resolution concerning the ordering of the prob es is pairwise adjacencies

Let us consider a revealing case mapping instances with unique up to reversal maps

What can we say ab out the pairwise adjacencies of prob es in the unique map In a trivial

way they are the same in all maps

Denition Let us cal l a pair of probes P P a strong adjacency if P and P are adjacent

in every map ie consistent order An adjacency of a map is weak if it not strong

In this terminology an instance of the mapping problem has a unique solution if and only

if all adjacencies are strong What are the strong adjacencies when the mapping problem

has multiple solutions The answer is the set of adjacencies given by the Qno des which

have only leaves for children

Clearly for any map an adjacency of two prob es is strong or xed if they are adjacent in

1

all maps On the other hand an adjacency of two prob es in map A is weak when there

2

exists another map A such that in this map the two prob es are not adjacent

As we consider prob e orders up to reversal this distinction adjacent or separated constitutes

a complete analysis of pairwise prop erties This is exactly the mapping information captured

by the Qno des which contain only leaves For weak adjacencies we can nd out from the

PQtree all their degrees of freedom For example one can determine all prob es which

are ever adjacent in any map to a given prob e

Global resolution comp onents in the clonecover graph

At the other extreme from strongly adjacent prob e pairs there are prob es pairs for which

there is no information relating the prob es within the pair Ultimately all information

concerning the relative p osition of prob es derives from having clones which cover more

than one prob e Two prob es that are adjacent on G will b e likely to b e covered by clones

ie b oth would hybridize to a numb er of clones The more clones that cover them the

stronger is the linkage b etween them On the other hand prob es that are faraway on the

genome will b e unlikely to b e covered by a clone

The structure of this coverage is recorded in the following graph

Denition The clonecover graph of A is given as fol lows CC V E where V P

and E fP P w j P P P w number of clones containing both P and P g

Clearly the connected comp onents of the graph CC disregarding edges of weight corre

sp ond to disjoint and unlinked regions on G In the PQtree terminology they are exactly

the children of a P no de at the ro ot of the PQtree By the denition of a P no de any

ordering of its children denes a valid C P p ermutation It is then a corollary of Theorem

that the true order of the comp onents of CC cannot b e computed That is pairs of prob es

from distinct comp onents have no information relating them

On the other hand we do know something ab out prob e pairs within a comp onent They

should not have prob es from outside the comp onent b etween them

Since the clone cover graph removes explicit information ab out clones and retains only

counts of their coverage it is interesting to ask how much information is lost It turns

out that the clone cover graph still contains enough information to determine whether the

matrix is CP or not

Lemma The mapping information contained in the clone cover graph CC A is sucient

for determining whether A is CP

Matrix singularities

There are certain prop erties of a matrix which guarantee that the PQtree will lose infor

mation Two examples are nonhybridizing prob es and redundant prob es

It is p ossible that a prob e do es not hybridize any clones Its corresp onding column in

the matrix will b e all zero es The prob e will thus b e a singleton comp onent in the clone

cover graph and all we know ab out it is that it do es not t inside any other comp onent

Clearly we can just discard these prob es and lose nothing However in the when the data

contains errors it may b e dicult to identify these prob es since errors may cause them to

not corresp ond to columns which are all zero es

Another p ossibility is that two or more prob es hybridize exactly the same clones In this

case they will b e inside a leaf P no de since any ordering yields the same matrix The

prob es may b e adjacent on the genome or nonadjacent In the adjacent case no clone

happ ened to b egin or end b etween them Perhaps they are very close together or p erhaps

the intervening region is uncloneable In the nonadjacent case it may b e that all clones

which start b etween them also end b etween them This seems to b e a less likely p ossibility

but denitely can o ccur

As with nonhybridizing prob es it is tempting to x the matrix by removing all but one

copy of each redundant set However it may again b e dicult to do so when errors make

prob es which hybridize the same clones app ear dierent

Exp erimental errors

In the previous section we showed that the theory of PQtrees captures the information

available in a hybridization matrix from an ideal exp eriment In the ideal exp eriment

every clone corresp onded to a single contiguous region on the target DNA every prob e

corresp onded to a unique lo cation on the target DNA and every overlap of clone and

prob e was correctly witnessed by a hybridization exp eriment and faithfully recorded in the

hybridization matrix

Unfortunately molecular biology like all exp erimental sciences do es not pro duce p erfect

data As was describ ed in Section the actual exp eriments are much more complicated

then the simple mo del used in the last section A variety of errors derived from the lack of

desired precision of the exp eriments some of which are inherent to the current technology

make the resulting data less than ideal The determination of which prob es stick to which

clones will yield b oth false p ositives and false negatives Some sections of the target DNA

will tend to shatter into tiny pieces which are lost while other sections will contain no prob e

sequences Some clones will corresp ond to multiple regions on the target DNA

In this section we will formalize the eect of errors on hybridization matrices and describ e

what an algorithm might b e exp ected to do to counter their eects Unlike the error

free case we will not b e able to p oint to an algorithm which captures the information in

the hybridization matrix Instead we discuss what sorts of information are likely to b e

retrievable and give some p ossible strategies for nding information

Extending the mo del to data with errors

Prior to the inclusion of errors the ma jor pillar of our theory was the fact that if the

columns of the hybridization matrix were ordered in the same order as the prob es in the

genomic placement then the matrix was CP Unfortunately all our typ es of errors invalidate

this guarantee Thus we are led to ask again What do es a map lo ok like

One p ossibility would b e that enough were known ab out the errors that we could at least

have a statistical mo del of what the true ordering of the hybridization matrix would lo ok

like Unfortunately very little is known ab out the statistical nature of the errors It is

p ossible to place some b ounds on the error b ehavior but no hard and fast rules exist For

example one might assume that all or most clones will have one or two fragments or

assume that no clone will have more than a few p ercent false p ositives but these would only

b e guesses Therefore we present a theory in which the nature of the errors is abstracted

into a general cost function

We b egin our formalism by returning to our basic assumption our goal is to retrieve as

much information ab out the genomic placement as p ossible from the hybridization matrix

We know from the errorfree case that at most we can determine the true prob e ordering

along the genomic placement From the prob e ordering we were able to infer from the

matrix the prob erestricted placement which gave us an ordering of the clones

When the data contains errors the prob e ordering do es not suce to tell us the clone

order For example if a clones row in the true ordering is then we could have

a single fragment hybridizing prob es and only that is the clone without errors would

read or a single fragment hybridizing prob es through only that is the clone

without errors would read or two fragments one hybridizing prob es and only

and the other hybridizing only prob e or many other p ossibilities The three p ossibilities

mentioned corresp ond to the least amount of just false p ositives just false negatives or just

chimeric clones which could explain the rows value

Thus we extend our denition of a map from the errorfree case to require b oth a permutation

of the columns of the matrix and an explanation of errors which could lead to each rows

value

Denition An explanation of a permutation of a hybridization matrix is a function

which maps each row of the matrix to a set of fragments in a probe restricted placement

For example the explanation of a CP matrix might map each row to the single fragment

which b egins at the rst prob e in the rows blo ck of ones and continues to the last prob e For

nonCP matrices the explanation may involve marking some entries of the matrix as b eing

incorrect In the errorfree case we dened a consistent as meaning each row corresp onded

to a single continuous fragment or equivalently that each row contained a single contiguous

blo ck of ones We extend the notion of consistency to data with errors by allowing the

explanation to x hybridization errors andor split chimeric clones

Denition The probe orderexplanation pair is consistent if for every clone C

of C the correspondence created by causes al l and only probes within fragments to have

value one in the matrix

Since there are many p ossible explanations of any matrix row we need some way of deter

mining which are b est We therefore dene the concept of a noise model A noise mo del

enco des what is known or thought to b e known ab out the p ossible errors in the data by

sp ecifying a set of explanations which undo the errors and a cost function which rates

the likeliho o d of each explanation b eing correct The cost function need not b e a formal

likeliho o d function but should as faithfully as p ossible reect what is known ab out the

errors

Denition A noise mo del N consists of a set of possible explanations and a function

which takes each explanation on a xed permutation of a hybridization matrix A to a

cost f A R

Just as we can extend the notion of consistency to data containing errors by including an

explanation of the errors we can extend the notion of map to include not just and ordering

but also an explanation drawn from the noise mo del

Denition Let G C P pos be a genomic placement A its hybridization matrix and

N be a noise model A map for the genomic placement is a columnordered matrix A

and an explanation such that is consistent for A The true map of the genomic

0

placement is A where is the true probe order and describes the errors that

actual ly occurred

As in the errorfree case we are unlikely to b e able to determine whether a particular map

is the true map In fact we cannot at this p oint rule out any prob e ordering b ecause there

may always exist some explanation which is consistent with the ordering for the matrix

We dene three simple noise mo dels

Denition The false p ositive only FPO noise model al lows explanations of the form

the entries of the matrix in set S are changed from to where S is any subset of matrix

entries whose value is and the resulting matrix is CP The mapping to fragments then

proceeds as in the CP case The cost of an explanation is the size of S

Denition The false negative only FNO noise model al lows explanations of the form

the entries of the matrix in set S are changed from to where S is any subset of matrix

entries whose value is and the resulting matrix is CP The mapping to fragments then

proceeds as in the CP case The cost of an explanation is the size of S

Denition The chimera only CO noise model al lows explanations of the form the

entries which are s in each row are partitioned into at most two groups such that each group

occurs consecutively The mapping to fragments then proceeds as in the CP case except

that rows having two groups are mapped to two fragments The cost of an explanation is

the number of rows having two groups

These simple noise mo dels corresp ond to our three ma jor error typ es Each has the desirable

prop erty that the cost of the null explanation is always minimal and that if more errors are

required by the explanation then the cost is higher We call noise mo dels with this prop erty

monotonic and we will see in Section that monotonicity is a helpful prop erty in algorithm

design

There are many p ossible variants The cost of changing two adjacent zero es to ones might

b e less than the cost of changing nonadjacent zero es thereby approximating the eects of

deletions The cost of changing two adjacent ones to zero es might b e higher than the cost

of changing two distant ones thereby p enalizing the use of false p ositives to remove true

chimera The chimeric noise mo del might allow an unlimited numb er of groups p er row In

this case the cost might b e dened as the maximum numb er of groups p er row or as the

total numb er of groups

We are not arguing for a particular noise mo del but merely providing a framework in which

to dene them For example a noise mo del which is designed to match one typ e of error

might also b e useful when other error typ es o ccur but the one typ e is dominant The simple

noise mo dels also provide intuition for algorithm design

Some noise mo dels are very forgiving in that any p ermutation admits some explanation

Lemma Under the FPO or FNO noise model for any hybridization matrix and any

permutation there always exists a consistent explanation for the permutation on the matrix

Pro of Under FPO all ones could b e eliminated or more reasonably all ones except those in

a single blo ck on each row Under FNO all zero es could b e eliminated or more reasonably

all zero es b etween the rst and last one in each row

In order to limit the p ower of forgiving noise mo dels and in general to reduce the numb er

of candidate p ermutationexplanation pairs we make the parsimony hypothesis

Denition The maximum parsimony set MPS for a hybridization matrix is the set of

permutationexplanation pairs with the lowest cost

Unlike the errorfree case in which we could limit our search to CP maps and b e certain

that the true map was in the search set limiting our search to the MPS do es not guarantee

that the true map is in the search set We will often thus want to consider all p ermuta

tionexplanation pairs with close to minimal cost since we are then more likely to include

the true map

NPcompleteness barriers

It has b een shown elsewhere though not in these terms that nding even one mem

b er of an MPS for some noise mo dels is NPcomplete Readers not familiar with NP

completeness may want to lo ok at Garey and Johnsons standard reference Intuitively

a problem b eing NPcomplete means that computer scientists agree that no ecient algo

rithm exists which solves the problem exactly For example the CO mo del allows one to

search for an ordering of the matrix with at most two blo cks of ones p er row and nding

such an ordering is known to b e NPcomplete The FNO noise mo del is closely related to

the NPcomplete problems of bandwidth and of envelop e

In fact we conjecture it will b e NPcomplete to nd a single memb er of the MPS for any

noise mo del corresp onding to real exp eriments

Reasonable goals

Since the NPcompleteness results imply that we cannot eciently nd a b est p ermuta

tionexplanation pair for a given hybridization matrix we are in a similar p osition to not

b eing able to distinguish b etween CP matrices in the errorfree case The dierence is that

now we have no algorithm like the PQtree algorithm to capture all the p ossibiliti es

Thus we aim for a subset of what the PQtrees told us In particular we have lo oked at

how many strong adjacencies we can nd We might hop e that there will b e regions of

the genome that are so well covered by clones that even in the presence of errors there

will remain strong adjacencies We also ask what are the connected comp onents in the CC

graph False negatives tend to increase the numb er of connected comp onents while false

p ositives and chimera tend to decrease them We can also lo ok for comp onents that are

more strongly connected We might demand that there are no single edges whose removal

would split the comp onent into two comp onents This would remove some connectivity due

to errors while having less eect on wellconnected true comp onents

Evaluating Algorithms on Hybridization Data

It is a p ostulate of algorithmic work on physical maps that the ultimate test of an algorithm

is its success on real genomic data However our limited understanding of real data and

the paucity of existing data for which we know the correct map requires that algorithms

b e evaluated on synthetic data In work which will b e rep orted elsewhere we are working

with the Los Alamos Center for Human Genome Studies and with the Whitehead Institute

to evaluate our algorithms on real data but the purp ose of this pap er is to explore the

evaluation of algorithms on synthetic data

Our evaluation consists of two parts using data designed to match the observed character

istics of a particular data set from Los Alamos and using data chosen to cover a range of

p ossible data characteristics Both sets of data are pro duced using a matrix generator based

on a simple mo del of prob e and clone placement and of error o ccurrences The simplicity of

the mo del makes it relatively easy to control the characteristics of the generated matrices

On the other hand the simplicity undoubtedly obscures some imp ortant details of actual

matrices One of the goals of our exp eriments has b een to evaluate the generator More

will b e said ab out this in later sections Here we simply describ e the generator used

Generator

Input

The hybridization matrix generator used in this work has the following essential inputs

Numb er of prob es and clones Each matrix will have one column for each prob e and

one row p er clone Errorfree clones consist of a single fragment which is determined

to hybridize a consecutive set of prob es See b elow for more detail

Error rates chimerism rate false negative rate and false p ositive rate Each clone has

an indep endent probability of b eing chimeric and thereby consisting of two fragments

rather than one Each p osition in the matrix corresp onding to a prob e which hits a

clone has an indep endent probability of b eing a false negative and thus b eing recorded

as a zero instead of as a one Each p osition in the matrix corresp onding to a prob e

which do es not hit a clone has an indep endent probability of b eing a false p ositive

and thus b eing recorded as a one instead of as a zero

The range of clone sizes Each clone has an indep endent probability of b eing a small

clone or a large clone Size ranges are sp ecied for b oth small and large clones

Description of the generator algorithm

The generator b egins by cho osing p ositions for the prob es The target genetic data is

abstracted as an interval of the real line Prob e p ositions are chosen indep endently and

uniformly at random in this interval This leads to a Poisson distribution of their lo cations

The lo cations are sorted by p osition to pro duce the correct order of the prob es

Once the prob es p ositions on the target genome have b een chosen each clone is generated

and a row of the matrix is created corresp onding to it For each clone an initial fragment

is chosen by randomly cho osing a size range according to the input probability and then

cho osing a size uniformly at random in the range A start p osition on the target genome is

then chosen uniformly at random from those p ositions at which a fragment of this size could

o ccur ie not to o close to the end The fragment thus corresp onds to a subinterval of the

interval corresp onding to the target genome The start p ositions are Poisson distributed

while the end p ositions are not Comparison with a generator which cho oses b oth the

start and end p ositions of the clones according to a Poisson distributions would b e an

interesting followon exp eriment The current approach however allows rapid generation

and a signicant amount of control over clone sizes

Once the start and end p osition of a fragment have b een determined the identity of those

prob es whose p ositions fall in the fragments interval can easily b e determined The size of

the clone will determine the expected numb er of prob es which hit the clone If this clone

is determined to b e chimeric then a second fragment is generated in the same manner as

the rst Since the two fragments may overlap the clone may or may not corresp ond to two

intervals In the row of the matrix created for this clone each column corresp onding to a

prob e which is not contained in the clones intervals is initially set to If a coin tossed

with the input false p ositive probability indicates a false p ositive then the matrix entry is

changed to Similarly each column corresp onding to a prob e which is contained in the

clones intervals is initially set to If a coin tossed with the input false negative probability

indicates a false negative then the matrix entry is changed to

Discussion

There are many asp ects of this generator which we b elieve could b e made more detailed as

well as many questions we would like to answer concerning its biases

Prob e generation Cho osing p ositions based on double precision real numb ers is of a

ner grain than the actual genome do es this matter Actual prob es are often not

random strings of DNA but have some genetic meaning Is there some way to include

this knowledge in the generator In particular some prob es are known to b e the PCR

ends of clone fragments What would b e the eect of explicitly dening some prob es

to b e fragment ends It is also known that some regions of DNA are uncloneable

Should this b e represented by gaps in b etween prob es or by mo dication to clone

generation

Clone generation The actual biological pro cess of pro ducing clones is quite complex

Can a b etter mo del b e pro duced by generating basepair sequences and then simulat

ing the fragmenting of copies of the target DNA and the uptake by vectors Is there

some intermediate mo del which captures more of the biology than our current mo del

Error generation How should chimeric clones b e treated Should they b e allowed

to have more than two pieces Should the individual pieces have a size distribution

identical to nonchimeric pieces In an earlier generator we made the size of the

combined chimera b e distributed equivalently to the size of nonchimera This led to

one or b oth pieces b eing very small On the other hand the current approach leads to

chimera b eing on average larger than nonchimera and thus providing more though

p otentially misleading information

Should two false p ositives or false negatives b e correlated Deletions might b e mo deled

as correlated false negatives However for clones which hit at most three prob es which

is true of most of our clones a correlated false negative simply pro duces a smaller

clone Should the p o oling strategies used for hybridization exp eriments b e included to

identify correlations in the p o oling pro cess Should false p ositives b e more common

near clones than far from clones

Should the rate of false p ositives dep end on the size of the matrix A rate of

false p ositives corresp onds to an average of two false p ositive p er row with prob es

and an average of one false p ositive p er row with prob es If the ratio of ones due

to false p ositives to the ratio of ones due to true hybridizations is to o high can any

algorithm succeed

Extensions Some exp eriments yield hybridization data which is neither p ositive nor

negative but which is instead an estimate of likeliho o d of hybridization Can the

matrix b e given fractional entries to represent this data It is dicult to pro duce

prob es which corresp ond to unique lo cations on the target DNA Can each prob e b e

assigned a set of p ositions in order to represent this typ e of data

Technical details

In order to ensure that any algorithm running on the generated matrix do es not make use

of the known generation order of the prob es or clones b oth the rows and columns of the

matrix are randomly p ermuted b efore b eing output We b elieve that these p ermutations

are critical to fair evaluations of algorithms

Mimicking real data

Our rst use of our generator was to attempt to pro duce matrices which were as similar as

p ossible to real data Norman Doggetts group at Los Alamos LANL was kind enough to

share some early hybridization data with us This data consisted of STS prob es and

clones They had recorded cloneprob e hybridizations Recently they have increased

the numb er of prob es and clones and by using breakp oints and other nonhybridization

data have published a physical map of their data We were initially interested in how

much information was available in their early data so we have not used any of the later

data

Given a hybridization matrix it is not immediately obvious what characteristics to attempt

to match We decided that as a rst approximation we should attempt to match the

distribution of clone sizes The average numb er of prob es hit by a clone was approximately

so we endeavored to pro duce matrices with an average of ones p er row In the

LANL data every clone hit at least one prob e presumably clones hitting no prob es had

b een removed b efore giving us the data and no clone hit more than prob es More than

one third out of of the clones hit only a single prob e and thus provided no mapping

information More than three quarters of the clones hit at most three prob es

In order to simulate matrices with these characteristics we made of our clones small

in the to range and the remaining clones large in the to range In trials with

no simulated errors the average numb er of ones pro duced was the desired p er row the

maximum was p er row and ab out rows had no ones Clearly the prob es were randomly

bunched so as to cause some clones which exp ected or hits to miss all prob es and others

exp ecting or to hit many more in one trial a row hit prob es Although not exactly

matching the distribution of clone sizes in the LANL data we considered it to b e a go o d

match

However the LANL data denitely contained some amount of chimerism and hybridization

errors In another pro ject we are attempting to see whether we can use our algorithms to

estimate the amounts of each error but for this study we merely tried several p ossibilities

Each time we increased the chimerism or false p ositive rate we had to reduce the average

size of clone fragments in order to maintain the same distribution of numb er of ones p er

row Similarly we had to increase the fragment size when we increased the false negative

rate

In all we lo oked at eight cases no errors and chimerism only and false

negatives only and false p ositives only and a combination of chimerism

false negatives and false p ositives In order to b e able to explore many trials in each

of these cases we scaled the numb er of prob es down to

Exploring a range of data

The LANL data is only one example of what data might lo ok like We wanted to b e able

to discuss a wide range of data so that we would cover other existing data and also b e able

to suggest directions in which to push the data in order to achieve b etter results

In this rst study we chose to investigate each error typ e in isolation We also did not want

to have to o many clones which gave no information so we slightly increased the average

clone size to an average of ve prob e hybridizations p er clone We tried error rates of

and chimerism of and false negatives and of and

false p ositive These rates were intended to include values at which all algorithms

would b e exp ected to do well and at which all would b e exp ected to fail In retrosp ect the

false p ositive range should have included some much smaller values

In order to allow a large numb er of trials we p erformed all our exp eriments using prob es

We were quite interested in the eect of coverage on algorithmic success so we tried each

case with and clones as well as some cases with and clones

The Quality of Data

In Sections and we saw that the amount of information available in a hybridization

matrix can b e limited and discussed some of the reasons for its limitations In the errorfree

case we were able to link the information available to the theory of PQtrees but when

the data contains errors we merely could say that things will b e worse Thus as our rst

exp eriments we measured as b est we could the amount of information available We

lo oked at the numb er of connected comp onents in the clone cover graph at the numb er of

redundant prob es at the numb er of nonhybridizing prob es and at the numb er of weak

adjacencies

Connected Comp onents

For any hybridization matrix we have dened the clone cover graph Recall that the clone

cover graph is formed by asso ciating a vertex with each prob e ie column of the matrix

and placing an edge b etween two vertices for each clone which hits b oth prob es ie a row

with a one in the two corresp onding columns Multiple edges b etween vertices are collapsed

into a single edge with weight equal to the multiplicity One can compute the connected

comp onents that is sets of vertices which are connected by some path of nonzero weight

edges in the graph

In a very strong sense there is no information available ab out the relation of prob es in

dierent comp onents Consider two orderings of the prob es in which the prob es within

comp onents are kept in the same order while the order of entire comp onents diers There

is no way to prefer one such order from another Thus if there are many comp onents then

any algorithm will fail to reliably retrieve the correct complete order of the prob es an

algorithm may o ccasionally guess the order correctly but it cannot b e sure of getting it

correct

There are many ways in which multiple comp onents can o ccur One simple way is if a prob e

hits no clones It is then in a comp onent of its own In hindsight these prob es should have

b een eliminated from the matrix but it is still instructive to see how often they o ccur In

the trials of matrices constructed to match the LANL data if no errors o ccurred there

were b etween and prob es which hit no clones with an average of out of Thus

these typ es of prob es are common enough to cause some trouble but not overwhelming

When matrices were created assuming dierent typ es of errors the numb er of such prob es

went up slightly for false negatives down slightly for chimera and reduced to essentially

with even false p ositives The interesting p oint to note is that although false p ositives

can greatly reduce the apparent numb er of nonhybridizing prob es the numb er of prob es

which actually hit no clones has not b een reduced See Figures and

Lesson If there is a signicant amount of false positives then it wil l be dicult to screen

out nonhybridizing probes

Of course the numb er of connected comp onents can b e increased by other means b esides

nonhybridizing prob es If there is a large gap b etween prob es then it is p ossible that no

clone bridges the gap For the LANLlike data there were b etween and comp onents not

including the nonhybridizing prob es with an average of when no errors were simu

lated However even false p ositives reduces the numb er of comp onents to Simulating

chimerism also greatly reduced the numb er of comp onents while simulating false negatives

increased the numb er of comp onents For the mo derate mixture of errors case the numb er

of comp onents was reduced to See Figures and Again note that the numb er of

actual comp onents was not reduced only the numb er of apparent comp onents

Lesson If there is a signicant amount of chimera or false positives then it wil l be dicult

to identify connected components

Redundant prob es

There is another simple way in which matrices can lack ordering information two prob es

can hybridize identical sets of clones If more than one prob e hybridizes exactly the same

set of clones than there is no information ab out the relative order of these prob es The

same matrix would result from any two total orders which diered only in the order of

these prob es

Redundant prob es o ccur naturally whenever there is no clone which starts or ends in the

interval b etween two successive prob es In real data this may b e due to the fact that two

prob es are actually very close together on the target DNA or by chance if there just are not

very many clones Both these situations are faithfully mo deled by our generator

In some sense the problem of redundant prob es is less severe than that of multiple comp o

nents b ecause all of the equivalent orders are quite similar In fact each equivalent prob e

order will lead to the same clone order However in evaluating the success of an algorithm

it is imp ortant to know whether incorrect ordering is due to equal prob es or not

Error amounts Min trial Max Trial Avg of Trials

none

chimerism

chimerism

false neg

false neg

false p os

false p os

chi fn fp

Figure Nonhybridizing prob es for LANLlike data

Various levels of errors for 50 probes, avg. 5 ones per row 2.5 C1P matrices Chimeric matrices Matrices with false positives 2 Matrices with false positives

1.5

1 Number of all zero columns

0.5

0 50 100 150 200 250 300 350 400 450 500

Number of clones

Figure Nonhybridizing prob es for varied data

Error amounts Min trial Max Trial Avg of Trials

none

chimerism

chimerism

false neg

false neg

false p os

false p os

chi fn fp

Figure Connected comp onents for LANLlike data

Various levels of errors for 50 probes, avg. 5 ones per row 3.5 C1P matrices Chimeric matrices Matrices with false positives 3 Matrices with false negatives

2.5

2

Number of connected components 1.5

1 50 100 150 200 250 300 350 400 450 500

Number of clones

Figure Numb er of connected comp onents for varied data

It is of course relatively easy to screen input matrices for redundant columns and remove

all but one of each typ e However as in the case of multiple comp onents the presence of

errors may obscure the fact that in the errorfree matrix there are redundant columns

In the LANLlike data when no errors are assumed there are an average of out of

p ossible prob es which are identical to the prob e b efore it in the ordering We used the fact

that we know the correct ordering to simplify the search for equal prob es but the columns

could b e checked for duplicates without knowing the order In fact our check of adjacent

prob es only may undercount the numb er of duplicate columns See gures and

As with multiple comp onents the presence of errors tended to screen the o ccurrence of

redundant prob es Chimerism only slightly decreased the numb er of apparent redundancies

presumably b ecause there are more fragments with a chance to b egin or end b etween

prob es Hybridization errors b oth false p ositives and false negatives had a much greater

eect since even two prob es which map to the exact same p osition on the target genome

could app ear dierent due to a hybridization error

Lesson If there are hybridization errors or to a lesser extent chimera then it wil l be

dicult to identify redundant probes

Weak adjacencies

The existence of multiple connected comp onents and of redundant prob es result in infor

mation theoretic constraints on the ability of an algorithm to solve the exact version of the

mapping problem ie reconstruct the exact sequence of prob es along the target DNA We

b elieve that there are additional sources of weakness in the data For example the presence

of masked comp onents and redundancies mentioned in the earlier sections are examples of

weaknesses not witnessed directly by multiple comp onents or redundant prob es

As mentioned in Section for CP matrices there exist multiple dierent orderings which

are CP Although some of the p ossible orderings are due to rearranging connected comp o

nents andor scrambling redundant columns there are other reorderings which corresp ond

to ipping nonredundant groups of columns More formally any CP orderings can b e

transformed into any other CP ordering by a series of translocations in which each inter

mediate ordering is also CP A translo cation corresp onds to taking a section of the prob e

order and reversing it in place The translo cation i j i j n converts the n

element p ermutation to the p ermutation

i j j i j n

Using a PQtree representation it is thus p ossible to identify sequences of prob es which are

a subsequence of al l CP orderings of a matrix As in Section we refer to the adjacent

prob e pairs within these sequences as strong adjacencies In the correct ordering of a CP

hybridization matrix only these pairs are completely dened by the information in the

matrix Any other adjacencies in the correct order are weak that is it is not p ossible

for an algorithm to always determine them It should b e noted that there may b e partial

information ab out these adjacencies For example if there are only two CP orders not

counting reversing the entire sequence which dier by a single ip then the two adjacencies

on either side of the ip will b e weak However an algorithm might b e able to identify the

two orders as the only p ossible orders

In our LANLlike data without errors thus with CP matrices on average over half the

adjacencies were weak Since the chance of identifying a weak adjacency correctly based on

Error amounts Min trial Max Trial Avg of Trials

none

chimerism

chimerism

false neg

false neg

false p os

false p os

chi fn fp

Figure Numb er of redundant prob es for LANLlike data

Various levels of errors for 50 probes, avg. 5 ones per row 18 C1P matrices 16 Chimeric matrices Matrices with false positives Matrices with false negatives 14

12

10

8

6 Number of redundant columns 4

2

0 50 100 150 200 250 300 350 400 450 500

Number of clones

Figure Numb er of redundant prob es for varied data

the hybridization matrix alone is at most any algorithm is exp ected to miss identify

at least of the correct adjacencies

Lesson If there are weak adjacencies then judging an algorithm on whether or not it

retrieves the entire correct ordering is futile

While it is p ossible to exactly dene a weak adjacency for a CP matrix any adjacency

which is at one end of a ip which preserves CPness it is less clear how to dene weak

adjacencies for nonCP matrices We are exploring the following denition but b elieve that

it will require renement as we learn more ab out these matrices The intuition is that if

the p ermutation after a translo cation has no more fragments than the original p ermutation

then most noise mo dels would allow an explanation which was no more costly than the

explanation of the original p ermutation Although this assumption cannot b e rigorously

proven it do es seem reasonable and the data seems to conrm its usefulness

Denition Given an n column hybridization matrix the adjacency i i for i

n is weak i there exists a translocation i j or j i such that for every row of the

matrix the number of blocks of ones after the translocation is no greater than the number

before the translocation

This denition has two clear weaknesses First it is p ossible that despite the fact that the

numb er of fragments in each row has not increased the noise mo del is such that the cost of

the explanation has increased greatly In this case an adjacency we are considering to b e

weak should b e considered strong it is o ccurs in all close to minimal maps Our current

understanding of noise mo dels makes this case seem unlikely but it is nonetheless p ossible

The second weaknesses is the opp osite in nature an adjacency which is weak may b e

identied as strong It is p ossible for example that there exists a p ermutation which

breaks an adjacency but the numb er of fragments in one row go es up while the numb er

in all other rows go es down The maximum numb er of fragments p er row could even go

down Yet our simple pro cedure would not discover this fact Thus even if the noise

mo del has a cost function which is monotonic in numb er of fragments we may miss weak

adjacencies The chance of missing weak adjacencies is however a conservative prop erty

in our exp eriments We will exp ect algorithms to correctly identify strong adjacencies so

our algorithms will b e p enalized if we miss weak adjacencies

Despite the ab ove caveats ab out the notion of weak adjacencies we have found it to b e a

useful concept One interesting asp ect of our denition of weak adjacencies is that unlike

the numb er of multiple comp onents and numb er of redundant prob es it do es not seem

to correlate with error rate See gures and Chimerism and false negatives seem to

have little eect and false p ositives sometimes increase and sometimes decrease the numb er

Ideally of course none of the errors would eect it at all and thus we continue to lo ok for

b etter denitions

Summary of data quality

As can b e seen in Figures and increasing the numb er of clones is an eective

means of improving the data quality although it of course is exp erimentally exp ensive

Error amounts Min trial Max Trial Avg of Trials

none

chimerism

chimerism

false neg

false neg

false p os

false p os

chi fn fp

Figure Numb er of weak adjacencies for LANLlike data

Various levels of errors for 50 probes, avg. 5 ones per row 35 C1P matrices Chimeric matrices 30 Matrices with false positives Matrices with false negatives"

25

20

15

10 Number of weak adjacencies

5

0 50 100 150 200 250 300 350 400 450 500

Number of clones

Figure Weak adjacencies for varied data

One of our hop es is that by having algorithms which are eective on the strong parts of the

data and which can identify the weak sections that eort can b e placed into just improving

the coverage on weak sections

Combinatorial Optimizations as Mo dels of Errorcorrection

Genomic reconstruction through parsimonious explanation

In the previous sections we have dened the physical mapping problem in terms of two

pro cesses p ermutation ie prob e reorder and noise removal ie error repair One could

however consider the pro cess to consist of a single noise removal step where disorder is

just another typ e of noise Supp ose there existed some ob jective function which could b e

applied to ordered matrices such that the function had value zero for the true map and

some value greater than zero for any other order of the hybridization matrix In this case

one could attempt to nd the ordering of the matrix which minimized the function

Unfortunately such a measure can exist only if one knows the genomic placement a priori

As we have seen it is often the case that several orderings of the matrix could b e the

correct order dep ending on the actual genomic placement and the errors that o ccurred

The function could only take on its unique minimum value if it magically knew the genomic

placement and the errors which o ccurred However it is p ossible that such a function

can b e approximated In this section we discuss several functions which can b e argued

to approximate the magic measure and in Section describ e algorithms which attempt to

minimize these functions

Since the nature of the errors is p o orly understo o d and they are sto chastic in nature we

might not even b e able to recognize the magic function if we found it We can however

cho ose a candidate function based on some sound principles and then attempt to validate

the function through empirical studies As many errors o ccur at one time measures that

correlate with one or many errors are of interest In the nal analysis however one would

hop e to have measures that correlate with the structure of the errors as a whole

Given such a measure dened for maps one can search for p ermutations such the the

value A is minimal or close to minimal Insisting on exact solutions is not justied

biologically due to the typ e of data and its evolutionary nature Moreover the fo cus on ap

proximations is mandatory it is imp osed by the computational intractability of computing

exact solutions The insistence on the relevance of closetooptimal reects the hyp othesis

of parsimonious explanations Indeed it is natural to search for minimal explanations ie

those based on the smallest amount of change

Fragmentcounting ob jective functions

Let us consider rst measures based on fragment counting For every p ermutation the map

A represents a fragmentation of the clones into fragments based on the blo cks of consecutive

ones in each row One can measure various parameters of this fragmentation Let A

b e the total numb er of fragments Let A b e the maximum numb er of fragments in a

clone Let A b e the numb er of clones that are broken ie split Clearly all these

functions have two nice prop erties First they have minimal value when the matrix is CP C1P matrices with 50 probes, avg. 5 ones per row 30 number of weak adjacencies number of redundant probes number of connected components 25 number of non-hybridizing probes

20

15

10

5

0 50 100 150 200 250 300 350 400 450 500

Number of clones

Figure Symptoms of weak data for CP matrices

Secondly they are nondecreasing as the clone splitting increases chimerism rate increases

Deletions false p ositives and false negatives can either increase or decrease the functions

but will typically increase the measure Decreases come only in the sp ecial cases when

deletions or false negatives completely remove a fragment or when false p ositives join two

fragments When the error rates get very high eg ab ove these sp ecial cases b ecome

more common However in these cases reconstruction will b e extremely dicult by any

means Intuitively maps that have values close to minimal in these measures provide go o d

approximations of parsimonious explanations

Information known ab out genomic libraries supp orts the use of these ob jective functions

For example it was observed that chimeric clones rarely have more than two inserts There

fore maps having a very small value of would b e consistent with the observed data If

the rate of chimerism is known maps with the measure close to the chimeric rate or the

measure close to the numb er of clones plus the numb er of chimeric clones would provide

a go o d t for the extra knowledge ab out the library

One disadvantage of these fragmentcounting functions is that nding the optimum value

on general matrices and on some sparse hybridizationlike matrices has b een shown

to b e NPcomplete However NPcompleteness for optimum need not b e a barrier

to us since we already know that we must lo ok for prop erties of all approximate solutions

rather than for a single optimum solution

Other ob jective functions

Total interfragment gap length

One of the limitations of the measures based on fragment counting is the fact that they

are insensitive to the size of the fragments or the size of the gaps b etween fragments The

distinction b etween the size of gaps is essential for distinguishing b etween chimerism and

deletions It is likely that chimeric clones have big gaps b etween their fragments while

deletions intro duce small gaps in the clones In this resp ect false negatives and deletions

could b e treated together and data sets exist where they are the dominant errors The

natural idea of keeping the gaps small led us to consider a measure which is the total

interfragment gap length Let us start rst by dening the measure to b e sum over all

rows of the numb er of columns b etween the rst and the last in the row Also let b e

the density of the matrix that is the total numb er of ones in the matrix Now we dene

by A A A

Minimizing or equivalently attempts to keep the gaps as small as p ossible The function

resembles an ob jective function considered in numerical analysis for square matrices in

connection with the storage of sparse matrices Although our matrices are not square the

adjacency matrix of the clone cover graph is a closely related square matrix that can b e

used for the connection

Minimizing obviously is inappropriate when large gaps are present Therefore in that

case dealing with chimerism takes algorithmic precedence

Ob jective functions

There are two prop erties that it would b e desirable for all the ob jective functions to satisfy

Conservative Extension of CP The optimization function takes on its b est typ

ically minimal value for CP matrices

Monotonicity The optimization function maps matrices corresp onding to higher

error rates to worse typically higher values

Each of and is indeed a conservative extension of CP Thus when these func

tions are applied to errorfree data the optimal solutions will b e CP As discussed ab ove

each of and is likely to b e monotonic The ways in which these functions break

monotonicity for hybridization errors is a p otential area for continued research Do there

exist simple functions which are monotonic for hybridization errors Are they eciently

approximable

Algorithms

The Humangreedy algorithm Minimizing

Converting to TSP on the Hamming vector graph

The connection b etween minimizing the numb er of blo cks of ones in a matrix and solving

the Traveling Salesp erson Problem on the hamming distance graph of the matrix has

b een observed by many authors In we abstracted this connection to what we

called the vector Traveling Salesp erson Problem vTSP We rep eat the denition of vTSP

here for convenience

Denition An instance of the vectorTSP is an n vertex vectorlabeled complete

m

undirected graph G V E cost and a function f N N where cost is a function

v v

from edges to mvectors over f g

The sparseness of a vTSP graph sG max the numb er of ones in the vector cost e

eE v

The vectorcost of a tour in G is the componentwise sum of the costs of the edges in the

tour The f cost of a tour is f applied to the vectorcost

The vTSP f optimization problem takes as input a vTSP instance I G f and returns

a tour in G of minimal f cost

The conversion from the problem of minimizing on a matrix to an instance of vTSP is as

follows

Create a vertex corresp onding to each column of the matrix

Lab el each edge v v with the vector formed by taking the comp onentwise exclusive

or of the columns corresp onding to v and v The resulting vector has a one in each

entry corresp onding to a row in which the two columns dier and a zero in all other

entries

Let reduction function f b e the function which sums the entries of a vector Thus

f applied to an edge gives the numb er of rows in which the two columns dier and f

applied to the vectorcost of a tour gives the numb er of times a blo ck of ones b egan

or ended in the tour

The tour found by minimizing the vTSP instance allows blo cks of ones to wrap around the

matrix that is start near the right side of the matrix and continue at the left side To

avoid this technical problem an additional column of all s can b e added to the matrix and

the tour op ened up to a path from one neighb or of this column to the other With this

change the cost of the tour is exactly twice

Because addition is distributive it is easy to see that using vector sum as the reduction

function makes the vTSP equivalent to the standard scalar TSP Therefore algorithms to

optimize can b e based on the classic approximation algorithms for TSP twice around a

minimum spanning tree and the Christodes maximum matching improvement These

algorithms have the advantage of giving provable guarantees on p erformance For example

using Christodes algorithm one is guaranteed that the ordering of the matrix pro duced

has a value of which is no worse than times the optimal value of for any ordering the

simpler algorithm yields a guarantee of times optimal

However preliminary exp eriments and the exp erience of other researchers on other TSP

instances show that these algorithms do not do much b etter than their guarantees and

that other algorithms give b etter solutions in practice Therefore we co ded two variants of

a greedy algorithm for minimizing TSP

Similar algorithms were used extensively by with the addition of a OPT phase at the

end

Greedy sigma I Our rst algorithm works as follows

Convert the matrix to a TSP graph

Starting with the rst vertex add the closest vertex not already on the tour to the

tour Ties are broken randomly

Continue adding the closest to the last vertex added vertex which is not already on

the tour until all vertices are used Again ties are broken randomly

Convert the tour to a matrix order by cho osing the starting p osition which give the

b est value of A tour corresp onds to n p ossible matrix orders dep ending on which

vertex is made the left hand column

Although the reliance on randomness seems a disadvantage since the quality of the tour

can dep end highly on the random choices we found that by using multiple random orders

that we could get b oth improved orders and information ab out the reliability of parts of

the order

Greedy sigma I I Human As a check against the dep endence on randomness in the

rst algorithm we implemented a second algorithm patterned after the algorithm for creating

Human co des Although this algorithm still uses a randomness to break ties it is much

less dep endent on a start vertex This algorithm works as follows

Convert the matrix to a TSP graph

Pick the nearest two vertices and make them adjacent on the tour Ties are broken

randomly

Continue cho osing adjacent pairs of vertices which are not yet adjacent to two vertices

and make them adjacent on the tour until all vertices have two neighb ors on the tour

Convert the tour to a matrix order by cho osing the starting p osition which give the

b est value of A tour corresp onds to n p ossible matrices dep ending on which vertex

is made the left hand column

The clonecover algorithm

Although the reduction of optimization to TSP preserves the optimum value it is not

clear that an approximation algorithm for TSP will fairly explore the region of near optimal

solutions In particular the use of hamming distance pro duces a p enalty for two adjacent

columns having dierent values in a row but only indirectly rewards two adjacent columns

which b oth have s in a row We therefore created an algorithm which explicitly attempts

to keep columns together which b oth have s in the same rows

Rather than creating the hamming distance graph we create the clone cover graph describ ed

in Section Similar ideas app ear implicitly in DR Fulkerson and OA Gross and

Lander and Waterman

Heuristically it app ears that one can minimize by maximizing the cost of a tour in the

clone cover graph If edges on the tour have high values then many s are placed adjacent

to each other and there will b e fewer overall blo cks of ones Maximizing the cost of a tour

is unfortunately at least as dicult as minimizing the tour and therefore we once again

resort to greedy heuristics

Our third algorithm thus works as follows

Convert the matrix to a clone cover graph

Starting with a random vertex add the vertex whose weight to this vertex is greatest

and which is not already on the tour to the tour Ties are broken randomly

Continue adding the vertex which is not already on the tour whose weight to the

last vertex added is greatest until all vertices are used Ties are broken randomly

Convert the tour to a matrix order by cho osing the starting p osition which give the

b est value of

The Fiedlervectorsp ectral algorithm Minimizing

Concentrating on the number of blo cks of ones in the matrix by optimizing do es not

capture the fact that two blo cks separated by a single could o ccur in exp erimental matrices

due to several factors eg a false negative hybridization result or a deletion which are highly

unlikely to create two blo cks separated by many s Therefore we lo oked at metho ds which

attempt to group all the s in each row together but dont demand that they form a single

blo ck Formally a row is scored by the numb er of columns from the rst o ccurring to

the last o ccurring The row thus gets a score of under the

metric compared to a b est value of and worst value of n whereas it would get a score of

under the metric compared to a b est value of and worst value of

The measure was inspired by the sparseness of the hybridization matrices It seemed

reasonable to attempt to keep the area containing ones as small as p ossible Keeping this

area small tends to minimize the numb er of false negatives and deletions in an unied way

For symmetric matrices is the so called envelop e used in numerical analysis for handling

sparse matrices Minimizing the envelop e is seen as a way to minimize the storage for sparse

matrices which in turn sp eeds up computation Although envelop e minimization is NPhard

various heuristic metho ds have b een develop ed that often work well on numerical analysis

applications

Unfortunately hybridization matrices are not symmetric so standard envelop e minimiza

tion techniques cannot b e directly applied We can however form a symmetric matrix my

multiplying the hybridization matrix by its transp ose The resulting matrix has a nonzero

structure which corresp onds precisely to the edges in the clone cover graph

Sp ectral metho ds

When the connection of with the envelop e was made a new algorithmic avenue came to

light namely the use of sp ectral metho ds

Sp ectral metho ds are fundamental in numerical analysis Fiedler develop ed the

mathematics that was ultimately used to design p owerful heuristics for arranging sparse

matrices in such a way that all the entries are as close to the diagonal as p ossible The rst

algorithmic uses of Fiedlers ideas for ordering problems were due to Juvan and Mohar

More recent applications include work by Pothen and Simon and Atkins Boman and

Hendrickson

We employed a heuristic for optimizing which is based on the work of our colleagues at

Sandia who have develop ed the co de for an indep endent pro ject called Chaco which

partitions computational meshes for placement on parallel machines When it b ecame

apparent that their co de might b e useful to us they very kindly made mo dications to

allow its inclusion in our software suite In fact they b ecame interested in the theoretical

prop erties of sp ectral metho ds for this application and have proved several lovely

ab out it

A full description of their co de is b eyond the scop e of this pap er and can b e found in

but we give here a sketch of its use for mapping

The Fiedlervector sp ectral algorithm

Given an m n matrix A construct the clone cover graph CC A Let G b e the

A

adjacency matrix of the graph CC A

Construct the Laplacian L of G ie L D G where D d is a diagonal

A A A A A A ii

P

n

matrix with d Ai j and for i j d

ii ij

j

Let v b e the ndimensional Fiedler vector of L dened to b e the eigenvector of L

F A A

corresp onding to the second smallest eigenvalue of L v has only real entries

A F

Sort the comp onents of v The sorting pro duces a p ermutation of the the set

F F

f ng

F

Output the map A

In recent work Atkins Boman and Hendrickson showed that a generalization of this

algorithm has the remarkable prop erty of solving the consecutive ones problem That is

on CP matrices the ABHalgorithm nds the CP p ermutations The relevance of this

b eautiful theorem in the biological context is as follows This algorithm oers an extra

guarantee it is provably correct on errorfree instances As the errorfree set of inputs is

quite complex this is a highly nontrivial result and the rst of this kind to our knowledge

The cyclebasis algorithm Minimizing

Both our algorithms for optimizing and those for optimizing consider measures which

are the sum over all the rows of the matrix For large matrices this means that one or two

very badly ordered rows can b e masked by lots of well ordered rows In some cases this

may b e inevitable but since we exp ect really bad clones to b e ltered from the input we

wanted to b e able to lo ok at measures which attempted to optimize the maximum b ehavior

of any row rather than the average or sum b ehavior The measure do es exactly this

Preliminaries

We start by considering a class of fragment counting ob jective functions for which the

problem of nding optimal tours could b e approximately reduced to the problem of com

puting spanning trees b oth problems in vectorTSP graphs Let us say that a function

m m

f N N is monotone if for every a a N a a we have f a f a We

m

say that f is subadditive if for every a a N we have f a a f a f a

m

Lemma Let A be an m n matrix and f N N a monotone and subadditive

function Then the optimal f cost of a tour of the vectorTSP instance G f is at most

A

twice the optimal f cost of a spanning tree of G

A

Pro of The pro of is a straightforward extension of the pro of of the twicearoundatree

heuristic for TSP

As is monotone and subadditive we will concentrate now on computing an optimal f cost

spanning tree Again the problem turns out to b e NPcomplete

The algorithm

We present an algorithm for computing a spanning tree whose cost is close to optimal

The algorithm is based on lo cal search Informally one starts with a spanning tree T and

searches for b etter spanning trees in the neighb orho o d of T If one such tree is found one

continues searching for improvements in its neighb orho o d and so on When a tree T is

found such that no b etter trees exists in its neighb orho o d then T is a lo cally optimal tree

We can show that T is close to the globally optimal spanning tree T To complete the

OPT

informal presentation of the algorithm we have to describ e the neighb orho o d of a spanning

tree and the criteria for cho osing a b etter spanning tree The neighb orho o d is the so called

cyclebasis For a spanning tree T and for every nontree edge e one could obtain a graph

with exactly one cycle by adding the edge e to the T Then one could remove edges of

T from the cycle We collect only the resulting spanning trees some edge removal might

disconnect When we do this op eration for all edges not in T we obtain a collection of

spanning trees that is the cycle basis neighb orho o d of T As the value for spanning tree

is slow in rep orting changes we use as a more changesensitive criteria for cho osing b etter

trees The lexicographic order of the sorted vectorcost of the tree

The cyclebasis algorithm

Given an m n matrix A construct G the nno de v T S P graph of A The no des of

A

G corresp ond to the columns of A or to the prob es G is a complete graph with the

A A

edges lab eled by mdimensional vectors with entries If edge j j corresp onds

to columns C C then its lab el is the the Hamming vector of C and C

j j j j

Construct a spanning tree T of G

A

Rep eat until no improvement

Add to the current tree T a nontree edge

If an edge of T from the unique cycle formed by the added edge can b e removed

such that the resulting graph is a tree T and T has a b etter value than T

then we have improvement and we set T T

Let T b e the resulting spanning tree

Let b e the p ermutation obtained by op ening anywhere the tour obtained by the

standard walking twicearound the tree T

Output

The p erformance of the algorithm is given next

Theorem Given an m n matrix A the cyclebasis algorithm outputs a permutation

such that

0

OPT

log m A O A

where is the optimal permutation for A The algorithm has worstcase running

OPT

time O n log m

The OPT algorithm

It is common when using approximation algorithms for NPhard combinatorial optimiza

tions to use some sort of lo cal search to ne tune the solution One common lo cal search

metho d for p ermutation based algorithms is the OPT algorithm When p erforming a

OPT two p ermutations are consider to b e neighb ors if they dier by a single translo cation

As we saw in Sections and the translo cation is a natural op eration for our algorithms

Formally the OPT algorithm takes as input the m n matrix A and a p ermutation

i i and an optimization function f Recall that a translo cation ips sections

n

of the p ermutation A segment of is given by two elements i i such that i o ccurs

j k j

b efore i in A translocation of the segment i i in is a p ermutation obtained

k j k

from by reversing the sequence of elements of b etween and including i and i Given a

j k

nn

p ermutation there are translo cations for a p ermutation of size n this set includes

the original p ermutation

The OPT algorithm

Given an m n matrix A and p ermutation P er mn

Until no improvement is p ossible

Consider in turn all the translo cations of and among them cho ose the p ermu

0

tation for which f A is minimal

0

If f A f A then there is improvement and we set

In our exp eriments we used as our optimization function the function

It is interesting to note that unlike the cycle basis algorithm the lo cal search of OPT

yields no guarantees either on running time or solution quality The lo cal optimum can

b e arbitrarily worse than the global optimum and it can take exp onential in n time to

converge to a lo cal optimum For this reason it is desirable to use OPT only after a

go o d starting solution has b een found If the starting solution is provably close to the

optimum solution and a b ound is placed on the numb er of steps of OPT to b e p erformed

then the combination of an algorithm with OPT will yield a solution which is provably

close to the optimum and will not take unb ounded time

Exp erimental Design

We have argued b oth theoretically and exp erimentally that there are severe limitations on

the ability of an algorithm to retrieve from a hybridization matrix alone the correct genomic

order of the prob es andor clones along the genome Nonetheless we have designed several

algorithms and put them to the test Although many algorithms have b een prop osed and

even implemented for variants of the physical mapping problem see Section for more on

related work there has b een little or no published attempts to give a rigorous evaluation of

these algorithms across many situations Typically the algorithm is run on several sp ecic

large real data sets and the results are compared to the b est previous algorithm or to

the published map of the data generators

It is one of the goals of this pap er to establish the b eginning of a dialogue within the

community concerning how to evaluate mapping algorithms We have given some theoretical

background but the success or failure of algorithms will always remain their practical use

We b elieve that understanding the practical use of an algorithm dep ends on answering the

following questions

What are some go o d quantitative measures of the go o dness of a map

How can fair comparisons of dierent algorithms b e made In particular what should

the output of an algorithm b e

When an algorithm p erforms p o orly why did it p erform p o orly

What are the requirements for a useful generator of synthetic examples

How many random trials are necessary to pro duce reliable information ab out an al

gorithm

What parameters are worth adjusting in studying an algorithm

Measures of success

We do not yet have a totally satisfactory answer to the rst two questions how to measure

the go o dness of a map We have however explored several measures The rst measure

is a natural outgrowth of our theoretical discussion of the information available in a hy

bridization matrix We ask how many of the adjacencies in the true order are identied by

the algorithm In retrosp ect we now b elieve that we should have attempted to make our

algorithms list only those adjacencies for which the algorithm has derived go o d evidence

that the adjacency is in the true map In this way the algorithm would b e able to try to

identify weak adjacencies as well as list strong adjacencies However in this study we cre

ated algorithms which like most previous algorithms attempted to pro duce a total order

of the prob es

Thus our rst measure is percent of true map adjacencies which are adjacent in the algo

rithms order

Denition Given an algorithm A which produces an ordering on an n probe hy

A

P

n

bridization matrix with true order the total adjacency cost is where

i i

n

if i and i are not adjacent in and otherwise

A i

We have argued that the algorithm should only b e resp onsible for determining strong ad

jacencies Thus our second measure counts only strong adjacencies which are missed Of

course for data containing errors we must have some denition of a strong adjacency such

as the one dened in Section

Denition Given an algorithm A which produces an ordering on a n probe hy

A

P

n

bridization matrix with true order the strong adjacency cost is where

i i

n

if i i is a strong adjacency and i and i are not adjacent in and

A

otherwise

i

We are currently exp erimenting with some new measures The numb er of adjacencies do es

not seem to completely capture how far from correct a solution is For example if two

adjacent prob es in the true order are ipp ed then two adjacencies will b e incorrect However

the adjacency cost will b e the same if some large section of the genome is transp osed in

the solution We therefore lo ok at the average distance that adjacent prob es from the true

order are separated in the calculated order A variant of this measure is mentioned in

Denition Given an algorithm A which produces an ordering on a n probe hy

A

bridization matrix with true order the total distance cost is

n

X

i j i j

A A

n

Of course a variant of total distance can b e dened in which only strong adjacencies are

counted In preliminary studies the total distance measure seems to give us a b etter un

derstanding of the quality of an algorithm but we are not yet comfortable enough with our

understanding of it to rep ort results

Parameters to vary

Perhaps the hardest part of an exp erimental study is deciding what to lo ok at Having

settled on adjacency cost as our primary measure of success we still had to decide what

parameters to vary and over what ranges

It was clear that we wanted to vary the amount of chimerism false negatives and false

p ositives We chose for this rst exp eriment to vary each indep endently so as to limit

the numb er of p ossibilities and to make the results more clearly relatable to the input

We chose a range of to chimerism since the literature seemed to imply that a fair

amount of chimerism was present in real data We also exp ected from our earlier studies

that algorithms could p erform well even at these high levels of chimerism Cho osing a false

negative rate was more dicult Many p o oling techniques are biased toward pro ducing false

negatives in order to reduce the amount of false p ositives In fact some groups b elieve false

p ositives to b e so costly that they apply prepro cessing techniques to remove false p ositives

from the matrices at the cost of increasing the numb er of false negatives The fear of false

p ositives was b orn out by our exp erience Although we intended to use the same to

range used for false negatives we found that some of algorithms to ok inordinant amounts

of time to pro cess more than false p ositives and then gave very p o or solutions

It should b e noted that the false p ositive and false negative rates are not equivalent measures

since the false negatives can only o ccur at the few p ercent of matrix entries which record

hybridizations while the false p ositives can o ccur anywhere else In fact we recommend

that the false p ositive rate in future exp eriments b e based on the exp ected numb er of false

p ositives in a row Thus if there are prob es and one wishes approximately one false

p ositive p er row then the rate must b e whereas with prob es the rate must b e

It was also clear that we wanted to vary the numb er of clones in the exp eriment The term

coverage has many denitions in the literature but always connotes the relative numb er of

clones to prob es Dening coverage to b e the relative numb er of clones to numb er of prob es

we lo oked at coverages varying from to The coverage of is almost certainly to o low

while the coverage of is well b eyond the level of eort likely to b e made by any actual

mapping team

It was less clear what to do ab out the relative size of clones Preliminary studies showed

that minor variations did not seem to make much dierence However we b elieve that

the role of clone size deserves further study in the future Even having decided to not

explicitly vary clone size we were left with a dilemma of how to account for the fact that

diering amounts of errors led to diering apparent clone sizes even if generated clone size

was kept constant When no eort was made to counterbalance the error eect we found

that increased chimerism led to improved algorithm success This counterintuitive result

was due to the fact that the chimeric clones contained additional information in the form of

additional fragments We thus decided to mo dify the input clone sizes so that the exp ected

density of ones in the matrix was unchanged This of course meant that with high chimera

and false p ositive rates the size of clones decreased

Each of the approaches constant clone size or constant ones density has is shortcomings

We arbitrarily settled on constant ones density but exp ect to rep ort results on constant

clone size in the future

Timing

No exp erimental study would b e complete without some measure of the time cost of the

algorithms used We ran all our exp eriments on a single dedicated Sparc workstation so

times should b e equivalent However the eort put into optimizing the various algorithms

varied greatly The sp ectral algorithms were part of a separate pro ject for graph bisection

and have b een carefully optimized The Human and Clone Cover algorithms are relatively

simple and though little eort was put into optimization are probably fairly ecient

The Cycle Basis Algorithm is quite complex and probably could b e sp ed up considerably

The rudimentary opt algorithm was co ded in a naive manner and there are known to b e

considerably more ecient implementations We hop e to improve its eciency considerably

when we extend it to consider additional optimization criteria

Given all these caveats the timing graphs which follow in the next section should b e used

mostly as a gauge of the amount of opt necessary and as some indication of how the

current implementations scale as the coverage and amount of error increases

Exp erimental Analysis of Algorithms

General observations

All of the algorithms exhibit go o d trending b ehavior That is when more coverage is given

the algorithms nd more correct adjacencies Furthermore when the amount of error is

increased the algorithms tend to do worse One exception is that the use of OPT seems to

b enet from increased chimerism We are not sure why this is so and continue to examine

the detailed data in order to lo ok for reasons

We present our data in summary graphs Each graph has cost total adjacency cost strong

adjacency cost or time on the y axis The x axis is more complicated It represents b oth

coverage and amount of error present Each coverage is assigned an interval on the x axis

and within that interval the amount of error is allowed to grow Thus for chimerism the

p oints at x and are for each amount of coverage with no chimerism

and the p oints at x and are for each amount of

coverage and chimerism Formally the value of the x axis is the coverage ie of

clones of prob es plus times the p ercent chimerism

Similarly for false negatives and false p ositives the value of the x axis is the coverage plus

times the log base two of the p ercent false entries No particular meaning is given to the

spacing of the error values they are merely chosen to spread out the p oints

For each algorithm there are nine graphs The rst three examine the eects of pure

chimerism The rst graph lo oks at total adjacency cost and strong adjacency cost for

the algorithm alone while the second graph lo oks at the same costs for the solution of

the algorithm followed by lo cal search using OPT The third graph shows running time

for the algorithm and running time for the OPT p ortion alone The next three graphs

give analagous results for pure false p ositives while the last three give results for pure false

negatives

The algorithms examined are the sp ectral algorithm which attempts to minimize the enve

lop e of the hybridization matrix the cycle basis algorithm which attempts to minimize the

maximum numb er of fragments in any row of the hybridization matrix the human TSP

algorithm which attempts to minimize the total numb er of fragments in all rows of the

hybridization matrix the clone cover algorithm which attempts to maximize the amount

of adjacent overlap and a random algorithm which choses a random p ermutation

For each exp erimental p oint coverage and error rate random matrices were generated

and all algorithms were run on the same matrices Detailed data from the runs will b e

made available on the World Wide Web at the time of publication of this pap er The URL

can b e found through the home page at httpwwwcssandiagov dsgreenmainhtml

Sp ectral

Figures through show the results of the exp eriments using the sp ectral algorithm

It is interesting to note that increased coverage do es not seem to signicantly help the

algorithm in the face of chimerism or false p ositives In contrast increased coverage do es

seem to allow the OPT algorithm to nd a b etter lo cal optimum when applied to the

sp ectral algorithms solution The fact that the sp ectral solutions with higher coverage do

not app ear b etter by the adjacency measure yet are b etter starting p oints for OPT is

evidence that the adjacency measure is not a p erfect measure

Increased coverage do es improve the sp ectral algorithms p erformance on false negatives

It is dicult to tell whether this eect is mostly due to the increased numb er of strong

adjacencies or to increased information ab out all adjacencies The sp ectral algorithm is

so eective on false negatives that the OPT algorithm can make little improvement

esp ecially for rates less than

Cycle Basis Algorithm

Figures through show the results of the exp eriments using the Cycle Basis algorithm

Increased coverage seems to help the Cycle Basis algorithm most for correcting chimerism

Clone Cover

Figures through show the results of the exp eriments using the clone cover algorithm

Increased coverage seemed to help the clone cover algorithm in all cases We were surprised

at the robustness of the clone cover algorithm since it do esnt explicitly optimize any pa

rameter which we can directly relate to map go o dness However the notion of strongly

linked prob es is common in the biology literature and our exp eriments seem to justify its

use

Human

Figures through show the results of the exp eriments using the Human algorithm

for greedy TSP minimization As has b een rep orted by us in a previous pap er and by

other authors the greedy minimization of the hamming distance TSP is quite eective

We do not give results for the Greedy Heuristic algorithm mentioned in Section since the

Human algorithm was marginally more eective and is less dep endent on a go o d initial

order It was clear to us that it is very imp ortant to test greedy algorithms with a random

initial ordering Otherwise the greedy algorithms tend to conserve the input order and lo ok

articially go o d when the data is presented in the correct order

Random

Figures through show the results of the exp eriments using the OPT algorithm on

a random p ermutation The results with just the random p ermutation are predictably

bad The results for the strong adjacencies can b e used to track the numb er of strong

adjacencies When there are many weak adjacencies the random result can only b e wrong

on the remaining strong adjacencies As the numb er of strong adjacencies increases there

are more adjacencies on which the random result can fail

The OPT algorithm by itself is fairly eective It however is very exp ensive Even taking

the relative eciency of the implementations into account it is clearly a ma jor b enet to

give OPT a go o d starting p osition In addition the dep endency of OPT on error rate

is particularly bad It seems that higher error rates yield more gentle landscap es thereby

allowing OPT to make many more changes b efore reaching a lo cal minimum

Comparison

In order to compare the various algorithms it is really necessary to examine each matrix

in turn We are building the machinary with which to do this but for now must settle on

some summary data In the two tables b elow we show the total and strong adjacency cost

for each algorithm on a few extreme examples coverage or and error rate min or

max

In b oth Figure and a star has b een placed next to the minimum value for each

case Atkins et al have recently mo died the sp ectral algorithm to always nd a CP

ordering if one exists so it is not surprising that even their early sp ectral algorithm gives

the b est results for the CP case It is apparent that there are many weak adjacencies in the

coverage case and many fewer in the coverage case Sp ectral is also very eective

for false negatives Although the total adjacency measures are not particularly encouraging

the strong adjacency numb ers and the graphs ab ove show that for mo derate levels of false

negatives most of the information inherent in the matrix can b e retrieved

For chimerism there are several eective algorithms with b oth Human and sp ectral yielding

slightly b etter starting p oints for OPT The fact that all algorithms miss less than of

the strong adjacencies even for chimerism and coverage of is quite encouraging One

might hop e that with a combination of algorithms reliable information ab out the strong

adjacencies can b e retrieved

The case for false p ositives is less rosy Clone cover is surprisingly eective but it is disturb

ing that even at the mo derate levels of false p ositive tested more coverage was actually

a detriment toward nding strong adjacencies This app ears to b e the result of there b e

ing more strong p ositives but in some sense the new strong p ositives are less strong and

therefore harder to recover

Related work

The literature contains extensive studies on physical mapping and a variety of algorithmic

approaches have b een prop osed Various software packages are available that oer a quite

variate source of computational supp ort for mapping It is not our intent in this work to

compare and contrast various approaches but merely to provide a framework in which this

can b e done in the future Each of the eorts describ ed b elow has had its own goals and

measures of success

The mapping software develop ed by the H Lehrachs group contains a variety of

algorithms including ones dealing sp ecically with chimerism a rarity in the literature

They make use of the TSP analogy and apply simulated annealing As an output from

their program they lo ok for subsets of the clones which span the prob es Thus they are able

to ignore some illconditi oned clones They also have a dechimaerising algorithm which uses

ideas similar to our clone cover graph to remove clones which are apparently chimeric

They have applied their algorithms to real data from Spombe For this organism it was

p ossible to have a library with coverage equivalent to genomes This extremely high

coverage allowed them to aim for a completely correct map They found that they sometimes

achieved the correct order for some of the chromosomes and were able to manual correct

the others They note that the minimum of their optimization function did not necessary Spectral Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Spectral Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Spectral Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 30 Spectral Alg Spectral Alg with 2-OPT 25

20

15

10 Time to find solution (secs)

5

0 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate

Figure Spectral Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Spectral Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Spectral Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 8 Spectral Alg 7 Spectral Alg with 2-OPT

6

5

4

3

Time to find solution (secs) 2

1

0 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate

Figure Spectral Algorithm: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Spectral Algorithm: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Spectral Algorithm: 0, 1, 2, 4 pct. false pos., constant density 40 Spectral Alg 35 Spectral Alg with 2-OPT

30

25

20

15

Time to find solution (secs) 10

5

0 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate

Figure BCA Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

BCA Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

BCA Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 80 BCA Alg 70 BCA Alg with 2-OPT

60

50

40

30

Time to find solution (secs) 20

10

0 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate

Figure BCA Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

BCA Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

BCA Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 80 BCA Alg 70 BCA Alg with 2-OPT

60

50

40

30

Time to find solution (secs) 20

10

0 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate

Figure BCA Algorithm: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

BCA Algorithm: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

BCA Algorithm: 0, 1, 2, 4 pct. false pos., constant density 80 BCA Alg 70 BCA Alg with 2-OPT

60

50

40

30

Time to find solution (secs) 20

10

0 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate

Figure Clone Cover: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Clone Cover: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Clone Cover: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 3 Clone Cover Clone Cover with 2-OPT 2.5

2

1.5

1 Time to find solution (secs)

0.5

0 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate

Figure Clone Cover: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Clone Cover: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Clone Cover: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 11 Clone Cover 10 Clone Cover with 2-OPT 9 8 7 6 5 4 3 Time to find solution (secs) 2 1 0 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate

Figure Clone Cover: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Clone Cover: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Clone Cover: 0, 1, 2, 4 pct. false pos., constant density 7 Clone Cover Clone Cover with 2-OPT 6

5

4

3

2 Time to find solution (secs)

1

0 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate

Figure Huffman Hamming TSP Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Huffman Hamming TSP Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Huffman Hamming TSP Algorithm: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 3 Huffman Hamming TSP Alg Huffman Hamming TSP Alg with 2-OPT 2.5

2

1.5

1 Time to find solution (secs)

0.5

0 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate

Figure Huffman Hamming TSP Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Huffman Hamming TSP Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Huffman Hamming TSP Algorithm: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 4.5 Huffman Hamming TSP Alg 4 Huffman Hamming TSP Alg with 2-OPT

3.5

3

2.5

2

1.5

Time to find solution (secs) 1

0.5

0 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate

Figure Huffman Hamming TSP Algorithm: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Huffman Hamming TSP Algorithm: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Huffman Hamming TSP Algorithm: 0, 1, 2, 4 pct. false pos., constant density 7 Huffman Hamming TSP Alg Huffman Hamming TSP Alg with 2-OPT 6

5

4

3

2 Time to find solution (secs)

1

0 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate

Figure Random order: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Random order: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate Figure

Random order: 0, 10, 20, 30, 40, 50 pct. chimerism, constant density 60000 Random order Random order with 2-OPT 50000

40000

30000

20000 Time to find solution (secs)

10000

0 2 4 6 8 10

Number of clones as multiple of number of probes + chimerism rate

Figure Random order: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Random order: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate Figure

Random order: 0, 1, 2, 4, 8, 16, 32 pct. false neg., constant density 140000 Random order Random order with 2-OPT 120000

100000

80000

60000

40000 Time to find solution (secs)

20000

0 2 4 6 8 10

Number of clones as multiple of number of probes + false negative rate

Figure Random order: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs Strong pairs 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Random order: 0, 1, 2, 4 pct. false pos., constant density 100

All pairs with 2-OPT Strong pairs with 2-OPT 80

60

40

20

0 Percent of adjacencies in initial matrix not alg. solution 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate Figure

Random order: 0, 1, 2, 4 pct. false pos., constant density 90000 Random order 80000 Random order with 2-OPT

70000

60000

50000

40000

30000

Time to find solution (secs) 20000

10000

0 2 4 6 8 10

Number of clones as multiple of number of probes + false positive rate

Figure

CP

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

Chimerism only

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

False p ositives only

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

False negatives only

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

Figure Total adjacency costfor various coverageerror rates

CP

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

Chimerism only

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

False p ositives only

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

False negatives only

Sp ectral

w OPT

Cycle Basis

w OPT

Clone Cover

w OPT

Human

w OPT

Random

w OPT

Figure Strong adjacency cost for various coverageerror rates

corresp ond to the correct order

Another prominent software package that is based on simulated annealing is the package

develop ed by A simulated annealing heuristic is used to construct the Hamming

distance TSP tour They lo ok at several measures of success including the size of the

largest contig the numb er of contigs and the error within a contig The emphasis on

separate contigs is a version of our concept of weak adjacencies the measure of error within

a contig is related to our notion of average prob e distance They applied their algorithm

to simulated data which included errors which we called uncloneable regions and false

p ositives They also applied their algorithm to the genome of A nidulans

The Berkeley software library develop ed by R Karps group also employs simulated

annealing to the TSP graph In addition they prop ose several heuristics for reducing the

amount of false p ositives in the data dealing with rep eated prob es and p o oling schemes

They also include analysis of data in which prob es are built from the ends of clones

Zhang et al lo ok at data which contains cloneclone hybridization data rather than clone

prob e data They use a classic algorithm for nding a diameter of a graph in order to

estimate an ordering of the clones In order to deal with false hybridization they apply

several heuristics For the library of cosmids and YACS for human chromosome to

which they applied their techniques there was a three to ve times coverage They do not

give any quantitative measure of how successful their algorithms were

Summary and Future Directions

Although we feel that we have moved forward a great deal from our earlier work in physical

mapping we are sure that much remains to b e done At a minimum there remain op en

questions in the creation of generators of synthetic data the creation of algorithms based

on combinatorial optimizations the consideration of algorithms based on other techniques

such as maximum likeliho o d the inclusion of data of other typ es b esides hybridizations

the mo deling of errors and inclusion of other error typ es the denition of more detailed

measures of algorithm success and the construction of more detailed exp eriments

Synthetic Data We discussed the question of generators in some depth in Section but

a few p oints are worth reiterating No one is likely to pro duce a generator which actually

captures all of the intricacies of the exp erimental mapping pro cess The key is to attempt

to include facets of the pro cess which are likely to have large eects on the evaluation of

algorithms Toward this goal we have lo oked at the distribution of prob e p ositions the size

of clone fragments and the amount of simple errors We b elieve that it will b e imp ortant

to further examine the role of the distribution of fragment sizes the correlation of various

error typ es and the correlation of errors to fragment p osition

Algorithms In our discussion of algorithms in Section we noted that it was desirable

to nd optimization functions which are conservative extensions of the CP problem and

which are monotonic in the error rate We continue to lo ok for b etter such functions In

addition it is clear to us that the use of multiple algorithms to increase the reliabili ty of

the results and allow the handling of multiple error typ es will b e crucial We are exploring

ways to combine the outputs of multiple algorithms and to use them in iterative schemes

which progressively improve the solution

The use of the OPT renement op ens up many questions We used OPT to lo cally

search for solutions with the lowest value of However we could equally well have used

any other or combination of our optimization functions It would b e interesting to see

whether there is an advantage to having the OPT use a dierent or the same function

as the primary algorithm

We also exp ect that new research avenues will develop based on combinatorial optimizations

of multiple ob jective functions The area of multiple ob jective optimization is an active

area of research in combinatorial optimization and we exp ect that the eorts for providing

computational supp ort for mapping will reinforce the need for new algorithmic to ols and

metho ds

New data and error typ es There will always b e new exp erimental techniques with cor

resp onding new typ es of information and errors It is our hop e that some of the mechanisms

develop ed in this pap er will b e applicable to the resulting mapping problems One of the

greatest challenges to our approach is to allow the biologist to include side information For

example if it is known from a break p oint analysis that two sets of prob es are separate

from each other it would b e nice to allow the algorithm to make use of this information

Exp erimental Analysis We learned many lessons from p erforming the exp eriments of

Section An imp ortant lesson was that summary statistics is not always enough It is

sometimes imp ortant to b e able to directly compare the solutions of all algorithms on a

single input instance It would b e nice to have additional to ols for comparing how close

two p ermutations are to each other These to ols might also provide us with additional

candidates for measuring relative algorithm success

A second lesson we learned is that it is very dicult to hold everything constant except one

factor For example varying the error rates changed the density of the resulting matrix It

is imp ortant to keep a record of all parameters even ones not b eing studied in order to b e

able to lo ok for later correlations

We were not particularly careful ab out implementation eciency This sometimes meant we

had to limit the size of our examples andor the numb er of trials examined In b oth cases

this meant that we were not able to reach as high a level of condence in our output numb ers

as we would have liked We computed standard deviations minimums and maximums as

well as averages The averages seem to b e reliable measure but we would like to make the

study more rigorous In addition the use of run time as a measure of algorithm eciency

is dubious We would like to reexamine the use of OPT in order to count the numb er

of optimization steps We susp ect that it will b e interesting to count not only the total

numb er of steps until lo cal optimum is found but also the numb er of steps until a solution

which is within a threshold of the lo cal optimum is found

Theory Our theory of physical mapping in the presence of errors is still in its early stages

We hop e to extend the theory to uncover intrinsic limitations of distinguishing noise from

signal in physical mapping In particular we are currently use only very simple noise

mo dels Determining how to prop erly weight the eects of dierent error sources will b e

challenging

Maximum likeliho o d approaches implicitly include a noise mo del We hop e that by making

the noise mo del explicit that we can determine when enough information is available to

make maximum likeliho o d eective

Integration into mapping pro jects We have presented the problem of physical map

ping as o ccurring after the biological exp eriments are completed This is probably not the

most ecient use of mapping algorithms It would b e preferable if the mapping software

b ecame integrated into the mapping pro cess For example the software should b e able to

help the biologists know which sections of the data need work The hop e is that the in

tegrated approach will lead to improved pathways to successful maps The question of how

to make this integration happ en will undoubtedly b e one of the ma jor issues for mapping

software in the next several years

Acknowledgments

It is a pleasure to thank Eric Lander for suggesting this research area to us for his technical

contributions and his supp ort Ernie Brickell and Fred Howes supp ort have b een crucial

to this eort Bruce Hendrickson James Park Cindy Phillips and Mike Sipser provided

helpful reviews comments and valuable technical contributions Norman Doggett and

Manny Knill shared with us b oth their data of chromosome and their ideas ab out physical

mapping with us We would also like to thank Leslie Goldb erg Jim Orlin David Torney

and Mike Waterman for fruitful discussions

Jon Atkins and Paul Goldb erg contributed pro ofs that certain optimization functions were

NPhard Bruce Hendrickson and Rob Leland not only gave us access to their co de for

envelop e minimization from their Chaco software but made mo dications to it in order

to interface it to our co des Cathy McGeo ch Henry Shapiro and Bernard Moret gave us

many helpful tips on exp erimental analysis

References

F Alizadeh R Karp L Newb erg and D K Weiser Physical mapping of chromo

somes a combinatorial problem in molecular biology In Proceedings of the th Annual

ACMSIAM Symposium on Discrete Algorithms pages

J Atkins Testing whether genomic matrices have chimeric numb er two is npcomplete

Manuscript Sandia Labs

J Atkins E Boman and B Hendrickson A sp ectral algorithm for the seriation prob

lem Technical Rep ort SAND Sandia National Lab oratories Albuquerque

NM

K Bo oth and G Lueker Testing for the consecutive ones prop erty interval graphs and

graph planarity using PQtree algorithms J Comput Syst Sci

T A Brown Gene cloning Chapman and Hall second edition

I Chumakov et al Continuum of overlapping clones spanning the entire human chro

mosome q Nature

D Cohen I Chumakov and J Weissenbach A rstgeneration physical map of the

human genome Nature

A Cuticchia J Arnold and W Timb erlake The use of simulated annealing in chro

mosome reconstruction exp eriments based on binary scoring Genetics

A Cuticchia J Arnold and W Timb erlake Ods ordering dna sequences a physical

mapping algorithm based on simulated annealing CABIOS

A Cuticchia J Arnold and W E Timb erlake The use of simulated annealing in

chromosome reconstruction exp eriments based on binary scoring Genetics

M Fiedler Algebraic connectivity of graphs Czechoslovak Math J

M Fiedler A prop erty of eigenvectors of nonnegative symmetric matrices and its

application to graph theory Czechoslovak Math J

S Fo ote D Vollrath A Hilton and D C Page The human Y chromosome Over

lapping DNA clones spanning the euchromatic region Science

D Fulkerson and O Gross Incidence matrices and interval graphs Pacic J of

Mathematics

M Garey and D Johnson Computers and Intractability Freeman and Co

P Goldb erg The generalized consecutive ones prop erty is NPcomplete Technical

rep ort Sandia National Lab oratories Dec also included in Four strikes against

physical mapping of DNA by Goldb erg Golumbic Kaplan and Shamir Tel Aviv Uni

versity Tech Rep ort

E Green H Reithman J Dutchik and M V Olson Detection and characterization

of chimeric yeast articialchromosome clones Genomics

D Greenb erg and S Istrail The chimeric mapping problem algorithmic strategies and

p erformance evaluation on synthetic genomic data Computers chem

D Greenb erg and C Phillips In preparation

A Grigoriev R Mott and H Lehrach An algorithm to detect chimeric clones and

random noise in genomic mapping Genomics

B Hendrickson and R Leland The Chaco users guide version Technical Rep ort

SAND Sandia National Lab oratories Albuquerque NM Octob er

M Juvan and B Mohar Optimal linear lab elings and eigenvalues of graphs Disc

Appl Math

E Knill Lower b ounds for identifying subset memb ers with subset queries In Proceed

ings of the th Annual ACMSIAM Symposium on Discrete Algorithms pages

L T Kou Polynomial complete consecutive information retrieval problems SIAM J

Comput

E S Lander and M S Waterman Genomic mapping by ngerprinting random clones

A mathematical analysis Genomics

P Little Mapping the way ahead Nature

R Mott A Grigoriev E Maier J Hoheisel and H Lehrach Algorithms and software

to ols for ordering clone libraries application to the mapping of the genome schizosac

charomyces pombe Nucleic Acid Research

D Nelson and B H Brownstein YAC Libraries A Users Guide W H Freeman and

Co

B Parlett and D Scott The Lanczos algorithm with selective orthogonalization Math

Comp

D Torney Mapping using unique sequences J Mol Biol

D Vollrath S Fo ote A Hilton L Brown P BeerRomero J Bogan and D C Page

The human Y chromosome a interval map based on naturally o ccurring deletions

Science