<<

Reconstructing the Duplication History of a Tandem Rep eat

 y

Gary Benson and Lan Dong

Department of Biomathematical Sciences

The Mount Sinai Scho ol of Medicine

New York, NY 10029-6574

Abstract repeat.Over time, individual copies within a tandem re-

p eat may undergo additional, unco ordinated mutations

including new tandem duplications so that typically,

One of the less well understo o d mutational transforma-

multiple approximate tandem copies are present.

tions that act up on DNA is tandem duplication. In this

pro cess, a stretch of DNA is duplicated to pro duce two

Examination of a tandem rep eat often suggests that

or more adjacent copies, resulting in a tandem repeat.

the sequence was pro duced by a series of tandem du-

Over time, the copies undergo additional mutations so

plications intersp ersed with p ointmutations. The real

that typically,multiple approximate tandem copies are

biological sequence shown in Fig. 1 is a typical exam-

present. An interesting feature of tandem rep eats is

that the duplicated copies are preserved together, mak-

ple. It consists of 16 copies of an 8 nucleotide pattern.

ing it p ossible to do \phylogenetic analysis" on a single

Copies are numb ered and spaces inserted b etween the

sequence. This involves using the pattern of mutations

copies for clarity. A consensus for these 16 copies is

among the copies to determine a minimal or a most

AAAC T T AG. Astrisks * ag di erences b etween the

likely history for the rep eat. A history tries to de-

copies and the consensus.

scrib e the interwoven pattern of substitutions, indels,

and duplication events in suchaway as to minimize

Careful observation reveals that G is p erio dically sub-

the numb er of identical mutations that arise indep en-

stituted for A. Such substitutions are unlikely to o c-

dently. Because the copies are adjacent and ordered,

cur indep endently. It is more likely that a single

the history problem can not b e solved by standard phy-

common ancestor pattern is resp onsible for the A to

logeny algorithms. In this pap er, weintro duce several

G substitutions through duplication. Perhaps an 8

versions of the tandem rep eat history problem, develop

character unit, say AAAC T T AG,was rst duplicated

algorithmic solutions and evaluate their p erformance.

and then mutated to AGAC T T AG and then the two

We also develop ways to visualize imp ortant features

of a history with the goal of discovering prop erties of

copies were duplicated as a single 16 character unit

the duplication mechanism.

AAAC T T AGAGAC T T AG. When the second through

thirteenth copies are viewed in this way, 5 of the A to

G substitutions are accounted for. Further observation

Keywords: tandem rep eats, phylogeny algorithms

suggests that the two starred T smayhave b een the

result of another duplication.

1 Intro duction

Tandem rep eats are di erent from other typ es of du-

plicated sequences b ecause the child copies of duplica-

One of the less well understo o d mutational pro cesses for

tion are adjacent on the same sequence. This di erence

DNA molecules is tandem duplication in which a stretch

leads to complications in determining the parent copy

of DNA is transformed into two or more adjacent copies.

of duplication. See Fig. 2.

The following illustrates a tandem duplication in which

the single o ccurrence of triplet CGG is transformed into

Boundaries. It is not always p ossible to distinguish

three identical, adjacent copies.

the b oundaries of a duplicated pattern. Consider the

two examples b elow in which a duplication changes

three identical copies of AB C D into four identical

:::T C GG A::: ! :::T C GG CGG CGG A:::

copies. Although the b oundaries of the duplicated pat-

terns underlined di er, the results are the same.

The result of a tandem duplication is termed a tandem

AB C D ! AB C D AB C D AB C DAB C D AB C D AB C D



Partially supp orted by NFS grant CCR-9623532 and

AB C D AB C D AB CD ! AB C D AB C D AB CDABCD

a 1997 grant from the German Academic Exchange Service

DAAD.

y

Partially supp orted by NFS grant CCR-9623532. 1

* * * *

AAACTTAG AAACTTAT AGACTTAG AAACTTAG AGACTTAG AAACTTAG AGACTTAG AAACTTAG

1 2 3 4 5 6 7 8

* * * * * * *

AGACTTAG AAACTTAT AGACTTAG AAACTTAG AGACTTAG AGACTCAG AAACTTAG AAAGCTTAG

9 10 11 12 13 14 15 16

------

*

AAACTTAG AAACTTATAGACTTAG AAACTTAGAGACTTAG AAACTTAGAGACTTAG AAACTTAGAGACTTAG

1 2 3 4 5 6 7 8 9

* * * *

AAACTTATAGACTTAG AAACTTAGAGACTTAG AGACTCAG AAACTTAG AAAGCTTAG

10 11 12 13 14 15 16

Figure 1: Top: Perio dic nucleotide substitutions in a tandem rep eat suggests a common ancestor. Bottom: Fiveof

the A to G substitutions may b e accounted for by a single A to G substitution followed by duplication.

Mutations add information. In the next example, the assumes predominantly single copychanges with rare

second copy of ABCD has b een mutated to AXCY.

multi-copychanges. In Bell & Torney 1993 compar-

Now, di erent duplication b oundaries give di erent re-

ison of estimated rates of and

sults.

observed rates of mutation lead to the

conclusion that slipp ed strand mispairing is the ma-

ABCDAXCY AB C D ! ABCDAXCY AXCY AB C D

jor cause of length p olymorphism in . In

ABCDAXCY AB CD ! ABCDAXCY AB CY ABCD

Charlesworth, Sniegowski, & Stephan 1994, mo deling

and simulation suggests that very low recombination

rates unequal crossing over can result in very large

copynumb er and higher order rep eats.

Note that the b oundaries are still not completely deter-

mined in the later two cases. The pattern in b oth could

Many unresolved questions can b e asked ab out the

b e shifted one character to the right and give the same

mechanism of tandem duplication, among them: 1 Is

results. We presentaway to display this uncertaintyin

the b oundary of the duplication unit unique, is it con-

Section 6.

ned to a few lo cations or is it seemingly unrestricted?

2 Is the duplication unit size unique, do es it vary in

Duplication size. The size of the duplication unit can

a small range or is it unrestriced? Do es pattern size

be anymultiple of the basic pattern size. In the exam-

a ect the variability of duplication unit size? 3 Do es

ple b elow, four copies of a pattern of size 4 are changed

duplication o ccur preferentially at one end or the other

into six copies by duplicating the middle 8 characters.

of the rep eat or preferentially on the leading or lag-

Again, mutations in the original copies can help distin-

ging strand during replication Kunst & Warren 1994;

guish the size of the duplication unit from other p ossi-

Kang et al. 1995; Eichler et al. 1995?

bilities.

Answers to these questions may suggest the presence of

conformational structures, either within or adjacentto

AB C D ! ABCDAXCY ABCZ

the tandem rep eat Je reys et al. 1994, which trigger

ABCDAXCY ABCZ AXCY ABCZAB C D

duplication or may indicate that di erent mechanisms

act on patterns of di erent sizes. An extensive anal-

ysis of the histories of many tandem rep eats can pro-

Several mechanisms have b een prop osed for the pro-

vide data to supp ort one or the other of the theoretical

duction of tandem rep eats, including replication slip-

mo dels and may reveal new mechanistic features not al-

page and unequal crossing over Wells 1996; Levinson

ready anticipated. Additionally, comparison of related

& Gutman 1987; Schlotterer & Tautz 1992; Okumura,

tandem rep eats in di erent sequences could resolve im-

Kiyama, & Oishi 1987; Smith 1976. Biological stud-

p ortant questions regarding evolution or mutation over

ies Strand et al. 1993; Weitzmann, Wo o dford, & Us-

short time scales. Such a capabilitywould op en up new

din 1997 have already provided supp ort for one or the

opp ortunities to address questions of evolution and an-

other of the mechanisms. Mathematical mo deling has

cestry, including the study of human migration, rapid

suggested mechanistic characteristics. For example in

evolution of bacterial diseases, and the cascade of mu-

Di Rienzo et al. 1994 accurate mo deling of copynum-

tations that lead to cancer. With these purp oses in

ber variation at a p olymorphic dinucleotide rep eat lo-

mind, wehave b egun the development of algorithms to

cus has b een obtained with a two-phase mo del which 2

copy of a pattern with two or more adjacent, identi-

cal copies. A contraction is an algorithmic op eration

in whichtwo or more adjacent, equal length substrings

the contraction copies of a string are replaced by a sin-

gle substring the mergedcopy. A contraction can b e

thought of as the opp osite of a tandem duplication. A

binary contraction replaces two contraction copies with

a merged copy.Amany-to-one contraction replaces two

or more contraction copies with a merged copy.Acon-

traction copy is some substring of the multiple align-

ment M with length a multiple of k .For the purp oses of

contraction, each p osition in M is treated as a character

set which is some subset of the alphab et fA; C ; G; T ; g.

An ambiguous character set is a character set whichis

Figure 2: A tandem rep eat history. Ancestral pattern not a singleton set, e.g., fA; Gg is ambiguous but fAg

is not. The original multiple alignment M contains no

sequence is at the top. Bottom sequence contains 9 de-

scendant copies of the pattern. Dotted lines mark the ambiguous character sets, but a merged copymay con-

b oundaries of copies involved in a duplication. Parent tain ambiguous character sets. When a contraction is

applied to a multiple alignment M , a new, shorter mul-

copyisabove, child copies in b old b elow. Note that

0

1 the b oundaries of a parent need not coincide with tiple alignment M is pro duced.

the putative b oundaries of the pattern, 2 a parent's

In this pap er, we consider the following problems.

length can b e a multiple of the length of a single pat-

tern, and 3 child copies can interact to form a parent

 Tandem rep eat history problem TRHIST.

in subsequent duplications.

Givenamultiple alignment M of the copies of a tan-

dem rep eat, a cost function for contractions, and a

reconstruct tandem rep eat histories.

rule for pro ducing merged copies, nd a least cost

series of contractions which reduce M to a single

The remainder of this pap er is organized as follows.

merged copy.

Section 2 contains de nitions and descriptions of the

 TRHIST, xed b oundary, xed duplication

problems weinvestigate. Section 3 describ es our greedy

size. Size and b oundaries of contraction copies are

algorithms for the history problem. In Section 4 we

xed and remain the same across all contractions.

develop upp er and lower b ounds on a restricted ver-

Without loss of generality, the size is k and the left

sion of the history problem. In Section 5 we rep ort the

b oundary is column 1 of M .

p erformance of the algorithms on simulated sequences.

Finally, in Section 6 we give graphical presentations of

 TRHIST, single column, xed duplication

our analysis of real biological sequences. The App endix

size. The history problem on a single column of M .

contains additional details on one of our algorithms.

Boundary is necessarily xed and size of contraction

copies is xed without loss of generality at a single

character.

2 De nitions and Problem Descriptions

For the purp oses of the problems describ ed b elow, we

3 Greedy algorithms for TRHIST.

assume that a tandem rep eat sequence consists of n

approximate copies of a basic pattern of length k .We

Rule for pro ducing merged copies. If contrac-

are givenamultiple alignment, M , of the copies. M

tion copies are not identical, the merged copy will con-

has n rows and k columns and the ith rowinM con-

tain ambiguous characters, represented byambiguous

tains the ith copy left-to-right in the tandem rep eat.

character sets. This ambiguitymay b e resolved by

We let M represent the ith row and j th column of

i;j

some later contraction. Our rule is that the charac-

M . Each M contains one of the alphab et symb ols

i;j

ter set at p osition i in a merged copy is the intersec-

fA; C ; G; T ; g where indicates a gap in the align-

tion of the character sets at p osition i in the contrac-

ment. We use the notation

tion copies if the intersection is non-empty. Otherwise,

0 0

0 0

M M ; 1  i  i  n; 1  j  j  k

i;j i ;j

it is the union of the character sets. This is anal-

to represent a substring of characters in the multiple ogous to the metho d used by Sanko Sanko 1975;

alignment starting at p osition i; j , ending at p osition Sanko & Rousseau 1975.

0 0

i ;j , and wrapping around at the right edge of the

The cost of a contraction. We let the cost function

multiple alignment if necessary.

for contractions equal the number of changes that must

b e made in the contraction copies to make them iden- De nition1. A pattern is some string of nucleotides.

tical. This is an edit distance typ e cost function where A tandem duplication isamutation that replaces one 3

First contraction:

1 2 3 4 5 1 2 3 4 5

1 A C T T A 1 A C T T A

2 A C T A 2 A C T A

3 G G < T T A 3 G G T T fA/Cg

4 G A T T A = 4 fA/GgfA/Cg T T A

5 G A >< T T C 5 G fA/Cg T T A

6 A C T T A

7 G C > T T A

M M = T T A G A T T A G A

3;3 5;2

M M = T T C A C T T A G C

5;3 7;2

merged copy= T T fA/CgfA/GgfA/Cg T T A G fA/Cg

Second contraction:

1 2 3 4 5 1 2 3 4 5

1 A C T T A 1 A C T T A

2 A C < T A 2 A C T A

3 G G T >< T fA/Cg = 3 G fA/C/GgT T A

4 fA/GgfA/Cg T > T A 4 G fA/Cg T T A

5 G fA/Cg T T A

0 0

M M = T A G G T

2;4 3;3

0 0

= T fA/CgfA/GgfA/Cg T M M

4;3 3;4

merged copy = T A G fA/C/Gg T

Figure 3: An example of two binary contractions.

substitutions and indels have equal cost. The cost of tiple alignment. At each stage, the algorithm cho oses

a contraction equals the number of character sets that the contraction with minimum contraction cost ratio de-

are formed in the merged copyby the union op eration. ned as the contraction cost divided by which is the

c

To see why, note that in the case where the intersection reduction in size of the tandem rep eat. Here  equals

c

is not empty, there is a character which makes the con- the size of one contraction copy. Ties are broken ar-

traction copies identical at that p osition. In the union bitrarily except that larger  is chosen over smaller

c

case though, there is no character that b oth contraction  .

c

copies share and therefore, at least one of the copies

3

Wehave implemented GREEDY-TRHIST as a O kn 

must b e changed at that p osition with a cost of one.

algorithm. Note that the problem size is kn.At each

The op eration of binary contraction is illustrated in stage, the cost for every p ossible contraction size 1  k

Fig. 3. Many-to-one contraction works similarly.On to size n=2  k , starting at every p osition is deter-

the left in the rst contraction, is a multiple alignment mined with a character to character comparison. This

2

with k = 5 and n =7. Two contraction copies, of takes time O kn . There are at most n 1 contraction

length 2k , are marked by < and >. On the rightisthe stages. Notice that it is p ossible to leave out of the cal-

new alignment with the merged copy. The contraction culation any columns that contain only a single letter.

copies and merged copy are shown separately b elow the The numb er of such columns increases as the algorithm

alignments. Braces indicate ambiguous character sets. pro ceeds.

The contraction cost is 4. In the second contraction, the

A many-to-one contractions algorithm. Our sec-

contraction copies have size 1k. In the merged copy,

ond greedy algorithm GREEDY-MANY-TRHIST lo-

twoambiguous character sets are eliminated and one

cally minimizes the many-to-one contraction cost. Each

set grows larger. The contraction cost is 1.

contraction removes k  2 contraction copies from a

A binary contractions algorithm. Our rst greedy multiple alignment and replaces them with a single

algorithm, GREEDY-TRHIST, lo cally minimizes the merged copy. At each stage, the algorithm cho oses

binary contraction cost. Each contraction removes two the contraction with minimum contraction cost ratio.

contraction copies from a multiple alignment and re- Here,  is k 1 times the size of a contraction copy.

c

places them with a merged copy to form a smaller mul- Ties are broken as in GREEDY-TRHIST. 4

SR

OPT TREE

1

6

2 3 4 5 7 1 2 4 5 6 7

3

Figure 4: An optimal duplication tree and the cycle SR pro duced by shortcutting an inorder traversal of the tree.

GREEDY- Even with a restricted version of the history problem,

3

MANY-TRHIST is implemented as a O kn log n al- we still do not know the minimum answer. Below, we

gorithm. Within a single column, for contraction size develop several upp er and lower b ounds with whichwe

2

i  k; i =1; 2;:::;n=2, there are O n =i costs to b e de- compare the p erformance of the GREEDY algorithm.

termined, each in constant time using earlier cost calcu-

2

lations. This leads to O n log n cost calculations. For

4.1 Upp er b ounds

k columns and a maximum of n 1 contraction stages,

3

the total is O kn log n. We do not rep ort further on

In contrast to the general problem, the duplication his-

GREEDY-MANY-TRHIST in this pap er.

tory of the restricted problem is always a tree. As

Exploring the tree of solutions. The space of all

with other Steiner tree problems which ob ey the tri-

p ossible history solutions for a tandem rep eat can b e

angle inequality, the xed b oundary, xed duplication

explored as a tree of solutions in whichwe are seeking

size problem can b e b ounded to within 2 times optimal

the minimal solution. The GREEDY algorithms fol-

Kou, Markowsky, & Berman 1981. Unlike those other

low only a single branch of this tree at eachnodei.e.,

problems, a minimum spanning tree is not required. A

only a single contraction is selected. In order to im-

minimum spanning tree will usually improve the 2  OP T

prove the chance of nding an optimal solution, wedo

solution, but b ecause the leaves of the tree are ordered,

a limited exploration of the tree of solutions. At each

a sp ecial typ e of minimum spanning tree, the ordered

contraction stage, we generate a list of minimal cost or

minimum spanning tree is required. Due to the the

near minimal contraction choices there are often sev-

left-to-right ordering of the pattern copies imp osed by

eral minimal cost choices and using depth rst search

the tandem rep eat sequence any other Steiner tree ap-

we explore eachchoice in turn.

proximation algorithm which dep ends in its pro of or

implementation on unordered trees do es not apply.An

Exploration of the solution tree provides a secondary

example is the 11=6 Steiner tree approximation of Ze-

b ene t. It allows us to identify those features b ound-

likovsky Zelikovsky 1993 which assumes that edges

aries/duplication sizes/duplication p ositions that are

can b e removed and added on spanning trees whose

strongly supp orted by the collection of minimal or near

leaves are unordered.

minimal histories.

De nition 2 A duplication tree is a ro oted, leaf and

edge ordered tree. A depth- rst traversal of the tree

which follows the edge order at each no de visits the

4 Upp er and lower b ounds on the cost

leaves in order Fig. 4, left. An orderedspanning tree

of a restricted problem.

is a spanning tree on an ordered set of no des with the

following prop erty. With the no des numb ered in order,

It is dicult to evaluate the GREEDY-TRHIST al- for anytwo edges i ;j  and i ;j ;i

1 1 2 2 1 1 2 2

gorithms' ability to nd minimum cost solutions to a havei i i j j i j j   0. Alternately,

1 2 1 2 1 2 1 2

history problem b ecause the minimum answer is not an ordered spanning tree can b e drawn on an ordered

known. In order to test the metho d, wehave used sim- set of no des arranged in a linear fashion, with every

ulated data and a more restricted problem in which edge o ccupying the same half plane the half planes es-

the b oundaries and duplication sizes are xed ahead of tablished by the line through the no des and with no

edges crossing Fig. 5, left. Each tree is built on a time. A greedy solution for the xed b oundary, xed

multiple alignment with k columns. The ith leaf du- duplication size problem can b e obtained with the algo-

2

plication tree or ith no de ordered spanning tree is rithm GREEDY-TRHIST, in time O n k , by restrict-

lab eled with the ith rowofM . In a duplication tree, ing the chosen contractions to those with left b oundary

the internal no des are lab eled with ancestor sequences in column 1 of M and contraction copy size = 1k .We

also of length k . Edge cost in b oth typ es of tree is the call this algorithm GREEDY-TRHIST-RESTRICTED. 5

i i+h i+j-1 i i+h i+h+1 i+j-1

minimum ordered spanning tree

Figure 5: An ordered spanning tree, left. The recursion for minimal cost for an interval of length j starting at no de

i has two cases.

numb er of di erences b etween the aligned substrings val of length 2, the cost of the edge is the distance

lab eling the ends of the edge. The cost of a tree is the di; i +1between the copies. For an interval of length

sum of its edge costs. j>2 starting at no de i, there are two p ossible cases

for the minimal cost Fig. 5. In one case, a no de i + h

Lemma 1. Every ordered spanning tree can b e trans-

splits the interval, with all no des on the left side of the

formed into a duplication tree of equal cost.

split connected to all no des on the right side through

no de i + h. In the other case, a no de i + h splits the

Pro of: Sketch: To convert an ordered spanning tree

interval with all no des on the left, including no de i + h,

to a duplication tree, we create a leaf for each original

connected to all no des on the right through an edge b e-

no de in the spanning tree. We also create a ro ot and

tween no des i and i + j 1. The recurrence for the cost

internal no des each of which has the same lab el as one

is

of the leaves. New edges created either have a cost of

zero b ecause they connect no des with the same lab el or

DS P i; j =

8

they mimic the edges in the spanning tree. The ordering

min fDS P i; h +1+ DS P i + h; j hg

>

> of the edges at each no de preserves the ordering of the

1hj 2

<

leaves.

min

min fdi; i + j 1 + DS P i; h +1

>

0hj 2

>

:

+DS P i + h +1;j h 1g:

A 2  OP T approximation. An optimal duplication

tree P for M will lo ok something like the tree shown in

Fig. 4. Each leaf is one of the rows of M .Intermediate

The minimum ordered spanning tree MT can b e con-



no des, are lab eled with ancestral sequences. An inorder

verted into a duplication tree MT of equal cost



traversal of P starting at the ro ot pro duce a cycle R of

Lemma 4.1 and this tree while no worse than TR

no des in which eachinternal no de app ears twice and

is usually b etter see section 5. As we also rep ort in

each leaf no de app ears once. Because R has two edges

section 5, GREEDY-TRHIST-RESTRICTED pro duces

 

for every edge in P , costR=2 costP . Eliminating

amuch b etter solution than either TR or MT .

all the internal no des from R by short-cutting b etween

leaf no des pro duces a simple cycle SR containing only

4.2 Lower b ounds

leaf no des. It is imp ortant to note that no matter what

the original form of P , the graph SR always has the

Our crudest lower b ound is character di erences,

P

same form shown in Fig. 4. The cost of SR dep ends

C 1, where C is the numb er of di erentchar-

j j

j

only on the distance b etween the leaf no des, that is,

acters in the column j of M . This b ound implies that

the rows in M . The triangle inequality guarantees that

every pair of identical characters in a column can b e

costSR  costR. Following the no des in SR from

merged at zero cost. Better b ounds are p ossible for the

leaf 1 pro duces the sequence 1; 2; 3;:::;n;1. By remov-

xed b oundary, xed duplication size problem. First,

ing the most costly edge, an ordered spanning tree TR

from the 2  OP T solution, costSR is easy to compute,

is pro duced, giving the inequality

so wehave a simple lower b ound of

costTR  costSR  costR=2 costP :

costP   costSR=2:

TR can b e easily transformed into a duplication tree

 

TR of equal cost Lemma 4.1. Thus TR is a solution

Next, observe that in the restricted problem, the dupli-

with cost no greater than 2  OP T .

cation tree for each column of the multiple alignment M

is identical. A single column algorithm, when applied The minimum ordered spanning tree. TR is not

to each column separately provides a lower b ound, re- necessarily a minimum ordered spanning tree. A true

ferred to as indep endent columns. The minimum minimum tree MT can b e computed using dynamic

2 3

cost for a single column of n characters can b e found in programming in O n k + n  time. We compute the

3

O n  time by dynamic programming. The algorithm minimum ordered spanning tree for all intervals of size

is given in the App endix. A b etter lower b ound can 2; 3;:::;n where an interval of length j starting at no de

b e obtained by computing the optimal cost for every i contains the i;:::;i + j 1rows in M .For an inter- 6

Length=60, Duplications=10

Cost p = :02 p = :03 p = :04

Di erences MEAN STD MEAN STD MEAN STD

 

TR MT 6.8 3.6 7.6 4.0 8.2 4.0



MT GREEDY 16.9 4.4 24.7 4.7 31.2 5.2

GREEDY C ol P air 1.0 1.3 1.3 1.4 1.3 1.5

Generating Cost 53.3 6.4 79.6 9.1 105.7 10.7

 

Table 1: Cost di erences b etween the three solution metho ds, TR , MT , and GREEDY-TRHIST-RESTRICTED

and the b est lower b ound, column pairs. Generating cost is the numberofcharacter changes during \mutation" in

the simulation. p is the probabilityofcharacter mutation b etween duplications. GREEDY-TRHIST-RESTRICTED



surpasses MT by ab out 30 relative to the generating cost and is very close to the lower b ound.



subset of columns of size r and then greedily cho osing surpassing MT , the next b est metho d, by ab out 30

subsets whose joint column costs most exceed their in- relative to the generating cost and giving solutions

dep endent costs. Wehave used this metho d, referred which are very close to the lower b ound Table 1, Fig. 6,

to as column pairs with r = 2. The algorithm for a left.

single column can b e generalized to sets of r columns

Unrestricted TRHIST problem. For this problem,

r 3

in time O 5 n .

the left b oundary of the duplicated substring was un-

restricted, i.e. it could have o ccurred in any column of

the multiple alignment. GREEDY-TRHIST follows the

5 Simulation results

same pattern of p erformance as GREEDY-TRHIST-

RESTRICTED. The only lower b ound that applies here

In our simulation tests, we show that the GREEDY al-

is character di erences which is not as accurate as col-

gorithms p erform very close to our b est lower b ounds

umn pairs is for the the restricted problem Fig. 6,

and that GREEDY-TRHIST-RESTRICTED is much

right.

b etter than the algorithms based on duplication trees.

Note that for the results presented here we did not ex-

plore the solution tree as describ ed in section 3.

6 Data visualization

Each simulation sequence started with a single ran-

Recall from the discussion in the intro duction that there domly generated string of length k k = 60 for the

can b e uncertainty in the b oundaries of the duplicated results presented here, k =12; 25 not shown, but sim-

pattern. Fig. 7, top, is a graphical display of this un- ilar. For the rst duplication, the entire string was

certainty. The circles represent contractions pro duced duplicated. In all subsequent duplications, a substring

by GREEDY-TRHIST on two distinct tandem rep eats of length k was chosen and duplicated. Every dupli-

of a 135bp pattern containing 18 copies left and 14 cation, except the rst was preceded by\mutation" in

copies right from 1 in yeast. Each ring whichevery character in the sequence could change to

represents one contraction. The shaded arc shows the another character with probability p. Three values of

p ossible left b oundaries columns in the multiple align- p were used, .02, .03, and .04. Wechose these muta-

ment M  of the contraction copies. The rows of the tion rates to bracket 1 the contraction cost for a real

contraction copies are indicated in the b ottom of the 60bp pattern found in the human T lo cus se-

gure see b elow. Reading clo ckwise from column 1 to quence not shown and 2 the estimated average mu-

column 135, any left boundary chosen from the shaded tation rate b etween contractions observed in several real

arc wil l give the same sequence after contraction.For examples we analyzed, including the 60bp pattern and

example, in the left circle, second ring from the out- the 135bp patterns shown in Fig. 7. The number of

side, the left b oundary can b e any of columns 88 to character changes during mutation for one sequence is

103. Gray level signi es contraction cost. the generating cost. A history for each simulated se-

quence was constructed and the cost calculated. In the

Fig. 7, b ottom, is a graphical display of the lo cation

gures, costs are normalized by subtracting the gener-

of contractions in the histories. Numb ers on the right

ating cost.

indicate size of the contraction copy1=1k ,2=2k .

The rightmost p ossible b oundaries for each contraction TRHIST xed b oundary, xed duplication size.

are shown. Notice that the history on the right is nearly For this problem, the left b oundary of the duplicated

identical to the history on the left except for the four substring was always at the rst p osition of a copy

contractions marked with an asterisk * and some mi- corresp onding to column 1 of the multiple alignment.

nor reordering of the contractions. Indeed, the right- The b est lower b ound for this problem is column pairs.

most 12 copies in the sequence on the left are nearly Of the three solution metho ds Section 4, GREEDY-

identical to the leftmost 12 copies in the sequence on TRHIST-RESTRICTED is sup erior to the other two, 7 er- cta v A . , xed ells, R. and re- kton, D.; 10:213{218. Lo cating the sau3a y the direction bination in the uman minisatel- ulated sequences 15:7477{7489 1975. es. . e t, unrestricted TRHIST 4:203{221. 9:240{246. utation trees of sequences. 77:853{861. hanism for DNA sequence hromosomal utation at h cids R Natur 6:136{145. Cel l 28:35{42. , G.; and Berman, L. 1981. A amming. ama, R.; and Oishi, M. 1987. Se- arren, S. 1994. Cryptic and p olar gr able 1. Righ o wsky c. Biol. Evol. Nucleic A o orski, A.; Ohshima, K.; and W amaki, K.; MacLeo d, A.; Monc Y-TRHIST-RESTRICTED solutions are t. w al Pr e Genetics Mole en . See T ts in germline m 15:141{145.  en ura, K.; Kiy Natur o , D., and Rousseau, P o , D. 1975. Minimal m MT olution. uman disease are determined b or ertices of a Steiner tree in an arbitrary metric space. ariation of the fragile x rep eat could result in predis- Kang, S.; Ja 1995. Expansionh and deletion ofof CTG replication rep in eats e. from coli. Kou, L.; Mark fast algorithm for the Steiner problem in graphs. Je reys, A.; T Neil, D.; and Armour,sion J. ev 1994. Complexlites. con Inform. Kunst, C., andv W p osing normal alleles. Levinson, G., and Gutman,mispairing: G. a 1987.ev ma jor Slipp ed-strand mec Okum quence analyses of extrac lated family DNA:excision analysis ev of recom Sank v Mathematic Sank SIAM J. Appl. Math.  TR 8 y subtracting the generating cost. Left, xed b oundary . .; o- u- ula- . w lost e DNA Human ard, P aldes, A.; er b ound on solution. GREED e DNA se- h b etter than w uman p opu- uc ey of the h ere normalized b 91:3166{3170 t has someho .; and Stephan, W. terruptions. tly analyzing sim er b ound is less accurate than that used on left. Generating costs for b oth graphs w er b ound and solutions for TRHIST problems. Results are from 250 sim GG in 17:185{190. wski, P w opulation surv ad. Sci. USA ard de ning the course of ev 371:215{220. c , D. 1993. Rep etitiv w e are curren e har di lo 4:2199{2208. er b ound and are m o hange for a sp eci c tree top ology reimer, N. 1994. Mutational pro- w eterson, A.; Garza, J.; V w a bias in the p osition of the dupli- utation rates. Scores w orney Natur um c olutionary dynamics of rep etitiv c. Natl. A 20:406{416. o otes. orth, B.; Sniego Pr t. ol. Computers Chem. t. The sequence on the righ y for the loss of A ary cular Genetics h of three m h, W. 1971. T hler, E.; Hammond, H.; Macpherson, J.; W ery close to the lo Bell, G., and T quences: some considerationsp eats. for simple sequence re- Charlesw 1994. The ev in euk cesses of simple-sequencelations. rep eat lo ci in h Di Rienzo, A.;Slatkin, P M.; and F Eic and Nelson, D.man FMR1 1995. CGG rep eatp olarit P substructure suggests biased Mole Fitc lution: minim Syst. Zo Figure 6: Comparison ofat lo eac duplication size problem. Colv pair is b est lo problem. Note thatare c similar. the righ an older part of the tandemBoth rep eat. histories sho cation b oundaries. W tions and other histories tosigni can determine if these biases are References cost − generating cost −20 − −10 − 0− 10− 20− 30− %3 4% 3% 2% GREEDY−TRHIST RESTRICTED col pair MT* TR* pattern width:60generations:10 rate

cost − generating cost −30 − −20 − −10 − 0− %3 4% 3% 2% GREEDY−TRHIST char diff pattern width:60generations:10 rate 16-19 17-20 12-15 13-16 8-11 9-12 4-7 5-8 0-3 0-4

1 1 135

135

a b 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 1 1 1 1 * 1 * 1 1 * 2 1 1 1

1

Figure 7: Graphical representation of duplication histories, a 18 copies of a 135bp pattern and, b 14 copies of

a 135bp pattern, b oth from yeast chromosome 1. Each ring represents one contraction. Horizontal line in circle

marks the rst and last columns in multiple alignment. Shaded arcs indicate p ossible left b oundary of duplication

unit. Gray level signi es contraction cost. Triangular display shows p osition of contractions. Rightmost p ossible

contraction b oundaries are shown. Numb ers on right indicate size of contraction copy 1=1k ,2=2k .

Schlotterer, C., and Tautz, D. 1992. Slippage synthesis 7 App endix

of simple sequence DNA. Nucleic Acids Res. 20:211{

215.

Single column minimum algorithm. We are

given a column of characters from the alphab et  =

Smith, G. 1976. Evolution of rep eated DNA sequences

fA; C ; G; T ; g on which to construct a duplication his-

by unequal crossover. Science 191:528{535.

tory tree. If a history tree were given, and wewanted

Strand, M.; Prolla, T.; Liskay, R.; and Petes, T. 1993.

to compute its cost, wewould use a metho d rst de-

Destabilization of tracts of simple rep etitive DNA in

scrib ed by Fitch Fitch 1971 and later proven correct

yeast bymutations a ecting DNA mismatch repair. by Sanko Sanko 1975; Sanko & Rousseau 1975.

Nature 365:274{276.

The metho d uses a cost vector C of size j j assigned

i

to eachnodex of the tree. Eachentry C [ ]; 2 ,

i i

Weitzmann, M.; Wo o dford, K.; and Usdin, K. 1997.

represents the minimum cost of the edges in the subtree

Dna secondary structures and the evolution of hyp er-

ro oted at x including the edge to the parentofx when

i i

variable tandem arrays. J. of Biological Chemistry

the parent is lab eled with the character  . The ro ot

272:9517{9523.

can b e assumed to haveahyp othetical parent. The

Wells, R. 1996. Molecular basis of genetic instability recurrence for C is as follows:

i

8

of triplet rep eats. J. of Biological Chemistry 271:2875{

n

0  = 

i

>

2878.

>

when leaf x is lab eled  .

i i

>

1  6= 

<

i

Zelikovsky,A. 1993. An 11/6-approximation algo-

C [ ] =

i

[]+d; []+::: + C minC

j j

>

k 1

> rithm for the network steiner problem. Algorithmica

2

>

:

9:463{470.

. ;:::;x when x has children x

j i j

k 1 9

Single column algorithm

Data structures:

V [substr ing siz e; substr ing star t; l etter ] =cost of subtree when ro ot is letter cost vectors.

T [par tition; l etter ] =cost of subtree de ned by partition when ro ot is letter.

inputC ol umn[1 :::n]

for l =1; :::;n  initiali ze leaf cost vectors.

for  2fA; C ; G; T ; g

if  == C ol umn[l ] V [1;l;]=0;

else V [1;l;]=1;

 compute minimum cost vectors.

for k =2; :::;n  size of substring

for l =1;:::;n k  start of substring

for d =1;:::; k 1  partition

 get cost vector for each partition p osition

for  2fA; C ; G; T ; g characters

T [d;  ]= V [d; l ;  ]+V [k d; l + d;  ];

m = min fT [d;  ]g;

 2fA;C;G;T ;g

for  2fA; C ; G; T ; g

if T [d;  ] 6= m T [d;  ]=m +1;

 get overall cost vector.

for  2fA; C ; G; T ; g

V [k; l; ]= min fT [d;  ]g;

d=1;::: ;k 1

 get minimum cost for optimal tree.

m = min fV [n; 1;]g;

 2fA;C;G;T ;g

returnm;  return minimum from ro ot

Figure 8: Algorithm for optimal solution for TRHIST, single column, xed duplication size, binary contractions.

Although the Fitch and Sanko pap ers assumed that that eachcharacter mayhave its own minimizing tree.

the tree is given, the metho d can also b e used to deter-

Extension to r columns. The preceding algorithm

mine the minimum cost at each no de of the tree as the

can b e generalized for r columns as follows. Instead of

tree is b eing constructed from the b ottom up.

nding a single column vector for each substring, we

3

Fig. 8 is pseudo co de for a O n  algorithm for nding instead nd an r -dimensional array, one dimension for

r

the cost of the optimal tree with binary contractions. each column. The size of this arrayis5 . The values in

It builds the tree from the b ottom up, working with the array represent the combined costs of the same tree

substrings of the column of size k =2:::n.For each on each of the r substrings for every letter combination

r 3

substring, an optimal binary tree connecting the char- at the r ro ots. Time complexityis O 5  n  whichis

acters in the substring is determined. As ab ove, each exp onential in the numb er of columns.

substring is assigned a cost vector with the costs of the

leaves xed. Cost vectors are stored in an array V in-

dexed by the substring length and starting p osition.

When k = 2 the cost vector for adjacentcharacters is

determined by the metho d describ ed ab ove. When k 

3, the optimal tree for the substring will have a left and

a right subtree. There are k 1 p ossible lo cations in the

substring for a partition b etween the subtrees and each

is tried in turn. One cost vector p er partition p osition

is determined and a nal cost vector for the substring

is obtained by nding for eachcharacter its minimum

over all the xed partition costs. Note that this implies 10