Reconstructing the Duplication History of a Tandem Rep eat
y
Gary Benson and Lan Dong
Department of Biomathematical Sciences
The Mount Sinai Scho ol of Medicine
New York, NY 10029-6574
Abstract repeat.Over time, individual copies within a tandem re-
p eat may undergo additional, unco ordinated mutations
including new tandem duplications so that typically,
One of the less well understo o d mutational transforma-
multiple approximate tandem copies are present.
tions that act up on DNA is tandem duplication. In this
pro cess, a stretch of DNA is duplicated to pro duce two
Examination of a tandem rep eat often suggests that
or more adjacent copies, resulting in a tandem repeat.
the sequence was pro duced by a series of tandem du-
Over time, the copies undergo additional mutations so
plications intersp ersed with p ointmutations. The real
that typically,multiple approximate tandem copies are
biological sequence shown in Fig. 1 is a typical exam-
present. An interesting feature of tandem rep eats is
that the duplicated copies are preserved together, mak-
ple. It consists of 16 copies of an 8 nucleotide pattern.
ing it p ossible to do \phylogenetic analysis" on a single
Copies are numb ered and spaces inserted b etween the
sequence. This involves using the pattern of mutations
copies for clarity. A consensus for these 16 copies is
among the copies to determine a minimal or a most
AAAC T T AG. Astrisks * ag di erences b etween the
likely history for the rep eat. A history tries to de-
copies and the consensus.
scrib e the interwoven pattern of substitutions, indels,
and duplication events in suchaway as to minimize
Careful observation reveals that G is p erio dically sub-
the numb er of identical mutations that arise indep en-
stituted for A. Such substitutions are unlikely to o c-
dently. Because the copies are adjacent and ordered,
cur indep endently. It is more likely that a single
the history problem can not b e solved by standard phy-
common ancestor pattern is resp onsible for the A to
logeny algorithms. In this pap er, weintro duce several
G substitutions through duplication. Perhaps an 8
versions of the tandem rep eat history problem, develop
character unit, say AAAC T T AG,was rst duplicated
algorithmic solutions and evaluate their p erformance.
and then mutated to AGAC T T AG and then the two
We also develop ways to visualize imp ortant features
of a history with the goal of discovering prop erties of
copies were duplicated as a single 16 character unit
the duplication mechanism.
AAAC T T AGAGAC T T AG. When the second through
thirteenth copies are viewed in this way, 5 of the A to
G substitutions are accounted for. Further observation
Keywords: tandem rep eats, phylogeny algorithms
suggests that the two starred T smayhave b een the
result of another duplication.
1 Intro duction
Tandem rep eats are di erent from other typ es of du-
plicated sequences b ecause the child copies of duplica-
One of the less well understo o d mutational pro cesses for
tion are adjacent on the same sequence. This di erence
DNA molecules is tandem duplication in which a stretch
leads to complications in determining the parent copy
of DNA is transformed into two or more adjacent copies.
of duplication. See Fig. 2.
The following illustrates a tandem duplication in which
the single o ccurrence of triplet CGG is transformed into
Boundaries. It is not always p ossible to distinguish
three identical, adjacent copies.
the b oundaries of a duplicated pattern. Consider the
two examples b elow in which a duplication changes
three identical copies of AB C D into four identical
:::T C GG A::: ! :::T C GG CGG CGG A:::
copies. Although the b oundaries of the duplicated pat-
terns underlined di er, the results are the same.
The result of a tandem duplication is termed a tandem
AB C D ! AB C D AB C D AB C DAB C D AB C D AB C D
Partially supp orted by NFS grant CCR-9623532 and
AB C D AB C D AB CD ! AB C D AB C D AB CDABCD
a 1997 grant from the German Academic Exchange Service
DAAD.
y
Partially supp orted by NFS grant CCR-9623532. 1
* * * *
AAACTTAG AAACTTAT AGACTTAG AAACTTAG AGACTTAG AAACTTAG AGACTTAG AAACTTAG
1 2 3 4 5 6 7 8
* * * * * * *
AGACTTAG AAACTTAT AGACTTAG AAACTTAG AGACTTAG AGACTCAG AAACTTAG AAAGCTTAG
9 10 11 12 13 14 15 16
------
*
AAACTTAG AAACTTATAGACTTAG AAACTTAGAGACTTAG AAACTTAGAGACTTAG AAACTTAGAGACTTAG
1 2 3 4 5 6 7 8 9
* * * *
AAACTTATAGACTTAG AAACTTAGAGACTTAG AGACTCAG AAACTTAG AAAGCTTAG
10 11 12 13 14 15 16
Figure 1: Top: Perio dic nucleotide substitutions in a tandem rep eat suggests a common ancestor. Bottom: Fiveof
the A to G substitutions may b e accounted for by a single A to G substitution followed by duplication.
Mutations add information. In the next example, the assumes predominantly single copychanges with rare
second copy of ABCD has b een mutated to AXCY.
multi-copychanges. In Bell & Torney 1993 compar-
Now, di erent duplication b oundaries give di erent re-
ison of estimated rates of unequal crossing over and
sults.
observed rates of microsatellite mutation lead to the
conclusion that slipp ed strand mispairing is the ma-
ABCDAXCY AB C D ! ABCDAXCY AXCY AB C D
jor cause of length p olymorphism in microsatellites. In
ABCDAXCY AB CD ! ABCDAXCY AB CY ABCD
Charlesworth, Sniegowski, & Stephan 1994, mo deling
and simulation suggests that very low recombination
rates unequal crossing over can result in very large
copynumb er and higher order rep eats.
Note that the b oundaries are still not completely deter-
mined in the later two cases. The pattern in b oth could
Many unresolved questions can b e asked ab out the
b e shifted one character to the right and give the same
mechanism of tandem duplication, among them: 1 Is
results. We presentaway to display this uncertaintyin
the b oundary of the duplication unit unique, is it con-
Section 6.
ned to a few lo cations or is it seemingly unrestricted?
2 Is the duplication unit size unique, do es it vary in
Duplication size. The size of the duplication unit can
a small range or is it unrestriced? Do es pattern size
be anymultiple of the basic pattern size. In the exam-
a ect the variability of duplication unit size? 3 Do es
ple b elow, four copies of a pattern of size 4 are changed
duplication o ccur preferentially at one end or the other
into six copies by duplicating the middle 8 characters.
of the rep eat or preferentially on the leading or lag-
Again, mutations in the original copies can help distin-
ging strand during replication Kunst & Warren 1994;
guish the size of the duplication unit from other p ossi-
Kang et al. 1995; Eichler et al. 1995?
bilities.
Answers to these questions may suggest the presence of
conformational structures, either within or adjacentto
AB C D ! ABCDAXCY ABCZ
the tandem rep eat Je reys et al. 1994, which trigger
ABCDAXCY ABCZ AXCY ABCZAB C D
duplication or may indicate that di erent mechanisms
act on patterns of di erent sizes. An extensive anal-
ysis of the histories of many tandem rep eats can pro-
Several mechanisms have b een prop osed for the pro-
vide data to supp ort one or the other of the theoretical
duction of tandem rep eats, including replication slip-
mo dels and may reveal new mechanistic features not al-
page and unequal crossing over Wells 1996; Levinson
ready anticipated. Additionally, comparison of related
& Gutman 1987; Schlotterer & Tautz 1992; Okumura,
tandem rep eats in di erent sequences could resolve im-
Kiyama, & Oishi 1987; Smith 1976. Biological stud-
p ortant questions regarding evolution or mutation over
ies Strand et al. 1993; Weitzmann, Wo o dford, & Us-
short time scales. Such a capabilitywould op en up new
din 1997 have already provided supp ort for one or the
opp ortunities to address questions of evolution and an-
other of the mechanisms. Mathematical mo deling has
cestry, including the study of human migration, rapid
suggested mechanistic characteristics. For example in
evolution of bacterial diseases, and the cascade of mu-
Di Rienzo et al. 1994 accurate mo deling of copynum-
tations that lead to cancer. With these purp oses in
ber variation at a p olymorphic dinucleotide rep eat lo-
mind, wehave b egun the development of algorithms to
cus has b een obtained with a two-phase mo del which 2
copy of a pattern with two or more adjacent, identi-
cal copies. A contraction is an algorithmic op eration
in whichtwo or more adjacent, equal length substrings
the contraction copies of a string are replaced by a sin-
gle substring the mergedcopy. A contraction can b e
thought of as the opp osite of a tandem duplication. A
binary contraction replaces two contraction copies with
a merged copy.Amany-to-one contraction replaces two
or more contraction copies with a merged copy.Acon-
traction copy is some substring of the multiple align-
ment M with length a multiple of k .For the purp oses of
contraction, each p osition in M is treated as a character
set which is some subset of the alphab et fA; C ; G; T ; g.
An ambiguous character set is a character set whichis
Figure 2: A tandem rep eat history. Ancestral pattern not a singleton set, e.g., fA; Gg is ambiguous but fAg
is not. The original multiple alignment M contains no
sequence is at the top. Bottom sequence contains 9 de-
scendant copies of the pattern. Dotted lines mark the ambiguous character sets, but a merged copymay con-
b oundaries of copies involved in a duplication. Parent tain ambiguous character sets. When a contraction is
applied to a multiple alignment M , a new, shorter mul-
copyisabove, child copies in b old b elow. Note that
0
1 the b oundaries of a parent need not coincide with tiple alignment M is pro duced.
the putative b oundaries of the pattern, 2 a parent's
In this pap er, we consider the following problems.
length can b e a multiple of the length of a single pat-
tern, and 3 child copies can interact to form a parent
Tandem rep eat history problem TRHIST.
in subsequent duplications.
Givenamultiple alignment M of the copies of a tan-
dem rep eat, a cost function for contractions, and a
reconstruct tandem rep eat histories.
rule for pro ducing merged copies, nd a least cost
series of contractions which reduce M to a single
The remainder of this pap er is organized as follows.
merged copy.
Section 2 contains de nitions and descriptions of the
TRHIST, xed b oundary, xed duplication
problems weinvestigate. Section 3 describ es our greedy
size. Size and b oundaries of contraction copies are
algorithms for the history problem. In Section 4 we
xed and remain the same across all contractions.
develop upp er and lower b ounds on a restricted ver-
Without loss of generality, the size is k and the left
sion of the history problem. In Section 5 we rep ort the
b oundary is column 1 of M .
p erformance of the algorithms on simulated sequences.
Finally, in Section 6 we give graphical presentations of
TRHIST, single column, xed duplication
our analysis of real biological sequences. The App endix
size. The history problem on a single column of M .
contains additional details on one of our algorithms.
Boundary is necessarily xed and size of contraction
copies is xed without loss of generality at a single
character.
2 De nitions and Problem Descriptions
For the purp oses of the problems describ ed b elow, we
3 Greedy algorithms for TRHIST.
assume that a tandem rep eat sequence consists of n
approximate copies of a basic pattern of length k .We
Rule for pro ducing merged copies. If contrac-
are givenamultiple alignment, M , of the copies. M
tion copies are not identical, the merged copy will con-
has n rows and k columns and the ith rowinM con-
tain ambiguous characters, represented byambiguous
tains the ith copy left-to-right in the tandem rep eat.
character sets. This ambiguitymay b e resolved by
We let M represent the ith row and j th column of
i;j
some later contraction. Our rule is that the charac-
M . Each M contains one of the alphab et symb ols
i;j
ter set at p osition i in a merged copy is the intersec-
fA; C ; G; T ; g where indicates a gap in the align-
tion of the character sets at p osition i in the contrac-
ment. We use the notation
tion copies if the intersection is non-empty. Otherwise,
0 0
0 0
M M ; 1 i i n; 1 j j k
i;j i ;j
it is the union of the character sets. This is anal-
to represent a substring of characters in the multiple ogous to the metho d used by Sanko Sanko 1975;
alignment starting at p osition i; j , ending at p osition Sanko & Rousseau 1975.
0 0
i ;j , and wrapping around at the right edge of the
The cost of a contraction. We let the cost function
multiple alignment if necessary.
for contractions equal the number of changes that must
b e made in the contraction copies to make them iden- De nition1. A pattern is some string of nucleotides.
tical. This is an edit distance typ e cost function where A tandem duplication isamutation that replaces one 3
First contraction:
1 2 3 4 5 1 2 3 4 5
1 A C T T A 1 A C T T A
2 A C T A 2 A C T A
3 G G < T T A 3 G G T T fA/Cg
4 G A T T A = 4 fA/GgfA/Cg T T A
5 G A >< T T C 5 G fA/Cg T T A
6 A C T T A
7 G C > T T A
M M = T T A G A T T A G A
3;3 5;2
M M = T T C A C T T A G C
5;3 7;2
merged copy= T T fA/CgfA/GgfA/Cg T T A G fA/Cg
Second contraction:
1 2 3 4 5 1 2 3 4 5
1 A C T T A 1 A C T T A
2 A C < T A 2 A C T A
3 G G T >< T fA/Cg = 3 G fA/C/GgT T A
4 fA/GgfA/Cg T > T A 4 G fA/Cg T T A
5 G fA/Cg T T A
0 0
M M = T A G G T
2;4 3;3
0 0
= T fA/CgfA/GgfA/Cg T M M
4;3 3;4
merged copy = T A G fA/C/Gg T
Figure 3: An example of two binary contractions.
substitutions and indels have equal cost. The cost of tiple alignment. At each stage, the algorithm cho oses
a contraction equals the number of character sets that the contraction with minimum contraction cost ratio de-
are formed in the merged copyby the union op eration. ned as the contraction cost divided by which is the
c
To see why, note that in the case where the intersection reduction in size of the tandem rep eat. Here equals
c
is not empty, there is a character which makes the con- the size of one contraction copy. Ties are broken ar-
traction copies identical at that p osition. In the union bitrarily except that larger is chosen over smaller
c
case though, there is no character that b oth contraction .
c
copies share and therefore, at least one of the copies
3
Wehave implemented GREEDY-TRHIST as a O kn
must b e changed at that p osition with a cost of one.
algorithm. Note that the problem size is kn.At each
The op eration of binary contraction is illustrated in stage, the cost for every p ossible contraction size 1 k
Fig. 3. Many-to-one contraction works similarly.On to size n=2 k , starting at every p osition is deter-
the left in the rst contraction, is a multiple alignment mined with a character to character comparison. This
2
with k = 5 and n =7. Two contraction copies, of takes time O kn . There are at most n 1 contraction
length 2k , are marked by < and >. On the rightisthe stages. Notice that it is p ossible to leave out of the cal-
new alignment with the merged copy. The contraction culation any columns that contain only a single letter.
copies and merged copy are shown separately b elow the The numb er of such columns increases as the algorithm
alignments. Braces indicate ambiguous character sets. pro ceeds.
The contraction cost is 4. In the second contraction, the
A many-to-one contractions algorithm. Our sec-
contraction copies have size 1k. In the merged copy,
ond greedy algorithm GREEDY-MANY-TRHIST lo-
twoambiguous character sets are eliminated and one
cally minimizes the many-to-one contraction cost. Each
set grows larger. The contraction cost is 1.
contraction removes k 2 contraction copies from a
A binary contractions algorithm. Our rst greedy multiple alignment and replaces them with a single
algorithm, GREEDY-TRHIST, lo cally minimizes the merged copy. At each stage, the algorithm cho oses
binary contraction cost. Each contraction removes two the contraction with minimum contraction cost ratio.
contraction copies from a multiple alignment and re- Here, is k 1 times the size of a contraction copy.
c
places them with a merged copy to form a smaller mul- Ties are broken as in GREEDY-TRHIST. 4
SR
OPT TREE
1
6
2 3 4 5 7 1 2 4 5 6 7
3
Figure 4: An optimal duplication tree and the cycle SR pro duced by shortcutting an inorder traversal of the tree.
GREEDY- Even with a restricted version of the history problem,
3
MANY-TRHIST is implemented as a O kn log n al- we still do not know the minimum answer. Below, we
gorithm. Within a single column, for contraction size develop several upp er and lower b ounds with whichwe
2
i k; i =1; 2;:::;n=2, there are O n =i costs to b e de- compare the p erformance of the GREEDY algorithm.
termined, each in constant time using earlier cost calcu-
2
lations. This leads to O n log n cost calculations. For
4.1 Upp er b ounds
k columns and a maximum of n 1 contraction stages,
3
the total is O kn log n. We do not rep ort further on
In contrast to the general problem, the duplication his-
GREEDY-MANY-TRHIST in this pap er.
tory of the restricted problem is always a tree. As
Exploring the tree of solutions. The space of all
with other Steiner tree problems which ob ey the tri-
p ossible history solutions for a tandem rep eat can b e
angle inequality, the xed b oundary, xed duplication
explored as a tree of solutions in whichwe are seeking
size problem can b e b ounded to within 2 times optimal
the minimal solution. The GREEDY algorithms fol-
Kou, Markowsky, & Berman 1981. Unlike those other
low only a single branch of this tree at eachnodei.e.,
problems, a minimum spanning tree is not required. A
only a single contraction is selected. In order to im-
minimum spanning tree will usually improve the 2 OP T
prove the chance of nding an optimal solution, wedo
solution, but b ecause the leaves of the tree are ordered,
a limited exploration of the tree of solutions. At each
a sp ecial typ e of minimum spanning tree, the ordered
contraction stage, we generate a list of minimal cost or
minimum spanning tree is required. Due to the the
near minimal contraction choices there are often sev-
left-to-right ordering of the pattern copies imp osed by
eral minimal cost choices and using depth rst search
the tandem rep eat sequence any other Steiner tree ap-
we explore eachchoice in turn.
proximation algorithm which dep ends in its pro of or
implementation on unordered trees do es not apply.An
Exploration of the solution tree provides a secondary
example is the 11=6 Steiner tree approximation of Ze-
b ene t. It allows us to identify those features b ound-
likovsky Zelikovsky 1993 which assumes that edges
aries/duplication sizes/duplication p ositions that are
can b e removed and added on spanning trees whose
strongly supp orted by the collection of minimal or near
leaves are unordered.
minimal histories.
De nition 2 A duplication tree is a ro oted, leaf and
edge ordered tree. A depth- rst traversal of the tree
which follows the edge order at each no de visits the
4 Upp er and lower b ounds on the cost
leaves in order Fig. 4, left. An orderedspanning tree
of a restricted problem.
is a spanning tree on an ordered set of no des with the
following prop erty. With the no des numb ered in order,
It is dicult to evaluate the GREEDY-TRHIST al- for anytwo edges i ;j and i ;j ;i 1 1 2 2 1 1 2 2 gorithms' ability to nd minimum cost solutions to a havei i i j j i j j 0. Alternately, 1 2 1 2 1 2 1 2 history problem b ecause the minimum answer is not an ordered spanning tree can b e drawn on an ordered known. In order to test the metho d, wehave used sim- set of no des arranged in a linear fashion, with every ulated data and a more restricted problem in which edge o ccupying the same half plane the half planes es- the b oundaries and duplication sizes are xed ahead of tablished by the line through the no des and with no edges crossing Fig. 5, left. Each tree is built on a time. A greedy solution for the xed b oundary, xed multiple alignment with k columns. The ith leaf du- duplication size problem can b e obtained with the algo- 2 plication tree or ith no de ordered spanning tree is rithm GREEDY-TRHIST, in time O n k , by restrict- lab eled with the ith rowofM . In a duplication tree, ing the chosen contractions to those with left b oundary the internal no des are lab eled with ancestor sequences in column 1 of M and contraction copy size = 1k .We also of length k . Edge cost in b oth typ es of tree is the call this algorithm GREEDY-TRHIST-RESTRICTED. 5 i i+h i+j-1 i i+h i+h+1 i+j-1 minimum ordered spanning tree Figure 5: An ordered spanning tree, left. The recursion for minimal cost for an interval of length j starting at no de i has two cases. numb er of di erences b etween the aligned substrings val of length 2, the cost of the edge is the distance lab eling the ends of the edge. The cost of a tree is the di; i +1between the copies. For an interval of length sum of its edge costs. j>2 starting at no de i, there are two p ossible cases for the minimal cost Fig. 5. In one case, a no de i + h Lemma 1. Every ordered spanning tree can b e trans- splits the interval, with all no des on the left side of the formed into a duplication tree of equal cost. split connected to all no des on the right side through no de i + h. In the other case, a no de i + h splits the Pro of: Sketch: To convert an ordered spanning tree interval with all no des on the left, including no de i + h, to a duplication tree, we create a leaf for each original connected to all no des on the right through an edge b e- no de in the spanning tree. We also create a ro ot and tween no des i and i + j 1. The recurrence for the cost internal no des each of which has the same lab el as one is of the leaves. New edges created either have a cost of zero b ecause they connect no des with the same lab el or DS P i; j = 8 they mimic the edges in the spanning tree. The ordering min fDS P i; h +1+ DS P i + h; j hg > > of the edges at each no de preserves the ordering of the 1hj 2 < leaves. min min fdi; i + j 1 + DS P i; h +1 > 0hj 2 > : +DS P i + h +1;j h 1g: A 2 OP T approximation. An optimal duplication tree P for M will lo ok something like the tree shown in Fig. 4. Each leaf is one of the rows of M .Intermediate The minimum ordered spanning tree MT can b e con- no des, are lab eled with ancestral sequences. An inorder verted into a duplication tree MT of equal cost traversal of P starting at the ro ot pro duce a cycle R of Lemma 4.1 and this tree while no worse than TR no des in which eachinternal no de app ears twice and is usually b etter see section 5. As we also rep ort in each leaf no de app ears once. Because R has two edges section 5, GREEDY-TRHIST-RESTRICTED pro duces for every edge in P , costR=2 costP . Eliminating amuch b etter solution than either TR or MT . all the internal no des from R by short-cutting b etween leaf no des pro duces a simple cycle SR containing only 4.2 Lower b ounds leaf no des. It is imp ortant to note that no matter what the original form of P , the graph SR always has the Our crudest lower b ound is character di erences, P same form shown in Fig. 4. The cost of SR dep ends C 1, where C is the numb er of di erentchar- j j j only on the distance b etween the leaf no des, that is, acters in the column j of M . This b ound implies that the rows in M . The triangle inequality guarantees that every pair of identical characters in a column can b e costSR costR. Following the no des in SR from merged at zero cost. Better b ounds are p ossible for the leaf 1 pro duces the sequence 1; 2; 3;:::;n;1. By remov- xed b oundary, xed duplication size problem. First, ing the most costly edge, an ordered spanning tree TR from the 2 OP T solution, costSR is easy to compute, is pro duced, giving the inequality so wehave a simple lower b ound of costTR costSR costR=2 costP : costP costSR=2: TR can b e easily transformed into a duplication tree TR of equal cost Lemma 4.1. Thus TR is a solution Next, observe that in the restricted problem, the dupli- with cost no greater than 2 OP T . cation tree for each column of the multiple alignment M is identical. A single column algorithm, when applied The minimum ordered spanning tree. TR is not to each column separately provides a lower b ound, re- necessarily a minimum ordered spanning tree. A true ferred to as indep endent columns. The minimum minimum tree MT can b e computed using dynamic 2 3 cost for a single column of n characters can b e found in programming in O n k + n time. We compute the 3 O n time by dynamic programming. The algorithm minimum ordered spanning tree for all intervals of size is given in the App endix. A b etter lower b ound can 2; 3;:::;n where an interval of length j starting at no de b e obtained by computing the optimal cost for every i contains the i;:::;i + j 1rows in M .For an inter- 6 Length=60, Duplications=10 Cost p = :02 p = :03 p = :04 Di erences MEAN STD MEAN STD MEAN STD TR MT 6.8 3.6 7.6 4.0 8.2 4.0 MT GREEDY 16.9 4.4 24.7 4.7 31.2 5.2 GREEDY C ol P air 1.0 1.3 1.3 1.4 1.3 1.5 Generating Cost 53.3 6.4 79.6 9.1 105.7 10.7 Table 1: Cost di erences b etween the three solution metho ds, TR , MT , and GREEDY-TRHIST-RESTRICTED and the b est lower b ound, column pairs. Generating cost is the numberofcharacter changes during \mutation" in the simulation. p is the probabilityofcharacter mutation b etween duplications. GREEDY-TRHIST-RESTRICTED surpasses MT by ab out 30 relative to the generating cost and is very close to the lower b ound. subset of columns of size r and then greedily cho osing surpassing MT , the next b est metho d, by ab out 30 subsets whose joint column costs most exceed their in- relative to the generating cost and giving solutions dep endent costs. Wehave used this metho d, referred which are very close to the lower b ound Table 1, Fig. 6, to as column pairs with r = 2. The algorithm for a left. single column can b e generalized to sets of r columns Unrestricted TRHIST problem. For this problem, r 3 in time O 5 n . the left b oundary of the duplicated substring was un- restricted, i.e. it could have o ccurred in any column of the multiple alignment. GREEDY-TRHIST follows the 5 Simulation results same pattern of p erformance as GREEDY-TRHIST- RESTRICTED. The only lower b ound that applies here In our simulation tests, we show that the GREEDY al- is character di erences which is not as accurate as col- gorithms p erform very close to our b est lower b ounds umn pairs is for the the restricted problem Fig. 6, and that GREEDY-TRHIST-RESTRICTED is much right. b etter than the algorithms based on duplication trees. Note that for the results presented here we did not ex- plore the solution tree as describ ed in section 3. 6 Data visualization Each simulation sequence started with a single ran- Recall from the discussion in the intro duction that there domly generated string of length k k = 60 for the can b e uncertainty in the b oundaries of the duplicated results presented here, k =12; 25 not shown, but sim- pattern. Fig. 7, top, is a graphical display of this un- ilar. For the rst duplication, the entire string was certainty. The circles represent contractions pro duced duplicated. In all subsequent duplications, a substring by GREEDY-TRHIST on two distinct tandem rep eats of length k was chosen and duplicated. Every dupli- of a 135bp pattern containing 18 copies left and 14 cation, except the rst was preceded by\mutation" in copies right from chromosome 1 in yeast. Each ring whichevery character in the sequence could change to represents one contraction. The shaded arc shows the another character with probability p. Three values of p ossible left b oundaries columns in the multiple align- p were used, .02, .03, and .04. Wechose these muta- ment M of the contraction copies. The rows of the tion rates to bracket 1 the contraction cost for a real contraction copies are indicated in the b ottom of the 60bp pattern found in the human T cell lo cus se- gure see b elow. Reading clo ckwise from column 1 to quence not shown and 2 the estimated average mu- column 135, any left boundary chosen from the shaded tation rate b etween contractions observed in several real arc wil l give the same sequence after contraction.For examples we analyzed, including the 60bp pattern and example, in the left circle, second ring from the out- the 135bp patterns shown in Fig. 7. The number of side, the left b oundary can b e any of columns 88 to character changes during mutation for one sequence is 103. Gray level signi es contraction cost. the generating cost. A history for each simulated se- quence was constructed and the cost calculated. In the Fig. 7, b ottom, is a graphical display of the lo cation gures, costs are normalized by subtracting the gener- of contractions in the histories. Numb ers on the right ating cost. indicate size of the contraction copy1=1k ,2=2k . The rightmost p ossible b oundaries for each contraction TRHIST xed b oundary, xed duplication size. are shown. Notice that the history on the right is nearly For this problem, the left b oundary of the duplicated identical to the history on the left except for the four substring was always at the rst p osition of a copy contractions marked with an asterisk * and some mi- corresp onding to column 1 of the multiple alignment. nor reordering of the contractions. Indeed, the right- The b est lower b ound for this problem is column pairs. most 12 copies in the sequence on the left are nearly Of the three solution metho ds Section 4, GREEDY- identical to the leftmost 12 copies in the sequence on TRHIST-RESTRICTED is sup erior to the other two, 7 er- cta v A . , xed ells, R. and re- kton, D.; 10:213{218. Lo cating the sau3a y the direction bination in the uman minisatel- ulated sequences 15:7477{7489 1975. es. . e Genetics t, unrestricted TRHIST 4:203{221. 9:240{246. utation trees of sequences. 77:853{861. hanism for DNA sequence hromosomal utation at h cids R Natur 6:136{145. Cel l 28:35{42. , G.; and Berman, L. 1981. A amming. ama, R.; and Oishi, M. 1987. Se- arren, S. 1994. Cryptic and p olar gr able 1. Righ o wsky c. Biol. Evol. Nucleic A o orski, A.; Ohshima, K.; and W amaki, K.; MacLeo d, A.; Monc Y-TRHIST-RESTRICTED solutions are t. w al Pr e Genetics Mole en . See T ts in germline m 15:141{145. en ura, K.; Kiy Natur o , D., and Rousseau, P o , D. 1975. Minimal m MT olution. uman disease genes are determined b or ertices of a Steiner tree in an arbitrary metric space. ariation of the fragile x rep eat could result in predis- Kang, S.; Ja 1995. Expansionh and deletion ofof CTG replication rep in eats e. from coli. Kou, L.; Mark fast algorithm for the Steiner problem in graphs. Je reys, A.; T Neil, D.; and Armour,sion J. ev 1994. Complexlites. gene con Inform. Kunst, C., andv W p osing normal alleles. Levinson, G., and Gutman,mispairing: G. a 1987.ev ma jor Slipp ed-strand mec Okum quence analyses of extrac lated family DNA:excision analysis ev of recom Sank v Mathematic Sank SIAM J. Appl. Math. TR 8 y subtracting the generating cost. Left, xed b oundary . .; o- u- ula- . w lost e DNA Human ard, P aldes, A.; er b ound on solution. GREED e DNA se- h b etter than w uman p opu- uc ey of the h ere normalized b 91:3166{3170 t has someho .; and Stephan, W. terruptions. tly analyzing sim er b ound is less accurate than that used on left. Generating costs for b oth graphs w er b ound and solutions for TRHIST problems. Results are from 250 sim GG in 17:185{190. wski, P w opulation surv ad. Sci. USA ard de ning the course of ev 371:215{220. c , D. 1993. Rep etitiv w e are curren e har di lo 4:2199{2208. er b ound and are m o hange for a sp eci c tree top ology reimer, N. 1994. Mutational pro- w eterson, A.; Garza, J.; V w a bias in the p osition of the dupli- utation rates. Scores w orney Natur um c olutionary dynamics of rep etitiv c. Natl. A 20:406{416. o otes. orth, B.; Sniego Pr t. ol. Computers Chem. t. The sequence on the righ y for the loss of A ary cular Genetics h of three m h, W. 1971. T hler, E.; Hammond, H.; Macpherson, J.; W ery close to the lo Bell, G., and T quences: some considerationsp eats. for simple sequence re- Charlesw 1994. The ev in euk cesses of simple-sequencelations. rep eat lo ci in h Di Rienzo, A.;Slatkin, P M.; and F Eic and Nelson, D.man FMR1 1995. CGG rep eatp olarit P substructure suggests biased Mole Fitc lution: minim Syst. Zo Figure 6: Comparison ofat lo eac duplication size problem. Colv pair is b est lo problem. Note thatare c similar. the righ an older part of the tandemBoth rep eat. histories sho cation b oundaries. W tions and other histories tosigni can determine if these biases are References cost − generating cost −20 − −10 − 0− 10− 20− 30− %3 4% 3% 2% GREEDY−TRHIST RESTRICTED col pair MT* TR* pattern width:60generations:10 rate cost − generating cost −30 − −20 − −10 − 0− %3 4% 3% 2% GREEDY−TRHIST char diff pattern width:60generations:10 rate 16-19 17-20 12-15 13-16 8-11 9-12 4-7 5-8 0-3 0-4 1 1 135 135 a b 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 1 1 1 1 * 1 * 1 1 * 2 1 1 1 1 Figure 7: Graphical representation of duplication histories, a 18 copies of a 135bp pattern and, b 14 copies of a 135bp pattern, b oth from yeast chromosome 1. Each ring represents one contraction. Horizontal line in circle marks the rst and last columns in multiple alignment. Shaded arcs indicate p ossible left b oundary of duplication unit. Gray level signi es contraction cost. Triangular display shows p osition of contractions. Rightmost p ossible contraction b oundaries are shown. Numb ers on right indicate size of contraction copy 1=1k ,2=2k . Schlotterer, C., and Tautz, D. 1992. Slippage synthesis 7 App endix of simple sequence DNA. Nucleic Acids Res. 20:211{ 215. Single column minimum algorithm. We are given a column of characters from the alphab et = Smith, G. 1976. Evolution of rep eated DNA sequences fA; C ; G; T ; g on which to construct a duplication his- by unequal crossover. Science 191:528{535. tory tree. If a history tree were given, and wewanted Strand, M.; Prolla, T.; Liskay, R.; and Petes, T. 1993. to compute its cost, wewould use a metho d rst de- Destabilization of tracts of simple rep etitive DNA in scrib ed by Fitch Fitch 1971 and later proven correct yeast bymutations a ecting DNA mismatch repair. by Sanko Sanko 1975; Sanko & Rousseau 1975. Nature 365:274{276. The metho d uses a cost vector C of size j j assigned i to eachnodex of the tree. Eachentry C [ ]; 2 , i i Weitzmann, M.; Wo o dford, K.; and Usdin, K. 1997. represents the minimum cost of the edges in the subtree Dna secondary structures and the evolution of hyp er- ro oted at x including the edge to the parentofx when i i variable tandem arrays. J. of Biological Chemistry the parent is lab eled with the character . The ro ot 272:9517{9523. can b e assumed to haveahyp othetical parent. The Wells, R. 1996. Molecular basis of genetic instability recurrence for C is as follows: i 8 of triplet rep eats. J. of Biological Chemistry 271:2875{ n 0 = i > 2878. > when leaf x is lab eled . i i > 1 6= < i Zelikovsky,A. 1993. An 11/6-approximation algo- C [ ] = i []+d; []+::: + C minC j j > k 1 > rithm for the network steiner problem. Algorithmica 2 > : 9:463{470. . ;:::;x when x has children x j i j k 1 9 Single column algorithm Data structures: V [substr ing siz e; substr ing star t; l etter ] =cost of subtree when ro ot is letter cost vectors. T [par tition; l etter ] =cost of subtree de ned by partition when ro ot is letter. inputC ol umn[1 :::n] for l =1; :::;n initiali ze leaf cost vectors. for 2fA; C ; G; T ; g if == C ol umn[l ] V [1;l;]=0; else V [1;l;]=1; compute minimum cost vectors. for k =2; :::;n size of substring for l =1;:::;n k start of substring for d =1;:::; k 1 partition get cost vector for each partition p osition for 2fA; C ; G; T ; g characters T [d; ]= V [d; l ; ]+V [k d; l + d; ]; m = min fT [d; ]g; 2fA;C;G;T ; g for 2fA; C ; G; T ; g if T [d; ] 6= m T [d; ]=m +1; get overall cost vector. for 2fA; C ; G; T ; g V [k; l; ]= min fT [d; ]g; d=1;::: ;k 1 get minimum cost for optimal tree. m = min fV [n; 1;]g; 2fA;C;G;T ; g returnm; return minimum from ro ot Figure 8: Algorithm for optimal solution for TRHIST, single column, xed duplication size, binary contractions. Although the Fitch and Sanko pap ers assumed that that eachcharacter mayhave its own minimizing tree. the tree is given, the metho d can also b e used to deter- Extension to r columns. The preceding algorithm mine the minimum cost at each no de of the tree as the can b e generalized for r columns as follows. Instead of tree is b eing constructed from the b ottom up. nding a single column vector for each substring, we 3 Fig. 8 is pseudo co de for a O n algorithm for nding instead nd an r -dimensional array, one dimension for r the cost of the optimal tree with binary contractions. each column. The size of this arrayis5 . The values in It builds the tree from the b ottom up, working with the array represent the combined costs of the same tree substrings of the column of size k =2:::n.For each on each of the r substrings for every letter combination r 3 substring, an optimal binary tree connecting the char- at the r ro ots. Time complexityis O 5 n whichis acters in the substring is determined. As ab ove, each exp onential in the numb er of columns. substring is assigned a cost vector with the costs of the leaves xed. Cost vectors are stored in an array V in- dexed by the substring length and starting p osition. When k = 2 the cost vector for adjacentcharacters is determined by the metho d describ ed ab ove. When k 3, the optimal tree for the substring will have a left and a right subtree. There are k 1 p ossible lo cations in the substring for a partition b etween the subtrees and each is tried in turn. One cost vector p er partition p osition is determined and a nal cost vector for the substring is obtained by nding for eachcharacter its minimum over all the xed partition costs. Note that this implies 10