Reconstructing the Duplication History of a Tandem Repeat
Total Page:16
File Type:pdf, Size:1020Kb
Reconstructing the Duplication History of a Tandem Rep eat y Gary Benson and Lan Dong Department of Biomathematical Sciences The Mount Sinai Scho ol of Medicine New York, NY 10029-6574 Abstract repeat.Over time, individual copies within a tandem re- p eat may undergo additional, unco ordinated mutations including new tandem duplications so that typically, One of the less well understo o d mutational transforma- multiple approximate tandem copies are present. tions that act up on DNA is tandem duplication. In this pro cess, a stretch of DNA is duplicated to pro duce two Examination of a tandem rep eat often suggests that or more adjacent copies, resulting in a tandem repeat. the sequence was pro duced by a series of tandem du- Over time, the copies undergo additional mutations so plications intersp ersed with p ointmutations. The real that typically,multiple approximate tandem copies are biological sequence shown in Fig. 1 is a typical exam- present. An interesting feature of tandem rep eats is that the duplicated copies are preserved together, mak- ple. It consists of 16 copies of an 8 nucleotide pattern. ing it p ossible to do \phylogenetic analysis" on a single Copies are numb ered and spaces inserted b etween the sequence. This involves using the pattern of mutations copies for clarity. A consensus for these 16 copies is among the copies to determine a minimal or a most AAAC T T AG. Astrisks * ag di erences b etween the likely history for the rep eat. A history tries to de- copies and the consensus. scrib e the interwoven pattern of substitutions, indels, and duplication events in suchaway as to minimize Careful observation reveals that G is p erio dically sub- the numb er of identical mutations that arise indep en- stituted for A. Such substitutions are unlikely to o c- dently. Because the copies are adjacent and ordered, cur indep endently. It is more likely that a single the history problem can not b e solved by standard phy- common ancestor pattern is resp onsible for the A to logeny algorithms. In this pap er, weintro duce several G substitutions through duplication. Perhaps an 8 versions of the tandem rep eat history problem, develop character unit, say AAAC T T AG,was rst duplicated algorithmic solutions and evaluate their p erformance. and then mutated to AGAC T T AG and then the two We also develop ways to visualize imp ortant features of a history with the goal of discovering prop erties of copies were duplicated as a single 16 character unit the duplication mechanism. AAAC T T AGAGAC T T AG. When the second through thirteenth copies are viewed in this way, 5 of the A to G substitutions are accounted for. Further observation Keywords: tandem rep eats, phylogeny algorithms suggests that the two starred T smayhave b een the result of another duplication. 1 Intro duction Tandem rep eats are di erent from other typ es of du- plicated sequences b ecause the child copies of duplica- One of the less well understo o d mutational pro cesses for tion are adjacent on the same sequence. This di erence DNA molecules is tandem duplication in which a stretch leads to complications in determining the parent copy of DNA is transformed into two or more adjacent copies. of duplication. See Fig. 2. The following illustrates a tandem duplication in which the single o ccurrence of triplet CGG is transformed into Boundaries. It is not always p ossible to distinguish three identical, adjacent copies. the b oundaries of a duplicated pattern. Consider the two examples b elow in which a duplication changes three identical copies of AB C D into four identical :::T C GG A::: ! :::T C GG CGG CGG A::: copies. Although the b oundaries of the duplicated pat- terns underlined di er, the results are the same. The result of a tandem duplication is termed a tandem AB C D ! AB C D AB C D AB C DAB C D AB C D AB C D Partially supp orted by NFS grant CCR-9623532 and AB C D AB C D AB CD ! AB C D AB C D AB CDABCD a 1997 grant from the German Academic Exchange Service DAAD. y Partially supp orted by NFS grant CCR-9623532. 1 * * * * AAACTTAG AAACTTAT AGACTTAG AAACTTAG AGACTTAG AAACTTAG AGACTTAG AAACTTAG 1 2 3 4 5 6 7 8 * * * * * * * AGACTTAG AAACTTAT AGACTTAG AAACTTAG AGACTTAG AGACTCAG AAACTTAG AAAGCTTAG 9 10 11 12 13 14 15 16 -------------------------------------------------------------------------------- * AAACTTAG AAACTTATAGACTTAG AAACTTAGAGACTTAG AAACTTAGAGACTTAG AAACTTAGAGACTTAG 1 2 3 4 5 6 7 8 9 * * * * AAACTTATAGACTTAG AAACTTAGAGACTTAG AGACTCAG AAACTTAG AAAGCTTAG 10 11 12 13 14 15 16 Figure 1: Top: Perio dic nucleotide substitutions in a tandem rep eat suggests a common ancestor. Bottom: Fiveof the A to G substitutions may b e accounted for by a single A to G substitution followed by duplication. Mutations add information. In the next example, the assumes predominantly single copychanges with rare second copy of ABCD has b een mutated to AXCY. multi-copychanges. In Bell & Torney 1993 compar- Now, di erent duplication b oundaries give di erent re- ison of estimated rates of unequal crossing over and sults. observed rates of microsatellite mutation lead to the conclusion that slipp ed strand mispairing is the ma- ABCDAXCY AB C D ! ABCDAXCY AXCY AB C D jor cause of length p olymorphism in microsatellites. In ABCDAXCY AB CD ! ABCDAXCY AB CY ABCD Charlesworth, Sniegowski, & Stephan 1994, mo deling and simulation suggests that very low recombination rates unequal crossing over can result in very large copynumb er and higher order rep eats. Note that the b oundaries are still not completely deter- mined in the later two cases. The pattern in b oth could Many unresolved questions can b e asked ab out the b e shifted one character to the right and give the same mechanism of tandem duplication, among them: 1 Is results. We presentaway to display this uncertaintyin the b oundary of the duplication unit unique, is it con- Section 6. ned to a few lo cations or is it seemingly unrestricted? 2 Is the duplication unit size unique, do es it vary in Duplication size. The size of the duplication unit can a small range or is it unrestriced? Do es pattern size be anymultiple of the basic pattern size. In the exam- a ect the variability of duplication unit size? 3 Do es ple b elow, four copies of a pattern of size 4 are changed duplication o ccur preferentially at one end or the other into six copies by duplicating the middle 8 characters. of the rep eat or preferentially on the leading or lag- Again, mutations in the original copies can help distin- ging strand during replication Kunst & Warren 1994; guish the size of the duplication unit from other p ossi- Kang et al. 1995; Eichler et al. 1995? bilities. Answers to these questions may suggest the presence of conformational structures, either within or adjacentto AB C D ! ABCDAXCY ABCZ the tandem rep eat Je reys et al. 1994, which trigger ABCDAXCY ABCZ AXCY ABCZAB C D duplication or may indicate that di erent mechanisms act on patterns of di erent sizes. An extensive anal- ysis of the histories of many tandem rep eats can pro- Several mechanisms have b een prop osed for the pro- vide data to supp ort one or the other of the theoretical duction of tandem rep eats, including replication slip- mo dels and may reveal new mechanistic features not al- page and unequal crossing over Wells 1996; Levinson ready anticipated. Additionally, comparison of related & Gutman 1987; Schlotterer & Tautz 1992; Okumura, tandem rep eats in di erent sequences could resolve im- Kiyama, & Oishi 1987; Smith 1976. Biological stud- p ortant questions regarding evolution or mutation over ies Strand et al. 1993; Weitzmann, Wo o dford, & Us- short time scales. Such a capabilitywould op en up new din 1997 have already provided supp ort for one or the opp ortunities to address questions of evolution and an- other of the mechanisms. Mathematical mo deling has cestry, including the study of human migration, rapid suggested mechanistic characteristics. For example in evolution of bacterial diseases, and the cascade of mu- Di Rienzo et al. 1994 accurate mo deling of copynum- tations that lead to cancer. With these purp oses in ber variation at a p olymorphic dinucleotide rep eat lo- mind, wehave b egun the development of algorithms to cus has b een obtained with a two-phase mo del which 2 copy of a pattern with two or more adjacent, identi- cal copies. A contraction is an algorithmic op eration in whichtwo or more adjacent, equal length substrings the contraction copies of a string are replaced by a sin- gle substring the mergedcopy. A contraction can b e thought of as the opp osite of a tandem duplication. A binary contraction replaces two contraction copies with a merged copy.Amany-to-one contraction replaces two or more contraction copies with a merged copy.Acon- traction copy is some substring of the multiple align- ment M with length a multiple of k .For the purp oses of contraction, each p osition in M is treated as a character set which is some subset of the alphab et fA; C ; G; T ; g. An ambiguous character set is a character set whichis Figure 2: A tandem rep eat history. Ancestral pattern not a singleton set, e.g., fA; Gg is ambiguous but fAg is not. The original multiple alignment M contains no sequence is at the top.