The Nadeau-Taylor Chromosomal Breakage Model Revisited

Transforming Men into Mice: the Nadeau-Taylor Chromosomal Breakage Model Revisited Pavel Pevzner Glenn Tesler Department of Computer Science Department of Computer Science University of California, San Diego University of California, San Diego La Jolla, CA 92093-0114 La Jolla, CA 92093-0114 [email protected] [email protected] ABSTRACT order) and published a milestone paper with a rearrange- Although analysis of genome rearrangements was pioneered ment scenario for the species D. pseudoobscura and D. mi- by Dobzhansky and Sturtevant 65 years ago, we still know randa. Every genome rearrangement study involves solving very little about the rearrangement events that produced the a combinatorial puzzle to find a series of genome rearrange- existing varieties of genomic architectures. The genomic se- ments to transform one genome into another. The problem quences of human and mouse provide evidence for a larger of finding the minimum number of reversals to transform one number of rearrangements than previously thought and shed unichromosomal genome into another is called the reversal some light on previously unknown features of mammalian distance problem. evolution. In particular, they reveal extensive re-use of Kececioglu and Sankoff, 1993 [17] were the first to recog- breakpoints from the same relatively short regions. Our nize the importance of combinatorial dependencies between analysis implies the existence of a large number of very short different breakpoints and to come up with an approximation “hidden” synteny blocks that were invisible in comparative algorithm for the reversal distance problem. Hannenhalli mapping data and were not taken into account in previous and Pevzner, 1995, 1999 [14] developed a polynomial algo- studies of chromosome evolution. These blocks are defined rithm for the reversal distance problem and further extended by closely located breakpoints and are often hard to detect. it to the genomic distance problem, i.e., finding a most par- Our result is in conflict with the widely accepted random simonious scenario for multichromosomal genomes under re- breakage model of chromosomal evolution. We suggest a versals, translocations, fusions, and fissions of chromosomes new “fragile breakage” model of chromosome evolution that [13, 43]. Later on, the Hannenhalli-Pevzner algorithm was postulates that breakpoints are chosen from relatively short further optimized and extended for other applications in [5, fragile regions that have much higher propensity for rear- 16, 11, 1, 4]. rangements than the rest of the genome. Even before Sankoff and colleagues introduced the combinatorial approach to rearrangement studies, Nadeau and Taylor pioneered a statistical approach. In a landmark pa- Categories and Subject Descriptors per, Nadeau and Taylor, 1984 [29] introduced the notion J.3 [Life and Medical Sciences]: Biology and Genetics of conserved segments (segments with preserved gene or- ders) and estimated that there are roughly 180 conserved General Terms segments in human and mouse. In the same paper they provided convincing arguments in favor of the random breakage Algorithms, Theory model of genomic evolution postulated by Ohno, 1973 [30]. The model assumes a random (i.e, uniform and independent) Keywords distribution of chromosome rearrangement breakpoints and is supported by the observation that the lengths of synteny breakpoint re-use, evolution, genome rearrangements blocks shared by human and mouse are well fitted by the predicted distribution imposed by the random breakage model. 1. INTRODUCTION Since the model was first introduced in [30], it has been Analysis of genome rearrangements in molecular evolution analyzed by Nadeau and others [29, 27, 28, 39], and has was pioneered by Dobzhansky and Sturtevant, 1938 [10], become widely accepted. It was further supported by stud- who introduced the notion of a breakpoint (disruption of gene ies of significantly larger datasets that confirmed that newly discovered synteny blocks still fit the predicted exponential distribution very well [26, 9, 20, 24, 33]. These studies, with progressively increasing levels of resolution, transformed the Permission to make digital or hard copies of all or part of this work for random breakage model into the de facto theory of chromo- personal or classroom use is granted without fee provided that copies are some evolution. not made or distributed for profit or commercial advantage and that copies The arguments in favor of the random breakage model bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific usually proceed as follows. One first constructs the distri- permission and/or a fee. bution of lengths of conserved segments and fits the resulting RECOMB’03, April 10–13, 2003, Berlin, Germany. histogram with the theoretical distribution predicted by the Copyright 2003 ACM 1-58113-635-8/03/0004 ...$5.00. 247 random breakage model. An important implication of this (a) (b) Lengths of synteny blocks (w/o hidden blocks) Lengths of synteny blocks (with hidden blocks) model is that the segment lengths approximate an expo- 80 300 1 −x/L nential distribution with density function f(x)= L e , 250 60 where L is the average length of all segments. Techni- 200 cally, Nadeau and Taylor, 1984 [29] did not have information 150 about all segments since most of them were still undiscov- 40 Frequency Frequency 100 ered in 1984. However, they were able to estimate L (and 20 therefore the number of still undiscovered segments) from 50 0 0 the small set of already discovered segments. The relatively 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 small departure from an exponential distribution was at- Block length (Mb) Block length (Mb) tributed to missing information about some conserved segments. Of course, there was always a danger that newly dis- Figure 1: (a) Histogram of synteny block lengths covered segments would shift this estimate and even deviate in human for 281 synteny blocks of length at least 1 1 from the exponential distribution predicted by the model. Mb, fitted by an exponential distribution with mean However, this did not happen in the past, and the random block length L =9.6 Mb. (b) The same histogram breakage model was reinforced in a number of influential superimposed with the 190 “hidden” synteny blocks studies in the last decade. As a result, the Nadeau-Taylor revealed by genome rearrangement analysis, under predictions are viewed as among the most significant results the assumption that all “hidden” blocks are short, in “...the history and development of the mouse as a research i.e., less than 1 Mb in length. tool” (Pennisi, 2000 [31]). There is a conceptual difference between the Nadeau and Taylor, 1984 [29] statistical approach to studies of chro- closely located breakpoints that cannot be explained by the mosomal history (that is not concerned with the details random breakage model. of rearrangement history) and the combinatorial approach The surprisingly large number of breakpoint clumps is an that attempts to infer the rearrangement scenario. Sankoff, argument in favor of a different model of chromosome evo- 1999 [34] raised the problem of integrating these approaches, lution that we call the fragile breakage model.Thismodel which had not been done before. In this paper we attempt postulates that the breakpoints mainly occur within rela- to combine the statistical and combinatorial approaches and tively short fragile regions (hot spots of rearrangements). demonstrate that the combined analysis reveals evidence The existence of some fragile regions at the population level against the random breakage model. was supported by previous studies of cancer and infertility The draft human and mouse sequences reveal many pre- [8, 36], but the extent of this phenomenon in molecular evo- viously undiscovered synteny blocks and put the random lution became clear only after the human and mouse DNA breakage model to a new test. In particular, they reveal sequences became available. If one assumes the fragile re- 281 synteny blocks shared by human and mouse of size at gions are uniformly distributed through the genome then least 1 Mb (Pevzner and Tesler, 2003 [32]). Although the the fragile and random breakage models lead to identical number of synteny blocks is higher than the Nadeau-Taylor estimates for the number of long segments (e.g., segments predictions, the lengths of the blocks still fit the exponential longer than 1–2 Mb). In some sense, the random break- distribution (Fig. 1a), another argument in favor of the ran- age model can be viewed as an excellent null hypothesis dom breakage model. However, a different type of evidence for a certain level of resolution and genome heterogeneities. derived from genome rearrangement studies reveals an un- However, the random breakage and fragile breakage model expectedly large number of closely located breakpoints that generate very different predictions when it comes to short cannot be explained by the random breakage model. This segments that were below the granularity level of previous analysis implies that in addition to the segments shown in comparative mapping studies. Fig. 1a, there are another 190 “short” synteny blocks, typ- ically below 1 Mb in length. These blocks were never dis- 2. SYNTENY BLOCKS covered in the comparative mapping studies, and moreover, DNA sequences provide evidence that the human and most of them are hard to find even with available human 2 mouse genomes are significantly more rearranged than pre- and mouse sequences. The existence of these blocks im- viously thought. Moreover, they indicate that a large pro- mediately implies that an exponential distribution is not a portion of previously identified conserved segments are not good fit to reality (Fig. 1b). In other words, rearrangement really conserved since there is evidence of multiple micro- analysis of the human and mouse genomes reveals clumps of rearrangements (Mural et al., 2002 [24]).

The Nadeau-Taylor Chromosomal Breakage Model Revisited

Grammar String: a Novel Ncrna Secondary Structure Representation

120421-24Recombschedule FINAL.Xlsx

Michael S. Waterman: Breathing Mathematics Into Genes >>>

Gene and Genome Duplication David Sankoff

Curriculum Vitae

Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

Algorithms for High-Throughput Sequencing Data

ISBRA 2012 Short Abstracts

LANGUAGE CHANGE ACROSS the LIFESPAN: /R/ in MONTREAL FRENCH

Copyright by Siavash Mir Arabbaygi 2015 the Dissertation Committee for Siavash Mir Arabbaygi Certiﬁes That This Is the Approved Version of the Following Dissertation

ISCB Honors Michael S. Waterman and Mathieu Blanchette Merry Maisel Ach Year, the International Fu¨ R Informatik and Chair of the ISCB Early 1970S

Comparative Genomics Via Phylogenetic Invariants for Jukes-Cantor Semigroups