Transforming Men into Mice: the Nadeau-Taylor Chromosomal Breakage Model Revisited

Pavel Pevzner Glenn Tesler Department of Computer Science Department of Computer Science University of California, San Diego University of California, San Diego La Jolla, CA 92093-0114 La Jolla, CA 92093-0114 [email protected] [email protected]

ABSTRACT order) and published a milestone paper with a rearrange- Although analysis of genome rearrangements was pioneered ment scenario for the species D. pseudoobscura and D. mi- by Dobzhansky and Sturtevant 65 years ago, we still know randa. Every genome rearrangement study involves solving very little about the rearrangement events that produced the a combinatorial puzzle to find a series of genome rearrange- existing varieties of genomic architectures. The genomic se- ments to transform one genome into another. The problem quences of human and mouse provide evidence for a larger of finding the minimum number of reversals to transform one number of rearrangements than previously thought and shed unichromosomal genome into another is called the reversal some light on previously unknown features of mammalian distance problem. evolution. In particular, they reveal extensive re-use of Kececioglu and Sankoff, 1993 [17] were the first to recog- breakpoints from the same relatively short regions. Our nize the importance of combinatorial dependencies between analysis implies the existence of a large number of very short different breakpoints and to come up with an approximation “hidden” synteny blocks that were invisible in comparative algorithm for the reversal distance problem. Hannenhalli mapping data and were not taken into account in previous and Pevzner, 1995, 1999 [14] developed a polynomial algo- studies of chromosome evolution. These blocks are defined rithm for the reversal distance problem and further extended by closely located breakpoints and are often hard to detect. it to the genomic distance problem, i.e., finding a most par- Our result is in conflict with the widely accepted random simonious scenario for multichromosomal genomes under re- breakage model of chromosomal evolution. We suggest a versals, translocations, fusions, and fissions of chromosomes new “fragile breakage” model of chromosome evolution that [13, 43]. Later on, the Hannenhalli-Pevzner algorithm was postulates that breakpoints are chosen from relatively short further optimized and extended for other applications in [5, fragile regions that have much higher propensity for rear- 16, 11, 1, 4]. rangements than the rest of the genome. Even before Sankoff and colleagues introduced the com- binatorial approach to rearrangement studies, Nadeau and Taylor pioneered a statistical approach. In a landmark pa- Categories and Subject Descriptors per, Nadeau and Taylor, 1984 [29] introduced the notion J.3 [Life and Medical Sciences]: Biology and Genetics of conserved segments (segments with preserved gene or- ders) and estimated that there are roughly 180 conserved General Terms segments in human and mouse. In the same paper they pro- vided convincing arguments in favor of the random breakage Algorithms, Theory model of genomic evolution postulated by Ohno, 1973 [30]. The model assumes a random (i.e, uniform and independent) Keywords distribution of chromosome rearrangement breakpoints and is supported by the observation that the lengths of synteny breakpoint re-use, evolution, genome rearrangements blocks shared by human and mouse are well fitted by the pre- dicted distribution imposed by the random breakage model. 1. INTRODUCTION Since the model was first introduced in [30], it has been Analysis of genome rearrangements in molecular evolution analyzed by Nadeau and others [29, 27, 28, 39], and has was pioneered by Dobzhansky and Sturtevant, 1938 [10], become widely accepted. It was further supported by stud- who introduced the notion of a breakpoint (disruption of gene ies of significantly larger datasets that confirmed that newly discovered synteny blocks still fit the predicted exponential distribution very well [26, 9, 20, 24, 33]. These studies, with progressively increasing levels of resolution, transformed the Permission to make digital or hard copies of all or part of this work for random breakage model into the de facto theory of chromo- personal or classroom use is granted without fee provided that copies are some evolution. not made or distributed for profit or commercial advantage and that copies The arguments in favor of the random breakage model bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific usually proceed as follows. One first constructs the distri- permission and/or a fee. bution of lengths of conserved segments and fits the resulting RECOMB’03, April 10Ð13, 2003, Berlin, Germany. histogram with the theoretical distribution predicted by the Copyright 2003 ACM 1-58113-635-8/03/0004 ...$5.00.

247

random breakage model. An important implication of this (a) (b) Lengths of synteny blocks (w/o hidden blocks) Lengths of synteny blocks (with hidden blocks) model is that the segment lengths approximate an expo- 80 300 1 −x/L nential distribution with density function f(x)= L e , 250 60 where L is the average length of all segments. Techni- 200 cally, Nadeau and Taylor, 1984 [29] did not have information 150 about all segments since most of them were still undiscov- 40 Frequency Frequency 100 ered in 1984. However, they were able to estimate L (and 20 therefore the number of still undiscovered segments) from 50 0 0 the small set of already discovered segments. The relatively 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 small departure from an exponential distribution was at- Block length (Mb) Block length (Mb) tributed to missing information about some conserved seg- ments. Of course, there was always a danger that newly dis- Figure 1: (a) Histogram of synteny block lengths covered segments would shift this estimate and even deviate in human for 281 synteny blocks of length at least 1 1 from the exponential distribution predicted by the model. Mb, fitted by an exponential distribution with mean However, this did not happen in the past, and the random block length L =9.6 Mb. (b) The same histogram breakage model was reinforced in a number of influential superimposed with the 190 “hidden” synteny blocks studies in the last decade. As a result, the Nadeau-Taylor revealed by genome rearrangement analysis, under predictions are viewed as among the most significant results the assumption that all “hidden” blocks are short, in “...the history and development of the mouse as a research i.e., less than 1 Mb in length. tool” (Pennisi, 2000 [31]). There is a conceptual difference between the Nadeau and Taylor, 1984 [29] statistical approach to studies of chro- closely located breakpoints that cannot be explained by the mosomal history (that is not concerned with the details random breakage model. of rearrangement history) and the combinatorial approach The surprisingly large number of breakpoint clumps is an that attempts to infer the rearrangement scenario. Sankoff, argument in favor of a different model of chromosome evo- 1999 [34] raised the problem of integrating these approaches, lution that we call the fragile breakage model.Thismodel which had not been done before. In this paper we attempt postulates that the breakpoints mainly occur within rela- to combine the statistical and combinatorial approaches and tively short fragile regions (hot spots of rearrangements). demonstrate that the combined analysis reveals evidence The existence of some fragile regions at the population level against the random breakage model. was supported by previous studies of cancer and infertility The draft human and mouse sequences reveal many pre- [8, 36], but the extent of this phenomenon in molecular evo- viously undiscovered synteny blocks and put the random lution became clear only after the human and mouse DNA breakage model to a new test. In particular, they reveal sequences became available. If one assumes the fragile re- 281 synteny blocks shared by human and mouse of size at gions are uniformly distributed through the genome then least 1 Mb (Pevzner and Tesler, 2003 [32]). Although the the fragile and random breakage models lead to identical number of synteny blocks is higher than the Nadeau-Taylor estimates for the number of long segments (e.g., segments predictions, the lengths of the blocks still fit the exponential longer than 1–2 Mb). In some sense, the random break- distribution (Fig. 1a), another argument in favor of the ran- age model can be viewed as an excellent null hypothesis dom breakage model. However, a different type of evidence for a certain level of resolution and genome heterogeneities. derived from genome rearrangement studies reveals an un- However, the random breakage and fragile breakage model expectedly large number of closely located breakpoints that generate very different predictions when it comes to short cannot be explained by the random breakage model. This segments that were below the granularity level of previous analysis implies that in addition to the segments shown in comparative mapping studies. Fig. 1a, there are another 190 “short” synteny blocks, typ- ically below 1 Mb in length. These blocks were never dis- 2. SYNTENY BLOCKS covered in the comparative mapping studies, and moreover, DNA sequences provide evidence that the human and most of them are hard to find even with available human 2 mouse genomes are significantly more rearranged than pre- and mouse sequences. The existence of these blocks im- viously thought. Moreover, they indicate that a large pro- mediately implies that an exponential distribution is not a portion of previously identified conserved segments are not good fit to reality (Fig. 1b). In other words, rearrangement really conserved since there is evidence of multiple micro- analysis of the human and mouse genomes reveals clumps of rearrangements (Mural et al., 2002 [24]). These micro-re- arrangements were not visible in the comparative genetic 1 maps used for defining ≈ 180 conserved segments in the Sankoff et al., 1997, 2000 [38, 37] analyzed the accuracy of the random breakage model and further reinforced it despite past. We study synteny blocks (segments that can be con- some deviations, particularly for short conserved segments. verted into conserved segments by micro-rearrangements) These deviations are often attributed to mapping errors, sta- instead of conserved segments. The synteny blocks do not tistical noise or frequent short inversions. necessarily represent areas of continuous similarity between 2 If the breakpoints are located very close to each other (e.g., genomes. Instead, they consist of short regions of similarity within a few nucleotides or even at the same position), the that may be interrupted by non-similar regions and gaps. corresponding very short blocks may be undetectable by alignment analysis. Moreover, some short blocks may be Human and mouse genomes share 281 synteny blocks of deleted in the course of evolution. However, rearrangement size at least 1 Mb (Pevzner and Tesler, 2003 [32]). There is analysis confirms the existence of such breakpoints, even in evidence of at least 3170 micro-rearrangements (reversals) the absence of statistically significant sequence alignments. within the synteny blocks (though many may be artifacts of

248

incorrect assemblies). There is a large variation in the rate Mouse of micro-rearrangements along the genomes: 41 out of 281 1 -7 6 -10 9 -8 2 -11 -3 5 4 synteny blocks do not show any evidence of micro-rearrange- ments, while 10 synteny blocks are extremely rearranged (40 or more rearrangements within a block). Given two genomic sequences, how can one construct syn- teny blocks? False ortholog assignments and micro-rear- rangements make it non-trivial to find the analogs of syn- teny blocks (conserved gene clusters) even in shorter bac- terial genomes. In addition, human-mouse sequence simi- larities in non-coding regions [19, 45] may further compli- cate ortholog assignments and make it difficult to apply the methods developed in bacterial to construction of 1 2 3 4567 8 9 10 11 human-mouse synteny blocks. Human Sankoff et al., 1997 [35] were the first to develop an algo- Mouse rithm for synteny block generation. However, their approach 1 -7 6 -10 9 -8 2 -11 -3 5 4 was mainly intended for comparative mapping data and ig- nored some aspects of sequencing data. Pevzner and Tesler, 2003 [32], described a different approach that is geared to- ward genomic sequences. To construct the human-mouse synteny blocks, we start with bidirectional best local simi- larities (also called anchors) between human and mouse ge- nomic sequences [41, 24]. A number of software tools have recently become available to generate such anchors for en- tire mammalian genomes [22, 40, 21, 18]. We assume that a set of non-overlapping anchors is given and the goal is to construct the synteny blocks based on these anchors. 1 2 3 4567 8 9 10 11 We concatenate human and mouse chromosomes to form a Human single coordinate system. Anchors are viewed as diagonals in 2-D and each point on these diagonals is described by its coordinates (h, m) in human and mouse. We define the Figure 3: Two different most parsimonious rear- distances between points in the resulting genomic dot-plot rangement scenario for human and mouse X chro- as follows. The distance between two points (h1,m1)and mosomes. Breakpoint region uses are shown as short (h2,m2) from the same chromosome pair (the same rectan- vertical lines, and re-uses are shown as double lines. gle) is the Manhattan distance |h2 − h1| + |m2 − m1|.The Both scenarios have 8 breakpoint region uses and 3 distance between points from different chromosome pairs is breakpoint region re-uses. defined as infinity. The distance between two anchors is defined as the distance between their closest ends. Some anchors may look like isolated points (or “small We define the span of a cluster in human (mouse) as the clusters”) in a genomic dot-plot, while synteny blocks will interval between its minimum and maximum coordinates in be formed from clusters consisting of a larger number of human (mouse). Note that although different clusters are points. Fig. 2a presents the genomic dot-plot for anchors not supposed to overlap in 2-D, they often overlap in 1-D from the X chromosomes. They form 15 clusters (Fig. 2b). (i.e., their span intervals may overlap in human or mouse). Fig. 2c presents rectified clusters that ignore the details of Therefore, defining the cluster order for intermingled clus- the internal anchor arrangements in the clusters and rep- ters should be done with caution. We compute the center resent every cluster as a diagonal. These rectified clusters of mass of all anchors forming the cluster and order clusters are further combined into diagonals that correspond to 11 in human by the coordinates of their centers of masses. We synteny blocks (Fig. 2d). Fig. 2e is a symbolic representa- assign the clusters numbers according to their order on the tion of synteny blocks as units of the same size, used in the human genome. This lets us read off a cluster order in the construction of the breakpoint graph. mouse genome in terms of these labels. Pevzner and Tesler, 2003 [32], described the GRIMM- Signs (orientations) of the resulting clusters are usually Synteny algorithm for synteny block generation from a col- well-defined but for some highly rearranged clusters are not lection of anchors. The algorithm uses the maximum gap obvious. The algorithm for sign assignments in GRIMM- size G and minimum cluster size C as parameters: Synteny and the theorem justifying it are in Appendix A. Since long gaps may break a single synteny block into a GRIMM-Synteny algorithm: few clusters, we combine such clusters into strips that form (1) Form an anchor graph whose vertices are the anchors. synteny blocks.3 GRIMM-Synteny finds 319 clusters in the (2) Connect vertices in the anchor graph by an edge if the human genome longer than 1 Mb (for G = C =1Mb) distance between them is smaller than G. and a number of smaller clusters. These clusters are further (3) Define clusters as connected components in the graph. combined into 281 synteny blocks. (4) Delete “small” clusters (length ≤ C). 3 (5) Determine the cluster order and signs for each genome. A strip is a sequence of consecutive signed clusters i1,...,in (6) Output the strips in the resulting cluster order as syn- in one genome that either appear consecutively in the same teny blocks. way or in reverse as −in,...,−i1 in the other genome.

249

(a) X chromosome dot-plot (anchors) (b) Clusters of anchors (c) Rectified clusters Human Chromosome X Human Chromosome X Human Chromosome X 147 Mb 147 Mb 147 Mb 100 Mb 100 Mb 100 Mb 50 Mb 50 Mb 50 Mb Mouse Chromosome X Mouse Chromosome X Mouse Chromosome X 0 Mb 0 Mb Mb 50 Mb 100 Mb 149 Mb 0 Mb 50 Mb 100 Mb 149 Mb 0 Mb Mb 50 Mb 100 Mb 149 Mb

(d) Synteny blocks (e) Synteny blocks (genomic dotplot) (f) Human path Human Chromosome X Human Chromosome X Human Chromosome X 147 Mb end –1 7 –6 10 100 Mb –9 8 –2 11 3 Mouse Chromosome X –5 50 Mb Mouse Chromosome X –4 Mouse Chromosome X

start 1 2 3 4 5 6 7 8 9 10 11 end

0 Mb Mb 50 Mb 100 Mb 149 Mb

(g) Mouse path (h) Breakpoint graph on genomic dotplot (i) Breakpoint graph Human Chromosome X Human Chromosome X Human Chromosome X end end end –1 –1 –1 7 7 7 –6 –6 –6 10 10 10 –9 –9 –9 8 8 8 –2 –2 –2 11 11 11 3 3 3 Mouse Chromosome X Mouse Chromosome X Mouse Chromosome X –5 –5 –5 –4 –4 –4 start start start 1 2 3 4 5 6 7 8 9 10 11 end start 1 2 3 4 5 6 7 8 9 10 11 end start 1 2 3 4 5 6 7 8 9 10 11 end

Figure 2: X chromosome: from local similarities, to synteny blocks, to breakpoint graph, to rearrangement scenario. (a) Dot-plot of anchors. Small anchors are enlarged for visibility. (b) Clusters of anchors. (c) Rectified clusters. (d) Synteny blocks. (e) Synteny blocks (symbolic representation as genome rearrange- ment units). (f) Construction of the breakpoint graph from synteny blocks: “human” (solid) path. Note that additional “start” and “end” blocks are added. (g) Construction of the breakpoint graph from synteny blocks: “mouse” (dotted) path. (h) Superposition of human and mouse paths produces the 2-D breakpoint graph superimposed on the synteny blocks. (i) Remove the synteny blocks to obtain the 2-D breakpoint graph. The graph has four cycles.

250

Rearrangements of anchors in a synteny block are called on these 11 blocks has at least 3 re-uses of breakpoint re- micro-rearrangements. Rearrangements of the order and gions (although we cannot unambiguously infer where these orientation of synteny blocks are macro-rearrangements. breakpoint re-uses occurred).4 This indicates that there are at least 3 more “hidden” synteny blocks in addition to 11 3. REARRANGEMENT SCENARIOS “large” synteny blocks shown in Fig. 3. Some of these may The signed permutation describing synteny block order on be detected by lowering the threshold for synteny block de- the mouse X chromosome is tection, but others may escape such detection. A generaliza- tion of this result for multichromosomal genomes reveals at (−4, −5, 3, 11, −2, 8, −9, 10, −6, 7, −1). least 190 breakpoint region re-uses over the whole genome.5 For our goals, we use (1, −7, 6, −10, 9, −8, 2, −11, −3, 5, 4) 4. BREAKPOINT RE-USE The breakpoint graph provides evidence for breakpoint (a “flip” of the entire chromosome) and transform this per- re-use and allows one to estimate the number of breakpoint mutation into the “identity” permutation (1,...,11) repre- re-use events. senting the human X chromosome by 7 reversals (Fig. 3). In the sorting by reversals problem, an adjacency is a pair Moreover, there are 177 micro-rearrangements within the of blocks i, j that appear consecutively in both genomes, in X chromosome that were beyond the resolution of previous the form i, j or −j, −i. Blocks corresponding to an adjacency comparative mapping studies (some may be artifacts of as- form a 2-cycle in the breakpoint graph. A pair of consecutive sembly errors). A previous analysis of comparative maps of blocks (a black edge in the breakpoint graph) is a breakpoint X chromosomes revealed 8 syntenic blocks and postulated a if it is part of a long cycle (a cycle of length at least 4). Let most parsimonious rearrangement scenario with 6 reversals C4 be the total number of long cycles, and L4 be the total of (Bafna and Pevzner, 1995 [3]). their lengths. The number of breakpoints is L4/2. Cycles of Fig. 2f presents the genomic dot-plot (with added “start” length at least 6 are called composite cycles, and require at and “end” elements) and the “human” path (shown as solid least one breakpoint re-use during sorting. Let C6 and L6 edges) traversing the synteny blocks in human order. Sim- be the total number and total length of composite cycles. ilarly, Fig. 2g presents the same genomic dot-plot and the In the sorting by reversals problem, the distance is “mouse” path (shown as dotted edges) traversing the syn- teny blocks in mouse order. The 2-dimensional breakpoint d = n +1− c + h + f = br − C4 + h + f graph is obtained by superimposing these solid and dotted paths (Fig. 2h) and deleting the synteny blocks (Fig. 2i). where c is the number of cycles, br is the number of break- The breakpoint graph introduced by Bafna and Pevzner, points, and h, f are the “hurdles” and “fortress” parameters 1993 [2] is the key tool in studies of genome rearrangements. defined in [14]. The number of breakpoint re-uses is given Our “2-dimensional” representation of breakpoint graphs is by the following simple theorem: new and different from the representation used in previous studies (which used the “1-dimensional” projections of this Theorem 1. In the sorting by reversals problem, if all 6 graph shown along the axis in Fig. 2f–i). The 1-dimensional reversals are delimited by pairs of breakpoints, the number representation was well-suited to developing the theory, but of breakpoint re-uses in any most parsimonious reversal sce- is quite messy when used to display real data for large ge- nario is 2d − br. This is a lower bound for all non-optimal nomes. Also, the vertex order is determined by choosing reversal scenarios. one of the genomes, and therefore the resulting graph (and Proof. Each inversion has 2 breakpoint uses, so every certain aspects of the theory) are asymmetric with regards most parsimonious rearrangement scenario has 2d break- to the two genomes. point uses altogether. Every breakpoint must be used at We used a fast implementation of the Hannenhalli-Pevzner least once, so the number of breakpoint re-uses is 2d − br. algorithm (Tesler, 2002 [44]) to analyze the human-mouse An arbitrary scenario on D ≥ d steps has 2D − br break- rearrangement scenario (available via the GRIMM web server point re-uses, so 2d − br is a lower bound. [42]). Although the algorithm finds a most parsimonious sce- nario, the real scenario is not necessarily a most parsimo- 4Our analysis assumes reversals are the only rearrangement nious one, and the order of rearrangement events within a events for the X chromosome and does not take into account most parsimonious scenario often remains uncertain. Avail- events, such as transpositions, believed to be less frequent. ability of three or more mammalian genomes could remedy 5We emphasize that by re-using breakpoints we do not mean some of these limitations and provide a means to infer the multiple use of exactly the same genomic position as an end- gene order in the mammalian ancestor [6, 23, 25]. point of rearrangements, but rather the fact that between If there is no breakpoint re-use then the reversal distance synteny blocks, there are regions that host endpoints for multiple rearrangements. Our estimate of 190 breakpoint is exactly half the number of breakpoints and the real evo- re-uses means that there were 190 rearrangements that re- lutionary scenario is a most parsimonious one. However, used the same breakpoint region between synteny blocks. the estimate of reversal distance as the half of the num- The multiple endpoints within a breakpoint region split it ber of breakpoints is inaccurate since it assumes that the into smaller regions, within which may lie “hidden” synteny breakpoints are not re-used in evolution. Fig. 3 presents blocks. Knowledge of these blocks would simplify the cycles two different most parsimonious scenarios that transform of the breakpoint graph (see the discussion of “(g, b)-splits” in [14]) and narrow down the possible rearrangement events. the order of the 11 synteny blocks on the mouse X chromo- 6When there are hurdles, it is sometimes possible to have a some into the order on the human X chromosome. Although most parsimonious rearrangement scenario that uses black the scenarios are very different, both have 3 breakpoint re- edges that are not identified as breakpoints. This effectively uses. One can prove that any rearrangement scenario based increases br and decreases the number of re-uses accordingly.

251

Theorem 1 implies that 2d − br = br − 2C4 +2(h + f)= Proof. The number of breakpoints (1) in G(π, γ)iscom- 1 − 1 − 2 L4 2C4+2(h+f). Each 4-cycle contributes 2 (4) 2(1) = 0 puted by counting how many black edges are in cycles of 1 − − 1 − to 2 L4 2C4;thus2d br = 2 L6 2C6 +2(h + f) . length greater than 2. Paths and cycles of length longer A similar theorem holds for multichromosomal genomes. than 2, and ΓΓ-paths of length 2, will always end up in However, it is not as simple as Theorem 1 because it re- cycles of length greater than 2 in G(π, γ). The 2-cycles in quires a careful definition of the notion of breakpoints in G(π, γ) arise from the following features in G(Π, Γ): multichromosomal genomes. (a) All 2-cycles of G(Π, Γ). A pair of adjacent blocks (i, j) To define breakpoints for multi-chromosomal genomes Π in Π results in a 2-cycle in G(Π, Γ) iff it does not form and Γ, it is useful to prepend a symbol “C” and append a an internal breakpoint (B1). symbol “−C” to each chromosome of each genome. (b) All bare edges that form ΠΓ-paths (because in G(π, γ), (B1) An internal breakpoint of (Π, Γ) is a pair of adjacent all such bare edges are closed into a 2-cycle). A block i − blocks (i, j) in Π that do not appear consecutively as that starts (or i that ends) a chromosome in Π results (i, j)or(−j, −i)inΓ. in a bare edge ΠΓ-path iff it does not form an external breakpoint (B2). (B2) The first kind of external breakpoint of (Π, Γ) is a pair (C, i)or(−i, −C) in Π that does not appear in either (c) νΓ of the bare edges that form ΠΠ-paths, provided ≤ form, (C, i)or(−i, −C), in Γ. In other words, a chro- νΓ νΠ. There will be νΓ bare edges that are ΠΠ- mosome of Π starts with i or ends with −i, but i does paths that are subsequently closed into 2-cycles and − not have either of these properties in Γ. hence do not form breakpoints. The remaining νΠ νΓ bare edges that are ΠΠ-paths will be joined with other (B3) Let νΠ, νΓ be the number of null chromosomes in each paths into larger cycles to form breakpoints (B3). genome. Normally, one of these is zero, but that isn’t (d) All tails (except in the case of “bad bonds”). “Tails” necessary. If νΠ >νΓ, designate νΠ − νΓ of the nulls in Π as external breakpoints (of the second kind). between chromosomes are not physical breakpoints. The algorithm [43] usually forces these to be 2-cycles, Let i(Π, Γ) be the number of internal breakpoints and except in the pathological case of a “bad bond,” two e(Π, Γ) be the number of external breakpoints (of both kinds). tails form a 4-cycle, thus creating two additional break- We define the total number of breakpoints as br(Π, Γ) = points in G(π, γ) and increasing Eq. (1) by two. This i(Π, Γ) + e(Π, Γ). For example, consider genomes is a mathematical artifact that does not correspond to breakpoints in physical regions of the chromosomes. Π=(123−4 | 5 −678|−9101112) Formula (2) is derived similarly to Theorem 1. Γ=(123−45−678−9 | 12 10 11) Our analysis of the human and mouse genomes reveals with NΠ =3andNΓ = 2 chromosomes. The internal break- adistanced = 245, and 258 internal breakpoints and 42 points of (Π, Γ) are (9, 10) and (11, 12), while the exter- external breakpoints for (human,mouse) (or 261 and 39 for − − − − nal breakpoints are ( 4, C), (C, 5), (8, C), (C, 9) and (mouse,human)). Thus, the breakpoint graph provides evi- − − (12, C). The internal breakpoints of (Γ, Π) are ( 4, 5), dence for of 2(245) − (258 + 42) = 190 breakpoint re-uses in − (8, 9), and (12, 10), while the external breakpoints are the course of human-mouse evolution. (−9, −C), (C, 12), (11, −C)and(C, −C). There are 7 break- points in both cases. These breakpoints split the chromo- somes into 5 strips: [1 2 3 −4], [5 −678], [−9], [10 11], [12]. 5. BREAKPOINT RE-USE AND THE RAN- More generally, we have the following result: DOM BREAKAGE MODEL Our key insight is that although many short synteny blocks Theorem 2. The number of strips is i(Π, Γ) + NΠ = are still undiscovered (some will probably never be discov- i(Γ, Π) + NΓ. ered), the number of such blocks can be reliably estimated The breakpoint graph G(Π, Γ) for multichromosomal ge- with genome rearrangement analysis. Each such short block nomes Π and Γ is more complex than the breakpoint graphs creates an impression of breakpoint re-use since the break- for unichromosomal genomes. The multichromosomal ge- points flanking short synteny blocks are hard to separate. nome rearrangement algorithm [13, 43] transforms (Π, Γ) Another important observation is that most breakpoint re- into a pair of permutations (π, γ) (called optimal capped gions are short (under 1 Mb) with very few exceptions. Since concatenates of Π and Γ) on which the sorting by reversals “hidden” synteny blocks should fit within the breakpoint re- algorithm [14] may be applied. gions, these blocks are also short. The genome rearrange- ment analysis implies that there is a very large number Nr Theorem 3. Form optimal capped concatenates π, γ of (at least 190) of breakpoint re-uses (and therefore, short Π and Γ by the algorithm in Tesler, 2002 [43]. The total synteny blocks), thus adding an extra bar to the empirical number of breakpoints in the graph G(π, γ) is distribution of block length (Fig. 1b). The average size of breakpoint regions (excluding chromo- i(Π, Γ) + e(Π, Γ) = i(Γ, Π) + e(Γ, Π) (1) some ends) is only 668 Kb in human and 458 Kb in mouse, or this quantity plus two. If all rearrangement events are and each contains on average 1.9 breakpoints (rather than delimited by pairs of breakpoints, the number of breakpoint a single breakpoint as was implicitly assumed in previous re-uses in any most parsimonious rearrangement scenario studies). The overall size of the breakpoint regions is 172.5 between Π and Γ is Mb in human (5.7% of the genome length) and 119 Mb in mouse (4.7% of the genome length). Intuitively, the ran- 2d(Π, Γ) − br(Π, Γ). (2) dom breakage model contradicts the fact that 5.7% per- (This is a lower bound in a non-optimal scenario.) cent of the genome is populated by such a large number

252

of closely located breakpoints. The 190 breakpoint re-uses Ideally, these inferences should be based on complete data revealed by rearrangement analysis and the 258 breakpoint about segment lengths. However, the information about regions revealed by GRIMM-Synteny imply an estimate of short segments may be hard to obtain even with available n = Nb − Nc + Nr = 281 − 23 + 190 = 448 for the over- draft human and mouse sequences. Churchill et al., 1990 [7] all number of breakpoints (Nb is the number of synteny model the missing data by assuming that if two breakpoint 7 blocks and Nc is the number of chromosomes). We as- sites are within locking distance a then the conserved seg- sume that these breakpoints are located in breakpoint re- ment remains undetected. However, even this more flexible gions and ignore chromosome ends. One can estimate the model is unable to explain the very large number Nr =190 expected number of clumps (pairs of consecutive points that of short unobserved segments that are revealed by our ge- are within a “small” distance w from each other) in the posi- nome rearrangement studies. The question therefore arises tions of n uniformly distributed points in the interval8 [0, 1] as to whether there exists a different model of chromosomal as (n−1)(1−(1−w)n) [12]. If we are interested in the num- rearrangements that (i) explains the fit between the distri- ber of clumps of n breakpoints within a distance of 0.668 Mb bution of long synteny blocks and the truncated exponential 0.668 in the genome of total length G =2, 983 Mb, then w = 2,983 density function observed by Nadeau and Taylor, 1984, and and the expected number of clumps is ≈ 43.Thisisinsharp (ii) explains a large number of short blocks that the Nadeau- contrast with the estimate of Nr = 190 breakpoint re-uses, Taylor statistics failed to explain. Below we describe a nat- a strong argument against the random breakage model.9 ural fragile breakage model that explains both good fit of long blocks and a large number of short blocks. 6. FRAGILE BREAKAGE MODEL In the fragile breakage model, the genome consists of (short) fragile and (long) solid regions with different propensities to If positions of n breakpoints in the genome are given by breakpoints. For expository purposes we assume that the random variables ui in [0, 1], the segment sizes are yi = probability of a breakpoint in a fragile region follows the ui − ui−1. For the following analysis, n = Nb − 1 = 280, Poisson process, while the probability of a breakpoint in a since we ignore the chromosome ends (see above) and, simi- solid region is zero (extreme case). The overall size of frag- larly to Nadeau and Taylor, 1984 [29], do not consider short ile regions may be very small, e.g., 5% of the genome, in blocks (below 1 Mb). We also discard any other genomic sharp contrast to the random breakage model. However, material not observed to be within the synteny blocks. Fol- if fragile regions are distributed randomly in the genome, lowing Churchill et al., 1990 [7], we test the random breakage both the random breakage and the fragile breakage models model with the Kolmogorov-Smirnov test, which measures predict the same distribution of long synteny blocks! This the largest difference between the empirical and uniform dis- may be the reason for the prophetic predictive power of the tribution functions:   random breakage model in the past. However, the ran- i i − 1 dom breakage model does not perform well in a test with Dn =max max ( − ui), max (ui − ) . 1≤i≤n n 1≤i≤n n the sequencing data, while the fragile breakage model eas- ily explains the large number of short blocks. In addition, We computed D280 =0.085, which is close to the esti- the fragile breakage model allows one to estimate a lower mate of 0.095 computed by Sankoff et al, 1997 [38] based bound on the number of still unobserved fragile regions that on comparative mapping data for only 130 blocks. We an- may be revealed by sequencing efforts in other mammalian alyzed the Kolmogorov-Smirnov statistics with transformed 2 species. Assume there are m fragile regions in the genome data and χ statistics (data are not shown). We also found and n = Nb − Nc + Nr = 448 random breakages, of which that there is a reasonably good fit between the largest syn- Nb − Nc = 258 are observed. The expected number of ob- teny block of length 79.6 Mb and the expected maximum served fragile regions (i.e., fragile regions broken by break- fragment length L(γ +ln(n +1))=59.8Mb,whereL is the 1 n 1 448 ages) is m(1 − (1 − ) ). Solving m(1 − (1 − ) )=258 mean fragment length and γ =0.5772 is Euler’s constant. m m gives m ≈ 364. The number of still undiscovered fragile re- gions is at least m−(Nb −Nc) ≈ 106, most of which probably reside within existing synteny blocks.

7. CONCLUSIONS 7A more accurate estimate for the number of breakpoints The visionary insights of Nadeau and Taylor, 1984 [29] would use the number of chromosomes in the common an- and prophetic accuracy of their estimates survived many cestor of human and mouse, which remains unknown. See comparative mapping studies and (at a certain level of gran- [6, 25] for the analysis of chromosomal organization in the ularity) still remain in good fit with available human and mammalian ancestor. 8 mouse genomic sequences. The random breakage model Similarly to Nadeau and Taylor, 1984 [29] we represent the proved to be an extremely valuable evolutionary theory, par- genome as a single interval. We also estimate the number of breakpoints in breakpoint regions as n = Nb − Nc + Nr ticularly when contrasted against “random gene scrambling” although some of these breakpoints may fall at the ends of and other models that were considered in early 1980s. How- chromosomes and, in this case, should not be counted. In ever, the available human and mouse genomic sequences addition, the exact borders of the syntenic blocks are not dramatically increased the level of resolution at which we well defined, and may extend into the breakpoint regions. can analyze the genomes and, for the first time, allow us However, these issues only slightly affect our analysis. to accurately estimate the extent of rearrangement events. 9This estimate uses the average breakpoint region length rather than the distribution of breakpoint region lengths. This analysis revealed a previously unknown phenomenon The estimate needs to be revisited in the (unlikely) case that of “breakpoint re-use” in mammalian evolution and demon- the chromosome ends and a small number of long breakpoint strated that, with a new level of granularity, the random regions account for almost all breakpoint re-use events. breakage model is unable to explain the very large number

253

(a) (b) (a) + + (b) ++ (x22,y ) ()x2 ,y2 Lengths of breakpoint regions in human Lengths of breakpoint regions in mouse 120 140 + p2 + _ _ _ 100 120 p2 __ p2 (x2 ,y2 ) 100 _ 80 ()x,y22 p2 + + + 80 (x1 ,y1 ) p1 g 60 _ _ 60 (x1 ,y1 ) _

Frequency Frequency g 40 p 40 1 _ 20 20 __ ++ + p1 = 0 ()x1 ,y1 p ()x ,y 0 0 1 1 1 0 500 1000 1500 2000 2500+ 0 500 1000 1500 2000 2500+ Region length (Kb) Region length (Kb) Figure 5: Anchor distance in terms of interior gap ± ± Figure 4: (a) Histogram of breakpoint region lengths (g) and penalties (p1 , p2 ) for each choice of terminal. in the human genome. Most breakpoint regions are Closest terminals shown by bold path. (a) Anchors very short, with 109 out of 258 regions being shorter are in configuration 1, −2 (case (C2)). (b) Anchors than 100 Kb. However, there is a small number of are in configuration −1, −2 (case (C4)). long breakpoint regions: 17 regions are 1–2.5 Mb, and 15 are longer than 2.5 Mb (shown by a single bar at the right end). Chromosome ends can also We consider when two anchors from the same chromosome host breakpoints, but are not counted as breakpoint pair can be joined by an edge at threshold G, depending on regions. (b) Histogram of breakpoint region lengths their lengths, orientations, and gaps between them in each in the mouse genome. 123 out of 261 regions are genome. When two anchors are on the same chromosome shorter than 100 Kb. 21 regions are 1–2.5 Mb, and pair, we defined the distance between them as the Manhat- 10 are longer than 2.5 Mb. tan distance between their closest ends. We now introduce an equivalent formula that is of use both computationally and analytically. Represent the first anchor as the line seg- + + − − of breakpoint clumps. Therefore, a new more accurate null ment from (x1 ,y1 )to(x1 ,y1 ), and the second anchor as + + − − hypothesis is needed for comparative studies of many mam- the line segment from (x2 ,y2 )to(x2 ,y2 ). We break up the malian genomes that are about to be sequenced. distance between the anchors as a sum of three quantities, shown in Fig. 5: an interior gap measuring the closest ter- 8. ACKNOWLEDGEMENTS minals of these anchors within each genome, without regard We are grateful to , Guillaume Bourque, Mich- to signs, and then a correction term for the best terminal of each anchor. Specifically, let (X1,X2) be the closest values ael Kamal, Eric Lander, Kerstin Linblad-Toh, Bill Murphy, ± ± David Sankoff, and Jade Vinson for many suggestions. of x1 and x2 , and similarly for (Y1,Y2). Define the interior gap between the anchors as

Appendix A: Signs of synteny blocks g = |X2 − X1| + |Y2 − Y1|. For highly rearranged blocks, deciding whether a synteny − block in human has the same or reverse order in mouse is a The penalty for each choice (+ or ) of terminal in the first non-trivial problem. For example, if (1, 2, 3, 4) is the order of anchor is − − − − ± ± ± anchors in a human synteny block and ( 4, 3, 2, 1) is p1 = |X1 − x1 | + |Y1 − y1 |, the order of anchors in the corresponding mouse synteny ± block then these blocks are reversed (i.e., have different and p2 is defined similarly for the second anchor. Then the signs). Deciding whether (1, 2, 3, 4) and (−3, −4, 2, 1) are distance between the anchors is reversed or not is a more difficult problem. We have imple- + − + − D = g +min(p ,p )+min(p ,p ) . mented a few algorithms addressing this problem and found 1 1 2 2 that the following approach works the best. An anchor is long if its span on each genome has length Suppose a cluster in human consists of anchors 1,...,m at least G,andisshort otherwise. arranged in increasing order of position. Let σ be the signed For two genomes (“human” and “mouse”), consider an- permutationof1,...,mgiving the relative ordering of these chors A, B that lie on the same chromosome in each genome, same anchors in mouse. We will determine the sign δ of this and are in order ··· ,A,··· ,B,··· in human. If they are in cluster in mouse. order on mouse as ··· ,A,··· , −B,···, we call it configura- The signed permutation σ =(a1,...,am)isseparable if tion A, −B. There are 8 possible configurations on mouse: (a1,...,ar) is a signed permutation of (1,...,r) for some − r =1,...,m−1. For a signed permutation σ =(a1,...,am) (C1) Configuration A, B or −B,−A: We have p1 = + we denote −σ =(−am,...,−a1). We now determine the p2 = 0 and D(A, B)=g, so there is no restriction on sign δ = ±1 of this cluster in mouse as follows: the lengths of anchors A, B in either genome. There is an edge iff the interior gap is less than G. (S1) If m =1,setδ = ±1sothatσ =(δ). (C2) Configuration A, −B or B,−A: Here we are poten- (S2) For m>1, if σ is separable, set δ =1. tially clustering anchors whose orientation is incon- − − − + − (S3) For m>1, if σ is separable, set δ = 1. Note that sistent. We have p1 = 0 and p2 ,p2 > 0. Specifi- − + + − − − + for m>1, σ and σ cannot both be separable. cally, p2 = |y2 − y2 | and p2 = |x2 − x2 |,oneless (S4) Otherwise, there is no clear choice of sign. Choose than the spans of marker B on the two axes. So + − δ = 1. (Alternatives include deleting the cluster, or D = g +min(p2 ,p2 ). Since an edge is added only using the algorithm of [15] to choose a sign.) if D

254

(C3) Configuration −A, B or −B,A: Similar to (C2), Note that the GRIMM-Synteny algorithm, the separa- but A is short. bility condition for determining signs, and Theorem 4 go (C4) Configuration −A, −B or B,A: Here we are poten- through to multiple genomes, on replacing coordinates (x, y) tially clustering anchors with inconsistent order. It is by higher dimension coordinates. One genome is chosen as similar to (C2) and (C3), but both A and B are short. a reference genome (as human was here), and the signs of a cluster on the other genomes are determined relative to it Theorem 4. Suppose a cluster contains at least one long by considering their anchor order as a signed permutation of anchor. Let the order of its anchors in human be τ1,i1,τ2, the anchors in the reference genome. We did a version of this i2,...,τr,ir,τr+1,wherei1,i2,...,ir are long anchors and (with the algorithm adapted to comparative mapping data each τj is a sequence of zero or more short anchors. rather than sequencing data) in Murphy et al., 2003 [25]. Then in mouse, these anchors are in the order τ1,i1,τ2,i2, ...,τr,ir,τr+1 or −τr+1, −ir, −τr,...,−i1, −τ1,whereeach τj is a signed permutation of τj . In the former case the clus- 9. REFERENCES ter sign is positive in mouse, and in the latter it is negative. [1]D.Bader,B.Moret,andM.Yan.Alinear-time Proof. The cluster signs follow from cases (S1)–(S3) of algorithm for computing inversion distances between the definition of cluster signs. signed permutations with an experimental study. Assume without loss of generality that anchor |i1| has ori- Journal of , 8:483–491, 2001. 10 − entation i1 in both genomes. (If it has orientation i1 [2] V. Bafna and P. Pevzner. Genome rearrangements and in mouse, flip the whole chromosome in the mouse for the sorting by reversals. In Proceedings of the 34th Annual purposesofthisproof.) IEEE Symposium on Foundations of Computer Suppose in human, A is any anchor occurring to the left of Science, pages 148–157, 1993 (Full version has | | i1 and B is any anchor occurring to its right. Then D(A, B) appeared in SIAM J. Computing, 25: 272-289, 1996). | | is at least the span of i1 in human, which is at least G [3] V. Bafna and P. Pevzner. Sorting by reversals: | | (because i1 is long), so there is no edge connecting A and genome rearrangements in plant organelles and | | B. Thus, all paths from anchors left of i1 to anchors right evolutionary history of X chromosome. Molecular | | | | of i1 must pass through i1 (i.e., it is an articulation point). Biology and Evolution, 12:239–246, 1995. The anchors of τ1 are all connected to |i1| through paths that [4] A. Bergeron. A very elementary presentation of the go through other anchors of τ1 and terminate at the terminal Hannenhalli-Pevzner theory. In Proceedings 12th +i1. Anchors in τ2,i2,...,ir,τr+1 are connected to |i1| via Annual Symposium on Combinatorial Pattern paths that go through these anchors and then terminate at Matching, volume 2089 of Lecture Notes in Computer the terminal −i1. Science, pages 106–117, Jerusalem, Israel, 2001. Similarly, |i2|,...,|ir| are articulation points, and anchors [5] P. Berman and S. Hannenhalli. Fast sorting by left of |ij | in human are connected to |ij | via paths terminat- reversal. In Combinatorial Pattern Matching. 7th ing at +ij , while anchors right of ij in human are connected Annual Symposium, volume 1075 of Lecture Notes in to |ij | via paths terminating at −ij . Thus, this cluster Computer Science, pages 168–185, New York, 1996. on mouse has the order τ1,i1,τ2, ±i2,τ3, ±i3,...,±ir,τr+1, Springer. where τj is a signed permutation of τj . [6] G. Bourque and P. Pevzner. Genome-scale evolution: We will show ±i2 is +i2.Let|i1|, |a1|, |a2|,...,|am|, |i2| be a path in the anchor graph that does not reuse any anchors. reconstructing gene orders in the ancestral species. Genome Research, 12:9748–9753, 2002. By the analysis above, |a1|,...,|am| are short anchors and are between |i1| and |i2| in both genomes. If mouse has [7] G. Churchill, D. Daniels, and M. Waterman. The an edge with configuration i1, −i2 or ±aj , −i2 then cases distribution of restriction enzyme sites in Escherichia (C2) or (C4) imply |i2| must be short, contradicting the coli. Nucleic Acids Research, 18:589–597, 1990. assumption that it is long. [8] O. Cohen, C. Cans, M. Cuillel, J. Gilardi, H. Roth, Similarly, the signs on all other ±ij are +ij in mouse. M. Mermet, P. Jalbert, and J. Demongeot. Cartographic study: breakpoints in 1574 families Note that with the parameters of our analysis (G =1 carrying human reciprocal translocations. Human Mb), all anchors were short (under 10 Kb); however, we may Genetics, 97:659–657, 1996. replace anchors by strips of anchors in the above theorem, [9] R. DeBry and M. Seldin. Human/mouse homology thus increasing the actual lengths. (We also did this as a relationships. Genomics, 33:337–351, 1996. preprocessing step to cut down the size of the input.) [10] T. Dobzhansky and A. Sturtevant. Inversions in the Our rules (S1–S3) allow for a possibility such as σ = − chromosomes of Drosophila pseudoobscura. Genetics, (2, 1, 3, 5, 4), which is not covered by the above theorem. 23:28–64, 1938. It is separable but has a point 3 that is inverted rather than [11] N. El-Mabrouk. Sorting signed permutations by fixed. Anchor 3 must be a short anchor. There are also reversals and insertions/deletions of contiguous nonseparable permutations, such as σ =(2, 4, 1, 3), that are segments. Journal of Discrete Algorithms,1:–,2000. not covered by (S1–S3). By the above theorem, these must consist solely of short anchors. [12] J. Glaz, J. Naus, and S. Wallenstein. Scan Statistics. 10 Springer, Berlin, 2001. Although the edges in the anchor graph are drawn to con- [13] S. Hannenhalli and P. Pevzner. Transforming men into nect the closest terminals of the anchors, each anchor is a single vertex in the anchor graph; its terminals are not con- mice (polynomial algorithm for genomic distance sidered to be two separate vertices. Thus, in working with problem). In Proceedings of the 36th Annual IEEE a signed anchor order, we emphasize that the anchor is a Symposium on Foundations of Computer Science, single vertex by using absolute values when appropriate. pages 581–592, Milwaukee, Wisconsin, 1995.

255

[14] S. Hannenhalli and P. Pevzner. Transforming cabbage [30] S. Ohno. Ancient linkage groups and frozen accidents. into turnip (polynomial algorithm for sorting signed Nature, 244:259–262, 1973. permutations by reversals). In Proceedings of the 27th [31] E. Pennisi. A mouse chronology. Science, 288:248–257, Annual ACM Symposium on the Theory of 2000. Computing, pages 178–189, 1995 (full version [32] P. Pevzner and G. Tesler. Genome rearrangements in appeared in Journal of ACM, 46: 1–27, 1999). mammalian evolution: Lessons from human and [15] S. Hannenhalli and P. Pevzner. To cut... or not to cut mouse genomic sequences. Genome Research, (applications of comparative physical maps in 13:13–26, 2003. molecular evolution). In Proceedings of the 7th [33] R. Waterston, et al. Initial sequencing and ACM-SIAM Symposium on Discrete Algorithms comparative analysis of the mouse genome. Nature, (SODA), pages 304–313, 1996. 420:520–562, 2002. [16] H. Kaplan, R. Shamir, and R. Tarjan. Faster and [34] D. Sankoff. Comparative mapping and genome simpler algorithm for sorting signed permutations by rearrangement. In J. Dekkers, S. Lamont, and reversals. In Proceedings of the 8th Annual M. Rotschild, editors, From Jay L. Lush to Genomics: ACM-SIAM Symposium on Discrete Algorithms, pages Visions for animal breeding and genetics,pages 344–351, New York, 1997. ACM. 124–134, 1999. [17] J. Kececioglu and D. Sankoff. Exact and [35] D. Sankoff and M. Blanchette. The median problem approximation algorithms for the inversion distance for breakpoints in . In between two permutations. Algorithmica, 13:180–210, Computing and Combinatorics, Proceeedings of 1995. COCOON ‘97, Lecture Notes in Computer Science, [18] W. J. Kent. BLAT–the BLAST-like alignment tool. pages 251–263, New York, 1997. Springer Verlag. Genome Research, 12:656–664, 2002. [36] D. Sankoff, M. Deneault, P. Turbis, and C. Allen. [19] B. Koop and L. Hood. Striking sequence similarity Chromosomal distribution of breakpoints in cancer, over almost 100 kilobases of human and mouse T-cell infertility, and evolution. Theoretical Population receptor DNA. Nature Genetics, 7:48–53, 1994. Biology, 61:497–501, 2002. [20] E. Lander and et al. Initial sequencing and analysis of [37] D. Sankoff, M. Parent, and D. Bryant. Accuracy and the human genome. Nature, 409:860–921, 2001. robustness of analyses based on numbers of genes in [21] B. Ma, J. Tromp, and M. Li. PatternHunter: faster observed segments. In D. Sankoff and J. Nadeau, and more sensitive homology search. , editors, Comparative Genomics - Empirical and 18:440–445, 2002. Analytical Approaches to Gene Order Dynamics, Map [22] C. Mayor, M. Brudno, J. Schwartz, A. Poliakov, E. M. Alignment and the Evolution of Gene Families.,pages Rubin, K. A. Frazer, L. Pachter, and I. Dubchak. 299–305. Kluwer, 2000. VISTA: Visualizing global DNA sequence alignments [38] D. Sankoff, M. Parent, I. Marchand, and V. Ferretti. of arbitrary length. Bioinformatics, 16:1046–1047, On the Nadeau-Taylor theory of conserved 2000. chromosome segments. In Eight Annual Symposium on [23] B. Moret, A. Siepel, J. Tang, and T. Liu. Inversion Combinatorial Pattern Matching, volume 1264 of medians outperform breakpoint medians in phylogeny Lecture Notes in Computer Science, pages 262–274, reconstruction from gene-order data. In Proceedings of Aarhus, Denmark, June 1997. Springer-Verlag. the 2nd International Workshop on Algorithms in [39] D. Schoen. Comparative genomics, marker density and Bioinformatics (WABI’02), Rome (Lecture Notes in statistical analysis of chromosome rearrangements. Computer Science 2452), pages 521–563, 2002. Genetics, 154:943–952, 2000. [24] R. Mural and et al. A comparison of whole-genome [40] S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, shotgun-derived mouse chromosome 16 and the C.Riemer,J.Bouck,R.Gibbs,R.Hardison,and human genome. Science, 296:1661–1671, 2002. W. Miller. PipMaker – a web server for aligning two [25] W. J. Murphy, G. Bourque, G. Tesler, P. A. Pevzner, genomic DNA sequences. Genome Research, and S. J. O’Brien. Analysis of mammalian genome 10:577–586, 2000. rearrangements using multispecies comparative maps. [41] R. Tatusov, E. Koonin, and D. Lipman. A genomic (submitted). perspective on protein families. Science, 278:631–637, [26] N. Copeland, et al. A genetic linkage map of the 1997. mouse: Current applications and future prospects. [42] G. Tesler. GRIMM, 2001. http://www- Science, 262:57–66, 1993. cse.ucsd.edu/groups/bioinformatics/GRIMM. [27] J. Nadeau and D. Sankoff. Counting on comparative [43] G. Tesler. Efficient algorithms for multichromosomal maps. Trends Genet., 14:495–501, 1998. genome rearrangements. J. Comp. Sys. Sci., [28] J. Nadeau and D. Sankoff. The lengths of 65(3):587–609, 2002. undiscovered conserved segments in comparative [44] G. Tesler. GRIMM: genome rearrangements web maps. Mamm Genome., 9:491–495, 1998. server. Bioinformatics, 18:492–493, 2002. [29] J. Nadeau and B. Taylor. Lengths of chromosomal [45] J. Thomas, T. Summers, S. Lee-Lin, V. Maduro, segments conserved since divergence of man and J. Idol, S. Mastrian, J. R. JF, D. Jamison, and mouse. Proceedings of the National Academy of E. Green. Comparative genome mapping in the Sciences USA, 81:814–818, 1984. sequence-based era: early experience with human chromosome 7. Genome Research, 10:624–633, 2000.

256