Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Early vertebrate whole genome duplications were predated by a period of intense genome rearrangement

Andrew L. Hufton1, Detlef Groth1,2, Martin Vingron1, Hans Lehrach1, Albert J. Poustka1, Georgia Panopoulou1*

1. Max Planck for Molecular Genetics, Ihnestr. 73, 12169 Berlin, Germany. 2. Potsdam University, Bioinformatics Group, c/o Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, D-14476 Potsdam-Golm, Germany

* Corresponding author: Max-Planck Institut für Molekulare Genetik, Ihnestrasse 73, D- 14195 Berlin Germany. email: [email protected], Tel: +49-30-84131235, Fax: +49- 30-84131128

Running title: Early vertebrate genome duplications and rearrangements

Keywords: synteny, amphioxus, genome duplications, rearrangement rate, genome instability Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Abstract

Researchers, supported by data from polyploid plants, have suggested that whole genome duplication (WGD) may induce genomic instability and rearrangement, an idea which could have important implications for vertebrate evolution. Benefiting from the newly released amphioxus genome sequence (Branchiostoma floridae), an invertebrate which researchers have hoped is representative of the ancestral genome, we have used proximity conservation to estimate rates of genome rearrangement throughout vertebrates and some of their invertebrate ancestors. We find that, while amphioxus remains the best single source of invertebrate information about the early chordate genome, its genome structure is not particularly well conserved and it cannot be considered a fossilization of the vertebrate pre- duplication genome. In agreement with previous reports, we identify two WGD events in early vertebrates and another in teleost fish. However, we find that the early vertebrate WGD events were not followed by increased rates of genome rearrangement. Indeed, we measure massive genome rearrangement prior to these WGD events. We propose that the vertebrate WGD events may have been symptoms of a pre-existing predisposition toward genomic structural change.

Supplementary Data in the form of 2 tables (Tables S1, S2) and 2 figures (Figs. S1, S2) are included.

2 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Introduction Researchers have proposed that whole genome duplication (WGD) events may increase the genomic rearrangement rate, possibly directly accelerating evolution itself (Otto 2007). This idea has received some support from estimates of genome rearrangement in plant polyploids (Pontes et al. 2004; Song et al. 1995), and around a WGD event in teleost fish (Semon and Wolfe 2007). In the later case, Semon and Wolfe measured an increased rate of rearrangement at the crown of the tetrapod-teleost lineages, but, lacking an appropriate root species, they could not distinguish whether this increased rearrangement rate occurred around the teleost WGD or in the early branch of tetrapods. In addition, comparisons between and mouse genomes have revealed that most breakpoints occur close to clusters of tandem gene duplications and large segmental duplications, suggesting that local duplication can promote rearrangement (Armengol et al. 2005). While these data are suggestive, the relation of WGD events to genome rearrangement remains largely untested in animals. Vertebrate genomes show evidence of widespread compared to invertebrate genomes, leading Ohno to propose the existence of two rounds of WGD during early vertebrate evolution, now known as the 2R hypothesis (Ohno 1970). This has been a hotly debated topic—in large part because early phylogeny-based approaches could not distinguish WGD from other gene duplication models (Gibson and Spring 2000; Hughes et al. 2001). However, analysis of conserved gene order, or synteny, within complete vertebrate genome sequences has provided an increasing body of evidence supporting the 2R hypothesis (Dehal and Boore 2005; McLysaght et al. 2002; Panopoulou et al. 2003; Vandepoele et al. 2004). These studies led Kasahara to declare that “there is now incontrovertible evidence supporting the 2R hypothesis” (Kasahara 2007). Fascination with this hypothesis has endured because of the dramatic consequences such events could have had for the evolution of vertebrates. It is clear that WGD events provide a quick and easy way to produce vast numbers of duplicate , creating a genetic reservoir from which innovations can arise. However, it is not clear whether these WGD events also sparked widespread genome rearrangement. The cephalochordate amphioxus (Branchiostoma floridae) genome may be a key source of information about the evolution of the early vertebrate genome. Cephalochordates, together with the tunicates, are the closest living relatives of vertebrates (Delsuc et al. 2006), and both taxa separated from the vertebrate lineage prior to the widespread gene duplications.

3 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

In contrast to tunicates, which are exceptionally diverged in many regards, genome sequence and FISH mapping of selected amphioxus regions indicates that amphioxus shares some genomic features with vertebrates (Abi-Rached et al. 2002; Castro and Holland 2003; Hughes and Friedman 2005). Hence, it is thought to be the best preserved pre-duplication genome. Its genome sequence has recently been completed, and analysis published by the amphioxus genome project has provided what appears to be the most compelling support in favor of the 2R hypothesis to date (Putnam et al. 2008). With this evidence before us, we are presented with a unique opportunity to reconstruct the evolutionary history of the early vertebrate genome. However, even with the amphioxus genome present, estimating genome rearrangement rates in the ancient vertebrate lineages is not trivial. Algorithms exist that can reconstruct the most likely series of rearrangement events between two closely related species, however, these methods are only effective for the cases where genomes are fully assembled, and breakpoints are not heavily saturated (Nadeau and Sankoff 1998; Nadeau and Taylor 1984; Pevzner and Tesler 2003). Over longer evolutionary times, or when genomes of interest have only scaffold-level assemblies, exact reconstructions of rearrangement events is impossible. In response, authors have derived more flexible approaches to inferring rates of genome rearrangement. Semon and Wolfe used a parsimony based model to score gene proximity conservation, however their model required prior knowledge of the locations of WGD events, and the authors warned that it may produce abnormally high rearrangement estimates for genomes with scaffold-level assemblies (Semon and Wolfe 2007). Smith and Voss used a association based score originally suggested by Housworth and Postlethwait to measure rates of synteny loss through vertebrates, however once again this measure is only appropriate for fully assembled genomes, and only estimates the amount of interchromosomal rearrangement (Housworth and Postlethwait 2002; Smith and Voss 2006). No method has been shown to reliably estimate the amount of rearrangement between genomes in scaffold- level assemblies, such as the current amphioxus genome assembly. Here we have completed a survey of the conservation of gene synteny throughout vertebrates and their invertebrate ancestors, using a proximate gene pair method derived from our previous work (Panopoulou et al. 2003). Clustering of syntenic gene segments clearly illustrates the traces of the vertebrate WGD events. Moreover, we use this synteny data to develop a new, simple metric for syntenic conservation, and thereby estimate rates of synteny loss throughout the vertebrate lineages, and among some of their invertebrate ancestors. Our

4 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. results reveal that amphioxus genomic structure is not exceptionally conserved. Moreover, we find that the vertebrate WGD events were not followed by increased rearrangement rates; the teleost WGD occurred during a period of relatively low synteny loss, and the early vertebrate WGD events appear to have been preceded by a spike in synteny loss, and then followed by a low rate of synteny loss.

Results

Identifying conserved gene synteny among vertebrates and their ancestors

Syntenic gene pairs are the smallest detectable unit of syntenic conservation, and as such, we reasoned that they may provide a suitable foundation for estimating genomic synteny conservation even when studying heavily fragmented genome assemblies. Since the amount of synteny conservation between two species is reduced by genomic rearrangement, genomic rearrangement rates can be inferred from synteny data. To this end, we have adapted our previous gene pair synteny method to detect synteny conservation between pairs of genomes (Fig. 1, Methods). Briefly, for each comparison between two genomes we first group genes into orthologous families, and then identify cases where genes from a pair of gene families are observed in close proximity, which we call a ‘family combination’ (Fig. 1). Family combinations which have proximate gene pairs in more than one location, either across the two genomes of interest or within a single genome are assumed to have evidence of syntenic conservation. Genes are defined to be in close proximity if they have no more than ten intervening genes, a threshold which previous simulations have shown is appropriate for identifying true syntenic gene relationships (Panopoulou et al. 2003). These syntenic family combinations can then be assembled into segments of syntenic genes (Fig. 1, Methods). While amphioxus shares the most family combinations with chicken (2,644), the family combinations shared with the (1,888) assemble into more syntenic segments that cover slightly more of the amphioxus genome (Table 1). The majority of these 807 amphioxus-human syntenic segments include one pair of syntenic genes: 52.6% include 2 genes, 17% include 3 genes, and 14.4% include 4 genes. In total, 490.96 Mb of the human genome is covered by gene pairs with synteny conservation in amphioxus (Fig. S1). The equivalent regions in amphioxus extend over 103.02 Mb (Table 1). Figure S2 shows the 20

5 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. amphioxus scaffolds with the most syntenic gene pairs and their association with the human . A small set of family combinations are conserved across multiple genomes, possibly identifying cases where strong purifying selection has protected a gene cluster from rearrangement throughout . We identified 167 family combinations that are conserved between amphioxus, human, zebrafish, fugu and chicken (Table S2). These widely conserved genes include the large Hox and syntenic clusters, as well as several ancient tandemly duplicated genes that have preserved their linkage during evolution, including: I) Smad2/3-Smad6/7, II) Gamma-aminobutyric-acid receptor (GABRB3-GABRA5), III) Neuronal voltage-gated calcium channel (CACNG5/7-CACNG4/8), IV) Six1/2 and Six4/5 are proximate in amphioxus, human and fish, but not in chicken (Kawakami et al. 2000).

Groups of related syntenic segments support the 2R hypothesis

By merging our syntenic gene pairs into groups of related syntenic segments we receive a clear illustration of the genomic duplications events that have occurred during vertebrate evolution. For this analysis, we grouped together syntenic segments that have undergone duplication since an organism’s divergence from the second species used in the synteny comparison (Fig. 1, Methods). By plotting the chromosome coverage of these ‘synteny groups’, versus their size, we receive a simple illustration of genomic duplication histories (Fig. 2). Vertebrate synteny groups, since divergence from amphioxus, are spread over several chromosomes, reflecting the well-known widespread gene duplication in the vertebrate lineage (Fig. 2A-C). Within the human or chicken genomes, the largest of these synteny groups are present on four chromosomes (Fig. 2 A and B), consistent with two WGD events. In zebrafish, the synteny groups spread over a greater number of chromosomes, compatible with an additional WGD in the fish lineage (Fig. 2C). As a control, artificial duplication of the human synteny groups produces a similar spread (Fig. 2D). Furthermore, zebrafish synteny groups built by comparison to the human genome peak at two chromosomes, supporting a single WGD event in the teleost lineage (Fig. 2E). However, many synteny groups are present on three chromosomes, possibly indicating additional duplication mechanisms at work. Conversely, human synteny groups, since divergence from zebrafish, peak sharply at one chromosome, the expected distribution when no WGD has occurred (Fig. 2F).

6 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Overall, these findings support the existence of two rounds of WGD in early vertebrates (commonly known as the 2R hypothesis), and one additional WGD in teleost fish, in agreement with other recent findings (reviewed in Kasahara 2007). In the lineages that are hypothesized to have experienced WGD, as the synteny group size decreases, the number of chromosomes also decreases, indicating that the majority of the observed duplication in these genomes were generated by a mechanism that retains synteny, such as WGD (Fig. 2A-D). If the duplications were generated solely by many local tandem duplications, as suggested by Hughes and Friedman (Friedman and Hughes 2001), we would expect to see the opposite trend: local duplications will be synteny disrupting, so regions duplicated by this mechanism will retain the least synteny. Indeed, this pattern is observed in the human lineage since its divergence from fish, where local segmental duplication has occurred at an increased rate (Bailey and Eichler 2006) (Fig. 2F). These results support data from the amphioxus genome project, which found striking 4-fold synteny conservation between the amphioxus and vertebrate genomes (Putnam et al. 2008). Together, these results seem to confirm the emerging consensus in support of the 2R WGD hypothesis.

Syntenic genes maintained in duplicate after WGD are enriched for specific functional classes Others reports have indicated that genes retained in duplicate after WGD events often show enrichment for particular functional classes. To test whether the syntenic gene segments that are maintained in duplicate after WGD events show similar functional biases, we used term enrichment to compare two sets of human genes—those that show syntenic conservation with amphioxus, and those genes that show duplicate in-genome synteny that was likely to have been created by the early vertebrate WGD events (Methods). The genes in the amphioxus-human syntenic pairs tend to function in metabolism and cellular physiological processes, while the anciently duplicated in-genome human syntenic genes are enriched for a variety of terms related to development, morphogenesis, activity, signalling and regulation (Table 2), indicating that different evolutionary forces may be selecting for synteny conservation and duplicate retention. Previous studies in Arabidopsis and within several vertebrate genomes, have similarly observed that genes retained after WGD are enriched for signalling, transcription regulation, and development indicating that this may be a general consequence of WGD events in plants and animals (Blanc and Wolfe 2004; Blomme et al. 2006; Brunet et al. 2006; Maere et al. 2005).

7 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Rates of synteny loss did not increase after the vertebrate WGD events With the existence of the 2R WGD events seemingly well-established, we assessed how these WGD events may have affected rates of synteny loss in vertebrates. We quantified the amount of synteny conservation between two species by calculating the number of shared syntenic gene pairs between the species, and then dividing this value by the number of gene pairs which could be shared if the genomes were ideally arranged (Fig. 3). Because both of these values are reduced by genome fragmentation and gene family loss, in general this syntenic pair measure is largely independent of genome assembly quality. Additional data filters are employed to help correct for structural differences in the genomes (Methods). To show that our metric of synteny conservation is indeed robust to genome fragmentation we simulated increasing fragmentation of the human genome, and measured the amount of synteny conservation with the amphioxus, fugu, or chicken genomes (Fig. 4A). In each simulation the increasing fragmentation of the human genome is quantified by measuring the artificial assembly’s ‘G50’ value—the gene number such that 50% of the assembled genome lies in scaffolds containing at least G50 genes. The measured conserved synteny values are highly consistent down to G50 values around 10-20, after which the variability increases slightly. In general, this indicates that our measures should be robust for our genomes of interest—the amphioxus genome assembly has a G50 value of 90, and the fugu and anemone assemblies both have G50 values of 52. Estimates within the sea urchin lineage should be regarded with the most suspicion since the current sea urchin genome assembly has a G50 value of only three. However even at this level of fragmentation our simulated estimates are reasonably reliable. Synteny conservation was measured between all pairwise combinations of sea anemone, sea urchin, amphioxus, human, mouse, chicken, fugu, and zebrafish. This species set was chosen to provide representative species on both sides of the 2R WGD events, while maintaining a set of species where all shared some clear aspects of synteny conservation. Each measure is bootstrapped giving both a 95% confidence estimate on the measure and showing that the measures are robust to significant incompleteness in the genomes (Table S1 and Methods). Not surprisingly, these measures show an exponential trend when plotted against evolutionary divergence time (Fig. 4B). Over increasing evolutionary distances, the number of remaining syntenic pairs decreases, and therefore the probability that a new rearrangement event disrupts an existing syntenic pair also decreases, creating an exponential decay process. As such, we converted our shared pairs proportions to a linear measure of

8 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

‘syntenic distance’ by taking the negative natural logarithm. The largest syntenic distance among these organisms was observed between sea anemone and zebrafish: 4.976 (4.735- 5.330). Using randomly shuffled sea anemone and zebrafish genomes, syntenic distances always exceeded 6.368 (from 50 iterations), indicating that even in our most extreme comparison the amount of conserved synteny is well above that expected by chance (p = 3.2 x 10-31). These syntenic distance measures can then be fit onto the known species tree using an additive tree model, and converted to synteny loss rates by dividing by the evolutionary time in each branch (Table 3 and Fig. 4C). Because the tree model used to estimate the branch based estimates of synteny loss could produce skewed results if individual organisms have erroneous data, such as bad gene orthology mapping or inconsistent estimates of divergence age, we tested the robustness of our estimates by eliminating each organism from our data, one at a time, and recalculating synteny loss for every branch possible (i.e. a leave-one-out analysis). Estimated branch lengths do appear to be robust; the range of values received is reported in Table 3. To estimate synteny loss rates around the early vertebrate WGD events, we reconstructed a set of syntenic gene pairs that were likely to have existed in the pre-2R ancestral genome, and used this information to measure the syntenic distance between amphioxus and the 2R WGD events (Methods). We then reallocated the synteny loss in branch n2-n3 before and after the 2R events (Table 3 and Fig. 4C). Despite the simplicity of our method, we find that our synteny loss rates are generally consistent with published rearrangement rate estimates generated from more complicated models, indicating that gene-pair based synteny loss is a reasonable estimator of genomic rearrangement rates (Table 3). We agree with previous reports that mice have experienced more synteny loss than or chicks in their terminal lineages (Burt et al. 1999; Hillier 2004), and that the rate of rearrangement between the tetrapod radiation and the mammalian radiation (nodes n4-n5) exceeded that in the human and chick lineages. Moreover, our estimates agree with Semon and Wolfe that fish have experienced more synteny loss in their terminal branches than tetrapods (Semon and Wolfe 2007). In the tested invertebrate lineages—amphioxus, sea urchin, and sea anemone—we find that all three have experienced a similar moderate rate of synteny loss (Fig. 4C). Interestingly, our estimates indicate that urchin has nearly as much conserved synteny with vertebrates as amphioxus, something which has not been previously appreciated in the literature, perhaps because of the extreme fragmentation of the current urchin genome

9 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. assembly. While it appears that amphioxus currently is the best single source of invertebrate information about the early chordate genome, our data indicates that its genome structure is not particularly strongly conserved, and therefore it cannot be assumed to be uniquely representative of the ancestral chordate genome. Intriguingly, the vertebrate 2R WGD were followed by a period of relatively low synteny loss (0.07, Fig. 4C). In fact, we estimate that the rate of synteny loss preceding the 2R WGD events was more than eight times higher than the rate afterwards. Indeed, the pre- 2R WGD period had the highest synteny loss rate observed in our analysis (0.61, Fig. 4C). Some of this synteny loss will have occurred between the two WGD events, however this period is believed to have been relatively brief, and as such is unlikely to fully account for the observed synteny loss spike (Furlong and Holland 2002; Gibson and Spring 2000; Vandepoele et al. 2004). Overall, these results indicate that the 2R WGD events did not increase the rate of synteny loss, and may have occurred during a pre-existing period of intense genome rearrangement. We measure relatively low rates of synteny loss around the teleost WGD (0.12, Fig. 4C). While this result seems to disagree with Semon and Wolfe, these authors lacked a suitable invertebrate outgroup, and as such could not separate the rates of rearrangement in the early tetrapod lineage (n3-n4) from those in the early fish lineage (n3-n6) (Semon and Wolfe 2007). Our estimates show that the rate of synteny loss in the early tetrapod lineage was more than twice as high as the rate in the early fish lineage (0.25 vs. 0.12). With this new information, it appears that the rearrangement rate around the teleost WGD event was lower than in most neighbouring branches. In an attempt to determine the synteny loss rate before and after the teleost WGD event, we employed the same method of ancestral reconstruction used for the 2R WGD events (Methods). Unfortunately, the error in our estimates exceeded the total amount of syntenic loss within this branch. While we note that the synteny loss appears higher after the teleost WGD, the large overlapping confidence intervals on these estimates make it impossible to draw any reliable conclusions (n3-WGD synteny loss = 0.06, 95% confidence = 0-0.15; WGD-n6 synteny loss = 0.38, 95% confidence = 0-0.71).

Discussion

Our data indicate that the early vertebrate whole genome duplication events (2R WGD) did not spark an increase in genomic rearrangement, but were in fact preceded by intense genome

10 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. rearrangement. These findings contrast with previous studies in plant polyploids (Pontes et al. 2004; Song et al. 1995), indicating that there is not a simple cause and effect relationship between WGD events and genome rearrangement. In fact, in vertebrates the opposite may be true: WGD events may be symptoms of existing genome instability. Naturally, these conclusions are based on our assumptions regarding the existence and timing of the 2R WGD events. A body of evidence from multiple genomes now seems to provide compelling evidence in support of the existence of the 2R WGD events (Dehal and Boore 2005; McLysaght et al. 2002; Panopoulou et al. 2003; Putnam et al. 2008; Vandepoele et al. 2004). We have assumed that these events were relatively closely spaced and centered on the divergence of the lamprey/hagfish lineage from jaw-vertebrates, as indicated by the most recent published reports (Nakatani et al. 2007; Putnam et al. 2008). Assuming an older age, closer to the cephalochordate divergence, merely exaggerates our conclusions, creating a more intense rearrangement spike prior to the WGD events. In the other direction, if we move the 2R WGD events as recent as the divergence of cartilaginous fish (525 mya)—a conservative lower bound for their age—there is still more rearrangement prior to the 2R WGD than after (0.40 vs 0.24). While information from additional chordate genomes will continue to refine these conclusions, it appears quite clear that there was not an increase in synteny loss after the 2R WGD events. The early vertebrate diversification appears to have been a hot spot of genome structural change. In addition to the two WGD events and the intense prior genome rearrangement, current evidence suggests that the lamprey and hagfish genomes, which diverged from jawed vertebrates around the time of the 2R WGD, may have undergone additional WGD events (Fried et al. 2003; Stadler et al. 2004). This indicates that these lineages may have also possessed a preposition toward structural genome change. From this data it is impossible to determine the cause of this evolutionary hot spot. However, because this time period coincides with an amazing phylogenetic diversification, it is tempting to speculate that there may have been selective pressures favouring structural genome change, possibly as a way of creating functional diversity or sparking speciation. Nonetheless, it is also possible that evolutionarily neutral processes, such as environmental changes or genetic mutations led to a general increase in genomic instability. Future research will be needed to resolve these issues, and genome sequence from organisms closer to the WGD events, such as lamprey and hagfish, could provide new insights.

11 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Our estimates of the genomic rearrangement rates around the teleost WGD event are somewhat less clear. A previous report indicated an increased rate of synteny loss in the early tetrapod and fish lineages (n3-n4 + n3-n6) (Semon and Wolfe 2007). By including invertebrate genomes in our analysis, we are able to divide the synteny loss in these branches, revealing that this increased rate is due to intense rearrangement in the early tetrapod lineage. Hence, the teleost WGD event appears to have occurred during a period of relatively low genome rearrangement (Fig. 4C). Nonetheless, we were unable to reliably determine the synteny loss rate before and after the teleost WGD event, so it remains possible that synteny loss did increase after the teleost WGD event. However, Nakatani et al. observed that the teleost WGD event was preceded by a period of intense chromosome fusions, possibly suggesting that the teleost WGD event also followed on the heels of prior genome instability (Nakatani et al. 2007). Whole genome duplications have been associated with genome rearrangement within another setting—cancer oncogenesis (reviewed in Ganem et al. 2007). Genome instability, characterized by frequent aneuploidy and genome rearrangement, is a common feature of cancerous cells, and is believed to play a key role in creating oncogenic genetic changes. Studies have associated tetraploidy with early stage cancer, however these abnormal tetraploidy events are generally associated with dysfunction in genes like p53 or Rb, key regulators of genomic integrity (Galipeau et al. 1996; Olaharski et al. 2006). Indeed, within p53-null cells, artificially-induced genome duplication can trigger genome instability and oncogenesis (Fujiwara et al. 2005). Hence, while tetraploidy can play a role in oncogenesis, it generally appears to first require changes in the genes that regulate genome stability. In support of this notion, many normal differentiating human cells undergo endoreplication— DNA replication without cell division—showing that polyploidy does not induce genome instability in normal cellular contexts (Ravid et al. 2002). While cancer oncogenesis and phylogenetic diversification occur over very different time scales, we may see a general theme emerging—WGD events alone are not sufficient to trigger genome instability or rearrangement, but instead often appear to associate with prior events that decrease overall genome stability.

12 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Methods

Common names are used for the following organisms: human (Homo sapiens), mouse (Mus musculus), chicken (Gallus gallus), zebrafish (Danio rerio), fugu (Takifugu rubripes), amphioxus (Branchiostoma floridae), sea urchin (Strongylocentrotus purpuratus), sea anemone (Nematostella vectensis), sea squirt (Ciona intestinalis).

Orthology detection Gene lists and genomic locations were extracted from Ensembl release 42 for all the vertebrate organisms, JGI v1.0 for amphioxus (B. floridae) and anemone (N. vectensis), and Spur v2.1 with Gnomon gene predictions for sea urchin (S. purpuratus). Orthology was assigned using the blast-based method Inparanoid, which creates groups of genes between two-species that are likely to be related to a single gene in their common ancestor (Remm et al. 2001).

Identifying syntenic gene pairs When searching for syntenic gene pairs between two genomes, we first grouped genes into orthologous families, and then identified cases where genes from a pair of gene families (a ‘family combination’) were observed in close proximity in more than one location (Fig. 1). Genes were defined to be in close proximity if they had no more than ten intervening genes. When defining the intervening genes between potential syntenic gene pairs, only coding genes were counted, and genes were treated as one-dimensional objects located at the genes’ predicted start sites. Family combinations which have proximate gene pairs in more than one location, either across the two genomes of interest or within a single genome are assumed to have evidence of syntenic conservation. Family combinations that only have in- genome synteny are required to have gene pairs on at least two chromosomes, helping to remove groups created by recent local duplication. Synteny groups were created by exhaustively merging all syntenic family combinations that had gene pairs which shared a gene. Family combinations with in-genome synteny within the target-genome and/or cross-genome synteny with the reference-genome were used. For these analyses we eliminated all vertebrate gene families with 95 or more genes, to prevent these gene families from forming spurious linkages. This threshold eliminated only two classes of genes: olfactory receptor genes which have seen dramatic expansion in tetrapods, and zebrafish LINE transposable elements.

13 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

When compared to our 2003 results, we found that cross-genome amphioxus-human synteny reveals as set of syntenic regions which are largely different from the syntenic regions we previously identified by human in-genome synteny. Hence, by combining both sources of information we greatly improve our ability to detect conserved synteny. In the present analysis, in-genome human synteny defines syntenic segments that cover 652.8 Mb of the human genome, while amphioxus-human synteny defines segments that cover 496.7 Mb (Fig. S1). These two sets overlap by only 62.1 Mb, defined by 84 syntenic segments containing 44 different family combinations. Fig. S2 shows a Cohen-Friendly association plot summarizing the synteny association between human chromosomes and amphioxus scaffolds (Cohen 1980; Friendly 1992).

Gene Ontology analysis The DAVID Bioinformatics Resource 2007 was used to look for enriched GO terms in different classes of human genes with ancient synteny (Table 2) (Dennis et al. 2003). The human-amphioxus gene list includes all human genes contained within proximate gene-pairs present in both the human and amphioxus genomes. The human ancient in-genome syntenic gene list includes all human genes contained within in-genome syntenic pairs where phylogeny based evidence indicates that duplication happened prior to the fish-tetrapod divergence (described in more detail in the last Methods section). The background population was the union of both gene lists—representing the set of all human genes for which we have evidence of ancient syntenic conservation. P-values reported are multiple-test corrected according to the Benjamini-Hochberg method implemented in DAVID. We report all enriched terms from levels 1 and 2 of the molecular function and biological process ontologies with corrected p-values less than 0.01.

Measuring rates of synteny loss Cross-genome syntenic family combinations between each pair of genomes were identified as previously described, with two additional filters that help correct for structural biases in the genomes. First, gene duplications can mask rearrangement events, creating biased estimates of synteny, especially when measuring across whole genome duplication events. Therefore, orthology groups are required to have only a single gene within vertebrate genomes. We do not similarly filter orthology groups in amphioxus, and other invertebrates, since these organisms’ current genome builds contain mixtures of haplotypes, which create the appearance of far more gene duplicates than truly exist. Genuine gene duplication in these

14 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. invertebrate lineages may mask some rearrangement, so we consider our calculated rates of synteny loss in these lineages to be minimum estimates. Second, highly fragmented genome assemblies may contain a disproportionately high number of gene pairs that are immediately adjacent, and these adjacent pairs are more likely to be conserved through evolution than pairs with several intervening genes. Hence, we exclude all chromosomes or scaffolds with less than 10 genes prior to synteny calculations. To calculate synteny conservation, we count the number of family combinations with proximate genes in at least one location in each genome (total proximate family combinations), and then count the number of family combinations with proximate genes in both genomes (shared family combinations). The shared synteny proportion is calculated as the number of ‘shared family combinations’ divided by the smaller of the ‘total proximate family combinations’ from the two genomes. Each measure is bootstrapped giving both a 95% confidence estimate on the measure and showing that the measures are robust to significant incompleteness in the genomes (Table S1). In each iteration, 50% of the family combinations in the smaller genome are removed randomly, and a new proportion is calculated. Confidence intervals shown are each based on 1000 iterations. The shared synteny proportion is the product of an exponential decay process which can be described by the following formula, where Nt is the observed number of shared family combinations, N0 is the maximum number that could be shared, λ is the rate of synteny loss, and t is the evolutionary divergence time.

-λt Nt / N0 = e

Hence, we can calculate a time linear ‘syntenic distance’ (λt), by taking the negative natural logarithm of our syntenic shared pair values. These syntenic distance measures are then used to calculate the amount of synteny loss on each branch of the species tree using the least-squares-based Fitch-Margoliash method implemented by PHYLIP (Fitch and Margoliash 1967; Retief 2000). This method has the advantage of not relying on ancestral reconstructions at each node, and has a long history of use in distance-based phylogenetics. Branch estimates of synteny loss can then be converted to synteny loss rates (λ) by dividing by the amount of evolutionary time in each branch. All synteny loss rates presented in the paper have been multiplied by 100, to improve readability. Divergence times were based on the best nuclear gene estimates in the current literature (Blair

15 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. and Hedges 2005; Blair et al. 2005; Hedges et al. 2004; Hurley et al. 2007). The 2R WGD events were assumed to be centered on the divergence of the lamprey lineage from jaw- vertebrates (652 mya), as indicated in other reports (Nakatani et al. 2007; Putnam et al. 2008). C. intestinalis was not included in our tree-based rate estimates because of concerns about the exact placement of tunicates on the species tree (addressed in Putnam et al. 2008), however its genome is clearly exceptionally rearranged, confirming previous observations (Ikuta et al. 2004). When compared to amphioxus it has a syntenic distance of 4.83, exceeding the syntenic distance measured between amphioxus and anemone, despite the fact that their divergence is many hundreds of millions years older. Synteny conservation was measured between all pairwise combinations of sea anemone, sea urchin, amphioxus, human, mouse, chicken, fugu, and zebrafish. This species set was chosen to provide representatives on both sides of the 2R WGD events, which also satisfied two simple criteria: 1) the species must have publicly available genome sequence, and 2) the species must show aspects of clear synteny conservation with vertebrates. In phyla where the amount of conserved synteny approaches the amount expected by chance, our syntenic distance metric begins to saturate, leading to increasingly large 95% confidence intervals. This concern led us to specifically exclude from our analysis nematodes, insects, and tunicates. All of the syntenic distance estimates presented here defined gene proximity using the 10 gene interval shown to be appropriate in our previous work (Panopoulou et al. 2003), however synteny loss calculations using intervals of 5 and 15 genes produced highly similar results (data not shown). In both cases the estimated branch lengths had a Pearson correlation exceeding 0.99 when compared to the 10-gene interval values reported here. We verified the robustness of our branch estimates with a leave-one-analysis (see Results and Table 3), however some concern was raised that this analysis might be skewed by the greater number of vertebrates relative to invertebrates. In response to this concern, we made a series of phylogenetically-balanced trees that included the three available invertebrates and a selection of three vertebrate genomes. For these vertebrate genomes, we tested all combinations that included: 1) at least one tetrapod and one fish, 2) did not include both humans and mice, since these genomes quite closely related (7 total possibilities). Among these trees, synteny loss rate varied from 0.366-0.392 for branch n2-n3, 0.159-0.166 for Bf- n2, 0.182-0.191 for Sp-n1, and 0.142-0.143 for Nv-n1. These intervals are highly similar to those already reported in Table 3 for the leave-one-analysis reported, and exactly the same for

16 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. the sea anemone lineage (0.367-0.392, 0.158-0.169, 0.184-0.192, 0.142-0.143, respectively). These values indicate that the branch estimates are not skewed by organism distribution, and validate the leave-one-out analysis.

Measuring syntenic distance between amphioxus and the 2R WGD events To identify a set of syntenic gene pairs which were present in the vertebrate genome during the 2R WGD events, we began by identifying family combinations in human, fugu, or zebrafish that have conserved in-genome synteny, i.e. proximate gene pairs have been preserved in multiple places in the same genome. We then built maximum likelihood phylogenetic trees for all the gene families in these family combinations, using the amphioxus orthologs to root the trees (Frickey and Lupas 2004; Schmidt et al. 2002), and subsequently select family combinations where the genes within the in-genome gene pairs were generated by duplication prior to the tetrapod-teleost split, and declare these ancient vertebrate pairs. If we assume that the majority of gene duplication occurring prior to the tetrapod-teleost split was generated by the vertebrate 2R WGD events—an assumption supported by our synteny group analysis (Fig. 2)—then we can infer that the majority of these ancient vertebrate gene pairs will have been present in the vertebrate genome during 2R WGD events. This process identifies 419 family combinations in the human genome, 113 in fugu, and 70 in zebrafish. Together, this makes 513 qualifying ancient vertebrate pairs, of which only 28 are conserved in amphioxus. The syntenic distance between the amphioxus genome and these pre-2R ancient pairs was estimated to be 2.91 (2.17-3.38). This syntenic distance is based on a relatively small set of inferred syntenic family combinations, and naturally we were wary that such sets may not produce accurate estimates of syntenic distance. To assure that this type of comparison is valid, we also tested ancestral sets that were inferred to be present at the tetrapod-teleost divergence (node n3), and at the zebrafish-fugu divergence (node n6), and then compared these estimates to the values generated from our previous whole genome comparisons. For the tetrapod-teleost divergence we selected the family combinations common to the human genome and at least one fish genome, a set of 9161 pairs, and compared this set to the amphioxus genome, measuring a syntenic distance of 2.96 (2.88-3.06), close to the distance of 3.02 estimated by our additive tree model (Table 3: Bf-n2 + n2-n3). To obtain a set of ancestral family combinations of similar size to our 2R WGD set, we randomly selected sets of 500 family combinations from

17 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. these 9161 tetrapod-teleost pairs. In 100 trails, the mean syntenic distance was 2.98, and the estimated 95% confidence intervals contained 3.02 exactly 95% of the time. Similar results were obtained from the ancestral set inferred to exist at the divergence of zebrafish and fugu. For 100 trails of 500 random family combinations, the mean distance to amphioxus was 3.48, and the confidence intervals contained the previous estimate of 3.32 (Table 3: Bf-n2 + n2-n3 + n3-n6) in 91% of the cases. Hence, syntenic distances calculated from ancestral sets are reliable, and the confidence intervals appear to generally account for the increase in uncertainty. We used similar ancestral reconstruction to try to subdivide the synteny loss before and after the teleost specific WGD event. The human, fugu, zebrafish, and amphioxus phylogenetic trees were used to identify syntenic gene-pairs that duplicated prior to the teleost divergence (n6), but after the tetrapod-fish divergence (n3). This analysis identified 135 family combinations that were likely to be present immediately prior to the teleost WGD event. The syntenic distance between these pairs and the human genome is 0.75 (0.59-0.93). The 95% confidence interval for this measure is narrower than the interval calculated for the 2R WGD reconstruction (0.34 vs 1.21), but nonetheless, because the total amount of synteny loss around the teleost WGD event is so low (Table 3: n3-n6 = 0.30), the confidence intervals for the resulted synteny loss rates before and after the teleost WGD event are large and overlapping (n3-WGD, synteny loss rate = 0.06, 95% confidence = 0-0.15; WGD-n6, synteny loss rate = 0.38, 95% confidence = 0-0.71). Hence, the relatively low amount of synteny loss in this branch prevents us from determining the synteny loss before and after the teleost WGD with any reliability.

Acknowledgements We thank N. Putnam from the JGI amphioxus genome project for sharing their unpublished data with us, and S. Haas, P. Polak, and A. Borchers for helpful comments. This work was supported by the Max-Planck Society (Max-Planck Gesellschaft zur Forderung der Wissenschaften e.v.).

18 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Figure Legends

Figure 1. Identifying and grouping pairs of syntenic genes. Syntenic gene pairs were identified in three steps. First, genes from the two genomes are grouped into orthologous families (A, B, and C). Next, both genomes are searched for syntenic ‘family combinations’—pairs of genes from two ortholog families that are found in close proximity in more than one genomic location. Genes are in close proximity if they have no more than ten intervening genes (shown here as red x’s). In this example, three syntenic family combinations are identified (A-B, A-C, and B-C). The A-B combination is present in both amphioxus and humans, while the A-C and B-C combinations are only present in humans. These syntenic gene pairs can then be merged to generate segments of syntenic genes (A1-B1, A2-B2-C1, and A3-B3-C2). In the human genome these syntenic segments form a ‘synteny group’ which contains three gene families and is present on two chromosomes.

Figure 2. Synteny group distributions reveal vertebrate duplication events. The size of each syntenic group, measured by the number of gene families in the group, is plotted against the number of chromosomes over which the group is spread. The bubble sizes are proportional to the number syntenic groups at each point. (A-C) Vertebrate synteny groups built by comparison to amphioxus: (A) human, (B) chicken, and (C) zebrafish. In (A) and (B), the largest groups, with the most conserved synteny, are present on four chromosomes, and a steady reduction in chromosome coverage is seen as the group size decreases. (C) The zebrafish groups are spread out past four chromosomes. (D) As a control, the chromosome coverage of the human groups shown in (A) were doubled, creating a simulation of a new WGD event on top of the early chordate duplications. This plot shows a similar chromosome spread to (C), and post-WGD gene loss in zebrafish could account for the sparser plot. (E) Zebrafish synteny groups, built by comparison to the human genome, show that the largest synteny groups cover 2-3 chromosomes. (F) Human synteny groups, built by comparison to the zebrafish genome, show a strong peak at 1 chromosome, as expected in the absence of WGD events. Arrowheads within the plots indicate the bubbles that contain the Hox clusters. In comparisons between amphioxus and vertebrates (A-D), the Hox genes form a single a synteny group, while in comparisons between fish and tetrapods (E-F), they subdivide into

19 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al. four separate groups, indicating that the cluster duplicated twice within the early vertebrate lineage.

Figure 3. Estimating the amount of synteny conservation between two genomes. This figure illustrates the method we use to calculate the ‘syntenic distance’ between pairs of genomes. (A) Two genomes, X and Y, share eight orthologous genes, present on three genome fragments in the genome X, and two in genome Y. (B) These orthologous genes can be decomposed into proximate gene pairs (see Methods and Figure 1). (C) From these gene pairs, we can calculate the shared synteny proportion and convert this proportion into a time- linear distance measure by taking the negative natural logarithm. This is a highly simplified case—for analyses of real genomes, genome fragments are required to have at least 10 genes.

Figure 4. Rates of synteny loss throughout vertebrates and their ancestors. (A) Estimates of conserved synteny are robust to genome fragmentation. The human genome was artificially fragmented into scaffolds of random lengths according to a Pareto distribution where k satisfies the equation, scaffold size = 1/U1/k, and U is a random number between 0 and 1. These fragmented human genomes were then compared to amphioxus (Bf), chicken (Gg), or zebrafish (Dr). As k increases, the G50 size of the fragmented human genome decreases (dashed lines), but the syntenic shared pair metric remains relatively consistent (solid lines). G50 is the gene number such that 50% of the assembled genome lies in scaffolds containing at least G50 genes. (B) Conserved synteny was measured between all pairwise combinations of human, fugu, zebrafish, chicken, mouse, amphioxus, sea urchin, and sea anemone and then plotted relative to the divergence age of the comparison. The values are well fit by an exponential curve. (C) Syntenic distances were apportioned to the known species tree and then divided by the estimated evolutionary time in each branch to obtain rates of synteny loss. Internal nodes are labeled n1-6. The highest rates of loss are observed in the period after the vertebrate divergence from amphioxus but before the early vertebrate WGD events (n2-2R WGD), and in the terminal zebrafish lineage (n6-Dr). Species abbreviations are human (Hs), mouse (Mm), chicken (Gg), fugu (Tr), zebrafish (Dr), amphioxus (Bf), sea urchin (Sp), sea anemone (Nv).

20 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Supplementary Figure Legends

Supplementary Figure S1. Coverage of the human genome by conserved syntenic segments. Blue segments have conserved synteny with amphioxus; red segments have conserved synteny within the human genome. Black lines show the locations of all the human genes that have an amphioxus ortholog. Centromeres are shown as yellow ovals.

Supplementary Figure S2. Cohen-Friendly association plot between the human and amphioxus genomes. The width of each block indicates the number of shared syntenic family combinations between the chromosome and scaffold. The 20 amphioxus scaffolds with the most syntenic gene pairs are shown. The block height indicates the significance of the association, based on a normal distribution (Pearson residuals). Blue bars denote more shared synteny than expected, and red bars denote less than expected.

21 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Tables

Table 1. Synteny conservation between amphioxus and other organisms Organisms Orthologous Syntenic Syntenic Amphioxus 2nd genome compared gene family segments genome covered covered by families combina- by syntenic gene syntenic gene tions pairs (Mb) pairs (Mb)

Bf-Hs 7576 1,188 807 103 (20.3%) 497 (16.1%) Bf-Mm 7443 1132 791 96 (18.8%) 337 (12.6%) Bf-Gg 6832 2,624 747 100 (19.6%) 222 (20.2%) Bf-Dr 6304 519 535 51 (10.0%) 117 (8.1%) Bf-Tr 6501 563 575 58 (11.3%) 28 (7.1%) Bf-Ci 4955 154 173 17 (3.3%) 6 (3.6%) Bf-Sp 8460 936 738 67 (13.2%) 49 (6.1%) Bf-Nv 8041 733 565 63 (12.3%) 22 (4.7%) Species abbreviations: human (Hs), mouse (Mm), chicken (Gg), fugu (Tr), zebrafish (Dr), amphioxus (Bf), sea urchin (Sp), sea squirt (Ci), and sea anemone (Nv).

22 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Table 2. Gene Ontology term enrichment of syntenic genes

Category Term Count Percent P-value human-amphioxus syntenic genes BP_2 metabolism 960 46.1 2.1E-16 BP_2 cellular physiological process 1192 57.2 3.0E-07 BP_1 physiological process 1253 60.2 3.5E-07 MF_2 structural constituent of ribosome 37 1.8 8.5E-04 MF_2 nucleic acid binding 443 21.3 5.5E-04 BP_2 response to endogenous stimulus 44 2.1 4.8E-04 MF_1 catalytic activity 599 28.8 9.0E-03 human ancient in-genome syntenic genes MF_1 signal transducer activity 237 20.1 8.8E-20 BP_2 cell communication 329 27.8 1.5E-17 BP_1 development 252 21.3 1.2E-15 BP_2 organ development 85 7.2 1.7E-08 MF_2 receptor activity 141 11.9 7.7E-08 MF_1 binding 799 67.6 3.6E-07 MF_2 channel or pore class transporter activity 58 4.9 8.6E-07 BP_2 morphogenesis 92 7.8 5.6E-06 MF_1 transcription regulator activity 177 15 1.1E-05 MF_2 ion transporter activity 81 6.9 2.3E-05 BP_2 cell differentiation 70 5.9 4.3E-05 BP_2 system development 80 6.8 4.5E-05 MF_2 protein binding 408 34.5 5.3E-05 MF_2 transcription factor activity 142 12 1.5E-04 MF_2 receptor binding 57 4.8 1.7E-04 MF_2 GTPase regulator activity 52 4.4 2.1E-04 BP_2 organismal physiological process 109 9.2 2.9E-04 MF_1 enzyme regulator activity 75 6.3 6.8E-04 BP_1 regulation of biological process 331 28 1.0E-03 BP_2 positive regulation of biological process 66 5.6 1.7E-03 BP_2 cell adhesion 54 4.6 1.8E-03 MF_2 ion binding 295 25 1.8E-03 BP_2 regulation of physiological process 305 25.8 3.3E-03 BP_2 regulation of cellular process 311 26.3 3.7E-03 MF_2 lipid binding 39 3.3 3.7E-03 Enriched terms are shown from the first two levels of the biological process (BP) and molecular function (MF) GO ontologies.

23 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Table 3. Estimates of genomic rearrangement in vertebrates and their ancestors. Evolutionary Time Syntenic Synteny previous published estimates Branch (million years) Distance Loss Rate Burt Hillier Semon 1700 (1555- Nv-n1 1845) 2.41 0.142 (0.142-0.143) Sp-n1 896 (832-1022) 1.67 0.187 (0.184-0.192) n1-n2 5 (0-131) 0.00 0 (0-0) Bf-n2 891 (810-1067) 1.45 0.163 (0.158-0.169) n2-n3 415 (334-591) 1.58 0.380 (0.367-0.392) Gg-n4 326 (311-354) 0.04 0.011 (0.008-0.014) <0.09* 0.11 0.046 n4-n5 206 (191-234) 0.20 0.095 (0.094-0.100) <0.09* 0.23 0.059 Hs-n5 120 (100-140) 0.06 0.048 (0.034-0.049) 0.48 0.07 0.022 Mm-n5 120 (100-140) 0.09 0.078 (0.077-0.092) 0.95 0.18 0.078 n3-n4 150 (116-168) 0.38 0.251 (0.203-0.284) 0.096* n3-n6 253 (219-271) 0.30 0.117 (0.106-0.126) 0.096* Tr-n6 223 (181-265) 0.62 0.277 (0.273-0.280) 0.083 Dr-n6 223 (181-265) 0.79 0.354 (0.351-0.358) 0.181 n2-2R WGD 239 (149-286) 1.46 0.611 (0.486-0.808) 2R WGD- n3 176 (129-266) 0.12 0.066 (0-0.235)

Branches are from the species tree shown in Figure 4. Species abbreviations: human (Hs), mouse (Mm), chicken (Gg), fugu (Tr), zebrafish (Dr), amphioxus (Bf), sea urchin (Sp) and sea anemone (Nv). The intervals reported with the synteny loss rates represent that range of values observed for each branch in our leave-one-analysis. Around the 2R WGD events, synteny loss rate intervals are derived from the resampling based 95% confidence intervals on the syntenic distance between amphioxus and the 2R WGD events. Previous published estimates were renormalized by the evolutionary divergence ages in column 2 for consistency. *authors lacked an outgroup, so rates are an average across both branches.

24 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Supplementary Tables

Supplementary Table S1. Pairwise syntenic conservation and syntenic distances.

Hs Mm Gg Tr Dr Bf Sp Nv 0.859 0.748 0.222 0.181 0.025 0.021 0.010 (0.857- (0.745- (0.217- (0.176- (0.023- (0.016- (0.008- Hs 0.861) 0.751) 0.227) 0.185) 0.027) 0.025) 0.011) 0.151 0.724 0.204 0.167 0.027 0.018 0.010 (0.149- (0.721- (0.199- (0.163- (0.025- (0.014- (0.008- Mm 0.154) 0.727) 0.209) 0.172) 0.030) 0.021) 0.011) 0.291 0.323 0.266 0.222 0.033 0.025 0.013 (0.287- (0.318- (0.259- (0.215- (0.030- (0.021- (0.011- Gg 0.295) 0.327) 0.273) 0.228) 0.036) 0.029) 0.015) 1.505 1.589 1.325 0.245 0.018 0.013 0.007 (1.483- (1.567- (1.299- (0.241- (0.015- (0.009- (0.006- Tr 1.528) 1.612) 1.351) 0.249) 0.021) 0.017) 0.009) 1.712 1.787 1.506 1.407 0.016 0.016 0.007 (1.686- (1.761- (1.478- (1.391- (0.013- (0.011- (0.005- Dr 1.740) 1.816) 1.536) 1.424) 0.018) 0.021) 0.009) 3.688 3.609 3.415 4.022 4.156 0.047 0.020 (3.600- (3.520- (3.338- (3.883- (3.991- (0.043- (0.018- Bf 3.774) 3.609) 3.504) 4.177) 4.348) 0.050) 0.021) 3.878 4.040 3.676 4.326 4.113 3.055 0.016 (3.695- (3.853- (3.525- (4.061- (3.849- (2.990- (0.014- Sp 4.108) 4.257) 3.865) 4.657) 4.500) 3.136) 0.018) 4.625 4.644 4.327 4.896 4.976 3.926 4.133 (4.467- (4.493- (4.176- (4.680- (4.735- (3.858- (4.009- Nv 5.183) 4.834) 4.499) 5.183) 5.330) 4.001) 4.287)

Cells above diagonal: proportion of shared syntenic pairs; cells below diagonal: syntenic distance. Parentheses contain 95% confidence intervals estimated by resampling. Human (Hs), mouse (Mm), chicken (Gg), fugu (Tr), zebrafish (Dr), amphioxus (Bf), sea urchin (Sp) and sea anemone (Nv).

25 Hufton et al.

Table S2. Human syntenic gene pairs from family combinations that occur in the amphioxus, human, chicken, zebrafish and fugu genomes. Downloaded from (Ensembl version NCBI36.42). Human id GeneAlias GeneDescription Human id GeneAlias GeneDescription ENSG00000156650 MYST4 "Histone acetyltransferase MYST4 ENSG00000156110 ADK "Adenosine kinase (AK) (Adenosine 5'-phosphotransferase). [Source:Uniprot/SWISSPROT;Acc:Q8WYB5]" [Source:Uniprot/SWISSPROT;Acc:P55263]"

ENSG00000165370 GPR101 "Probable G-protein coupled receptor 101. ENSG00000147274 RBMX "Heterogeneous nuclear ribonucleoprotein G (hnRNP G) genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:Q96P66]" [Source:Uniprot/SWISSPROT;Acc:P38159]" ENSG00000196935 SRGAP1 "SLIT-ROBO Rho GTPase-activating protein 1 ENSG00000118600 TMEM5 "Transmembrane protein 5. [Source:Uniprot/SWISSPROT;Acc:Q7Z6B7]" [Source:Uniprot/SWISSPROT;Acc:Q9Y2B1]" ENSG00000103546 SLC6A2 "Sodium-dependent noradrenaline transporter ENSG00000087253 AYTL1 "Acyltransferase-like 1 (EC 2.3.1.-). [Source:Uniprot/SWISSPROT;Acc:P23975]" [Source:Uniprot/SWISSPROT;Acc:Q7L5N7]" ENSG00000198682 PAPSS2 "Bifunctional 3'-phosphoadenosine 5'-phosphosulfate ENSG00000107789 MINPP1 "Multiple inositol polyphosphate phosphatase 1 precursor synthetase 2 [Source:Uniprot/SWISSPROT;Acc:O95340]" [Source:Uniprot/SWISSPROT;Acc:Q9UNW1]" onSeptember26,2021-Publishedby ENSG00000185697 MYBL1 "Myb-related protein A (A-Myb). ENSG00000180828 BHLHB5 "basic helix-loop-helix domain containing, class B, 5 [Source:Uniprot/SWISSPROT;Acc:P10243]" [Source:RefSeq_peptide;Acc:NP_689627]" ENSG00000186862 PDZD7 "PDZ domain-containing protein 7. ENSG00000107816 LZTS2 "Leucine zipper putative tumor suppressor 2 (Protein [Source:Uniprot/SWISSPROT;Acc:Q9H5P4]" LAPSER1). [Source:Uniprot/SWISSPROT;Acc:Q9BRK4]" ENSG00000180806 HOXC9 "Homeobox protein Hox-C9 (Hox-3B). ENSG00000123364 HOXC13 "Homeobox protein Hox-C13 (Hox-3G). [Source:Uniprot/SWISSPROT;Acc:P31274]" [Source:Uniprot/SWISSPROT;Acc:P31276]" ENSG00000128709 HOXD9 "Homeobox protein Hox-D9 (Hox-4C) (Hox- ENSG00000128714 HOXD13 "Homeobox protein Hox-D13 (Hox-4I). 5.2).[Source:Uniprot/SWISSPROT;Acc:P28356]" [Source:Uniprot/SWISSPROT;Acc:P35453]" ENSG00000175354 PTPN2 "Tyrosine-protein phosphatase non-receptor type 2 (EC ENSG00000128789 TNFSF5IP1 "tumor necrosis factor superfamily, member 5-induced 3.1.3.48) (T-cell protein-tyrosine phosphatase) (TCPTP). protein 1 [Source:RefSeq_peptide;Acc:NP_064617]" [Source:Uniprot/SWISSPROT;Acc:P17706]" ENSG00000150275 PCDH15 "Protocadherin-15 precursor. ENSG00000198964 TMEM23 "Phosphatidylcholine:ceramide cholinephosphotransferase 1 [Source:Uniprot/SWISSPROT;Acc:Q96QU1]" [Source:Uniprot/SWISSPROT;Acc:Q86VZ5]" ENSG00000121879 PIK3CA "Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic ENSG00000172667 ZMAT3 "p53 target protein isoform 1 subunit alpha isoform [Source:RefSeq_peptide;Acc:NP_071915]"

[Source:Uniprot/SWISSPROT;Acc:P42336]" Cold SpringHarborLaboratoryPress ENSG00000204052 C6orf154 "Uncharacterized protein C6orf154. ENSG00000137221 TJAP1 "Tight junction-associated protein 1 (Tight junction protein [Source:Uniprot/SWISSPROT;Acc:Q5JTD7]" 4) [Source:Uniprot/SWISSPROT;Acc:Q5JTD0]" ENSG00000107537 PHYH "Phytanoyl-CoA dioxygenase, peroxisomal precursor ENSG00000065328 MCM10 "minichromosome maintenance protein 10 isoform 2 [Source:Uniprot/SWISSPROT;Acc:O14832]" [Source:RefSeq_peptide;Acc:NP_060988]" ENSG00000142408 CACNG8 "Voltage-dependent calcium channel gamma-8 subunit ENSG00000105605 CACNG7 "Voltage-dependent calcium channel gamma-7 subunit [Source:Uniprot/SWISSPROT;Acc:Q8WXS5]" [Source:Uniprot/SWISSPROT;Acc:P62955]" ENSG00000075461 CACNG4 "Voltage-dependent calcium channel gamma-4 subunit ENSG00000075429 CACNG5 "Voltage-dependent calcium channel gamma-5 subunit [Source:Uniprot/SWISSPROT;Acc:Q9UBN1]" [Source:Uniprot/SWISSPROT;Acc:Q9UF02]" ENSG00000147050 UTX "Ubiquitously transcribed X chromosome tetratricopeptide ENSG00000069509 FUNDC1 "FUN14 domain containing 1 repeat protein [Source:Uniprot/SWISSPROT;Acc:O15550]" [Source:RefSeq_peptide;Acc:NP_776155]" ENSG00000197912 SPG7 "Paraplegin (EC 3.4.24.-) (Spastic paraplegia protein 7). ENSG00000167522 ANKRD11 "Ankyrin repeat domain-containing protein 11 (Ankyrin [Source:Uniprot/SWISSPROT;Acc:Q9UQ90]" repeat-containing cofactor 1). [Source:Uniprot/SWISSPROT;Acc:Q6UB99]" ENSG00000144381 HSPD1 "60 kDa heat shock protein, mitochondrial precursor (Hsp60) ENSG00000115520 COQ10B "Protein COQ10 B, mitochondrial precursor.

(60 kDa chaperonin) (CPN60) (Heat shock protein 60) (HSP- [Source:Uniprot/SWISSPROT;Acc:Q9H8M1]" 60) (Mitochondrial matrix protein P1) (P60 lymphocyte

26 Hufton et al.

protein) (HuCHA60). [Source:Uniprot/SWISSPROT;Acc:P10809]" Downloaded from ENSG00000174989 FBXW8 "F-box/WD repeat protein 8 (F-box and WD-40 domain protein ENSG00000111412 C12orf49 "CDNA: FLJ21415 fis, clone COL04030 ( 8) (F-box only protein 29). open reading frame 49). [Source:Uniprot/SWISSPROT;Acc:Q8N3Y1]" [Source:Uniprot/SPTREMBL;Acc:Q9H741]" ENSG00000154316 TDH "L-threonine dehydrogenase (TDH) on ENSG00000171044 XKR6 "XK-related protein 6. [Source:RefSeq_dna;Acc:NR_001578]" [Source:Uniprot/SWISSPROT;Acc:Q5GH73]" ENSG00000165861 ZFYVE1 "Zinc finger FYVE domain-containing protein 1 ENSG00000119599 WDR21A "WD repeat protein 21A.

[Source:Uniprot/SWISSPROT;Acc:Q9HBF4]" [Source:Uniprot/SWISSPROT;Acc:Q8WV16]" genome.cshlp.org ENSG00000163462 TRIM46 "Tripartite motif-containing protein 46 ENSG00000169242 EFNA1 "Ephrin-A1 precursor [Source:Uniprot/SWISSPROT;Acc:Q7Z4K8]" [Source:Uniprot/SWISSPROT;Acc:P20827]" ENSG00000163462 TRIM46 "Tripartite motif-containing protein 46 ENSG00000143590 EFNA3 "Ephrin-A3 precursor (EPH-related receptor tyrosine kinase [Source:Uniprot/SWISSPROT;Acc:Q7Z4K8]" ligand 3) [Source:Uniprot/SWISSPROT;Acc:P52797]" ENSG00000163462 TRIM46 "Tripartite motif-containing protein 46 ENSG00000143620 EFNA4 "Ephrin-A4 precursor (EPH-related receptor tyrosine kinase [Source:Uniprot/SWISSPROT;Acc:Q7Z4K8]" ligand 4) (LERK-4).

[Source:Uniprot/SWISSPROT;Acc:P52798]" onSeptember26,2021-Publishedby ENSG00000112210 RAB23 "Ras-related protein Rab-23. ENSG00000112208 BAG2 "BAG family molecular chaperone regulator 2 (BCL2- [Source:Uniprot/SWISSPROT;Acc:Q9ULC3]" associated athanogene 2) (BAG-2). [Source:Uniprot/SWISSPROT;Acc:O95816]" ENSG00000133065 SLC41A1 "solute carrier family 41 member 1 ENSG00000174514 MFSD4 "major facilitator superfamily domain containing 4 [Source:RefSeq_peptide;Acc:NP_776253]" [Source:RefSeq_peptide;Acc:NP_857595]" ENSG00000078295 ADCY2 "Adenylate cyclase type 2 ENSG00000037474 NSUN2 "NOL1/NOP2/Sun domain family 2 protein [Source:Uniprot/SWISSPROT;Acc:Q08462]" [Source:RefSeq_peptide;Acc:NP_060225]" ENSG00000148719 DNAJB12 "DnaJ homolog subfamily B member 12. ENSG00000168209 DDIT4 "RTP801 [Source:RefSeq_peptide;Acc:NP_061931]" [Source:Uniprot/SWISSPROT;Acc:Q9NXW2]" ENSG00000134851 TMEM165 "Transmembrane protein 165 (Transmembrane protein ENSG00000180613 GSH2_HUMAN "Homeobox protein GSH-2. TPARL) [Source:Uniprot/SWISSPROT;Acc:Q9HC07]" [Source:Uniprot/SWISSPROT;Acc:Q9BZM3]" ENSG00000111231 ATPBD1C "ATP binding domain 1 family, member C ENSG00000111229 ARPC3 "Actin-related protein 2/3 complex subunit 3 (ARP2/3 [Source:RefSeq_peptide;Acc:NP_057385]" complex 21 kDa subunit) (p21-ARC). [Source:Uniprot/SWISSPROT;Acc:O15145]" ENSG00000176658 MYO1D "Myosin Id. [Source:Uniprot/SWISSPROT;Acc:O94832]" ENSG00000176749 CDK5R1 "Cyclin-dependent kinase 5 activator 1 precursor (CDK5

activator 1) [Source:Uniprot/SWISSPROT;Acc:Q15078]" Cold SpringHarborLaboratoryPress ENSG00000135119 TMEM118 "transmembrane protein 118 (TMEM118), mRNA ENSG00000111412 C12orf49 "CDNA: FLJ21415 fis, clone COL04030 (Chromosome 12 [Source:RefSeq_dna;Acc:NM_032814]" open reading frame 49). [Source:Uniprot/SPTREMBL;Acc:Q9H741]" ENSG00000146872 TLK2 "Serine/threonine-protein kinase tousled-like 2 (EC 2.7.11.1) ENSG00000087995 METTL2A "Methyltransferase-like protein 2 (EC 2.1.1.-). [Source:Uniprot/SWISSPROT;Acc:Q86UE8]" [Source:Uniprot/SWISSPROT;Acc:Q96IZ6]" ENSG00000007168 PAFAH1B1 "Platelet-activating factor acetylhydrolase IB subunit alpha ENSG00000127804 METT10D "methyltransferase 10 domain containing [Source:Uniprot/SWISSPROT;Acc:P43034]" [Source:RefSeq_peptide;Acc:NP_076991]" ENSG00000065883 CDC2L5 "Cell division cycle 2-like protein kinase 5 (EC 2.7.11.22) ENSG00000006451 RALA "Ras-related protein Ral-A. [Source:Uniprot/SWISSPROT;Acc:Q14004]" [Source:Uniprot/SWISSPROT;Acc:P11233]" ENSG00000064933 PMS1 "PMS1 protein homolog 1 (DNA mismatch repair protein ENSG00000128699 ORMDL1 "ORM1-like protein 1 (Adoplin-1). PMS1). [Source:Uniprot/SWISSPROT;Acc:P54277]" [Source:Uniprot/SWISSPROT;Acc:Q9P0S3]" ENSG00000165661 QSCN6L1 "Sulfhydryl oxidase 2 precursor (EC 1.8.3.2) (Quiescin Q6-like ENSG00000107187 LHX3 "LIM/homeobox protein Lhx3. protein 1) (Neuroblastoma-derived sulfhydryl oxidase). [Source:Uniprot/SWISSPROT;Acc:Q9UBR4]" [Source:Uniprot/SWISSPROT;Acc:Q6ZRP7]"

ENSG00000064393 HIPK2 "Homeodomain-interacting protein kinase 2 (EC 2.7.11.1) ENSG00000122778 Q9H0M3_HUMA (hHIPk2). [Source:Uniprot/SWISSPROT;Acc:Q9H2X6]" N

27 Hufton et al.

ENSG00000086619 ERO1LB "ERO1-like protein beta precursor (EC 1.8.4.-) (ERO1-Lbeta) ENSG00000077585 GPR137B "Integral membrane protein GPR137B (Transmembrane 7 [Source:Uniprot/SWISSPROT;Acc:Q86YB8]" superfamily member 1). Downloaded from [Source:Uniprot/SWISSPROT;Acc:O60478]" ENSG00000197930 ERO1L "ERO1-like protein alpha precursor (EC 1.8.4.-) (ERO1- ENSG00000180998 GPR137C Lalpha) [Source:Uniprot/SWISSPROT;Acc:Q96HE7]" ENSG00000183558 HIST2H2AA3 " type 2-A (H2A.2). ENSG00000203814 HIST2H2BF " type 2-F. [Source:Uniprot/SWISSPROT;Acc:Q6FI13]" [Source:Uniprot/SWISSPROT;Acc:Q5QNW6]" ENSG00000203812 H2A2A_HUMA "Histone H2A type 2-A (H2A.2). ENSG00000203814 HIST2H2BF "Histone H2B type 2-F.

N [Source:Uniprot/SWISSPROT;Acc:Q6FI13]" [Source:Uniprot/SWISSPROT;Acc:Q5QNW6]" genome.cshlp.org ENSG00000184260 HIST2H2AC "Histone H2A type 2-C (H2A-GL101) (H2A/r). ENSG00000184678 HIST2H2BE "Histone H2B type 2-E (H2B.q) (H2B/q) (H2B-GL105). [Source:Uniprot/SWISSPROT;Acc:Q16777]" [Source:Uniprot/SWISSPROT;Acc:Q16778]" ENSG00000184260 HIST2H2AC "Histone H2A type 2-C (H2A-GL101) (H2A/r). ENSG00000203814 HIST2H2BF "Histone H2B type 2-F. [Source:Uniprot/SWISSPROT;Acc:Q16777]" [Source:Uniprot/SWISSPROT;Acc:Q5QNW6]" ENSG00000184270 HIST2H2AB "Histone H2A type 2-B. ENSG00000184678 HIST2H2BE "Histone H2B type 2-E (H2B.q) (H2B/q) (H2B-GL105). [Source:Uniprot/SWISSPROT;Acc:Q8IUE6]" [Source:Uniprot/SWISSPROT;Acc:Q16778]"

ENSG00000184270 HIST2H2AB "Histone H2A type 2-B. ENSG00000203814 HIST2H2BF "Histone H2B type 2-F. onSeptember26,2021-Publishedby [Source:Uniprot/SWISSPROT;Acc:Q8IUE6]" [Source:Uniprot/SWISSPROT;Acc:Q5QNW6]" ENSG00000137259 HIST1H2AB "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000146047 HIST1H2BA "Histone H2B type 1-A (Histone H2B, testis) (Testis- [Source:Uniprot/SWISSPROT;Acc:P28001]" specific histone H2B). [Source:Uniprot/SWISSPROT;Acc:Q96A08]" ENSG00000180573 HIST1H2AC "Histone H2A type 1-C. ENSG00000180596 HIST1H2BC "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) [Source:Uniprot/SWISSPROT;Acc:Q93077]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000180573 HIST1H2AC "Histone H2A type 1-C. ENSG00000196226 HIST1H2BB "Histone H2B type 1-B (H2B.f) (H2B/f) (H2B.1). [Source:Uniprot/SWISSPROT;Acc:Q93077]" [Source:Uniprot/SWISSPROT;Acc:P33778]" ENSG00000124642 HIST1H2AD "Histone H2A type 1-D (H2A.3). ENSG00000197697 HIST1H2BE "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) [Source:Uniprot/SWISSPROT;Acc:P20671]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000124642 HIST1H2AD "Histone H2A type 1-D (H2A.3). ENSG00000158373 HIST1H2BD "Histone H2B type 1-D (H2B.b) (H2B/b) (H2B.1 B) (HIRA- [Source:Uniprot/SWISSPROT;Acc:P20671]" interacting protein 2). [Source:Uniprot/SWISSPROT;Acc:P58876]"

ENSG00000124642 HIST1H2AD "Histone H2A type 1-D (H2A.3). ENSG00000180596 HIST1H2BC "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) Cold SpringHarborLaboratoryPress [Source:Uniprot/SWISSPROT;Acc:P20671]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000187990 HIST1H2BG "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) [Source:Uniprot/SWISSPROT;Acc:P28001]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000197846 HIST1H2BF "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) [Source:Uniprot/SWISSPROT;Acc:P28001]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000197697 HIST1H2BE "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) [Source:Uniprot/SWISSPROT;Acc:P28001]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000158373 HIST1H2BD "Histone H2B type 1-D (H2B.b) (H2B/b) (H2B.1 B) (HIRA- [Source:Uniprot/SWISSPROT;Acc:P28001]" interacting protein 2). [Source:Uniprot/SWISSPROT;Acc:P58876]"

ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000180596 HIST1H2BC "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) [Source:Uniprot/SWISSPROT;Acc:P28001]" (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l).

28 Hufton et al.

[Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000196787 HIST1H2AG "Histone H2A type 1 (H2A.1). ENSG00000124635 HIST1H2BJ "Histone H2B type 1-J (H2B.r) (H2B/r) (H2B.1). Downloaded from [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:P06899]" ENSG00000184825 HIST1H2AH "Histone H2A type 1-H. ENSG00000197903 HIST1H2BK "Histone H2B type 1-K (H2B K) (HIRA-interacting protein [Source:Uniprot/SWISSPROT;Acc:Q96KK5]" 1). [Source:Uniprot/SWISSPROT;Acc:O60814]" ENSG00000184825 HIST1H2AH "Histone H2A type 1-H. ENSG00000124635 HIST1H2BJ "Histone H2B type 1-J (H2B.r) (H2B/r) (H2B.1). [Source:Uniprot/SWISSPROT;Acc:Q96KK5]" [Source:Uniprot/SWISSPROT;Acc:P06899]" ENSG00000196747 HIST1H3H ".1 (H3/a) (H3/b) (H3/c) (H3/d) (H3/f) (H3/h) ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c).

(H3/i) (H3/j) (H3/k) (H3/l). [Source:Uniprot/SWISSPROT;Acc:Q99880]" genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:P68431]" ENSG00000182611 HIST1H2AJ "Histone H2A type 1-J. ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c). [Source:Uniprot/SWISSPROT;Acc:Q99878]" [Source:Uniprot/SWISSPROT;Acc:Q99880]" ENSG00000184348 HIST1H2AK "Histone H2A type 1 (H2A.1). ENSG00000196501 HIST1H2BN "Histone H2B type 1-N (H2B.d) (H2B/d). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99877]" ENSG00000184348 HIST1H2AK "Histone H2A type 1 (H2A.1). ENSG00000196374 HIST1H2BM "Histone H2B type 1-M (H2B.e) (H2B/e).

[Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99879]" onSeptember26,2021-Publishedby ENSG00000184348 HIST1H2AK "Histone H2A type 1 (H2A.1). ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99880]" ENSG00000198374 HIST1H2AL "Histone H2A type 1 (H2A.1). ENSG00000196501 HIST1H2BN "Histone H2B type 1-N (H2B.d) (H2B/d). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99877]" ENSG00000198374 HIST1H2AL "Histone H2A type 1 (H2A.1). ENSG00000196374 HIST1H2BM "Histone H2B type 1-M (H2B.e) (H2B/e). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99879]" ENSG00000198374 HIST1H2AL "Histone H2A type 1 (H2A.1). ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99880]" ENSG00000196866 HIST1H2AM "Histone H2A type 1 (H2A.1). ENSG00000196501 HIST1H2BN "Histone H2B type 1-N (H2B.d) (H2B/d). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99877]" ENSG00000196866 HIST1H2AM "Histone H2A type 1 (H2A.1). ENSG00000196374 HIST1H2BM "Histone H2B type 1-M (H2B.e) (H2B/e). [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" [Source:Uniprot/SWISSPROT;Acc:Q99879]" ENSG00000100577 GSTZ1 "Maleylacetoacetate isomerase (EC 5.2.1.2) (MAAI) ENSG00000165553 NGB "Neuroglobin. [Source:Uniprot/SWISSPROT;Acc:O43708]" [Source:Uniprot/SWISSPROT;Acc:Q9NPG2]" ENSG00000204116 CHIC1 "Cysteine-rich hydrophobic domain 1 protein (Brain X-linked ENSG00000131264 CDX4 "Homeobox protein CDX-4 (Caudal-type homeobox protein

protein). [Source:Uniprot/SWISSPROT;Acc:Q5VXU3]" 4). [Source:Uniprot/SWISSPROT;Acc:O14627]" Cold SpringHarborLaboratoryPress ENSG00000166123 GPT2 "Alanine aminotransferase 2 ENSG00000155330 NP_001001436.1 "similar to RIKEN cDNA 4921524J17 (LOC388272), [Source:Uniprot/SWISSPROT;Acc:Q8TD30]" mRNA [Source:RefSeq_dna;Acc:NM_001001436]" ENSG00000119711 ALDH6A1 "Methylmalonate-semialdehyde dehydrogenase [acylating], ENSG00000187097 ENTPD5 "Ectonucleoside triphosphate diphosphohydrolase 5 mitochondrial precursor (EC 1.2.1.27) precursor [Source:Uniprot/SWISSPROT;Acc:O75356]" [Source:Uniprot/SWISSPROT;Acc:Q02252]" ENSG00000129083 COPB1 "Coatomer subunit beta (Beta-coat protein) (Beta-COP). ENSG00000133818 RRAS2 "Ras-related protein R-Ras2 (Ras-like protein TC21) [Source:Uniprot/SWISSPROT;Acc:P53618]" (Teratocarcinoma ). [Source:Uniprot/SWISSPROT;Acc:P62070]" ENSG00000171823 FBXL14 "F-box/LRR-repeat protein 14 (F-box and leucine-rich repeat ENSG00000082805 ERC1 "ELKS/RAB6-interacting/CAST family member 1 (RAB6- protein 14). [Source:Uniprot/SWISSPROT;Acc:Q8N1E6]" interacting protein 2) (ERC protein 1). [Source:Uniprot/SWISSPROT;Acc:Q8IUD2]" ENSG00000136492 BRIP1 "Fanconi anemia group J protein ENSG00000121068 TBX2 "T-box transcription factor TBX2 (T-box protein 2). [Source:Uniprot/SWISSPROT;Acc:Q9BX63]" [Source:Uniprot/SWISSPROT;Acc:Q13207]" ENSG00000101349 PAK7 "Serine/threonine-protein kinase PAK 7 ENSG00000125869 CT103_HUMAN "Uncharacterized protein C20orf103 precursor.

[Source:Uniprot/SWISSPROT;Acc:Q9P286]" [Source:Uniprot/SWISSPROT;Acc:Q9UJQ1]" ENSG00000153922 CHD1 "Chromodomain-helicase-DNA-binding protein 1 (EC 3.6.1.-) ENSG00000174136 RGMB "RGM domain family member B precursor.

29 Hufton et al.

(ATP- dependent helicase CHD1) (CHD-1). [Source:Uniprot/SWISSPROT;Acc:Q6NW40]" [Source:Uniprot/SWISSPROT;Acc:O14646]" Downloaded from ENSG00000106245 BUD31 "Protein BUD31 homolog (Protein G10 homolog) (EDG-2). ENSG00000106244 PDAP1 "28 kDa heat- and acid-stable phosphoprotein [Source:Uniprot/SWISSPROT;Acc:P41223]" [Source:Uniprot/SWISSPROT;Acc:Q13442]" ENSG00000012983 MAP4K5 "Mitogen-activated protein kinase kinase kinase kinase 5 ENSG00000100490 CDKL1 "Cyclin-dependent kinase-like 1 (EC 2.7.11.22) (KHS). [Source:Uniprot/SWISSPROT;Acc:Q9Y4K4]" [Source:Uniprot/SWISSPROT;Acc:Q00532]" ENSG00000023318 TXNDC4 "Thioredoxin domain-containing protein 4 precursor ENSG00000136874 STX17 "Syntaxin-17 (KAT01814). [Source:Uniprot/SWISSPROT;Acc:Q9BS26]" [Source:Uniprot/SWISSPROT;Acc:P56962]"

ENSG00000155287 SLC25A28 "Mitoferrin-2 (Mitochondrial iron transporter 2) ENSG00000155265 C10orf132 "Uncharacterized protein C10orf132. genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:Q96A46]" [Source:Uniprot/SWISSPROT;Acc:Q2TAP0]" ENSG00000068028 RASSF1 "Ras association domain-containing protein 1. ENSG00000114383 TUSC2 "Tumor suppressor candidate 2 (PDGFA-associated protein [Source:Uniprot/SWISSPROT;Acc:Q9NS23]" 2) (Fus-1 protein) (Fusion 1 protein). [Source:Uniprot/SWISSPROT;Acc:O75896]" ENSG00000106789 CORO2A "Coronin-2A (WD repeat-containing protein 2) (IR10). ENSG00000136938 ANP32B "Acidic leucine-rich nuclear phosphoprotein 32 family [Source:Uniprot/SWISSPROT;Acc:Q92828]" member B (PHAPI2 protein) (Silver-stainable protein

SSP29) (Acidic protein rich in leucines). onSeptember26,2021-Publishedby [Source:Uniprot/SWISSPROT;Acc:Q92688]" ENSG00000164347 GFM2 "Elongation factor G 2, mitochondrial precursor (mEF-G 2) ENSG00000049860 HEXB "Beta-hexosaminidase beta chain precursor (Elongation factor G2). [Source:Uniprot/SWISSPROT;Acc:P07686]" [Source:Uniprot/SWISSPROT;Acc:Q969S9]" ENSG00000183941 H4_HUMAN ". [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000203814 HIST2H2BF "Histone H2B type 2-F. [Source:Uniprot/SWISSPROT;Acc:Q5QNW6]" ENSG00000182217 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000203814 HIST2H2BF "Histone H2B type 2-F. [Source:Uniprot/SWISSPROT;Acc:Q5QNW6]" ENSG00000196176 HIST1H4A "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000146047 HIST1H2BA "Histone H2B type 1-A (Histone H2B, testis) (Testis- specific histone H2B). [Source:Uniprot/SWISSPROT;Acc:Q96A08]" ENSG00000124529 HIST1H4B "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000146047 HIST1H2BA "Histone H2B type 1-A (Histone H2B, testis) (Testis- specific histone H2B). [Source:Uniprot/SWISSPROT;Acc:Q96A08]" ENSG00000197061 HIST1H4C "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000196226 HIST1H2BB "Histone H2B type 1-B (H2B.f) (H2B/f) (H2B.1).

[Source:Uniprot/SWISSPROT;Acc:P33778]" Cold SpringHarborLaboratoryPress ENSG00000188987 HIST1H4D "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000197697 HIST1H2BE "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000188987 HIST1H4D "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000158373 HIST1H2BD "Histone H2B type 1-D (H2B.b) (H2B/b) (H2B.1 B) (HIRA- interacting protein 2). [Source:Uniprot/SWISSPROT;Acc:P58876]" ENSG00000188987 HIST1H4D "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000180596 HIST1H2BC "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000188987 HIST1H4D "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000196226 HIST1H2BB "Histone H2B type 1-B (H2B.f) (H2B/f) (H2B.1). [Source:Uniprot/SWISSPROT;Acc:P33778]" ENSG00000198518 HIST1H4E "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000197846 HIST1H2BF "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]"

ENSG00000198518 HIST1H4E "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000197697 HIST1H2BE "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l).

30 Hufton et al.

[Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000198518 HIST1H4E "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000158373 HIST1H2BD "Histone H2B type 1-D (H2B.b) (H2B/b) (H2B.1 B) (HIRA- Downloaded from interacting protein 2). [Source:Uniprot/SWISSPROT;Acc:P58876]" ENSG00000198518 HIST1H4E "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000180596 HIST1H2BC "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000198327 HIST1H4F "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000187990 HIST1H2BG "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A)

(H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000198327 HIST1H4F "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000197846 HIST1H2BF "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000198327 HIST1H4F "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000197697 HIST1H2BE "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" onSeptember26,2021-Publishedby ENSG00000198327 HIST1H4F "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000158373 HIST1H2BD "Histone H2B type 1-D (H2B.b) (H2B/b) (H2B.1 B) (HIRA- interacting protein 2). [Source:Uniprot/SWISSPROT;Acc:P58876]" ENSG00000158406 HIST1H4H "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000168242 HIST1H2BI "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000158406 HIST1H4H "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000197459 HIST1H2BH "Histone H2B type 1-H (H2B.j) (H2B/j). [Source:Uniprot/SWISSPROT;Acc:Q93079]" ENSG00000158406 HIST1H4H "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000187990 HIST1H2BG "Histone H2B type 1-C/E/F/G/I (H2B.a/g/h/k/l) (H2B.1 A) (H2B/a) (H2B/g) (H2B/h) (H2B/k) (H2B/l). [Source:Uniprot/SWISSPROT;Acc:P62807]" ENSG00000198339 HIST1H4I "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000124635 HIST1H2BJ "Histone H2B type 1-J (H2B.r) (H2B/r) (H2B.1). [Source:Uniprot/SWISSPROT;Acc:P06899]" ENSG00000197238 HIST1H4J "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000196374 HIST1H2BM "Histone H2B type 1-M (H2B.e) (H2B/e). [Source:Uniprot/SWISSPROT;Acc:Q99879]"

ENSG00000197238 HIST1H4J "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c). Cold SpringHarborLaboratoryPress [Source:Uniprot/SWISSPROT;Acc:Q99880]" ENSG00000197914 HIST1H4K "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000196374 HIST1H2BM "Histone H2B type 1-M (H2B.e) (H2B/e). [Source:Uniprot/SWISSPROT;Acc:Q99879]" ENSG00000197914 HIST1H4K "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c). [Source:Uniprot/SWISSPROT;Acc:Q99880]" ENSG00000198558 HIST1H4L "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000196501 HIST1H2BN "Histone H2B type 1-N (H2B.d) (H2B/d). [Source:Uniprot/SWISSPROT;Acc:Q99877]" ENSG00000198558 HIST1H4L "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000196374 HIST1H2BM "Histone H2B type 1-M (H2B.e) (H2B/e). [Source:Uniprot/SWISSPROT;Acc:Q99879]" ENSG00000198558 HIST1H4L "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" ENSG00000185130 HIST1H2BL "Histone H2B type 1-L (H2B.c) (H2B/c). [Source:Uniprot/SWISSPROT;Acc:Q99880]" ENSG00000104129 DNAJC17 "DnaJ homolog subfamily C member 17. ENSG00000137880 GCHFR "GTP cyclohydrolase 1 feedback regulatory protein [Source:Uniprot/SWISSPROT;Acc:Q9NVM6]" [Source:Uniprot/SWISSPROT;Acc:P30047]" ENSG00000130348 QRSL1 "glutaminyl-tRNA synthase (glutamine-hydrolyzing)-like 1 ENSG00000130347 RTN4IP1 "Reticulon-4-interacting protein 1, mitochondrial precursor

[Source:RefSeq_peptide;Acc:NP_060762]" (NOGO- interacting mitochondrial protein). [Source:Uniprot/SWISSPROT;Acc:Q8WWV3]"

31 Hufton et al.

ENSG00000168283 BMI1 "Polycomb group RING finger protein 4 (Polycomb complex ENSG00000148444 COMMD3 "COMM domain-containing protein 3 (Bup protein) (PIL protein BMI-1) (RING finger protein 51). protein). [Source:Uniprot/SWISSPROT;Acc:Q9UBI1]" Downloaded from [Source:Uniprot/SWISSPROT;Acc:P35226]" ENSG00000174989 FBXW8 "F-box/WD repeat protein 8 (F-box and WD-40 domain protein ENSG00000135119 TMEM118 "transmembrane protein 118 (TMEM118), mRNA 8) (F-box only protein 29). [Source:RefSeq_dna;Acc:NM_032814]" [Source:Uniprot/SWISSPROT;Acc:Q8N3Y1]" ENSG00000137573 SULF1 "Extracellular sulfatase Sulf-1 precursor (EC 3.1.6.-) (HSulf-1). ENSG00000165084 C8orf34 "C8orf34 protein. [Source:Uniprot/SWISSPROT;Acc:Q8IWU6]" [Source:Uniprot/SPTREMBL;Acc:Q49A92]"

ENSG00000161681 SHANK1 "SH3 and multiple ankyrin repeat domains protein 1 (Shank1) ENSG00000131409 LRRC4B "Leucine-rich repeat-containing protein 4B precursor. genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:Q9Y566]" [Source:Uniprot/SWISSPROT;Acc:Q9NT99]" ENSG00000115760 BIRC6 "Baculoviral IAP repeat-containing protein 6 (Ubiquitin- ENSG00000119820 YIPF4 "Protein YIPF4 (YIP1 family member 4). conjugating BIR-domain enzyme apollon). [Source:Uniprot/SWISSPROT;Acc:Q9BSR8]" [Source:Uniprot/SWISSPROT;Acc:Q9NR09]" ENSG00000155792 DEPDC6 "DEP domain containing 6 ENSG00000136982 NP_076999.1 "defective in sister chromatid cohesion homolog 1 [Source:RefSeq_peptide;Acc:NP_073620]" [Source:RefSeq_peptide;Acc:NP_076999]"

ENSG00000104731 KLHDC4 "Kelch domain-containing protein 4. ENSG00000154118 JPH3 "Junctophilin-3 (Junctophilin type 3) (JP-3). onSeptember26,2021-Publishedby [Source:Uniprot/SWISSPROT;Acc:Q8TBB5]" [Source:Uniprot/SWISSPROT;Acc:Q8WXH2]" ENSG00000174748 RPL15 "60S ribosomal protein L15. ENSG00000170142 UBE2E1 "Ubiquitin-conjugating enzyme E2 E1 [Source:Uniprot/SWISSPROT;Acc:P61313]" [Source:Uniprot/SWISSPROT;Acc:P51965]" ENSG00000174748 RPL15 "60S ribosomal protein L15. ENSG00000182247 UBE2E2 "Ubiquitin-conjugating enzyme E2 E2 (EC 6.3.2.19) [Source:Uniprot/SWISSPROT;Acc:P61313]" (Ubiquitin-protein ligase E2) (Ubiquitin carrier protein E2) (UbcH8). [Source:Uniprot/SWISSPROT;Acc:Q96LR5]" ENSG00000150995 ITPR1 "Inositol 1,4,5-trisphosphate receptor type 1 ENSG00000144455 SUMF1 "Sulfatase-modifying factor 1 precursor (C-alpha- [Source:Uniprot/SWISSPROT;Acc:Q14643]" formyglycine- generating enzyme 1). [Source:Uniprot/SWISSPROT;Acc:Q8NBK3]" ENSG00000134321 RSAD2 "radical S-adenosyl methionine domain containing 2 ENSG00000134326 NP_997198.2 "thymidylate kinase family LPS-inducible member [Source:RefSeq_peptide;Acc:NP_542388]" [Source:RefSeq_peptide;Acc:NP_997198]" ENSG00000113719 ERGIC1 "Endoplasmic reticulum-Golgi intermediate compartment ENSG00000120129 DUSP1 "Dual specificity protein 1 [Source:Uniprot/SWISSPROT;Acc:Q969X5]" [Source:Uniprot/SWISSPROT;Acc:P28562]" ENSG00000105220 GPI "Glucose-6-phosphate isomerase ENSG00000166398 KIAA0355 "KIAA0355 (KIAA0355), mRNA [Source:Uniprot/SWISSPROT;Acc:P06744]" [Source:RefSeq_dna;Acc:NM_014686]"

ENSG00000166949 SMAD3 "Mothers against decapentaplegic homolog 3 (SMAD 3) ENSG00000137834 SMAD6 "Mothers against decapentaplegic homolog 6 (SMAD 6) Cold SpringHarborLaboratoryPress (Mothers against DPP homolog 3) (Mad3) (hMAD-3) (JV15-2) (Mothers against DPP homolog 6) (Smad6) (hSMAD6). (hSMAD3). [Source:Uniprot/SWISSPROT;Acc:P84022]" [Source:Uniprot/SWISSPROT;Acc:O43541]" ENSG00000148655 C10orf11 ENSG00000165655 ZNF503 "zinc finger protein 503 [Source:RefSeq_peptide;Acc:NP_116161]" ENSG00000115109 EPB41L5 "Band 4.1-like protein 5. ENSG00000144120 TMEM177 "transmembrane protein 177 (TMEM177), mRNA [Source:Uniprot/SWISSPROT;Acc:Q9HCM4]" [Source:RefSeq_dna;Acc:NM_030577]" ENSG00000148842 CNNM2 "cyclin M2 isoform 2 ENSG00000166272 C10orf26 "Uncharacterized protein C10orf26. [Source:RefSeq_peptide;Acc:NP_951058]" [Source:Uniprot/SWISSPROT;Acc:Q9NX94]" ENSG00000065457 ADAT1 "adenosine deaminase, tRNA-specific 1 ENSG00000034713 GABARAPL2 "Gamma-aminobutyric acid receptor-associated protein-like [Source:RefSeq_peptide;Acc:NP_036223]" 2 [Source:Uniprot/SWISSPROT;Acc:P60520]" ENSG00000014919 COX15 "Cytochrome c oxidase assembly protein COX15 homolog. ENSG00000119929 CUTC "Copper homeostasis protein cutC homolog. [Source:Uniprot/SWISSPROT;Acc:Q7KZN9]" [Source:Uniprot/SWISSPROT;Acc:Q9NTM9]" ENSG00000138696 BMPR1B "Bone morphogenetic protein receptor type IB precursor (EC ENSG00000163110 PDLIM5 "PDZ and LIM domain protein 5 (Enigma homolog) 2.7.11.30) (CDw293 antigen). (Enigma-like PDZ and LIM domains protein).

[Source:Uniprot/SWISSPROT;Acc:O00238]" [Source:Uniprot/SWISSPROT;Acc:Q96HC4]" ENSG00000107779 BMPR1A "Bone morphogenetic protein receptor type IA precursor ENSG00000122367 LDB3 "LIM domain-binding protein 3 (Z-band alternatively spliced

32 Hufton et al.

[Source:Uniprot/SWISSPROT;Acc:P36894]" PDZ-motif protein) (Protein cypher). [Source:Uniprot/SWISSPROT;Acc:O75112]" Downloaded from ENSG00000153310 FAM49B "Protein FAM49B (L1). ENSG00000136997 MYC "Myc proto-oncogene protein (c-Myc) (Transcription factor [Source:Uniprot/SWISSPROT;Acc:Q9NUQ9]" p64). [Source:Uniprot/SWISSPROT;Acc:P01106]" ENSG00000197872 FAM49A "Protein FAM49A. ENSG00000134323 MYCN "N-myc proto-oncogene protein. [Source:Uniprot/SWISSPROT;Acc:Q9H0Q0]" [Source:Uniprot/SWISSPROT;Acc:P04198]" ENSG00000142655 PEX14 "Peroxisomal membrane protein PEX14 (Peroxin-14) ENSG00000160049 DFFA "DNA fragmentation factor subunit alpha (Peroxisomal membrane anchor protein PEX14) (PTS1 [Source:Uniprot/SWISSPROT;Acc:O00273]"

receptor docking protein). genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:O75381]" ENSG00000196358 NTNG2 "Netrin G2 precursor (Laminet-2). ENSG00000160563 CRSP8 "CRSP complex subunit 8 [Source:Uniprot/SWISSPROT;Acc:Q96CW9]" [Source:Uniprot/SWISSPROT;Acc:Q6P2C8]" ENSG00000154764 WNT7A "Protein Wnt-7a precursor. ENSG00000163520 FBLN2 "Fibulin-2 precursor. [Source:Uniprot/SWISSPROT;Acc:O00755]" [Source:Uniprot/SWISSPROT;Acc:P98095]" ENSG00000188064 WNT7B "Protein Wnt-7b precursor. ENSG00000077942 FBLN1 "Fibulin-1 precursor.

[Source:Uniprot/SWISSPROT;Acc:P56706]" [Source:Uniprot/SWISSPROT;Acc:P23142]" onSeptember26,2021-Publishedby ENSG00000198791 CNOT7 "CCR4-NOT transcription complex subunit 7 ENSG00000104219 ZDHHC2 "Palmitoyltransferase ZDHHC2 [Source:Uniprot/SWISSPROT;Acc:Q9UIV1]" [Source:Uniprot/SWISSPROT;Acc:Q9UIJ5]" ENSG00000106038 EVX1 "Homeobox even-skipped homolog protein 1 (EVX-1). ENSG00000106031 HOXA13 "Homeobox protein Hox-A13 (Hox-1J). [Source:Uniprot/SWISSPROT;Acc:P49640]" [Source:Uniprot/SWISSPROT;Acc:P31271]" ENSG00000162676 GFI1 "Zinc finger protein Gfi-1 (Growth factor independence 1). ENSG00000122484 C1orf82 "UPF0385 protein C1orf82. [Source:Uniprot/SWISSPROT;Acc:Q99684]" [Source:Uniprot/SWISSPROT;Acc:Q8IXW5]" ENSG00000163288 GABRB1 "Gamma-aminobutyric-acid receptor subunit beta-1 precursor ENSG00000109158 GABRA4 "Gamma-aminobutyric-acid receptor subunit alpha-4 (GABA(A) receptor subunit beta-1). precursor (GABA(A) receptor subunit alpha-4). [Source:Uniprot/SWISSPROT;Acc:P18505]" [Source:Uniprot/SWISSPROT;Acc:P48169]" ENSG00000163288 GABRB1 "Gamma-aminobutyric-acid receptor subunit beta-1 precursor ENSG00000151834 GABRA2 "Gamma-aminobutyric-acid receptor subunit alpha-2 (GABA(A) receptor subunit beta-1). precursor (GABA(A) receptor subunit alpha-2). [Source:Uniprot/SWISSPROT;Acc:P18505]" [Source:Uniprot/SWISSPROT;Acc:P47869]" ENSG00000166206 GABRB3 "Gamma-aminobutyric-acid receptor subunit beta-3 precursor ENSG00000186297 GABRA5 "Gamma-aminobutyric-acid receptor subunit alpha-5 (GABA(A) receptor subunit beta-3). precursor (GABA(A) receptor subunit alpha-5). [Source:Uniprot/SWISSPROT;Acc:P28472]" [Source:Uniprot/SWISSPROT;Acc:P31644]"

ENSG00000106415 GLCCI1 "Glucocorticoid-induced transcript 1 protein. ENSG00000106399 RPA3 "Replication protein A 14 kDa subunit (RP-A) (RF-A) Cold SpringHarborLaboratoryPress [Source:Uniprot/SWISSPROT;Acc:Q86VQ1]" (Replication factor-A protein 3) (p14). [Source:Uniprot/SWISSPROT;Acc:P35244]" ENSG00000092931 NP_077287.1 "CDNA FLJ20226 fis, clone COLF5064 ENSG00000161547 SFRS2 "Splicing factor, arginine/serine-rich 2 [Source:Uniprot/SPTREMBL;Acc:Q9NXI5]" [Source:Uniprot/SWISSPROT;Acc:Q01130]" ENSG00000183558 HIST2H2AA3 "Histone H2A type 2-A (H2A.2). ENSG00000183941 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q6FI13]" ENSG00000203812 H2A2A_HUMA "Histone H2A type 2-A (H2A.2). ENSG00000183941 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" N [Source:Uniprot/SWISSPROT;Acc:Q6FI13]" ENSG00000184260 HIST2H2AC "Histone H2A type 2-C (H2A-GL101) (H2A/r). ENSG00000182217 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q16777]" ENSG00000184260 HIST2H2AC "Histone H2A type 2-C (H2A-GL101) (H2A/r). ENSG00000183941 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q16777]" ENSG00000184270 HIST2H2AB "Histone H2A type 2-B. ENSG00000182217 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q8IUE6]"

ENSG00000184270 HIST2H2AB "Histone H2A type 2-B. ENSG00000183941 H4_HUMAN "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q8IUE6]"

33 Hufton et al.

ENSG00000111332 H2AFJ "H2A histone family, member J isoform 2 ENSG00000197837 HIST4H4 "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:RefSeq_peptide;Acc:NP_808760]" Downloaded from ENSG00000137259 HIST1H2AB "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000124529 HIST1H4B "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P28001]" ENSG00000137259 HIST1H2AB "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000196176 HIST1H4A "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P28001]" ENSG00000180573 HIST1H2AC "Histone H2A type 1-C. ENSG00000197061 HIST1H4C "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q93077]"

ENSG00000180573 HIST1H2AC "Histone H2A type 1-C. ENSG00000124529 HIST1H4B "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:Q93077]" ENSG00000180573 HIST1H2AC "Histone H2A type 1-C. ENSG00000196176 HIST1H4A "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q93077]" ENSG00000124642 HIST1H2AD "Histone H2A type 1-D (H2A.3). ENSG00000188987 HIST1H4D "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P20671]" ENSG00000124642 HIST1H2AD "Histone H2A type 1-D (H2A.3). ENSG00000197061 HIST1H4C "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]"

[Source:Uniprot/SWISSPROT;Acc:P20671]" onSeptember26,2021-Publishedby ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000198518 HIST1H4E "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P28001]" ENSG00000168274 HIST1H2AE "Histone H2A type 1-E (H2A.2) (H2A/m). ENSG00000188987 HIST1H4D "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P28001]" ENSG00000184825 HIST1H2AH "Histone H2A type 1-H. ENSG00000198339 HIST1H4I "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:Q96KK5]" ENSG00000184348 HIST1H2AK "Histone H2A type 1 (H2A.1). ENSG00000197914 HIST1H4K "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" ENSG00000184348 HIST1H2AK "Histone H2A type 1 (H2A.1). ENSG00000197238 HIST1H4J "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" ENSG00000198374 HIST1H2AL "Histone H2A type 1 (H2A.1). ENSG00000197914 HIST1H4K "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" ENSG00000198374 HIST1H2AL "Histone H2A type 1 (H2A.1). ENSG00000197238 HIST1H4J "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" ENSG00000196866 HIST1H2AM "Histone H2A type 1 (H2A.1). ENSG00000198558 HIST1H4L "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]"

[Source:Uniprot/SWISSPROT;Acc:P0C0S8]" Cold SpringHarborLaboratoryPress ENSG00000196866 HIST1H2AM "Histone H2A type 1 (H2A.1). ENSG00000197914 HIST1H4K "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" ENSG00000196866 HIST1H2AM "Histone H2A type 1 (H2A.1). ENSG00000197238 HIST1H4J "Histone H4. [Source:Uniprot/SWISSPROT;Acc:P62805]" [Source:Uniprot/SWISSPROT;Acc:P0C0S8]" ENSG00000126214 KNS2 "Kinesin light chain 1 (KLC 1). ENSG00000166165 CKB "Creatine kinase B-type (EC 2.7.3.2) (Creatine kinase B [Source:Uniprot/SWISSPROT;Acc:Q07866]" chain) (B-CK). [Source:Uniprot/SWISSPROT;Acc:P12277]" ENSG00000148339 SLC25A25 "solute carrier family 25, member 25 isoform c ENSG00000167103 PIP5KL1 "phosphatidylinositol-4-phosphate 5-kinase-like 1 [Source:RefSeq_peptide;Acc:NP_001006643]" [Source:RefSeq_peptide;Acc:NP_775763]" ENSG00000005471 ABCB4 "Multidrug resistance protein 3 (EC 3.6.3.44) ENSG00000005469 CROT "Peroxisomal carnitine O-octanoyltransferase (EC 2.3.1.137) [Source:Uniprot/SWISSPROT;Acc:P21439]" (COT). [Source:Uniprot/SWISSPROT;Acc:Q9UKG9]" ENSG00000085563 ABCB1 "Multidrug resistance protein 1 (EC 3.6.3.44) ENSG00000005469 CROT "Peroxisomal carnitine O-octanoyltransferase (EC 2.3.1.137) [Source:Uniprot/SWISSPROT;Acc:P08183]" (COT). [Source:Uniprot/SWISSPROT;Acc:Q9UKG9]" ENSG00000162378 ZYG11B "zyg-11 homolog B [Source:RefSeq_peptide;Acc:NP_078922]" ENSG00000162377 C1orf163 "CDNA FLJ12439 fis, clone NT2RM1000127.

[Source:Uniprot/SPTREMBL;Acc:Q9H9Z9]" ENSG00000203995 ZYG11A "ZYG-11A early embryogenesis protein. ENSG00000162377 C1orf163 "CDNA FLJ12439 fis, clone NT2RM1000127.

34 Hufton et al.

[Source:Uniprot/SPTREMBL;Acc:Q6WRX3]" [Source:Uniprot/SPTREMBL;Acc:Q9H9Z9]" ENSG00000092969 TGFB2 "Transforming growth factor beta-2 precursor (TGF-beta-2) ENSG00000067533 KIAA0507 "CGI-115 protein (CGI-115), mRNA Downloaded from [Source:Uniprot/SWISSPROT;Acc:P61812]" [Source:RefSeq_dna;Acc:NM_016052]" ENSG00000107651 SEC23IP "SEC23-interacting protein (p125). ENSG00000197771 C10orf119 "CJ119_HUMAN Isoform 2 of Q9BTE3 - Homo sapiens [Source:Uniprot/SWISSPROT;Acc:Q9Y6Y8]" (Human) [Source:Uniprot/Varsplic;Acc:Q9BTE3-2]" ENSG00000009765 IYD "Iodotyrosine dehalogenase 1 precursor (EC 1.-.-.-) (IYD-1). ENSG00000198729 PPP1R14C "Protein phosphatase 1 regulatory subunit 14C (PKC- [Source:Uniprot/SWISSPROT;Acc:Q6PHW0]" potentiated PP1 inhibitory protein) [Source:Uniprot/SWISSPROT;Acc:Q8TAE6]"

ENSG00000205726 ITSN1 "Intersectin-1 (SH3 domain-containing protein 1A) (SH3P17). ENSG00000205758 CRYZL1 "Quinone oxidoreductase-like 1 (EC 1.-.-.-) (QOH-1) (Zeta- genome.cshlp.org [Source:Uniprot/SWISSPROT;Acc:Q15811]" crystallin homolog) (4P11). [Source:Uniprot/SWISSPROT;Acc:O95825]"

onSeptember26,2021-Publishedby Cold SpringHarborLaboratoryPress

35 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

References

Abi-Rached, L., A. Gilles, T. Shiina, P. Pontarotti, and H. Inoko. 2002. Evidence of en bloc duplication in vertebrate genomes. Nat Genet 31: 100-105. Armengol, L., T. Marques-Bonet, J. Cheung, R. Khaja, J.R. Gonzalez, S.W. Scherer, A. Navarro, and X. Estivill. 2005. Murine segmental duplications are hot spots for chromosome and gene evolution. Genomics 86: 692-700. Bailey, J.A. and E.E. Eichler. 2006. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet 7: 552-564. Blair, J.E. and S.B. Hedges. 2005. Molecular phylogeny and divergence times of deuterostome animals. Mol Biol Evol 22: 2275-2284. Blair, J.E., P. Shah, and S.B. Hedges. 2005. Evolutionary sequence analysis of complete genomes. BMC Bioinformatics 6: 53. Blanc, G. and K.H. Wolfe. 2004. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16: 1679-1691. Blomme, T., K. Vandepoele, S. De Bodt, C. Simillion, S. Maere, and Y. Van de Peer. 2006. The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol 7: R43. Brunet, F.G., H.R. Crollius, M. Paris, J.M. Aury, P. Gibert, O. Jaillon, V. Laudet, and M. Robinson-Rechavi. 2006. Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol Biol Evol 23: 1808-1816. Burt, D.W., C. Bruley, I.C. Dunn, C.T. Jones, A. Ramage, A.S. Law, D.R. Morrice, I.R. Paton, J. Smith, D. Windsor, A. Sazanov, R. Fries, and D. Waddington. 1999. The dynamics of chromosome evolution in birds and mammals. Nature 402: 411-413. Castro, L.F. and P.W. Holland. 2003. Chromosomal mapping of ANTP class homeobox genes in amphioxus: piecing together ancestral genomes. Evol Dev 5: 459-465. Cohen, A. 1980. On the graphical display of the significant components in a two-way contingency table. Communications in Statistics-Theory and Methods A9: 1025-1041. Dehal, P. and J.L. Boore. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol 3: e314. Delsuc, F., H. Brinkmann, D. Chourrout, and H. Philippe. 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature 439: 965-968. Dennis, G., Jr., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, and R.A. Lempicki. 2003. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4: P3. Fitch, W.M. and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155: 279- 284. Frickey, T. and A.N. Lupas. 2004. PhyloGenie: automated phylome generation and analysis. Nucleic Acids Res 32: 5231-5238. Fried, C., S.J. Prohaska, and P.F. Stadler. 2003. Independent Hox-cluster duplications in lampreys. J Exp Zoolog B Mol Dev Evol 299: 18-25. Friedman, R. and A.L. Hughes. 2001. Pattern and timing of gene duplication in animal genomes. Genome Res 11: 1842-1847. Friendly, M. 1992. Graphical methods for categorical data. SAS User Group International Conference Proceedings 17: 190-200.

36 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Fujiwara, T., M. Bandi, M. Nitta, E.V. Ivanova, R.T. Bronson, and D. Pellman. 2005. Cytokinesis failure generating tetraploids promotes tumorigenesis in p53-null cells. Nature 437: 1043-1047. Furlong, R.F. and P.W. Holland. 2002. Were vertebrates octoploid? Philos Trans R Soc Lond B Biol Sci 357: 531-544. Galipeau, P.C., D.S. Cowan, C.A. Sanchez, M.T. Barrett, M.J. Emond, D.S. Levine, P.S. Rabinovitch, and B.J. Reid. 1996. 17p (p53) allelic losses, 4N (G2/tetraploid) populations, and progression to aneuploidy in Barrett's esophagus. Proc Natl Acad Sci U S A 93: 7081-7084. Ganem, N.J., Z. Storchova, and D. Pellman. 2007. Tetraploidy, aneuploidy and cancer. Curr Opin Genet Dev 17: 157-162. Gibson, T.J. and J. Spring. 2000. Evidence in favour of ancient octaploidy in the vertebrate genome. Biochem Soc Trans 28: 259-264. Hedges, S.B., J.E. Blair, M.L. Venturi, and J.L. Shoe. 2004. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evol Biol 4: 2. Hillier, L., et al. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432: 695-716. Housworth, E.A. and J. Postlethwait. 2002. Measures of synteny conservation between species pairs. Genetics 162: 441-448. Hughes, A.L., J. da Silva, and R. Friedman. 2001. Ancient genome duplications did not structure the human Hox-bearing chromosomes. Genome Res 11: 771-780. Hughes, A.L. and R. Friedman. 2005. Loss of ancestral genes in the genomic evolution of Ciona intestinalis. Evol Dev 7: 196-200. Hurley, I.A., R.L. Mueller, K.A. Dunn, E.J. Schmidt, M. Friedman, R.K. Ho, V.E. Prince, Z. Yang, M.G. Thomas, and M.I. Coates. 2007. A new time-scale for ray-finned fish evolution. Proc Biol Sci 274: 489-498. Ikuta, T., N. Yoshida, N. Satoh, and H. Saiga. 2004. Ciona intestinalis Hox gene cluster: Its dispersed structure and residual colinear expression in development. Proc Natl Acad Sci U S A 101: 15118-15123. Kasahara, M. 2007. The 2R hypothesis: an update. Curr Opin Immunol 19: 547-552. Kawakami, K., S. Sato, H. Ozaki, and K. Ikeda. 2000. Six family genes--structure and function as transcription factors and their roles in development. Bioessays 22: 616- 626. Maere, S., S. De Bodt, J. Raes, T. Casneuf, M. Van Montagu, M. Kuiper, and Y. Van de Peer. 2005. Modeling gene and genome duplications in . Proc Natl Acad Sci U S A 102: 5454-5459. McLysaght, A., K. Hokamp, and K.H. Wolfe. 2002. Extensive genomic duplication during early chordate evolution. Nat Genet 31: 200-204. Nadeau, J.H. and D. Sankoff. 1998. Counting on comparative maps. Trends Genet 14: 495- 501. Nadeau, J.H. and B.A. Taylor. 1984. Lengths of chromosomal segments conserved since divergence of man and mouse. Proc Natl Acad Sci U S A 81: 814-818. Nakatani, Y., H. Takeda, Y. Kohara, and S. Morishita. 2007. Reconstruction of the vertebrate ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res 17: 1254-1265. Ohno, S. 1970. Evolution by Gene Duplication Springer-Verlag, Berlin, New York. Olaharski, A.J., R. Sotelo, G. Solorza-Luna, M.E. Gonsebatt, P. Guzman, A. Mohar, and D.A. Eastmond. 2006. Tetraploidy and chromosomal instability are early events during cervical carcinogenesis. Carcinogenesis 27: 337-343.

37 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Hufton et al.

Otto, S.P. 2007. The evolutionary consequences of polyploidy. Cell 131: 452-462. Panopoulou, G., S. Hennig, D. Groth, A. Krause, A.J. Poustka, R. Herwig, M. Vingron, and H. Lehrach. 2003. New evidence for genome-wide duplications at the origin of vertebrates using an amphioxus gene set and completed animal genomes. Genome Res 13: 1056-1066. Pevzner, P. and G. Tesler. 2003. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl Acad Sci U S A 100: 7672-7677. Pontes, O., N. Neves, M. Silva, M.S. Lewis, A. Madlung, L. Comai, W. Viegas, and C.S. Pikaard. 2004. Chromosomal rearrangements are a rapid response to formation of the allotetraploid Arabidopsis suecica genome. Proc Natl Acad Sci U S A 101: 18240-18245. Putnam, N.H., T. Butts, D.E. Ferrier, R.F. Furlong, U. Hellsten, T. Kawashima, M. Robinson- Rechavi, E. Shoguchi, A. Terry, J.K. Yu, E.L. Benito-Gutierrez, I. Dubchak, J. Garcia- Fernandez, J.J. Gibson-Brown, I.V. Grigoriev, A.C. Horton, P.J. de Jong, J. Jurka, V.V. Kapitonov, Y. Kohara, Y. Kuroki, E. Lindquist, S. Lucas, K. Osoegawa, L.A. Pennacchio, A.A. Salamov, Y. Satou, T. Sauka-Spengler, J. Schmutz, I.T. Shin, A. Toyoda, M. Bronner-Fraser, A. Fujiyama, L.Z. Holland, P.W. Holland, N. Satoh, and D.S. Rokhsar. 2008. The amphioxus genome and the evolution of the chordate karyotype. Nature 453: 1064-1071. Ravid, K., J. Lu, J.M. Zimmet, and M.R. Jones. 2002. Roads to polyploidy: the megakaryocyte example. J Cell Physiol 190: 7-20. Remm, M., C.E. Storm, and E.L. Sonnhammer. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314: 1041-1052. Retief, J.D. 2000. Phylogenetic analysis using PHYLIP. Methods Mol Biol 132: 243-258. Schmidt, H.A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18: 502-504. Semon, M. and K.H. Wolfe. 2007. Rearrangement rate following the whole-genome duplication in teleosts. Mol Biol Evol 24: 860-867. Smith, J.J. and S.R. Voss. 2006. Gene order data from a model amphibian (Ambystoma): new perspectives on vertebrate genome structure and evolution. BMC Genomics 7: 219. Song, K., P. Lu, K. Tang, and T.C. Osborn. 1995. Rapid genome change in synthetic polyploids of Brassica and its implications for polyploid evolution. Proc Natl Acad Sci U S A 92: 7719-7723. Stadler, P.F., C. Fried, S.J. Prohaska, W.J. Bailey, B.Y. Misof, F.H. Ruddle, and G.P. Wagner. 2004. Evidence for independent Hox gene duplications in the hagfish lineage: a PCR-based gene inventory of Eptatretus stoutii. Mol Phylogenet Evol 32: 686-694. Vandepoele, K., W. De Vos, J.S. Taylor, A. Meyer, and Y. Van de Peer. 2004. Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. Proc Natl Acad Sci U S A 101: 1638- 1643.

38 Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Fig. 1. Identifying and grouping pairs of syntenic genes. Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Fig. 2. Synteny group distributions reveal vertebrate duplication events. Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Fig. 3. Estimating the amount of synteny conservation between two genomes. Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Fig. 4. Rates of synteny loss throughout vertebrates and their ancestors. Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press

Early vertebrate whole genome duplications were predated by a period of intense genome rearrangement

Andrew L Hufton, Detlef Groth, Martin Vingron, et al.

Genome Res. published online July 14, 2008 Access the most recent version at doi:10.1101/gr.080119.108

Supplemental http://genome.cshlp.org/content/suppl/2008/10/02/gr.080119.108.DC1 Material

P

Accepted Peer-reviewed and accepted for publication but not copyedited or typeset; accepted Manuscript manuscript is likely to differ from the final, published version.

License

Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here.

Advance online articles have been peer reviewed and accepted for publication but have not yet appeared in the paper journal (edited, typeset versions may be posted when available prior to final publication). Advance online articles are citable and establish publication priority; they are indexed by PubMed from initial publication. Citations to Advance online articles must include the digital object identifier (DOIs) and date of initial publication.

To subscribe to Genome Research go to: https://genome.cshlp.org/subscriptions

Copyright © 2008, Cold Spring Harbor Laboratory Press