Quick viewing(Text Mode)

Elephant Shark Sequence Reveals Unique Insights Into the Evolutionary History of Vertebrate Genes: a Comparative Analysis of the Protocadherin Cluster

Elephant Shark Sequence Reveals Unique Insights Into the Evolutionary History of Vertebrate Genes: a Comparative Analysis of the Protocadherin Cluster

Elephant shark sequence reveals unique insights into the evolutionary history of genes: A comparative analysis of the protocadherin cluster

Wei-Ping Yu†‡, Vikneswari Rajasegaran†, Kenneth Yew†, Wai-lin Loh†, Boon-Hui Tay§, Chris T. Amemiya¶, Sydney Brenner‡§, and Byrappa Venkatesh‡§

†Gene Regulation Laboratory, National Neuroscience Institute, 11 Jalan Tan Tock Seng, Singapore 308433; §Institute of Molecular and Cell , Agency for , Technology, and Research, Biopolis, Singapore 138673; and ¶Benaroya Research Institute at Virginia Mason, Seattle, WA 98101

Contributed by Sydney Brenner, January 14, 2008 (sent for review November 7, 2007) Cartilaginous fishes are the oldest living phylogenetic group of entire extracellular domain, the transmembrane domain, and a jawed . Here, we demonstrate the value of cartilagi- small portion of the cytoplasmic domain, whereas the extracel- nous fish sequences in reconstructing the evolutionary history of lular domain of the nonclustered protocadherins is usually vertebrate genomes by sequencing the protocadherin cluster in the encoded by multiple exons (8). Protocadherin cluster proteins relatively small genome (910 Mb) of the elephant shark (Callorhin- are cadherin-like cell adhesion molecules that are highly en- chus milii). Human and contain a single protocadherin riched on synaptic membranes (9). These proteins are believed cluster with 53 and 49 genes, respectively, that are organized in to be a vertebrate that accompanied the emergence three subclusters, Pcdh␣, Pcdh␤, and Pcdh␥, whereas the dupli- of the neural tube and the elaborate central nervous system and cated protocadherin clusters in fugu and zebrafish contain >77 and have been hypothesized to provide a molecular code for speci- 107 genes, respectively, that are organized in Pcdh␣ and Pcdh␥ fying the enormous diversity in neuronal connectivity (3, 9, 10). subclusters. By contrast, the elephant shark contains a single No clustered protocadherin gene has been identified in inver- protocadherin cluster with 47 genes organized in four subclusters tebrate genomes (11–13). The protocadherin gene cluster has (Pcdh␦, Pcdh␧, Pcdh␮, and Pcdh␯). By comparison with elephant been characterized in several bony vertebrates (), EVOLUTION shark sequences, we discovered a Pcdh␦ subcluster in including human and coelacanth and teleost . Human and fishes, coelacanth, Xenopus, and chicken. Our results suggest that coelacanth protocadherin cluster genes are organized in three the protocadherin cluster in the ancestral jawed vertebrate con- tandem subclusters, designated as Pcdh␣, Pcdh␤, and Pcdh␥ (3, tained more subclusters than modern vertebrates, and the evolu- 4). Each subcluster contains multiple ‘‘variable’’ exons (Ϸ2.4 kb tion of the protocadherin cluster is characterized by lineage- each) that are transcribed from independent promoters. Each specific differential loss of entire subclusters of genes. In contrast variable exon codes for an extracellular domain comprising six to teleost fish and mammalian protocadherin genes that have repeats of a calcium-binding ectodomain (EC1–6), a transmem- undergone gene conversion events, elephant shark protocadherin brane domain, and a part of the cytoplasmic domain. In addition genes have experienced very little gene conversion. The syntenic to variable exons, Pcdh␣ and -␥ subclusters contain three block of genes in the elephant shark protocadherin locus is well constant exons at the 3Ј end of the respective subcluster, to which conserved in human but disrupted in fugu. Thus, the elephant each variable exon is independently spliced. The constant region shark genome appears to be less prone to rearrangements com- exons code for the major part of the cytoplasmic domain that is pared with teleost fish genomes. The small and ‘‘stable’’ genome shared by genes in each subcluster. Human and coelacanth of the elephant shark is a valuable reference for understanding the protocadherin clusters contain a total of 53 and 49 genes, evolution of vertebrate genomes. respectively. Because of the ‘‘-specific’’ whole-genome du- plication, teleost fishes such as fugu and zebrafish contain two ancestral jawed vertebrate ͉ clustered protocadherins ͉ cartilaginous fish ͉ unlinked protocadherin clusters, designated Pcdh1 and Pcdh2. Callorhinchus milii ͉ gene conversion Although fugu Pcdh1 cluster is highly degenerate, Pcdh2 clusters in both fishes have undergone lineage-specific repeated tandem Ͼ econstructing the evolutionary history of vertebrate ge- duplications, and gene losses giving rise to 107 genes in Ͼ nomes, and in particular the human genome, is a major goal zebrafish (2, 5, 7) and 77 genes in fugu (6). However, these R ␣ of vertebrate comparative genomics. Efforts are underway to teleost protocadherin genes are orthologous to only Pcdh and ␥ reconstruct the evolutionary history of the human genome at the - subcluster genes in and coelacanth. Another inter- nucleotide level and at the level of large- chromosomal esting feature of teleost protocadherin cluster genes is that they rearrangements (1) with the objective of predicting ancestral have experienced extensive regional gene conversion events Ͼ Ј genomes with high accuracy. The accuracy of the reconstruction resulting in nucleotide identity of 99% in the 3 region of of ancestral genomes is greatly influenced by the choice and variable exons among paralogs (5, 6). Based on the organization, number of genomes compared. Although a comparison of gene content, and phylogenetic relationships of protocadherin genomes from all major vertebrate groups is desirable, it is also important to compare genomes from the most distant branches Author contributions: W.-P.Y., S.B., and B.V. designed research; W.-P.Y., V.R., K.Y., W.-l.L., of the phylogenetic tree. Here, we show how the genome and B.-H.T. performed research; C.T.A. contributed new reagents/analytic tools; W.-P.Y., sequence from the oldest living phylogenetic group of jawed S.B., and B.V. analyzed data; and W.-P.Y. and B.V. wrote the paper. vertebrates, the cartilaginous fishes, can dramatically change our The authors declare no conflict of interest. inferences of the state of the ancestral vertebrate genome. Data deposition: The sequences reported in this paper have been deposited in the GenBank The protocadherin gene cluster is one of the most evolution- database (accession nos. EF693945, EU267079, and EU267080). arily dynamic loci in vertebrates. It is composed of tandem arrays ‡To whom correspondence may be addressed. E-mail: weiping࿝[email protected], backhill@ of paralogous genes that are highly susceptible to lineage-specific hotmail.co.uk, or [email protected]. gene losses, tandem duplications, and gene conversion (2–7). This article contains supporting information online at www.pnas.org/cgi/content/full/ These clustered genes are distinct from nonclustered protocad- 0800398105/DC1. herin genes by possessing a large coding exon that codes for the © 2008 by The National Academy of Sciences of the USA

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0800398105 PNAS ͉ March 11, 2008 ͉ vol. 105 ͉ no. 10 ͉ 3819–3824 Downloaded by guest on September 30, 2021 Pcdhδ Pcdhε Pcdhμ Pcdhν

1

X

C ε 10 ε 11 ε 12 ε 13 ε 15 ε 17 ε 18 ν 10 ν 11 ν 12 ν 13 ν 14 ν 15 ν 16 ν 17 ν 2 ν 1 ν 3 ν 4 ν 5 ν 6 ν 7 ν 8 ν 9 δ 1 ε 1 ψ 1 ε 2 ε 3 ε 5 ε 6 ε 7 ε 8 ε 9 ν ν CX2 ν CX3 ε 14 ε 20 δ CX1 δ CX2 δ CX3 ε 4 ε 16 ε 19 ε 21 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 ν CX2b μ 8 μ CX1 μ CX2 μ CX3

49C8 166L19 176N9

Fig. 1. Genomic organization of the elephant shark protocadherin cluster. Constant exons at the end of each subcluster are shown in red color. Variable exons in the same paralog subgroup or intersubcluster subgroup are shown in the same color. Only a single pseudogene (␺1) was identified. ␯CX2b is an alternatively spliced ‘‘constant’’ exon. The positions and identification numbers of BAC clones sequenced are shown below.

clusters in bony vertebrates, the ancestral vertebrate protocad- three constant exons, whereas the second subcluster is com- herin cluster is predicted to contain either two (Pcdh␣ and -␥) prised of only variable exons (21 in all), each of which is or three subclusters (Pcdh␣,-␤, and -␥) that have subsequently transcribed independently as single exon genes similar to the undergone lineage-specific gene losses and gene duplicat- coelacanth and mammalian ␤ subcluster genes. The third and ions, including the whole-locus duplication in the teleost fish fourth subclusters consist of 8 and 17 variable exons, respectively, lineage (6). and three constant exons each. In addition, the fourth subcluster The living jawed vertebrates (gnathostomes) fall into two contains an alternative constant exon between the second and major monophyletic groups, the (cartilaginous third constant exons. The splice sites of the constant exons of the fishes that include elasmobranchs and holocephalians) and the three subclusters are shown in SI Appendix, SI Fig. 6.We Osteichthyes (bony vertebrates, which include ray-finned fishes, designated the four subclusters as the Pcdh␦, Pcdh␧, Pcdh␮, and , , and ) that shared a common Pcdh␯ subclusters, respectively (Fig. 1). ancestor Ϸ450 million ago. The protocadherin cluster has To determine the evolutionary relationships between individ- been characterized in several bony vertebrates. Here, we report ual protocadherin (Pcdh) genes in the elephant shark, we aligned the sequencing and characterization of the protocadherin cluster the amino acid sequences of all variable exons using ClustalX from a holocephalian cartilaginous fish, the elephant shark and generated a phylogenetic tree using the Neighbor-joining (Callorhinchus milii). The elephant shark possesses a relatively method. Sequences of only the first three ectodomains (EC1–3) small genome (Ϸ910 Mb) among the cartilaginous fishes and is were used, because the C-terminal ectodomains (EC4–6) have therefore attractive for genome sequencing and comparative been shown to be susceptible to extensive gene conversion in analysis (14). Our study shows that the elephant shark protocad- teleost fishes and human, resulting in homogenization of se- herin cluster contains 47 variable exons arrayed in four subclus- quences among paralogs (5, 6). The phylogenetic tree shows that ters. Interestingly, none of the four subclusters show a direct most of the elephant shark Pcdh genes segregate into ‘‘paralog orthology relationship to the ␣, ␤,or␥ subcluster in bony groups,’’ each consisting of paralogs that belong to the same vertebrates. Thus, the protocadherin cluster in the common subcluster (CmPcdh␧2-21; CmPcdh␮1-8; CmPcdh␯2-3, and ancestor of jawed vertebrates appears to have contained more CmPcdh␯4 -17)(SI Appendix, SI Fig. 7). The only exceptions are subclusters of genes than modern vertebrates. the CmPcdh␦1, CmPcdh␧1, and CmPcdh␯1 genes from different subclusters that form a monophyletic group. Such a relationship Results between genes from different subclusters indicates they are Elephant Shark Protocadherin Cluster Genes Organized into Four ‘‘intersubcluster paralogs’’ that shared a common ancestor in the Subclusters. Screening of an elephant shark BAC library identified primordial protocadherin cluster. Similar intersubcluster paral- 10 positive clones that belong to a single locus. Three representative ogs have been identified in the human protocadherin cluster, in overlapping BACs were sequenced to obtain Ϸ409 kb of contiguous which the Pcdh␣ genes, c1 and c2, were found to be more closely sequence spanning the protocadherin cluster. Based on sequence related to Pcdh␥ genes, c4 and c5, than to other genes in the ␣ homology and GENSCAN prediction (see Methods), a total of 47 subcluster (3). The presence of such paralogs indicates that these variable exons were identified in this cluster (Fig. 1). In addition, we ancient genes have been highly conserved as single paralogous identified several nonprotocadherin genes flanking the protocad- genes despite the recurrent expansion and loss of genes in their herin cluster [supporting information (SI) Appendix,SIFig.5], vicinity. The evolutionary conservation of such ancient paralogs indicating that the elephant shark protocadherin cluster sequence between subclusters suggests they may play a fundamental role is complete. The overall gene content of the elephant shark in protocadherin functions. In support of this hypothesis, it has protocadherin cluster (47 variable exons arrayed in four subclus- been shown that, whereas the expression of all other protocad- ters) is comparable to that in coelacanth (49 variable exons in three herin genes in the mammalian ␣ subcluster is monoallelic and subclusters) (4) and human (53 variable exons in three subclusters) cell-selective, the c1 and c2 genes are expressed biallelically and (3, 15). However, its gene content is considerably lower than that in every neuron (16, 17). It is thus possible that the highly in fugu (Ͼ77 variable exons) (6) and zebrafish (Ͼ107 variable conserved intersubcluster paralogs CmPcdh␦1, CmPcdh␧1, and exons) (2, 4, 5, 7). CmPcdh␯1 in the elephant shark carry out a fundamental role The mammalian and coelacanth protocadherin clusters con- similar to that of ‘‘c’’ genes in mammals. The absence of such a sist of three subclusters, the Pcdh␣,-␤, and -␥. The teleost paralog in the Pcdh␮ subcluster is likely due to a subcluster- clusters, however, lack the ␤ subcluster. To determine the specific gene degeneration. subcluster organization of elephant shark protocadherin cluster, we first searched for constant exons by homology search and Elephant Shark Protocadherin Subclusters Are Not Orthologous to GENSCAN prediction and identified three sets of constant Those in Mammals, Coelacanth, or Teleost Fishes. To examine the exons. We then determined the splicing pattern of the variable phylogenetic relationships of elephant shark subclusters with exons with the constant exons by RT-PCR and 3ЈRACE. These those of human, coelacanth, and teleost fishes, we aligned the analyses indicated that the genes are organized into four sub- amino acid sequences of the constant region exons of the clusters. The first subcluster consists of a single variable exon and CmPcdh␦,-␮, and -␯ genes from elephant shark, and Pcdh␣ and

3820 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0800398105 Yu et al. Downloaded by guest on September 30, 2021 Because the elephant shark Pcdh␧ subcluster lacks the con- stant region, its phylogenetic relationship with other protocad- herin subclusters can be inferred only by analyzing the variable exons. To this end, we generated a phylogenetic tree using the alignment of the N-terminal ectodomain sequences (EC1–3) of all variable exons from elephant shark, fugu, and coelacanth. This phylogenetic tree of variable exons suggests that the Pcdh␧ subcluster is not related to the bony vertebrate ␤ subcluster that also lacks a constant region but an ancient subcluster that has undergone repeated expansion within the elephant shark lineage (SI Appendix, SI Fig. 8). In addition, this analysis showed that the first genes in the fugu Pcdh1 (FrPcdh1␣1) and Pcdh2 clusters (FrPcdh2␣1) and the coelacanth protocadherin cluster (LmPcdh␣1) are more closely related to elephant shark CmPcdh␦1,-␮1, and -␯1 genes than to other ␣ genes in their respective subclusters (SI Appendix, SI Fig. 8). This suggests that the fugu Pcdh1␣1 and Pcdh2␣1 genes and coelacanth Pcdh␣1 gene may not be ␣ genes but belong to a different subcluster.

A Pcdh␦ Subcluster in Teleost Fishes, Coelacanth, Xenopus, and Chicken. To verify this, we searched for constant exons down- Fig. 2. Phylogenetic analysis of protocadherin constant region exons. The stream of these variable exons by GENSCAN prediction and tree was generated from alignment of protein sequences of the protocad- TBLASTN by using elephant shark Pcdh␦ constant region herin constant regions by Neighbor-joining method based on sequence dis- sequence as a query. We uncovered three new constant exons tance matrix. The tree is unrooted. Bootstrap values (%) of only the major each in the fugu Pcdh1 and Pcdh2 clusters and in the coelacanth nodes are shown. Cm, Callorhinchus milii; Hs, Homo sapiens; Mm, Mus mus- protocadherin cluster. Because the zebrafish Pcdh1␣1 gene is culus; Lm, menadoensis; Dr, Danio rerio; Fr, Fugu rubripes; Xt, X. orthologous to the fugu Pcdh1␣1 gene (6), we extended our study EVOLUTION tropicalis; Gg, Gallus gallus. to the zebrafish Pcdh1 locus and identified a similar set of constant exons immediately downstream of the zebrafish ␣ -␥ genes from human, mouse, coelacanth, fugu, and zebrafish Pcdh1 1 gene. RT-PCR analyses with RNA from fugu and and constructed a phylogenetic tree using the Neighbor-joining zebrafish (coelacanth was not included due to unavailabil- method. We also included sequences of the constant region of ity of RNA) confirmed that the first variable exons in the fugu Pcdh␦ from fugu, zebrafish, coelacanth, Xenopus tropicalis, and Pcdh1 and Pcdh2 clusters and in the zebrafish Pcdh1 cluster are chicken that we have identified in the present study (described spliced to the respective newly identified three constant exons below). The phylogenetic tree (Fig. 2) indicates that the Pcdh␦ and not to the constant region exons in their respective ␣ sequences from elephant shark and the two teleost fishes, subclusters. coelacanth and Xenopus, constitute a monophyletic group dis- The absence of an ortholog for the coelacanth Pcdh␣1 gene tinct from other sequences, indicating this is a separate subclus- (redesignated as Pcdh␦1) in the human and mouse protocad- ter from the ␣ and ␥ subclusters. This phylogenetic tree also herin clusters (2, 3, 18, 19) indicates that the Pcdh␦ subcluster is shows that elephant shark Pcdh␮ and -␯ subclusters may not be lost in mammals. To determine more precisely when this sub- direct orthologs of Pcdh␣ and -␥ subclusters in bony vertebrates cluster was lost, we searched the X. tropicalis and chicken (Fig. 2). These elephant shark subclusters are either the or- genome assemblies by TBLASTN using the elephant shark thologs of the ␣ and ␥ subclusters that have undergone an Pcdh␦ constant region sequence as a query. This search led to the unusually accelerated rate of evolution involving extensive nu- identification of a single-variable exon plus three constant exon cleotide substitutions or simply distinct subclusters that lack subclusters at the 5Ј end of protocadherin clusters in Xenopus orthologs in bony vertebrates. Given there is no evidence that (scaffold࿝177, length 2.1 Mb) and chicken (chromosome 13). elephant shark genes have been evolving at a rapid rate, and that, Phylogenetic analysis of the amino acid sequences of the variable in particular, the neighboring Pcdh␦ gene shows a normal exon (SI Appendix, SI Fig. 8) and the constant region exons (Fig. evolutionary rate, it is more probable that the Pcdh␮ and -␯ 2) of this cluster from Xenopus and chicken together with their subclusters are distinct protocadherin subclusters that lack or- corresponding sequences in fugu, zebrafish, coelacanth, and thologs in bony vertebrates. elephant shark indicated that these genes are indeed orthologs

Table 1. Synonymous substitutions per codon in the ectodomain region of elephant shark and fugu Pcdh paralog subgroups

† ‡ § Subgroups dSEC1–6 dSEChigh dSEClow dSEChigh/dSEClow

CmPcdh␧2–21 0.211 0.281 0.157 1.8 CmPcdh␮1–8 0.647 1.073 0.534 2.0 CmPcdh␯4–17 0.210 0.338 0.146 2.3 FrPcdh2␣3–7 0.104 0.517 0.00515 100.4 FrPcdh2␣9–25 0.228 1.022 0.00522 195.8 FrPcdh2␣26–36 0.327 1.331 0.00104 1,280 FrPcdh2␥1–17 0.067 0.240 0.00302 79.5 FrPcdh2␥19–31 0.227 1.308 0.00555 235.7 FrPcdh2␥33–36 0.297 8.667 0.00310 2796

†Average dS for each branch in the gene tree of individual subgroups was calculated based on the alignment of paralogs in the subgroup. ‡Average dS per branch calculated based on alignment of the most-divergent ectodomain in each subgroup. §Average dS per branch calculated based on alignment of the least-divergent ectodomain in each subgroup.

Yu et al. PNAS ͉ March 11, 2008 ͉ vol. 105 ͉ no. 10 ͉ 3821 Downloaded by guest on September 30, 2021 A the cytoplasmic fragment encoded by the variable exons, whereas the EC2 and EC3 ectodomain regions seldom undergo gene conversion, and thus are hypothesized to provide the majority of diversifying signals for protocadherin cluster genes. Protocadherin cluster genes in human have also undergone gene B conversions although to a lesser extent compared with teleost fishes, but those in coelacanth have experienced few gene conversions (4). To investigate whether the elephant shark protocadherin cluster genes have undergone gene conversion, we estimated the total number of synonymous substitutions per C codon (dS) for each of the elephant shark protocadherin paralog subgroups (CmPcdh␧2-21, CmPcdh␮1–8, CmPcdh␯4–17). The synonymous substitution rate reflects the frequency of gene conversion events, because purifying selection for protein func- tion does not act on synonymous sites. Because gene conversion D in protocadherin clusters is usually restricted to the C-terminal ectodomains, we assessed the synonymous substitution rates separately for each ectodomain. The overall ratio of synonymous substitution rates between the most- and the least-divergent ectodomains among different paralog subgroups of the elephant E shark protocadherin genes ranges from 1.8 to 2.3 (Table 1). This indicates a uniform rate of synonymous-site substitutions among different ectodomains in the elephant shark paralogs. In con- trast, the ratio between the most- and the least-divergent ectodo- mains among different subgroups in fugu ranges from 79 to 2,796 F (Table 1). Such a high ratio indicates that different domains have experienced different rates of substitution, most likely due to gene conversion in some ectodomains. These results suggest that none of the ectodomains in the elephant shark protocadherin G cluster genes have undergone gene conversion and sequence homogenization. Thus, like the coelacanth genome, the elephant shark genome appears to be less susceptible to recombination- mediated rearrangements. H Evolution of Core Promoter Elements of Protocadherin Cluster Genes. Previous studies have identified a 15-bp core sequence element conserved in the promoter sequences of all mammalian and coelacanth protocadherin genes and in zebrafish Pcdh1␥ and Pcdh2␥ genes but divergent in zebrafish Pcdh1␣ and Pcdh2␣ I genes (5, 18, 20). This core element includes a ‘‘CGCT’’ motif implicated in promoter function (20). We searched the promoter regions (480 bp upstream of the start codon) of the elephant shark genes using MEME algorithm to look for conserved motifs in genes among different subclusters. Our results show that the J elephant shark Pcdh␧ and Pcdh␮ subcluster genes share a highly conserved 27-bp promoter element (Fig. 3 A and B). The presence of this element in genes from two different subclusters and the high level of conservation indicate that this is an ancient promoter element that is under purifying selection. The pro- moter sequences of the elephant shark Pcdh␯ genes, however, Fig. 3. Comparison of the promoter elements of protocadherin genes. The are divergent from this element and also from each other, except core promoter elements of elephant shark Pcdh␧ (A), Pcdh␮ (B), and Pcdh␯ (C); Ј ␣ ␤ ␥ ␣ ␥ for a short CAAT-box-like motif in the 3 region (Fig. 3C). These human Pcdh (D), Pcdh (H), and Pcdh (F); coelacanth Pcdh (E) and Pcdh Ј (G); and fugu Pcdh␣ (I) and Pcdh␥ (J) were identified by MEME (http:// genes, located at the 3 end of the elephant shark protocadherin meme.sdsc.edu/meme/intro.html) and displayed as WebLogo plots. cluster, are likely to be regulated differently from those of Pcdh␧ and Pcdh␮ genes. The first 15 bp of the 27-bp element in the elephant shark of the elephant shark Pcdh␦ subcluster. Thus, we conclude that Pcdh␧ and Pcdh␮ genes are well conserved in the Pcdh␣,-␤, and Pcdh␦ is an ancient subcluster that has been conserved in teleost -␥ genes in human and coelacanth (Fig. 3 D–H) and in Pcdh␥ fishes, coelacanth, Xenopus, and chicken protocadherin clusters genes in fugu (Fig. 3J). The 3Ј region of this element, however, but lost in the human and mouse protocadherin clusters. is divergent in bony vertebrates, except for the CAAT-box-like motif that is partially conserved in the Pcdh␣ and Pcdh␥ genes Elephant Shark Protocadherin Cluster Genes Have Experienced Very in human and coelacanth (Fig. 3 D–G). The substitutions Little Gene Conversion. The tandem arrays of paralogous pro- accumulated in the 3Ј region of this promoter element in bony tocadherin genes in fugu and zebrafish have experienced re- vertebrates may have resulted in the acquisition of new expres- peated gene conversion events, resulting in extensive regional sion patterns and contributed to adaptive diversity of protocad- sequence homogenization among paralogs in the same subclus- herin cluster genes. The promoter elements of fugu Pcdh␣ ter (5, 6). Gene conversion in protocadherin clusters is usually subcluster genes are quite divergent from the promoter elements restricted to the C-terminal ectodomains from EC4 to EC6 and in all other protocadherin clusters. Even the ‘‘CGCT’’ motif is

3822 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0800398105 Yu et al. Downloaded by guest on September 30, 2021 Hsαc1 Hsγb1-7 Hsγc4-5 Hsα1-13 Hsαc2 Hsβ1-19 Hsγa1-12 Hsγc3 Human Pcdh

Lmα11-14 Lmα21 Lmγ20 Lmα1Lmα2-10 Lmα15-20 Lmβ1-4 Lmγ1-19 Lmγ21-24 Coelacanth Pcdh

Fr1α1 Fr1α2 Fr1α3 Fugu Pcdh1

D Fr2α8-36 Fr2γ19-32 Fr2γ37 Fr2α1 Fr2α2-7 Fr2α37 Fr2γ1-18 Fr2γ33-36 Fugu Pcdh2

Cmν2-3 Cmδ1Cmε1CCmε2-21 Cmμ1-8 mν1 Cmν4-17 Elephant shark Pcdh

Ancestor Pcdh

Pcdhδ Pcdhε Pcdhµ Pcdhν Pcdhα Pcdhβ Pcdhγ

Fig. 4. A model to explain the evolutionary history of the protocadherin cluster in jawed vertebrates. Variable exons are indicated by boxes and constant exons by red bars. Intersubcluster paralogs or orthologs between species are shown in the same color. White boxes represent protocadherin genes or paralog subgroups with no apparent intersubcluster paralogs or orthologs. Encircled ‘‘D’’ denotes whole-genome duplication in the ray-finned fish lineage.

rather weakly conserved in the promoters of these genes (Fig. the elephant shark. The protocadherin cluster in elephant shark 3I). Thus, these fugu genes appear to have undergone further contains 47 genes organized into four subclusters, the Pcdh␦,-␧,

adaptive radiation in the teleost lineage. -␮, and -␯. Surprisingly, phylogenetic analyses indicated that EVOLUTION these subclusters are not direct orthologs of Pcdh␣,-␤,or-␥ Syntenic Block in the Elephant Shark Protocadherin Locus Is Conserved subcluster previously identified in bony vertebrates. We have in Mammals but Rearranged in Fugu. The elephant shark protocad- shown that Pcdh␦ is an ancient subcluster that is conserved in herin locus contains a block of syntenic genes (Hars, Dnd1, and teleost fishes, coelacanth, Xenopus, and chicken but lost in Zmat2, protocadherin cluster, and Diaph1 genes) spanning Ϸ409 mammals. Thus, our study suggests that the protocadherin kb that is conserved in the human protocadherin locus. However, cluster in the ancestral jawed vertebrate contained more sub- the human locus is considerably larger (Ϸ950 kb) than the clusters than previously proposed based on characterization of elephant shark locus and contains an extra gene, HARS2, which protocadherin cluster in bony vertebrates. We propose that the is a duplicated copy of HARS (SI Appendix, SI Fig. 5). The protocadherin cluster in the last common ancestor of jawed content and of genes in the orthologous mouse locus are vertebrates contained seven subclusters (Pcdh␣,-␤,-␥,-␦,-␧,-␮, identical to that in the human (data not shown). In contrast to and -␯) with multiple variable exons (Fig. 4). After the diver- the well conserved synteny between the elephant shark and gence of the cartilaginous fish and bony vertebrate lineages, this mammalian protocadherin loci, the regions flanking the dupli- ancestral cluster has experienced differential loss of entire cate protocadherin clusters of fugu have experienced extensive subclusters of genes in different vertebrate lineages. Although rearrangements (SI Appendix, SI Fig. 5). Comparisons of Ϸ1.4ϫ the Pcdh␣,-␤, and -␥ subclusters were lost in the elephant shark whole-genome shotgun sequences of elephant shark derived lineage, the common ancestor of bony vertebrates lost the Pcdh␧, from paired end sequences of fosmid clones with the human and -␮, and -␯ subclusters. Of the remaining four subclusters (Pcdh␦, zebrafish genome assemblies have previously indicated that the -␣,-␤, and -␥) in bony vertebrates, the Pcdh␤ subcluster was lost level of conserved synteny between the elephant shark and in the ray-finned fish lineage, most likely before the ‘‘fish- human genomes is higher than that between the elephant shark specific’’ whole-genome duplication event. Another subcluster, and zebrafish genomes (21). The highly conserved synteny of Pcdh␦, was lost more recently in the mammalian lineage (Fig. 4). genes at the protocadherin locus in the elephant shark and Besides the entire subclusters, the constant region exons of mammals and its disruption in fugu provides further support to protocadherin genes have also been targeted for lineage-specific this observation and is consistent with the notion that teleost fish losses. The constant exons of the Pcdh␧ and -␤ subclusters were genomes have experienced a higher rate of chromosomal rear- lost independently in the elephant shark lineage and in a rangements relative to elephant shark and mammals. Recon- common ancestor of lobe-finned fishes, respectively (Fig. 4). struction of ancestral vertebrate karyotypes based on compari- Thus, the characterization of the protocadherin cluster in the sons of teleost fish genomes (medaka, zebrafish, and Tetraodon) elephant shark has demonstrated the importance of cartilagi- and human genome has suggested that several major chromo- nous fish genome sequence in reconstructing the evolutionary somal rearrangements occurred in the fish lineage within a short history of vertebrate genomes and in inferring the organization period of Ϸ50 million years after the ‘‘fish-specific’’ whole- of the ancestral jawed vertebrate genome. Because they are the genome duplication (22). The whole-genome duplication may most ancient phylogenetic group of jawed vertebrate, cartilagi- have triggered an accelerated rate of rearrangements by facili- nous fishes not only provide a better insight into the now-extinct tating recombination between paralogous chromosomal seg- ancestral jawed vertebrate genome but also serve as an outgroup ments. that helps in identifying lineage-specific adaptive changes in different lineages of bony vertebrates. Discussion The elephant shark genome was proposed as a model carti- The living jawed vertebrates comprise two major lineages, the laginous fish genome because of its relatively small size (14). cartilaginous fishes and bony vertebrates. In this study, we have Subsequently, comparative analysis of Ϸ1.4ϫ coverage elephant characterized the protocadherin cluster from a cartilaginous fish, shark sequences with the whole-genome assemblies of human,

Yu et al. PNAS ͉ March 11, 2008 ͉ vol. 105 ͉ no. 10 ͉ 3823 Downloaded by guest on September 30, 2021 fugu, and zebrafish suggested that noncoding sequences in (24). Human orthologs of the elephant shark nonprotocadherin genes were elephant shark are evolving more slowly than in teleost fishes and identified by BLAT search of the human genome at University of , that the elephant shark genome has experienced fewer chromo- Santa Cruz (UCSC), Genome Browser (http://genome.ucsc.edu). The genomic somal rearrangements compared with teleost fish genomes (21, sequences of human and mouse protocadherin clusters were retrieved from 23). The characterization of the protocadherin locus in elephant the UCSC Genome Browser, whereas the coelacanth protocadherin cluster sequence was assembled from BAC sequences in GenBank (accession nos. shark has provided further support to the hypothesis that the AC150238, AC150284, and AC150308–AC150310) (4). The genomic sequences elephant shark genome is more stable compared with teleost fish of fugu protocadherin clusters were also obtained from GenBank (accession genomes. The conserved synteny of genes at the protocadherin nos. DQ986917–DQ986918) (6). Protocadherin sequences of X. tropicalis and locus in elephant shark and human, the lower turnover of chicken were identified by BLAT search of their genome assemblies (Ver. 4.1 protocadherin genes in the elephant shark protocadherin cluster, and 2.1, respectively; http://genome.ucsc.edu) using the protein sequence of and a lack of evidence for gene conversion between paralogous the elephant shark Pcdh␦ constant region as query. Gaps in the intronic protocadherin genes in elephant shark suggest that elephant regions of Xenopus and chicken sequences were filled by PCR amplification of shark genome is less prone to recombination-mediated rear- the genomic DNA. The complete genomic sequences of the Xenopus and ␦ rangements and tandem gene duplications compared with te- chicken Pcdh subclusters have been submitted to the GenBank (accession nos. EU267079 and EU267080). leost fish genomes. Although we did not measure the relative ,substitution rates of the promoter sequences of protocadherin RT-PCR and 3؅RACE Analyses. Total RNA was extracted from the elephant shark genes, the relatively higher levels of conservation of the 15-bp fugu, and zebrafish tissues by the TRIzol method (Invitrogen) and reverse- core promoter element in the elephant shark compared with that transcribed using SMART rapid amplification of cDNA ends (RACE) cDNA Ampli- in fugu and coelacanth indicate that the regulatory region of fication kit (Clontech). The PCR was performed by an initial denaturing step of elephant shark protocadherin genes is evolving more slowly than 95°C for 2 min, followed by 35 cycles of 95°C for 30 sec, 55°C for 30 sec, and 72°C that in and coelacanth. The relatively small and slowly for 1–3 min. 3ЈRACE analysis was performed according to the manufacturer’s evolving genome of the elephant shark, therefore, is an impor- instructions (Clontech). The RACE products were sequenced completely. tant ‘‘reference’’ for understanding the evolution of vertebrate genomes. Phylogenetic Analyses. The amino acid sequences of EC1–3 domains of pro- tocadherin cluster genes from various species were aligned by ClustalX (25). Methods Phylogenetic trees were constructed by the Neighbor-joining method based on sequence distance matrix and displayed using NJplot (www-igbmc.u- Sequencing and Assembly of BACs. To identify probes for the elephant shark strasbg.fr/BioInfo). The robustness of the tree was determined by bootstrap protocadherin cluster, we first performed a TBLASTN search of the whole- analysis of 1,000 replicate sample sequences. genome shotgun sequences of elephant shark (GenBank accession nos. CW854842–CW882785) using the amino acid sequences of human protocad- Synonymous Substitution Analyses. The synonymous substitution rates were herins as queries. We identified two sequences (GenBank accession nos. estimated using the CODEML program in PAML package with default CW868635 and CW882727) that showed high similarity to human protocad- herin. The shotgun clones of these sequences were used to probe the parameters (26). The nucleotide sequence alignments were generated by IMCB࿝Eshark BAC library (cloned in pCCBAC-EcoRI; unpublished data). In total, RevTrans program (27), using amino acid sequence alignment (generated 10 positive clones were identified, from among which three representative by ClustalX) as templates. The synonymous substitution rates were cal- overlapping clones (49C8, 166L19, and 176N9) were sequenced completely by culated as average dS for each branch in the gene tree of individual the standard shotgun sequencing method and assembled using SeqMan subgroups. (Lasergene). This sequence has been submitted to GenBank (accession no. EF693954). ACKNOWLEDGMENTS. The research work in the W.-P.Y. laboratory is sup- ported by the Biomedical Research Council (BMRC); the National Medical Research Council (NMRC); and the SingHealth Foundation Funds, Singapore; Sequence Annotation. The variable and constant exons of elephant shark and the work in B.V.’s laboratory is supported by the Agency for Science, protocadherin genes and exons of the nonprotocadherin genes flanking the Technology and Research (A*STAR), Singapore. B.V. is adjunct staff of the protocadherin cluster were identified and annotated based on homology Department of Pediatrics, Yong Loo Lin School of Medicine, National Univer- (TBLASTN and BLASTX) (www.ncbi.nlm.nih.gov) and GENSCAN prediction sity of Singapore.

1. Ma J, et al. (2006) Reconstructing contiguous regions of an ancestral genome. Genome 15. Vanhalst K, Kools P, Vanden Eynde E, van Roy F (2001) The human and murine Res 16:1557–1565. protocadherin-beta one-exon gene families show high evolutionary conservation, 2. Wu (2005) Comparative genomics and diversifying selection of the clustered verte- despite the difference in gene number. FEBS Lett 495:120–125. brate protocadherin genes. Genetics 169:2179–2188. 16. Ribich S, Tasic B, Maniatis T (2006) Identification of long-range regulatory elements in 3. Wu Q, Maniatis T (1999) A striking organization of a large family of human neural the protocadherin-alpha gene cluster. Proc Natl Acad Sci USA 103:19719–19724. cadherin-like cell adhesion genes. Cell 97:779–790. 17. Kaneko R, et al. (2006) Allelic gene regulation of Pcdh-alpha and Pcdh-gamma clusters 4. Noonan JP, et al. (2004) Coelacanth genome sequence reveals the evolutionary history involving both monoallelic and biallelic expression in single Purkinje cells. J Biol Chem of vertebrate genes. Genome Res 14:2397–2405. 281:30551–30560. 5. Noonan JP, Grimwood J, Schmutz J, Dickson M, Myers RM (2004) Gene conversion and 18. Wu Q, et al. (2001) Comparative DNA sequence analysis of mouse and human pro- the evolution of protocadherin gene cluster diversity. Genome Res 14:354–366. tocadherin gene clusters. Genome Res 11:389–404. 6. Yu WP, Yew K, Rajasegaran V, Venkatesh B (2007) Sequencing and comparative 19. Zou C, Huang W, Ying G, Wu Q (2007) Sequence analysis and expression mapping of the analysis of fugu protocadherin clusters reveal diversity of protocadherin genes among rat clustered protocadherin gene repertoires. Neuroscience 144:579–603. teleosts. BMC Evol Biol 7:49. 20. Tasic B, et al. (2002) Promoter choice determines splice site selection in protocadherin 7. Tada MN, et al. (2004) Genomic organization and transcripts of the zebrafish Pro- alpha and gamma pre-mRNA splicing. Mol Cell 10:21–33. tocadherin genes. Gene 340:197–211. 21. Venkatesh B, et al. (2007) Survey sequencing and comparative analysis of the elephant 8. Morishita H, Yagi T (2007) Protocadherin family: diversity, structure, and function. Curr shark (Callorhinchus milii) genome. PLoS Biol 5:e101. Opin Cell Biol 19:584–592. 22. Kasahara M, et al. (2007) The medaka draft genome and insights into vertebrate 9. Kohmura N, et al. (1998) Diversity revealed by a novel family of cadherins expressed in genome evolution. 447:714–719. neurons at a synaptic complex. Neuron 20:1137–1151. 23. Venkatesh B, et al. (2006) Ancient noncoding elements conserved in the human 10. Shapiro L, Colman DR (1999) The diversity of cadherins and implications for a synaptic genome. Science 314:1892. adhesive code in the CNS. Neuron 23:427–430. 24. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic 11. Hill E, Broadbent ID, Chothia C, Pettitt J (2001) Cadherin superfamily proteins in DNA. J Mol Biol 268:78–94. Caenorhabditis elegans and Drosophila melanogaster. J Mol Biol 305:1011–1024. 25. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL࿝X 12. Sasakura Y, et al. (2003) A genomewide survey of developmentally relevant genes in windows interface: flexible strategies for multiple sequence alignment aided by Ciona intestinalis. X. Genes for cell junctions and extracellular matrix. Dev Genes Evol quality analysis tools. Nucleic Acids Res 25:4876–4882. 213:303–313. 26. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum 13. Whittaker CA, et al. (2006) The adhesome. Dev Biol 300:252–266. likelihood. Comput Appl Biosci 13:555–556. 14. Venkatesh B, Tay A, Dandona N, Patil JG, Brenner S (2005) A compact cartilaginous fish 27. Wernersson R, Pedersen AG (2003) RevTrans: Multiple alignment of coding DNA from model genome. Curr Biol 15:R82–R83. aligned amino acid sequences. Nucleic Acids Res 31:3537–3539.

3824 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0800398105 Yu et al. Downloaded by guest on September 30, 2021