<<

doi:10.1016/j.jmb.2008.04.037 J. Mol. Biol. (2008) 379, 912–928

Available online at www.sciencedirect.com

The 185/333 Family Is a Rapidly Diversifying Host-Defense in the Purple Sea Urchin Strongylocentrotus purpuratus

K. M. Buckley1, S. Munshaw2, T. B. Kepler2⁎ and L. C. Smith1⁎

1Department of Biological The genome of the purple sea urchin contains numerous large gene families Sciences, George Washington with putative immunological functions. One gene family, known as 185/333, University, Washington, is characterized by extraordinary molecular diversity resulting from single DC 20052, USA nucleotide polymorphisms and the presence or the absence of 27 large blocks of sequences known as elements. The mosaic composition of elements, known 2Center for Computational as element patterns, that is present within the members of this gene family is Immunology, Department of encoded entirely in the second of two exons. Many of the elements corres- Biostatistics and Bioinformatics, pond to one of six types of repeats that are present throughout the . The Duke University, Durham, sequence diversity and variation in element patterns led us to investigate the NC 27705, USA of the 185/333 gene family. The work presented here suggests that Received 31 December 2007; the element patterns are the result of both recombination and duplication received in revised form and/or deletion of intragenic repeats. Each element is composed of a limited 11 April 2008; number of similar but distinct sequences, and their distribution among the accepted 15 April 2008 185/333 genes suggests frequent recombination within this gene family. Available online Phylogenetic analyses of five 185/333 elements and two regions of the intron 22 April 2008 were performed using two tests: incongruence length difference and incon- gruence permutation. Results indicated that each pair of sequence segments was incongruent, suggesting that recombination occurs frequently along the length of the genes, including both the intron and the second exon, and that recombination is not restricted to intact elements. Paradoxically, the high level of similarity among the elements indicated that the 185/333 genes appear to be the result of a recent diversification. These results add to the growing body of evidence suggesting that invertebrate immune systems are not simple and static, but are dynamic and highly complex, and may employ group-specific mechanisms for diversification. © 2008 Elsevier Ltd. All rights reserved. Keywords: echinoderm; molecular evolution; innate immunology; repeats; Edited by J. Karn invertebrate

Introduction

Complex molecularly diverse immune responses have been traditionally believed to be limited to higher vertebrates. However, recent evidence has *Corresponding authors. E-mail addresses: demonstrated that invertebrate immune responses [email protected]; [email protected]. may also be highly diversified1 and are often en- – Abbreviations used: TLR, toll-like receptor; LRR, leucine coded within large gene families.2 11 The recently rich repeat; qPCR, quantitative polymerase chain reaction; sequenced genome of the purple sea urchin Strongy- EST, expressed sequence tag; 5′ UTR, 5′ untranslated locentrotus purpuratus12 contains a number of large region; BAC, bacterial artificial chromosome; MP, immune-related gene families, many with consider- maximum parsimony; ML, maximum likelihood; ILD, ably more members than their vertebrate homolo- incongruence length difference; IP, incongruence gues.13 These gene families encode toll-like receptors permutation; TCR, T-cell receptor; FREP, fibrinogen-related (TLRs), NACHT domain and leucine-rich repeat ; TNT, Tree Analysis Using New Technology. (NLR) , scavenger receptor cysteine-rich

0022-2836/$ - see front matter © 2008 Elsevier Ltd. All rights reserved. Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 913 domains, and C-type lectins. The 185/333 gene family been used to align the 185/333 sequences and to is another example of a large diverse gene family that define elements based on the locations of gaps is putatively involved in the sea urchin immune within the alignment, as well as the edges of the response.14 Results from quantitative polymerase repeats.14 chain reaction (qPCR) analysis of genomic DNA The 185/333 sequence diversity is extremely high. suggest that the 185/333 gene family consists of From 16 S. purpuratus individuals, 872 185/333 se- between 80 and 120 alleles.15 The genes are closely quences (183 genes and 689 transcripts) have been linked, are flanked by dinucleotide and trinucleotide analyzed, of which 475 are unique, encoding 323 repeats,14 and are highly expressed in response to proteins with 37 different element patterns.14,15,18 immunological challenge with either whole Sequence diversity is the result of variation in ele- bacteria,16,17 lipopolysaccharide,16,18 β-1,3-glucan, ment patterns, as well as point and small double-stranded RNA,18 or peptidoglycan (D.A. indels. No identical gene sequences are shared Raftos, unpublished data). Sea urchins seem to be among different animals, indicating that the nucleo- able to discriminate among these pathogen signatures tide diversity occurs not only within the 185/333 through as yet unknown mechanisms and express gene family of individual sea urchins but also within unique suites of 185/333 genes in response to the S. purpuratus population. challenge.18,19 The 185/333 transcripts constitute An initial analysis of 290 ESTs containing the 5′ 6.45% of a cDNA library constructed from bacterially untranslated region (5′ UTR) and leader sequence activated coelomocytes, as opposed to 0.086% in a revealed that identical 5′ UTRs were associated nonactivated library—a 75-fold increase.16 with different leader sequences and vice versa.16 Although the function of the 185/333 proteins This observation was further supported by a more remains unknown, they localize to the cell surface of thorough analysis of 185/333 genes. From three a subset of the coelomocytes (immune cells) of the individual sea urchins, 121 unique genes were sea urchin and may be involved in the formation identified from 171 clones, even though each of the of syncytia to immobilize invading pathogens.20 27 elements had a small number of related but Analysis of 185/333 protein expression by two- distinct element sequences (average=11 unique dimensional Western blot analysis suggests that dis- sequences/element).14 Unique genes were com- tinct suites of proteins are expressed in response to posed of different combinations of the element lipopolysaccharide compared to peptidoglycan, and sequences in a patchwork structure. This mosaic that individual sea urchins can express N200 unique nature of the gene sequences, in addition to the proteins (D.A. Raftos, unpublished data). These pro- diversifying pressure on the 185/333 genes that tein data therefore support the previous observ- results from their predicted immunological role,16 ations of transcript and gene diversity14,15,18 and prompted further investigation of the evolution of emphasize the putative role of the 185/333 gene this unique gene family. family in the S. purpuratus immune response. The results presented here suggest that the In addition to the striking increase in expression highly diversified 185/333 gene family is subject following immune challenge, the 185/333 sequences to frequent recombination, , and are intriguing. Originally discovered as 60% of the gene deletion. Because gaps introduced into expressed sequence tags (ESTs) from an arrayed sequence alignments to define the element patterns cDNA library screened with a subtracted probe to complicate phylogenetic analysis of full-length identify sequences that were upregulated in res- gene sequences, the evolution of this gene family ponse to immune challenge, the 185/333 transcripts was analyzed from the perspective of the repeats are homologous to only two previously isolated within the genes. Phylogenetic analysis suggests S. purpuratus ESTs after which the genes are named that the repeats have arisen as a result of intragenic (DD185, GenBank accession no. AF228877; EST333, repeat duplication and/or deletion, recombination, GenBank accession no. R62011).16 Alignment of the and point mutations. Short simple repeats located 185/333 mRNAs requires the insertion of large gaps, within the larger repeats may facilitate gene creating blocks of similar sequences known as ele- diversification through unknown mechanisms.21 ments.15,16 The elements are variably present or Incongruent phylogenetic histories of a variety of absent in different mRNAs, which have been used to elements and analysis of the distribution of element define specific element patterns. Analysis of 185/333 sequences across the genes suggest that the genes genes indicates that the variation in transcript ele- undergo frequent recombination, which is likely to ment patterns is likely the result of variations in be a mechanism for generating diversity within the element patterns encoded by many genes, rather gene family. Within this framework of gene than the result of extensive alternative splicing diversity, however, there is a paradox of remark- among a few genes.14,15 The genes have two ably conserved element sequences, suggesting that exons. The first encodes a hydrophobic leader, and the divergence from the last common ancestral the second encodes the remainder of the open gene for the extant 185/333 sequences occurred reading frame, including the variable element relatively recently. The 185/333 gene family there- patterns. Intron sequences are similar in length but fore provides an intriguing addition to the growing exhibit sequence variability and can be phylogen- body of evidence suggesting that invertebrate etically classified into five types (α–ε).14 A variety of immune systems are far more complex than pre- imperfect repeats within the coding regions have viously believed. 914 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family

Results were oriented tandemly in elements Er2–5, whereas repeat types 2–6 were located in elements Er10–26 in Two different alignments for the 185/333 gene se- mixed interspersed groups (Fig. 1a; Table 1). A dot quences have been described previously,14,15 and the plot of gene 2-034 (Supplemental File 1), which analyses described here use the repeat-based align- contains all repeats, illustrates the pattern of tandem ment because it facilitates analysis of the repeats. The repeats present in the 5′ region of the gene as com- repeat-based alignment divides the 185/333 gene pared to the interspersed repeats present in the 3′ sequences into 27 elements (Er1–27) using a combi- region (Fig. 1b). Neither inverted repeats nor repeats nation of gaps and the repeat edges.14 The variable within the intron were identified. Genes have vari- presence or absence of these elements has been used able numbers of each type of repeat,14 which were to define element patterns (representative patterns numbered with respect to their relative positions (Fig. are shown in Fig. 1a). Previous analyses of 185/333 1a; see Table 2 for definitions). transcripts15,16 and genes14 identified five types of The diversity of the 185/333 genes, which was the imperfect repeats. A sixth type of repeat, which cor- result of variations in both the element patterns responds to elements Er10a and Er18, was identified and the nucleotide sequence, led us to investigate in this analysis (Fig. 1a, pattern rA1γ). Type 1 repeats the molecular evolution of this gene family and,

Fig. 1. Element patterns and repeats within the 185/333 genes. (a) 185/333 gene structure and element patterns. The first exon, which encodes the leader (L), intron (int.; Greek letter indicates intron type), and pattern of 27 elements in the second exon are shown to scale.14 The elements are numbered (shown at the top) and are designated as Er, as in the text. Each of the repeat types is indicated by a different shading within the element boxes. The numbers for repeat types 1–6 are shown at the bottom. Elements shown in gray do not contain repeats. The patterns of type 1 repeats are identified in the column labeled “type 1.” The categories of repeat patterns in the 3′ end of the gene are shown in the column labeled “3′” (see Table 2 for detailed definitions of element types and gene patterns). (b) Gene 2-034 contains all repeats. A dot plot analysis of gene 2-034 (GenBank accession number EF607716) was performed using a window size of 10 and a mismatch limit of 2. The panel on the left is the plot of the full-length 2-034 sequence. The panels in the middle and on the right are smaller sections of the dot plot and are matched based on the dashed lines surrounding boxes within the left panel. Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 915

Table 1. The different types of 185/333 repeats vary in length, number, and diversity

Number of Length Variable Repeats Repeat type Frequency unique repeats (nt) positions Diversitya per gene Elements (Er) 1 292 52 75–76 52 0.2241 2–42–5 2 194 13 21 11 0.1371 1–4 11, 13, 19, 23 3 187 7 23 7 0.0782 1–2 14, 24 4 (complete)b 126 21 90 18 0.0652 1–2 12, 26 4 (including partial) 211 17 90 83 0.0739 1–3 12, 20–23, 26 4ac 126 8 27 7 0.0785 1–2 12, 20, 26 4bc 191 11 27 12 0.1104 1–3 12, 21, 26 4cc 191 6 29 7 0.0535 1–3 12, 22, 26 4dc 211 1 7 0 0.0000 1–3 12, 23, 26 5 122 10 34 7 0.0583 1–2 15, 25 67346–49 5 0.1029 1–2 10, 18 a Diversity scores14,15,22 of all of the aligned repeats; gaps were excluded from the diversity calculations. b Includes only the complete type 4 repeats, R4.1 and R4.3 (Fig. 1a; Table 2). c Includes all of the type 4 subrepeats (e.g., R4.1a, R4.2a, and R4.3a; Fig. 1a; Table 2).

perhaps, to identify the source of the diversity. Due Type 1 repeats to the correlation between elements and repeats, we theorized that the repeats themselves may Phylogenetic analysis of the repeats was first per- facilitate gene diversification.14 To eliminate the formed on the type 1 repeats because of their var- problem of gaps in phylogenetic analyses and to iability and relatively simple organization. Four obtain phylogenetically useful information regard- tandem type 1 repeats (designated as R1.1–R1.4; ing the repeats, including their variant sequences, see Table 2) were located in elements Er2–5(Fig. 1a; their positions within the second exon, and with Table 1). Type 1 repeats had 52 variable positions respect to each other, the repeats were analyzed as (of 75 or 76 nt) and nearly twice the diversity of the individual repeat types. other types of repeats (Table 1). In total, 292 type 1

Table 2. Terminology for repeats and genes

Term Example Definition Repeats RX.Y Type X repeat in the Yth alignment position R1.1 Type 1 repeat in the first alignment position R2.1 Type 2 repeat in the first alignment position R1.Ya Type 1 repeats in the Yth alignment position from a gene with a type 1 repeats R1.14 Type 1 repeat in the first position from a gene with four type 1 repeats R1.12/3 Type 1 repeat in the first position from a gene with either two or three type 1 repeats R4.2a Type 4 repeat in the second position, partial subrepeat type a R2/3 Concatenation of repeats 2 and 3 a Type 1 repeat patterns R11/2/3/4 Gene with all four type 1 repeats R11/4 Gene with type 1 repeats in positions 1 and 4 α R11/4 Gene with type 1 repeats in positions 1 and 4, plus an α-type intron R12/4 Gene with two type 1 repeats in the second and fourth positions R11/2/4 Gene with three type 1 repeats in the first, second, and fourth positions – R11/2/ /4 Gene with three type 1 repeats; structure is the same as genes with four type 1 repeats, but with the third repeat deleted Type 2 to type 6 element patterns I–VII See Fig. 1 for examples of each of the element patterns Elements Repeat-based elements Er1–27 27 elements defined by the repeat-based alignment14 Er81 Repeat-based element 8, first unique sequence Genes Element patternsb rA1γ Repeat (r)-based pattern alignment, element pattern A1, intron type γ rE2/3/7/δ Repeat (r)-based pattern alignment, element pattern was combined from E2, E3,andE7 from the cDNA-based alignment, intron type δ a Gene structure based on the type 1 repeats. b Pattern of all the elements for an entire gene (Fig. 1a; see Refs. 14,15,18). 916 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family repeats were isolated from 121 unique genes. The 52 Results from phylogenetic analyses showed three unique repeat sequences were aligned and employed major clades and seven categories of type 1 re- in phylogenetic analysis using maximum parsi- peats (R1.42/3, R1.44, R1.34, R1.24, R1.23, R1.14, and mony (MP), maximum likelihood (ML), and Baye- R1.12/3; Fig. 2; Table 2). The R1.42/3 repeat category sian methods. was composed of R1.4 repeats from genes with Individual genes contained between two and four either two or three type 1 repeats. The R1.23 repeat type 1 repeats. Genes were initially aligned14 such category also included the R1.4 repeat from gene 2- that those with two type 1 repeats all had repeats 017 and the R1.1 repeat from gene 10-035. Both of R1.1 and R1.4, and genes with three type 1 repeats these genes had been previously aligned such that also had R1.2 (Fig. 1a; Table 2). Results of the phylo- they contained repeats R1.1 and R1.4,14 but based genetic analyses indicated that there were two im- on these results, these two type 1 repeats were re- portant factors regarding each type 1 repeat: (i) its aligned to the R1.2 position (Fig. 1a, element pat- position within the gene (R1.1–R1.4), and (ii) the total terns rC5α and rE3α). All seven repeat categories number of type 1 repeats present in the gene from were supported by Bayesian, MP, and ML analyses which each individual repeat was obtained (denoted (data not shown). The only major difference among with a subscript; Fig. 2; Table 2). To incorporate these the tree topologies from the three different tree- factors into the nomenclature for each repeat, they building methods was that the ML analysis sup- were named such that an R1.12 repeat was located in ported a monophyletic R1.14 clade, rather than the the first position and was obtained from a gene paraphyletic designation shown in Fig. 2. Overall, with two type 1 repeats (Fig. 1a, rD5ε; Table 2). The with the exception of the repeats from genes with arrangement, length, and diversity of type 1 repeats four type 1 repeats, the primary factor determining resulted in an informative phylogenetic analysis. the grouping of the repeats was their position in the gene, with little resolution between repeats from genes with two or three type 1 repeats. Repeats from genes with four type 1 repeats clustered together, with R1.14 and R1.44 most closely related to each other, and with R1.24 closely related to R1.34. Understanding the relationships among each of the repeat types allowed for a theoretical reconstruction of the evolutionary histories of each of the seven categories of type 1 repeats.

Fig. 2. Seven classifications of type 1 repeats. The tree was derived from Bayesian analysis of the 52 unique type 1 repeat sequences. The seven classifications of type 1 repeats are listed to the right. Specific nodes on the tree are indicated by Roman numerals I–III. The R1.4 repeats from genes with two or three type 1 repeats formed a strongly supported clade, designated as the R1.42/3 category. The rest of the repeats formed clade I with two paraphyletic R1.44 repeats at the base. The remainder of the type 1 repeats, nested within clade I, formed clade II, with the R1.14 sequences as a paraphyletic group at the base. The R1.12 and R1.13 repeats were mixed in clade III, defining the R1.12/3 category. The R1.23 repeats formed a strongly supported group within clade III. Gene 2-017 (element pattern rE3α; Fig. 1a; Supplemental File 1) was originally characterized with two type 1 repeats: R1.1 and R1.4.14 However, because the R1.4 repeat clustered with the R1.23 repeats in clade III, this particular repeat was realigned to the R1.2 position, making gene 2-017 an R11/2 gene with repeats R1.1 and R1.2. Clade III also included R1.24 subclades and R1.34 repeats. Due to the short length of the alignment, support for the nodes was low, and the trees were largely unresolved. Relative Bremer support values derived from MP-based analysis are shown to the left and above the branches. Posterior probabilities derived from Bayesian analysis are shown to the right and above the branches. The values below the branches indicate boot- strap values based on 2000 replications of ML analysis. Dashes indicate that the bootstrap values were not significant or that the node was not recovered using the specified tree-building method. Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 917

Proposed molecular evolution of the type 1 2). Of the 75 genes with two type 1 repeats, 62 were repeats R11/4 genes with R1.1 and R1.4 repeats (Fig. 1a; Table 2). There were also genes with two type 1 repeats, of Closer examination of the consensus sequences of which the first repeat clustered within the R1.23 clade the seven type 1 repeat categories indicated that some (Fig. 2), defining the R11/2 repeat pattern. In addition, of the repeats might be the result of recombination there were R12/4-type genes, including four events, in addition to point mutations (Fig. 3a). The rC3α genes and one r01α gene (Fig. 1a), which had repeats were therefore subjected to further phylo- R1.1 repeats most similar to the R1.23 clade (Fig. 2). genetic analysis that optimized the relationships Classification of genes with three or four type 1 among the repeats, while also identifying recombi- repeats was more straightforward than that for genes nation.11,23 Results from this analysis indicated that with two repeats. Genes with four repeats were γ γ the most likely evolution of the type 1 repeats was designated as R11/2/3/4 (rA1 and rG2 ; Fig. 1a; Table from three ancestral sequences (Fig. 3a). The inferred 2). There were 42 genes with three type 1 repeats, and consensus sequences for these ancestors are shown in all but one of these were missing repeat R1.3 and Fig. 3a. Incorporating the recombination events into therefore were designated as gene type R11/2/4.The the phylogenetic analysis of the type 1 repeats most commonly identified gene element pattern, rD1/ α 14 provided greater insight into their evolution and 6/7 , was an R11/2/4 gene. The remaining gene with allowed for the construction of putative evolutionary three type 1 repeats (gene 4-1528; Supplemental histories for each of the seven type 1 repeat categories File 1) was unusual in that, although it contained in terms of the three hypothetical ancestors. repeats R1.1, R1.2, and R1.4, those repeats consis- Four of the seven repeat categories (R1.12/3, R1.42/ tently clustered within the corresponding clades from 3, R1.24, and R1.34) were most similar to the hypo- the R11/2/3/4 genes rather than those from the R11/2/4 thetical ancestors. The differences among and within genes (Fig. 2), resembling an R11/2/3/4 gene that was these repeat categories were likely the result of point missing repeat R1.3. Therefore, this gene was sepa- mutations (Fig. 3a). Because the R1.24 and R1.34 rated from the R11/2/4 genes into its own group, – repeats evolved from ancestor B, but shared a poly- designating it as R11/2/ /4 (Table 2). morphism unique to these repeat categories (an A Analysis of the type 1 repeat patterns and intron rather than a G in position 46; Fig. 3a), these repeats sequences among the genes revealed a general cor- were most likely the result of a duplication event relation between repeat pattern and intron type. The following their divergence from ancestor B. In con- α intron was found in 54 genes,14 of which 41 had trast, the R1.23, R1.14, and R1.44 repeats appeared to three type 1 repeats. In contrast, most of the R11/4 be the result of two different recombination events genes had introns β, δ,orε. The two genes with among the ancestral sequences. The first 20 nt of the variant type 1 repeat patterns (R11/2 and R12/4) also α consensus sequence from the R1.23 repeats were had the intron. Seven genes with the R11/4 type 1 α most similar to the R1.42/3 repeats, which were de- repeat pattern (five with element pattern rB3 , one rived from ancestor C. However, 55 nt at the 3′ end with element pattern rC5α, and one with element α of the R1.23 repeats were nearly identical with the pattern rB7 ; Fig. 1a; Supplemental File 1) were α R1.12/3 repeats derived from ancestor A (Fig. 3a). unusual in that they also had the intron and were Thus, the most likely origin of the R1.23 repeats was defined with an additional type 1 repeat pattern α a recombination event between the A and the C (R11/4 ; Table 2). Of the six element patterns that hypothetical ancestral sequences in which the 5′ end formed the variant genes that had two type 1 re- of the C sequence and the 3′ end of the A sequence peats, three had other intron types. For example, α were linked (Fig. 3a). In contrast, the R1.14 and R1.44 exon element pattern rB3 was found with introns repeats were most likely the result of a recombina- and β, with the majority containing intron β (10 of tion between the 5′ sequence of ancestor C and the 3′ 15 genes). The minority pattern (5 of 15 genes), rB3α, α 14 sequence of ancestor B. Therefore, the type 1 repeats was categorized as an R11/4 -type gene. The genes from the 185/333 genes appear to be the result of with two type 1 repeats were therefore divided into α duplication and point , in addition to re- four groups (R11/4, R11/4 , R11/2, and R12/4), three of combination among the different repeat categories. which were considered variant because they were observed in a minority of the genes. The R11/2/3/4 γ – Analysis of 185/333 genes based on type 1 genes all had the intron, as did the R11/2/ /4 clone repeats (element pattern E10γ; Fig. 1a). Because there was no correlation between intron type and type 1 repeat To understand the evolution of the 5′ end of the 185/ pattern in these genes, this hinted at recombination 333 genes, the type 1 repeat categories were used to between the intron and the second exon, and was analyze 185/333 genes. This, in turn, provided insight subsequently used in an analysis of gene evolution into the evolution of elements Er2–5andthe5′ end of as a function of the type 1 repeat patterns. the 185/333 genes. The seven categories of type 1 repeats were used to define seven varieties of genes, Hypothetical evolution of the 5′ end of the which were designated as R1X,withtheX indicating 185/333 genes the locations in which the type 1 repeats were posi- tioned (Fig. 1a). For example, an R11/4 gene had two To understand the evolution of the gene family as type 1 repeats in positions R1.1 and R1.4 (Fig. 1a; Table a whole, we used the patterns of type 1 repeats in the 918 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family

Fig. 3. Evidence for repeat recombination. (a) Analysis of the extant and inferred ancestral type 1 repeat consensus sequences indicates that some of the repeats are the result of recombination. Each of the consensus sequences from the seven repeat categories is shown, in addition to the sequence of the theoretical ancestral sequences A–C (A=blue; B=pink; C=yellow). Nucleotides identical with the consensus are indicated with a dot. Dark lines at the bottom denote regions of simple repeats. (b) The type 1 repeats evolved from three ancestors. The hypothetical single ancestor (Anc) is shown at the center of the network. Computational analysis of the unique type 1 repeat sequences that optimize phylogenetic trees by allowing recombination between branches11,23 indicates that the optimal evolution of the repeats utilizes three ancestral sequences, which were the hypothetical sources of the extant repeat sequences. Those ancestral sequences (labeled A–C) are shown as pink, yellow, or blue circles (as in a). Extant repeat sequences are shown as dark blue ovals. Sequences that resulted from recombination events (A+C and C+B) are indicated by dual-colored circles. Because the inferred sequences of ancestors A and B were nearly identical (differing by 3 nt of 76 nt), it is likely that A and B shared a common ancestor (labeled “AB,” gray circle) more recently than their divergence from the type 1 repeat ancestor (Anc). In contrast, the type C ancestral sequence was significantly different from both ancestors A and B (18 nt different from A, and 15 nt different from B), indicating that it most likely evolved directly from the original ancestral repeat. (c) The type 1 repeat patterns evolved through duplication, deletion, and recombination. The hypothetical Anc has a single type 1 repeat, and the putative sequence is shown in (a). A duplication of Anc generated a 185/333 gene with two type 1 repeats (types AB and C). Divergence of the AB repeat among the 185/333 genes yielded two varieties of type 1 repeats (AC and BC), which are virtually indistinguishable in the extant sequences. Recombination and unequal crossover between the AC and BC genes resulted in a C+B recombination event, which produced the R1.2-type recombinant repeat, and a 185/333 gene with three type 1 repeats (R11/2/4). Deletion of either the R1.13 or the R1.43 repeats led to the R12/4 or R11/2 genes, respectively. The R11/ 2/3/4 genes, which contain four type 1 repeats, arose through an unknown mechanism that may have involved 185/333 gene sequences that either are not present in the animals sampled or are now extinct from the population. The R1.24 and R1.34 repeats arose from the B ancestor, as opposed to the R1.14 and R1.44 repeats, and appear to be the result of a recombination – between A and C ancestral repeats. The R11/2/ /4-type gene is most likely the result of a deletion of an R1.34 repeat from an R11/2/3/4-type gene. Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 919

185/333 genes. This analysis employed the putative R4.2a–c correlated with element sequences Er20–22, evolution of the type 1 repeats (Fig. 3b) plus the R4.2d did not correspond to an element and instead pattern of type 1 repeats within 185/333 genes (Fig. was the first 7 nt of element Er23.14 Three genes 1a). The seven varieties of genes defined by the type contained a full-length R4.2 repeat, while the re- 1 repeat patterns each had theoretical evolutionary maining 118 genes had R4.2 subrepeats (45 genes histories with respect to the evolution of the repeats had R4.2b and R4.2c, 20 genes only had R4.2b, 20 (Fig. 3c). The R11/4 genes had the most straight- genes only had R4.2c, and 33 only had R4.2d). forward putative history such that the two type 1 Overall, repeat types 2–6 were distinct from type 1 repeats may have evolved as a result of a dupli- repeats not only in their orientation and pattern but cation of the ancestral type 1 repeat, which subse- also in sequence conservation and length. – quently diverged into the R1.12 and R1.42 repeats. Repeat types 2 6 were generally too short and Because the R1.23 repeats appeared to have resulted conserved to provide useful phylogenetic infor- from recombination between the C and the B ances- mation. However, type 2 repeats were consistently tral repeats (Fig. 3b), genes with three type 1 repeats found in tandem with type 3 repeats, which facili- were likely generated through unequal crossover tated phylogenetic analysis on this pair of repeats and recombination between hypothetical ancestors (R2/3; Table 2). When concatenated, these repeats of the forms A–C and B–C. This recombination were 44 nt long and provided 13 informative char- would have generated a gene containing a hybrid acters. Genes had either one or two R2/3 repeat pairs R1.23 (a mix of ancestors C and B) flanked by A- and (R2.1/3.1 and/or R2.4/3.2; Fig. 1a). In total, 187 R2/3 C-type repeats (Fig. 3c, R11/2/4). Similarly, variant pairs yielded 18 unique sequences (Fig. 4a). For genes genes with two type 1 repeats (R11/2 and R12/4) that contained both pairs of tandem repeats, the likely arose as a deletion of a single repeat (R1.1 or R2.1/3.1 pair consistently formed a clade separate R1.4) from an R11/2/4 gene (Fig. 3c). from the R2.4/3.2 pair (Fig. 4b). In all but two genes The evolution of the genes with four type 1 repeats that contained a single pair of R2/3 repeats, this pair was difficult to infer (Fig. 3c). The theoretical was monophyletic within the R2.1/3.1 clade. ancestral sequence for the R1.24 and R1.34 repeats was B (Fig. 3b), suggesting that the pair resulted Patterns of the repeats at the 3′ end of the genes from a duplication. The R1.14 and R1.44 repeats appeared to be recombinants of the A and the C Although the sequences of the individual type 2 ancestors (Fig. 3b) and were not found in any other to type 6 repeats were more conserved than the ′ sequences. The R11/2/3/4 genes were therefore type 1 repeats, the variable patterns of the 3 re- generated through an unknown mechanism inde- peats were remarkably complex. The majority of the pendent of the extant R11/2/4 genes that may have genes lacked many of the 13 possible repeats bet- utilized repeats from genes that were either no ween elements Er10 and Er26. The presence/ longer in the gene pool or not present within the absence of repeat types 2–6 defined seven patterns – – sampled population. The R11/2/ /4 gene was most (I VII), which were more coarsely divided into two ′ likely the result of a deletion of an R1.34 repeat from groups: pattern I, which contained the full suite of 3 – an R11/2/3/4 gene. Thus, from a minimal number of repeats, and patterns II VII, which were lacking ancestral sequences, it was possible to reconstruct some of these repeats (Fig. 1a). theoretical evolutionary histories for the 5′ end of most of the 185/333 genes that incorporate point Evolution of the complete 3′ repeat structure mutations, recombination, duplications, and dele- tions of the type 1 repeats. Three 185/333 genes contained all 13 of the 3′ re- peats (four type 2 repeats, two type 3 repeats, three Repeat types 2–6 type 4 repeats, two type 5 repeats, and two type 6 repeats; elements Er10a/g–26) and were designated In contrast to the tandem type 1 repeats located in as pattern I (Fig. 1a). Type 6 repeats, located in the 5′ region of the 185/333 genes, the 3′ end of the elements Er10a, Er10g, and Er18, were present only in genes was composed of complex patterns of five pattern I genes. The 3′ repeats were arranged as two types of interspersed repeats.14 Repeat types 2–6 tandem blocks of a series of six consecutive repeats: a were shorter and notably less diverse than the type 1 type 6 repeat followed by a type 2 repeat, a type 4 repeats, as measured by both diversity and number repeat, a second type 2 repeat, a type 3 repeat, and a of unique sequences (Table 1). Type 2 repeats were type 5 repeat, or 6–2–4–2–3–5followedbyasingle – – – – – – ′ present in individual genes between one and four type 4 repeat (6 2 4 2 3 5)2 4(Fig. 1a). The type 1 3 times; repeat types 3, 5, and 6 were present only repeat pattern was most likely the result of a dupli- once or twice (Table 1). A single type 4 repeat (repeat cation of this block, followed by the R4.3) was identified in element Er26 of all genes (Fig. insertion of an additional type 4 repeat (R4.3). 1a; Table 2). Three genes contained a second type 4 repeat (repeat R4.1 in element Er12). The third type 4 Partial 3′ repeat patterns repeat, R4.2, was complete in some genes (rA1γ and rG2γ; Fig. 1a) and present as partial subrepeats In total, 118 genes had partial 3′ repeat patterns. (designated as R4.2a, R4.2b, R4.2c, and R4.2d) in The presence/absence of repeats, in addition to the other genes (e.g., rB3α and rD5ε; Fig. 1a). Although phylogenetic analysis of the R2/3 pairs, defined six 920 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family

Fig. 4. The unique type 2 and type 3 tandem repeats form two clades. (a) Alignment of the 18 unique R2/3 sequences. An alignment of the 18 unique tandem R2/3 sequences was generated from a total of 187 R2/3 repeat pairs from 121 unique 185/333 genes. Subscripts indicate the location within the alignment of the repeat pair from genes with two R2/3 pairs (1=R2.1/3.1; 2=R2.4/3.2; see Fig. 1a). Repeats from genes that had a single pair of repeats do not have subscripts. (b) The R2/3 tandem repeats segregate into two clades. An ML tree was constructed from the alignment of the 18 unique type 2 and type 3 tandem repeats shown in (a). The numbers present on the branches indicate the bootstrap values obtained using 1000 replications. Nodes with bootstrap support of b50% were collapsed. The R2.1/3.1 and R2.4/3.2 clades are indicated. Because the sequences were short, bootstrap support for the distinction between the two major clades (R2.1/3.1 and R2.4/3.2) was low, with the clades defined by a single character (G versus A in position 21; indicated by the box in (a)). When all 187 repeat pairs were used in the analysis, identical sequences consistently formed 18 clades in which each of the sets of identical sequences formed single clades, generating the same tree topology shown. The type 2/3 repeats from genes 2-015 and 10-038 (Supplemental File 1) were more similar to the R2.4/3.2 repeats. Therefore, although the 2-015 gene was originally designated as an rE2/3/7δ gene,14 the R2/3 repeats in elements Er11 and Er14 were realigned to elements Er23 and Er24, and defined a new element pattern rE11δ (Fig. 1a). partial 3′ repeat patterns (patterns II–VII; Fig. 1a). genes.26 The type 1 repeats in the 185/333 genes Patterns II and III contained both the R2.1/3.1 and contained three full and two partial 11 nt repeats (see the R2.4/3.2 repeat pairs, while pattern V contained the dot plots in Fig. 1b and the consensus sequence in only the R2.1/3.1 pair, and patterns VI and VII Fig. 3a). These simple repeats were not identified contained only the R2.4/3.2 pair. All of the genes elsewhere within the 185/333 sequence and encoded with a partial suite of 3′ repeats contained a single two glycine residues, which are enriched at the 5′ type 4 repeat (R4.3), in addition to either R4.2b, R4.2c, end of the sequence.15,16 The 3′ end of the genes is or both. Repeat R4.2b was found in patterns II–IV, histidine-rich, with several stretches of sequence while patterns III, IV, and VI also contained R4.2c. In encoding tandem histidines.15,16 The repeated histi- addition to the repeats, there was a short region dine codons constituted multiple simple repeats (14 nt, element Er17) in those genes in both R2/3 within the larger type 5 and type 6 repeats. There- repeat pairs, located between R3.2 and R4.2, that was fore, the presence of simple repeats within the 185/ dissimilar to any of the defined repeats or any other 333 genes may hint at a diversification mechanism 185/333 sequence. All of the partial 3′ repeat patterns, for this gene family. with the exception of pattern IV, contained a single type 5 repeat, R5.2 (Fig. 1a). Due to the complexity of 185/333 gene recombination the patterns, it was difficult to draw any conclusions about their evolution. However, given the large One of the more striking results from this analy- number of genes sequenced, it was surprising that sis of the 185/333 gene family was the lack of cor- there were only seven types of repeat patterns within relation between the type 1 repeat patterns and the the 3′ end of the 185/333 genes. 3′ repeat patterns. Genes with element patterns rD5ε δ and r01/8 both had R11/4 type 1 repeat patterns, but Simple repeats had very different 3′ repeat patterns (III and V, respectively; Fig. 1a). In contrast, genes with element In addition to the six major repeat types, the 185/ patterns rD5ε and rD1/6/7α all had type 3 3′ repeat 333 genes contained a variety of simple short repeats. patterns but different type 1 repeat patterns (R11/4 Simple repeats have been shown to increase genomic and R11/2/4, respectively). This agreed with our pre- instability24 and recombination frequency,25 and vious analysis of 290 EST sequences, which indi- have been implicated as mediators of the assembly cated that transcripts with identical leader se- of the lamprey-based variable lymphocyte receptor quences could have different 5′ UTR sequences Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 921 and vice versa.16 Given these results and the rela- quence of the element to which it was adjacent. For tively low number of unique sequences for each some elements, a given sequence was adjacent to element as compared to the large number of distinct single unique sequences on either side; however, gene sequences, the distribution of specific element there were examples of unique element sequences sequences across the family of 185/333 genes was that were adjacent to up to 12 different sequences on analyzed. either side (Fig. 5). Examination of the distribution The arrangement of individual element sequences of unique element sequences across the set of unique (average of 11 unique sequences per element14) was genes indicated that the pattern was complex and analyzed from 53 unique 185/333 genes isolated warranted a more detailed analysis. from a single animal (Fig. 5). The 27 elements had The associations between adjacent elements for all varying numbers of distinct sequences, ranging of the genes from a single animal are illustrated in from 1 (Er12) to 28 (Er6 and Er27; Fig. 5), with a Fig. 5, in which the circles represent individual ele- total of 297 unique element sequences. Specific ele- ment sequences and the connecting lines indicate ment sequences were defined with a subscript (e.g., elements that are adjacent in at least one 185/333 Er81; Table 2). Each specific sequence from each gene. Two examples of putative recombinant genes element was analyzed to identify the specific se- are shown (Fig. 5). Gene 2-061 appeared to be the

Fig. 5. The 185/333 elements are shared among genes. Element sequences from the repeat-based alignment within the 185/333 gene family of animal 214 are shown as circles. Each of the columns of circles represents a set of unique sequences for a given element from the second exon of the 185/333 genes. Element (Er) numbers are shown at the bottom; “L” indicates the leader sequence encoded by the first exon, and “Int” indicates the intron sequence. Lines that connect the circles indicate that the two distinct element sequences are adjacent to one another in one or more genes. The figure is the result of a manually generated display of adjacent element sequences. An example of numbering for individual element sequences is given for Er8, while the numbers for the other individual element sequences are not shown. Er8 has three unique sequences, which are shown as three circles labeled from top to bottom with their corresponding numbers (Er81–3). 14 Our analysis of Er8 showed that it was found in 46 genes and was never present in genes with Er7. Er81, which was found in 21 unique genes, was adjacent to nine unique Er6 sequences and a single Er9 sequence. Er82 (located in 20 unique genes) was adjacent to 12 unique Er6 sequences and 1 Er9 sequence. Er83 was found in five genes, and was adjacent to four Er6 sequences and two Er9 sequences. Two putative recombinant sequences and their source genes are indicated by the colored lines and circles. Yellow lines and circles indicate the element sequences present in gene 2-028 (GenBank accession number EF607711). Element sequences from gene 2-012 (GenBank accession number EF607697) are in light blue. The putative recombinant sequence generated from these two genes, 2-061 (GenBank accession number EF607733), is indicated in green. An example of putative recombination in element Er6, gene 2-061, shares element sequences L though Er5 with gene 2-028, and element sequences Er7 through Er27 with gene 2-012. A second example of recombination between genes 2-063 (red; GenBank accession number EF607734) and 2-034 (dark blue; GenBank accession number EF607716) is gene 2-015 (purple; GenBank accession number EF607699), which has identical sequence as gene 2-063 from L-Er10, a theoretical recombination event in Er23, and elements Er23–27 that are identical with gene 2-034. 922 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family product of the 5′ end of gene 2-028 and the 3′ end of cantly less optimal than the sum of trees built from gene 2-012, with recombination occurring in element the individual elements, this can be interpreted as Er6 (Fig. 5, green line). Likewise, recombination was evidence that the elements have incongruent phylo- also observed between genes 2-063 and 2-034 to genetic histories, a possible result of gene recom- generate gene 2-015. However, for these genes, the bination. The ILD test utilizes the MP optimality putative recombination site was in element Er23 criterion of tree length to assess trees. The IP test uses (Fig. 5, purple line), suggesting that recombination ML trees and compares the two-tree model versus the can occur at any point along the gene sequences. one-tree model of evolution by assessing the diffe- When the 121 unique genes from three individual rence (Δ) in the likelihoods between the concate- sea urchins were analyzed together, a similarly com- nated tree and the sum of the two individual trees. plex distribution of element sequences was observed The two tests differ in their tree-building methods (Supplemental File 2), suggesting that this pheno- and measures of optimality, but rely on the same menon was not limited to the 185/333 gene repertoire principles and produce largely similar results. of a single animal, nor to just a few genes. At least 12 Seven of the 185/333 elements that were found in of the 121 185/333 genes could have been the product all genes14 were subjected to incongruence analysis. of recombination between two or more other extant These regions included the leader, two segments genes. Given that the 185/333 gene family has not from the intron (the first 139 nt of the intron, desig- been sampled exhaustively and that it is not possible nated as intron a, and the last 93 nt, defined as intron to distinguish recombinant from nonrecombinant b), and elements Er1, Er6, Er26, and Er27. Further- sequences with confidence, this estimate is likely low. more, because Er27 was 192 nt long, it was analyzed Furthermore, the original genes involved in the re- as two 96-nt fragments (Er27a and Er27b). For all combinations may change over time such that they pairs of elements, except Er1 and intron b, the null may not be recognizable as the sources of the re- hypothesis was rejected (pb0.01; a representative combination events. This analysis revealed a complex histogram of the Δ values is shown in Fig. 6a), indi- mosaic pattern of element sequences among the cating that the evolutionary history of each element 185/333 genes and led us to employ statistical was significantly incongruent from all of the others approaches to analyze incongruence in the evolu- (Table 3; Supplemental File 3). For elements Er1 and tionary histories of the different elements. intron b, the analysis methods produced opposite To detect recombination among the members of results in which the ILD test indicated that the ele- the 185/333 gene family, phylogenetic incongruence ments were congruent, while the IP test detected among elements was assessed using two methods: incongruence. Incongruence was also observed when the MP-based incongruence length difference (ILD) elements defined by the cDNA-based alignment14,15 test27 and the incongruence permutation (IP) test, were analyzed, indicating that recombination is which uses ML-based tree-building methods. If a not necessarily restricted to the edges of the ele- tree constructed using a character matrix (alignment) ments (data not shown). Results from the control sets containing two concatenated elements is signifi- showed that the T-cell receptor (TCR) gene segments

Fig. 6. Elements Er1 and Er27 are significantly incongruent. Histograms of log likelihood ratio Δ values from the ML- based ILD test for (a) 185/333 elements Er1 and Er27, (b) TCR V and J segments, and (c) two segments from the S. purpuratus histone H3 genes. Each test was performed using 100 permutations. The observed values for the nonpermuted segments and significance are indicated. Positive and negative controls were used to assess the accuracy of both tests. A set of human TCR genes served as positive control because the gene segments undergo recombination to generate a large repertoire of specificities that can recognize a broad array of peptides.28 The TCR sequences exhibited significant levels of incongruence (pb0.01) between the V and J gene segments in both the ILD test and the IP test (see also Table 3). A set of unique S. purpuratus histone H3 genes was employed as negative control. Histone sequences are largely conserved within and across species, and are believed to evolve through a birth-and-death process with a strong purifying selection that is not influenced by frequent recombination.29 The histone gene sequences were divided into two segments (designated as H3.1 and H3.2) that were similar in size as an average 185/333 element. Results of both the IP test and the ILD test for the H3 gene segments indicated that the evolutionary histories of the segments were congruent (pN0.05), supporting a lack of recombination among H3 genes (Table 3). Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 923

Table 3. The 185/333 elements have incongruent phylogenetic histories

L Int a Int b Er1 Er6 Er6a Er6b Er26 Er27 Er27a Er27b H3.1 H3.2 TCR V TCR J ⁎⁎⁎ ⁎⁎ L *** ** ––*** *** ––––– – Int a ** *** ** *** ––*** *** ––––– – Int b ** ** ns *** ––*** *** ––––– – ⁎ Er1 ** ** *** ––*** *** ––––– – Er6 * ** ** ** *** *** ––––– – Er6a –– – –– * –– – – –– – – Er6b –– – –– * –– – – –– – – Er26 ** ** ** ** ** –– *** ––––– – Er27 ** ** ** ** ** ––* ––––– – Er27a ––––––––– * –– – – Er27b –––––––––** –– – – H3.1 ––––––––– – – ns –– H3.2 ––––––––– – –ns –– TCR V ––––––––– – – –– *** TCR J ––––––––– – – –– ** ILD results are shown above the diagonal, whereas IP results are shown below the diagonal. (–) indicates that the test was not performed. ns=not significant (pN0.05). Int, intron; Er, repeat-based element; Er27a and Er27b, two halves of element 27; H3, histone H3; TCR V, TCR V region; TCR J, TCR J region. * pb0.05. ** pb0.01. *** pb0.001. recombined while the S. purpuratus histone H3 present in all genes: Er1, Er6, Er26, Er27, plus the sequences did not (Fig. 6bandc;Table 3). The results leader. The average pairwise differences for these from the ILD and IP tests for the 185/333 elements elements ranged from 2.76% to 6.52% (Table 4). The agreed with the unusual element distribution shown maximum pairwise difference was 13% in element in Fig. 5 and supported the notion of frequent recom- Er1, in which 12 of 92 nucleotide positions differed bination among the 185/333 genes. It was noteworthy between the Er1 sequence variants. This result that recombination may occur throughout the se- supported previous observations that the number quences and that no recombination hot spots were of unique sequences per element is low and that the identified. high number of unique genes was the result of the mosaic distribution of the element sequences among The 185/333 genes are paradoxically both the genes.14 diverse and conserved Size of the 185/333 gene family The 185/333 genes have been described as extra- ordinarily diverse with respect to variations in ele- Analysis of the 185/333 genes using qPCR ment patterns and numbers of unique sequences.14 estimated that the number of 185/333 alleles was Yet, despite this diversity, the genes were actually between 80 and 120.15 As an alternative approach, surprisingly similar. This paradox was demon- the percentages of unique genes obtained from the strated through a pairwise analysis of all 121 unique total number of genes cloned from three animals14 genes using pairwise gap deletion (i.e., every were used to estimate the numbers of alleles per position in which one of the sequences containing diploid genome. Characterization of 171 genes from a gap was deleted), which indicated that the genes, three individuals yielded 53 unique genes of 87 total when analyzed as a set, were N88% identical. But genes from one animal, 38 unique genes of 50 total because gap deletion can lead to the loss of valuable genes from a second animal, and 30 unique genes out information, pairwise percent differences among the of 34 total genes from a third animal. Using these data sequences were calculated using only five elements points, the ML of the total number of distinct alleles was 118, with a 95% confidence interval for the Table 4. The 185/333 elements sequences are conserved number of distinct 185/333 alleles ranging from 91 to 142. This model assumed that each animal did not Maximum have identical alleles within the 185/333 family and Length Average % Maximum Minimum difference consequently, the result may be an underestimate. a b Element (nt) difference age age (%) Although the diversity of the 185/333 genes sug- Leader 55 2.76 4.25 2.76 11 gested that it is unlikely that identical 185/333 Er1 92 5.47 8.42 5.47 13 sequences are present in a single genome, this Er6 147 3.78 5.81 3.78 12 possibility cannot be eliminated. Thus, when combin- Er26 90 3.61 5.55 3.61 8.9 ing the previous estimate of gene copy numbers by Er27 138 2.83 4.35 2.83 8.7 15 − qPCR with this statistical analysis, the 185/333 gene a Using the rate of 0.65% Myr 1. − family may likely be composed of 100–120 alleles or b Using the rate of 1% Myr 1. 50–60 loci. 924 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family

Discussion are thought to promote genomic instability in a variety of ways that facilitate gene diversification. The R genes in higher plants have been studied The size and diversity of the 185/333 gene family 8,33 provide an interesting system for studying the extensively with regard to genomic instability. complexity of the innate immune system in S. The organization of clustered R genes that have shared sequences promotes diversification through purpuratus. The data presented here suggest that 8,33 the diversity of the 185/333 gene family may, in part, a variety of mechanisms. The 185/333 genes also be the result of frequent recombination events, in appear to be clustered; however, it is not clear addition to point mutations. The type 1 repeats, whether they are present in a few large clusters or located at the 5′ end of the genes, appear to have are scattered in small clusters, as has been described originated as a result of recombination, duplication, for R genes. 185/333 gene clustering is based on PCR amplification of the intergenic region between these and deletion. Incongruent phylogenetic histories and 14 analysis of the distribution of specific element genes that is supported by whole genomes and by sequences across genes also suggest that the genes all 185/333-positive BACs isolated from two libraries undergo frequent recombination. Repeats within the (unpublished data). This strongly implies that many genes and other types of repeats that are located on of these genes are present in tight clusters of two or both sides of each gene in the short intergenic re- more copies, although it does not rule out the pre- gions14 may facilitate recombination by promoting sence of isolated 185/333 genes. genomic instability24,25 and may be an important Preliminary data presented here suggest that there mechanism for the diversification of this gene family. may be two 185/333 gene clusters, based on the differences between genes with four type 1 repeats Gene family size versus genes with fewer type 1 repeats. The origin of genes with four type 1 repeats, R11/2/3/4, cannot be Genes that undergo frequent recombination are deduced from our data and appears to have evolved typically members of large gene families. One of the independently of those with three type 1 repeats. best studied examples are the R genes in higher plant There are very few identical sequences of type 1 genomes that are clustered into numerous families, repeats that are shared in both R11/2/3/4 genes and in in addition to many that are isolated and distri- genes with fewer type 1 repeats. Furthermore, the 30–32 γ buted throughout the genome. In the freshwater R11/2/3/4 genes only have the -type intron, while the snail Biomphalaria glabrata, there are a large number other genes have all intron types. The R11/2/3/4 genes of genes that encode fibrinogen-related proteins may be clustered and physically separated from the (FREPs) and are present in 13 families.7 Significant genes with two or three type 1 repeats, and this may problems in assembling the genomic region of the reduce the frequency with which they recombine. sea urchin in which the 185/333 genes are located Conversely, it is possible that the R11/2/3/4 genes may resulted in the identification of only six genes. Con- be mixed within one or more clusters of 185/333 sequently, we have estimated the numbers of 185/333 genes with less than four type 1 repeats, but have alleles using two different methods: a statistical esti- fundamental differences in element pattern, repeat mate (∼118 alleles) and qPCR analysis of genomic sequences within or flanking the genes, and/or DNA (∼100 alleles).15 In addition, bacterial artificial nucleotide sequences that preclude frequent recom- chromosome (BAC) library screens have isolated 48 bination with the other genes. clones from a small-insert (∼50–80 kB) library (6.25× genome coverage) and 73 clones from a large-insert Mechanisms of diversification (∼140 kB) BAC library (25× coverage). Given the numbers of positive BACs, the average insert size, Diversification of R genes in plants can be the and the genome coverage for each library, the result of the sequence that is exchanged between estimated size of the haploid genomic region in clustered paralogous alleles in the process of gene which the 185/333 genes are located is 200–250 kB conversion, which is thought to generate new alleles (unpublished results). If the 185/333 genes are closely with novel pathogen-recognition specificities.30,34,35 linked and if each gene, plus a flanking intergenic Furthermore, gene duplications and/or deletions region, spans ∼5 kB, this suggests that there are can also generate new or variant R genes. On a larger about 80–100 alleles. Although the estimate from the scale, groups of genes may be translocated to remote BACs may be low due to putative unlinked genes, it sites by duplications/deletions of varying sizes of is noteworthy that these three estimates of allele copy chromosomal segments.10,32,36 Significant variations number are similar. Given that immune response in the numbers of tandemly linked R genes may be genes are commonly organized as gene clusters, the the result of unequal crossovers or meiotic mispair- predicted structure of the 185/333 gene family ing.37 Not only may blocks of sequence recombine provides further evidence that the 185/333 proteins and alter the recognition specificity of R genes, but are involved in the immune response to pathogens. the process of diversification can also have “bystan- der” effects that alter the copy number and positions Gene clusters of fragments of non-R genes that flank R gene clus- ters. As a result of these mechanisms to diversify R Clusters of genes involved in the immune res- genes, their copy numbers within clusters can vary ponse have been noted in a number of species and among cultivars.38 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 925

On a smaller scale, recombination may be the the coding regions, there are repeats that surround major mechanism for R gene diversification,33 and a each gene in the intergenic region.14 We have not large portion of R genes may be the result of at least investigated these repeats in detail, but speculate one recombination event30 producing new variants that they may promote duplication and/or deletion that may appear in populations within a single ge- of intact genes or blocks of tandem genes, as has neration.35,39 The location at which recombination been noted for R genes. The intergenic repeats plus occurs depends on the family. Recombination hot the 185/333 genes themselves, which may act as large spots have been identified in Xa21, a rice R gene repeats, may facilitate meiotic mispairing, resulting in family that exchanges promoters among para- variations in the numbers of 185/333 genes in indi- logues.40 On the other hand, recombination in the viduals. Based on the multiple types of repeats that Dm3 family in lettuce occurs in the coding region.41 we have identified in this gene cluster,14–16 the level It seems that the sequence requirements for these of genomic instability and the rate of recombination different levels of DNA exchanges are the repeats may be more extreme than that observed for the R within the R gene clusters.34 The highest levels of gene and FREP families. This would predict that the have been noted in the leucine-rich 185/333 gene family is the product of numerous repeats (LRRs) in some R genes, which may be ongoing and recent recombination events and that facilitated by LRRs themselves.42 It is noteworthy the extant members of the gene family are relatively that the TLR gene family in S. purpuratus, which has young. This is in agreement with molecular clock N200 members, with many clustered in tandem analysis indicating that the 185/333 elements are not arrays,19 also shows increased sequence diversity in N10.8 million years old, about the same age as the the LRRs.12 In both R proteins and TLRs, the LRRs S. purpuratus species.44 A high recombination rate is are thought to be the site of pathogen recognition also consistent with genome blots showing different and would be under the greatest pathogen pressure and complex banding patterns in different sea for diversification. urchins15 and that 185/333 genes are not shared Gene recombination is becoming increasingly evi- among animals.14 The 185/333 gene family appears to dent in immune systems in organisms other than undergo a higher level of diversification than much of higher vertebrates. Yet, only one other large immune the rest of the genome (excluding other immune response gene family in an invertebrate has been response gene clusters such as the TLR family12,13,19). analyzed for recombination as a mechanism for di- This is supported by our diversity analysis of the 185/ versification. The FREP genes are expressed in fresh- 333 genes, which have a score of 0.30 compared to the water snails in response to trematode parasites2,5,7,11,43 diversity score of 0.15 for the sea urchin histone H3 and are believed to diversify through point mutation gene family. and somatic recombination using a limited set of The 185/333 gene family represents an intriguing source genes.11 Different suites of FREP genes are addition to what is currently known about the com- present in parent snails compared to offspring, yet plexity of invertebrate immune systems. The diver- both parents and offspring have identical source sity is based on both variations in nucleotide se- genes, implying somatic recombination of germline quence and mosaic combinations of elements into DNA in all tissues. The FREP and R genes therefore distinct element patterns,14 generating a diverse re- represent examples of germline-encoded immune pertoire of transcripts15,16,18 and proteins20 (D.A. diversity in medium to large families that function in Raftos, unpublished data) in response to immune organisms surviving on innate immunity. challenge. This diversity may reflect diversification We propose that the 185/333 system is another pressure placed on S. purpuratus by the microbes large gene family that employs recombination as a present within their marine environment.45 Marine mechanism for diversification. Recombination is microbial rRNA sequences isolated from Eastern Pa- evident both between and within elements, suggest- cific seawater suggest that there are 2×106 bacteria/ ing that these events are not limited to the element ml and 5×105 archaea/ml.46 Given this level of borders, but instead may occur throughout the constant pathogen exposure, it is only reasonable to entire gene sequence. No recombination hot spots expect that any organism living in this environment are evident, which is likely due to the presence of would survive based on a complex immune system multiple repeats throughout the sequence. Frequent that incorporates mechanisms to keep pace with the gene recombination is strongly implied from the se- swift evolutionary variations in microbial pathogens. quences of the type 1 repeats, as well as from the intact genes. Given that the 185/333 gene family has not been exhaustively sampled and because it is not Materials and Methods possible to distinguish recombinant from nonrecom- binant sequences with confidence, our estimate of 12 of 121 recombinant sequences is likely to be low. A Sequences confounding factor in this analysis is that the appa- The set of 185/333 gene sequences employed in this rent rate of recombination may be so high that the study has GenBank accession numbers EF607618– original genes involved in any given recombination EF607793. Details of their isolation and preliminary event may undergo additional recombinations over analysis can be found in Buckley and Smith.14 Nonunique time so that they become unrecognizable as sources clones were assumed to be duplicates of the same gene of recombination. In addition to the repeats within and were not included in the current analysis. 926 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family

Phylogenetic analyses (that a single tree optimally represents the evolution of both elements because there was no recombination between the Sequences were aligned using the repeat-based two segments). Characters were resampled by randomly alignment.14 Bioedit47 was used to edit alignments and distributing them between two new alignments of the same generate dot plots. PERL scripts were used to remove dimensions as the original segments for 100 permutations unique sequences, to calculate entropy scores,22 and to using a PERL script. Avalue of Δ was generated for each of calculate diversity characteristics of the repeats.14 Trees these permuted pairs. A set of human TCR genes served as were viewed and edited using TreeView.48 To improve MP positive controls;28 25 unique sea urchin early histone H3 analysis, which treats gaps as missing data, the single gap genes served as negative controls.12,59 in the type 1 repeat sequences was coded as an additional binary (presence/absence) character to recover the lost information49 with the program GapCoder.50 Estimates of Protein Data Bank accession numbers the numbers of 185/333 alleles were calculated using previously employed statistical methods.51 Accession numbers and coordinates of the specific sequences used are found in Supplemental File 5. Tree building and support Parsimony analyses were performed using Tree Analy- sis Using New Technology (TNT)52 with a standard heu- ristic search strategy consisting of 200 Wagner addition Acknowledgements trees with subtree pruning and regrafting and tree bisec- tion and reconnection branch swapping. ML trees were The authors would like to thank Dr. David Raftos generated using the dnaml program in the PHYLIP 3.66 53 54 for providing helpful improvements to the manu- package. MODELTEST was used to select the models of sequence evolution that best fit the data using the script; Drs. Diana Lipscomb, Marc Allard, and Akaike information criterion. Bayesian analyses were Fernando Alvarez for providing valuable assistance executed in MRBAYES;55 commands for MRBAYES are with phylogenetic analysis; and Dr. Heinrich Schu- found in Supplemental File 4. Three independent chains of lenberg for bringing simple repeats to our attention. 4,000,000 generations saving one tree each for 1000 gene- This research was supported by funding from the rations were performed. Plots of the negative log like- National Science Foundation (MCB-0424235) to L.C. lihood against the number of generations were used to S.; travel funding from the Department of Biological identify the generations where the posterior probabilities Sciences and the Columbian College of Arts and variation began to asymptote. Trees obtained below this Sciences, George Washington University; and a limit were discarded as burn-in. Posterior probabilities Weintraub Fellowship from the George Washington were calculated as the 51% majority-rule consensus of the remaining trees. Bootstrap analyses were performed using University to K.M.B. 2000 replications in the computer programs TNT52 and PHYLIP.53 Relative Bremer support values56 were calcu- lated using 50,000 suboptimal trees in the computer Supplementary Data program TNT.52 Supplementary data associated with this article Measures of element congruence can be found, in the online version, at doi:10.1016/j. jmb.2008.04.037 Two approaches to measuring phylogenetic congruence among the elements (thereby measuring putative gene recombination between those elements) were performed: References the MP-based ILD test27 and a similar method rooted in ML, designated as the IP test. The ILD test was performed 1. Flajnik, M. F. & Du Pasquier, L. (2004). Evolution of in WinClada57 using 1000 replications. The IP test analyzed innate and adaptive immunity: can we draw a line? incongruence between two elements (or any sequence Trends Immunol. 25, 640–644. segments) by estimating the evolutionary histories for both 2. Adema, C. M., Hertel, L. A., Miller, R. D. & Loker, E. S. elements individually versus concatenated elements. The (1997). A family of fibrinogen-related proteins that two-tree model versus the one-tree model was used to precipitates parasite-derived molecules is produced determine whether the two elements share a common by an invertebrate after infection. Proc. Natl Acad. Sci. evolution. A sequence that results from a crossover be- USA, 94, 8691–8696. tween two or more ancestors can be best explained not 3. Cannon, J. P., Haire, R. N. & Litman, G. W. (2002). by a single tree but by a set of correlated trees over the Identification of diversified genes that contain immu- 58 Δ sequence. The log of the likelihood ratio =logLA + noglobulin-like variable regions in a protochordate. − 3 – logLB logLA+B was used to compare the two nested Nat. Immunol. , 1200 1207. models (one-tree model versus two-tree model). The Δ 4. Destoumieux, D., Bulet, P., Loew, D., Van Dorsselaer, values between the concatenated tree and the sum of the A., Rodriguez, J. & Bachere, E. (1997). Penaeidins, a two individual trees increase as the difference between the new family of antimicrobial peptides isolated from the evolutionary histories of the two segments increases. shrimp Penaeus vannamei (Decapoda). J. Biol. Chem. Ideally, the expected distribution of Δ values would be 272, 28398–28406. known under the null hypothesis. However, because the 5. Hertel, L. A., Adema, C. M. & Loker, E. S. (2005). models are too complex for this distribution to be derived, Differential expression of FREP genes in two strains of a distribution was generated using permuted matrices Biomphalaria glabrata following exposure to the dige- created through character resampling. Therefore, the netic trematodes Schistosoma mansoni and Echinostoma distribution of Δ was calculated under the null hypothesis paraensei. Dev. Comp. Immunol. 29, 295–303. Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family 927

6. Jones, J. D. & Dangl, J. L. (2006). The plant immune in phylogenetics. Mol. Biol. Evol. In Press. doi:10.1093/ system. Nature, 444, 323–329. molbev/msn066. 7. Leonard, P. M., Adema, C. M., Zhang, S. M. & Loker, 24. Rogaev, E. I. (1990). Simple human DNA-repeats E. S. (2001). Structure of two FREP genes that combine associated with genomic hypervariability, flanking IgSF and fibrinogen domains, with comments on di- the genomic and similar to retroviral versity of the FREP gene family in the snail Biompha- sites. Nucleic Acids Res. 18, 1879–1885. laria glabrata. Gene, 269, 155–165. 25. Majewski, J. & Ott, J. (2000). GT repeats are associated 8. McDowell, J. M. & Simon, S. A. (2008). Molecular with recombination on human chromosome 22. diversity at the plant–pathogen interface. Dev. Comp. Genome Res. 10, 1108–1114. Immunol. 32, 736–744. 26. Nagawa, F., Kishishita, N., Shimizu, K., Hirose, S., 9. O'Leary, N. A. & Gross, P. S. (1999). Genomic structure Miyoshi, M., Nezu, J. et al. (2007). Antigen-receptor and transcriptional regulation of the penaeidin gene genes of the agnathan lamprey are assembled by a pro- family from Litopeneaus vannamei. Gene, 371,75–83. cess involving copy choice. Nat. Immunol. 8, 206–213. 10. Parniske, M. & Jones, J. D. (1999). Recombination 27. Mickevich, M. F. & Farris, J. S. (1981). The implications between diverged clusters of the tomato Cf-9 plant of congruence in Mendida. Syst. Zool. 30, 351–370. disease resistance gene family. Proc. Natl Acad. Sci. 28. Siu, G., Clark, S. P., Yoshikai, Y., Malissen, M., Yanagi, USA, 96, 5850–5855. Y., Strauss, E. et al. (1984). The human T cell antigen 11. Zhang, S. M., Adema, C. M., Kepler, T. B. & Loker, receptor is encoded by variable, diversity, and joining E. S. (2004). Diversification of Ig superfamily genes in gene segments that rearrange to generate a complete an invertebrate. Science, 305, 251–254. V gene. Cell, 37, 393–401. 12. Sodergren, E., Weinstock, G. M., Davidson, E. H., 29. Rooney, A. P., Piontkivska, H. & Nei, M. (2002). Mole- Cameron, R. A., Gibbs, R. A., Angerer, R. C. et al. cular evolution of the nontandemly repeated genes (2006). The genome of the sea urchin Strongylocen- of the histone 3 multigene family. Mol. Biol. Evol. 19, trotus purpuratus. Science, 314, 941–952. 68–75. 13. Hibino, T., Coll, M. L., Messier, C., Majeske, A. C., 30. Bakker, E. G., Toomajian, C., Kreitman, M. & Bergelson, Terwilliger, D. P., Buckley, K. M. et al. (2006). The J. (2006). A genome-wide survey of R gene poly- immune gene repertoire encoded in the purple sea morphisms in Arabidopsis. Plant Cell, 18,1803–1818. urchin genome. Dev. Biol. 300, 349–365. 31. Meyers, B. C., Kozik, A., Griego, A., Kuang, H. & 14. Buckley, K. M. & Smith, L. C. (2007). Extraordinary Michelmore, R. W. (2003). Genome-wide analysis of diversity among members of the large gene family, NBS-LRR-encoding genes in Arabidopsis. Plant Cell, 15, 185/333, from the purple sea urchin, Strongylocentrotus 809–834. purpuratus. BMC Mol. Biol. 8, 68. 32. Richly, E., Kurth, J. & Leister, D. (2002). Mode of 15. Terwilliger, D. P., Buckley, K. M., Mehta, D., Moorjani, amplification and reorganization of resistance genes P. G. & Smith, L. C. (2006). Unexpected diversity during recent Arabidopsis thaliana evolution. Mol. Biol. displayed in cDNAs expressed by the immune cells of Evol. 19,76–94. the purple sea urchin, Strongylocentrotus purpuratus. 33. Michelmore, R. W. & Meyers, B. C. (1998). Clusters of Physiol. Genomics, 26, 134–144. resistance genes in plants evolve by divergent selection 16. Nair, S. V., Del Valle, H., Gross, P. S., Terwilliger, D. P. and a birth-and-death process. Genome Res. 8, 1113–1130. & Smith, L. C. (2005). Macroarray analysis of coelo- 34. Hammond-Kosack, K. E. & Jones, J. D. (1997). Plant mocyte gene expression in response to LPS in the sea disease resistance genes. Annu. Rev. Plant Physiol. urchin. Identification of unexpected immune diversity Plant Mol. Biol. 48, 575–607. in an invertebrate. Physiol. Genomics, 22,33–47. 35. Hulbert, S. H., Webb, C. A., Smith, S. M. & Sun, Q. 17. Rast, J. P., Pancer, Z. & Davidson, E. H. (2000). New (2001). Resistance gene complexes: evolution and approaches towards an understanding of deuterostome utilization. Annu. Rev. Phytopathol. 39, 285–312. immunity. Curr. Top. Microbiol. Immunol. 248,3–16. 36. Leister, D. (2004). Tandem and segmental gene dupli- 18. Terwilliger, D. P., Buckley, K. M., Brockton, V., Ritter, cation and recombination in the evolution of plant N. J. & Smith, L. C. (2007). Distinctive expression disease resistance gene. Trends Genet. 20,116–1122. patterns of 185/333 genes in the purple sea urchin, 37. Meyers, B. C., Kaushik, S. & Nandety, R. S. (2005). Strongylocentrotus purpuratus: an unexpectedly diverse Evolving disease resistance genes. Curr. Opin. Plant. family of transcripts in response to LPS, beta-1,3- Biol. 8, 129–134. glucan, and dsRNA. BMC Mol. Biol. 8, 16. 38. Smith, S. M., Pryor, A. J. & Hulbert, S. H. (2004). 19. Rast, J. P., Smith, L. C., Loza-Coll, M., Hibino, T. & Allelic and haplotypic diversity at the rp1 rust resis- Litman, G. W. (2006). Genomic insights into the tance locus of maize. , 167, 1939–1947. immune system of the sea urchin. Science, 314, 39. Ellis, J., Lawrence, G., Ayliffe, M., Anderson, P., 952–956. Collins, N., Finnegan, J. et al. (1997). Advances in the 20. Brockton,V.,Henson,J.H.,Raftos,D.A.,Majeske,A.J., molecular genetic analysis of the flax–flax rust Kim, Y. O. & Smith, L. C. (2008). Localization and diver- interaction. Annu. Rev. Phytopathol. 35, 271–291. sity of 185/333 proteins from the purple sea urchin— 40. Song, W. Y., Pi, L. Y., Wang, G. L., Gardner, J., Holsten, unexpected protein-size range and protein expression in T. & Ronald, P. C. (1997). Evolution of the rice Xa21 a new coelomocyte type. J. Cell Sci. 121, 339–348. disease resistance gene family. Plant Cell, 9, 1279–1287. 21. Hancock, J. M. & Simon, M. (2005). Simple sequence 41. Meyers, B. C., Chin, D. B., Shen, K. A., Sivaramak- repeats in proteins and their significance for network rishnan, S., Lavelle, D. O., Zhang, Z. & Michelmore, evolution. Gene, 345,113–118. R. W. (1998). The major resistance gene cluster in 22. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). lettuce is highly duplicated and spans several mega- Biological Sequence Analysis, Probability Models of bases. Plant Cell, 10,1817–1832. Proteins and Nucleic Acids. Cambridge University 42. Mondragon-Palomino, M., Meyers, B. C., Michelmore, Press, Cambridge, UK. R. W. & Gaut, B. S. (2002). Patterns of positive selec- 23. Munshaw, S. & Kepler, T. B. (2008). An information- tion in the complete NBS-LRR gene family of Arabi- theoretic method for the treatment of plural ancestry dopsis thaliana. Genome Res. 12, 1305–1315. 928 Evolution of the Strongylocentrotus purpuratus 185/333 Gene Family

43. Zhang, S. M. & Loker, E. S. (2004). Representation of an 51. Barth, R. K., Kim, B. S., Lan, N. C., Hunkapiller, T., immune responsive gene family encoding fibrinogen- Sobieck, N., Winoto, A. et al. (1985). The murine T-cell related proteins in the freshwater mollusc Biomphalaria receptor uses a limited repertoire of expressed V gene glabrata,anintermediatehostforSchistosoma mansoni. segments. Nature, 316, 517–523. Gene, 341,255–266. 52. Goloboff, P., Farris, J. & Nixon, K. C. (2003). TNT: Tree 44. Lee, Y. H. (2003). Molecular phylogenies and diver- Analysis Using New Technology. Program and docu- gence times of sea urchin species of Strongylocen- mentation available from the authors and at www. trotidae, Echinoida. Mol. Biol. Evol. 20, 1211–1221. zmuc.dk/publnic/phylogeny. 45. Smith, L. C. (2005). Host responses to bacteria: innate 53. Felsenstein, J. (2005). PHYLIP (Phylogeny Inference immunity in invertebrates. In Advances in Molecular and Package) Version 3.6. Department of Genome Sciences, Cellular Microbiology (McFall-Ngai, M. J., Henderson, B. University of Washington, Seattle, WA; distributed by & Ruby, E. G., eds), vol. 10, pp. 293–320, Cambridge the author. University Press. 54. Posada, D. & Crandall, K. A. (1998). MODELTEST: 46. Massana, R., Guillou, L., Diez, B. & Pedros-Alio, C. testing the model of DNA substitution. Bioinformatics, (2002). Unveiling the organisms behind novel eukar- 14, 817–818. yotic ribosomal DNA sequences from the ocean. Appl. 55. Huelsenbeck, J. P. & Ronquist, F. (2001). MRBAYES: Environ. Microbiol. 68, 4554–4558. Bayesian inference of phylogenetic trees. Bioinformatics, 47. Hall, T. A. (1999). BioEdit: a user friendly biological 17,754–755. sequence alignment editor and analysis program for 56. Bremer, K. (1988). The limits of amino acid sequence Windows 95/98/NT. Nucleic Acids Symp. Ser. 41,95–98. data in angiosperm phylogenetic reconstruction. 48. Page, R. D. (1996). TreeView: an application to display Evolution, 42, 795–803. phylogenetic trees on personal computers. Comput. 57. Nixon, K. C. (1999–2002). WinClada Ver. 1.00.08. Nixon Appl. Biosci. 12, 357–358. K.C., Ithaca, NY, USA. 49. Simmons, M. P. & Ochoterena, H. (2000). Gaps as 58. Hudson, R. R. (1983). Properties of a neutral allele characters in sequence-based phylogenetic analyses. model with intragenic recombination. Theor. Popul. Syst. Biol. 49, 369–381. Biol. 23, 183–201. 50. Young, N. D. & Healy, J. (2003). GapCoder automates 59. Marzluff, W. F., Sakallah, S. & Kelkar, H. (2006). The the use of indel characters in phylogenetic analysis. sea urchin histone gene complement. Dev. Biol. 300, BMC Bioinf. 4,6. 308–320. Supplemental File 1: Characteristics of the unique 185/333 genes used in this analysis

Genbank Element Intron Number of repeats Type 1 Type 2-6 Sequence Accession # Pattern Pattern 1 2 3 4 5 6 Pattern Pattern 2-002 EF607673 rD1/6/7  3 2 2 1 + bc 1 0  III 2-005 EF607674 rD1/6/7  3 2 2 1 + bc 1 0  III 2-006 EF607675 rF1/2  2 2 2 1 + b 1 0  II 2-007 EF607676 rD1/6/7  3 2 2 1 + bc 1 0  III 2-008 EF607677 rD1/6/7  3 2 2 1 + bc 1 0  III 2-009 EF607678 rF1/2  2 2 2 1 + b 1 0  II 2-010 EF607679 rF1/2  2 2 2 1 + b 1 0  II 2-011 EF607688 rE2/3/7  2 1 1 1 1 0  V 2-012 EF607697 rE2/3/7  2 1 1 1 1 0  V 2-013 EF607698 rD1/6/7  3 2 2 1 + bc 1 0  III 2-015 EF607699 rE2/3/7  2 1 1 1 1 0  V 2-017 EF607700 rE3  2 1 1 1 1 0  V 2-018 EF607701 rD1/6/7  3 2 2 1 + bc 1 0  III 2-019 EF607702 rE2/3/7  2 1 1 1 1 0  V 2-020 EF607703 rE2/3/7  2 1 1 1 1 0  V 2-021 EF607704 rD1/6/7  3 2 2 1 + bc 1 0  III 2-022 EF607705 rD1/6/7  3 2 2 1 + bc 1 0  III 2-023 EF607706 rD1/6/7  3 2 2 1 + bc 1 0  III 2-024 EF607707 rD1/6/7  3 2 2 1 + bc 1 0  III 2-025 EF607708 rE2/3/7  2 1 1 1 1 0  V 2-026 EF607709 rD1/6/7  3 2 2 1 + bc 1 0  III 2-027 EF607710 rD1/6/7  3 2 2 1 + bc 1 0  III 2-028 EF607711 rD1/6/7  3 2 2 1 + bc 1 0  III 2-029 EF607712 rE2/3/7  2 1 1 1 1 0  V 2-031 EF607713 rE2/3/7  2 1 1 1 1 0  V 2-032 EF607714 rD1/6/7  3 2 2 1 + bc 1 0  III 2-033 EF607715 rE2/3/7  2 1 1 1 1 0  V 2-034 EF607716 rA1  4 4 2 3 2 2  I 2-035 EF607717 rE2/3/7  2 1 1 1 1 0  V 2-036 EF607718 rA1  4 4 2 3 2 2  I 2-037 EF607719 rE2/3/7  2 1 1 1 1 0  V 2-038 EF607720 rD1/6/7  3 2 2 1 + bc 1 0  III 2-039 EF607721 rD1/6/7  3 2 2 1 + bc 1 0  III 2-040 EF607722 rF1/2  2 2 2 1 + b 1 0  II 2-041 EF607723 r01/2  2 1 1 1 + c 1 0  IV 2-047 EF607724 rE2/3/7  2 1 1 1 1 0  V 2-048 EF607725 rD1/6/7  3 2 2 1 + bc 1 0  III 2-049 EF607726 r01/2  2 1 1 1 + c 1 0  IV 2-050 EF607727 r01/2  2 1 1 1 + c 1 0  IV 2-052 EF607728 rD1/6/7  3 2 2 1 + bc 1 0  III 2-055 EF607729 rB3  2 2 2 1 + b 1 0  II 2-056 EF607730 rD1/6/7  3 2 2 1 + bc 1 0  III 2-057 EF607731 rD1/6/7  3 2 2 1 + bc 1 0  III 2-059 EF607732 r01/2  2 1 1 1 + c 1 0  IV 2-061 EF607733 rE6  3 1 1 1 1 0  V 2-063 EF607734 rE2/3/7  2 1 1 1 1 0  V 2-064 EF607735 rD1/6/7  3 2 2 1 + bc 1 0  III 2-065 EF607736 r01/2  2 1 1 1 + c 1 0  IV 2-066 EF607737 rF1/2  2 2 2 1 + b 1 0  II 2-067 EF607738 rE2/3/7  2 1 1 1 1 0  V 2-069 EF607739 r01/2  2 1 1 1 + c 1 0  IV 2-070 EF607740 rD1/6/7  3 2 2 1 + bc 1 0  III 2-071 EF607741 rE2/3/7  2 1 1 1 1 0  V 2-073 EF607742 rE2/3/7  2 1 1 1 1 0  V 2-076 EF607743 rD1/6/7  3 2 2 1 + bc 1 0  III 2-077 EF607744 rD1/6/7  3 2 2 1 + bc 1 0  III 2-079 EF607745 rD1/6/7  3 2 2 1 + bc 1 0  III 2-080 EF607746 rD1/6/7  3 2 2 1 + bc 1 0  III 2-081 EF607747 rF1/2  2 2 2 1 + b 1 0  II 2-082 EF607748 rD1/6/7  3 2 2 1 + bc 1 0  III 2-083 EF607749 r01/2  2 1 1 1 + c 1 0  IV 2-084 EF607750 r01/2  2 1 1 1 + c 1 0  IV 2-089 EF607751 rE2/3/7  2 1 1 1 1 0  V 2-090 EF607752 rE2/3/7  2 1 1 1 1 0  V 2-091 EF607753 rE2/3/7  2 1 1 1 1 0  V 2-093 EF607754 rF1/2  2 2 2 1 + b 1 0  II 2-094 EF607755 rE2/3/7  2 1 1 1 1 0  V 2-095 EF607756 rE2/3/7  2 1 1 1 1 0  V 2-096 EF607757 rF1/2  2 2 2 1 + b 1 0  II 2-098 EF607758 rD1/6/7  3 2 2 1 + bc 1 0  III 2-099 EF607759 rE2/3/7  2 1 1 1 1 0  V 2-101 EF607680 rD1/6/7  3 2 2 1 + bc 1 0  III 2-102 EF607681 rD1/6/7  3 2 2 1 + bc 1 0  III 2-103 EF607682 rF1/2  2 2 2 1 + b 1 0  II 2-104 EF607683 rB3  2 2 2 1 + b 1 0  II 2-105 EF607684 rE2/3/7  2 1 1 1 1 0  V 2-106 EF607685 rD1/6/7  3 2 2 1 + bc 1 0  III 2-107 EF607686 rE2/3/7  2 1 1 1 1 0  V 2-109 EF607687 rD1/6/7  3 2 2 1 + bc 1 0  III 2-110 EF607689 rF1/2  2 2 2 1 + b 1 0  II 2-111 EF607690 rD1/6/7  3 2 2 1 + bc 1 0  III 2-112 EF607691 rE2/3/7  2 1 1 1 1 0  V 2-114 EF607692 rE2/3/7  2 1 1 1 1 0  V 2-115 EF607693 rD1/6/7  3 2 2 1 + bc 1 0  III 2-116 EF607694 r01/2  2 1 1 1 + c 1 0  IV 2-118 EF607695 rE2/3/7  2 1 1 1 1 0  V 2-119 EF607696 rE2/3/7  2 1 1 1 1 0  V 4-11501 EF607760 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11503 EF607761 r01/2  2 1 1 1 + c 1 0  IV 4-11508 EF607762 r01/2  2 1 1 1 + c 1 0  IV 4-11511 EF607763 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11515 EF607764 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11516 EF607765 rC3  2 2 2 1 + bc 1 0  III 4-11517 EF607766 r01/2  2 1 1 1 + c 1 0  IV 4-11520 EF607767 rE2/3/7  2 1 1 1 1 0  V 4-11521 EF607768 rB3  2 2 2 1 + b 1 0  II 4-11523 EF607769 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11528 EF607770 rE10  3 1 1 1 1 0  V 4-11530 EF607771 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11531 EF607772 rE2/3/7  2 1 1 1 1 0  V 4-11532 EF607773 rE2/3/7  2 1 1 1 1 0  V 4-11533 EF607774 rE2/3/7  2 1 1 1 1 0  V 4-11534 EF607775 rB3  2 2 2 1 + b 1 0  II 4-11535 EF607776 r01/2  2 1 1 1 + c 1 0  IV 4-11536 EF607777 rC5  2 1 1 1 + bc 0 0  IV 4-11538 EF607778 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11542 EF607779 rE2/3/7  2 1 1 1 1 0  V 4-11543 EF607780 rD1/6/7  3 2 2 1 + bc 1 0  III 4-11544 EF607781 rC3  2 2 2 1 + bc 1 0  III 4-11545 EF607782 rC2  3 2 2 1 + bc 1 0  III 4-11547 EF607783 r01/2  2 1 1 1 + c 1 0  IV 4-11548 EF607784 rB3  2 2 2 1 + b 1 0  II 4-11550 EF607785 rE2/3/7  2 1 1 1 1 0  V 4-12410 EF607786 rD1/6/7  3 2 2 1 + bc 1 0  III 4-12412 EF607787 rD1/6/7  3 2 2 1 + bc 1 0  III 4-12415 EF607788 rB3  2 2 2 1 + b 1 0  II 4-12449 EF607789 rD1/6/7  3 2 2 1 + bc 1 0  III 10-001 EF607621 rB3  2 2 2 1 + b 1 0  II 10-002 EF607622 rB3  2 2 2 1 + b 1 0  II 10-003 EF607623 r01/2  2 1 1 1 + c 1 0  IV 10-004 EF607624 rD5  2 2 2 1 + bc 1 0  III 10-005 EF607625 rD5  2 2 2 1 + bc 1 0  III 10-006 EF607626 rD1/6/7  3 2 2 1 + bc 1 0  III 10-007 EF607627 rB6  2 1 1 1 1 0  V 10-008 EF607628 rE2/3/7  2 1 1 1 1 0  V 10-010 EF607629 rG2  4 4 2 3 2 2  I 10-011 EF607630 rB3  2 2 2 1 + b 1 0  II 10-012 EF607631 rB3  2 2 2 1 + b 1 0  II 10-013 EF607632 rA6  4 4 2 3 2 2  I 10-014 EF607633 rD1/6/7  3 2 2 1 + bc 1 0  III 10-015 EF607634 rD1/6/7  3 2 2 1 + bc 1 0  III 10-017 EF607635 rB3  2 2 2 1 + b 1 0  II 10-018 EF607636 rB6  2 1 1 1 1 0  V 10-019 EF607637 rB2  3 2 2 1 + b 1 0  II 10-020 EF607638 rE2/3/7  2 1 1 1 1 0  V 10-021 EF607639 rD1/6/7  3 2 2 1 + bc 1 0  III 10-022 EF607640 rD1/6/7  3 2 2 1 + bc 1 0  III 10-024 EF607641 rD1/6/7  3 2 2 1 + bc 1 0  III 10-025 EF607642 r01  2 1 1 1 + c 1 0  IV 10-026 EF607643 rE2/3/7  2 1 1 1 1 0  V 10-027 EF607644 rE2/3/7  2 1 1 1 1 0  V 10-028 EF607645 r01  2 1 1 1 + c 1 0  IV 10-029 EF607646 r01/2  2 1 1 1 + c 1 0  IV 10-030 EF607647 r01/2  2 1 1 1 + c 1 0  IV 10-031 EF607648 rB3  2 2 2 1 + b 1 0  II 10-032 EF607649 r01/2  2 1 1 1 + c 1 0  IV 10-034 EF607650 r01/2  2 1 1 1 + c 1 0  IV 10-035 EF607651 rC3  2 2 2 1 + bc 1 0  III 10-037 EF607652 r01/2  2 1 1 1 + c 1 0  IV 10-038 EF607653 r01/2  2 1 1 1 + c 1 0  IV 10-039 EF607654 rB3  2 2 2 1 + b 1 0  II 10-040 EF607655 rB3  2 2 2 1 + b 1 0  II 10-042 EF607656 rB3  2 2 2 1 + b 1 0  II 10-043 EF607657 rB3  2 2 2 1 + b 1 0  II 10-045 EF607658 rD1/6/7  3 2 2 1 + bc 1 0  III 10-046 EF607659 r01/2  2 1 1 1 + c 1 0  IV 10-048 EF607660 r01/2  2 1 1 1 + c 1 0  IV 10-049 EF607661 rD1/6/7  3 2 2 1 + bc 1 0  III 10-050 EF607662 rC3  2 2 2 1 + bc 1 0  III 10-051 EF607663 rB3  2 2 2 1 + b 1 0  II 10-052 EF607664 r01  2 1 1 1 + c 1 0  IV 10-053 EF607665 r01/2  2 1 1 1 + c 1 0  IV 10-054 EF607666 rB7  2 1 1 1 1 0  V 10-055 EF607667 rD1/6/7  3 2 2 1 + bc 1 0  III 10-056 EF607668 rD1/6/7  3 2 2 1 + bc 1 0  III 10-057 EF607669 rD1/6/7  3 2 2 1 + bc 1 0  III 10-058 EF607670 r01/2  2 1 1 1 + c 1 0  IV

Supplemental File 2: Distribution of element sequences among genes from three S. purpuratus individuals. Element sequences from the repeat-based alignment within the 185/333 gene family of animal 2 (Buckley and Smith 2007) are shown as circles. Each of the columns of circles represents a set of unique sequences for a given element from the second exon of the

185/333 genes. Element (Er) numbers are shown at the bottom; “L” indicates the leader sequence encoded by the first exon and “Int” indicates the intron sequence. Lines that connect the circles indicate that the two distinct element sequences are adjacent to one another in one or more genes. The colors of the circles indicate the animal from which the element sequence was isolated (red = animal 10; blue = animal 2; yellow = animal 4; green = animals 2 and 4; purple = animals 2 and 10; orange = animals 4 and 10; gray = animals 2, 4, and 10; Buckley and Smith

2007).

L Er1 Er3 Er5 Er7 Er9 Er11 Er13 Er15 Er17 Er19 Er21 Er23 Er25 Er27 Int. Er2 Er4 Er6 Er8 Er10 Er12 Er14 Er16 Er18 Er20 Er22 Er24 Er26 Supplemental File 3: Maximum likelihood trees generated from elements Er1 (A) and

Er27 (B). Sequences are colored according to their clades in A. The two elements are phylogenetically incongruent, as determined by both the ILD and MLILR tests. Furthermore, many of the clades present in the Er1 tree are disrupted in the Er27 tree. For example, the four sequences that comprise the monophyletic light green clade (10-52, 10-50, 10-35, and 4-1523) are scattered throughout the Er27 tree. This is further evidence for recombination among the

185/333 genes.

4-1544 4-1516 4-1544 10-52 4-1516 A 10-50 B 4-1531 4-1523 4-1542 10-35 4-1520 4-1538 4-1532 4-2415 4-1533 4-1548 4-1528 4-1521 2-59 4-1534 2-84 10-43 2-65 10-01 2-116 10-39 2-61 10-42 2-25 10-11 2-50 10-02 2-118 2-06 2-95 2-09 2-119 2-66 2-12 2-103 2-20 10-18 4-1550 10-07 2-90 10-28 2-105 10-04 10-26 2-36 2-63 2-34 10-20 10-10 10-52 10-13 10-25 4-1528 10-32 2-105 10-30 2-11 10-03 2-67 10-46 2-47 10-37 2-90 10-28 2-63 10-38 2-35 10-29 2-15 10-50 4-1550 4-2410 10-26 4-1547 2-91 4-1517 10-20 4-1535 2-107 4-1503 2-37 4-1508 2-12 10-35 2-95 4-1538 2-20 10-43 2-25 10-39 2-118 10-01 2-119 10-42 2-73 2-36 4-1517 2-34 10-25 10-10 4-1520 4-1511 2-65 2-15 4-1532 4-1548 10-30 4-1534 10-32 4-2415 10-03 2-09 4-1533 2-103 2-50 2-06 10-46 10-19 10-37 10-06 4-1542 10-04 2-59 4-1523 4-1547 10-13 2-84 2-73 4-1535 2-11 4-1531 2-67 4-1503 2-47 10-38 2-107 4-1508 2-37 2-116 2-35 10-29 2-17 10-40 10-14 4-1511 2-76 10-17 2-109 2-55 2-91 10-51 4-1543 10-31 2-08 10-06 10-51 4-2410 10-24 4-1530 10-31 4-1515 10-55 4-2449 10-17 4-2412 10-02 2-76 2-56 10-14 10-21 4-1545 2-18 2-109 4-1501 2-56 2-52 2-38 2-106 2-18 10-15 2-61 2-57 2-52 10-22 2-07 4-1530 4-1501 2-05 2-05 2-82 2-21 2-28 10-19 10-11 2-106 2-115 2-77 2-24 2-102 10-54 10-24 10-18 10-15 10-07 2-57 4-1521 2-79 2-38 10-22 2-07 2-82 2-21 2-64 2-79 2-28 2-77 2-17 2-102 2-08 2-66 2-111 2-64 2-115 4-2449 4-1536 4-2412 4-1543 4-1515 2-24 2-111 10-55 4-1545 10-21 4-1536 10-54 10-40 2-55 Begin mrbayes; outgroup 10_18_D; charset DNA=1-76; charset GAP=77-77; partition part1 = 2: DNA, GAP; set partition=part1; prset ratepr=variable; lset applyto=(1) nst=2 rates=gamma; prset statefreqpr=fixed(equal) tratiopr=beta(1,1) pinvarpr=uniform(0,1); unlink shape=(all) pinvar=(all) statefreq=(all) revmat=(all); lset applyto=(2) rates=gamma; set autoclose=yes; mcmcp ngen=4000000 printfreq=1000 samplefreq=1000 nchains=4 temp=0.15 savebrlens=yes filename=Uniquerepeats_GAPCODER; mcmc; end;

Supplemental File 5: Controls used for the recombination analysis. The identifiers for the histone sequences are SPU numbers; these sequences are available at http://annotation.hgsc.bcm.tmc.edu/Urchin/. Identifiers for the TCR sequences are Genbank accession numbers.

Histone Segment A Segment B Sequence Start Stop Start Stop SPU_000307 1 169 170 237 SPU_000464 1 169 170 237 SPU_001021 1 169 170 237 SPU_001709 1 169 170 237 SPU_001711 1 169 170 237 SPU_002772 1 169 170 237 SPU_010740 1 169 170 237 SPU_003828 1 169 170 237 SPU_016047 1 169 170 237 SPU_005019 1 169 170 237 SPU_015741 1 169 170 237 SPU_007820 1 169 170 237 Histones SPU_024346 1 169 170 237 SPU_008993 1 169 170 237 SPU_018569 1 169 170 237 SPU_011743 1 169 170 237 SPU_027750 1 169 170 237 SPU_025804 1 169 170 237 SPU_015871 1 169 170 237 SPU_007835 1 169 170 237 SPU_014711 1 169 170 237 SPU_019208 1 169 170 237 SPU_026462 1 169 170 237 SPU_020103 1 169 170 237 SPU_018560 1 169 170 237

TCR V-Region J-Region Sequence Start Stop Start Stop AF009786 23 213 220 269 AF430647 62 249 253 302 AF430650 63 250 263 309 AF430654 63 250 254 303 AF430656 63 250 260 309 T-cell AF430661 62 249 259 308 Receptors AF430670 63 250 257 306 AF430671 63 249 256 306 AF430673 63 249 253 300 AF430677 63 250 257 306 AF430678 63 250 257 306 AF430681 63 250 257 305 AF430686 18 205 218 267 AJ841712 29 216 223 272 AJ841713 26 213 219 261 AJ841720 35 222 244 290 AJ841725 14 203 210 257 AJ841726 26 213 223 272 AJ841733 23 213 223 270 AJ841734 74 258 260 278 AJ841737 25 212 216 265 AJ841738 14 204 217 258 K02885 32 223 236 284 L10118 35 224 228 272 L14854 11 203 210 257 M12883 41 230 236 282 M13550 41 230 236 282 M13850 48 235 238 282 S82067 14 201 208 257 X01417 32 223 236 284 X04926 38 231 241 290 X04928 35 222 226 275 X04931 50 239 243 287 X04933 47 237 241 290 X04935 35 221 226 275 Y15203 35 222 234 281 Z23044 50 238 245 293 Z47369 58 245 255 304