Genome Evolution and the Evolution of Exon-Shuffling

Gene 238 (1999) 103–114 www.elsevier.com/locate/gene

Review Genome evolution and the evolution of exon-shuﬄing — a review k La´szlo´ Patthy * Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary Received 4 February 1999; received in revised form 5 May 1999; accepted 1 June 1999

Abstract

Recent studies on the genomes of protists, plants, fungi and animals confirm that the increase in genome size and gene number in different eukaryotic lineages is paralleled by a general decrease in genome compactness and an increase in the number and size of introns. It may thus be predicted that exon-shuffling has become increasingly significant with the evolution of larger, less compact genomes. To test the validity of this prediction, we have analyzed the evolutionary distribution of modular proteins that have clearly evolved by intronic recombination. The results of this analysis indicate that modular multidomain proteins produced by exon-shuffling are restricted in their evolutionary distribution. Although such proteins are present in all major groups of metazoa from sponges to chordates, there is practically no evidence for the presence of related modular proteins in other groups of eukaryotes. The biological significance of this difference in the composition of the proteomes of animals, fungi, plants and protists is best appreciated when these modular proteins are classified with respect to their biological function. The majority of these proteins can be assigned to functional categories that are inextricably linked to multicellularity of animals, and are of absolute importance in permitting animals to function in an integrated fashion: constituents of the extracellular matrix, proteases involved in tissue remodelling processes, various proteins of body fluids, membrane-associated proteins mediating cell–cell and cell–matrix interactions, membrane associated receptor proteins regulating cell–cell communications, etc. Although some basic types of modular proteins seem to be shared by all major groups of metazoa, there are also groups of modular proteins that appear to be restricted to certain evolutionary lineages. In summary, the results suggest that exon-shuffling acquired major significance at the time of metazoan radiation. It is interesting to note that the rise of exon-shuffling coincides with a spectacular burst of evolutionary creativity: the Big Bang of metazoan radiation. It seems probable that modular protein evolution by exon-shuffling has contributed significantly to this accelerated evolution of metazoa, since it facilitated the rapid construction of multidomain extracellular and cell surface proteins that are indispensable for multicellularity. © 1999 Elsevier Science B.V. All rights reserved.

Keywords: Genome compactness; Introns; Modular proteins; Modules; Metazoan evolution

1. Introduction explained by two types of hypotheses. The ‘introns early’ hypotheses assumed that introns and RNA splicing are Shortly after the discovery of split genes, it was the relics of the RNA world and the ‘genes in pieces’ realized that the existence of introns may have dramatic organization of the eukaryotic genome is the original, consequences on protein evolution (Gilbert, 1978). It ancestral form (Darnell, 1978; Doolittle, 1978; Darnell was pointed out that recombination within introns could and Doolittle, 1986; Gilbert, 1986). According to this assort exons independently, and middle repetitious view, eukaryotes retained introns and the genetic plastic- sequences in introns may create hotspots for recombina- ity of the primitive ancestors of all cells. On the other tion to shuffle the exonic sequences. hand, bacteria gained increased efficiency by eliminating The presence of introns in most eukaryotic protein- their introns. Supporters of the introns-early hypotheses coding genes and their absence from prokaryotes was assume that the introns of all protein-coding genes reflect the assembly of these genes from pieces; that k exons do indeed correspond to building blocks Presented at the Symposium on Evolutionary Genomics, Puntarenas, Costa Rica, 11–15 January 1999. (a-helices, b-sheets, etc.) from which all the genes were * Tel.: +36-1-4665-633. fax: +36-1-4665-465. assembled by intronic recombination (Gilbert and E-mail address: [email protected] (L. Patthy) Glynias, 1993).

In contrast with this, the ‘introns late’ theories suggest exon-shuffling became significant at the time of the that the prokaryotic genes resemble the ancestral ones appearance of the first multicellular animals, and that and that the introns were inserted later in genes of the rise of exon-shuffling could in fact contribute to the eukaryotes (Crick, 1979; Orgel and Crick,1980; Cavalier- explosive nature of metazoan radiation. Smith, 1985; Cech, 1985; Sharp, 1985). In fact, it is now obvious that the exon–intron structure of eukaryotic protein-coding genes is not static: introns are continually 2. Results and discussion inserted into (as well as removed from) genes. The actual mechanisms of insertion, propagation of some self- 2.1. Introns and evolution of genome compactness splicing introns have been analyzed in detail and the mechanisms responsible for the insertion of spliceosomal In the past few years, the complete sequences of the introns are also becoming clear (Dujon, 1989; genomes of several Eubacteria (Escherichia coli, Bacillus Lambowitz and Belfort, 1989; Perlman and Butow, subtilis, Haemophilus influenzae, Borrelia burgdorferi, 1989; Morl and Schmelzer, 1990; Belfort, 1991, 1993; Mycoplasma pneumoniae, Mycoplasma genitalium, etc.), Lambowitz, 1993; Mueller et al., 1993; Grivell, 1994; Archaea (Methanococcus jannaschii, Archaeoglobus ful- Patthy, 1995). gidus, etc.), a unicellular eukaryote (Saccharomyces cere- Since introns themselves are subject to evolution, it visiae), and a multicellular animal (Caenorhabditis is clear that exon-shuffling has been evolving parallel elegans) have been determined, and significant progress with the evolution of introns. We have argued previously has also been made on the genome of a protozoan that the introns suitable for exon-shuffling appeared at parasite (Plasmodium falciparum), the plant, Arabidopsis a relatively late stage of evolution; therefore, exon- thaliana, the fruitfly, Drosophila melanogaster and vari- shuffling could not play a major role in the construction ous vertebrates (e.g., the fish Fugu rubripes, mouse, of ancient proteins (Patthy, 1987, 1991a,b). The self- human). As a result of these studies it has become clear splicing introns of the RNA world that could be present that there are clear correlations between genome size, at the time the first proteins were formed are practically genome compactness and the proportion of the genome unsuitable for exon-shuffling by intronic recombination: that is occupied by introns and repetitive elements. such self-splicing introns encode an essential function, At one extreme we find the small, compact genomes therefore their sequence is not tolerant to intronic recom- of Eubacteria and Archaea (with limitingly high coding bination (Patthy, 1987, 1991a,b, 1994). Exon-shuffling density) which are practically devoid of introns and could become significant only with the appearance of contain very little repetitive sequences. Vertebrate spliceosomal introns: these introns play a negligible role genomes represent the other extreme: they have large in their own excision, therefore intronic recombination genomes with a drastically reduced coding-density, is less likely to produce recombinant introns that are their genomes are rich in repetitive sequences and their deficient in splicing. Furthermore, the nonessential parts genes are characterized by a high intron to exon ratio. of spliceosomal introns could accommodate large seg- Studies on the complete genomes of H. influenzae ments of middle repetitious sequences, further increasing (Fleischmann et al., 1995), M. genitalium (Fraser et al., the chances of intronic recombination. Since spliceoso- 1995), M. pneumoniae (Himmelreich et al., 1996, 1997), mal introns evolved relatively recently from group II B. subtilis(Kunst et al., 1997), E. coli(Blattner et al., self-splicing introns (Cech, 1986; Jacquier, 1990; 1997), Borrelia burgdorferi (Fraser et al. 1997) have Cavalier-Smith, 1991; Copertino and Hallick, 1993; revealed that the genomes of these eubacteria are very Saldanha et al., 1993; Sharp, 1994) and are restricted in compact. The major part of their genomes (about 90%) their evolutionary distribution (Cavalier-Smith, 1991; is dedicated to protein-coding genes (830–976 genes per Palmer and Logsdon, 1991; Logsdon, 1991) exon- Mbp) and they do not contain large quantities of shuffling could play a major role only in the construction intergenic DNA, introns or repetitive elements, reflecting of ‘younger’ proteins (Patthy, 1987; 1991a,b, 1994, the fact that there is strong selective pressure to eliminate 1995, 1996). nongenic DNA. In the present review I wish to emphasize that the A limitingly high coding density also seems to hold significance of exon-shuffling increased parallel with the for Archaea, genes covering about 90% of their compact evolution of less compact genomes. The basis of this genomes. For example, in the case of the genome of M. correlation is that the number and size of introns and jannaschii (Bult et al., 1996) coding density is similar to the proportion of repetitive sequences in introns that of eubacteria: 1022 genes per Mbp. Similarly, in increases parallel with the decrease of genome compact- the genome of another Archaeon, A. fulgidus ( Klenk ness, therefore the chances of exon-shuffling by intronic et al., 1997) there are 1107 genes per Mbp, genes recombination also increase. Analysis of the evolution- covering 92.2% of the genome. ary distribution of proteins that were clearly assembled The genome of the unicellular eukaryote, S. cerevisiae from modules by intronic recombination suggests that is significantly less compact than those of eubacterial or L. Patthy / Gene 238 (1999) 103–114 105 archaeal prokaryotes: open reading frames occupy ‘only’ human genome, which is estimated to have only about 68% of the yeast genome (Bassett et al., 1996; Clayton 20 genes per Mbp. The compactness of the C. elegans et al., 1997; Dujon, 1997; Oliver, 1997). Thus, compared genome (relative to vertebrate genomes) is due to the with eubacterial or Archaeal genomes, where ORFs fact that introns and intergenic distances are significantly occupy about 90% of their genomes, yeast seems to be shorter. More than half of the C. elegans introns are under weaker selective pressure to reduce genome size. shorter than 60 bp, the most common length being only (Whereas in Archaea and Eubacteria there are about 48 bp. This tendency may be best illustrated by the fact 830–1100 genes per Mbp, in the case of yeast this that the size of C. elegans genes is usually much smaller number is only 446.) Nevertheless, introns of yeast than that of homologous vertebrate genes and they protein-coding genes are still few (233), and short: less usually contain fewer introns. For example, the nema- than 4% of the genes contain introns, and introns tode a1(IV ) and a2(IV ) collagen genes are about 9 kb account for less than 1% of the entire genome. Compared long with only 11 and 19 introns, whereas the homolo- with higher eukaryotes, the yeast genome is compact gous human type a2(IV), a5(IV ) and a6(IV ) collagen due to the short size of intergenic regions, that introns genes are 100–200 kb long with 46, 50 and 44 introns are few and small, and that repetitive sequences are (Sibley et al., 1993; Oohashi et al., 1995), the C. elegans infrequent. osteonectin gene spans 3.6 kb and has five introns, The genome of the protozoan parasite, Plasmodium whereas the mammalian homolog spans 26 kb and has falciparum is 30 Mb, about twice as large as that of nine introns (McVey et al., 1988; Schwarzbauer and yeast. Analysis of chromosome 2 of P. falciparum Spencer, 1993). As a reflection of the compactness of revealed that it contains 0.95 Mbp and encodes 209 the C.elegans genome, tandem repeats account for only protein-coding genes (Gardner et al., 1998). The esti- 2.7% of the genome, and inverted repeats account mated gene density is thus less than half of that found for 3.6%. in the case of yeast: 221 genes per Mbp. As compared The genome of D. melanogaster is about 170 Mbp with the yeast genome, the decrease in genome compact- and contains about 12 000 protein-coding genes (Rubin, ness is also reflected in a significantly increased propor- 1998). Although the estimated gene density (71 genes tion of introns: 43% of the protein-coding genes contain per Mbp) is significantly lower than in yeast, the genome at least one intron. is more compact than most vertebrate genomes. The D. The 100–120 Mbp genome of A. thaliana, a small melanogaster genes are rather compact, 50% of the crucifer weed, is significantly less compact than the yeast introns are <100 bp. In general, D. melanogaster genes genome. Analysis of a 1.9 Mb of contiguous sequence tend to have shorter and fewer introns than their verte- of the A. thalianagenome has shown that there are about brate homologs. For example, the fly laminin A gene is 175 genes per Mbp (The EU Arabidopsis Genome 14 kb long with 14 introns, whereas the human homolog Project, 1998) as compared with 446 gene per Mbp for occupies more than 260 kb and has 63 introns (Kusche- yeast. This decrease in genome compactness is also Gullberg et al., 1992; Zhang et al., 1996). reflected in the fact that the majority of Arabidopsis Recent studies on the genome of the sea squirt Ciona genes contain introns. A. thaliana introns are usually intestinalis seem to confirm the hypothesis that simple small (<200 bp), their average size is 146 bp (Hebsgaard chordates had a gene number very similar to inverte- et al., 1996; The EU Arabidopsis Genome Project, 1998). brates (Simmen et al., 1998). For a genome of about Consistent with decreased genome compactness, repeat 162 Mbp C. intestinalis has only about 15 500 genes, a classes already account for a significant portion (about number very close to the 12 000–19 000 genes of D. 19%) of genomic sequence. melanogaster and C. elegans. The gene density of this Sequencing of the 97 Mbp C. elegans genome — invertebrate chordate (97 genes per Mbp) is about half containing approx. 19 000 genes — has essentially been of that in C. elegans; only about 13% of the genome completed, permitting the recognition of some basic codes for protein (Simmen et al., 1998). characteristics of the genome of this simple, multicellular The inverse relation between genome compactness animal (Blumenthal and Spieth, 1996; Blaxter, 1998; and the proportion of intronic and repetitive sequence Chalfie, 1998; Chervitz et al., 1998; Ruvkun and Hobert, may also be illustrated by comparison of the genome of 1998; The C. elegans Sequencing Consortium, 1998). the pufferfish, F. rubripes with those of other vertebrates. This genome is also less compact than that of yeast: The Fugu genome is only 400 Mbp, about 7.5 times whereas in yeast there are 446 genes per Mbp, in the smaller than that of human, although these organisms case of C. elegans this value is only 196. The majority have about the same number of genes: the F. rubripes of C. elegans protein-coding genes have introns and genome is thus 7.5 times more compact than the human most have multiple introns. Coding sequence accounts genome (Elgar et al., 1996). Fugu genes usually have for 27% of the genome and introns (on average five per the same exon–intron organization as their mammalian gene) account for another 26%. The C. elegans genome homologs but, despite this conservation, the size of the is significantly more compact than the large (3000 Mbp) Fugu introns is usually drastically shorter than in the 106 L. Patthy / Gene 238 (1999) 103–114 mammalian homologs, the majority of introns being greater compactness thus have fewer and shorter introns between 60 and 150 bp. An illustrative example is the than genes located in the least compact genomic regions. Huntington’s disease gene, which is ‘only’ 22 kb in Fugu These observations thus suggest that the evolutionary as compared with the 180 kb human gene (i.e. an forces controlling genome compactness have a direct eightfold difference in size). Importantly, the exon– influence on the frequency and size of introns, as well intron organization of this gene is identical to that of as the relative abundance of repetitive elements in the human homolog: reduction occurred at the expense genomes. It may thus be predicted that the evolution of of noncoding regions and introns. The gene structure of less compact genomes was paralleled by an increase in complement component C9 of the pufferfish also the potential for intronic recombination and exon- illustrates these features of the Fugu genome (Yeo et al., shuffling. To test this prediction we have analyzed the 1997). The 11 exons of the Fugu C9 gene span 2.9 kb evolutionary distribution of protein-coding genes con- of genomic DNA, whereas the 11 exons of human C9 structed by exon-shuffling via intronic recombination. span 90 kb, representing a 30-fold difference in size at the expense of intron size. The compactness of the Fugu 2.2. Identification of cases of exons-shuffling genome (relative to the human genome) is due to the fact that it contains significantly less nongenic DNA. In order to establish a case of modular protein There are no abundant classes of dispersed repeats in evolution by exon-shuffling, one has to show (1) that Fugu and all forms of repetitive DNA (including two unquestionably homologous modules (i.e., modules telomeric repeats, ribosomal RNAs) constitute less than derived from a common ancestor) are present in other- 10% of the Fugu genome. The average gene density in wise nonhomologous protein environments and (2) that F. rubripes is quite similar to that in C. elegans: #150 the transposition of the module was mediated by intronic genes per Mbp (in human this value is 20). recombination. In this review I will discuss only cases In summary, the evidence from genome projects where both these criteria are satisfied. suggests that prokaryotes are characterized by small It must be emphasized that frequently it is not a compact genomes with little space for intergenic DNA, trivial task to prove homology of modules or to prove introns or repetitive sequences. In eukaryotes multiple a role of introns in module-shuffling. Introns-early ver- replication origins have permitted the evolution of sions of the exon-shuffling theory that assume that all larger, less compact genomes which can accommodate primordial proteins were assembled by intronic recombi- increasing amounts of intergenic regions, introns and nation from short exons that encoded various secondary repetitive sequences. The inverse relationship between structural elements (Dorit et al., 1990) face the formida- genome compactness and intron number and intron size ble task of proving that in two otherwise unrelated of protein-coding genes is valid not only for entire proteins two structurally similar a-helices, membrane genomes, but seems to hold even for different isochores spanning domains or b-sheets, are similar due to of a given genome. In vertebrate genomes, genes are not common ancestry rather than due to convergence. evenly distributed in different isochores: for example, Obviously, such basic structural units were invented the most GC-rich H3 isochore of the human genome independently several times during evolution (Patthy, contains about 28% of the genes, although H3 accounts 1991a). for only 3–5% of the genome. Conversely, only 34% of Even if criterion (1) is satisfied, we cannot take it for the human genes are found in the GC-poor L1+L2, granted that exon-shuffling was responsible for module- although they represent 62% of the genome shuffling. Although exon-shuffling by intronic recombi- (Mouchiroud et al., 1991; Bernardi, 1993). In other nation is by far the most powerful mechanism of modu- words, GC-poor isochores appear to be gene deserts as lar protein evolution, this does not mean that it is the compared to the most GC-rich isochores: the gene only way to exchange domains among protein-coding density is four to eight times higher in H3 than in genes (Patthy, 1996). Some recent examples show how H1+H2 and about 10–20 times higher than in L1+L2 modular proteins of bacteria may be constructed without (Mouchiroud et al., 1991; Bernardi, 1993). Significantly the assistance of introns. For example, a modular protein for our present discussion, protein-coding genes of of Peptostreptococcus magnus has been shown to be the GC-poor isochores contain more introns per kb of product of a recent intergenic recombination of two coding sequence than those in GC-rich isochores, and different types of streptococcal surface proteins (de the intervening sequences are, on average, three times Chateau and Bjorck, 1994). The recent transfer of a longer for genes located in L1+L2 than for those found fragment of a prokaryotic gene to another shows that in H3 isochores (Duret et al., 1995). The ratio of introns are not absolutely essential for exchange of gene intervening sequence to coding sequences is thus strik- fragments. These studies have also shown that gene ingly different for L1+L2 and H3: genes in H3 are on rearrangements by exonic recombination may be facili- average 2.5 times more compact than those in L1+L2 tated by the presence of special recombinogenic DNA (Duret et al., 1995). The genes of genomic regions of sequences in intermodule linker regions, and that antibi- L. Patthy / Gene 238 (1999) 103–114 107

Fig. 1. Schematic representation of the structures of the genes of some modular proteins assembled from the class 1-1 C-type lectin-module (LN), epidermal growth factor-module (G), complement B-type module (B), immunoglobulin-module (Ig), link protein-module (LK), follistatin-module (FS) and Kunitz-type inhibitor module (INH). The numbers indicate the position and phase class of the introns. Phase 1 introns found at the boundaries of modules are highlighted in red. The black boxes indicate signal peptide domains, vertical black bars represent transmembrane domains. The boxes designating the diﬀerent domains are drawn to scale.

otics may provide the selective pressure for the creation of advantageous chimaeras (de Chateau and Bjorck, 1996). To provide unambiguous evidence for exon-shuffling, one has to show that shuffling of the module was mediated by flanking introns that belonged to the same phase, i.e., the module was ‘symmetrical’ in accordance with the phase-compatibility-rules of exon-shuffling (Patthy, 1987). Many protein-coding genes produced by exon-shuffling could be recognized by a striking correlation between the exon-structure of the genes and the domain-organization of proteins (Patthy, 1985, 1987, 1991b). As a consequence of the phase-compatibility rule, the introns of these modular proteins also show a marked intron-phase bias: e.g., in the genes of modular proteins produced by exon-shuffling of class 1-1 mod-

Fig. 2. Comparison of the exon–intron structures of genes of some modular vertebrate proteins with those of invertebrate homologs. (A) Comparison of the structure of the Drosophila and human tolloid genes. (B) Comparison of the structure of the genes of the Caenorhabditis and human netrin receptors. The numbers indicate the position and phase class of the introns. Phase 1 introns found at the boundaries of modules are highlighted in red. The black boxes indicate signal peptide domains, vertical black bars represent transmembrane domains. The boxes designating the different domains are drawn to scale. 108 L. Patthy / Gene 238 (1999) 103–114 ules, the introns that participated in the assembly process to each other and inserted into new locations by recom- are all phase 1 (Patthy, 1987, 1991a). The genes of bination in phase 1 introns, then we may rightfully selectins (e.g., granule membrane protein 140 of plate- assume that they also arose by exon shuffling, even lets, endothelial-leukocyte adhesion molecule 1, lympho- though the ‘original’ introns are already missing from cyte homing receptor), interleukin-2 receptor, factor their genes. XIIIb subunit, cartilage link protein, follistatin, lipopro- In this respect, the tolloid genes provide illustrative tein-associated coagulation inhibitor provide typical examples. The tolloid gene of D. melanogaster encodes examples of such a correlation: their C-type lectin-, a modular astacin-type metalloprotease containing five growth factor-, complement B-type modules, class 1-1 complement C1r type (CUB) modules and two immunoglobulin-, link-, follistatin modules, Kunitz-type class 1-1 EGF-like (G) modules (Fig. 2). The 3.4 kb inhibitor modules are encoded by discrete class 1-1 gene of Drosophila is rather compact, contains only six exons, i.e., all the intermodule introns are phase 1 short (49–142 bp) introns (Childs and O’Connor, 1994); (Fig. 1.). The fact that the structure of the gene of a only two of the module-boundaries are marked by phase modular protein conforms to these rules may actually 1 introns, and four of the ‘expected’ intermodule phase ffl be used as evidence that it has evolved by exon-shu ing 1 introns are missing (Fig. 2.). The gene (6 kb) of a (Patthy, 1988). related protease (BP10) from sea urchin blastula with There are many cases where the correlation between adifferent set of six introns also failed to show a clear modular structure of a multidomain protein (consisting correlation between modular structure of the protein of class 1-1 modules) and exon–intron structure of its and exon–intron structure of the gene, therefore it was gene is less perfect, since some of the ‘expected’ introns suggested that ‘‘it is unlikely that exon-shuffling had a are missing from the module boundaries (Patthy, 1994; role in the evolution of the astacin/EGF/CUB subfamily 1995; 1996). The most plausible explanation for the of proteases’’ (Lhomond et al.,1996). Studies on the weaker correlation between exon–intron structure of the genomic organization of the mammalian tolloid gene gene and domain structure of the protein is that the suggest that this conclusion is unwarranted. The gene original genomic organization has been obscured by of human Bone Morphogenetic Protein 1 (BMP1) removal and insertion of introns (Patthy, 1994). As a encodes a protein with a domain structure identical to result of continual intron insertion and intron removal, that of the Drosophila Tolloid protein, but its gene the correlation between modular protein structure and structure is strikingly different from that of the gene structure may get weaker and weaker, and may Drosophila homolog (Takahara et al., 1995). Consistent eventually lead to an exon–intron organization that has with the differences in genome compactness of fly and little or nothing to do with the one that existed at the human, the human tolloid gene is more than tenfold time of the formation of the gene. Since erosion of the larger (46 kb) since it contains a greater number and original genomic structure progresses with time, it is not significantly longer introns than the fly homolog. surprising that old genes usually show less perfect corre- Importantly, all boundaries separating the class 1-1 lation with protein structure than recently assembled ones. This point may be illustrated by the exon–intron CUB and G modules are marked by the ‘expected’ phase structures of laminin genes. Laminins are among the 1 introns (Fig. 2.). It thus appears that, in this respect, oldest modular proteins unique to metazoa, inasmuch the human tolloid gene is more similar to the original ff as they are already present in Hydrozoa (Sarras et al., gene structure than the fly homolog. The di erence in 1994), and their domain organization was practically gene structure may be explained by assuming that in unchanged during subsequent evolution. Consistent with the Drosophila lineage (characterized by a more compact the great age of these genes, many of the original phase genome) there was greater selective pressure to eliminate 1 introns are already missing from the boundaries of introns than in the chordate lineage (characterized by a the class 1-1 laminin B-type modules of Drosophila and less compact genome). vertebrate laminin B1 and B2 chain genes (Vuolteenaho There are other cases which suggest that the gene- et al., 1990; Chi et al., 1991; Kallunki et al., 1991; Gow structures of chordate homologs are more likely to have et al., 1993). preserved the original introns. As another example we It must be pointed out that the absence of the may mention the case of human and C. elegans netrin expected introns from some module boundaries of pro- receptor genes. The gene of the human netrin receptor teins is frequently interpreted as evidence that these DCC (Deleted in Colorectal Cancer) encodes a trans- proteins did not evolve by exon shuffling. Such a conclu- membrane protein with an extracellular part containing sion is unjustified. If a modular protein is composed of four class 1-1 immunoglobulin (Ig) modules and six various class 1-1 modules (EGF-like modules, comple- class 1-1 fibronectin type III (FN3) modules (Fig. 2.). ment B-type modules, C-type lectin modules, laminin The human DCC gene contains 28 introns and spans B-type modules, CUB modules, etc.) that have clearly approx. 1400 kbp: it is the largest tumor suppressor been shown in most other cases to be duplicated, joined gene identified to date (Cho et al., 1994). Significantly, L. Patthy / Gene 238 (1999) 103–114 109 with one exception, the phase 1 introns have been that proteins composed of class 1-1 modules familiar preserved at the boundaries separating the class 1-1 Ig from vertebrates have been found in sponges, hydrozoa, and FN3 modules (Fig. 2). In contrast with this, in the nematodes, molluscs, arthropods, and echinoderms, etc., case of the C.elegans homolog, 6 of the 11 expected indicates that this machinery of exon-shuffling was avail- phase 1 introns are missing from the module boundaries able before the divergence of these metazoan phyla. (Fig. 2.). This difference in genomic organization may There can be no doubt that the mechanism of the be also explained by assuming that in the Caenorhabditis construction of these modular proteins was the same as lineage (characterized by a more compact genome) there that of vertebrate genes, since there are several cases of was greater selective pressure to eliminate introns than invertebrate genes where the class 1-1 modules are in the chordate lineage (characterized by a less compact flanked by the original phase 1 introns (Patthy, 1994, genome). Importantly, the gene of the C. elegans homo- 1995). It is noteworthy that modular proteins assembled log of DCC/netrin receptor is dramatically more com- from class 1-1 modules have already been found in pact (4.7 kb) than the human homolog (1400 kb), and sponges and hydrozoa, although only a tiny fraction of it has fewer and shorter introns (Chan et al., 1996). their genes have been sequenced so far. In contrast with In summary, since the genomic organization of a this, there is practically no evidence for related modular gene does not necessarily reflect the structure that existed proteins in plants or fungi. In addition to metazoa, a at the time of its formation, in many cases the original few modular proteins homologous with class 1-1 mod- structure can be reconstructed only through a complex ules were found in some animal viruses and parasitic analysis of the evolutionary history of the constituent protozoa (Patthy, 1994, 1995). However, it seems likely domains. So far, our analyses have identified more than that these viruses and protists acquired the modular five dozen class 1-1 module-types that have been used proteins from their multicellular hosts by horizontal ffl to build clan 1 modular proteins by exon-shu ing. As gene transfer; these proteins assist them in the invasion has been discussed previously, there are much fewer process. Thus the surface protein of P. falciparum con- class 0-0 or class 2-2 modules (Patthy, 1994, 1995). taining four EGF-domains facilitates infection by bind- According to the modularization hypothesis, the expla- ing to receptors on mosquito epithelial cells, the nation for this is that the formation of class 1-1 modules thrombospondin-homolog proteins of malarial parasites is strongly favoured (Patthy, 1994, 1995, 1999). assist in their entry into hepatocytes by mediating their adherence to these cells, the vaccinia virus protein containing four complement B-type modules could 2.3. Evolutionary distribution of proteins produced by counteract host immune defences by interfering with the exon-shuffling complement cascade. The evolutionary distribution of modular proteins that have clearly evolved by exon- Analysis of the evolutionary distribution of modular shuffling is thus consistent with the suggestion that exon- proteins produced by exon-shuffling may permit a clear shuffling became significant at the time of metazoan definition of the time of their formation, and thus may provide an insight into changes in the significance of radiation, parallel with the spread of large spliceoso- exon-shuffling. Thanks to various genome projects, mal introns. sufficient amount of sequence information has already Our analysis has also shown that the vast majority accumulated on plants, protists, fungi and diverse of metazoan modular proteins produced by exon- ffl groups of animals to perform a meaningful analysis. shu ing is extracellular or they are extracellular parts Using the collection of modules that were clearly of membrane-associated proteins (Patthy, 1995). A spread by exon-shuffling, we have searched databases to major group of these modular proteins consists of define the evolutionary distribution of modular proteins various constituents of the extracellular matrix (lami- that were constructed from these modules. In the present nins, modular collagens, fibronectin, etc.), modular work, modular proteins produced by exon-shuffling were metalloproteinases involved in the remodelling of the defined as those which contain at least one of the extracellular matrix (e.g., type IV collagenases, bone modules spread by exon-shuffling and this module is morphogenetic proteins). Another large group of modu- joined to at least another protein-domain. Note that lar proteins contains a variety of membrane associated according to this definition, a protein consisting of a or transmembrane proteins with extracellular parts con- single module and a membrane anchoring segment structed from modules: receptor tyrosine kinases, recep- and/or signal peptide is not counted as a modular tor tyrosine phosphatases, proteins involved in cell–cell protein, but is considered to represent the protomodule or cell–matrix interactions, complement receptors, LDL- stage of the modularization pathway (Patthy, 1994, receptor, cytokine receptors, etc. A third major group 1995, 1999). of modular extracellular proteins comprises various The search for modular proteins has brought many plasma proteins: the modular proteases of the blood examples from all major groups of metazoa. The fact coagulation, fibrinolytic and complement cascades, the 110 L. Patthy / Gene 238 (1999) 103–114 non-protease factors regulating blood coagulation and the C. elegans proteins found in current databases were complement activation, immunoglobulins, etc. predicted with a rather high error rate.) In addition to these extracellular or membrane-associ- In principle, differences in the recurrence of different ated modular proteins, there are some intracellular modules-types may reflect differences in the time of the modular proteins that have also evolved by exon- appearance of the module (those protomodules that shuffling in the metazoan lineage (Patthy, 1995). For were formed earlier had more time to spread by exon- example, there is clear evidence that myomesin/skelemin shuffling), or may reflect differences in the chances of of the contractile apparatus has been constructed by the survival of the newly formed modular protein-coding exon-shuffling from class 1-1 immunoglobulin and gene. Survival of the newly formed gene requires that fibronectin type III modules. In the gene of this protein, the protein it encodes should be able to fold efficiently the start and end of each immunoglobulin and fibronec- into a stable multidomain protein, therefore folding- tin type III domain are still defined by phase 1 introns efficiency of the protein-module is a critical aspect of its (Steiner et al., 1999). evolutionary success. The significance of this require- There seem to be striking differences in the distribu- ment may be illustrated by the fact that the majority of tion of modular proteins in different metazoan lineages. modules used in the construction of multidomain pro- Although modular receptor tyrosine kinases have teins display remarkable folding autonomy: the isolated already been found in most metazoan lineages, including domains can fold very efficiently (Patthy, 1993). sponges (Scha¨cke et al., 1994) and there is molecular Obviously, folding autonomy of domains in multido- evidence for the presence of laminins from most meta- main proteins is of utmost importance since this mini- zoan phyla including Hydrozoa (Sarras et al., 1994), mizes the influence of neighboring domains and ensures there are modular proteins that appear to be restricted stability in the extracellular environment. Folding auton- to certain evolutionary lineages. For example, the rich omy can ensure that folding of the domain is not variety of modular proteases of blood coagulation, deranged when inserted into a novel protein environ- fibrinolysis, complement activation and other extracellu- ment. It thus seems likely that the most widely used lar proteolytic cascades (Fig. 3.) have probably been modules have been selected for folding autonomy and formed in the chordate lineage, since proteases with a fold stability. similar domain organization are missing from non- It is interesting to note in this respect that the modules chordate genomes, including the completely sequenced used most frequently in the construction of modular genome of C. elegans. Conversely, most basic constitu- proteins are not random representatives of the protein ents of the extracellular matrix, the majority of proteins universe. Analysis of protein structures of the involved in cell–cell or cell–matrix interactions are pre- Brookhaven Protein Databank (Martin et al., 1998) sent in vertebrates as well as in arthropods and nema- shows that about half of the non-homologous protein todes, suggesting that they were formed prior to the families belong to the ab class, 25% belong to the mainly divergence of these lineages. It is worth pointing out a class, about another 25% is mainly b (Fig. 5A). that the genes of modular proteins that appear to be Structural classification of the class 1-1 modules shows restricted to the chordate lineage provide the most a significantly different distribution: more than 60% convincing examples of a correlation between domain belong to the mainly b class and only about 10% are organization of a modular protein and exon–intron mainly a (Fig. 5B). If the differences in recurrence structure of their gene. It seems likely that this is due (Fig. 4) of modules is also taken into account, there is to the fact that these genes are relatively young and that a further significant shift in favour of modules belonging in the chordate lineage there was no significant pressure to the mainly b class (Fig. 5C), suggesting that structural to eliminate their introns. features (folding autonomy and stability) play a signifi- It seems possible that many of the modular proteins cant role in controlling the spread of module-types. The found so far only in the C. elegans genome have been fact that the structural distribution of disulphide-bonded formed in the nematode lineage. Even though we cannot extracellular domains are in general shifted towards the exclude the possibility that similar modular proteins will mainly b proteins (Martin et al., 1998) indicates that be found in other genomes, it seems to be clear that such proteins are more stable in the extracellular envi- there are significant differences in the population of ronment than other fold classes. It is noteworthy in this modular proteins of vertebrates vs. C. elegans. This respect that as a reflection of their adaptation to the point may be illustrated by the fact that different class oxidative milieu of the extracytoplasmic space, many of 1-1 modules are not used with equal probability in these the class 1-1 modules are rich in disulfide bonds. two groups (Fig. 4.). Although in both groups the EGF- In summary, the clearcut examples of proteins assem- like module is found in the highest percentage of modu- bled by exon-shuffling from modules are restricted to lar proteins, there are also striking differences in the metazoa, with practically no counterparts in prokary- recurrence of class 1-1 modules (Fig. 4). (The results of otes, plants, protists or fungi. This observation is consis- this analysis must be treated with some caution, since tent with our earlier suggestion that the exon-shuffling L. Patthy / Gene 238 (1999) 103–114 111 erent ff g the di he trypsin-family. The SERPRO box vitamin K-dependent calcium-binding sembled: kringle-module ( K), epidermal C1s module (CUB), von Willebrand type A module (vWA), LDL-receptor module (LDL), meprin- / receptor module (FRZ). The black boxes indicate signal peptide domains, vertical black bars represent transmembrane domains. The boxes designatin frizzled domains are drawn to scale. module (MAM), represents the trypsin-homolog serinegrowth protease factor-module (G), domain, themodule finger-module (C), colored (F), boxes contact fibronectin represent factor type the module II (CF), various module class complement (FN2), 1-1 B-type module preactivation modules (B), peptide from module which complement (PAP), these C1r proteases scavenger receptor were module as (SC), Fig. 3. Architecture of some modular proteases of the blood coagulation, fibrinolysis, complement activation cascades and other modular members of t 112 L. Patthy / Gene 238 (1999) 103–114

2.4. Signiﬁcance of exon-shuﬄing in metazoan evolution

It must be emphasized that most modular proteins produced by exon-shuffling are associated with, and are absolutely essential for, multicellularity of metazoa. For example, the appearance of the constituents of extracellular matrix is inextricably linked to the appearance of the first multicellular animals. The constituents of the extracellular matrix, membrane-associated proteins mediating cell–cell and cell–matrix interactions, receptor-proteins regulating cell–cell communications (such as receptor tyrosine kinases, receptor tyrosine phosphatases) are of absolute importance in permitting cells to function in an integrated fashion. The fact that the overwhelming majority of the constituents of the extracellular matrix, cell adhesion proteins, receptor proteins were constructed from modules underlines the extreme importance of exon-shuffling in metazoan evolution. As outlined above, this powerful evolutionary mechanism has become significant relatively late during evolution at the time of metazoan Fig. 4. Recurrence of the most frequently used class 1-1 modules in a database of vertebrate and C. elegans modular proteins. In this data- radiation, parallel with a decrease in genome compact- base, modular proteins with the same domain organization were ness and parallel with the evolution of spliceosomal counted as a single entry irrespective of the number of representatives introns. It is interesting in this respect that the rise of from different species or from the same species. Recurrence is defined exon-shuffling seems to coincide with a spectacular burst as the percentage of modular proteins in which the given domain is of evolutionary creativity at the time of metazoan radia- present, irrespective of the number of repeats of that module. ffl Abbreviation of modules: G, epidermal growth-factor module; Ig, tion. It seems probable that the rise of exon-shu ing immunoglobulin module; FN3, fibronect type III module; B, comple- has contributed significantly to this accelerated evolution ment B-type module; LN, C-type lectin-module; TSP, thrombospondin of metazoa. module; LDL, LDL-receptor module; vWC, von Willebrand type C In summary, it seems that the evolution of less module; K, kringle-module; vWA, von Willebrand type A module; SC, compact genomes, the evolution and spread of spliceoso- scavenger receptor module; LMG, globular module of laminin A; LMB, EGF-like module of laminin B; CUB, complement C1r/C1s mal introns, the creation of mobile modules have module; INH, Kunitz-type inhibitor module; CAD, cadherin module. reached a critical point sometime before the Cambrian period, and this led to a dramatic increase in the ffi machinery (large spliceosomal pre-mRNA introns, pro- e ciency of modular protein evolution by exon- ffl ffi ffl tomodules) appeared relatively late during evolution shu ing. Increased e ciency of exon-shu ing was criti- (Patthy, 1994) parallel with the appearance of less cal for the rapid creation of diverse multidomain pro- compact genomes. The importance of the accumulation teins that are essential for multicellularity of metazoa. of a critical mass of module-types is illustrated by the fact that most of the modular proteins were assembled from some five-dozen class 1-1 modules. References

Bassett Jr., D.E., Basrai, M.A., Connelly, C., Hyland, K.M., Kita- gawa, K., et al., 1996. Exploiting the complete yeast genome sequence. Curr. Opin. Genet. Dev. 6, 763–766. Belfort, M., 1991. Self-splicing introns in prokaryotes: migrant fossils? Cell 11, 9–11. Belfort, M., 1993. An expanding universe of introns. Science 262, 1009–1010. Bernardi, G., 1993. The isochore organization of the human genome and its evolutionary history — a review. Gene 135, 57–66. Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, Fig. 5. (A) Structural classification of non-homologous protein fami- V., et al., 1997. The complete genome sequence of Escherichia coli lies in the PDB (Martin et al., 1998). (B) Structural classification of K-12. Science 277, 1453–1462. non-homologous class 1-1 modules. (C) Recurrence of class 1-1 mod- Blaxter, M., 1998. Caenorhabditis elegans is a nematode. Science 282, ules belonging to the three different structural classes in modular pro- 2041–2046. teins of vertebrates. The colors define the class: red, mainly a; green, Blumenthal, T., Spieth, J., 1996. Gene structure and organization in mainly b; yellow, mixed ab. Caenorhabditis elegans. Curr. Opin. Genet. Dev. 6, 692–698. L. Patthy / Gene 238 (1999) 103–114 113

Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., et al., Fraser, C.M., Casjens, S., Huang, W.M., Sutton, G.G., Clayton, R., 1996. Complete genome sequence of the methanogenic Archaeon, et al., 1997. Genomic sequence of a Lyme disease spirochaete, Bor- Methanococcus jannaschii. Science 273, 1058–1073. relia burgdorferi. Nature 390, 580–586. Cavalier-Smith, T., 1985. Selfish DNA and the origin of introns. Gardner, M.J., Tettelin, H., Carucci, D.J., Cummings, L.M., Aravind, Nature 315, 283–284. L., et al., 1998. Chromosome 2 sequence of the human malaria Cavalier-Smith, T., 1991. Intron-phylogeny: a new hypothesis. Trends parasite Plasmodium falciparum. Science 282, 1126–1132. Genet. 7, 145–148. Gilbert, W., 1978. Why genes in pieces? Nature 271, 501. Cech, T.R., 1985. Self-splicing RNA: implications for evolution. Int. Gilbert, W., 1986. The RNA world. Nature 319, 618. Rev. Cytol. 93, 3–22. Gilbert, W., Glynias, M., 1993. On the ancient nature of introns. Gene Cech, T.R., 1986. The generality of self-splicing RNA: relationship to 135, 137–144. nuclear mRNA splicing. Cell 44, 207–210. Gow, C.H., Chang, H.Y., Lih, C.J., Chang, T.W., Hui, C.F., 1993. Chalfie, M., 1998. The worm revealed. Nature 396, 620–621. Analysis of the Drosophila gene for the laminin B1 chain. DNA Chan, S.S.Y., Zheng, H., Su, M.W., Wilk, R., Killeen, M.T., Cell Biol. 12, 573–587. Hedgecock, E.M., Culotti, J.G., 1996. UNC-40, a C.elegans homo- Grivell, L.A., 1994. Invasive introns. Curr. Biol. 4, 161–164. log of DCC (deleted in colorectal cancer), is required in motile cells Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze´, responding to UNC-6 netrin cues. Cell 87, 187–195. P., et al., 1996. Splice site prediction in Arabidopsis thaliana pre- de Chateau, M., Bjorck, L., 1994. Protein PAB, a mosaic albumin- mRNA by combining local and global sequence information. binding bacterial protein representing the first contemporary exam- Nucleic Acids Res. 24, 3439–3452. ffl ple of module shu ing. J. Biol. Chem. 269, 12147–12151. Himmelreich, R., Hilbert, H., Plagens, H., Pirkl, E., Li, B.-C., et al., de Chateau, M., Bjorck, L., 1996. Identification of interdomain 1996. Complete sequence analysis of.the genome of the bacterium sequences promoting the intronless evolution of a bacterial protein Mycoplasma pneumoniae. Nucleic Acis Res. 24, 4420–4449. family. Proc. Natl. Acad. Sci., USA 93, 8490–8495. Himmelreich, R., Plagens, H., Hilbert, H., Reiner, B., Hermann, R., Chervitz, S.A., Aravind, L., Sherlock, G., Ball, C.A., Koonin, E.V., 1997. Comparative analysis of the genomes of the bacteria Myco- Dwight, S.S., Harris, M.A., Dolinski, K., Mohr, S., Smith, T., plasma pneumoniae and Mycoplasma genitalium. Nucleic Acids Res. Weng, S., Cherry, J.M., Botstein, D., 1998. Comparison of the 25, 701–712. complete protein sets of worm and yeast: orthology and divergence. Jacquier, A., 1990. Self-splicing group II and nuclear pre-mRNA Science 282, 2022–2028. introns: how similar are they? Trends Biochem. Sci. 15, 351–354. Chi, H.C., Juminaga, D., Wang, S.Y., Wang, S.Y., Hui, C.F., 1991. Kallunki, T., Ikonen, J., Chow, L.T., Kallunki, P., Tryggvason, K., Structure of the Drosophila gene for the Laminin B2 chain. DNA 1991. Structure of the human laminin B2 chain gene reveals extens- Cell. Biol. 10, 451–466. ive divergence from the laminin B1 chain gene. J. Biol. Chem. Childs, S.R., O’Connor, M.B., 1994. Two domains of the tolloid pro- 266, 221–228. tein contribute to its unusual genetic interaction with decapen- Klenk, H.P., Clayton, R.A., Tomb, J.F., White, O., Nelson, K.E., taplegic. Dev. Biol. 162, 209–220. et al., 1997. The complete genome sequence of the hyperthermophi- Cho, K.R., Oliner, J.D., Simons, J.W., Hedrick, L., Fearon, L., lic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature Fearon, E.R., Presinger, A.C., Hedge, P., Silverman, G.A., 390, 364–370. Vogelstein, B., 1994. The DCC gene: structural analysis and mut- Kunst, F., Ogasawara, N., Moszer, I., Albertini, A.M., Alloni, G., ations in colorectal carcinomas. Genomics 19, 525–531. et al., 1997. The complete genome sequence of the Gram-positive Clayton, R.A., White, O., Ketchum, K.A., Venter, J.C., 1997. The first bacterium Bacillus subtilis. Nature 390, 248–255. genome from the third domain of life. Nature 387, 459–462. Kusche-Gullberg, M., Garrison, K., MacKrell, A.J., Fessler, L.I., Fes- Copertino, D.W., Hallick, R.B., 1993. Group II and group III introns of twintrons: potential relationships with nuclear pre-mRNA sler, J.H., 1992. Laminin A chain: expression during Drosophila introns. Trends Biochem. Sci. 18, 467–471. development and genomic sequence. EMBO J. 11, 4519–4527. Crick, F., 1979. Split genes and RNA splicing. Science 204, 264–271. Lambowitz, AM., Belfort, M., 1989. Infectious introns. Cell 56, Darnell, J.E., 1978. Implications of RNA.RNA splicing in evolution 323–326. of eukaryotic cells. Science 202, 1257–1260. Lambowitz, AM., 1993. Introns as mobile genetic elements. Annu. Darnell, J.E., Doolittle, W.F., 1986. Speculations on the early course Rev. Biochem. 62, 587–622. of evolution. Proc. Natl. Acad. Sci. USA 83, 1271–1275. Lhomond, G., Chiglione, C., Lepage, T., Gache, C., 1996. Structure Doolittle, W.F., 1978. Genes in pieces: were they ever together? Nature of the gene encoding the sea urchin blastula protease 10 (BP10), 2+ 272, 581–582. a member of the astacin family of Zn metalloproteases. Eur. Dorit, R.L., Schoenbacher, L., Gilbert, W., 1990. How big is the uni- J. Biochem. 238, 744–751. verse of exons? Science 250, 1342–1382. Logsdon Jr., J.M., 1991. The recent origins of spliceosomal introns Dujon, B., 1989. Group I introns as mobile genetic elements: facts and revisited. Curr. Opin. Genet. Dev 8, 637–648. mechanistic speculations. Gene 82, 91–114. Martin, A.C.R., Orengo, C.A., Hutchinson, E.G., Jones, S., Karmint- Dujon, B., 1997. The yeast genome project: what did we learn? Trend zou, M., Laskowski, R.A., Mitchell, J.B.O., Taroni, C., Thornton, Genet. 12, 263–269. J.M., 1998. Protein folds and functions. Structure 6, 875–884. Duret, L., Mouchiroud, D., Gautier, C., 1995. Statistical analysis of McVey, J.H., Nomura, S., Kelly, P., Mason, I.J., Hogan, B.L.M., 1988. vertebrate sequences reveals that long genes are scarce in GC-rich Characterization of the mouse SPARC/Osteonectin gene. J. Biol. isochores. J. Mol. Evol. 40, 308–317. Chem. 263, 11111–11116. Elgar, G., Sandford, R., Aparicio, S., Macrae, A., Venkatesh, B., et al., Morl, M., Schmelzer, C., 1990. Integration of group II intron bl1 into 1996. Small is beautiful: comparative genomics with the pufferfish a foreign RNA by reversal of the self-splicing reaction in vitro. Cell (Fugu rubripes). Trends Genet. 12, 145–150. 60, 629–636. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, Mouchiroud, D., D’Onofrio, G., Assani, B., Macaya, G., Gautier, C., E.F., 1995. Whole-genome random sequencing and assembly of Bernardi, G., 1991. The distribution of genes in the human genome. Haemophilus influenzae Rd. Science 269, 496–512. Gene 100, 181–187. Fraser, C.M., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A., Mueller, M.W., Allmaier, M., Eskes, R., Schweyen, R.J., 1993. Trans- et al., 1995. The minimal gene complement of Mycoplasma geni- position of group II intron aI1 in yeast and invasion of mithochon- talium. Science 270, 397–403. drial genes at new locations. Nature 366, 174–176. 114 L. Patthy / Gene 238 (1999) 103–114

Oliver, S.G., 1997. From DNA sequence to biological function. Nature Sarras Jr., M.P., Yan, L., Grens, A., Zhang, X., Agbas, A., Huff, J.K., 379, 597–600. St. John, P.L., Abrahamson, D.R., 1994. Cloning and biological Oohashi, T., Ueki, Y., Sugimoto, M., Ninomiya, Y., 1995. Isolation function of laminin in Hydra vulgaris. Dev. Biol. 164, 312–324. and structure of the COL4A6 gene encoding the human a6(IV ) Scha¨cke, H., Rinkevich, B., Gamulin, V., Mu¨ller, I.M., Mu¨ller, collagen chain and comparison with other type IV collagen genes. W.E.G., 1994. Immunoglobulin-like domain is present in the extra- J. Biol. Chem. 270, 26863–26867. cellular part of the receptor tyrosine kinase from the marine sponge. Orgel, L.E., Crick, F.H.C., 1980. Selfish DNA: the ultimate parasite. J. Mol. Recogn. 7, 272–276. Nature 284, 604–607. Schwarzbauer, J.E., Spencer, C.S., 1993. The Caenorhabditis elegans Palmer, J.D., Logsdon Jr., J.M., 1991. The recent origins of introns. homologue of the extracellular calcium binding protein SPARC/ Curr. Opin. Genet. Dev. 1, 470–477. Osteonectin affects nematode body morphology and mobility. Mol. Patthy, L., 1985. Evolution of the proteases of blood coagulation and Biol. Cell. 4, 941–952. fibrinolysis by assembly from modules. Cell 41, 657–663. Sharp, P.A., 1985. On the origin of RNA splicing and introns. Cell Patthy, L., 1987. Intron-dependent evolution: preferred types of exons 42, 397–400. and introns. FEBS Lett. 214, 1–7. Sharp, P.A., 1994. Split genes and RNA splicing. Cell 77, 805–815. Patthy, L., 1988. Detecting distant homologies of mosaic proteins. Sibley, M.H., Johnson, J.J., Mello, C.C., Kramer, J.M., 1993. Genetic Analysis of the sequences of thrombomodulin, thrombospondin, identification, sequence and alternative splicing of the Caenorhab- complement components C9, C8a and C8b, vitronectin and plasma ditis elegans alpha 2(IV ) collagen gene. J. Cell Biol. 123, 255–264. cell membrane glycoprotein PC-1. J. Mol. Biol. 202, 689–696. Simmen, M.W., Leitgeb, S., Clark, V.H., Jones, S.J.M., Bird, A., 1998. Patthy, L., 1991a. Exons — original building blocks of proteins? BioEs- Gene number in an invertebrate chordate, Ciona intestinalis. Proc. say 13, 187–192. Natl. Acad. Sci. USA 95, 4437–4440. Patthy, L., 1991b. Modular exchange principles in proteins. Curr. Steiner, F., Weber, K., Furst, D.O., 1999. M band proteins myomesin Opin. Struct. Biol. 1, 351–361. and skelemin are encoded by the same gene: analysis of its organiza- Patthy, L., 1993. Modular design of proteases of coagulation, fibrinol- tion and expression. Genomics 56, 78–89. ysis and complement activation: implications for protein engineer- Takahara, K., Lee, S., Wood, S., Greenspan, D.S., 1995. Structural ing and structure function studies. Methods Enzymol. 222, 10–21. organization and genetic localization of the human bone morpho- Patthy, L., 1994. Exons and introns. Curr. Opin. Struct. Biol. 4, genetic protein 1/mamalian tolloid gene. Genomics 29, 9–15. 383–392. The C. elegans Sequencing Consortium, 1998. Genome sequence of Patthy, L., 1995. Protein Evolution by Exon-Shuffling. Molecular Biol- the nematode C. elegans: a platform for investigating biology. Sci- ogy Intelligence Unit, R.G. Landes Company/Springer, New York. ence 282, 2012–2018. Patthy, L., 1996. Exon shuffling and other ways of module exchange. The EU Arabidopsis Genome Project, 1998. Analysis of 1.9 Mb of Matrix Biol. 15, 301–310. contiguous sequence from chromosome 4 of Arabidopsis thaliana. Patthy, L., 1999. Protein Evolution. Blackwell Science, Oxford. Nature 391, 485–488. Perlman, P.S., Butow, R.A., 1989. Mobile introns and intron-encoded Vuolteenaho, R., Chow, L.T., Tryggvason, K., 1990. Structure of the proteins. Science 246, 1106–1109. human laminin B1 chain gene. J. Biol. Chem. 265, 15611–15616. Rubin, G.M., 1998. The Drosophila Genome Project: a progress report. Yeo, G.S.H., Elgar, G., Sandford, R., Brenner, S., 1997. Cloning and Trends Genet. 14, 340–343. sequencing of complement component C9 and its linkage to DOC-2 Ruvkun, G., Hobert, O., 1998. The taxonomy of developmental con- in the pufferfish Fugu rubripes. Gene 200, 203–211. trol in Caenorhabditis elegans. Science 282, 2033–2041. Zhang, X., Vuolteenanho, R., Tryggvason, K., 1996. Structure of the Saldanha, R., Mohr, G., Belfort, M., Lambowitz, A.M., 1993. Group human laminin a2-chain gene (LAMA2) which is affected in con- I and group II introns. FASEB J. 7, 15–24. genital muscular dystrophy. J. Biol. Chem. 271, 27664–27669.