<<

Nucleomorph genome of Hemiselmis andersenii reveals complete loss and compaction as a driver of structure and function

Christopher E. Lane*, Krystal van den Heuvel*, Catherine Kozera†, Bruce A. Curtis†, Byron J. Parsons†‡, Sharen Bowman†§, and John M. Archibald*¶

*Canadian Institute for Advanced Research, Integrated Microbial Biodiversity Program, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, Canada B3H 1X5; §Department of Process Engineering and Applied Science, Dalhousie University, Halifax, NS, Canada B3H 3J5; and †Atlantic Genome Centre, Halifax, NS, Canada B3J 1S5

Edited by Jeffrey D. Palmer, Indiana University, Bloomington, IN, and approved October 29, 2007 (received for review August 6, 2007) Nucleomorphs are the remnant nuclei of algal that primarily with ‘‘housekeeping’’ functions such as transcription, took up residence inside a nonphotosynthetic eukaryotic host. The translation, and protein folding/degradation (6). Recently, the nucleomorphs of cryptophytes and are de- nucleomorph genome of the alga Bigelow- rived from red and green algal endosymbionts, respectively, and iella natans was completely sequenced and, at 373 kbp (7), is even represent a stunning example of convergent evolution: their ge- smaller than that of G. theta. Like G. theta, the B. natans nomes have independently been reduced and compacted to <1 nucleomorph genome is largely composed of whose func- megabase pairs (Mbp) in size (the smallest nuclear genomes tion is to perform core eukaryotic cellular processes and to known) and to a similar three- architecture. The mo- maintain the expression of a small number of essential genes/ lecular processes underlying genome reduction and compaction in involved in (3, 5, 7). A striking similarity are largely unknown, as is the impact of reduction/ between the G. theta and B. natans nucleomorph genomes is that compaction on protein structure and function. Here, we present both are composed of three , each with subtelo- the complete 0.572-Mbp nucleomorph genome of the cryptophyte meric ribosomal DNA (rDNA) cistrons (3, 5). This is intriguing, Hemiselmis andersenii and show that it is completely devoid of considering the independent evolutionary history of these or- spliceosomal and genes for splicing RNAs—a case of com- ganisms: the algal that gave rise to the crypto- plete intron loss in a nuclear genome. Comparison of H. andersenii phyte nucleomorph and () is derived from an proteins to those encoded in the slightly smaller (0.551-Mbp) ancestor of modern-day red , whereas in chlorarachnio- nucleomorph genome of another cryptophyte, Guillardia theta, phytes, the endosymbiont was a green alga (reviewed in refs 8, and to their homologs in the unicellular red alga Cyanidioschyzon 9). The observed similarities in basic karyotype and genome merolae reveal that (i) cryptophyte nucleomorph genomes encode structure between the two nucleomorphs are, thus, the result of proteins that are significantly smaller than those in their free-living convergent evolution, the biological significance of which is algal ancestors, and (ii) the smaller, more compact G. theta nucleo- unknown (3). Importantly, the content of the G. theta and morph genome encodes significantly smaller proteins than that of B. natans nucleomorph genomes, in particular the complement H. andersenii. These results indicate that genome compaction can of genes for plastid-targeted proteins, are very different from eliminate both coding and noncoding DNA and, consequently, one another, emphasizing the independent evolutionary trajec- drive the evolution of protein structure and function. Nucleomorph tories taken by the two genomes since their enslavement. proteins have the potential to reveal the minimal functional units Beyond the cryptophyte G. theta and the chlorarachniophyte required for basic eukaryotic cellular processes. B. natans, very little is known about nucleomorph genome diversity within members of each lineage. Preliminary karyotype endosymbiosis ͉ genome evolution ͉ genome reduction diversity studies have revealed considerable size variation, with estimated nucleomorph genome sizes ranging from Ϸ450 to 845 Ϸ uclear in eukaryotes varies Ϸ200,000-fold (1). kbp in cryptophytes and 330–610 kbp in chlorarachniophytes NToward the lower end of this spectrum are the reduced (4, 10–14). The presence of three chromosomes is thus far a genomes of that have become symbionts or universal feature of nucleomorph genomes (3, 12, 14), as is the intracellular pathogens, such as apicomplexans (e.g., Plasmo- existence of subtelomeric rDNA repeats. An interesting excep- dium, the causative agent of malaria) and microsporidian par- tion was recently discovered within members of the cryptophyte asites (e.g., Encephalitozoon, an opportunistic pathogen of AIDS genus Hemiselmis, where only three of the six nucleomorph patients). The nuclear genomes of these organisms are smaller chromosome ends contain intact repeats, the other three con- and more compact than those of their free-living relatives and taining only the 5S rDNA locus (11). To better understand the contain little in the way of repetitive DNA (2). Far and away the

most extreme examples of eukaryotic genome reduction are the Author contributions: J.M.A. designed research; C.E.L., K.v.d.H., C.K., B.A.C., B.J.P., and S.B. ‘‘nucleomorph’’ genomes of cryptophytes and chlorarachnio- performed research; C.K., B.A.C., B.J.P., and S.B. contributed new reagents/analytic tools; phytes. Nucleomorphs are the relic nuclei of algal endosymbi- C.E.L., K.v.d.H., B.A.C., and J.M.A. analyzed data; and C.E.L. and J.M.A. wrote the paper. onts that became permanent inhabitants of nonphotosynthetic The authors declare no conflict of interest. eukaryotic host cells (3–5). Through the combined effects of This article is a PNAS Direct Submission. genome compaction and intracellular gene transfer, the nucleo- Data deposition: The sequences reported in this paper have been deposited in the GenBank morph genomes of cryptophytes and chlorarachniophytes have database (accession nos. CP000881, CP000882, and CP000883). shrunk to a fraction of the size of the algal nuclear genomes from ‡Present address: Ocean Nutrition Canada, 101 Research Drive, Dartmouth, NS, Canada B2Y which they are derived and, thus, represent a fascinating system 4T6. for studying the process of genome evolution. ¶To whom correspondence should be addressed. E-mail: [email protected]. The first nucleomorph genome to be sequenced was the This article contains supporting information online at www.pnas.org/cgi/content/full/ 551-kilobase pair (kbp) genome of the model cryptophyte, 0707419104/DC1. Guillardia theta (6). The G. theta genome contains 513 genes, © 2007 by The National Academy of Sciences of the USA

19908–19913 ͉ PNAS ͉ December 11, 2007 ͉ vol. 104 ͉ no. 50 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0707419104 Downloaded by guest on September 27, 2021 sequence and structural diversity of nucleomorph genomes and, in both the spliceosome and as part of the rRNA processing more generally, the causes and consequences of genome reduc- machinery (24). Significantly, we were also unable to detect H. tion and compaction in eukaryotes, we have completely se- andersenii genes for any of the five spliceosome-specific snRNAs quenced the nucleomorph genome of a newly described species, (U1, U2, U4, U5, and U6; SI Fig. 4), all of which are found in Hemiselmis andersenii (15). Detailed comparison of the H. G. theta (6). Collectively, these results provide strong evidence andersenii genome to that of G. theta (6) provides the first for the hypothesis of complete loss of introns and splicing in the glimpse into the tempo and mode of nucleomorph genome H. andersenii nucleomorph. Nevertheless, it is formally possible evolution and highlights the significant impact of genome com- that the missing splicing factors in H. andersenii are, in fact, paction on gene and protein structure. nucleus-encoded and imported to the organelle posttranslation- ally, as must be the case for many nucleomorph and plastid Results and Discussion proteins in cryptophytes (3), although it is not clear what their Chromosome and Genome Structure. H. andersenii CCMP644 present functions would be. The G. theta nuclear genome is being nucleomorph DNA was isolated by using cesium chloride- completely sequenced (www.jgi.doe.gov/sequencing/why/ Hoechst dye density gradient centrifugation, cloned and shotgun CSP2007/guillardia.html) and it will be possible to assemble a sequenced to Ϸ9ϫ coverage. After the use of long-range PCR complete ‘‘parts list’’ for the nucleomorph spliceosome in this to link contigs and fill remaining gaps, three chromosome-sized organism. Assuming that there is indeed no spliceosome in the contigs (207.5, 184.7, and 179.6 kbp) were produced, in agree- H. andersenii nucleomorph, comparing and contrasting the suite ment with genome size estimates based on pulsed-field gel of nucleomorph-localized proteins involved in RNA metabolism electrophoresis (11). The complete H. andersenii genome is in G. theta and H. andersenii should provide key insight into 571,872 bp in size (Figs. 1 and 2A) with an overall GϩC content eukaryotic nuclear proteins whose functions are restricted to of 25.2% (24.7% in single-copy regions, 39.0% in rDNA re- splicing and those that are multifunctional. peats). The H. andersenii nucleomorph chromosome ends are highly unusual: telomeres are composed of a never-before-seen Genome Synteny and Recombination. A comparison of gene order (GA17)4–7 repeat, in contrast to those in G. theta between the H. andersenii and G. theta nucleomorph genomes (([AG]7AAG6A)11). Consistent with previous observations (11), reveals an exceptional degree of synteny. Ninety-four percent of intact subtelomeric rDNA operons are present only on both ends homologous genes (see below) reside within syntenic blocks (Fig. of chromosome I and one end of chromosome III (5S rDNA 2b), with a relatively small number of intra- and interchromo- exists in isolation on chromosome II and one end of chromosome somal recombinations and inversions having scrambled the two III; Fig. 1). genomes since they diverged from one another. For example, a significant fraction of H. andersenii chromosome I corresponds Loss of Introns and Splicing Machinery. The G. theta nucleomorph to G. theta chromosome III, whereas chromosomes II and III of genome possesses 17 small (42- to 52-bp) spliceosomal introns H. andersenii share large blocks of synteny with G. theta chro- with standard GT/AG boundaries, primarily in ribosomal pro- mosome II (Fig. 2b). Several blocks of synteny are as large as 30 tein genes and invariably located at their 5Ј ends (6). With the kbp in size and most differ only in organism-specific ORF exception of orf183 and orf263, all of the G. theta intron- content (Fig. 1; below). In some cases, such as one end of containing genes have homologs in H. andersenii. Unexpectedly, chromosome I, large portions of the chromosome share gene none of these contain introns nor do any of the other predicted content with a portion of a G. theta chromosome, but these genes (Fig. 1). Introns are widely considered to be a universal regions are broken into syntenic blocks that have been inverted feature of nuclear genomes (16) and are removed by the since the common ancestor of the two genomes. spliceosome, a large, evolutionarily conserved ribonucleoprotein Compared with prokaryotic and organellar genomes (25–27), complex consisting of five small nuclear (sn) RNAs and Ͼ50 gene order in nuclear genomes is typically only conserved proteins (17, 18). Although intron density varies greatly, even the between closely related species (28–30). An interesting excep-

most reduced and compacted nuclear genomes examined thus tion occurs in microsporidian parasites where a recent genomic EVOLUTION far retain at least a few introns. For example, the genomes of the investigation (31) revealed that their reduced and compacted parasites Giardia lamblia (19) and Encephalitozoon cuniculi (20) genomes are unexpectedly stable relative to the fungal genomes possess 4 and 13 introns, respectively, and encode snRNAs and from which they evolved, presumably because of a decrease in dozens of core spliceosomal protein components necessary for recombination frequency. Our data suggest that the extreme their removal (19–21). reduction and compaction that has occurred during cryptophyte To gain further insight into the significance of intron loss in the evolution has led to an even greater degree of genomic stability H. andersenii nucleomorph genome, we performed a detailed in nucleomorphs, on par with that seen in reduced analysis of 51 G. theta and/or H. andersenii nucleomorph genes and organellar genomes. Nonhomologous recombination events with predicted roles in RNA metabolism [supporting informa- are likely to disrupt coding sequences in gene-dense nucleo- tion (SI) Fig. 4]. Nineteen of 51 genes have clear functions in morph genomes, reducing the rate of viable genomic rearrange- biogenesis (e.g., cbf5, nop56), 17 of which are present ments and resulting in the retention of large blocks of synteny in both genomes (U3 snoRNP and brx1 are missing in G. theta). observed between distantly related cryptophytes. The amount of Both genomes encode an mRNA capping enzyme (mce), two time since H. andersenii and G. theta diverged from a common polyadenylate-binding proteins (pab1,2) and several DExD/H ancestor is not known, but molecular phylogenies reveal that box RNA helicases (e.g., has1, dbp4), which participate in a wide they are not closely related (12, 32), their nucleus- and nucleo- range of RNA-related processes (22). In stark contrast, whereas morph-encoded rDNAs being only Ϸ90% and Ϸ80% identical, the G. theta genome encodes 13 proteins with known or pre- respectively. dicted spliceosomal functions, most notably two U5 snRNP The nucleomorph chromosomes of both Guillardia theta and subunits and the large, highly conserved and spliceosome- Bigellowiella natans encode substantial subtelomeric repeats, specific protein prp8, all but four of these are absent in H. characterized by the presence of rDNA cistrons (6, 7). These andersenii (SI Fig. 4). The remaining four proteins are highly repeats are presumably undergoing recombination/conversion at divergent snrpD and D2 homologs with weak similarity to two of rates high enough to maintain nearly identical sequence. Inter- the seven snRNP-associated protein genes in G. theta, cdc28, a estingly, differences in gene content exist at the most internal DExD/H box helicase whose yeast counterpart (prp2) functions portions of the repeats in both genomes. The B. natans genome in spliceosome activation (23) and snu13, a protein that functions encodes a complete copy of the heat shock protein gene dnaK

Lane et al. PNAS ͉ December 11, 2007 ͉ vol. 104 ͉ no. 50 ͉ 19909 Downloaded by guest on September 27, 2021 Chromosome I Chromosome II Chromosome III 207,524 bp 184,755 bp 179,593 bp

telomere (GA 17 ) telomere (GA 17 ) telomere (GA 17 ) 4-7 5S rRNA hat 4-7 5S rDNA orf479 4-7 5S rRNA rps17 orf49 prsS12 impA orf61 rpl9 orf289 trnA prsB4 28S rRNA orf282 28S rRNA prsS4 orf530 (Gt479) orf42 tcpZ rpb3 mrs2 orf268 5.8S rRNA orf724 orf48 trnI 5.8S rRNA eif6 (Gt206) orf598 18S rRNA trnI rla0 18S rRNA asf (Gt153) rps15 orf124 (Gt129) (Gt605) orf59 orf49 dph1 (Gt414) prsB1 orf59 orf49 rps8 orf53 orf1147 orf496 (Gt419a) orf53 orf139 (Gt139) orf57 ubc4 orf57 ubc4 orf119 mcm7 (Gt670) (Gt938) rps23 orf119 orf752 rps15 orf409 orf143 (Gt127) trnQ rpl10 orf409 orf109 (Gt111) orf123 orf155 orf51 cdc28 orf821 orf253 orf146 (Gt125) nop1 orf390 (Gt519) rpl13 (Gt729) (Gt247) orf357 dnaG orf336 sen34 orf511 (Gt477) trnR orf429 (Gt331) orf1993 orf204 hcf136 orf868 (Gt626) rpl36 rub orf452 orf154 (Gt160) (Gt1613) trnL* ATPbp (Gt236) rps3A rps30 kin(snf1) rps24 ncbP2 tubA ranbpm orf62 orf194(Gt187) rps28 rps13 uceE2 trnFΨ taf13 (Gt168) orf1040 pi4K rpb10 trnL rpl24 rpl19 prsA3 rpl8 gblp1 orf239 orf102 cdc48 fkbp-like orf378 (Gt534) orf417 orf173 mce orf492 rps15 orf58 rpl35A brx1 orf154 orf48 orf113 (Gt160) (Gt461) rpl5 trnS tfIIB smc1 trnP ef2 rpl37A clpP2 rpl1 (Gt1005) gblp rpl18 rpl7A kin(aaB) ruvb-like 1 eif4E orf410 (Gt365) nop-like rpl31 rps26 orf138 (Gt132) h2B rpl11B snu13 ATP/GTPbp gsp2 orf269 (Gt147) orf281 (Gt266) orf831 (Gt773) rpl30 rpc10 dhm kin(cdc2) trnS* rpf1 (Gt186) orf138 orf893 rpa1 rcl1 eif2G rps27A prsB6 reb1 rrp3 orf656 (Gt532) orf106 orf108 rad3 orf593 (Gt471) rps27 nmt1 orf365 cbbx orf244 (Gt590) trnP trnA rpb7 prsB5 orf206 (Gt177) rfc2 cpn60 hsp90 orf668 (Gt636) prsS10B orf732 (Gt224) rpl27A rpc2 orf251 (Gt171) sys1 orf428 orf270 (Gt244) rpl12 orf130 orf105 (Gt102) trnS orf203 (Gt187) orf591 taf30 gidA tcpT mcm6 (Gt744) orf1070 tcpH (Gt446) rps25 trnK sui1 prsS13 rpl34 rpl13A orf522 (Gt472) orf512 orf899 (Gt861) orf483 orf235 (Gt228) rpb11 Nat10 (Gt859) eif2B gyrA orf278 (Gt270) dbx-like ruvb-like2 gyrB rpb8 prsA2 pab1 tcpD sen1 bsm1-like trnR* orf210 (Gt168) orf316 (Gt301) orf302 (Gt291) dbp4 (Gt556) orf88 orf825 (Gt755) kin(abc1) orf391 crm cbf5 orf445 (Gt419) trnY* orf343 (Gt272) orf193 (Gt177) orf308 (Gt304) orf264 (Gt216) orf107 (rla1) cdc2 orf90 (Gt68) tic22 rpoD hsf rli1 orf215 rbp1 cpeT-like (Gt222) pp1 orf53 clpP1 kea1 hlip orf162 orf247 (Gt81) sof1 trnL orf65 mcm4 orf221b kin orf395 (Gt338) rpl7 kin(cdc) orf223 (Gt193) orf390 (Gt341) pcna orf134 (Gt125) orf136 tcpA orf212 (Gt201) rad25 orf335 (Gt307) smc2 orf132 orf205 prsA5 eif5A orf48 orf239 (Gt224) rps5 rpl23 orf337 (Gt323) prsB3 prsS7 rad51 tfIIIC trnL orf42 orf233 orf344 (Gt325) orf828 rps15A rpl40 rps21 (Gt665) nop56 tfIIE (Gt207) orf204 orf448 (Gt380) cycB tcpB orf217 trnH orf366 (Gt357) rfc3 rps14 orf103 orf599 rpb2 rps20 orf889 orf263 (Gt249) eif1A kin(snf2) trnC* rpl23A orf431 (Gt387) (Gt861) cenp-A rps29A orf124 rps16 orf350 rpl27 trnG h3 h4 mak16 bystin-like orf171 (Gt168) trnP rpc9 trnL* mcm5 (Gt532) rpl3 (Gt245) orf310 pop2 trnG orf182 rpb1 orf255 (Gt232) orf77 tfIIB-brf nog1 orf65 orf121 rps9 orf66 orf341 orf235 (Gt201) prsS6A trnV* GTPbp orf204 (Gt197) trf orf1234 trnK mcm2 trnT orf198 orf221 orf436 pab2 orf160 (Gt160) tfIID orf194 rps10b rpl26 nol10 (Gt312) mcm3 orf134 rpl10A fet5 orf235 rpl17 orf154 kin(mps1) (Gt673) orf202 (Gt176) orf67 trnT snrpD2 orf120 (Gt112) orf336 (Gt277) rpl14 rps4 rps11 rpl18A hira prsB7 nop2 nop5 prsA7 ftsZ orf821 sbp1 rps6 has1 trnN tif211 (Gt753) rpl24 rps3 prsA1 prsS6 orf667 (Gt261) rip1 trnR ebi trnW* secE tcpG tbl3 (Gt744) orf57 orf293 orf348 tfIIA-S orf454 (Gt456) prsS8 orf80 orf590 orf421 rpa5 engA (Gt496) orf237 (Gt227) cdc48b ubc2 imp4 trnS orf100 (Gt100) orf308 (Gt258) orf495 prsS1 orf472 fkbp ubiquitin orf200 hsp70 orf301 (Gt272) orf311(Gt223) orf179 tubB orf249 taf90 hda nip7 trnR* orf394 (Gt379) rsp4 orf121 rpl32 orf373 taf smc4 rpc1 U3snoRNP orf803 (Gt625) orf123 (Gt231) orf178 rps19 rpl21 orf161 (Gt143) imb1 ggt snrpD prsA6 sut rpa2 orf418 (Gt365) trnV iap100 rpl6B tha4 trnT rpabc6 eif4A rps2 orf144 trnM rpabc5 ufd orf159 (Gt82) 5 Kb orf432 met orf409 orf173 (Gt163) orf279 rpl9 orf119 orf61 ubc4 orf49 5S rDNA telomere (GA ) smc3 (Gt1019) 5S rDNA telomere (GA ) 17 4-7 orf53 orf57 17 4-7 orf49 orf59 sufD (Gt467) 18S rRNA Translation Protein folding, assembly and degradation Syntenic with G. theta I tubG Transcription Mitotic apparatus 5.8S rDNA G. theta orf169 Miscellaneous end product Genes not present in G. theta nucleomorph Syntenic with II tcpE 28S rRNA orf277 Plastid-targeted Orf with no homolog Syntenic with G. theta III G. theta erf1 5S rDNA DNA metabolism, cell cycle and signaling Orf with conserved homolog in telomere (GA ) RNA metabolism 17 4-7 Orf with no homolog, but in conserved position

Fig. 1. H. andersenii nucleomorph genome. The genome is 571,872 bp in size with three chromosomes, shown artificially broken at their midpoint. Colors correspond to predicted functional categories, and shaded bars indicate regions of synteny with the nucleomorph genome of G. theta. Unidentified ORFs in H. andersenii with names followed by brackets including ‘‘Gt’’ correspond to demonstrably homologous unidentified ORFs in G. theta (pink) or ORFs with no obvious sequence similarity residing in a syntenic genomic position (black). Genes mapped on the left side of each chromosome are transcribed bottom-to-top and those on the right, top-to-bottom.

19910 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0707419104 Lane et al. Downloaded by guest on September 27, 2021 a chromosome. Whether chromosome II or III was the first to lose b most of the rDNA cistron from one of its ends is unclear. Gt Ha 207,524 bp I c Subtelomeric rDNA cistrons appear to be widespread in I cryptophyte and chlorarachniophyte nucleomorph genomes (6, 184,755 bp 12) and, curiously, are also found in the reduced nuclear I II II genomes of the microsporidian E. cuniculi (20) and the III ϩ II 179,593 bp 5 Kb diplomonad G. lamblia (33). It is possible that the elevated G C III III d content of the rDNA loci in such genomes serves a role in the maintenance of chromosome ends in genomes with a high average AϩT content. However, the lack of full rDNA cistrons c d rrp3 on half of the H. andersenii chromosome ends, combined with 1Kb nmt1 their extraordinarily AϩT-rich telomeric sequences, suggest that gblp 1Kb rrp3 cpn60 this is unlikely. More plausibly, the presence of subtelomeric rpl31 rpl27A rpl11B nmt1 rDNA cistrons is related to the high rates of recombination and snR4 gblp orf269 sys1 cpn60 gene conversion that are typical for chromosome ends, para- rpl31 rpl11B orf893 rpl27A gidA doxically, in a region of the genome often associated with orf147 sys1 moderate levels of gene expression. The consequences of having orf471 gidA orf899 orf593 half of the ‘‘typical’’ nucleomorph complement of rDNA cistrons prsB5 prsB5 orf636 orf861 orf278 in Hemiselmis is not immediately obvious, but the different orf668 complement of multicopy protein genes associated with the orf244 orf270 sen1 orf270 chromosome ends in nucleomorph genomes examined thus far sen1 orf446 kin orf591 suggests that such genes are likely the beneficiaries of serendipity rps25 cdc2 (abc1) orf472 rps25 rbp1 rather than selection. orf228 orf522 sof1 cdc2 eif2B orf215 orf235 orf338 rbp1 eif2B Analysis of Hemiselmis and Guillardia ORFs. The H. andersenii gyrB orf125 sof1 orf307 nucleomorph genome possesses 472 predicted protein genes, orf168 gyrB prl1 orf395 orf323 orf134 compared with 465 in G. theta (SI Fig. 4). We used two orf755 orf210 orf335 orf325 interrelated criteria to infer gene homology between the two trnY* orf825 rpl40 orf210 orf304 orf337 genomes, standard DNA/protein sequence similarity and con- hsf trnY* Gt orf220 orf308 orf344 sideration of synteny. Based on sequence similarity alone, Ͼ50% Gt hsf rpl40 (254 of 472) of the H. andersenii genes are present in G. theta and cpeT-like Ha Ha have identifiable homologs in canonical nuclear genomes. Eigh- teen H. andersenii genes with clear eukaryotic homologs are not Fig. 2. A high degree of synteny between the H. andersenii (Ha) and G. theta found in G. theta, whereas 16 ‘‘conserved’’ G. theta genes are (Gt) nucleomorph genomes. (a) Pulsed-field gel electrophoresis showing the absent in H. andersenii (Fig. 1 and SI Fig. 4). Intriguingly, both relative sizes of the three nucleomorph chromosomes in the two species. (b) nucleomorphs harbor an identical set of 30 genes predicted to Schematic of the three H. andersenii chromosomes with syntenic blocks encode plastid-targeted proteins. The retention of essential highlighted in color: black, multicopy DNA; red, syntenic with G. theta chro- genes for photosynthesis (and the machinery to express them) in mosome I; green, syntenic with G. theta II, blue, syntenic with G. theta III. ‘‘c’’ the nucleomorph has been touted as the raison d’eˆtre of these and ‘‘d’’ indicate portions of the genome shown in greater detail in the lower portion of the figure. (c and d) Comparisons of two selected syntenic regions enigmatic organelles (6). Clearly the complement of cryptophyte shown on chromosome I and III, respectively, in b. Genes in gray are shared nucleomorph-encoded plastid protein genes was established between the two genomes, whereas those in red are not. Yellow ORFs show very early in the evolution of this lineage, before the divergence

no detectable to each other or any known gene/protein, of G. theta and H. andersenii. The functional significance of this EVOLUTION despite occupying the same position in each genome. observation, in terms of the migration of nucleomorph genes to the host nucleus, is unknown. The remaining H. andersenii ORFs can be classified as follows: on the internal side of one chromosome II repeat and truncated (i) ORFs that are demonstrably homologous between H. pseudogene copies in the same location on the other five andersenii and G. theta but that show no obvious similarity to chromosome ends (7). In G. theta, the repeats encode more genes in other genomes (60 of 472 ϭ 13%), (ii) ORFs showing proteins and are more variable in content. The ubc4 gene resides no homology to any gene in G. theta or elsewhere (110 of 472 ϭ within five of the six G. theta repeats, and tfIID exists on three 23%), and (iii) H. andersenii ORFs with no detectable homology of the chromosome ends. However, the region outside of ubc4 is to known genes but that reside within regions of syntenic essentially identical on all ends (6). A similar pattern has been conservation (30 of 472 ϭ 6.4%). Remarkably, pairwise com- shown to exist in two other cryptophytes, Hanusia phi and parisons of the H. andersenii and G. theta ORFs in the third Proteomonas sulcata (12). category (Fig. 2 c and d) reveal that they are almost always In the case of the cryptophyte genus Hemiselmis, exploratory similar in size and, despite sharing no obvious amino acid Southern blot hybridizations (11) revealed that all members of sequence similarity, have similar pI values and number of this genus appear to lack the majority of the rDNA cistron on predicted transmembrane helices (if present). Given that pro- chromosome II. The complete nucleomorph genome presented teins encoded in nucleomorph genomes are often divergent in here confirms this for H. andersenii and demonstrates that this sequence (3, 34–36), it appears likely that these genes share a situation also exists on one end of chromosome III (Fig. 1). The common origin but have rapidly diverged in sequence. Together immediately subtelomeric 5S rDNA is the only remnant of the with the H. andersenii-specific unidentified ORFs, the ‘‘syntentic rDNA cistron on all three of these ends. Interestingly, rpl9 occurs ORFs’’ encode predicted proteins with highly biased amino acid on both ends of chromosome II and is part of a large syntenic compositions, much more so than ORFs with demonstrable block shared with G. theta on one of these ends. This suggests homologs in other eukaryotes and/or in both nucleomorph that the original rDNA cistron was replaced by a portion of the genomes. Remarkably, the proportion of phenylalanine and chromosome normally located internal to the repeat. The rpl9 asparagine residues (which are encoded by AϩT-rich codons) in locus was presumably then propagated to the opposite end of the many of the H. andersenii- and G. theta-specific proteins exceeds

Lane et al. PNAS ͉ December 11, 2007 ͉ vol. 104 ͉ no. 50 ͉ 19911 Downloaded by guest on September 27, 2021 450 120 at the carboxyl terminus has been deleted (data not shown), a 418.9 b suggesting a fundamentally different mode of action for this 400 397.5 100 96.9 transcription factor in the cryptophyte nucleomorph. Another 348.3 350 330.4 example is the largest subunit of RNA polymerase II (RPB1). 300 80 The C-terminal domain (CTD) of RPB1 in most eukaryotes contains an evolutionarily conserved, tandemly arrayed hep- 250 60 tapeptide repeat that serves as a platform for interactions with 52.1 200 a variety of proteins involved in transcription (39). The nucleo- Ͼ 150 40 morph-encoded RPB1 proteins are 300 aa shorter than those Average Protein Size (AAs) Average in C. merolae and A. thaliana (SI Table 1) and completely lack 100 Spacer Size (bp) Average 20 a CTD (C. merolae contains a CTD with atypical repeats). A host 50 of other transcription-related proteins are shorter as well (e.g.,

0 0 RPA1, RPA2, RPC1). The 76-aa ubiquitin monomer, which is Gt Ha Cm At Gt Ha typically encoded as part of a polyubiquitin tract, has been lost in G. theta but is retained in the H. andersenii nucleomorph Fig. 3. Impact of genome reduction and compaction on gene density and genome as a single stand-alone ORF, as in the reduced genomes gene/protein size. (a) Histogram showing average protein sizes for a set of 198 homologous proteins in the H. andersenii and G. theta nucleomorph genomes of G. lamblia (40) and E. cuniculi (20). and the nuclear genomes of the red alga C. merolae and the land plant A. Unexpectedly, not only are the cryptophyte nucleomorph- thaliana. All pairwise comparisons are significant when paired t and encoded proteins shorter than their homologs in other eu- binomial test statistics (P Ͻ 0.0005) are used. (b) Histograms showing average karyotes, the sizes of H. andersenii and G. theta proteins differ sizes of 164 homologous intergenic spacers in the H. andersenii and G. theta significantly from one another. Eighty-one percent of 290 com- nucleomorph genomes. parable homologs in the 0.572-Mbp H. andersenii genome are larger than their counterparts in G. theta (Fig. 3a and SI Table 1), whose genome is smaller (0.551 Mbp) and more compact: 25%. Despite their unusual composition, these proteins are very comparison of homologous gene spacers reveals a mean inter- Ͼ Ͼ likely bona fide: 83 are 150 aa in length, 45 are 300 aa long, genic distance of 52 bp in the G. theta genome versus 97 bp in and 8 H. andersenii ORFs with no sequence homology to known H. andersenii (Fig. 3b). The difference in both protein and Ͼ proteins are 800 aa in length. It would appear that through the intergenic spacer size is significant at P Ͻ 0.0005 when both combined effects of increased mutation rate and/or reduced binomial and t test statistics were used. An interesting compar- selective constraint, a significant fraction of cryptophyte nucleo- ison of the 2.9-Mbp genome of the microsporidian E. cuniculi to morph-encoded proteins are evolving extraordinarily quickly, yet the 12-Mbp S. cerevisiae genome revealed that 85% of its for unknown reasons, are retained in both H. andersenii and G. proteins were smaller than their homologs in yeast (20). Based theta. on the assumption that in eukaryotes large proteins facilitate complex regulatory networks (41), it was suggested (20) that this Protein Size Reduction. We next sought to test whether the process discrepancy reflects a decreased requirement for protein– of genome reduction/compaction has influenced the size of protein interactions in a highly simplified intracellular parasite nucleomorph-encoded proteins as well as their composition. We with fewer proteins and a simplified ‘‘interactome.’’ In the case compared the sizes of 198 proteins found in both H. andersenii of nucleomorphs, the significantly different sizes of proteins in and G. theta with their homologs in the red alga C. merolae (37) H. andersenii and G. theta, whose genomes encode approxi- and the land plant Arabidopsis thaliana (38) (the genome of B. mately the same number of proteins (and whose endosymbiont natans (7) could not be analyzed because of the small number of compartments presumably import approximately the same num- genes its green-algal-derived nucleomorph shares with crypto- ber of nucleus-encoded proteins), suggest that genome compac- phytes). Ninety-two percent of these proteins were smaller in tion can play a direct role in the process of protein shortening, nucleomorphs than in both C. merolae and A. thaliana (Fig. 3a beyond simply providing the mechanism for the eventual elim- and SI Table 1). All pairwise size comparisons were significant ination of genes (or parts of genes) that are no longer essential. when paired t test and binomial test statistics (P Ͻ 0.0005) were We hypothesize that a deletion bias accounts for the smaller, used. No functional bias was observed, because the trend was more compact nucleomorph genome of G. theta as well as its apparent in proteins involved in a wide range of cellular pro- smaller proteins. This can be tested by comparing the nucleo- cesses, including protein folding and degradation, transcription, morph genomes of very closely related cryptophytes and, when translation, and RNA metabolism. present, pseudogenes, as has been done to demonstrate the To determine where nucleomorph protein shortening had existence of a deletion bias in species of Drosophila (42, 43), once occurred, we examined 50 protein sequence alignments assem- these data become available. bled to include homologs from diverse eukaryotes. Although the amino and carboxyl termini were almost always shorter than Methods their homologs in algae and other eukaryotes (SI Fig. 5 a and b), DNA Isolation, Genome Sequencing, and Genome Annotation. By numerous internal deletions were also apparent (SI Fig. 5 c–e). using density gradient-purified DNA as starting material (11), Deletions were often localized to regions of the proteins that nucleomorph DNA from H. andersenii CCMP644 was nebulized were variable in length, presumably corresponding to surface and electrophoretically separated on a 1% agarose gel, cloned loops in protein structure. However, in many cases, the crypto- into pUC19 vector, and shotgun sequenced to Ϸ9ϫ coverage by phyte nucleomorph-encoded proteins were Ͼ100 aa shorter than using ET terminator chemistry (GE Healthcare) and MegaBace their homologs in other eukaryotes, suggesting that entire pro- capillary DNA sequencers. Assembly and editing of the Ϸ16,500 tein domains have been removed. A striking example is a end reads was performed by using Staden (44) and resulted in 28 transcription factor involved in the regulation of heat shock nonoverlapping contigs. Contigs were mapped to specific chro- protein gene expression: in H. andersenii and G. theta (6, 34) the mosomes by using Southern blot hybridization and the remaining HSF protein is 236 and 185 aa long, respectively, compared with gaps were filled by using long-range PCR. PCR products were 467 in C. merolae, 476 in A. thaliana (SI Table 1) and 833 in the cloned and sequenced as described (11). ORFs Ͼ40 aa in size yeast Saccharomyces cerevisae. Although the amino-terminal were identified in Artemis (45) and examined for their coding DNA-binding domain remains intact, the transactivation domain potential by using BLASTX (46). tRNA genes were identified

19912 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0707419104 Lane et al. Downloaded by guest on September 27, 2021 by using tRNA-scan (http://lowelab.ucsc.edu/tRNAscan-SE/). A paired t test was implemented by using the following equation: Ribosomal RNA and snRNA genes were identified by using t ϭ mean(length difference)/[s/sqrt(n)] where s is the square root BLASTN and by comparison to the G. theta nucleomorph of the sample variance. Infinite degrees of freedom were used genome. The H. andersenii nucleomorph genome sequence has when calculating P values. A binomial test was also used with the been deposited in GenBank under the following accession following formula: P(Z Ͼϭ[phat Ϫ 1/2]/(0.5/sqrt(n))), where Z numbers: CP000881, CP000882, and CP000883. has a normal distribution, and phat is the proportion of proteins where the length is shorter in G. theta than H. andersenii or the Identification of Syntenic Regions. Portions of the genome were considered syntenic if they shared identifiable ORFs in the same nucleomorph representative, in cases where a nucleomorph order and orientation as G. theta. ORFs showing no detectible genome was being compared with a nuclear genome. homology to known ORFs were not considered interruptions of a syntenic block, nor were tRNAs or genes present in only one of the We thank O. Zhaxybayeva, F. Doolittle, A. Roger, E. Bapteste, H. genomes. Only blocks that included three or more conserved genes Philippe, and M. Gray for comments and discussion; A. Fong for DNA (or ORFs shared between the nucleomorphs) were described as sequencing; and E. Susko for assistance with statistical analyses. This syntenic. Genes encoding rRNAs were not included. work was supported by Genome Atlantic and a Natural Sciences and Engineering Research Council (Canada) Discovery grant (to J.M.A.). Statistical Analysis. Protein length and intergenic spacer size was J.M.A. is a Scholar of the Canadian Institute for Advanced Research, compared between the two genomes by using two test statistics. Program in Integrated Microbial Biodiversity.

1. Gregory TR (2001) Biol Rev 76:65–101. 25. Saccone C, Gissi C, Reyes A, Larizza A, Sbisa E, Pesole G (2002) Gene 2. Keeling PJ, Slamovits CH (2005) Curr Opin Gen Dev 15:601–608. 286:3–12. 3. Archibald JM (2007) BioEssays 29:392–402. 26. Suyama M, Bork P (2001) Trends Genet 17:10–13. 4. Eschbach S, Hofmann CJ, Maier UG, Sitte P, Hansmann P (1991) Nucleic Acids 27. Tamas I, Klasson L, Canback B, Naslund AK, Eriksson AS, Wernegreen JJ, Res 19:1779–1781. Sandstrom JP, Moran NA, Andersson SG (2002) Science 296:2376–2379. 5. Gilson PR, McFadden GI (2002) Genetica 115:13–28. 28. Huynen MA, Bork P (1998) Proc Natl Acad Sci USA 95:5849–5856. 6. Douglas SE, Zauner S, Fraunholz M, Beaton M, Penny S, Deng L, Wu X, Reith 29. Ranz JM, Casals F, Ruiz A (2001) Genome Res 11:230–239. M, Cavalier-Smith T, Maier U-G (2001) Nature 410:1091–1096. 30. Sharakhov IV, Serazin AC, Grushko OG, Dana A, Lobo N, Hillenmeyer ME, 7. Gilson PR, Su V, Slamovits CH, Reith ME, Keeling PJ, McFadden GI (2006) Westerman R, Romero-Severson J, Costantini C, Sagnon N, et al. (2002) Proc Natl Acad Sci USA 103:9566–9571. Science 298:182–185. 8. Archibald JM (2005) Int Union Biochem Mol Biol Life 57:539–547. 31. Slamovits CH, Fast NM, Law JS, Keeling PJ (2004) Curr Biol 14:891–896. 9. Palmer JD (2003) J Phycol 39:4–11. 32. Deane JA, Strachan IM, Saunders GW, Hill DRA, McFadden GI (2002) 10. Gilson PR, McFadden GI (1999) Phycol Res 47:7–19. J Phycol 38:1236–1244. 11. Lane CE, Archibald JM (2006) J Eukaryot Microbiol 53:515–521. 33. Le Blancq SM, Adam RD (1998) Mol Biochem Parasitol 97:199–208. 12. Lane CE, Khan H, MacKinnon M, Fong A, Theophilou S, Archibald JM (2006) 34. Archibald JM, Cavalier-Smith T, Maier U, Douglas S (2001) J Mol Evol Mol Biol Evol 23:856–865. 52:490–501. 13. Rensing SA, Goddemeier M, Hofmann CJ, Maier UG (1994) Curr Genet 35. Brinkmann H, van der Giezen M, Zhou Y, Poncelin de Raucourt G, Philippe 26:451–455. H (2005) Syst Biol 54:743–757. 14. Silver T, Koike S, Yabuki A, Kofuji R, Archibald JM, Ishida K-I (2007) J 36. Keeling PJ, Deane JA, Hink-Schauer C, Douglas SE, Maier UG, McFadden GI Eukaryot Microbiol 54:403–410. (1999) Mol Biol Evol 16:1308–1313. 15. Lane CE, Archibald JM (2008) J Phycol, in press. 16. Martin W, Koonin EV (2006) Nature 440:41–45. 37. Matsuzaki M, Misumi O, Shin IT, Maruyama S, Takahara M, Miyagishima SY, 17. Collins CA, Guthrie C (2000) Nat Struct Biol 7:850–854. Mori T, Nishida K, Yagisawa F, Nishida K, et al. (2004) Nature 428:653–657. 18. Nilsen TW (2003) BioEssays 25:1147–1149. 38. The Arabidopsis Genome Initiative (2000) Nature 408:796–815. 19. Morrison HG, McArthur AG, Gillin FD, Aley SB, Adam RD, Olsen GJ, Best 39. Stiller JW, Hall BD (2002) Proc Natl Acad Sci USA 99:6091–6096. AA, Cande WZ, Chen F, Cipriano MJ, et al. (2007) Science 317:1921–1926. 40. Krebber H, Wostmann C, Bakker-Grunwald T (1994) FEBS Lett 343:234–236. 20. Katinka MD, Duprat S, Cornillot E, Metenier G, Thomarat F, Prensier G, 41. Zhang J (2000) Trends Genet 16:107–109. 42. Petrov DA, Lozovskaya ER, Hartl DL (1996) Nature 384:346–349.

Barbe V, Peyretaillade E, Brottier P, Wincker P, et al. (2001) Nature 414:450– EVOLUTION 453. 43. Petrov DA, Sangster TA, Johnston JS, Hartl DL, Shaw KL (2000) Science 21. Nixon JE, Wang A, Morrison HG, McArthur AG, Sogin ML, Loftus BJ, 287:1060–1062. Samuelson J (2002) Proc Natl Acad Sci USA 99:3701–3705. 44. Bonfield JK, Smith KF, Staden R (1995) Nucleic Acids Res 23:4992–4999. 22. Linder P (2006) Nucleic Acids Res 34:4168–4180. 45. Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell 23. Edwalds-Gilbert G, Kim DH, Silverman E, Lin RJ (2004) RNA 10:210–220. B (2000) Bioinformatics 16:944–945. 24. Watkins NJ, Segault V, Charpentier B, Nottrott S, Fabrizio P, Bachi A, Wilm 46. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman M, Rosbash M, Branlant C, Luhrmann R (2000) Cell 103:457–466. DJ (1997) Nucleic Acids Res 25:3389–3402.

Lane et al. PNAS ͉ December 11, 2007 ͉ vol. 104 ͉ no. 50 ͉ 19913 Downloaded by guest on September 27, 2021