Quick viewing(Text Mode)

Foamy-Like Endogenous Retroviruses Are Abundant and Extensive in Teleosts

Foamy-Like Endogenous Retroviruses Are Abundant and Extensive in Teleosts

Foamy-like Endogenous Are Abundant and Extensive In

Item Type text; Electronic Thesis

Authors Ruboyianes, Ryan

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 05/10/2021 23:26:24

Link to Item http://hdl.handle.net/10150/594955

FOAMY-LIKE ENDOGENOUS RETROVIRUSES ARE ABUNDANT AND EXTENSIVE IN TELEOSTS

by

RYAN RUBOYIANES

______Copyright © Ryan Ruboyianes 2015

A Thesis Submitted to the Faculty of the

DEPARTMENT OF ECOLOGY & EVOLUTIONARY BIOLOGY

In Partial Fulfillment of the Requirements

For the Degree of

MASTER OF SCIENCE

In the Graduate College

THE UNIVERSITY OF ARIZONA

2015

STATEMENT BY AUTHOR

The thesis titled Foamy-like Endogenous Retroviruses Are Abundant and Extensive in Teleosts prepared by Ryan Ruboyianes has been submitted in partial fulfillment of requirements for a master’s degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library.

Brief quotations from this thesis are allowable without special permission, provided that an accurate acknowledgement of the source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder.

SIGNED: Ryan Ruboyianes

APPROVAL BY THESIS DIRECTOR

This thesis has been approved on the date shown below:

______Dr. Michael Worobey Date Professor of Ecology & Evolutionary Biology

2 ACKNOWLEDGEMENTS

I would like to thank, from the depths of my heart, Tom Watts, for his technical and moral guidance towards completing this stage of my education.

I thank my research advisor, Dr. Michael Worobey, and thesis committee members, Dr. Michael Barker and Dr. John Wiens, for patience and timely advice. Dr. Worobey, thank you especially for the necessary and sometimes difficult lessons in self-sufficiency.

I again thank Dr. Barker, and Dr. Mark Beilstein, for the friendly encouragement that came with their good advice.

I thank Dr. Guan-Zhu Han and Brendan Larsen, my lab-mates and comrades, for immensely helpful suggestions for improving my work.

I thank Kelsey Yule and Sarah Richman for the many helpful discussions.

I thank my father for reminding me that I am not stupid.

I gratefully thank all of the people who graciously provided samples for my experiments; Andrew Bentley, Dr. Kevin Fitzsimmons and Dawn M. Cloutier Wyles were especially helpful.

Mom, thank you for literally everything else.

TABLE OF CONTENTS

LIST OF FIGURES ...... 5 LIST OF TABLES ...... 5 ABSTRACT ...... 6 INTRODUCTION ...... 7 RESULTS AND DISCUSSION ...... 9 Foamy-like ERVs in ...... 9 Foamy Sister Clade Comprised of Fish ERVs ...... 13 Fish Foamy-like ERVs More Complex Than FVs ...... 15 Genetic organization ...... 15 Fish ERV foamy-like features ...... 20 Estimated Age of ERV Integration ...... 23 Virus-Host Cophylogenetic History ...... 27 De Novo Sequencing of FV-like Sequences in Diverse Teleosts ...... 29 MATERIALS AND METHODS ...... 33 screening ...... 33 De novo fish ERV sequencing ...... 33 Phylogenetic analysis ...... 35 Quasi-consensus construction and annotation ...... 36 ERV age estimation ...... 36 CONCLUSIONS ...... 37 REFERENCES ...... 40 APPENDIX A – SEQUENCES IN PHYLOGENETIC ANALYSES ...... 47 APPENDIX B – SAMPLED FISH ...... 50

4 LIST OF FIGURES

Figure 1. Retrovirus Pol phylogeny. Representative taxa from all classified genera of exogenous and are included with all known fish foamy-like endogenous retroviruses in a Bayesian phylogenetic tree construct...... 14 Figure 2. Fish ERV genetic features. Detailed genetic features of prototype foamy virus and three high-quality fish ERVs. Foamy-like ERV features are annotated...... 16 Figure 3. Fish ERV genetic organization. Coarse to-scale comparison of intact fish ERVs with annotated genes...... 19 Figure 4. Foamy clade host-virus incongruence. A cophylogeny and tanglegram designed to test apparent incongruences between foamy-like fish ERV host and virus phylogenies. Host and virus trees are compared for only highly intact fish ERVs. Crossed lines show that even recently integrated ERVs demonstrate a history of host switching...... 28 Figure 5. Phylogeny of 27 foamy-like fish ERVs and distant RVs. Bayesian reconstruction of phylogenetic history for all fish EFVs recovered in this study, including the those previously described...... 32

LIST OF TABLES

Table 1. EFV-harboring fish genome contigs. Included are the WGS contig from each actinopterygian host with the highest sequence homology to exogenous foamy ...... 10 Table 2. Estimates of ERV integration age. Time since integration for fish ERVs with intact LTR. Ages calculated with LTR divergence and LTR-flank divergence method...... 25 Table 3. Hosts with EFV-like fragments from de novo sequencing. Detected by PCR assay of hosts with no GenBank genome representation. Included are novel EFV fragments with homology to exogenous foamy viruses...... 30

5

ABSTRACT

Spumaretrovirus, among retrovirus clades, has an extensive accumulation of evidence for an ancient origin. Recent discoveries indicate that the Spumaretrovirus ancestor could have been the first retrovirus to appear during the evolution of vertebrates. If they indeed appeared in ancient marine environments hundreds of millions of years ago, we should expect significant undiscovered diversity of foamy-like endogenous retroviruses in fish genomes. I report the discovery of these elements in 23 novel hosts. These viruses have very large genomes compared to all other retroviruses, possess an unprecedented array of accessory genes, and form a robust reciprocally monophyletic sister clade with sarcopterygian host foamy viruses, with class III mammal endogenous retroviruses being the immediate sister group to both clades. I estimated that some of these viruses integrated recently into host genomes, and exogenous descendants of these viruses may be extant.

6 INTRODUCTION

Foamy viruses (FV) constitute one of seven extant genera of Retroviridae (1). The lone of the subfamily, FVs are complex retroviruses distinguished from primarily because of disparate replication strategies (2). FV infection is persistent and nonpathogenic in vivo, but causes characteristic fluid-filled syncytial in vitro with a “foamy” appearance (3,4). Known infectious FVs are thus far confined to mammalian hosts (5), including felines (6), horses (7), cows (8), bats (9), and wide variety of (10–15) harboring species- specific FVs. Among FVs, host-virus coevolution appears to have been the norm for the last 30 million years (16). Despite attempts to link a cryptic FV infection to common diseases (3), a circulating human-specific FV has never been described though (SFV) zoonoses do occur (17,18). FVs, in common with all retroviruses (RV), reverse-transcribe their positive-sense RNA genome into dsDNA, then integrate into the genome of an infected . The integrated virus then exploits the host’s cellular machinery to generate required for virion assembly and egress (19). Endogenous retroviruses (ERV) are the result of infection and integration into germline cells and subsequent parent-offspring vertical transmission as genomic DNA (20). After endogenization, RVs evolve at the host neutral evolutionary rate and can remain detectable tens of millions of years after integration as viral “fossils” (21). Newly discovered ERVs are traditionally classified by phylogenetic proximity to exogenous RVs; class I ERVs are related to isolates, class II are most related to , and class III to Spumaretrovirus, or foamy viruses. These divisions are slowly becoming obsolete as RV genera are resolved into clades by accumulated data. ERVs may represent extinct RVs that no longer exist in exogenous form. For example, though human-specific FVs are unknown, the human genome is rife with the relics of ancestral infection by FV-like class III elements (22), some of which are widely retained among other primates (23). The discovery of endogenous foamy viruses (EFV) in two-toed sloth (24), galago (5), and Cape golden mole (5,25) genomes provide strong evidence for at least 100 million years of near- stable host-virus coevolution for Spumaretrovirinae. A coelacanth EFV suggests the possibility of a marine FV origin 400 million years ago (Ma) (26). The presence of foamy-like ERVs in

7 zebrafish (27), platyfish, and cod (28) lend credence to this hypothesis, though the possibility remains that these elements represent distinct and possibly extinct lineages that branched very early in the evolution of retroviruses (5). Whether these fish ERVs represent “true” EFVs or early diverging foamy-like class III retroelements, their existence is congruent with other lines of evidence that FVs are the most ancient genus of extant RVs. Uniquely nonexistent pathogenicity in living hosts has been cited as possible evidence for ancient origins (29), along with a replication strategy that resembles orthoretroviruses in some respects, but pararetroviruses in others (30). Whether FVs are truly ancient or merely a highly successful RV lineage with an incredible host range, we should expect a significant undiscovered diversity of foamy-like ERVs in host clades that diverged early in the evolution of vertebrates. Ray-finned are the most species-rich clade of living vertebrates, with at least 26,000 taxa (31), and have likely been so since shortly after the divergence of Osteichthyes into sarcopterygian and actinopterygian lineages 430 Ma (32,33). I report here the discovery of 23 novel EFV-like elements in teleost genomes by genome database and PCR screens. Nine of these ERVs were intact enough for genetic annotation.

8 RESULTS AND DISCUSSION

Foamy-like ERVs in Fish Genomes

To explore EFV-like diversity in fish, I screened all 52 available actinopterygian whole genome shotgun (WGS) databases in GenBank (GB) with exogenous and endogenous FV Pol amino acid (aa) queries. My search revealed 17 teleost fish harboring FV-like pol sequences (Table 1). Three of these ERVs have been described previously. The platyfish genome (Xiphophorus maculatus) was reported to contain two near-intact FV-like insertions (28). My search uncovered only one insert encoding nearly a full proviral genome, though it lacks long terminal repeats (LTR), designated XmERV. A second large insert, also lacking LTR, is missing 32% its length. These and the 15 other fragments are highly similar with minimum 80% nucleotide (nt) identity, indicating their origin as a single viral infection of some platyfish ancestor. From the zebrafish genome (Danio rerio), in addition to the EFV-like element designated DrFV-1 (27), I recovered two additional full length and highly intact zebrafish ERV sharing 96% global nucleotide identity along the aligned length excluding the 5’ LTR. I designate these DrFV-2 and DrFV-3 following the naming precedent set by Llorens et al. (27). A fourth copy with major deletions in the region is also present. It shares a 5’ flanking sequence with DrFV-3 and represents a duplicated copy. The putative EFV fragment detected in the cod genome (Gadus morhua) contained only the (RT) region of pol (28). However, I detected a second FV-like genomic contig. A conserved domain search (34) revealed the presence of RNase H (RH), and (Int) (E-value = 5.67e-17, 6.36e-23, respectively). Visually inspecting the conceptual translation of this sequence revealed a retroviral protease (Pro) motif within an open reading frame (ORF) lacking RT. This same contig also encoded two transmembrane (TM) motifs among broken ORFs within 3.5 kilobases (kb) downstream, evidencing its origin as an infectious RV. Concatenating then translating these sequences produced a 1069 aa with one premature stop codon that shares 66% aa identity when aligned with either complete platyfish ERV Pol sequence. Among novel FV-like insertions in fish genomes, eight were characterized by a pol gene alone, as neither canonical gag, env, nor potential accessory ORFs corresponding to bel or tas

9 Table 1. EFV-harboring fish genome contigs. Host species higher taxonomic levels per Betancur-R et al. 2013 (33), except (*) incertae sedis member of Ovalentaria assigned by aforementioned. Common names per FishBase classification (35). (a) Previously described in (27). (b) Previously described in (28). ERV components marked (t) are truncated. Abbreviations: EFV, ; PFV, Prototype foamy virus; SFVcpz, Simian foamy virus – chimpanzee; SFVorg, Simian foamy virus – orangutan, SFVmac, Simian foamy virus – ; SFVspm, Simian foamy virus – spider .

Order BLASTx match Common name Contig (acc) % ID E-value ERV components Species (acc)

Cypriniformes Danio rerioa Zebrafish CAAK05053864.1 SFVspm 29 6E-83 LTR-gag-pol-env-LTR (ABV59399) Pimephales promelas Fathead minnow JNCD01029789.1 SFVcpz 30 5E-87 tLTR-gag-pol-env-LTR (AKM21185) Salmoniformes Oncorhynchus mykiss Rainbow trout CCAF010042932.1 EFV 30 5E-36 pol (NP_054716) Gadus morhuab Cod CAEA01131311.1 PFV 34 9E-71 pol (AAA66556) Pleuronectiformes Cynoglossus semilaevis Tongue sole AGRG01061695.1 SFVspm 33 2E-64 pol (ABV59399) Amphilophus citrinellus Midas cichlid CCOE01001074.1 EFV 28 7E-100 LTR-gag-pol-env-LTR (NP_054716) Nothobranchius furzeri Turquoise killifish JNBZ01063262.1 EFV 31 5E-69 pol (NP_054716) Fundulus heteroclitus Mummichog JXMV01054445.1 SFVmac 33 1E-33 pol (AFA44809) b Xiphophorus maculatus Platyfish AGAJ01041163.1 SFVorg 29 6E-92 tLTR-gag-pol-env (CAD67562) Poecilia formosa Guppy AYCK01027102.1 SFVmac 29 2E-92 LTR-gag-polt-env-LTR (AFA44809) Poecilia reticulata Amazon molly AZHG01028727.1 SFVspm 29 4E-86 LTR-gag-pol-env-LTR (ABV59399) Anoplopoma fimbria Sablefish AWGY01041462 SFVcpz 37 4E-58 pol (AFX98084) Larimichthys crocea Yellow croaker JRPU01012463.1 SFVspm 26 3E-81 LTR-gag-pol-envt (ABV59399) Periophthalmus magnuspinnatus Korean mudskipper JACL01052273.1 SFVspm 30 4E-74 pol (ABV59399) Sebastes rubrivinctus Flag rockfish AUPQ01030678.1 SFVspm 32 3E-62 pol (ABV59399) Sebastes nigrocinctus Tiger rockfish AUPR01019601.1 SFVspm 31 1E-60 pol (ABV59399) Formerly Perciformes Stegastes partitus* Bicolor damselfish JMKM01039980.1 SFVspm 33 2E-72 tLTR-gag-polt-env-LTR (ABV59399)

10 were detected among WGS contigs identified by my FV Pol screen. Rainbow trout (Oncorhynchus mykiss), Korean mudskipper (Periophthalmus magnuspinnatus), tongue sole (Cynoglossus semilaevis), flag rockfish (Sebastes rubrivinctus), tiger rockfish (Sebastes nigrocinctus), and sablefish (Anoplopoma fimbria) genomes each contained a single fragment with detectable homology with FV pol. None of these hosts genomes contained a complete hypothetical ORF, but I was able to extract >1700 nt of sequence homologous to pol. The pol fragments in flag rockfish and tiger rockfish are very likely orthologous, as they are of nearly identical length (1750 nt) and share 96% nt identity. These species arose late in the diversification of Sebastes, sharing a common ancestor 3-4 Ma (36). Turquoise killifish (Nothobranchius furzeri) FV-like sequences were also characterized by pol sequence alone, but there are numerous copies of complete FV-like pol genes in this host genome. Again, each is likely the product of a single viral insertion (37), as all detectable pol sequences are highly similar (>80% ID) to paralogous counterparts. I was nonetheless unable to detect other ERV features. Whether this is due to the short length of contigs containing pol, the longest being 6,307 nt, or their general absence, is unknown. Similarly, the mummichog genome (Fundulus heteroclitus) has four nonfunctional copies that encode Pol after conservative frameshift corrections, but are riddled with premature stop codons. Other than a TM motif detected in a fragmented reading frame downstream of pol in one contig, there are no other discernable ERV features. Five further species host more complete viral genomic insertions. The yellow croaker genome (Larimichthys crocea) contains numerous ERV integrations, some with an intact LTR. These viral genomes were generally degraded with interruption by secondary insertions of unrelated mobile elements. The most intact fragment featured both 5’ and 3’ LTR, a gag-like ORF, pol, and a partial env gene encoding a TM motif (LcERV). The majority of EFV-like fragments in this genome contain a ~2 kb deletion in the putative gag gene, evidence that these fragments originated from a single infectious integration. Aligning these fragments allowed me to construct a quasi-consensus, anchored by pol, containing full-length gag and env genes. Likewise, I was able to annotate the proviral genome of an LTR-complete ERV harbored by bicolor damselfish (Stegastes partitus, SpERV). Numerous smaller fragments facilitated this, many with a largely intact gag-pol-env or some derivation, but no other LTR bearing ERVs with FV homology. The single intact genome was highly degraded with numerous stop codons and

11 frameshifts. The final quasi-consensus genome, with ORFs recovered from post-endogenization mutations, was very large compared to known RV genome size diversity. The same held true for LcERV. The Midas cichlid genome (Amphilophus citrinellus) contains four LTR-complete EFV- like elements (AcERV-1 through AcERV-4) in addition to five fragments lacking either the 5’ or 3’ LTR. None of these nine shared flanking host genomic sequences, thus are likely to be true integrations and not the result of segmental duplication. Each shares ~85% global nt identity with other A. citrinellus EFV-like elements, therefore insertions may not have occurred contemporaneously. The four nearly complete ERV genomes are of similar total length (~17,000 nt), and genomic organization is highly similar to that reported for the platyfish consensus EFV (28). Two congeneric species, guppy (Poecilia reticulata) and Amazon molly (Poecilia formosa), each contain two full ERV inserts. In Amazon mollies, both LTR are truncated in one insert, the other, curiously, has inverted LTR. In guppies, one viral genome appears complete, but its putative accessory genes are also found upstream of the 5’ LTR, indicating a recombination event centered on that LTR. The other complete insert has a truncated 3’ LTR. Both are also nearly the same length as the platyfish EFV, not including LTR. The EFVs found in either species genome do not appear to be orthologous. I was unable to detect shared host genomic flanking sequences for any of the complete or fragmented inserts. I performed a clustering analysis of all pol fragments; all inserts and duplicated fragments displayed greater within-species similarity than between-species similarity. This is perhaps surprising given estimates of the relatively recent divergence of these species 25 Ma (38). I detected two FV-like insertions in the fathead minnow genome (Pimephales promelas), the nearest relative to zebrafish among these results. The first of these viruses, which I designate PpERV-1 for future analysis, lacks LTR but otherwise encodes nonfunctional RV core and accessory genes. PpERV-1 shares 77% global identity with several smaller fragments in the genome, which may be the product of a single infection. However, the low number of duplicated fragments, along with very large gaps in the genome assembly, renders this impossible to determine. Interestingly though, the second intact is highly divergent, yet remarkably intact, as it retains partial LTRs and encodes full albeit nonfunctional RV core genes. Designated

12 PpERV-2, It shares only 40% aa identity with PpERV-1. Among the novel fish ERVs I detected, it is the only potential example of infection by multiple distantly related foamy-like viruses. Finally, my screen also captured a potential FV-related pol fragment in both the Nile tilapia (Oreochromis niloticus) and the giant mudskipper (Periophthalmodon schlosseri) genome. A blastx search demonstrated their probable RV origin, with exogenous FVs being the best hits (Nile tilapia, e-value = 4e-36; giant mudskipper, e-value = 2e-19.) Both were short and contained numerous indels, with the Nile tilapia insert missing most of its RT motifs. I was not able to phylogenetically assign either species ERV to an RV clade with confidence, therefore I excluded these from further analysis and did not include them in the total number of novel fish ERVs.

Foamy Virus Sister Clade Comprised of Fish ERVs

I aligned the Pol sequences of newly discovered ERVs with those previously described and a representative set of exogenous and endogenous retroviruses from all major clades (App. A). I eliminated ambiguously aligned positions and constructed a Bayesian phylogeny of 693 aa aligned positions (Fig. 1). Fish FV-like elements group robustly (posterior probability = 1.0) as the sister group to all other foamy viruses. Likewise, all major clades of exogenous and endogenous RVs group robustly in branching orders that reflect known relationships. I was unable to discern certain finer relationships among fish ERVs with precision, indicated by several nodes among FV-like fish ERVs with low posterior probability. On a coarse scale, fish ERVs to an extent do group into recognizable fish host clades despite incongruities with known host divergence patterns. ERVs from the four Fundulidae (mummichog, platyfish, guppy, and Amazon molly) form a clade that branches in an order matching the host species. The two species ERVs (zebrafish and fathead minnow) are also grouped. Cod and rainbow trout ERVs both branch before the large percomorph ERV crown group in a pattern that resembles the branching relationships of the hosts. These bifurcations may reflect the true evolutionary history of these viruses, implying a complex history of host-switching deep in the history of fish. However, the quality of sequences I extracted from fish genomes is highly variable. ERVs from three host species in particular: P. magnuspinnatus, N. furzeri, and A. fimbria, are

13

Figure 1. Retrovirus Pol phylogeny. Seven genera of exogenous and endogenous retroviruses in colored blocks; Snakehead retrovirus, genus yet unclassified, is uncolored; foamy-like teleost endogenous retroviruses represented by salmon-colored branches. Class I ERVs are grouped with and class II with for simplicity. Tree construct is a Bayesian inference of 692 aligned Pol protein positions and rooted at its midpoint for illustrative purposes. Nodes with posterior probability <1.0 are labeled. RV taxa and source citations are available in Appendix A.

14 percomorph species closely related (in relative terms) to the fish group circumscribing guppies and yellow croaker, for example. In addition, they are at present geographically and ecologically disparate. Turquoise killifish occupy temporary freshwater pools in Zimbabwe and Mozambique, growing from egg to adult in a mere month (39). Sablefish are benthic deep-sea dwellers with a trans-Pacific distribution (40), while mudskippers occupy mudflats in Korea and mainland China (41). Yet ERVs from these fish occupy a basal position within FV-like fish ERV. This may be attributable to phylogenetic signal degradation over millions of years of accumulated neutral mutations since integration. Critically, the novel fish ERV clade shares a closer phylogenetic relationship with mammal FVs than class III ERVs in mammals, including HERV-L and MuERV-L (Fig. 1). This was the least expected result of this study.

Fish Foamy-like ERVs More Complex Than FVs

Genetic organization. Alignment of all sequences obtained from genomic BLAST searches revealed eight intact or nearly intact ERVs. To determine whether the molecular characteristics of these viruses resemble FV biology, I annotated exemplars from host species with the most intact ERVs: zebrafish, fathead minnow, and Midas cichlid, and compared them to the prototype foamy virus (PFV), GB accession number (acc): Y07725.1 (Fig. 2). Raw intact zebrafish (DrFV-2, acc: CAAK05053864) and fathead minnow (PpERV-1, acc: JNCD01029789) ERVs were annotated with no modification. With the exception of a frameshift mutation in the gag gene, DrFV-3 is extremely similar to DrFV-2. Intact ORFs correspond to RV core genes and several putative accessory genes, along with intact LTR. DrFV gag and env genes possess no detectable homology with any other known RV, FV or otherwise, as determined by BLAST searches and CD-HIT (42). The same is true for PpERV-1. Gag genes were annotated as such based on their position as the lone ORF upstream of pol. Env genes I annotated as such because they encode putative TM-helix motifs, which were not found in any other ORF between these two proviral genomes. DrFV-2 and PpERV-1 encode two or three accessory genes downstream from env, much like mammalian FVs universally encode bel1/tas transactivator and bel2/bet regulatory proteins in the same region. These genes are indispensible for FVs because tas regulates replication (43,44), while bet is probably responsible for preventing cellular superinfection by multiple FVs,

15

Figure 2. Fish ERV genetic features. Comparison of provirus features. SFVcpz/hu (acc: Y07725.1) is the prototype virus for the FV clade; DrFV-2, PpERV-1, and Midas cichlid ERV are exemplars from this study selected for genome quality. Open boxes represent ORFs. Premature stop codons marked with asterisk (*). Reading frames represented by vertical orientation. Downward arrow represents furin cleavage site; arrows surmounted by ‘S’ are signal peptidase cleavage sites; p3-cleavage site marked by ‘p3’ arrow; dotted line represents ORF boundary in LTR. PPT:LTR boundary ‘gggtg’ motifs are marked as such. Gag genes are green, pol are blue, and env are red. Black boxes are unknown accessory genes unless labeled. Boxes and genome length are to scale (kb). Abbreviations: GQR, glycine-glutamine- rich box; GR, glycine-arginine rich box; IP, internal promoter; Int, integrase; PBS, primer binding site; Pro, protease; RH, ribonuclease H; RT, reverse transcriptase; TM, trans-membrane motif.

16 thereby maintaining low in vivo pathogenicity (2,29,45). Whether the zebrafish 3’ accessory genes maintain similar functions is unknown. Neither share functional domains with FV counterparts. Both viruses also contain an ORF, 1335 and 1263 nt, respectively, between pol and env coding sequences. This feature is highly unusual. FVs do not contain this feature, nor do known class III ERVs (46). This feature is also unique among infectious fish retroviruses (47– 49). The closest comparison to be made is among , which often encode a viral infectivity factor (Vif) or transactivator (Tat) in this region (50). AcERV-1, the Midas cichlid ERV selected for this comparison retains intact LTR and pol. The ERV regions flanking pol contain premature stop codons and frameshifting indels, as determined by alignment with homologous inserts in the same genome. Therefore I constructed a quasi-consensus anchored with the pol gene. There is the possibility that this consensus does not reflect the genetic organization of the virus pre-integration. Though unable to assemble mature virions and infect naïve cells, env-defective are still potentially mobile elements, much like classic retrotransposons. For Midas cichlid FV-like inserts, consensus of intact proviral genomes may only reveal the broken organization of the genome between accumulating nonsense mutations and retrotransposition throughout the host genome. To test whether AcERV-1, 2, 3, and 4 represent insertions from independent infections, I calculated pairwise ratio of non-synonymous to synonymous substitutions (dN/dS) in both the pol and env between all four proviruses. It stands to reason that a virus replicating by a series of subsequent germline infections would retain a signature of purifying selection (dN/dS < 1) in both genes, whereas a replication-competent, non-infectious provirus would accumulate functionally deleterious mutations in its env gene. In the latter scenario, viral env evolves under a neutral substitution regime (dN/dS ~ 1). Pairwise pol (dN/dS = 0.086 – 0.120) and pairwise env comparisons (dN/dS = 0.149 – 0.228) were consistent with a history of purifying selection. A codon-based Z-test rejected the hypothesis of neutral evolution for both genes (pol, P = 0; env, P = 0). A similar test of DrFV-2 and DrFV-3 pol and env coding regions demonstrated similar results (pol, dN/dS = 0.136; env, dN/dS = 0.123) and I was likewise able to reject a hypothesis of neutral evolution (P = 0). Compared to DrFV and PpERV, the AcERV quasi-consensus is still more complex. Rather than two putative accessory genes downstream of env, this virus encodes three, all of similar size. This third ORF is not unique; among FVs, PFV encodes three ORFs downstream of

17 env (51)(Fig. 2). AcERV also appears to encode two accessory genes between gag and env. From 5’ to 3’, the first is of similar size to the corresponding DrFV gene, with which it shares weak homology. The accessory genes of neither the AcERV quasi-consensus nor AcERV raw proviruses share any detectable homology with known retroviral sequences. Based on homology with either DrFV or AcERV, I annotated probable coding sequences in seven more ERVs with intact proviral genomes, including the two divergent fathead minnow FV-like inserts. For Amazon molly (PrERV) and bicolor damselfish ERVs, numerous nonsense or frameshift mutations in apparent coding sequences necessitated construction of a quasi- consensus like for AcERV. I also constructed a quasi-consensus for the yellow croaker ERV, as it contains secondarily inserted mobile elements. Raw sequences were annotated for platyfish, guppy (PfERV), and fathead minnow ERV-2. Among these fish ERVs, increased complexity is a common characteristic (Fig. 3). Proviral genetic organization among these ERVs can be sorted into two types. Comprising the first type, zebrafish and fathead minnow, both Cypriniformes, contain ERVs of similar length and organization. At 11,967 and 11,995 nt, DrFV-2 and DrFV-3 are large but still within the spectrum of RV genome size. PpERV-2 is of similar size at 10,345 nt. It lacks LTR, but presumably they are about the same size as DrFV LTR, so the hypothetical provirus is ~12,000 nt in length. I estimate PpERV-1 is larger still at 14,122 after taking into account the truncated 5’ LTR. ERVs from each species possess a comparable set of accessory genes. The second type, those most like AcERV, all share a pattern of multiple homologous accessory genes which contribute to a very large RV genome size >17,000 nt. These ERVs genomes are larger than any previously described RV genome, and very likely more complex in terms of cell interactions. Because an exogenous counterpart to any of these viruses has not yet been described, the function of these numerous large accessory genes is unknown. A conserved domain search produced no matches with known retroviruses or otherwise. However, I was unable to detect TM motifs in accessory genes among any of these ERVs, possibly indicating that the products of these genes are not directly involved in virus budding and maturation. Gag genes are very large in these fish ERVs, ranging 3342 – 3411 nt for ERVs like AcERV. For comparison, FV gag ranges from 1413 nt for coelacanth endogenous FV (CoeEFV)(26) and 1545 nt for feline FV (6) to 1947 nt for PFV (52). Based on the observed pattern, these very large and accordingly complex RVs could have arisen at some stage within

18

Figure 3. Fish ERV genetic organization. Raw or quasi-consensus representation of the nine most complete fish ERVs. Bold lines represent ORFs. Gag genes are green, pol are blue, and env are red. Accessory genes of unknown function are black. Premature stop codons are marked (*), as are frameshift mutations (fs), and truncated features (t). ERVs are represented to scale (kb).

19 percomorph evolution. This relationship may be transient, as there are no examples of complete FV-like ERVs outside of Percomorphaceae and Cyprinodontiformes. In addition, it appears that PpERV-1 straddles the gap between FV-length ERVs and very long fish ERVs.

Fish ERV foamy-like features. Shared genetic architecture and functional motifs between these fish ERVs and mammalian FVs would be evidence for a FV common ancestor at the junction of these clades. Though not necessarily a phylogenetically conserved trait (53), nearly all class III ERVs use the 3’ end of a partially unwound tRNA to prime reverse transcription at an 18 nt primer binding site (PBS) at the 5’ end of RV genomic RNA, and just downstream of the U5 region of the subsequent proviral LTR (overview of reverse transcription in Coffin et al. (54)). Exceptions to this strategy include the recently discovered galago prosimian FV (5), with a PBS corresponding to asparagine tRNA, and HERV-S, which utilizes serine tRNA (55). The majority of fish ERVs in this study with intact genomes (Fig. 3) utilize tRNALys. PBS sequences correspond exactly, except for XmERV, SpERV, AcERV-1, and AcERV-4, which each contain a single mismatch for vertebrate tRNALys. The remainders do not maintain an FV-like PBS. Zebrafish, both DrFV-2 and DrFV-3, utilize a threonine tRNA PBS, while guppy and Amazon molly ERVs use asparagine tRNA. FVs have two transcription promoters, one shared by all RVs in the 5’LTR, and an internal transcription promoter (IP) located in env (56). The IP encourages production of Tas, which acts in trans to further increase the activity of the IP. After an equilibrium level of Tas has been translated, it begins to act on the 5’ LTR promoter (2). I was able to detect a strong IP (TATAAATA) toward the 3’ terminus of env in both DrFV-2 and DrFV-3 (Fig. 3), and near the midpoint of env for PpERV-1 and PpERV-2. Notably, each of these proviruses contain up to three more TATA-boxes, but none of them are located near the 3’ end of env. Because of these IP signatures are generally >1000 bp upstream of putative accessory genes, it is unclear whether they have a functional cis-regulatory effect. AcERV and like fish ERVs do not possess a detectable env IP. Curiously, most of them do include a strong TATA-box (TATATAAG) in the second ORF after pol. Again, whether this IP is functional is entirely unknown. All fish ERVs detected in this study share a GGGTG nt motif on the border between the polypurine tract (PPT) and 3’ LTR (Fig. 2) where these features exist. This is a “peculiar” feature also shared by all FVs (4), and this structure may be responsible for the unique process of FV integration (57). During Orthoretrovirinae integration, Int cleaves a terminal dinucleotide from

20 both extremities of the provirus precursor after reverse transcription. This process leaves sticky proviral ends in preparation for an Int-catalyzed breaking and reforming of phosphodiester bonds at a targeted site in the host genome (58). In FVs, only the right end of the proviral dsDNA is cleaved of its terminal dinucleotide (57). The reasons for are this unclear, but it renders explaining FV integration with the standard RV model problematic. The reigning hypothesis models FV dsDNA as a 2-LTR circle that must be cleaved at the palindromic site joining the LTR ends, rather than a blunt-ended linear molecule (59). The strict conservation of the GGGTG PPT:LTR motif points to its probable role in the aforementioned FV pre-integration process. This feature alone suggests a FV last common ancestor (LCA) of these fish RVs and FVs in mammals. A side effect of proviral DNA integration is a target site duplication (TSD). After engaging host genomic DNA (gDNA), RV Int cleaves both strands at positions four to six nt apart, then reforms phosophodiester between the ragged gDNA ends and proviral dsDNA stands. Host enzymes repair the resulting single-stranded gaps at either end of the conjoined molecule. This process leaves a 4 – 6 nt direct repeat abutting both ends of the newly integrated ERV (58). Because RV Int mediates this process, the precise number of duplicated nucleotides tends to be conserved among RV genera. Both FVs and Gammaretrovirus form a 4 nt TSD with little exception (60). Among fish ERVs in this study, with one exception (AcERV-3) all complete FV- like integrations also have detectable 4 nt TSDs. Fish foamy-like ERV env products generally do not share typical FV function. During FV assembly, TM domains at both Env termini penetrate the (ER), directed there by a signal peptide on the N-terminus. Furin-like cellular proteases then cleave Env into three products: a short leader peptide (LP), a surface unit (SU), and trans-membrane unit, requiring at minimum an RXXR recognition site (61) in two places (Fig. 2: PFV). The LP unit possesses a WXXW motif, conserved in all known FVs but coelacanth endogenous FV (CoeEFV) (62) that interacts with Gag to assemble viral particles. Orthretroviral Env is directed to the ER by the N-terminal LP signal peptide, but the LP is quickly cleaved by signal peptidase (SPase) and usually degrades quickly (2,63). The rest of Env is cleaved at a single furin-like site into only two products: SU and TM. I was able to detect a putative SU-TM cleavage site in DrFV-2, PpERV-1 and AcERV (Fig. 2), along with SpERV. A likely N-terminal signal peptide is found in all intact fish ERV Env except LcERV and

21 SpERV. Whether these sites are missing because of alternate Env processing or because of sequence degradation is again unknown. I favor the latter explanation because the lack of furin- like cleavage sites in XmERV, PfERV, PrERV, and LcERV is difficult to explain otherwise when this mechanic of Env glycoprotein processing is ubiquitous among infectious RVs. Tripartite Env processing is not a likely feature of fish foamy-like ERVs. One intact provirus however, does contain a second furin-like cleavage site along with an N-terminal SPase site: PpERV-1. This 718 aa Env protein features an SPase cleavage site at aa position (pos.) 48: VLG|LI, abutting a TM helix at pos. 30-52; a strong furin-like cleavage site at pos. 209: RKAR|SL, and pos. 420: RLKR|VG, and a C-terminal TM helix at pos. 647-669 (Fig. 2). Though lacking a Gag-interacting WXXW motif, PpERV-1 Env is hypothetically processed in a very FV-like fashion. Another Gag trait conserved among infectious FVs and certain EFVs (5), including PpERV-1 (Fig. 2), is the p3 cleavage site: a viral Pro cleavage site motif VX|XV (62) very near the C-terminus. In PpERV-1, this motif is VG|RV, -34 aa from the C-terminus. Though not cleaved in all virions (2), removal of this short terminal peptide is necessary for both replication and infectivity (64,65). Where Orthoretrovirinae Gag are typically cleaved completely into matrix, , and nucleocapsid domains by Pro (66), Spumaretrovirinae are not (67). Secondary Pro cleavage sites are found in Gag, but efficiency at these sites is low, and the large unprocessed Gag protein (minus the p3 cleavage) is not cleaved again until the virus docks with a host cell. The p3 cleavage site was not found in any other Fish ERVs, though potential secondary Pro cleavage sites were evident. Absent from complete fish ERVs was a cytoplasmic targeting retention signal (CTRS). It is thought that FVs make use of a CTRS to localize viral Gag proteins to a specific part of the host cell for capsid formation (68) where most Orthoretrovirinae possess a plasma membrane targeting signal in the MA domain of Gag (69). Where virus particle formation occurs for these fish ERVs is unclear. All exogenous FVs carry a CTRS, however it is absent in CoeEFV (62), begging the question of this feature’s evolutionary origin. The CTRS motif may have developed late in FV evolution among early mammals, its conserved motif could be less functionally constrained than believed, or it could have been obfuscated by accumulated mutations in CoeEFV.

22 Also conspicuously absent from Gag in FV-like Fish ERVs are Cys-His boxes. Orthoretroviral species encode at least one zinc finger domain toward the C-terminus of Gag. 2+ Bearing the motif CX2CX4HX4C (27), the cysteine-histidine residues grip a zinc ion (Zn ) that allows Gag to interact with a number of molecules, not the least important of which are packaged viral genomic RNA and host chromatin (70). Variously having a role in replication, integration, and viral budding, Cys-His boxes are indispensable to the retroviral life cycle and maintenance of infectivity (71,72). FV Gag alternatively contains glycine-arginine rich motifs that fulfill the much same function as Cys-His zinc fingers (67). Termed GR boxes, one of these domains is critical for viral packaging of Pol, which complexes with Gag to package viral RNA (73). Another GR box serves as a nuclear localization signal, chaperoning viral DNA toward host DNA toward potential integration sites (74). While these fish ERVs are generally enriched for glycine/arginine toward the C-terminus of Gag, there are no clearly definable GR boxes. Rather, the presence of at least one glycine-glutamine-arginine (GQR) domain is conserved; DrFV-2: RRGRQQR at -101 aa from C-terminus; PpERV-1: GDRQRDQ -88 aa from C-terminus; AcERV: GQRTQQ -90 aa and QGQQR at -73 aa from C-terminus. Given the absence of a Cys- His boxes, and the well known DNA/RNA binding affinity of glutamine (75,76), I hypothesize that FV-like Fish RVs utilize a GQR box, fulfilling the function of conventional GR boxes. This notion is reminiscent of FFV, which carries the aspartic protease catalytic motif DTQA (6), a unique variant of the typical D(T/S)GA active site motifs thought to be immutable for a properly functioning RV Pro domain.

Estimated Age of ERV Integration

Conventionally, three methods are used to estimate the approximate age of ERV insertion into a host genome. First, the pairwise divergence between LTR of an intact ERV can be coupled with an estimated per generation host mutation rate (77). Because the LTR of a provirus are identical upon integration, due to the mechanics of reverse transcription (54), any changes must have occurred since. Age estimates are calculated with the formula � = �/2�, where t is time to most recent common ancestor (TMRCA), d is divergence, and µ is the host mutation rate. Estimates from this method are predicated on the notion that integrated LTR are under neutral selection pressure and indeed evolve at the corresponding rate of the host genome. Recombination between ERVs is a confounding factor for these estimates, often leading to

23 significant overestimates of time since integration when chimeric proviruses are formed. Gene conversion on the other hand has a homogenizing effect, leading to artificially recent estimates of integration (78). Recombination was apparent in a subset of fish ERVs. For example, DrFV-3 shares 96% nt identity with DrFV-2 from gag to 3’ LTR, but only 85% identity between 5’ LTR. DrFV-3 is clearly recombinant, as it shares a 5’ flanking sequence with a fourth fragmented copy. The most parsimonious explanation for this observation is replacement of LTR by ectopic recombination between similar elements at different loci. The second method estimates the coalescent age of paralogous fragments that arose through segmental duplication, identified by a common sequence flanking an LTR region. This method is useful when ERVs are highly fragmented and LTR-complete proviruses are lacking. The obvious limitation of this method is that the direct age of an ERV integration cannot be estimated, only the time since gene duplication. It is also possible to establish a cross-sectional estimate of ERV age by comparing the divergence of a homologous LTR in multiple individuals (79). The nature of this study precluded this third method. Six fish ERVs proved amenable to age estimation (Table 2). Midas cichlid mutation rate, measured empirically with breeding experiments, has been calculated at 6.6e-8/site/generation (80), with a generation time of approximately one year (81). A mutation rate has been estimated for Amazon mollies in a similar fashion at 4.89e-8/sites/year (82). I took the Amazon molly rate for the guppy rate, as they are closely related. I took the Amazon molly rate for yellow croaker and bicolor damselfish too, as no mutation rate estimates exist for these species, and the Amazon molly rate is slower and more conserved than for Midas cichlid. For zebrafish, the rate of synonymous substitution has been estimated at 4.3e-9/site/year based on comparisons of conserved non-coding regions (83). Assuming the LTR of ERV and immediate flanking region are evolving neutrally, I want to include non-synonymous substitutions in the rate estimate. Therefore I accelerated this rate estimate by an order of magnitude, to 4.3e-8/site/year, bringing it in line with Midas cichlid and Amazon molly mutation rates. DrFV-2 is the highest quality ERV discovered in this study. Its LTR sequences share 100% identity and there are no flanking sequences that would allow an age estimate to be

24 Table 2. Estimates of ERV integration age. Divergence calculated with K80 nucleotide substitution model (84). Mutation rates drawn from previous studies (80,82,83). DrFV divergence was calculated from multiple sequence alignment using BEAST (85). Where more than one estimation was possible, these are presented as upper and lower estimates of minimum age. Abbreviations: pos, nucleotide positions; TMRCA, time to most recent common ancestor; y, years.

Method Divergence: d Mutation rate: µ Estimated age: t ERV Flank TMRCA (4661 pos) 0.017 4.3E-8 284,000 y DrFV

LTR divergence (1502 pos) 0.002 15,151 y AcERV-1 LTR divergence (1472 pos) 0.003 22,727 y AcERV-2 LTR divergence (988 pos) 0.011 6.6E-8 83,333 y AcERV-3 LTR divergence (1493 pos) 0.011 83,333 y AcERV-4 LTR divergence (665 pos) 0.003 4.89E-8 30,674 y PfERV-1 LTR divergence (1707 pos) 0.015 153,374 y PfERV-2

LTR divergence (1690 pos) 0.005 4.89E-8 PrERV-1 Flank TMRCA (2995 pos) 0.016 51,124 y PrERV-2 163,599 y

LTR divergence (1368 pos) 0.007 4.89E-8 SpERV 71,574 y

Flank TMRCA (2313 pos) 0.003 4.89E-8 LcERV LTR divergence (1561 pos) 0.010 30,674 y 102,249 y

25 formed. Accordingly, I was unable use it to estimate time lapsed since integration. Barring LTR homogenization mediated by gene conversion, this virus infected zebrafish very recently. DrFV- 3 has clearly undergone ectopic recombination with a related sequence, so LTR divergence is not likely to give an accurate estimate. I was able to detect six proviral fragments, several of them solo LTR, that share a long 5’ flanking sequence with DrFV-3. I used these to estimate the minimum time since TMRCA. That DrFV-3 integrated ~284,000 years ago (TMRCA: 2.84E5, 95% HPD: 2.47e5 – 3.24e5), while DrFV-2 likely integrated much more recently, is an interesting observation. Similarly, PpERV-1 and PpERV-2 are highly divergent and either represent superinfection by distantly related foamy-like ERVs, or asynchronous infection and integration. Only PpERV-1 is nearly LTR-complete, for which estimating the insertion is impossible due to identical LTR and no shared flanks. The four complete AcERV integrations retain LTR, so I was able to date them with LTR- divergence. With this method I estimated their ages to range from 15,000 to 83,000 years since integration. None of them appear to be recombinant. This disparity between integration ages has two possible explanations; either LTR at different loci evolve at different rates, or four integration events happened over a span of 60,000 years. I favor the latter explanation, if only because FV superinfection is virally regulated and probably rare (86). Further caveats are required to explain these estimates, as many of them are actually likely underestimated. PfERV is represented in the Amazon molly genome by two sequences; PfERV-1 (Fig. 3) has short LTR, probably having been truncated on either end. PfERV-2 is the product of homologous recombination, as its 3’ coding sequences are also found upstream of the 5’ LTR. Between the two age estimates, I favor the upper minimum of ~150,000 years calculated with flank divergence. The true insertion is likely older, as only ectopic recombination between ERVs is likely to increase the divergence between LTR (78). With Amazon mollies, PrEFV-1 has an odd inverted 5’ LTR. Low-level divergence from the 3’ LTR might be an indication that this happened through recombination, but this is unclear. Regardless, PrEFV-2 shares a 3’ flank elsewhere in the genome. Again I favor the upper minimum age: ~160,000 years since the segmental duplication that produced these shared flanks. The similarity between these Poecilia ERV estimates is probably coincidental, as there is no evidence that they are orthologous and these species likely diverged well before these minimum integration dates (38).

26 SpERV is also recombinant, having 3’ coding region upstream of its 5’ LTR, like PfEFV- 1. Despite this, I estimated its minimum age through LTR-divergence. The estimate of ~71,000 years is probably less than the time since integration. A single LTR-intact ERV in the yellow croaker genome represents LcERV. It also shares a 5’ LTR and upstream flank with a fragment elsewhere in the genome. The lower age estimate is probably accurate, but only reflects years passed since the segmental duplication that created the fragment of LcERV. The upper age is closer to the true age of insertion and is a more accurate minimum age estimate for LcERV. Of course the RVs that left all of these endogenous relics are much older than implied by these estimates. LTR-divergence and flank-divergence can only estimate how long ago the fish ERV integrated itself into the host genome (77).

Virus-Host Cophylogenetic History

FV host switching is thought to be rare among mammals. Reasons why are unclear, but this claim is based on the observation that host-virus coevolution appears generally stable because of congruent host-virus phylogenies (5), incomplete sampling notwithstanding. While this observation is valid, it is predicated on the alignment of intact ERVs with a strong phylogenetic signal among a wider set of homologous aa residues than possible in this study (5). To determine whether foamy-like Fish ERVs display a mammal FV-like tendency to codiverge that is obfuscated by lack of data, I selected nine fish EFVs with the best viral genomes (Table 1), including the eight I annotated. Mummichog ERV (FhERV) was also included in this set because it retains putative Env TM signatures. Except for FhERV, these ERVs have detectable LTR and nearly intact pol ORFs. I then aligned the complete Pol protein sequences from these ERVs with Pol from 13 exogenous and endogenous FVs and constructed a Bayesian phylogeny, which I compared to a host tree constructed from phylogenies inferred in previous studies (Fig. 4) (33,38,87,88). This phylogeny recovered the cross species incongruency between the SFVs of spider monkeys and squirrel monkeys. I was unable to resolve the node at the base of the mammal FV subtree, where previous studies indicate sloth endogenous foamy virus (SloEFV) should have branched earlier (5,89). Based on this Pol phylogeny, cross-species transmission, and possibly interorder transmission, has occurred during the infectious history of these fish viruses.

27

Figure 4. Foamy clade host-virus incongruence. Left tree is a Bayesian phylogeny of 1032 aligned Pol protein positions belonging to well-described exogenous and endogenous foamy viruses, and foamy-like Pol genes from this study – except XmEFV, which was described in (28). Scale: substitutions/site. Right tree is a manually constructed phylogeny of host divergence times inferred in previous studies: mammal divergence times from (87), Cyprinodontoidei and Poecilidae divergence from (90), the base of Poecilia from (38), and all other fish nodes from (33). Scale: Ma. Lines link virus to host, red crossed lines represent phylogenetically incongruent relationships. Virus tree nodes with posterior probability <1.0 are labeled.

28 Considering the paucity of host phylogenetic clustering, and likewise the paucity of host geographic clustering, within the fish EFV phylogenetic subtree (Fig. 1), it is likely that this phenomenon has been a common occurrence in the evolutionary history of fish EFVs. Cross- species transmission and establishment is probably much more frequent than in mammalian FVs. Notably, while the virus tree is midpoint rooted, the Sarcopterygian node is shallower than should be expected given the depth of the corresponding node on the host tree, and given the hypothesis of an ancient marine origin of FVs. I was unable to discern whether this is due to ancient host switching between lobe-finned and ray-finned fishes, or because I reached the outer limit of phylogenetic inference possible with the functionally constrained pol gene (91). Unfortunately, I was unable to discern positional homology between other sarcopterygian and actinopterygian FV-like genes – including putative gag and env genes.

De Novo Sequencing of FV-like Sequences in Diverse Teleosts

I aligned the fish ERV Pol protein sequences recovered in this study with CoeEFV and several mammalian FVs and searched for aa motifs specific to these viruses. I designed a pair of highly degenerate PCR primers to amplify a 450-500 nt region of FV RT, spanning from the conserved fish RT DNA-binding QY(P/K) motif to the YIDD(I/V) active site motif. With these primers I screened 91 preserved ray-finned fish tissues gifted from various institutions. Sampled taxa span the Actinopterygian tree of life, as I attempted to sample from every described order of fish and from multiple species within the especially species-rich clades such as ostariophysians and percomorphs. I detected FV-like endogenous pol in nine of these fish (Table 3). These species are more phylogenetically disparate than the host species identified during in silico parsing of genomes. Four of these species belong to the Ovalentaria crown group within the percomorph “bush”: houndfish (: Tylosus crocodilus), Bennet’s flying fish (Beloniformes: Cheilopogon pinnatibarbatus), and doubleline clingfish (Blenniformes (33): Lepadichtys lineatus) belong to orders not with no GB representation, while freshwater angelfish (Cichliformes: Pterophyllum scalare) is a South American cichlid related to the Midas cichlid. The deepwater serrano (Serranus aequidens) is a Perciformes like the yellow croaker. The flying gurnard (Dactylopterus volitans) is a Sygnathiformes, more closely

29 Table 3. Hosts with EFV-like fragments from de novo sequencing. Listed are the best hits for a representative cloned sequence successfully amplified from genomic DNA for each species. Orders after Betancur-r et al. (33). Common names from FishBase (35). Tissue source abbreviations: KU, University of Kansas; MSB, Museum of Southwest Biology; SIO, Scripps Institute of Oceanography. Blastx hit abbreviations: SFVmac, simian foamy virus – macaque; SFVspm, simian foamy virus – spider monkey.

Order BLASTx hit Common name Source catalog no. % ID E-value Species (acc)

Anguilliformes SFVspm 27 9E-15 Gymnothorax miliaris Goldentail moray KU 145 (ABV59399)

Cypriniformes SFVmac 47 1E-19 Hybognathus placitus Plains minnow MSB 57953 (AFA44809)

Gadiformes EFV 35 2E-21 Albatrossia pectoralis Giant grenadier KU 2298 (NP_054716)

Sygnathiformes SFVspm 35 3E-25 Dactylopterus volitans Flying gurnard KU 237 (ABV59399)

Cichliformes SFVspm Freshwater 39 7E-27 Pterophyllum scalare KU 2846 (ADE05995) angelfish

Blenniformes SFVspm Doubleline 38 3E-29 lineatus KU 7020 (ABV59399) clingfish

Beloniformes SFVspm Bennet’s 36 4E-21 Cheilopogon pinnabarbatus KU 2780 (ABV59399) flying fish

SFVspm Tylosurus crocodilus Houndfish KU 5842 (ADE05995) 40 1E-28

Perciformes Serranus aequidens SFVspm 38 2E-27 Deepwater serrano SIO 08-90 (ABV59399)

30 related to , such as tuna and mackerel, than to the other percomorph lineages identified in this study. ERVs in this species suggest that foamy-like viruses might be widespread across Percomorphaceae, a massive grouping of fish representing approximately half of all ray-finned fish (92). The plains minnow (Hybognathus placitus) belongs to Cypriniformes, in the same family as zebrafish and fathead minnow. Giant grenadier (Albatrossia pectoralis) belongs to Gadiformes along with cod. Last, the goldentail moray (Gymnothorax miliaris) is an Anguilliformes of the basal-teleost lineage. These species share a more distant LCA than the host species identified by genomic screens. With the detection of EFV-like sequences in goldentail moray, the host range of these viruses includes all teleosts. I sequenced 16 clones from each species. With the exception of minor variation attributable to sequencing and cloning error, sequences from most species were monoclonal. G. miliaris likely possesses at least two divergent copies of an FV-like viral integration, and P. scalare likely possesses numerous variable copies of the same integration. Interestingly, this variability was similar to the variability of pol copies I detected in the genome of Midas cichlid, which is another South American cichlid. I was unable to amplify FV-like sequences in an African cichlid I tested, golden mbuna (Melanochromis auratus). Likewise, four African cichlid genomes in WGS databases also lacked detectable foamy-like ERVs. A phylogeny that includes representatives of these sequences and a smaller set of distantly related RVs show that these ERVs also belong to an FV sister clade (Fig. 5). Both goldentail moray ERVs are found at the root of this tree, in the same position of the host species among teleosts in the fish tree of life (33). The fish ERV clade at large is again only loosely grouped into recognizable fish clades, indicating a history of frequent cross-species transmission. This is not surprising if it is presumed that virus transmission is more likely in an aquatic environment than a terrestrial environment. Very little is actually known about how animal viruses spread in aquatic ecosystems (93), but there have been documented cases of a virus rapidly spreading among multiple geographically distant species in a short amount of time among fish hosts (94).

31

Figure 5. Phylogeny of 27 foamy-like fish ERVs and distant RVs. Tree is a Bayesian phylogeny of 702 aa positions, including gap positions introduced with short fragments sequenced in this study. RV clades are marked with colored bars to right. SnRV, genus unassigned, is marked with a gray bar. Asterisks (*) mark ERVs discovered and sequenced by PCR screen in this study. Bullets (•) mark taxa from WGS databases also PCR amplified and sequenced as positive control. Tree is midpoint rooted for visual clarity. Nodes with posterior probability <1.0 are marked.

32 MATERIALS AND METHODS

Genome screening. To identify novel fish EFVs, I queried all sequences from 52 Actinopterygian species available in GB WGS databases using the tblastn algorithm and a set of well-described exogenous and endogenous FV Pol amino acid sequences (prototype foamy virus, CAA68999; equine foamy virus, AF201902; , CAA70075; coelacanth EFV (26); sloth endogenous foamy virus (24).) I then recursively screened each species, querying top hits with blastx against the GB non-redundant protein database. Contigs that matched best with FVs, and with little ambiguity, were downloaded and screened for ORFs with UGENE (95) and by aligning to exogenous mammalian FVs. Making conservative frame shift corrections where necessary, I extracted likely Pol protein translations for phylogenetic evaluation. Pol coding DNA sequences (cds) ~3000 nt in length with N-terminal Pro D(T/S)GA, downstream RT

YXDD(I/V), and Int DDX35E motifs and were favored. Conserved motifs were detected with CD-search (34). WGS sequence contigs that grouped robustly with exogenous and endogenous FVs in a phylogenetic context were investigated further. I screened each species WGS database with the corresponding previously-identified contig as a blastn query to identify all contigs with >2000 kb of pol coding sequence at >70% nt identity, lower than the common arbitrary cut-off for ERV families (37). I recursively used all-against-all BLAST searches within these groups of contigs to identify inserts with intact LTR and paralogs. Finally, I screened each genome with intraspecific translated Pol using tBLASTn to search for divergent inserts, then screened each genome with interspecific Pol sequences to identify potential orthologs. CD-HIT (42) and bootstrap neighbor- joining tree searches in ClustalX (96) were used to redundantly search for orthologous ERVs between closely related species.

De novo fish ERV sequencing. Based on a Pol alignment of fish ERVs from WGS contigs and mammalian FVs, I was able to identify amino acid motifs generally conserved between FVs but not among other RV genera. Blocks of aligned positions were identified using a custom Perl script, with which I parsed a ClustalX q-score file (97) with an 18-nt sliding window. I identified the FV-like fish RT DNA-binding motif QYP(L/I)N and the RT active site motif YIDD(I/V)F as suitable primer binding sites ~450 nt apart and therefore sensitive to ERV locus fragmentation. I designed a highly degenerate pair of PCR primers: FishPol472F 5’-CARTAYCSIHTNAA and

33 FishPol919R 5’-GWRIANRTCRTCDATRTA, where I is inosine. Numbers in the primer name refer to the position the annealing site in DrFV-2 Pol. Ethanol-suspended tissues were gifted from the University of Kansas Ichthyology Collection and the Scripps Institute of Oceanography Marine Vertebrate Collection. Frozen whole fish were gifted from the University of New Mexico Museum of Southwestern Biology. Several freshly expired fish were also obtained from a local pet store. A single O. mykiss specimen was gifted from Dawn’s Rainbow Trout Farm of Sedona, Arizona. Dr. Kevin Fitzsimmons and the University of Arizona Environmental Research Lab provided a single O. niloticus specimen. The Xiphophorus Genetic Stock Center at Southwest Texas State University provided an ethanol-suspended X. maculatus specimen. See Appendix 2 for the full list of specimen samples and sources. I extracted genomic DNA from fish muscle tissue using a DNeasy & Tissue kit (Qiagen) and quantified DNA concentration with a Qubit 2.0 fluorometer (Invitrogen). I assayed DNA quality by amplifying a ~1500 nt fragment of RAG1 with PCR primers from previous studies (98,99), modifying them when necessary. PCR amplification of RV sequences was performed on 5-20 ng template DNA in a 25 µl reaction mixture containing ThermoPol Buffer at 1X concentration (New England BioLabs), 2.0 mM MgSO4, 150 ng each of forward and reverse primer, 0.2 mM dNTPs, and 1 unit Taq DNA Polymerase (New England BioLabs). Reactions proceeded in a stepdown thermal cycler protocol with the following steps: 95°C – 3 m, (95°C – 30 s, 64°C – 20 s, 68°C – 45 s) x 4, (95°C – 30 s, 60°C – 20 s, 68°C – 45 s) x 4, (95°C – 30 s, 56°C – 20 s, 68°C – 45 s) x 4, (95°C – 30 s, 52°C – 20 s, 68°C – 45 s) x 30, 68°C – 5 m. PCR products were visualized on a 2% agarose gel. PCR bands between 400 and 500 nt were loaded into a specially prepared PCR-clean 2% agarose gel, then excised and purified with a Zymoclean Gel DNA Recovery Kit (Zymo Research). Recovered DNA was cloned with a pGEM-T Easy Vector System (Promega) and JM109 competent cells (Promega). I harvested colonies; amplified vector inserts with m13 PCR primers, and then sequenced them at the University of Arizona Genetics Core. Before screening fish genomic DNA for foamy-like ERVs, several species served as positive controls for my PCR primers: D. rerio, F. heteroclitus, O. mykiss, P. reticulata, and X. maculatus. Amplification of ERV sequences was successful in each of these species.

34 Phylogenetic analysis. For the large phylogeny of representative retroviruses (Fig. 1), I retrieved Pol cds from GenBank and aligned them with Fish ERVs using ClustalX (96). Ambiguously aligned positions were removed with trimAl using the gappyout option (100), producing an alignment of 693 positions. I used MrBayes v3.2.5 for phylogenetic reconstruction, selecting a mixed-model amino acid model prior (101). My MCMC chains ran for five million generations, sampling trees every 100 generations, and discarding the first 25% as burn-in. I checked for convergence and adequate estimated sample size of parameters using Tracer (102). For the virus half of the codivergence tanglegram (Fig. 4), I aligned the Pol sequences of 13 Sarcopterygian FVs with 10 fish FV-like Pol sequences, including the two divergent fathead minnow ERVs, using ClustalX (97). For phylogenetic inference, I used ProtTest 3 (103,104) to select RtRev+I+G+F as the most likely substitution model for this alignment (∆AIC ≈ 12 compared to next best model). I fixed these parameters and ran MrBayes v3.2.5 (101) for three million generations, sampling every 100 generations, and discarding the first 25% as burn-in. For the host half, I manually constructed a tree using TreeGraph 2 (105). Mammal branch lengths and bifurcations were taken from Bininda-Emonds et al. (87). For the earliest node in Cyprinodontoidei (F. heteroclitus(X. maculatus(P. formosa, P. reticulata))) I used the divergence of Xenodexia ctenolepsis as a proxy, and the North-South vicariance of Central American Poecilid species as a proxy for the next node in that subtree (90). Because P. formosa is a unisexual hybrid daughter species of P. mexicanus (88), I used the LCA of that species and P. reticulata at the base of the Poecilia (38) as the P. formosa-P. reticulata split. All other fish branch lengths, including coelacanth (Latimeria chalumnae) are from Betancur-r et al. (33). The virus-host cophylogeny was created with TreeMap 3 (106,107) and illustrated manually. The smaller RV Pol phylogeny (Fig. 5) was created with a reduced set of representative RVs, aligned with ClustalX (96). I aligned bacterial clone sequences with fish ERVs from GB databases to determine the most likely translation reading frame. I included the sequenced clone with the highest blast score for each species, except G. miliaris for which I included two. Ambiguously aligned and gap-ridden positions were eliminated with trimAl (100), producing an alignment with 702 aa positions, including gap positions in short ERV fragments sequenced in this study. Phylogenetic reconstruction was done in MrBayes v3.2.5 (101), setting a mixed model amino acid prior, running for three million generations, sampling every 100, and discarding the first 25% as burn-in.

35

Quasi-consensus construction and annotation. Where possible, ERVs were annotated with little modification. ERVs with secondary integrations or large deletions were built into a quasi- consensus. Each quasi-consensus, is an intact cds (typically Pol) used to anchor an alignment of flanking proviral genomic sequences. For the example discussed in-text, A. citrinellus ERV, the most intact provirus-bearing contig was AcERV-1 (acc: CCOE01001074). Despite having a complete putative Pol ORF, both flanking regions contained premature stop codons and two small poly-N gap assembly artifacts. Using the region from the PBS to the Pol start codon, and the Pol stop codon to the 3’ PPT, as blastn queries, I aligned ERVs and fragments with ≥80% query coverage using the ClustalX algorithm (96) in UGENE (95). I then took the strict majority consensus (including gaps) of these alignments create AcERV quasi-consensus. I repeated this procedure as necessary for ERVs in other hosts. ORF identities were called based on conserved functional motifs for pol, position in genome for gag, and presence of TM-helix motifs predicted by TMHMM Server v2.0 (108). ERVs were scanned for conserved FV-like features with a custom Perl script or the regular expression search feature in UGENE (95).

ERV age estimation. For ERV integrations dated with LTR-divergence, I aligned LTR with the ClustalW (96) algorithm and calculated the Kimura (K80) corrected divergence using MEGA 5.2.2 (109). I then applied the formula � = �/2�, where t is time, d is divergence, and µ is the host mutation rate. Host mutation rates were derived from several sources (81–83). The zebrafish mutation rate I increased by a factor of 10, as the synonymous site substitution rate 4.39e-9 is theoretically too slow. For fish species without direct estimates of neutral substitution rates in nuclear DNA, I used the P. formosa rate: 4.89e-8 (82). For flank-divergence estimates of ERV age, I searched for LTR-flank paralogs within WGS databases for each fish ERV with an intact LTR. For ERVs sharing a flank with one other sequence, I performed steps identical to LTR-divergence calculations. For DrFV-3, I detected six contigs sharing a 5’ flank. After alignment, I selected Hasegawa-Kishino-Yano (HKY) as the best-fitting substitution model using jmodelTest 2 (104,110). I then conducted a two million generation Bayesian MCMC analysis of TMRCA with BEAST v1.8.2 (85), specifying 4.3e- 8/site/year as the rate prior, a strict clock, and Yule model of speciation. I analyzed outputs using Tracer to ensure it reached convergence (102).

36 CONCLUSIONS

The number of ERVs discovered in this study makes for an interesting observation. The discovery of these endogenous foamy-like RVs in fish hosts means that there are now more described teleost EFVs than there are mammal EFVs (5,9,24,111). Endogenous foamy-like RVs in fish are not a novel discovery for paleovirology (27,28), and though undiscovered fish ERV diversity was predicted, it was not expected that 34% of species with WGS data (18 of 52 spp.) would contain evidence of past foamy-like RV infection. Compare this fraction to the 5% of sequenced mammal genomes known to contain EFVs (4 of 113 spp.). The fraction of mammal genomes with EFVs is also smaller than fraction of EFV-like sequences discovered through de novo PCR screens in this study: 9% (9 of 91 spp.). It is readily apparent that teleost fish have been frequent hosts to foamy-like RV infection. There is a strong indication that these viruses are still circulating in some infectious form, though there is no real expectation that exogenous descendants of ERVs remain among hosts. In contrast to other early-diverging ERV lineages recently discovered (112–115), the six fish ERV integrations that could be dated in this study are all very young, even when compared to the relatively recent integration of fish endogenous in zebrafish and other species (116). The oldest, zebrafish DrFV-3, is estimated to have integrated ~284,000 years ago (Table 2). DrFV-2 in the same species is the most recent integration and has identical LTR sequences, indicating host infection over a considerable period of time. The same may be true for fathead minnow ERVs, of which PpERV-1 probably integrated very recently and PpERV-2 is highly divergent. One explanation for this is contemporaneous integration of two EFV types into the fathead minnow genome; the other is a long but not necessarily continuous period of infection. Midas cichlid ERV is very young, is present in many copies with estimated integrations ~15,000- 83,000 years ago, and closely related to other South American freshwater fish foamy-like ERVs (Fig. 5). That distantly related ERVs are so young and probably accompanied by exogenous counterparts for so long points to the likelihood of present day persistence. Fish foamy-like ERVs possess the largest genomes of any known retrovirus to date. Most of this size increase is caused by a much larger than normal gag gene, and the acquisition of several accessory genes. The functions of these genes are unknown but it can be assumed that replication pathways for these viruses are more complex than mammal FVs. Aside from the massive increase in genomic coding potential, these fish ERVs are foamy-like in their

37 organization. Long LTR, independently transcribed gag and pol, four nt TSD, and the presence of internal transcription promoters are all shared by FVs, although not exclusively (2,66). I was able to detect discrete features unique to FV biology among known RVs. The tripartite Env processing regime found in PpERV-1 is foamy-like, as is the putative p3 cleavage site next to the C-terminus of Gag also in PpERV-1. These features have presumably been obscured by post-endogenous neutral substitutions in other annotated fish ERV genomes. Most fish ERVs examined in this study use tRNALys to prime reverse transcription, and those that do not still mostly use known FV alternatives. All fish ERV examined in this study possess the GGGTG PPT:LTR junction motif that characterizes FV asymmetrical provirus processing (4). Likewise is the absence of Cys-His zinc finger Gag motifs unique to FVs (67) and similarly absent in foamy-like fish ERVs with an intact gag ORF. I was unable to detect the CTRS signature (68) used by FVs to localize virus assembly in the cytoplasm. Nor was I able to locate a WXXW Gag-Env interaction motif used by FVs to direct virus budding to the ER (62). Conservation of the latter would necessitate conservation of the Gag substrate it interacts with. Gag is the fastest evolving core gene in known exogenous FVs, followed by Env (30), therefore it is difficult to treat absence of evidence as evidence of absence for this supposedly requisite feature FV replication. A CTRS functional domain is found in all exogenous FVs, but not exclusively. RV cytoplasmic targeting and encapsidation was first noted in Betaretrovirus mouse mammary tumor virus and Mason-Pfizer monkey virus (117). Its active site consists of a single arginine, an abundant residue in fish EFVs. Though necessary for FV-like replication, the sequence conservation among even closely related mammal FVs is suspect (62). Despite my inability to characterize certain apparently indispensible FV-like domains among these fish ERVs, the specific motifs and features that were detected, when taken together, point to a basic replication strategy conserved between these fish ERVs and nearest phylogenetic relations – CoeEFV and mammal FVs. Evidence for an ancient origin of these viruses is more elusive. With no clear pattern of host-virus codivergence and no clear evidence of geographic clustering, it is difficult to generate hypotheses that better explain the proposed ancient marine origin proposed for foamy viruses (26). Accepting that fish foamy-like RVs are widely distributed across teleost hosts, if the LCA of these fish bore the LCA of these viruses, then I can surmise their emergence in the Permian

38 Period 283 Ma at minimum (33). This estimate is rather easy to accept given the deep divergence between the mammal-coelacanth FV clade and the fish ERV clade (Fig. 1). This widespread diversity of foamy-like ERVs in fish hosts is consistent with a hypothetical ancient marine origin of FVs (26). The infectious progenitors of these ERVs were clearly capable of invading phylogenetically disparate hosts. They form a deeply diverging monophyletic sister clade with known FVs, and are more closely related to such than known class III FV-like ERVs in mammals (Fig. 1). Further sampling of fish taxa, and the eventual discovery of an extant strain, will elucidate the question of FV origins. These results however, are nonetheless concordant with the idea that FVs arose in the ocean hundreds of millions of years ago.

39 REFERENCES

1. Linial ML. Foamy viruses are unconventional retroviruses. J Virol. 1999;73(3):1747–55. 2. Rethwilm A. Molecular biology of foamy viruses. Med Microbiol Immunol. 2010;199(3):197–207. 3. Meiering CD, Linial ML. Historical perspective of foamy virus epidemiology and infection. Clin Microbiol Rev. 2001;14(1):165–76. 4. Delelis O, Lehmann-Che J, Saïb A. Foamy viruses - A world apart. Curr Opin Microbiol. 2004;7(4):400–6. 5. Katzourakis A, Aiewsakun P, Jia H, Wolfe ND, LeBreton M, Yoder AD, et al. Discovery of prosimian and afrotherian foamy viruses and potential cross species transmissions amidst stable and ancient mammalian co-evolution. Retrovirology. 2014;11(1):61. 6. Winkler I, Bodem J, Haas L, Zemba M, Delius H, Flower R, et al. Characterization of the genome of feline foamy virus and its proteins shows distinct features different from those of primate spumaviruses. J Virol. 1997;71(9):6727–41. 7. Tobaly-Tapiero J, Bittoun P, Neves M, Guillemin MC, Lecellier CH, Puvion-Dutilleul F, et al. Isolation and characterization of an equine foamy virus. J Virol. 2000;74(9):4064–73. 8. Renshaw RW, Casey JW. Transcriptional mapping of the 3’ end of the bovine syncytial virus genome. J Virol. 1994;68(2):1021–8. 9. Wu Z, Ren X, Yang L, Hu Y, Yang J, He G, et al. Virome analysis for identification of novel mammalian viruses in bat species from Chinese provinces. J Virol. 2012;86(20):10999–1012. 10. Kupiec JJ, Kay A, Hayat M, Ravier R, Peries J, Galibert F. Sequence analysis of the simian foamy virus type 1 genome. Gene. 1991;101:185–94. 11. Renne R, Friedl E, Schweizer M, Fleps U, Turek R, Neumann-Haefelin D. Genomic organization and expression of simian foamy virus type 3 (SFV-3). Virology. 1992 Feb;186(2):597–608. 12. Herchenröder O, Renne R, Loncar D, Cobb EK, Murthy KK, Schneider J, et al. Isolation, cloning, and sequencing of simian foamy viruses from chimpanzees (SFVcpz): high homology to (HFV). Virology. 1994 Jun;201(2):187–99. 13. Thümer L, Rethwilm A, Holmes EC, Bodem J. The complete nucleotide sequence of a New World simian foamy virus. Virology. 2007;369(1):191–7. 14. Bieniasz PD, Rethwilm A, Pitman R, Daniel MD, Chrystie I, McClure MO. A comparative study of higher primate foamy viruses, including a new virus from a . Virology. 1995 Feb 20;207(1):217–28. 15. Pacheco B, Finzi A, McGee-Estrada K, Sodroski J. Species-specific inhibition of foamy viruses from south american monkeys by new world monkey TRIM5 proteins. J Virol. 2010;84(8):4095–9. 16. Cong M, Kuiken C, Bhullar V, Beer BE, Vallet D, Gautier-hion A, et al. Ancient co-speciation of simian foamy viruses and primates. Nature. 2005;434(March):376–80. 17. Switzer WM, Bhullar V, Shanmugam V, Cong M, Parekh B, Lerche NW, et al. Frequent simian foamy virus infection in persons occupationally exposed to nonhuman primates. J Virol. 2004;78(6):2780–9. 18. Betsem E, Rua R, Tortevoye P, Froment A, Gessain A. Frequent and recent human acquisition of simian foamy viruses through ’ bites in central . PLoS Pathog. 2011 Oct;7(10):e1002306.

40 19. Coffin JM, Hughes SH, Varmus HE, editors. The Place of Retroviruses in Biology. Retroviruses. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 1997. 20. Weiss RA. The discovery of endogenous retroviruses. Retrovirology. 2006 Jan;3(67). 21. Patel MR, Emerman M, Malik HS. Paleovirology - ghosts and gifts of viruses past. Curr Opin Virol. 2011 Oct;1(4):304–9. 22. Bénit L, Lallemand JB, Casella JF, Philippe H, Heidmann T. ERV-L elements: a family of endogenous retrovirus-like elements active throughout the evolution of mammals. J Virol. 1999;73(4):3301–8. 23. Greenwood AD, Stengel A, Erfle V, Seifarth W, Leib-Mösch C. The distribution of pol containing human endogenous retroviruses in non-human primates. Virology. 2005;334(2):203–13. 24. Katzourakis A, Gifford RJ, Tristem M, Gilbert MTP, Pybus OG. Macroevolution of complex retroviruses. Science. 2009 Sep 18;325(5947):1512. 25. Han GZ, Worobey M. Endogenous viral sequences from the Cape golden mole (Chrysochloris asiatica) reveal the presence of foamy viruses in all major placental mammal clades. PLoS One. 2014;9(5):3–6. 26. Han G-Z, Worobey M. An endogenous foamy-like viral element in the coelacanth genome. PLoS Pathog. 2012 Jun;8(6):e1002790. 27. Llorens C, Muñoz-Pomer A, Bernad L, Botella H, Moya A. Network dynamics of eukaryotic LTR retroelements beyond phylogenetic trees. Biol Direct. 2009;4(1):41. 28. Schartl M, Walter RB, Shen Y, Garcia T, Catchen J, Amores A, et al. The genome of the platyfish, Xiphophorus maculatus, provides insights into evolutionary adaptation and several complex traits. Nat Genet. Nature Publishing Group; 2013;45(5):567–72. 29. Linial ML. Why aren’t foamy viruses pathogenic? Trends Microbiol. 2000;8(00):284–9. 30. Rethwilm A, Bodem J. Evolution of foamy viruses: the most ancient of all retroviruses. Viruses. 2013 Oct;5(10):2349–74. 31. Helfman GS, Collette BB, Facey DE, Bowen BW. The diversity of fishes: biology, evolution, and ecology. Atlantic. 2009. 736 p. 32. Benton MJ. The quality of the fossil record of vertebrates. In: Donovan SK, Paul CRC, editors. The adequacy of the fossil record. New York: Wiley; 1998. p. 269–303. 33. Betancur-R R, Broughton RE. The Tree of Life and a new classification of bony fishes. Plos Curr Tree Life. 2013;1–54. 34. Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004;32(Web Server issue):W327–31. 35. Froese R, Pauly D. FishBase [Internet]. 2015. Available from: www.fishbase.org 36. Hyde JR, Vetter RD. The origin, evolution, and diversification of rockfishes of the genus Sebastes (Cuvier). Mol Phylogenet Evol. 2007;44(2):790–811. 37. Jern P, Sperber GO, Blomberg J. Use of endogenous retroviral sequences (ERVs) and structural markers for retroviral phylogenetic inference and . Retrovirology. 2005 Jan;2:50. 38. Meredith RW, Pires MN, Reznick DN, Springer MS. Molecular phylogenetic relationships and the evolution

41 of the placenta in Poecilia (Micropoecilia) (Poeciliidae: Cyprinodontiformes). Mol Phylogenet Evol. Elsevier Inc.; 2010;55(2):631–9. 39. Harel I, Benayoun BA, Machado B, Singh PP, Hu C-K, Pech MF, et al. A platform for rapid exploration of aging and diseases in a naturally short-lived vertebrate. Cell. 2015;160(5):1013–26. 40. Orlov a, Biryukov I. First report of sablefish in spawning condition off the coast of Kamchatka and the Kuril Islands. ICES J Mar Sci. 2005;62(5):1016–20. 41. Baeck GW, Takita T, Yoon YH. Lifestyle of Korean mudskipper Periophthalmus magnuspinnatus with reference to a congeneric species Periophthalmus modestus. Ichthyol Res. 2008;55:43–52. 42. Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–2. 43. Omoto S, Brisibe EA, Okuyama H, Fujii YR. Feline foamy virus Tas protein is a DNA-binding transactivator. J Gen Virol. 2004;85(Pt 10):2931–5. 44. Rethwilm A. Regulation of foamy virus gene expression. Curr Top Microbiol Immunol. 1995;193:1–24. 45. Rethwilm A. The replication strategy of foamy viruses. Curr Top Microbiol Immunol. 2003;277:1–26. 46. Ribet D, Louvet-Vallee S, Harper F, de Parseval N, Dewannieux M, Heidmann O, et al. Murine endogenous retrovirus MuERV-L is the progenitor of the “orphan” epsilon viruslike particles of the early mouse embryo. J Virol. 2008;82(3):1622–5. 47. Hart D, Frerichs GN, Rambaut A, Onions DE. Complete nucleotide sequence and transcriptional analysis of snakehead fish retrovirus. J Virol. 1996;70(6):3606–16. 48. Paul T a, Quackenbush SL, Sutton C, Casey RN, Bowser PR, Casey JW. Identification and characterization of an exogenous retrovirus from atlantic salmon swim bladder sarcomas. J Virol. 2006;80(6):2941–8. 49. Rovnak J, Quackenbush SL. Walleye dermal sarcoma virus: molecular biology and oncogenesis. Viruses. 2010 Sep;2(9):1984–99. 50. Tang H, Kuhen KL, Wong-Staal F. Replication and Regulation. Annu Rev Genet. Annual Reviews; 1999 Dec 1;33(1):133–70. 51. A. S. Non-primate foamy viruses. Curr Top Microbiol Immunol. 2003;277:197–211. 52. Maurer B, Bannert H, Darai G, Flügel RM. Analysis of the primary structure of the long terminal repeat and the gag and pol genes of the human spumaretrovirus. J Virol. 1988;62(5):1590–7. 53. Blomberg J, Benachenhou F, Blikstad V, Sperber G, Mayer J. Classification and nomenclature of endogenous retroviral sequences (ERVs). Gene. Elsevier B.V.; 2009;448(2):115–23. 54. Coffin JM, Hughes SH, Varmus HE, editors. Overview of reverse transcription. Retroviruses. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 1997. 55. Yi J-M, Kim T-H, Huh J-W, Park KS, Jang SB, Kim H-M, et al. Human endogenous retroviral elements belonging to the HERV-S family from human tissues, cancer cells, and primates: expression, structure, phylogeny and evolution. Gene. 2004;342(2):283–92. 56. Campbell M, Renshaw-Gegg L, Renne R, Luciw P a. Characterization of the internal promoter of simian foamy viruses. J Virol. 1994;68(8):4811–20.

42 57. Juretzek T, Holm T, Gärtner K, Lindemann D, Herchenröder O, Rammling M, et al. Foamy virus integration. 2004;78(5):2472–7. 58. Brown PO. Integration. In: Coffin JM, Hughes SH, Varmus HE, editors. Retroviruses. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 1997. 59. Delelis O, Petit C, Leh H, Mbemba G, Mouscadet J-F, Sonigo P. A novel function for spumaretrovirus integrase: an early requirement for integrase-mediated cleavage of 2 LTR circles. Retrovirology. 2005;2:31. 60. Serrao E, Ballandras-Colas A, Cherepanov P, Maertens GN, Engelman AN. Key determinants of target DNA recognition by retroviral intasomes. Retrovirology. 2015;12(1):39. 61. Nakayama K. Furin: a mammalian subtilisin/Kex2p-like endoprotease involved in processing of a wide variety of precursor proteins. Biochem J. 1997;327 ( Pt 3:625–35. 62. Kehl T, Tan J, Materniak M. Non-simian foamy viruses: molecular virology, tropism and prevalence and zoonotic/interspecies transmission. Viruses. 2013;5(9):2169–209. 63. Coffin JM, Hughes SH, Varmus HE, editors. Synthesis and organization of Env glycoproteins. Retroviruses. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 1997. 64. Enssle J, Fischer N, Moebes A, Mauer B, Smola U, Enssle RG, et al. Carboxy-terminal cleavage of the human foamy virus Gag precursor molecule is an essential step in the viral life cycle. J Virol. 1997;71(10):7312–7. 65. Zemba M, Wilk T, Rutten T, Wagner A, Flügel RM, Löchelt M. The carboxy-terminal p3Gag domain of the human foamy virus Gag precursor is required for efficient virus infectivity. Virology. 1998;247(1):7–13. 66. Coffin JM, Hughes SH, Varmus HE, editors. Synthesis, assembly, and processing of viral proteins. Retroviruses. Cold Spring Harbor; 1997. 67. Müllers E. The foamy virus gag proteins: What makes them different? Viruses. 2013;5(4):1023–41. 68. Eastman SW, Linial ML. Identification of a conserved residue of foamy virus Gag required for intracellular capsid assembly. J Virol. 2001;75(15):6857–64. 69. Krausslich HG, Welker R. Intracellular transport of retroviral capsid components. Curr Top Microbiol Immunol. 1996;214:25–63. 70. Thomas J a, Gagliardi TD, Alvord WG, Lubomirski M, Bosche WJ, Gorelick RJ. Human immunodeficiency virus type 1 nucleocapsid zinc-finger mutations cause defects in reverse transcription and integration. Virology. 2006;353(1):41–51. 71. Gorelick RJ, Gagliardi TD, Bosche WJ, Wiltrout T a, Coren L V, Chabot DJ, et al. Strict conservation of the retroviral nucleocapsid protein zinc finger is strongly influenced by its role in viral infection processes: characterization of HIV-1 particles containing mutant nucleocapsid zinc-coordinating sequences. Virology. 1999;256:92–104. 72. Lee E-G, Linial ML. Deletion of a Cys-His motif from the nucleocapsid domain reveals late domain mutant-like budding defects. Virology. 2006;347(1):226–33. 73. Lee E-G, Linial ML. The C terminus of foamy retrovirus Gag contains determinants for encapsidation of Pol protein into virions. J Virol. 2008;82(21):10803–10.

43 74. Tobaly-Tapiero J, Bittoun P, Lehmann-Che J, Delelis O, Giron M Lou, de Thé H, et al. Chromatin tethering of incoming foamy virus by the structural Gag protein. Traffic. 2008;9(2):1717–27. 75. Tan R, Frankel a D. A novel glutamine-RNA interaction identified by screening libraries in mammalian cells. Proc Natl Acad Sci U S A. 1998;95(8):4247–52. 76. Luscombe NM, Laskowski R a, Thornton JM. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Res. 2001;29(13):2860–74. 77. Johnson WE, Coffin JM. Constructing primate phylogenies from ancient retrovirus sequences. Proc Natl Acad Sci U S A. 1999;96(18):10254–60. 78. Hughes JF, Coffin JM. Human endogenous retroviral elements as indicators of ectopic recombination events in the primate genome. Genetics. 2005;171(3):1183–94. 79. Jha AR, Pillai SK, York V a., Sharp ER, Storm EC, Wachter DJ, et al. Cross-sectional dating of novel haplotypes of HERV-K 113 and HERV-K 115 indicate these proviruses originated in Africa before Homo sapiens. Mol Biol Evol. 2009;26(11):2617–26. 80. Recknagel H, Elmer KR, Meyer A. A hybrid genetic linkage map of two ecologically and morphologically divergent Midas cichlid fishes (Amphilophus spp.) obtained by massively parallel DNA sequencing (ddRADSeq). G3 (Bethesda). 2013;3(1):65–74. 81. Kratochwil CF, Sefton MM, Meyer A. Embryonic and larval development in the Midas cichlid fish species flock (Amphilophus spp.): a new evo-devo model for the investigation of adaptive novelties and species differences. BMC Dev Biol. 2015;15(12):1–15. 82. Fraser BA, Künstner A, Reznick DN, Dreyer C, Weigel D. Population genomics of natural and experimental populations of guppies (Poecilia reticulata). Mol Ecol. 2015;24(2):389–408. 83. Fu B, Chen M, Zou M, Long M, He S. The rapid generation of chimerical genes expanding protein diversity in zebrafish. BMC Genomics. BioMed Central Ltd; 2010;11(1):657. 84. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16(2):111–20. 85. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73. 86. Nethe M, Berkhout B, van der Kuyl AC. Retroviral superinfection resistance. Retrovirology. 2005;2:52. 87. Bininda-Emonds ORP, Cardillo M, Jones KE, MacPhee RDE, Beck RMD, Grenyer R, et al. The delayed rise of present-day mammals. Nature. 2007;446(7135):507–12. 88. Hrbek T, Seckinger J, Meyer A, Meredith RW, Pires MN, Reznick DN, et al. The origin and evolution of a unisexual hybrid: Poecilia formosa. Mol Phylogenet Evol. Elsevier Inc.; 2007;55(3):2901–9. 89. Katzourakis A, Gifford RJ. Endogenous viral elements in animal genomes. PLoS Genet. 2010 Nov;6(11):e1001191. 90. Hrbek T, Seckinger J, Meyer A. A phylogenetic and biogeographic perspective on the evolution of poeciliid fishes. Mol Phylogenet Evol. 2007;43(3):986–98. 91. Zanotto PM, Gibbs MJ, Gould E a, Holmes EC. A reevaluation of the higher taxonomy of viruses based on

44 RNA polymerases. J Virol. 1996;70(9):6083–96. 92. Alfaro ME, Santini F, Brock C, Alamillo H, Dornburg A, Rabosky DL, et al. Nine exceptional radiations plus high turnover explain species diversity in jawed vertebrates. Proc Natl Acad Sci U S A. 2009;106(32):13410–4. 93. Suttle CA. Marine viruses — major players in the global ecosystem. Nat Rev Microbiol. 2007;5(10):801– 12. 94. Einer-Jensen K. Evolution of the fish rhabdovirus viral haemorrhagic septicaemia virus. J Gen Virol. 2004;85(5):1167–79. 95. Okonechnikov K, Golosova O, Fursov M. Unipro UGENE: a unified bioinformatics toolkit. Bioinformatics. 2012;28(8):1166–7. 96. Larkin MA, Blackshields G, Brown NP, Chenna R, Mcgettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–8. 97. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Jan;7(539):539. 98. López JA, Chen W-J, Ortí G. Esociform Phylogeny. Copeia. 2015;2004(3):449–64. 99. Li C, Ortí G. Molecular phylogeny of () inferred from nuclear and mitochondrial DNA sequences. Mol Phylogenet Evol. 2007;44:386–98. 100. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3. 101. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Hohna S, et al. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61(3):539– 42. 102. Rambaut A, Suchard M, Xie D, Drummond A. Tracer v1.6. 2014. 103. Darriba D, Taboada GL, Doallo R, Posada D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 2011;27(8):1164–5. 104. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52(5):696–704. 105. Stöver BC, Müller KF. TreeGraph 2: Combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics. 2010;11(7). 106. Charleston M. TreeMap 3 [Internet]. Available from: http://sydney.edu.au/engineering/it/~mcharles/software/treemap/treemap3.html 107. Charleston MA, Robertson DL. Preferential host switching by primate lentiviruses can account for phylogenetic similarity with the primate phylogeny. Syst Biol. 2002;51(3):528–35. 108. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol. 2001;305(3):567–80. 109. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutioanry distance, and maximum parsimony methods. Mol Biol

45 Evol. 2011;28(10):2731–9. 110. Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods. Nature Publishing Group; 2012;9(8):772–772. 111. Han G-Z, Worobey M. An endogenous foamy virus in the aye-aye (Daubentonia madagascariensis). J Virol. 2012 Jul;86(14):7696–8. 112. Cui J, Tachedjian G, Tachedjian M, Holmes EC, Zhang S, Wang L-F. Identification of diverse groups of endogenous gammaretroviruses in mega- and microbats. J Gen Virol. 2012 Sep;93(Pt 9):2037–45. 113. Tarlinton RE, Barfoot HKR, Allen CE, Brown K, Gifford RJ, Emes RD. Characterisation of a group of endogenous gammaretroviruses in the canine genome. Vet J. Elsevier Ltd; 2013;196(1):28–33. 114. Wang L, Yin Q, He G, Rossiter SJ, Holmes EC, Cui J. Ancient invasion of an extinct gammaretrovirus in cetaceans. Virology. 2013 Mar 29. 115. Han G-Z. Extensive retroviral diversity in shark. Retrovirology. 2015;12(1):10–3. 116. Basta H a, Cleveland SB, Clinton R a, Dimitrov AG, McClure M a. Evolution of teleost fish retroviruses: characterization of new retroviruses with cellular genes. J Virol. 2009 Oct;83(19):10152–62. 117. Choi G, Park S, Choi B, Hong S, Lee J, Hunter E, et al. Identification of a cytoplasmic targeting/retention signal in a retroviral Gag polyprotein. J Virol. 1999;73(7):5431–7.

46 APPENDIX A – RETROVIRUS SEQUENCES IN PHYLOGENETIC ANALYSES

Exogenous and endogenous retrovirus amino acid Pol sequences used for phylogenetic inference in this study. Primary literature is cited where a GenBank accession number is not available. RhERV is a Rhesus macaque ERV type D reference sequence from Dr. Robert J. Gifford’s retrovirus reference sequence library available at http://bioinformatics.cvr.ac.uk/paleovirology/index.html.

Abbreviation Full Name Citation/Accession No. ALV Avian leukosis virus M37980.1 BabERV Baboon endogenous retrovirus AHZZ01047987.1 BFV NC_001831.1 BIV Bovine immunodeficiency virus NP_040563.1 BLV NC_001414.1 CAEV Caprine arthritis encephalitis virus NC_001463.1 CoeEFV Coelacanth endogenous foamy virus Han and Worobey, 2012 (1) DIAV Duck infectious anemia virus KF313137.1 EFV Equine foamy virus AF201902.1 EqERV-b1 Equine endogenous retrovirus beta van der Kuyl, 2011 (2) EIAV virus NC_001450.1 ERV-3 (Human) endogenous retrovirus 3 Jern et al., 2005 (3) FFV Feline foamy virus KC292054.1 FIV Feline immunodeficiency virus NC_001482.1 FLV M18247.1 FriendMLV Friend NC_001362.1 GgALV Avian leukosis virus HQ900844.1 HERV-E Human endogenous retrovirus E Jern et al., 2005 (3) HERV-Fc1 Human endogenous retrovirus Fc1 Jern et al., 2005 (3) HERV-H Human endogenous retrovirus H – RTVLH2 Jern et al., 2005 (3) HERV-K Human endogenous retrovirus K AB047209.1 HERV-L Human endogenous retrovirus L Jern et al., 2005 (3) HERV-S Human endogenous retrovirus S Jern et al., 2005 (3) HIV-1 Human immunodeficiency virus NC_001802.1 HIV-2 Human immunodeficiency virus NC_001722.1 HTLV-1 Human t-cell lymphotropic virus AF033817.1 HTLV-3 Human t-cell lymphotropic virus EU649782.1

47 Jaagsiekte Jaagsiekte sheep retrovirus NC_001494.1 Jembrana Jembrana disease virus Q82851.1 KoRV AF151794.2 Mason-Pfizer Mason-Pfizer monkey virus NC_001550 MELV Mustelidae endogenous lentivirus Han and Worobey, 2012b (4) MgLPDV Turkey lymphoproliferative disease virus U09568.1 MMTV Mouse mammary tumor virus NC_001503.1 MuERV-L Murine endogenous retrovirus L Y12713.1 OoERV Killer whale endogenous retrovirus GQ222416.1 PERV-A Porcine endogenous retrovirus A EU789636.1 PFV Prototype foamy virus Y07725.1 PyERV Python molurus endogenous retrovirus AAN77283.1 RELIK Rabbit endogenous lentivirus type K Katzourakis et al., 2007 (5) RhERV Macaque endogenous retrovirus type D See description RSV-PragueC – Prague C J02342.1 SFVagm African green monkey simian foamy virus NC_010820.1 SFVcpz Chimpanzee simian foamy virus NC_001364.1 SFVgor Gorilla simian foamy virus HM245790.1 SFVmac Macaque simian foamy virus NC_010819.1 SFVmar Common marmoset simian foamy virus ADE06000.1 SFVora Orangutan simian foamy virus ADN65591.1 SFVspm Spider monkey simian foamy virus ABV59399.1 SFVsqu Squirrel monkey simian foamy virus ADE05995 SIVagm African green monkey simian NC_001549.1 immunodeficiency virus SIVcpz Chimpanzee simian immunodeficiency virus AF115393.1 SloEFV Sloth endogenous foamy virus Katzourakis et al., 2009 (6) SMRV-H Squirrel monkey retrovirus NC_001514.1 SnRV Snakehead retrovirus U26458.1 SRV-1 Simian retrovirus M11841.1 SSSV Atlantic salmon swim bladder sarcoma virus DQ174103.1 STLV-1Bab Baboon simian t-cell lymphotropic virus JX987040.1 STLV-2Bon Bonobo simian t-cell lymphotropic virus NC_001815.1 Visna-Maedi Visna/Maedi virus (VMV) NC_001452.1

48 WDSV Walleye dermal sarcoma virus AF033822.1 WEHV-1 Walleye epidermal hyperplasia virus AF133051.1 WEHV-2 Walleye epidermal hyperplasia virus AF133052.1 Xen-1 Xenopus laevis endogenous retrovirus NC_010955.1

1. Han G-Z, Worobey M. An endogenous foamy-like viral element in the coelacanth genome. PLoS Pathog. 2012 Jun;8(6):e1002790. 2. van der Kuyl AC. Characterization of a full-length endogenous beta-retrovirus, EqERV-beta1, in the genome of the horse (Equus caballus). Viruses. 2011;3(6):620–8. 3. Jern P, Sperber GO, Blomberg J. Use of endogenous retroviral sequences (ERVs) and structural markers for retroviral phylogenetic inference and taxonomy. Retrovirology. 2005 Jan;2:50. 4. Han G-Z, Worobey M. Endogenous lentiviral elements in the weasel family (Mustelidae). Mol Biol Evol. 2012 Oct;29(10):2905–8. 5. Katzourakis A, Tristem M, Pybus OG, Gifford RJ. Discovery and analysis of the first endogenous lentivirus. Proc Natl Acad Sci U S A. 2007;104(15):6261–5. 6. Katzourakis A, Gifford RJ, Tristem M, Gilbert MTP, Pybus OG. Macroevolution of complex retroviruses. Science. 2009 Sep 18;325(5947):1512.

49 APPENDIX B – SAMPLED FISH SPECIES

Fish species tissues screened for endogenous foamy-like elements. KU is the University of Kansas Ichthyology Collection. MSB is the Museum of Southwestern Biology at the University of New Mexico. SIO is the Scripps Institute of Oceanography Marine Vertebrate Collection. Crawford is Dr. Douglas L. Crawford, Professor of Marine Biology & Ecology at the University of Miami. Fitzsimmons is Dr. Kevin M. Fitzsimmons, Professor of Soil, Water, and Environmental Science at the University of Arizona. Wyles is Dawn M. Cloutier Wyles of Dawn’s Rainbow Trout Farm in Sedona, AZ, USA. XGSC is the Xiphophorus Genetic Stock Center at Southwest Texas State University. Private donation denotes a local pet store that provided naturally expired specimens.

Source/ Order Species Catalog No. Polydon spathula KU 9075 Albuliformes pacifica SIO 10-122 Alepocephalus tenebrosus SIO 09-332-L Amia calva KU 10298 Trichogaster lalius Private donation Anguilliformes Gymnothorax miliaris KU 145 Hoplunnis macrurus KU 5098 Myrichthys maculosus KU 6977 Rhynchoconger flavus KU 5096 Argentina silas SIO 99-78 Melanotaenia splendida KU 1670 Synodus synodus KU 169 Batrachoidiformes Sanopus greenfieldorum KU 239 Beloniformes Cheilopogon pinnatibarbatus KU 2780 Tylosurus crocodilus KU 5842 Rondeletia bicolor KU 5920 Scopelogadus beanii KU 5321 Lepadichthys lineatus KU 7020 Carangiomorphariae: incertae sedis Sphyraena barracuda KU 8989 Cichliformes Melanochromis auratus Private donation Oreochromus Niloticus Fitzsimmons Pterophyllum scalare KU 2846 Clupeiformes Anchoa hepsetus KU 9990 Brevoortia tyrannus KU 1499

50 Cypriniformes Carpiodes carpio KU 8793 Catostomis commersoni MSB 49697 Cobitis lutheri KU 8123 Danio rerio Private donation Cyprinodontiformes Fundulus grandis KU 1626 Fundulus heteroclitus Crawford Fundulus zebra (F. heteroclitus) MSB 57941 Poecilia orri KU 5845 Poecilia reticulata Private donation Xiphophorus maculatus XGSC Esox lucius KU 8753 Gadiformes Albatrossia pectoralis KU 2298 Gadus macrocephalus KU 3250 Gadus morhua KU 2937 Lota lota KU 3772 Periopthalmus kalolo KU 4460 Rhinogobiops nicholsi SIO 04-44-L Chanos chanos KU 1603 Apteronotus albifrons KU 2528 Holocentriformes Holocentrus rufus KU 5782 Kurtiformes retrosella SIO 01-39-L Lampridiformes Lophotus lacepede KU 6557 Lepisosteiformes Lepisosteus platostomus KU 8247 Lophiiformes Lophius americanus KU 4970 Melanocetus johnsonii KU 5265 Mugiliformes Liza macrolepis KU 6748 Mytophum punctatum KU 3802 Myctophiformes Diaphus dumerilii KU 1577 Notacanthus chemnitzii KU 5937 Cataetyx rubirostris KU 2262 Lepophidium profundorum KU 2934 Barbantus curvifrons KU 8141 Barbus gananensis KU 6047 Osmerus mordax KU 8278

51 Pantodon buchholzi KU 2543 Ovalentariae: incertae sedis Abudefduf sexfasciatus KU 7008 Perciformes Anoplopoma fimbria KU 9422 Calamus brachysomus SIO 04-113 Gasterosteus aculeatus KU 5985 Perca flavescens KU 2567 Pterosis antennata KU 5471 Sebastes rubrivinctus KU 9180 Serranus aequidens SIO 08-90 Percomorpharia: incertae sedis Terapon jarbua KU 6755 Ambylopsis rosae KU 10293 Pleuronectiformes Paralichthys dentatus KU 1446 Paralichthys woolmani SIO 08-88 Solea solea KU 1846 Polypteriformes Polypterus sp. KU 8219 Saccopharyngiformes Eurypharynx pelecanoides KU 7473 Salmoniformes Oncorhynchus mykiss Wyles Oncorhynchus tshawytscha KU 3227 Prosopium williamsoni KU 8862 Scombriformes Euthynnus lineatus SIO 09-133 Siluriformes Corydoras metae KU 2519 Ictalurus punctatus KU 9142 Siluranodon auritus KU 1401 Argyropeleceus aculeatus KU 8329 Chauliodus sloani KU 5365 Sygnathiformes Dactylopterus volitans KU 237 Syngnathus fuscus KU 5987 Synbranchus marmoratus KU 10406 Mola mola KU 5409 assasi KU 8846 Takifugu rubripes KU 3490 Xenolepidichthys dalgleishi KU 8348 Zenopsis conchifer KU 3277

52