Broad-scale phylogenomics reveals insights into retroviral origin and gammaretrovirus-host

Ling-Shan Yu

A dissertation submitted for the degree of Doctor of Philosophy at Imperial College London

December 2015

1

Abstract

The Retroviridae is a family of single-stranded positive-sense viruses united by a unique mechanism of replication. Numerous studies have demonstrated the host diversity and host–retrovirus evolutionary history of the Retroviridae. However, in the past it has been difficult to gain a deeper understanding owing to the lack of sufficient host genomic data. Recent advances in whole- sequencing and bioinformatics technologies have enabled the collection of high-quality genomic data. Broad-scale in silico screening of vertebrate provides numerous opportunities to analyse retroviral origin and evaluate the risks and limitations of horizontal transmissions between different host species. In Chapters 2 and 3, I expand our current understanding of retroviral diversity in lower and identify the host range boundary of the Retroviridae. I report the discovery of a basal retrovirus within the genome of the lamprey (Petromyzon marinus). No retroviruses were identified within other basal , such as hagfishes, molluscs and sponges. This suggests that members of the Retroviridae are restricted to the lamprey and other phylogenetically higher vertebrates, and the host range boundary of this virus family has been potentially identified. In addition, this study identified extensive retroviral diversity in the basal vertebrates. The phylogenetic results show that at least three independent invasions have occurred in cartilaginous fish and the coelacanth. In Chapter 4, I investigate the gammaretroviral diversity and evolutionary history of mammalian genomes by combining the data of viral hosts and viral sequences. The study provides insights into the retrovirus–host evolution history. Six horizontal transmission hotspots have been identified, and rodents are suggested to be the major retroviral reservoir of type II gammaretroviruses. In addition, by mapping

2 host species onto viral phylogenies, it is shown that cross-species horizontal transmissions of gammaretroviruses are frequent between closely related species.

3

Declaration of Originality

I declare that the research presented in this thesis is my own original work. Expect to

PCR-derived ERVs from Joanne Martin et al., 2006 (unpublished data) which are included into phylogenetic analyses in Chapter IV. Any additional sources of information have been duly cited in reference list.

Copyright Declaration

The copyright of this thesis rests with the author and is made available under a

Creative Commons Attribution Non-Commercial No Derivatives licence.

Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work

4

Acknowledgements

Foremost, I would like to thank my supervisor, Dr. Michael Tristem, for continuous support of my MSc and PhD Studies, for his patience and immense knowledge. I could not have imagined having a better mentor for my study. My sincere thanks also go to my dear friend and labmate, Dr. Adam Lee, for his encouragement, comments, and music. Also, for all the coffee and fun we had in the last three years.

Last but not the least, I would like to thank my parents for giving birth to me at the first place and supporting all my studies in UK. Their love and support have enabled me to learn and grow throughout my years at Imperial College London. My special thanks goes to, Jason Lin, for coming over UK for me, support my life in general, and include me in your life.

5

Table of Contents

Title Page…………………………………………………………….….1 Abstract…………………………………………………………….…...2 Declaration of Originality………………………………………………4 Copyright Declaration…………………………………………………..4 Acknowledgements……………………………………………...……...5

Contents

Chapter I - Introduction 1.1 Overview and general introduction…………………………………9 1.2 Retroviral genome…………………………………………………14 1.3 Retroviral life cycle……………………………………………..…20 1.4 Retroviral diversity…………………………………………..….…31 1.5 Retroviruses and host genome………………………………….…37 1.6 Evolutionary studies of retroviruses and their hosts………………40 1.7 Co-option of retroviral genes……………………………………...47

Chapter II - A basal retrovirus in an ancient vertebrate lineage 2.1 Introduction………………………………………………………..48 2.2 Materials and Methods…………………………………………….54 2.3 Result………………………………………………………………59 2.3.1 Retroviral distribution in lower vertebrates…………………...59 2.3.2 PmRV, a retrovirus from the Petromyzon marinus genome…..67 2.4 Disucssion 2.4.1 ERV host range and diversity………………………………….83 2.4.2 Genomic organisation analysis of PmRV……………………..88 2.4.3 Putative function for lamprey ERV……………………………94 2.5 Conclusion…………………………………………………………97

Chapter III retroviruses 3.1 Introduction……………………………………………….……….98

6

3.2 Materials and Methods…………………………………………...100 3.3 Results……………………………………………………………101 3.4 Discussion………………………………………………………..105 3.5 Conclusion………………………………………………………..107

Chapter IV - Biogeographic and horizontal transmission history of mammalian gammaretroviruses 4.1 Introduction………………………………………………...…….108 4.2 Methods and Methods……………………………………...…….112 4.3 Results……………………………………………………………118 4.3.1 Detection and characterization of mammalian gammaretroviruses…………………………………………...118 4.3.2 Gammaretrovirus frequency in mammals……………………134 4.3.3 Rodents are significant vectors of interorder viral transmission events within mammals………………………………………134 4.3.4 Transmission frequency varies according to the genetic distance of donor and recipient………………………………138 4.4 Discussion and Conclusion……………………………………....160 4.4.1 Biogeography and horizontal transmission hotspots of gammaretroviruses………………...…………………………160 4.4.2 Horizontal transmission dynamics of gammaretroviruses…...162 4.4.3 Rodents have more type I gammaretroviruses than other mammals……………………………………………..166 4.4.4 A model of type I mammalian gammaretrovirus evolution….167

Chapter V - Conclusion and Future developments ………………..169 Appendix………...……………………………………………………174

References…………………………………………………………….176

Figure Figure 1.1 11 Figure 1.1.2 13 Figure 1.2.1 16 Figure 1.3.1 21 Figure 1.3.2 23 Figure 1.3.3 25 Figure 1.3.4 26

7

Figure 1.3.5 29 Figure 1.4.1 32 Figure 1.4.2 33 Figure 2.1 50 Figure 2.2 58 Figure 2.3 61 Figure 2.4 67 Figure 2.5 69 Figure 2.6 77 Figure 2.7 80 Figure 2.8 81 Figure 2.9 87 Figure 2.10 88 Figure 2.11 90 Figure 2.12 94 Figure 3.1 104 Figure 4.1 114 Figure 4.2 117 Figure 4.3 119 Figure 4.4 125 Figure 4.5 137 Figure 4.6 139 Figure 4.7 144 Figure 4.8 150 Figure 4.9 151 Figure 4.10 152 Figure 4.11 153 Figure 4.12 154 Figure 4.13 155 Figure 4.14 159 Figure 4.15 163 Figure 4.16 164 Figure 4.17 168

Table Table 1.6.1 44 Table 2.1 55 Table 2.2 90 Table 2.3 96 Table 3.1 103 Table 4.1 118 Table 4.2 135 Table 4.3 158

8

Chapter I

Introduction

1.1 Overview and general introduction

The Retroviridae is a family of single-stranded positive-sense viruses responsible for many medically important diseases, including immunodeficiencies, sarcomas and leukaemias. Retroviral integration usually occurs in somatic cells. Occasionally, integration occurs in germline cells, resulting in the vertical transmission of retroviruses from parents to offspring (Vogt, 1997). Retroviruses which integrate into host germline cells are termed endogenous retroviruses (ERVs). ERVs may retain the ability to replicate for millions of years and increase in copy number via reinfection or retrotransposition in different locations within the host genome (Sverdlov, 1998). This may result in distinct retrovirus families appearing in host genomes, each originating from a single horizontal interspecies transmission (Tristem, 2000). ERVs could become inactive via recombination deletions and random mutations caused by host

DNA replication events (Hughes and Coffin, 2004; Stoye, 2001).

At the time of integration, retroviruses contain two identical long terminal repeats

(LTRs) at each retroviral terminal (Johnson and Coffin, 1999). Thus, the time of retroviral integration can be estimated from the sequence divergence between the two

LTRs, since the divergence should be proportional to the length of time the retroviruses have been subjected to background host mutation. However, the accuracy of this method can be confounded by gene conversion and recombination.

Retroviruses can exist in both exogenous and endogenous forms, and some

9 retroviruses can exist in both forms, such as the mouse mammary tumour virus

(MMTV) and Jaagsiekte sheep retrovirus (JSRV) (Sarkar et al., 2004; Golovkina et al.,

1994; York et al., 1992). ERVs have been isolated in most vertebrates, and the most basal vertebrate in which retroviruses have been identified is the lemon shark

(Negaprion brevirostris) (Herniou et al., 1998). Disease-causing retroviruses have been discovered in a range of vertebrates (Vogt, 1997). The first two disease-causing retroviruses discovered, the avian leukaemia virus (ALV) and Rous sarcoma virus

(RSV), belong to the alpharetrovirus genus (Rous, 1911; Rous, 1910; Ellerman and

Bang, 1909; Ellerman and Bang, 1908). In 1936, the first mammalian retrovirus was discovered in mice, termed mouse mammary tumour virus (MMTV) (Bittner, 1936), which was noticed to be able to transmit murine breast cancer to host offspring

(Cohen et al., 1979). During 1980, retroviruses received considerable attention when

HIV-1 and HIV-2 were discovered (Zhu et al., 1998). These viruses can have fatal consequences for hosts, resulting in acquired immunodeficiency syndrome (AIDS).

The term “retroviruses” is currently applied to describe two different but overlapping sets of retroelements. First, a retrovirus is a member of the Retroviridae, in a sense that this family is monophyletic with respect to other retroelemtents, such as retrotransposons, short interspersed nuclear elements and long interspersed nuclear elements. Second, a retrovirus refers to an infectious retroelement, including members of the Retroviridae and Ty3/gypsy retrotransposons which are capable of horizontal transmission. To carify, I use the former definition in this thesis. Figure 1.1.1 shows the generalised genomic organisation of a number of retroelement groups, and Figure

1.1.2 illustrates the phylogenetic relationships between major groups of retroelements on the basis of their reverse transcriptase (RT) domains.

10

LTR retroelements are distinguishable from other retroelements by being flanked by paired LTRs, formed during reverse transcription. LTR retrotransposons typically encode two genes: the gag and pol genes, which encode structural polyproteins and enzymes for reverse transcription and integration, respectively. One of the main differences between LTR retrotransposons and retroviruses is the presence of the env gene, which allows a virus to infect another cell. However, some LTR retrotransposons encode a third open reading frame which functions similarly to the env gene, and thus, they can be infectious like retroviruses. The best-characterised examples of env-containing retroelements are gypsy (Pelisson et al., 2002; Terzian et al., 2001) and ZAM (Leblanc et al., 1997). Gypsy is in fact an infectious retroelement partly due to the capture of an env-like gene from baculovirus (Pearson and

Rohrmann, 2002; Malik et al., 2000; Song et al., 1994). However, in contrast to retroviruses, LTR retrotransposons do not appear to be naturally infectious and are infectious only under certain laboratory conditions.

Figure 1.1.1 Schematic genomic compositions of four different types of LTR retroelements. P: internal promoters, PBS: primer binding site, PPT: polypurine tract.

11

The evolutionary origins of the retroviral env gene remain unclear and several possibilities concerning how the env gene was captured by invertebrate retroviruses have been suggested. For example, the env of retrotransposon Cer shares homology with a phleboviral fusion protein, and Tas appears to have captured the gB glycoprotein of an ancestral herpesvirus (Malik et al., 2000). In addition to being acquired from another infectious agent, an alternative possible origin of the env gene is that an ancestral vertebrate LTR retrotransposon acquired the env gene from its host.

Several lines of evidence have shown that the host gene could be co-opted by retroviruses for their own advantage. For instance, the src gene of RSV is derived from the host genome (Czernilofsky et al., 1983, 1980; Schwartz et al., 1983; Shoji et al., 1981; Takeya et al., 1983, 1981). Another possibility is that the fusion of different genes led to the formation of the env gene.

Although retroviruses and LTR retrotransposons form two distinct clades, their host distributions are partially overlapping, suggesting that different taxa have different degrees of vulnerability to these two groups of elements. For example, members of the gypsy and copia retroelements are found in vertebrates (except for mammals), fungi, plants and insects, whereas retroviruses are present in almost all vertebrates, including mammals, but not in plants, insects and fungi.

12

Figure 1.1.2. Relationship between major groups of retroelements (Simplified from Xiong and Eickbush, 1990).

13

1.2 Retroviral genome

Retroviral genomes consist of linear and single-stranded positive-sense DNA. They are comprised of two LTRs, 300–1,200 nucleotides in length, and separated by 5–10 kb of sequence encoding the retroviral gag, pol and env genes. Vertebrate retroviruses can be broadly divided into simple and complex retroviruses. The main difference is that while simple retroviruses have a basic LTR-gag-pol-env-LTR genomic structure, complex retroviruses encode additional accessory genes that are needed for greater control of retroviral gene expression (Vogt, 1997). A schematic of the retroviral genomic structure is shown in Figure 1.2.1.

Structural genes gag

The gag protein is the precursor to the internal structure protein of retroviruses and is critical to retrovirus assembly. The gag gene encodes three essential proteins which form the structural components of the virus core: the matrix (MA), capsid (CA) and nucleocapsid (NC) proteins. In some viruses, such as RSV, additional cleavage products (p2, p10 and SP) are also released from the gag polyprotein precursor. These structural components are released by the viral protease once the molecules have been assembled into a particle at the plasma membrane (Wills and Craven, 1991; Dickson et al., 1984).

The MA, CA and NC proteins appear to serve similar functions in all retroviruses and are organised in the same order, from the amino terminus to the carboxyl terminus,

(NH2)-MA-X-CA-NC-Y-(COOH) (Wills and Carven, 1991), where X and Y indicate

14 segments that may be cleaved into one or several proteins or maybe absent together.

The MA sequence at the amino terminus of gag protein is suggested to contain the membrane targeting and binding domain which directs the gag to the site of budding.

The NC protein usually contains one or more cysteine-histidine (i.e. zinc fingers) motifs that are involved in the packaging of viral RNA. The exact structural function of CA in the mature viral particle is unknown; however, it contains the most highly conserved region within the gag, termed the major homology region (MHR)

(Mammano et al., 1994; Wills and Craven, 1991; Patarca and Haseltine, 1985), which usually shows similarity in the 20 amino acid residue. The human and simian spumaretroviruses are the only retroviruses that do not contain the MHR region

(Renne et al., 1992; Maurer et al., 1988). The second similarity within the gag polyproteins is found in the C-terminus of the NC as a Cys-X2-Cys-X4-His-X4-Cys

(CCHC) motif, which may duplicated one to three times depending on the viral species (Llorens et al., 2009; Summer et al., 1992; Green and Berg, 1989; Berg 1986; Covey,

1986). The CCHC motif has been found to be involved in virion assembly, RNA packaging, reverse transcription and integration processes (Buckman et al., 2003). In all retroviruses except for spumaretroviruses, one of the two CCHC motifs can be found in their NC regions

(Henderson et al., 1981).

15

(a) accessory accessory accessory genes genes genes

gag pro pol LTR LTR PBS env PPT Host genomic MA CA NC PR RT IN DNA U3 R U5 SU TM U3 R U5

(b)

Surface glycoprotein Transmembrane (SU) protein (TM)

Protease (PR) Matrix (MA)

Viral RNA Lipid bilayer

Reverse transcriptase Capsid (CA) (RT)

Integrase (IN) Nucleocapsid (NC)

Figure 1.2.1 (a) Genetic organisation of a generalised provirus. The proviral DNA is inserted into the host genome, with gag-pro-pol-env genes flanked by long terminal repeats (LTRs). Sequences in the LTR (U3-R-U5) are important for transcription. The sequences for gag, pol and env are located invariably in the same orientation (gag-pol-env) in all retroviruses. Sequences that are essential for the replication and gene expression are shown in the approximate locations where they are typically found. Primer binding site (PBS); Matrix protein (MA); Capsid (CA); Nucleotide protein (NC); Reverse transcriptase (RT); Integrase (IN); Surface (SU); Transmembrane (TM) component; Polypurine tract (PPT). (b) Schematic representation of a typical retrovirus virion, illustrating the position of structural and enzymatic proteins (Figure 1.2.1 (b) is modified from http://what-when-how.com/molecular- biology/retroviruses-part-1-molecular-biology/)

16 pol

The pol gene locates downstream of the gag gene and encode three critical enzymes for reverse transcription and integration: RT, integrase (IN) and RNaseH. RT is capable of catalysing DNA synthesis from a single strand of RNA or DNA (Scolnick et al., 1971; Baltimore, 1970), and this reverse transcription process requires a primer

12–18 bases in length usually provided by the 3ʹ end of the host tRNA (Eickbush,

1994). IN is responsible for removing two bases from the LTR end and inserting a linear double-stranded DNA copy of the retroviral genome into the host genome.

Three subdomains are present within IN: (1) an N-terminal subdomain, which displays a conserved zinc-finger (HHCC) binding motif (Lodi et al., 1995); (2) the central subdomain, which contains a catalytic core and can be identified by the presence of a conserved DDE motif (Polard and Chandler, 1995; Khan et al., 1991); and (3) a less conserved C-terminal subdomain. Protease encoded by the pro gene is responsible for cleaving gag and pol; it usually incorporates into pol, while in some retroviruses, it is encoded as a separate gene between gag and pol, is usually incorporated into the pol.

Pro and pol initially form part of a polyprotein which is linked to the gag polyprotein.

These polyproteins are all trafficked to the cell surface membrane and just before budding, viral protease cleaves the viral polyprotein into their mature forms (Oroszlan and Luftig, 1990). env

The envelope (env) genes encoded by retroviruses are usually composed of two subunits which encoded the surface (SU) and transmembrane (TM) proteins. The SU protein contains antigenic sites and mediates viral adsorption via binding of specific

17 cell surface receptors. The TM protein is the integral env protein subunit that mediates virus entry by triggering virus-host cell membrane fusion. Little is known about the similarity among retroviral env protein at the primary genomic structure level; however, a conserved polybasic motif, K/R-X-K/R-R, has been found to be present and is common to all retroviruses. This motif is the consensus cleavage site which is recognised by the host cellular endopeptidase and cleaves the env precursor polyprotein into the peptides, SU and TM. If uncleaved env polyproteins are expressed on the viral surface, then there is binding the host cell receptors, but failure to induce membrane fusion. Therefore, the cleavage step is critical to the infectivity of retroviruses (Coffin et al., 1997). The TM sequence has two hydrophobic stretches: one at or near the N terminus constituting the fusion peptide (fp) and the second is the transmembrane region TM, which anchors it to the viral membrane. The ectodomain of env includes two heptad repeat regions, hr1 and hr2, which play an important role in the dynamic arrangement of the trimer during the process of fusion and form a highly conserved coiled-coil structure which is discovered in many viral fusion proteins (Singh et al., 1999). The ectodomain region of some retroviruses also includes a region called the immunosuppressive domain (ISD). The ISD is a stretch of

20 amino acids and is relatively conserved in the env region (Bénit and Heidmann,

2001).

Accessory genes

Accessory genes can be characteristic of a viral genus, characteristic of a clade within a genus or only found in certain cases. Accessory genes are located in various places downstream of the pol gene. They are commonly found overlapping the env gene and

U3, perhaps reflecting the evolutionary pressure resulting from the limit to the size of an RNA molecule that can be packaged into a virion (Coffin, 1997). In the simplest

18 example, such as the bel gene of HFV and the nef gene of HIV-1, accessory genes are only located downstream of the env gene and sometimes extend into the U3 region of the 5ʹ LTR. In some complicated viral genomes, the accessory genes are comprised of two exons, such as tax and rex in HTLV: one is upstream or overlapping the upstream regions of the env gene and the other is downstream or overlapping the downstream regions of the env gene. Some accessory genes have exons upstream of and within the env gene, but utilise different open reading frames. After the first accessory gene was discovered in HTLV (Seiki et al., 1986), various accessory genes that facilitate viral replication have been found within retroviruses from different viral genera. Although accessory genes are dispensable under certain circumstances, in some cells, they are as essential as the structural and regulatory proteins. This is exemplified by the Vpx and Vpr proteins in HIV replication (Fujita et al., 2010;

Malim and Emerman, 2008).

Long terminal repeats (LTRs)

Retroviral protein-coding genes are flanked by two identical LTRs. LTRs contain regulatory elements for proviral integration, transcription and retroviral mRNA processing. The common structure of retroviral LTRs has been investigated using hidden Markov models (Benachenhou et al., 2009). LTRs can be divided into two unique regions (U3 and U5) and a repeated (R) region located between them. R and

U5 are generally more conserved than U3 and this may be because U3 has to adapt to varying tissue environments. The highest conservation of LTR is the short inverted repeat motifs, which start with TG and end with CA as well as one to three AT-rich regions providing the LTRs with one or two TATA boxes and a polyadenylation signal (AATAAA motif) (Benachenhou et al., 2013).

19

1.3 Retroviral life cycle

The life cycle of ERVs start with an initial germ line colonisation event by an exogenous retrovirus (the founder retrovirus), followed by amplification of ERVs via retrotransposition or reinfection. Each independent colonisation event defines a new

ERV lineage (Tristem, 2000). As a result, an integrated ERV may produce replication-competent exogenous retroviruses, which can reinfect the host genome

(Figure 1.3.1).

The life cycle of retroviruses can be divided into two phases: the early phase refers to infection, penetration and integration of the viral cDNA into the cell genome, whereas the late phase begins with the expression of viral genes and the release and maturation of progeny virions (Figure 1.3.2). The initial step of the retroviral life cycle is the adsorption of retroviral particles on the surface of their target cell, although it remains ambiguous whether binding occurs through specific interactions. Evidence has shown that the attachment usually involves molecules which are distinct from the viral receptor responsible for the entry process (Sharma et al., 2000). For example, although HIV entry into the target cells involves CD4 and a coreceptor, early attachment of virions to various cell surface molecules (Ugolini et al., 1999), including heparin sulfate proteoglycan (Mondor et al., 1998), LFA-1 (Fortin et al.,

1998) and nucleolin (Nisole et al., 1999), has been observed.

20

Figure 1.3.1 The life cycle of an ERV lineage. The colonization, amplification, fixation, loss, inactivation and release XRVs stages of an ERV lineage (modified from Gifford and Tristem, 2003).

Following the initial binding step, retroviruses use cell surface proteins as specific receptors to enter their target host cells. Next, the viral core enters the cytoplasm, where the viral RNA is reverse transcribed by the viral-coded enzyme RT (Figure

1.3.2). Immediately after its release into the cytoplasm, the viral core undergoes a process known as uncoating. The core partially disassembles, allowing RT to start accessing the RNA genome as a template to synthesise a complementary DNA strand.

Reverse transcription of retroviral genomic RNA produces double-stranded DNA that

21 is integrated into the host genome to form a provirus, starting from a host tRNA binding to the primer binding site region of the retroviral RNA. Reverse transcriptase also has RNaseH activity, enabling digestion of the viral RNA genome strand.

The retroviral life cycle requires the integration of viral DNA into the host cell genome to form a “provirus”. Thus, reverse transcribed DNA as well as pre- integration complexes must enter the nucleus. For most retroviruses, pre-integration complexes can only enter the nuclei during mitosis when nuclear membranes breakdown (Lewis and Emerman, 1994; Roe et al., 1993). Integration occurs in two catalytic steps, referred to as end processing and joining. End processing involves cleavage of the terminal two bases from the viral cDNA 3ʹ ends and can occur within the cytoplasm. During the joining reaction, nucleophilic oxygen from the 3ʹ hydroxyl group attacks the target DNA (Engelman et al., 1991; Vink et al., 1991). The proviral ends are cleaved at the bases TG on the 5ʹ strand and CA on the 3ʹ strand; this is highly conserved among retroviruses and retrotransposons (Katz and Skalka, 1994).

Repair of damage to the host DNA by integration has been suggested to be mediated by cellular enzymes (Hindmarsh and Leis, 1999). Integration is accompanied by the generation of short direct repeats (4–6 bp, the length is determined by the virus) flanking the provirus from the target site (Craigie et al., 1990) (Figure 1.3.3).

Integration is not sequence specific, and thus, retroviruses may be able to insert anywhere in the host genome (Katz and Skalka, 1994; Withers-Ward et al., 1994); however, some evidence indicates that integration does not occur randomly. Murine leukaemia virus (MLV) prefers to integrate near the start of the transcription units (distributed evenly upstream and downstream), whereas HIV-1 prefers to integrate anywhere except upstream of the transcription start site (Barr et al., 2006; Barr et al., 2005; Ciuffi et al., 2006; Ciuffi et al.,

2005; Lewinski et al., 2006; Lewinski et al., 2005; Mitchell et al. 2004; Wu et al., 2003;

22

Schröder et al. 2002). The difference between the integration profiles suggests that there may

be fundamental mechanistic differences influencing the site preferences of MLV and HIV.

Viral entry Exit

Figure 1.3.2 Retrovirus life cycle. An infectious retrovirus attaches and penetrates a host cell membrane, then reverse transcribes its RNA genome, and integrates a DNA copy into the host genome. The viral products are translated using host cell machinery, followed by assembly at the cell surface. A new infectious virus buds off from the host cell (modified from Stoye, 2012).

23

Initiation of (-) strand DNA synthesis

The 5ʹ end of the viral RNA genome is degraded by the RNaseH activity of RT as the (-)strand DNA is synthesised.

First template exchange The RNA genome continues to be (+) strand DNA synthesis begins, degraded as (-) strand DNA is primed by the ppt RNA synthesised.

(+) strand DNA synthesis

(+) strand DNA synthesis RNaseH endonuclease activity of RT DNA ends are connected by annealing The pbs is copied twice removes both primer RNAs at complementary PBS sequences (once from the RNA genome once from tRNA primer)

24

Second template exchange is facilitated by annealing of PBS sequences

Figure 1.3.3 Retroviral reverse transcription (modified from http://www.twiv.tv/reverse-transcription/) 25

U3 R U5 U3 R U5

OH OH

Target DNA binding

OH

OH

Gap repair

Figure 1.3.4 Retroviral DNA integration (modified from Coffin, 1997)

26

Retroviruses express like cellular genes after integrating into the host genome until they acquire inactivating stop codons or frameshifting mutations. An integrated provirus is transcribed by host cell RNA polymerase II, which synthesises cellular mRNAs and some small nuclear RNAs. The transcription factor present in the LTRs of the provirus is involved in the binding of the RNA polymerase enzyme. Following transcription of the provirus, retroviral RNA transcripts undergo cap addition at the 5ʹ end, cleavage, polyadenylation at the 3ʹ end and splicing. All retroviruses can synthesise at least two forms of mRNAs: unspliced and spliced mRNA. Unspliced full-length mRNA can give rise to the gag and pol, which are synthesised as either the pol protein or gag-pol precursor. Spliced mRNA produces transcripts that encode genes downstream of the 5ʹ portion of the genomic RNA, mainly the env protein.

Simple retroviruses splice a subset of the M genomic RNA into a transcript which encodes the env protein. The splicing activity of simple retroviruses may be downregulated by control elements or via inefficient use of splice sites (McNally et al., 1991; Katz and Skalka, 1990). In contrast, complex retroviruses produce both singly and multiply spliced transcripts that encode not only the env protein but also regulatory and accessory proteins specific to these retroviruses. Some complex retroviruses regulate splicing through the interaction of proteins encoded by the products of host accessory genes (Bakker et al., 2001; Maury, 1998; Cullen, 1992).

For example, HIV uses multiple alternative 5ʹ and 3ʹ splice sites to generate over 40 different mRNAs species, including multiply spliced (1.8 kb) and singly spliced (4 kb)

RNAs. It is also worth noting that the proportion of spliced and unspliced mRNAs must be regulated to ensure sufficient unspliced RNA reach the cell cytoplasm for the translation of gag and pol (Katz et al., 1998).

27

For retroviral gene expression, both spliced and unspliced RNAs must be transported to the cytoplasm. In the cytoplasm, unspliced viral mRNA serves two different roles.

Unspliced genome-length RNA acts as a genomic RNA which is packaged into the assembling virion or serves as an mRNA template and is translated by the host protein synthesis machinery. The gag gene can be translated into a single polyprotein which is subsequently cleaved to yield the structural components of the viral core, including the protein MA, CA and NC. The termination of gag can be bypassed and translation continues downstream, resulting in a gag-pro-pol fusion protein (Opperman et al.,

1997; Swanstrom and Wills, 1997). Bypass of the termination codon occurs through one of two mechanisms (Figure 1.3.5). The first mechanism is frame shifting, which is used by most retroviruses. This involves a ribosome slipping backward by one nucleotide (−1 frameshift) during the gag translation, and therefore bypassing the termination codon (Jacks et al., 1988). The second mechanism is read-through

(termination) suppression, whereby the gag termination codon is occasionally misread

(Yoshinaka et al., 1985a, 1985b). For example, one gag-pro-pol precursor is synthesised for every 10–20 gag polyproteins during MLV replication (Jamjoom et al.,

1977). Gln tRNA can misread UAG as CAG 1/20 times. The efficiency of the suppression of termination is low, typically about 5%, and therefore, the gag protein outnumbers the gag-pol polyprotein by about 20-fold (Coffin, 1997).

28

(a)

(b)

Figure 1.3.5 (a) MLV readthrough suppression, including the proposed pseudoknot downstream from the amber termination codon. (b) Frameshift suppression in the synthesis of gag-pro-pol. (Pictures from http://www.slideshare.net/TnHoLm/retroviruses-and-hiv)

The env glycoprotein glycoprotein is synthesised from a spliced form of viral

genomic RNA. The mature protein products of the env gene comprise the SU and TM

proteins. The SU protein is responsible for the receptor binding function, and thus

determines the tropism, whereas the TM protein mediates virus entry by triggering

virus-host cell membrane fusion. Translation and extensive modification of the env

glycoprotein occurs within the cellular rough endoplasmic reticulum, and

subsequently, cleavage occurs in the Golgi apparatus by a cellular protease.

29

Once the gag, gag-pro-pol and env polyproteins have been synthesised, the proteins come together to form two copies of viral RNA and tRNA primers at a site present on the cell membrane, where they assemble into viral particles. The precise location of assembly varies among retroviruses, although it usually takes place either within the cytoplasm or at the plasma membrane. For type C retroviruses, which include the alpharetroviruses, gammaretroviruses and lentiviruses, assembly occurs at the plasma membrane. However, for types B and D retroviruses, assembly takes place within the cytoplasm (Coffin, 1997). Eventually, the viral core encloses the cell membrane and detaches from the host cells. The viral core covers the region of the host cell membrane, forming the retroviral envelope. At the time of budding, proteolysis occurs for conversion of the immature virion into the mature infectious form

(Oroszlan and Luftig, 1990). The mature virion can then reinfect other host cells.

30

1.4 Retroviral diversity

The phylogenetic relationship of retroviruses is usually constructed from the viral pol gene, which is the most conserved gene across members of the Retroviridae family.

Retroviruses are subdivided into seven genera by the International Committee on

Taxonomy of Viruses (ICTV) (Van Regenmortel et al., 2000): alpharetroviruses

(avian type C retroviruses), betaretroviruses (mammalian type B and type D retroviruses), gammaretroviruses (mammalian type C), epsilonretroviruses, deltaretroviruses, lentiviruses and spumaretroviruses (spumaviruses). On the basis of sequence similarity to their infectious exogenous retroviruses, retroviruses can also be loosely classified into three phyletic classes: Class I (gammaretroviruses and epsilonretroviruses), Class II (lentiviruses, deltaretroviruses, alpharetroviruses and betaretroviruses) and Class III (spumaretroviruses). This classification depends on the length of the target site duplication: 4 bp for Class I, 6 bp for Class II and 5 bp for

Class III. However, neither of these classifications can classify all exogenous and endogenous retroviruses. For instance, newly discovered endogenous lentiviruses fall outside these three classes (Hayward, 2014; Hayward, 2013; Gifford et al., 2008;

Katzourakis et al., 2007). Moreover, a basal clade, composed of viruses from turtles, alligators and frogs, falls outside of all retroviral genera (Hayward et al., 2014). The distribution of retroviral sequences in vertebrate hosts is shown in Figure 1.4.1.

Alpharetroviruses include many avian retroviruses. Their viral particles exhibit type C morphology and thus alphretroviruses are also termed C-type elements. Viral assembly and budding both occur at the plasma membranes. Alpharetroviruses have been discovered in both exogenous and endogenous forms. Nevertheless, their host

31 range is restricted to only chicken and some other birds. Alpharetroviruses are simple viruses and have a simple genomic composition; the viral internal region is 6.8 to 9

Kb in length, and the LTRs are approximately 0.3 Kb, which is relatively short compared to those of other genera. Representative species include RSV and ALV, and

RSV is known to be an oncogene-containing virus. The viruses in this genus are associated with malignancies and osteopetrosis.

host range boundary?

Figure 1.4.1. Distribution of retroviral sequence within vertebrate and other Metazoa. Endogenous retroviral sequences are identified by +, whereas – represents that no retroviral sequences have yet been identified. The arrows indicates the hypothetical host range boundary, which is immediately below the class of sharks (modified from Herniou et al., 1998).

Phylogenetic analyses reveal a close relationship between alpharetroviruses and betaretroviruses. Betaretroviruses can exist in both exogenous and endogenous forms, for example, MMTV and JSRV. In addition, betaretroviruses infect a wide range of species including primates, ruminants, rodents and marsupials. The viruses in this genus are associated with immunodeficiencies and pulmonary cancer in mammals.

Betaretroviruses are considered simple viruses, despite the fact that MMTV possesses

32 an accessory gene (sag) encoding a super-antigen. The beta viral core assembles at the cytoplasm before migrating to the plasma membrane. The internal region is about

7.5 to 9 Kb in length and the LTRs vary between 0.3 and 1 Kb.

Figure 1.4.2 Phylogenetic tree of retroviruses based on reverse transcriptase protein sequences. (Jern et al., 2005)

33

Deltaretroviruses have only been identified in exogenous forms, including human T- cell lymphotropic virus 1 and 2 (HTLV1 and HTLV2) and bovine leukaemia virus

(BLV). The length of the internal region is 8 to 9 Kb and the LTRs are about 0.7 Kb in length. Deltaretroviruses are considered complex viruses, and contain at least two regulatory proteins, Tat and Rev (Coffin, 1997).

Gammaretroviruses are widely distributed among mammalian species in both exogenous and endogenous forms. Many viruses in this genus are infectious, causing malignancies, immunosuppression and neurological disorders. The gammaretroviral virions exhibit a type C morphology, and capsid assembly and budding occur at the plasma membrane. The representative species of this genus is MLV. In addition to mammals, gammaretroviruses have also been identified in birds, for example reticuloendotheliosis virus (REV) and spleen necrosis virus (SNV), and reptiles

(Poulet et al., 1994).

Epsilonretroviruses have only been identified in fish species, implying they have a restricted host range. Species in this genus include walleye dermal virus (WDSV) and walleye epidermal hyperplasia virus (WEHV I and II). Epsilonretroviruses exhibit a degree of relatedness to gammaretroviruses based on phylogenetic analyses on the pol genes. Epsilonretroviruses, unlike gammaretroviruses, encode a specific set of accessory genes (Llorens et al., 2011). The internal region of the viruses is 8 to 9 Kb in length and the LTRs are approximately 0.5 Kb in length. Among all the retroviruses, the WDSV genome is the largest, at approximately 12 Kb. In addition,

34 gammaretroviruses are the only known viruses that encode an accessory gene product with homology to cellular genes (LaPierre et al., 1999, LaPierre et al., 1998).

Spumaviruses include the human foamy virus (HFV), simian foamy virus (SFV) and bovine syncytial virus (BSV). Spumviruses have complex genomes, and contain the accessory genes, tas and bet, downstream of the env gene. The internal region is approximately 13 Kb in length and the LTRs are approximately 1.7 Kb in length.

These viruses have been widely identified in mammalian species in exogenous forms; however, the pathogenicity of spumaviruses has not been described. Endogenous spumaretroviruses have been discovered in sloth (Katzourakis et al., 2009). In addition, evidence has shown that spumaviruses have coevolved with their hosts for more than 100 million years.

Lentiviruses are complex retroviruses with an internal region 8 to 9 Kb in length and

LTRs 0.2 to 1 Kb in length. Members of this genus encode at least two accessory genes, Tat and Rev, in addition to the structural genes. Primate lentiviruses encode additional accessory genes such as Nef, Vpr, Vpu, Vpx and Vif. Lentiviruses include mainly exogenous viruses found in mammal. A recent study found endogenous forms in primates (Hron et al., 2014), carnivores (Cui and Holmes, 2012; Han and Worobey,

2012b), lemurs (Gifford et al., 2008) and rabbits (Katzourakis, 2007).

Several lines of evidence have shown a close relationship between retroviruses and

LTR retrotransposons. The pol genes, especially in RT and Ty3/gypsy elements and

35 retroviruses have been shown to be very similar (Xiong and Eickbush, 1990). Taken together with evidence from phylogenetic analyses, it is well accepted that retroviruses evolved from Ty3/gypsy LTR retrotransposons. Ty3/gypsy LTR retrotransposons are widely distributed across plants, and fungi. Ty3/gypsy

LTR retroelements usually have a variable genome size between 4 and 15 Kb. The main difference between Ty3/gypsy elements and retroviruses has long been considered to be that retroviruses have an additional open reading frame coding for the env gene, which is necessary for viral transfer between cells. However, several lineages belonging to Ty3/gypsy and other retrotransposons have been shown to also contain env-like genes (Eickbush and Malik, 2002). For example, one gypsy element found in Drosophila melanogaster is capable of horizontally transferring itself from a

Drosophila species to other hosts (Terzian et al., 2001; Alberola and de Frutos, 1996;

Kim et al., 1994; Song et al., 1994; Mizrokhi and Mazo, 1991).

36

1.5 Retroviruses and host genome

The eukaryotic genome is a complicated and dynamic structure. About 42% of the is composed of retrotransposon sequences in comparison to ~3% encoding protein-coding sequences. In fact, since their initial discovery in 1956 by

Barbara McClintock in maize DNA (McClintock, 1956), transposable elements have been identified in the genomes of almost all eukaryotic organisms. For example, they constitute more than 22% of the Drosophila genome (Kapitonov and Jurka, 2003),

50% of the maize genome (Wessler, 1998) and 42% of human DNA (Lander, 2001).

They were initially considered to be a “junk” DNA or a genetic parasite, but are now suggested to be functional genetic elements that can alter gene expression, provide new genetic material and promote genome evolution (Böhne et al., 2008; Goodier and

Kazazian, 2008; Han and Boeke, 2005; Beauregard et al., 2001; Kidwell and Lisch,

2001).

ERVs can mediate changes in the genome of their host through insertion. The insertion of retroviruses can change the function of a host gene either by integrating into a gene or nearby, both of which can disrupt its normal function. In addition, retroviral LTRs can be involved in regulating nearby genes, and have been incorporated into the normal regulation of mammalian genes, most frequently as promoters, enhancers or polyadenylation signals (Cohen et al., 2009). Whole-genome analysis reveals that about 25% of all human promoters contain retroelements in their sequences, suggesting retrotransposons as alternative promoters (Van de Lagemaat et al., 2003). ERVs can also act as transcriptional enhancers for cellular genes. In

37 humans, all genes for salivary amylase contain a full-length insertion of HERV-E upstream of their transcription start site (Emi et al., 1988; Samuelson et al., 1988). It has been hypothesised that the insertion of ERV activated as a cryptic promoter drives the transcription of amylase within the salivary glands.

ERVs can also mediate changes in the genome of their host through recombination.

Recombination is a powerful evolutionary factor that produces genetic diversity by using already existing biological information. Recombination can occur in several ways: (1) Homologous recombination occurs between coding regions of an integrated provirus, leaving a solo LTR at the locus (Hughes and Coffin, 2004; Stoye, 2001;

Johnson and Coffin, 1999). Solo LTRs are found at least 10 times more frequently than their related ancestral proviruses in mammalian genomes (Benachenhou et al.,

2009). Recent studies have shown that solo LTRs resulting from recombination occur more rapidly after integration than after mutations have accumulated in proviral LTRs

(Belshaw et al., 2007). Besides, the persistence of proviruses relies on the recombination rate and tolerance within the host genome (Katzourakis et al., 2007). (2)

Homologous recombination between two proviruses in the same orientation present on the same chromosome results in the loss of viral and genetic sequences between recombination locations. (3) Recombination between 5ʹ and 3ʹ LTRs of a provirus on sister chromatids results in a tandem provirus. (4) Gene conversion leads to the non- reciprocal exchange of sequences without proviral loss.

Recombination can lead to deleterious, advantageous or null gene rearrangement in a host genome (Boeke and Stoye, 2002). For example, human diseases caused by

38 recombination between retroelements have been reported: L1 recombination- associated deletion (L1RAD) events causing human diseases, such as glycogen storage disease (Burwinkel and Kilimann, 1998), Alport syndrome-diffuse leiomymatosis (Segal et al., 1999) and Ellis-van Creveld syndrome (Temtamy et al.,

2008) and recombination between HERV-I cause complete germ cell aplasia (Kamp et al., 2000) However, apart from deleterious effects, recombination between retroelements can also be advantageous. Proposed reasons include removing deleterious mutations and insertion into the genome, increasing the variability within a population and increasing the speed of beneficial mutations (Michod and Levin,

1988).

39

1.6 Evolutionary studies of retroviruses and their hosts

The distribution and diversity of ERVs have been shaped by the interaction between

ERVs and their hosts. This can be observed by comparing retroviral phylogenies with those of their respective hosts. If endogenous retroviruses coevolve with their hosts, then the retroviral phylogenetic tree should mirror that of the host lineage. This idea was first suggested at the beginning of the 20th century in studies by Kellogg (1913) and Fahrenholz (1913), who proposed the hypothesis known as Fahrenholz’s rule that a parasite phylogeny mirrors that of its host (Fahrenholz, 1913), and a similar idea was proposed by Szidat (1940) regarding primate hosts harbouring primitive parasites.

As DNA sequencing was impossible at the time of these proposals, the phylogeny of the parasites was used to build the phylogeny of hosts and vice versa. This led to two phylogenies that tend to be congruent, giving rise to widespread belief that co- speciation was common. It was not until the late 1980s that robust host and parasite phylogenies were built independently and used to test cospeciation in a more specific manner (Hafner and Nadler, 1988).

ERVs could be used as information markers for studies of host-virus coevolution, especially with regard to viral horizontal transmission. Horizontal transmission can be indicated by incongruence between host and virus trees (Cui et al., 2012; Dimcheff et al., 2000; Martin et al., 1999). Host life history traits may be linked to the raised levels of retroviral horizontal transmission. Studies have shown that congruence between spumaviruses and primate phylogenies is apparent (Broussard et al., 1997;

Schweizer and Neumann-Haefelin, 1995; Bieniasz et al., 1995). Host-virus codivergence has also been suggested in lentiviruses (Beer et al., 1999). However, the

40 correspondence between lentivirus and primate phylogenies is partially due to preferential host switching (Charleston and Robertson, 2002), from primates to humans (Chen et al., 1997; Gao et al., 1999). This is consistent with the extraordinarily high replication rate in lentiviruses (between 10−2 and 10−3 per site per year), which leads to variable genomes. In contrast, spumaviruses show genomic stability within their hosts, where the replication rate is about 3 × 10−4 per site per year (Schweizer et al. 1999). This life strategy which has evolved in spumaviruses has resulted in the suppression of replication, whereas lentiviruses have evolved to possess the high replication and mutation rate in order to escape host immune responses. Horizontal transmission could also be linked to virus traits, such as receptor specificity of the env protein. Occasional recombination between different viruses could lead to the genomic exchange of env genes, and such events may be associated with extended or altered host ranges (Bénit et al., 2001; Van der Kuyl et al.,

1999).

Transmission between distantly related hosts (e.g. cross-class transmission) occurs rarely, and only two cases have ever been identified. Reticuloendotheliosis viruses

(REV) arose via viral horizontal transmission from mammals to birds (Martin et al.,

1999; Gao et al., 1999), while koala retrovirus (KoRV) is the result of viral horizontal transmission from rodents or other mammals to koala (Fiebig et al., 2006). In contrast, cross-species transmission between related hosts (e.g. within the same family) apparently occurs more frequently (Martin, 1999) and furthermore, suggests that horizontal transmission between related species is a major mode of exogenous RNA virus evolution (Kitchen et al., 2011).

41

Retroviruses have long been known to be capable of infecting new hosts by cross- species transmission and causing diseases. The best known example are the human immunodeficiency viruses (HIV-1 and HIV-2), which are the products of such cross- species transmissions from non-human primates to humans (Gao et al., 1999; Gao et al., 1994). There are abundant examples of naturally occurring transpecies transmissions (Table 1.6.1). Some have resulted in emerging fatal diseases, such as

AIDS. Interestingly, all known simian immunodeficiency viruses (SIV) are apathogenic in their natural host, but transmission into humans can induce AIDS. In contrast, SFVs are widely distributed in non-human primates, but none have yet led to the development of disease. The concordance between the phylogeny of SFVs and their host species indicates a long-standing coexistence (Schweizer and Neumann-

Haefelin, 1995).

KoRV is an example of recent cross-species transmission and endogenisation

(Tarlinton et al., 2006; Tarlinton et al., 2005; Hanger et al., 2000). Like many other retroviruses, KoRV induces myeloid leukaemia, lymphomas and immunodeficiency diseases in koalas. Several lines of evidence have shown that KoRV is closely related to MLV and gibbon ape leukaemia virus (GaLV) (Hanger et al., 2000). This raises an interesting question that since gibbon and koalas live on different continents, there should have been intermediate vectors involved in the viral transmission. It has been suggested that GaLV originated in Southeast Asian mice (Mus caroli) (Lieber, 1975).

However, a recent study disagreed with this point because a retrovirus derived from laboratory mice shows an even closer phylogenetic relationship to KoRV compared to

Southwest Asian mice (Fiebig, 2006). The transmission route has not yet been

42 determined and cross-species transmissions between phylogenetically distant species

(e.g. between classes) remains rare.

The ecological and evolutionary relationships between retroviruses and their hosts remain largely unknown. In fact, the exact pattern and timing of individual horizontal transmissions is unlikely to be determined because both ERV and host information are patchy. However, detailed taxon sampling may provide more information related to ERV distribution in terms of their host and ecological niches, and this may allow some general direction of horizontal transmission to be inferred. The most widely accepted co-phylogeny method is the event-based method. The first event-based method was Brooks’ parsimony analysis (Brooks, 1981). This method was widely used in the 1980s, but received heavy criticism because of its requirement for numerous a posteriori interpretations (Page, 1994a, 1994b). In 1990, Page developed reconciliation analysis, which considers parasites as an evolutionary lineage rather than a character state. This method attempts to reconcile host and parasite phylogenies by maximising the number of co-speciation events and minimising the number of host switching events. However, this method can overestimate co-speciation events because it assumes that co-speciation is more likely than host switching or other events. In addition, it interprets congruence as evidence for co-speciation, which is not necessarily the case. Recent reconciliation analyses consider additional possible evolutionary events, such as co-speciation, host switching, duplication and extinction.

Event-based methods find the most parsimony scenario by minimising the total cost.

The most popular cost-based methods include TreeMap 3 (Charleston and Robertson,

2002), TARZAN (Merkle and Middendorf, 2005) and Jane 4 (Conow et al., 2010).

43

Table 1.6.1 Natural cross-species transmission of lentiviruses and spumaviruses

Original virus Species Pathogenicity Species Virus Pathogenicity Reference

SIVcpz Chimpanzee Apathogenic Human HIV-1 AIDS Gao et al., 1999

SIVsm Sooty mangabey Apathogenic Human HIV-2 AIDS Gao et al., 1994

Rhesus monkey SIVmac AIDS

STLV-1 Mandrill T-cell leukaemia Human HTLV-1 T-cell leukaemia Koralnik et al., 1994 chimpanzee immunodeficiency Voevodin et al.,

baboon 1997

rhesus monkey

STLV-1 Chimpanzee T-cell leukaemia African green monkey STLV-1 T-cell leukaemia Courgnaud et al., 2004

44

SFV African green monkey baboon Apathogenic Human SFV apathogenic Heneine et al., 1998

SRLV Small ruminants Pathogenic Small ruminants SRLV pathogenic Shah et al., 2004a ;Shah et

(CaEV, MVV) al., 2004b

Natural cross-species transmission of gammretroviruses

Original virus Species Pathogenicity Species Virus Pathogenicity Reference

PriERV non-human primates ? Cat RD-114 van der Kuyl et al., 1999

Benveniste and

Todaro, 1974

Rodent virus (?) ? Koala KoRV lymphoma, Hanger et al., 2000 Gibbon ape GaLV immunodeficiency, Delassus et al.,

myeloid leukaemia 1989

KoRV Koala Pathogenic Rat tumour Fiebig et al., 2006

45

MoMuLV Mouse T-cell leukaemia Rhesus monkeys apathogenic

Mammalian virus Pathogenic Duck SNV immunodeficiency Kewalramani et al., 1992

46

1.7 Co-option of retroviral genes

Most retroviral insertions into host genomes are likely to be deleterious such that the insertions are never passed on or have negligible consequences to the host biology and are expected to fade away via the accumulation of mutations. However, some retroviral genes have been surprisingly preserved against mutational inactivation, and some could possibly be co-opted to play a role in normal host development. The best-known classes of domesticated genes are syncytin genes, which captured the env gene of a human defective retrovirus, HERV-W (Blond et al., 2000). The viral env gene is composed of two essential proteins: the SU protein serves to bind the virion to the host cell and the TM protein serves to mediate the fusion of the virion with the host cell membrane during entry. Some envelope proteins also encode a peptide motif, ISD, within the

TM domain, and this has potent immunosuppressive activities which suppress the production of cytokines and cell-mediated immunity (Haraguchi et al., 1997). These activities (mediated cell- cell fusion and immune system suppression) are essential to the survival of the developing fetus in many mammalian species. In primates (syncytin-2) and muroids (syncytin-B), in vivo assays based on the inhibition of tumour rejection by the mouse immune system have suggested that the syncytin genes display immunosuppressive activity (Mangeney et al., 2007). Syncytin domestication from a retroviral env gene has been shown to have independently occurred at least seven times during the evolution of mammalian species (Vernochet et al., 2011; Heidmann et al.,

2009; Du Pressoir et al., 2005; Renard et al., 2005; Blond et al., 2000; Mi et al., 2000). The oldest known syncytin gene is the Syncytin-Car1 gene, which was domesticated at least 60 million years ago, before the radiation of Carnivora (Cornelis et al., 2012).

47

Chapter II

A basal retrovirus in an ancient vertebrate lineage

2.1 Introduction

Endogenous retroviruses (ERVs) constitute a large proportion of vertebrate genomes.

Approximately 8% and 10% of the human and mouse genomes, respectively, are comprised of traces of past infection by retroelements (Waterson et al., 2002; Lander et al., 2001). Many have integrated into host genomes where they are capable of remaining in the same locus for millions of years (Xiong and Eickbush, 1990; Bowerman, 1989). During this time, they will have undergone significant mutational changes and no longer code for infectious viruses. While most

ERVs are nonpathogenic, their infectious counterparts, exogenous retroviruses, usually are horizontally transmitted and cause neurological disease, malignancies and immunodeficiencies

(Coffin, 1992). The International Committee on of Viruses (ICTV) classifies retroviruses into seven main genera: Alpharetrovirus, Betaretrovirus, Deltaretrovirus,

Epsilonretrovirus, Gammaretrovirus, Lentivirus and Spumavirus. However, most retroviruses within these families have been isolated from mammals (Hayward et al., 2014), with the remainder found in birds (Lai et al., 2011; Gifford et al., 2005; Johnson and Heneine, 2001;

Poulet et al., 1994; Wain-Hobson, 1994; Gak et al., 1991; Sambrook and Russell., 1989; Maeda,

1985), fishes (Schartl et al., 2013; Han and Worobey, 2012a; Llorens et al., 2009; Shen and

Steiner, 2004; LaPierre et al., 1999; Holzschu et al., 1995; Martineau et al., 1992; Martineau et al., 1991) reptiles (Martin et al., 1999; Tristem et al., 1996; Maeda 1985) and amphibians

48

(Martin et al., 1999; Tristem et al., 1996). This uneven distribution among host species raises questions regarding the host range and the origin of the Retroviridae (Figure 2.1).

49

Figure 2.1 Distribution of retroviral sequences within vertebrates and other chordates. Retroviral sequences identified by previous studies are indicated by + and - represents screened taxa from which retroviral sequences are yet to be identified.

50

To infect a new host, a virus must be able to efficiently infect suitable cells of new hosts, and this process can be restricted at many different levels. As discussed in Chapter 1, it is the product of the retroviral env gene (TM and SU) that allows retroviruses to attach to target cells and penetrate host cell membranes. The unique three-dimensional structure of the SU domain defines the receptor to which a virus binds, and therefore determines the host range and tissue specificity.

Successful host expansion has to be followed by successful viral genome replication and gene expression. Thus, some levels of corresponding genomic changes in the virus are essential to overcoming these barriers. However, most viruses after transfer to a new hosts are poorly adapted, so that the greater the genetic variation rate, the more likely it is for a virus to adapt to a new host (Menéndez-Arias, 2009). The error-prone replication and short generation times of retroviruses, which are RNA viruses, allow them to infect new hosts more efficiently. Although certain retroviruses are not easily transmitted between distantly related species, there has been plenty of opportunities and time for them to colonised other vertebrate classes if possible.

Cartilaginous fishes () are the most basal chordates in which retroviral sequences have been identified (Han et al., 2015; Herniou et al., 1988); however, its basal position remains questionable. In 1988, Herniou et al. performed a broad-scale polymerase chain reaction (PCR) screening for the presence of ERVs within genomes from across 18 vertebrate orders, and suggested that the retrovirus found in lemon sharks (Negaprion brevirostris) was not at or close to the basal position of all other retroviruses. In addition, this retrovirus was clustered with a broad host range of vertebrate ERVs, including those identified in humans, birds and amphibians, suggesting a cross-species origin instead of a single origin. In subsequent decade, most studies have mainly focused on retroviral investigations in higher vertebrates and the relevant diseases.

51

In recent years, because of improved sequencing technologies, a broad-scale phylogenomics study has provided the overall retrovirus distribution in representative species for each vertebrate class, supporting the previous investigation that alpharetroviruses have a restricted host range to avains, betaretroviruses are confined to higher vertebrates, and gammaretroviruses are the most widespread retroviral genus (Hayward et al., 2014; Hayward et al., 2013). In addition, a study founded that sea lamprey has not escaped retroviral activity (Hayward et al., 2014). In 2015, Han revealed the extensive retroviral diversity in a basal vertebrate, the elephant shark

(Callorhinuchus milii). He suggested that there were at least three independent retroviral infection events that had occurred in the elephant shark over the last 50 million years. However, the possibility that these ERVs resulted from cross-species transmission events from other fish species cannot be excluded, leaving the question of their origin unresolved. Although the origin of vertebrate retroviruses remains unknown, it is well accepted that they evolved from Ty3/gypsy retrotransposons because of their closer phylogenetic relationship and identical gag-pol genome structure (Malik and Eickbush, 2001; Llorens and Martin, 2001; Malik and Eickbush, 1999;

Xiong and Eickbush, 1990). This view is supported by phylogenetic analysis of pol polyprotein domains such as RT, RNaseH and INT. A recent study reconstructed the phylogenetic tree based on both gag and pol polyproteins. It shows that the three classes of retroviruses (Classes I, II and

III) were derived from three different Ty3/gypsy retrotransposons and these independently acquired their env genes (Llorens et al., 2008).

A more detailed investigation on basal retroviruses for filling the gap in the retrovirus evolutionary history is now feasible. Advances in whole-genome sequencing and bioinformatics technologies have facilitated in silico screening of ERVs from vast genomic datasets.

52

Accordingly, it is now possible to examine broad-scale patterns of retrovirus evolution by utilising phylogenetics to analyse homologous retroviral sequences across a wide range of host taxa. In this investigation, a conserved region of the pol gene was used to search for the basal vertebrate and genomes for ERVs in order to build phylogeny trees. The construction of a comprehensive phylogenetic tree from both endogenous and exogenous retroviruses from their diverse hosts provides a very important information source for the study of retrovirus-host evolution as a whole. Here, I aim to improve the retroviral phylogenetic tree by providing the missing link between retroviruses and their purposed ancestor, the Ty3/gypsy retrotransposons, in order to address the issue of the basal vertebrate host range boundary of the Retroviridae; furthermore, I investigate whether more retroviral genera remain to be discovered.

53

2.2 Materials and Methods

Identification of ERVs within the sequence database

Eukaryotic genomes (including phylum Porifera, Cnidaria, Echinodermata, Mollusca, and basal chordate) were obtained from the EMBL/GenBank/DDBJ database (Table 2.1). Each database was screened using a tBLASTn search with part of the reverse transcriptase (RT) protein domains from representative retroviruses of different families. These representative retroviruses included mouse mammary tumour virus (MMTV), gibbon ape leukaemia virus (GaLV), simian foamy virus type 1 (SFV1), walleye dermal sarcoma virus (WDSV) and snakehead retrovirus

(SnRV). This method was expected to extract the most retroviral sequences within host genomes.

The retroviral-like sequences recovered by tBLASTn searches (Altschul et al., 1990) were then recorded and examined for RT motifs to ensure each retroviral sequence contained the conserved

RT domain which enabled construction of the protein alignment. Sequences that did not contain the RT domain were excluded from the subsequent analysis.

Alignment

All newly discovered retroviral sequences were translated into amino acids and aligned to previously characterised retroviruses and retrotransposons using MUSCLE (Edgar, 2004). The alignment of the amino acid sequences from the previously described retroviruses was used as a template to identify the most likely region of RT when the remaining sequences were aligned.

Further manual adjustments were made on the basis of whether they were supported by the result of alignment algorithms using the raw nucleotides. Regions lacking clear homology and where homology could not be identified were excluded from the alignment.

54

Table 2.1 Retroviruses identified from available sequence data. Endogenous retroviruses identified by tBLASTn screening are identified by +, while – represents screened species from which retroviruses are yet to be identified.

Host genome Retroviruses identified Phylum Chordata Lobe-finned fish Coelacanth Latimeria chalumnae +

Bony fish Japanese eel Anguilla japonica +

Cartilaginous fish Australian ghost shark milii + Little skate Leucoraja erinacea +

Jawless fish Sea lamprey Petromyzon marinus + Inshore hagfish Eptatretus burgeri -

Lancelet Florida lancelet Branchiostoma floridae -

Tunicate Sea squirt Oikopleura dioica - Golden star tunicate Botryllus schlosseri - Pacific transparent sea squirt Ciona savignyi - Vase tunicate Ciona intestinalis -

Phylum Echinodermata Sea urchin Purple sea urchin Strongylocentrotus purpuratus - Green sea urchin Lytechinus variegates -

Phylum Mollusca Freshwater snail Biomphalaria glabrata - Freshwater snail Bithynia siamensis - Sea slug/Sea hare Aplysia california - Sea snail Lottia gigantean - Pacific oyster Crassostrea gigas - Mussel Mytilus galloprovincialis - Hawaiian Bobtail Squid Euprymna scolopes - Atlantic surf clam Spisula solidissima - Marine mollusk Chaetopleura apiculata - Chaetoderma nitidulum - Phylum Cnidaria Warty comb jelly Mnemiopsis leidyi -

Phylum Porifera

55

Sponge Suberites domuncula - Amphimedon queenslandica -

Phylogenetic analyses

Phylogenetic analyses were performed using both Bayesian Markov Chain Monte Carlo (MCMC) inference and the neighbour-joining (NJ) approach, based on the RT alignment mentioned above.

The best fitting amino acid substitution model was determined by MEGA5.0 (Tamura et al.,

2013). Bayesian phylogenetic reconstruction was performed using MrBayes 3.2.2. (Ronquist et al., 2012), with two runs of 1,000,000 generations and sampling of posterior trees every 100 generations, and the rtREV amino acid substitution model was applied. The first 25% of the posterior trees were discarded. An NJ analysis was also performed by RT alignment. The bootstrap consensus tree inferred from 1,000 replicates and the evolutionary distance were computed using the JTT matrix-based method. The rate variation among sites was modelled with a gamma distribution (shape parameter = 1). Position containing gaps and missing data were treated with a pairwise deletion.

Characterisation of the lamprey retrovirus

The open reading frame (ORF) finder (http://www.ncbi.nlm.nih.gov/projects/gorf/) was applied to each retroviral sequence to estimate protein coding regions. ORFs longer than 50 amino acids were searched using tBLASTn to identify the existence of homologous genes in other species and their own related genes. To be considered homologous, the protein (ORF) in lamprey had to cover at least 50% of the length and share at least 25% identity with the proteins in the database.

56

LTRs detection and full-length ERV characterization

A dot plot is a visual representation of the similarities between two sequences placed on the x and y axes using a calculated score for each position of the sequence, and is constructed using both the x and y axes of the consensus sequence. A dot plot is expected to have one main diagonal due to self-comparison. In addition, if the consensus sequence follows the typical retroviral organisation (LTR-gag-pol-env-LTR), two parallels (to the main diagonal) are produced owing to the LTR within the sequence. Hence, the position and length of the LTR can only be roughly identified.

However, most ERVs in this study were found to be highly degraded owing to their ancient integration, leading their LTRs to be highly mutated and difficult to recognise. Thus, it was necessary to consider further retroviral characteristics while performing ERV mining. A platform-independent JAVA program package, RetroTector (Sperber et al., 2007), was applied to facilitate retroviral mining within basal vertebrate genomes. RetroTector contains three basic modules which characterise retroviruses in more detail: (1) detecting candidate LTRs; (2) detecting conserved retroviral motifs fulfilling the distance constraints and (3) attempting to reconstruct the original retroviral sequence. An example of a RetroTector result is shown in

Figure 2.2.

57

Figure 2.2 Example of a RetroTector result from a run of the chimpanzee genome (Sperber et al., 2007). Red bars represent stop codons, blue bars show start codons. Triple lines represent putative protein encoding sequences. Green bars show putative asparagine glycosylation. “/” slice donor,”\” splice acceptor, “S” possible frameshifts and “8” pseudoknot sequence, which are also possible frameshifts. The viral components LTRs (5LT and 3LT), primer binding site (PBS), gag (motifs named CA and NC), pro (PR), pol (motif names RT, RH and IN), env(motif names SU and TM) and polypurine tract (PPT) are shown (Sperber et al., 2007).

58

2.3 Results

2.3.1 Retroviral distribution in lower vertebrates

Complete low-coverage and trace archive genomic sequences from a total of 26 chordate species were screened using tBLASTn, and query sequences were derived from the pol proteins of representative retroviruses. Most newly recognised retroviruses were recovered from

Chondrichthyes, including the representative cartilaginous species investigated. About 14 lamprey retroviruses (PmRVs) were found to be sister groups of spumaviruses and the SnRV, while retroviral-like elements found in lancelet and lower chordates were phylogenetically closer to Ty3/gypsy retrotransposons. This suggested that lamprey may be the most basal vertebrate from which a retroviral sequence can be identified.

Conserved regions of the retroviral pol coding domain were used to construct an alignment, and

Ty3/gypsy LTR retrotransposons were used to out-group this phylogenetic tree (Figures 2.3 and

2.4). Phylogenetic reconstruction grouped all novel ERVs and reference retroelements into two major types of retroelements, the Retroviridae and Ty3/gypsy retrotransposons. Newly found retroviruses derived from lamprey (PmRV), cartilaginous fish (Australian ghost shark and little skate) and the Japanese eel, were clustered with currently recognised retroviruses. The phylogenetic analysis clustered PmRVs with SnRV and spumaviruses. The clade credibility of

PmRV lineage is 99% by Bayesian analysis and 88% by the NJ method. Since the NJ analysis is based on a distance matrix, the lack of resolution was possibly because of the extremely small separation degree of the taxon (tips) compared with the overall diversity. Nevertheless, the basis of the grouping by both methods was identical.

59

Retroviral-like elements identified from lower chordates (lancelets, tunicates) to sponges were clustered with Ty3/gypsy LTR retrotransposons. Retrotransposons recovered from tunicate

(Oikopleura dioica) and comb jelly (Mnemiopsis leidyi) formed a monophyletic group which was phylogenetically closest to PmRVs (Bayesian posterior probability value = 68.31%; NJ bootstrap value < 50%). This finding is consistent with that of a previous study which suggested that Oikopleura ERV-like elements (Tor) might represent an unknown retroviral genus basal to the Retroviridae, according to their genomic organisations and phylogenetic positions. The overall diversity of ERV sequences within the chordate species is much wider than the other major group of Ty3/gypsy retrotransposons, indicating that additional Ty3/gypsy retrotransposon genera remain to be discovered.

To disclose the deeper evolutionary phylogeny relationship, a phylogenetic analysis was conducted by introducing additional taxa previously suggested to be phylogenetically closer to retroviruses, including Ty3/gypsy LTR retrotransposons and DIR elements. The phylogenetic relationship of DIRs, Ty3/gypsy retrotransposons and retroviruses are consistent with previous reports of DIRs appearing as sister groups of Ty3/gypsy retrotransposons, and Ty3/gypsy retrotransposons representing vertebrate retroviruses’ closest relatives (Ericka et al., 2004).

Again, the phylogenetic analysis grouped PmRV into a strongly supported monophyletic group and positioned it in a basal position with respect to all previously identified retroviruses (NJ bootstrap value = 100%; Bayesian posterior probability = 100%) (Fig. 2.3). This result provided strong evidence that PmRVs are a basal or primitive retroviral lineage and suggested a possible marine origin for Retroviridae.

60

Retroviruses

Lamprey ERVs

Ty3/gypsy Retrotransposons

Figure 2.3 Phylogenetic tree consisting of novel ERVs as well as previously identified retroviruses and retrotransposons. Enlarged portions of this phylogenetic tree are shown on the following pages. Bootstrap values shown are based on 1,000,000 replicates. 61

62

63

64

65

66

2.3.2 PmRV, a retrovirus from the Petromyzon marinus genome

Through an iterative process of screening lamprey genomes and comparing viral-related sequences to representative homologous sequences using RetroTector, several contigs that together contained the 5ʹ and 3ʹ long terminal repeat (LTR) sequences, gag and pol coding domains and a trace of the env gene of a retrovirus were extracted for subsequent genomic reconstruction. In total, 12 nearly complete proviruses were extracted from currently available lamprey genomic sequences, including the Petromyzon marinus and Lethenteron japonicum genomes.

Position (kb)

Figure 2.4 Consensus PmRV genome. The schematic below the scale shows the locations of ORFs and genomic features within the consensus proviral genomic sequence. LTR, long terminal repeat; PBS, primer binding site; MA, matrix; CA, capsid; NC, nucleocapsid; PR, protease; RT, reverse transcriptase; RH, RNAseH; IN, integrase; SU, surface glycoprotein; TM, transmembrane domain.

To construct the consensus genome of PmRV, multiple alignments consisting of 6000 bp upstream and downstream of each hit (results from tBLASTn) were performed. The sequences were aligned using MUSCLE and manually adjusted to identify the ORFs. Figure 2.4 outlines the genomic structure of this new retrovirus; it is ~12.5 kb in length and contains gag, pol and

67 env traces flanked by LTRs. Figure 2.5 demonstrates the DNA sequence, protein domains and frameshift locations of PmRV. Detailed characteristics of PmRV genomic structure are discussed in the following sections.

68

69

70

71

72

73

Figure 2.5 Complete nucleotide sequence and deduced amino acid sequence of PmRV genome. The predicted positions of the coding domains are based on sequence homology. CA, capsid; MA, matrix; MHR, major homology region; NC, nucleocapsid; PR, protease; RT, reverse transcriptase; env trace, envelope transmembrane.

74

Putative PmRV LTRs

The continuing matching of the 5ʹ upstream regions made the actual start point of U3 difficult to determine through sequence alignment alone. This continuous matching may result from segmental duplication of the lamprey genome in this region. RetroTector provided some predictions of PmRV LTRs, including where the putative 5ʹ and 3ʹLTRs begin with a unique start site, TT, and end with the generalised motif, CA. A typical proviral structure contains identical 5ʹ and 3ʹLTRs, resulting in an equal numbers of nucleotide base pairs. However, the putative 5ʹ and

3ʹLTRs of PmRV are 95% identical, indicating that PmRV copy is partially degraded. The difference in size between the two LTRs (3ʹLTR is 501 bp and 5ʹLTR is 505 bp) appears to be due to deletions at the beginning of 3ʹLTR and insertion or substitution at the end of 3ʹLTR, which results in a difference of 4 bp between the 5ʹLTR and 3ʹLTR.

To further investigate PmRV LTRs, the predicted 5ʹLTR was used as a probe to perform

BLASTn searches against lamprey genomic data. A total of 23 hits were obtained and aligned, and the result showed that the consensus matching sequence is about 1765 bp, which is triple the predicted length (505 bp). However, it is difficult to verify the length of LTR by the sequence alignment alone, since all lamprey ERVs found in this study were highly degraded. Nevertheless, various LTR promoters were found upstream of the retroviral gag gene, including CAAT, GC and TATA boxes and a polyadenylation signal (AATAAA) thus providing some evidence of

PmRV LTR characteristics.

75

In retroviruses, DNA synthesis is primed by specific tRNA, and thus, all retroelements contain a primer binding site presenting a Watson-Crick complementary to the primer tRNA. The corresponding PBS region in PmRV could not be identified in a total of 2341 lamprey-specific tRNA acceptor stems, containing the sequence complementary to the primer binding site (PBS).

However, RetroTector predicted that the primer binding site is a perfect complement to tRNAPro

(5ʹ-TGGGGGCTCGTCCGGGAT-3ʹ), which is one of the most widely used primers of reverse transcription by mammalian C-type viruses and HTLV/BLV. However, the possibility that this might be a false positive result owing to the limited tRNA database in the RetroTector cannot be excluded.

Gag zinc finger and major homology region

Downstream of its 5ʹLTR, PmRV exhibits three ORFs. ORF1 contains the gag gene, encoding

585 amino acids and starting with a methionine (ATG) codon. Based on RetroTector, a 66 amino acid region was orthologous to the retroviral capsid gene (CA) from position 2155 bp, and a short protease (PR) motif was predicted downstream of this region.

Most retroviruses and Ty3/gypsy retrotransposons have NC zinc fingers present within their gag gene. Two zinc fingers are detectable in the alpharetroviruses, betaretroviruses, epsilonretroviruses, lentiviruses and some gammaretroviruses (HERV-H group) but none in spumaviruses. Like spumaviruses, no zinc fingers were detected in PmRV in its gag region; therefore, it is likely that both PmRV and spuma-like retroviruses have lost their zinc fingers.

Another structural trait in gag is the major homology region (MHR), which plays a critical role in the assembly of the mature capsid shell and is required for infectivity. Similar to all

76 retroviruses, PmRV contained a conserved MHR (5ʹ-ISRDLFDELQTALQRKNETL-3ʹ), which was located from 3385 bp to 3445 bp within the gag domain.

Retrovirus-like pol ORF2

The second open frame, ORF2, started at residue 3935 in the +3 frame and encoded a 1170 amino acid retrovirus-like polymerase. PmRV polymerase included four enzymatic domains; protease, reverse transcriptase, RNAseH, and integrase, showing the characteristic domain order and structure of polyproteins of both retroviruses and Ty3/gypsy LTR retrotransposons. The pol

ORF encodes a large polyprotein and is possibly expressed by translational suppression of the gag TAA termination codon. A secondary structure was identified downstream of the stop codon: a stem loop (23–43 bp downstream of the TAA termination codon) which might be required for readthrough of the gag termination codon of PmRV (Figure 2.6).

Figure 2.6 Secondary structure (red square) identified downstream of the gag stop codon.

77

Nucleotide homology between PmRV and other retrovirus pol sequences was compared. PmRV

RT domain shows little similarity to those of the SnRV (26.3%), WDSV (25.2%) and spumaviruses (25.7%). As with the RT domain, the IN domain is closest in identity to those of spumaviruses (22.4%), WDSV (22.9%) and SnRV (21.2%). In conjunction with RT, RNase-H shows some similarity to spumaviruses (19.6%) and SnRV (13.4%). The GPY/F module at the C terminal of IN is preserved as GPH in PmRV, while spumaviruses have lost the GPY/F motif, substituting it with (K/G)PT, and Ty3/gypsy retrotransposons showed no GPY/F motifs to be present within their IN region. In addition, PmRV pol did not show any significant sequence similarity to a dUTPase nor any other viral or nonviral proteins.

Env encoded by ORF3 of PmRV

This region starts from position 7445 bp and forms the third discrete ORF; TBLASTn searches with this sequence did not reveal any significant matches. Like all other retroviruses, PmRV env consisted of two subunits: SU (for the surface domain) and TM (for the transmembrane domain).

SU has maximum exposure to the host immune system; thus, it is under heavy adaption and is poorly conserved. In contrast, TM is mostly shielded from the immune response by SU and is highly conserved for the essential function of fusing the viral and host cell membranes during viral entry. The essential functions of TM are reflected in the highly conserved domain organisation. Given the importance of this function, it is not surprising that the TM region is sufficiently conserved to infer retroviral evolutionary history. Therefore, the following analyses were mainly focused on the retroviral TM region.

78

At PmRV env N terminus, from 7964 bp to 7976 bp, a conserved RPKR motif at the SU/TM cleavage site (RXR/KR) was identified. This is the cleavage site between SU and TM and marks the beginning of the TM portion of the sequence. This polybasic motif was identified in different types of retroviral TMs and even in the most divergent spuma-type TM the only exception is the foamy-like CoEFV. The cleavage site and fusion protein are then followed by a region containing an α-helical secondary structure located from amino acid 7610 bp to 7684 bp, corresponding to a putative hydrophobic motif region of env by RetroTector and a coiled-coil region from amino acid 8114 bp to 8254 bp to the C-terminal end. This is one of the predominant features of the ectodomain, which consists of heptad repeats (HPPHCPC). The ectodomain sequence of some retroviruses also includes a region known as the immunosuppressive domain

(ISD). This is a stretch of 20 amino acids and has been reported to have important immunosuppressive properties, most probably essential for virus dissemination within the host.

Both PmRV and spumaviruses have no recognized ISD regions while alpharetroviruses, gammaretroviruses, deltaretroviruses and some lentiviruses (such as HIV-1) contain this domain.

Another putative functional env domain, CX6CC, which is immediately downstream of ISD, could not be recognised in either PmRV or spumaviruses, while being highly conserved in alpharetroviruses, gammaretroviruses, deltaretroviruses, betaretroviruses and lentiviruses

(Figures 2.7 and 2.8).

79

Figure 2.7 TM protein sequence alignment of ERVs and PmRV. The SU/TM cleavage site is indicated by a red arrow, heptad repeats within the coiled coil is indicated by a purple arrow and the conserved transmembrane domain is indicated by a green arrow.

80

Figure 2.8 Sequence alignments of retroviral fusion subunits. Sequence alignments are separated into covalent type (alpha-, gamma-, delta-), noncovalent type (beta-, lenti-) and unclassified

81

2.4 Discussion

As a large number of eukaryotes have been sequenced, it has become clear that the genomes of all vertebrates contain retroelements from multiple distinct lineages. However, retroelemtents are not evenly distributed across vertebrate orders. Some retroelements were found to be widely dispersed in vertebrates, while some show more patchy distributions. The Ty3/gypsy retrotransposons, which are considered be the closest sister group to retroviruses, are widely spread in almost all vertebrate orders. Other groups of LTR retroelements, Ty1/copia LTR retrotransposons have a more patchy distribution, and have been identified mainly in amphibian and fish species. Among all the distribution patterns of LTR retroelements, the pattern of retroviruses is most notable since they induce a variety of diseases in humans and animals.

However, little attention has been focused on retroviral distribution within basal vertebrates.

Since the first endogenous retrovirus was discovered in the late 1960s, retroviruses have been identified in over eight vertebrate classes (Hayward et al., 2013), including mammals, avians, reptilians, amphibians, chondrichthyes, perciformes, actinopterygii and more recently, coelacanths. Despite their widespread distributions, most were recognised within their mammalian and avian hosts. Therefore, the Retroviridae has long been considered as having a host range restricted to vertebrates, leaving their host ranges and origin ambiguous.

Advances in genome sequencing technologies have provided unprecedented opportunities for in silico screening of ERVs from diverse host genomes. In 2012, Han and Worobey published an endogenous foamy-like viral element in the coelacanth genome by using BLAST, providing evidence of the ancient marine origin of retroviruses. More recently, a genome-wide screening for ERVs in the elephant shark genome has revealed the extensive ERV diversity within sharks

82

(Han et al., 2015). All these findings imply that retroviruses may well be ubiquitous in these basal vertebrates and have a marine origin. However, based on the RT phylogenetic tree, all known marine ERVs are found to be not particularly basal to all retroviruses, leaving the basal retroviral species unknown.

To improve our understanding of the diversity and origin of retroviruses, in this study I performed a comprehensive screening of eukaryotic genomes from the coelacanth (Class

Sarcopterygii) to the sponge (Class Demospongiae), focusing especially on an ancient lineage of vertebrates, jawless fishes. The screening was conducted by blasting all available published chordate genomes with reference retroviral pol amino acid sequences. The Ty3/gypsy retrotransposon was used as the outgroup to serve as ancestral and sister retroelements of retroviruses. Although the outgroup choice could sometimes be controversial, the genomic organisations and phylogenetic analyses of the retroviral pol gene have provided substantial evidence to root the retroviruses with gypsy retrotransposons.

The two key findings of this study include determining the host range boundary of the

Retroviridae and discovery of the basal retrovirus, PmRV. The following sections present and discuss these two findings.

2.4.1 ERV host range and diversity

The ERV diversity and abundance in lower vertebrates decreases sharply from sharks to lampreys, and no ERV could be identified from hagfishes to sponges. The underlying tree topology recovered for novel and known retroviruses is fundamentally congruent with the previously published phylogeny tree. Although abundant novel basal retroviruses have been

83 found, the result remains consistent with the hypothesis that spumaviruses represent a basal clade to the current retroviral grouping; however, their basal positions are now being challenged. The phylogenetic analysis of the novel PmRVs suggests that an additional ancient ERV clade lies more basal with respect to spumaviruses for the Retroviridae. This is one of the most interesting findings of my study. Lampreys may be the most basal vertebrates that could be infected by retroviruses and thus determine the host range boundary of the Retroviridae.

Lampreys, as representatives of an ancient lineage of vertebrates, diverged from our lineage

~500 million years ago, providing us with a good study organism to investigate retroviral origin.

If retroviruses emerged around that time, then lamprey ERVs today are more likely to be eroded owing to mutations. Indeed, only 12 fragmented retroviral sequences were found within two published lamprey genomes (Petromyzon marinus and Lethenteron camtschaticum). Although they might no longer be infectious, these highly mutated and fragmented lamprey ERVs reflect ancient and deeper host-retrovirus relationships and add further insight into retroviral origin. In this study, the hagfish (Eptatretus burgeri), the other remnant of Agnatha, shows no evidence of past retroviral infections. It is possible that poor genomic sequence information for the hagfish limited the ability for ERV mining, or the true “original” host had became extinct in the ancient past, and without their genomic data, it is difficult to reveal host switching events. The discovery of a lamprey ERV supports the hypothesis of a marine origin of retroviruses and represents the most basal retrovirus identified so far.

ERV abundance and diversity are considerably lower within the lamprey genome, suggesting constraints on the ability of the retrovirus to diversify and colonise lampreys. One possible explanation is that PmRV represents a more recent phylogenetic group which has not yet had sufficient time to proliferate and colonise a wide diversity of host species. Yet, this hypothesis is

84 unlikely to be true. First, no complete or near complete PmRV genomes were found in this study, and PmRV LTRs were shown to be highly mutated. Second, a recent study has discovered the coelacanth endogenous foamy-like virus (CoeEFV), which can be dated back to more than 407 million years ago (Han et al., 2012a). According to the pol-based phylogenetic analysis in this study, PmRV may have emerged even earlier than CoeEFV because PmRV lineage stemmed from the basal position to CoeERV and all other retroviruses. Therefore, PmRV should have had sufficient time to proliferate and colonize in a diverse host population if it is possible.

Another plausible explanation is related to the major change to the host immune system during the early evolution of vertebrates (Figure 2.9). Just like all pathogens, retroviruses have to seek out a niche that is favourable for their replications. Being enveloped, retroviruses are fragile and do not survive well outside an infected cell; decay of infectivity is rapid even in tissue culture

(Sattentau, 2010). Therefore, finding and infecting new host cells with minimum delay is important for their optimal survival and dissemination. The most primitive adaptive immune system in vertebrates is the variable lymphocyte receptor (VLR) immune system, which is established in lamprey and arose about 480 million years ago (Li et al., 2013). The lamprey VLR consists of two major immune cell types that are similar to T and B lymphocytes (Boehm et al.,

2012; Herrin and Cooper, 2010; Pancer et al., 2004). The diversification of the progenitor T and

B cells offers an extraordinary opportunity by providing immune cells as novel cell niches for pathogens (Figure 2.10). If the adaptive immune cells are the primary cell niches in which basal retrovirus replicate, it is reasonable to postulate that retroviruses could only successfully proliferate after the emergence of adaptive immune cells. The CoeEFV, the oldest spumavirus found so far, was estimated to emerge around 407 million years ago, which is later than the adaptive immune system first established in vertebrates (480 million years ago). Therefore, it is

85 reasonable to speculate that the adaptive immune system evolved shortly after the early vertebrate divergence and may be an important factor in determining the host range boundary.

86

Figure 2.9 Overview of the evolution of the immune system in deuterostomes. The molecules restricted to jawed and jawless vertebrates are indicated in blue and green, respectively. Molecules that emerged at the stage of invertebrates are in pink. 1R and 2R refer to the two whole-genome replication events in the vertebrate lineage. The table on the right demonstrates the presence of retroviruses in their respective host species. ERV identified by in silico screening are indicated by “+”, whereas “–” represents a taxa from which ERV are yet to be identified. The phylogenetic tree is modified from Flajnik & Kasahara (2009).

87

Figure 2.10 Retroviral infections of immune cells. Retroviral infections identified are indicated by solid squares, whereas hollow squares represent a cell from which retroviruses are not yet be identified.

2.4.2 Genomic organisation analysis of PmRV

The discovery of PmRV unequivocally demonstrates that retroviruses are capable of invading jawless vertebrate genomes. However, numerous inframe stop codons, frameshift mutations, and low copy numbers of PmRV reflect its ancient origin and infection deficiency. The structural traits of PmRV and other retroelements are compared and summarised in Table 2.2. PmRV exhibits several predominant genomic signatures, such as a conserved MHR in gag, GPY/F and

YXDD motifs in pol and a conserved SU/TM cleavage motif, RPKP. Together with spumaviruses and the SnRV clade, PmRV lacks zinc finger motifs in the gag gene, implying that

88 zinc finger motifs may be lost in basal retroviruses and basal retroviruses may have a different mechanism for recognising specific RNA sequences needed for viral packaging. In addition to the lost zinc finger motifs, these retroviruses are significantly different from the other six retroviral genera in their genome organisations and LTR length. As suggested, the closest retrovirus sister group, the Ty3/gypsy retrotransposons, show genomic similarity in their gag and pol genes to PmRV by sharing conserved GPY/F and YM/VDD motifs in their pol gene and an

MHR in some gypsy retrotransposon gag regions. Taken together, these finding suggest the assumption that PmRV may be the most basal (or transitional retrovirus, from Ty3/gypsy retrotransposons to retroviruses) by sharing genomic characteristics of both spumaviruses and

Ty3/gypsy retrotransposons. The genomic characteristics of the retroviruses, PmRV and

Ty3/gypsy retrotransposons are compared and summarised in Table 2.2 and Figure 2.11.

Although most viral internal genes have been clearly characterised, the LTR sequences remain unclear. The possibility that the putative LTRs are a false positive result of RetroTector cannot be ruled out; instead of being viral LTRs, these two long repeats could be some random repeat regions located both upstream and downstream of PmRV. This difficulty in characterising

PmRV is mainly owing to the intrinsic characteristics of the lamprey genome, such as the high content of repetitive components and GC bases, and more importantly, the absence of broad- scale sequence information from closely related species which made the assembly of the lamprey genome more uncertain (Smith et al., 2013). Well-characterised lamprey and hagfish genomes are needed to serve as a good fundamental research database for further investigation.

89

Table 2.2 Summary of the genomic characteristic of the retroelements, including the number of NC zinc fingers, presence of MHR, duPase, C-terminal (G-patch) and pol (GPY/F) motifs and YXDD. The cleavage sites in env are examined in retroviruses, except for the gypsy and copia retrotransposons.

Figure 2.11 Exploration of the LTR lengths in the different groups is shown as boxplots. The average LTR length of spumaviruses is significantly longer than that of other retroviruses.

90

Phylogenetic analysis of retroviral elements has relied essentially on the retroviral pol gene, owing to its greater conservation compared to other genomic regions. Although phylogenetic analysis based on pol has its advantages, a disadvantage of using this highly conserved region of the retroviral genome is that many fine distinctions between retroviral families can be blurred. In addition, just as organism chromosomes can have complex independent histories, recombination events result in different evolutionary histories of different parts of the retroviral genome.

Therefore, the pol gene reflects only part of the evolutionary history giving rise to any given retroviral genome. In contrast to the pol gene, the env gene has long been considered a highly diverging sequence. Env consists of two subunits; SU, the surface domain, and TM, the transmembrane domain. SU has greatest interaction with the host immune system and includes the receptor binding domain; therefore, it is under heavy adaptive pressure and is less conserved.

In contrast, TM is a well-conserved domain and its main function involves viral entry through membrane fusion. This shared conserved domain and common features allow an evolutionary relationship to be inferred. Indeed, among all these genomic features, the main difference between LTR retrotransposons and retroviruses is the presence of a functional envelope (env) gene in retroviruses, which is absent or non-functional in LTR retrotransposons. The emergence of an infectious retrovirus may originate from de novo acquisition or capture of env genes by retrotransposons. One of the best-studied examples of such a capture is the Ty3/gypsy retrotransposon of Drosophila melanogaster. This is an infectious insect retrotransposon partly owing to the capture of an env-like gene from baculovirus (double-stranded DNA insect virus)

(Song et al., 1994). More evidence of env capture by retrotransposons has been suggested. For instance, the env of the nematode (Caenorhabditis elegans) retrotransposon Cer shares homologies with a phleboviral fusion protein (Malik et al., 2000), Tas, a retrotransposon of

91

Ascaris lumricoides, appears to have captured the gB glycoprotein gene of an ancestral herpesvirus (Pearson and Rohrmann, 2002; Malik et al., 2000), and TED, the env of Trichoplusia ni is homologous to the baculovirus F gene Pearson and (Ozers and Friesen, 1996). According to the above examples, infection by large DNA viruses such as baculovirus and herpesvirus appears to provide conditions favourable to the sporadic capture of env by retrotransposons. Therefore, retrovirus and retrotransposon env genes should reflect their different origins.

Given the divergence among different TMs, together with the key role that ERVs play in determining host range, it is not surprising to find differences in the species distribution of retroviruses for each TM type. For example, beta-type TM (including betaretroviruses and lentiviruses) has not been found in any host species outside the mammalian class. In contrast, gamma-type TM has been found in ERV sequences from five classes of vertebrates, including mammals, reptiles, amphibians, fish and birds. Interestingly, the analysis of spumavirus TMs revealed a distinct env organisation and high average pairwise identity at the amino acid level. In fact, not only is it distinct from other TM types, the env organisation can also be significantly divergent between some spumaviruses. For instance, the CoEFV env is so divergent from other spumaviruses, and just like PmRV env, CoEFV contains several indels (insertions) in its ISD region. It is also worth noting that at the sequence level, PmRV env is highly divergent so that basic local alignment tool (BLAST) searches do not return any type of env sequences despite the use of various parameters and retroviral datasets. This result is not unpredictable; in fact, the lack of positive hits also happens when blasting against major types of TMs (beta-type and gamma- type). The TM of PmRV and spumaviruses may possibly represent two additional TM types owing to their distinct TM organisations. In fact, the only common feature found between spumaviruses and PmRV is two heptad regions which play a critical role in the dynamic

92 rearrangement of the trimer during the fusion process (Chambers et al., 1990). Taken together, the distinct env organisations found in different retroviral genera may reflect their unique env origins and also explain their distribution in present-day host species.

ERV integration time is generally inferred directly from host timescales or on the basis of a neutral rate of evolution. However, PmRV integration time could not be estimated in this study.

The lack of a well-characterised lamprey genome hindered the ability to recover the LTRs of this lamprey retrovirus. Without the entire LTRs, as well as a neutral mutation rate of ancient vertebrates and the lack of a detailed Agnatha evolutionary timescale, it is unlikely that the actual insertion time of

PmRV can be estimated. Compared to the CoeEFV, which is likely to have co-diverged with its vertebrate hosts at least 407 million years ago, PmRV should have a longer evolutionary relationship with lamprey, since PmRV branches off at a more basal position than CoeEFV in the phylogenetic analysis.

93

2.4.3 Putative function for lamprey ERV

It is interesting to speculate on the factor of viral-derived genes in their respective host genomes. By blasting against the Petromyzon marinus genomic database, I found that PmRV pol showed some similarity to a lamprey specific variable lymphocyte receptor (VLR) gene (Boehm et al., 2012; Herrin and

Copper, 2010; Pancer et al., 2004) at the protein level (AY577941: E = 8e-32, query cover = 61%;

AY577942.1: E = 1e-32, query cover = 46%). The multiple alignment of RT shown in Figure 2.12 reveals the sequence similarity between the partial VLR and the viral RT. The two highly conserved motifs in the

RT region, XPXG and YXDD, can be clearly identified in VLR as LPMG and YLDD, respectively. Most notably, LPMG and YLDD are exact matches to those found in a newly identified lamprey gypsy retrotransposon RT region.

Figure 2.12 Alignment of the RT protein. Amino acid sequence from different retroelements (including retroviruses, PmRV and lamprey Ty3/gypsy retrotransposons) and the translated VLR gene are aligned to highlight the conserved residues. The black boxes indicate invariant amino acids and the similarity of amino acid decrease corresponds to the lighter grey scale.

94

This provided evidence that VLR gene in lamprey may comprise an endogenous retroviral sequence. Detailed examinations are needed before confirming their relationship; however, this finding is not totally unexpected. Although viruses and hosts generally compete by continuously counter-adapting their molecular arsenal, ERV-derived immunity can occasionally provide an evolutionary shortcut. For example, the recombination-activating genes, RAG-1, encodes key enzymes required for the generation of the highly diversified antigen receptor repertoire. Based on core sequence homology, it is assumed that the RAG-1 protein evolved from DNA transposons that belong to the Transib superfamily (Agrawal et al., 1998). In addition, viral receptors (JAM and CTV) have IgG-like surface domains (Du Pasquier et al., 2004), and one of these viral receptors (CTX) is phylogenetically basal to T cell receptors, suggesting a viral- originated T cell receptor (Villarreal, 2009). Furthermore, Fv1 is a gene that appears to derive from a retroviral group-specific antigen gene (gag) and has been present in mice genome for at least 10 million years. It is believed that Fv1 is the restriction factor that protects mice against infection by MuLV, providing evidence of an ERV-derived immunity (Best et al., 1996). Taken together, the putative ERV-derived gene found in VLR gene in lamprey raises an interesting question pertaining to both function and the origin of the VLR gene. Detailed analyses of viral gene capture and their influence on shaping the primitive adaptive immune system are worth for further investigations. More examples of ERV-derived immune-related genes are summarized in

Table 2.3.

95

Table 2.3 Examples of confirmed and prospective ERV-derived immune-related genes.

ERV-derived Viral origin Host genome Functions Reference immunity Fv1 Retroviral gag Mouse Capsid interaction (Best et al., 1996) Fv4 Retroviral env Mouse Receptor interference (Taylor et al., 2001) APOBEC3 Retroviral Up regulation of viral inhibition (Sanville et al., regulatory element 2010) Endogenous Avian leukosis Chicken Receptor interference and (Taylor et al., 2001) ALV virus env immune tolerization Endogenous Betaretroviral gag Sheep Heteromultimerization, (Mura et al., 2004) JSRV receptor interference Endogenous Retroviral env Cat Receptor interference (McDougall et al., FeLV 1994) Endogenous Mouse mammary Mouse T cell deletion (Young et al., 2012; Acha-Orbea and MMTV tumour virus sag MaDonald, 1995) CGIN1 Retroviral RNase mammalian Unconfirmed ubiquitination (Marco and Marín, mechanism 2009) IAPV Picornavirus unknown (Lusso, 2006) IRIS Env fruit fly Unknown (Malik and Henikoff, 2005) Rmcf1,2 Env Mouse Receptor interference (Wu et al.,2005; Jung et al., 2002) EBLN Bornavirus-like mammalian unknown (Horie et al., 2010) NIRVs Ebolavirus-like Bats Unknown (Talyor et al., 2011) VP35 RAG-1 DNA transposon Human Generation of mature B and T ( Agrawal et al., lymphocytes 1998)

96

2.5 Conclusion

This study determines the host range boundary of the Retroviridae. The lamprey is the most basal vertebrate in which retroviruses can be identified. The presence of retroviruses in vertebrate hosts is consistent with the existence of B and T cells (or B- and T-like cells), implying adaptive immune cells may be primary niches for retroviruses. Although the full-length

PmRV remains unclear at its LTR region, the genomic characteristics and phylogenetic analyses support its basal position in the Retroviridae. I also identified a VLR-like region in the viral gag gene, whose evolutionary relationship with PmRV must be further investigated.

97

Chapter III

Evolution of Fish Retroviruses

3.1 Introduction

Retroviruses have been identified in a wide range of vertebrates. However, only a limited number of retroviruses have been identified in fish species. To date, only three fish retroviruses of the genus epsilonretrovirus have been conclusively identified, including walleye dermal sarcoma virus (WDSV) (Martineau et al., 1992; Walker, 1969), walleye epidermal hyperplasia virus types 1 and 2 (WEHV-1 and WEHV-2) (LaPierre et al., 1999) as well as a tentative member, perch hyperplasia virus (PHV). The members of this genus have a relatively complex genomes in that their size ranges from 11.7 kb to 12.8 kb. In addition to the three major viral genes (gag, pol and env), there are two additional open reading frames (ORFs), one located immediately downstream of env and one located upstream of gag. The accessory gene product

ORFA which is present in all three walleye retroviruses appears to be a viral homology of cyclin

D, which functions to induce cell proliferation enabling viral replication (LaPierre et al., 1998).

Walleye retroviruses use the primer tRNAHis, while the snakehead retrovirus (SnRV) uses tRNAArg. All members of this genus are exogenous retroviruses and may be harmful to host health. For example, WDSV is etiologically associated with a multifocal skin tumour in walleye

(Stizostedion vitreum), while WEHV-1 and WEHV-2 cause discrete epidermal hyperplasia in walleye. Atlantic salmon swim bladder sarcoma virus (SSSV), an unclassified retrovirus, is associated with leiomyosarcomas in the swim bladders of Atlantic salmon and has a very high proviral copy number per cell (greater than 30).

98

Phylogenetic analyses based on the viral RT region shows that walleye viruses and PHV cluster together and have significantly diverged from SnRV. The analyses showed that most other fish retroviruses fall between the gammaretrovirus and epsilonretrovirus genera (Paul et al., 2006), such as SSSV and endogenous retrovirus from D. rerio. Furthermore, still others, such as SnRV, cannot be assigned (or phylogenetically closer) to any of the present retroviral genera.

The genomic organisation, sequence homology and transcription profile of SnRV are different compared to all other fish retroviruses. For example, SnRV is unique among all known retroviruses by the presence of tRNAArg1,2 primer binding site, and its coding regions are highly divergent from those of known retroviruses (Hart, 1996). Unlike most fish retroviruses which cause damage to their hosts, SnRV has not shown an association with any fish diseases so far.

Improved sequencing technologies have facilitated the in silico screening of host genomic sequences. This type of screening has identified several retroviruses in fish species, such as a foamy-like endogenous retrovirus in coelacanths (Latimeria chalumnae) (Han and Worobey,

2012a), a gamma-like virus in dolphins (Tursiops truncates) (LaMere et al., 2009), an epsilon- like virus in killer whales (Orcinus orca) (LaMere et al., 2009), and gamma- and foamy-like retroviruses in sharks (Callorhinchus milii) (Han, 2015). Studies on marine retroviruses indicate that diverse retroviruses have been associated with marine vertebrates for millions of years (Han and Worobey, 2012a). Following on from Chapter II, in this chapter, I aim to investigate retroviral diversity in marine species on the basis of newly released genomic data and mainly focus on basal vertebrates, including cartilage fishes (sharks and skates) and a living fossil, the coelacanth.

99

3.2 Materials and Methods

Retrovirus mining in marine species

Using tBLASTn I performed genomic mining of four newly publicly available fish genomes, including the coelacanth (Latimeria chalumnae), Japanese eel (Anguilla japonica), Australian ghost shark (Callorhinchus milii), and little skate (Leucoraja erinacea). (Data retrieval is detailed in the Materials and Method section of Chapter II).

Phylogenetic analyses

Two phylogenetic analyses were performed to determine the phylogenetic relationships and evolutionary history of novel ERVs: the outgroup of this phylogenetic tree is the betaretrovirus,

MMTV. Reference retroviruses were downloaded from NCBI and aligned by Geneious 6

(Kearse et al., 2012). Both Bayesian Markov Chain Monte Carlo (MCMC) inference and the neighbour-joining (NJ) approach were applied based on the viral RT region. MEGA5.0 was used to determine the best fitting amino acid substitution model. (Details of the models applied are demonstrated in the Materials and Method section of chapter II.)

Estimation of ERV invasion time

The values for the estimated genetic distances (d) and known speciation times (u) are applied using the following formula:

(t is the predicted invasion time, u is the neutral evolutionary rate of the host and d indicates the genetic divergence between the two LTRs) (Kimura, 1980).

100

The average neutral rate of sharks is estimated to be 2.2 x 10-9 substitutions per site per year (Han, 2015).

3.3 Results

Phylogenetic analyses recovered the ERV diversity in basal fish species (Figure 3.1). All novel fish ERVs were found to cluster between gammaretroviruses and epsilonretroviruses. Most viruses were found with fragmented viral genomes, containing numerous stop codons, frameshifts and highly degraded LTRs, showing that they are no longer active. Only two full- length (containing 5ʹ and 3ʹLTRs and gag-pol-env genes) ERVs were found in the Australian ghost shark (NW006890094, NW006890327) and one in the little skate (contig643253). The genomic compositions of these four full-length ERVs are shown in Figures 3.3 to 3.5 and are summarised in Table 3.1.

The classification of novel fish ERVs was consistent with the current classification that most fish retroviruses are found within the epsilonretrovirus genus or between the epsilonretrovirus and gammaretrovirus genera. However, the clade credibility of fish ERVs with both these genera

(gamma and epsilon) are low (bootstrap value < 50%), suggesting that although these novel

ERVs fall between gammaretrovirus and epsilonretrovirus, they do not belong to either of them.

Most newly identified coelacanth ERVs form a monophyletic group, except for one which cluster with shark ERVs, and one at the basal position to all foamy viruses. There are significant accumulations of stop codons and frameshifts throughout the coelacanth ERV genomes as well as a lack of gag or env genes. One coelacanth ERV was predicted to contain gag-pol-env and both 5ʹ and 3ʹLTRs (89.9% similarity); however, it also contained a string of hundred of “n”

101 symbols, indicating that an unspecified length had not been sequenced and thus preventing verification as a full-length element.

A coelacanth foamy-like virus was found at the basal position to all foamy viruses, representing their ancient endogenous forms relatives to extant foamy viruses (NJ bootstrap value = 100%;

Bayesian posterior probability = 100%). This element has been previously reported as coelacanth endogenous foamy-like retroviruses (CoeERVs), which had likely co-diverged with their vertebrate host over 407 million years ago (Han and Worobey, 2012a). Taken together, retroviruses have independently integrated into the coelacanth genome at least three times. The diverse coelacanth ERVs found in this study suggested that infectious retroviruses had circulated within their hosts more than 407 million years ago.

The extensive diversity of shark ERVs was revealed in this study. Phylogenetic analysis showed that shark ERVs form three distinct lineages. In lineage I, shark ERVs clustered with retroviruses isolated from the snakehead fish (Ophicephalus striatus), while in lineages II and III, shark

ERVs clustered with epsilonretroviruses isolated from the walleye (Sander vitreus) and zebrafish

(Danio rerio). Three shark ERV lineages are distantly related to each other, suggesting that at least three independent invasions have occurred. Full-length shark ERVs found in this clade have nearly identical LTR pairs (99.7% similarity). This suggests that shark ERVs may result from recent integration events and are expected to be exogenous retroviral forms. The extraordinary diversity in the coelacanth and cartilaginous fish implied that retroviruses may have circulated within ancient fish lineages through horizontal transmissions over millions of years. In contrast to the diverse coelacanth and shark ERVs, ERVs in the little skate formed only one monophyletic clade. All members in this clade were highly degraded, had accumulated multiple

102 intragenic stop codons and frameshifts, and no paired LTRs were found in any of the skate ERV genomes.

Table 3.1 Genomic features of full-length shark ERVs found in different ERV lineages.

LTR Genome length LTR length PBS PPT TSR (%identity)

Lineage 1

Australian ghost shark 8238 496 89% tRNAPro yes yes (NW006890075)

Australian ghost shark 10086 336 97% tRNAMet yes yes (scaffold 324)

Lineage 2 9917 ? ? tRNAPro yes no Little skate (contig643253)

Lineage3

Australian ghost shark 9275 510 96.50% tRNASer ? yes (NW006890055)

Australian ghost shark 9041 611 99.70% tRNASer ? no (NW006890327)

103

Figure 3.1 NJ phylogenetic tree of fish ERVs (bootstrap value are shown at the nodes; the coelacanth ERV lineages are highlighted in red squares, whereas the cartilaginous fishes ERV lineages are highlighted in blue squares).

104

3.4 Discussion

Fish retrovirus mining was performed with TBLASTn, and viral RT was used as the indicator of potentially new retroviruses. RetroTector was used to identify the presence of viral genes and

LTRs. Analysis of cartilaginous fish and the coelacanth revealed 28 new retroviruses. This result has significantly increased the number of retroviruses in basal and ancient vertebrates. There are representatives from more than one species of fish in each major fish virus clade, indicating that exogenous retroviruses have infected and endogenous retroviruses have reinfected fish hosts multiple times. Consistent with the ICTV classification, some newly found fish ERVs appear to cluster with epsilonretroviruses or to fall between gammaretrovirus and epsilonretrovirus genera, while still others cannot be classified into current retroviral genera. This demonstrates the extensive retroviral diversity found within sharks and coelacanths.

Shark ERVs form three distant related lineages, indicating that at least three independent integration events have occurred. The Australian ghost shark (Callorhinchus milii) has accumulated multiple copies of various retroviruses; however, it has the smallest known cartilaginous fish genomes (Venkatesh et al., 2014). The neutral evolutionary rate of the

Australian ghost shark is not available, but it is known that the neutral evolutionary rate of sharks is approximately an order of magnitude lower than that of mammals. Therefore, the estimated integration time, based on average mammalian neutral evolutionary rate, may be more recent than it actually is (t = d/2u, where u represents the neutral evolutionary rate). Nevertheless, these results suggest that these retroviruses started infecting the shark genome more than 250 million years ago, indicating their ancient origins. In contrast, there is one full-length shark ERV which is estimated to have integrated 6.8 million years ago. This suggests that retroviruses have

105 continually infected the shark genome in the recent past, and it is possible that their exogenous counterpart is still capable of circulating within shark populations. Previous studies have suggested that shark ERVs may have an ancient marine origin, but the possibility that shark ERV originated from cross-species transmission cannot be excluded. Based on the findings in Chapter

II and the analyses of their LTR divergence, I suggest that shark ERVs are more likely to be the result of cross-species transmissions instead of being the marine origin of all retroviruses.

Most newly found coelacanth ERVs were clustered into a monophyletic clade which represents a sister group to shark ERVs and SnRV. Only one coelacanth ERV clustered with spumaretroviruses, and this finding is consistent with a previous report that an endogenous foamy virus-like element (CoEFV) is placed at the basal position to all known foamy viruses

(Han and Worobey, 2012a). The phylogenetic result reveals that at least two ERV lineages have been found to be present in the coelacanth genome, but they may no longer be infectious. As a representative of the oldest living lineage of Sarcopterygii, the coelacanth genome serves as a useful reference genome for understanding the origin and evolution of reptilian and mammalian genomes. Novel coelacanth ERVs found in this study may facilitate evolutionary studies by serving as evolutionary markers. Apart from CoEFV, two additional coelacanth ERVs were found clustering with SnRV and some cartilaginous fish ERVs. This finding suggests that retroviruses may be ubiquitous in this ancient fish lineage.

106

3.5 Conclusion

Fish ERVs formed distinct monophyletic clades, implying that interclass transmissions have occurred only rarely in the ancient past. Although few horizontal transmissions have occurred, the retroviral diversity in basal vertebrate species is high. At least three viral lineages found in both cartilaginous fishes (skates and sharks) and coelacanth genomes indicate that at least three independent retroviral invasions have occurred in these two species. Consistent with previous studies, no fish ERVs were clustered with Class II retroviruses (alpharetroviruses, betaretroviruses, deltaretroviruses and lentiviruses), suggesting that Class II retroviruses may have host ranges restricted to only mammals and birds.

107

Chapter IV

Biogeographic and horizontal transmission history of mammalian gammaretroviruses

4.1 Introduction

Retroviruses are members of transposable elements and encode the enzyme reverse transcriptase, enabling synthesis of a DNA copy from their RNA genome. The viral DNA copy can

Subsequently be inserted into the genome of a host organism, resulting in germline integration events that lead to the endogenous retrovirus being passed vertically for a long period of time . Retroviruses have been discovered in a wide range of vertebrate hosts. Endogenous retrovirus (ERV) traces comprise approximately 8% of mammal genomes (Waterson et al., 2002; Lander et al., 2001). However, most of these ERVs have become defective over time owing to frameshift or nonsense mutations introduced during replication of the host genome or via recombinational deletion of the internal region of the virus (Boeke and Stoye, 2002; Stoye, 2001).

Gammaretroviruses have been identified in several vertebrate classes (Hayward, 2013; Van

Regenmortel, 2000; Martin et al., 1999). Within mammalian taxa, there appear to be two distinct gammaretrovirus lineages. The first lineage, termed type I gammaretroviruses, contains numerous exogenous and endogenous retroviruses, including FeLV (cat), MuLV

(mouse), GALV (primate), KoEV (koala), OERV (sheep), PERV (pig), KWERV (whale),

108

REV(bird) (LaMere et al., 2009; Tarlinton et al., 2006; Klymiuk et al., 2002; Klymiuk et al.,

2003; Delassus et al., 1989; Stewart et al., 1986; Wilhelmsen and Temin, 1984; Stephens et al.,

1983; Shinnick et al., 1981) as well as the human endogenous retrovirus family, type T

(HERV.T) (Hanger et al., 2000; Martin et al., 1999; Werner, 1990). The second mammalian lineage (type II) is less widespread and contains only ERVs, such as the HERV-E family

(Tristem et al., 1996; Repaske et al., 1985). Several conserved motifs in type I gammaretroviruses are absent in HERV-E elements, and the env gene region of HERV-E elements shows little homology with known exogenous retroviruses (Tristem et al., 1996;

Repaske et al., 1985). A number of HERV-E sequences occupying the same genomic position have been identified in the genomes of apes and Old World monkeys. Because the shared common ancestor of apes and Old World monkeys is estimated to be 27–36 million years ago

(Purvis and Webster, 1999), the mammalian gammaretroviruses are at least this old.

The age and diversity of endogenous gammaretrovirus sequences varies enormously within different mammalian species (Klymiuk et al., 2003; Waterston et al., 2002; Lander et al., 2001).

Within human genomic sequences, there appear to be only two groups of gammaretroviruses

(HERV-T and HERV-E). Both these groups integrated into the genome of an ancestral primate over 25 million years ago and are no longer capable of replicating (Lander et al., 2001; Tristem,

2000). In contrast to human genomic sequences, mouse genomic sequences contain numerous distinct gammaretrovirus families, many of which contain intact or nearly intact members, implying their recent activities (Waterston et al, 2002). This is consistent with the findings of large number of exogenous and active endogenous gammaretroviruses known to be currently circulating in some murine populations (Evans et al., 2003; Boeke and Stoye, 2002; Bonham et al., 1997). In addition, increasing evidence has shown that bats are important reservoirs of

109 emerging infections. Their distinctive ecological features, such as their ability to fly and long lifespan make them ideal vectors for horizontal transmissions. Data from a recent study have shown that a bat gammaretrovirus, RfRV, is basal to all mammalian gammaretroviruses, indicating the possibility that mammalian gammaretroviruses may have originated from bats

(Cui et al., 2012). Moreover, considering the fact that bats are the second most species-rich order of mammals, the coevolution of gammaretroviruses and bat hosts is worth investigating.

Species-to-species transmission events have long been found to arise within the Retroviridae family as a whole and also within gammaretroviruses (Sharp et al., 2005; Yohn, 2005; Mang et al., 2000; Martin et al., 1999; van der Kuyl, 1995; Benveniste, 1975; Benveniste and Todaro,

1974). Phylogenetic analyses have shown that the distance of host taxonomic relatedness caninfluence the rate of productive cross-species transmission. For example, intraclass transmission occurs much more frequently than interclass transmission (Gifford et al., 2005;

Martin et al., 1999; Herniou et al., 1998). So far, only a few horizontal transmission events between distantly related species have been confirmed including that of the highly pathogenic spleen necrosis virus (SNV), which is the result of a transmission event from mammals to birds

(Martin et al., 1999). Other examples are KoEV and rodent viruses, which are the result of horizontal transmission between marsupials and placental mammals (Hayward et al., 2013;

Simmons et al., 2012). Furthermore, it appears that the frequency of transmission in this lineage is higher than that observed in both type II mammalian gammaretroviruses and gammaretroviruses harboured by birds, reptiles and amphibians (Martin et al., 2003; Martin et al.,

1999). However, the frequency of interspecies transmission events within mammalian gammaretroviruses is unclear, as is whether the transmission rate is influenced by the distance between donors and recipient hosts. Furthermore, it must be clarified whether a mammalian

110 reservoir can be maintained in a rodent host via frequent rodent-to-rodent transmission with occasional transmission to other mammalian orders.

Here, I apply a bioinformatics method to recover novel gammaretroviruses in 69 mammalian orders and perform phylogenetic analysis on the virally coded pol gene. The worldwide frequency and distribution of type I gammaretroviruses within their hosts and their horizontal transmission history is assessed by both ancestral state reconstructions and reconciliation analyses. Coupled with geographical data, I find that there are several distinct geographical areas which appear to be hotspots for present-day horizontal transmissions. Furthermore, I estimate the effect of host taxonomic distance on horizontal transmission frequency and show that transmission rates significantly decline with increasing distance between donors and recipient hosts.

111

4.2 Materials and Methods

In silico mining of gammaretroviral sequences

The pol gene of the gibbon ape leukaemia virus (GaLV) is used as a representative probe to search for gammaretroviral sequences in mammalian genomes by using the program tBLASTn.

The recovered database sequences were added to an alignment in a descending order of similarity until they clustered with another probe sequence in preliminary phylogenetic analysis. To decrease the computational time, novel viral sequences with a nucleotide similarity greater than 90% were chosen only once as representative novel sequences for subsequent phylogenetic analyses.

Alignment and phylogenetic analyses

A 716bp DNA alignment of type I mammalian MLVs, including 178 novel PCR-derived ERVs,

(Martin, unpublished data), was constructed using the program Geneious 6.1.6 (Kearse et al.,

2012), which allows the simultaneous comparison of sequences at both the nucleotide and amino acid levels. This alignment included the conserved viral RT region while some small variable regions were excluded. Apart from the variable region, there were a few instances where amino acid insertion had to be included to align a group of sequences. Type II mammalian gammaretroviruses were also included for outgrouping purposes.

Bayesian phylogenetic analysis reconstruction was performed using MrBayes 3.2.2. (Ronquist and Huelsenbeck, 2003), with two runs of 1,000,000 generations and sampling of posterior trees every 100 generations. The phylogenetic model was selected using MEGA 6.0. (Tamura et al.,

2013), and the GTR model of nucleotide substitution with gamma-shaped rated heterogeneity

112 was applied.

Host-parasite association test

I used Bayesian tip-association significance testing (BaTS) (Parker et al., 2008) to examine whether (1) the main lineages of type I gammaretroviruses tended to be associated with a particular mammalian host and whether (2) the viruses with an intact pol gene tended to cluster together: Taxa were assigned to one of three categories, dependent on the number of inactivating mutations (intact: no stop codon in pol, nearly intact: with one stop codon in pol, not intact: with more than one stop codon in pol) within the amplified sequences. This method examines tip- character association by estimating two parameters: the association index (AI) (Wang et al., 2001) and the Fitch parsimony score (PS) (Figure 4.1). If the traits under investigation do occur, then the observed PS value should be inversely related to the strength of tip-character associations.

The PS statistic for a given trait is in the range 1 ≤ PS ≤ n, where n represents the number of tips in the phylogeny. A low PS value indicates strong tip-character associations, while a high PS value represents little association between a tip and its respective character. Another parameter,

AI, is the sum of all the internal nodes in the phylogeny, and k represents the number of internal nodes. In the following equation, for each internal node i, fi is defined as the frequency of the most common trait value among the tips, and mi is the number of tips subtended by node i.

Therefore, low AI values represent a strong phylogeny-trait association.

113

A new statistic which was used in Salemi et al. (2005) is also investigated. Intuitively, stronger phylogeny-trait associations should yield larger monophyletic clades (MCs) whose tips share the same trait. This is quantified by the MC size statistic for an investigated trait value x. In the following equation, mi represents the number of tips subtended by node I. Ii is an indicator function that equals 1 if all tips subtended by node I have a trait value x, whereas it equals 0 if all tips subtended by node I do not have a trait value x.

, 1 ≤ MC ≤ Nx

To test whether the viral sequences with an intact pol region are significantly clustered together,

I assigned all tips to one of three states (intact, nearly intact or not intact) depending on the number of inactivating mutations within the viral sequences. For the host-retrovirus associations, the tips were categorised into several states according to their host classification system. Both these analyses were based on the obtained type I mammalian gammaretrovirus tree.

Figure 4.1 (a) Strong association (b) Maximally interspersed (c) Intermediate situation (modified from Parker et al., 2008)

114

Estimating the number of gammaretroviral lineages per species

The number of viruses from each species that appeared to be independently derived in the phylogenies was counted. A virus was regarded as independent if it was phylogenetically separated by a virus found within a distantly related host (i.e. subfamily level or higher): Sister viruses from closely related hosts may represent vertical transmission from their common ancestor, while sister viruses recovered from two distinct hosts within separated mammalian subfamilies are likely to be independently derived. This method was only used for estimating general trends in gammaretrovirus abundance and diversity in different orders of mammals.

Estimating host switches

Interorder transmission

Putative host switches were constructed under parsimony and maximum likelihood by using

Mesquite 3.0 (Maddison and Maddison, 2008). The phylogenetic tree was constructed from type

I gammaretroviruses. In both approaches, host associations were simplified into an unordered character matrix. In the likelihood model, two general models can be used for likelihood reconstruction: the Markov k-state 1 parameter model (MK1) and the asymmetrical Markov k- state 2 parameter model (AsymmMK). The major difference between these two models is that the 2 parameter model allows forward and backward rates to be different. The 2 parameter model can be useful if the probability of gaining (e.g. host switches) vs. losing an interaction is known from prior experience. However, in a global-scale analysis, it is unlikely to apply universal transmission/loss rates to diverse virus-host associations. Thus, here I applied both models with

115 their default settings by using Mesquite.

Intrafamily transmission

For co-phylogenetic analysis, host phylogenies were constructed from previous studies (Blanga-

Kanfi et al., 2009; Mao et al., 2008; Eick et al., 2005). Viral fragments obtained from species which lack data regarding their host relationships were omitted. Jane 4.0 (Conow et al., 2010), a parsimony-based software tool, was used to reconstruct co-phylogenetic analysis using the following predefined event cost values: co-speciation = 0, duplication = 1, host switch = 1, loss =

1 and failure to diverge = 1. Jane 4.0 uses an event-based approach that assumes to satisfy the following constraint: the co-speciation cost value is strictly less than the duplication, host switch and loss cost values (Figure 4.2). This concept was preferred as Fahrenholz (1913) stated that

‘parasite phylogeny should mimic host phylogeny’. In fact, it is difficult to determine what these event costs should really be since every biological system is different. Here, I used general event costs which have been applied to most co-phylogenetic analysis as their default costs: co- speciation is assumed to have a low cost (0 or a small positive value), duplication equals 1, host switch equals 2, loss equals 1 and failure to diverge equals 1. Notably, the event costs are unitless in the sense that only their relative values are essential.

116

Figure 4.2 Tanglegram and co-phylogeny mapping (a) A simple tanglegram (b) Four possible reconstructions of the parasite tree (green) into the host tree (blue) for the tanglegram.

Despite advances in the efficiency of explicit event costs, little is understood about the relationship between event costs and the resulting maximum parsimony reconciliations.

Therefore, I applied a recently developed co-phylogenetic tool, Xscape (Libeskind-Hadas et al.,

2014), to provide the reconciliation results obtained from a range of costs instead of from explicit costs. The costscape tool was used to find solutions that are Pareo-optimal solutions and to visualise the host switch or loss costs over the range 0.2–5. Empirical P-values were computed using the sigscape tool to test the null hypothesis that the host and parasite trees are similar owing to chance.

117

4.3 Results

4.3.1 Detection and characterisation of mammalian gammaretroviruses

Novel type I gamma ERV pol fragments were amplified from 69 mammalian orders, and about

65% of ERVs were derived from rodent hosts. A Bayesian phylogenetic tree was constructed by

620 novel type I gamma ERVs and previously identified gammaretroviruses (Figures 4.3 and

4.4).

Over time, replication-competent endogenous retroviruses accumulate mutations as a result of host DNA replication, and therefore, the number of inactivating mutations can be used as an indicator of the approximate time since their integration. Viruses with intact or nearly intact sequences were mainly identified from African primates, bats and South American rodents. To estimate the distribution of potential active gammaretroviruses across the phylogeny, I examined the number of inframe stop-codons and frameshift mutations within the retrieved viral sequences

(Table 4.1). No evidence was obtained that viruses with similar numbers of inactivating mutation within the pol region were clustered together (P>0.05).

Table 4.1 BaTS analyses of viruses and their respective pol ORF state (intact: no stop codon in pol, nearly intact: with one stop codon in pol, not intact: with more than one stop codon in pol).

118

Figure 4.3 Ancestral state reconstruction based on maximum likelihood model

(Enlarged portions of the phylogenetic tree are shown on the following pages).

119

120

121

122

123

124

Figure 4.4 Ancestral state reconstruction based on parasimony model (Enlarged portions of the phylogenetic tree are shown on the following pages).

125

126

127

128

129

130

(1) First major group

The viruses with intact or nearly intact sequences were largely confined to two major and one minor group within the phylogeny. The first major group comprises GALV, KoRV, BaEV, exogenous FeLV, MuLV and a variety of ERVs derived from African, South-East Asian and

Australian rodent taxa. Within this group, was a monophyletic clade constituted of only African and South-East Asian primate gammaretroviruses. Elements found in this clade have an open reading frame with pol regions, implying that these ERVs are a young subgroup and may remain retrotransposition-competent.

Notably, the virus derived from rhesus macaque (Macaca mulatta) from South-East Asian and

African primates, including grivet (Chlorocebus aethiops) and green monkey (Chlorocebus sabaeus), showed 95.2–96.8% similarity in their protein level. This provides evidence of interfamily transfer between African and South-East Asian primates. Another interesting finding here is that the gammaretrovirus found in wild cats is also recovered within the primate clade, suggesting interorder horizontal transmission has occurred between these two distantly related mammalian orders. Horizontal transmission between two divergent orders seemed to be relatively frequent. Besides the transmission between primates and wild cats, a clade mixed with the virus derived from vesper bats (Myotis davidii), Beatrix’s bats (Glauconycteris beatrix) and grey mouse lemurs (Microcebus murinus) suggested frequent interorder transmissions between bats and primates in Africa.

Although most cases observed in horizontal transmissions of type I gammaretroviruses involved rodents as the virus donors, bats, a well-known virus reservoir, were also responsible for occasional transmissions of gammaretroviruses among mammalian hosts. The smaller group

131 with intact or nearly intact pol regions recovered from genomes of bats, lemurs and tree shrews suggested cross-species transmission among diverse mammalian orders in this region. The ancestral state reconstruction analysis suggested that bats are the main retroviral donors in this clade (maximum likelihood proportion = 77%).

(2) Second major group

Apart from Africa, a high frequency of horizontal transmission can also be observed among species from North and South America. The second large group was composed of a spectacular mixture of host species which come from North and South America. This finding suggests that hosts have been moving between these two continents, followed by horizontal transmissions.

Most transmissions involved small mammalian species, including rodents from South America and pikas (Lagomorpha) and American marten (Carnivora) from North America. The ancestral state reconstruction analysis revealed that rodents represented the most plausible source of interorder horizontal transmissions in this clade (maximum likelihood proportion = 99%; with 39 horizontal transmission events). In addition, a clade mixed with viruses derived from a North

American bat and a South American rodent suggested frequent horizontal transmissions between the two well-known viral reservoirs, rodents and bats.

Interestingly, a clade composed of gammaretroviruses derived from marine mammals (Atlantic white-sided dolphin, bottlenose dolphin and Risso’s dolphin) were found clustered with African primates (lemurs, gorillas, baboons, lesser hedgehog tenrecs) and North American carnivores

(cougars, raccoons and hedgehogs). The proportional likelihood value showed that the gammaretroviruses found in the marine mammals are likely to be the result of viral cross-species

132 transmission from carnivores (39%), primates (22%) or rodents (15%). More viral sequences were needed to get a better understanding of the origin of the gammaretroviruses in the marine mammals.

(3) Third group

Within the type I gammaretrovirus phylogeny, there appears to be a strong boundary for the common presence of rodent-derived viruses. That is, most clades contained numerous viruses from rodents, which were recovered from multiple rodent species. In contrast, the most basal clade in the phylogeny only contained three rodent-derived viral sequences, two of which (RV

Major’s tufted-tailed rat, RV large-headed rice rat) are phylogenetically separated from each other by viruses present within bats. This uneven distribution of rodent-derived viruses suggests that there has been a switch in host preference during the evolution of gammaretroviruses. The basal clade contained a more heterogeneous mixture of host species, including retroviruses from sloths, bats, hedgehogs and South American primates. Within this group, there are three major lineages that can be largely confined by their host species, indicating less interorder horizontal transmissions among these species. In addition, most viruses in this group accumulated numerous in-frame stop codons and frameshift mutations, indicating replication incompetency of the viruses in this group. The first lineage in this group consisted of a mixture of bat hosts from

South-East Asia, , North America and Africa, indicating that this gammaretrovirus may be inherited from a common ancestor or once was circulating among species but has now lost the capability for infection or replication. The hedgehog gammaretroviruses comprise the second lineage, which contained viruses isolated from European hedgehogs and hedgehogs endemic to

Africa. The third lineage was mainly composed of infectious gammaretroviruses derived from

South-American primates. Moreover, a distinct monophyletic clade was consisted of exogenous

133

SNV and gammaretroviruses recovered from short-braked echidna (Tachyglossus aculeatus), showing their clear distinction from other gammaretroviruses in this basal clade.

4.3.2 Gammaretrovirus frequency in mammals

The 69 mammalian genomes screened in this study, as well as a more detailed examination into host genome sequences, allowing some general trends related to gammaretrovirus load and distribution within different mammals will be described here.

The gammaretroviral sequences from non-rodent taxa were less dispersed in the phylogenetic tree (Figures 4.3 and 4.4). The number of phylogenetically distinct type I gammaretrovirus lineages within each host species proved to be substantially higher than that within the rodents.

Mice and rodents exhibit at least seven phylogenetically distinct type I gammaretrovirus lineages, whereas bats exhibit five lineages and dogs, cats, rabbits and dolphin each exhibit only one.

Although bats also exhibit high numbers of lineages, the number of bat gammaretroviruses is substantially lower than that of rodents, and half of bat viral pol genes were interrupted by numerous stop codons, indicating their replication incompetency. Taken together, this evidence suggests that rodents may represent the significant type I gammaretrovirus reservoir, presumably capable of occasional horizontal transmission to other mammalian species.

4.3.3 Rodents are significant vectors of interorder viral transmission events within mammals

The viruses derived from non-rodent taxa appear relatively dispersed, often clustering as sister clades to rodent viruses. To determine to what extent rodents are involved in interorder

134 transmission events with mammals, the host-retrovirus associations were tested based on the association index (AI) and Fitch parsimony score (PS), and host switch numbers and directions were estimated by ancestral state reconstructions by using Mesquite.

Type I gammaretroviruses displayed a significant phylogenetic structure in the correlation of the viruses and their probable reservoir hosts (AI and PS: P < 0.001). Although there is significant support for the clustering of viruses with the same reservoir hosts across the phylogeny, only three broadly defined reservoir hosts (rodents, primates, cetaceans; MC: P < 0.05) fall into this category. This suggests that there is substantial host switching within type I gammaretroviruses.

The lack of significance of host-parasite association (P > 0.05) for some viruses was suggested to be the product of host jumping. Cetaceans, such as dolphins, share similar habitats and produced strong signals of the reservoir host structure in the tree (P = 0.009) (Table 4.2).

Table 4.2 Statistical analysis of virus-host associations, and * represents the P value <0.05. Taxonomic group Oberved mean Null mean Significance AI 4.43(4.16-4.66) 9.49(8.43-10.56) 0* PS 33.95(34.0-34.0) 49.53 0* (47.57-51.61) primates 4(4-4) 1.67(1.0-3.0) 0.009* carnivora 2(2-2) 1.32(1.0-2.0) 0.29 rodentia 13(13.0-13.0) 4.11(3.0-7.0) 0.009* lagomorpha 2(2.0-2.0) 1.01(1.0-1.0) 0.02* artiodactyla 2(2.0-2.0) 1.04(1.0-1.0) 0.04* chiroptera 2(2.0-2.0) 1.15(1.0-2.0) 0.13 cetacea 3(3.0-3.0) 1.02(1.0-1.0) 0.009* scandentia 1(1.0-1.0) 1.01(1.0-1.0) 1 didelphimorphia 2.0(2.0-2.0) 1.01(1.0-1.0) 0.009* diprotodontia 1(1.0-1.0) 1(1.0-1.0) 1 erinaceomorpha 1(1.0-1.0) 1.02(1.0-1.0) 1 hyracoidea 1(1.0-1.0) 1(1.0-1.0) 1

135

Having established that host-virus association was significant in the type I gammaretroviruses, I next investigated the number and direction of host switches between different taxa in different mammalian orders. Ancestral state reconstruction analyses were applied to calculate the number of transmission events between the investigated subfamily and other mammalian hosts by using Mesquite.

The parsimonious and maximum likelihood models (MK1 and AsymmMk) were applied to reconstruct the ancestral states along the investigated type I gammaretrovirus tree. Before this analysis, type II gammaretroviruses (outgroup) were removed from the tree because they would have to be mapped to their respective hosts, biasing host state reconstruction. The results are summarised in Figure 4.5.

In summary, these analyses determined that host switches from rodents to other host categories have occurred more often than switches originating from any other host category. Of these, 55% of horizontal transmissions involved a rodent as the donor species and a non-rodent as the recipient, with the recipient being carnivores (30%), primates (22%) and other mammalian orders. Apart from rodents, about 39% of horizontal transmissions involved a bat as the donor species, with the recipient mainly being rodents

(27%) and non-rodent species from 11 different mammalian orders (73%). Taken together, rodents appeared to be involved in more than 64% of interorder transmission events and were donors rather than recipients in approximately 55% of these cases.

136

Figure 4.5 Summary of average host switches modelled by parsimony ancestral state reconstruction using Mesquite.

137

4.3.4 Transmission frequency varies according to the genetic distance of donor and recipient.

Based on tip association and ancestral state analyses as well as the substantially higher number and diversity of type I gammaretroviruses found in rodents, I have established that rodents are the major reservoir and bats the second major reservoir of interorder transmission events of type I gammaretroviruses within mammals. I next investigated to what extent the overall frequency of transmission varied with the taxonomic distance between the donor and potential recipients. To this end, I calculated the number of viral transmission events occurring at the following taxonomic levels: (1)

Intrafamily transmission: between different host genera within the subfamily Murinae and in the family

Chiroptera; (2) Interfamily transmission: to and from members of the subfamily Murinae/family

Chiroptera and other (non-Murinae/ non-Chiroptera) mammalian hosts.

Two types of analyses were used to estimate the number of viral horizontal transmission events. First, I used Mesquite to calculate the number of transmission events between the subfamily Murinae/Chiroptera family and other mammalian hosts. Rodent type I gammaretroviruses harboured by different genera within the Murinae family were categorised into 15 categories on the basis of their host family/subfamily classifications, and the remaining non-Murinae viruses were categorised into one category. This analysis indicated that 15 horizontal transmission events have occurred between Murinae and non-Murinae hosts.

For bats, the derived viruses were classified into five categories on the basis of their host family classifications, and the remaining non-Chiroptera viruses were grouped into one category. The analysis reveals that two horizontal transmissions have occurred between Chiropteran and non-Chiropteran hosts

(Figures 4.6 and 4.7).

138

Figure 4.6 Intrafamily transmission of rodent type-I gammaretroviruses

139

140

141

142

143

Figure 4.7 Intrafamily transmission of bat type-I gammaretroviruses

144

145

146

147

148

Although ancestral state reconstruction analysis estimated the transmission events between two distantly related hosts, it could be problematic when estimating two closely related hosts. This is because sister viruses belonging to their sister genera within the same host family/subfamily may well be the result of vertical transmission from their common ancestors. For this reason, the second method, which reconciles two phylogenetic trees, host tree and parasite tree, was used to assess the level of horizontal transmissions between different genera within a host family/subfamily.

Viruses obtained from rodents and bats were categorised into several groups on the basis of the host genus from which they were derived and were then compared with actual host phylogenetic trees by reconciliation analysis. This method was used to assess the level of horizontal transmission between different genera within each mammalian order. The first approach used Jane 4.0. Based on its default setting (co-speciation = 0, duplication = 1, host switch = 1, loss = 1 and failure to diverge = 1), I calculated the number of each type of host-parasite evolutionary events. Co-phylogenetic signals which were not significant (P > 0.05) were excluded from subsequent transmission frequency calculations. The results of Jane 4.0 are summarised in Figures 4.8–4.13.

149

Figure 4.8.a Cophylogeny of host and rodent-derived viruses (lineage I). Hollow circles at the nodes indicate co-speciation events; solid circles, duplications; arrows, host switch events; and dotted lines, sorting events (losses); arrows indicate host switch events.

Figure 4.8.b Cost histogram. The horizontal axis represents the cost of the sample and the vertical axis represents the number of samples with the corresponding cost. The red dashed line shows the best solution found to our original dataset and the blue histogram shows the costs for the 50 random trials.

150

Figure 4.9.a Co-phylogeny of host and rodent-derived viruses. Hollow circles at the nodes indicate co-speciation events; solid circles, duplications; arrows, host switch events; and dotted lines, sorting events (losses); arrows indicate host switch events.

Figure 4.9.b Cost histogram. The horizontal axis represents the cost of the sample and the vertical axis represents the number of samples with the corresponding cost. The red dashed line shows the best solution found to our original dataset and the blue histogram shows the costs for the 50 random trials.

151

Figure 4.10.a Co-phylogeny of host and rodent-derived viruses (lineage II). Hollow circles at the nodes indicate co -speciation events; solid circles, duplications; arrows, host switch events; and dotted lines, sorting events (losses); arrows indicate host switch events.

Figure 4.10.b Cost histogram. The horizontal axis represents the cost of the sample and the vertical axis represents the number of samples with the corresponding cost. The red dashed line shows the best solution found to our original dataset and the blue histogram shows the costs for the 50 random trials.

152

Figure 4.11.a Co-phylogeny of host and rodent-derived viruses (lineage III). Hollow circles at the nodes indicate co-speciation events; solid circles, duplications; arrows, host switch events; and dotted lines, sorting events (losses); arrows indicate host switch events.

Figure 4.11.b Cost histogram. The horizontal axis represents the cost of the sample and the vertical axis represents the number of samples with the corresponding cost. The red dashed line shows the best solution found to our original dataset and the blue histogram shows the costs for the 50 random trials.

153

Figure 4.12.a Co-phylogeny of host and bat-derived viruses (lineage I). Hollow circles at the nodes indicate co-speciation events; solid circles, duplications; arrows, host switch events; and dotted lines, sorting events (losses); arrows indicate host switch events.

Figure 4.12.b Cost histogram. The horizontal axis represents the cost of the sample and the vertical axis represents the number of samples with the corresponding cost. The red dashed line shows the best solution found to our original dataset and the blue histogram shows the costs for the 50 random trials.

154

Figure 4.13.a Co-phylogeny of host and bat-derived viruses (lineage II). Hollow circles at the nodes indicate co-speciation events; solid circles, duplications; arrows, host switch events; and dotted lines, sorting events (losses); arrows indicate host switch events.

Figure 4.13.b Cost histogram. The horizontal axis represents the cost of the sample and the vertical axis represents the number of samples with the corresponding cost. The red dashed line shows the best solution found to our original dataset and the blue histogram shows the costs for the 50 random trials.

155

However, the assumption of event costs could be problematic since every ecosystem may exhibit their own host-parasite associations. To evaluate the possible impact on different weights of costs, I explored a wide range of cost parameters in order to investigate the effect of the host switching penalty on the global cost. The result showed that increasing the cost of host switching events had a significant overall impact on total cost. For example, by increasing the host switching cost penalty from 2 to 6, the predicted cost of a host switching event can decrease from 24 to 8. Therefore, performing the reconciliation analyses over a cost range and obtaining a plausible result with that range are important.

Rather than using explicit event costs, the costscape tool reported the number of reconciliation events in all maximum parsimony solutions over a cost range (duplication normalised to 1, co-speciation fixed to 0, host switches and loss ranging from 0.1 to 5). All maximum parsimony solutions which are Pareto- optimal were reported in the form of (c, d, s, l) which indicate the number of co-speciations, duplications, host switches and losses, respectively. The results are summarised in Table 4.3.

The reconciliation yielded no major conflicts between Jane 4.0 (parsimony and maximum likelihood methods) and Xscape (pareto-optimal solutions). Most pareto-optimal event solutions found by costscape were identical or nearly identical to those found by Jane 4.0. For example, for one of the rodent-derived viral lineages, one pareto-optimal solution found by Xscape is (8, 11, 7, 2), while (8, 11, 7, 3) was found by Jane 4.0. Furthermore, Xscape found several solutions which meet the pareto-optimal criteria, such as

(6, 11, 9, 0), (8, 10, 7, 2) and (8, 9, 9, 2) (Figure 4.14; Table 4.3). However, these solutions were only optimal at extreme values for costs of host switch or loss events, and therefore, they were unlikely to be true. This gave a good indication that the reconciliation results estimated by Jane 4.0 were good enough to represent those obtained from a range of event costs. Furthermore, to assess whether the reconciliations were similar owing to chance, a permutation test employing 1000 trails was conducted using the sigscape tool in Xscape. The event count vectors found by both costscape and Jane 4.0 fell in the green region, which represents significance at the 0.01 level (Figure 4.14). Thus, the null hypothesis that co-speciation events could have occurred by chance could be rejected.

156

To account for the different sample sizes and evolutionary time of each phylogenetic tree, I divided the number of co-speciation events and host switch events by the number of parasites and the estimated divergence time in each association (Table 4.3). Since the denominator (parasite number × divergence time) is much greater than the numerator (number of cospeciation or host switch events), the ratio is not sensitive to slight differences between the resultant event numbers estimated by Jane 4.0 and Xscape. The resulting ratios were used to compare across the different host-parasite associations. Notably, this only gave a rough estimation of the transmission frequencies among different hosts, since the transmission events may well be underestimated.

The result showed that type I rodent gammaretroviruses and host associations have the greatest number of host switches per parasite per million years (0.002, 0.003, 0.002), while primate gammaretroviruses and primate hosts have the lowest host switch ratio (0.0002). Rodent associations also have the greatest number of co-speciations per parasite per million years, especially in lineage 1 (0.007), which is about nine times greater than that of the bat gammaretrovirus-derived lineage (0.0008).

157

Table 4.3 Reconciliation results with Jane 4 and Xscape. Jane reconciles the parasite and host trees based on pre-set event costs (cospeciation = 0, duplication = 1 host switch and duplication = 2, loss = 1, failure to diverge = 1); the transmission frequencies were estimated based on significant reconstruction. Xscape finds solutions that are pareto-optimal over a cost range (each event cost ranged from 0.1 to 5); only significant reconciliations are shown in this table. The transmission frequencies were calculated by Jane 4 based on the number of significant host switches.

Rodent Jane 4 Xscape Estimated transmission frequency Lineage 1(67.52mya) host switch # (s,d,t,l) host switch # Intergenus Family: Murinae murini->hydromyini 1(74%) (7,2,4,5) 4 murini->rattus 1(100%) (4,2,7,0) 7 0.002 Interfamily Cricetide->Muridae 1(<1%) Muridae->Cricetidae 1(<1%)

lineage 2 (112.89mya) host switch # (s,d,t,l) host switch # Intergenus Family:Murinae mus->rattus 1(100%) (8,14,10,5) 10 mus->rattus 1(54%) (4,15,13,0) 13 mastomys->mus 1(54%) (8,16,8,7) 8 mus->lemniscomys 1(99%) (6,16,10,3) 10 melomys->notomys 1(100%) 0.003 apodemus->((pseudomys,notomys),(conilurus,melomys)) 1(100%) apodemus->mus 1(<1%) stochomys->(mus,mastomys) 1(100%) ((lemniscomys,arvicanthis),stochomys)->(leopoldamys,(berylmys,rattus)) 1(100%) apodemus->? 1(<1%)

Interfamily Murinae ->Cricetidae 1(100%) 0.0003

lineage 3 (133.09mya) host switch # (s,d,t,l) host switch # Intergenus Family: Murinae praomys->mastomys 1(100%) (8,6,12,2) 12 conilurus->rattus 1(100%) (8,11,7,2) 7 conilurus->apodemus 1(100%) (6,11,9,0) 9 0.002 conilurus->myomys 1(100%) mus->(mastomys,myomys) 1(100%) rhabdomys->(lemniscomys,arvicanthis) 1(88%)

Bat

Jane Xscape Estimated transmission frequency lineage 1 (555.15mya) host switch # (s,d,t,l) host switch # Intergenus Family:Pteropodidae rousettus->pteropus 2(100%) (7,3,6,4) 6 (8,4,4,7) 4 Family:Vespertillionidae (8,5,3,9) 3 0.0004 scotophilus->myotis 1(65%)

Family:Molossidae mormopterus->chaerophon 1(100%)

Interfamily Molossidae->Pteropodidae 1(65%) 0.00001 (Vespertillionidae,Molossidae)->Pteropodidae 1(32%) 158

Figure 4.14 Pareto-regions computed by costscape (left) and sigscape (right) tools in Xscape for host and rodent-derived type I gammaretrovirus. Costscape reveals the cost space and partitions into regions, where each region allows the same set of maximum parsimony reconciliations. The displayed event counts comprise the number of speciations, duplications, transfers and losses. Furthermore, the ‘count’ field represents the number of distinct reconciliations in each region. Sigscape reports what fraction of the space achieves significance at the 0.01 level (green), between 0.01 and 0.05 (yellow) and below the 0.05 (red) level. 159

4.4 Discussion and Conclusion

4.4.1 Biogeography and horizontal transmission hotspots of gammaretroviruses

I have demonstrated that gammaretroviruses, which are intact in the viral pol region, are unevenly clustering across the type I gammaretroviral phylogeny. There are six notable groups of such viruses, providing support to potential recent horizontal transmission hotspots of type I gammaretroviruses in the wild. The best definitions of these hotspots are

(1) in African primates

(2) among South American rodents with occasional transmission to North American

mammalian orders (such as Lagomorpha and Carnivora)

(3) among African bats with occasional transmission to African primates and carnivores

(4) in North American rodents and bats

(5) between African and South-East Asian hedgehogs

(6) among South-East Asian rodents with occasional transmission to non-rodent hosts

The presence of exogenous viruses (GALV, MuLV, FeLV), endogenous viruses (koala retrovirus, KoRV) and infectious and recently infectious endogenous retroviruses (BaEV, MDEV) within these groups supports this proposal. In addition, novel viruses uncovered from these categories provide evidence in these geographic regions, which may well be considered as hotspots for emerging infectious gammaretrovirus diseases. Apart from these viruses, most others with type I gammaretrovirus phylogenies appear highly defective, suggesting that their integration happened a long time ago; thus, it is

160 unlikely that there is any ongoing horizontal transmission between gammaretroviruses found across the remaining regions of the phylogeny.

An interesting finding is the highly supported clade within retroviral sequences found in cetacean; molecular dating sets the invasion time of the cetacean clade to 10–19 million years ago (Wang et al.,

2013). Given the low level of sequence divergence within the cetacean clade, it is reasonable to speculate that these gammaretroviruses share a relatively recent common ancestor. However, the exact intermediate vector remains unclear. The ancestral state reconstruction suggests that the intermediate vectors in the cetacean clade may possibly be rodents, carnivores or primates. This result is consistent with the finding that suggests that cetaceans may be infected by rodent retroviruses (Hayward et al., 2013). Nevertheless, the possibility of carnivores or primates being the vector for transmission cannot be ruled out, since they both show high sequence similarity in their respective pol regions to those of cetaceans. Dating analyses of related viruses are needed for deeper evolutionary analysis.

Koala retrovirus (KoRV) is an exogenous retrovirus widespread in wild koalas (Simmons et al., 2012;

Hanger et al., 2000). The full sequence of KoRV reveals a striking high genetic similarity with gibbon ape leukaemia virus (GALV). The fact that gibbons and koalas live on divergent continents suggests that zoonosis may have involved an intermediate vector. A recent study has challenged the suggestion that

South-East Asian mice are the source of KoRV. The study shows that a virus derived from a laboratory mouse is phylogenetically closer to KoRV than previously identified Asian mice retroviral sequences

(Haywards et al., 2013). It is worth noting that KoRV is suggested to be pathogenic, which complicates tracing the potential transmission route of KoRV. In fact, although 12 Australian rodent- and bat-derived gammaretroviruses were included in my study, none of these viruses were found clustered close to KoRV.

For a more complete understanding of potential vectors in spreading gammaretroviruses to koalas, it is necessary to examine more indigenous Australian bats and rodents for associated exogenous retroviruses.

161

4.4.2 Horizontal transmission dynamics of gammaretroviruses

Although a number of zoonotic infections between different host taxa have been reported, it remains unclear how often interspecies transmission events occur or whether their frequency is influenced by the evolutionary distance between host taxa. The large numbers of gammaretroviruses found in this study, together with their inferred horizontal transmission events, allow some general trends of horizontal transmission to be supposed. It is clear that the frequency of interspecies transmission events decreases with increasing phylogenetic distance between donors and recipient hosts. Little evidence has been found for most extreme cases, e.g. interclass transmission events or transmission between two hosts. In fact, only one confirmed interclass transmission event has occurred during the evolution of gammaretroviruses:

Reticuloendotheliosis viruses (REV), duck spleen necrosis virus (SNV) and chicken syncytial virus (CSV) are of mammalian origin and their presence in the avian genome is confirmed to be the result of horizontal transmission (Martin et al., 1999). In 2005, Gifford et al. suggested that interclass transmission has rarely occurred during the evolution of class II retroviruses. Therefore, this may be a general trend in the Retroviridae family.

Consistent with previous studies, it is clear that transmission between different orders/families occurs much less frequently than those between genera within the same host family (Gifford et al., 2005; Martin et al., 1999; Herniou et al., 1998). While this effect seems apparent, the reason largely remains unknown.

Switching to a new host can have a profound effect on virus evolution. In fact, viruses are not likely to gain the ability to spread efficiently within a new host that was not previously exposed or susceptible because of the host range boundary. To infect a new host, a virus must be able to efficiently infect the appropriate cells of the new hosts. This process can be restricted at several levels, including receptor binding, host cell entry, trafficking within the cell, genome replication and gene expression

(Figure 4.15).

162

Figure 4.15 Steps involved in the emergence of host switching viruses, the transfer of viruses into new host populations. Occasionally, a virus gains the ability to successfully spread into new hosts, and under the right circumstances, a virus will emerge and start a new epidemic. (Adapted from Antia et al., 2003)

Therefore, switching to a new host can have a profound effect on virus evolution owing to the required corresponding changes in the virus to cross the host barriers (Figure 4.16). An innate antiviral response can be another significant barrier to infection. For example, PERVs can infect human cells in vitro, yet the transmission of these viruses to humans is restricted (Rother et al., 1995). This is partially owing to host immunity through complement-mediated inactivation of gammaretroviruses which is initiated by the binding of antibodies to the carbohydrate epitope expressed on the retroviral envelope (Fujita et al., 2003).

Thus, it is reasonable to speculate that the decrease in transmission frequency is due to a greater difficulty experienced by viruses trying to establish successful infections in distantly related species.

163

Figure 4.16 Steps involved in the emergence of host switching viruses. The host and viral processes that can be involved in the transfer and adaptation process (Woolhouse et al., 2005)

Apart from the intrinsic restriction of a new host, environmental and demographic barriers are also critical to virus interchange. Contact between a virus donor and recipient is a precondition for virus transfer and some host-switching events are likely to be prevented because of limited contact between viruses and potential new hosts. Therefore, related host species having with similar ecological niches may have a higher chance of interaction and thus result in viral transmission. For instance, multiple strains of SIVcpz obtained from sample sites separated by large rivers displayed phylogeographic clustering (Keele et al.,

2006). However, species with an extraordinary ability to fly may not be limited by geographic barriers. In this study, bats were found to be the second largest viral reservoir, with 15 horizontal transmission events identified in total, including transmission from bats to rodents, carnivores and primates. Bats are unique among mammals in their ability to fly. For example, the taxa in Myotis may travel 200 to 400 miles from their winter hibernation sites, and Mexico free-tailed bats migrate at least 800 miles between their summer habitats in Texas and New Mexico and their overwinter sites in Mexico. Indeed, bat retroviruses

164 were not constrained by geographic barriers, because the viruses retrieved from South-East Asian species

(Glauconycteris Beatrix) clustered with those retrieved from African species (Myotis brandtii). A previous study has also shown that bats are not restricted by geographic barriers. Viruses derived from

Australian bats were found to cluster with those identified from horse bats from China (Cui et al., 2012).

This may be an explanation for why bats were estimated to transfer type I gammaretroviruses among 12 different mammalian orders, and some of the recipient species distribute distantly to bat reservoirs.

165

4.4.3 Rodents have more type I gammaretroviruses than other mammals

The analysis of the rodent (mouse and rat) sequences revealed abundant type I mammalian gammaretrovirus lineages, with at least seven independently derived lineages. Many of these lineages contained retroviruses that may still be active and infectious (Tipper et al., 2005; Tomonaga and Coffin,

1999; Wolgamot et al., 1998). This appears to be in contrast to non-rodent taxa. Human and dog genomes contain only one type I gammaretrovirus clade, and the chimpanzee genome likely contains three (Jern et al., 2006). Although multiple gammaretrovirus-like lineages in sheep and pigs were identified in previous studies (Klymiuk et al., 2003; Patience et al., 2001), only one group in sheep and one in pigs appeared to be independently derived type I gammaretrovirus lineages within the phylogenies.

These results suggest that certain rodent lineages now contain a far higher load of active type I gammaretroviruses than other taxa. The reason for this may be a large number of rodent-to-rodent horizontal transmissions, which have continuously maintained a high level of active and infectious gammaretrovirus lineages in Rodentia. The following are possible explanations for this phenomenon.

First, rodents are highly susceptible to climatic and ecological change, resulting in variable population numbers (Howard and Fletcher, 2012). This makes it easier (or presents more chances) for them to be in contact with various species in different environmental niches. Second, rodents have short generation times and large blood sizes (Han et al., 2015; Luis et al., 2013), which enable them to tolerate active endogenous retroviruses compared to hosts with longer generation times and smaller blood sizes. This in turn considerably increases the risk of host exposure to the viruses they may carry, as well as stimulating the virus to undergo mutational adaptations to the changing environment.

166

4.4.4 A model of type I mammalian gammaretrovirus evolution

I suggested that a pool of active type I mammalian viruses within rodents acts as the primary reservoir and that bats acts as the secondary reservoir for the infection of mammalian hosts. The data shows that rodents have a large number of type I gammaretroviruses, and there are a large number of horizontal transmissions from rodents to other mammalian taxa. Virus horizontal transmission rates between families within Murinae are higher than those of other possible virus reservoirs, such as bats. Large numbers of rodent-to-rodent horizontal transmission events have continually maintained a high level of active and infectious gammaretrovirus lineages in Rodentia. This is because genus-to-genus transmission occurs more frequently than interorder or interfamily transmission, since small host intrinsic barriers will need to be overcome. This result is consistent with the transmission of gammaretroviruses, BaEV and

PtERV, within primates (Jern et al., 2006; Yohn et al., 2005; van der Kuyl et al., 1995; Mang et al., 2000).

In contrast, non-rodent hosts exhibit fewer gammaretroviral lineages than rodents and most are defective.

For example, most members in the bat group have interrupted reading frames of the pol gene, suggesting that they have been integrated into their host genome for a long period of time and lack present infectious activity. This is consistent with the finding that suggested that bat-derived gammaretroviruses may represent the most ancient lineage of the gammaretroviral family (Cui et al., 2012). Therefore, it is likely that active cycles of infection occur only transiently in non-rodent host families or orders. All these factors ensure that large numbers of active and infectious type I gammaretroviruses cycle among rodents and other mammalian species.

167

Figure 4.17 Proposed type I gammaretrovirus horizontal transmission model. Most ongoing horizontal transmission events between different mammalian orders involve rodents as the likely donor species, and non-rodent hosts are continually being infected by new rodent-derived viruses. Horizontal transmissions between two non-rodent or non-bat orders occur only rarely.

168

Chapter V

Conclusion and Future Developments

Retroviruses are one of the fastest evolving biological entities on the planet. Their overall rates of nucleotide substitutions per site per year fall in the range from 10−2 to 10−5 (Sanjuán, 2012

Hanada et al., 2004; Jenkins et al., 2002), which is approximately a million times faster than the rate of their hosts (Kumar and Subramanian, 2002). This characteristic allows retroviruses to escape host immunity and rapidly adapt to new hosts after cross-species transmissions. With the advent of genomic sequencing techniques, it is now possible to access the evolutionary history of retroviruses and their respective hosts in greater detail.

In this thesis, the studies presented in Chapters 2, 3 and 4 investigated the evolutionary history of retroviruses and their respective hosts. Chapters 2 and 3 provide evidence of the most basal retroviruses identified so far, and demonstrate the extensive retroviral diversity within basal fish lineages (including cartilaginous fish and the coelacanth). The findings answer long-standing controversial questions regarding the existence of a host range boundary for Retroviridae.

Previous studies suggest that cartilaginous fish, such as sharks, may be the most basal vertebrate from which a retrovirus can be identified. However, the results of the phylogenetic analyses do not support this theory. The analyses showed that ERVs in sharks resulted from horizontal transmission (Herniou et al., 1998). In this study, ERVs in lampreys branched off at the root of all retroviruses and their genomic characteristics show some similarity to retroviruses and

Ty3/gypsy retrotransposons, and thus, their transitional evolutionary position was confirmed.

Although the host range boundary may be established, the reason as to why such a boundary

169 exists remains unclear. The host range of Retroviridae is not restricted by competitive exclusion, since there are many different types of retroviruses and LTR retrotransposons have been identified from an individual lower vertebrate. Interestingly, there is one major change in the host immune system during the early evolution of vertebrates, and the distribution of ERVs suggests an association with the presence of B and T cell populations. Lymphocyte-like cells first appeared in the lamprey in terms of VLRA, VLRB and VLRC cells (Kasamatsu et al., 2010;

Tasumi et al., 2009; Rogozin et al., 2007; Pancer et al., 2004), and these immune cells may represent the primary niches for retroviral invasions. One of the major challenges in this study was that the evidence of infected cells for some retroviruses is patchy, especially for SnRV and epsilon retroviruses. This is mainly because some of these retroviruses have been considered apathogenic until now, and therefore, infection processes of these viruses have rarely been investigated. Nevertheless, so far, all available evidence suggests that the immune cells are one of the major or even the first cell types that retroviruses initially infect in hosts, and thus, the assumption is supported.

Another challenge in this study was characterising the genomic organisation of PmRV and this may be due to the following reasons. The lamprey genome has an intricate problem related to its incompetence; it is compact with highly repetitive regions, and closely related species’ genomic data for assembly reference are unavailable. In addition, the scaffolds or contigs of the lamprey are very often found with a string of “n” present, which indicates an unspecified region at that position. Combined with numerous stop codons, frameshifts and the divergent genomic organisation of PmRV, it is unlikely to be characterised by any present retrovirus characterisation tools (such as the LTR harvest (Ellinghaus et al., 2008), LTR finder (Xu and

Wang, 2007) or RetroTector (Sperber et al., 2007)) alone. Major challenges for these tools are

170 adjusting for distance constraints and the diversity of the motifs included. Some ancient retroelements are relatively incomplete and very divergent from known retroviral structures, with few processing recognised conserved motifs, such as in PmRV identified in this study. All these factors limit the ability of the tools to identify ERVs, especially where close relatives have not been identified and included into the database for reference. In addition, none of these tools can detect and annotate the promoter and enhancer of retroviral LTRs. However, detecting the promoter and enhancer of LTRs and excluding false-positive results containing two long nonsense repetitive elements located upstream and downstream of the viral-related genes, can improve the accuracy of identifying full-length ERVs. Since it is known that a large proportion of vertebrate genomes are composed of repetitive regions (Waterson et al., 2002; Lander et al.,

2001), it is worth distinguishing the nonsense repetitive regions and retroviral LTRs while mining for full-length retroviruses. Lastly, the phylogenetic and corresponding genomic structure of novel ERVs presented in Chapters 2 and 3 suggest that it is impossible to merge these basal and fish ERVs into the seven genera, because most novel ERVs do not cluster with known retroviruses. As the availability of molecular data for species genomes increases, more diverse retroviruses will be discovered. Therefore, an advanced and efficient classification system for

Retroviridae is worth developing.

ERVs have proved helpful in expanding the understanding of virus–host interactions. Previous studies have shown that extant simian foamy viruses have a long stable history of cospeciation with their hosts for >30 million years (Broussard et al., 1997; Bieniasz et al., 1995; Schweizer and Neumann-Haefelin, 1995), while the origin of this cospeciation history could be further pushed back to the origin of eutherians more than 100 million years ago (Katzourakis et al.,

2009). The study presented in Chapter 4 shows that in contrast to the evolutionary history of

171 simian foamy viruses, the evolutionary history of gammaretroviruses shows abundant cross- species transmission events between rodents and other mammalian species. By combining gammaretroviral and host phylogenetic information with biogeographical information, this study suggested six horizontal transmission hotspots. The best definitions of these hotspots are (1) in

African primates; (2) among South American rodents with occasional transmission to North

American mammalian orders; (3) among African bats with occasional transmission to African primates and carnivores; (4) in North American rodents and bats; (5) between African and South-

East Asian hedgehogs; and (6) among South-East Asian rodents with occasional transmission to non-rodent hosts. Since gammaretroviruses have been identified to be associated with infectious diseases, these identified hotspots may well be potential regions for the outbreak of emerging diseases.

In addition to the identified transmission hotspots, the substantially higher number of ERVs found in rodents and the result of the ancestral state reconstructions suggested that rodents are the major reservoir of type I gammaretroviruses. Consistent with previous findings, most cross- species transmission events identified in this study were between closely related species, suggesting that the phylogenetic distance between a donor and a recipient is an important determinant for a successful cross-species transmission. Future investigation of gammaretroviruses coul focus on whether key host ecological traits (i.e. the mating system) and biological traits (i.e. the lifespan and blood volume) play important roles in retroviral transmissions. This may provide more details as to why rodents are the major gammaretroviral donor.

Although the recent large-scale phylogenetic analyses of ERVs across a wide range of vertebrates implied that cross-species transmissions between distantly related species (i.e. from

172 different orders and families) may be more common than previously suggested, the evidence for these events remains limited. In fact, investigations in greater detail are needed before confirming cross-species events occurring between two distantly related species. For example,

KoRV is very closely related to GaLV and was previously suggested to be acquired from Asian rodents. However, this hypothesis has been challenged by another study which found that an

ERV in a laboratory mouse is more closely related to KoRV than to that in Asian rodents

(Hayward et al., 2013). This reveals that the donor-recipient relationship of two distantly related species can be controversial, although the evidence is limited. That is, a relationship between a donor and a recipient can be established only after a comprehensive investigation of their relative retroviruses and determination of a transmission route. Since ease of travel can facilitate transmission between two geographically distant species, future investigations should not only recover ERVs from their relative species (i.e. for the KoRV, investigating epidemic species in

Australia) but should also determine possible intermediate vectors (i.e. mice, rodents and potential vectors carried by humans or other animals).

Overall, this investigation into ERVs using the available vertebrate genomic data in combination with bioinformatics tools and phylogenetic analyses has provided important insights into retroviral evolution. The next step will be recovering more ancient ERVs from multiple species using quality-improved genomic data and integrating host ecological and biological traits into the analyses. This should further enhance our understanding of retroviral evolution, transmission dynamics and emerging infectious diseases.

173

Appendix

Appendix 1. The protease of representative retroviruses used for genome screening and phylogenetic construction

Name Accession Number Alphretrovirus Avian leukemia virus YP_004222728.1 Rous sarcoma virus NP_056886.1

Betaretrovirus Mouse mammary tumour virus NP_056880.1 Jaagsiekte sheep retrovirus NP_041186.1 Bovine endogenous retrovirus ABM73646.1 Simian retrovirus 1 AAA47732.1 Squirrel monkey retrovirus AAA66453.1

Lentivirus HIV 1 ABK51636.1 HIV 2 NP_663784.1 Simian immunodeficiency virus NP_687035.1 Bovine immunodeficiency virus AAA91271.1 Equine infectious virus AAB59862.1 Ovine lentivirus AAA66812.1

Deltaretrovirus Bovine leukemia virus NP_056895.1 Human T-cell leukemia virus type 1 NP_057860.1 Human T-cell leukemia virus type 2 NP_041003.2 Human T-cell leukemia virus type 3 AAZ77658.1 Human T-cell leukemia virus type 4 ABR68001.1 Simian T-cell leukemia virus 1 AAU34010.1

Epsilonretretrovirus Walleye dermal sarcoma virus NP_045937.1 Walleye epidermal hyperplasia virus 1 AAD30048.1 Walleye epidermal hyperplasia virus 2 AAD30054.1 Atlantic salmon swim bladder ABA54982.1 Zebra fish endogenous retrovirus AAM34208.1

Gammaretrovirus Reticuloendotheliosis virus YP_223871.1 Gibbon ape leukemia virus NP_056790.1 Feline leukemia virus NP_047255.1 Porcine endogenous retrovirus AAT77167.1 Rhinolophus ferrumequinum retrovirus AFA52559.1

174

Orcinus orca endogenous retrovirus GQ222416.1 Baboon endogenous virus BAA89659.1 Moloney murine leukemia virus AAC82568.1

Spumavirus African green monkey simian foamy virus YP_001956722.2 Bovine foamy virus NP_044929.1 Equine foamy virus NP_054716.1 Feline foamy virus CAA70075.1 HERV-L Simian foamy virus 1 NP_056803.1 Simian foamy virus 3 YP_001956722.2 SFVgor HM245790.1 SFVcpz NC_001364.1 SFVmar ADE06000.1 SFVspi ABV59399.1 Squirrel monkey foamy virus ADE05995.1

Unclassified Snakehead retrovirus AAC54861.1 Sphenodon punctatus endogenous retrovirus CAA59408.1 Dendrobates ventrimaculatus virus 1 X95795.1 Dendrobates ventrimaculatus virus 2 X95796.1 Dendrobates ventrimaculatus virus 3 X95797.1

175

Reference

Agrawal, A., Eastman, Q. M. & Schatz, D. G. (1998) Transposition mediated by RAG1 and RAG2 and its implications for the evolution of the immune system. Nature. 394(6695), 744–751. Available from: 10.1038/29457.

Alberola, T. M. & de Frutos, R. (1996) Molecular structure of a gypsy element of Drosophila subobscura (Gypsyds) constituting a degenerate form of insect retroviruses. Nucleic Acids Research. 24(5), 914–923. Available from: 10.1093/nar/24.5.914.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) Basic local alignment ssearch tool. Journal of Molecular Biology. 215(3), 403–410. Available from: 10.1006/jmbi.1990.9999.

Antia, R., Regoes, R. R., Koella, J. C. & Bergstrom, C. T. (2003) The role of evolution in the emergence of infectious diseases. Nature. 426(6967), 658–661. Available from: 10.1038/nature02104.

Bakker, A., Van de Loo, F., Joosten, L., Bennink, M., Arntz, O., Dmitriev, I., Kashentsera, E., Curiel, D. & Van den Berg, W. (2001) A tropism-modified adenoviral vector increased the effectiveness of gene therapy for arthritis. Gene Therapy. 8(23), 1785–1793. Available from: 10.1038/sj.gt.3301612.

Baltimore, D. (1970) RNA-dependent DNA polymerase in virions of RNA tumour viruses. Nature. 226,1209–1211. Available from: 10.1038/2261209a0.

Barr, S. D., Ciuffi, A., Leipzig, J., Shinn, P., Ecker, J. R. & Bushman, F. D. (2006) HIV Integration Site Selection: Targeting in Macrophages and the Effects of Different Routes of Viral Entry. Molecular Therapy. 14(2), 218–225. Available from: 10.1016/j.ymthe.2006.03.012.

Barr, S. D., Leipzig, J., Shinn, P., Ecker, J. R. & Bushman, F. D. (2005) Integration targeting by avian sarcoma-leukosis virus and human immunodeficiency virus in the chicken genome. Journal of Virology. 79(18), 12035–12044. Available from: 10.1128/jvi.79.18.12035-12044.2005.

Beauregard, M., Lévesque, J. & Bourgouin, P. (2001) Neural correlates of conscious self-regulation of emotion. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience. 21(18), RC165. Available from: http://www.jneurosci.org/content/21/18/RC165.full.pdf.

Beer, B. E., Bailes, E., Goeken, R., Dapolito, G., Coulibaly, C., Norley, S. G., Kurth, R., Gautire, J., Gautier-Hion, A., Vallet, D., Sharp, P. M. & Hirsch, V. M. (1999) Simian immunodeficiency virus (SIV) from sun-tailed monkeys (Cercopithecus solatus): evidence for host-dependent evolution of SIV within the C. lhoesti superspecies. Journal of Virology. 73(9), 7734–7744. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC104300/.

Belshaw, R., Watson, J., Katzourakis, A., Howe, A., Woolven-Allen, J., Burt, A. & Tristem, M. (2007) Rate of Recombinational deletion among human endogenous retroviruses. Journal of Virology. 81 (17), 9437–9442. Available from: 10.1128/jvi.02216-06.

Benachenhou, F., Jern, P., Oja, M., Sperber, G., Blikstad, V., Somervuo, P., Kaski, S. & Blomberg, J. (2009) Evolutionary conservation of Orthoretroviral long terminal repeats (LTRs) and ab initio detection of single LTRs in Genomic data. PLoS ONE. 4 (4), e5179. Available from: 10.1371/journal.pone.0005179.

Benachenhou, F., Sperber, G. O., Bongcam-Rudloff, E., Andersson, G., Boeke, J. D. & Blomberg, J. (2013) Conserved structure and inferred evolutionary history of long terminal repeats (LTRs).

176

Mobile DNA. 4(1), 5. Available from: 10.1186/1759-8753-4-5.

Bénit, L., Dessen, P. & Heidmann, T. (2001) Identification , Phylogeny , and Evolution of Retroviral Elements Based on Their Envelope Genes. Journal of Virology. 75(23), 11709–11719. Available from: 10.1128/jvi.75.23.11709-11719.2001.

Benveniste, R. E., Sherr, C. J. & Todaro, G. J. (1975) Evolution of type C viral genes: origin of feline leukemia virus. Science. 190(4217), 886–888. Available from: 10.1126/science.52892.

Benveniste, R. E. & Todaro, G. J. (1974) Evolution of C-type viral genes: inheritance of exogenously acquired viral genes. Nature. 252(5483), 456–459. Available from: 10.1038/252456a0.

Benveniste, R. E. & Todaro, G. J. (1975) Evolution of type C viral genes: preservation of ancestral murine type C viral sequences in pig cellular DNA. Proceedings of the National Academy of Sciences of the United States of America. 72(10), 4090–4094. Available from: 10.1073/pnas.72.10.4090.

Berg, J. (1986) Potential metal-binding domains in nucleic acid binding proteins. Science. 232(4749), 485–487. Available from: 10.1126/science.2421409.

Best, S., Le Tissier, P., Towers, G. & Stoye, J. P. (1996) Positional cloning of the mouse retrovirus restriction gene Fv1. Nature, 382 (6594), 826–829. Available from: http://www.ncbi.nlm.nih.gov/pubmed/8752279.

Bieniasz, P. D., Rethwilm, A., Pitman, R., Daniel, M. D., Chrystie, I. & McClure, M. O. (1995) A comparative study of higher primate foamy viruses, including a new virus from a gorilla. Virology. 207 (1),217–228. Available from: 10.1006/viro.1995.1068.

Bittner, J. J. (1936) Some possible effects of nursing on the mammary gland tumor incidence in mice. Science. 84 (2172), 162. Available from: 10.1126/science.84.2172.162.

Blanga-Kanfi, S., Miranda, H., Penn, O., Pupko, T., DeBry, R. W. & Huchon, D. (2009) Rodent phylogeny revised: analysis of six nuclear genes from all major rodent clades. BMC Evolutionary Biology. 9, 71. Available from: 10.1186/1471-2148-9-71.

Blond, J. ., Lavillette, D., Cheynet, V., Bouton, O., Oriol, G., Chapel-Fernandes, S., Mandrand, B., Mallet, F. & Cosset, F. (2000) An envelope Glycoprotein of the human endogenous retrovirus HERV-W is expressed in the human Placenta and fuses cells expressing the type D mammalian retrovirus receptor. Journal of Virology. 74 (7), 3321–3329. Available from: 10.1128/jvi.74.7.3321-3329.2000.

Boehm, T., McCurley, N., Sutoh, Y., Schorpp, M., Kasahara, M. & Cooper, M. D. (2012) VLR-Based Adaptive Immunity. Annual Review Immunology. 30, 203–220. Available from: 10.1146/annurev- immunol-020711-075038.

Boeke, J. D. & Stoye, J. P. (2002) Retrotransposons, endogenous retroviruses and the evolution of retroviruses. In: Coffin, J. M., Hughes, S. H. & Varmus, H. E. (eds.) Retroviruses. United States, Cold Spring Harbor Laboratory Press, U.S., 343–435. Available from: http://www.ncbi.nlm.nih.gov/books/NBK19376/.

Böhne, A., Brunet, F., Galiana-Arnoux, D., Schultheis, C. & Volff, J. N. (2008) Transposable Research. 16(1), 203–215. Available from: 10.1007/s10577-007-1202-6.

Bonham, L., Wolgamot, G. & Miller, A. D. (1997) Molecular cloning of Mus dunni endogenous virus: an unusual retrovirus in a new murine viral interference group with a wide host range. Journal of

177

Virology. 71(6), 4663–4670. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC191688/.

Bowerman, B., Brown, P. O., Bishop, J. M. & Varmus, H. E. (1989) A nucleoprotein complex mediates the integration of retroviral DNA. Genes & Development. 3 (4), 469–478. Available from: 10.1101/gad.3.4.469.

Brooks, D. R. (1981) Hennig’s Parasitological Method: A Proposed Solution. Systematic Zoology. 30(3), 229–249. Available from: 10.2307/2413247.

Broussard, S. R., Comuzzie, A. G., Leighton, K. L., Leland, M. M., Whitehead, E. M. & Allan, J. S. (1997) Characterization of new simian foamy viruses from African nonhuman primates. Virology. 237(2), 349–359. Available from: 10.1006/viro.1997.8797.

Buckman, J. S., Bosche, W. J. & Gorelick, R. J. (2003) Human immunodeficiency virus type 1 nucleocapsid zn2+ fingers are required for efficient reverse transcription, initial integration processes, and protection of newly synthesized viral DNA. Journal of Virology. 77(2), 1469–1480. Available from: 10.1128/JVI.77.2.1469.

Burwinkel, B. & Kilimann, M. W. (1998) Unequal homologous recombination between LINE-1 elements as a mutational mechanism in human genetic disease. Journal of Molecular Biology. 277 (3), 513– 517. Available from: 10.1006/jmbi.1998.1641.

Chambers, P., Pringle, C. R. & Easton, A. J. (1990) Heptad repeat sequences are located adjacent to hydrophobic regions in several types of virus fusion glycoproteins. The Journal of General Virology. 71 (12), 3075–3080. Available from: 10.1099/0022-1317-71-12-3075.

Charleston, M. A. & Robertson, D. L. (2002) Preferential host switching by primate lentiviruses can account for phylogenetic similarity with t he primate phylogeny. Systematic Biology. 51(3), 528–535. Available from: 10.1080/10635150290069940.

Chen, Z., Luckay, A, Sodora, D. L., Telfer, P., Reed, P., Gettie, A., Kanu, J. M., Sadek, R. F., Yee, J., Ho, D. D., Zhang, L. & Marx, P. A. (1997) Human immunodeficiency virus type 2 (HIV-2) seroprevalence and characterization of a distinct HIV-2 genetic subtype from the natural range of simian immunodeficiency virus-infected sooty mangabeys. Journal of Virology. 71(5), 3953–3960. Available from: http://www.ncbi.nlm.nih.gov/pubmed/9094672.

Ciuffi, A., Llano, M., Poeschla, E., Hoffmann, C., Leipzig, J., Shinn, P., Ecker, J. R. & Bushman, F. (2005) A role for LEDGF/p75 in targeting HIV DNA integration. Nature Medicine. 11(12), 1287– 1289. Available from: 10.1038/nm1329.

Ciuffi, A., Mitchell, R. S., Hoffmann, C., Leipzig, J., Shinn, P., Ecker, J. R. & Bushman, F. D. (2006) Integration site selection by HIV-based vectors in dividing and growth-arrested IMR-90 lung fibroblasts. Molecular Therapy : The Journal of the American Society of Gene Therapy. 13(2), 366– 373. Available from: 10.1016/j.ymthe.2005.10.009.

Coffin, J. M. (1992) Retroviral DNA integration. Developments in Biological Standardization. 76(4), 141–151. Available from: 10.1016/S0092-8674(85)80092-1.

Coffin, J. M., Hughes, S. H. & Varmus, H. E. (eds.) (1997) Retroviruses. United States, Cold Spring Harbor Laboratory Press,U.S. Available from: http://www.ncbi.nlm.nih.gov/books/NBK19376/.

Cohen, C. J., Lock, W. M. & Mager, D. L. (2009) Endogenous retroviral LTRs as promoters for human

178

genes: A critical assessment. Gene. 448(2), 105–114. Available from: 10.1016/j.gene.2009.06.020.

Cohen, J. C., Majors, J. E. & Varmus, H. E. (1979) Organization of mouse mammary tumor virus-specific DNA endogenous to BALB/c mice. Journal of Virology. 32(2), 483–496. Available from: http://jvi.asm.org/content/32/2/483.full.pdf.

Conow, C., Fielder, D., Ovadia, Y. & Libeskind-Hadas, R. (2010) Jane: a new tool for the cophylogeny reconstruction problem. Algorithms for Molecular Biology. 5(1), 16. Available from: 10.1186/1748- 7188-5-16.

Cornelis, G., Heidmann, O., Bernard-Stoecklin, S., Reynaud, K., Veron, G., Mulot, B., Dupressoir, A. & Heidmann, T. (2012) Ancestral capture of syncytin-car1, a fusogenic endogenous retroviral envelope gene involved in placentation and conserved in Carnivora. Proceedings of the National Academy of Sciences. 109 (7), E432–E441. Available from: 10.1073/pnas.1115346109.

Courgnaud, V., Van Dooren, S., Liegeois, F., Pourrut, X., Abela, B., Loul, S., Mpoudi-Ngole E., Vandamme, A., Delaporte, E. & Peeters, M. (2004) Simian T-cell leukemia virus (STLV) infection in wild primate populations in Cameroon: evidence for dual STLV type 1 and type 3 infection in agile mangabeys (Cercocebus agilis). Journal of Virology. 78(9), 4700–4709. Available from: 10.1128/JVI.78.9.4700-4709.2004.

Covey, S. N. (1986) Amino acid sequence homology in gag region of reverse transcribing elements and the coat protein gene of cauliflower mosaic virus. Nucleic Acids Research. 14(2), 623–633. Available from: 10.1093/nar/gkn942.

Craigie, R., Fujiwara, T. & Bushman, F. (1990) The IN protein of Moloney murine leukemia virus processes the viral DNA ends and accomplishes their integration in vitro. Cell. 62(4), 829–837. Available from: 10.1016/0092-8674(90)90126-Y.

Cui, J. & Holmes, E. C. (2012) Endogenous Lentiviruses in the Ferret Genome. Journal of Virology. 86(6), 3383–3385. Available from: 10.1128/JVI.06652-11.

Cui, J., Tachedjian, M., Wang, L., Tachedjian, G., Wang, L. F. & Zhang, S. (2012) Discovery of retroviral homologs in bats: implications for the origin of mammalian gammaretroviruses. Journal of Virology. 86(8), 4288–4293. Available from: 10.1128/jvi.06624-11.

Cullen, B. R. (1992) Mechanism of action of regulatory proteins encoded by complex retroviruses. Microbiological Review. 56(3), 375–394. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1406488.

Czernilofsky, A., Levinson, A. & Varmus, H. (1980) Nucleotide sequence of an avian sarcoma virus oncogene (src) and proposed amino acid sequence for gene product. Nature. 287(5779), 198–203. Available from: 10.1038/287198a0.

Czernilofsky, A. P., Levinson, A. D., Varmus, H. E., Bishop, J. M., Tischer, E. & Goodman, H. (1983) Corrections to the nucleotide sequence of the src gene of Rous sarcoma virus. Nature. 301 (5902), 736–738. Available from: 10.1038/301736b0.

Delassus, S., Sonigo, P. & Wain-Hobson, S. (1989) Genetic organization of gibbon ape leukemia virus. Virology. 173(1), 205–213. Available from: 10.1016/0042-6822(89)90236-5.

Dickson, R. Eisenman, H. Fan, E. Hunter, N. Teich. (1984) Protein biosynthesis and assembly. In: R. Weiss, N. Teich, H. Varmus, J. Coffin (eds.), RNA Tumor Viruses (2nd ed.), Cold Spring Harbor

179

Laboratory, London, 513–648

Dimcheff, D. E., Drovetski, S. V, Krishnan, M. & Mindell, D. P. (2000) Cospeciation and horizontal transmission of avian sarcoma and leukosis virus gag genes in galliform birds. Journal of Virology. 74(9), 3984–3995. Available from: 10.1128/JVI.74.9.3984-3995.2000.

Du Pasquier, L., Zucchetti, I. & De Santis, R. (2004) Immunoglobulin superfamily receptors in protochordates: before RAG time. Immunological Reviews. 198, 233–248. Available from: 10.1111/j.0105-2896.2004.00122.x.

Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 32(5), 1792–1797. Available from: 10.1093/nar/gkh340.

Eick, G. N., Jacobs, D. S. & Matthee, C. A. (2005) A nuclear DNA phylogenetic perspective on the evolution of echolocation and historical biogeography of extant bats (chiroptera). Molecular Biology and Evolution. 22(9), 1869–86. Available from: 10.1093/molbev/msi180.

Eickbush T.H. (1994) Origin and evolutionary relationships of retroelements In: Morse, S. S. (ed.) The evolutionary biology of viruses, Raven Press, New York, 121–157.

Eickbush, T. H. & Furano, A. V. (2002) Fruit flies and humans respond differently to retrotransposons. Current Opinion in Genetics & Development. 12(6), 669–74. Available from: 10.1016/S0959- 437X(02)00359-3.

Eickbush, T. H. & Malik, H. S. (2002). Origin and Evolution of retrotransposons. In: Craig, N. L., Craigie, R., Gellert, M. & Lambowitz, A. M. (eds.) Mobile DNA II.Washington, DC: ASM Press, 1111-1144.

Ellermann, V. & Bang, O. (1908) Experimentelle leukämie bei hühnern. Fizentralblatt bakteriologie. Zentralblatt der Bakteriologie. 46, 595–609.

Ellermann, V. & Bang, O. (1909) Experimentelle Leukämie bei Hühnern. II. Zeitschrift für Hygiene und Infektionskrankheiten. 63 (1), 231–272. Available from: 10.1007/bf02227892.

Ellinghaus, D., Kurtz, S. & Willhoeft, U. (2008) LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 9, 18. Available from: 10.1186/1471-2105-9-18.

Emi, M., Horii, A., Tomita, N., Nishide, T., Ogawa, M., Mori, T. & Matsubar, K. (1988) Overlapping two genes in human DNA: a salivary amylase gene overlaps with a gamma-actin pseudogene that carries an integrated human endogenous retroviral DNA. Gene. 62 (2), 229–235. Available from: 10.1016/0378-1119(88)90561-6.

Engelman, a, Mizuuchi, K. & Craigie, R. (1991) HIV-1 DNA integration: mechanism of viral DNA cleavage and DNA strand transfer. Cell. 67(6), 1211–1221. Available from: 0092-8674(91)90297-C.

Evans, L. H., Lavignon, M., Taylor, M. & Alamgir, A. S. (2003) Antigenic subclasses of polytropic murine leukemia virus (MLV) isolates reflect three distinct groups of endogenous polytropic MLV- related sequences in NFS/N mice. Journal of Virology. 77(19), 10327–10338. Available from: 10.1128/JVI.77.19.10327-10338.2003.

Fiebig, U., Hartmann, M. G., Bannert, N., Kurth, R. & Denner, J. (2006) Transspecies transmission of the endogenous koala retrovirus. Journal of Virology. 80(11), 5651–5654. Available from: 10.1128/JVI.02597-05.

180

Flajnik, M. F. & Kasahara, M. (2009) Origin and evolution of the adaptive immune system: Genetic events and selective pressures. Nature Reviews Genetics. 11 (1), 47–59. Available from: 10.1038/nrg2703.

Fortin, J. F., Cantin, R. & Tremblay, M. J. (1998) T cells expressing activated LFA-1 are more susceptible to infection with human immunodeficiency virus type 1 particles bearing host-encoded ICAM-1. Journal of Virology, 72(3), 2105–2112. Available from: http://www.ncbi.nlm.nih.gov/pubmed/9499066.

Fujita, M., Otsuka, M., Nomaguchi, M. & Adachi, A. (2010) Multifaceted activity of HIV Vpr/Vpx proteins: The current view of their virological functions. Reviews in Medical Virology. 20 (2), 68–76. Available from: 10.1002/rmv.636.

Fujita, F. (2003) Inactivation of porcine endogenous retrovirus by human serum as a function of complement activated through the classical pathway. Hepatology Research. 26, 106–113. Available from: 10.1016/S1386-6346(03)00087-1

Gak, E., Yaniv, A., Sherman, L., Ianconescu, M., Tronick, S. R. & Gazit, A. (1991) Lymphoproliferative disease virus of turkeys: sequence analysis and transcriptional activity of the long terminal repeat. Gene. 99(2), 157–162. Available from: http://www.ncbi.nlm.nih.gov/pubmed/2022329.

Gao, F., Bailes, E., Robertson, D. L., Chen, Y., Rodenburg, C. M., Michael, S. F., Cummins, L. B., Arthur, L. O., Peeters, M., Shaw, G. M., Sharp, P. M. & Hahn, B. H. (1999) Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature. 397(6718), 436–441. Available from: 10.1038/17130.

Gao, F., Yue, L., Robertson, D. L., Hill, S. C., Hui, H., Biggar, R. J., Neequaye, A. E., Whelan, T. M., Ho, D. D. & Shaw, G. M. (1994) Genetic diversity of human immunodeficiency virus type 2: evidence for distinct sequence subtypes with differences in virus biology. Journal of Virology. 68(11), 7433– 7447. Available from: http://jvi.asm.org/content/68/11/7433.long.

Gao, F., Yue, L., White, A. T., Pappas, P. G., Barchue, J., Hanson, A. P., Greene, B.M., Sharp, P. M. Shaw, G. M. & Hahn, B. H. (1992) Human infection by genetically diverse SIVSM-related HIV-2 in west Africa. Nature. 358(6386), 495–499. Available from: 10.1038/358495a0.

Gifford, R., Kabat, P., Martin, J., Lynch, C. & Tristem, M. (2005) Evolution and distribution of class II- Related endogenous retroviruses. Journal of Virology. 79(10), 6478–6486. Available from: 10.1128/JVI.79.10.6478.

Gifford, R. J., Katzourakis, A., Tristem, M., Pybus, O. G., Winters, M. & Shafer, R. W. (2008) A transitional endogenous lentivirus from the genome of a basal primate and implications for lentivirus evolution. Proceedings of the National Academy of Sciences. 105(51), 20362–20367. Available from: 10.1073/pnas.0807873105.

Golovkina, T. V, Jaffe, A. B. & Ross, S. R. (1994) Coexpression of exogenous and endogenous mouse mammary tumor virus RNA in vivo results in viral recombination and broadens the virus host range. Journal of Virology. 68(8), 5019–5026. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC236444/.

Goodier, J. L. & Kazazian, H. H. (2008) Retrotransposons Revisited: The Restraint and Rehabilitation of Parasites. Cell. 135(1), 23–35. Available from: 10.1016/j.cell.2008.09.022.

Green, L. M. & Berg, J. M. (1989) A retroviral Cys-Xaa2-Cys-Xaa4-His-Xaa4-Cys peptide binds metal

181

ions: Spectroscopic studies and a proposed three-dimensional structure. Proceedings of the National Academy of Sciences. 86 (11), 4047–4051. Available from: 10.1073/pnas.86.11.4047.

Hafner, M. S. & Nadler, S. A. (1988) Phylogenetic trees support the coevolution of parasites and their hosts. Nature. 332(6161), 258–259. Available from: 10.1038/332258a0.

Han, G. Z. (2015) Extensive retroviral diversity in shark. Retrovirology. 12 (1). Available from: 10.1186/s12977-015-0158-4.

Han, B. A., Schmidt, J. P., Bowden, S. E. & Drake, J. M. (2015) Rodent reservoirs of future zoonotic diseases. Proceedings of the National Academy of Sciences. 112(22), 201501598. Available from: 10.1073/pnas.1501598112.

Han, G. Z., & Worobey, M. (2012a) An endogenous foamy-like viral element in the coelacanth genome. PLoS Pathogens. 8(6), 1–7. Available from: 10.1371/journal.ppat.1002790.

Han, G. Z., & Worobey, M. (2012b) Endogenous lentiviral elements in the weasel family (mustelidae). Molecular Biology and Evolution. 29(520), 2905–2908. Available from: 10.1093/molbev/mss126.

Han, J. S. & Boeke, J. D. (2005) LINE-1 retrotransposons: Modulators of quantity and quality of mammalian gene expression? BioEssays. 27(8), 775–784. Available from: 10.1002/bies.20257.

Hanada, K., Suzuki, Y. & Gojobori, T. (2004) A large variation in the rates of synonymous substitution for RNA viruses and its relationship to a diversity of viral infection and transmission modes. Molecular Biology and Evolution. 21(6), 1074–1080. Available from: 10.1093/molbev/msh109.

Hanger, J. J., Bromham, L. D., McKee, J. J., O’Brien, T. M. & Robinson, W. F. (2000) The nucleotide sequence of koala (Phascolarctos cinereus) retrovirus: a novel type C endogenous virus related to Gibbon ape leukemia virus. Journal of Virology. 74(9), 4264–72. Available from: 10.1128/JVI.74.9.4264-4272.2000.

Haraguchi S., Good R.A., Cianciolo G.J., Engelman, R. W. & Day, N. K. (1997). Immunosuppressive retroviral peptides: immunon pathological implications for immunosuppressive influences of retroviral nfections. Journal of Leukocyte Biology. 61(6), 654–666. Available from: http://www.ncbi.nlm.nih.gov/pubmed/9201256.

Hart,D., Frerichs,G.N., Rambaut,A. & Onions, D.E. (1996). Complete nucleotide sequence and transcriptional analysis of snakehead fish retrovirus. Journal of Virology.70, 3606-3616.

Hayward, A., Grabherr, M. & Jern, P. (2013) Broad-scale phylogenomics provides insights into retrovirus-host evolution. Proceedings of the National Academy of Sciences of the United States of America. 110(50), 20146–20151. Available from: 10.1073/pnas.1315419110.

Hayward, A., Cornwallis, C. K. & Jern, P. (2014) Pan-vertebrate comparative genomics unmasks retrovirus macroevolution. Proceedings of the National Academy of Sciences. 112 (2), 464–469. Available from: 10.1073/pnas.1414980112.

Heidmann, O., Vernochet, C., Dupressoir, A. & Heidmann, T. (2009) Identification of an endogenous retroviral envelope gene with fusogenic activity and placenta-specific expression in the rabbit: A new ‘syncytin’ in a third order of mammals. Retrovirology. 6 (1), 107. Available from: 10.1186/1742-4690-6-107.

Henderson, L. E., Copeland, T. D., Sowder, R. C., Smythers, G. W. & Oroszlan, S. (1981) Primary structure of the low molecular weight nucleic acid-binding proteins of murine leukemia viruses. The

182

Journal of Biological Chemistry. 256(16), 8400–8406. Available from: http://www.ncbi.nlm.nih.gov/pubmed/6267042.

Heneine, W., Switzer, W. M., Sandstrom, P., Brown, J., Vedapuri, S., Schable, C. A, Khan, A. S., Lerche, N. W., Schweizer, M., Neumann-Haefelin, Chapman, L. E. & Folks, T. M. (1998) Identification of a human population infected with simian foamy viruses. Nature Medicine. 4(4), 403–407. Available from: 10.1038/nm0498-403.

Herniou, E., Martin, J., Miller, K., Cook, J., Wilkinson, M., & Tristem, M. (1998) Retroviral diversity and distribution in vertebrates. Journal of Virology. 72(7), 5955–5966. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC110400/.

Herrin, B. R. & Cooper, M. D. (2010) Alternative adaptive immunity in jawless vertebrates. Journal of Immunology. 185(3), 1367–1374. Available from: 10.4049/jimmunol.0903128.

Hindmarsh, P., & Leis, J. (1999). Retroviral DNA integration. Microbiology and Molecular Biology Reviews, 63(4), 836-843. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC98978/.

Holzschu, D. L., Martineau, D., Fodor, S. K., Vogt, V. M., Bowser, P. R. & Casey, J. W. (1995) Nucleotide Sequence and Protein Analysis of a Complex Piscine Retrovirus , Walleye Dermal Sarcoma Virus. Journal of Virology. 69(9), 5320–5331. Available from: http://www.ncbi.nlm.nih.gov/pubmed/7636975.

Howard, C. R. & Fletcher, N. F. (2012) Emerging virus diseases: can we ever expect the unexpected? Emerging Microbes & Infections. 1(12), e46. Available from: 10.1038/emi.2012.47.

Hron, T., Fábryová, H., Pačes, J. & Elleder, D. (2014) Endogenous lentivirus in Malayan colugo (Galeopterus variegatus), a close relative of primates. Retrovirology. 11(1), 84. Available from: 10.1186/s12977-014-0084-x.

Hughes, J. F. & Coffin, J. M. (2004) Human endogenous retrovirus K solo-LTR formation and insertional polymorphisms: implications for human and viral evolution. Proceedings of the National Academy of Sciences of the United States of America. 101(6), 1668–1672. Available from: 10.1073/pnas.0307885100.

Hughes, J. F. & Coffin, J. M. (2005) Human endogenous retroviral elements as indicators of ectopic recombination events in the primate genome. Genetics. 171(3), 1183–1194. Available from: 10.1534/genetics.105.043976.

Jacks, T., Madhani, H. D., Masiarz, F. R. & Varmus, H. E. (1988) Signals for ribosomal frameshifting in the Rous sarcoma virus gag-pol region. Cell. 55, 447–458. Available from: 10.1016/0092- 8674(88)90031-1.

Jamjoom, G. A., Naso, R. B. & Arlinghaus, R. B. (1977) Further characterization of intracellular precursor polyproteins of rauscher leukemia virus. Virology. 78 (1), 11–34. Available from: 10.1016/0042-6822(77)90075-7.

Jenkins, G. M., Rambaut, A., Pybus, O. G. & Holmes, E. C. (2002) Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis. Journal of Molecular Evolution. 54(2), 156–165. Available from: 10.1007/s00239-001-0064-3.

Jern, P., Sperber, G. O. & Blomberg, J. (2005) Use of endogenous retroviral sequences (ERVs) and structural markers for retroviral phylogenetic inference and taxonomy. Retrovirology. 2, 50.

183

Available from: 10.1186/1742-4690-2-50.

Jern, P., Sperber, G. O. & Blomberg, J. (2006) Divergent patterns of recent retroviral integrations in the human and chimpanzee genomes: probable transmissions between other primates and chimpanzees. Journal of Virology. 80(3), 1367–1375. Available from: 10.1128/JVI.80.3.1367-1375.2006.

Johnson, J. A. & Heneine, W. (2001) Characterization of endogenous avian leukosis viruses in chicken embryonic fibroblast substrates used in production of measles and mumps vaccines. Journal of Virology. 75(8), 3605–3612. Available from: 10.1128/JVI.75.8.3605-3612.2001.

Johnson, W. E. & Coffin, J. M. (1999) Constructing primate phylogenies from ancient retrovirus sequences. Proceedings of the National Academy of Sciences of the United States of America. 96(18), 10254–10260. Available from: 10.1073/pnas.96.18.10254.

Kamp, C., Hirschmann, P., Voss, H., Huellen, K. & Vogt, P. H. (2000) Two long homologous retroviral sequence blocks in proximal Yq11 cause AZFa microdeletions as a result of intrachromosomal recombination events. Human Molecular Genetics. 9(17), 2563–2572. Available from: 10.1093/hmg/9.17.2563.

Kapitonov, V. V. & Jurka, J. (2003) Molecular paleontology of transposable elements in the Drosophila melanogaster genome. Proceedings of the National Academy of Sciences of the United States of America. 100(11), 6569–6574. Available from: 10.1073/pnas.0732024100.

Kasamatsu, J., Sutoh, Y., Fugo, K., Otsuka, N., Iwabuchi, K. & Kasahara, M. (2010) Identification of a third variable lymphocyte receptor in the lamprey. Proceedings of the National Academy of Sciences of the United States of America. 107(32), 14304–14308. Available from: 10.1073/pnas.1001910107.

Katz, R. A. & Skalka, A. M. (1994) The retroviral enzymes. Annual Review of Biochemistry. 63, 133–173. Available from: 10.1146/annurev.biochem.63.1.133.

Katz, R. A., Gravuer, K. & Skalka, A. M. (1998) A preferred target DNA structure for retroviral integrase in vitro. Journal of Biological Chemistry. 273(37), 24190–24195. Available from: 10.1074/jbc.273.37.24190.

Katz, R. A. & Skalka, A. M. (1990) Generation of diversity in retroviruses. Annual Review of Genetics. 24, 409–445. Available from: 10.1146/annurev.ge.24.120190.002205.

Katzourakis, A., Gifford, R. J., Tristem, M., Gilbert, M. T. P. & Pybus, O. G. (2009) Macroevolution of complex retroviruses. Science. 325(5947), 1512. Available from: 10.1186/1742-4690-6-S2-O1.

Katzourakis, A., Pereira, V. & Tristem, M. (2007) Effects of recombination rate on human endogenous retrovirus fixation and persistence. Journal of Virology. 81 (19), 10712–10717. Available from: 10.1128/jvi.00410-07.

Katzourakis, A., Tristem, M., Pybus, O. G. & Gifford, R. J. (2007) Discovery and analysis of the first endogenous lentivirus. Proceedings of the National Academy of Sciences. 104(15), 6261–6265. Available from: 10.1073/pnas.0700471104.

Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., Sturrock, S., Buxton, S., Cooper, A., Markowitz, S., Duran, C., Thierer, T., Ashton, B., Meintjes, P. & Drummond, A. (2012) Geneious basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 28(12), 1647–1649. Available from: 10.1093/bioinformatics/bts199.

Keele, B. F., Van Heuverswyn, F., Li, Y., Bailes, E., Takehisa, J., Santiago, M. L., Bibollet-Ruche, F.,

184

Chen, Y., Wain, L. V., Liegeois, F., Loul, S., Ngole, E. M., Bienvenue, Y., Delaporte, E., Brookfield, J. F. Y., Sharp, P. M., Shaw, G. M., Peeters, M. & Hahn, B. H. (2006) Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science. 313(5786), 523–526. Available from: 10.1126/science.1126531.

Kewalramani, V. N., Panganiban, A. T. & Emerman, M. (1992) Spleen necrosis virus, an avian immunosuppressive retrovirus, shares a receptor with the type D simian retroviruses. Journal of Virology. 66(5), 3026–3031. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC241062/.

Khan, E., Mack, J. P. G., Katz, R. A., Kulkosky, J. & Skalka, A. M. (1991) Retroviral integrase domains: DNA binding and the recognition of LTR sequences. Nucleic Acids Research. 19(6), 1358. Available from: 10.1093/nar/19.6.1358.

Kidwell, M. G. & Lisch, D. R. (2001) Perspective: transposable elements, parasitic DNA, and genome evolution. Evolution: International Journal of Organic Evolution. 55(1), 1–24. Available from: 10.1554/0014-3820(2001)055.

Kim, A., Terzian, C., Santamaria, P., Pélisson, A., Purd’homme, N. & Bucheton, A. (1994) Retroviruses in invertebrates: the gypsy retrotransposon is apparently an infectious retrovirus of Drosophila melanogaster. Proceedings of the National Academy of Sciences of the United States of America. 91(4), 1285–1289. Available from: 10.1073/pnas.91.4.1285.

Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution. 16 (2), 111–120. Available from: 10.1007/bf01731581.

Kitchen, A., Shackelton, L. A. & Holmes, E. C. (2011) Family level phylogenies reveal modes of macroevolution in RNA viruses. Proceedings of the National Academy of Sciences of the United States of America. 108(1), 238–243. Available from: 10.1073/pnas.1011090108.

Klymiuk, N., Müller, M., Brem, G. & Aigner, B. (2002) Characterization of Porcine endogenous retrovirus pro-pol nucleotide sequences. Journal of Virology. 76 (22), 11738–11743. Available from: 10.1128/jvi.76.22.11738-11743.2002.

Klymiuk, N., Müller, M., Brem, G. & Aigner, B. (2003) Characterization of endogenous retroviruses in sheep. Journal of Virology. 77 (20), 11268–11273. Available from: 10.1128/jvi.77.20.11268- 11273.2003.

Koralnik, I. J., Boeri, E., Saxinger, W. C., Monico, A. L., Fullen, J., Gessain, A., Gau, H. G., Gallo, R. C., Markham, P. & Kalyanaraman, V. (1994) Phylogenetic associations of human and simian T-cell leukemia/lymphotropic virus type I strains: evidence for interspecies transmission. Journal of Virology. 68(4), 2693–2707. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC236747/.

Kumar, S. & Subramanian, S. (2002) Mutation rates in mammalian genomes. Proceedings of the National Academy of Sciences of the United States of America. 99(2), 803–808. Available from: 10.1073/pnas.022629899.

Lai, H., Zhang, H., Ning, Z., Chen, R., Zhang, W., Qing, A., Xin, C., Yu, K., Cao, W. & Liao, M. (2011) Isolation and characterization of emerging subgroup J avian leukosis virus associated with hemangioma in egg-type chickens. Veterinary Microbiology. 151(3-4), 275–283. Available from: 10.1016/j.vetmic.2011.03.037.

185

Lamere, S. A, St Leger, J. A, Schrenzel, M. D., Anthony, S. J., Rideout, B. A. & Salomon, D. R. (2009) Molecular characterization of a novel gammaretrovirus in killer whales (Orcinus orca). Journal of Virology. 83(24), 12956–12967. Available from: 10.1128/JVI.01354-09.

Lander, E. S. & Human Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409(6822), 860–921. Available from: 10.1038/35057062.

Lai, H., Zhang, H., Ning, Z., Chen, R., Zhang, W., Qing, A., Xin, C., Yu, K., Cao, W. & Liao, M. (2011) Isolation and characterization of emerging subgroup J avian leukosis virus associated with hemangioma in egg-type chickens. Veterinary Microbiology. 151 (3-4), 275–283. Available from: 10.1016/j.vetmic.2011.03.037.

LaMere, S. a, St Leger, J. a, Schrenzel, M. D., Anthony, S. J., Rideout, B. a, & Salomon, D. R. (2009). Molecular characterization of a novel gammaretrovirus in killer whales (Orcinus orca). Journal of Virology, 83(24), 12956–12967.

LaPierre, L. A., Casey, J. W. & Holzschu, D. L. (1998) Walleye retroviruses associated with skin tumors and hyperplasias encode cyclin D homologs. Journal of Virology. 72(11), 8765–8771. Available from: http://www.ncbi.nlm.nih.gov/pubmed/9765420.

LaPierre, L. A, Holzschu, D. L., Bowser, P. R. & Casey, J. W. (1999) Sequence and transcriptional analyses of the fish retroviruses walleye epidermal hyperplasia virus types 1 and 2: evidence for a gene duplication. Journal of Virology. 73(11), 9393–9403. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC112974/.

Leblanc, P., Desset, S., Dastugue, B. & Vaury, C. (1997) Invertebrate retroviruses: ZAM a new candidate in D.melanogaster. The EMBO Journal. 16(24), 7521–7531. Available from: 10.1093/emboj/16.24.7521.

Lerat, E. & Capy, P. (1999) Retrotransposons and retroviruses: analysis of the envelope gene. Molecular Biology and Evolution, 16(9), 1198–1207. Available from: 10.1093/oxfordjournals.molbev.a026210.

Lewinski, M.K., Bisgrove, D., Shinn, P., Chen, H., Hoffmann, C., Hannenhalli, S., Verdin, E., Berry, C.C., Ecker, J.R. & Bushman, F.D. (2005) Genome-wide analysis of chromosomal features repressing human immunodeficiency virus transcription. Journal of Virology. 79(11), 6610–6619. Available from: 10.1128/JVI.79.11.6610-6619.2005.

Lewinski, M.K., Yamashita, M., Emerman, M., Ciuffi, A., Marshall, H., Crawford, G., Collins, F., Shinn, P., Leipzig, J., Hannenhalli, S., Berry, C.C., Ecker, J.R. & Bushman, F.D. (2006) Retroviral DNA integration: viral and cellular determinants of target-site selection. PLoS Pathogens. 2(6), e60. Available from: 10.1371/journal.ppat.0020060.

Lewis, P. F. & Emerman, M. (1994) Passage through mitosis is required for oncoretroviruses but not for the human immunodeficiency virus. Journal of Virology. 68(1), 510–516. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC236313/.

Li, J., Das, S., Herrin, B. R., Hirano, M. & Cooper, M. D. (2013) Definition of a third VLR gene in hagfish. Proceedings of the National Academy of Sciences of the United States of America. 110(37), 15013–15018. Available from: 10.1073/pnas.1314540110.

Libeskind-Hadas, R., Wu, Y. C., Bansal, M. S. & Kellis, M. (2014) Pareto-optimal phylogenetic tree reconciliation. Bioinformatics. 30(12), 87–95. Available from: 10.1093/bioinformatics/btu289.

186

Lieber, M. M., Sherr, C. J., Todaro, G. J., Benveniste, R. E., Callahan, R. & Coon, H. G. (1975) Isolation from the asian mouse Mus caroli of an endogenous type C virus related to infectious primate type C viruses. Proceedings of the National Academy of Sciences of the United States of America. 72(6), 2315–2319. Available from: 10.1073/pnas.72.6.2315.

Llorens, C., Fares, M. A. & Moya, A. (2008) Relationships of gag-pol diversity between Ty3/Gypsy and Retroviridae LTR retroelements and the three kings hypothesis. BMC Evolutionary Biology. 8, 276. Available from: 10.1186/1471-2148-8-276.

Llorens, C., Futami, R., Covelli, L., Domínguez-Escribá, L., Viu, J. M., Tamarit, D., Aguilar-Rodríguez, J., Vicente-Ripolles, M., Fuster, G., Bernet, G. P., Maumus, F., Munoz-Pomer, A., Sempere, J. M., Latorre, A. & Moya, A. (2011) The Gypsy Database (GyDB) of mobile genetic elements: release 2.0. Nucleic Acids Research, 39(Database issue), D70–74. Available from: 10.1093/nar/gkq1061.

Llorens, C. & Marin, I. (2001) A Mammalian Gene Evolved from the Integrase Domain of an LTR Retrotransposon. Molecular Biology and Evolution. 18(8), 1597–1600. Available from: 10.1093/oxfordjournals.molbev.a003947.

Llorens, C., Muñoz-Pomer, A., Bernad, L., Botella, H. & Moya, A. (2009). Network dynamics of eukaryotic LTR retroelements beyond phylogenetic trees. Biology Direct. 4, 41. Available from: 10.1186/1745-6150-4-41.

Lodi, P. J., Ernst, J. A, Kuszewski, J., Hickman, A. B., Engelman, A., Craigie, R., Clore, G. M. & Gronenborn, A. M. (1995). Solution structure of the DNA binding domain of HIV-1 integrase. Biochemistry. 34(31), 9826–9833. Available from: http://www.ncbi.nlm.nih.gov/pubmed/15895093.

Luis, A. D., Hayman, D. T. S., O’Shea, T. J., Cryan, P. M., Gilbert, A. T., Pulliam, J. R. C., Mills, J.N., Timonin, M.E., Willis, C.K.R., Cunningham, A.A., Fooks, A.R., Rupprecht, C.E., Wood, J.L.N. & Webb, C.T. (2013). A comparison of bats and rodents as reservoirs of zoonotic viruses: are bats special? Proceedings of the Royal Society B: Biological Sciences. 280(1756), 20122753–20122753. Available from: 10.1098/rspb.2012.2753.

Maddison, W. P. & Maddison, D. R. (2008). Mesquite: A modular system for evolutionary analysis. Evolution, 62, 1103–1118. Available from: http://mesquiteproject.org.

Maeda, N. (1985). Nucleotide sequence of the haptoglobin and haptoglobin-related gene pair. Journal of Biological Chemistry, 260(11), 6698–6709. Available from: http://www.ncbi.nlm.nih.gov/pubmed/2987228.

Malik, H. S. & Eickbush, T. H. (1999). Modular evolution of the integrase domain in the Ty3/Gypsy class of LTR retrotransposons. Journal of Virology, 73(6), 5186–5190. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC112568/.

Malik, H. S. & Eickbush, T. H. (2001). Phylogenetic analysis of ribonuclease H domains suggests a late, chimeric origin of LTR retrotransposable elements and retroviruses. Genome Research. 11(7), 1187–1197. Available from: 10.1101/gr.185101.

Malik, H. S., Henikoff, S. & Eickbush, T. H. (2000). Poised for contagion: Evolutionary origins of the infectious abilities of invertebrate retroviruses. Genome Research. 10(9), 1307–1318. Available from: 10.1101/gr.145000.

Malim, M. H. & Emerman, M. (2008). HIV-1 accessory proteins —ensuring viral survival in a hostile environment. Cell Host & Microbe. 3(6), 388–398. Available from: 10.1016/j.chom.2008.04.008.

187

Mammano, F., Ohagen, A, Höglund, S. & Göttlinger, H. G. (1994). Role of the major homology region of human immunodeficiency virus type 1 in virion morphogenesis. Journal of Virology. 68(8), 4927–4936. Available from: http://www.ncbi.nlm.nih.gov/pubmed/8035491.

Mang, R., Maas, J., Van Der Kuyl, A. C. & Goudsmit, J. (2000). Papio cynocephalus endogenous retrovirus among old world monkeys: evidence for coevolution and ancient cross-species transmissions. Journal of Virology, 74(3), 1578–1586. Available from: 10.1128/jvi.74.3.1578- 1586.2000.

Mangeney, M., Renard, M., Schlecht-Louf, G., Bouallaga, I., Heidmann, O., Letzelter, C., Richaud, A., Ducos, B. & Heidmann, T. (2007) Placental syncytins: Genetic disjunction between the fusogenic and immunosuppressive activity of retroviral envelope proteins. Proceedings of the National Academy of Sciences. 104 (51), 20534–20539. Available from: 10.1073/pnas.0707873105.

Mao, X., Nie, W., Wang, J., Su, W., Feng, Q., Wang, Y., Dobigny, G. & Yang, F. (2008). Comparative cytogenetics of bats (Chiroptera): the prevalence of Robertsonian translocations limits the power of chromosomal characters in resolving interfamily phylogenetic relationships. Chromosome Research : An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology. 16(1), 155–170. Available from: 10.1007/s10577-007-1206-2.

Martin, J., Herniou, E., Cook, J., O’Neill, R. W. & Tristem, M. (1999). Interclass transmission and phyletic host tracking in murine leukemia virus-related retroviruses. Journal of Virology. 73(3), 2442–2449. Available from: http://www.ncbi.nlm.nih.gov/pubmed/9971829.

Martin, J., Kabat, P. & Tristem, M. (2003). Cospeciation and horizontal transmission rates in the murine leukaemia-related retroviruses. In: Page, R. D. M.(ed.), Tangled Trees: Phylogeny, Cospeciation and Coevolution, United States, University of Chicago Press, 174–194. Available from: http://press.uchicago.edu/ucp/books/book/chicago/T/bo3634552.html.

Martineau, D., Bowser, P. R., Renshaw, R. R. & Casey, J. W. (1992). Molecular characterization of a unique retrovirus associated with a fish tumor. Journal of Virology. 66(1), 596–599. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1727503.

Martineau, D., Renshaw, R., Williams, J., Casey, J. & Bowser, P. (1991). A large unintegrated retrovirus DNA species present in a dermal tumor of walleye Stizostedion vitreum. Diseases of Aquatic Organisms. 10, 153–158. Available from: .10.3354/dao010153.

Maurer, B., Bannert, H., Darai, G. & Flugel, R. M. (1988). Analysis of the primary structure of the long terminal repeat and the gag and pol genes of the human spumaretrovirus. Journal of Virology. 62(5), 1590–1597. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC253186/.

Maury, W. (1998). Regulation of equine infectious anemia virus expression. Journal of Biomedical Science. 5(1), 11–23. Available from: 10.1007/bf02253351.

McClintock, B. (1956). Controlling Elements and the Gene. Cold Spring Harbor Symposia on Quantitative Biology. 21, 197–216. Available from: 10.1101/SQB.1956.021.01.017.

McNally, M. T., Gontarek, R. R. & Beemon, K. (1991). Characterization of Rous sarcoma virus intronic sequences that negatively regulate splicing. Virology. 185(1), 99–108. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1656608.

Menéndez-Arias, L. (2009). Mutation rates and intrinsic fidelity of retroviral reverse transcriptases. Viruses. 1(3), 1137–1165. Available from: 10.3390/v1031137.

188

Merkle, D. & Middendorf, M. (2005). Reconstruction of the cophylogenetic history of related phylogenetic trees with divergence timing information. Theory in Biosciences. 123(4), 277–299. Available from: 10.1016/j.thbio.2005.01.003.

Mi, S., Lee, X., Li, X., Veldman, G. M., Finnerty, H., Racie, L., LaVallie, E., Tang, X. Y., Edouard, P., Howes, S., Keith, J. C. Jr. & McCoy, J. M. (2000). Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature. 403 (6771), 785–789. Available from: http://www.ncbi.nlm.nih.gov/pubmed/10693809.

Levin, B. R. (1988). The evolution of sex in bacteria. In: Michod, R. E. & Levin, B. R. (eds.) The evolution of Sex: An Examination of Current Ideas, Sinauer, sunderland MA, 194-211.

Mitchell, R. S., Beitzel, B. F., Schroder, A. R. W., Shinn, P., Chen, H., Berry, C. C., Ecker, J. R. & Bushman, F. D. (2004). Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS Biology. 2(8), E234. Available from: 10.1371/journal.pbio.0020234.

Mizrokhi, L. J., & Mazo, A. M. (1991). Cloning and analysis of the mobile element gypsy from D. virilis. Nucleic Acids Research. 19(4), 913–916. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1708127.

Mondor, I., Ugolini, S. & Sattentau, Q. J. (1998). Human immunodeficiency virus type 1 attachment to HeLa CD4 cells is CD4 independent and gp120 dependent and requires cell surface heparans. Journal of Virology. 72(5), 3623–3634. Available from: http://www.ncbi.nlm.nih.gov/pubmed/9557643.

Nisole, S., Krust, B., Callebaut, C., Guichard, G., Muller, S., Briand, J. P. & Hovanessian, A. G. (1999). The anti-HIV pseudopeptide HB-19 forms a complex with the cell-surface-expressed nucleolin independent of heparan sulfate proteoglycans. The Journal of Biological Chemistry. 274(39), 27875–27884. Available from: 10.1074/jbc.274.39.27875.

Oppermann, H., Bishop, J. M., Varmus, H. E. & Levintow, L. (1977). A joint produce of the genes gag and pol of avian sarcoma virus: a possible precursor of reverse transcriptase. Cell. 12(4), 993–1005. Available from: 10.1016/0092-8674(77)90164-7.

Oroszlan, S. & Luftig, R.B. (1990). Retroviral proteinases. Current Topics in Microbiology and Immunology. 157,153-185. Available from: http://www.ncbi.nlm.nih.gov/pubmed/2203608?dopt=Abstract.

Ozers, M. S. & Friesen, P. D. (1996). The Env-like open reading frame of the baculovirus-integrated retrotransposon TED encodes a retrovirus-like envelope protein. Virology. 226(2), 252–259. Available from: 10.1006/viro.1996.0653.

Page, R. D. M. (1994a). Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Systematic Biology. 43(1), 58–77. Available from: 10.2307/2413581.

Page, R. D. M. (1994b). Parallel phylogenies: reconstructing the history of host-parasite assemblages. Cladistics. 10(2), 155–173. Available from: 10.1006/clad.1994.1010.

Pancer, Z., Amemiya, C. T., Ehrhardt, G. R. a, Ceitlin, J., Gartland, G. L. & Cooper, M. D. (2004). Somatic diversification of variable lymphocyte receptors in the agnathan sea lamprey. Nature. 430(6996), 174–180. Available from: 10.1038/nature02740.

Pancer, Z., Saha, N. R., Kasamatsu, J., Suzuki, T., Amemiya, C. T., Kasahara, M. & Cooper, M. D.

189

(2005). Variable lymphocyte receptors in hagfish. Proceedings of the National Academy of Sciences of the United States of America. 102(26), 9224–9229. Available from: 10.1073/pnas.0503792102.

Parker, J., Rambaut, A. & Pybus, O. G. (2008). Correlating viral phenotypes with phylogeny: Accounting for phylogenetic uncertainty. Infection, Genetics and Evolution. 8(3), 239–246. Available from: 10.1016/j.meegid.2007.08.001.

Patarca, R. & Haseltine, W. A. (1985). A major retroviral core protein related to EPA and TIMP. Nature. 318 (6044), 390–390. Available from: 10.1038/318390a0.

Patience, C., Switzer, W. M., Takeuchi, Y., Griffiths, D. J., Goward, M. E., Heneine, W., Stoye, J.P. & Weiss, R. A. (2001). Multiple groups of novel retroviral genomes in pigs and related species. Journal of Virology, 75(6), 2771–2775. Available from: 10.1128/JVI.75.6.2771-2775.2001.

Paul, T. A., Quackenbush, S. L., Sutton, C., Casey, R. N., Bowser, P. R. & Casey, J. W. (2006) Identification and characterization of an Exogenous retrovirus from Atlantic salmon swim bladder Sarcomas. Journal of Virology. 80 (6), 2941–2948. Available from: 10.1128/jvi.80.6.2941- 2948.2006.

Pearson, M. N. & Rohrmann, G. F. (2002). Transfer, Incorporation, and Substitution of Envelope Fusion Proteins among Members of the Baculoviridae, Orthomyxoviridae, and Metaviridae (Insect Retrovirus) Families. Journal of Virology. 76(11), 5301–5304. Available from: 10.1128/JVI.76.11.5301-5304.2002

Pelisson, A., Mejlumian, L., Robert, V., Terzian, C. & Bucheton, A. (2002). Drosophila germline invasion by the endogenous retrovirus gypsy: Involvement of the viral env gene. Insect Biochemistry and Molecular Biology. 32(10), 1249–1256. Available from: 10.1016/S0965- 1748(02)00088-7.

Pearson, M. N. & Rohrmann, G. F. (2002). Transfer, Incorporation, and Substitution of Envelope Fusion Proteins among Members of the Baculoviridae, Orthomyxoviridae, and Metaviridae (Insect Retrovirus) Families. Journal of Virology. 76(11), 5301–5304. Available from: 10.1128/JVI.76.11.5301-5304.2002.

Polard, P. & Chandler, M. (1995). Bacterial transposases and retroviral integrases. Molecular Microbiology. 15(1), 13–23. Available from: 10.1111/j.1365-2958.1995.tb02217.x

Poulet, F. M., Bowser, P. R. & Casey, J. W. (1994). Retroviruses of fish, reptiles and molluscs, The Retroviridae, New York, Plenum Press, 1–38.

Purvis, A., & Webster, A. J. (1999). Phylogenetically independent comparisons and primate phylogeny. In: Lee, P. C. (ed.) Comparative Primate Socioecology. Cambridge, Cambridge University Press, 44–70. Available from: http://ebooks.cambridge.org/chapter.jsf?bid=CBO9780511542466&cid=CBO9780511542466A010.

Renard, M., Varela, P. F., Letzelter, C., Duquerroy, S., Rey, F. A. & Heidmann, T. (2005) Crystal structure of a pivotal domain of human Syncytin-2, A 40 Million years old endogenous retrovirus Fusogenic envelope gene captured by primates. Journal of Molecular Biology. 352 (5), 1029–1034. Available from: 10.1016/j.jmb.2005.07.058.

Renne, R., Friedl, E., Schweizer, M., Fleps, U., Turek, R. & Neumann-Haefelin, D. (1992). Genomic organization and expression of simian foamy virus type 3 (SFV-3). Virology. 186(2), 597–608. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1310187

190

Repaske, R., Steele, P. E., O’Neill, R. R., Rabson, A. B. & Martin, M. A. (1985). Nucleotide sequence of a full-length human endogenous retroviral segment. Journal of Virology. 54(3), 764–772. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC254863/.

Roe, T., Reynolds, T. C., Yu, G. & Brown, P. O. (1993). Integration of murine leukemia virus DNA depends on mitosis. The EMBO Journal. 12(5), 2099–2108. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC413431/.

Rogozin, I. B., Iyer, L. M., Liang, L., Glazko, G. V, Liston, V. G., Pavlov, Y. I., Aravind, L. & Pancer, Z. (2007). Evolution and diversification of lamprey antigen receptors: evidence for involvement of an AID-APOBEC family cytosine deaminase. Nature Immunology. 8(6), 647–656. Available from: 10.1038/ni1463.

Ronquist, F. & Huelsenbeck, J. P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 19(12), 1572–1574. Available from: 10.1093/bioinformatics/btg180.

Ronquist, F., Teslenko, M., Van Der Mark, P., Ayres, D. L., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M.A. & Huelsenbeck, J. P. (2012). Mrbayes 3.2: Efficient bayesian phylogenetic inference and model choice across a large model space. Systematic Biology. 61(3), 539–542. Available from: 10.1093/sysbio/sys029.

Rother, R. P., Fodor, W. L., Springhorn, J. P., Birks, C. W., Setter, E., Sandrin, M. S., Squinto, S. P. & Rollins, S. A. (1995). A novel mechanism of retrovirus inactivation in human serum mediated by anti-alpha-galactosyl natural antibody. The Journal of Experimental Medicine. 182(5), 1345–1355. Available from: 10.1084/jem.182.5.1345.

Rous, P. (1910). A transmissible avian neoplasm. (sarcoma of the common fowl.). The Journal of Experimental Medicine. 12(5), 696–705. Available from: 10.1084/jem.150.4.729.

Rous, P. (1911). A sarcoma of the fowl transmissible by an agent separable from the tumor cells. The Journal of Experimental Medicine. 13(4), 397–411. Available from: 10.1097/00000441-191108000- 00079.

Salemi, M., Lamers, S. L., Yu, S., de Oliveira, T., Fitch, W. M. & McGrath, M. S. (2005) Phylodynamic analysis of human immunodeficiency virus type 1 in distinct brain compartments provides a model for the Neuropathogenesis of AIDS. Journal of Virology. 79 (17), 11343–11352. Available from: 10.1128/jvi.79.17.11343-11352.2005.

Sambrook, J. & Russell, D. W. (2000) Molecular cloning: A laboratory manual. 3rd ed. United States, Cold Spring Harbor Laboratory Press,U.S.

Samuelson, L. C., Wiebauer, K., Gumucio, D. L. & Meisler, M. H. (1988). Expression of the human amylase genes: Recent origin of a salivary amylase promoter from an actin pseudogene. Nucleic Acids Research. 16(17), 8261–8276. Available from: 10.1093/nar/16.17.8261.

Sanjuán, R. (2012). From molecular genetics to phylodynamics: Evolutionary relevance of mutation rates across viruses. PLoS Pathogens. 8(5), e1002685. Available from: 10.1371/journal.ppat.1002685.

Sarkar, N. H., Golovkina, T. & Uz-Zaman, T. (2004). RIII/Sa mice with a high incidence of mammary tumors express two exogenous strains and one potential endogenous strain of mouse mammary tumor virus. Journal of Virology. 78(2), 1055–1062. Available from: 10.1128/JVI.01991-07.

Sattentau, Q. J. (2010). Cell-to-cell spread of retroviruses. Viruses. 2, 1306–1321. Available from:

191

10.3390/v2061306.

Schartl, M., Walter, R. B., Shen, Y., Garcia, T., Catchen, J., Amores, A., Braasch, I., Chalopin, D., Volff, J.-N., Lesch, K.-P., Bisazza, A., Minx, P., Hillier, L., Wilson, R.K., Fuerstenberg, S., Boore, J., Searle, S., Postlethwait, J.H. & Warren, W. C. (2013). The genome of the platyfish, Xiphophorus maculatus, provides insights into evolutionary adaptation and several complex traits. Nature Genetics. 45(5), 567–572. Available from: 10.1038/ng.2604.

Schröder, A. R. W., Shinn, P., Chen, H., Berry, C., Ecker, J. R. & Bushman, F. (2002) HIV-1 integration in the human genome favors active genes and local Hotspots. Cell. 110 (4), 521–529. Available from: 10.1016/s0092-8674(02)00864-4.

Schwartz, D. E., Tizard, R. & Gilbert, W. (1983). Nucleotide sequence of Rous sarcoma virus. Cell. 32, 853–869. Available from: 10.1016/0092-8674(83)90071-5.

Schweizer, M. & Neumann-Haefelin, D. (1995). Phylogenetic analysis of primate foamy viruses by comparison of pol sequences. Virology. 207(2), 577–582. Available from: 10.1006/viro.1995.1120.

Schweizer, M., Schleer, H., Pietrek, M., Liegibel, J., Falcone, V. & Neumann-Haefelin, D. (1999). Genetic stability of foamy viruses: long-term study in an African green monkey population. Journal of Virology. 73(11), 9256–9265. Available from: http://www.ncbi.nlm.nih.gov/pubmed/10516034.

Scolnick, E. M., Aaronson, S. A., Todaro, G. J. & Parks, W. P. (1971) RNA dependent DNA polymerase activity in mammalian cells. Nature. 229 (5283), 318–321. Available from: 10.1038/229318a0.

Segal, Y., Peissel, B., Renieri, A., de Marchi, M., Ballabio, A., Pei, Y. & Zhou, J. (1999). LINE-1 elements at the sites of molecular rearrangements in Alport syndrome-diffuse leiomyomatosis. American Journal of Human Genetics. 64(1), 62–69. Available from: 10.1086/302213.

Seiki, M., Inoue, J., Takeda, T. & Yoshida, M. (1986). Direct evidence that p40x of human T-cell leukemia virus type I is a trans-acting transcriptional activator. The EMBO Journal. 5(3), 561–565. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1166799/.

Seperack, P. K., Strobel, M. C., Corrow, D. J., Jenkins, N. A. & Copeland, N. G. (1988). Somatic and germ-line reverse mutation rates of the retrovirus-induced dilute coat-color mutation of DBA mice. Proceedings of the National Academy of Sciences of the United States of America. 85(1), 189–192. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC279509/.

Shah, C., Böni, J., Huder, J. B., Vogt, H. R., Mühlherr, J., Zanoni, R., Miserez, R., Lutz, H. & Schüpbach, J. (2004a). Phylogenetic analysis and reclassification of caprine and ovine lentiviruses based on 104 new isolates: Evidence for regular sheep-to-goat transmission and worldwide propagation through livestock trade. Virology. 319, 12–26. Available from: 10.1016/j.virol.2003.09.047.

Shah, C., Huder, J. B., Böni, J., Schönmann, M., Mühlherr, J., Lutz, H. & Schüpbach, J. (2004b). Direct evidence for natural transmission of small-ruminant lentiviruses of subtype A4 from goats to sheep and vice versa. Journal of Virology. 78(14), 7518–7522. Available from: 10.1128/JVI.78.14.7518- 7522.2004.

Sharma, S., Miyanohara, A. & Friedmann, T. (2000). Separable mechanisms of attachment and cell uptake during retrovirus infection. Journal of Virology. 74(22), 10790–10795. Available from: 10.1128/JVI.74.22.10790-10795.2000.

Sharp, P., Shaw, G. & Hahn, B. (2005). Simian Immunodeficiency Virus Infection of Chimpanzees.

192

Journal of Virology. 79(7), 3891–3902. Available from: 10.1128/JVI.79.7.3891.

Shen, C. & Steiner, L. A. (2004). Genome structure and thymic expression of an endogenous retrovirus in zebrafish. Journal of Virology. 78(2), 899–911. Available from: 10.1128/JVI.78.2.899.

Shinnick, T., Lerner, R. & Sutcliffe, J. G. (1981). Nucleotide Sequence of Moloney Murine Leukemia Virus. Nature. 293, 543–548. Available from: 10.1038/293543a0.

Shoji, S., Parmelee, D. C., Wade, R. D., Kumar, S., Ericsson, L. H., Walsh, K. A., Neurath, H., Long, G.L., Demaille, J.G., Fischer, E.H. & Titani, K. (1981). Complete amino acid sequence of the catalytic subunit of bovine cardiac muscle cyclic AMP-dependent protein kinase. Proceedings of the National Academy of Sciences of the United States of America. 78(2), 848–851. Available from: 10.1073/pnas.78.2.848.

Simmons, G., Young, P., Hanger, J., Jones, K., Clarke, D., Mckee, J. & Meers, J. (2012). Prevalence of koala retrovirus in geographically diverse populations in Australia. Australian Veterinary Journal. 90(10), 404–409. Available from: 10.1111/j.1751-0813.2012.00964.x

Singh, M., Berger, B. & Kim, P. S. (1999). LearnCoil-VMF: computational evidence for coiled-coil-like motifs in many viral membrane-fusion proteins. Journal of Molecular Biology. 290(5), 1031–1041. Available from: 10.1006/jmbi.1999.2796.

Smith, J. J., Kuraku, S., Holt, C., Sauka-Spengler, T., Jiang, N., Campbell, M. S., Yandell, M. D., Manousaki, T., Meyer, A., Bloom, O. E., Morgan, J. R., Buxbaum, J. D., Sachidanandam, R., Sims, C., Garruss, A. S., Cook, M., Krumlauf, R., Wiedemann, L. M., Sower, S. A., Decatur, W. A., Hall, J. A., Amemiya, C. T., Saha, N. R., Buckley, K. M., Rast, J. P., Das, S., Hirano, M., McCurley, N., Guo, P., Rohner, N., Tabin, C. J., Piccinelli, P., Elgar, G., Ruffier, M., Aken, B. L., Searle, S. M. J., Muffato, M., Pignatelli, M., Herrero, J., Jones, M., Brown, C. T., Chung-Davidson, Y. W., Nanlohy, K. G., Libants, S. V., Yeh, C. Y., McCauley, D. W., Langeland, J. A., Pancer, Z., Fritzsch, B., de Jong, P. J., Zhu, B., Fulton, L. L., Theising, B., Flicek, P., Bronner, M. E., Warren, W. C., Clifton, S. W., Wilson, R. K. & Li, W. (2013). Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature Genetics. 45(4), 415–21, 421e1–2. Available from: 10.1038/ng.2568.

Song, S. U., Gerasimova, T., Kurkulos, M., Boeke, J. D. & Corces, V. G. (1994). An env-like protein encoded by a Drosophila retroelement: Evidence that gypsy is an infectious retrovirus. Genes and Development. 8(17), 2046–2057. Available from: 10.1101/gad.8.17.2046.

Sperber, G. O., Airola, T., Jern, P. & Blomberg, J. (2007). Automated recognition of retroviral sequences in genomic data - RetroTector©. Nucleic Acids Research. 35(15), 4964–4976. Available from: 10.1093/nar/gkm515.

Stephens, R. M., Rice, N. R., Hiebsch, R. R., Bose, H. R. & Gilden, R. V. (1983). Nucleotide sequence of v-rel: the oncogene of reticuloendotheliosis virus. Proceedings of the National Academy of Sciences of the United States of America. 80(20), 6229–6233. Available from: 10.1073/pnas.80.20.6229.

Stewart, M., Warnock, M., Wheeler, A., Wilkie, N., Mullins, J., Onions, D. & Neil, J. (1986). Nucleotide sequences of a feline leukemia virus subgroup A envelope gene and long terminal repeat and evidence for the recombinational origin of subgroup B viruses. Journal of Virology. 58(3), 825–834. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC252989/.

Stoye, J. P. (2001). Endogenous retroviruses: Still active after all these years? Current Biology. 11(22), 914–916. Available from: 10.1016/s0960-9822(01)00553-x.

193

Stoye, J. P. (2012). Studies of endogenous retroviruses reveal a continuing evolutionary saga. Nature Reviews Microbiology. 10(6), 395–406. Available from: 10.1038/nrmicro2783.

Summers, M. F., Henderson, L. E., Chance, M. R., South, T. L., Blake, P. R., Perez-Alvarado, G., Bess, J. W., Sowder, R. C., Arthur, L. O., Sagi, I. & Hare, D. R. (1992). Nucleocapsid zinc fingers detected in retroviruses: EXAFS studies of intact viruses and the solution-state structure of the nucleocapsid protein from HIV-1. Protein Science. 1 (5), 563–574. Available from: 10.1002/pro.5560010502.

Sun, C., Skaletsky, H., Rozen, S., Gromoll, J., Nieschlag, E., Oates, R. & Page, D. C. (2000) Deletion of azoospermia factor a (AZFa) region of human Y chromosome caused by recombination between HERV15 proviruses. Human Molecular Genetics. 9 (15), 2291–2296.

Sverdlov, E. D. (1998). Perpetually mobile footprints of ancient infections in human genome. FEBS Letters. 428 (1-2), 1–6. Available from: 10.1016/s0014-5793(98)00478-5.

Sverdlov, E. D. (2000). Retroviruses and primate evolution. BioEssays. 22(2), 161–171. Available from: http://www.ncbi.nlm.nih.gov/pubmed/10655035.

Swanstrom, R. & Wills, J. W. (1997). Synthesis, Assembly, and Processing of Viral Proteins. In: Coffin, J. M., Hughes, S. H. & Varmus, H. E. (eds.) Retroviruses. United States, Cold Spring Harbor Laboratory Press, 263–334. Available from: http://www.ncbi.nlm.nih.gov/books/NBK19376/.

Takeya, T. & Hanafusa, H. (1983) Structure and sequence of the cellular gene homologous to the RSV src gene and the mechanism for generating the transforming virus. Cell. 32 (3), 881–890. Available from: 10.1016/0092-8674(83)90073-9.

Takeya, T., Hanafusa, H., Junghans, R. P., Ju, G. & Skalka, A. M. (1981) Comparison between the viral transforming gene (src) of recovered avian sarcoma virus and its cellular homolog. Molecular and Cellular Biology. 1 (11), 1024–1037. Available from: 10.1128/mcb.1.11.1024.

Taylor, D. J., Dittmar, K., Ballinger, M. J. & Bruenn, J. A. (2011) Evolutionary maintenance of filovirus- like genes in bat genomes. BMC Evolutionary Biology. 11 (1), 336. Available from: 10.1186/1471- 2148-11-336.

Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. (2013). MEGA6: Molecular evolutionary genetics analysis version 6.0. Molecular Biology and Evolution. 30(12), 2725–2729. Available from: 10.1093/molbev/mst197.

Tarlinton, R. (2005) Real-time reverse transcriptase PCR for the endogenous koala retrovirus reveals an association between plasma viral load and neoplastic disease in koalas. Journal of General Virology. 86 (3), 783–787. Available from: 10.1099/vir.0.80547-0.

Tarlinton, R. E., Meers, J. & Young, P. R. (2006). Retroviral invasion of the koala genome. Nature, 442(7098), 79–81. Available from: 10.1038/nature04841.

Tasumi, S., Velikovsky, C. A., Xu, G., Gai, S. A., Wittrup, K. D., Flajnik, M. F., Mariuzza, R.A. & Pancer, Z. (2009). High-affinity lamprey VLRA and VLRB monoclonal antibodies. Proceedings of the National Academy of Sciences of the United States of America. 106(31), 12891–12896. Available from: 10.1073/pnas.0904443106.

Temtamy, S. A., Aglan, M. S., Valencia, M., Cocchi, G., Pacheco, M., Ashour, A. M., Amr, K. S., Helmy, S. M. H., El-Gammal, M. A., Wright, M., Lapunzina, P., Goodship, J. A. & Ruiz-Perez, V. L. (2008) Long interspersed nuclear element-1 (LINE1)-mediated deletion ofEVC, EVC2, C4orf6,

194

andSTK32B in Ellis–van Creveld syndrome with borderline intelligence. Human Mutation. 29 (7), 931–938. Available from: 10.1002/humu.20778.

Terzian, C., Pélisson, A. & Bucheton, A. (2001). Evolution and phylogeny of insect endogenous retroviruses. BMC Evolutionary Biology. 1, 3. Available from: 10.1186/1471-2148-1-3.

Tipper, C. H., Bencsics, C. E. & Coffin, J. M. (2005). Characterization of hortulanus endogenous murine leukemia virus, an endogenous provirus that encodes an infectious murine leukemia virus of a novel subgroup. Journal of Virology. 79(13), 8316–8329. Available from: 10.1128/JVI.79.13.8316- 8329.2005.

Tomonaga, K. & Coffin, J. M. (1999). Structures of endogenous nonecotropic murine leukemia virus (MLV) long terminal repeats in wild mice: implication for evolution of MLVs. Journal of Virology. 73(5), 4327–4340. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC104214/.

Tristem, M. (2000). Identification and characterization of novel human endogenous retrovirus families by phylogenetic screening of the human genome mapping project database. Journal of Virology. 74(8), 3715–3730. Available from: 10.1128/JVI.74.8.3715-3730.2000.

Tristem, M., Herniou, E., Summers, K. & Cook, J. (1996). Three retroviral sequences in amphibians are distinct from those in mammals and birds. Journal of Virology. 70(7), 4864–4870. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC190434/.

Tristem, M., Kabat, P., Lieberman, L., Linde, S., Karpas, A & Hill, F. (1996). Characterization of a novel murine leukemia virus-related subgroup within mammals. Journal of Virology. 70(11), 8241–8246. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC190910/.

Ugolini, S., Mondor, I. & Sattentau, Q. J. (1999). HIV-1 attachment: another look. Trends in Microbiology. 7(4), 144–149. Available from: 10.1016/S0966-842X(99)01474-2.

Van de Lagemaat, L. N., Landry, J. R., Mager, D. L. & Medstrand, P. (2003). Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends in Genetics. 19(10), 530–536. Available from: 10.1016/j.tig.2003.08.004.

Van der Kuyl, A. C., Dekker, J. T. & Goudsmit, J. (1999). Discovery of a new endogenous type C retrovirus (FcEV) in cats: evidence for RD-114 being an FcEVGag-Pol/baboon endogenous virus BaEVEnv recombinant. Journal of Virology. 73(10), 7994–8002. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC112814/.

Van der Kuyl, A. C., Dekker, J. T. & Goudsmit, J. (1995). Distribution of baboon endogenous virus among species of African monkeys suggests multiple ancient cross-species transmissions in shared habitats. Journal of Virology. 69(12), 7877–7887. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC189732/.

Van Regenmortel, M. H. V., Fauquet, C. M., Bishop, D. H. L., Carstens, E. B., Estes, M. K., Lemon, S. M., Maniloff, J., Mayo, M. A., McGeoch, D. J., Pringle, C. R. & Wickner, R. B. (eds) (2000). Virus Taxonomy. Seventh Report of the International Committee on Taxonomy of Viruses . San Diego : Academic Press.

Venkatesh, B., Lee, A. P., Ravi, V., Maurya, A. K., Lian, M. M., Swann, J. B., Ohta, Y., Flajnik, M.F., Sutoh, Y., Kasahara, M., Hoon, S., Gangu, V., Roy, S.W., Irimia, M., Korzh, V., Kondrychyn, I., Lim, Z.W., Tay, B.-H., Tohari, S., Kong, K.W., Ho, S., Lorente-Galdos, B., Quilez, J., Marques- Bonet, T., Raney, B.J., Ingham, P.W., Tay, A., Hillier, L.W., Minx, P., Boehm, T., Wilson, R.K.,

195

Brenner, S. & Warren, W. C. (2014). Elephant shark genome provides unique insights into gnathostome evolution. Nature. 505(7482), 174–179. Available from: 10.1038/nature12826.

Vernochet, C., Heidmann, O., Dupressoir, A., Cornelis, G., Dessen, P., Catzeflis, F. & Heidmann, T. (2011) A syncytin-like endogenous retrovirus envelope gene of the guinea pig specifically expressed in the placenta junctional zone and conserved in Caviomorpha. Placenta. 32 (11), 885–892. Available from: 10.1016/j.placenta.2011.08.006.

Vink, C., Van Gent, D. C., Elgersma, Y. & Plasterk, R. H. (1991). Human immunodeficiency virus integrase protein requires a subterminal position of its viral DNA recognition sequence for efficient cleavage. Journal of Virology. 65(9), 4636–4644. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC248918/.

Voevodin, A., Samilchuk, E., Schätzl, H., Boeri, E. & Franchini, G. (1996). Interspecies transmission of macaque simian T-cell leukemia/lymphoma virus type 1 in baboons resulted in an outbreak of malignant lymphoma. Journal of Virology. 70(3), 1633–1639. Available from: http://www.ncbi.nlm.nih.gov/pubmed/8627684.

Vogt, P. K. (1997). Historical introduction to the general properties of retroviruses. In: Coffin, J. M., Hughes, S. H. & Varmus, H. E. (eds.) Retroviruses. United States, Cold Spring Harbor Laboratory Press,U.S., 1–25. Available from: http://www.ncbi.nlm.nih.gov/books/NBK19376/.

Wain-Hobson, S. (1994). Is antigenic variation of HIV important for AIDS, and what might be expected in the future. In: Morse, S. S. (ed.), The evolutionary biology of viruses. New York, Lippincott Williams and Wilkins, 185–209.

Walker, R. (1969). Virus associated with epidermal hyperplasia in fish. National Cancer Institue Monograph. 31,195–207. Available from: http://www.ncbi.nlm.nih.gov/pubmed/5393702

Wang, T. H., Donaldson, Y. K., Brettle, R. P., Bell, J. E. & Simmonds, P. (2001) Identification of shared populations of human immunodeficiency virus type 1 infecting Microglia and tissue Macrophages outside the central nervous system. Journal of Virology. 75 (23), 11686–11699. Available from: 10.1128/jvi.75.23.11686-11699.2001.

Wang, L., Yin, Q., He, G., Rossiter, S. J., Holmes, E. C. & Cui, J. (2013). Ancient invasion of an extinct gammaretrovirus in cetaceans. Virology. 441(1), 66–69. Available from: 10.1016/j.virol.2013.03.006.

Waterston, R. H. & Mouse Genome Sequencing Consortium (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915), 520–562. Available from: 10.1038/nature01262.

Werner, T., Brack-Werner, R., Leib-Mosch, C., Backhaus, H., Erfle, V. & Hehlmann, R. (1990). S71 is a phylogenetically distinct human endogenous retroviral element with structural and sequence homology to simian sarcoma virus (SSV). Virology. 174(1), 225–238. Available from: 10.1016/0042-6822(90)90071-x.

Wessler, S. R. (1998). Transposable elements associated with normal plant genes. Physiologia Plantarum. 103(4), 581–586. Available from: 10.1034/j.1399-3054.1998.1030418.x.

Wilhelmsen, K. C. & Temin, H. M. (1984). Structure and dimorphism of c-rel (turkey), the cellular homolog to the oncogene of reticuloendotheliosis virus strain T. Journal of Virology. 49(2), 521– 529. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC255493/.

196

Wills, J.W. & Craven, R. C. (1991). Form, function, and use of retroviral Gag protein. AIDS. 5(6), 639– 654. Available from: 10.1097/00002030-199106000-00002.

Withers-Ward, E. S., Kitamura, Y., Barnes, J. P. & Coffin, J. M. (1994). Distribution of targets for avian retrovirus DNA integration in vivo. Genes and Development. 8(12), 1473–1487. Available from: 10.1101/gad.8.12.1473.

Wolgamot, G., Bonham, L. & Miller, A. D. (1998). Sequence analysis of Mus dunni endogenous virus reveals a hybrid VL30/gibbon ape leukemia virus-like structure and a distinct envelope. Journal of Virology. 72(9), 7459–7466. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC109979/.

Woolhouse, M. E. J., Haydon, D. T. & Antia, R. (2005). Emerging pathogens: The epidemiology and evolution of species jumps. Trends in Ecology and Evolution. 20(5), 238–244. Available from: 10.1016/j.tree.2005.02.009.

Wu, X., Li, Y., Crise, B. & Burgess, S. S. M. (2003). Transcription start regions in the human genome are favored targets for MLV integration. Science. 300(5626), 1749–1751. Available from: 10.1126/science.1083413.

Xiong, Y. & Eickbush, T. H. (1990). Origin and evolution of retroelements based upon their reverse transcriptase sequences. The EMBO Journal. 9(10), 3353–3362. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1698615.

Xu, Z. & Wang, H. (2007). LTR-FINDER: An efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research. 35 (Web Server), W265–W268. Available from: 10.1093/nar/gkm286.

Yohn, C. T., Jiang, Z., McGrath, S. D., Hayden, K. E., Khaitovich, P., Johnson, M. E., Eichler, M.Y., McPherson, J.D., Zhao, S., Pääbo, S. & Eichler, E. E. (2005). Lineage-specific expansions of retroviral insertions within the genomes of African great apes but not humans and orangutans. PLoS Biology. 3(4), e110. Available from: 10.1371/journal.pbio.0030110.

York, D. F., Vigne, R., Verwoerd, D. W. & Querat, G. (1992). Nucleotide sequence of the jaagsiekte retrovirus, an exogenous and endogenous type D and B retrovirus of sheep and goats. Journal of Virology. 66(8), 4930–4939. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC241337/.

Yoshinaka, Y., Katoh, I., Copeland, T. D. & Oroszlan, S. (1985a). Murine leukemia virus protease is encoded by the gag-pol gene and is synthesized through suppression of an amber termination codon. Proceedings of the National Academy of Sciences of the United States of America. 82(6), 1618–1612. Available from: 10.1073/pnas.82.6.1618.

Yoshinaka, Y., Katoh, I., Copeland, T. D. & Oroszlan, S. (1985b). Translational readthrough of an amber termination codon during synthesis of feline leukemia virus protease. Journal of Virology. 55(3), 870–873. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC255078/.

Zhu, T., Korber, B. T., Nahmias, A. J., Hooper, E., Sharp, P. M. & Ho, D. D. (1998). An African HIV-1 sequence from 1959 and implications for the origin of the epidemic. Nature. 391(6667), 594–597. Available from: 10.1038/35400.

197