<<

The Pennsylvania State University The Graduate School Intercollege Graduate Program in Molecular, Cellular, and Integrative Biosciences

MOLECULAR AND EVOLUTIONARY ANALYSES OF

TRANSCRIPTIONALLY ACTIVE ENDOGENOUS

RETROVIRUSES IN MULE DEER

A Dissertation in Molecular, Cellular, and Integrative Biosciences by Theodora Alexis Kaiser

© 2018 Theodora Alexis Kaiser

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2018 The dissertation of Theodora Alexis Kaiser was reviewed and approved∗ by the following:

Mary Poss Professor of Biology and Veterinary and Biomedical Sciences Dissertation Advisor Chair of Committee

Cooduvalli S. Shashikant Associate Professor of Molecular and Developmental Biology Co-Director, Bioinformatics and Genomics Graduate Program

Michael Axtell Professor of Biology

Le Bao Associate Professor of Statistics

Melissa Rolls Associate Professor of Biochemistry and Molecular Biology Chair, Molecular, Cellular, and Integrative Biosciences Graduate Program

∗Signatures are on file in the Graduate School.

ii Abstract

Endogenous (ERVs) are genetic elements originally acquired by infection of a in a germ cell, but are subsequently inherited from parent to offspring. ERVs contribute to essential host processes, but can also negatively impact gene expression and are often silenced. Many species contain insertionally polymorphic ERVs, which are variably present among individuals. Thus, ERV insertional polymorphism could result in diversity within a population or species, and ERV expression in this context has been largely unexplored. We investigated transcriptionally active Cervid Endogenous Retroviruses (CrERV) in mule deer, which contain insertionally polymorphic ERVs of different lineages that have been acquired sequentially throughout species evolution, and asked if transcriptional activity was due to proximity to genes or recent integration into the or, a functional impact on gene expression. We first evaluated sequence and insertion diversity of CrERV in a Montana mule deer population. Next, we evaluated CrERV expression in this population, and identified four transcriptionally active CrERV. These CrERV were close to genes, but we showed that CrERV can also be silenced close to an expressed host gene. Unexpectedly, transcriptionally active CrERV were phylogenetically older, except one CrERV that has recently expanded within the genome. Transcriptionally active CrERV were widespread in the population, present in the provirus/solo-LTR configuration in all animals, and some CrERV impacted host gene splicing. Finally, we demonstrated that CrERV expression levels differed among populations, and there was higher CrERV expression in animals from a prion disease-endemic region. We also showed that CrERV expression levels correlated with gene expression levels among populations, further supporting an effect of some CrERV on proximal gene expression. In conclusion, we showed that transcriptionally active CrERV were close to genes, and some impact proximal host gene expression. All transcriptionally active CrERV have been maintained in the population in the provirus/solo-LTR configuration. These results suggest that the maintenance of transcriptionally active CrERV

iii involves more than the co-option of CrERV LTRs as regulatory elements for host genes. We propose that the CrERV RNA is beneficial to the host, and suggest future investigations of CrERV transcripts as long non-coding .

iv Table of Contents

List of Figures viii

List of Tables xi

Acknowledgments xii

Chapter 1 Introduction 1 1.1 Retroviruses ...... 1 1.1.1 Retroviral Genome ...... 2 1.1.2 Reverse Transcription and Integration ...... 3 1.2 Endogenous Retroviruses ...... 4 1.2.1 Establishment in Host ...... 4 1.2.2 ERV Distribution Within the Genome ...... 5 1.2.3 ERV Classification and Distribution ...... 5 1.2.4 ERV Expansion in Host Genome ...... 6 1.2.5 Other in Host Genomes ...... 6 1.3 Impacts of ERVs ...... 8 1.3.1 Genomic Structural Variations ...... 8 1.3.2 Insertional Mutagenesis ...... 8 1.3.3 Viral ...... 9 1.3.4 Disease Associations ...... 9 1.4 Host Control of ERVs ...... 11 1.4.1 Epigenetic Regulation ...... 11 1.4.2 Small RNAs ...... 14 1.4.3 KAP1 and KRAB-ZFPs ...... 15 1.4.4 Restriction Factors ...... 15 1.4.5 TLR Signaling ...... 16 1.5 Functional Contributions of ERVs ...... 16 1.5.1 ERV LTRs as Regulatory Elements for Host Genes . . . . . 16

v 1.5.2 ERV RNAs as long non-coding RNAs (lncRNAs) ...... 20 1.5.3 ERV Proteins ...... 21 1.6 Transcriptional Activation of ERVs ...... 21 1.6.1 Position in Genome ...... 21 1.6.2 Activation by Environmental Stimuli ...... 22 1.7 ERV Identification in Genomes ...... 23 1.7.1 Previous Methods ...... 23 1.7.2 Problems Identifying ERVs from Short-read Sequence Data . 24 1.7.3 Existing Solutions ...... 24 1.8 Mule Deer and Cervid (CrERV) ...... 26 1.8.1 Mule Deer ...... 26 1.8.2 Identification of Novel , CrERV ...... 26 1.9 Dissertation Objectives ...... 28

Chapter 2 Characterizing polymorphic CrERV in the Montana mule deer population 29 2.1 Introduction ...... 29 2.2 Materials and Methods ...... 32 2.3 Results ...... 35 2.4 Discussion ...... 44

Chapter 3 Evolutionary implications of endogenous retrovirus expression on host genome evo- lution 48 3.1 Introduction ...... 48 3.2 Materials and Methods ...... 50 3.3 Results ...... 57 3.4 Discussion ...... 73

Chapter 4 Comparison of transcriptionally active CrERV in mule deer from Montana and Wyoming 79 4.1 Introduction ...... 79 4.2 Materials and Methods ...... 81 4.3 Results ...... 87 4.4 Discussion ...... 111

vi Chapter 5 Discussion and Future Directions 115

Bibliography 120

vii List of Figures

1.1 Retrovirus genome structure...... 3 1.2 Effects of ERVs on genome function...... 9 1.3 Summary of ERV silencing mechanisms...... 12

2.1 CrERV integrations per animal in Montana...... 36 2.2 CrERV prevalence among animals in Montana...... 37 2.3 The number of singletons varies with number of Total Cr- ERV integrations...... 38 2.4 PCA and map displaying geographical location based on latitude and longitude of kill-location of Montana (MT), Oregon (OR), and Wyoming (WY) animals...... 39 2.5 Phylogeny of representative full-length CrERV from M273. 41 2.6 Distribution of CrERV identified in M273 throughout the Montana population...... 42 2.7 Proportion of WY/OR animals that have each M273 CrERV. 43 2.8 Proportion of WY/OR animals that have each CrERV from M273 found in 6 or less Montana animals...... 44 2.9 Prevalence in Montana of CrERV that are not found in M273...... 45

3.1 Methylation patterns of CrERV loci...... 59 3.2 Schematic of approach to amplify CrERV spliced env tran- scripts...... 59 3.3 CrERV loci in M273 that produce spliced env transcripts. 60 3.4 RNAseq Read coverage of transcriptionally active CrERV in M273...... 61 3.5 Lineage-specific CrERV quantification...... 63 3.6 Results of PCR for individual CrERV loci...... 64 3.7 Orientation of transcriptionally active CrERV with respect to genes...... 66

viii 3.8 Schematic of KXD1 transcript isoforms...... 67 3.9 KXD1 gene expression levels between animals with and without S26536...... 68 3.10 Schematic of SIRT6 transcript isoforms...... 69 3.11 Alignment of S386 LTRs and LTR-driven SIRT6 transcript. 70 3.12 SIRT6 gene expression levels between animals with and without S386...... 71 3.13 ISY1 transcript structure...... 72 3.14 Presence of S3442n integration alters FBXO42 Exon 1 us- age patterns...... 73 3.15 Splicing patterns of S3442 and FBXO42...... 74

4.1 PCA of shared CrERV...... 88 4.2 Map showing geographic locations of animals from Mon- tana, Oregon, and Wyoming based on latitude and longi- tude of kill-location...... 89 4.3 Distribution of CrERV per Animal in Montana and Wyoming. 90 4.4 CrERV prevalence among animals in Montana and Wyoming. 91 4.5 The number of singletons varies with number of total Cr- ERV...... 92 4.6 The proportion of Montana vs Wyoming animals that con- tain each M273 CrERV...... 94 4.7 Lineage-specific CrERV expression in Montana and Wyoming animals...... 96 4.8 Overall CrERV expression levels differ between Montana and Wyoming...... 98 4.9 Overall CrERV expression levels differ between Montana and two Wyoming populations...... 99 4.10 Lineage-specific CrERV expression levels differ between Montana and Wyoming...... 100 4.11 Lineage-specific CrERV expression levels differ between Montana and two Wyoming populations...... 101 4.12 KXD1 gene expression levels between Wyoming animals with and without S26536...... 102 4.13 Total KXD1 and canonical KXD1 gene expression between populations...... 103 4.14 LTR-driven KXD1 gene expression between populations. . 104

ix 4.15 SIRT6 gene expression levels between Wyoming animals with and without S386...... 105 4.16 LTR-driven SIRT6 gene expression between populations. . 106 4.17 Total SIRT6 gene expression between populations...... 107 4.18 Presence of S3442n integration alters FBXO42 Exon 1 us- age patterns in Wyoming...... 108 4.19 FBXO42 Exon 1B-exon 2 expression between populations. 109 4.20 FBXO42 Exon 1C-exon 2 expression between populations. 110

x List of Tables

3.1 Primers used for spliced env amplification and cloning. . . . 52 3.2 Primers used for CrERV locus-specific PCR...... 53 3.3 Primers used for CrERV qPCR...... 54 3.4 Primers used for Gene qPCR ...... 56 3.5 Summary of Bisulfite Analysis of 50 LTRs in M273 . . . . . 58 3.6 Expression of genes that have a CrERV integration within 10kb...... 65

4.1 Summary of FBXO42-S3442 PCR results...... 111

xi Acknowledgments

I would like to first thank my advisor, Mary Poss, for accepting me into her lab and teaching me to become the scientist that I am today. Aside from endless support during experiments, analyses, and writing, she encouraged me to explore non-academic career options that better aligned with my skills and interests. I am forever grateful for her kindness, encouragement, and appreciation of good dark chocolate. I also thank my committee members, Cooduvalli Shashikant, Mike Axtell, and Le Bao, for many helpful discussions during my comprehensive exams and committee meetings that have improved this body of work tremendously. Thank you also to the many members of the Poss Lab that provided not only computational and scientific help, but also friendship over the past 6 years. A special thank you to Dr. Lei Yang, Dr. Raunaq Malhotra, and Dr. Yang Liu, for invaluable help with bioinformatics and statistics for my research. I am also grateful to the Poss Lab undergraduates, Brian Huylebroeck, Emily Sproul, Lan Nguyen, Haley Krouse, Morgan Ferrell, Sam Frable, Kaleb Bogale, and Steph Williams, who not only helped with day-to-day experiments but also helped me become a better teacher and mentor. I would also like to acknowledge and thank the CBIOS program for providing financial support to attend conferences and complete my research, and for the additional training in science communication and outreach. I am also grateful to have had the opportunity to complete the ECoS IP/Tech Transfer Internship, which introduced me to what I hope will be my future career. Thank you to Becky Johnson and Merisa Nisic, for sitting next to me during orientation and becoming two of my best (coffee-loving) friends. Thank you also to my colleagues in MSC for keeping me company and helping me stay sane during long days and nights in the lab. Last, but certainly not least, I would like to thank my entire family, especially my mom, Helen, my brother, Tim, and my husband, Sean. Thank you for picking me up every time I fell down, for helping me celebrate my wins, and for supporting me every step of the way through my graduate school journey. I could not have done this without you.

xii Chapter 1 | Introduction

Endogenous retroviruses (ERVs) are genetic elements that were originally acquired by infection of a retrovirus in a germ cell. The retrovirus integrated into the germ cell genome, as per the normal retrovirus lifecycle, and is now inherited from parent to offspring as like any other host gene. ERVs may contribute to variation within host genomes, either via the ERVs themselves or through recombination and other chromosomal rearrangements. Additionally, ERVs may contribute to host gene expression regulation via their LTRs or retroviral RNA or proteins. Many of these processes have been discovered in species with ancient ERV integrations that are largely fixed in host populations. Genome sequencing, however, has revealed that many organisms have varying degrees of ERV insertional polymorphism. Given that ERVs have the potential to impact host processes, this could result in genomic and phenotypic diversity among individuals in a population or species. It is currently unclear if ERVs have an impact on host gene expression and other essential host processes during host colonization, or if this evolves over time. For this reason, it is important to expand our studies of transcriptionally active ERVs to non-model organisms with more recently acquired ERVs to investigate their impact on the host genome.

1.1 Retroviruses

Retroviruses are a large and diverse family of single-stranded positive sense RNA that are found in all vertebrate hosts. Retroviruses consist of a dimer of positive-sense single-stranded RNA, enclosed first in a , and then a bilayer envelope [1,2].

1 1.1.1 Retroviral Genome

The retrovirus RNA genome (Figure 1.1, a) is typically 7-11 kb in size and consists of the coding regions for the genes gag, pro, pol, and env, flanked by Long Terminal Repeats (LTRs), which are non-coding and important for viral replication. Prior to integration into the host genome, the 50 LTR and the 30 LTR are not identical. The 50 LTR consists of the repeat region (R), and a unique 50 region (U5). The 30 LTR consists of the unique 30 region (U3), and the R region, which is used during reverse transcription to generate an integrated provirus with identical Long Terminal Repeats. In addition to containing the retroviral promoter, TATA-box, and GC/GT-box, retroviral LTRs also contain transcription factor binding sites and the polyadenylation signal necessary for retrovirus transcription and replication. Immediately downstream of the U5 region in the 50 LTR is the primer binding site (PBS) which is complementary to the 30-sequence of a host tRNA used as a primer for reverse transcription and differs among retroviruses [3]. The group-specific antigen (gag) gene encodes the Gag , that is proteolytically processed into the retroviral matrix (MA), capsid (CA), and nucleocapsid (NC) proteins. The viral polymerase (pol) gene codes for the (RT) and integrase (IN) genes that are responsible for the reverse transcription of the viral RNA genome into double stranded DNA and subsequent integration of the retroviral DNA into the host cell’s genome, a necessary step for retrovirus replication. The pro gene encodes the viral protease, which acts to proteolytically process the proteins encoded by gag, pro, pol, and env genes. The envelope (env) gene encodes the surface (SU) glycoprotein and the transmembrane (TM) protein. During retrovirus infection, the SU binds to a specific cellular receptor on the surface of the target cell, while the TM triggers the fusion of the viral membrane with the plasma membrane on the target cell, allowing the viral nucleocapsid to be released into the . The envelope protein is translated from a spliced transcript, typically produced from a splice donor (SD) at the 50 end of the and a splice acceptor just prior to the env gene (Figure 1.1, b). Some retroviruses also carry accessory genes, which can regulate and coordinate viral gene expression, and are found either between pol and env, just downstream of env, or overlapping portions of env [1].

2 a. Viral RNA transcript PBS PPT 5’ cap R U5 gag pro pol env U3 R -AAAAAAA 5’ LTR MA CA NC PR RT IN SU TM 3’ LTR

b. Viral spliced env transcript SD SA

R U5 env U3 R -AAAAAAA 5’ LTR 3’ LTR

c. Integrated DNA provirus Host Genomic DNA pA Host Genomic DNA

U3 R U5 gag pro pol env U3 R U5 5’ LTR 3’ LTR TSD TSD 4-6bp 4-6bp

d. Solo LTR conversion Host Genomic DNA Host Genomic DNA

5’ LTR 3' LTR 5’ LTR U3 R U5 Solo-LTR DNA provirus 3' LTR TSD TSD 4-6bp 4-6bp

Figure 1.1: Retrovirus genome structure.

(a) A generalized retroviral RNA transcript. (b) The spliced env mRNA is generated from the primary transcript using the virus splice donor (SD) and splice acceptor (SA) sites. (c) An integrated DNA provirus. (d) Endogenous retrovirus solo LTR conversion.

1.1.2 Reverse Transcription and Integration

As part of the normal retrovirus life cycle, the RNA genome undergoes reverse transcription into double stranded DNA after entry into the host cell cytoplasm. Briefly, the reverse transcription process begins with annealing of the tRNA to the PBS on the viral RNA. Minus-strand DNA synthesis proceeds until the 50 end of the RNA is reached, which generates minus-strand strong-stop DNA (-sssDNA), a DNA sequence about 100-150 bp in length. First strand transfer, mediated by the R sequence in the LTR, then causes the -sssDNA to be annealed to the 30 end of the viral RNA. Minus-strand DNA synthesis then resumes, followed by digestion of the template strand due to the RNase H activity of the retroviral RT . The short polypurine tract (PPT) of the viral RNA is resistant to RNase H degradation, and the PPT is used to prime plus-strand DNA synthesis, forming a DNA called plus-strand strong-stop DNA (+sssDNA). RNase H then removes the primer tRNA, which allows for annealing of the complementary PBS segments in the +sssDNA

3 and minus-strand DNA. The plus-strand of DNA can then serve as the template to complete minus-strand synthesis, and vice versa [1]. After the reverse transcription process is completed, the double-stranded, blunt-ended linear viral DNA molecule remains in the cytoplasm and the integrase enzyme cleaves the 30 termini of the viral DNA to eliminate the terminal 2-3 bases from each 30 end. The viral DNA then enters the nucleus and the integrase enzyme cleaves the host DNA on opposite strands at positions staggered by 4-6 bases, allowing for integration of the viral DNA. DNA synthesis then extends from the host DNA 30-OH groups that flank the host-viral DNA junctions, forming a target site duplication (TSD) of the 4-6 bases on either end of the viral DNA. Following a ligation step, the retrovirus DNA is now integrated into the host genome and is known as a provirus (Figure 1.1, c) [1].

1.2 Endogenous Retroviruses

Endogenous retroviruses (ERVs), are the remnants of ancient retroviral infections and are one of many types of Transposable Elements (TEs) in the genome. Trans- posable elements (TEs) are pieces of DNA first described by Barbara McClintock as ‘jumping genes’ in maize [4]. By definition, TEs are genetic elements that change position in the genome, and make up more than half of the human and murine genome (with comparable percentages in many other vertebrates). Recent evidence has shown that TEs play crucial roles in genome evolution and gene expression of the host. Unlike other TEs, ERVs are unique in that they can be acquired by infection events, either by a novel infectious retrovirus [5] or by reinfection of viruses originating from other host cells [6]. Additionally, only ERVs have LTRs, which naturally contain regulatory elements that could be co-opted by the host as gene regulators [7].

1.2.1 Establishment in Host Genomes

When a retrovirus infects and integrates into the genome of a germ cell, the provirus becomes part of the germline and is inherited as any other gene and is now known as an endogenous retrovirus. ERVs make up about 8% of the human genome [8], and 10% of the mouse genome [9]. After integration into the genome, the full-length retrovirus is known

4 as a provirus (Figure 1.1, c). Many ERVs are also present in contemporary genomes as proviruses, with viral DNA sequence between two LTRs. The majority of ERV sequence in the human genome, however, are present as solo-LTRs (Figure 1.1, d). Solo-LTR formation occurs when homologous recombination between two LTRs (of the same provirus or of different virus loci), removes the proviral DNA and one LTR, leaving behind a solo-LTR at that locus [10]. Estimates suggest that about 95% of ERV insertions in humans are represented by solo-LTRs [11].

1.2.2 ERV Distribution Within the Genome

Within the genome, ERV integrations are less frequent within and near genes [12] and in introns compared to intergenic regions [13]. There is also a shift in ERV distributions with age of the element: younger integrations are found at higher densities in regions of high GC content, with a shift to regions of lower GC with increasing age [12]. Exogenous retroviruses favor initial integration close to genes, such as transcriptionally active regions of the genome or transcription start sites and distal enhancer regions [14]. This suggests that selection may have shaped the distribution of ERVs in contemporary genomes, and deleterious ERVs within or near genes are lost from the population.

1.2.3 ERV Classification and Distribution

Although ERVs have been discovered in the genomes of all sequenced vertebrate genomes to date, a number of ERV groups are restricted to certain species. Retro- viruses are classified into seven genera based on the phylogeny of the retroviral poly- merase (RT) gene: , , , Gammaretro- virus, , Spumavirus, and . The Alpharetrovirus-related ERVs are restricted to the Gallus genus of birds. The Betaretrovirus-related ERVs, including Mouse Mammary Tumor Virus (MMTV), and Jaagsiekte sheep retrovirus (JSRV) are found in mammals. The Gammaretrovirus-related ERVs are found in multiple vertebrate classes, and include the well characterized murine leukaemia virus (MuLV), gibbon ape leukaemia virus (GaLV), feline leukaemia virus (FeLV), and the porcine endogenous retrovirus (PERV) [15]. Epsilonretrovirus-related ERVs are found in fish, but have been discovered in mammals such as horse [16] and across primates [17]. Recent studies have identified endogenous counterparts of the

5 [18], [19–21], and Spumaviruses [22, 23], although these ERVs appear to be rare.

1.2.4 ERV Expansion in Host Genome

In addition to the initial retrovirus integration event, there are additional ways for ERV copy number to increase in the genome. Retrotransposition is a process in which an ERV provirus is transcribed into an RNA intermediate, either in cis, where the virus encodes its own proteins for mobilization, or in trans, where another endogenous or infectious virus in the same cell provides these proteins, and the RNA copy is inserted into another location in the genome. Reinfection, on the other hand, requires that the ERV provirus produce a virus particle that leaves the cell and re-infects the cell (or another cell in the organism). Reinfection requires an intact env gene, whereas retrotransposition does not [6]. Lastly, ERV could increase in copy in the genome if present in a host DNA segment that has undergone copy-number expansion, which does not require that the ERV be transcriptionally active [24].

1.2.5 Other Retrotransposons in Host Genomes

Other retrotransposons present in host genomes include the Long Interspersed Nuclear Elements (LINEs), Short Interspersed Nuclear Elements (SINEs), and LTR retrotransposons. Like ERVs, these TEs change position in the genome via an RNA intermediate that is reverse transcribed into DNA before insertion into a new location in the genome.

Long Interspersed Nuclear Elements (LINEs)

The most prominent LINE in mammals are the L1 elements, which comprise about 17% of the human genome, and are the only known human that is actively mobilizing in the genome [25]. L1s have been active in the mammalian genome since the marsupial-eutherian split, and are acquired by descent, with new copies acquired by retrotransposition within a germline cell [26]. L1 elements are especially abundant in AT-rich, gene-poor regions of the genome with low rates of recombination, due to both an insertion bias and natural selection [27]. L1 elements are about 6 kb in length and consist of two non-overlapping open reading

6 frames (ORFs) that encode proteins important for retrotransposition, flanked by Untranslated Regions (UTR). An internal RNA polymerase II promoter in the 50 UTR directs transcription of the element in both transcriptional directions. L1 elements have been continuously amplifying over the last 170 million years, becoming the most abundant family of autonomously replicating elements in mammalian genomes [28]. Although the estimated number of L1 elements in the human genome is 516,000 elements [8], only 80-100 of these elements are still capable of retrotransposition [29]. L1 retrotransposition has shaped genome evolution [30], however, about 0.07% of L1 mobilization in humans is known to cause disease [31,32]. Because of this, L1 expression is tightly regulated in both germline cells and somatic cells by DNA methylation, RNA intereference (RNAi), and RNA and DNA binding proteins [33].

Short Interspersed Nuclear Elements (SINEs)

SINEs are short repetitive elements that are less than 600 bp in length [34, 35]. SINEs are transcribed by RNA polymerase III and use LINE reverse transcriptase to replicate and spread throughout the genome and are non-autonomous [36–38]. After transcription by RNA polymerase III, the SINE transcript binds to the LINE reverse transcriptase and the SINE RNA is converted to DNA and integrated in a new genomic location. SINE expression and retrotransposition is suppressed using an APOBEC3-mediated system, SINE DNA methylation, and various methods of LINE repression. Although SINEs are found in all mammalian genomes sequenced to date, each species has unique SINEs that contribute to the genome sequence [39]. The most common SINE in the human genome is the Alu element, which comprises approximately 11% of the human genome and consists of both ancient and polymorphic families [40]. In contrast, cattle have at least three major families of SINEs: Bov-A, t-RNA-Glu, and MIR [41]. SINE elements comprise 25% of the cow genome [42]. In cow, SINEs appear to be spatially correlated with gene density and GC content based on age: ancient repeats, such as L2 and MIR are positively correlated with gene density and GC repeats, whereas the younger, ruminant-specific repeats, such as BovB, Bov-tA, BOV-A2 and ART2A are negatively correlated with gene density and GC content [42].

7 LTR retrotransposons

Broadly, the LTR retrotransposons contain direct Long Terminal Repeats, or LTRs, and encode gag and pol genes [43]. The gag gene encodes structural proteins, whereas the pol gene encodes several , including a protease, reverse transcriptase, and integrase [43]. Two retrotransposon families found in all , the and the , have been described in detail, and differ in the organization of their pol genes [43]. LTR retrotransposons mobilize in the genome by first transcribing the genomic copy, reverse transcribing the transcript in the cytoplasm of the cell, and finally re-integrating into the host genome, typically at specific genomic sites [30]. The LTR retrotransposons have contributed many genes to the human genome, mostly as genes derived from gag, such as the Mart, Pnma, and SCAN gene families, although several genes have also been found to have derived from LTR retrotranpsoson integrase and protease [44].

1.3 Impacts of ERVs

1.3.1 Genomic Structural Variations

The presence of ERV integrations in the genome have caused large scale deletions, duplications, and chromosome instability during human genome evolution [45]. Comparative genomics studies have demonstrated the involvement of ERVs in ec- topic recombination events, deletions, inversions, and duplications [46–49]. Many of these ERV-mediated events have contributed to species-specific genomic variations.

1.3.2 Insertional Mutagenesis

ERV insertions are responsible for 10%-12% of spontaneous mutations of genes in mice, occurring both as de novo germ line mutations and as insertional mutagens in somatic cells, particularly during oncogenesis (Figure 1.2, a). In germ cells, most ERV insertion mutations are due to ERV integrations into introns, which can disrupt gene expression by altering splicing or polyadenylation. In addition, some mouse ERV elements are capable of driving ectopic gene expression, resulting in a mutant phenotype [50]. Unlike mice, humans have no catalogued mutant alleles that are due to ERV insertions.

8 a. ERV integration ablates gene expression d. ERV impacts host transcript splicing

AAAAAA ERV ERV ERV Host Gene Host Gene Host Gene Host Gene

b. ERV LTR acts as promoter for gene e. ERV transcript acts as lncRNA

ERV ERV Host Gene Host Gene

c. ERV alters gene regulatory network f. ERV transcripts are translated into proteins

Host Gene A gag ERV Host Gene B ERV Env Host Gene C

Figure 1.2: Effects of ERVs on genome function.

1.3.3 Viral Proteins

Expression of ERVs can result in the production of viral proteins that can negatively affect the host. For example, overexpression of the Rec protein of HERV-K is associated with the formation of germ cell tumors in humans by de-repressing oncogenic transcription factors [51]. ERV proteins could also activate antiviral defense mechanisms of the innate immune system. For example, the surface subunit of the envelope protein of a HERV associated with Multiple Sclerosis interacts with Toll-like receptor (TLR) 4 to stimulate production of pro-inflammatory cytokines and activation of other immune cells [52].

1.3.4 Disease Associations

ERV RNA and protein expression has also associated with disease states in humans and in other animals. Several studies have shown that immune responses against ERV antigens may be a driving force of autoimmunity, rather than a consequence. This is most often studied in context of systemic lupus erythromatosus (SLE) [53, 54]. HERV envelope protein expression is also upregulated in multiple sclerosis lesions [55]. The Rec protein produced by HERV-K (HML2) has also been shown to interfere with germ cell development and cause carcinoma-like lesions when

9 introduced into mice [56]. A HERV K protein was also found to mediate fusion of melanoma cells, which could promote disease progression [57]. HERV RNA expression is associated with monocyte activation in inflammatory brain diseases, such as MS or HIV infection [58]. Expression of ERVs is associated with certain cancers in both humans and mouse models. In Hodgkin lymphomas, aberrant expression of the CSF1R proto-oncogene appears to be driven by THE1B, an LTR from a human ERV family [59, 60]. A normally silent ERV LOR1a is also activated in Hodgkin lymphoma and drives expression of IRF5, a process termed ‘onco-exaptation’ by the study authors [61]. An LTR driven FABP7 transcript in diffuse large B-cell lymphoma also produces a novel protein that plays a role in cell proliferation [62]. Additionally, ERV RNA expression is often upregulated in certain cancers, such as HERV-K in melanomas and germ-cell carcinomas and HERV-E in renal cell carcinomas [63, 64]. It is possible, however, that ERV de-repression is solely a consequence of the altered epigenetic landscape in tumor cells [65]. ERVs have also been associated with the pathogenesis of prion disease in both humans and experimental infections in animal models. HERV RNA was found to be upregulated in the cerebrospinal fluid (CSF) of patients with sporadic Creutzfeldt- Jakob disease, a human prion disease, compared to normal controls and control patients with other neurological disease [66]. ERV expression patterns have also been shown to be altered in a murine cell culture model of prion infection [67]. In a macaque experimental prion infection, ERV RNA expression was upregulated and downregulated in brain tissue in an ERV class specific manner [68]. Long non-coding RNAs (lncRNAs), which are frequently enriched in ERV sequence, are also often dysregulated in cancers [69, 70]). A lncRNA known as EVADR was recently identified in colorectal tumors and was determined to be driven by a MER48 ERV LTR promoter, an element that is conserved among Old World monkeys and apes [71]. Although no specific function within the cancer cell has yet been attributed to EVADR, the cell specificity and evolutionary conservation of the locus suggests the LTR may play an important role in primate biology. Additionally, RNAs known as very long intergenic non-coding RNAs, or vlincRNAs, have also been identified as functionally important in many cancer cells. Tissue- specific vlincRNAs promoters are frequently found within ERV sequences, and ERV expression of vlincRNAs is correlated with degree of malignant transformation [72].

10 1.4 Host Control of ERVs

1.4.1 Epigenetic Regulation

Epigenetics, the study of heritable changes in gene expression not due to changes in genome sequence, includes modifications of both DNA and histones with the purpose of regulating gene expression. In the context of ERV silencing, there are three key mediators of epigenetic control: chromatin formation, DNA methylation, and histone modifications [73].

Heterochromatin

Many ERV proviruses, as well as other TEs, are located in heterochromatin, where they are effectively silenced due to the tightly packed genome in these regions [74]. A novel HERVK, named K111, was discovered in the human centromere [75]. Normally silenced, this novel ERV was identified after HIV infection, when the HIV-1 Tat protein induced a loss of heterochromatin in these regions [75]. Kangaroo Endogenous Retrovirus (KERV) was also found to have undergone expansion the heterochromatic centromere regions of the genome [76].

CpG Methylation

In somatic cells, ERVs are typically silenced via CpG methylation (Figure 1.3). DNA methylation, or the process of adding a methyl group to the DNA molecule, occurs on the 5 position on the pyrimindine ring and forms 5-methylcytosine. In mammals DNA methylation occurs on CpG sites, where a cytosine is followed by a guanine nucleotide in the 50 to 30 direction on the DNA and is separated by a phosphate group. Conversely, there is evidence in plants of CHG and CHH methylation, where H is either T, A, or C. Typically, the cytosines on both strands are methylated; however, when only one strand is methylated the sequence is said to be hemimethylated. The process of DNA methylation is catalyzed by the DNA methyltransferases (DNMTs), of which there are at least 5 enzymes that perform unique functions in the cell [77]. Most of the CpG dinucleotides in mammalian genomes are methylated. CpG islands, however, are sites of high GC percentage with low methylation that are

11 Transcription Deamination blocked by by APOBEC epigenetic proteins mechanisms ERV

Expression driven repression by small RNAs

ERV

KRAB-ZFPs/ Tetherin blocks KAP1 direct release of ERV silencing particles complex to ERVs

Figure 1.3: Summary of ERV silencing mechanisms.

In somatic cells, ERVs are silenced at the transcriptional level by epigenetic mechanisms such as histone modifications and CpG methylation. Histone modifications are also the primary means of ERV silencing in germline cells. Epigenetic silencing complexes are directed to ERVs via binding of KRAB-ZFPs/KAP1. ERVs in some organisms are also silenced at the post-transcriptional level by small RNAs. Retroviral restriction via APOBEC proteins is also used to silence ERVs, and many ERVs are hypermutated (G-A mutations). Tetherin can also block the release of ERV particles from the cell. typically found near active gene promoters and enhancer elements. Studies of genes with methylated CpG island promoters have shown that methylation at the promoter is associated with transcriptional silencing, whereas promoters that are unmethylated are transcriptionally active. Methylation of CpGs in the gene body, however, is not associated with repression [78]. TEs, including ERVs, are typically methylated in the genomes of mammalian somatic cells [79]. Evidence suggests that CpG methylation is used to silence the promoters of ERVs to prevent transcription. CpG methylation in the 50 LTR was inversely correlated with HERV-K (HML2) transcription in vitro, however, CpG methylation levels cannot fully explain transcriptional activity of this virus in certain cell types [80]. Methylation of the 50 LTR was also shown to control transcription of Porcine Endogenous Retrovirus (PERV) in pig tissues [81]. HERV-E LTRs that are active in the placenta are unmethylated, but are heavily methylated

12 in blood cells, where they are transcriptionally silent [82]. In mice deficient in the DNA methyltransferase Dnmt1, IAP endogenous retrovirus levels are elevated 50-100 fold [83].

Histone Modifications

Histone modifications are another epigenetic mechanism important for ERV tran- scriptional silencing (Figure 1.3), particularly during development when DNA methylation levels are low [84]. The basic units of chromatin are nucleosomes, which are composed of a central heterotetramer of histones H3 and H4, flanked by two heterodimers of histones H2A and H2B [85]. Covalent modifications of histones that are involved in transcriptional regulation include acetylation, phosphorylation, methylation, and ubiquitination, particularly on histone H3 [86]. Histone phosphorylation is important for transcriptional induction of certain genes in mammalian cells [86]. Histone ubiquitylation is a larger covalent modifica- tion that is important for both gene activity and silencing [87]. There is, however, no evidence in the literature that histone phosphorylation or ubiquitylation are involved with ERV silencing. Histone acetylation is performed by Histone acetyltransferases (HATs) and removed by histone deacetylases (HDACs). In general, histone acetylation destabi- lizes chromatin and favors transcriptional activation, whereas histone deacetylation leads to transcriptional repression [87]. Acetylation of Lysine 27 on histone 3 (H3K27Ac) in particular is associated with transcriptionally active ERVs [88–91]. Histone deacetylation of histone 3 and histone 4 tails at lysine residues, on the other hand, suppresses ERV transcription. Histone deacetylation is also linked to both DNA methylation and histone methylation, as the deacetylase complexes interact with mCpG binding protein MeCP2 and the histone demethylase LSD1, which demethylates the active marks H3K4me1 and H3K4me2, silencing the gene [92]. Histone methylation occurs on side chains of lysine, which can be mono-, di-, or tri-methylated, and arginines, which can be mono-, symmetrically or asymmetrically di-methylated [87]. Histone arginine methylation is associated with gene activation, and involves the histone methyltransferases in the CARM1/PRMT1 family [86]. Histone lysine methylation is important in both transcriptional repression and activity. The most important histone methylation marks are tri-methylation of lysine 9 on histone 3 (H3K9me3), which is associated with gene silencing, and

13 tri methylation of lysine 4 on histone 3 (H3K4me3), which is associated with transcriptional activity [93]. SETDB1 is the histone lysine methyltransferase responsible for the H3K9me3 modification, and is recruited to ERVs by binding of KRAB-ZFP and KAP1 [93]. Each histone methyltransferase is critical at a distinct time during development as indicated by gene disruption studies in mice [92]. Initial silencing of proviruses in embryonic cells by histone modifications was demonstrated using newly integrating (MLV). Provirus silencing occurred within 2 days, while LTR methylation does not occur until 8-14 days post transduction [92]. In addition, knockout studies show that SETDB1 is required in embryos at days E3.5- E5.5, which is when ERVs become inactivated, and that SETDB1 knockout embryonic cells have upregulated ERV transcription [92]. Dnmt knockout mESCs display no increase in ERV expression until differentiation [94], also suggesting that ERV silencing in ESCs is DNA methylation independent. Other histone marks associated with ERVs and overlapping with H3K9 methylation include H4K20me3 and H3K27 methylation [95]. During development, H3K9me3 is the key mechanism to silence ERVs, whereas in differentiated cells, DNA methylation takes over [95]. Recent evidence has suggested that histone methylation is important for ERV silencing in mouse B-lypmhocytes [96] and mouse neural progenitor cells [97], implying that DNA methylation is not solely responsible for ERV repression in differentiated cells. Histone methylation and CpG methylation are linked, however, and H3K9 and H3K27 methylation, both associated with transcriptional silencing, are positively correlated with DNA methylation [92]. DNA and histone methylation together inhibit transcription by leading to the assembly of compact chromatin and further attracting more gene silencing complexes [92].

1.4.2 Small RNAs

In plants [98] and Drosophila [99] small RNAs, such as piwi-interacting RNAs (piRNAs), are used to silence transposable elements (Figure 1.2). Piwi-like proteins known as Miwi, Mili, and Miwi2 were identified in the mouse germ line [100], and studies have shown that ERV transcripts in mice are elevated in Mili mutants [101] and that loss of Mili and Miwi2 leads to TE activation and impaired CpG methylation of TEs, including ERVs [102–104]. Dicer-dependent small interfering RNAs (siRNAs) also play a role in TE silencing during mouse development. Dicer

14 knockout mouse ES cells have elevated levels of ERVs and L1 [105], as do dicer knockout mouse oocytes [101,106]. A recent study has also identified tRNA-derived small RNAs that inhibit two active ERV families in mouse stem cells by targeting the highly conserved primer binding site and preventing reverse transcription [107].

1.4.3 KAP1 and KRAB-ZFPs

Kruppel-associated box domain-zinc finger proteins (KRAB-ZFPs), are DNA bind- ing proteins that target specific sequences, including some ERVs, using zinc finger motifs. KRAB-ZFPs recruit KRAB-associated protein 1 (KAP1), also known as TRIM28, which acts as a scaffold for a histone modifying silencing complex [92]. KRAB-ZFP809, a murine leukemia virus (MuLV) binding protein that binds to a 17 bp sequence within the proline tRNA primer binding site of endogenous MuLV, is the only KRAB-ZFP with biochemical and genetic evidence that supports its role in ERV silencing, although other KRAB ZFPs have been implicated in ERV silencing in both humans and mice [108]. KAP1 knockouts are embryonic lethal [109]. KAP1 plays a role in de novo DNA methylation during development, and ERVs that are activated in KAP1 conditional knockout murine ES cells are not expressed by KAP1 disruption in murine embryonic fibroblasts [110]. KAP1-mediated histone modifications were also determined to silence ERVs in neural progenitor cells, in contrast with other somatic cells which use DNA methylation for ERV silencing [97]. Additional studies have also identified a KRAB-ZFP that controls ERV expression in mouse hepatocytes and embryonic fibroblasts and modulate expression of nearby host genes in differentiated tissues both in cell culture and in vivo [90].

1.4.4 Restriction Factors

Restriction factors are part of the innate immune system that acts to restrict retrovirus replication (Figure 1.2). APOBEC3 is in a family of cytosine deaminases that have antiviral effects against numerous retroviruses. In an ex vivo experiment using cloned mouse and HERVK capable of forming infectious particles, it was demonstrated that APOBEC3 proteins were capable of ERV restriction [111]. There is also evidence of past APOBEC3 activity on endogenous copies of ERVs in mouse and humans when analyzing the number of G-to-A mutations characteristic of cytosine deaminases [111]. Tetherin is a restriction factor that prevents retroviral

15 particle release from the surface of infected cells [112] and has been shown to block the release of Porcine Endogenous Retrovirus (PERV) from porcine and human cells [113].

1.4.5 TLR Signaling

ERVs have been shown to interact with TLRs and activate the innate immune system of the host. ERV RNA, DNA, and protein can be detected by various TLRs in the cell [114]. Experiments with knockout mice deficient in TLR3, TLR7, and TLR9, have demonstrated that TLRs are also essential for the control of ERVs. ERV sequences were upregulated in TLR7 deficient mice, and additional loss of TLR3 and TLR9 resulted in ERV-induced T cell lymphoblastic leukemia tumors [115].

1.5 Functional Contributions of ERVs

Originally considered ‘junk DNA,’ it is now widely accepted that ERVs contribute to essential host functions and act as powerful facilitators of genome evolution. Transposable elements, such as ERVs have been the subject of the ‘TE-Thrust Hypothesis,’ where the mobile element acts to shape and reformat the genome via transposition or by facilitating DNA recombination [116]. In addition, ERVs can provide regulatory elements and RNAs, and even useful proteins that can be exapted by the host and may contribute to lineage-specific traits.

1.5.1 ERV LTRs as Regulatory Elements for Host Genes

The ERV LTR contains a promoter, transcription factor binding sites, poly- adenylation signals, and other regulatory elements necessary for provirus tran- scription and replication (Figure 1.2, b-d). When an ERV is integrated close to a gene, and in some instances at a distance [117], these regulatory elements can be used for host gene regulation as well. There are potentially three types of ERV LTRs in the host genome: 50 LTRs, 30 LTRs, and solo LTRs, and all three types have been demonstrated to have promoter activity for host genes, although 50 LTR are typically more promoter-active than 30 LTRs or soloLTRs [118].

16 Alternative Promoters

In humans, an expressed sequence tag (EST) search for sequences with ERV LTRs determined that two human genes, apoC-I and EBR, had alternative LTR promoters. In human liver, a solo LTR in the sense orientation with respect to the apoC-I gene drives 15% of apoC-I transcription, and in vitro studies using baboons, which lack the solo LTR element, demonstrated that presence of the LTR increased apoC-I promoter activity [119]. The same study also identified that the 50 LTR of an ERV in the sense orientation with respect to the EBR gene drives 25-30% of EBR transcription in the placenta [119].

Tissue & Cell specific Promoters

Tissue-specific transcriptome data been used to identify ERV LTRs that act as tissue or cell specific promoters for host genes. A transcriptome analysis of 18 human tissues identified the tissue specific promoter activity of 62 different classes of human LTRs. This study identified several significant LTR-tissue associations, some of which have not been previously documented. Evidence suggests that many of these LTRs act as enhancers, particularly since many of the effects are not associated with increased tissue-specific expression of particular LTRs themselves [120]. A HERV-E LTR inserted into the first intron of the human gene Mid1 also acts as a placenta and embryonic kidney cell line specific promoter. In this case, the LTR driven Mid1 transcript encodes an alternative 50 UTR of this gene, since the coding sequence of Mid1 begins after the HERV-E insertion. The LTR-driven Mid1 transcripts account for 25% of total Mid1 transcription in the placenta and 22% of total Mid1 transcription in fetal kidney [121]. Similarly, a HERV-L soloLTR promoter drives 74% of B3Gal-T5 expression in the colon. The HERV-L soloLTR is located 30 of the native promoter, and the LTR-driven B3Gal-T5 transcript is alternatively spliced relative to the canonical B3Gal-T5 transcript resulting in an altered 50 UTR [122]. ERVs can also act as cell and stage-specific promoters necessary for proper embryonic cell development. At the early stages of mouse embryogenesis, MERV-L is highly expressed but strictly regulated, and MERV- L integrations are a source of regulatory elements that regulate stage-specific expression of cellular genes [123].

17 Bidirectional Promoters

Primarily, ERV LTR promoter activity occurs in the sense orientation, as this is the direction that transcription occurs for proviral replication. There are documented cases both in cell culture and in vivo of ERV LTRs acting as bidirectional promoters. A HERV-K solo LTR was determined to have bidirectional promoter activity in a human teratocarcinoma cell line, [124], suggesting that solo-LTRs may be able to promote transcription of downstream as well as upstream genes. An analysis in cancer cells of transcripts driven by a solo-LTR with bidirectional promoter activity determined that antisense transcripts bound transcription factors crucial for cellular proliferation. This suggests a function of antisense RNAs driven by solo-LTR bidirectional promoters in inhibiting the growth of cancer cells [125]. In the human genome, an ERV1 LTR acts as a bidirectional promoter for two genes, DSCR4 and DSCR8. The transcription start sites of these genes are 130 bp apart in a head to head orientation, and he ERV1 LTR promotes sense transcription of DSCR4 and acts as a promoter for DSCR8 in the antisense direction, although the ERV1 LTR promoter is more active in the sense direction than antisense. In this case, the ERV1 LTR is a solo LTR integrated into an older ERV-L solo LTR, which contains strong negative regulatory elements [126]. Similarly, a HERV-H LTR located in the 50 flanking region of the GSDML gene in the antisense orientation has also been demonstrated to drive expression of this gene. The antisense transcription start site was determined to be in the U3 portion of the LTR [127]. In SL3-3 MLV induced tumors in mice, retroviruses inserted in the opposite transcriptional orientation within gene introns and immediately 50 of genes were shown to initiate gene transcription from two regions in U3 of the 50 LTR on the negative strand of DNA [128], suggesting bidirectional promoter activity of some proviral 50 LTRs as well.

Transcription Factor Binding Sites (TFBS)

Integrated ERV proviruses and solo-LTRs also contain transcription factor binding sites that can be used to coordinate host gene regulatory networks. One study identified p53 binding sites in LTR10 and MER61 families that work to coordinate new lineage-specific subnetworks of p53 regulation, since genes with conserved p53 functions are absent from the list of gene close to and regulated by p53 ERVs [129].

18 More recently, researchers identified using CRISPR/Cas9 deletion that ERVs have dispersed interferon induced enhancers that regulate essential immune function, specifically activation of the AIM2 inflammasome. These ERVs contain IRF and STAT1 binding sites important for regulation of these immune system genes [130]. ERV1 elements were also shown to contribute to human-specific gene regulation in embryonic stem cells, and serve as binding sites for the ESC-specific transcription factors OCT4 and NANOG. Interestingly, the presence of these Repeat-Associated Binding Sites (RABS) likely explains the species specific differences in OCT4 and NANOG binding profiles between humans and mice [131]. A similar study that included additional non-embryonic cell lines in human showed that additional transcription factor binding sites are associated with species specific ERVs, and in turn these RABS are also associated with regulated genes [132]. The results of these studies demonstrate the importance of ERVs in facilitating functional gene regulatory evolution.

Polyadenylation Signals

The ERV LTR also contains a polyadenylation signal for proper retrovirus replication that has been expiated for use in humans as genic polyadenylation signals. Identified using a screen of the human EST database, HHLA2 and HHLA3 were identified to be human genes polyadenylated by an LTR. In humans, there is no non-LTR polyadenylation of HHLA2 and HHLA3, however, the baboon genome lacks this LTR and uses a different polyadenylation signal for this gene [133]. Additionally, a human protein tyrosine phosphatase 1 gene (PTP1) and other not yet characterized genes are polyadenylated using a HERVK-T47D LTR [134].

Intronic ERVs

Intronic ERVs can also act to modulate gene expression. HERV LTRs located in the antisense orientation within introns of SLC4A8 and IFT172 act as promoters for RNAs complementary to the gene exons. These antisense transcripts can decrease the mRNA level of the corresponding gene, contributing to human-specific gene regulation [135]. ERVs are also integrated into introns of the DRB genes in both transcriptional directions. These intronic ERVs act to generate diversity of the Major Histocompatibilty Complex Class II region in primates by promoting gene

19 duplications and deletions [136].

1.5.2 ERV RNAs as long non-coding RNAs (lncRNAs)

Endogenous retrovirus-derived RNA has also been shown to have been exapted for use by the host for essential host functions (Figure 1.2, e). In human ESCs, ERV RNAs are important to maintain stem cell identity. When HERV-H RNA expression is knocked down using shRNAs, hESCs take on a fibroblast-like appearance and pluripotency markers are downregulated. In addition, HERV-H RNA expression was found to be required for induced pluripotent stem cell (iPSC) reprogramming. Further analysis revealed that the HERV-H RNA itself was acting as a lncRNA scaffold to recruit chromatin modifiers and transcription factors to promote enhancer activity of LTR7 [137]. A similar requirement of an ERV-associated lncRNA in embryogenesis was also identified in mice [138]. ERVs produce transcripts throughout embryonic development in a stage-specific pattern of expression, likely dependent on the ERV LTR identity [139]. ERV-derived lncRNAs are also important in human erythropoiesis. In erythroid progenitor cells, ERV-9-derived lncRNAs act in cis to regulate transcription of key erythroid genes and act in trans to regulate transcription of target genes on other chromosomes. The lncRNAs act as scaffolds to stabilize the ERV-9 LTR enhancer-pol II complex. Global depletion of ERV-9 lncRNAs was also found to inhibit ex-vivo erythropoiesis, suggesting that they are involved in coordinating the transcriptional network of erythropoiesis [140]. In general, ERVs are overrepresented in lncRNAs from human and mouse cell lines, particularly in lncRNA exons and upstream flanking regions. The enrichment of ERVs with indicators of active chromatin and regulatory proteins in the proximal upstream regions of lncRNAs suggests that ERVs play a role in lncRNA regulation, in addition to contributing to the lncRNA sequence [141]. The RIDL (Repeat Insertion Domains of LncRNAs) hypothesis suggests that ERVs and other transposable elements are enriched in lncRNAs because they provide functional protein and DNA binding properties to these lncRNAs [142]. This is supported by the function of ERV-derived lncRNAs as scaffolds to regulate gene expression in embryonic cells [137] and erythroid cells [140].

20 1.5.3 ERV Proteins

One of the best known examples of transcriptionally active ERV exaptation is the case of syncytin in placental mammals (Figure 1.2, f). Syncytin-1 and syncytin-2 are proteins that are essential for the proper formation of the placenta that have been conserved in primates for over 25 million years and are transcribed from the envelope gene of two HERVs, HERV-W and HERV-FRD, respectively. Within the syncytin genes, genomic indicators of purifying selection and low levels of polymorphism within the human population suggest these genes play an essential physiological role [143]. Syncytin-1 and syncytin-2 are expressed specifically in the placenta. In the placenta, cell-cell fusion is essential for metabolic exchange between mother and developing fetus. When this cell-cell fusion occurs in neighboring cells, the resulting multinucleated cell is known as a syncytium. In the placenta, this is called the syncytiotrophoblast, which is in direct contact with maternal blood and is responsible for nutrient and gas exchange between mother and developing fetus, among other tasks. It has also been hypothesized that syncytins play a role in maternal immune tolerance towards the feto-placental unit expressing paternal antigens, owing to the canonical immunosuppressive activity of the retroviral env ISD [144]. Syncytins are not unique to primates, and seven distinct syncytin genes have been discovered in mice, Leporidae, carnivores, and other mammals, such as ruminants, whose syncytin genes were recently discovered. Interestingly, ruminants have a different placental structure that may be due to the specific properties of syncytin-Rum1 [143]. The capture of syncytin genes in diverse species has occurred independently from different endogenous retroviruses through convergent evolution [144].

1.6 Transcriptional Activation of ERVs

1.6.1 Position in Genome

ERV expression has been documented in normal human [145] and mouse tissues and cells [50], and in non-model organisms [16,146,147]. Some evidence suggests transcriptional activation of ERVs could be due to position in the genome close to

21 genes or in open chromatin. In humans, it was determined that a larger proportion of HERV-K LTRs in gene-rich regions were transcriptionally active compared to LTRs in gene-poor regions [118,148]. Additionally, in mice, 57% of ETn (a type of ERV) were unmethylated when located within 1.5 kb of a TSS, but less than 10% of ETn insertions greater than 1.5 kb of a gene were unmethylated [149], suggesting that location close to genes is indicative of an unmethylated state. In addition, a study of transcriptionally active ERV in the horse genome found that 25.8% of expressed loci were located either internal to or within 10 kb of an identified gene, which is higher than the total loci in the dataset (8.5%) within this same genomic distance [16]. Because DNA methylation has been shown to spread from TE copies to nearby genes, particularly in plants [150,151], it was hypothesized that ERV close to genes are unmethylated to avoid spreading of methylation and silencing marks to nearby genes. Indeed in mouse embryonic stem cell lines, it was shown that heterochromatin can spread from silenced ERV copies into nearby genes [152]. In human neural progenitor cells, knockdown of TRIM28 results in increased expression of ERVs and nearby protein coding genes, suggesting that ERV silencing mediates local heterochromatin spreading in the vicinity of the surrounding genome [153]. A study in plants also demonstrated that older, methylated TEs are farther from genes, and that methylated TEs are associated with reduced neighboring gene expression. The authors presented a similar model that host silencing of TEs near genes must have negative impacts on nearby gene expression, thus resulting in preferential loss of methylated TEs from gene-rich chromosomal regions [154]. In contrast, DNA methylation spreading into nearby genes was not seen in mouse cells, potentially due to CTCF and H3K4me3 in the gene-ERV boundary acting as insulators [149].

1.6.2 Activation by Environmental Stimuli

ERVs are also expressed following activation by environmental stimuli, including infectious viruses. HERV-K (HML-2) transcripts are highly elevated in the plasma of HIV-1-infected patients, including several integrations that are human-specific. One integration, named K111, is found in centromeric areas of the human genome, and is activated only in HIV-1-infected patients [155]. Further analyses revealed that expression of HERV-K (HML-2) in HIV-1 patients is mediated by the HIV-1

22 Tat protein, which acts with transcription factors NF-kB and NF-AT to activate the HERVK (HML-2) promoter [156]. There are also documented examples of expression and mobilization of ERVs after infection with infectious retroviruses in mice [157], cats [158], and birds [159], due to recombination between the endoge- nous and exogenous retroviruses. Infection with Epstein Barr Virus (EBV) also transactivates HERV-K18 in B lymphocytes by interacting with cellular receptor CD21 [160]. Additionally, (HSV)-1 infection stimulates HERV-K expression due to interaction with the HSV-1 protein ICP0 and binding of transcription factor AP-1 on the HERV-K LTR [161]. Ultraviolet C irradiation induced HERV-K accessory proteins rec or np9 ex- pression in normal human melanocytes, but decreases or has no effect on HERV-K rec and np9 expression in melanoma cell lines [162]. Additionally, ultraviolet B irradiation activates expression of HERV pol sequences in primary keratinocyte cells and a keratinocyte cell line. Because these HERVs are associated with expression in autoimmune patients, the activation of HERVs by ultraviolet B may contribute to the pathogenesis of lupus and other autoimmune diseases [163]. Inflammation may also play a role in activation of ERV expression. Transcription of HERV-R was increased in vascular endothelial cells by tumor necrosis factor-a, interleukin-1a, and IL-1β, and was downregulated by interferon-γ, suggesting that HERV-R expression may be regulated at sites of inflammation in vessels [164]. This is not surprising, as HERV-K LTRs frequently contain binding sites for inflammatory transcription factors and have been demonstrated to be regulated by factors and hormones associated with the establishment of chronic inflammatory disease [165].

1.7 ERV Identification in Genomes

1.7.1 Previous Methods

ERVs were first identified in the 1960’s, with the simultaneous discovery of endoge- nous avian leucosis virus (ALV) in birds and the murine leukemia virus (MLV) and mouse mammary tumor virus (MMTV) in mice [166]. Early methods to sample ERV diversity in genomes used hybridization and probes that bound to retrovirus- specific sequences, such as the PBS. This hybridization technique was also used to identify ERV sequences in genomic libraries. Sequencing positive clones allowed for

23 identification of the sequence of both ancient and contemporary ERVs, although this method was time consuming. Degenerate PCR was also used to sample the ERV diversity in a genome, providing data suitable for phylogenetic analysis, but not the complete sequence of the ERV [15]. In addition, PCR-based screening does not allow for identification of flanking sequence, so it does not allow for studies of novel integrations or disease associations [167]. With the advent of whole genome sequencing, it was possible to data mine genome data using bioinformatics tools, such as BLAST [168] and BLAT [169] searches for retroviral proteins, methods to identify LTRs, such as LTR_STRUC [170] and LTR_FINDER [171], and methods that identify conserved motifs such as RetroTector [172]. Genomes can also be annotated with RepeatMasker [173] to identify known repetitive elements, such as ERVs and other TEs.

1.7.2 Problems Identifying ERVs from Short-read Sequence Data

Although next-generation sequencing (NGS) allows for a greater depth of coverage in sequencing projects, the tradeoff is that read lengths are much shorter. NGS reads are on average 150 bp, much shorter than the length of an ERV integration. In a de novo assembly of an organism’s genome, repeats that are longer than the read length will create gaps in the assembly. This is a particular problem for younger ERVs, as they are highly identical and difficult to place in the genome [174]. For this reason, many existing genomes are fragmented [175] and ERVs and other TEs are not assembled [176]. In addition, recent genome studies have shown that many species have insertionally polymorphic ERV integrations that are not shared by all individuals in a population [177–187], which may contribute to phenotypic variation among a species. To study insertionally polymorphic ERVs, it is necessary to develop an approach that is not affected by the inability to assemble repetitive elements in the genome.

1.7.3 Existing Solutions

A number of computational tools have been developed to assist in assembly of repeats from short read NGS data. New de novo assemblers include overlap-based and de Bruijin graph assemblers, which both create graphs from the read data and use these graphs to reconstruct the genome. Repetitive sequences will still

24 introduce structural assembly errors if the assembler is unable to reconstruct the correct path through the de Bruijin graphs [174]. Another method to fill in repeat-induced gaps in genome assembles is to generate reliable long-insert mate-pair read libraries. Mate-pair libraries are generated by fragmenting the genome to the desired size, typically 1 kb to 3 kb but can be as large as 25 kb, and sequencing the ends of this fragment. This technique can provide much larger distance information than paired-end sequencing, which typically only allows for insert sizes of less than 500 bp [188]. A recent study comparing large insert mate-pair libraries (20 or 25 kb), medium insert mate-pair libraries (5, 8, or 15 kb), and short insert mate-pair libraries (3 kb), determined that medium sized mate-pair libraries were efficient at bridging repeats and contig scaffolding in the rat genome. For the 5 kb and 8 kb mate-pair libraries, approximately half of the repeats in the rat genome with a matched size could be spanned by at least one mate-pair, which is more efficient that using paired-end reads or 3 kb mate-pair libraries [189]. Newer technology that can produce longer sequencing reads can also be used to assembly repetitive elements in genomes. A recent study [190] used PacBio sequencing, which can produce reads up to 70 kb, to assemble ERVs from the genome of a single koala. Limitations of long-read sequencing technologies include the higher error rate, which does not allow for the detection of low-level variants within a population and increases difficulty of recombination and phylogenetic analyses, and higher cost than traditional short-read sequence technologies. Transposon junction assays are one method to determine shared mobile ele- ments among individuals in a population. Briefly, these assays use TE-specific and degenerate primers to amplify the integration site of an insertion in several individuals. TE-host junctions are then mapped onto the genome to evaluate shared and polymorphic integrations. This approach has been used to identify polymorphic ERVs in mice [117, 191], transposons in microbes [192], and mobile elements in human genomes [193, 194]. When no reference genome is available, clustering identical TE-host junction reads from multiple individuals allows for identification of shared TE integrations, as well as those that are present in only one sample [195].

25 1.8 Mule Deer and Cervid Endogenous Retrovirus (Cr- ERV)

1.8.1 Mule Deer

Mule deer (Odocoileus hemionus), are a species of deer that are indigenous to western North America. Mule deer belong to the family Cervidae, which include the muntjac, elk, moose, reindeer, and other deer species, including the mule deer subspecies the black-tailed deer. Hybridization between mule deer and black-tailed deer have been reported, however, hybrids between mule deer and white-tailed deer, sister taxa of mule deer, are rare in both the wild and in captivity. Traditional population genetics approaches indicate that there is a low degree of population structure and high connectivity in mule deer [196,197], as expected for highly mobile species. Population divergence that follows subspecies designations, however, is evident and have demonstrated strong genetic differentiation between mule deer and black-tailed deer [198,199].

1.8.2 Identification of Novel Gammaretrovirus, CrERV

Cervid endogenous retrovirus (CrERV) was first identified in a meta-transcriptomic screen of healthy mule deer lymph nodes. A phylogenetic analysis of the viral transcripts determined that this was a gammaretrovirus [146] and southern blot analysis of mule deer genomic DNA suggested that this gammaretrovirus was endogenous [184]. Further characterization revealed that CrERV integrations were insertionally polymorphic, and few were fixed in all animals studied [184]. Analysis of white-tailed deer genomic DNA revealed that none of the CrERV integrations identified in mule deer were present in white-tailed deer, suggesting that CrERV entered the genome of mule deer after this speciation event about 1.1 MYA [200]. Related CrERV were identified, however, in the genome of white-tailed deer, and elk [184].

Genomic Structure of CrERV

A single CrERV integration was chosen for further analysis and characterization. CrERVγ-in7 is a full length provirus that is 9,082 bp and contains complete open

26 reading frames for the gammaretrovirus genes gag, pro, pol, and env. The LTRs of CrERVγ-in7 are 449 bp in length and identical, with a TATA box located at position 280 and a poly(A) signal at position 367, along with multiple predicted TFBS and a conserved Xho1 site in R. Downstream of the 50 LTR is a tRNA-proline primer binding site and a 1,027 bp long 50 leader region, containing 3 direct repeats. The predicted splice donor is located at position 830 and the predicted splice acceptor is located at position 6591. CrERVγ-in7 contains many typical features of , such as a conserved Cys-His box motif in the nucleocapsid region, a gag amber stop codon, a conserved YXDD reverse transcriptase motif, conserved catalytic motif of integrase (D-D-35X=E), and a conserved CETTG motif at position 7113, which is attributed to proper function of the envelope protein during virus-induced cell fusion in exogenous retroviruses [201]. Southern blot analysis revealed that all animals contain viruses at the full length size of CrERVγ-in7, as well as viruses that are about 8.5 kb and 6.5 kb in length [184].

Phylogeny and Colonization Dynamics

Phylogenetically, CrERVγ is most closely related to endogenous gammaretroviruses from sheep and pig, specifically OERVγ1A, OERVγ1B, OERVγ1C, and PERVγ in the pro/pol region. Quantification of the gag and env genes suggested that there are 50-150 CrERVγ copies per haploid genome, and that there is significant variation in copy numbers of these genes amongst different animals. Quantification of the LTR indicated that there were 2-3 fold higher numbers than the CrERV genes, suggesting that there were also solo-LTRs present in the mule deer genomes. Illumina RNA sequencing also confirmed that there was transcription of the CrERVγ provirus genome in lymph node [184]. Further analysis of a subset of 14 CrERV integrations, named CrERVγ-in1 to -in14, revealed that there were at least 4 independent epizootic events that led to endogenization, and that the time to most recent common ancestor for all CrERVs was 0.74 MYA. The most recent CrERV lineage integrated into the genome over the last 25,000 years to the present. CrERV integration prevalence amongst animals varied, and a hierarchical clustering based on CrERV prevalence data was able to spatially cluster mule deer populations with better resolution than microsatellite allele frequencies [185].

27 Isolation of an infectious CrERV

Coculture of black-tailed deer primary kidney cells and a human rhabdomyosarcoma cell line resulted in induction of CrERV particles that are xenotropic and can infect at least two human cell lines. Sequencing revealed that this induced CrERV was most similar to CrERVγ-in8, and clustered with the insertionally polymorphic and evolutionarily young group of previously characterized CrERV [202]. This is suggestive that the youngest CrERV integrations have the highest capacity to form infectious progeny.

1.9 Dissertation Objectives

The work presented in this dissertation focus on transcriptionally active ERV in a non-model organism with established ERV insertional polymorphism to evaluate if this results in differences among individuals within or between populations. We use mule deer for this study because this outbred species has sustained multiple independent endogenization events from infectious retroviruses, resulting in inser- tional polymorphism of CrERV from sequentially acquired lineages over the last 700,000 years. These CrERV are unique to mule deer and are not fixed in the population. Chapter 2 establishes the extent of CrERV diversity both within the Montana mule deer population and within a single Montana animal, and further supports that this animal is a typical Montana mule deer with respect to CrERV. In Chapter 3, these data are used to determine the identity of transcriptionally active CrERV loci in a representative animal and other individuals in the Montana population. The various impacts of these CrERV on mule deer gene expression are also evaluated. Chapter 4 contains an analysis of the CrERV integrations and transcriptionally active CrERV in two different mule deer populations. Chapter 5 provides an overall discussion of the work and indicates directions for future work.

28 Chapter 2 | Characterizing polymorphic CrERV in the Montana mule deer population

2.1 Introduction

Genome sequencing projects have revealed that many species have insertionally polymorphic ERVs. Insertionally polymorphic ERV loci are not fixed in the population, and are thus not shared by all individuals. Polymorphic ERVs may be present in the genome in three allelic states: the pre-integration site, a full-length ERV, or a solo-LTR. Insertionally polymorphic ERVs have been documented in mice [182], particularly between mouse strains [181], mule deer [184, 185], cats [179,180], within and between sheep breeds [183], pigs [177], and koala [186]. Recent evidence suggests that the youngest human ERV, HERV-K (HML-2) is also insertionally polymorphic among human populations [187]. Long ignored, analyses of insertionally polymorphic ERVs may be important for many types of ERV-host studies. With respect to disease, insertionally polymorphic ERVs could be novel genetic risk factors. Additionally, insertionally polymorphic ERVs are usually recent integrations that are likely to have retained functional coding sequence. ERVs with functional LTRs can modulate host gene expression via promoter or enhancer activity and ERVs that can produce viral proteins can suppress or stimulate the immune response, potentially leading to host disease [203]. Insertionally polymorphic ERVs can also be used to study ERV impact on genes,

29 either by providing a ‘no-ERV’ control to compare gene expression levels or by allowing for allelic expression quantification if the ERV is present in only one allele [149]. Insertionally polymorphic ERVs can also be used to study epigenetic impacts of ERVs on nearby host genes by providing natural examples of pre- integration sites for comparison [152]. Lastly, polymorphic ERVs have been used as genetic markers to study the history of sheep domestication [183] and mule deer population structure [185]. Large-scale studies of polymorphic ERV diversity within a species are difficult, however, due to problems in genomic placement of ERVs using short read sequence data. ERVs are nearly identical repetitive elements that typically do not assemble in de novo genome sequencing and are difficult to place in the genome [174]. For this reason, it is difficult to simply sequence several individual genomes to evaluate ERV diversity and population frequency of ERV integrations. To identify insertionally polymorphic ERVs, it is necessary to use an approach to evaluate ERV integrations across a population that is not affected by the inability to assemble repetitive elements in the genome. Transposon junction assays are one method to determine shared mobile elements among individuals in a population. These assays use TE-specific or degenerate primers to amplify the integration site of an insertion in several individuals. TE-host junctions are then mapped to a host genome to evaluate TE integrations in each individual. This approach has been used to identify polymorphic ERVs in mice [117, 191], transposons in microbes [192], and mobile elements in human genomes [193,194]. A limitation of these approaches, however, is the uncertainty involved in assigning the presence of an ERV in an individual, particularly when assignment is based on read count data, and the uncertainty of mapping to repeats and other poorly assembled genome regions. Difficulties also arise when sampling ERV diversity in a single genome. Pre- genomics, ERVs were identified in genomes using hybridization techniques, such as Southern Blot [204,205] and degenerate PCR [15], which provided data suitable for phylogenetic analysis but not the genomic location of the ERV. Data mining approaches were used after whole-genome sequencing projects began. Bioinformatic tools such as BLAST [168] and BLAT [169] were used to search for retroviral proteins, and methods to identify LTRs and conserved motifs, such as LTR_STRUC [170], LTR_FINDER [171], and RetroTector [172] have been used to identify ERV sequences from genome databases. These approaches have been used to identify

30 ERVs in the genome sequences of domestic animals such as pig, dog, sheep, cats, cattle, and chicken [206]. Despite the plethora of tools developed to mine genome data for ERV sequences, many genomes, particularly of non-model organisms, are still fragmented [175] and missing repeat content [176]. This is because short reads typical of next-generation sequencing fail to assemble repeats that are longer than the read length, typically 150 bp [174]. Methods to overcome these limitations include the use of overlap-based assemblers and deBruijin graph assemblers [174], the use of long-insert mate-pair sequencing [189], and using newer technologies that can produce longer sequencing reads. Technologies such as PacBio can produce reads up to 70 kb, which allows for assembly of ERVs (typically 9kb) and flanking sequence, facilitating ERV assembly; this approach was recently taken to assemble ERVs from the genome of a single koala [190]. Limitations of long-read sequencing technologies include the higher error rate, which does not allow for the detection of low-level variants within a population and increases the difficulty of recombination and phylogenetic analyses, and the higher cost than traditional short-read sequence technologies. Although ERV presence alone can impact the host genome via structural varia- tion, transcriptionally active ERVs can also have functional impacts on the host by modulating gene expression or protein production. Most studies of transcriptionally active ERV do not try to find the genomic location of the ERV locus producing transcripts, and instead focus on quantifying transcription of a particular ERV lineage. Identification of ERVs in genome sequence has allowed researchers to localize transcriptionally active ERV in the genome, typically using annotation of EST datasets available through public repositories [207], RT-PCR and cloning based methods [208], and RNA sequencing and mapping to ERVs previously identified in the genome [16,153,209]. When ERVs are essentially identical, however, not only is placement in the genome difficult, but multiple mapping of RNA sequencing reads does not allow for accurate identification of the transcriptionally active ERV locus, requiring additional data to determine their genomic location. Previous data has suggested that CrERV are insertionally polymorphic [184,185]. In addition, CrERV are recently integrated into the mule deer genome and are nearly identical, and they did not assemble in a de novo genome assembly of mule deer. We also have evidence that CrERV are transcriptionally active [146,184], however we did not know the phylogenetic diversity or genomic location of transcriptionally

31 active CrERV. The primary goal of Chapter 2 was to develop methods to evaluate the population frequency of CrERV integrations in the Montana population of mule deer, as well as evaluate and compare the diversity of CrERV within a single Montana individual to establish this individual as representative of the Montana population.

2.2 Materials and Methods

Mule deer samples

Mule deer retropharyngeal lymph nodes were obtained from legally hunted animals brought to hunter check stations in Montana and Wyoming. All animals were identified by morphology as belonging to mule deer (Odocoileus hemionus hemionus). Genomic DNAs were prepared from RNAlater (Ambion) preserved tissues using phenol-chloroform extraction (described in [184,210]).

NGS libraries of CrERV integration sites

Next generation sequencing libraries of CrERV integration sites were prepared using a method adapted from previous mobile element junction fragment analy- ses [191,193,194,211]. Briefly, genomic DNA was digested with dsDNA fragmentase (NEB) to 250-1000 bp. DNA fragments were then end-repaired and modified to create 30 A overhangs and 50 phosphorylation to allow for ligation of double-stranded linkers. The DNA linkers were designed with features to prevent linker to linker amplification of DNA fragments lacking the target ERV sequence, including a 30 amino modification in the linker oligonucleotide, a single stranded region in the top linker that matches the linker-specific primer, and a high difference in melting temperatures between PCR primers used to take advantage of the suppression PCR effect. The sequences of the linker top strand is 50- GTGGCGGCCAGTATTCG- TAGGAGGGCGCGTAGCATAGAAC*G*T (* denote phosphorothioate bonds which prevent the degradation of the linker end), the sequence of the bottom strand is 50- p-CGTTCTATGCTAC-N (p denotes 5’phosphate to enable ligation of the linker; N indicates the 3’amino modification); both were synthesized by Integrated DNA Technologies. The linker was added in 20-40X molar excess relative to the amount of genomic DNA fragments and annealed to the DNA fragments

32 using Quick Ligase (NEB). The DNA was then purified using a PCR Purifica- tion Column (Qiagen) to remove unligated free linkers. Approximately 70-150 ng of DNA with ligated linkers was then used as template in the PCR amplifi- cation to enrich for virus-host junction sequences. PCR mixtures contained the following: 1.5 units of Ex Taq DNA polymerase (Takara), ExTaq reaction buffer (Takara), 0.2 mM dNTPs, and 400 nM primers. The linker-specific primer was identical for all samples and the sequence is 50-GCGGCCAGTATTCGTAGGA-30. An LTR-specific primer was used for one set of PCRs to enrich for all CrERV-host fragments (50-AATGACCCCTGCTTATGTTTGA-30) and an env-specific primer was used for another set of PCRs to enrich for CrERV-host fragments of viruses that contain coding sequence (50- GAGGACAGCTCCTTGGTTTG-30). Cycling conditions for this PCR were: 95°C for 3 minute initial denaturation, followed by 32 cycles of 95°C for 30 s, 59°C for 30 s, 72°C for 30 s, and a final exten- sion of 72°C for 5 min. Each PCR set was individually cleaned over a PCR purification column (Qiagen) and size-selected using AmpureBeads (Agencourt). Approximately 150 ng of the CrERV-enriched DNA fragments were then used as a template for the Barcode PCR, which allowed for distinguishing between samples during processing of the sequence data. PCR mixtures contained the following: 1.5 units of Ex Taq DNA polymerase (Takara), ExTaq reaction buffer (Takara), 0.2 mM dNTPs and 400 nM primers. The linker-specific primer was identical for all samples and contained the P1 adaptor sequence required for emulsion PCR and Ion Torrent amplicon sequencing. The sequence of the linker-specific primer is 50- CCTATCCCCTGTGTGCCTTGGCAGTCTCAGGCGGCCAGTATTCGTAGG-30 , where the underlined part is the P1 adaptor, and the remaining 3’end sequence is complementary to the linker. The LTR-specific primer contains the A adaptor sequence required for emulsion PCR and Ion Torrent amplicon sequencing. For each sample, the primer contained a unique library-specific index or ’barcode’. The se- quence of the LTR-specific primer is 5’-CCATCTCATCCCTGCGTGTCTCCGAC TCAGxxxxxxxxTCCTTCTTGCGTTTTGCATTGTCTC, where the underlined part is the A-adaptor, x denotes the position of the barcode sequence and the remaining 3’end sequence is complementary to the CrERV LTR. The barcodes were designed based on the Roche Multiplex Identifier (MID) sequences. The length of the barcodes was 5 or 8 , designed so that the difference between any combination of two barcodes is at least two nucleotide positions. The barcoded

33 primers were used for 454 sequencing and Ion Torrent sequencing. Cycling con- ditions were: 95°C 2 min initial denaturation, followed by 28 cycles of 95°C 15 s, 60°C 25 s, 72°C 30 s, and then final extension of 72°C for 5 min. The PCR products were then size-selected by gel electrophoresis using 1% agarose. The region corresponding to approximately 350-400bp range was excised and purified from gel slices using QIAquick gel extraction kit (Qiagen). The concentration was measured on a QuantiFluor-ST fluorimeter (Promega). The size profile and concentration was determined on the Bioanalyzer 2100 chip (Agilent). At this point all the barcoded libraries were pooled and processed for sequencing on the Ion Personal Genome Machine (Life Technologies) using the Ion 318 chip.

Junction Fragment cluster and data analysis

Reads obtained from Ion Torrent sequencing were processed and clustered as de- scribed previously [195,210]. Briefly, reads are clustered using a previously described clustering pipeline in two rounds and inter-cluster distances were computed to check that each cluster represented a single CrERV integration site. A two-component mixture model was also developed to address the uncertainty in assigning CrERV status using only read-count data. The mixture model allowed us to assign each a probability of each CrERV within an individual.

PCR amplification and sequencing of full length CrERV integrations

To amplify CrERV integrations that required sequencing, PCR primers were de- signed to the genomic region just outside a CrERV integration. PCR was performed using Promega GoTaq Long PCR Master Mix and 0.4 ÂţM of each primer, with M273 genomic DNA used as a template. Thermocycling conditions in a BioRad T100 Thermal Cycler were 95°C for 3 min, then 36 cycles of 95°C for 30 s, 62-65°C for 30 s, and 72°C for 8 minutes, with a final extension of 72°C for 10 min. The PCR was analyzed by running on a 1% agarose gel with the NEB kb ladder. PCR bands corresponding to the full-length CrERV sequenced were gel isolated using a gel isolation kit (Qiagen). The PCR products were then Sanger sequenced using virus-specific primers to cover regions with missing data. Several PCR product were first fragmented to approximately 500 bp and sequenced using Illumina MiSeq.

34 Assembly of CrERV using Mate Pair sequences and Sanger sequence

CrERV sequences were assembled using long-insert mate-pair reads and Sanger sequence as described previously [212]. Briefly, we used anchored mates that mapped into the genomic region surrounding a CrERV integration to extract unmapped mates that were associated with that CrERV. Unmapped mate-pair reads were assembled using a CrERV reference, and regions of missing or uncertain data were PCR amplified and Sanger sequenced. Using SeqMan Pro (Lasergene), CrERV mate pair sequences and Sanger sequence were assembled to recreate the viral sequence at each genomic location and manually checked and adjusted.

2.3 Results

Number of CrERV in Montana Mule Deer

A previous study of 13 random CrERV integrations in 257 mule deer from various populations demonstrated that CrERV were insertionally polymorphic but some CrERV were shared amongst animals and populations [185]. We wanted to expand this study to address the number of shared CrERV and frequency of specific CrERV across the Montana population, as well as determine the total number of CrERV in each animal. Using the clustered junction fragments, we identified 1327 CrERV integration sites for a set of 22 animals from Montana [210]. We previously used a mixture model to estimate the uncertainty in ERV status and assigned a probability for each ERV in each individual in the Montana population [210]. For the following analyses, CrERV with a probability of greater than 0.95 were considered present in an animal. The number of CrERV per Montana animal ranged from 135 to 343 integrations, with an average of 263 CrERV. The median number of CrERV per animal in Montana is 267 (Figure 2.1). Because we are doing more detailed analyses of the CrERV in a single animal, M273, we also wanted to confirm that M273 contains a similar number of CrERV as other animals from Montana. There are 266 CrERV in M273 according to this analysis, which is close to the median number of integrations in Montana and suggests that M273 has a similar number of CrERV as other individuals in the Montana population.

35 Number of CrERV per Montana Animal

3.0 Mean (263 CrERV)

Median (267 CrERV)

M273 (266 CrERV) 2.5 2.0 1.5 Frequency 1.0 0.5 0.0

150 200 250 300 350

CrERV per animal

Figure 2.1: CrERV integrations per animal in Montana.

The y axis represents the number of Montana individuals with a given number of CrERV integrations. The mean (green line) and median (red line) are indicated on the graph. The number of CrERV in M273 (266 CrERV integrations) is indicated on the graph as a dashed blue line.

Distribution of CrERV in Montana mule deer

We also used the CrERV probabilities to evaluate the number of animals in which a particular CrERV can be found. There were 27 CrERV integrations that were shared by all 22 animals from the Montana dataset, however, only 5% of CrERV are found in 16 or more animals in Montana. The median number of Montana animals per CrERV integration was 2 and 32% of CrERV are found in only 1 animal, which we refer to as singletons (Figure 2.2). These data support that CrERV are insertionally polymorphic. Expanding this analysis, we used a linear

36 regression model to determine if there is a relationship between the total number of CrERV and the number of singletons in an individual (Figure 2.3). The model demonstrates that number of singletons increases with total number of CrERV. M273, which has 266 CrERV integrations, has 23 singletons and was not an outlier according to the regression analysis. In contrast, M267, which has 70 singletons and 273 total CrERV, was calculated to be an outlier.

Number of Montana Animals per CrERV

32%ile (1 animal) 700 80%ile (7 animals)

85%ile (9 animals)

600 90%ile (11 animals)

95%ile (16 animals) 500 400 Frequency 300 200 100 0

0 5 10 15 20

Animals per CrERV (of 22 animals)

Figure 2.2: CrERV prevalence among animals in Montana.

The distribution of all CrERV integrations across Montana animals was plotted as a frequency histogram. The y axis represents the number of CrERV integrations present in a given number of Montana animals. Various percentiles are indicated on the graph.

37 Number of Singletons varies with Number of Total CrERV integrations

70 Other MT animals

M273 60 50 40 30 Number of Singletons 20 10 0

150 200 250 300 350

Total CrERV integrations

Figure 2.3: The number of singletons varies with number of Total CrERV integrations.

The number of singletons varies with number of Total CrERV integrations. Regression line is indicated on this graph as a solid black line. M273 is highlighted in blue, and other Montana animals are in red.

M273 clusters with other Montana animals based on shared CrERV integrations

Given the nature of ERV inheritance, animals that share a CrERV integration site are related. We have shown previously that principal component analysis (PCA) of the matrix of the probability of a given animal containing a given CrERV can separate animals into distinct geographic populations based on shared CrERV [210,213]. M273 was collected from southern Montana, near the Wyoming border (Figure 2.4). The PCA of CrERV probabilities in the Montana population is also shown, with M273 highlighted in blue (Figure 2.4), which shows that M273 is not an outlier according to CrERV shared with other Montana animals.

38 1 1 Other MT Mule Deer

2 M273 1

47 1 3 OR BT Deer 1 1 1 1 4 OR Mule Deer 1 11 1 11 1 5 WY Mule Deer 1 46 11 1 1 1 4 2 4 1 45 4 444 5 4 4 44 4 4 PC2

44 5 5 5 5 5 5 5 5 5 5 5 5 5 5 43 5 5 5

3 3 5 5 3 3 3 42

41 3

−130 −125 −120 −115 −110 −105 −100 −95

PC1

1 1 1 MT Mule Deer 1 2 M273 1 3 OR BT Deer 4 OR Mule Deer 1 1 1 5 WY Mule Deer 1 1 1 1 1

1 2 4 4 1 1 1 4 4 4 1 44 44 3 4 4 4 4 5 3 5 3 3 5 5 55 5 5 5 5 5 5 5 5 5 55 5

Figure 2.4: PCA and map displaying geographical location based on lati- tude and longitude of kill-location of Montana (MT), Oregon (OR), and Wyoming (WY) animals.

The first two principal component scores (top) were rotated and scaled to make the locations comparable with the latitude and longitude coordinates of animal kill-location (bottom). M273, indicated in both the PCA and map as a blue 2, separates with other Montana animals in the PCA based on shared CrERV integrations, and not with mule deer from other populations. PC1 and PC2 account for 8.03% and 4.7% of the variation, respectively.

39 Distribution of M273 CrERV in the Montana population is consistent with virus age estimates

We next performed a more detailed analysis of CrERV in a single animal, M273. The genome of this individual was sequenced using paired-end and long-insert mate pair sequencing [212]. Using the long-insert mate-pair data and additional Sanger sequencing, we were able to reconstruct 164 CrERV with coding sequence and 46 solo-LTRs, and identify the genomic locations of these CrERV. Each full-length CrERV could be assigned to a phylogenetic group, broadly classified as CrERVγ-in1 , CrERVγ-in12 , CrERVγ-in3, CrERVγ-in6, CrERVγ-in7, which were previously identified [185] and CrERVγ-in15 , a newly identified CrERV lineage (Figure 2.5). Of the CrERV assembled in M273, 161 of these CrERV with phylogenetic information or solo-LTR status can be accurately matched to CrERV-host junction fragments in the CrERV frequency table of integrations in the 22 Montana animals. We combined these two datasets to evaluate CrERV phylogenetic lineage distribution in the Montana population (Figure 2.6). There were no CrERVγ-in6 or CrERVγ-in7 CrERV identified in M273 that were found in all 22 Montana animals, and these two distributions are skewed heavily to the left, suggesting that these groups of viruses are not widespread in the population and are likely younger viruses. Several CrERVγ-in3 and CrERVγ-in12 CrERV identified in M273 were found in all 22 Montana animals but this distribution is flatter, suggesting that many of these viruses are well-distributed throughout the Montana population. In contrast, the CrERVγ-in1 CrERV and solo-LTRs identified in M273 are widespread throughout the population, and most of these integrations are found in many animals throughout Montana, suggesting that the CrERVγ-in1 CrERV are older lineages shared by most animals in the population. These data are consistent with coalescent estimates of virus ages.

M273 CrERV integrations are representative of the Montana population

Additionally, we evaluated the 266 CrERV in M273 to determine if these CrERV are more representative of mule deer from Montana or mule deer from other populations (WY/OR). For each CrERV in M273, we compared the proportion of MT mule deer (22 total animals) to the proportion of WY/OR mule deer (41 total animals) that have that CrERV. Of the 266 CrERV identified in M273, 166 CrERV are

40 Figure 2.5: Phylogeny of representative full-length CrERV from M273.

CrERV are colored according to phylogenetic affiliation. Red color indicates CrERVγ-in7, blue color indicates CrERVγ-in3, orange color indicates CrERVγ-in15, purple color indicates CrERVγ-in6, green color indicates CrERVγ-in12, and black color indicates CrERVγ-in1. This figure is adapted from a figure published in [212], which was generated using a general time reversible nucleotide substitution model, a calibrated Yule tree prior, and a lognormal relaxed clock. found in a higher proportion of MT mule deer than WY/OR mule deer, suggesting that the majority of CrERV identified in M273 are more representative of the Montana population (Figure 2.7). A test of equal proportions indicates that only 5 CrERV are found in a significantly higher proportion of WY/OR animals, while 11 CrERV are found in a significantly higher proportion of MT animals. Although the distribution of an ERV in the population may be affected by non-evolution

41 Density plot of Distribution of M273 CrERV across MT population

0.09

Phyl.f in1

0.06 in12 in3

density in6 in7 solo

0.03

0.00

0 5 10 15 20 Number of MT Animals

Figure 2.6: Distribution of CrERV identified in M273 throughout the Montana population.

CrERV were separated into phylogenetic groups or solo-LTRs, and plotted as separate distributions. The y axis indicates the number of CrERV that are found in a given number of Montana animals, using kernel smoothing to plot values. related events such as geographic barriers and population bottlenecks, ERVs that are found in fewer animals in a population are likely to be recent insertions. Thus, we performed a similar analysis on M273 CrERV that are shared amongst 6 or less Montana animals, since these most likely represent more recent CrERV integrations. Of the 97 M273 CrERV that are found in 6 or less Montana animals, 46 CrERV are found in a higher proportion of WY/OR mule deer than MT mule deer. This suggests that the majority of younger M273 CrERV are also more representative of the Montana population (Figure 2.8). We also evaluated the distribution of CrERV that are missing from M273 in the other Montana animals (Figure 2.9). Of the

42 1456 CrERV that are not found in M273, 55% are found in only one other Montana animal, and less than 5% of these CrERV are found in more than 8 Montana animals. This suggests that M273 is not missing any CrERV that are found in all other animals from Montana, further supporting that CrERV distribution in M273 is typical of a Montana mule deer.

Distribution of M273 CrERV in WY/OR

WY/OR proportion higher than MT WY/OR proportion lower than MT

0.4

0.0 WY/OR proportion to MT proportion relative −0.4

Cluster

Figure 2.7: Proportion of WY/OR animals that have each M273 CrERV.

For each CrERV in M273, the proportion of MT animals (22 total animals) and WY/OR animals (41 total animals) in which that CrERV was found was calculated. To calculate WY/OR proportion relative to MT proportion, the MT proportion was subtracted from the WY/OR proportion for each CrERV integration. Negative values of WY/OR proportion indicate that a CrERV is found in more MT animals, and is displayed in the graph as blue bars. Positive values of WY/OR proportion indicate that a CrERV is found in more WY/OR animals, and is displayed in the graph as red bars.

43 Distribution of M273 CrERV that are found in less than 6 MT animals

Legend MT proportion

Cluster WY/OR proportion

−0.2 0.0 0.2 0.4 0.6 Proportion relative to Montana

Figure 2.8: Proportion of WY/OR animals that have each CrERV from M273 found in 6 or less Montana animals.

For each CrERV in M273, the proportion of MT animals (22 total animals) and WY/OR animals (41 total animals) that contain a CrERV was calculated. To calculate WY/OR proportion relative to MT proportion, the MT proportion was subtracted from the WY/OR proportion for each CrERV integration. The relative WY/OR proportions are represented on the dot plot as blue dots. Negative values of WY/OR proportion indicate that a CrERV is found in more MT animals. Positive values of WY/OR proportion indicate that a CrERV is found in more WY/OR animals. The MT proportion for each CrERV integration is plotted at zero on the above dot plot, represented by red dots.

2.4 Discussion

Using CrERV-host junction fragment libraries, we were able to investigate the distribution of insertionally polymorphic ERVs in a population of mule deer. We

44 Frequency of CrERV missing from M273 in MT population 800 600

55%ile (1 animal)

80%ile (3 animals)

85%ile (4 animals) 400

Frequency 90%ile (5.5 animals)

95%ile (8 animals) 200 0

0 5 10 15

Animals per CrERV (of 22 animals)

Figure 2.9: Prevalence in Montana of CrERV that are not found in M273.

The y axis represents the number CrERV found in a given number of Montana animals. Various percentiles of the distribution are indicated on the graph. determined that the mean number of CrERV in a Montana mule deer is 263 CrERV and M273, our representative animal, has a similar number of CrERV as other Montana animals. CrERV are insertionally polymorphic, and 32% of all CrERV are found in only one animal, termed singletons. M273 also has a similar number of singletons relative to total CrERV as other animals from the Montana population. Additionally, when investigating shared CrERV, M273 clusters with animals from Montana, and not animals from other populations. Together, these data support that in terms of number of CrERV, M273 was a typical Montana mule deer. Using de novo genome assembly and computational techniques for CrERV reconstruction, we determined the sequence and phylogenetic affiliation of CrERV in M273. Combining

45 these two datasets, we were able to evaluate the distribution of phylogenetic groups of M273 CrERV in the Montana population. These distributions were consistent with coalescent estimates of virus ages. The majority of M273 CrERV are found in a higher proportion of Montana animals than animals from other populations. Additionally, M273 is not missing any CrERV widespread in Montana, further supporting that M273 is a typical Montana mule deer. According to this analysis, there are 135 to 343 CrERV per Montana animal, which is consistent with previous estimates of 50-150 CrERV copies per haploid genome [184]. The CrERV-host junction fragment library analysis supports previ- ous evidence [184, 185] that CrERV are insertionally polymorphic, as 53% of all CrERV are found in 2 or less animals in the population (Figure 2.2). Insertionally polymorphic ERV may be due to evolutionarily younger virus age [187] and/or active retrotransposition within the host [182,214]. We identified 27 CrERV that are shared amongst all animals in the Montana population and are fixed. Analysis of the full dataset reveals that 24 of these CrERV are found in all animals in Montana and two additional mule deer populations, suggesting that these CrERV are widespread throughout mule deer. Although we could only determine the phylogenetic affiliation of a subset of these, widespread CrERV are primarily related to the in1 phylogenetic group or are solo-LTRs. The CrERVγ-in1 group is the oldest phylogenetic group, so we expect these to be widespread in the population. Using CrERV-host junction fragment analysis, we can identify shared CrERV integration sites, but we are unable to differentiate between a solo-LTR and the 30 LTR of a virus with coding sequence at a given locus. For this reason, CrERV integrations that have been identified as solo-LTRs in M273 have the potential to be full-length CrERV in other animals, however, we do not know with confidence the phylogenetic group of M273 solo-LTRs. None of the fixed CrERV belonged to the CrERVγ-in6 or CrERVγ-in7 phylogenetic groups, which were determined using coalescent estimates to be more recent integrations and are thus not expected to be fixed in the population. Overall, CrERV belonging to older phylogenetic groups, such as CrERVγ-in1 and CrERVγ-in3, are widespread in Montana. CrERV belonging to the younger phylogenetic groups, such as CrERVγ- in6 and CrERVγ-in7, are not well-distributed throughout Montana individuals. In general, the CrERV found in M273 are more characteristic of the Montana population than other populations of mule deer in the dataset. As expected, M273

46 clusters with other Montana animals in a PCA when considering shared CrERV. Additionally, the majority of CrERV in M273 are found in a larger proportion of Montana animals than Wyoming/Oregon animals. This is expected for older CrERV that have been in the Montana population longer. Unexpectedly, most CrERV that are not widespread in Montana are also not widespread in other populations. Despite this, there are some CrERV from M273 that are found in a much higher proportion of the Wyoming/Oregon population than the Montana population. Because these CrERV are widespread in other mule deer populations, they are likely not younger CrERV but may represent CrERV that have recently entered the Montana population due to mule deer migration from other states. In conclusion, CrERV-host junction fragment sequencing can be used to establish the diversity of shared ERV in a population of mule deer without individual genome sequencing of each animal. In addition, long-insert mate-pair sequencing can be used to reconstruct CrERV integrations in a de novo assembly of an individual mule deer genome. Together, we used this data to establish that M273, our representative mule deer, is typical of a Montana animal. M273 has a similar number of total CrERV and singletons as other mule deer in the Montana population, M273 has similar shared CrERV with other Montana animals, and the CrERV in M273 are more characteristic of the Montana population than surrounding populations. In subsequent chapters of this dissertation, M273 will be used as a representative animal to determine transcriptionally active CrERV and potential impacts on mule deer gene expression.

47 Chapter 3 | Evolutionary implications of endogenous retrovirus expression on host genome evolution

3.1 Introduction

Endogenous retroviruses (ERVs) are genetic elements acquired by infection of a germline cell with an infectious retrovirus. Following integration into the host genome, the provirus, containing the viral genes and LTR regulatory elements, is inherited from parent to offspring. Once integrated, ERV copy numbers can increase within the genome via retrotransposition to new sites [215] or reinfection if the ERV can produce a replication-competent virus [216]. ERVs have the potential to contribute to host genome structural variations, particularly through recombination. Recombination events can also lead to formation of solo-LTRs, where the virus genes are removed and a single LTR remains in the host genome [10]. Solo-LTRs and ERV proviruses can also have functional consequences on host gene expression. ERV LTRs have contributed regulatory sequences to host genes [118,123,126,129,130,137,217], and ERV RNAs and protein contribute to essential host processes [137,144]. ERV transcription, however, can also lead to insertional mutagenesis and negatively impact gene expression [50], thus ERVs are silenced by several host mechanisms, including DNA methylation and histone modifications [80,218]. Data suggest that ERV silencing occurs soon after the initial integration event [219], thus any functional impact of ERVs on host processes evolves over time

48 after mutations have altered the ERV sequence and regulatory activity [220,221]. Increasing evidence suggests that ERV expression may be linked to their location in the genome. Transcriptionally active ERVs are often enriched near genes [118,148], despite an overall exclusion of ERV from genes and nearby regions [12]. Unlike in plants where DNA methylation has been demonstrated to spread from transposable element copies into nearby genes [150,151], there are few examples of methylation spreading from mammalian ERVs into adjacent genes [152]. Alternatively, active marks can spread from genes into nearby ERV copies [149], and infectious proviruses integrated into expressed CpG island gene promoters are also transcriptionally active [222], suggesting that the methylation or chromatin status of the ERV integration site may affect ERV expression. ERV de-repression typically results in increased expression of proximal genes [96,97], further supporting that ERVs and genes are epigenetically co-regulated. In contrast to ERVs that integrated prior to a speciation event and are in all individuals in a contemporary species, insertionally polymorphic ERVs are present at varying frequency among strains [181,182], breeds [183], and subspecies [177], and within populations [185, 187]. Insertional polymorphism is typically attributed to ongoing ERV retrotransposition or recent ERV integration. Given the potential impacts of ERVs on the host genome, ERV insertional polymorphism could result in phenotypic variation among individuals or populations. We previously identified CrERV transcripts in mule deer lymph nodes [146,184], but did not know the location of the transcriptionally active CrERV. We have also demonstrated extensive insertional polymorphism of CrERV within and between mule deer populations [184,185,210]. A detailed genome assembly and reconstruction of CrERV sequences within a representative mule deer [212] indicate that there are multiple CrERV lineages as a result of distinct epizootics over the past 200,000 years [185, 212]. This provides a unique system containing numerous recently integrated ERVs with diverse sequences that are variably present in the population, allowing us to investigate transcriptionally active ERVs at multiple evolutionary time points. The goals of this study were to determine the genomic location of transcriptionally active CrERV and test if CrERV expression was a byproduct of genomic location. We identified four transcriptionally active CrERV and determined that although these CrERV are close to host genes, ERV proximity to genes alone does not dictate CrERV transcriptional activity. Some transcriptionally active

49 CrERV are phylogenetically older, but one CrERV is evolutionarily young and provides a window on transcriptionally active ERV effects closer to the time of integration. Our data indicate that transcriptionally active CrERV are widespread in the population and maintained in the provirus/soloLTR configuration in all animals.

3.2 Materials and Methods

Animal Tissues and Nucleic Acid Extraction

Mule deer retropharyngeal lymph nodes were obtained from legally hunted animals brought to hunter check stations in Montana. Tissues were stored in RNAlater (Am- bion). Total RNA was extracted from mule deer lymph nodes using Trizol following the manufacturer’s protocol. Briefly, 30 mg of RNAlater preserved lymph node tis- sue was minced with a scalpel and placed into 1 mL of Trizol reagent for 1 hour, with vortexing every 10 minutes to promote tissue breakage. The tissue was then cen- trifuged at 12,000 g at 4°C for 10 minutes to pellet DNA and insoluble tissue. RNA extraction then proceeded following the manufacturer’s instructions, substituting 100 µL 1-Bromo-3-chloropropane during phase separation. RNA was quantified us- ing a Quantiflour fluorimeter and RNA measuring kit. A volume of RNA containing 6 µg total nucleic acid was treated with 2 µL TurboDNase for 60 minutes according to the manufacturer’s instructions. Total RNA was quantified again using the Quantifluor fluorimeter and RNA kit and removal of DNA was verified by a CrERV- specific PCR (primers = in6887dF: 50-GGGAACATGGTGGCCCRTTTTGAC-30 and in7109dR: 50-GTCCCGGTRGTTTCACATCCC-30). Genomic DNA was ex- tracted from RNAlater preserved mule deer lymph nodes using the QIAamp DNA Micro Kit following the manufacturer’s protocol for tissue samples. Briefly, a 10 mg piece of lymph node was added to 180 µL Buffer ATL and equilibrated to room temperature. After adding 20µL of proteinase K (NEB), the tissue was placed in a 56°C water bath for overnight tissue lysis. Next, 200 µL Buffer AL and 200 µL 100% ethanol were added and the mixture was pulse vortexed and spun through a QIAamp MinElute column. The column bound DNA was then washed with 500 µL Buffer AW1 followed by 500 µL Buffer AW2. The membrane was then centrifuged at max speed for 3 min to dry. DNA was eluted in 100 µL Buffer AE and quantified

50 using the Quantifluor dsDNA system (Promega).

Bisulfite Sequencing Analysis

Previously published analysis of CrERV integration sites, based on targeted NGS sequencing [210] was adapted to allow the assessment of CrERV LTR methylation. The bisulfite sequencing strategy method was adapted from pre- vious analyses of mobile element methylation [223, 224]. Briefly, mule deer ge- nomic DNA was digested with DNA fragmentase, end-repaired and ligated to oligonucleotide adapters. The reaction was then treated with sodium bisul- fite, which converted unmethylated cytosines to uracils, while methylated cy- tosines were unaffected. The next step was PCR amplification, with one primer targeting the adaptor sequence and the other primer compatible to the bisul- fite converted CrERV 50 LTR sequence. The sequence of this primer was 50- CCATCTCATCCCTGCGTGTCTCCGACTCAGxxxCAAAAAAAAAATTTATTA CTAACTC-30, where the underlined part is the A-adaptor, TCAG is the key se- quence for signal calibration, x denotes the position of the barcode sequence and the remaining 30 end sequence is complementary to a conserved region in the CrERV LTR after in silico bisulfite conversion. Untreated genomic DNA was also subjected to the same analysis. The primer used for untreated genomic DNA was 50- CCATCTCATCCCTGCGTGTCTCCGACTCAGxxxxxCCAAGAGACAATGCAA AACGCAAG-30, where the underlined part is the A-adaptor, TCAG is the key sequence for signal calibration, x denotes the position of the barcode sequence and the remaining 30 end sequence is complementary to a conserved region in the CrERV LTR. The PCR amplicons were then subjected to 454 sequencing. The resulting set of sequences contained almost the entire 50 LTR and the adjacent sequences of the flanking deer genomic DNA of the integration site. Comparison of the bisulfite-treated and untreated sequences allows the determination of LTR methylation levels for each CrERV integration individually. Reads that contained only CrERV sequence were removed from the analysis. The remaining reads were mapped to the mule deer scaffolds using BLAST [168] and manually checked to ensure accurate mapping.

51 Table 3.1: Primers used for spliced env amplification and cloning. Primer Sequence 50-30 in7_797F ATCCCGCGGCAGTTGACC in7_8535R GCCTTAGCACCATAATCTGGATAGTATT I7_796F GATCCCGCGGCAGTTGACCA M8257R AGAGCTGCGCAGACCCCACCTT

cDNA synthesis

Total RNA was made into cDNA using the AffinityScript Multiple Temperature cDNA synthesis kit, following the manufacturer’s instructions for either Random Primers or gene-specific primer. Briefly, 1 µg of total RNA was incubated with either 300 ng Random Primers or 100 ng of a CrERV-specific primer (I7_8555R: 50-GCCGGTATTGTTGCCTTAGC- 30) and RNase-free water for 5 minutes at 65°C before cooling at room temperature for 5 minutes. Next, 10X AffinityScript Buffer, 100 mM dNTPs, RNase Block, and AffinityScript Reverse Transcriptase were added following the manufacturer’s protocol.

Transcript amplification, cloning, and Sanger sequencing

Spliced envelope transcripts sequences were amplified from cDNA generated with CrERV specific primer (I7_8555R) using Takara ExTaq polymerase following the manufacturer’s instructions. Primers are listed in Table 3.1. PCR amplified sequences were then cloned into the TA cloning vector (Invitrogen) and processed using the Qiagen Miniprep Kit. Transcripts were then sequenced using Sanger sequencing at the Penn State Genomics Core Facility and analyzed using the LaserGene suite by DNAstar. To account for sequencing errors in the transcripts, unique sequences within each transcript were identified. Sub-sequences of fixed length (k-mers) were extracted from the unique regions of each transcript and queried against the whole genome sequencing data of M273 to confirm presence of the virus sequence in the mule deer genome.

PCR for individual CrERV loci

To determine allelic state of the CrERV integration in select animals, PCR primers (Table 3.2) were designed to the genomic region flanking a CrERV integration,

52 Table 3.2: Primers used for CrERV locus-specific PCR. Primer Sequence 50-30 CrERV S386F GGACTTGTCTTGCCGAGTGA S386 S386R TTCACCGACATCCTCACTGC S386 S2220aF AGGAGGCCAGTGTGAAGGAGTAAT S2220 S2220aR AAGCACAGACCCGAAGTAAACGAC S2220 S26536F TTGCAAAAGCTGAACTCCGC S26536 S26536R TGGATGCCACCATAGGGAGA S26536 S1645aF GACTTATGTAGCGGGCCTTCCTGA S3442 S1645aR CCTGTCGGGCCTTCTCTAATACCA S3442

which amplified the pre-integration site if the CrERV is absent. PCR was performed on mule deer genomic DNA using Promega GoTaq Long PCR Master Mix and 0.4 µM of each primer. Thermocycling conditions in a BioRad T100 Thermal Cycler were 95°C for 3 min, then 36 cycles of 95°C for 30 s, 62-65°C for 30 s, 72°C, and a final extension of 72°C for 10 min. The PCR was analyzed by running on a 1% agarose gel with the NEB kb ladder. PCR bands that corresponded to solo-LTR and pre-integration site were gel isolated and Sanger sequenced for confirmation.

Computational analyses to assess genomic duplication

To assess the potential of transcriptionally active CrERVs being located in duplicated genomic regions, we conducted a search using mapping-based and k-mer-based approaches. Using the 30x paired-end whole genome sequencing data reported previously [212], we mapped the reads with the default setting of bwa mem [225]. We calculated the read coverage depth of the beta-actin (ACTB) gene and the host genome [212] flanking 20 kb of the candidate transcribing CrERVs using the default parameters of bedtools [226] genomecov. The depth of regions flanking candidate CrERVs were visualized in IGV [227] to identify continuous regions with read coverage depth that are multiple times of the depth of ACTB. The k-mer frequency table was generated using dsk [228] with k-mer size of 60 and frequency cutoff of 3. Frequency of 60-mers corresponding to the ACTB gene and host-flanking region of candidate CrERVs were compared to identify potential duplication around the candidate CrERV integration sites.

53 Table 3.3: Primers used for CrERV qPCR. Primer Sequence 50-30 Gene Target MDactin1052cF CATGGCGGGGGTGTTGAAGGTCT Actin MDactin1594cR GCGTGTGGCCCCCGAGGAG Actin QMD_G6pdhF TGACCTATGGCAACCGATACAA G6pdh QMD_G6pdhR CCGCAAAAGACATCCAGGAT G6pdh QMDin1_1665F AGAGGAACGAGATAAGGCACGAAC CrERVγ-in1 QMDin1_1809R CTGGTGGGGGTTGGTCTGGAG CrERVγ-in1 QMDin12_2563F TTAATGAATGTGGGGGAGTTG CrERVγ-in12 QMDin12_2680R GAGAGCATRTTTGGGTGTTCG CrERVγ-in12 QMD7076F TGGGGGAGTTGATTCTTTTTATTG CrERVγ-in3 QMD7173R ACGGTGGTTGGGATCTGACTTTAC CrERVγ-in3 QMD6644F CAGAGGAACGAGATAAGGCAC Total env QMD6732R CCATCAGCAAGAGAACTAGGAG Total env QMD798F TCCCGCGGCAGTTGACCA Spliced env QMD6668R TGAGCATTCGCCTTCCATTCTG Spliced env QMD2401F CTGCGACCTAACTGGGATTT Total gag QMD2519R GCTCTTGCCTCACCAGATTTA Total gag

Quantitative PCR of CrERV

CrERV expression levels in Random Primer cDNA were measured by real-time quantitative PCR (qPCR) using the iQ SYBR Green Supermix Reagent (Bio- Rad) and a Bio-Rad iQ5 Real-Time PCR detector system (Bio-Rad). Data were analyzed using Bio-Rad iQ5 Optical System Software V. The conditions for all qPCR reactions were as follows: 3 minutes at 95°C followed by 10 seconds at 95°C, ten seconds at the primer-specific annealing temperature (58°C or 60°C), and 20 seconds at 72°C for 40 cycles. Primers are listed in Table 3.3. A melt curve analysis was performed to verify the amplification of a single product. Absolute values of each qPCR reaction were determined by the Bio-Rad iQ5 Optical System Software using a standard curve created with plasmids of known copy number. CrERV expression data were normalized against beta-actin and G6pdh as housekeeping genes.

54 Mule deer gene transcript screening and quantification

Samples were screened for gene transcript splicing differences using PCR. Briefly, cow transcripts of each gene of interest were mapped to the mule deer genome and primers were designed to evaluate gene transcript variation in animals with and without the CrERV. Gene transcripts were quantified from Random Primer cDNA by qPCR using the iQ SYBR Green Supermix Reagent (BioRad) and a Bio-Rad iQ5 Real-Time PCR detector system. Data were analyzed using Bio-Rad iQ5 Optical System Software V. The conditions for all qPCR reactions were as follows: 3 minutes at 95°C followed by 10 seconds at 95°C, 15 seconds at the primer-specific annealing temperature (53°C for LTR-driven KXD1, 60°C - 62°C for others), and 20 seconds at 72°C for 40 cycles. Primers are listed in Table 3.4. A melt curve analysis was performed in order to verify the amplification of a single product. Absolute values of each qPCR reaction were determined by the Bio-Rad iQ5 Optical System Software using a standard curve created with plasmids of known copy number. Gene expression data were normalized against beta-actin expression. To quantify FBXO42, gene-specific cDNA was made using a primer in exon 4 of FBXO42 (E4_7659R: 50- TGCAGCCTCCAAACACATACATAG-30) and was purified using Agencourt RNAClean XP system (Beckman Coulter) following the manufacturer’s instructions for single-stranded DNA purification. Purified cDNA was eluted in 30 µL water and quantified using the Quantifluor ssDNA system (Promega). FBXO42 expression was quantified from purified cDNA using using the iQ SYBR Green Supermix Reagent (BioRad) and Bio-Rad iQ5 Real-Time PCR detector system. Data were analyzed using Bio-Rad iQ5 Optical System Software V. The conditions for all qPCR reactions were as follows: 3 minutes at 95°C followed by 10 seconds at 95°C, 15 seconds at the primer-specific annealing temperature (52°C for Total FBXO42, 58°C for Exon 1A, 61.5°C for Exon 1B, and 54°C for Exon 1C), and 20 seconds at 72°C for 40 cycles. A melt curve analysis was performed in order to verify the amplification of a single product. Absolute values of each qPCR reaction were determined by the Bio-Rad iQ5 Optical System Software using a standard curve created with plasmids of known copy number. FBXO42 exon 1 expression was normalized to total FBXO42 (exon 2- exon 3) expression within each animal.

55 Table 3.4: Primers used for Gene qPCR

Primer Sequence 50-30 Gene Target MDSIRT6_679F GAGGAGTTGGAGAGGAAGGTGTGG Total SIRT6 (Exon 2) MDSIRT6_809Rq CCCCGCTCCTCCATCGTCC Total SIRT6 (Exon 3) MDSIRT6_LTR348Fq CATGGCGTTATGTTTGCTCTCC LTR-driven SIRT6 (LTR) MDSIRT6_525Rq CCGGAAATAGGGTGGGACGAG LTR-driven SIRT6 (UTR) MDKXD1_739F GCAGCGGAGGAGGAGGAAGAG Total KXD1 (Exon 2) MDKXD1_878R GTTGAGCAGCATCTCGTTGGTCT Total KXD1 (Exon 3) MDKXD1_LTR2Fq TTCCTACGTTCCGTTTGTTCTTC LTR-driven KXD1 (LTR) MDKXD1_70Rq GAAAATCATGCCTTTTTGTAGTCTG LTR-driven KXD1 (UTR) MDKXD1_719Fq CAAGGCCGCCAGATGTGC Canonical KXD1 (Exon 1) MDKXD1_790Rq ATGCTCAGGATGCGGCTACAGAA Canonical KXD1 (Exon 2) E2for9742q TACCGGCTTATTAAAGGTGTAGCC Total FBXO42 (Exon 2) E3rev9822q GGGATAAGGGTATGTCCGACTC Total FBXO42 (Exon 3) X3for538q GACGCGATTTGTCTAGAGGTT FBXO42 Exon 1A-Exon 2 (for) E2rev9516q GTCATCTTCACTGTCGGAGGAG FBXO42 Exon 1A-Exon 2 (rev) E1for9211q AGCTGTGAGGAGTCCCGAGT FBXO42 Exon 1B-Exon 2 (for) E2rev9519q CTGTCATCTTCACTGTCGGAG FBXO42 Exon 1B-Exon 2 (rev) X2for9459q CAGACAACAGCGCAGTAACC FBXO42 Exon 1C-Exon 2 (for) E2rev9519q CTGTCATCTTCACTGTCGGAG FBXO42 Exon 1C-Exon 2 (rev)

30 Rapid amplification of cDNA ends (RACE)

30-RACE analysis of FBXO42 was performed following a modified protocol [229]. Briefly, 200 ng of ribosome depleted RNA was reverse transcribed using Qtotal_TM primer (50-ACGCTGACGCAGAGTGACGAGGACTCGGTCGCTGACTTTTTTT TTTTTTTTTT-30) and AffinityScript Multiple Temperature RT (Agilent). First- round PCR was performed using Qouter2_TM primer (50-ACGCTGACGCAGAGTGACG- 30) and gene-specific primer (GSP) 1 (50-TGGCGATCGCAAACGGACTA-30). Second-round PCR was performed using Qinner5b_TM (50-GAGGACTCGGTCGCTGAC- 30) and GSP 2 (50-AGCTCCGGGCTGGGTGTAG-30). All PCRs were performed using Phusion HotStart Flex Polymerase (NEB). PCR products were visualized on a 1% gel and gel-extracted using the Qiaquick Gel Extraction kit (Qiagen), cloned into the pMiniT vector (NEB) and Sanger sequenced.

RNAseq Library Preparation

Illumina RNASeq libraries were prepared from RNA from animal M273 using the TruSeq Stranded mRNA LT Kit with RiboMinus Human/Mouse Module (Invitrogen), which removes ribosomal RNA from the sample. Following the

56 manufacturer’s protocol, about 5.5 µg DNAse-treated Total RNA was incubated with 8 µL of RiboMinus Probe and 300 µL B5 Hybridization Buffer at 71°C for 5 min, then cooled in a 37°C water bath for 30 min. After the hybridized sample cooled, the sample was transferred to 200 µL prepared RiboMinus beads and incubated at 37°C for 15 min. Ribosomal RNA depleted RNA was then ethanol precipitated using 3M Sodium Acetate to increase the concentration for library preparation. RNA quality before and after rRNA depletion was determined using BioAnalyzer (Agilent) peak analysis. Libraries were created by the Penn State Genomics Core Facility using the TruSeq Stranded mRNA kit without poly A selection. The RNA sample was sequenced on a single HiSeq Rapid Run using 150 nt single read sequencing.

RNASeq Analysis

Duplicate reads in the RNA-seq were removed using default setting of FastUniq [230]. Then, adapters and low-quality ends of RNA-seq reads were trimmed using Trimmomatic [231] with the parameters of ‘ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’, where ‘TruSeq3- SE.fa’ contains the sequence of standard TruSeq3 adapters. The processed reads were mapped using both tophat [232] and bwa [233] with the default setting.

Statistical Analyses

Fisher’s exact tests were used to evaluate CrERV distribution in the genome, which were performed using R statistical software. Mann-Whitney U tests and Kruskal- Wallis tests with post-hoc Dunn’s multiple comparison tests were performed using R statistical software to analyze qPCR data.

3.3 Results

Identification of transcriptionally active CrERV loci in M273

We first determined which loci contained a CrERV that had the potential to be transcriptionally active by assessing the CpG methylation status of all LTR in our representative mule deer (M273) and one additional Montana animal. We identified seven 50 LTRs with at least 2 reads that were unmethylated at all CpG

57 Table 3.5: Summary of Bisulfite Analysis of 50 LTRs in M273

CrERV with at least two unmethylated reads that were identified in bisulfite analysis of two Montana animals. Values in parentheses under ‘Methylation Status’ indicate number of unmethylated reads out of total reads that map to that locus. Virus Phylogenetic Group Distance to Gene (bp) Closest Gene Methylation Status S386 In3 216 SIRT6 Unmethylated (11/11) S2220 In12 399 ISY1 Unmethylated (9/9) S26536 In1 357 KXD1 Unmethylated (7/7) S9905 In1 8544 ZMYND15 Partial Unmethylated (3/18) S7094 solo-LTR 112753 TRIM44 Partial Unmethylated (2/11) S10945 In12 122041 CG25C Partial Unmethylated (2/14) S3442 in12 intronic FBXO42 Partial Unmethylated (2/9)

sites and common in both animals. We assigned these 50LTRs to CrERV loci in the mule deer genome assembly (Table 3.5). Three CrERV LTRs were completely unmethylated in the M273 bisulfite dataset, and all reads that mapped to those loci were unmethylated at all CpG sites in the LTR. Six CrERV LTR were partially unmethylated, with bimodal methylation patterns likely due to heterogeneity of the lymph node tissue sample (Figure 3.1). Eight of the candidate 50LTRs with evidence of being unmethylated were assigned to proviral CrERV and one was assigned to a solo-LTR. Other CrERV LTRs with unmethylated reads were identified in only a single animal and were partially unmethylated, thus were not included in this analysis. We assessed transcriptional activity by sequencing CrERV transcripts in M273. We focused on identifying env transcripts because they are produced from a spliced transcript, which allowed us to avoid genomic DNA contamination (Figure 3.2). We identified transcripts phylogenetically related to the CrERVγ-in3 group, CrERVγ- in1 group, and CrERVγ-in12 group in M273. There was higher diversity in the CrERVγ-in12 transcripts, and two distinct transcript sequences were identified. Although many clones were recovered, there was little diversity in the CrERVγ- in3 and CrERVγ-in1 related transcripts, suggesting a single CrERV provirus was producing these transcripts. We searched our dataset of reconstructed CrERV in M273 [212] and matched the transcripts to S386, S26536, S2220 and S3442 (Figure 3.3), which are four of the eight CrERV that were identified in the bisulfite analysis and include all three of the loci with completely unmethylated LTR (Table 3.5). RNAseq data confirmed expression of these CrERV transcripts, with higher read coverage of the env gene (Figure 3.4). There was no evidence of transcription of the other unmethylated candidates identified in the bisulfite analysis using the same

58 a. Fully Unmethylated – S386, S2220, S26536 b. Partially Unmethylated – S3442, S9905, S7094, S10945

TGTAGAATGTAGGGAGAGCAAACATAACGCCATGACAAAAGGCAGAAGCAGGAAGACC TGTAGAATGTAGGGAGAGCAAACATAACGCCATGACAAGAGGCAGAAGCAGGAAGAC AGGCCCTCACCCTGGCAACTAAGGCGACTATCAGGTCACCAAGACACAACCAGAGACTA CAGGCCCTCACCCTGGCAACTAGGGCGACTATCAAGCCACCCAGACACAACCAGTCAGA TCAGGCCACCCAGATACAGCCAATCAGAACTTGTTTACGGAAAGATCCCGCGCGCGCGA GACTATCAGGCCACTCAGATACAGCCAATCAGAACTTGTTTACGGAAAAAACCCGCGCG ATGTCTGACCAATGAAAAGACCCCCGCGAACATGTAACCAATCCGCTTCGCTAAATGACC CGCGAATATCTGACCAATGAAAGGACCCCCGCAAGCATGTAACCAATCCGCTTCGCTAA CCTGCTTATGTTTGAAATTGGATGTTCTATAAATATGGGTAGAAAACCGGGCTCGGGGC GTGACCCCTGCTTATGTTTGAAATTAAATGTCCTATAACTATGAGTAGAAAAACCGGGCT TCTCAGCCTGTGCGCCACTGCGTTGGACACAGCGGGGGCCCTAGCTCGAGCTAGCAATA CGGGGCTCTCAGCCTGTGCGCCACTGCGTTGGACACAGCGGGGGCCCTAGCTCGAGCTA AACTTCCTTCTTGCGTTTTGCATTGTCTCTTGGTAGCTTTCTCTCTTCCCGCTCGGGGATTC GCAATAAACTTTCTTCTTGCGTTTTGCATTGTCTCTTGGTAGCTTTGTCTCTTCCCGCTCGG GGACATCGGGCATAACA GGATTCGGACATTGGGCATAACA

Figure 3.1: Methylation patterns of CrERV loci.

(a) Lollipop diagram that depicts the methylation pattern of CrERV with completely unmethylated 50 LTRs. Circles indicate CpG dinucleotides, and empty circles indicate and unmethylated CpG and filled circles indicate a methylated CpG. The CrERV LTR (S386) sequence is depicted below the lollipop diagram, and CpG dinucleotides are underlined. (b). Lollipop diagram and CrERV LTR sequence (S3442) that depictes the methylation pattern of CrERV with partially unmethylated 50 LTRs. mapping criteria.

1F 2R 3R cDNA = SD 5’LTR 3’LTR = SA

Figure 3.2: Schematic of approach to amplify CrERV spliced env tran- scripts.

The CrERV provirus in depicted in red. The splice donor is marked by a green triangle and the splice acceptor is marked by a yellow arrow. The cDNA primer location is indicated. Locations of the primers used to amplify spliced env transcripts are indicated by arrows marked 1F, 2R, and 3R. The resulting spliced env transcript is depicted below the provirus.

59 S22897 S29930 S11112 S111665 S3780 S3657 S10499 S5212 S7050 S10113 S29996 S18448 S386 S3597 S5800 S16113 S6404 S21951 S27860 S30429 S18517 S25446 S11782 S3749 S12116 S21484 S5890 S9907 S2220 S3442 S2374 S2970 S15154 S12562 S26536 700 600 500 400 300 200 100 0 KYA 70000.0

Figure 3.3: CrERV loci in M273 that produce spliced env transcripts.

Coalescent tree of select CrERV integrations from M273. CrERV are colored according to phylogenetic affiliation. CrERV that were identified to produce spliced env transcripts are marked by a yellow star.

Similar groups of CrERV are transcribed in other Montana mule deer

We next determined if other mule deer from Montana had a similar set of transcrip- tionally active CrERV families as identified in M273. We cloned and sequenced CrERV spliced envelopes from nine additional Montana animals to assess transcript diversity. All nine animals had transcripts that were nearly identical to S386 (pairwise distance = 0.000803). In contrast, there was higher diversity among CrERVγ- in12 spliced env transcripts (pairwise distance = 0.011787), although most (6/10) animals produced the S2220 transcript identified in M273. The S3442 spliced envelope was isolated from M273 only. There was little diversity among

60 S386

S2220

S3442

S26536

Figure 3.4: RNAseq Read coverage of transcriptionally active CrERV in M273.

RNAseq reads were mapped to S386, S2220, S3442, and S26536 virus genes (50 and 30 LTRs were removed) at 1.0 length fraction and 0.99 similarity fraction using CLC Genomics Workbench. The approximate location of the env gene of each CrERV is position 6000 on the above graphs. Graph bounds were fixed at a maximum of 100 reads for each mapping for ease of comparison. Colors correspond to the minimum, mean, and maximum observed values. the CrERVγ-in1 transcripts identified from four animals in the analysis (pairwise distance = 0.006774). A transcript from one additional CrERV lineage was identi- fied from one animal, and as we had no evidence for an unmethylated 50LTR from this CrERV lineage it was not further investigated. No transcripts matching the sequence of other CrERV from M273 with partially unmethylated 50 LTRs (Table 3.5) were identified in any other Montana animal.

61 After determining that the same or related CrERV were transcriptionally active in multiple Montana animals, we compared transcription levels of each CrERV in the 10 animals by qPCR. Although lineage-specific expression levels varied amongst animals, expression of CrERVγ-in3 and CrERVγ-in1 CrERV were highest in all animals, with comparatively lower expression of CrERVγ-in12 (Figure 3.5). Analysis of qPCR melt curves suggests that there was little diversity in CrERVγ-in3 and CrERVγ-in1 transcript sequence across different animals, which is consistent with our sequence data. In contrast, qPCR melt curves showed multiple peaks for CrERVγ-in12 sequences, which is also consistent with the spliced envelope sequenc- ing results and supports that there are multiple CrERVγ-in12-like transcriptionally active CrERV in Montana animals. Our data support that S386, or a closely related CrERV, is highly expressed in all Montana animals. We evaluated the distribution of this CrERV in the population [210] and determined that all Montana animals in our data set have a CrERV at this locus, with the exception of two animals (M170 and M267). We confirmed that a full length virus capable of producing the transcript was present at this site using CrERV locus-specific PCR, which showed that S386 is heterozygous for the provirus/solo-LTR configuration in all Montana animals evaluated. Two mule deer (M170 and M267) and one white-tail deer (negative control) did not contain the virus integration, and only a pre-integration site was amplified (Figure 3.6). Similar analyses of S26536, S2220, and S3442 indicate that these transcriptionally active CrERV from M273 were also found in all Montana animals as provirus/solo-LTR heterozygous integrations. CrERV locus-specific PCR indicated that M170 and M267 only contained pre-integration sites at all four loci. We considered that the appearance of a band for both provirus and solo-LTR in the amplicon generated using primers flanking the integration site could be due to a duplication of the genomic region surrounding the virus integration sites. We mapped all genome sequence reads back to our assembly, and there was no increase in read depth at any of the host flanking regions (data not shown). We also searched a k-mer library generated from all sequence data and determined that none of the k-mers derived from 2 kb of the host flanking regions occurred at a frequency that differed from single copy genes.

62 Lineage−specific CrERV Expression

0.3

0.2

Legend in1 in3 in12 Normalized Copy Number Normalized Copy

0.1

0.0 M167 M253 M257 M261 M268 M273 M350 M358 M364 M376 Animal

Figure 3.5: Lineage-specific CrERV quantification.

CrERV copy numbers were quantified using qRT-PCR and normalized to number of copies of actin and G6pdh housekeeping genes. CrERVγ-in1 expression levels are indicated in red, CrERVγ-in3 expression levels are indicated in green, and CrERVγ-in12 expression levels are indicated in blue.

Proximity to expressed host genes does not dictate CrERV transcription

Unmethylated and transcriptionally active ERV LTRs are typically enriched in gene-rich regions of the genome in humans and mice [118, 149]. Additionally, ERVs and proximal cellular genes are often epigenetically coregulated [96,97]. We tested if CrERV with unmethylated LTR were closer to mule deer genes than all CrERV in M273. The results of a Fisher’s exact test showed that the proportion of unmethylated CrERV within 10 kb of a gene is higher than the proportion of all CrERV within 10kb of a gene (p-value = 0.0041). We considered that this may

63 M170 M273

10 kb

1.5 kb

1 kb

CrERV provirus

CrERV solo-LTR

Pre-int site

Figure 3.6: Results of PCR for individual CrERV loci.

(Top) Gel image that depicts S2220 locus PCR on M170 genomic DNA, which has only a 1kb pre-integration site, and M273 genomic DNA, which has a 10 kb CrERV provirus and 1.5 kb CrERV solo-LTR. The same primer set was used for both PCRs. (Bottom) Schematic of PCR approach. Small arrow indicate primers that correspond to the unique host flanking regions of each CrERV locus. PCRs that detect CrERV proviruses are marked with black circles, PCRs that detect CrERV solo-LTRs are marked with grey circles, and PCRs that detect the empty pre-integration site are marked with white circles. indicate that CrERV expression is a byproduct of genomic location. We were able to determine the 50 LTR methylation status of 13 CrERV within 10 kb of a host gene. Only three CrERV in this dataset were unmethylated. RNAseq data show

64 that all proximal genes, except two genes that are primarily expressed in testis and endometrium, are expressed in mule deer lymph nodes (Table 3.6). This indicates that although CrERV with unmethylated 50 LTRs are close to genes, a CrERV can also be methylated in close proximity to an expressed gene.

Table 3.6: Expression of genes that have a CrERV integration within 10kb.

Values in parentheses under ‘Methylation Status’ indicate number of unmethylated reads out of total reads that map to that CrERV locus. Gene expression was assessed by mapping RNAseq reads to each gene transcript. CEMIP and ZMYND15 were not expressed in mule deer lymph nodes, and are primarily expressed in endometrium and testis, respectively. Virus Phylogeny Methylation Status Distance to Gene (bp) Gene Gene Expressed? S3442 in12 Partial Unmethylated (2/9) 185 SZRD1 Yes S386 In3 Unmethylated (11/11) 216 SIRT6 Yes S26536 In1 Unmethylated (7/7) 357 KXD1 Yes S2220 In12 Unmethylated (9/9) 399 ISY1 Yes S3597 In3 Methylated (0/3) 1410 ALG8 Yes S15140 In1 Methylated (0/8) 1559 SIRPA Yes S1131 In15 Methylated (0/3) 2173 CIB1 Yes S10078 In6 Methylated (1/13) 2727 DTX4 Yes S18448 In3 Methylated (0/2) 3079 CEMIP No S17735 In7 Methylated (0/3) 5734 SPSB2 Yes S1860 In6 Methylated (0/1) 7088 TAC1 Yes S12116 In6 Methylated (0/1) 7280 CTNNA3 Yes S9905 In1 Partial Unmethylated (3/18) 8544 ZMYND15 No

S26536 impacts splicing of gene transcript but does not impact gene expression levels

S26536 is estimated to be the oldest CrERV integration [212], is present in the provirus/ solo-LTR configuration in all animals evaluated, and data supports expression of this CrERV in multiple Montana animals. The S26536 virus and solo-LTR are integrated into the genome in the reverse orientation 357 bp from the predicted KXD1 exon 1 (Figure 3.7, a). The mule deer KXD1 transcript is homologous to cow KXD1 based on RNAseq data (Figure 3.8). We confirmed this transcript structure by amplifying a full length KXD1 transcript from a mule deer that does not contain the CrERV (M170). No KXD1 transcript splice variants were amplified in any animals or identified from the RNAseq reads. The orientation of S26536 with respect to KXD1 suggested that the CrERV LTR could act as a bidirectional promoter for the host gene. We identified a transcript that initiates in the S26536 LTR and includes 233 bp of the host sequence, then splices from a predicted splice donor to KXD1 exon 2 and includes the remaining

65 a. Transcriptionally active CrERV with fully unmethylated 5' LTRs

S26536 Virus Kxd1 3’LTR 5’LTR

S386 Virus Sirt6 3’LTR 5’LTR

S2220 Virus Isy1 3’LTR 5’LTR

b. Transcriptionally active CrERV with partially unmethylated 5' LTR

Exon 1A Exon 1B Exon 1C Exons 2- 10

Fbxo42

S3442n Virus 5’ LTR 3’ LTR

Figure 3.7: Orientation of transcriptionally active CrERV with respect to genes.

(a) Transcriptionally active CrERV with fully unmethylated 50 LTRs are integrated in the reverse orientation with respect to the closest gene. Virus name and gene name are indicated in the figure. Black arrows indicate the direction of transcription. (b) Transcriptionally active CrERV with partially unmethylated 50 LTR is integrated in the first intron of a gene in the sense orientation with respect to the gene. Three gene transcript isoforms with alternative exon 1 sites are depicted. Black arrows indicate the direction of transcription of each exon 1.

KXD1 exons (Figure 3.8). This alters the 50 UTR and omits the annotated exon 1 that is present in the canonical transcript but leaves the coding region of the gene, which begins in exon 2, unaffected. There were no regulatory features identified in the new UTR using a web-based server to identify functional RNA motifs [234]. We confirmed that the new UTR is only found in LTR-driven transcripts by amplifying UTR to KXD1 exon 5 from gene-specific cDNA. Transcripts containing UTR to KXD1 exon 5 were amplified from an animal containing S26536, but not from an animal lacking the CrERV. RNAseq reads that overlapped the LTR driven UTR-KXD1 exon 2 junction were also identified. Because we identified a novel LTR-driven KXD1 transcript isoform, we deter-

66 a. Canonical KXD1 transcript 3F 3R 2F 2R

b. S26536- driven KXD1 transcript 1F 1R

UTR LTR KXD1

Figure 3.8: Schematic of KXD1 transcript isoforms.

(a) Canonical KXD1 transcript isoform, which contains exon 1. (b) S26536-driven KXD1 transcript isoform, which is alternatively spliced and does not contained KXD1 exon 1. Small arrows indicate primer locations for qPCR. Primers 1F and 1R are used for LTR-driven KXD1 quantification and primers 2F and 2R are used for canonical KXD1 exon 1 quantification. Primers 3F and 3R are in exons 2 and 3, which are conserved between the two KXD1 transcript isoforms and are used to quantify total KXD1 expression.

mined if the presence of S26536 alters the amount of KXD1 expression. There was no significant difference in either total KXD1 (exon 2- exon 3) expression (Mann-Whitney U p-value = 0.083) or canonical KXD1 (exon 1-exon 2) expression (Mann-Whitney U p-value = 0.117) between animals with or without the CrERV (Figure 3.9). Additionally, LTR-driven KXD1 (LTR-UTR) expression is low, and ranges between 1-10% of total KXD1 expression across Montana animals, suggest- ing that the majority of total KXD1 expression in each animal is driven by the canonical promoter.

S386 does not alter host gene expression levels or splicing patterns

S386 is estimated to be a more recent CrERV integration, but is also retained in the provirus/solo-LTR configuration and is expressed in all animals evaluated. The S386 virus and solo-LTR are integrated into the genome in the reverse orientation

67 KXD1 Gene Expression Levels

Total KXD1 Canonical KXD1

p = 0.083 p = 0.17

0.06

0.04 Normalized Copy Number Normalized Copy

0.02

0.00

No Yes No Yes S26536 Status

Figure 3.9: KXD1 gene expression levels between animals with and with- out S26536.

Copy numbers of Total KXD1 and KXD1 exon 1 were normalized to actin copy number. Expression levels of were compared between animals with the S26536 integration (blue boxplot) and animals without the S26536 integration (pink boxplot) using the Mann-Whitney U test, and p-values of each comparison are indicated at the top of the graph.

216 bp from the predicted SIRT6 exon 1 (Figure 3.7, a), thus we also investigated the potential bidirectional promoter activity of this CrERV LTR. We identified a mule deer SIRT6 transcript with 8 exons based on RNAseq data and confirmed this transcript structure by amplifying the full length SIRT6 transcript in M170, which does not contain S386. No alternatively spliced SIRT6 transcripts were amplified from any animal or identified using RNAseq data. We amplified a SIRT6 transcript that initiates from the LTR and includes all

68 SIRT6 exons (Figure 3.10). The LTR-containing SIRT6 transcript is a read-through transcript from the LTR to SIRT6 exon 1 and extends the gene 50 UTR, but has the same exon structure as the SIRT6 transcript amplified from M170. The S386 proviral 30 LTR and solo-LTR have a CpG dinucleotide insertion at position 165 relative to the 50 LTR. The sequence of the LTR-driven SIRT6 transcript indicates it is initiated by the 50 LTR of S386 (Figure 3.11). The LTR-driven SIRT6 transcript reads through to the canonical SIRT6 open reading frame (ORF) and does not alter the ORF or translation start codon. We did not identify any reads from the RNAseq analysis that mapped between the S386 LTR and SIRT6 exon 1, suggesting this transcript is rare. There was also no significant difference in total SIRT6 expression (Mann-Whitney U p-value = 0.5645) between animals with or without the CrERV (Figure 3.12).

a. Canonical SIRT6 transcript 2F 2R

b. S386 - driven SIRT6 transcript 1F 1R

LTR SIRT6

Figure 3.10: Schematic of SIRT6 transcript isoforms.

(a) Canonical SIRT6 transcript isoform. (b) S386-driven SIRT6 transcript isoform, which is a read-through transcript from the LTR to SIRT6 exon 1. Small arrows indicate primer locations for qPCR. Primers 1F and 1R are used for LTR-driven SIRT6 quantification. Primers 2F and 2R are in exons 2 and 3, which are conserved between the two SIRT6 transcript isoforms and are used to quantify total SIRT6 expression.

69 Figure 3.11: Alignment of S386 LTRs and LTR-driven SIRT6 transcript.

The S386 50 LTR (S386_5pLTR), the S386 30 LTR (S386_3pLTR), S386 solo-LTR (S386_solo) and LTR portion of the LTR-driven SIRT6 transcript (LTRSIRT6) were aligned. The LTR-driven SIRT6 transcript is missing a ‘CG’ dinucleotide, similarly to the S386 50 LTR, indicating that this transcript is initiated from the S386 provirus 50 LTR.

Similarly, the S2220 virus is integrated into the genome in the reverse orientation 399 bp from the first predicted ISY1 exon (Figure 3.7, a). We identified a full length mule deer ISY1 transcript with 11 exons in an animal with S2220 and an animal without the virus, and no ISY1 splice variants were detected in either animal (Figure 3.13). We amplified a truncated LTR-driven ISY1 transcript but there was no evidence that the LTR-driven transcript was full-length. Amplification of truncated LTR-driven ISY1 was inconsistent, so quantification was not attempted.

Intronic S3442 alters gene promoter usage and splicing patterns

We identified S3442 as a partially unmethylated CrERV that was producing spliced env transcripts in M273. No other animals in the analysis contained S3442 spliced env transcripts, however, the CrERV was present in the provirus/solo-LTR configu- ration in all animals evaluated. Unlike the other transcriptionally active CrERV, S3442 is in the sense orientation and in the first intron of FBXO42 (Figure 3.7, b). We identified CrERV-host chimeric RNAseq reads that mapped to FBXO42 exon 1 and spliced into the S3442 env or gag, at predicted splice acceptor sites. A database search indicated that there are multiple FBXO42 transcript isoforms from deer, sheep, and cow that differ in exon 1, but share coding exons (exons 2-10). We identified three transcripts that splice to the conserved FBXO42 exon 2 and mapped these to three locations in the mule deer genome, which we refer to as Exon 1A, Exon 1B, and Exon 1C (Figure 3.7, b). Exon 1A is 2.1 kb upstream of S3442, whereas Exon 1B, Exon 1C, and FBXO42 exons 2-10 are downstream of the CrERV. We confirmed full length FBXO42 expression containing each exon 1 by amplifying FBXO42 transcripts in an animal positive for S3442 and an animal

70 Total SIRT6 Gene Expression Levels

p = 0.56 0.003

0.002 Normalized Copy Number Normalized Copy

0.001

0.000 No Yes S386 Status

Figure 3.12: SIRT6 gene expression levels between animals with and with- out S386.

Copy numbers of Total SIRT6 were normalized to actin copy number. Expression levels of were compared between animals with the S386 integration (blue boxplot) and animals without the S386 integration (pink boxplot) using the Mann-Whitney U test, and p-value is indicated at the top of the graph. lacking the virus (M170). Given the exon structure of FBXO42 and presence of CrERV-host chimeric reads, we investigated FBXO42 expression in the presence and absence of the virus. We quantified FBXO42 expression in two animals lacking the CrERV and six animals with S3442. One animal (M376) was omitted from the analysis due to lack of measurable FBXO42 transcript expression. We found no statistical difference in Exon 1B-Exon 2 expression between animals with and without S3442. Animals without the CrERV had significantly more Exon 1A-Exon 2 expression than animals

71 ISY1 Figure 3.13: ISY1 transcript structure.

ISY1 gene exon structure was determined using RNAseq read mapping. The mule deer ISY1 gene contains 11 exons with homology to the cow ISY1 gene. A diagram of the transcript structure is shown.

with S3442 (Mann Whitney U p-value = 6.804 x 10-6). In contrast, animals with S3442 had significantly more Exon 1C-Exon 2 expression than animals without the CrERV (Mann Whitney U p-value = 1.004 x 10-5), suggesting that presence of S3442 alters exon 1 usage during FBXO42 expression (Figure 3.14). Intronic ERVs can result in early termination of the gene transcript through the use of polyadenylation signals in the virus [50]. We used 30 RACE to determine if FBXO42 transcripts that spliced into S3442 terminate at the virus polyadenylation signal in the S3442 30 LTR. We identified a polyadenylated transcript that begins with FBXO42 Exon 1A, splices to the S3442 env at a predicted splice acceptor and reads through the viral env before terminating at the virus polyadenylation signal (Figure 3.15). We evaluated S3442 exonization within FBXO42 (Figure 3.15) and amplified an FBXO42 transcript from M273 that contains 423 bp of the viral envelope. Predicted viral splice acceptor and donor sites were identified in this transcript. Additional experiments suggested this transcript is full length (FBXO42 exon 10) and polyadenylated. We also amplified a gag-containing FBXO42 transcript that extends to at least FBXO42 exon 4. Computational analyses suggest that the addition of S3442 env or gag sequence does not alter the FBXO42 open reading frame (ORF), which begins in exon 2. We also amplified S3442 env-containing FBXO42 transcripts in two additional Montana animals, M358 and M376. These data indicate that S3442 has a quantitative impact on FBXO42 exon 1 usage and a qualitative impact on FBXO42 transcript form.

72 FBXO42 Alternative Exon 1 Expression

Exon 1B Exon 1A Exon 1C

p = 0.97 p = 0.00014 p = 7.8e−05 0.25

0.20

0.15

0.10 Normalized Copy Number Normalized Copy

0.05

0.00

No Yes No Yes No Yes S3442n Status

Figure 3.14: Presence of S3442n integration alters FBXO42 Exon 1 usage patterns.

Copy numbers of FBXO42 Exon 1B-exon 2, Exon 1A-exon 2, and Exon 1C-exon 3 were normalized to total FBXO42 copy number. Expression levels of each alternative exon 1 was compared between animals with the S3442 integration (blue boxplot) and animals without the S3442 integration (pink boxplot) using the Mann-Whitney U test, and p-values of each comparison are indicated at the top of the graph.

3.4 Discussion

In this study, we investigated transcriptionally active ERV in an organism with extensive insertional polymorphism among individuals. We have identified four CrERV that are transcriptionally active in mule deer lymph nodes, and provide data that supports expression of these CrERV throughout a mule deer population. We provide strong evidence that proximity to genes is not the only factor that

73 gag env pA Exon 1A 5’LTR 3’LTR Exon 2

RNAseq RNAseq

RNAseq = SD

RNAseq = SA

RNAseq

3’RACE AAAAAA

PCR

PCR env to exon 10

PCR gag to exon 2

S3442n spliced env clone

Canonical FBXO42 RNAseq + PCR Exon 1A – exon 2

Figure 3.15: Splicing patterns of S3442 and FBXO42.

All chimeric transcripts isolated are included in the schematic. Method used to isolate the transcript is indicated. Splice donors (SD) within S3442 are indicated by a green triangle, and splice acceptors (SA) are marked using a yellow arrow. Genomic provirus and provirus portions of transcripts are indicated in red. FBXO42 host gene and transcripts are indicated in blue. The virus polyadenylation signal is marked (pA). dictates CrERV transcriptional activity. We show that transcriptionally active CrERV are heterozygous, and indicate that the provirus has been maintained in the majority of animals in the population. We use animals that lack the CrERV to show that transcriptionally active CrERV can have bidirectional promoter activity and affect host gene splicing, but presence of the virus does not have a major impact on host gene expression levels. We first asked if CrERV transcriptional activity was a byproduct of location close to genes, thus requiring knowledge of the CrERV locus producing transcripts, rather than identification of transcriptionally active CrERV lineages. We sequenced CrERV transcripts to identify candidate transcriptionally active proviruses. CrERV have colonized and expanded within the mule deer genome over the last 200,000 years [185], resulting in many genomic copies of viruses that can share up to 99% identity. Thus, we also assessed CrERV 50 LTR methylation status and identified completely and partially unmethylated CrERV LTRs, which have been previously

74 associated with ERV expression [80,235, 236]. Combining these approaches allowed us to determine that candidate CrERV had an LTR consistent with transcriptional activity, and provides an advantage over using RNAseq only, which does not allow for accurate identification when ERV sequences are greater than 96% identical [16]. The four transcriptionally active CrERV identified in M273 were close to genes. Expressed ERV are often near genes [16, 118, 148], and it has been suggested that ERV expression may be due to a byproduct of genomic location to avoid local heterochromatin spreading into nearby genes [152, 153]. If CrERV were transcriptionally active as a byproduct of genomic location, we would expect there would be no methylated CrERV close to expressed genes. Our data indicates this is not the case, and proximity to genes is not the only factor responsible for CrERV transcriptional activity. Alternatively, CrERV orientation with respect to genes, integration into a CpG island, or other features may dictate CrERV expression. Transcriptionally active CrERV belong to older phylogenetic lineages. We expected that transcriptionally active CrERV would be related to the youngest CrERV lineage, CrERVγ-in7. The youngest CrERVγ-in7 family originates from an ancestor that integrated into the genome within the last 20,000 years [212]. This family is the only phylogenetic group enriched near genes and has been involved in CrERV recombination events [212], which may be RNA mediated and suggest transcriptional activity of these ERV in the germline [157,237]. Further, a full-length infectious CrERV from this CrERVγ-in7 group was recently isolated from mule deer cells co-cultured with human cells [202]. We found no evidence of transcriptional activity of recently integrated and insertionally polymorphic CrERV lineages in lymph nodes, and instead demonstrate that transcriptionally active CrERV are older. S26536, a CrERVγ-in1 virus, is the oldest CrERV and is fixed in all mule deer [184,185]. S2220 and S3442 are also members of an older CrERV lineage, CrERVγ-in12, and both CrERV are widespread in the population. This may be consistent with the hypothesis that older ERV, after mutations have reduced their mutagenic potential, acquire regulatory potential and are released from silencing [220, 238]. S386, however, is younger and has been involved in recent germline retrotranposition events [212]. Additionally, all transcriptionally active CrERV are present in the provirus/solo-LTR configuration throughout the population. ERV heterozygosity typically indicates recent germline acquisition and is associated with insertional polymorphism [158,239–242]. A heterozygous ERV

75 may indicate balancing selection on the ERV to maintain the provirus. Alternatively, ERV heterozygosity could be due to a hitchhiking effect with a nearby gene under selection. The widespread heterozygous CrERV state suggests that there may be a dosage effect, and two copies of either the provirus or solo-LTR may be deleterious. This is similar to the ERV integration within the KIT gene in cats, where the solo-LTR allele results in a more extreme phenotype than the full-length ERV allele [243]. S386, S26536, and S2220 were integrated into the genome in the reverse tran- scriptional orientation with respect to the closest gene. Proviruses that contribute enhancers to host genes are often found in this configuration, such as the HERV-E amylase gene enhancer [244]. An advantage to our system is that several animals lack these CrERV integrations, allowing us to investigate the impact of S386 and S26536 on proximal gene expression. We did not find a positive or negative quan- titative impact on gene expression levels, indicating that presence of the CrERV is not deleterious nor enhances gene expression in lymph nodes. LTRs can also have bidirectional promoter activity [124,126,245]. We demonstrate the ability of S26536 to act as a bidirectional promoter and drive expression of a KXD1 transcript isoform with a novel 50 UTR in the mule deer lymph node. We were unable to determine any motif differences between the 50 UTR of the LTR-driven KXD1 transcript and the canonical KXD1 transcript to suggest a functional impact of the alternative transcript isoform, however, 50 UTRs are important for control of translation efficiency [246]. Our data suggest a minimal impact of transcriptionally active CrERV as alternative promoters for host genes. Many ERV-associated alter- native promoters are solo-LTRs [126,247]. Only one CrERV solo-LTR was identified as partially unmethylated and all unmethylated CrERV included a provirus and a solo-LTR, which may also be indicative that co-option of CrERV LTRs as host gene promoters is not a feature of CrERV transcriptional activity. We observed aberrant FBXO42 transcript processing caused by intronic S3442, including premature polyadenylation within the CrERV, exonization of CrERV sequence, and an impact on FBXO42 exon 1 usage patterns. S3442 is integrated in the sense orientation within FBXO42, an unusual configuration given a strong antisense bias among intronic ERVs [13, 248]. It is interesting to consider that FBXO42 transcripts that terminate in the CrERV indicate host-driven S3442 spliced env expression, which would not be identified using our spliced env transcript

76 amplification strategy. If this process occurs in all animals, then the S3442 spliced env would be ‘transcribed’ regardless of the methylation status of the 50 LTR. Thus the host could avoid the negative consequences of increased ERV expression, such as retrotransposition and epigenetic effects on host genes, while maintaining expression of S3442 env transcripts. Because there were mule deer lacking S3442, we were able to evaluate FBXO42 expression in the presence and absence of the virus and determined that exon 1 usage differed between animals with and without S3442. S3442 may impact FBXO42 regulation since alternative first exons are often associated with tissue or cell context-dependent gene expression [249, 250]. Considering the effects presented on FBXO42 expression and widespread S3442 heterozygosity, it is unusual to maintain this intronic sense ERV provirus, further suggesting a beneficial impact of either the S3442 transcript or alternatively spliced gene transcripts. S386 is distinct among the transcriptionally active CrERV, and its more recent integration into the genome provides information on the effects of ERV expression closer to the time of initial integration. Data supports that expression levels of S386 are high in all animals evaluated and that S386 is widespread in the Montana population. We previously determined that S386 has expanded within the genome [212], suggesting ongoing transcription of this CrERV provirus in recent evolutionary time. Genomic location may indicate that S386 is transcribed because it is close to an expressed host gene. Alternatively, the S386 provirus has been maintained in all animals, the population frequency is high given it is a younger CrERV, and there is no functional consequence of S386 on proximal host gene transcript isoform or expression levels. Given data that indicate maintenance of CrERV proviruses that produce transcripts, it is possible that CrERV RNA and other CrERV-host transcripts act as lncRNAs. ERVs are major contributors to the transcription of mammalian lncRNAs [141, 251] and ERV transcripts act as lncRNA scaffolds [137]. Although the function of CrERV transcripts were not evaluated in this study, it is interesting to consider their role in gene expression regulation as lncRNAs, which often function as cis-regulators of adjacent host genes [140,252–254]. This would be novel for recently integrated ERVs that display some degree of insertional polymorphism, since ERVs co-opted as lncRNAs are often conserved among multiple species [141]. Future work will determine the stability of CrERV transcripts and evaluate potential protein binding partners to assess the

77 lncRNA activity of CrERV RNA. In conclusion, this study supports that that transcriptionally active CrERV are not a byproduct of genomic location close to genes or recent integration into the genome. Rather, we demonstrate that transcriptionally active CrERV integrated into the genome at different evolutionary time points and provide evidence of ongoing transcriptional activity. Population frequencies of transcriptionally active CrERV are high, even for a CrERV that is relatively young, and all expressed CrERV proviruses have been maintained in the mule deer population. With this research, we gained insights into the extent to which ERV acquisition and transcription affects short term and long term host genome evolution.

78 Chapter 4 | Comparison of transcriptionally active CrERV in mule deer from Montana and Wyoming

Sections of the data presented in Chapter 4 will be integrated with data on genome structural variations associated with CrERV for a manuscript that is currently in preparation. Some of the data presented here will appear in the supplement for this manuscript.

4.1 Introduction

Endogenous retroviruses (ERVs) that entered the genome of ancestral species are fixed in the genomes of present-day species. For example, the provirus HERV-K110 is present in humans, chimpanzees, bonobos, and gorillas but not the orangutan, indicating that the provirus integrated before humans, chimpanzees, bonobos, and gorillas diverged from their common ancestor [255]. In contrast, insertionally poly- morphic ERVs are variably present within individuals due to ongoing colonization or active ERV retrotransposition. For example, at least 36 HERV-K proviruses are human-specific and insertionally polymorphic and not present in the human reference genome [187]. In other organisms, insertionally polymorphic ERVs have been documented among mouse strains [181,182], pig subspecies [177], and sheep breeds [183]. Interestingly, recent studies have also demonstrated ERV insertional polymorphism among populations within the same species [185,187].

79 Given the impacts of ERVs on host genomes, insertionally polymorphic ERVs may be associated with phenotypic variation among individuals within a species. For example, in humans, several unfixed ERVs are near or within genes. Although these ERV-gene pairs were not investigated further, these insertionally polymorphic ERVs may be associated with phenotypic effects in only some populations [187]. Other polymorphic HERVs also exhibit varying levels of expression among individuals [208], implying that there is variation in ERV contribution to the human transcriptome. Insertionally polymorphic ERVs have been shown to impact gene expression in other species. Two different studies of ERVs polymorphic among mouse strains that have integrated into gene introns demonstrate that these ERVs can disrupt gene expression [117,182]. ERV polymorphisms in mice have been associated with changes in gene expression across mouse strains [256], suggesting that insertionally polymorphic ERVs may contribute to these strain-specific phenotypic differences. We recently discovered an ERV in mule deer, named Cervid Endogenous Retro- virus (CrERV) that is insertionally polymorphic among mule deer [184] but absent from white-tailed deer, which shared a common ancestor with mule deer about 1 million years ago [200]. Mule deer are a young, phenotypically diverse species [198], however, there is little population structure among mule deer based on microsatel- lites or mitochondrial DNA [196–199]. We have previously shown using a dataset of 14 CrERV that CrERV insertional polymorphism can be used to detect geographic clustering of related deer [185]. We further expanded this analysis to include a total of 1722 CrERV among 77 animals from three different states (Montana, Wyoming, and Oregon). A principal component analysis (PCA) of shared CrERV indicates that mule deer can be separated by state of origin [213], with a particular distinction between the Montana/Oregon and Wyoming animals. This suggests that animals from Montana and Wyoming differ in CrERV content. A previous analysis (Chapter 3) indicated that transcriptionally active CrERV are widespread in the Montana population and some CrERV have varying impacts on host gene expression. Given the differences in CrERV content between the Montana and Wyoming populations, it was unknown if animals from these popula- tions have similar transcriptionally active CrERV or if CrERV expression varied depending on mule deer population of origin. Transcriptionally active CrERV are of particular interest in the context of chronic wasting disease (CWD), a naturally occurring prion disease of cervids [257,258]. CWD is endemic to certain portions

80 of Wyoming [259], but few Montana mule deer have been found to be CWD posi- tive. Due to geographic proximity and the migratory behavior of mule deer, it is unlikely that the absence of CWD in Montana is due to a lack of prion exposure. Alternatively, there may be a mule deer host genetic component that facilitates disease establishment. Although traditional genetic approaches have suggested that there are no genetic differences between animals in the CWD-free Montana and CWD-endemic Wyoming populations [196–199], we have shown that animals from these regions differ in CrERV colonization history [213]. Retroviral RNA affects prion conversion and pathogenesis [260, 261], and transcripts of some ERVs are upregulated during prion infection [66], suggesting that transcriptionally active CrERV have the potential to be involved in CWD. The results of Chapter 4 will establish differences between the Montana and Wyoming populations with respect to CrERV content, transcriptionally active CrERV, and the potential impact of CrERV transcription on nearby gene expression, and may be important for future studies regarding CrERV and CWD.

4.2 Materials and Methods

Some Chapter 4 methods were included in other chapters and are copied verba- tim below. ‘Junction fragment cluster and data analysis’ was included in Chapter 2. ‘Animal Tissues and Nucleic Acid Extraction,’ ‘cDNA synthesis,’ ‘Transcript amplification, cloning, and Sanger sequencing,’ ‘PCR for individual CrERV loci,’ ‘Quantitative PCR of CrERV,’ and ‘Mule deer gene transcript screening and quan- tification’ were included in Chapter 3.

Illumina libraries of CrERV integration sites

Next generation sequencing libraries of CrERV integration sites were prepared using a method adapted from previous mobile element junction fragment analy- ses [191,193,194,211]. Briefly, genomic DNA was digested with dsDNA fragmentase (NEB) to 250-1000 bp. DNA fragments were then end-repaired and modified to create 30 A overhangs and 50 phosphorylation to allow for ligation of double-stranded linkers. The DNA linkers were designed with features to prevent linker to linker amplification of DNA fragments lacking the target ERV sequence, including a 30 amino modification in the linker oligonucleotide, a single stranded region in the

81 top linker that matches the linker-specific primer, and a high difference in melting temperatures between PCR primers used to take advantage of the suppression PCR effect. The sequences of the linker top strand is 50- GTGGCGGCCAGTATTCG- TAGGAGGGCGCGTAGCATAGAAC*G*T (* denote phosphorothioate bonds which prevent the degradation of the linker end), the sequence of the bottom strand is 50- p-CGTTCTATGCTAC-N (p denotes 5’phosphate to enable ligation of the linker; N indicates the 3’amino modification); both were synthesized by Integrated DNA Technologies. The linker was added in 20-40X molar excess relative to the amount of genomic DNA fragments and annealed to the DNA fragments using Quick Ligase (NEB). The DNA was then purified using a PCR Purification Column (Qiagen) to remove unligated free linkers. Approximately 70-150 ng of DNA with ligated linkers was then used as template in the PCR amplification to enrich for virus-host junction sequences. PCR mixtures contained the following: 1.5 units of Ex Taq DNA polymerase (Takara), ExTaq reaction buffer (Takara), 0.2 mM dNTPs, and 400 nM primers. The linker-specific primer was identical for all samples and the sequence is 50-GCGGCCAGTATTCGTAGGA-30. An LTR-specific primer was used for one set of PCRs to enrich for all CrERV-host fragments (50- AATGACCCCTGCTTATGTTTGA-30) and an env-specific primer was used for an- other set of PCRs to enrich for CrERV-host fragments of viruses that contain coding sequence (50- GAGGACAGCTCCTTGGTTTG-30). Cycling conditions for this PCR were: 95°C for 3 minute initial denaturation, followed by 32 cycles of 95°C for 30 s, 59°C for 30 s, 72°C for 30 s, and a final extension of 72°C for 5 min. Each PCR set was individually cleaned over a PCR purification column (Qiagen) and size-selected using AmpureBeads (Agencourt). Approximately 150 ng of the CrERV-enriched DNA fragments were then used as a template for a PCR with degenerate primers, which generated diversity needed for accurate clustering after Illumina sequencing. PCR mixtures contained the following: 1 unit of Phusion DNA polymerase (NEB), Phusion HF reaction buffer (NEB), 0.2 mM dNTPs and 500 nM primers. The se- quences of the primers used were: 50- RYRYRCTTGCGTTTTGCATTGTCTCT-30 and 50-RYRYRGCGGCCAGTATTCGTAGGAG-30, where R stands for A or G and Y stands for C or T. Cycling conditions were: 98°C 30 sec initial denaturation, followed by 24 cycles of 98°C 25 s, 57°C 25 s, 72°C 30 s, and then final extension of 72°C for 5 min. The PCR products were then size-selected by gel electrophoresis using 1% agarose. The region corresponding to approximately 350-450 bp range

82 was excised and purified from gel slices using QIAquick gel extraction kit (Qiagen). The concentration was measured on a QuantiFluor-ST fluorimeter (Promega). Ap- proximately 100 ng of each sample was used as the input DNA for the TruSeq Nano DNA Protocol (Illumina) following the manufacturer’s instructions beginning at the step ‘Clean Up Fragmented DNA.’ Briefly, 50µL DNA was cleaned up using 80µL Sample Purification beads and eluted in 60 µL Resuspension Buffer. The DNA was then end-repaired using End Repair Mix 2 and further size selected using Sample Purification beads following the protocol for a 350 bp insert size. The 30 ends of the DNA were then adenylated using A-Tailing Mix and indexing adapters from the TruSeq Nano DNA LT Sample Prep Kit were ligated to the DNA samples using Ligation Mix 2. The reaction was inactivated using Stop Ligation Buffer and cleaned up using Sample Purification beads. DNA fragments with adapter molecules on both ends were selectively enriched by PCR using the Enhanced PCR Mix and PCR Primer Cocktail. Cycling conditions for this PCR were: 95°C for 3 minute initial denaturation, followed by 8 cycles of 98°C for 20 s, 60°C for 15 s, 72°C for 30 s, and a final extension of 72°C for 5 min. The enriched DNA fragments were then cleaned up using sample purification beads, eluted in 30 µL Resuspension Buffer and quantified using a QuantiFluor-ST fluorimeter (Promega). Assuming an average length of 470 bp, the final libraries for each sample were then normalized to 25 nM and pooled for sequencing on an Illumina MiSeq.

Junction Fragment cluster and data analysis

Reads obtained from Ion Torrent sequencing were processed and clustered as de- scribed previously [195,210]. Briefly, reads are clustered using a previously described clustering pipeline in two rounds and inter-cluster distances were computed to check that each cluster represented a single CrERV integration site. A two-component mixture model was also developed to address the uncertainty in assigning CrERV status using only read-count data. The mixture model allowed us to assign each a probability of each CrERV within an individual.

Animal Tissues and Nucleic Acid Extraction

Mule deer retropharyngeal lymph nodes were obtained from legally hunted animals brought to hunter check stations in Montana. Tissues were stored in RNAlater (Am-

83 bion). Total RNA was extracted from mule deer lymph nodes using Trizol following the manufacturer’s protocol. Briefly, 30 mg of RNAlater preserved lymph node tis- sue was minced with a scalpel and placed into 1 mL of Trizol reagent for 1 hour, with vortexing every 10 minutes to promote tissue breakage. The tissue was then cen- trifuged at 12,000 g at 4°C for 10 minutes to pellet DNA and insoluble tissue. RNA extraction then proceeded following the manufacturer’s instructions, substituting 100 µL 1-Bromo-3-chloropropane during phase separation. RNA was quantified us- ing a Quantiflour fluorimeter and RNA measuring kit. A volume of RNA containing 6 µg total nucleic acid was treated with 2 µL TurboDNase for 60 minutes according to the manufacturer’s instructions. Total RNA was quantified again using the Quantifluor fluorimeter and RNA kit and removal of DNA was verified by a CrERV- specific PCR (primers = in6887dF: 50-GGGAACATGGTGGCCCRTTTTGAC-30 and in7109dR: 50-GTCCCGGTRGTTTCACATCCC-30). Genomic DNA was ex- tracted from RNAlater preserved mule deer lymph nodes using the QIAamp DNA Micro Kit following the manufacturer’s protocol for tissue samples. Briefly, a 10 mg piece of lymph node was added to 180 µL Buffer ATL and equilibrated to room temperature. After adding 20µL of proteinase K (NEB), the tissue was placed in a 56°C water bath for overnight tissue lysis. Next, 200 µL Buffer AL and 200 µL 100% ethanol were added and the mixture was pulse vortexed and spun through a QIAamp MinElute column. The column bound DNA was then washed with 500 µL Buffer AW1 followed by 500 µL Buffer AW2. The membrane was then centrifuged at max speed for 3 min to dry. DNA was eluted in 100 µL Buffer AE and quantified using the Quantifluor dsDNA system (Promega). cDNA synthesis

Total RNA was made into cDNA using the AffinityScript Multiple Temperature cDNA synthesis kit, following the manufacturer’s instructions for either Random Primers or gene-specific primer. Briefly, 1 µg of total RNA was incubated with either 300 ng Random Primers or 100 ng of a CrERV-specific primer (I7_8555R: 50-GCCGGTATTGTTGCCTTAGC-30) and RNase-free water for 5 minutes at 65°C before cooling at room temperature for 5 minutes. Next, 10X AffinityScript Buffer, 100 mM dNTPs, RNase Block, and AffinityScript Reverse Transcriptase were added following the manufacturer’s protocols.

84 Transcript amplification, cloning, and Sanger sequencing

Spliced envelope transcripts sequences were amplified from cDNA generated with CrERV specific primer (I7_8555R) using Takara ExTaq polymerase following the manufacturer’s instructions. Primers are listed in Table 3.1. PCR amplified sequences were then cloned into the TA cloning vector (Invitrogen) and processed using the Qiagen Miniprep Kit. Transcripts were then sequenced using Sanger sequencing at the Penn State Genomics Core Facility and analyzed using the LaserGene suite by DNAstar.

PCR for individual CrERV loci

To determine allelic state of the CrERV integration in select animals, PCR primers (Table 3.2) were designed to the genomic region flanking a CrERV integration, which amplified the pre-integration site if the CrERV is absent. PCR was performed using Promega GoTaq Long PCR Master Mix and 0.4 µM of each primer. Thermocycling conditions in a BioRad T100 Thermal Cycler were 95°C for 3 min, then 36 cycles of 95°C for 30 s, 62-65°C for 30 s, 72°C, and a final extension of 72°C for 10 min. The PCR was analyzed by running on a 1% agarose gel with the NEB kb ladder. PCR bands that corresponded to solo-LTR and pre-integration site were gel isolated and Sanger sequenced for confirmation.

Quantitative PCR of CrERV

CrERV expression levels in Random Primer cDNA were measured by real-time quantitative PCR (qPCR) using the iQ SYBR Green Supermix Reagent (Bio- Rad) and a Bio-Rad iQ5 Real-Time PCR detector system (Bio-Rad). Data were analyzed using Bio-Rad iQ5 Optical System Software V. The conditions for all qPCR reactions were as follows: 3 minutes at 95°C followed by 10 seconds at 95°C, ten seconds at the primer-specific annealing temperature (58°C or 60°C), and 20 seconds at 72°C for 40 cycles. Primers are listed in Table 3.3. A melt curve analysis was performed to verify the amplification of a single product. Absolute values of each qPCR reaction were determined by the Bio-Rad iQ5 Optical System Software using a standard curve created with plasmids of known copy number. CrERV expression data were normalized against beta-actin and G6pdh as housekeeping genes.

85 Mule deer gene transcript screening and quantification

Samples were screened for gene transcript splicing differences using PCR. Briefly, cow transcripts of each gene of interest were mapped to the mule deer genome and primers were designed to evaluate gene transcript variation in animals with and without the CrERV. Gene transcripts were quantified from Random Primer cDNA by qPCR using the iQ SYBR Green Supermix Reagent (BioRad) and a Bio-Rad iQ5 Real-Time PCR detector system. Data were analyzed using Bio-Rad iQ5 Optical System Software V. The conditions for all qPCR reactions were as follows: 3 minutes at 95°C followed by 10 seconds at 95°C, 15 seconds at the primer-specific annealing temperature (53°C for LTR-driven KXD1, 60°C - 62°C for others), and 20 seconds at 72°C for 40 cycles. Primers are listed in Table 3.4. A melt curve analysis was performed in order to verify the amplification of a single product. Absolute values of each qPCR reaction were determined by the Bio-Rad iQ5 Optical System Software using a standard curve created with plasmids of known copy number. Gene expression data were normalized against beta-actin expression. To quantify FBXO42, gene-specific cDNA was made using a primer in exon 4 of FBXO42 (E4_7659R: 50- TGCAGCCTCCAAACACATACATAG-30) and was purified using Agencourt RNAClean XP system (Beckman Coulter) following the manufacturer’s instructions for single-stranded DNA purification. Purified cDNA was eluted in 30 µL water and quantified using the Quantifluor ssDNA system (Promega). FBXO42 expression was quantified from purified cDNA using using the iQ SYBR Green Supermix Reagent (BioRad) and Bio-Rad iQ5 Real-Time PCR detector system. Data were analyzed using Bio-Rad iQ5 Optical System Software V. The conditions for all qPCR reactions were as follows: 3 minutes at 95°C followed by 10 seconds at 95°C, 15 seconds at the primer-specific annealing temperature (52°C for Total FBXO42, 58°C for Exon 1A, 61.5°C for Exon 1B, and 54°C for Exon 1C), and 20 seconds at 72°C for 40 cycles. A melt curve analysis was performed in order to verify the amplification of a single product. Absolute values of each qPCR reaction were determined by the Bio-Rad iQ5 Optical System Software using a standard curve created with plasmids of known copy number. FBXO42 exon 1 expression was normalized to total FBXO42 (exon 2- exon 3) expression within each animal.

86 Statistical Analyses

Statistical analyses were performed using the R statistical software. Mann-Whitney U tests were used to evaluate differences between Montana and Wyoming popula- tions. Statistically significant differences were considered when p-value < 0.05.

4.3 Results

MT and WY animals can be separated based on shared CrERV integra- tions

We previously [213] generated CrERV-host junction fragment libraries for animals from Montana, Oregon, and Wyoming populations and assigned each virus occur- rence within an animal a probability using a mixture model. A similar analysis was performed on a different set of Wyoming animals using the Illumina platform to sequence CrERV-host junction fragments. We performed principal components analysis (PCA) to visualize the relationships among the animals based on shared CrERV integrations (Figure 4.1). Because more CrERV-host junction fragments were identified when sequencing with Illumina, we required that all CrERV included in the PCA were found in at least three animals sequenced using the Ion Torrent platform to avoid bias. The PCA indicates that Wyoming animals, regardless of sequencing method, separate from Montana and Oregon animals. We used the latitude and longitude coordinates of animal kill-location to depict the geographic distribution of animals (Figure 4.2). The map shows that animals from Montana were sampled from different regions of the western portion of the state. Animals from Wyoming can be roughly separated into two populations from the Northwest (NW) portion of the state and the Southeast (SE) portion of the state. Animals from both Wyoming populations were included in the Ion Torrent and Illumina sequencing datasets. Animals from both Wyoming populations were also used in the transcript studies described in this Chapter. Animals used for transcript analyses were not included in next generation sequencing libraries of CrERV integration sites. Although no animals positive for CWD were included in the transcript studies, the animals from the SE Wyoming population are from a CWD-endemic area.

87 2

2 2 1 2 2 3 44.5 3 3 1 3 33 4 4 2 3 3 3 11 54 4 3 1 4 4 31 3 1 1 1 5 4 4 1 3 3 1 1 1 1 5 55 4 4 4 4 1 1 5 54 4 4 3 1 5 1 55 1 4 1 5 4 1 5

44.0 4 1 4 PC2 43.5

4

1 MT Mule Deer 5 2 OR BT Deer 43.0 3 OR Mule Deer

4 WY Mule Deer

5 WY Illumina 5

−120 −115 −110 −105

PC1

Figure 4.1: PCA of shared CrERV.

The first two principal component scores were rotated and scaled to make the locations comparable with the latitude and longitude coordinates of animal kill-location. In the figure, MT stands for Montana, OR stands for Oregon, BT stands for blacktail deer, and WY stands for Wyoming. Animals depicted using numbers 1-4 were included in CrERV-host junction fragment libraries sequenced using Ion Torrent. Wyoming animals depicted using number 5 were included in CrERV-host junction fragment libraries sequenced using Illumina. PC1 and PC2 account for 9.5% and 5.1% of the variation, respectively.

Animals in Wyoming contain more CrERV integrations than animals in Montana

We previously used a mixture model to assign a probability for each CrERV in each individual in the Montana population [210] and determined the number of

88 1 1 MT Mule Deer 1 2 2 MT Mule Deer−transcript studies 1 3 OR BT Deer 4 OR Mule Deer 2 2 5 WY Mule Deer−Ion Torrent 1 6 WY Mule Deer−Illumina 2 1 1 2 7 WY Mule Deer−transcript studies 2

1 2 4 4 2 2 1 4 4 4 2 44 44 6 6 3 4 7 77 77 6 4 4 4 5 6 3 5 3 3 6 6 6 5 75 6 55 5 5 5 5 5 5 5 7 5 6 76 6 5 557 7 6 5 6

Figure 4.2: Map showing geographic locations of animals from Montana, Oregon, and Wyoming based on latitude and longitude of kill-location.

In the map, MT stands for Montana, OR stands for Oregon, WY stands for Wyoming, and BT stands for black-tailed. The Montana animals, Oregon black-tailed and mule deer, and Wyoming animals indicated by a blue 5 were included in next generation sequencing libraries of CrERV integration sites sequenced using Ion Torrent sequencing as described in [210, 213]. The Wyoming animals indicated by an orange 6 in the map were included in next generation sequencing libraries of CrERV integration sites sequenced using Illumina. The Wyoming animals indicated by a green 7 were used for transcript analyses in Chapter 4 but were not included in libraries of CrERV integration sites.

CrERV integrations per Montana animal. We used a similar dataset of CrERV integration sites to evaluate CrERV distributions in Wyoming animals. For the following analyses, CrERV with a probability of being present greater than 0.95 were included. The number of CrERV per Wyoming animal ranged from 261 to 737 integrations, with an average of 333 CrERV. The median number of CrERV in Wyoming animals (307 CrERV) is higher than the median number of CrERV in Montana animals (267 CrERV). Additionally, the distribution of CrERV per animal in Wyoming is shifted to the right compared to Montana (Figure 4.3). These data support that animals in Wyoming contain more CrERV than animals in Montana.

89 Number of CrERV per Animal in MT and WY

5

4

3 Median MT WY

Value Frequency MT 2 WY

1

0

200 400 600 CrERV per Animal

Figure 4.3: Distribution of CrERV per Animal in Montana and Wyoming.

The y axis represents the number of individuals with a given number of CrERV. Pink bars indicate Montana animals, and blue bars indicate Wyoming animals. The median number of CrERV per Montana animal (dashed red line) and Wyoming animal (dashed blue line) are indicated on the graph.

Distribution of CrERV in Wyoming animals is similar to distribution in Montana

Using the same data, we determined that there were 32 CrERV that were shared by all 34 Wyoming animals, which is more than the number of CrERV shared among Montana animals (27 CrERV). Additionally, only 20% of CrERV are found in only one Wyoming animal, which we refer to as singletons. The distributions of CrERV in Montana and Wyoming are similarly shaped (Figure 4.4), indicating that CrERV are insertionally polymorphic in both populations.

90 Animals per CrERV in MT and WY

400

Legend MT WY Frequency

200

0.00 0.25 0.50 0.75 1.00 Proportion of Total Animals per CrERV

Figure 4.4: CrERV prevalence among animals in Montana and Wyoming.

The distribution of all CrERV across Montana and Wyoming animals was plotted as a frequency histogram. The y axis represents the number of CrERV present in a given proportion of total animals in a population. There are 22 total animals in Montana, and 34 total animals in Wyoming. Pink bars indicate Montana animals, and blue bars indicate Wyoming animals.

Wyoming animals have fewer singletons than Montana animals

We previously determined that the number of singletons in each Montana animal increases with total number of CrERV (Figure 2.3). Using a linear regression model, we indicate that this is also true in Wyoming (Figure 4.5). The slope of the regression lines from Montana and Wyoming, however, indicate that there are more singletons per animal in Montana than in Wyoming, despite the higher number of total CrERV per Wyoming animal. This, along with the data that 32 CrERV are fixed in the Wyoming population, indicate that there are more CrERV shared

91 among Wyoming animals.

Number of Singletons varies with Number of Total CrERV integrations (MT & WY)

70 Montana

Wyoming 60 50 40 30 Number of Singletons 20 10 0

200 300 400 500 600 700

Total CrERV integrations

Figure 4.5: The number of singletons varies with number of total CrERV.

Montana animals are indicated by filled circles, and Wyoming animals are indicated by empty circles. The regression line for Montana is indicated by a solid black line, and the regression line for Wyoming is indicated by a dashed line.

M273 CrERV phylogenetically related to CrERVγ-in1 are overrepre- sented in Wyoming animals

The data thus far indicate that Wyoming animals have more CrERV and that CrERV are also insertionally polymorphic in the Wyoming population of mule deer, although some CrERV are clearly shared between Montana and Wyoming animals. We have previously determined that M273 is a typical Montana mule deer in terms of number and distribution of CrERV (Chapter 2). We also have

92 genome data and CrERV sequences from this animal, thus, we chose to evaluate the distribution of M273 CrERV in the Wyoming population. Of the 266 CrERV in M273, we have evaluated the phylogenetic affiliation or solo-LTR status for 160 of the CrERV. We calculated the proportion of Montana mule deer (of 22 total animals) and the proportion of Wyoming mule deer (of 34 total animals) that contain each M273 CrERV (Figure 4.6). There are 16 M273 CrERV that are absent from Wyoming, and the majority (13/16 CrERV) are phylogenetically related to CrERVγ-in6 and CrERVγ-in7, the youngest phylogenetic groups. A test of equal proportions indicates that 140 M273 CrERV are found in the same proportion of Montana and Wyoming animals. There are 15 M273 CrERV that are found in a higher proportion of Wyoming animals, whereas only 4 CrERV are found in a higher proportion of Montana animals. These 15 CrERV belong to various phylogenetic groups, and 5 of these CrERV are phylogenetically related to CrERVγ-in1. A Fisher’s exact test indicates that this proportion is higher than the proportion of CrERVγ-in1 viruses in the dataset (p-value=0.0308), suggesting that CrERVγ-in1 viruses may be overrepresented in Wyoming animals. Together, these data indicate that the majority of the CrERV integrations identified in M273, a typical Montana mule deer, are also found in Wyoming animals, and many of these CrERV integrations, particularly CrERVγ-in1, are more widespread in the Wyoming population.

Similar phylogenetic groups of CrERV transcribe in both MT and WY animals

In Chapter 3, we established that similar CrERV lineages are transcribed in M273 and other animals in Montana. Wyoming animals contain more and different CrERV than the Montana population, but share some loci with M273. Thus, we wanted to evaluate if similar CrERV families are expressed in Montana and Wyoming animals, or if Wyoming has unique transcriptionally active CrERV lineages. We cloned and sequenced CrERV spliced env transcripts from 11 Wyoming animals. Of note, these Wyoming animals were not included in CrERV-host junction fragment analyses. Given that Wyoming animals cluster together in a PCA based on their CrERV profile, we assume that the Wyoming animals used for the transcript analyses are likely to have similar CrERV as other animals from Wyoming. Transcripts that were phylogenetically related to CrERVγ-in1 and CrERVγ-in3

93 Distribution of M273 CrERV in MT and WY in1 in12 in3

Proportion Value 1.00

0.75

in6 in7 solo−LTR 0.50 Cluster 0.25

0.00

MT Proportion WY Proportion MT Proportion WY Proportion MT Proportion WY Proportion State

Figure 4.6: The proportion of Montana vs Wyoming animals that contain each M273 CrERV.

The y-axis represents each of the 160 CrERV in M273 for which we have phylogenetic affiliation. CrERV are separated by phylogenetic group or solo-LTR status. Darker color indicates that a CrERV is found in a higher proportion of animals. White color indicates that a CrERV is absent or found in a low proportion of animals in a population. were cloned from all Wyoming animals in the analysis, and spliced env transcripts phylogenetically related to CrERVγ-in12 were identified in a subset of animals. Analysis of CrERVγ-in3 related spliced env transcripts from Montana indicated little diversity among transcripts from this group (pairwise distance=0.000803). This is also consistent when considering CrERVγ-in3 transcripts from Wyoming (pairwise distance=0.000862). The same transcript is shared amongst all animals in Montana and Wyoming and sequence analysis suggests this transcript is most similar to S386,

94 a previously identified transcriptionally active CrERV in Montana. There was more diversity among Wyoming CrERVγ-in1 transcripts (pairwise distance=0.00495), and more transcripts related to this phylogenetic group were cloned from Wyoming animals. This is comparable to the diversity of Montana CrERVγ-in1 transcripts (pairwise distance=0.006774). Three Wyoming animals contained a CrERVγ-in12 transcript shared with the Montana animals, and sequence analysis suggests this transcript is most similar to S2220, a previously identified transcriptionally active CrERV in Montana. There were more CrERVγ-in12 transcripts cloned from Montana animals (pairwise distance=0.011787). CrERVγ-in12 transcripts were found in only five Wyoming animals and diversity of this group was lower (pairwise distance=0.00286). We did not identify any additional groups of transcriptionally active CrERV in Wyoming, suggesting that similar CrERV phylogenetic lineages are expressed in both populations despite differences in CrERV content.

Lineage-specific expression patterns are consistent across Montana and Wyoming animals

In Montana animals, CrERV expression levels within each animal varied based on phylogenetic group, and all animals had higher CrERVγ-in3 and CrERVγ-in1 expression relative to CrERVγ-in12 expression levels. Because similar phylogenetic groups of CrERV are transcriptionally active in both Montana and Wyoming, we evaluated lineage-specific CrERV expression levels in Wyoming animals to determine if this expression pattern was similar in both populations. We quantified expres- sion levels of CrERVγ-in1, CrERVγ-in3, and CrERVγ-in12 in each Montana and Wyoming animal in the analysis (Figure 4.7). In all Montana and Wyoming animals, there is more expression of CrERVγ-in1 and CrERVγ-in3 relative to CrERVγ-in12, suggesting this expression pattern is consistent across both populations. We also note that in a subset of Wyoming animals from NW Wyoming (WY988, W989, W993, and W995), there is lower CrERVγ-in3 expression than other animals in this population.

95 Lineage−specific CrERV Expression

Montana Wyoming

0.4

0.3

Legend in1 0.2 in3 in12 Normalized Copy Number Normalized Copy

0.1

0.0 M167 M253 M257 M261 M268 M273 M350 M358 M364 M376 WY291 WY329 WY910 WY988 WY989 WY993 WY995 WY1331 WY1553 Animal

Figure 4.7: Lineage-specific CrERV expression in Montana and Wyoming animals.

The y-axis represents copy number normalized to actin and G6pdh expression in each animal. Each boxplot represents 6 replicates.

Overall CrERV expression levels differ between Montana and Wyoming populations

The data suggest that CrERV lineage expression patterns within each individual animal were similar between the Montana and Wyoming populations. We next asked if there was equal CrERV expression on a population level between Montana and Wyoming. To evaluate this, we quantified overall CrERV expression levels in our cohort of 11 Wyoming animals. The data indicate that there is significantly more total env and total gag expression in Wyoming animals (p-value=0.0043 and p-value=0.021) but more expression of spliced env transcripts in Montana animals (p-value=0.027) (Figure 4.8). Because there are two distinct populations

96 of Wyoming mule deer included in the transcript analysis, we also determined if CrERV expression differed across the Montana, NW Wyoming, and CWD-endemic SE Wyoming populations. The data indicate that there is significantly more total env expression in animals from SE Wyoming than animals from either Montana (p-value=2.5x10-5) or NW Wyoming (p-value=0.012) (Figure 4.9). There is also more total gag expression in animals from SE Wyoming than either Montana or NW Wyoming (p-value=9.9x10-8 and p-value=1.5x10-6) (Figure 4.9). Separating the Wyoming populations, however, shows that both Montana and SE Wyoming populations have higher spliced env expression than NW Wyoming (p-value=2.6x10-8 and p-value=4.4x10-7) (Figure 4.9). This suggests that overall CrERV expression levels also differ between the two populations within Wyoming.

Lineage-specific CrERV expression levels differ between Montana and Wyoming populations

Analysis of lineage-specific CrERV expression also indicated differences between Montana and Wyoming. There was higher expression of CrERVγ-in1 (p-value=1.312x10-6) and CrERVγ-in12 (p-value=0.03797) in Wyoming (Figure 4.10). In contrast, CrERVγ-in3 expression levels did not differ between the two populations (p- value=0.187) (Figure 4.10). Consistent with the previous data (Figure 4.10), there was more CrERVγ-in1 expression in NW and SE Wyoming animals than the Montana animals (p-value=1.9x10-6 and p-value=0.00013) but no difference in CrERVγ-in1 expression levels within Wyoming (Figure 4.11). For CrERVγ-in3 ex- pression levels, there were significant differences between all three populations, with highest CrERVγ-in3 expression in SE Wyoming and lowest CrERVγ-in3 expression in NW Wyoming (Figure 4.11). There was no difference in CrERVγ-in12 expression levels between Montana and NW Wyoming populations, but there was significantly higher CrERVγ-in12 expression in SE Wyoming compared to Montana animals (p-value=0.022) (Figure 4.11). These data indicate that considering Wyoming populations separately allows us to more accurately determine population-level CrERV expression differences.

97 Overall CrERV Expression Levels

Total Env Total Gag Spliced Env

p = 0.0043 p = 0.021 p = 0.027

0.75

0.50 Normalized Copy Number Normalized Copy

0.25

0.00

Montana Wyoming Montana Wyoming Montana Wyoming State

Figure 4.8: Overall CrERV expression levels differ between Montana and Wyoming.

The y-axis represents copy number normalized to actin and G6pdh expression in each animal. Each boxplot represents all replicates from all animals from each state. P-values were calculated using the Mann-Whitney U test.

Transcriptionally active CrERV from M273 are widespread and het- erozygous throughout Montana and Wyoming

We previously identified four transcriptionally active CrERV in a representative Montana animal, M273, using bisulfite sequencing and spliced env transcript sequencing. Using CrERV-locus specific PCR, we determined these CrERV are widespread and heterozygous in the Montana population. Because the data suggest that the same CrERV are transcriptionally active in Wyoming, we asked if these CrERV were also widespread and heterozygous in Wyoming as well. We determined

98 Overall CrERV Expression Levels

Total Env Total Gag Spliced Env

2.5e−05 9.9e−08 0.071

0.012 1.5e−06 4.4e−07 0.75 0.27 0.32 2.6e−08

0.50 Normalized Copy Number Normalized Copy

0.25

0.00

MT NW SE MT NW SE MT NW SE Geographic Location

Figure 4.9: Overall CrERV expression levels differ between Montana and two Wyoming populations.

The y-axis represents copy number normalized to actin and G6pdh expression in each animal. Each boxplot represents all replicates from all animals from each state. P-values were calculated using the Mann-Whitney U test. In the figure, MT stands for Montana, NW stands for Wyoming Northwest, and SE stands for Wyoming Southeast. that the nine Wyoming animals included in the CrERV transcript analysis were heterozygous for S386, S2220, S26536, and S3442, and all viruses were present in the genome as one copy of the provirus and one copy of the solo-LTR. We confirmed the absence of these CrERV from one Wyoming white-tailed deer (W999), which contained only a pre-integration site for these CrERV loci.

99 Lineage−specific CrERV Expression Levels

in1 in3 in12

0.25 0.20 p = 1.3e−06 p = 0.19 p = 0.038

0.8

0.20

0.15

0.6

0.15

0.10

0.4

0.10 Normalized Copy Number Normalized Copy

0.05 0.2

0.05

0.00 0.0 0.00 Montana Wyoming Montana Wyoming Montana Wyoming State

Figure 4.10: Lineage-specific CrERV expression levels differ between Mon- tana and Wyoming.

The y-axis represents copy number normalized to actin and G6pdh expression in each animal. The y-axis is scale is different for each lineage to better visualize differences between the groups. Each boxplot represents all replicates from all animals from each state. P-values were calculated using the Mann-Whitney U test.

Population-specific differences in regulation of KXD1 expression by S26536 integration

Given that transcriptionally active CrERV from M273 are also found in all Wyoming mule deer in the analysis, we extended our evaluation of CrERV LTR as alter- native promoters for proximal genes to the Wyoming population as well. S26536 LTR-driven KXD1 transcripts that excluded exon 1 were previously identified in Montana animals containing S26536. RNAseq and PCR indicate that the exon structure of the canonical KXD1 transcript is the same in Montana and Wyoming

100 Lineage−specific CrERV Expression Levels

in1 in3 in12

0.8

0.6 0.00013 0.00043 0.022

0.94 9.3e−09 0.11

1.9e−06 3e−08 0.55 0.4 Normalized Copy Number Normalized Copy

0.2

0.0

MT NW SE MT NW SE MT NW SE Geographic Location

Figure 4.11: Lineage-specific CrERV expression levels differ between Mon- tana and two Wyoming populations.

The y-axis represents copy number normalized to actin and G6pdh expression in each animal. Each boxplot represents all replicates from all animals from each state. P-values were calculated using the Mann-Whitney U test. In the figure, MT stands for Montana, NW stands for Wyoming Northwest, and SE stands for Wyoming Southeast. animals. We confirmed the presence of a S26536 LTR-driven KXD1 transcript in a Wyoming animal that had the same sequence as the transcript identified in Montana, suggesting this is common in both populations. We previously determined that the presence of S26536 in Montana animals does not alter the expression levels of total KXD1 or canonical KXD1 (contains exon 1). A similar analysis comparing total KXD1 expression and canonical KXD1 expression in Wyoming animals indicates no difference between animals that contain

101 the virus (6 mule deer) and animals that do not contain S26536 (1 white-tailed deer) (Figure 4.12).

KXD1 Gene Expression Levels in WY only

Total KXD1 Canonical KXD1

p = 0.68 p = 0.63

0.03

0.02 Normalized Copy Number Normalized Copy

0.01

0.00 No Yes No Yes S26536 Status

Figure 4.12: KXD1 gene expression levels between Wyoming animals with and without S26536.

Copy numbers of Total KXD1 and KXD1 exon 1 were normalized to actin copy number. Expression levels of were compared between animals with the S26536 integration (blue boxplot) and animals without the S26536 integration (pink boxplot) using the Mann-Whitney U test, and p-values of each comparison are indicated at the top of the graph.

Additionally, there was no difference in either total KXD1 expression or canonical KXD1 exon 1 expression between Montana and Wyoming animals, indicating no major population differences in expression of this gene (Figure 4.13, a). Separating the Wyoming animals into the NE and SE populations, however, shows that there is significantly less total KXD1 expression in animals from NW Wyoming than

102 either Montana animals (p-value=0.014) or SE Wyoming (p-value=0.044) (Figure 4.13, b). In contrast, there was no difference in canonical KXD1 expression between the three populations. There is significantly more expression of LTR-driven KXD1, however, in Wyoming animals, regardless of Wyoming population (Figure 4.14). This suggests there may be population specific differences in regulation of KXD1 transcription by the S26536 LTR.

A KXD1 Gene Expression between States B KXD1 Gene Expression between populations Total KXD1 Canonical KXD1 Total KXD1 Canonical KXD1

p = 0.087 p = 0.72 0.85 0.5

0.06 0.06

0.044 0.49

0.014 0.95

0.04 0.04 Normalized Copy Number Normalized Copy Normalized Copy Number Normalized Copy

0.02 0.02

0.00 0.00

Montana Wyoming Montana Wyoming MT NW SE MT NW SE State Geographic Location

Figure 4.13: Total KXD1 and canonical KXD1 gene expression between populations.

(A) KXD1 gene expression level differences between Montana and Wyoming. (B) KXD1 gene expression levels between Montana, Northwest Wyoming, and Southeast Wyoming populations. Expression levels were compared between populations using the Mann-Whitney U test. P-values are reported.

103 A LTR−driven KXD1 expression between States B LTR−driven KXD1 expression between populations

0.0039 p = 2.3e−05 0.004 0.006

0.093

0.003 4.7e−05

0.004

0.002 Normalized LTR−KXD1 Expression Normalized LTR−KXD1 Normalized LTR−KXD1 Expression Normalized LTR−KXD1 0.002

0.001

0.000 0.000

Montana Wyoming MT NW SE State Geographic Location

Figure 4.14: LTR-driven KXD1 gene expression between populations.

(A) LTR-KXD1 gene expression level differences between Montana and Wyoming. (B) LTR-KXD1 gene expression levels between Montana, Northwest Wyoming, and Southeast Wyoming populations. Data shown include all replicates from all animals that contain the S26536 integration. P-values were calculated using the Mann-Whitney U test with continuity correction.

No population-specific differences in impact of S386 on gene expression

S386 LTR-driven read-through SIRT6 transcripts were previously identified in Montana animals, although the presence of S386 did not alter expression of SIRT6 in this population. We confirmed the presence of a read-through S386 LTR-driven SIRT6 transcript in a Wyoming animal with the same sequence as the transcript identified in Montana, suggesting this is common in both populations. Similarly to the results in the Montana, there was no quantitative impact of S386 on total SIRT6 expression in Wyoming animals (Figure 4.15).

104 SIRT6 Gene Expression Levels in WY only

p = 0.095

0.0015

0.0010 Normalized Copy Number Normalized Copy

0.0005

0.0000

No Yes S386 Status

Figure 4.15: SIRT6 gene expression levels between Wyoming animals with and without S386.

Copy numbers of Total SIRT6 were normalized to actin copy number. Expression levels of were compared between animals with the S386 integration (blue boxplot) and animals without the S386 integration (pink boxplot) using the Mann-Whitney U test, and p-value is indicated at the top of the graph.

S386 LTR-driven SIRT6 expression levels did not differ between the Montana and Wyoming populations, although there was significantly more expression of LTR- SIRT6 in Montana animals compared to animals from NW Wyoming (p-value=0.031) (Figure 4.16). There is more total SIRT6 expression in Montana animals than Wyoming animals (Figure 4.17, a). Separating the Wyoming populations shows that both the Montana population (p-value=0.00033) and SE Wyoming populations (p-value=0.019) have significantly more total SIRT6 expression than NW Wyoming animals (Figure 4.17, b).

105 A LTR−driven SIRT6 expression between States B LTR−driven SIRT6 expression between populations

4e−04 0.79 p = 0.15

0.092 6e−04

0.031

3e−04

4e−04

2e−04 Normalized LTR−SIRT6 Expression Normalized LTR−SIRT6 Normalized LTR−SIRT6 Expression Normalized LTR−SIRT6 2e−04

1e−04

0e+00 0e+00

Montana Wyoming MT NW SE State Geographic Location

Figure 4.16: LTR-driven SIRT6 gene expression between populations.

(A) LTR-SIRT6 gene expression level differences between Montana and Wyoming. (B) LTR-SIRT6 gene expression levels between Montana, Northwest Wyoming, and Southeast Wyoming populations. Data shown include all replicates from all animals that contain the S386 integration. P-values were calculated using the Mann-Whitney U test with continuity correction.

FBXO42 Exon 1 expression patterns and splicing differ between popu- lations

We previously determined that the presence of S3442 in the first intron of an FBXO42 transcript isoform affects gene splicing and alters patterns of FBXO42 exon 1 expression in Montana animals. Given the presence of the heterozygous S3442 CrERV in all Wyoming animals in our study, we performed a similar analysis in Wyoming animals. Unlike the pattern seen in Montana animals, there was more exon 1B-exon 2 expression in Wyoming animals that contain S3442 (Figure 4.18).

106 A Total SIRT6 expression between States B Total SIRT6 Gene expression between populations

0.28 p = 0.0023 0.003

0.019

0.00033 0.004

0.002

0.002 Normalized Total SIRT6 Expression SIRT6 Normalized Total

Normalized Total SIRT6 Expression SIRT6 Normalized Total 0.001

0.000 0.000

Montana Wyoming MT NW SE State Geographic Location

Figure 4.17: Total SIRT6 gene expression between populations.

(A) SIRT6 gene expression level differences between Montana and Wyoming. (B) SIRT6 gene expression levels between Montana, Northwest Wyoming, and Southeast Wyoming populations. Expression levels were compared between populations using the Mann-Whitney U test. P-values are reported.

Additionally, no Wyoming mule deer containing S3442 had any measurable exon 1A-exon 2 expression. In some Montana animals positive for S3442, however, there is expression of exon 1A-exon 2, suggesting population-level differences in FBXO42 exon 1A-containing expression. Similarly to Montana, there was more exon 1C- exon 2 expression in animals with S3442. Because there was evidence of differences in FBXO42 expression between Mon- tana and Wyoming animals, we chose to evaluate population-specific differences in FBXO42 exon 1B and exon 1C expression. There was more FBXO42 exon 1B expression in Montana animals (Figure 4.19, a) than Wyoming animals overall.

107 FBXO42 Alternative Exon 1 Expression in WY

Exon 1B Exon 1A Exon 1C

p = 0.00046 p = 0.018

0.10 Normalized Copy Number Normalized Copy 0.05

0.00

No Yes No Yes No Yes S3442n Status

Figure 4.18: Presence of S3442n integration alters FBXO42 Exon 1 usage patterns in Wyoming.

Copy numbers of FBXO42 Exon 1B-exon 2, Exon 1A-exon 2, and Exon 1C-exon 3 were normalized to total FBXO42 copy number. Expression levels of each alternative exon 1 was compared between animals with the S3442 integration (blue boxplot) and animals without the S3442 integration (pink boxplot) using the Mann-Whitney U test, and p-values of each comparison are indicated at the top of the graph.

Separation of the two Wyoming populations, however, reveals that there is signifi- cantly more exon 1B expression in Montana animals (p-value=9.5x10-6) and SE Wyoming animals (p-value=6.7x10-5) than NW Wyoming animals (Figure 4.19, b). Levels of exon 1C-exon 2 expression were similar between Montana and Wyoming animals (Figure 4.20), however, there was more exon 1C expression in Montana (p-value=0.021) and SE Wyoming (p-value=0.018) than NW Wyoming animals. Taken together, these data suggest that there may be population-specific differ-

108 ences in FBXO42 expression patterns involving exon 1B and exon 1C, with lower expression in NW Wyoming animals.

A Exon 1B expression between States B Exon 1B expression between populations

0.58 p = 0.0025

6.7e−05 0.15

9.5e−06 0.2

0.10

0.1 Normalized Exon 1B−Exon 2 Expression Normalized Exon 1B−Exon 2 Expression 0.05

0.0 0.00 Montana Wyoming MT NW SE State Geographic Location

Figure 4.19: FBXO42 Exon 1B-exon 2 expression between populations.

(A) Exon 1B expression level differences between Montana and Wyoming. (B) Exon 1B expression levels between Montana, Northwest Wyoming, and Southeast Wyoming populations. Expression levels were compared between populations using the Mann-Whitney U test. P-values are reported.

Data from Montana animals also demonstrate that the intronic S3442 CrERV affects FBXO42 expression by contributing splice sites that result in early termina- tion and CrERV exonization within the FBXO42 transcript. RNAseq data from a Wyoming animal (W1331) indicated that similar FBXO42-S3442 chimeric tran- scripts were present in this animal. We determined that in at least four Wyoming animals (W291, W329, W995, and W1331), there was evidence of FBXO42 Exon 1A splicing into the S3442 env gene (Table 4.1). We also identified a transcript that

109 A Exon 1C expression between States B Exon 1C expression between populations

0.4 p = 0.76

0.15 0.20 0.018

0.021

0.15

0.10

0.10

0.05 Normalized Exon 1C−Exon 2 Expression Normalized Exon 1C−Exon 2 Expression

0.05

0.00 0.00

Montana Wyoming MT NW SE State Geographic Location

Figure 4.20: FBXO42 Exon 1C-exon 2 expression between populations.

(A) Exon 1C expression level differences between Montana and Wyoming. (B) Exon 1C expression levels between Montana, Northwest Wyoming, and Southeast Wyoming populations. Expression levels were compared between populations using the Mann-Whitney U test. P-values are reported. contains the S3442 env splice to FBXO42 exon 2 in one Wyoming animal (W291), suggesting that exonization of S3442 may not be widespread in the population or is rare in lymph nodes. A similar analysis in Montana animals suggested that all animals with Exon 1A to env splice also have env-exon 2 splice, indicating exonization of S3442 within the FBXO42 transcript. These data support differences in regulation of FBXO42 splicing between the Montana and Wyoming mule deer populations.

110 Table 4.1: Summary of FBXO42-S3442 PCR results. PCRs were used to evaluate splicing from FBXO42 Exon 1A to S3442 env and splicing from S3442 env to FBXO42 Exon 2. All PCRs were performed as nested PCRs to account for rare transcripts. Animals without S3442 were used as a negative control and are indicated by an asterisk (*). State Animal Exon 1A-env env-Exon 2 Montana M170* - - Montana M257 - - Montana M273 + + Montana M358 + + Montana M376 + + Wyoming W291 + + Wyoming W329 + - Wyoming W988 - - Wyoming W993 - - Wyoming W995 + - Wyoming W999* - - Wyoming W1331 + -

4.4 Discussion

We determined that the Montana and Wyoming populations differ in number of CrERV integrations and in CrERV expression levels. Animals in Wyoming contain more CrERV, have higher overall CrERV expression, and higher expression of two CrERV lineages compared to animals in Montana. We also determined that there is significant variation in CrERV expression within Wyoming when separating Wyoming animals into two distinct geographic regions. There were higher CrERV expression levels in SE Wyoming, where CWD is endemic. High CrERV expression in SE Wyoming drives the differences in CrERV expression between Montana and Wyoming, suggesting that NW Wyoming animals are more similar to Montana animals based on levels of CrERV expression. We also found evidence that differential CrERV expression may contribute to expression differences of proximal genes among the populations. CrERVγ-in1 expression levels are higher in Wyoming than Montana, regardless of the geographic region of origin within Wyoming. This was also consistent

111 with LTR-driven KXD1 expression levels, which were higher in both Wyoming populations than in Montana. These data suggest that LTR-driven KXD1 transcript expression levels and CrERVγ-in1 expression levels are correlated. This also further supports that S26536 is transcriptionally active in both Montana and Wyoming animals, as this CrERV is proximal to KXD1 in the mule deer genome. Additionally, any biological impact of LTR-driven KXD1 would be more prevalent in Wyoming animals given the higher expression of this transcript, which may contribute to phenotypic variation between the two populations. Surprisingly, there was lower expression of total KXD1 in NW Wyoming, despite the increased levels of LTR- driven KXD1 and CrERVγ-in1 in this population. We did not see a difference in KXD1 gene expression levels between animals that have S26536 and animals that lack the CrERV in either Montana or Wyoming. These data indicate that there is an impact of LTR-driven KXD1 expression on total KXD1 gene expression that is dependent on CrERV expression level, rather than presence/absence of the CrERV. We consider that CrERV may fine-tune gene expression rather than switching gene expression on or off, as has been suggested for ERVs in neural progenitor cells [153]. We observed differential expression of CrERVγ-in3 among the three popula- tions in the study. CrERVγ-in3 expression levels were highest in SE Wyoming, intermediate in Montana, and lowest in NW Wyoming. Unlike S26536 and KXD1, however, this does not correlate to highest levels of LTR-driven SIRT6 or total SIRT6 expression in SE Wyoming. Instead, we see equal SIRT6 expression levels between the Montana and SE Wyoming populations. This further supports that there is less of an effect of S386 expression on proximal gene expression regulation. All mule deer evaluated in Wyoming contain S3442 in the provirus/solo-LTR configuration in the first intron of an FBXO42 transcript isoform. We demonstrate that there is a differential effect of this CrERV on FBXO42 exon 1 usage between the Montana and Wyoming populations. In contrast to results in Montana animals, there was no measureable expression of FBXO42 transcripts containing exon 1A in any Wyoming animals that were positive for S3442. One explanation for this is that RNA quality levels may be lower in Wyoming animals, which would affect transcript quantification [262]. There was equal expression of FBXO42 exon 1B- exon 2 and FBXO42 exon 1C-exon 2 between Montana and SE Wyoming, however, which does not support that the observed differences in FBXO42 exon 1A-exon 2 expression are due to lower quality RNA from Wyoming animals. Alternatively,

112 there may be higher levels of FBXO42 exon 1A transcripts that terminate at the S3442 polyadenlyation signal in Wyoming animals. This is consistent with the differences in amplification of FBXO42-S3442 chimeric transcripts between Montana and Wyoming (Table 4.1), although we did not specifically evaluate premature termination of FBXO42 in any Wyoming animals. Differences in FBXO42 Exon 1A expression are correlated with increased expression of CrERVγ-in12 in Wyoming animals, which further suggests population-specific differences in FBXO42 gene expression regulation. Lack of exon 1A-containing FBXO42 transcripts 1A may result in phenotypic differences between the Montana and Wyoming populations, since alternative first exons are often associated with tissue or cell context-dependent gene expression [249,250]. We observed higher expression of full length CrERV in SE Wyoming compared to the Montana and NW Wyoming populations. Interestingly, we observed higher expression of spliced env in SE Wyoming than in NW Wyoming. Given that SE Wyoming is a CWD-endemic region, these data may suggest that increased or differential CrERV envelope expression is important in context of disease. Distinct envelope sequences have been associated with pathogenicity of retroviruses [263]. A retrovirus isolated from wild mice also causes a non-inflammatory spongiform neurodegenerative disease [264] and the neurovirulence determinant was mapped to the retroviral env [265]. Retrovirus infection also strongly enhances infectivity of scrapie, a prion disease, in cell culture [266]. Expression of specific HERV families was upregulated in patients with sporadic Creutzfeldt-Jakob disease, a human prion disease, compared to expression levels in normal controls or patients with other neurological disease [66]. Although the impacts of CrERV expression on CWD pathogenesis is beyond the scope of this dissertation, it is interesting to consider the implications of increased CrERV envelope expression in CWD-endemic areas compared to nearby geographic regions. Animals in the Wyoming population contain more CrERV integrations than animals in the Montana population. This may be due to a novel Wyoming- specific CrERV lineage that has not been identified in our representative Montana animal [212] or expansion of a known CrERV lineage in the Wyoming population. Although we did not test these alternatives in Chapter 4, we hypothesize that there is expansion of CrERVγ-in1 in Wyoming animals. This CrERV lineage has participated in germline recombination events in M273, indicating expression in

113 the mule deer germline. Additionally, data presented in Chapter 4 indicate that CrERVγ-in1 expression levels are higher in both Wyoming populations compared to Montana animals. A detailed analysis of CrERV sequence reconstructions in a Wyoming animal will need to be done to address this question. In conclusion, there are population-specific differences in CrERV content and expression levels among mule deer populations. CrERV expression levels also differ between two distinct Wyoming populations, one of which is a CWD-endemic area. The increased number of CrERV integrations and higher CrERV expression levels in Wyoming warrant a more detailed analysis of transcriptionally active CrERV in this population. Bisulfite analysis and reconstruction of CrERV sequences in a Wyoming animal would enable the identification of Wyoming-specific transcriptionally active CrERV. This may lead to additional host genes that may have CrERV-mediated differences in gene expression between Montana and Wyoming and contribute to population-specific phenotypic differences. None of the Wyoming animals included in the transcript analyses were CWD-positive, which is a major limitation of this study with respect to CrERV involvement in CWD. Establishing differences in transcriptionally active CrERV between CWD-positive and CWD-negative animals from the same Wyoming population is crucial to evaluate association between CrERV and CWD in mule deer.

114 Chapter 5 | Discussion and Future Directions

Long considered to be ‘junk DNA,’ the regulatory contributions of ERVs are now widely accepted. ERVs have rewired transcriptional and regulatory networks [129, 131], have been co-opted for essential host functions [137, 143, 144], and have played a role in human and primate evolution [45,47]. ERVs, however, can also negatively impact gene expression [50] and they are often silenced by DNA methylation and histone modifications [80,218]. Alternatively, ERVs may retain transcriptional activity due to their genomic location, as a byproduct of proximity to host genes [149, 152, 222]. Another, and least explored, alternative is that transcriptional activity is retained because it confers a selective advantage. ERVs have been found in the genomes of all vertebrates sequenced to date, and many species have insertionally polymorphic ERVs [177, 181–183, 185, 187]. Given the impacts of ERVs on the host genome and gene expression, insertion- ally polymorphic ERVs could result in genomic and phenotypic diversity among individuals in a population or species. Few studies have explored the genomic and evolutionary impacts of transcriptionally active ERVs in the context of ERV insertional polymorphism. Mule deer are an outbred species that have sustained multiple infection events from retroviruses since speciation. This has resulted in multiple CrERV lineages, each with a different sequence and potentially different properties, in the con- temporary mule deer genome. CrERV insertional polymorphism has been well established [184, 185, 210, 213] and is extensive, with thousands of polymorphic ERV integration sites across multiple mule deer populations. We observed CrERV transcripts in mule deer lymph nodes [146,184], however, it was unknown which CrERV were transcriptionally active. Because CrERV are insertionally polymorphic

115 and evolutionarily young, transcriptional activity may be due to proximity to genes or recent integration into the genome or alternatively, a functional impact on gene expression. The goals of this dissertation were to identify transcriptionally active CrERV, evaluate if CrERV expression was a byproduct of genomic location, and compare CrERV transcription between two distinct mule deer populations. ERVs that have recently colonized a host are nearly identical and are difficult to place during genome assembly [174]. Thus, we reconstructed CrERV from whole-genome sequence data of a representative Montana mule deer (Chapter 2). These data, along with the mule deer genome assembly, were used to identify transcriptionally active CrERV candidates (Chapter 3). We asked if CrERV were transcriptionally active due to proximity to genes, requiring identification of the genomic location of the CrERV proviruses producing transcripts. Others have used RNAseq [16] or specific nucleotide differences to map cDNAs to individual proviruses [208]. These approaches fail, however, when genomic proviruses are highly (>96%) identical and may be confounded by RT-PCR errors or SNPs. We mapped transcripts to reconstructed CrERV and identified the genomic location of unmethylated 50 LTRs to accurately identify CrERV proviruses that were producing transcripts in M273. We also show that junction fragment libraries can be used to explore CrERV diversity and sharing among mule deer populations, and demonstrate that the extent of CrERV sharing amongst individuals is low. We also used these data to establish that our representative animal (M273), for which we have the genome assembled and CrERV sequences reconstructed, is a typical Montana mule deer (Chapter 2). We then used junction fragment libraries to explore transcriptionally active CrERV distribution across the population (Chapters 3 and 4) and determined that transcriptionally active CrERV were widespread in multiple mule deer populations. We next identified transcriptionally active CrERV in a subset of mule deer from the Montana population (Chapter 3). We established that transcriptionally active CrERV were close to genes. It has been suggested that ERVs that integrate close to genes are unmethylated and transcriptionally active because they cannot be silenced due to genomic location [152], but over time insulators may allow silenced ERV and proximal expressed genes to be in equilibrium [149]. We showed that recently integrated CrERV that are close to genes can be silenced and the gene can be expressed. We expected that transcriptionally active CrERV would be from

116 the youngest, insertionally polymorphic group based on data from other organisms with recently colonized ERVs [158, 267–269] and from the sequence of a replication competent CrERV [202]. We determined that transcriptionally active CrERV in mule deer lymph nodes belonged to phylogenetically older CrERV lineages and were widespread in the population. Some data supports that ERV regulatory activity evolves over time, given that older TEs are more likely to act as enhancers [238] and that transcription factor binding site distribution in TEs is dependent on the age of the element [220]. We show that relatively young ERVs, particularly S386, are transcriptionally active and that species-specific ERVs, such as S26536, can act as host gene regulatory elements. The data supporting widespread expression of S386 in the mule deer population are particularly interesting, as S386 is a more recent integration that allows us to investigate the effects of ERV integration closer to the time of initial integration into the host genome. S386 has recently expanded within the genome [212], which indicates germline expression of S386 that produced a functional transcript that was able to integrate into the genome. Relative to other transcribed CrERV, S386 is highly expressed in mule deer lymph nodes, however, we demonstrate that the bidirectional promoter activity of the S386 LTR does not impact proximal host gene transcript splicing, nor does its presence alter host gene expression levels (Chapter 3). This suggests that any benefit of CrERV transcriptional activity may involve more than the co-option of the LTR as a regulatory element for host genes. This is similar to the suggestion that reduced HERV-H solo-LTR formation supports that HERV-H RNA is important to the host [270]. Indeed, HERV-H lncRNA is essential to maintain human embryonic stem cell identity [137]. Given that all four transcriptionally active CrERV are found in the provirus/solo-LTR configuration in all animals evaluated, we propose that the CrERV RNA or provirus is beneficial to the host in a way that the solo-LTR is not and may act as a lncRNA. CrERV have previously been used to show population structure and history of wild mule deer populations [185], and CrERV distributions demonstrate that mule deer from Wyoming are more distantly related to mule deer from Oregon and Montana [213]. Thus, we extended our analyses of CrERV distribution (Chapter 2) and transcriptionally active CrERV (Chapter 3) to the Wyoming population (Chapter 4). This was of particular interest in context of chronic wasting disease (CWD). The animals used for transcript studies in Chapter 4 were from two distinct

117 Wyoming populations: NW Wyoming and SE Wyoming, which is a CWD-endemic area. We demonstrated that animals from Wyoming have higher CrERV expression levels than animals from Montana. Interestingly, there is higher CrERV expression in animals from the CWD-endemic region in SE Wyoming. This may be important with respect to disease, given the associations of ERV and retroviral RNA with prion disease [66,260,261]. Population level analyses also revealed that proximal gene expression levels vary with CrERV expression levels for S26536 and S3442, which were previously shown to alter splicing of proximal gene transcripts (Chapter 3). Population level differences in expression levels of S386, however, did not correlate to SIRT6 expression levels, further supporting that there is less of an effect of S386 expression on proximal gene expression regulation. Few studies have examined transcriptionally active ERV differences among populations. The frequencies of some human ERVs vary by population, and many insertionally polymorphic HERV-K viruses are within or near genes [187]. The biological effects of these ERVs have not been investigated, but insertionally polymorphic ERVs can impact gene expression and phenotypic variation among mouse strains [182,256]. Nonetheless, changes in gene transcription via the regulatory impact of ERVs can drive phenotypic evolution and contribute to species diversity [271]. This further supports that investigating differences in ERV transcription and impact on gene expression within populations is important to understand species evolution. Future work could investigate CrERV RNA as lncRNA in two contexts. CrERV RNA may be beneficial to the host, and may have an impact on host gene expression regulation or other host processes. Data from Chapter 3 supports that the full length transcriptionally active proviral allele has been maintained in the genome despite evidence of germline expansion and some effects on gene splicing and expression levels in lymph nodes. ERV mobilization and intronic splicing effects can be mutagenic [50], suggesting that it may be more beneficial for the host if CrERV were silenced or present in the solo-LTR configuration. Proviral allele maintenance suggests that CrERV RNA has provided a benefit to the host, perhaps acting as a lncRNA that regulates gene expression. The next steps would evaluate CrERV RNA degradation kinetics in the cell and determine if CrERV RNA has any protein or DNA binding partners. ERV also contribute functional proteins to the host [143,144]. Data in Chapter 4 indicates differential expression of spliced env transcripts, which may be translated into viral protein. Thus, we could also investigate CrERV protein

118 production, which may have been co-opted by the host. Additionally, CrERV RNA may be involved in CWD pathogenesis. Data from Chapter 4 indicate that there is altered CrERV expression in animals from a CWD-endemic area and that CrERV expression levels can affect gene expression levels. Others have suggested a role of retroviral RNA in prion disease pathogenesis [260, 261]. The next steps could determine if CrERV contribute to CWD pathogenesis by investigating if CrERV expression levels differ between CWD-positive and negative animals and if this impacts expression of proximal genes important in the context of prion disease. Additionally, the potential role of CrERV RNA or protein in prion conversion could also be evaluated. These studies would increase our understanding of ERV contributions to host genome function and their role in disease.

119 Bibliography

[1] Rosenberg, N. and P. Jolicoeur (1997) “Retroviral Pathogenesis,” in Retroviruses (J. M. Coffin, S. Hughes, and H. Varmus, eds.), Cold Spring Harbor Press, pp. 475–585. URL http://www.ncbi.nlm.nih.gov/pubmed/21433341

[2] Hayward, A., C. K. Cornwallis, and P. Jern (2015) “Pan-vertebrate comparative genomics unmasks retrovirus macroevolution,” Proceedings of the National Academy of Sciences, 112(2), pp. 464–469, arXiv:1408.1149. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1414980112

[3] Jern, P. and J. M. Coffin (2008) “Effects of retroviruses on host genome function.” Annual review of genetics, 42, pp. 709–32. URL http://www.ncbi.nlm.nih.gov/pubmed/18694346

[4] McClintock, B. (1950) “The origin and behavior of mutable loci in maize,” Proceedings of the National Academy of Sciences, 36(6), pp. 344–355. URL http://www.pnas.org/cgi/doi/10.1073/pnas.36.6.344

[5] Shimode, S., S. Nakagawa, and T. Miyazawa (2015) “Multiple invasions of an infectious retrovirus in cat genomes,” Scientific Reports, 5, pp. 1–10.

[6] Belshaw, R., V. Pereira, A. Katzourakis, G. Talbot, J. Pac, A. Burt, and M. Tristem (2004) “Long-term reinfection of the human genome by endogenous retroviruses,” PNAS, 101(Track II), pp. 4894–4899.

[7] Rebollo, R., M. T. Romanish, and D. L. Mager (2012) “Transposable elements: an abundant and natural source of regulatory sequences for host genes.” Annual review of genetics, 46, pp. 21–42. URL http://www.ncbi.nlm.nih.gov/pubmed/22905872

[8] Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond,

120 M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange- Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hub- bard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Fra- zier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toy- oda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brot- tier, T. Bruls, E. Pelletier, C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M. Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, C. Raymond, N. Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Ar- avind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. John- son, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pol- lara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E. Stupka, J. Szustakowski, D. Thierry-Mieg, “”Initial sequencing

121 and analysis of the human genome, Nature, (6822), pp. 860–921, 11237011.

[9] Consortium, M. G. S. “Initial sequencing and comparative analysis of the mouse genome,” Nature, p. 520.

[10] Hughes, J. F. and J. M. Coffin “Human endogenous retrovirus K solo- LTR formation and insertional polymorphisms: implications for human and viral evolution.” Proceedings of the National Academy of Sciences of the United States of America, (6), pp. 1668–72.

[11] Katzourakis, A., V. Pereira, and M. Tristem (2007) “Effects of recombination rate on human endogenous retrovirus fixation and persistence.” Journal of virology, 81(19), pp. 10712–7.

[12] Medstrand, P., L. N. V. D. Lagemaat, and D. L. Mager (2002) “Retroelement Distributions in the Human Genome : Variations Associated With Age and Proximity to Genes Distributions of Retroelements in Different,” Genome Research, pp. 1483–1495.

[13] van de Lagemaat, L. N., P. Medstrand, and D. L. Mager (2006) “Multiple effects govern endogenous retrovirus survival patterns in human gene introns.” Genome biology, 7(9), p. R86. URL http://genomebiology.com/2006/7/9/R86

[14] Demeulemeester, J., J. De Rijck, R. Gijsbers, and Z. Debyser (2015) “Retroviral integration: Site matters,” BioEssays, pp. n/a–n/a. URL http://doi.wiley.com/10.1002/bies.201500051

[15] Gifford, R. and M. Tristem (2003) “Invited Review The Evolution , Distribution and Diversity of Endogenous Retroviruses *,” Virus Genes, 26(3), pp. 291–315.

[16] Brown, K., J. Moreton, S. Malla, a. A. Aboobaker, R. D. Emes, and R. E. Tarlinton (2012) “Characterisation of retroviruses in the horse genome and their transcriptional activity via transcriptome sequencing.” Virology, 433(1), pp. 55–63. URL http://www.ncbi.nlm.nih.gov/pubmed/22868041

[17] Brown, K., R. D. Emes, and R. E. Tarlinton (2014) “Multiple groups of endogenous epsilon-like endogenous retroviruses conserved across primates.” Journal of virology, (August). URL http://www.ncbi.nlm.nih.gov/pubmed/25142585

[18] Farkašová, H., T. Hron, J. Pačes, P. Hulva, P. Benda, R. J. Gif- ford, and D. Elleder (2017) “Discovery of an endogenous Deltaretrovirus

122 in the genome of long-fingered bats (Chiroptera: Miniopteridae),” Proceedings of the National Academy of Sciences, 114(12), pp. 3145–3150. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1621224114

[19] Katzourakis, A., M. Tristem, O. G. Pybus, and R. J. Gifford (2007) “Discovery and analysis of the first endogenous lentivirus,” Proceedings of the National Academy of Sciences, 104(15), pp. 6261–6265. URL http://www.pnas.org/cgi/doi/10.1073/pnas.0700471104

[20] Cui, J. and E. C. Holmes (2012) “Endogenous Lentiviruses in the Ferret Genome,” Journal of Virology, 86(6), pp. 3383–3385. URL http://jvi.asm.org/cgi/doi/10.1128/JVI.06652-11

[21] Hron, T., H. Fábryová, J. Pačes, and D. Elleder “Endogenous lentivirus in Malayan colugo (Galeopterus variegatus), a close relative of primates,” Retrovirology, (1), p. 84.

[22] Katzourakis, A., P. Aiewsakun, H. Jia, N. D. Wolfe, M. LeBreton, A. D. Yoder, and W. M. Switzer (2014) “Discovery of prosimian and afrotherian foamy viruses and potential cross species transmissions amidst stable and ancient mammalian co-evolution,” Retrovirology, 11(1), pp. 1–17.

[23] Ruboyianes, R. and M. Worobey (2016) “Foamy-like endogenous retro- viruses are extensive and abundant in teleosts,” Virus Evolution, 2(2), p. vew032.

[24] Johnson, W. E. “Endogenous Retroviruses in the Genomics Era,” Annual Review of Virology, (1), pp. 135–159.

[25] Esnault, C., J. Maestre, and T. Heidmann (2000) “Human LINE retrotransposons generate processed pseudogenes.” Nature genetics, 24(4), pp. 363–7. URL http://www.ncbi.nlm.nih.gov/pubmed/10742098

[26] Babushok, D. V. and H. H. Kazazian (2007) “Progress in understanding the biology of the human mutagen LINE-1,” Human Mutation, 28(6), pp. 527–539.

[27] Graham, T. and S. Boissinot (2006) “The genomic distribution of L1 elements: The role of insertion bias and natural selection,” Journal of Biomedicine and Biotechnology, 2006, pp. 1–5.

[28] Khan, H., A. Smit, and S. Boissinot “Molecular evolution and tempo of amplification of human LINE-1 retrotransposons since the origin of primates.” Genome research, (1), pp. 78–87.

123 [29] Brouha, B., J. Schustak, R. M. Badge, S. Lutz-Prigge, A. H. Farley, J. V. Moran, and H. H. Kazazian (2003) “Hot L1s account for the bulk of retrotransposition in the human population,” Proceedings of the National Academy of Sciences, 100(9), pp. 5280–5285. URL http://www.pnas.org/cgi/doi/10.1073/pnas.0831042100

[30] Kazazian, H. H. (2004) “Mobile elements: drivers of genome evolution.” Science (New York, N.Y.), 303(5664), pp. 1626–32. URL http://www.ncbi.nlm.nih.gov/pubmed/15016989

[31] Ostertag, E. M. and H. H. K. Jr. “Biology of Mammalian L1 Retro- transposons,” Annual Review of Genetics, pp. 501–538.

[32] Hancks, D. C. and H. H. Kazazian “Roles for retrotransposon insertions in human disease,” Mobile DNA, (1), p. 9.

[33] Bodak, M., J. Yu, and C. Ciaudo (2014) “Regulation of LINE-1 in mammals,” Biomolecular Concepts, 5(5), pp. 409–428.

[34] Ohshima, K. and N. Okada (2005) “SINEs and LINEs: symbionts of eukaryotic genomes with a common tail,” Cytogenetic and Genome Research, 110(1-4), pp. 475–490. URL https://www.karger.com/DOI/10.1159/000084981

[35] Deragon, J. M. and X. Zhang (2006) “Short interspersed elements (SINEs) in plants: Origin, classification, and use as phylogenetic markers,” Systematic Biology, 55(6), pp. 949–956.

[36] Jurka, J. (1997) “Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons,” Proceedings of the National Academy of Sciences, 94(5), pp. 1872–1877. URL http://www.pnas.org/cgi/doi/10.1073/pnas.94.5.1872

[37] Kajikawa, M. and N. Okada (2002) “LINEs mobilize SINEs in the eel through a shared 3âĂš sequence,” Cell, 111(3), pp. 433–444.

[38] Dewannieux, M., C. Esnault, and T. Heidmann (2003) “LINE-mediated retrotransposition of marked Alu sequences,” Nature Genetics, 35(1), pp. 41–48.

[39] Jurka, J., V. V. Kapitonov, O. Kohany, and M. V. Jurka “Repetitive sequences in complex genomes: structure and evolution.” Annual review of genomics and human genetics, pp. 241–59.

124 [40] Richardson, S. R., A. J. Doucet, H. C. Kopera, J. B. Moldovan, J. L. Garcia-perez, and J. V. Moran (2015) “The Influence of LINE-1 and SINE Retrotransposons on Mammalian Genomes,” Microbiology Spectrum, 3(2), pp. 1–26. [41] Almeida, L. M., I. T. Silva, W. a. Silva Jr., J. P. Castro, P. K. Riggs, C. M. Carareto, and M. E. J. Amaral (2007) “The contribution of transposable elements to Bos taurus gene structure,” Gene, 390(1-2), pp. 180–189. URL http://linkinghub.elsevier.com/retrieve/pii/S0378111906006664 [42] Adelson, D. L., J. M. Raison, and R. C. Edgar (2009) “Characterization and distribution of retrotransposons and simple sequence repeats in the bovine genome,” Proceedings of the National Academy of Sciences of the United States of America, 106(31), pp. 12855–12860. [43] Havecker, E. R., X. Gao, and D. F. Voytas “The diversity of LTR retrotransposons.” Genome biology, (6), p. 225. [44] Naville, M., I. A. Warren, Z. Haftek-Terreau, D. Chalopin, F. Brunet, P. Levin, D. Galiana, and J. N. Volff (2016) “Not so bad after all: Retroviruses and long terminal repeat retrotransposons as a source of new genes in vertebrates,” Clinical Microbiology and Infection, 22(4), pp. 312–323. URL http://dx.doi.org/10.1016/j.cmi.2016.02.001 [45] Hughes, J. F. and J. M. Coffin (2001) “Evidence for genomic rearrange- ments mediated by human endogenous retroviruses during primate evolution.” Nature genetics, 29(4), pp. 487–9. URL http://www.ncbi.nlm.nih.gov/pubmed/11704760 [46] ——— “Human endogenous retroviral elements as indicators of ectopic re- combination events in the primate genome.” Genetics, (3), pp. 1183–94. [47] Polavarapu, N., N. J. Bowen, and J. F. McDonald (2006) “Identifica- tion, characterization and comparative genomics of chimpanzee endogenous retroviruses,” Genome Biology, 7(6). [48] Shin, W., J. Lee, S.-Y. Son, K. Ahn, H.-S. Kim, and K. Han “Human-Specific HERV-K Insertion Causes Genomic Variations in the Human Genome.” PloS one, (4), p. e60605. [49] Mun, S., J. Lee, Y.-J. Kim, H.-S. Kim, and K. Han (2014) “Chimpanzee- specific endogenous retrovirus generates genomic variations in the chimpanzee genome.” PloS one, 9(7), p. e101195. URL http://www.ncbi.nlm.nih.gov/pubmed/24987855

125 [50] Maksakova, I. a., M. T. Romanish, L. Gagnier, C. a. Dunn, L. N. van de Lagemaat, and D. L. Mager “Retroviral elements and their hosts: insertional mutagenesis in the mouse germ line.” PLoS genetics, (1), p. e2.

[51] Kaufmann, S., M. Sauter, M. Schmitt, B. Baumert, B. Best, A. Boese, K. Roemer, and N. Mueller-Lantzsch (2010) “Human endogenous retrovirus protein Rec interacts with the testicular zinc-finger protein and androgen receptor,” Journal of General Virology, 91(6), pp. 1494–1502.

[52] Rolland, A., E. Jouvin-Marche, C. Viret, M. Faure, H. Perron, and P. N. Marche (2006) “The envelope protein of a human endogenous retrovirus-W family activates innate immunity through CD14/TLR4 and promotes Th1-like responses.” The Journal of Immunology, 176, pp. 7636– 7644.

[53] Baudino, L., K. Yoshinobu, N. Morito, M.-L. Santiago-Raber, and S. Izui (2010) “Role of endogenous retroviruses in murine SLE,” Autoimmu- nity Reviews, 10(1), pp. 27–34.

[54] Perl, A., D. Fernandez, T. Telarico, and P. E. Phillips (2010) “En- dogenous retroviral pathogenesis in lupus,” Current opinion in Rheumatology, 22(5), pp. 483–492, 15334406.

[55] Antony, J. M., G. van Marle, W. Opii, D. A. Butterfield, F. Mal- let, V. W. Yong, J. L. Wallace, R. M. Deacon, K. Warren, and C. Power (2004) “Human endogenous retrovirus glycoproteinâĂŞmediated induction of redox reactants causes oligodendrocyte death and demyelination,” Nature Neuroscience, 7, p. 1088. URL http://dx.doi.org/10.1038/nn1319 http://10.0.4.14/nn1319 https://www.nature.com/articles/nn1319#supplementary-information

[56] Galli, U. M., M. Sauter, B. Lecher, S. Maurer, H. Herbst, K. Roe- mer, and N. Mueller-Lantzsch “Human endogenous retrovirus rec inter- feres with germ cell development in mice and may cause carcinoma in situ, the predecessor lesion of germ cell tumors,” Oncogene, p. 3223.

[57] Huang, G., Z. Li, X. Wan, Y. Wang, and J. Dong (2013) “Human endogenous retroviral K element encodes fusogenic activity in melanoma cells,” J Carcinog, 12, p. 5. URL http://www.ncbi.nlm.nih.gov/pubmed/23599687

[58] Johnston, J. B., C. Silva, J. Holden, K. G. Warren, A. W. Clark, and C. Power (2001) “Monocyte activation and differentiation augment human endogenous retrovirus expression: Implications for inflammatory brain

126 diseases,” Annals of Neurology, 50(4), pp. 434–442. URL http://doi.wiley.com/10.1002/ana.1131

[59] Lamprecht, B., K. Walter, S. Kreher, R. Kumar, M. Hummel, D. Lenze, K. Köchert, M. A. Bouhlel, J. Richter, E. Soler, R. Stadhouders, K. Jöhrens, K. D. Wurster, D. F. Callen, M. F. Harte, M. Giefing, R. Barlow, H. Stein, I. Anagnostopoulos, M. Janz, P. N. Cockerill, R. Siebert, B. Dörken, C. Bonifer, and S. Mathas (2010) “Derepression of an endogenous long terminal repeat activates the CSF1R proto-oncogene in human lymphoma,” Nature Medicine, 16(5), pp. 571–579. URL http://www.nature.com/doifinder/10.1038/nm.2129

[60] Lamprecht, B., C. Bonifer, and S. Mathas (2010) “Repeat-element driven activation of proto-oncogenes in human malignancies,” Cell Cycle, 9(21), pp. 4276–4281.

[61] Babaian, a., M. T. Romanish, L. Gagnier, L. Y. Kuo, M. M. Karimi, C. Steidl, and D. L. Mager (2015) “Onco-exaptation of an endogenous retroviral LTR drives IRF5 expression in Hodgkin lymphoma,” Oncogene, (February), pp. 1–5. URL http://www.nature.com/doifinder/10.1038/onc.2015.308

[62] Lock, F. E., R. Rebollo, K. Miceli-Royer, L. Gagnier, S. Kuah, A. Babaian, M. Sistiaga-Poveda, C. B. Lai, O. Nemirovsky, I. Ser- rano, C. Steidl, M. M. Karimi, and D. L. Mager (2014) “Distinct isoform of FABP7 revealed by screening for retroelement-activated genes in diffuse large B-cell lymphoma.” Proceedings of the National Academy of Sciences of the United States of America, 111(34), pp. E3534–43. URL http://www.ncbi.nlm.nih.gov/pubmed/25114248

[63] Cegolon, L., C. Salata, E. Weiderpass, P. Vineis, G. Palù, and G. Mastrangelo “Human endogenous retroviruses and cancer prevention: evidence and prospects.” BMC cancer, p. 4.

[64] Katoh, I. and S.-i. Kurata “Association of Endogenous Retroviruses and Long Terminal Repeats with Human Disorders,” Frontiers in Oncology, (September), pp. 1–8.

[65] Szpakowski, S., X. Sun, J. M. Lage, A. Dyer, J. Rubinstein, D. Kowalski, C. Sasaki, J. Costa, and P. M. Lizardi (2009) “Loss of epigenetic silencing in tumors preferentially affects primate-specific retroele- ments,” Gene, 448(2), pp. 151–167.

127 [66] Jeong, B.-H., Y.-J. Lee, R. I. Carp, and Y.-S. Kim (2010) “The prevalence of human endogenous retroviruses in cerebrospinal fluids from patients with sporadic Creutzfeldt-Jakob disease.” Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology, 47(2), pp. 136–42. URL http://www.ncbi.nlm.nih.gov/pubmed/20005155

[67] Stengel, A., C. Bach, I. Vorberg, O. Frank, S. Gilch, G. Lutzny, W. Seifarth, V. Erfle, E. Maas, H. Schätzl, C. Leib-Mösch, and A. D. Greenwood (2006) “Prion infection influences murine endogenous retrovirus expression in neuronal cells.” Biochemical and biophysical research communications, 343(3), pp. 825–31. URL http://www.ncbi.nlm.nih.gov/pubmed/16564028

[68] Greenwood, A. D., M. Vincendeau, A.-C. Schmädicke, J. Montag, W. Seifarth, and D. Motzkus “Bovine spongiform encephalopathy in- fection alters endogenous retrovirus expression in distinct brain regions of cynomolgus macaques (Macaca fascicularis).” Molecular neurodegeneration, (1), p. 44.

[69] Gibb, E. a., C. J. Brown, and W. L. Lam (2011) “The functional role of long non-coding RNA in human carcinomas.” Molecular cancer, 10(1), p. 38. URL http://www.molecular-cancer.com/content/10/1/38

[70] Gutschner, T., S. Diederichs, and K. Rna (2012) “A long non-coding RNA point of view,” RNA Biology, 9(June), pp. 703–719.

[71] Gibb, E. a., R. L. Warren, G. W. Wilson, S. D. Brown, G. Robert- son, G. B. Morin, and R. a. Holt (2015) “Activation of an endoge- nous retrovirus-associated long non-coding RNA in human adenocarcinoma,” Genome Medicine, 7. URL http://genomemedicine.com/content/7/1/22

[72] St Laurent, G., D. Shtokalo, B. Dong, M. R. Tackett, X. Fan, S. Lazorthes, E. Nicolas, N. Sang, T. J. Triche, T. a. McCaffrey, W. Xiao, and P. Kapranov (2013) “VlincRNAs controlled by retroviral elements are a hallmark of pluripotency and cancer.” Genome biology, 14(7), p. R73. URL http://genomebiology.com/2013/14/7/R73

[73] Brookes, E. and Y. Shi (2014) Diverse Epigenetic Mechanisms of Human Disease, vol. 48.

[74] Slotkin, R. K. and R. Martienssen (2007) “Transposable elements and the epigenetic regulation of the genome,” Nature Reviews Genetics, 8(4), pp.

128 272–285. URL http://www.nature.com/doifinder/10.1038/nrg2072

[75] Contreras-Galindo, R., M. H. Kaplan, S. He, A. C. Contreras- Galindo, M. J. Gonzalez-Hernandez, F. Kappes, D. Dube, S. M. Chan, D. Robinson, F. Meng, M. Dai, S. D. Gitlin, A. M. Chin- naiyan, G. S. Omenn, and D. M. Markovitz (2013) “HIV Infection Reveals Wide-Spread Expansion of Novel Centromeric Human Endogenous Retroviruses.” Genome research. URL http://www.ncbi.nlm.nih.gov/pubmed/23657884

[76] Ferreri, G. C., J. D. Brown, C. Obergfell, N. Jue, C. E. Finn, M. J. O’Neill, and R. J. O’Neill (2011) “Recent Amplification of the Kangaroo Endogenous Retrovirus, KERV, Limited to the Centromere,” Journal of Virology, 85(10), pp. 4761–4771. URL http://jvi.asm.org/cgi/doi/10.1128/JVI.01604-10

[77] Jin, B., Y. Li, and K. D. Robertson (2011) “DNA methylation: Superior or subordinate in the epigenetic hierarchy?” Genes and Cancer, 2(6), pp. 607–617.

[78] Jones, P. A. (2012) “Functions of DNA methylation: islands, start sites, gene bodies and beyond,” Nature Reviews Genetics, 13, p. 484. URL http://dx.doi.org/10.1038/nrg3230 http://10.0.4.14/nrg3230

[79] Yoder, J. A., C. P. Walsh, and T. H. Bestor (1997) “Cytosine methy- lation and the ecology of intragenomic parasites,” Trends in Genetics, 13(8), pp. 335–340.

[80] Lavie, L., M. Kitova, E. Maldener, J. Mayer, and E. Meese (2005) “CpG Methylation Directly Regulates Transcriptional Activity of the Human Endogenous Retrovirus Family CpG Methylation Directly Regulates Tran- scriptional Activity of the Human Endogenous Retrovirus Family HERV-K ( HML-2 ),” Journal of virology.

[81] Matousková, M., P. Vesely, P. Daniel, G. Mattiuzzo, R. Hector, L. Scobie, Y. Takeuchi, and J. Hejnar (2013) “The Role of DNA Methy- lation in Expression and Transmission of Porcine Endogenous Retrovirus.” Journal of virology, (August). URL http://www.ncbi.nlm.nih.gov/pubmed/23986605

[82] Reiss, D., Y. Zhang, and D. L. Mager “Widely variable endogenous retroviral methylation levels in human placenta.” Nucleic acids research, (14), pp. 4743–54.

129 [83] Walsh, C. P., J. R. Chaillet, and T. H. Bestor (1998) “Transcription of IAP endogenous retroviruses is constrained by cytosine methylation.” Nature genetics, 20(2), pp. 116–7. URL http://www.ncbi.nlm.nih.gov/pubmed/9771701

[84] Hemberger, M., W. Dean, and W. Reik (2009) “Epigenetic dynamics of stem cells and cell lineage commitment: digging Waddington’s canal,” Nature Reviews Molecular Cell Biology, 10, p. 526. URL http://dx.doi.org/10.1038/nrm2727 http://10.0.4.14/nrm2727

[85] Peterson, C. L. and M.-A. Laniel (2004) “Histones and histone modifications,” Current Biology, 14(14), pp. R546–R551. URL http://linkinghub.elsevier.com/retrieve/pii/S0960982204004853

[86] Berger, S. L. (2002) “Histone modifications in transcriptional regulation,” Current Opinion in Genetics & Development, 12(2), pp. 142–148.

[87] Bannister, A. J. and T. Kouzarides (2011) “Regulation of chromatin by histone modifications,” Cell Research, 21(3), pp. 381–395, NIHMS150003. URL http://www.nature.com/doifinder/10.1038/cr.2011.22

[88] Chuong, E. B., M. A. Rumi, M. J. Soares, and J. C. Baker (2013) “Endogenous retroviruses function as species-specific enhancer elements in the placenta,” Nature Genetics, 45(3), pp. 325–329.

[89] Friedli, M., P. Turelli, A. Kapopoulou, B. Rauwel, N. Castro- Díaz, H. M. Rowe, G. Ecco, C. Unzu, E. Planet, A. Lombardo, B. Mangeat, B. E. Wildhaber, L. Naldini, and D. Trono (2014) “Loss of transcriptional control over endogenous retroelements during repro- gramming to pluripotency.” Genome research. URL http://www.ncbi.nlm.nih.gov/pubmed/24879558

[90] Ecco, G., M. Cassano, A. Kauzlaric, J. Duc, A. Coluccio, M. Im- beault, H. M. Rowe, P. Turelli, and D. Trono (2016) “Transposable elements and their KRAB-ZFP controllers regulate gene expression in adult tissues,” Developmental Cell, 36(6), pp. 611–623.

[91] Thompson, P., T. Macfarlan, and M. Lorincz (2016) “Long Terminal Repeats: From Parasitic Elements to Building Blocks of the Transcriptional Regulatory Repertoire,” Molecular Cell, 62(5), pp. 766–776. URL http://linkinghub.elsevier.com/retrieve/pii/S1097276516300120

[92] Rowe, H. M. and D. Trono (2011) “Dynamic control of endogenous retroviruses during development.” Virology, 411(2), pp. 273–87. URL http://www.ncbi.nlm.nih.gov/pubmed/21251689

130 [93] Hurst, T. P. and G. Magiorkinis (2017) “Epigenetic control of human endogenous retrovirus expression: Focus on regulation of long-terminal repeats (LTRs),” Viruses, 9(6), pp. 1–13.

[94] Hutnick, L. K., X. Huang, T. C. Loo, Z. Ma, and G. Fan (2010) “Repression of retrotransposal elements in mouse embryonic stem cells is primarily mediated by a DNA methylation-independent mechanism,” Journal of Biological Chemistry, 285(27), pp. 21082–21091.

[95] Mikkelsen, T. S., M. Ku, D. B. Jaffe, B. Issac, E. Lieberman, G. Giannoukos, P. Alvarez, W. Brockman, T.-K. Kim, R. P. Koche, W. Lee, E. Mendenhall, A. O’Donovan, A. Presser, C. Russ, X. Xie, A. Meissner, M. Wernig, R. Jaenisch, C. Nusbaum, E. S. Lander, and B. E. Bernstein “Genome-wide maps of chromatin state in pluripotent and lineage-committed cells,” Nature, p. 553.

[96] Collins, P. L., K. E. Kyle, T. Egawa, Y. Shinkai, and E. M. Oltz (2015) “The histone methyltransferase SETDB1 represses endogenous and ex- ogenous retroviruses in B lymphocytes,” Proceedings of the National Academy of Sciences, 112(27), pp. 8367–8372. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1422187112

[97] Fasching, L., A. Kapopoulou, R. Sachdeva, R. Petri, M. Jönsson, C. Männe, P. Turelli, P. Jern, F. Cammas, D. Trono, and J. Jakobsson (2014) “TRIM28 Represses Transcription of Endogenous Retroviruses in Neural Progenitor Cells,” Cell Reports, pp. 1–9. URL http://linkinghub.elsevier.com/retrieve/pii/S2211124714010158

[98] Law, J. A. and S. E. Jacobsen (2011) “Establising, maintaining and modifying DNA methylation patterns in plants and animals,” Nat Rev Genet., 11(3), pp. 204–220. URL http://dx.doi.org/10.1038/nrg2719

[99] Aravin, A. A., N. M. Naumova, A. V. Tulin, V. V. Vagin, Y. M. Rozovsky, and V. A. Gvozdev (2001) “Double-stranded RNA-mediated silencing of genomic tandem repeats and transposable elements in the D. melanogaster germline,” Current Biology, 11(13), pp. 1017–1027.

[100] Girard, A., R. Sachidanandam, G. J. Hannon, and M. A. Carmell “A germline-specific class of small RNAs binds mammalian Piwi proteins,” Nature, p. 199.

[101] Watanabe, T., Y. Totoki, A. Toyoda, M. Kaneda, S. Kuramochi- Miyagawa, Y. Obata, H. Chiba, Y. Kohara, T. Kono, T. Nakano, M. A. Surani, Y. Sakaki, and H. Sasaki .

131 [102] Aravin, A. A., R. Sachidanandam, A. Girard, K. Fejes-Toth, and G. J. Hannon (2007) “Developmentally Regulated piRNA Clusters Implicate MILI in Transposon Control,” Science, 316(5825), pp. 744 LP – 747. URL http://science.sciencemag.org/content/316/5825/744.abstract

[103] Carmell, M. A., A. Girard, H. J. van de Kant, D. Bourc’his, T. H. Bestor, D. G. de Rooij, and G. J. Hannon (2007) “MIWI2 Is Essential for Spermatogenesis and Repression of Transposons in the Mouse Male Germline,” Developmental Cell, 12(4), pp. 503–514.

[104] Aravin, A. A., R. Sachidanandam, D. Bourc’his, C. Schaefer, D. Pezic, K. F. Toth, T. Bestor, and G. J. Hannon (2008) “A piRNA Pathway Primed by Individual Transposons Is Linked to De Novo DNA Methylation in Mice,” Molecular Cell, 31(6), pp. 785–799, NIHMS150003.

[105] Svoboda, P., P. Stein, M. Anger, E. Bernstein, G. J. Hannon, and R. M. Schultz (2004) “RNAi and expression of retrotransposons MuERV-L and IAP in preimplantation mouse embryos,” Developmental Biology, 269(1), pp. 276–285.

[106] Tam, O. H., A. A. Aravin, P. Stein, A. Girard, E. P. Murchi- son, S. Cheloufi, E. Hodges, M. Anger, R. Sachidanandam, R. M. Schultz, and G. J. Hannon “Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes,” Nature, p. 534.

[107] Schorn, A. J., M. J. Gutbrod, C. Leblanc, R. Martienssen, A. J. Schorn, M. J. Gutbrod, C. Leblanc, and R. Martienssen (2017) “LTR-Retrotransposon Control by tRNA-Derived Small RNAs,” Cell, 170(1), pp. 61–71.e11. URL http://dx.doi.org/10.1016/j.cell.2017.06.013

[108] Wolf, G. and T. S. Macfarlan (2015) “Revealing the Complexity of Retroviral Repression,” Cell, 163(1), pp. 30–32. URL http://dx.doi.org/10.1016/j.cell.2015.09.014

[109] Cammas, F., M. Mark, P. Dollé, A. Dierich, P. Chambon, and R. Losson “Mice lacking the transcriptional corepressor TIF1beta are de- fective in early postimplantation development.” Development (Cambridge, England), (13), pp. 2955–63.

[110] Rowe, H. M., J. Jakobsson, D. Mesnard, J. Rougemont, S. Rey- nard, T. Aktas, P. V. Maillard, H. Layard-Liesching, S. Verp, J. Marquis, F. Spitz, D. B. Constam, and D. Trono “KAP1 controls endogenous retroviruses in embryonic stem cells,” Nature, p. 237.

132 [111] Esnault, C., S. Priet, D. Ribet, O. Heidmann, and T. Heidmann (2008) “Restriction by APOBEC3 proteins of endogenous retroviruses with an extracellular life cycle: Ex vivo effects and in vivo "traces" on the murine IAPE and human HERV-K elements,” Retrovirology, 5, pp. 1–11.

[112] Wolf, D. and S. P. Goff (2008) “Host Restriction Factors Blocking Retroviral Replication,” Annual Review of Genetics, 42(1), pp. 143–163. URL https://doi.org/10.1146/annurev.genet.42.110807.091704

[113] Mattiuzzo, G., S. Ivol, and Y. Takeuchi (2010) “Regulation of porcine endogenous retrovirus release by porcine and human tetherins.” Journal of virology, 84(5), pp. 2618–2622.

[114] Magiorkinis, G. and T. P. Hurst “Activation of the innate immune response by endogenous retroviruses,” Journal of General Virology, (6), pp. 1207–1218.

[115] Yu, P., W. Lübben, H. Slomka, J. Gebler, M. Konert, C. Cai, L. Neubrandt, O. P. da Costa, S. Paul, S. Dehnert, K. Döhne, M. Thanisch, S. Storsberg, L. Wiegand, A. Kaufmann, M. Nain, L. Quintanilla-Martinez, S. Bettio, B. Schnierle, L. Kolesnikova, S. Becker, M. Schnare, and S. Bauer (2012) “Nucleic acid-sensing Toll- like receptors are essential for the control of endogenous retrovirus viremia and ERV-induced tumors.” Immunity, 37(5), pp. 867–79. URL http://www.ncbi.nlm.nih.gov/pubmed/23142781

[116] Oliver, K. R. and W. K. Greene (2011) “Mobile DNA and the TE-Thrust hypothesis: Supporting evidence from the primates,” Mobile DNA, 2(1), pp. 1–17.

[117] Li, J., K. Akagi, Y. Hu, A. L. Trivett, C. J. W. Hlynialuk, D. A. Swing, N. Volfovsky, T. C. Morgan, Y. Golubeva, R. M. Stephens, D. E. Smith, and D. E. Symer (2012) “Mouse endogenous retroviruses can trigger premature transcriptional termination at a distance,” Genome research, pp. 870–884.

[118] Buzdin, A., E. Kovalskaya-Alexandrova, E. Gogvadze, and E. Sverdlov “At least 50% of human-specific HERV-K (HML-2) long terminal repeats serve in vivo as active promoters for host nonrepetitive DNA transcription.” Journal of virology, (21), pp. 10752–62.

[119] Medstrand, P., J. R. Landry, and D. L. Mager (2001) “Long terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes in humans.” The Journal of biological chemistry,

133 276(3), pp. 1896–903. URL http://www.ncbi.nlm.nih.gov/pubmed/11054415 [120] Pavlicev, M., K. Hiratsuka, K. A. Swaggart, C. Dunn, and L. Muglia (2015) “Detecting endogenous retrovirus-driven tissue-specific gene transcription,” Genome Biology and Evolution, 7(4), pp. 1082–1097. [121] Landry, J.-R., A. Rouhi, P. Medstrand, and D. L. Mager (2002) “The Opitz syndrome gene Mid1 is transcribed from a human endogenous retroviral promoter.” Molecular biology and evolution, 19(11), pp. 1934–1942. [122] Dunn, C. a., L. N. Van De Lagemaat, G. J. Baillie, and D. L. Mager (2005) “Endogenous retrovirus long terminal repeats as ready-to-use mobile promoters: The case of primate B3GAL-T5,” Gene, 364(1-2), pp. 2–12. [123] Schlesinger, S. and S. P. Goff (2015) “Retroviral Transcriptional Regu- lation and Embryonic Stem Cells: War and Peace,” Molecular and Cellular Biology, 35(5), pp. 770–777. URL http://mcb.asm.org/lookup/doi/10.1128/MCB.01293-14 [124] Domansky, A. N., E. P. Kopantzev, E. V. Snezhkov, Y. B. Lebedev, C. Leib-Mosch, and E. D. Sverdlov (2000) “Solitary HERV-K LTRs possess bi-directional promoter activity and contain a negative regulatory element in the U5 region,” FEBS Letters, 472(2-3), pp. 191–195. URL http://linkinghub.elsevier.com/retrieve/pii/S0014579300014605 [125] Xu, L., A. G. Elkahloun, F. Candotti, A. Grajkowski, S. L. Beaucage, E. F. Petricoin, V. Calvert, H. Juhl, F. Mills, K. Ma- son, N. Shastri, J. Chik, C. Xu, and A. S. Rosenberg “A novel function of RNAs arising from the long terminal repeat of human endogenous retrovirus 9 in cell cycle arrest.” Journal of virology, (1), pp. 25–36. [126] Dunn, C. a., M. T. Romanish, L. E. Gutierrez, L. N. van de Lagemaat, and D. L. Mager (2006) “Transcription of two human genes from a bidirectional endogenous retrovirus promoter.” Gene, 366(2), pp. 335–42. URL http://www.ncbi.nlm.nih.gov/pubmed/16288839 [127] Huh, J. W., D. S. Kim, D. W. Kang, H. S. Ha, K. Ahn, Y. N. Noh, D. S. Min, K. T. Chang, and H. S. Kim (2008) “Transcriptional regulation of GSDML gene by antisense-oriented HERV-H LTR element,” Archives of Virology, 153(6), pp. 1201–1205. [128] Rasmussen, M. H., B. Ballarín-González, J. Liu, L. B. Lassen, A. Füchtbauer, E.-M. Füchtbauer, A. L. Nielsen, and F. S. Peder- sen (2010) “Antisense transcription in gammaretroviruses as a mechanism of

134 insertional activation of host genes.” Journal of virology, 84(8), pp. 3780–8. URL http://www.ncbi.nlm.nih.gov/pubmed/20130045

[129] Wang, T., J. Zeng, C. B. Lowe, R. G. Sellers, S. R. Salama, M. Yang, S. M. Burgess, R. K. Brachmann, and D. Haussler (2007) “Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53,” 104(47), pp. 18613–18618.

[130] Chuong, E. B., N. C. Elde, and C. Feschotte (2016) “Regulatory evolution of innate immunity through co-option of endogenous retroviruses,” Science, 351(6277), pp. 1083–1087, arXiv:1011.1669v3. URL http://www.sciencemag.org/cgi/doi/10.1126/science.aad5497

[131] Kunarso, G., N.-Y. Chia, J. Jeyakani, C. Hwang, X. Lu, Y.-S. Chan, H.-H. Ng, and G. Bourque (2010) “Transposable elements have rewired the core regulatory network of human embryonic stem cells.” Nature genetics, 42(7), pp. 631–4. URL http://www.ncbi.nlm.nih.gov/pubmed/20526341

[132] Bourque, G., B. Leong, V. B. Vega, X. Chen, Y. L. Lee, K. G. Srinivasan, J.-L. Chew, Y. Ruan, C.-L. Wei, H. H. Ng, and E. T. Liu “Evolution of the mammalian transcription factor binding repertoire via transposable elements.” Genome research, (11), pp. 1752–62.

[133] Mager, D. L., D. G. Hunter, M. Schertzer, and J. D. Freeman (1999) “Endogenous retroviruses provide the primary polyadenylation signal for two new human genes (HHLA2 and HHLA3).” Genomics, 59(3), pp. 255–263. URL http://www.ncbi.nlm.nih.gov/pubmed/10444326

[134] Baust, C., W. Seifarth, H. Germaier, R. Hehlmann, and C. Leib- Mösch (2000) “HERV-K-T47D-Related long terminal repeats mediate polyadenylation of cellular transcripts.” Genomics, 66(1), pp. 98–103. URL http://www.ncbi.nlm.nih.gov/pubmed/10843810

[135] Gogvadze, E., E. Stukacheva, A. Buzdin, and E. Sverdlov (2009) “Human-Specific Modulation of Transcriptional Activity Provided by Endoge- nous Retroviral Insertions,” Journal of Virology, 83(12), pp. 6098–6105. URL http://jvi.asm.org/cgi/doi/10.1128/JVI.00123-09

[136] Doxiadis, G. G. M., N. de Groot, and R. E. Bontrop (2008) “Impact of Endogenous Intronic Retroviruses on Major Histocompatibility Complex Class II Diversity and Stability,” Journal of Virology, 82(13), pp. 6667–6677. URL http://jvi.asm.org/cgi/doi/10.1128/JVI.00097-08

135 [137] Lu, X., F. Sachs, L. Ramsay, P.-É. Jacques, J. Göke, G. Bourque, and H.-H. Ng (2014) “The retrovirus HERVH is a long noncoding RNA required for human embryonic stem cell identity,” Nature Structural & Molec- ular Biology, 21(4), pp. 423–425. URL http://www.nature.com/doifinder/10.1038/nsmb.2799

[138] Wang, J., X. Li, L. Wang, J. Li, Y. Zhao, G. Bou, Y. Li, G. Jiao, X. Shen, R. Wei, S. Liu, B. Xie, L. Lei, W. Li, Q. Zhou, and Z. Liu (2016) “A novel long intergenic noncoding RNA indispensable for the cleavage of mouse two-cell embryos,” EMBO Rep, 17(10), pp. 1452–1470.

[139] Göke, J., X. Lu, Y. S. Chan, H. H. Ng, L. H. Ly, F. Sachs, and I. Szczerbinska (2015) “Dynamic transcription of distinct classes of endoge- nous retroviral elements marks specific populations of early human embryonic cells,” Cell Stem Cell, 16(2), pp. 135–141.

[140] Hu, T., W. Pi, X. Zhu, M. Yu, H. Ha, H. Shi, J. H. Choi, and D. Tuan (2017) “Long non-coding RNAs transcribed by ERV-9 LTR retrotransposon act in cis to modulate long-range LTR enhancer function,” Nucleic Acids Research, 45(8), pp. 4479–4492.

[141] Kapusta, A., Z. Kronenberg, V. J. Lynch, X. Zhuo, L. Ramsay, G. Bourque, M. Yandell, and C. Feschotte “Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs.” PLoS genetics, (4), p. e1003470.

[142] Johnson, R. and R. Guigó (2014) “The RIDL hypothesis : transposable elements as functional domains of long noncoding RNAs,” RNA, 2014.

[143] Lavialle, C., G. Cornelis, A. Dupressoir, C. Esnault, O. Heid- mann, C. Vernochet, and T. Heidmann (2013) “Paleovirology of ’syn- cytins’, retroviral env genes exapted for a role in placentation,” Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1626), pp. 20120507–20120507.

[144] Dupressoir, A., C. Lavialle, and T. Heidmann (2012) “From ancestral infectious retroviruses to bona fide cellular genes: Role of the captured syncytins in placentation,” Placenta, 33(9), pp. 663–671. URL http://dx.doi.org/10.1016/j.placenta.2012.05.005

[145] Seifarth, W., O. Frank, U. Zeilfelder, A. D. Greenwood, R. Hehlmann, C. Leib-mösch, and B. Spiess (2005) “Comprehensive Analysis of Human Endogenous Retrovirus Transcriptional Activity in Human Tissues with a Comprehensive Analysis of Human Endogenous Retrovirus

136 Transcriptional Activity in Human Tissues with a Retrovirus-Specific Mi- croarray,” Journal of virology, 79(1).

[146] Wittekindt, N. E., A. Padhi, S. C. Schuster, J. Qi, F. Q. Zhao, L. P. Tomsho, L. R. Kasson, M. Packard, P. Cross, and M. Poss (2010) “Nodeomics: Pathogen Detection in Vertebrate Lymph Nodes Using Meta-Transcriptomics,” Plos One, 5(10).

[147] Bittmann, I., D. Mihica, R. Plesker, and J. Denner (2012) “Expression of porcine endogenous retroviruses (PERV) in different organs of a pig,” Virology, 433(2), pp. 329–336. URL http://dx.doi.org/10.1016/j.virol.2012.08.030

[148] Pérot, P., N. Mugnier, C. Montgiraud, J. Gimenez, M. Jaillard, B. Bonnaud, and F. Mallet (2012) “Microarray-based sketches of the HERV transcriptome landscape,” PLoS ONE, 7(6).

[149] Rebollo, R., K. Miceli-royer, Y. Zhang, S. Farivar, L. Gagnier, and D. L. Mager (2012) “Epigenetic interplay between mouse endogenous retroviruses and host genes,” Genome Biology, 13(10), p. R89. URL http://genomebiology.com/2012/13/10/R89

[150] Kinoshita, Y., H. Saze, T. Kinoshita, A. Miura, W. J. J. Soppe, M. Koornneef, and T. Kakutani (2007) “Control of FWA gene silencing in Arabidopsis thaliana by SINE-related direct repeats,” Plant Journal, 49(1), pp. 38–45.

[151] Martin, A., C. Troadec, A. Boualem, M. Rajab, R. Fernan- dez, H. Morin, M. Pitrat, C. Dogimont, and A. Bendahmane “A transposon-induced epigenetic change leads to sex determination in melon,” Nature, p. 1135.

[152] Rebollo, R., M. M. Karimi, M. Bilenky, L. Gagnier, K. Miceli- Royer, Y. Zhang, P. Goyal, T. M. Keane, S. Jones, M. Hirst, M. C. Lorincz, and D. L. Mager “Retrotransposon-induced heterochromatin spreading in the mouse revealed by insertional polymorphisms.” PLoS genetics, (9), p. e1002301.

[153] Brattås, P. L., M. E. Jönsson, L. Fasching, J. Nelander Wahlest- edt, M. Shahsavani, R. Falk, A. Falk, P. Jern, M. Parmar, and J. Jakobsson (2017) “TRIM28 Controls a Gene Regulatory Network Based on Endogenous Retroviruses in Human Neural Progenitor Cells,” Cell Reports, 18(1), pp. 1–11.

137 [154] Hollister, J. D. and B. S. Gaut “Epigenetic silencing of transposable elements: a trade-off between reduced transposition and deleterious effects on neighboring gene expression.” Genome research, (8), pp. 1419–28.

[155] Contreras-Galindo, R., M. H. Kaplan, A. C. Contreras-Galindo, M. J. Gonzalez-Hernandez, I. Ferlenghi, F. Giusti, E. Lorenzo, S. D. Gitlin, M. H. Dosik, Y. Yamamura, and D. M. Markovitz (2012) “Characterization of Human Endogenous Retroviral Elements in the Blood of HIV-1-Infected Individuals,” Journal of Virology, 86(1), pp. 262–276. URL http://jvi.asm.org/cgi/doi/10.1128/JVI.00602-11

[156] Gonzalez-Hernandez, M. J., J. D. Cavalcoli, M. A. Sartor, R. Contreras-Galindo, F. Meng, M. Dai, D. Dube, A. K. Saha, S. D. Gitlin, G. S. Omenn, M. H. Kaplan, and D. M. Markovitz “Regulation of the human endogenous retrovirus K (HML-2) transcriptome by the HIV-1 Tat protein.” Journal of virology, (16), pp. 8924–35.

[157] Evans, L. H., A. S. M. Alamgir, N. Owens, N. Weber, K. Virtaneva, K. Barbian, A. Babar, F. Malik, and K. Rosenke “Mobilization of endogenous retroviruses in mice after infection with an exogenous retrovirus.” Journal of virology, (6), pp. 2429–35.

[158] Anai, Y., H. Ochi, S. Watanabe, S. Nakagawa, M. Kawamura, T. Gojobori, and K. Nishigaki “Infectious endogenous retroviruses in cats and emergence of recombinant viruses.” Journal of virology, (16), pp. 8634–44.

[159] Bai, J., L. N. Payne, and M. a. Skinner “HPRS-103 (exogenous avian leukosis virus, subgroup J) has an env gene related to those of endogenous elements EAV-0 and E51 and an E element found previously only in sarcoma viruses.” Journal of virology, (2), pp. 779–84.

[160] Hsiao, F. C., M. Lin, A. Tai, G. Chen, and B. T. Huber (2006) “Cutting edge: Epstein-Barr virus transactivates the HERV-K18 superantigen by docking to the human complement receptor 2 (CD21) on primary B cells.” Journal of immunology (Baltimore, Md. : 1950), 177(4), pp. 2056–2060.

[161] Kwun, H. J., H. J. Han, W. J. Lee, H. S. Kim, and K. L. Jang “Transactivation of the human endogenous retrovirus K long terminal repeat by herpes simplex virus type 1 immediate early protein 0,” Virus research, (1-2), pp. 93–100.

[162] Reiche, J., G. Pauli, and H. Ellerbrok “Differential expression of human endogenous retrovirus K transcripts in primary human melanocytes and melanoma cell lines after UV irradiation,” Melanoma Research, (5).

138 [163] Hohenadl, C., H. Germaier, M. Walchner, M. Hagenhofer, M. Herrmann, M. Stürzl, P. Kind, R. Hehlmann, V. Erfle, and C. Leib-Mösch (1999) “Transcriptional Activation of Endogenous Retroviral Sequences in Human Epidermal Keratinocytes by UVB Irradiation,” Journal of Investigative Dermatology, 113(4), pp. 587–594. URL http://dx.doi.org/10.1046/j.1523-1747.1999.00728.x

[164] Katsumata, K., H. Ikeda, M. Sato, A. Ishizu, Y. Kawarada, H. Kato, A. Wakisaka, T. Koike, and T. Yoshiki (1999) “Cytokine Regulation of env Gene Expression of Human Endogenous Retrovirus-R in Human Vascular Endothelial Cells,” Clinical Immunology, 93(1), pp. 75–80.

[165] Manghera, M. and R. N. Douville “Endogenous retrovirus-K promoter: a landing strip for inflammatory transcription factors?” Retrovirology, (Figure 1), p. 16.

[166] Weiss, R. A., D. Griffiths, P. Villesen, L. Aagaard, C. Wiuf, F. Pedersen, J. Hughes, J. Coffin, M. Barbulescu, G. Turner, M. Seaman, A. Deinard, K. Kidd, J. Lenz, D. Baltimore, P. Vogt, L. Gross, L. Crawford, E. Crawford, H. Temin, H. Temin, A. Lwoff, J. Sambrook, H. Westphal, P. Srinivasan, R. Dul- becco, B. McClintock, H. Hanafusa, T. Hanafusa, H. Rubin, H. Rubin, P. Vogt, R. Ishizaki, P. Sarma, H. Turner, R. Hueb- ner, R. Dougherty, H. D. Stefano, R. Dougherty, H. D. Stefano, F. Roth, L. Payne, R. Chubb, P. Vogt, R. Weiss, R. Weiss, H. Hana- fusa, T. Miyamoto, T. Hanafusa, R. Huebner, G. Todaro, R. Weiss, L. Payne, P. Vogt, R. Friis, R. Weiss, R. Friis, E. Katz, P. Vogt, S. Astrin, H. Robinson, L. Crittenden, E. Buss, J. Wyban, W. Hay- ward, D. Baltimore, H. Temin, S. Mizutani, P. Rosenthal, H. Robin- son, W. Robinson, T. Hanafusa, H. Hanafusa, H. Varmus, R. Weiss, R. Friis, W. Levinson, J. Bishop, M. Baluda, E. Humphries, C. Glover, R. Weiss, J. Arrand, J. Boeke, J. Stoye, D. Frisby, R. Weiss, M. Roussel, D. Stehelin, R. Weiss, W. Mason, P. Vogt, L. Payne, P. Pani, R. Weiss, S. Astrin, E. Buss, W. Haywards, M. Boyce-Jacino, K. O, J. Bai, L. Payne, M. Skinner, A. van der Kuyl, R. Mang, J. Dekker, J. Goudsmit, N. Chai, P. Bates, R. Is- fort, Z. Qian, D. Jones, R. Silva, R. Witter, H. Kung, C. Hertig, B. Coupar, A. Gould, D. Boyle, P. Singh, W. Schnitzlein, D. Tri- pathy, W. Rowe, T. Pincus, J. Stoye, S. Aaronson, J. Hartley, G. Todaro, D. Lowy, W. Rowe, N. Teich, J. Hartley, S. Aaronson, G. Todaro, E. Scolnick, M. Lieberman, H. Kaplan, R. Latarjet, J. Duplan, N. Rosenberg, P. Jolicoeur, R. Benveniste, M. Lieber, D. Livingston, C. Sherr, G. Todaro, S. Kalter, R. Benveniste,

139 G. Todaro, R. Benveniste, R. Heinemann, G. Wilson, R. Calla- han, G. Todaro, P. Venables, S. Brookes, D. Griffiths, R. Weiss, M. Boyd, S. Mi, X. Lee, X. Li, G. Veldman, H. Finnerty, L. Racie, E. LaVallie, X. Tang, P. Edouard, S. Howes, J. Keith, J. McCoy, F. Mallet, O. Bouton, S. Prudhomme, V. Cheynet, G. Oriol, B. Bonnaud, G. Lucotte, L. Duret, B. Mandrand, P. Bentvelzen, J. Daams, P. Bentvelzen, J. Daams, P. Hageman, J. Calafat, J. Cohen, H. Varmus, H. Acha-Orbea, W. Held, G. Waanders, A. Shakhov, L. Scarpellino, R. Lees, H. MacDonald, J. Levy, M. Palmarini, M. Mura, T. Spencer, R. McAllister, M. Nicol- son, M. Gardner, R. Rongey, S. Rasheed, P. Sarma, R. Huebner, M. Hatanaka, S. Oroszlan, R. Gilden, A. Kabigting, L. Vernon, B. Achong, P. Trumper, B. Giovanella, A. Urisman, R. Moli- naro, N. Fischer, S. Plummer, G. Casey, E. Klein, K. Malathi, C. Magi-Galluzzi, R. Tubbs, D. Ganem, R. Silverman, J. Derisi, C. Patience, Y. Takeuchi, R. Weiss, S. Magre, Y. Takeuchi, B. Bar- tosch, R. Weiss, D. Purcell, C. Broscius, E. Vanin, C. Buckler, A. Nienhuis, M. Martin, C. Patience, Y. Takeuchi, F. Cosset, R. Weiss, “”The discovery of endogenous retroviruses, Retrovirology, (1), p. 67.

[167] Stoye, J. P. (2001) “Endogenous retroviruses: still active after all these years?” Current biology : CB, 11(22), pp. R914–6. URL http://www.ncbi.nlm.nih.gov/pubmed/11719237

[168] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman (1990) “Basic local alignment search tool,” Journal of Molecular Biology, 215(3), pp. 403–410.

[169] Kent, W. J. (2002) “BLAT âĂŤ The BLAST -Like Alignment Tool,” Genome Research, 12, pp. 656–664.

[170] McCarthy, E. M. and J. F. McDonald (2003) “LTR STRUC: A novel search and identification program for LTR retrotransposons,” Bioinformatics, 19(3), pp. 362–367.

[171] Xu, Z. and H. Wang (2007) “LTR-FINDER: An efficient tool for the prediction of full-length LTR retrotransposons,” Nucleic Acids Research, 35(SUPPL.2), pp. 265–268.

[172] Sperber, G. O., T. Airola, P. Jern, and J. Blomberg (2007) “Auto- mated recognition of retroviral sequences in genomic data - RetroTector©,” Nucleic Acids Research, 35(15), pp. 4964–4976.

140 [173] Smit, A., R. Hubley, and P. Green (2013), “RepeatMasker Open-4.0,” . URL http://www.repeatmasker.org

[174] Treangen, T. J. and S. L. Salzberg (2013) “Repetitive DNA and next- generation sequencing: computational challenges and solutions,” Nat Rev Genet., 13(1), pp. 36–46, NIHMS150003.

[175] Ye, L., L. D. W. Hillier, P. Minx, N. Thane, D. P. Locke, J. C. Martin, L. Chen, M. Mitreva, J. R. Miller, K. V. Haub, D. J. Dooling, E. R. Mardis, R. K. Wilson, G. M. Weinstock, and W. C. Warren (2011) “A vertebrate case study of the quality of assemblies derived from next-generation sequences,” Genome Biology, 12(3).

[176] Phillippy, A. M., M. C. Schatz, and M. Pop (2008) “Genome assembly forensics: Finding the elusive mis-assembly,” Genome Biology, 9(3).

[177] Niebert, M. and R. R. Tönjes (2003) “Analyses of prevalence and poly- morphisms of six replication-competent and chromosomally assigned porcine endogenous retroviruses in individual pigs and pig subspecies,” Virology, 313(2), pp. 427–434.

[178] Macfarlane, C. and P. Simmonds (2004) “Allelic variation of HERV- K(HML-2) endogenous retroviral elements in human populations,” Journal of Molecular Evolution, 59(5), pp. 642–656.

[179] Roca, A. L., J. Pecon-slattery, and S. J. O. Brien (2004) “Genomically Intact Endogenous Feline Leukemia Viruses of Recent Origin Genomically Intact Endogenous Feline Leukemia Viruses of Recent Origin,” 78(8), pp. 4370–4375.

[180] Roca, A. L., W. G. Nash, J. C. Menninger, J. Murphy, S. J. O. Brien, and W. J. Murphy (2005) “Insertional Polymorphisms of Endoge- nous Feline Leukemia Viruses Insertional Polymorphisms of Endogenous Feline Leukemia Viruses,” 79(7), pp. 3979–3986.

[181] Stocking, C. and C. A. Kozak (2008) “Murine endogenous retroviruses,” Cell Mol Life Sci, 65(21), pp. 3383–3398.

[182] Zhang, Y., I. a. Maksakova, L. Gagnier, L. N. van de Lagemaat, and D. L. Mager “Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements.” PLoS genetics, (2), p. e1000007.

[183] Chessa, B., F. Pereira, and Et Al (2009) “Revealing the History of Sheep Domestication Using,” Science, 324(5926), pp. 532–536.

141 [184] Elleder, D., O. Kim, A. Padhi, J. G. Bankert, I. Simeonov, S. C. Schuster, N. E. Wittekindt, S. Motameny, and M. Poss “Polymorphic integrations of an endogenous gammaretrovirus in the mule deer genome.” Journal of virology, (5), pp. 2787–96.

[185] Kamath, P. L., D. Elleder, L. Bao, P. C. Cross, J. H. Powell, and M. Poss (2013) “The Population History of Endogenous Retroviruses in Mule Deer ( Odocoileus hemionus ),” Journal of Heredity, pp. 1–15.

[186] Ishida, Y., K. Zhao, A. D. Greenwood, and A. L. Roca (2015) “Proliferation of Endogenous Retroviruses in the Early Stages of a Host Germ Line Invasion,” Molecular Biology and Evolution, 32(1), pp. 109–120.

[187] Wildschutte, J. H., Z. H. Williams, M. Montesion, R. P. Sub- ramanian, J. M. Kidd, and J. M. Coffin (2016) “Discovery of unfixed endogenous retrovirus insertions in diverse human populations,” Proceedings of the National Academy of Sciences, p. 201602336. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1602336113

[188] Ajay, S. S., S. C. Parker, H. O. Abaan, K. V. Fuentes Fajardo, and E. H. Margulies (2011) “Accurate and comprehensive sequencing of personal genomes,” Genome Research, 21(9), pp. 1498–1505.

[189] van Heesch, S., W. P. Kloosterman, N. Lansu, F. P. Ruzius, E. Levandowsky, C. C. Lee, S. Zhou, S. Goldstein, D. C. Schwartz, T. T. Harkins, V. Guryev, and E. Cuppen (2013) “Improving mammalian genome scaffolding using large insert mate-pair next-generation sequencing,” BMC Genomics, 14(1), pp. 1–11.

[190] Hobbs, M., A. King, R. Salinas, Z. Chen, K. Tsangaras, A. D. Greenwood, R. N. Johnson, K. Belov, M. R. Wilkins, and P. Timms (2017) “Long-read genome sequence assembly provides insight into ongoing retroviral invasion of the koala germline,” Scientific Reports, 7(1), p. 15838. URL http://www.nature.com/articles/s41598-017-16171-1

[191] Ray, A., R. Rahbari, and R. M. Badge (2011) “IAP Display: A Sim- ple Method to Identify Mouse Strain Specific IAP Insertions,” Molecular Biotechnology, 47(3), pp. 243–252.

[192] van Opijnen, T. and A. Camilli (2013) “Transposon insertion sequencing: a new tool for systems-level analysis of microorganisms,” Nature Reviews Microbiology, 11(7), pp. 435–442. URL http://www.nature.com/doifinder/10.1038/nrmicro3033

142 [193] Iskow, R. C., M. T. McCabe, R. E. Mills, S. Torene, W. S. Pittard, A. F. Neuwald, E. G. Van Meir, P. M. Vertino, and S. E. Devine “Natural mutagenesis of human genomes by endogenous retrotransposons.” Cell, (7), pp. 1253–61.

[194] Witherspoon, D. J., J. Xing, Y. Zhang, W. S. Watkins, M. a. Batzer, and L. B. Jorde “Mobile element scanning (ME-Scan) by targeted high-throughput sequencing.” BMC genomics, p. 410.

[195] Malhotra, R., D. Elleder, L. Bao, D. Hunter, M. Poss, and R. Acharya (2016) “A pipeline for identifying integration sites of mo- bile elements in the genome using next-generation sequencing,” Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016.

[196] Cullingham, C. I., S. M. Nakada, E. H. Merrill, T. K. Bollinger, M. J. Pybus, and D. W. Coltman (2011) “Multiscale population genetic analysis of mule deer (Odocoileus hemionus hemionus) in western Canada sheds new light on the spread of chronic wasting disease,” Canadian Journal of Zoology-Revue Canadienne De Zoologie, 89(2), pp. 134–147.

[197] Powell, J. H., S. T. Kalinowski, M. D. Higgs, M. R. Ebinger, N. V. Vu, and P. C. Cross (2013) “Microsatellites indicate minimal barriers to mule deer, Odocoileus hemionus dispersal across Montana, USA,” Wildlife Biology, 19(1), pp. 102–110. URL http://www.bioone.org/doi/abs/10.2981/11-081

[198] Latch, E. K., J. R. Heffelfinger, J. a. Fike, and O. E. Rhodes (2009) “Species-wide phylogeography of North American mule deer (Odocoileus hemionus): cryptic glacial refugia and postglacial recolonization.” Molecular ecology, 18(8), pp. 1730–45. URL http://www.ncbi.nlm.nih.gov/pubmed/19302464

[199] Latch, E., D. Reding, J. Heffelfinger, C. Alcalá-Galván, and O. Rhodes (2014) “Range-wide analysis of genetic structure in a widespread, highly mobile species (Odocoileus hemionus) reveals the importance of his- torical biogeography,” Molecular Ecology, 23(13), pp. 3171–3190. URL https://doi.org/10.1111/mec.12803

[200] Hedges, S. B., J. Dudley, and S. Kumar (2006) “TimeTree: a public knowledge-base of divergence times among organisms.” Bioinformatics (Ox- ford, England), 22(23), pp. 2971–2. URL http://www.ncbi.nlm.nih.gov/pubmed/17021158

143 [201] Oliveira, N. M., H. Satija, I. A. Kouwenhoven, and M. V. Eiden (2007) “Changes in viral protein function that accompany retroviral endog- enization,” Proceedings of the National Academy of Sciences, 104(44), pp. 17506–17511. URL http://www.pnas.org/cgi/doi/10.1073/pnas.0704313104

[202] Fábryová, H., T. Hron, H. Kabíčková, M. Poss, and D. Elleder (2015) “Induction and characterization of a replication competent cervid endogenous gammaretrovirus (CrERV) from mule deer cells,” Virology, 485, pp. 96–103.

[203] Moyes, D., D. J. Griffiths, and P. J. Venables (2007) “Insertional polymorphisms: a new lease of life for endogenous retroviruses in human disease.” Trends in genetics : TIG, 23(7), pp. 326–33. URL http://www.ncbi.nlm.nih.gov/pubmed/17524519

[204] Hecht, S. J., K. E. Stedman, J. O. Carlson, and J. C. DeMartini “Distribution of endogenous type B and type D sheep retrovirus sequences in ungulates and other mammals.” Proceedings of the National Academy of Sciences of the United States of America, (8), pp. 3297–302.

[205] Herniou, E., J. Martin, K. Miller, J. Cook, M. Wilkinson, and M. Tristem (1998) “Retroviral diversity and distribution in vertebrates.” Journal of virology, 72(7), pp. 5955–5966.

[206] Garcia-Etxebarria, K., M. Sistiaga-Poveda, and B. M. Jugo “En- dogenous retroviruses in domestic animals.” Current genomics, (4), pp. 256– 65.

[207] Oja, M., J. Peltonen, J. Blomberg, and S. Kaski (2007) “Methods for estimating human endogenous retrovirus activities from EST databases.” BMC bioinformatics, 8 Suppl 2(May 2014).

[208] Flockerzi, A., A. Ruggieri, O. Frank, M. Sauter, E. Maldener, B. Kopper, B. Wullich, W. Seifarth, N. Müller-Lantzsch, C. Leib- Mösch, E. Meese, and J. Mayer (2008) “Expression patterns of tran- scribed human endogenous retrovirus HERV-K(HML-2) loci in human tissues and the need for a HERV Transcriptome Project,” BMC Genomics, 9(1), p. 354.

[209] Schmitt, K., C. Richter, C. Backes, E. Meese, K. Ruprecht, and J. Mayer (2013) “Comprehensive analysis of human endogenous retrovirus group HERV-W locus transcription in multiple sclerosis brain lesions by high- throughput amplicon sequencing.” Journal of virology, 87(24), pp. 13837–52. URL http://www.ncbi.nlm.nih.gov/pubmed/24109235

144 [210] Bao, L., D. Elleder, R. Malhotra, M. Degiorgio, T. Maravegias, L. Horvath, L. Carrel, C. Gillin, T. Hron, H. Fábryová, D. R. Hunter, and M. Poss (2014) “Computational and Statistical Analyses of Insertional Polymorphic Endogenous Retroviruses in a Non-Model Organism,” Computation, (August), pp. 221–245.

[211] Ciuffi, A., K. Ronen, T. Brady, N. Malani, G. Wang, C. C. Berry, and F. D. Bushman “Methods for integration site distribution analyses in animal cell genomes.” Methods (San Diego, Calif.), (4), pp. 261–8.

[212] Yang, L., R. Chikhi, J. Rong, T. Kaiser, R. Malhotra, P. Medvedev, and M. Poss “The Draft Mule Deer Genome Reveals the Recent Continuous Colonization History, Polymorphism and Genomic Distribution of an Endogenous Gammaretrovirus.” in preparation.

[213] Hunter, D. R., L. Bao, and M. Poss (2017) “Assignment of endogenous retrovirus integration sites using a mixture model,” The Annals of Applied Statistics, 11(2), pp. 751–770. URL http://projecteuclid.org/euclid.aoas/1500537722

[214] Wang, Y., F. Liska, C. Gosele, and S. Lucie (2010) “A novel active endogenous retrovirus family contributes to genome variability in rat inbred strains A novel active endogenous retrovirus family contributes to genome variability in rat inbred strains ,” , pp. 19–27.

[215] Magiorkinis, G., R. J. Gifford, A. Katzourakis, J. De Ranter, and R. Belshaw “Env-less endogenous retroviruses are genomic superspreaders.” Proceedings of the National Academy of Sciences of the United States of America, (19), pp. 7385–90.

[216] Belshaw, R., A. Katzourakis, J. Pačes, A. Burt, and M. Tristem (2005) “High copy number in human endogenous retrovirus families is associ- ated with copying mechanisms in addition to reinfection,” Molecular Biology and Evolution, 22(4), pp. 814–817.

[217] Ting, C. N., M. P. Rosenberg, C. M. Snow, L. C. Samuelson, and M. H. Meisler (1992) “Endogenous retroviral sequences are required for tissue-specific expression of a human salivary amylase gene.” Genes & development, 6(8), pp. 1457–65. URL http://www.ncbi.nlm.nih.gov/pubmed/1379564

[218] Leung, D. C. and M. C. Lorincz (2012) “Silencing of endogenous retro- viruses: when and why do histone marks predominate?” Trends in biochemical sciences, 37(4), pp. 127–33. URL http://www.ncbi.nlm.nih.gov/pubmed/22178137

145 [219] Schlesinger, S. and S. P. Goff “Silencing of proviruses in embryonic cells: efficiency, stability and chromatin modifications.” EMBO reports, (1), pp. 73–9.

[220] Garazha, A., A. Ivanova, M. Suntsova, G. Malakhova, S. Roumi- antsev, A. Zhavoronkov, and A. Buzdin (2015) “New bioinformatic tool for quick identification of functionally relevant endogenous retroviral inserts in human genome,” Cell Cycle, 14(9), pp. 1476–1484.

[221] Chuong, E. B., N. C. Elde, and C. Feschotte (2016) “Regulatory activities of transposable elements: from conflicts to benefits,” Nature Reviews Genetics. URL http://www.nature.com/doifinder/10.1038/nrg.2016.139

[222] Hejnar, J., P. Hajkova, J. Plachy, D. Elleder, V. Stepanets, and J. Svoboda (2001) “CpG island protects -derived vectors integrated into nonpermissive cells from DNA methylation and transcriptional suppression,” Proceedings of the National Academy of Sciences, 98(2), pp. 565–569.

[223] Bakshi, A. and J. Kim (2014) “Retrotransposon-based profiling of mam- malian epigenomes: DNA methylation of IAP LTRs in embryonic stem, somatic and cancer cells,” Genomics, 104(6), pp. 538–544, NIHMS150003. URL http://dx.doi.org/10.1016/j.ygeno.2014.09.009

[224] Xie, H., M. Wang, M. d. F. Bonaldo, C. Smith, V. Rajaram, S. Gold- man, T. Tomita, and M. B. Soares (2009) “High-throughput sequence- based epigenomic analysis of Alu repeats in human cerebellum,” Nucleic Acids Research, 37(13), pp. 4331–4340.

[225] Li, H. (2013) “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM,” 00(00), pp. 1–3, 1303.3997. URL http://arxiv.org/abs/1303.3997

[226] Quinlan, A. R. and I. M. Hall (2010) “BEDTools: A flexible suite of utilities for comparing genomic features,” Bioinformatics, 26(6), pp. 841–842.

[227] Robinson, J. T., H. Thorvaldsdóttir, W. Winckler, M. Guttman, E. S. Lander, G. Getz, and J. P. Mesirov (2011) “Integrative Genome Viewer,” Nature Biotechnology, 29(1), pp. 24–6.

[228] Rizk, G., D. Lavenier, and R. Chikhi (2013) “DSK: K-mer counting with very low memory usage,” Bioinformatics, 29(5), pp. 652–653.

[229] Scotto-Lavino, E., G. Du, and M. A. Frohman “30 End cDNA amplifi- cation using classic RACE,” Nature Protocols, p. 2742.

146 [230] Xu, H., X. Luo, J. Qian, X. Pang, J. Song, G. Qian, J. Chen, and S. Chen (2012) “FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads,” PLoS ONE, 7(12), pp. 1–6.

[231] Bolger, A. M., M. Lohse, and B. Usadel (2014) “Trimmomatic: A flexible trimmer for Illumina sequence data,” Bioinformatics, 30(15), pp. 2114–2120.

[232] Trapnell, C., L. Pachter, and S. L. Salzberg (2009) “TopHat: Discov- ering splice junctions with RNA-Seq,” Bioinformatics, 25(9), pp. 1105–1111, 9605103.

[233] Li, H. and R. Durbin (2009) “Fast and accurate short read alignment with Burrows-Wheeler transform,” Bioinformatics, 25(14), pp. 1754–1760, 1303.3997.

[234] Chang, T. H., H. Y. Huang, J. B. K. Hsu, S. L. Weng, J. T. Horng, and H. D. Huang (2013) “An enhanced computational platform for inves- tigating the roles of regulatory RNA and for identifying functional RNA motifs,” BMC Bioinformatics, 14(Suppl 2), p. S4. URL http://www.biomedcentral.com/1471-2105/14/S2/S4

[235] Matoušková, M., J. Blažková, P. Pajer, A. Pavlíček, and J. Hej- nar (2006) “CpG methylation suppresses transcriptional activity of human syncytin-1 in non-placental tissues,” Experimental Cell Research, 312(7), pp. 1011–1020.

[236] Gimenez, J., C. Montgiraud, G. Oriol, J. P. Pichon, K. Ruel, V. Tsatsaris, P. Gerbaud, J. L. Frendo, D. Evain-Brion, and F. Mallet (2009) “Comparative methylation of ERVWE1/syncytin-1 and other human endogenous retrovirus LTRs in placenta tissues,” DNA Research, 16(4), pp. 195–211.

[237] Stuhlmann, H. and P. Berg (1992) “Homologous recombination of co- packaged retrovirus RNAs during reverse transcription.” Journal of virology, 66(4), pp. 2378–88. URL http://jvi.asm.org/content/66/4/2378.abstract

[238] Simonti, C. N., M. Pavlicev, and J. A. Capra (2017) “Transposable Element Exaptation into Regulatory Regions Is Rare, Influenced by Evolu- tionary Age, and Subject to Pleiotropic Constraints,” Molecular biology and evolution, 34(11), pp. 2856–2869.

[239] Macfarlane, C. M. and R. M. Badge “Genome-wide amplification of proviral sequences reveals new polymorphic HERV-K(HML-2) proviruses in

147 humans and chimpanzees that are absent from genome assemblies.” Retrovi- rology, (1), p. 35.

[240] Belshaw, R., A. L. A. Dawson, J. Woolven-allen, J. Redding, A. Burt, and M. Tristem (2005) “Genomewide Screening Reveals High Levels of Insertional Polymorphism in the Human Endogenous Retrovirus Family HERV-K ( HML2 ): Implications for Present-Day Activity,” Journal of virology, 79(19), pp. 12507–12514.

[241] Marchi, E., A. Kanapin, G. Magiorkinis, and R. Belshaw (2014) “Unfixed endogenous retroviral insertions in the human population.” Journal of virology, 148(June). URL http://www.ncbi.nlm.nih.gov/pubmed/24920817

[242] Moyes, D. L., A. Martin, S. Sawcer, N. Temperton, J. Worthing- ton, D. J. Griffiths, and P. J. Venables (2005) “The distribution of the endogenous retroviruses HERV-K113 and HERV-K115 in health and disease,” Genomics, 86(3), pp. 337–341.

[243] David, V. A., M. Menotti-Raymond, A. C. Wallace, M. Roelke, J. Kehler, R. Leighty, E. Eizirik, S. S. Hannah, G. Nelson, A. A. Schäffer, C. J. Connelly, S. J. O’Brien, and D. K. Ryugo (2014) “Endogenous Retrovirus Insertion in the KIT Oncogene Determines White spotting in Domestic Cats,” G3, 4(10), pp. 1881–1891. URL http://g3journal.org/lookup/doi/10.1534/g3.114.013425

[244] Samuelson, L. C., K. Wiebauer, C. M. Snow, and M. H. Meisler (1990) “reveal the lineage of human salivary and Retroviral and Pseudogene Insertion Sites Reveal the Lineage of Human Salivary and Pancreatic Amylase Genes from a Single Gene during Primate Evolution,” .

[245] Sin, H. S., J. W. Huh, D. S. Kim, D. W. Kang, D. S. Min, T. H. Kim, H. S. Ha, H. H. Kim, S. Y. Lee, and H. S. Kim (2006) “Transcriptional control of the HERV-H LTR element of the GSDML gene in human tissues and cancer cells,” Archives of Virology, 151(10), pp. 1985–1994.

[246] Mignone, F., C. Gissi, S. Liuni, and G. Pesole (2002) “Untranslated regions of mRNAs.” Genome biology, 3(3).

[247] Medstrand, P., J. R. Landry, and D. L. Mager (2001) “Long terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes in humans,” Journal of Biological Chemistry, 276(3), pp. 1896–1903.

148 [248] Subramanian, R. P., J. H. Wildschutte, C. Russo, and J. M. Coffin “Identification, characterization, and comparative genomic distribution of the HERV-K (HML-2) group of human endogenous retroviruses,” Retrovirology, (1), p. 90.

[249] Anderson, C. L., M. A. Zundel, and R. Werner (2005) “Variable pro- moter usage and alternative splicing in five mouse connexin genes,” Genomics, 85(2), pp. 238–244.

[250] Bockmühl, Y., C. A. Murgatroyd, A. Kuczynska, I. M. Adcock, O. F. X. Almeida, and D. Spengler (2011) “Differential Regulation and Function of 50-Untranslated GR-Exon 1 Transcripts,” Molecular Endocrinol- ogy, 25(7), pp. 1100–1110. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5417247/

[251] Kelley, D. and J. Rinn (2012) “Transposable elements reveal a stem cell- specific class of long noncoding RNAs,” Genome Biology, 13(11), p. R107. URL http://genomebiology.com/2012/13/11/R107

[252] Kornienko, A. E., P. M. Guenzl, D. P. Barlow, and F. M. Pauler “Gene regulation by the act of long non-coding RNA transcription.” BMC biology, (1), p. 59.

[253] Vance, K. W. and C. P. Ponting (2014) “Transcriptional regulatory functions of nuclear long noncoding RNAs,” Trends in Genetics, 30(8), pp. 348–355. URL http://dx.doi.org/10.1016/j.tig.2014.06.001

[254] Engreitz, J. M., J. E. Haines, E. M. Perez, G. Munson, J. Chen, M. Kane, P. E. McDonel, M. Guttman, and E. S. Lander “Local reg- ulation of gene expression by lncRNA promoters, transcription and splicing,” Nature, p. 452.

[255] Barbulescu, M., G. Turner, M. I. Seaman, A. S. Deinard, K. K. Kidd, and J. Lenz (1999) “Many human endogenous retrovirus K (HERV-K) proviruses are unique to humans,” Current Biology, 9(16), pp. 861–868.

[256] Nellåker, C., T. M. Keane, B. Yalcin, K. Wong, A. Agam, T. G. Belgard, J. Flint, D. J. Adams, W. N. Frankel, and C. P. Ponting “The genomic landscape shaped by selection on transposable elements across 18 mouse strains.” Genome biology, (6), p. R45.

[257] Miller, M. W., E. S. Williams, N. T. Hobbs, and L. L. Wolfe “En- vironmental sources of prion transmission in mule deer.” Emerging infectious diseases, (6), pp. 1003–6.

149 [258] Williams, E. S. (2005) “REVIEW ARTICLE Chronic Wasting Disease,” Vet Pathol, 549(42), pp. 530–549.

[259] Williams, E. and M. Miller (2002) “Chronic wasting disease in deer and elk in North America,” Revue Scientifique et Technique de l’OIE, 21(2), pp. 305–316.

[260] Leblanc, P., D. Baas, and J.-L. Darlix (2004) “Analysis of the inter- actions between HIV-1 and the cellular prion protein in a human cell line.” Journal of molecular biology, 337(4), pp. 1035–51. URL http://www.ncbi.nlm.nih.gov/pubmed/15033368

[261] Stanton, J. B., D. P. Knowles, K. I. O’Rourke, L. M. Herrmann- Hoesing, B. A. Mathison, and T. V. Baszler “Small-ruminant lentivirus enhances PrPSc accumulation in cultured sheep microglial cells.” Journal of virology, (20), pp. 9839–47.

[262] Fleige, S. and M. W. Pfaffl (2006) “RNA integrity and the effect on the real-time qRT-PCR performance,” Molecular Aspects of Medicine, 27(2-3), pp. 126–139.

[263] Poss, M. L., J. I. Mullins, and E. a. Hoover “Posttranslational mod- ifications distinguish the envelope glycoprotein of the immunodeficiency disease-inducing retrovirus.” Journal of virology, (1), pp. 189–95.

[264] Gardner, M., B. Henderson, J. Officer, R. Rongey, J. Parker, C. Oliver, J. Estes, and R. Huebner (1973) “A Spontaneous Lower Motor Neuron Disease Apparently Caused by Indigenous Type-C RNA Virus in Wild Mice,” Journal of the National Cancer Institute, 51(4), pp. 1243–1254.

[265] DesGroseillers, L., M. Barrette, and P. Jolicoeur (1984) “Physical mapping of the paralysis-inducing determinant of a wild mouse ecotropic neurotropic retrovirus.” Journal of virology, 52(2), pp. 356–363.

[266] Leblanc, P., S. Alais, I. Porto-Carreiro, S. Lehmann, J. Grassi, G. Raposo, and J. L. Darlix (2006) “Retrovirus infection strongly enhances scrapie infectivity release in cell culture,” EMBO Journal, 25(12), pp. 2674– 2685.

[267] Kuse, K., J. Ito, A. Miyake, J. Kawasaki, S. Watanabe, I. Makundi, M. H. Ngo, T. Otoi, and K. Nishigaki (2016) “Existence of Two Distinct Infectious Endogenous Retroviruses in Domestic Cats and Their Different Strategies for Adaptation to Transcriptional Regulation,” Journal of Virology, 90(20), pp. 9029–9045.

150 [268] Arnaud, F., M. Caporale, M. Varela, R. Biek, B. Chessa, A. Al- berti, M. Golder, M. Mura, Y.-P. Zhang, L. Yu, F. Pereira, J. C. Demartini, K. Leymaster, T. E. Spencer, and M. Palmarini “A paradigm for virus-host coevolution: sequential counter-adaptations between endogenous and exogenous retroviruses.” PLoS pathogens, (11), p. e170.

[269] Kozak, C. (2014) “Origins of the Endogenous and Infectious Laboratory Mouse Gammaretroviruses,” Viruses, 7(1), pp. 1–26. URL http://www.mdpi.com/1999-4915/7/1/1/

[270] Gemmell, P., J. Hein, and A. Katzourakis (2016) “Phylogenetic Analysis Reveals That ERVs "Die Young" but HERV-H Is Unusually Conserved,” PLOS Computational Biology, 12(6), p. e1004964. URL http://dx.plos.org/10.1371/journal.pcbi.1004964

[271] Wray, G. A. (2007) “The evolutionary significance of cis-regulatory muta- tions,” Nature Reviews Genetics, 8, p. 206. URL http://dx.doi.org/10.1038/nrg2063 http://10.0.4.14/nrg2063

151 Vita Theodora Alexis Kaiser

Education 2012: BS in Biochemistry, Rowan University, Glassboro, NJ 2018: PhD in Molecular, Cellular, and Integrative Biosciences (MCIBS), Pennsylvania State University, University Park, PA

Publications Bao, L., Elleder, D., Malhotra, R., DeGiorgio, M., Maravegias, T., Horvath, L., Carrel, L., Gillin, C., Hron, T., Fábryová, H., Hunter, D.R., & Poss, M. (2014). Computational and statistical analyses of insertional polymorphic endogenous retroviruses in a non-model organism. Computation 2(4), 221-245.

Yang, L., Chikhi, R., Rong, J., Kaiser, T., Malhotra, R., Medvedev, P. & Poss, M. The Draft Mule Deer Genome Reveals the Recent Continuous Colo- nization History, Polymorphism and Genomic Distribution of an Endogenous Gammaretrovirus. (in preparation).

Kaiser, T., Malhotra, R., Yang, L., Bogale, K., Elleder, D. & Poss, M. Evolutionary implications of endogenous retrovirus expression on host genome evolution. (in preparation).

Kaiser, T., Yang, L., Malhotra, R., Li, W., Williams, S. & Poss, M. Com- parative analysis of endogenous retrovirus transcription and contribution to structural variation between two mule deer populations. (in preparation).