<<

have been infected with the HERV-K (HML-2) endogenous retrovirus much more recently than and

Joseph R. Hollowaya,b,1, Zachary H. Williamsa,b,1, Michael M. Freemana,b, Uriel Bulowa,b, and John M. Coffina,b,2

aDepartment of Molecular Biology and Microbiology, Tufts University, Boston, MA 02111; and bSackler School of Graduate Biomedical Sciences, Tufts University, Boston, MA 02111

Contributed by John M. Coffin, November 25, 2018 (sent for review August 17, 2018; reviewed by Robert J. Gifford, Jack Lenz, and Jonathan P. Stoye) endogenous retrovirus-K (HERV-K) human mouse mammary from gorillas. The youngest known HML-2 provirus may have tumor virus-like 2 (HML-2) is the most recently active endogenous integrated in humans as recently as 100,000 y ago, suggesting that retrovirus group in humans, and the only group with human- this group was still active after the evolution of anatomically specific proviruses. HML-2 expression is associated with cancer and modern humans (15–17). Additionally, more than half of the other diseases, but extensive searches have failed to reveal any known human-specific HML-2 proviruses are insertionally poly- replication-competent proviruses in humans. However, HML-2 morphic, with some insertions present in fewer than 5% of indi- proviruses are found throughout the catarrhine , and it is viduals (12, 18–20). possible that they continue to infect some species today. To investi- Despite evidence of evolutionarily recent activity (as well as gate this possibility, we searched for -specific HML-2 elements occasional reports to the contrary), attempts by multiple labo- using both in silico data mining and targeted deep-sequencing ap- ratories have failed to find any unambiguous evidence for on- proaches. We identified 150 gorilla-specific integrations, including going HML-2 replication in humans. We previously reported our 31 2-LTR proviruses. Many of these proviruses have identical LTRs, attempts to identify rare, recent HML-2 integrations in short- and are insertionally polymorphic, consistent with very recent integra- read sequence data from over 2,500 individuals in the 1000 tion. One identified provirus has full-length ORFs for all genes, and thus Project (21). Although we were able to identify and EVOLUTION could potentially be replication-competent. We suggest that gorillas characterize rare HML-2 integrations, including one with full- may still harbor infectious HML-2 virus and could serve as a model length ORFs for all genes, none of the proviruses appeared to be for understanding retrovirus evolution and pathogenesis in humans. derived from recent activity (22). Although no infectious pro- virus is known, the high numbers of relatively intact, insertionally endogenous retroviruses | host–virus evolution | mining polymorphic HML-2 proviruses in humans have led researchers to investigate this group for links to disease (8, 23). Like most ndogenous retroviruses (ERVs) are sequences found in the ERVs, HML-2 proviruses are usually transcriptionally silenced Egenomes of all vertebrates that were originally derived from in healthy tissues, but transcription of specific proviruses has exogenous retroviruses (1–3). These sequences are the result of retroviral infection and integration of the provirus into the ge- Significance nome of germ-line cells, and provide a record of past retroviral infections. Once integrated, such proviruses are permanent resi- dents of the host and will be present in all cells of progeny derived Human endogenous retrovirus-K (HERV-K) human mouse mam- from the infected germ-line cell (4, 5). Most ERV sequences have mary tumor virus-like 2 (HML-2) is the most recently active endog- numerous mutations that render them noninfectious. Addition- enous retrovirus group in humans. Their proviruses are also found ally, homologous recombination can occur between the 5′ and 3′ within the genomes of all and Old World monkeys; however, LTRs of a provirus after integration, leading to the loss of internal no HML-2 provirus is known to be naturally infectious. Although coding sequence and producing a solo LTR. About 90% of ERV these proviruses seem to be functionally extinct in both humans integrations have been reduced to solo LTRs. However, replication- and chimpanzees, less is known about the profile and activity of competent ERVs have been found in a number of species, and HML-2 proviruses in gorillas. Our work here has identified gorilla- recombination between defective ERVs can also lead to the pro- specific HML-2 elements that have characteristics consistent with duction of infectious virus (6, 7). very recent activity, and raises the possibility that gorillas may still Human endogenous retroviruses (HERVs) constitute ∼8% of contain infectious HML-2 virus. Thus, gorillas could serve as a model the genome (8), with ∼30 HERV groups represented (5, 9). The for how HML-2 functioned as a virus in humans, as well as shed – groups are currently named for the specific tRNA used for light on its role in pathogenesis and host virus evolution. priming reverse transcription (10, 11), with the HERV-K group Author contributions: J.R.H., Z.H.W., and J.M.C. designed research; J.R.H., Z.H.W., M.M.F., further divided into 11 subtypes that reflect their similarity to the and U.B. performed research; J.R.H., Z.H.W., M.M.F., and J.M.C. analyzed data; and J.R.H., infectious mouse mammary tumor virus (MMTV) (4, 12, 13). The Z.H.W., and J.M.C. wrote the paper. “ human MMTV-like 2 (HML-2) subtype (hereafter called HML- Reviewers: R.J.G., MRC-University of Glasgow Centre for Virus Research; J.L., Albert Ein- 2”) is particularly interesting for a variety of reasons. In addition to stein College of Medicine; and J.P.S., Francis Crick Institute. having members that possess a number of full-length ORFs, it Conflict of interest statement: J.M.C., J.P.S., and R.J.G. are coauthors on a 2018 subcom- contains the youngest known HERV sequences, and is the only mittee report. This did not involve any active collaboration. one known to have human-specific integrations (14). These pro- Published under the PNAS license. viruses are further classified into three subtypes based on LTR Data deposition: The sequences reported in this paper have been deposited in the Gen- phylogeny: LTR5A, LTR5B, and LTR5Hs. LTR5B contains the Bank database (accession nos. MH678754–MH678803 and MH684412–MH684461). oldest insertions, and the LTR5A and LTR5Hs clades branch 1J.R.H. and Z.H.W. contributed equally to this work. separately out of this group (14). LTR5Hs includes the most re- 2To whom correspondence should be addressed. Email: [email protected]. cently integrated sequences and is the only group that has human- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. specific integrations, whereas LTR5A and 5B appear to have 1073/pnas.1814203116/-/DCSupplemental. ceased activity in the hominoid lineage before the split of humans

www.pnas.org/cgi/doi/10.1073/pnas.1814203116 PNAS Latest Articles | 1of10 Downloaded by guest on September 26, 2021 been observed in a number of disease states (24). Despite the (human) HML-2 5′ or 3′ LTR edges, filtering out any reads with abundance of research, no causative role has been proven for any junctions that matched known HML-2 proviruses in humans or HML-2 provirus in any disease. Attempts to prove such a role gorillas and any reads with <10 bp of sequence flanking the in- are hampered by insufficient knowledge of basic HML-2 biology, sertion. From the 30 gorilla genomes screened, we identified and of how HML-2 functions as a virus. Consensus HML- 2,057 putative nonreference insertion sites, of which 184 had 2 proviruses have been shown to be weakly infectious in vitro; reads corresponding to both the 5′ and 3′ junctions (SI Appendix, however, it is unclear how well these experiments recapitulate Table S1). We focused our downstream analyses on this group of how HML-2 is replicated in vivo (25, 26). high-confidence, two-sided hits, though it is likely that many of Although many HML-2 insertions are human-specific, there the hits with reads corresponding to only 5′ or 3′ flanks are are a number of older HML-2 integrations at identical sites in all genuine. Of the two-sided 184 hits, 130 were found in a single apes and Old World monkeys, dating their integration to as subspecies, with 117 unique to Western lowland gorillas and much as 35 million y ago, soon after the split with New World 13 unique to the Eastern lowland gorillas. Thirty-two hits were monkeys (14, 17, 27). Although these viruses may have become found in the single Cross River sample; however, all of them functionally extinct in humans, it is possible that active forms were shared within the Western lowland subspecies. Of the could currently exist in other primates, perhaps even in our 54 proviruses shared between two or more subspecies, 22 were closest relatives, chimpanzees and gorillas. Given the presence of found only in the Western and Eastern lowland gorillas, 18 only shared HML-2 proviruses between these species, we thought it in Western and Cross River, and 14 sites were found in all worthwhile to examine the possibility that chimpanzees and/or 3 subspecies. As we show below, this extent of polymorphism far gorillas might have been subject to more recent—and possibly exceeds that seen in humans for the same provirus group. ongoing—infection and reintegration with HML-2 virus. Al- In addition to identifying nonreference HML-2 insertions, we though -specific proviruses have been reported (28), took advantage of the recently released long-read gorilla genome the chimpanzee genome contains relatively low numbers of assembly (gorGor5) to identify further gorilla-specific insertions chimpanzee-specific HML-2 integrations in comparison with (32). Our previous attempts to identify such insertions from humans, suggesting even less recent activity in this species than earlier gorilla assemblies were hindered by very unreliable assembly in humans. In contrast, very little has been published on HML-2 of HML-2 elements, with artifactual “pseudolog” proviruses placed proviruses in other hominoids, including gorillas. In this study, at the sites of known human-specific HML-2 insertions, as well we apply to the gorilla genome two methodologies we have de- as gaps in the assembly at sites of genuine gorilla-specific in- veloped for the identification of proviral integrations in the hu- sertions. We were hopeful that the long reads used in the gorGor5 man genome (22, 29). One method utilizes publicly available assembly would improve the assembly of large insertions such whole-genome sequence data published by the Great Ge- as ERVs. nome Project (GAGP) (30), while a second method generates To identify gorilla-specific HML-2 proviruses from this assem- sequence libraries constructed to target, enrich, and identify bly, we downloaded all LTR5Hs and LTR5 coordinates annotated HML-2 proviral integration junctions from genomic DNA. by RepeatMasker (33) in the gorGor5 assembly available on the Using these methods, we have identified and sequenced UCSC Genome Browser (32). We then cross-referenced these 22 gorilla-specific 2-LTR proviruses not found in the reference sites against the same data from the human, chimpanzee, and gorilla genome. We identified an additional nine gorilla-specific genomes, filtering out any sites where at least one of 2-LTR proviruses present in the gorGor5 reference genome, and those three genomes had an annotated HML-2 insertion within estimated the population frequency, time of integration, and cod- 1kb.“Full-length” (or “2-LTR,” as distinct from solo LTRs) ing capacity of all proviruses identified. Most of the proviruses proviruses within this dataset were identified by filtering for sites identified are insertionally polymorphic and appear to have in- with HML-2 internal genic regions, classified as “HERVK-int” by tegrated very recently. Consistent with their recent origin, many RepeatMasker. We identified nine gorilla-specific 2-LTR full- contain one or more full-length ORFs. One of these proviruses has length or nearly full length proviruses present in the gorGor5 full-length ORFs for all retroviral genes, and could also potentially reference (Table 1, source: gorGor5), as well as 99 gorilla-specific be replication-competent. Overall, the gorilla HML-2 proviruses solo LTRs (SI Appendix,TableS2). comprise by far the most recently active HERV-related virus group Our second approach to discovery of nonreference gorilla in any known hominoid species and the most likely to represent HML-2 elements adapted a linker-mediated PCR sequence li- still-active transmitted viruses. brary preparation protocol first designed to specifically target and enrich human T-lymphotropic virus (HTLV) and HIV pro- Results viral integrations (34) to similarly identify HML-2 integrations Discovery of HERV-K (HML-2) Insertions. In this study, we used two in gorilla genomic DNA. We began our library preparation by separate methods to identify gorilla-specific HML-2 insertions attaching a linker to sheared DNA fragments to permit the spe- not present in the current reference assembly of the Gorilla sp. cific PCR enrichment of fragments containing HML-2 integrations. genome. The first approach used searches of publicly available This PCR amplification step was designed to target the 5′ LTR whole-genome sequence data from gorillas, using a data-mining region and the 5′ flanking genomic DNA of an HML-2 provirus technique that we previously applied to human samples from the using primers complementary to the linker and the untranslated 1000 Genomes Project (22). Though gorilla genomes have not leader sequence between the 5′ LTR and the gag gene. LTR- been sequenced at anywhere near the scale of the 1000 Genomes specific primers were not used in the first round to avoid ampli- Project, a smaller-scale study, the Great Ape Genome Project, fying solo LTRs. A second round of PCR amplification allowed finished in 2013, included high-coverage Illumina sequencing of for further enrichment of the 5′ flanking genomic DNA using 79 great apes from all six great ape species. We downloaded all an HML-2 primer designed to anneal near the end of the 5′ LTR, available gorilla sequence data from this project as unaligned again in combination with a nested linker-specific primer. Library FASTQ files. In total, we obtained sequence from 30 individuals construction was conducted for each of three available Western from 3 subspecies, including 26 Western lowland (Gorilla gorilla lowland gorilla DNA samples. Libraries were sequenced on the gorilla), 1 Cross River (Gorilla gorilla diehli), and 3 Eastern Illumina MiSeq platform, aligned to the gorGor5 genome, and lowland (Gorilla beringei graueri). Analysis began with an align- filtered to remove known HML-2 insertions. The resulting read ment step to the current gorGor5 long-read gorilla genome as- data yielded 24 highly targeted loci not present in gorGor5. All sembly (31). We retrieved unmapped reads from each aligned 24 loci were investigated further, producing 18 previously unknown gorilla genome and searched reads with sequence matching the HML-2 integrations (Table 1, source: library).

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1814203116 Holloway et al. Downloaded by guest on September 26, 2021 Table 1. Gorilla-specific HML-2 proviruses Name Coordinates (gorGor5) Allele ORF Age, MY Source

1.140 CYUI01015057v1:132050 prov, pre gag 2.54* Library 1.136 CYUI01015057v1:4140833 prov, pre 0.96 Library 2A.32 CYUI01015158v1:2091742 prov, pre gag 0.64* Library 2A.33 CYUI01015158v1:2931574 prov, pre gag, pro, pol 0.47 Library 3.80 CYUI01015020v1:2986790 prov, pre gag <0.3 Library 3.110 CYUI01014906v1:15912756 prov NA gorGor5 3.143 CYUI01014939v1:11491901 prov, pre gag, env <0.3 GAGP 3.188 CYUI01015043v1:1542506 prov gag NA gorGor5 4.100 CYUI01015141v1:3983164 prov, pre gag, pro, pol <0.3 Library 4.139 CYUI01015140v1:2825207 prov, pre gag, pro, pol 0.32 GAGP 5.55 CYUI01015464v1:2722027 prov, pre gag, pro, pol 0.32 Library 6.29 CYUI01015250v1:4896903 prov 2.25* GAGP, gorGor5 6.59 CYUI01015849v1:88674 prov NA gorGor5 8.36 CYUI01014974v1:14320368 prov, pre gag, pro, env 0.63 Library 8.46 CYUI01015266v1:1416776 prov, pre gag, pro, env <0.3 Library 9.14 CYUI01015156v1:4751052 prov, pre, solo gag, pro, pol 0.32 GAGP 9.27 CYUI01014957v1:16607851 prov, pre gag, pro <0.3 Library 9.32 CYUI01014957v1:11182647 prov, pre gag, pro, pol, env <0.3 Library 9.76 CYUI01015269v1:5989888 prov, pre gag, pro, pol <0.3 Library 10.90 CYUI01015287v1:1891774 prov, pre gag, env <0.3 GAGP 10.135 CYUI01014981v1:11954484 prov, pre gag, pro, env <0.3 Library 11.72 CYUI01015190v1:6023137 prov, pre <0.3 Library 12.6 CYUI01015151v1:6327071 prov, pre env NA Library 12.26 CYUI01015117v1:5171464 prov, pre gag, pro, pol 0.63 Library EVOLUTION 12.30 CYUI01015117v1:1528455 prov, pre gag, pro, pol <0.3 Library 17.76 CYUI01015420v1:1207415 prov gag 0.63 gorGor5 19.9 CYUI01015260v1:5498638 prov 1.27* gorGor5 19.11 CYUI01015260v1:2798010 prov, pre gag, pro, pol <0.3 Library 19.23 CYUI01015110v1:8706168 prov 1.93 gorGor5 22.8 CYUI01015765v1:799709 prov env 2.51* gorGor5 X.83 CYUI01014915v1:18600100 prov NA gorGor5

The name of each provirus corresponds to its chromosome and megabase location in the gorGor4 genome build. The coordinates in the gorGor5 assembly are also provided. Known alleles for each locus are noted as provirus (prov), preintegration site (pre), or solo LTR (solo). Intact ORFs, age estimates, and the dataset each provirus was found in are also listed. *Age estimated from a single LTR instead of the standard 5′–3′ LTR comparison method. NA, not applicable.

Nomenclature for HML-2 Insertions. To distinguish among the thousand proviruses, as expected. In total, we have confirmed 22 gorilla- or so HML-2 proviruses (and solo LTRs) in the human genome, specific 2-LTR proviruses and 20 solo LTRs not present in the we have adapted the convention of naming specific proviruses by gorGor5 reference genome, in addition to 9 gorilla-specific 2-LTR the chromosome band in which they are located (14, 35). In the proviruses and 99 solo LTRs from the gorGor5 assembly, for a absence of cytogenetic mapping of the gorilla genome, we adapted total of 150 gorilla-specific HML-2 insertions. a related nomenclature convention. The chromosomal location of To estimate the frequency of the 2-LTR proviruses, we used each newly found integration site was identified in the gorGor4 the same allele-specific PCR assay on a larger panel of 10 Western genome build (as the contigs in the gorGor5 assembly have not yet lowland gorilla DNA samples, with the addition of primer sets for been assigned to chromosomes) and the provirus name was based the gorilla-specific 2-LTR proviruses from the gorGor5 reference on the chromosome number and megabase interval of each in- genome. The majority of the proviruses screened were insertion- sertion, following the form chromosome.megabase (e.g., a provirus ally polymorphic, with the exception of two proviruses identified in at chr9:32500000 would be named 9.32). Neighboring integrations the reference genome which were homozygous for the insertion in within the same megabase interval will be given an additional letter all 10 samples screened. Many of the proviruses were present at corresponding to the order of discovery, though none was identified quite low frequency, with over half present in less than 50% of the in this study. samples and nine found in only one sample (Fig. 1A and SI Ap- pendix, Table S3). PCR Validation and Frequency Estimations. The 24 nonreference In addition to PCR screening, we estimated the frequency of insertions identified from the targeted library and 66 of the hits the 184 high-confidence hits across the 30 gorilla samples in the from the GAGP sample mining were validated with allele-specific Great Ape Genome Project. There was a wide range of fre- PCR on genomic DNA from three Western lowland gorillas. quencies, with one hit being found in 100% of the samples, but the Primers targeting the 5′ and 3′ genomic flanking DNA of each vast majority were found in less than 50% of the samples (Fig. 1B). insertion were designed and used along with previously designed Interestingly, 30 hits were found in only 1 of the 30 gorillas HML-2–specific primers (19, 22) to amplify both the insertion and screened. Although it is likely that screening a larger sample would the virus–host DNA genomic junction; 18/24 targeted library hits identify additional gorillas carrying some of these rare insertions, it and 24/66 GAGP hits were confirmed in at least one sample. Four is possible that some of them are unique to a single individual. 2-LTR proviruses and 20 solo LTRs were identified from the Due to variation in sequence coverage and the low sensitivity of GAGP, while all of the hits from the targeted library were 2-LTR our in silico mining approach, these estimates are inherently noisy,

Holloway et al. PNAS Latest Articles | 3of10 Downloaded by guest on September 26, 2021 A 100 80 allele 60 freq. (%) 40 20 0

provirus 22.8 19.9 19.23 10.135 9.76 4.139 2A.33 11.72 17.76 X.83 10.9 3.8 8.46 3.143 9.32 9.27 8.36 3.8 1.136 12.3 12.6 6.59 9.14 19.11 12.26 3.11 1.14 5.55 2A.32 4.1 provirus

BC1.0 30 0.8 20 # of insert 0.6 in silico loci freq. 0.4 PCR 10 0.2 0.0 3.3 20.0 36.7 53.3 70.0 86.7 3.143 4.139 9.14 10.9 frequency (%) provirus

Fig. 1. Integration frequencies of 2-LTR proviruses and solo LTRs in gorillas. (A) PCR-validated frequencies for each 2-LTR integration across a panel of 10 individual gorilla DNA samples. (B) In silico estimated frequencies of the 184 high-confidence integration sites of 2-LTR proviruses and solo LTRs mined from the GAGP data across the 30 individual gorillas used for the study. A maximum of 30 distinct proviruses were found in a single gorilla, with the majority of integrations represented in less than 50% of the gorillas in the study. (C) Comparison of the PCR and in silico frequencies for the four novel 2-LTR proviruses identified from the GAGP dataset. The similarity of the frequencies seen for each integration validates the in silico estimation method.

and probably underestimate the true frequencies; however, for the small selection of chimpanzee-specific LTRs extracted from the four proviruses with both PCR and in silico frequency estimates, chimpanzee reference genome (panTro3) (32). the frequencies match quite well (Fig. 1C). As expected, most of the species-specific LTRs clustered in separate clades, with the human and chimp insertions clustering Structure and Coding Capacity of Gorilla-Specific Proviruses. Using as sister clades and the gorilla-specific insertions forming an out- the same combination of primers used for genotyping, each non- group to chimpanzees and humans. Strikingly, the gorilla-specific reference insertion was amplified and sequenced. This analysis clade had noticeably shorter branch lengths than the human- or provided structural information about each insertion, such as chimpanzee-specific clades, indicating much less sequence diver- putative ORFs, point mutations, large-scale deletions, and target- gence in the gorilla-specific proviruses; indeed, two subclades in- site duplications (which are characteristic of retroviral integra- clude multiple sequences with identical LTRs. This low level of tion). Twenty-four of the 31 2-LTR proviruses discovered had a divergence suggested to us that many of these insertions were very full-length ORF for at least one viral gene (Table 1), with one of young, and that the HML-2s have been active in gorillas much the proviruses, 9.32, displaying full-length ORFs for all four ret- more recently than in humans or chimpanzees. roviral genes without the presence of any apparent inactivating To investigate this possibility more thoroughly, we first used ′– ′ mutations. Testing of this provirus for its ability to encode infec- the 5 3 LTR comparison method to date those proviruses with two LTRs (13, 14). The mean age of the gorilla-specific provi- tious virus is ongoing. Fig. 2 shows the overall structure of each ∼ identified provirus, excluding solo LTRs. Full-length ORFs, large ruses was 380,000 y, with individual proviruses ranging in age from 1.93 million y (MY) to <300,000 y. Twelve proviruses had deletions, small insertions, and inactivating mutations are in- identical LTRs and thus we could only estimate that they in- dicated in the schematic images. All full-length HML-2 proviruses tegrated at most ∼300,000 y ago, the average time for a single- in hominoids can be divided into two structural categories: type 1 nucleotide difference to arise between proviral LTRs. As these proviruses containing a characteristic 292-bp deletion in env,and proviruses could have integrated any time between 300,000 y ago type 2, with the 292-bp region intact. As in humans (36, 37), ap- and the present, we used the midpoint of 150,000 y for these proximately half of the proviruses identified are type 1 and half are proviruses for calculating the mean provirus age. Although 5′–3′ type 2. Interestingly, nine of the gorilla proviruses shared another LTR divergence is the preferred method to date individual (1,492-bp) deletion (not seen in humans), which is found in both proviruses, the limited number of 2-LTR proviruses in the ge- type 1 and type 2 proviruses, and which removes a large fraction of nome constrains the usefulness of this method for estimating the pro and pol, presumably rendering them completely incapable of actual level of HERV activity over time. As in humans, there are infection without the aid of a replication-competent helper vi- about 10 times as many HML-2 solo LTRs (as well as truncated rus. Also, two proviruses share a third, 1,372-bp, deletion in env proviruses with only a single LTR) as there are 2-LTR proviruses (Fig. 2). Like the type 1 deletion, both of these novel shared de- in the gorilla genome, however, and we wanted to use this letions are frameshifting, although the significance of this obser- abundance of sequence information to get a better sense of the vation is unclear. levels of HML-2 activity over time in gorillas. To estimate the insertion times of solo LTRs, we modified our 5′–3′ LTR com- Phylogenetic Analysis. We generated a neighbor-joining phylogeny parison method to use the divergence between each LTR and the of gorilla-specific insertions, using the LTR sequences of all 2-LTR nearest node on a neighbor-joining phylogeny. We previously proviruses identified (Fig. 3). For comparison, we also included the had used the divergence from a consensus LTR sequence as a sequences of all of the 2-LTR proviruses known from humans and a clock (14); however, this method assumes that all LTRs are

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1814203116 Holloway et al. Downloaded by guest on September 26, 2021 AB

C EVOLUTION

Fig. 2. Novel HML-2 proviruses identified in gorillas. Schematic representations of type 1 (B) and 2 (A) proviruses, or an unidentifiable type (C). Gorilla- specific proviruses (excluding solo LTRs) are divided by type, with the common type 1 292-bp deletion at the pol–env junction shown for (B) type 1 and the full- length viral genes for (A) type 2. Lighter-shaded colors for a given viral gene indicate disruption of the ORF. The provirus at 9.32 displays full-length ORFs for all four retroviral genes, while six of the type 2 and three of the type 1 proviruses contain a shared 1,492-nt deletion between pro and pol (†). In addition, another deletion shared among multiple proviruses is marked by ‡, and there is one shared nonsense mutation marked by ♢.

identical in sequence at the time of integration and ignores dif- ficity and the 2-LTR method can produce young biased ages due ferences accumulating during viral replication, leading to a bias to gene conversion (38, 39), so it seemed likely to us that the age toward older ages. Using the nearest node instead of a consensus estimates are roughly accurate (Fig. 4B). We applied these dating should significantly reduce this bias, although it will still some- methods to our full dataset of human-, gorilla-, and chimpanzee- what overestimate the time since integration due to mutations specific solo LTRs and proviruses, as well as those shared between potentially accumulating during replication as a virus. these species (Fig. 4C). The chimpanzee-specific insertions were Using the same methods as we used to identify species-specific significantly older than the human-specific and gorilla-specific insertions in the gorilla reference genome, we downloaded and ones, with no insertions younger than 2 million y, and most in- determined the species specificity of HML-2 LTR5Hs solo LTRs sertions much older (Kolmogorov–Smirnov, P < 0.0001). As in the latest human and chimpanzee reference genomes (hg38 expected, the gorilla-specific proviruses are significantly younger and panTro5, respectively); to this list we added the sequences of than the human-specific proviruses (Fig. 4D); when we looked all known nonreference human-specific LTR5Hs LTRs. For only at the ages of LTRs belonging to the clade containing most of each species, we made a neighbor-joining tree of all of the LTR the youngest insertions, they had a mean age of ∼0.9 MY, ∼1MY sequences from that species and estimated the age of each LTR younger than the clade containing the youngest human-specific based on the branch distance to the nearest node on the tree (SI insertions (Kolmogorov–Smirnov, P < 0.0001). Appendix, Fig. S2). As previously, proviruses with identical 5′ and 3′ LTRs were assigned an age of 150,000 y. Similarly, solo LTRs Discussion with no sequence differences from their nearest neighbor were HML-2 proviruses have been found across the catarrhine pri- assigned an age of 300,000 y. mates, with integration events beginning ∼35 million y ago. To test the accuracy of this method compared with 2-LTR Though some research has been published on their distribution estimates, we used the same approach with the 5′ and 3′ LTRs of in nonhuman primates, they remain relatively unstudied. The 2-LTR proviruses, using the distance to the nearest LTR not present work aimed to identify and characterize novel HERV-K belonging to the same provirus. The ages for proviruses calcu- (HML-2) elements in the gorilla genome to provide a more lated in this manner were indeed greater than the ages for the complete view of the group, assess their activity during the same proviruses calculated by 2-LTR comparison, though the evolution of closely related species, and investigate their difference was only statistically significant for nonhuman-specific potential for current infectious activity. Previously available go- proviruses (Fig. 4A). The ages calculated with this method did rilla genome builds had very unreliable assembly of HML-2 ele- correspond to the expected age ranges based on species speci- ments, with artifactual pseudolog proviruses placed at the sites of

Holloway et al. PNAS Latest Articles | 5of10 Downloaded by guest on September 26, 2021 A BC*† bootstrap 100%

specific clade * gorilla gorilla *

0% * specific clade chimp chimp †

* LTR5Hs *

specific clade * human † * *

* *†

† LTR5B † *† LTR5A 0.03 *†

Fig. 3. Phylogenetic relationship of species-specific HML-2 LTRs among related primates. (A) Simplified phylogeny showing the relationships between the LTR5A (cyan), LTR5B (red), and LTR5Hs HML-2 groups, with some subclades represented as triangles. Gorilla- (magenta), chimp- (green), and human- (blue) specific subclades within the LTR5Hs group are also shown. (B) Neighbor-joining tree with 1,000 bootstrap replicates, using the LTR sequences of all species- specific 2-LTR proviruses from gorillas (magenta) and humans (blue), plus shared LTR5Hs proviruses and a selection of chimp-specific HML-2 LTRs for reference (green). Bootstrap values for each node are indicated by colored circles. (C) A detailed view of the gorilla-specific clade highlighting its shorter branch lengths compared with the chimpanzee and human clades, indicating much less sequence divergence among the gorilla-specific viruses. Some subclades within the gorilla-specific group contain multiple proviruses with identical LTR sequences, indicated by the presence of zero-length branches. Type 1 proviruses are indicated by *, and proviruses containing the 1,492-bp deletion are indicated by †.

known human-specific HML-2 insertions, as well as gaps in the biology, though some level of host-specific adaptation cannot assembly at sites of genuine gorilla-specific insertions. The cur- be ruled out at this time. rent gorilla genome reference, although improved by long-read In total, we have identified 150 gorilla-specific HML-2 inser- sequencing technology, still has some assembly errors at HML- tions from screening 34 individuals, including 31 full-length, or 2 insertions, and does not capture the full diversity of gorilla- nearly full length, 2-LTR proviruses. Forty-two of these insertions specific insertions, as many are present at very low frequency. are not present in the gorGor5 reference assembly, and at least Though the gorilla sequences cluster separately from human 47 are insertionally polymorphic among the few samples we have and chimpanzee HML-2s, they are still extremely similar on a studied. For comparison, ∼150 human-specific solo and 2-LTR sequence level. Compared with the HERV-Kcon consensus HML-2 insertions have been identified in total, after much more human-specific HML-2 provirus (25), a consensus sequence of extensive screening, of which ∼50 are insertionally polymorphic, the gorilla-specific proviruses identified here is >98% identical based on similar data mining on data from about 2,500 individuals at both the nucleotide and the amino acid levels. By contrast, in the 1000 Genomes Project. Our current list of gorilla proviruses HIV-1 isolates of the same subtype can differ by as much as 30% is likely to be a significant underestimate of the total. We have at the amino acid level from one individual to the next (40). identified a large number of additional putative insertions from Known functional elements are all highly conserved, and it seems the GAGP data that we have not yet confirmed experimentally, likely that human and gorilla HML-2s share most of their including 159 sites with sequence reads matching both the 5′ and

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1814203116 Holloway et al. Downloaded by guest on September 26, 2021 A C 5’-3’ gor young * p<.0001 single p=.001 hum young shared LTR gor spec 5’-3’ p=.0002 hum spec single ns p=.0.357 specific human p<.0001 LTR chimp spec

0 10 20 30 40 Species specificity gor shared age (MY) hum shared B chimp shared human-chimp 0 10 20 30 40 specific age (MY) shared D solo LTR gor spec prov shared provirus hum spec prov human specific solo LTR gor young human specific provirus hum young 0 10 20 30 40 0 1 2 3 4 5 age (MY) age (MY)

Fig. 4. Age distribution of HML-2 insertions in humans, chimpanzees, and gorillas. Neighbor-joining trees of LTR5Hs LTRs from gorillas, humans, and chimpanzees were generated and ages were calculated as indicated. Each dot represents the age of one insertion. (A) Ages of human proviruses as calculated by 5′–3′ LTR comparison compared with the same provirus ages calculated using the solo LTR method. ns, not significant. (B) Comparison of ages of solo LTRs and proviruses in humans. “Human-specific” loci are found only in humans, “human-chimp”–specific are found only in humans and chimps, and “shared” are EVOLUTION also found in gorillas. The dashed red and blue lines, respectively, mark the estimated times of divergence of chimpanzees and gorillas from humans. (C) Ages of solo LTRs and proviruses from gorillas, chimps, and humans. Human-specific and chimp-specific groups include orthologs found in both chimps and hu- mans, but not gorillas; “gor young” and “hum young” are monophyletic clades containing the most recently integrated gorilla-specific and human-specific insertions, respectively, as labeled in SI Appendix, Fig. S2. P values were calculated with the Kolmogorov–Smirnov test. ns, not significant at P = 0.05. (D) Same as C, except the timescale is expanded to show only the most recent 5 million y.

3′ LTR edges, and over 1,800 sites with reads matching only one feature of the gorilla clade is the presence of a large number of LTR edge. It is likely that many of the single-edge hits are arti- proviruses with identical 5′ and 3′ LTRs, many of which are also facts; however, we have confirmed some of these hits by PCR, and identical across multiple proviruses. While a few proviruses with this list likely includes many truncated proviruses. We have higher identical 5′ and 3′ LTRs are known in humans, they do not cluster confidence in the two-sided hits, as they all have the expected 4- to closely together phylogenetically, in contrast to the low-divergence 7-bp target-site duplication characteristic of HML-2 integrations, gorilla sequences, which form a well-supported single clade, sug- and most of them were found in more than one individual. It is gesting a much more recent common evolutionary origin. also likely that many of the library hits with amplicon numbers These age estimates are limited in their resolution by the lack below our cutoff for further analysis also represented real provi- of sequence divergence in the gorilla sequences. For proviruses ruses. We had a limited number of gorilla DNA samples to test, with identical LTRs, we can only give a maximum age limit of and were unable to obtain DNA from the sequenced by ∼300,000 y, and it is possible some of them integrated much the GAGP; it is likely that PCR screening of additional individuals more recently. To account for this uncertainty, we assigned all would validate many of the unconfirmed or low-frequency hits. such proviruses an estimated age of 150,000 y. It should also be Additionally, the samples used for the GAGP were skewed heavily noted that the age estimates in this study use a mutation rate toward Western lowland gorillas, accounting for 26 of the based on a chimpanzee–human divergence time of ∼6 MY. The 30 samples, which limits the interpretation and analysis of in- true divergence time of chimpanzees and humans is still a matter tegration patterns across all gorillas. of some debate (30, 41, 42), and if the split took place signifi- Phylogenetic analysis of the LTR sequences from the gorilla- cantly earlier or later than 6 MY, these age estimates would have specific insertions showed that almost all of them clustered in a to be adjusted accordingly. Although the solo LTR age estimates single clade, separate from and basal to the clades of known are in rough agreement with the 2-LTR estimates, they are sig- human- and chimpanzee-specific sequences. The phylogeny sup- nificantly older, which is likely in part due to sequence differences ports a single shared origin of the three groups of species-specific accumulated during replication of the virus before integration. HML-2s in the common ancestor of gorillas, humans, and chim- However, it is known that the 2-LTR method can also produce panzees and subsequent codivergence of each group with its host. ages that are biased young due to inter-LTR gene conversion The branch lengths of the gorilla-specific clade are noticeably leading to elimination of sequence differences; indeed, several shorter than those in the human-specific clade, and molecular nonhuman-specific proviruses were excluded from this analysis clock analysis based on LTR divergence confirms that the due to evidence of such gene conversion leading to anomalously gorilla-specific clade is significantly younger than the human- young age estimates. With a few exceptions, the age estimates specific clade. The estimated mean time since integration of the produced by both methods correspond fairly well to the expected gorilla-specific proviruses is ∼3 MY, compared with ∼3.7 MY for age ranges based on species specificity, with species-specific in- human-specific integrations. The age difference between 2-LTR sertions giving dates younger than the species divergence times, proviruses alone is even more striking, with a mean age of and shared insertions dating older. Notably, the ages of insertions ∼380,000 y for gorillas and ∼950,000 y for humans. A noticeable found in both humans and chimpanzees, but not in gorillas, are

Holloway et al. PNAS Latest Articles | 7of10 Downloaded by guest on September 26, 2021 clustered around ∼6 million y ago, matching what we would expect Gorilla DNA Samples. Five samples were originally from the Coriell Institute, if they integrated in the relatively short period between the split of and their Coriell ID numbers are provided, along with names and studbook gorillas from humans and the split of chimpanzees from humans. numbers where available. The five samples from the Hahn laboratory do not have Coriell IDs but their names and studbook numbers are provided. Recent HML-2 replication in gorillas is also supported by the Michael Jensen-Seaman (Duquesne University, Pittsburgh, PA) provided large numbers of insertionally polymorphic proviruses. We used samples PR00301 (“Shango,” studbook 1123), PR00622 (“Chipua,” studbook two separate approaches to estimate the population frequency of 1419), and PR00671 (“Billy,” studbook 1148). All gorillas were captive at the gorilla-specific proviruses. Of the 31 2-LTR proviruses iden- birth. Fred Gage (Salk Institute, La Jolla, CA) provided samples PR00053 and tified, all but 2 were shown to be insertionally polymorphic by PR00075. These were fibroblast cell lines originally held by Coriell. Beatrice PCR, with allele frequencies in the 10 samples screened ranging Hahn (University of Pennsylvania, Philadelphia, PA) provided the samples from 5 to 90%. Nine proviruses were only found in a single gorilla “Amare” (studbook T1201), “Azizi” (studbook 1750), “Bahati” (studbook “ ” “ ” DNA sample. A similar overall pattern was seen with the GAGP 1142), Bana (studbook 1370), and Susie (studbook T1193). All gorillas were captive at birth and DNA originated from blood samples. data. In this dataset, we were limited to determining the presence or absence of each insertion, rather than genotypes. As with the DNA Library Construction and Sequencing. Whole-genome amplified total PCR screening, a large fraction of the hits were present in only genomic DNA samples from three separate captive individuals, PR00301 one individual; 30 of the 184 two-sided hits were present in only (Shango), PR00622 (Chipua), and PR00671 (Billy), were used to create Illumina 1/30 gorillas screened. While it is likely that this analysis under- sequence libraries. To target HML-2 proviruses, we employed a specific PCR estimates the true frequency of these insertions, it is possible that enrichment protocol adapted from that employed by Maldarelli et al. (29) to some of them may represent de novo integrations unique to a identify HIV-1 integration sites. DNA was subjected to sonication using a – Covaris (M220) sonicator, which allows for shearing of DNA to ∼1,500 bp. single individual. Future studies of parent child trios could test Although shearing is a random process, provirus-derived fragments that this idea conclusively. One hundred and thirty of the insertions happen to retain the 5′ LTR flanked by host DNA on the 5′ side and the gag were only found in one of the gorilla species investigated, and thus leader untranslated region (UTR) sequence on the 3′ end of the LTR can be may have integrated after the split between Western and Eastern specifically amplified using nested linker-mediated PCR. The procedure uses an gorillas ∼200,000 y ago (30). asymmetric linker added to the sheared fragments to mediate the first round Infection by HML-2 proviruses also requires the presence of of PCR amplification. To avoid amplifying solo LTRs, primers targeting the – ′ functional ORFs for gag, pro, pol, and env. Our investigation into linker in combination with an HML-2 specific primer (5 -CGTCGACTTGTCCT- CAATGACCACGCT-3′) targeting the UTR were used in a first round of PCR to the structure of these novel gorilla-specific proviruses reveals enrich the sample with full-length HML-2 proviral integration sites. A second that 24 have retained at least one full-length ORF. One provirus round of PCR amplification allowed for further enrichment of the 5′ flanking retained full-length ORFs for all retroviral genes, and could po- genomic DNA using an HML-2 primer designed to anneal five bases from tentially be replication-competent. The proviruses can also be the edge of the 5′ LTR (5′-CTGATCTCTCTTGCTTTTCC-3′) while also adding categorized on the basis of common deleted regions. Type 1 pro- barcode information. Samples were mixed, purified using magnetic beads viruses share a common 292-bp deletion at the pol–env junction (Omega Bio-Tek), and run on the Illumina MiSeq platform to produce se- that inactivates the two genes. Those proviruses that retain this quence read data in unaligned FASTQ format. region are called type 2. The proviruses identified in this study are Gorilla Sequence Alignments. All FASTQ files were trimmed to remove low- approximately evenly divided between these two types, as are quality sequence, short reads (<50 bp), and unpaired reads using Trimmo- human proviruses (14). Nine of the proviruses identified share matic (44). Trimmed sequences were aligned to the gorGor5 build of the another deleted region located around the pro–pol junction that gorilla reference genome using Bowtie 2 (45). Default settings for Bowtie also likely renders them incapable of infection or retrotransposition 2 were used with the exception of the “very-fast” option. Alignments were without a helper provirus to provide the missing function. Like output in SAM format. the type 1 proviruses, these proviruses do not appear to have diverged in sequence from the wild-type genomes, and are not HML-2 Discovery. To identify proviruses in the GAGP samples, unmapped focused in a single subclade, but rather are distributed across reads were retrieved from SAM files using SAMtools (46) and searched for sequence that precisely matched the HML-2 LTR edges using custom scripts. proviruses comprising the most recent node and thus presum- For the 5′ LTR edge, the following sequences and their reverse complements ably are replicated by copackaging and recombination with were searched for TGTGGGGAAAAGCAAGAGA, TGTGGGGAAAAGAAA- replication-competent HML-2 viruses and with one another. It GAGA, and TGTGGGGAGAAGCAAGAGA. For the 3′ LTR edge, the following is curious that these defective genomes seem to be maintained sequences and their reverse complements were searched: GGGGCAACC- in the viral population, despite the absence of obvious benefit CACCCCTACA, GGGGCAACCCACCCCTTCA, and GGGGCAAGCCATCCCTTCA. < to the virus. The oldest type 1 proviruses predate the human– Sequences with 10 bp of non-LTR flanking sequence were excluded. Reads matching known HML-2 junctions present in the human (hg19) or gorilla orangutan split (14), and have persisted in both the human- and (gorGor5) reference genomes were removed. The LTR portion of the remaining gorilla-specific HML-2 lineages until at least the last few hun- reads was removed, and the remaining flanking sequences were aligned to the dred thousand years, suggesting that they have been maintained reference gorilla genome using the UCSC BLAST-Like Alignment Tool (BLAT) for more than 10 million y of viral evolution. (47). The highest-scoring BLAT hit for each read was taken, and hits with Taken together, the high copy number, high levels of poly- overlapping coordinates were merged using BEDTools (48) to provide an initial morphism, low sequence divergence, and presence of well- list of putative integration sites. This list was further narrowed to a list of high- confidence hits based on the presence of reads derived from both the 5′ and 3′ preserved proviruses strongly suggest that HML-2 has been ac- ′ ′ tive much more recently in gorillas than in humans, and could LTR edges, and the presence of a 4- to 7-bp overlap between the 5 and 3 flanks corresponding to the target-site duplication. potentially still be infecting gorillas today. The SAM files generated from our targeted-sequencing libraries were further processed using a custom pipeline to filter out duplicate and low- Materials and Methods quality reads. Coordinates for all annotated HML-2 LTR5Hs and internal Great Ape Genome Project Data. Gorilla whole-genome sequence data were sequence (HERVK-int) insertions were downloaded from the RepeatMasker obtained from Great Ape Genome Project samples (43), including 30 indi- track of the gorilla (gorGor5 build) reference genomes on the UCSC Genome viduals. Three subspecies were sampled: 26 samples Browser. These loci were used to separate our read data into lists of targeted (G. gorilla gorilla), 1 Cross River (G. gorilla diehli), and 3 Eastern lowland (G. known and unknown sites using BEDTools. Reads corresponding to the same beringei graueri). GAGP data were downloaded from the National Center site were merged to yield a final BED file containing all putative novel HML- for Biotechnology Information Sequence Read Archive (NCBI SRA) website, 2 integration sites for a single sample. accession no. SRP018689, in unaligned FASTQ format. The gorilla reference genomes gorGor4 and gorGor5 were downloaded from the UCSC Genome Validation and Sequencing of HML-2 Insertions. All integration sites of interest Browser site (32). were validated with allele-specific PCR using 100 ng of whole-genome

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1814203116 Holloway et al. Downloaded by guest on September 26, 2021 amplified gorilla genomic DNA. Primers corresponding to the 5′ and 3′ flanking a proxy for 5′–3′ LTR divergence. As this distance is only half the divergence to genomic DNA for a site were designed and used to detect either the empty the nearest neighboring LTR, we multiplied each distance by 2 before applying site or solo LTR alleles. A separate PCR was run to infer a 2-LTR allele using a the same aging algorithm we use for 5′–3′ LTR comparisons. primer situated in the HML-2 5′ UTR paired with a flanking primer. Capillary The minimum age that can be estimated using this method is constrained sequencing was performed on at least one positive sample. 2-LTR provirus by the average time required for one LTR mutation to occur (300,000 and alleles were amplified in overlapping fragments from a single sample and 600,000 y for 2-LTR proviruses and solo LTRs, respectively). Thus, 2-LTR sequenced to ≥3×, and a consensus was then constructed with the read proviruses and solo LTRs with zero sequence divergence were assigned traces from each site. Reconstructed novel loci were named by finding the ages of 150,000 and 300,000 y, respectively, as they could have integrated any corresponding site of the DNA flanking the 5′ end of a provirus via the time between the present and their maximum possible age. 2-LTR proviruses BLAT tool to the previous gorGor4 gorilla genome build, which is orga- with discrepant LTR phylogenies due to gene conversion or segmental du- nized by chromosome. plications were excluded from the age analysis. Additionally, some sequences with unusually long branch lengths were excluded, as they appeared to be Identification of HML-2 LTR5Hs Insertions. Our list of nonreference integra- the result of sequencing artifacts, or in one case the result of APOBEC3- tions was supplemented with known HML-2 insertions. To identify human-, mediated hypermutation. chimpanzee-, and gorilla-specific insertions from their respective genome assemblies, as well as insertions specific to the human–chimpanzee clade and ORF Identification. Full-length and nearly full length ORFs were identified for insertions with orthologs in all three species, coordinates of all annotated all 2-LTR proviruses using the NCBI ORF Finder tool (49). The presence or HML-2 LTR5Hs and LTR5 insertions were downloaded from the Repeat- absence of gag, pro, pol, env, and rec ORFs was determined for each pro- Masker track of the human (hg38 build), gorilla (gorGor5 build), chimpanzee virus, using sequences from the HERV-Kcon consensus genome as a refer- (panTro5 build), and orangutan (ponAbe2 build) reference genomes on the ence (25). Genes were classified as full-length if at least 95% of the ORF UCSC Genome Browser. Sequences <900 bp were excluded. Solo LTRs and remained free of nonsense or frameshift mutations. Peptide sequences of 2-LTR proviruses were separated by the presence of internal proviral se- predicted proteins were queried against the UniProtKB/SwissProt protein quence annotated as HERVK-int by RepeatMasker. Species-specific integra- database using BLASTP to confirm their identity as HERV-K Gag, Pro, Pol, tions were identified using BEDTools to filter out sites present in at least one Env (50). of the other three species; coordinates were converted from one genome to ’ another as necessary using the UCSC Genome Browser s liftOver genome Quantification and Statistical Analysis. Evolutionary analysis was conducted by conversion tool. The same approach was used to identify loci present in both aligning sequences using MUSCLE (51). Neighbor-joining trees were gener- humans and chimpanzees, but not gorillas. Lastly, the remaining sites from ated using MEGA7 (52). The Kimura 2 parameter model was used for branch- the hg38, gorGor5, and panTro5 genomes were filtered with BEDTools length estimation, with an α of 2.5 and deletions treated pairwise (53). to remove any sites not present in all three genomes to generate a set of

Support for trees was assessed using 1,000 bootstrap replicates. EVOLUTION shared orthologous insertions. P values for provirus age distributions were calculated with the Kolmo- gorov–Smirnov test, as it seemed likely that the shape of the underlying ′– ′ Molecular Clock Age Estimation. The 5 3 LTR divergence was used to esti- frequency distributions in different species would be different, which might mate the time since insertion for 2-LTR proviruses, normalized to a neutral cause problems with the Mann–Whitney test, which assumes all samples mutation rate of 0.34% per million y, with lower and upper bounds of have the same frequency distribution shape. 0.24 to 0.45% per million y, based on previously described methods (13, 14). Insertion times for solo LTRs and truncated proviruses with a single LTR were Data Availability. Proviruses and solo LTRs: GenBank accession nos. MH678754 estimated based on the divergence between each sequence and its nearest to MH678803. Flanking genomic DNA: GenBank accession nos. MH684412 neighboring sequence on a phylogenetic tree. Separate neighbor-joining to MH684461. trees of LTR5Hs LTRs found in gorillas, humans, and chimpanzees were created. For 2-LTR proviruses, the 5′-to-3′ LTR sequence distances were di- ACKNOWLEDGMENTS. We thank Beatrice Hahn (UPenn), Fred Gage (Salk vided by a sequence divergence rate of 0.34% per million y, as previously Institute), and Michael Jensen-Seaman (Duquesne University) for generously calculated from the divergence of orthologous proviruses in humans and providing us with gorilla sample DNA, and Ravi Subramanian and Julia chimpanzees. Insertion times of integrations with a single LTR were calcu- Wildschutte for helpful discussion and assistance in acquiring gorilla samples. lated similarly, using their distance from the nearest node in the phylogeny as This work was made possible by Research Grant R35CA200421 from the NIH.

1. Bannert N, Kurth R (2004) Retroelements and the human genome: New perspectives 13. Hughes JF, Coffin JM (2004) Human endogenous retrovirus K solo-LTR formation and on an old relation. Proc Natl Acad Sci USA 101:14572–14579. insertional polymorphisms: Implications for human and viral evolution. Proc Natl 2. Magiorkinis G, Blanco-Melo D, Belshaw R (2015) The decline of human endogenous Acad Sci USA 101:1668–1672. retroviruses: Extinction and survival. Retrovirology 12:8. 14. Subramanian RP, Wildschutte JH, Russo C, Coffin JM (2011) Identification, charac- 3. Jern P, Coffin JM (2008) Effects of retroviruses on host genome function. Annu Rev terization, and comparative genomic distribution of the HERV-K (HML-2) group of Genet 42:709–732. human endogenous retroviruses. Retrovirology 8:90. 4. Nelson PN, et al. (2003) Demystified. Human endogenous retroviruses. Mol Pathol 56: 15. Jha AR, et al. (2011) Human endogenous retrovirus K106 (HERV-K106) was infectious after the emergence of anatomically modern humans. PLoS One 6:e20234. 11–18. 16. Mayer J, et al. (1999) An almost-intact human endogenous retrovirus K on human 5. Belshaw R, Katzourakis A, Paces J, Burt A, Tristem M (2005) High copy number in chromosome 7. Nat Genet 21:257–258. human endogenous retrovirus families is associated with copying mechanisms in 17. Stoye JP (2012) Studies of endogenous retroviruses reveal a continuing evolutionary – addition to reinfection. Mol Biol Evol 22:814 817. saga. Nat Rev Microbiol 10:395–406. 6. Löwer R, Löwer J, Kurth R (1996) The viruses in all of us: Characteristics and biological 18. Belshaw R, et al. (2005) Genomewide screening reveals high levels of insertional significance of human endogenous retrovirus sequences. Proc Natl Acad Sci USA 93: polymorphism in the human endogenous retrovirus family HERV-K(HML2): Implica- 5177–5184. tions for present-day activity. J Virol 79:12507–12514. 7. Young GR, et al. (2012) Resurrection of endogenous retroviruses in antibody-deficient 19. Barbulescu M, et al. (1999) Many human endogenous retrovirus K (HERV-K) proviruses mice. Nature 491:774–778. are unique to humans. Curr Biol 9:861–868. 8. Hohn O, Hanke K, Bannert N (2013) HERV-K(HML-2), the best preserved family of 20. Wildschutte JH, Ram D, Subramanian R, Stevens VL, Coffin JM (2014) The distribution HERVs: Endogenization, expression, and implications in health and disease. Front of insertionally polymorphic endogenous retroviruses in breast cancer patients and Oncol 3:246. cancer-free controls. Retrovirology 11:62. 9. Ruprecht K, et al. (2008) Human endogenous retrovirus family HERV-K(HML-2) RNA 21. 1000 Genomes Project Consortium; Auton A, et al. (2015) A global reference for – transcripts are selectively packaged into retroviral particles produced by the human human genetic variation. Nature 526:68 74. 22. Wildschutte JH, et al. (2016) Discovery of unfixed endogenous retrovirus insertions in germ cell tumor line Tera-1 and originate mainly from a provirus on chromosome diverse human populations. Proc Natl Acad Sci USA 113:E2326–E2334. 22q11.21. J Virol 82:10008–10016. 23. Magiorkinis G, Belshaw R, Katzourakis A (2013) ‘There and back again’: Revisiting the 10. Blomberg J, Benachenhou F, Blikstad V, Sperber G, Mayer J (2009) Classification and pathophysiological roles of human endogenous retroviruses in the post-genomic era. nomenclature of endogenous retroviral sequences (ERVs): Problems and recommen- Philos Trans R Soc Lond B Biol Sci 368:20120504. – dations. Gene 448:115 123. 24. Bhardwaj N, Montesion M, Roy F, Coffin JM (2015) Differential expression of HERV-K 11. Belshaw R, et al. (2004) Long-term reinfection of the human genome by endogenous (HML-2) proviruses in cells and virions of the teratocarcinoma cell line Tera-1. Viruses retroviruses. Proc Natl Acad Sci USA 101:4894–4899. 7:939–968. 12. Turner G, et al. (2001) Insertional polymorphisms of full-length endogenous retro- 25. Lee YN, Bieniasz PD (2007) Reconstitution of an infectious human endogenous ret- viruses in humans. Curr Biol 11:1531–1535. rovirus. PLoS Pathog 3:e10.

Holloway et al. PNAS Latest Articles | 9of10 Downloaded by guest on September 26, 2021 26. Dewannieux M, et al. (2006) Identification of an infectious progenitor for the multiple- 39. Kijima TE, Innan H (2010) On the estimation of the insertion time of LTR retro- copy HERV-K human endogenous retroelements. Genome Res 16:1548–1556. transposable elements. Mol Biol Evol 27:896–904. 27. Bannert N, Kurth R (2006) The evolutionary dynamics of human endogenous retro- 40. Hemelaar J (2012) The origin and diversity of the HIV-1 pandemic. Trends Mol Med viral families. Annu Rev Genomics Hum Genet 7:149–173. 18:182–192. 28. Macfarlane CM, Badge RM (2015) Genome-wide amplification of proviral sequences 41. Langergraber KE, et al. (2012) Generation times in wild chimpanzees and gorillas reveals new polymorphic HERV-K(HML-2) proviruses in humans and chimpanzees that suggest earlier divergence times in great ape and . Proc Natl Acad Sci – are absent from genome assemblies. Retrovirology 12:35. USA 109:15716 15721. 29. Maldarelli F, et al. (2014) HIV latency. Specific HIV integration sites are linked to clonal 42. Moorjani P, Amorim CE, Arndt PF, Przeworski M (2016) Variation in the molecular – expansion and persistence of infected cells. Science 345:179–183. clock of primates. Proc Natl Acad Sci USA 113:10607 10612. 30. Prado-Martinez J, et al. (2013) Great ape genetic diversity and population history. 43. Prado-Martinez J, et al. (2013) The genome sequencing of an albino Western lowland gorilla reveals inbreeding in the wild. BMC Genomics 14:363. Nature 499:471–475. 31. Gordon D, et al. (2016) Long-read sequence assembly of the gorilla genome. Science 44. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. 352:aae0344. 45. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat 32. Karolchik D, Hinrichs AS, Kent WJ (2011) The UCSC Genome Browser. Curr Protoc Hum Methods 9:357–359. Genet Chapter 18:Unit 18.6. 46. Li H, et al.; 1000 Genome Project Data Processing Subgroup (2009) The sequence 33. Smit AFA, Hubley R, Green P (2013–2015) RepeatMasker open-4.0. Available at www. alignment/map format and SAMtools. Bioinformatics 25:2078–2079. repeatmasker.org. Accessed December 27, 2018. 47. Kent WJ (2002) BLAT—The BLAST-like alignment tool. Genome Res 12:656–664. 34. Berry CC, et al. (2012) Estimating abundances of retroviral insertion sites from DNA 48. Quinlan AR, Hall IM (2010) BEDTools: A flexible suite of utilities for comparing ge- fragment length data. Bioinformatics 28:755–762. nomic features. Bioinformatics 26:841–842. 35. Hughes JF, Coffin JM (2001) Evidence for genomic rearrangements mediated 49. Wheeler DL, et al. (2003) Database resources of the National Center for Bio- by human endogenous retroviruses during primate evolution. Nat Genet 29: technology. Nucleic Acids Res 31:28–33. – 487 489. 50. Altschul SF, et al. (2005) Protein database searches using compositionally adjusted 36. Löwer R, et al. (1993) Identification of human endogenous retroviruses with complex substitution matrices. FEBS J 272:5101–5109. mRNA expression and particle formation. Proc Natl Acad Sci USA 90:4480–4484. 51. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high 37. Ono M, Yasunaga T, Miyata T, Ushikubo H (1986) Nucleotide sequence of human throughput. Nucleic Acids Res 32:1792–1797. endogenous retrovirus genome related to the mouse mammary tumor virus genome. 52. Kumar S, Nei M, Dudley J, Tamura K (2008) MEGA: A biologist-centric software for J Virol 60:589–598. evolutionary analysis of DNA and protein sequences. Brief Bioinform 9:299–306. 38. Johnson WE, Coffin JM (1999) Constructing primate phylogenies from ancient ret- 53. Kimura M (1980) A simple method for estimating evolutionary rates of base substi- rovirus sequences. Proc Natl Acad Sci USA 96:10254–10260. tutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120.

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1814203116 Holloway et al. Downloaded by guest on September 26, 2021