Identification of an overprinting gene in Merkel cell polyomavirus provides evolutionary insight into the birth of viral genes

Joseph J. Cartera,b,1,2, Matthew D. Daughertyc,1, Xiaojie Qia, Anjali Bheda-Malgea,3, Gregory C. Wipfa, Kristin Robinsona, Ann Romana, Harmit S. Malikc,d, and Denise A. Gallowaya,b,2

Divisions of aHuman Biology, bPublic Health Sciences, and cBasic Sciences and dHoward Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98109

Edited by Peter M. Howley, Harvard Medical School, Boston, MA, and approved June 17, 2013 (received for review February 24, 2013) Many use overprinting (alternate reading frame utiliza- mammals and birds (7, 8). Polyomaviruses leverage alternative tion) as a means to increase protein diversity in genomes severely splicing of the early region (ER) of the genome to generate pro- constrained by size. However, the evolutionary steps that facili- tein diversity, including the large and small T (LT and ST, tate the de novo generation of a novel protein within an ancestral respectively) and the middle T (MT) of murine poly- have remained poorly characterized. Here, we describe the omavirus (MPyV), which is generated by a novel splicing event and identification of an overprinting gene, expressed from an Alter- overprinting of the second exon of LT. Some polyomaviruses can nate frame of the Large T Open reading frame (ALTO) in the early drive tumorigenicity, and gene products from the ER, especially region of Merkel cell polyomavirus (MCPyV), the causative agent SV40 LT and MPyV MT, have been extraordinarily useful models of most Merkel cell carcinomas. ALTO is expressed during, but not to study the viral and host processes required for cellular trans- required for, replication of the MCPyV genome. Phylogenetic anal- formation (9–11). More recently, a new oncogenic polyomavirus, ysis reveals that ALTO is evolutionarily related to the middle T Merkel cell polyomavirus (MCPyV), was discovered in Merkel cell antigen of despite almost no sequence sim- carcinoma (MCC), a rare but aggressive form of , ilarity. ALTO/MT arose de novo by overprinting of the second exon providing the first established case of a human cancer stemming of T antigen in the common ancestor of a large clade of mamma- from a polyomavirus (12, 13). lian polyomaviruses. Taking advantage of the low evolutionary In comparison with other human polyomaviruses, the MCPyV divergence and diverse sampling of polyomaviruses, we propose LT protein contains an expanded, highly divergent region of evolutionary transitions that likely gave birth to this protein. We ∼200 amino acids encompassing two conserved short linear suggest that two highly constrained regions of the large T antigen motifs (Fig. 1A and SI Appendix, Fig. S1). Given the precedent of ORF provided a start codon and C-terminal hydrophobic motif nec- protein diversity generated from the ER of polyomaviruses, we essary for cellular localization of ALTO. These two key features, investigated the implications of this highly divergent MCPyV LT together with stochastic erasure of intervening stop codons, region. We identified an alternate T antigen ORF, hereafter resulted in a unique protein-coding capacity that has been pre- referred to as ALTO, overprinting within this region in MCPyV served ever since its birth. Our study not only reveals a previously LT (Fig. 1A) in the +1 frame relative to the second exon of LT. fi unde ned protein encoded by several polyomaviruses including ALTO is expressed during, but is not required for, viral genome MCPyV, but also provides insight into de novo protein evolution. replication. Interestingly, despite almost no sequence similarity, ALTO most probably shares a common evolutionary origin with gene evolution | synonymous substitution | disordered motifs the other known case of overprinting in polyomaviruses, the second exon of MT from MPyV. Indeed, we find that preserva- he birth of new genes has fascinated biologists for decades. tion of the overprinting ALTO/MT ORF defines a large mono- TAlthough the steps required to generate a new gene by gene phyletic clade of mammalian polyomaviruses. By comparing LT duplication or gene rearrangement have been characterized, less genes in a phylogenetic context among polyomaviruses within is known about the birth of new genes de novo. One particularly and outside this clade, we propose an evolutionary model for the intriguing mechanism of de novo gene birth is via “overprinting,” generation of a previously undiscovered viral gene de novo by in which a novel overprinting gene is encoded as an alternate ORF overprinting. within an ancestral “overprinted” gene (1). Overprinting results in two unrelated functional proteins encoded as overlapping ORFs Results within the same DNA sequence. However, the origins of such A Previously Uncharacterized Overprinting ORF in MCPyV. We noted a complex evolutionary solution have remained elusive. that the expanded, highly divergent region of MCPyV LT could Viruses appear to be especially adept at this form of evolu- encode a previously undefined protein, which we named ALTO. tionary innovation. This frequent use of overprinting is likely the ALTO is the result of an overprinting ORF that is +1 frameshifted result of the severe constraints imposed on viral genome size, making gene innovation more likely to occur as overprinting rather than within a noncoding region (2). Due to the numerous Author contributions: J.J.C., M.D.D., H.S.M., and D.A.G. designed research; J.J.C., M.D.D., examples of overprinting in single-stranded RNA viruses, a great X.Q., A.B.-M., G.C.W., and K.R. performed research; A.B.-M. and G.C.W. contributed new deal of research has focused in particular on this class of viruses reagents/analytic tools; J.J.C., M.D.D., X.Q., A.R., H.S.M., and D.A.G. analyzed data; and (3–6). However, small DNA viruses, such as adenoviruses, pap- J.J.C., M.D.D., H.S.M., and D.A.G. wrote the paper. illomaviruses, and polyomaviruses, have a similar requirement to The authors declare no conflict of interest. maximize the coding capacity of their genomes. In this study, we This article is a PNAS Direct Submission. have taken advantage of our identification of an overprinting gene 1J.J.C. and M.D.D. contributed equally to this work. born in the ancestor of a large clade of polyomaviruses to in- 2To whom correspondence should be addressed. E-mail: [email protected]. vestigate the steps that allowed generation of this unique protein. 3Present address: Institute for Systems Biology, Seattle, WA 98109. ∼ Polyomaviruses are nonenveloped viruses containing an 5-kb This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. circular double-stranded DNA genome that infect a wide range of 1073/pnas.1303526110/-/DCSupplemental.

12744–12749 | PNAS | July 30, 2013 | vol. 110 | no. 31 www.pnas.org/cgi/doi/10.1073/pnas.1303526110 Downloaded by guest on September 25, 2021 Region of low are being evolutionarily preserved and protected from pre- A 100 similarity between MCPyV LT and sumably deleterious nonsynonymous changes. We investigated other human LTs whether the region of LT that overlaps ALTO displays decreased 50 synonymous site variation. We first compared divergence at % identity synonymous sites (dS) in LT genes in sequenced isolates of 0 900 MCPyV and found that the dS within the ALTO-encoding re- Amino acid = MCPyV LT DNAj j YGS/T LxCxE OBDD ATPase/HelicaseAT gion is 2.5 times lower than the rest of the gene (dS 0.048 for the region of overlap vs. 0.12 for nonoverlap). We found a sim- ALTO ilar pattern when we compared LT genes between MCPyV and B its closely related polyomaviruses (ChimpPyV1a, ChimpPyV2a, 196 429 861 3078 Transcript 1 Large T antigen and GorillaPyV), which also have the capacity to encode ALTO. (LT) In all pairwise comparisons, synonymous divergence is decreased 196 754 Transcript 2 Small T antigen directly overlapping the ALTO-encoding region (Fig. 1D), con- (ST) Transcript 3 sistent with both reading frames being maintained throughout

196 429 861 1622 2778 3078 viral evolution. Interestingly, the ALTO coding sequence was Transcript 4 57kD T antigen (57kT) intact in all MCPyV sequences deposited in GenBank from healthy individuals (n = 31) and from nonlesional skin from 880 1624 Frame 1 Alternative T antigen MCC patients (n = 3) but was predicted to have truncating or Frame 2 open reading frame SI Appendix Frame 3 880 1622 27782785 (ALTO) frame shift mutations in 19 of 46 MCC tumors ( , Table S1). These same mutations have also been previously C shown to truncate LT (14). 880 MGPLNSKNGGDQEDSASGRHTNMGPIHTGPTQDPESLPPMHPGEPPVEAHHPTARALPLGMGPSQRPRLQ TPSPEDPIYLPNTMRNPPHPLDPVAERRPPIQEENPAHPMEPVYLEILPERMAPGRISSAMNHFPPLSLP ALTO Is Expressed in Cells During Viral DNA Replication. To validate RPLRSLRSPPPQEARPGSPRLPLPRRPRHLSLQMRNTDPPPSPPRRPLLHSQESENLGGPEALQALLVQQ that ALTO is a bona fide MCPyV expressed protein, we trans- VLQALHQSQKRTEKLLFLLIFLLIFLIILAMLYIVIKQ(QI) fected an intact circular MCPyV genome (SI Appendix, Fig. S3) 1624 2785 into human HEK293 cells and monitored protein expression and

D 1 Region of viral DNA replication. Under these conditions, the MCPyV ge- decreased dS nome has been shown to replicate its DNA for several days (19, 20). Lysates from transfected cells were analyzed by immunoblot 0.5 using immune rabbit serum that had been generated against MCPyV-GorillaPyV1 a mixture of two ALTO peptides (Fig.1C). We detected ALTO Synonymous MCPyV-ChimpPyV1a A

site divergence (dS) MCPyV-ChimpPyV2a at the expected size (27.6 kDa; Fig. 2 ) in cells transfected with 0 0-100 100-200 200-300 300-400 400-500 500-600 600-700 700-800 wild-type DNA. Importantly, the ALTO protein was not detec- Amino acid window ted using a MCPyV genome bearing a point mutation that MCPyV LT DNAj j YGS/T LxCxE OBDD ATPase/HelicaseAT

eliminated the predicted ALTO initiation codon without af- EVOLUTION ALTO fecting the LT encoded protein sequence. These results con- firmed that ALTO is a bona fide protein expressed during Fig. 1. ALTO is a unique protein encoded by the MCPyV ER. (A) Sliding MCPyV replication and validated that ALTO translation ini- window plot (10-aa window size) of amino acid identity between MCPyV tiates at position 880 in the genome. and eight other human polyomaviruses. The arrows depict the LT and ALTO To measure the effect that ALTO had on DNA replication, we ORFs; gray boxes indicate functional domains (14) (DNAj, activators of DNAk harvested low molecular weight DNA 2 d posttransfection and chaperones; OBD, origin binding domain; YGS/T and LxCxE, conserved linear motifs). (B) Four transcripts from the early region (ER) are present in MCPyV digested it with DpnI and BamHI. DpnI cleaves bacterially positive tumors (14). ORFs encoding ST, LT, 57 kT, and ALTO are shown as derived (methylated) input DNA multiple times, but not rectangles with colors indicating reading frames. Start sites, termination HEK293-derived (unmethylated) replicated DNA, allowing us to sites, and splice sites are shown (accession no. HM355825). (C) ALTO protein distinguish the two DNA species (SI Appendix, Fig. S3). We sequence is shown. Underlined peptides were used to generate immune detected newly replicated MCPyV genome using ethidium bro- rabbit sera used in Fig. 2. Blue and red boxes indicate the C-terminal basic mide-stained gels and Southern blots. The MCPyV genome and hydrophobic motifs, respectively. (D) Sliding window plot (100-aa win- migrated at ∼5 kb following cleavage with BamHI to linearize dow size) of the synonymous site divergence (dS) between the indicated LT the genome (Fig. 2 B and C). Interestingly, loss of ALTO ortholog pair. Below is the schematic of LT and ALTO ORFs as in A. resulted in no measurable difference in viral DNA replication in HEK293 cells (Fig. 2 B and C). Thus, we conclude that ALTO relative to the second exon of LT (Fig. 1A and SI Appendix, does not play an essential role in MCPyV genome replication in Fig. S1). All four previously mapped T antigen mRNAs (14) tissue culture but may play an accessory role in the viral life cycle, as has been observed for many overprinting genes (4). have the capacity to encode ALTO (Fig. 1B). Based on conser- We observed that ALTO localized to foci in the cytoplasm in vation of the AUG codon at genome position 880 among closely cells transfected with the MCPyV ER (Fig. 2D). No staining was related polyomaviruses (SI Appendix, Fig. S2), we predicted that observed in cells transfected with vector only, or in cells treated ALTO encodes a 248- or 250-residue protein depending on the SI Appendix ′ B C with preimmune rabbit serum ( , Fig. S4). An ALTO 3 splice variant (Fig. 1 and ). The only recognizable features variant lacking an intact C-terminal hydrophobic motif (residues of the predicted ALTO protein are proline-rich segments and 1–234) appeared to be distributed diffusely throughout the cy- a C terminus containing a basic motif just upstream of a hydro- toplasm. Thus, deletion of the hydrophobic C terminus alters the C phobic motif (Fig. 1 ). subcellular localization of ALTO. Overprinted ORFs often display decreased divergence at synonymous DNA sites, because a synonymous site change in Overprinting ORF Defines a Clade of Polyomaviruses. ALTO has not the ancestral, overprinted ORF (i.e., LT) will likely result in been previously described in polyomaviruses, despite extensive a nonsynonymous change in the overprinting ORF (i.e., ALTO). investigation of prototype members such as SV40, BKPyV, and In fact, decreased synonymous divergence is a means to identify JCPyV. We therefore wished to investigate the evolutionary origin new overprinting ORFs (3, 15–18) and indicates that both ORFs of ALTO, especially in light of the fact that all polyomaviruses

Carter et al. PNAS | July 30, 2013 | vol. 110 | no. 31 | 12745 Downloaded by guest on September 25, 2021 A B C as well as other polyomaviruses from primates, raccoons, rodents, kD ControlMCPyV880koER kB ControlMCPyV880ko ControlMCPyV880ko bats, and one other human PyV, Trichodysplasia spinulosa-asso- 50 37 ciated polyomavirus (TSPyV), all share the property of potentially Replicated 25 12 encoding an overprinting ALTO/MT-like gene (Fig. 3A). More- 20 5 3 15 2 over, for all viruses with an intact ORF, the predicted proteins ALTO encode a hydrophobic C terminus similar to both ALTO and MT 1 Input (Fig. 3B). In contrast, polyomaviruses outside this lineage, in- GADPH cluding the closely related human HPyV9 and Monkey B-Lym- D DAPI ALTO LT Merged photropic papovavirus (LPV), contain several stop codons in this frame and therefore cannot encode an ALTO/MT-like protein (Fig. 3 A and C). We can thus delineate this clade of poly-

Vector omaviruses, which we suggest calling Almipolyomaviruses (for ALTO or middle T containing polyomaviruses), by virtue of their conserved utilization of this alternate ORF (Fig. 3 and SI Appendix, Fig. S3A). Taking advantage of the homology be- ER tween the helicase domains of LT from polyomaviruses and E1 from papillomaviruses, we propose a root for the LT tree that further supports a single monophyletic clade of Almipolyoma- viruses (SI Appendix, Fig. S6). We conclude that ALTO/MT was ER born de novo specifically in the common ancestor of Almipo- (ALTO 1-234) (ALTO lyomaviruses, via the utilization of an alternate overprinting ORF Fig. 2. ALTO is expressed from the MCPyV early region during viral genome in the second exon of LT to encode a unique protein (Fig. 3). replication. (A) Cells (HEK293) were transfected with religated wild-type Following its birth, the overprinting ALTO/MT protein has MCPyVw156 DNA (MCPyV) or MCPyV genome harboring a point mutation been preserved in all members of the Almipolyomaviruses. In disrupting the ATG start codon of ALTO at nt 880 (880ko). An equal amount addition to a lack of stop codons, we observed additional evi- of an unrelated plasmid was used as a negative control. Immunoblots for dence for selection acting to preserve this ORF in several branches ALTO and GAPDH were performed on protein lysates collected 48 h after of the Almipolyomavirus clade. As described above, divergence at transfection or with a lysate from cells transfected with an ER expression synonymous sites is depressed specifically in the LT region that plasmid (using 50-fold less lysate). (B) Low molecular weight DNA was iso- overlaps ALTO in MCPyV and related hominoid PyVs (Fig. 1D). lated from transfected cells and digested with BamHI and DpnI. Digestion products were separated on an agarose gel and visualized by staining with We observed a similar pattern when we compared the other hu- ethidium bromide or (C) Southern blotting using a radioactive MCPyV DNA man Almipolyomavirus, TSPyV, to its closely related primate PyV probe. The input control plasmid contained 350 bp of the MCPyV genome (OrangutanPyVBo). In contrast, a pairwise comparison between overlapping the region used for hybridization. Input and replicated MCPyV two non-Almipolyomaviruses, HPyV9 and LPV, showed no such DNAs are indicated (SI Appendix, Fig. S3). (D) Cells (U2OS) were transfected depression of synonymous divergence in the region that would with expression plasmids containing the MCPyV ER, ER with a C-terminally correspond to ALTO (SI Appendix, Table S3). Such data indicate truncated ALTO (ALTO 1–234), or an empty vector control. After 48 h, the that, following overprinting of ALTO/MT onto LT, selection has cells were fixed and stained for ALTO (green), MCPyV LT (red), and DNA acted to preserve it in the Almipolyomaviruses. (blue). Individual and merged images are shown. (Scale bar: 30 μm.) Discussion Polyomaviruses have evolved a variety of strategies to maximize encode the overprinted (original) LT gene. Initial similarity ∼ searches [using tblastn (21)] indicated that ALTO was exclusively use of their genomes that are limited in size to 5 kb. One strategy is evident in the use of alternate translation start sites in the late present in MCPyV and three other hominoid polyomaviruses region. In contrast, all previous early region polyomavirus (ChimpPyV1, ChimpPyV2, and GorillaPyV1). However, we also proteins were believed to initiate at the same start codon in the noted that, like ALTO, the second exon of MPyV MT is the first exon of LT but were capable of producing multiple products result of overprinting onto the LT gene (22). We thus considered via alternative splicing of a primary transcript (8). Here, we have whether ALTO and the second exon of MT are evolutionarily identified a unique protein (ALTO) that is translated from an related, representing a single gene birth. internal start site in an alternate ORF of the second exon of LT We found striking parallels between ALTO and MT despite fi fi and expressed during MCPyV viral genome replication. Using the nding little signi cant sequence similarity between the two unique position and frame of ALTO in the MCPyV genome, we proteins. Both ALTO and MT are produced as a result of + found that ALTO was a member of a family of proteins that overprinting of LT second exon in the same 1 frame, and both included distantly related ALTO homologs and the second exon contain a basic motif just upstream of a hydrophobic C terminus, of MT from MPyV. Although these proteins have little recog- which is immediately followed by a stop codon (Figs. 1 and 3). nizable sequence identity, they share several key features in- These C-terminal motifs are located in almost exactly the same cluding a termination codon at the same location in the genome, ’ region in both MCPyV and MPyV genomes, overlapping LT s a hydrophobic C terminus flanked by one or more basic amino origin binding domain (OBD) that is highly conserved among acids, and a relatively high proportion of prolines. polyomaviruses (Fig. 1A). These similarities raised the possibility The ALTO/MT proteins we have identified could encode that ALTO and MT derived from a common ancestor. novel functions by which Almipolyomaviruses can manipulate To delineate the evolutionary origins of ALTO and MT, we the host. Indeed, other genes encoded by the early region are performed a comprehensive survey of all polyomaviruses to as- critical for cellular transformation and tumorigenicity (9–11). sess whether they could encode a putative overprinting ORF, For instance, SV40 uses large and small T antigens (LT and ST), which terminated in the same region as ALTO and MT. Phy- produced by alternative splicing of the same gene, to inactivate logenies based on LT alone (Fig. 3A and SI Appendix, Fig. S5A) tumor suppressors. Binding partners of SV40 LT first led to the or the entire genome (SI Appendix, Fig. S5B) were similar and identification of the tumor suppressor protein p53 (26, 27) and consistent with recently published phylogenetic analyses of LT revealed mechanisms to disrupt the pRb and p130 pathways (10). genes from polyomaviruses (23–25). Strikingly, we found that The main transforming activity of MPyV is encoded by the one lineage of polyomaviruses that includes MCPyV and MPyV, middle T antigen (MT), which splices the first exon corresponding

12746 | www.pnas.org/cgi/doi/10.1073/pnas.1303526110 Carter et al. Downloaded by guest on September 25, 2021 Overlap with LT OBD domain WUPyV * A 20 amino acids KIPyV BKPyV * * JCPyV SV40 HPyV10 * * HPyV6 HPyV7 HPyV9 * LPV * and LT ancestor

TSPyV * Avipolyomaviruses Exon 2 of MPyV MT OrangutanPyVBo MPyV * * HamsterPyV BatPyVB1130 RacoonPyV BatPyVR104 * Birth of ChimpPyV1a *alternate ORF GorillaPyV1 * * ChimpPyV2a *

Almipolyomaviruses MCPyV * ALTO protein of MCPyV Conserved stop codon B Hydrophobicity C Hydrophobicity TSPyV HPyV9 MPyV LPV HamsterPyV HPyV10 BatPyVB1130 HPyV6 RacoonPyV HPyV7 BatPyVR104 Start of LT OBD domain ChimpPyV1a MCPyV Start of LT OBD domain

Fig. 3. A single clade of polyomaviruses has a predicted alternate ORF. (A) Genomic sequences from several polyomaviruses (accession nos. in SI Appendix, Table S2) were translated as a +1 frameshift from the coding region of LT exon 2. Predicted stop codons are shown in red. MCPyV ALTO is shown in blue, and exon 2 of MT from MPyV is in gray. Due to length differences and lack of similarity between exon 2 sequences, translations are aligned by the conserved region of the genome overlapping the OBD. The phylogeny shown here is based on an amino acid alignment of LT (SI Appendix, Fig. S5A) but is consistent with a larger phylogeny of polyomaviruses based on the entire genomic sequence (SI Appendix, Fig. S5B). Asterisks indicate >75% bootstrap support in both phylogenies. (B) The C terminus of the indicated alternate ORFs is shown aligned to their conserved hydrophobic region. Conservation data and hydro- phobicity plots were generated using Geneious software (40). Stop codons are shown as red asterisks. (C) Alignment of the non-Almipolyomaviruses as in B.

to ST to the overprinting frame in exon 2. Studies of MT in LT proteins (SI Appendix, Fig. S7A) that derive from an exten-

MPyV have led to discoveries of phosphotyrosine kinases (PTK) sion of the linker region between the start of exon 2 and the EVOLUTION and phosphoinositide 3-kinase (PI3K) signaling (9, 22). Although conserved OBD (SI Appendix, Fig. S7B). Placed in the context of ALTO does not appear to be essential for viral genome repli- the phylogenetic tree, there is an obvious increase in the size of cation, we find that it is mutated in many cancer tissues and that elucidation of its function may uncover new roles for ALTO in MCPyV manipulation of human cell signaling and perhaps in Ancestral LT gene LT frame cancer development. Potential for +1 frame +1 frame *** JCPyV, KiPyV start codon and The origin of de novo genes has been of great interest and has hydrophobic motif Expansion of LT protein principally been studied in eukaryotes where studies have shown already present that novel genes can arise by stochastic removal of frameshift or stop codons from previously noncoding regions or long noncoding *** LPV, HPyV9 RNA (28, 29). RNA viruses, because of their genome size limi- Innovation of ALTO/MT Removal of stop codons tations, are especially proficient at generating novel genes by overprinting onto an already existing gene using a different reading * Ancestral Almipolyomavirus frame (3). However, the steps required to generate an intact, Subfunctionalization Create MT through splicing fi of ALTO/MT functional overprinting ORF have been dif cult to discern. This Expand LT/ALTO Expand LT/MT difficulty is both due to the high mutation rates of RNA viruses and fi because it is often not possible to nd closely related viruses that * MCPyV * MPyV branched just before the origin of overprinting. The relatively low evolutionary divergence of polyomaviruses, their broad sampling, Fig. 4. A model for polyomavirus de novo gene birth by overprinting. ORFs and our phylogenetic tracing of the birth of ALTO/MT proteins to corresponding to the LT and the ALTO/MT alternate frame are shown as in a single, monophyletic lineage of Almipolyomaviruses therefore Fig. 1. Dark blue denotes the region of ALTO/MT that encodes the hydro- phobic domain and overlaps the evolutionarily conserved OBD of LT. Red provide an unprecedented opportunity to dissect the de novo asterisks denote stop codons. Each schematic represents a proposed step evolutionary origins of an overprinting gene. along the evolutionary pathway that led to the current repertoire of LT, By comparing the ALTO/MT reading frames of Almipolyo- ALTO, and MT ORFs. An extant viral example of each schematic is also given, maviruses with the lack of equivalent ORFs in non-Almipolyo- with the exception of the inferred ancestor to all Almipolyomaviruses, which maviruses, we investigated what preexisting constraints and novel gave birth to all current ALTO/MT-containing viruses. Although MCPyV innovations may have led to overprinting of ALTO/MT onto LT. encodes ALTO wholly within exon 2, it remains to be experimentally de- Our ability to study LT and ALTO status along several points of termined whether other Almipolyomaviruses besides MPyV and HamsterPyV have splice variants that encode an MT-like protein. If all basally branching the well-resolved polyomavirus phylogeny allows us to propose Almipolyomaviruses were found to encode an MT protein, we would revise a series of steps that could have plausibly given rise to ALTO/ this model to suggest that that the initial ALTO/MT innovation was actually MT (Fig. 4). First, we observed that, compared with non-Almi- MT-like and later became ALTO-like due to the use of the downstream polyomaviruses, Almipolyomaviruses generally possess longer methionine (conserved due to the LT YGS/T motif).

Carter et al. PNAS | July 30, 2013 | vol. 110 | no. 31 | 12747 Downloaded by guest on September 25, 2021 this region that exactly coincides with the region of overprinting splicing patterns (Fig. 4). In fact, there is very little sequence (SI Appendix, Fig. S7B). Interestingly, this increase in size is al- relatedness among predicted ORFs in Almipolyomaviruses out- ready evident in the clade of viruses most closely related to side of the hydrophobic C terminus and an immediately adjacent Almipolyomaviruses (represented by HPyV9 and LPV), whose patch of polar residues (Fig. 3B). However, conserved features LT region is the same size as the most basally branching Almi- are evident in ALTO comparisons among closely related Almi- polyomaviruses (represented by TSPyV and OrangutanPyVBo). polyomaviruses (SI Appendix, Alignment S2) and reveal that, in These results suggest that expansion of LT likely preceded and the region of overlap, neither ALTO nor LT appears to have may have been necessary to accommodate the new ORF al- consistently higher sequence conservation, making it unclear though alternate models are plausible (Fig. 4). Although the LT which reading frame has been more highly constrained during expansion could be the result of neutral drift, it may have been evolution (SI Appendix, Table S4). Interestingly, unlike some selected for to facilitate additional protein–protein interactions. structured overprinted/overprinted pairs, including the recently Consistent with this hypothesis, the expanded region in MCPyV discovered influenza A PA-X protein and the PA protein it LT is required for interactions with hVam6p, a cytoplasmic overprints (15), neither ALTO nor the overprinted region of LT protein involved in lysosomal processing (30). is predicted to contain substantial protein structure (SI Appendix, A second feature that likely facilitated the birth of ALTO/MT Fig. S1). Together, these findings are consistent with studies was the presence of two highly conserved regions of LT that showing that overprinting genes of viruses frequently encode could encode important functional elements in the ALTO reading rapidly evolving, disordered proteins (3, 4). However, we hy- frame. First, the experimentally determined start codon of ALTO pothesize that ALTO/MT proteins likely perform related func- (Fig. 2) overlaps exactly with a conserved YGS/T motif located tions despite this low sequence and structural conservation, near the Rb-binding motif of LT (SI Appendix,Fig.S8). Al- similar to many examples of disordered proteins that can ac- though the functional significance of the YGS/T motif is not complish very similar biological tasks (34). Consistent with this understood, our observation that it is highly conserved supports possibility of similar function, the second exons of MT proteins structural evidence for its importance for proper folding of the from MPyV and HamsterPyV are less than 20% identical, yet DNAj domain (31). Interestingly, this potential to encode a start both MTs use short linear sequences, such as phosphatidylino- codon in the beginning of exon 2 is conserved across every - sitol-3 kinase motifs, whose location is not conserved between yomavirus containing the YGS/T motif, including many non- the two proteins but retention of these motifs has been shown to Almipolyomaviruses, suggesting that the start codon of ALTO be important for the functions of both proteins (22). We have also was in place before ALTO was born. We observe a similar identified putative short linear motifs among closely related ALTO phenomenon with the highly conserved LT OBD, which we hy- proteins of MCPyV-related viruses (SI Appendix, Alignment S2). pothesize resulted in the potential to encode a hydrophobic C Beginning with our identification of a unique protein, ALTO, terminus in the alternate frame (ALTO) immediately upon birth overprinting the LT gene in MCPyV, we have found that the of ALTO/MT. Again, the precursor to this alternate frame hy- presence of an overprinting ORF defines a single clade of poly- drophobic domain can also be seen in non-Almipolyomaviruses omaviruses. In addition to MCPyV and another human , even before the birth of ALTO although, in most cases, this TSPyV, the Almipolyomavirus clade also includes MPyV, in which domain contains stop codons (Fig. 3C). An intact hydrophobic the overprinting sequence encodes the second exon of the MT domain is the most conserved feature of the ALTO/MT proteins . Because of the strict conservation of LT in all poly- that we identified, and we have now shown that this domain is omaviruses, we propose a model for the origin of ALTO/MT, as well required for the subcellular distribution of ALTO (Fig. 2D), as the subsequent lineage-specific subfunctionalization that oc- similar to its role in MT localization and function (32). As a re- curred after this protein was born. Thus, our studies not only reveal sult, we expect that this hydrophobic motif and likely the cellular a previously undefined protein expressed from the early region in targeting that it confers are critical for ALTO/MT and may have a broad range of polyomaviruses of high biomedical importance, but immediately provided the newly born ALTO/MT with func- also provide insight into the evolutionary steps required to achieve tionality. Intriguingly, in an independent case of overprinting in protein novelty in genomes so constrained by size. parvoviruses, the single most recognizable feature of the over- printing SAT protein is its transmembrane segment that is re- Materials and Methods quired for correct localization (33). Cell Culture. Osteosarcoma (U2OS) and adenovirus-transformed human em- Thus, preexisting constraints on the LT ORF predisposed the bryonic kidney cell lines (HEK293) were cultured in DMEM supplemented with alternate frame with these two key features, which bookended the 10% (vol/vol) FBS and penicillin–streptomycin (Life Technologies). All cell ALTO overprinting ORF even before its birth. We posit that lines were maintained at 37 °C in 5% CO2. stochastic erasure of intervening stop codons then allowed the Plasmids and Cloning. For expression of MCPyV proteins in tissue culture, the generation of an intact overprinting ORF (Fig. 4). Interestingly, fi because there can be no selective pressure for stop codon removal MCPyVw156 early region DNA was ampli ed by PCR and cloned into the pQCXIP plasmid (Clontech) at BamHI/EcoRI sites using the In-Fusion cloning kit before the generation of an intact ALTO, erasure of stop codons (Clontech). The ALTO start site mutation (880ko) and truncation mutation is likely both the rate-limiting step in terms of de novo gene birth, (ALTO 1–234) were generated by site-directed mutagenesis. Details of as well as the step that is driven almost entirely by stochasticity. cloning and mutagenesis, including primer sequences, are described in SI We propose that the combination of these three events—LT gene Appendix, Table 1). expansion, the fact that the highly conserved YGS/T motif and OBD had the potential to encode the N-terminal start codon and Generation of Anti-ALTO Antibodies. Antibodies to ALTO were generated C-terminal hydrophobic domain in the alternate (ALTO) frame, using a combination of two synthetic peptides (C-GMGPSQRPRLQTPSPED and and stochastic erasure of stop codons—allowed the LT gene to be C-DPVAERRPPIQEENPAH) (ProSci), which were used to immunize two rabbits. overprinted by ALTO/MT. Importantly, the retention of an intact Both rabbits had equivalent responses, and serum from the final bleed was ALTO/MT ORF in all Almipolyomaviruses and the decrease in used at a dilution of 1:10,000. This study was reviewed and approved by the Institutional Animal Care and Use Committee (IACUC) at Fred Hutchinson synonymous divergence in the LT region that overlaps ALTO Cancer Research Center prior to implementation. indicate that this alternate ORF is functionally conserved in Almipolyomaviruses, and likely has been since its birth. MCPyV Replication Assay. Cells (106; HEK293) were transfected with 1 μgof After the ALTO/MT-like ORF invention, there was further religated MCPyV genomic DNA, either MCPyVw156 or its ALTOko mutant, or innovation of this ORF, including very rapid divergence of pri- an equal amount of pUC.M plasmid DNA as negative control. Cells were mary sequence, additional gene expansion, and/or changes in harvested 48 h posttransfection, and low molecular weight DNA was

12748 | www.pnas.org/cgi/doi/10.1073/pnas.1303526110 Carter et al. Downloaded by guest on September 25, 2021 extracted by the modified Hirt extraction method (20). One hundred Immunofluorescence Staining. Transfected U2OS cells were washed with PBS, nanograms of DNA was digested, visualized by staining with ethidium fixed with 4% paraformaldehyde for 10 min, permeabilized with 0.05% + bromide, and transferred onto Hybond N nylon membrane (GE Healthcare Triton X-100, washed three times with PBS, and blocked with 5% normal Biosciences) for Southern blot analysis (35). In an MCPyV PCR using primers goat serum. Primary antibodies were either immune or preimmune rabbit 751F and 1110R (SI Appendix, Table S5), 32P-dATP was incorporated in the sera and Cm2B4 (anti-LT; Santa Cruz Biotechnologies). Images were collected product to prepare the probe specific for MCPyV DNA. The blot was exposed using a Deltavision confocal microscope (Applied Precision), and the fluo- to BioMax MR autoradiography film (Kodak). rescent images were created using softWoRx 5.0 (Applied Precision). Transient Transfections. For transfections using expression plasmids, cells were Phylogenetic Analysis. Publically available polyomavirus sequences (SI Ap- grown in six-well dishes for immunoblots or on poly-L-Lysine–coated cover- slips (Sigma Chemical) for immunofluorescence. The total amount of DNA in pendix, Table S2) were used for all analyses. LT protein sequences and all transfections was 750 ng per well and was kept constant using empty complete viral genomes were aligned using MUSCLE (36), and maximum vector DNA. All transfections used Fugene HD (Roche Diagnostics) following likelihood phylogenetic trees were constructed using PhyML (37) imple- the manufacturer’s protocol. mented at www.phylogeny.fr (38). Trees were visualized using FigTree (http://tree.bio.ed.ac.uk/software/figtree/). dS calculations on available MCPyV Immunoblotting. Cells were harvested 48 h posttransfection, and proteins isolates (SI Appendix,TableS1) with intact LT and ALTO genes and sliding fl were extracted as described by Neumann et al. (19). Brie y, cells were window calculations were performed using codeml (39). resuspended in lysis buffer (50 mM Tris, pH 8.0, 150 mM NaCl, 1% Nonidet P-40, 0.5% Na-Deoxycholate, 5 mM EDTA, 0.1% SDS, and proteinase in- ACKNOWLEDGMENTS. We appreciate helpful discussions and comments hibitor mixture; Roche), and concentrations were determined using the DC from members of the D.A.G., H.S.M., A. Dusty Miller, and Paul Nghiem μ – protein assay (Bio-Rad). Proteins (50 g) were separated on 8 16% SDS/PAGE laboratories and Christopher Buck (National Cancer Institute). We ac- (Life Technologies) gels and transferred to Immobilon-P membrane (EMD knowledge assistance from Julio Vasquez and Elizabeth Jensen (Fred Millipore). Membranes were blocked in 2.5% nonfat dry milk–PBS solution Hutchinson Cancer Research Center Shared Resources). This work was sup- and incubated at 4 °C for 1 h with primary antibody [rabbit serum or GAPDH ported by a Cancer Research Institute Irvington Fellowship (to M.D.D.), mAb (EMD Millipore)], followed with horseradish peroxidase-conjugated a National Science Foundation CAREER award (to H.S.M.), National Institutes secondary antibody. Proteins were detected with Lumi-Light (Roche) and of Heath Grants CA064795 and CA042792 (to D.A.G.). H.S.M. is an Early Career exposed to Kodak XAR-5 film (Fisher Scientific). Scientist of the Howard Hughes Medical Institute.

1. Keese PK, Gibbs A (1992) Origins of genes: “Big bang” or continuous creation? Proc 21. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein Natl Acad Sci USA 89(20):9489–9493. database search programs. Nucleic Acids Res 25(17):3389–3402. 2. Chirico N, Vianelli A, Belshaw R (2010) Why genes overlap in viruses. Proc Biol Sci 22. Fluck MM, Schaffhausen BS (2009) Lessons in signaling and tumorigenesis from pol- 277(1701):3809–3817. yomavirus middle T antigen. Microbiol Mol Biol Rev 73(3):542–563. 3. Sabath N, Wagner A, Karlin D (2012) Evolution of viral proteins originated de novo by 23. Fagrouch Z, et al. (2012) Novel polyomaviruses in South American bats and their re- – overprinting. Mol Biol Evol 29(12):3767 3780. lationship to other members of the family . J Gen Virol 93(Pt 12): 4. Rancurel C, Khosravi M, Dunker AK, Romero PR, Karlin D (2009) Overlapping genes 2652–2657. produce proteins with unusual sequence properties and offer insight into de novo 24. Lim ES, et al. (2013) Discovery of STL polyomavirus, a polyomavirus of ancestral re- protein creation. J Virol 83(20):10719–10736. combinant origin that encodes a unique T antigen by alternative splicing. Virology 5. Firth AE, Atkins JF (2010) Candidates in , Seadornaviruses, Cytorhabdoviruses 436(2):295–303. and for +1 frame overlapping genes accessed by leaky scanning. Virol J 25. Tao Y, et al. (2013) Discovery of diverse polyomaviruses in bats and the evolutionary EVOLUTION 7:17. history of the Polyomaviridae. J Gen Virol 94(Pt 4):738–748. 6. Pavesi A, De Iaco B, Granero MI, Porati A (1997) On the informational content of 26. Lane DP, Crawford LV (1979) T antigen is bound to a host protein in SV40-trans- overlapping genes in prokaryotic and eukaryotic viruses. J Mol Evol 44(6):625–631. – 7. Schowalter RM, Pastrana DV, Pumphrey KA, Moyer AL, Buck CB (2010) Merkel cell formed cells. Nature 278(5701):261 263. fi polyomavirus and two previously unknown polyomaviruses are chronically shed from 27. Reich NC, Levine AJ (1982) Speci c interaction of the SV40 T antigen-cellular p53 – human skin. Cell Host Microbe 7(6):509–515. protein complex with SV40 DNA. Virology 117(1):286 290. 8. Cole CN, Conzen SD (2001) Fields Virology, Polyomaviridae: The Viruses and Their 28. Kaessmann H (2010) Origins, evolution, and phenotypic impact of new genes. Ge- Replication, ed Knipe DM,HowleyPM (Lippincott Williams & Wilkens, Philadelphia), nome Res 20(10):1313–1326. pp 2141–2174. 29. Xie C, et al. (2012) Hominoid-specific de novo protein-coding genes originating from 9. Cheng J, DeCaprio JA, Fluck MM, Schaffhausen BS (2009) Cellular transformation by long non-coding RNAs. PLoS Genet 8(9):e1002942. Simian Virus 40 and Murine Polyoma Virus T antigens. Semin Cancer Biol 19(4):218–228. 30. Liu X, et al. (2011) Merkel cell polyomavirus large T antigen disrupts lysosome clus- 10. DeCaprio JA (2009) How the Rb tumor suppressor structure and function was revealed tering by translocating human Vam6p from the cytoplasm to the nucleus. J Biol Chem by the study of Adenovirus and SV40. Virology 384(2):274–284. 286(19):17079–17090. 11. Ahuja D, Sáenz-Robles MT, Pipas JM (2005) SV40 large T antigen targets multiple 31. Kim HY, Ahn BY, Cho Y (2001) Structural basis for the inactivation of retinoblastoma – cellular pathways to elicit cellular transformation. Oncogene 24(52):7729 7745. tumor suppressor by SV40 large T antigen. EMBO J 20(1-2):295–304. 12. Feng H, Shuda M, Chang Y, Moore PS (2008) Clonal integration of a polyomavirus in 32. Zhou AY, et al. (2011) Polyomavirus middle T-antigen is a transmembrane protein – human Merkel cell carcinoma. Science 319(5866):1096 1100. that binds signaling proteins in discrete subcellular membrane sites. J Virol 85(7): 13. Becker JC, et al. (2009) MC polyomavirus is frequently present in Merkel cell carci- 3046–3054. noma of European patients. J Invest Dermatol 129(1):248–250. 33. Zádori Z, Szelei J, Tijssen P (2005) SAT: A late NS protein of porcine parvovirus. J Virol 14. Shuda M, et al. (2008) T antigen mutations are a human tumor-specific signature for 79(20):13129–13138. Merkel cell polyomavirus. Proc Natl Acad Sci USA 105(42):16272–16277. 34. Dunker AK, Silman I, Uversky VN, Sussman JL (2008) Function and structure of in- 15. Jagger BW, et al. (2012) An overlapping protein-coding region in influenza A virus herently disordered proteins. Curr Opin Struct Biol 18(6):756–764. segment 3 modulates the host response. Science 337(6091):199–204. 35. Southern E (2006) Southern blotting. Nat Protoc 1(2):518–525. 16. Lin MF, et al. (2011) Locating protein-coding sequences under selection for additional, 36. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high overlapping functions in 29 mammalian genomes. Genome Res 21(11):1916–1928. – 17. Firth AE, Brown CM (2006) Detecting overlapping coding sequences in virus genomes. throughput. Nucleic Acids Res 32(5):1792 1797. BMC Bioinformatics 7:75. 37. Guindon S, et al. (2010) New algorithms and methods to estimate maximum-likeli- – 18. Sabath N, Landan G, Graur D (2008) A method for the simultaneous estimation of hood phylogenies: Assessing the performance of PhyML 3.0. Syst Biol 59(3):307 321. selection intensities in overlapping genes. PLoS ONE 3(12):e3996. 38. Dereeper A, et al. (2008) Phylogeny.fr: Robust phylogenetic analysis for the non- 19. Neumann F, et al. (2011) Replication, gene expression and particle production by specialist. Nucleic Acids Res 36(Web Server issue):W465-9. a consensus Merkel Cell Polyomavirus (MCPyV) genome. PLoS ONE 6(12):e29112. 39. Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol 20. Schowalter RM, Pastrana DV, Buck CB (2011) Glycosaminoglycans and sialylated gly- 24(8):1586–1591. cans sequentially facilitate Merkel cell polyomavirus infectious entry. PLoS Pathog 40. Drummond AJ, et al. (2010) . Geneious version 5.0 created by Biomatters. Available at 7(7):e1002161. www.geneious.com.

Carter et al. PNAS | July 30, 2013 | vol. 110 | no. 31 | 12749 Downloaded by guest on September 25, 2021