<<

Diversification of AID/APOBEC-like deaminases in PNAS PLUS metazoa: multiplicity of clades and widespread roles in immunity

Arunkumar Krishnana, Lakshminarayan M. Iyera, Stephen J. Hollandb, Thomas Boehmb, and L. Aravinda,1

aNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894; and bDepartment of Developmental Immunology, Max Planck Institute of Immunobiology and Epigenetics, 79108 Freiburg, Germany

Edited by Anjana Rao, La Jolla Institute and University of California San Diego, La Jolla, CA, and approved February 23, 2018 (received for review November 30, 2017) AID/APOBEC deaminases (AADs) convert cytidine to in viruses and retrotransposons through hypermutation of DNA during single-stranded nucleic acids. They are involved in numerous muta- reverse transcription (APOBEC3s) (21). genic processes, including those underpinning vertebrate innate and The deaminase superfamily displays a conserved β-sheet with adaptive immunity. Using a multipronged sequence analysis strategy, five β-strands arranged in 2-1-3-4-5 order interleaved with three we uncover several AADs across metazoa, dictyosteliida, and algae, α-helices forming an α/β-fold (the deaminase fold) (22), which it including multiple previously unreported vertebrate clades, and shares with JAB/RadC, AICAR transformylase, formate de- versions from urochordates, nematodes, echinoderms, arthropods, hydrogenase accessory subunit (FdhD), and Tm1506 superfamilies lophotrochozoans, cnidarians, and porifera. Evolutionary analysis sug- of . The consists of two zinc (Zn)-chelating gests a fundamental division of AADs early in metazoan evolution into motifs, respectively typified by the signatures HxE/CxE/DxE at secreted deaminases (SNADs) and classical AADs, followed by diver- the end of helix 2 and CxnC (where x is any amino acid and n is ≥2) sification into several clades driven by rapid-sequence evolution, located in loop 5 and the beginning of helix 3 (Fig. 1). The de- loss, lineage-specific expansions, and lateral transfer to various algae. aminase superfamily contains two major divisions. In the so-called Most vertebrate AADs, including AID and APOBECs1–3, diversified in helix-4 division, which includes the Tad2/TadA, ADAR, and AID/ the vertebrates, whereas the APOBEC4-like clade has a deeper origin APOBEC-like deaminases (AADs), the helix 4 precedes the ter- INFLAMMATION IMMUNOLOGY AND in metazoa. Positional entropy analysis suggests that several AAD minal strand, resulting in strand 5 being parallel to the rest of the clades are diversifying rapidly, especially in the positions predicted sheet (Fig. 1). In the C-terminal hairpin division, containing families to interact with the nucleic acid target motif, and with potential viral such as the CDD-like, blasticidin-S–like, and DYW, strands 4 and inhibitors. Further, several AADs have evolved neomorphic metal- 5 immediately follow each other, forming a β-hairpin (22). binding inserts, especially within loops predicted to interact with the The best-known AADs are APOBEC1, AID, APOBEC2, target nucleic acid. We also observe polymorphisms, driven by alter- APOBEC3s (3A–3D; 3F–3H), and APOBEC4 (13). AID plays a native splicing, gene loss, and possibly intergenic recombination be- critical role in gnathostome (jawed vertebrate) adaptive immunity: tween paralogs. We propose that biological conflicts of AADs with viruses and genomic retroelements are drivers of rapid AAD evolution, Significance suggesting a widespread presence of mutagenesis-based immune- “ defense systems. Deaminases like AID represent versions institution- Mutagenic AID/APOBEC deaminases (AADs) are central to ” alized from the broader array of AADs pitted in such arms races for processes such as generation of antibody diversity and antiviral mutagenesis of self-DNA, and similar recruitment might have inde- defense in vertebrates. Their presence and role outside verte- pendently occurred elsewhere in metazoa. brates are poorly characterized. We report the discovery of several AADs, including some that are secreted, across diverse DNA/RNA editing | immunity | arms race | biological conflicts | metazoan, dictyosteliid, and algal lineages. They appear to retroelements have emerged from an early transfer of an AAD from bacterial toxin systems, followed by extensive diversification into multi- he deaminase superfamily encompasses zinc-dependent en- ple eukaryotic clades, showing dramatic structural innovation, Tzymes catalyzing the of bases in free rapid divergence, gene loss, polymorphism, and lineage-specific and nucleic acids across diverse biological contexts (1–3). These expansions. We uncover evidence for their divergence in arms- include that are (i) primarily involved in the salvage of race scenarios with viruses and genomic retroelements and bases in free nucleotides, such as the cytidine deaminases (CDD/ show that AAD-based nucleic acid mutagenesis as a basis of CDA) (4), deoxycytidylate monophosphate deaminases (dCMP) immune defense is widespread across metazoa, slime molds, (5), and deaminase (GuaD) (6); (ii) engaged in the bio- and algae. synthesis of modified -derived compounds (e.g., blasticidin- Author contributions: A.K., L.M.I., S.J.H., T.B., and L.A. designed research; A.K., L.M.I., and S deaminase) (7) and the RibD deaminase domain (8); and (iii)active L.A. performed research; S.J.H., T.B., and L.A. contributed new reagents/analytic tools; in in situ modifications of bases in nucleic acids, such as tRNA A.K., L.M.I., and L.A. analyzed data; and A.K., L.M.I., and L.A. wrote the paper. adenosine deaminases (Tad2/TadA and Tad1) (9), RNA-specific The authors declare no conflict of interest. (ADAR) (10), RNA-editing cytidine deami- This article is a PNAS Direct Submission. nases such as DYW (11, 12), and members of the AID (activation- This open access article is distributed under Creative Commons Attribution-NonCommercial- induced deaminase)/APOBEC family, which modify both DNA and NoDerivatives License 4.0 (CC BY-NC-ND). RNA (13, 14). This last group of deaminases is implicated in the Data deposition: The sequence reported in this paper has been deposited in the Sequence Read Archive at the NCBI (accession nos. SAMN08013505, SAMN08013506,and generation of degenerate codons for decoding during translation SAMN08013507). – (Tad2/TadA) (9, 15); stabilization of codon anti-codon interactions 1To whom correspondence should be addressed. Email: [email protected]. (Tad1) (16, 17); editing mRNAs, siRNAs, and miRNA precursors; This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. inactivation of RNA viruses by hypermutation (ADAR) (10, 18, 19); 1073/pnas.1720897115/-/DCSupplemental. diversification of antibodies (AID) (20); and defense against retro-

www.pnas.org/cgi/doi/10.1073/pnas.1720897115 PNAS Latest Articles | 1of10 Downloaded by guest on September 28, 2021 A helix-1 helix-2 FINAL NP_065712.1_AID_Hsap 4LLMNR R K F LYQ FKN 9 T Y L C YVV FGYLR N K 3 H V E L L F L R Y I S D WD 6 NP_663745.1_APOBEC3_Hsap 11 H LMDPH I FTSN FNN 6 T Y L C YEV RGFLH N Q 11H A E L R F L D L VPS L Q6 NP_001635.2_APOBEC1_Hsap 16 R RIEPWE FDVF C L L YEI N T 2 H V E VNF I K K F T S E R7 NP_006780.1_APOBEC2_Hsap 45 E RLPAN F F KFQ FRN 9 T F L C YVV RGYLE D E 4 H A E EAFFN T ILPAF5 XP_018105362.1_NAD1_Xlae 17 K LVSF E D FNGN FNN 6 T L A C FN WGFAY N N 4 H A E Q I I L Q E LYT F L10 GFIP01022257.1_NAD1_Pbuc 35 K QIPE D I F Y Q E FNN 6 T L L F FGLGQ WGYTF N R 3 H A E I R L LGFI K G F L10 H3B7Z9_LATCH_NAD1_Lcha 3KQLSK E E F EAE FNN 6 T L L C FSL Q Q WGYAH N N 4 H A E I L V L R E I E K Y L10 GECV01074881.1_NAD2_Mfis 23 I QMTPS I F ISH YAT 6 T H L C YEV N T 3 H A E EVF IGE RFR D E3 ABO15149.1_PmCDA1_Pmar 11 E KLDIYT F KKQ FFN 5 S HRC YVL WGYAV N K 9 H A E I FSI R K V E E Y L6 AIY70103.1_LcCDA1l_Lcam 7KKLPL N T F LFE FNN C YIF WGYAT N K 17H T E E L L L D E M T R HV 7 ABO15150.1_PmCDA2_Pmar 22RVAFL R C FAAPSQK 3 T V I L FYV VNY N K 5 H A E V L L L S A V R AAL15 AAGJ05003720.1_Spur 11 G EEIL N I FVKS GLH V C L VTF IKQ-- N R 3 H A E RDI I E F L R FNI3 GAGS01028218.1_Bsia 14 N LQLF T D ITNAGVH 7 T S A V N G3H A E IAL I E Q L T S F F9 AAGJ05065417.1_Spur 1 - - D I L K N F IHS GLH 13 T V V C LIV N I4H V E Q I L I E Y I N S K L5 GDJY01027842.1_Lana 134 A T D I M D T F ERS GPH 7 T DSY CVF RSFIF N D 4 H A E E Q A V N F LAK N I5 AKZP01034351.1_Pmin 10 I QEDI E T F IKS A V LL S T 3 H A E MNA VPS IFE L L3 GBYC01056953.1_Aele 22KNEIVR S F LVT SMY 9 T V V V CEI KVYR- N D 3 H A E I K L IGYL K D DS 7 APKA01034104.1_Bgla 19TTDDF D M F WRT HLM11 T I L Y WN RREG- T P2H AAAAVAR D LMT Q V8 NP_982279.1_APOBEC4_Hsap 3PIYEEYLANHGTIV11C SNCPYH ----32T F YE Q KGHAS S C 5 H P E S M LFE MNGYLD 9 XP_001622116.1_Nvec 59 V LGNK K E F C GAFYH 7CLDKQ S----C12T A V A LVK N C 5 H A E E F F LMD I D C Q L18 XP_005783540.1_Ehux 77 L LGTTAQ F C S S FYH 12CLSVAAGGPPC 3T V I C VARL Q LYI 4N C 9 H A E Q F L L Q D E D L RS33 GAWS02034135.1_Pame 12HSYFF R R M WNE Q S C VV Q K 4 H A E V K A L Q H V Q E K E6 KDR09461.1_Znev 46IVSTSK D I WTA C IT EISY- S H 4 H A E I K V L R N I K A R E7 XP_002637598.1_Cbri 148CTNGAS K FSVPKHV10 APOBEC4 S SGL IVT RGDFY 2S K 3 H V E E Q L VAAIYD L I7 GFAV01000705.1_SNAD4_Htub 217 E RELQ VVHIAC AED 2 T H L LYR-- S A8H A E M E F I R S R K Q T- GFAV01006376.1_SNAD4_Htub 14 L MKLNE ATKIACTT 3 Cnidaria-algae Zn T I L L LVL FYY-- N T 3 H A E MVM I D N Q S S D K1 XP_005947721.1_SNAD3_Hbur 236PSVNAN T F FDS LKG 8 chelating inserts Q Y A L AIF FEV D Y4H T E P L LCQ E V E I I L7 XP_014060899.1_SNAD3_Ssal 60GDDIND V F LDD SIN 19 Q HSW AVL YYP N Y4H S E D L I I K Q T Q E L L8 XP_007659302.1_SNAD1_Oana 68 P IDEGVLF RIIENL12 Q Y A VGIRLG15LPQMKKQLHDLGGL4SLVAARVQG N I5H A E W R L L S H A N N N R13 XP_004917434.1_SNAD1_Xtro 25 T EEQLLR ATNYIRQ 8 Q Y A Y IAV FT19NADDLMTTIEQHRI4Q IVAASFLR 1N N 4 H A E A R L L N GEGE AP13 XP_016091766.1_SNAD1_Sgra 52 T LDKI T K F FHQ NYE 2 Q Y A V AINVP22VNVKNIIRADEGPV1EGKELIAAG 1Q K 5 H S E F L LMN PPD N SP10 GEML01133863.1_SNAD2_Carg 26PSYEAE S TTGR VTT FSL AVSLP13DSGEEVRRTILNCD VYTGRRVVA 1T V48H A E Y RTL Q K L N T LA 6 GBFM01021829.1_SNAD2_Mflo 80 D QNKLAGIVTE ILK FSL AVSIP14DTGDQVWQRILNRC1VYTSQ R MVA 1T V44H A E Y RTL Q N F N T LV 6 WP_016970941.1_Ptol 16 E A P Y IFD N EYT ANF 5 T V L I AKI EA 5 Q PFGKFF-- N D 3 H A E ANF L Q A L R ANE7 consensus/60% .b..bppFb.ph.p ...... ohhhh.hb...... b.h.. Np HAEhbhlpplppbb strand-3 helix-5 Y R V T W FTS W S P C Y D C A RHV A D F L RGN PNL S LRI FTA R L Y EPEGLR R LHR 2VQIAIM T F K D YFYC W N T F12 Y R V T W FIS W S P C F S 2 C A G E VRAF L Q E N THV R LRI F A A R I Y LYKEAL Q M LR 3AQV S I M T Y D E F K H C W D T F10 C S I T W FLS W S P C W E C S QAIRE F LSR H PGV T LVI Y V A R L F Q Q N RQGL RDL 4V T I Q I M R A S E YYHC W R N F16 Y N V T W YVS S S P C AA C A D R I I K T LSK T KNL R LLI L V G R L F E I QAAL K KL 4CKL R I M K P Q D F E Y V W Q N F15 Y K L N L YTS Y S P C F S C C KDL C S F LDT Y 1NQ V S MHL K I A K L Y Q R G LQM L R EN 1AD I T I M D L E D Y K E C FYLF15 Y K LIA YLS Y S P C E T C C P Q L L E F VTT V KVC Q LEI R F S K L Y W K N MSR LHQA 1IQV K V M K K E D F E Y C Y H L F15 Q R V T L YVT C S P C N R C C T K I L E F F Q R F Q R F D MDI K I S K I Y Lophotrochozoa-Echinodermata Q D L KQLG--- V S L K V M D S S D F K E C F D L F15 C T VVW YIS W S P C D H C MQL L L S T F LPA 1PQ V R LHIVF A K V Y K NNIRAL QE 2VQI R V M N L N D F R W C W D Y Y6 F T I N W YSS W S P C A D C A E K I L E W YNQ E 3NGHT LKI W ACK L Y R N Q IGL WNLR 3V G L N V MVSE H Y Q C CRK I F13 F C V E W FTS W S P C H R C S G L L L R W L R D I 2GRHR LRV W F S R I Y RAGLRR L R RA 1VQLGVM D G R L H D Y C A H V L12 C T L H C YST Y S P C R D C V E Y I Q E FGAS - TGV R VVIHCCR L Y E GVLRSLSR 2RD F R L M3D AIALLLGGR L A15 R S V K I FCN F S P C S K 5 C N S IRKIKEDLK 7GERS LTFVF A H L Y NI 3SC REKG-C FESGPNR--H 2PW E NARSLNNL 4TS I R S F E MAD WVAL I E L L16 SII K I YIT F S P C LY C C E K L I T F FDT F 5FTQ MEII F A AL - Y NI 3SC EIRGLC SPDE--K--H12SLWTLHQARDK14L A VGLD 1PLT D Y Q I F V E RD 10 K S I T I YSN Y S S C T Q C S G K L Q VLKE E - ILDLQ IKFSAL SY IR 2SC TVS--C G C YT------AD P KSFTM LR 1L N A R P F Q K D D W D N L I D L L16 Q K I R F YLN W S P C A R C S S LFL N FSRII 3KLSIDVEI VFR 1 L Y EV 3SC KGLP-C D C AV-RKPEH -VQ K LHAL W RC 2V T V R T FIEE D W K T L V Q L L15 K E I D M YIN Y S P C GY C A N Q I A T M L Q D N P Q VPVRI K C AY L Y K E VAKRNGLT 7TT L E T I H G D D WITV I Q R I11 I K I K V YQN Y S P C S D C A K E I M Q Y M E E E 5REV E MTITF A N F Y LGLIGL Y E QP 2ELQLLG VHVK W K T F L N MS 11 I S V T I VQN Y S P C L D C A D E L L K T I Q LA 6LDM S ISFVALK 5SW L Q LNS33VRL T T F T P D T W K F L Y T F L22 R H IIL YSN N S P C N E 4 C I S KMY N F L I T Y PGI T LSI Y F S Q L Y S A WNR EALR 8V V L S P I S GGIW H S V L H S F14 W Q I T M YLT M Q P C HL12C C EVMI K AKE K L 1DNV E IVI KPTH 5WY E K GVRK L F KT 2IEL E C M K EGD W K Y L L Q Y A14 H R LLL YLT Y Q P C H H 15C T S L L CAF V R E - VLALLRREGVA7DW WARGAPPFGA AV AGER S RLD AFVSTALV10 M C I D L YLS Y S P C A R C A D F I L Q FSKLR PGC R INL F F S C L F R EGLR R LNA 3I T L R V F T G R E W C I L A H AT 13

RIV T L FLS Y S P C A N C A N F I I E FSR T R P Q C T VYI R FTC L F R DGLR R LNA 3I S LGVF insertSNAD3-specifc T VYE W R R L A E AT 10 H E I Q I FVS K S P C F H 18C A K L LGLLLSK V 4KKV D VKMTVKF L Y K Q G ILCM L QA 1IKV E P LLMK D W C A I M D WS 14 K T I K --IKN S P C A D C C D A L M D Y Y K D C Q MKP-EI Q F G R I Y E T E QFR KTVK L C Q N G F N L E V W E L FNN Y M13 NVV H I YIRN M P C NV C S D K L I E F F EPV E Q KP-TI Y I A K Q W K R H ETTQGLK 1LIKN G F K L R G W R E YKH SG 45 E S ILV YTLN S P C L N C M S V LSK E A Y S W 3YGI S TTVGFTQ F W F Q N ITCSYTN 2DP K N V F 88LLR D WLVL V N D C3 W N VYL FTLN S P C LA 5C MLNL M R K A F E W 3HGV R TSIGYRK C W F K DVSYAQ FE18PG N N L V109FEQ C WGDMVQ DK 3 R C LIF YTFR S P C K S 1 C L N ETHN SNIIN 7LVQ R GAFVFDR V Y ELVRAWSNLG 1I P TYHC YTK S C VYC S R D- G C VLF YTLN S P C TG 1C A R IGGE YNILD 8INNK ALV YND- V Y D TVWDAW KE 4I P FYRC FSR N C YRC FIN- G C VVF YTLN S P C I N 1 C L S GNY K ITGGL 6KGI K -AFAFKN I W T Q ELREKL KVIA 2I P LYRC K S K K C T L C G E PG 2 DLLLF YVLA S P C D K 1 C T S ETSH RSILN 6NWV K YAVVF S D V F Q E L RDSL M RL12NN IFRC 2 N PVQC T S C S S N- DLLLF YVLA S P C D K 1 C A S ESSR WNILN 6 Q WKN YAVVF S N V F E QDLRGAIE 8SN IFRC 2 D VVQC T S C S S G- Q E IVISIN N S P C R Q C C V LFKQ Y M E E I 4PNTT LVI ETA N I Y Q ACLRAL--- SRLPGI R V E L W N V A LVNA10 bplphY.sh SPC.p Cssblhpahbp. ..hph.lbhup lYb...... bp.b.bbbb. lslp.h p.pcap.hhp.h C C helix-6 H/C N GLHE N SVRLS173AID | | | B 3’ C H Zn | | | CC/H C W D G L D E H SQA185APOBEC3 | | | 5’ LWMMLYALEL182APOBEC1 | | | +2 H Zn C DIQE N FLYYE216APOBEC2 | | | L1 -2 C H C | | | C Y S YGN A R DMD188NAD1 | | | C H MAEHS NAILQ205NAD1 | | | Zn | | | Zn Zn K T K QLNAVFL168NAD1 | | | +1 N F E P W S KLDRN171NAD2 | | | C | | | -1 C L E KTLK R AEK184PmCDA1 | | | H1 Zn LVPW H I N VPR199LcCDA1-l| | | H55 Zn

| Core vertebrate AADs | | Y T E TNVVE PLV202PmCDA2 | | | N I R FGE EYGSR157Echinodermata | | N c D D V I Q T C AAA225Lophotrochozoa | | C | | L3 Zn L7 L R YGD EYGVW140Echinodermata | | H L5 DFEQLM T EQE318Lophotrochozoa | | N L9 | | E C ILKQIL E EHS160Echinodermata | | 234 C V C L R K A D SRK206Cnidaria | | 5 Lophotrochozoa | Classical AADs | C CTD RLEE D RLMQL253Lophotrochozoa | | TGRALAD RHN225APOBEC4 | | | Echinodermata-Zn C R R KTE D E KIG284Cnidaria | | | H6 Cnidaria-algae Zn |----- | | APOBEC4 Zn AMPL S C E PCA379Haptophyta | | | chelating W N I YWQGELD179Arthropoda APOBEC4-Cnidaria/Algae clade | | chelating | | chelating R D A W D S K WKI214Arthropoda | | D N H H L D KAVAQ343Nematoda | | 1 SP - T D E H I N KVK360SNAD4 \\ | \ SNAD4 | | H2 H4 Zn Zn N E L E D LLSDI194SNAD4/ | | H3 N C NVIK Q I T KDF485SNAD3 | | C \ SNAD3 | | C LLRER LIENF365SNAD3/ | | AID/APOBEC -like deaminases (AADs) C Q N T W N R C TQD273SNAD1\ | | L2 N N N PNN P C YYN230SNAD1 | | C SNAD1 | | L4 L6 L8 C T E INT A C LTD249SNAD1/ | | C C C C N Q VSPY C YSD265SNAD2 | | \ SNAD2 | | C T Q VTPY C YSE324SNAD2/ | | Q E LAR I N KSL176Bacteria // AADs Secreted | . p .bp b p ... | Common AAD model SNAD1/2 SNAD3

Fig. 1. Characterization of the AAD family. (A) MSAs of representatives of AAD clades labeled with accessions, clade names, and species abbrevi- ations (full species names in SI Appendix, Fig. S1). Zn-chelating active site residues are highlighted in black, and known substrate-interacting residues are shown in red. The SNAD1/2-specific helix between strands 1 and 2 and neomorphic strands are shown in gray rectangles. (B) Topology diagram depicting the conserved core common to all AADs, the nucleic acid substrate, and key residues. (C) Topologies of clades showing Zn-chelating features. Homologous Zn- chelating residues between APOBEC4 and the Cnidaria-algae clade are highlighted in purple, green, and blue, whereas other Zn-chelating residues are colored gray. (D) Topologies of SNADs1–3: inserts embedded within the core domain are in blue, while N- and C-terminal extensions are in gray.

cytidine deamination by AID at the genomic Ig loci in mature B cells causes somatic hypermutation and/or gene conversion, which drives triggers base excision and mismatch repair mechanisms (20, 23–27). antibody diversification, for instance as part of affinity maturation This process is central to class-switch recombination (24). Addition- during an immune response (28, 29). The AADs expressed in agna- ally, at least in some jawed vertebrates, AID-catalyzed deamination than (jawless vertebrate) lymphocytes (PmCDA1 and PmCDA2) are

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1720897115 Krishnan et al. Downloaded by guest on September 28, 2021 implicated in gene conversion-based diversification of their antigen Fig. 1A and SI Appendix,Figs.S3–S20),whichwethenusedto PNAS PLUS receptors (20), the variable lymphocyte receptors (VLRs), which are identify shared sequence features andtocomputephylogenetictrees structurally unrelated to antibodies (review ref. 30). APOBEC3 de- and phyletic relationships. The overall phylogenetic tree strongly aminases act as restriction factors in the innate response to , established the monophyly of a core group of gnathostome AADs, herpesviruses, parvoviruses, papillomaviruses, hepatitis B virus, and namely AID, APOBEC3, APOBEC2, APOBEC1, and the re- various retroelements (31–33). APOBEC1 deaminates cytosine both in lated lamprey CDAs (discussed in detail in the companion pa- RNA(34)andssDNA(35)andhasrolesinbothmRNAeditingand per). Additionally, we identified two distinct clades of AADs that ssDNA mutagenesis as part of the defense against retroviruses and we designate NAD1 and NAD2 (Novel AID/APOBEC-like genomic retrotransposons (36, 37). APOBEC2 and APOBEC4 remain Deaminases 1 and 2; Figs. 1A and 2A). These, too, group with poorly understood in terms of their molecular functions and the above AADs, specifically with the other gnathostome AADs. substrate specificity. We refer to this entire assemblage of deaminases as the core Interestingly, several deaminase clades catalyzing organellar RNA- vertebrate AAD clade. Our analysis further clarified the evolu- editing and DNA mutation, including the AADs, are sporadically tionary trajectories and relationships for both the previously distributed across the tree of life. Our previous analysis revealed that known AADs and the newly detected members (Fig. 2A). First, one of the main factors for this pattern is the diversification of most APOBEC1 was retrieved in taxa such as birds, reptiles, am- deaminase families as toxic effectors in bacterial toxin systems, fol- phibians, and lungfish, suggesting that its origin predated the lowed by multiple independent lateral transfers to eukaryotes (22). tetrapod-lungfish divergence (SI Appendix, Figs. S2 and S4). Whereas the evolutionary history of AADs in vertebrates has been Second, APOBEC2 was retrieved from chondrichthyans (sharks), extensively examined (22, 38), our prior discovery of AADs (22) actinopterygians (ray-finned fishes), and sarcopterygians (lung- outside vertebrates, and sporadically in other eukaryotic lineages and fish and coelacanth), thereby indicating its emergence before the bacteria, hinted at the possible presence of as-yet-undiscovered di- divergence of gnathostomes (SI Appendix, Figs. S2 and S5). vergent homologs. In this study, a multipronged sequence analysis Third, multiple paralogous copies or splice forms of APOBEC1 approach uncovered multiple clades of AADs from across metazoa, were observed in some turtles and of AID in certain actino- dictyosteliida, and algae. This allowed us to develop a comprehensive pterygians and amphibians (SI Appendix, Figs. S2–S4). Fourth, evolutionary picture of AAD diversification. We present evidence that APOBEC3 appears to have emerged from AID in eutherian biological conflicts with viruses and genomic retroelements are the mammals, followed by paralog expansions. primary selective force behind the evolution of multiple in- The remaining AADs, from invertebrates such as urochordates dependent lineage-specific expansions (LSEs) of these deaminases. (albeit predicted to be catalytically inactive), echinoderms, arthro- Furthermore, in some cases, the genomic organization might favor pods, lophotrochozoans, and cnidarians, and APOBEC4 form INFLAMMATION

polymorphism via recombination between paralogs, duplication, multiple distinct clades that are out-groups to the above core ver- IMMUNOLOGY AND and gene loss. Our study suggests that the self-DNA–mutating tebrate AAD clade (Fig. 2A). We refer to the clade that unites all deaminases, like AID and the cyclostome deaminases, emerged via these AADs with core vertebrate AADs as the classical AAD clade “institutionalization” of AADs that were originally deployed in (Fig. 2A). Of these, APOBEC4 is recovered from tetrapods, biological conflict with selfish genomic elements. sarcopterygians, and agnathans, but is frequently lost across actinopterygians (Fig. 2A and SI Appendix,Figs.S2andS13). Based Results and Discussion on conserved sequence features and a long, distinctive, shared metal- Identification and Classification of Members. In an earlier analysis, binding insert (Fig. 1C, Right panels, and SI Appendix,Fig.S14), we we discovered AADs in several bacteria and eukaryotic taxa other unified APOBEC4 with a specific group of AADs found in cnidar- than vertebrates (22). This sporadic distribution, coupled with rapid ians and sporadically in phylogenetically distant algal lineages. This sequence divergence, hinted at the possible presence of additional cnidaria-algae-APOBEC4 clade appears to be the outermost clade of AAD versions that had evaded detection. Taking advantage of the the classical AADs. Its phyletic pattern suggests an origin early in several new genome sequences that were published since our last metazoa followed by transfer to photosynthetic eukaryotes possibly study, we developed a strategy of transitive and iterative sequence- via the route of the algal-cnidarian symbioses. Another, remarkable, profile searches against the National Center for Biotechnology large monophyletic clade of AADs from diverse vertebrates and Information (NCBI) nonredundant (nr) and Ensembl proteome sponges contains members predicted to possess an N-terminal signal databases and Transcriptome Shotgun Assembly (TSA) and Whole- peptide (Figs. 1 and 2A and SI Appendix,Figs.S15–S18). This feature Genome-Shotgun (WGS) contig databases using various known suggests that these AADs might be secreted; accordingly, we named AADs as queries (see Materials and Methods). Profile searches were these the Secreted Novel AID/APOBEC-like Deaminases (SNADs). run using PSI-BLAST and JACKHMMER and translating searches This clade formed a sister group to the classical AADs. Overall, this with TBLASTN. To isolate AADs from other deaminases, the newly pattern suggests that there was a split between the classical AADs detected sequences were analyzed by profile-profile comparisons and the SNADs at the base of metazoan evolution (Fig. 2A). We using the HHpred program, by examination of bidirectional best hits also detected secreted AADs in the dictyosteliid slime molds Dic- and by use of phylogenetic methods, and finally assessed for sequence tyostelium fasciculatum, Tieghemostelium lacteum,andAcytostelium, and structural synapomorphies. Thus, we established a comprehen- which might either group with the SNADs or form a distinct basal sive collection of AADs from vertebrates, urochordates, echino- clade of eukaryote-specific AADs (Fig. 2A and SI Appendix, Fig. derms, lophotrochozoans, arthropods, cnidarians, poriferans, S19) We describe below the clades in greater detail (see SI Ap- dictyosteliids, and algae, and added homologs from species within pendix for obtaining the complete set of sequences). taxa that were previously known to possess AADs (Fig. 1A; SI Ap- pendix,Fig.S1; c.f., phyletic distribution in SI Appendix,Fig.S2). Members of Classical AAD Clade. Of the AADs, NAD1 and NAD2 We identified potential AADs in a flatworm (platyhelminthes), clades belong to the core vertebrate AID/APOBEC clade but are Microphallus sp., and its snail host, Potamopyrgus antipodarum;a not particularly close to any one of the known lineages (Fig. 2A). Of green algae, Elliptochloris marina; an endosymbiont of sea anem- these, NAD1 is found in ray-finned fishes, the coelacanth, am- one, Anthopleura elegantissima; and numerous dinoflagellates that phibians, lizards, and marsupials, whereas NAD2 is restricted to are symbionts of corals. In these instances, the complete sequence amphibians (SI Appendix,Figs.S2,S7,andS8). NAD1 is typi- identities between the host and symbiont/pathogen sequences, cally found in a single copy per genome and has been independently coupled with the results from phylogenetic tree analysis, show that the lost in eutherian mammals and archosaurs. Among the classical Microphallus, Elliptochloris, and the dinoflagellate AAD sequences are invertebrate AADs, despite the patchy phyletic pattern suggestive likely contaminations from their host genome sequences; hence, these of extensive gene loss, we delineated multiple well-defined clades, sequences were removed before further analysis. namely the following (Figs. 1 and 2A): (i) a clade found in cnidarians, To understand the evolution of the AADs in light of the newly lophotrochozoans, echinoderms, and tunicates with a deaminase detected members, we built multiple sequence alignments (MSAs; domain most closely related to the core vertebrate AADs

Krishnan et al. PNAS Latest Articles | 3of10 Downloaded by guest on September 28, 2021 A B Classical DYW-like NAD2PmCDA1-likeAPOBEC4 Alveolata-DYW-like-LSEs Bsia 6

TadA/Tad2 Features Cjac 5 Pmin 12 i 3 Aele 7 r Lana 2 APOBEC3 Ct Bacteria-AADs ri 3 Ct Ctri 3 Secreted Dictyosteliid AID Bsia 3 SNAD4-AADs SP Core vertebrate Spur 13 APOBEC2 Lana 4 SNAD3-AADs Str5--insert--Hel5 Past 2 Fig. 2. Phylogenetic relationships among AADs. (A) SNAD1-AADs SP, Str6 and Str7 Higher order relationships of deaminases with a

SNADs Cgig 18 SNAD2-AADs Secreted AADs SP, Str6 and Str7 Mter 3 focus on AADs. Clade-specific features are on the Cnidaria/algae L1-L3 Zn ( C ---- H -C-C- C ) APOBEC1 Past 8 Right. Zn-chelating residues are shown within pa- Mter 13 APOBEC4 L1-L3 Zn ( C - H ------C- C ) Nvec 5 rentheses and homologous residues shared by Arthropoda-AADs Htub 2 NAD1 Htub 3 members of the cnidaria-algae-APOBEC4 clade Nematoda-AADs PmCDA2 HtubH 3 are colored. (B) Phylogenetic tree illustrating LSEs of Lop-Ech-Zn2 L1-L7 Zn (H-C-C/H-C/H) HtubAque 2 4 tub 2 Aque 3 Htub 3 Lop-Ech-Cni Mammalia AADs in various metazoan lineages. Clades entirely Tber 8 Amphibia bur 2 SNAD4 NAD2-AADs H Ca rg 4 composed of monospecific representatives are col- Hbur 5 Car Onil 2 Onil 2 NAD1-AADs Onil 2 nat 7 Ccar 6 Msal 6 g 4 P C lapsed and labeled with species abbreviations and SNAD3 car 5 Msal 3 arg 2 APOBEC2 A C H lim 6 Vertebrata Hbur 2

bur 3

H APOBEC1 Classical AADs Aganatha+Vert. Ccar 4 number of sequences in the LSE. Nodes supported by

ub

Salp 10 APOBEC3 Aganatha SNAD1 2 r > > AID/APOBEC -lke deaminases (AADs bootstrap values 80% and 90% are marked with AID Eukaryota SNAD2 0.4 Secreted PmCDA1-like Porifera green-outlined and red-outlined circles, respectively Amoebozoa PmCDA1 Bacteria+Eukaryota Lophotrochozoa Cnidaria Echinodermata (also see SI Appendix, Fig. S21). Hel, helix; L, loop; SP, PmCDA2 vertebrate AADs Core Bacteria signal peptide; Str, strand.

(SI Appendix,Fig.S9); (ii) a clade found in lophotrochozoans and These unique structural features of the SNADs, combined echinoderms with a metal-chelating insert involving elements from with LSEs and differential loss of paralogous , indicate the extended loops 1 and 7 (SI Appendix,Fig.S10); and (iii)lineage- rapid and extensive divergence from the ancestral AAD arche- specific clades containing the versions from nematodes and arthro- type. The SNADs along with the AADs we detected in dictyos- pods, respectively (Fig. 2A and SI Appendix, Figs. S11 and S12). teliids are unique among the eukaryotic deaminases in being The invertebrate classical AADs are dominated by LSEs, secreted. The dictyosteliid AADs possess signal peptides like ranging from 2 to 22 paralogs per taxon (Fig. 2B and SI Ap- SNAD1, SNAD2, and SNAD4 and likewise show independent pendix, Figs. S2 and S21). These numbers can widely differ within lineage-specific expansions (6–7 copies) in D. fasciculatum and a single phylum: for example, among molluscs, the Japanese oyster T. lacteum (SI Appendix, Fig. S19). They are distinguished by a Crassostrea gigas shows a large expansion of 22 distinct AADs, distinct C-terminal domain with four cysteines that are likely to whereas Biomphalaria glabrata appears to possess only a single form disulfide bonds (SI Appendix, Fig. S19). However, their representative. In general, several species of lophotrochozoa and extreme divergence prevents us from establishing a specific re- echinodermata display LSEs of classical AADs, whereas those lationship between them and the metazoan SNADs. Together, arthropods that possess AADs typicallyencodeasingleorafew these secreted versions resemble the ancestral AADs from bac- copies. Among cnidarians with AADs, approximately one-half terial polymorphic toxins (22, 39), which were the likely precursors contain only a single copy, while the other show LSEs. The of eukaryotic AADs. Another functional analogy is offered by the prevalence of LSEs among invertebrate lineages parallels the nucleic-acid–targeting CR (Crinkler-RHS–type) effectors, which APOBEC3 paralog expansion in mammals. are secreted by various eukaryotes and delivered into recipient cells (40). The SNADs and dictyosteliid AADs might be similarly de- The Secreted AADs. The secreted AADs from metazoa can be livered into virus-infected cells or cells of extracellular parasites/ divided into four major subclades, SNADs1–4. SNAD1, SNAD2. pathogens. In vertebrates, the SNADs show a striking pattern of and SNAD3 clades are specific to vertebrates, whereas SNAD4 is gene loss in lineages that reliably maintain a constant high body found only in sponges (Fig. 2 A and B and SI Appendix,Figs. temperature, namely birds, marsupials, and placental mammals. By S2 and S15–S18). In vertebrates, SNAD1 shows the widest phyletic contrast, they are present in the basal members of these lineages spread, being present in amphibians, reptiles, the monotreme that are either poikilothermic or have lower body temperatures. mammal the platypus, the coelacanth, and actinopterygians, Hence, they might be deployed against pathogens specifically af- whereas SNAD2 and SNAD3 are only found in actinopterygians. fecting organisms with lower body temperatures (41). Hence, in vertebrates, SNAD1 is likely the ancestral clade that goes back to at least the stem euteleostomian, while SNAD2 and Cytosine in Single-Stranded Nucleic Acids Is the Likely Substrate SNAD3 were derived from it as actinopterygian-specific expan- Across the AAD Clade. We used functional data for the core ver- sions. SNADs show repeated LSEs in certain actinopterygians and tebrate AADs and the more distantly related tRNA-modifying turtles, while SNAD4 members are found as LSEs even in indi- TadA as models to understand the structure-function relationships vidual sponge species (Fig. 2B and SI Appendix,Figs.S2andS21). that are conserved across the AADs. Across the deaminase su- The SNADs have accreted several unique structural features perfamily, nucleic acid and nucleotide substrates are similarly po- around the core deaminase fold: SNAD1 and SNAD2 share a sitioned with the target base inserted into a binding pocket lined by + helical insert between strands 1 and 2 (Fig. 1D and SI Appendix, helices 2 and 3 in proximity to the catalytic Zn2 ion (Figs. 1B and Figs. S15 and S16). In place of the usual helices 5 and 6 of the 3A). In both TadA [ (PDB): 2b3j] and deaminase core domain, they contain a neomorphic strand 6 and APOBEC3A (PDB: 5sww), the target base is flipped out into the 7, which we predict to form a β-hairpin packing with the core active site, away from the rest of the single-stranded nucleic acid strand 5 (Fig. 1 A and D and SI Appendix, Figs. S15 and S16). backbone that is positioned in a U-shaped conformation (42, 43) Furthermore, SNAD1 and SNAD2 possess a unique cluster of (Fig. 3A). In both cases, the base 5′ to the target base (i.e., −1po- four cysteine residues with the first cysteine at the end of strand 5, sition) is also flipped out, whereas the 3′ base (+1 position) of TadA a CxxC motif in the middle of strand 6, and the last cysteine in the also shows an altered conformation. These flanking bases might middle of strand 7, which might form disulfide linkages to stabilize make contacts with the binding pocket and residues from loop 1 the secreted protein (Fig. 1 A and D and SI Appendix, Figs. S15 and (between helix 1 and strand 1), loop 3 (between strand 2 and S16). SNAD3 lacks the above insert, has an N-terminal helical ex- helix 2), loop 5 (between strand 3 and helix 3; in APOBEC3A), tension in lieu of the signal peptide, and a distinct helical insert in the end of strand 4, loop 7 (between strand 4 and helix 4), and the the loop-9 region (Fig. 1D and SI Appendix,Fig.S17). Additionally, terminal helix (helix 5; in TadA). these clades show unique lineage-specific sequence synapomorphies Side chains of the following residues make specific contacts (SI Appendix,TableS1). with the target base (Fig. 3A): (i) the active site histidine (PDB:

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1720897115 Krishnan et al. Downloaded by guest on September 28, 2021 5sww: H70; PDB: 2b3j: H53) establishes π-π interactions; (ii)an biological conflicts (i.e., immunity), it does not directly act on viral PNAS PLUS asparagine (N57 in APOBEC3A; N42 in TadA) at the beginning nucleic acids. Instead, it mutates the DNA of the host itself as part of loop 3 makes polar contacts and might further interact with the of the conserved antibody-diversification process in gnathostomes. phosphate backbone of the target nucleotide (as in APOBEC3A); Thus, AID seems to be under purifying rather than positive se- and (iii) a tyrosine (Y130, in APOBEC3A) in loop 7 projects into lection compared with the APOBEC3 and APOBEC1 proteins. the active site pocket and hydrogen-bonds with the backbone These observations serve as comparative yardsticks for the phosphate and also makes a π-π stacking interaction with the contrasting selective forces that have acted on different clades target nucleotide cytosine, as previously predicted (43). Analysis descending from the ancestral member of the AAD superfamily. of the variability of these residues suggests that, while the histi- Although there has been no direct experimental demonstration dine is absolutely conserved in active AADs, the asparagine in of the biochemical roles of APOBEC4 and APOBEC2, we find loop 3 might be replaced by other small residues such as serine, that, like AID, they show low global sequence entropy. Mutational aspartate, or proline, and in some cases, a histidine residue, data support a role for the deaminase active site of APOBEC2 in which can still play equivalent roles in contacting the cytosine. zebrafish retina (46) and mouse muscle (47) development. Hence, The tyrosine in loop 7 might be replaced by phenylalanine or these AADs are likely involved in a conserved nucleic acid-editing tryptophan, hence retaining the aromatic character at this posi- function, possibly during development. With respect to the iden- tion. This configuration is a unique and strongly conserved fea- tified clades, we found that three groups of AADs show high ture across the AADs (Figs. 1A and 3 A and E). The aromatic global sequence entropy comparable to or even higher than residue in loop 7 is missing in TadA, with its place taken by a APOBEC1 and APOBEC3: smaller residue (e.g., D104 in TadA; PDB: 2b3j). As we previously predicted (22), an aromatic residue limits the size of the active site, i) The SNADs: Here, high variability is present across the such that only a pyrimidine base can fit into the pocket. By con- sequence with the greatest diversity within the loops impli- trast, small residues at this position and other substrate-contacting cated in substrate binding and specificity (Fig. 3 C–E and SI positions in the loop 7 of TadA result in a larger substrate-binding Appendix, Fig. S22). pocket accommodating an adenine (42). Additional residues in ii) The NAD1 clade found in several vertebrates. loops 1, 3, 5, and 7 and strand 4 contact the target base mainly via iii) Most invertebrate classical AAD clades; these include clades polar interactions with the polypeptide backbone, or interact with exhibiting LSEs (average entropy values range between 2.0 the sugar-phosphate backbone of the nucleic acid (Fig. 3A). One of and 2.5) and those from low-copy-number lineages (e.g., these is the position at the end of strand 4, which is a strongly arthropods; average entropy = 2.4). This pattern is indica-

conserved basic residue (R128; PDB: 5ssw) in the core vertebrate tive of a function(s) in antiviral response, as is known for INFLAMMATION AADs, among others. It makes a polar contact with the backbone APOBEC1 and APOBEC3, with their divergence resulting from IMMUNOLOGY AND phosphate and possibly a cation-π interaction with the target base. a comparable arms-race scenario in these biological conflicts. However, substitutions at these positions are unlikely to drastically alter the size of the substrate pocket and the specificity for the target base. Based on these considerations, we propose that most Diversification of Regions Recognizing the Target Sequence in AADs. AADs target a cytosine for deamination. We then investigated local variations in positional entropy at sites Inspection of the TadA structure reveals two residues: E45 (PDB: potentially associated with substrate interaction, aided by pub- 2b3j) in loop 3 and D104 (PDB: 2b3j) in loop 7 that hydrogen-bond lished cocrystal structures of deaminases with their substrates. to the 2′ hydroxyl group of the ribose sugar. Indeed, a recent mu- Comparison of the substrate-bound TadA and APOBEC3A tagenesis study has shown that alteration of D104 (PDB: 2b3j) to structures (42, 43) shows that they make extensive contacts asparagine increases efficiency of adenine deamination in DNA extending up to two bases each on the 5′ and 3′ sides of the substrates with respect to WT TadA (44). Hence, loops 3 and 7 might target base, implying a 5-nucleotide-long recognition sequence. be involved in RNA vs. DNA discrimination. However, residues at These ancillary contacting residues emerge from the same loops these two positions, although often enriched in polar residues, are not (1, 3, 5, 7) that also contact the target base (Fig. 3A). A con- universally conserved even across the TadAs (22), let alone the siderable diversity of interactions is observed between the poly- AADs. Further, in AADs, the equivalent of D104 is instead involved peptide side chains or backbone with the bases and the sugar- in pyrimidine selectivity as opposed to nucleic acid backbone in- phosphate backbone of the nucleic acid. Of these interactions in teraction. This indicates that there are unlikely to be any widely APOBEC3A (PDB: 5sww), the side chains of D131 and Y132 in conserved specificity determinants for the type of nucleic acid, of- loop 7 and W98 in loop 5 make direct polar or van der Waals fering an explanation as to why APOBEC1 (36) or TadA (45) targets contacts with the −1 base of the target DNA (43). Similarly, side both DNA and RNA substrates. In addition to clade-specific deter- chains of H29 and K30 residues in loop 1 of APOBEC3A respec- minants, extraneous factors, such as the tissue of expression, in- tively make π-π stacking and cation-π interactions with the +1 base, teraction with other proteins, and subcellular localization, might play whereas R28 in loop 1 makes a cation-π contact with the −2base. a key role in determining the type of nucleic acid that is modified. Characterized AADs show different target sequence specificities, often with some degeneracy. AID prefers a WRC (W: A/T, R: A/ Evidence for Differential Action of Selective Forces Across the AAD G) motif, whereas APOBEC3A prefers thymine at the −1and Clade. We used global and local position-specific Shannon en- cytosine at the −2 positions (43, 48). Keeping with their lower tropy analysis to measure sequence variability and infer the pos- average global entropy, the entropy of substrate-binding positions sibility of purifying or positive selection on different AAD clades. is close to zero in AID, APOBEC2, and APOBEC4 (Fig. 3E). In We combined this information with published structures of AADs contrast, APOBEC1, APOBEC3, most invertebrate classical AAD in complex with their substrates to interpret the observed varia- clades, and the SNADs show high sequence entropy at these po- tions (Fig. 3 A–D and SI Appendix,Fig.S22). Global entropy sitions (Fig. 3 D and E and SI Appendix,Fig.S22). Of note is the values showed that AID, APOBEC2, and APOBEC4 are slowly frequently observed change in residue character at these positions evolving, with AID being the slowest evolving of the three. In within the same clade. These include polarity shift (polar vs. hy- contrast, APOBEC3s, while originally derived from the more drophobic), charge inversion (positive vs. negative), or gain/loss of ancient AID (22), are evolving fast. High global sequence entropy charge (charge vs. neutral) (Fig. 3E), highlighting the potential can be associated with a role in an arms race with parasites (Fig. 3 positive selection on these proteins. C and D and SI Appendix,Fig.S22): APOBEC3 paralogs and Substrate-binding loops might also display dramatic length APOBEC1, which are known to be deployed against viruses and variations (Fig. 4). We consistently observed a tendency for retroelements, show high sequence variability, in turn an indicator convergent emergence of long inserts in different loops stabilized by of positive selection arising from constantly evolving parasite re- metal chelating residues or disulfide bonds that are defining fea- sistance against them. In contrast, although AID is also involved in tures of particular clades (Fig. 1 C and D and SI Appendix,TableS1),

Krishnan et al. PNAS Latest Articles | 5of10 Downloaded by guest on September 28, 2021 ABELoop-7 +2 -2 1) Y130 5.04 3.80 4.51 4.71 5.04 4.40 3.55 4.05 5.04 3.16 5.36 3.10 4.13 3.76 4.95 3.99 4.24 +1 D311 Y 2.52 1.90 F 2.26 2.35 2.52 2.20 1.77 2.03 2.52 1.58 2.68 1.55 Y 2.06 W 1.88 F 2.47 W 1.99 W 2.12 -1 Y Y Y Y F G Y Y H V N N S F H W F T Y N Y R E F H C L Y S H F H P S F E R N S Y Q C N 0 Y 0 H 0 0 C 0 Y 0 H 0 D S 0 0 Y 0 T 0 F 0 Y 0 C 0 0 D 0 S H L F 0 H 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 L255 F258 2) D131

1.17 Zn F290 4.56 2.06 2.84 2.03 2.31 3.87 1.29 1.77 4.72 1.15 5.45 3.05 1.29 2.89 2.12 1.81 Y269 K K K D I262 W Y F K R D H H294 M K R S264 Surface-1 Y A S 2.28 1.03 Y 1.42 1.02 1.15 1.94 0.65 0.88 2.36 0.58 2.72 1.52 0.59 Q 0.64 1.44 1.06 0.91 G R D K N N Q H K H Hel-2 R Y K R R N S H F Y K S M F W E E E M G T Q V E H P Q T Q G N R T Q N P F Q R Y T K R Q T V V Hel-3 S G Hel-4 L M G T V E D I R T N K C L H S H P S Q C G H S S L V R E W W P N F K L L I L F C F L H 0 0 Q 0 D 0 0 0 0 D 0 0 0 0 Q N H F T 0 F I 0 P 0 D 0 D 0 F 0 E 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 AID Lop-Ech-Cni C 4 4 3) Y132 3 3 6.04 3.89 6.28 1.84 2.28 3.87 0.93 2.92 4.66 2.09 3.23 0.97 1.16 2.98 2.19 1.24 2.11 2 K 2 H P I V P L entropy 1 entropy 1 H E Positional Positional Positional Positional T F L 1.06 0.58 Y 3.02 1.95 3.14 0.92 1.14 1.94 0.47 1.46 2.33 1.04 S 1.61 0.48 L 1.49 1.09 0.62 P G 0 0 L S D I D Y V R G P D P P APOBEC3 Arthropoda N H K H D F G I L A E V A T Y F T F S Y 4 4 K L Q M S I R K V M E K T A S A N I T V W H Q R N K L L E L Q I K Q R P C M W P R I N S C K W L T V E W T G S N D P T F N 0 D 0 G S 0 H 0 S 0 A S K 0 T F M E 0 0 M 0 0 0 Q 0 0 R I 0 0 C 0 R 0 3 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2 2 entropy 1 entropy 1 Loop-1 Positional Positional Positional Positional 0 0 1) T31

4.2 4.66 2.90 SNAD3 Lop-Ech-Zn 4.96 2.27 4.17 3.05 4.21 1.72 2.07 3.64 3.87 3.85 2.37 4.36 3.64 1.51 4 4 T 3 3 T M T 2.10 2.33 1.45 2.48 1.14 2.09 1.53 2.10 0.86 T 1.03 1.82 1.94 1.93 1.18 2.18 1.82 0.76 C T S 2 2 C D V A Y T T V A A C D N P entropy 1 entropy 1 S V A S K S V N Positional Positional Positional Positional G Y L K R P I Q A S A N Y T L H Q G C M S S K H L L N 0 Q V T F Q W 0 I V 0 D T P 0 0 0 T 0 E G 0 0 N 0 D 0 T 0 0 A 0 H 0 E 0 E 0 E 0 1.00 1.001.00 1.00 0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001.000 1.00 1.00 4 D 2) K30 3.78 2.51 4.16 1.35 2.69 4.21 0.74 R 1.04 3.55 1.90 4.74 1.37 0.99 1.28 N 2.82 1.45 D 1.25 T K E V W R G S 3 T K T L D T N R 1.89 1.25 2.08 0.67 E 1.35 K 2.10 N 0.37 0.52 1.78 H 0.95 2.37 0.69 K 0.50 0.64 1.41 K 0.72 E 0.62 D K G E N S N T E Q R V A F W P S R G R V D H S A A Q T T T W E M P S Q R V N N S R H V R G V D Q M M Q Q L L M A S I I L K Q A G C M Y Q G K K E Y Q V L D H H 0 N 0 N Y 0 N D 0 G 0 D 0 I 0 K 0 L 0 0 C 0 P 0 N 0 R 0 A 0 0 L 0 D E 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3) H29 4.21 1.91 5.04 2.51 3.00 5.45 1.08 N 0.90 V 3.16 1.56 2.84 1.75 1.75 1.34 1.65 1.81 1.44

Positional entropy Positional 1 R H K K L R D F K R Y K N E V 2.10 0.96 2.52 1.25 1.50 Y 2.72 0.54 0.45 1.58 0.78 1.42 0.87 0.88 0.67 T 0.83 0.90 0.72 N E T V F P H N A R S S M R R S Q Y K K K I S D Q K K Q R H I S Y T W I R N G 0 H Q T S N T F H N D C Q D T E H K E Q E A L Q S Y R V P P A I P G S F N Y H Q N G K D L G N H S H W D D N C N F R G S W S S F T 0 N 0 R 0 H 0 Y 0 0 0 A 0 E 0 E 0 D 0 0 A 0 0 0 I D A A 0 F 0 E 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

ria

AID AID -Cni

NAD1 NAD2 NAD1 NAD2

SNAD1 SNAD2 SNAD3 SNAD4 SNAD1 SNAD2 SNAD3 SNAD4

Bacte Bacteria

APOBEC1 APOBEC2 APOBEC3 thropoda APOBEC4 APOBEC1 APOBEC2 APOBEC3 APOBEC4

Nematoda Nematoda

Lop-Ech-Zn Lop-Ech-Zn Ar Arthropoda Core vertebrate Lop-Ech-Cni Core vertebrate Lop-Ech Classical Cnidaria-algae Secreted Classical Cnidaria-algae Secreted AADS AADS

Fig. 3. (A) Illustration of human APOBEC3A bound to ssDNA substrate (PDB ID code 5ssw) showing key APOBEC3A residues involved in Zn chelation and in- teractions with the target and neighboring bases (labeled −2, −1, +1, and +2). (B) Surface view of APOBEC3F (PDB: 4j4j) depicting Vif1-binding residues obtained from various studies. Buried residues: red; solvent exposed: blue. (C) Positional entropy values for various AAD clades with mean entropy values (blue horizontal lines) and secondary structures on top. Loop-1 and -7 residues are shown in E: blue dots. Refer to SI Appendix,Fig.S22, for details. (D) Boxplots comparing global entropy values. Mean entropy values of >2, <2, and <1.5 are colored deep-brick, blue, and green, respectively. (E) Sequence logos of key substrate-binding residues in loops 7 and 1. Loop-7 residues are equivalents of APOBEC3A (PDB: 5ssw) Y130, D131, and Y132. Loop-1 residues are equivalents of H29, K30, andT31.

as follows: (i) a previously undelineated metal-chelating insert in feature of this clade; (iii) an insert stabilized by a metal-chelating APOBEC4 supported by two cysteines and a histidine from loop histidine from loop 1 and three further Cys/His residues from loop 1andacysteinefromloop3;(ii) a metal-binding insert within loop 7 in the lophotrochozoan-echinoderm clade; and (iv) multiple 1 of several cnidarian and algal sequences, which is a conserved disulfide bond-stabilized loops or extensions in vertebrate SNAD

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1720897115 Krishnan et al. Downloaded by guest on September 28, 2021 and dictyosteliid AAD clades. Together, these structures join the adaptations that extend or modulate the role of the deaminase PNAS PLUS previously reported weak Zn-binding dyad of histidines in some activity in biological conflicts with viruses and parasites. members of the APOBEC3 clade (49). Despite differences in the chelating residues and locations of these clade-specific inserts, ex- Divergence Arising from Potential Selective Pressures Imposed by amination of their inferred positions suggests that they are proximal Viral Inhibitors. The HIV-1 Vif protein acts as a counter de- to the protein-nucleic acid interface. fense against APOBEC3s by binding to either of their tandem Comparisons across AADs show that of the four substrate- (APOBEC3G, APOBEC3F, and APOBEC3D) or solo (APOBEC3C/ binding loops, loop 3 shows the greatest variations in lengths 3H) deaminase domains; this triggers ubiquitin-mediated degra- across different clades, followed in order by loops 1, 7, and 5 dation by recruiting an EloB/C-CUL5-Rbx2 E3 ubiquitin- (Fig. 4). Clades with low global sequence entropy (e.g., AID, complex (21). Mutagenesis suggests that Vif contacts APOBEC3s APOBEC2, and APOBEC4) also show little variability in loop either through residues in helices 2, 3, and 4 and loop 4 (APOBEC lengths. It has been previously reported that the length of loop 3C, APOBEC 3F), or through residues at the C-terminal part 3 of APOBEC3 varies considerably (50); our dataset catalogs a of loop 7 (APOBEC3H) (for review see ref. 52). A consensus range between 6 and 20 residues, with loop lengths of 6, 9, or emerging from several studies (52) implicates 13 residues com- 13 being most represented. Several other clades with high global monly across APOBEC3s (L255, F258, C259, I262, S264, Y269, entropy, such as the SNADs, the dictyosteliid secreted AADs, and E289, F290, H294, D311, T312, D313, and E324 in APOBEC3F; multiple invertebrate classical AADs, also show much loop-length PDB: 4j4j). However, several of these residues are not solvent- variation. For example, the SNAD2 loop 3 varies between 3 and exposed, suggesting that they are only indirectly involved in 64 residues, with ≥42 sequences possessing a loop longer than transmitting conformational changes upon Vif binding. We 10 residues and 29 sequences with lengths >15 residues (Fig. 4). asked whether the entropy of solvent-accessible positions in the Similarly, classical AAD clades—namely, (i) the cnidaria-algae above set of residues might reveal any general principles for clade; (ii) the clade encompassing cnidarian, lophotrochozoan, interactions of AADs and viral inhibitors. Five of the above and echinoderm representatives; and the SNAD4 clade—display 13 positions, two in helix 2 (PDB: 4j4j; C259, E289), one in helix length variations in loops 1 and 7. The arthropod clade shows 4(PDB: 4j4j; E324), and two in loop 7 (PDB: 4j4j; T312, D313), length variations in loops 1 and 3. Loop 5, which hosts the cys- are both solvent-exposed and high in entropy (2.6–3.5; Fig. 3B). + teines chelating the catalytic Zn2 , is usually short in most clades Investigation of neighboring solvent-exposed residues provided but shows inserts between the two cysteine residues in the additional high-entropy positions in loop 4 and helix 3 (PDB: cnidaria-algae, nematode, and SNAD3 clades. 4j4j; P265, A292, R293). Collectively, these positions locate to Thus, both low entropy values at target-interacting positions two distinct surfaces of the APOBEC3 structure (Fig. 3B): sur- INFLAMMATION

and low loop-length variability suggest that AID, APOBEC2, and face 1 includes residues at the C terminus of helix 2, the middle IMMUNOLOGY AND APOBEC4 retain relatively conserved target specificities (Figs. 3E of loop 4, and the C terminus of helix 3; surface 2 is made up of and 4). In contrast, most other AADs show concomitant variability residues from the C terminus of loop 7 and the N terminus of in the shared target-recognition residues and inserts in the loops helix 5. Only surface 2 shows a small overlap with the DNA- predicted to be close to the bound substrate (Figs. 3E and 4). binding interface (Fig. 3B). The additional parts of surface Hence, it is possible that both variability at positions targeting the 2 are equivalent to the dimeric interface of TadA (51) or the substrate and variable lengths of adjacent loops underlie oligomeric interface of APOBEC2 (53), respectively. The overlap the evolution of new target sequence specificities; in this way, the of the with a dimerization/oligomerization surface enzymes would counter emerging resistance of viral or parasite suggests that, in addition to recruiting a ubiquitin ligase complex, targets to deamination as a result of mutations in the target se- Vif might also interfere with oligomerization of APOBEC3s. quences. A second, but not mutually exclusive, possibility is that Analysis of these regions in other AADs shows that, except for loop-length variation is associated with altered stability or evolu- AID, APOBEC2, APOBEC4, NAD2, and the nematode AADs, tion of additional interfaces for protein oligomerization. For in- the remainder show high positional entropy. Such a changing stance, loop 3 of TadA and APOBEC3, respectively, is involved in profile of exposed residues within and between clades suggests dimerization and oligomerization. Oligomerization of APOBEC3 that most of the rapidly diversifying AADs are probably targeted (e.g., APOBEC3G) has been shown to be essential for restriction by viral inhibitory factors analogous to Vif. of HIV-1 by facilitating its binding to the viral template strand to block the reverse transcriptase from catalyzing DNA elongation Tandem Duplications, Gene Loss, Alternative Splicing, and Potential (49, 51). Thus, loop-length variations might also reflect accessory Intergenic Recombination in AADs. As previously reported for the

Loop-1 Loop-3 Loop-5 Loop-7

Bacteria | | | | | | ||| | | | SNAD4 | | | | | | | | | | SNAD3 | | | | | | | | | | | | SNAD2 | | | | | | | | | | | | | | | | | | | | | | SNAD1 Secreted Arthropoda | | | | | | | | | | APOBEC4 | | | | ||| | | | | | | | | | | | | | | |

Cnidaria−algae AADs Lop−Ech−Zn | | | | | | | | | | | | Lop−Ech−Cni | | | | | | | | | | NAD2 | | | | Classical NAD1 | ||| | | | | APOBEC3 | | | | | | | | | | | ||| | |

APOBEC2 Core APOBEC1 ||| | | | | | Vertebrate AID | | | ||| 0 0 5 5 0 0 5 5 30 60 15 20 25 20 40 50 45 50 15 20 25 20 25 30 10 15 10 10 10 15 35

Fig. 4. Length diversity of substrate-binding loops in AADs. AAD clades are listed on the y axis. For each clade, the range of the loop lengths in amino acids (x axis) is shown. The sequences corresponding to a loop of a particular length are shown as a circle centered at that length, with radius scaled by number of sequences normalized for a clade. The coloring scheme is a rainbow spectrum ranging from red (small number of sequences) to violet (large number of sequences), with the vertical blue lines representing the median loop length.

Krishnan et al. PNAS Latest Articles | 7of10 Downloaded by guest on September 28, 2021 APOBEC3s (32), we noted that at least a subset of the paralo- Major Evolutionary Trends in AADs and Their Functional Implications. gous AADs form an LSE cluster in the same genomic region. This study greatly extends the phyletic spread of the AADs in This was observed in several distant species, such as the metazoa (Fig. 5). Entropy analyses indicate high sequence vari- dictyosteliids D. fasciculatum and T. lacteum, the brachiopod ability of regions and residues involved in target motif de- Lingula anatina, sea urchins, the cnidarian Nematostella vectensis, termination, multimerization, or where potential inhibitory factors and lampreys (see companion paper). Like the intraspecific poly- analogous to Vif may bind. This is observed for several AAD morphism of APOBEC3s with respect to paralog number (54), we clades (e.g., APOBEC1, APOBEC3), almost all invertebrate found that the number of lamprey CDA1-like paralogs varies, clades (except for the nematode AADs), the SNADs, and dic- and that they can be entirely lost even in individuals of the same tyosteliid AADs, suggesting that these genes are under positive species (see companion paper). To explore if such interindivid- selection. The evolutionary trends among the AAD clades can be ual variations might be more generally observed, we performed subsumed under the following four categories (Fig. 5): whole-genome shotgun sequencing of the sea urchin Strongylocentrotus i) Clades that typically contain one well-conserved copy per purpuratus, Lingula, and the oyster Crassostrea gigas,andas- sembled AAD sequences from these reads. We then compared species. These are likely involved in conserved functions like these sequences with those from the individuals whose se- AID; in turn, this suggests that comparable conserved editing quence is deposited in the GenBank database. In each case, we roles are likely performed by APOBEC2 and APOBEC4. observed that, while the complement of AAD genes encoded by ii) Fast-evolving clades characterized by numerous paralogous the genome was comparable, there were individual-specific genes within each species. Here, further variability via tandem differences (SI Appendix,Fig.S23), with some paralogous duplications and gene loss, alternative splicing, or intergenic copies found only in one individual or the other (SI Appendix, recombination might also be at work, e.g., the APOBEC3s, Fig. S23). In Nematostella, we could identify an AAD pseudo- SNADs, dictyosteliid AADs, several invertebrate AADs, and gene in the tandem gene array (SI Appendix,Fig.S23), in- CDA1-like genes in lampreys. dicating that paralog number variation could proceed via such iii) Rapidly evolving clades typically present in relatively low pseudogenization events coupled with duplications in tandem copy number, such as those from the insect lineages poly- gene arrays. neoptera, mecoptera, and zygentoma. In certain metazoans, paralogous copies from a given species iv) Repeated emergence of catalytically inactive versions of revealed the presence of a substantial segment of completely AADs. This is observed in some members of the APOBEC3 identical sequence that might be shared by one set of AAD clade, AADs from Ciona intestinalis, and lophotrochozoan- paralogs, followed by a nonidentical segment distinguishing these echinoderm metal-binding clade AADs in the sea urchin paralogs. However, this nonidentical segment was found to be Evechinus chloroticus. As proposed for inactive APOBEC3 do- completely identical with the corresponding segment in another mains (31, 32, 52), these inactive copies might function either as set of paralogs from the same organism. For example, the eight decoys to bind viral inhibitors or merely bind invasive nucleic SNAD3 paralogs obtained from the assembled transcriptome of acids without mutating them. the emerald rockcod (Trematomus bernacchii)(SI Appendix, Fig. S24) suggested that they all could be visualized as being consti- We interpret LSEs and rapid sequence and/or structural tuted from different combinations of a small set of distinct seg- diversification, along with the frequent gene losses (Fig. 5), as ments: three N-terminal segments (nt1, nt2, nt3), two central signs of an ongoing arms race with pathogenic nucleic acids. segments (c1, c2), a single globally conserved segment (s1), and By analogy to known AADs, we suggest that DNA viruses, two distinct C-terminal segments (ct1, ct2) (SI Appendix,Fig.S24). retroviruses, and transposons are also targets for some of the A similar mosaic sequence pattern was also observed in the four AAD clades discovered herein. Many vertebrate AADs are Nematostella paralogs of the cnidaria-algae-APOBEC4 clade, in expressed in lymphocytes, and this correlates with the re- nine Anthopleura AADs, and in APOBEC1 paralogs from the markable lymphotropy of several retroviruses and DNA turtle Malaclemys terrapin (SI Appendix,Fig.S24), which we re- viruses (56, 57). Based on this observation, we propose that the covered from transcriptome assemblies (55). This pattern is un- biological conflict between AADs and viruses took shape in the likely to emerge from a simple divergence of paralogous copies. In lymphocyte or its invertebrate immunocyte counterpart. This the case of Trematomus SNAD3 and the Anthopleura AADs, the proposal is attractive, because it illuminates some intriguing as- genomes are unavailable; hence, the exact basis of this mosaic pects of the evolution of animal immunity. In a first step, viruses pattern in paralogs obtained from the assembled transcriptome is might have evolved to specifically infect immunocytes or lym- unclear. In Malaclemys APOBEC1, all of the mosaicism was re- phocytes, for this provides them with a host cell of astounding stricted to the C-terminal region of the paralogs. In the lamprey proliferative capacity. This also offered the viruses an opportunity AADs, we detected a complex pattern of alternative splicing, to interfere with host innate or adaptive immune responses that giving rise to proteins with C-terminal diversity (see companion relied on these cell types. In response, the host might have evolved paper), suggesting that a similar process could be active in gen- mechanisms to deploy AADs in immune cells to neutralize lym- erating the APOBEC1 isoforms in Malaclemys.InNematostella, photropic viruses by mutagenic inactivation. Once this co- when we compared the structure of the AAD paralogs deduced evolutionary process was set in motion, it possibly provided the from the mRNA sequences in the assembled transcriptome with stepping stone for a second phase, during which “institutionalized” the corresponding genes in the genome, we found that, barring mutator variant(s) of the AADs, such as AID in gnathostomes and one paralog, the remainder did not have identical genomic independently a subset of the CDAs in lampreys (20), were counterparts. Rather, they appeared to be mosaic, with different recruited. These facilitated the emergence of a fundamentally new segments corresponding to different genes (SI Appendix,Fig.S24). aspect of adaptive immunity, namely somatic diversification of Given that most of the coding frames of these genes derive from a antigen-receptor genes by direct mutagenesis or by triggering single exon, barring a conflation in transcriptome assembly, this DNA recombination. mosaic pattern of mRNAs possibly results from recombination In snails such as Biomphalaria, somatic diversification of between the different genomic copies. Since the transcriptomes the polymorphic plasma lectins, the fibrinogen-related proteins were derived from adult somatic tissues (55), this would imply that (FREPs), is proposed to be a component of an anticipatory im- such a recombination between the paralogous genes in the ge- mune system (58–60). Analogous to vertebrate immunoglobulins, nome occurs in somatic tissues to generate recombinant versions. FREPs recognize antigenic variations in snail parasites such as Alternatively, it might result from ongoing recombination between trematodes and exhibit signatures of somatic mutations (60). We the genomic copies in different individuals, given that the indi- identified a subclade of AADs within the lophotrochozoan- viduals from which the transcriptome and the genome sequences echinoderm-metal-binding clade, which is specifically conserved were derived are not the same. across gastropods, including Biomphalaria (APKA01034104.1)

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1720897115 Krishnan et al. Downloaded by guest on September 28, 2021 Various aspects of deaminase diversification described above PNAS PLUS APOBEC3 Eutherian mammals might also play out more widely across other distantly related NAD2 Platypus deaminases. In this study, we noticed a similar LSE of the DYW APOBEC1 Reptiles clade of deaminases in dinoflagellates of diverse lineages, in- SNAD1/NAD1 Amphibians SNAD2/3 cluding Gonyaulacales, Suessiales, Syndiniales, and Noctilucales AID/APOBEC2 Coelacanth (SI Appendix, Fig. S25). Genomes of individual species might contain anywhere from 1 to 39 paralogous DYW deaminases. PmCDA1-like &PmCDA2 Based on the widespread presence of RNA editing in chloro- Jawless vertebrates plasts of Symbiodinium minutum (66), it is possible that the Tunicates Cephalochordates DYW deaminases are involved in this process, perhaps in a gene- Lop-Ech-Zn Hemichordates specific way as proposed for the plant chloroplasts. However, it is Echinoderms also possible that the expansions reflect an adaptation to the Lop-Ech-Cni Hexapods conflict with viruses or transposons, similar to what we had Crustaceans APOBEC4 Nematodes earlier proposed for comparable expansions of another clade of Metazoans Cnidaria-algae Lophotrochozoans deaminases in fungi (22). Cnidarians In conclusion, this report provides the basis for further in- SNADs/AADs Placozoans vestigations of mutagenic deaminases in as yet poorly studied Ctenophores SNAD4 Poriferans biological contexts. We suggest that some of these enzymes may Fungi also serve as reagents for biotechnological purposes. Amoebozoans (Dictyosteliids) Materials and Methods LECA Archaeplastids/plants Sequence Searches. Sequence searches were conducted using the PSI-BLAST Chromalveolates (67) and JACKHMMER (68) algorithms against the nonredundant (nr) data- Excavates base at the NCBI and all ENSEMBL database proteomes. Transcriptome Bacteria Shotgun Assembly (TSA) and WGS contig databases were queried using Bacterial toxin systems containing deaminases TBLASTN. Open reading frames (ORFs) were predicted using the getorf Fig. 5. Evolutionary reconstruction and origins of various AAD families. The program from EMBOSS 6.5.7 software package (69). Profile-Profile searches provenance/distribution of AAD clades is superimposed on a simplified were run using the HHpred (70) program either against the PDB or Pfam eukaryotic tree. Presence of AADs (green), LSEs of AADs (red circles with databases. MSAs were built using the KALIGN, MAFFT, and GISMO programs INFLAMMATION

outer glow), and absence/potential loss of AADs (yellow markers/markers (71). Secondary structures were predicted with the JPred program (72). IMMUNOLOGY AND with cross). Tunicate inactive versions (gray circle); dotted lines indicate al- ternative lateral transfer scenarios from bacteria to stems of metazoa WGS Libraries. Individual C. gigas (procured from the Isle of Sylt, Germany) and dictyosteliids or to stems of ophisthokonta-amoebozoa with loss in and L. anatina (caught in Nha Trang Bay, Vietnam) and S. purpuratus fungi (“?”). (caught off the Californian coast) specimens were used to extract genomic DNA (total body for C. gigas and L. anatina; coelomocytes for S. purpuratus). Libraries were made from a minimum of 1 μg of total genomic DNA and × and Aplysia californica (GBBE01059801.1). These AADs are were sequenced at 2 250 bp paired end to a depth of 150 million reads using HiSeq2500 Rapid Mode (Illumina). Adapter sequences were trimmed distinguished from the rest of the clade by a loss of the conserved off the read ends, and read files were filtered through quality-control checks glutamate in the HxE motif but possess a compensatory con- and formatted into nucleotide databases (SI Appendix, SI Methods). Sequences served E after the second C of the CxnC motif, which might 2+ were deposited in the Sequence Read Archive at the NCBI with the following similarly chelate the catalytic Zn (SI Appendix, Fig. S10). Like accession numbers: SAMN08013505 (C. gigas), SAMN08013506 (S. purpuratus), AID in gnathostomes, a single representative of this group is and SAMN08013507 (L. anatina). conserved across diverse gastropods irrespective of whether the organisms possess other LSEs of AADs. We propose that these Phylogenetic Trees. Phylogenetic relationships were derived using an ap- molluscan AADs might play a role in the mutagenic variability of proximate maximum likelihood (ML) method as implemented in the FastTree FREPs and that certain other invertebrate classical AADs program (73) and the full ML method implemented in MEGA7 (74). To in- identified herein possibly also perform such mutator roles. crease the accuracy of topology in FastTree, we increased the number of The discovery of SNADs hints at the possibility of unexplored rounds of minimum-evolution subtree-prune-regraft (SPR) moves to 4 (-spr 4) contexts for AAD function. The prediction that they are secreted as well as utilized the options -mlacc and -slownni to make the ML nearest suggests that SNADs function either in the endoplasmic re- neighbor interchanges (NNIs) more exhaustive. ticulum or in the extracellular space. Several RNA and DNA viruses replicate in specialized viral factories arising from the Entropy Analysis. Position-wise Shannon entropy (H) was computed using a endoplasmic reticulum (61); moreover, several other viruses custom script written in the R language using the equation exploit the vesicular trafficking system during uptake and re- XM lease. Thus, SNADs might specialize in targeting viruses in ves- = − H Pi log2Pi icles as proposed for certain APOBEC3s (62). As noted above, i=1 vertebrate SNADs are confined to animals with poikilothermy or low body temperature. Hence, they might be secreted to specifi- where M is the number of amino acid types and P is the fraction of residues cally target pathogens affecting such organisms (e.g., certain fungi) of amino acid type i. The Shannon entropy for any given position in the MSA (63). They might also be deployed as mutators directed against ranges from 0 (absolutely conserved one amino acid at that position) to 4.32 other extracellular parasites or even pathogenic cells (e.g., trans- (all 20 amino acid residues equally represented at that position). missible tumors) (64). Interestingly, in the slime mold T. lacteum, we identified an ∼16.6-kb potential immunity-related locus with Data Visualization. Structures were visualized and compared using the PyMOL six tandem genes coding for secreted proteins (GenBank acces- program (https://pymol.org/2/). R language scripts were used for analysis of sion no. LODT01000051.1). In addition to three of the six secreted data and generation of entropy and loop-length divergence plots. AADs from this organism, the locus also encodes two paralogous MAC-Perforin domain proteins (also showing gene expansions in ACKNOWLEDGMENTS. We thank Christian Buschbaum, Elena Temereva, and Robert A. Cameron for help with obtaining animal specimens. A.K., slime molds) and a heme peroxidase, which has previously been L.M.I., and L.A. are supported by the intramural funds of the National implicated in antibacterial defense in slime molds (65). This sup- Library of Medicine, NIH. S.J.H. and T.B. are supported by the Max Planck ports the idea that secreted AADs might play a role in immunity Society and the European Research Council under the European Union’s more widely in eukaryotes. Seventh Framework Programme (FP7/2007-2013; ERC Grant Agreement 323126).

Krishnan et al. PNAS Latest Articles | 9of10 Downloaded by guest on September 28, 2021 1. Smith HC, Gott JM, Hanson MR (1997) A guide to RNA editing. RNA 3:1105–1123. 39. Zhang D, de Souza RF, Anantharaman V, Iyer LM, Aravind L (2012) Polymorphic toxin 2. Gott JM, Emeson RB (2000) Functions and mechanisms of RNA editing. Annu Rev systems: Comprehensive characterization of trafficking modes, processing, mecha- Genet 34:499–531. nisms of action, immunity and ecology using comparative genomics. Biol Direct 7:18. 3. Hamilton CE, Papavasiliou FN, Rosenberg BR (2010) Diverse functions for DNA and 40. Zhang D, Burroughs AM, Vidal ND, Iyer LM, Aravind L (2016) Transposons to toxins: RNA editing in the immune system. RNA Biol 7:220–228. The provenance, architecture and diversification of a widespread class of eukaryotic 4. Betts L, Xiang S, Short SA, Wolfenden R, Carter CW, Jr (1994) . The effectors. Nucleic Acids Res 44:3513–3533. 2.3 A crystal structure of an : Transition-state analog complex. J Mol Biol 235: 41. Bergman A, Casadevall A (2010) Mammalian endothermy optimally restricts fungi – 635 656. and metabolic costs. MBio 1:e00212-10. 5. Almog R, Maley F, Maley GF, Maccoll R, Van Roey P (2004) Three-dimensional struc- 42. Losey HC, Ruthenburg AJ, Verdine GL (2006) Crystal structure of Staphylococcus au- ture of the R115E mutant of T4-bacteriophage 2′-deoxycytidylate deaminase. reus tRNA adenosine deaminase TadA in complex with RNA. Nat Struct Mol Biol 13: Biochemistry 43:13715–13723. 153–159. 6. Liaw SH, Chang YJ, Lai CT, Chang HC, Chang GG (2004) Crystal structure of Bacillus 43. Shi K, et al. (2017) Structural basis for targeted DNA cytosine deamination and mu- subtilis : The first domain-swapped structure in the cytidine de- tagenesis by APOBEC3A and APOBEC3B. Nat Struct Mol Biol 24:131–139. aminase superfamily. J Biol Chem 279:35479–35485. 44. Gaudelli NM, et al. (2017) Programmable base editing of A•TtoG•C in genomic DNA 7. Kumasaka T, et al. (2007) Crystal structures of blasticidin S deaminase (BSD): Impli- without DNA cleavage. Nature 551:464–471. cations for dynamic properties of catalytic zinc. J Biol Chem 282:37103–37111. 45. Rubio MA, et al. (2007) An adenosine-to-inosine tRNA-editing enzyme that can per- 8. Stenmark P, Moche M, Gurmu D, Nordlund P (2007) The crystal structure of the bi- – functional deaminase/reductase RibD of the riboflavin biosynthetic pathway in form C-to-U deamination of DNA. Proc Natl Acad Sci USA 104:7821 7826. Escherichia coli: Implications for the reductive mechanism. J Mol Biol 373:48–64. 46. Powell C, Cornblath E, Goldman D (2015) Zinc-binding domain-dependent, deaminase- 9. Wolf J, Gerber AP, Keller W (2002) tadA, an essential tRNA-specific adenosine de- independent actions of apolipoprotein B mRNA-editing enzyme, catalytic polypeptide 2 aminase from Escherichia coli. EMBO J 21:3841–3851. (Apobec2), mediate its effect on zebrafish retina regeneration. JBiolChem290:6007. 10. Nishikura K (2010) Functions and regulation of RNA editing by ADAR deaminases. 47. Sato Y, et al. (2010) Deficiency in APOBEC2 leads to a shift in muscle fiber type, di- Annu Rev Biochem 79:321–349. minished body mass, and myopathy. J Biol Chem 285:7111–7118. 11. Salone V, et al. (2007) A hypothesis on the identification of the editing enzyme in 48. Pham P, Bransteitter R, Petruska J, Goodman MF (2003) Processive AID-catalysed cy- plant organelles. FEBS Lett 581:4132–4138. tosine deamination on single-stranded DNA simulates somatic hypermutation. 12. Shikanai T (2015) RNA editing in plants: Machinery and flexibility of site recognition. Nature 424:103–107. Biochim Biophys Acta 1847:779–785. 49. Shandilya SM, et al. (2010) Crystal structure of the APOBEC3G catalytic domain reveals 13. Conticello SG (2008) The AID/APOBEC family of nucleic acid mutators. Genome Biol 9: potential oligomerization interfaces. Structure 18:28–38. 229. 50. Marx A, Galilee M, Alian A (2015) Zinc enhancement of cytidine deaminase activity 14. Salter JD, Bennett RP, Smith HC (2016) The APOBEC : United by struc- highlights a potential allosteric role of loop-3 in regulating APOBEC3 enzymes. Sci ture, divergent in function. Trends Biochem Sci 41:578–594. Rep 5:18191. 15. Luo M, Schramm VL (2008) Transition state structure of E. coli tRNA-specific adenosine 51. Kuratani M, et al. (2005) Crystal structure of tRNA adenosine deaminase (TadA) from – deaminase. J Am Chem Soc 130:2649 2655. Aquifex aeolicus. J Biol Chem 280:16002–16008. 16. Gerber A, Grosjean H, Melcher T, Keller W (1998) Tad1p, a yeast tRNA-specific 52. Aydin H, Taylor MW, Lee JE (2014) Structure-guided analysis of the human APOBEC3- adenosine deaminase, is related to the mammalian pre-mRNA editing enzymes HIV restrictome. Structure 22:668–684. – ADAR1 and ADAR2. EMBO J 17:4780 4789. 53. Prochnow C, Bransteitter R, Klein MG, Goodman MF, Chen XS (2007) The APOBEC- 17. Maas S, Gerber AP, Rich A (1999) Identification and characterization of a human 2 crystal structure and functional implications for the deaminase AID. Nature 445: tRNA-specific adenosine deaminase related to the ADAR family of pre-mRNA editing 447–451. enzymes. Proc Natl Acad Sci USA 96:8895–8900. 54. Münk C, Willemsen A, Bravo IG (2012) An ancient history of gene duplications, fusions 18. Nishikura K (2016) A-to-I editing of coding and non-coding RNAs by ADARs. Nat Rev and losses in the evolution of APOBEC3 mutators in mammals. BMC Evol Biol 12:71. Mol Cell Biol 17:83–96. 55. Babonis LS, Martindale MQ, Ryan JF (2016) Do novel genes drive morphological 19. Bass BL, Weintraub H (1988) An unwinding activity that covalently modifies its novelty? An investigation of the nematosomes in the sea anemone Nematostella double-stranded RNA substrate. Cell 55:1089–1098. 20. Rogozin IB, et al. (2007) Evolution and diversification of lamprey antigen receptors: vectensis. BMC Evol Biol 16:114. Evidence for involvement of an AID-APOBEC family . Nat Immunol 8: 56. Refsland EW, et al. (2010) Quantitative profiling of the full APOBEC3 mRNA reper- 647–656. toire in lymphocytes and tissues: Implications for HIV-1 restriction. Nucleic Acids Res – 21. Harris RS, Liddament MT (2004) Retroviral restriction by APOBEC proteins. Nat Rev 38:4274 4284. Immunol 4:868–877. 57. Greeve J, et al. (2003) Expression of activation-induced cytidine deaminase in human 22. Iyer LM, Zhang D, Rogozin IB, Aravind L (2011) Evolution of the deaminase fold and B-cell non-Hodgkin lymphomas. Blood 101:3574–3580. multiple origins of eukaryotic editing and mutagenic nucleic acid deaminases from 58. Adema CM (2015) Fibrinogen-related proteins (FREPs) in mollusks. Results Probl Cell bacterial toxin systems. Nucleic Acids Res 39:9473–9497. Differ 57:111–129. 23. Muramatsu M, et al. (2000) Class switch recombination and hypermutation require 59. Léonard PM, Adema CM, Zhang SM, Loker ES (2001) Structure of two FREP genes that activation-induced cytidine deaminase (AID), a potential RNA editing enzyme. Cell combine IgSF and fibrinogen domains, with comments on diversity of the FREP gene 102:553–563. family in the snail Biomphalaria glabrata. Gene 269:155–165. 24. Kinoshita K, Honjo T (2001) Linking class-switch recombination with somatic hyper- 60. Hanington PC, et al. (2010) Role for a somatically diversified lectin in resistance of an mutation. Nat Rev Mol Cell Biol 2:493–503. invertebrate to parasite infection. Proc Natl Acad Sci USA 107:21087–21092. 25. Moris A, Murray S, Cardinaud S (2014) AID and APOBECs span the gap between in- 61. den Boon JA, Diaz A, Ahlquist P (2010) Cytoplasmic viral replication complexes. Cell nate and adaptive immunity. Front Microbiol 5:534. Host Microbe 8:77–85. 26. Litman GW, Rast JP, Fugmann SD (2010) The origins of vertebrate adaptive immunity. 62. Khatua AK, Taylor HE, Hildreth JE, Popik W (2009) Exosomes packaging APOBEC3G – Nat Rev Immunol 10:543 553. confer human immunodeficiency virus resistance to recipient cells. J Virol 83:512–521. 27. Boehm T, Swann JB (2014) Origin and evolution of adaptive immunity. Annu Rev 63. de Hoog GS, et al. (2011) Waterborne Exophiala species causing disease in cold- – Anim Biosci 2:259 283. blooded animals. Persoonia 27:46–72. 28. Harris RS, Sale JE, Petersen-Mahrt SK, Neuberger MS (2002) AID is essential for im- 64. Metzger MJ, et al. (2016) Widespread transmission of independent cancer lineages – munoglobulin V gene conversion in a cultured B cell line. Curr Biol 12:435 438. within multiple bivalve species. Nature 534:705–709. 29. Bastianello G, Arakawa H (2017) A double-strand break can trigger immunoglobulin 65. Nicolussi A, et al. (2018) Secreted heme peroxidase from Dictyostelium discoideum: gene conversion. Nucleic Acids Res 45:231–243. Insights into catalysis, structure, and biological role. J Biol Chem 293:1330–1345. 30. Boehm T, et al. (2012) VLR-based adaptive immunity. Annu Rev Immunol 30:203–220. 66. Mungpakdee S, et al. (2014) Massive gene transfer and extensive RNA editing of a 31. Stavrou S, Ross SR (2015) APOBEC3 proteins in viral immunity. J Immunol 195: symbiotic dinoflagellate plastid genome. Genome Biol Evol 6:1408–1422. 4565–4570. 67. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein 32. Refsland EW, Harris RS (2013) The APOBEC3 family of retroelement restriction factors. database search programs. Nucleic Acids Res 25:3389–3402. Curr Top Microbiol Immunol 371:1–27. 68. Eddy SR (2009) A new generation of homology search tools based on probabilistic 33. Duggal NK, Emerman M (2012) Evolutionary conflicts between viruses and restriction inference. Genome Inform 23:205–211. factors shape immunity. Nat Rev Immunol 12:687–695. 69. Rice P, Longden I, Bleasby A (2000) EMBOSS: The European molecular biology open 34. Fossat N, et al. (2014) C to U RNA editing mediated by APOBEC1 requires RNA-binding – protein RBM47. EMBO Rep 15:903–910. software suite. Trends Genet 16:276 277. 35. Petersen-Mahrt SK, Neuberger MS (2003) In vitro deamination of cytosine to uracil in 70. Söding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein ho- – single-stranded DNA by apolipoprotein B editing complex catalytic subunit 1 mology detection and structure prediction. Nucleic Acids Res 33:W244 W248. (APOBEC1). J Biol Chem 278:19583–19586. 71. Neuwald AF, Altschul SF (2016) Bayesian top-down protein sequence alignment with 36. Ikeda T, et al. (2017) Opossum APOBEC1 is a DNA mutator with and ret- inferred position-specific gap penalties. PLOS Comput Biol 12:e1004936. roelement restriction activity. Sci Rep 7:46719. 72. Cole C, Barber JD, Barton GJ (2008) The Jpred 3 secondary structure prediction server. 37. Ikeda T, et al. (2008) The antiretroviral potency of APOBEC1 deaminase from small Nucleic Acids Res 36:W197–W201. animal species. Nucleic Acids Res 36:6859–6871. 73. Price MN, Dehal PS, Arkin AP (2010) FastTree 2—Approximately maximum-likelihood 38. Conticello SG, Thomas CJ, Petersen-Mahrt SK, Neuberger MS (2005) Evolution of the trees for large alignments. PLoS One 5:e9490. AID/APOBEC family of polynucleotide (deoxy)cytidine deaminases. Mol Biol Evol 22: 74. Kumar S, Stecher G, Tamura K (2016) MEGA7: Molecular evolutionary genetics 367–377. analysis version 7.0 for bigger datasets. Mol Biol Evol 33:1870–1874.

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1720897115 Krishnan et al. Downloaded by guest on September 28, 2021