1

Supporting information for

Characterization of heliorhodopsins detected via functional metagenomics in freshwater ,

Chloroflexi and Archaea

Ariel Chazan, Andrey Rozenberg, Kentaro Mannen, Takashi Nagata, Ran Tahan,

Shir Yaish,​ Shirley​ Larom, Keiichi Inoue, Oded Béjà, and Alina Pushkarev

Corresponding author: Alina Pushkarev

Email: ​[email protected]

This PDF file includes:

Figures S1 to S11

Other supporting information includes:

Dataset S1 (flat GenBank format)​. Annotated sequences of the reported fosmids and proteorhodopsin gene used as positive control in Figure 2B.

Dataset S2 (Excel spreadsheet).​ Protein sequences and metadata for prokaryotic HeR and DTE proton pump genes collected for the analysis of gene neighbors.

Dataset S3 (Excel spreadsheet)​: Summary statistics and results of Fisher’s exact test for the different Pfam protein families and domains that deviate significantly between groups of rhodopsins (heliorhodopsins vs. DTE proton pumps) or between the vicinities of rhodopsin genes and background genomic locations.

Dataset S4 (flat hmm file): ​Protein profile used to collect DTE proton pumps, built from proteorhodopsin and xanthorhodopsin sequences in UniRef90. Note that the 2 matches were required to have the conserved DTE motif in TM helix C and lysine in TM helix G.

Hula Ein Afek

N N

Figure S1. Sampling sites. Colored dots represent the different sampling sites: pink

– Hula A (33°06'22.4"N 35°36'09.4"E), red – Hula B (33°06'24.4"N 35°36'04.7"E), and yellow – Ein Afek (32°50'44.15"N 35° 6'49.04"E). Figures were taken from

Google Earth Pro 7.3.3 (December 20, 2014). North of Israel. Positions as mentioned above, eye altitude 762 m. 3

● HULAa3G5: 17/33 -/9/1 ● HULAa2F4: 20/41 -/12/0.99 0.1 ●● GCA_003149555.1 [Ca. Aquiluna sp. XM-24bin5]: 130 ●● GCA_002390305.1: 116 100/112/1 ●● GCF_000257665.1 [Ca. Aquiluna sp. IMCC13023]: 134 100/129/1 -/105/0.48 ● GCA_002333405.1: 132 g__Aquiluna GCF_900100865.1 [Ca. Aquiluna sp. UB-MaderosW2red]: 133 100/59/1 ●● GCA_002479805.1: 111 GCA_002478545.1: 69 GCF_000699505.1 [ lacicola]: 127 100/119/1 GCF_005845385.1 [Ca. Rhodoluna limnophila]: 122 100/67/1 g__Rhodoluna GCF_001854225.1 [Ca. Rhodoluna planktonica]: 124 100/59/1 ●GCA_002342255.1: 68

Bacteria, Actinobacteria: the Aquiluna/Rhodoluna clade

● GCF_002288065.1 [Ca. Planktophila sulfonica]: 154 ● GCF_002284855.1: 154 ● GCF_002288225.1 [Ca. Planktophila dulcis]: 151 0.1 100/98/1 ● GCA_000372185.1: 134 98/113/1 GCA_002430075.1: 124 100/144/1 ● GCF_002288105.1 [Ca. Planktophila versatilis]: 146 ● 100/4/0.48 EINA62G7: 27/40 -/23/1 ● EINA20F1: 34/48 -/21/1 ● GCF_002287925.1 [Ca. Planktophila lacus]: 159 Heliorhodopsin genes: 100/5/0.28 100/144/1 ● GCF_002284875.1: 151 ● HULAa50H9: 7/31 group A GCF_002284895.1: 149 GCA_000378885.1: 119 group B ● GCF_002288185.1 [Ca. Planktophila vernalis]: 151 GCA_005789205.1: 102 100/83/1 ● GCA_003569265.1: 85 g__Planktophila 98/133/1 GCA_007279945.1: 111 100/96/1 98/103/1 GCA_005800815.1: 142 100/138/1 ● ● HULAb132A11: 21/40 100/103/1 -/13/1 ● ● GCA_001438925.1: 144 ● GCF_002288365.1 [Ca. Planktophila limnetica]: 147 ● GCA_007280125.1: 109 GCA_003569185.1: 105 100/77/1 ● GCF_002287885.2 [Ca. Nanopelagicus limnes]: 147 95/106/1 ● GCA_000383815.1: 116 86/136/1 ● GCA_000294575.1: 139 ● GCA_005793655.1: 96 100/93/1 GCF_002288005.1 [Ca. Nanopelagicus hibericus]: 146 82/147/1 100/126/0.82 ● g__Nanopelagicus GCA_000378865.1: 103 100/131/1 100/101/1 ● GCF_002288305.1 [Ca. Nanopelagicus abundans]: 144 ● GCA_001437855.1: 131 ● GCA_002340925.1: 109 100/143/1 100/59/1 ● GCA_000485495.1: 120 g__AAA044-D11 100/90/1 ● GCA_007280395.1: 121 100/106/1 ● GCA_002367715.1: 126 100/128/1 GCA_003151395.1: 132 100/145/0.98 ● GCA_005777515.1: 95 ● GCA_003569145.1: 127 100/79/1 ● GCA_002347935.1: 96 -/103/0.99 f__Nanopelagicaceae ● GCA_002470135.1: 126 100/119/1 100/72/1 GCA_005787575.1: 122 g__IMCC26077 ● GCA_005787645.1: 104 100/76/1 ● GCF_002284915.1: 145 100/101/1 ● GCA_002346305.1: 111 ● GCA_002737595.1: 97 ● GCA_002365735.1: 112 g__UBA4592 100/96/1 ● GCA_002390705.1: 137 100/55/1 ● GCA_002469525.1: 117 100/59/1 ● GCA_002422375.1: 62 95/127/0.99 ● GCA_002694095.1: 116 g__S36-B12 ● GCA_002728915.1: 113 100/107/1 ● GCA_002729215.1: 114 100/121/1 g__GCA-2737125 ● GCA_002737125.1: 78 f__S36-B12 100/40/1 ● GCA_005798555.1: 107 -/59/0.97 95/118/1 GCA_003506065.1: 84 52/90/1 ● GCA_003456155.1: 96 -/125/0.93 ● GCA_004379115.1: 125 GCA_004297305.1: 129 -/90/0.85 ● GCA_002699445.1: 121 -/90/0.99 -/84/0.75 GCA_003452655.1: 93 ●● GCA_002430405.1: 100 55/101/1 ● 48C12: 14/27 g__ATZT02 GCA_005789325.1: 76 f__AcAMD-5 100/48/1 GCA_000428345.2: 72 100/72/1 97/80/1 ● GCA_002293025.1: 118 f__UBA12327 ● GCA_003529305.1: 84 100/54/1 ● GCA_000754455.1: 79

Bacteria, Actinobacteria: Nanopelagicales

4

Figure S2​. ​Phylogenetic affinities of six of the actinobacterial clones isolated in this study. ​Estimated phylogenetic position of the actinobacterial clones including the previously reported clone 48C12. For named branches present in GTDB the corresponding taxa are provided above the branches. Below the branches the three numbers indicate: bootstrap branch support from the GTDB reference tree, effective number of genes in the species inference and local posterior probability (notice that for low values of the effective number of genes, posterior probabilities are not informative). The tips are labelled with NCBI assembly accessions (in black) or clone names (in red). For the environmental clones the two numbers after the colon indicate the number of the genes taken for gene phylogenies (this number could decrease further due to filtering) and the total number of the genes on the contig.

Dots indicate the presence of HeR genes from the two phylogenetic groups. 5

● LSSD01000066: 15/33 ● GCA_002900535.1: 107 100/75/1 GCA_001595915.1: 110 0.1 100/98/1 GCA_003651045.1: 112 84/121/1 GCA_003649845.1: 114 100/85/1 GCA_002254885.1: 95 f__DHVEG-1 ● GCA_002506745.1: 95 100/89/1 96/69/1 ● GCA_002496355.1: 110 100/98/1 g__SM1-50 GCA_003942085.1: 111 100/113/1 ● GCA_001595945.1: 127 35/83/0.53 100/120/1 100/108/1 ● GCA_002900555.1: 113 o__DHVEG-1 GCA_003649715.1: 121 100/114/1 GCA_003649745.1: 92 f__B18-G2 GCA_003651105.1: 85 100/68/1 GCA_003650935.1: 106 f__JdFR-43 GCA_002011355.1: 94 o__UBA202 100/55/1 GCA_003650975.1: 106 66/86/1 GCA_002502685.1: 86 100/114/1 f__UBA9212;g__UBA9212 GCA_013329505.1: 92 o__UBA9212 100/73/1 GCA_013329495.1: 107 100/104/1 GCA_002838935.1: 117 Archaea, Thermoplasmatota: class E2

GCA_002503545.1 [Ca. Methanomethylophilus sp. UBA78]: 103 100/88/1 GCF_000300255.2 [Ca. Methanomethylophilus alvus Mx1201]: 121 0.1 100/103/1 GCF_001481295.1 [Ca. Methanomethylophilus sp. 1R26]: 109 g__Methanomethylophilus GCF_001560915.1: 118 100/112/1 GCF_000350305.1: 117 100/118/1 GCA_002495325.1: 118 100/125/1 GCA_002506905.1: 121 48/120/0.98 GCF_000800805.1 [Ca. Methanoplasma termitum]: 124 GCA_006954465.1: 101 94/120/0.97 100/86/1 GCA_002506425.1: 114 g__UBA71100/102/1 GCA_002504495.1: 104 100/120/1 GCA_006954425.1: 123 100/121/1 GCA_002504405.1: 122 100/121/1 g__ISO4-G1 GCA_002506175.1: 84 100/120/1 100/82/1 GCA_001563305.1: 122 GCA_002509405.1: 115 g__VadinCA11 98/112/1 ● GCA_002505345.1: 120 100/109/1 ● GCA_002498365.1: 117 100/125/1 100/112/1 GCA_002498605.1: 112 GCA_002506255.1: 59 g__RumEn-M2 ● GCA_001421175.1: 117 100/121/1 ● GCA_002502965.1: 124 f__Methanomethylophilaceae 93/123/0.99 GCA_006954405.1: 124 100/124/1 GCA_007116915.1: 73 100/67/1 ● GCA_003557905.1: 97 Heliorhodopsin genes: 100/64/1 GCA_003560875.1: 108 g__PWHV01100/76/1 ● GCA_007117455.1: 78 100/84/1 ● GCA_003555025.1: 123 group B 100/87/1 100/106/1 ● GCA_003550345.1: 111 GCA_013329105.1: 90 o__Methanomassiliicoccales ● GCA_002498285.1 [Methanomassiliicoccus sp. UBA386]: 106 100/122/1 g__DTU008 100/97/1 GCA_001421185.1: 119 GCA_001512965.1: 122 100/119/1 ● GCA_002504525.1 [Methanomassiliicoccus sp. UBA345]: 121 GCA_002067635.1: 88 f__Methanomassiliicoccaceae-/121/0.87 ● 100/87/1 GCA_002508545.1 [Methanomassiliicoccus sp. UBA6]: 124 100/122/1 -/125/1 GCF_000308215.1 [Methanomassiliicoccus luminyensis B10]: 125 GCF_000404225.1 [Ca. Methanomassiliicoccus intestinalis Issoire-Mx1]: 122 GCA_013329565.1: 110 g__UBA472 100/88/1 ● GCA_002497075.1: 123 100/125/1 100/101/1 GCA_002067045.1: 104 100/121/1 ● GCA_004525545.1: 122 f__UBA472100/122/1 ● GCA_002067865.1: 126 100/122/1 g__FEN-33 ● GCA_003153895.1: 106 100/100/1 ● GCA_003135935.1: 117 GCA_005879045.1: 88 100/63/1 GCA_005878525.1: 107 GCA_005878665.1: 94 100/114/1 GCA_005878985.1: 119 100/96/1 GCA_005878515.1: 99 97/118/1 GCA_005878485.1: 111 100/62/1 GCA_005878395.1: 90 100/81/1 GCA_005878995.1: 86 GCA_005879065.1: 106 100/63/1 GCA_005879015.1: 82 100/100/1 GCA_005878385.1: 105 100/86/1 GCA_005878615.1: 113 100/65/1 100/89/1 100/101/1 GCA_005878635.1: 107 GCA_005878325.1: 89 g__RBG-16-68-1297/83/0.7 GCA_005878415.1: 93 100/117/1 GCA_013329855.1: 85 GCA_001800825.1: 113 100/99/1 100/89/1 GCA_001800745.1: 99 f__RBG-16-68-12 g__EA-19 GCA_005878955.1: 103 o__RBG-16-68-12 100/96/1 100/93/1 GCA_005878915.1: 109 100/109/1 GCA_005878375.1: 103 GCA_004377185.1: 117 GCA_002499085.1: 97 100/114/1 ● o__SG8-5;f__SG8-5 ● GCA_002496385.1: 108 100/2/0.87 ● 100/85/1 GCA_001595885.1: 90 HULAa36F11: 6/32 100/122/1 g__COMBO-56-21 GCA_001800815.1: 112 o__UBA10834;f__UBA10834 100/82/1 ● GCA_001800675.1: 98 86/78/1 ●GCA_013329135.1: 109 100/90/1 ●GCA_013331315.1: 91

Archaea, Thermoplasmatota: class Thermoplasmata

Figure S3​. ​Phylogenetic affinities of the archaeal clone isolated in this study. For comparison, the same analysis was performed for the metagenomic contig that contained ​Ta​HeR (GenBank accession number LSSD01000066). See Supplementary Figure S2 for further details.

6

GCA_004376105.1: 85 f__UBA5760 81/60/1 GCA_002418895.1: 78 0.1 97/82/1 GCA_002688105.1: 101 -/59/0.94 GCA_004376435.1: 92 GCA_002690375.1: 68 Heliorhodopsin genes: GCA_004375675.1: 95 ● GCA_001796415.1: 66 45/108/0.93f__UBA2162 100/41/1 GCA_002685815.1: 82 group A 95/81/1 GCA_002335405.1: 94 g__RBG-13-51-18 GCA_001796525.1: 74 87/4/0.28 100/52/1 GCA_001796495.1: 76 group B 100/78/1 GCA_001796865.1: 90 100/90/1 ● GCA_003141235.1: 101 f__RBG-16-60-22 GCA_001796935.1: 96 100/108/1 GCA_002011495.1 [Dehalococcoides sp. JdFR-56]: 100 93/81/1 GCA_004375725.1: 102 100/89/1 GCA_002688435.1: 92 ● HULAa30F3: 9/33 ● GCA_003162335.1: 105 g__Fen-1064 GCA_003153075.1: 82 f__UBA3254 100/68/1 ● 100/78/1 ● GCA_003153955.1: 96 86/94/0.88 ● ● GCA_002362735.1: 102 ● GCF_002878295.2 [Dehalogenimonas sp. GP]: 106 100/104/1 ● GCF_001953175.1 [Dehalogenimonas formicexedens]: 107 g__Dehalogenimonas 100/104/1 ● GCF_001466665.1 [Dehalogenimonas alkenigignens]: 104 o__Dehalococcoidales 100/106/1 ● GCA_001005265.1 [Dehalogenimonas sp. WBC-2]: 107 100/111/1 96/106/1 GCF_000143165.1 [D. lykanthroporepellens]: 106 100/105/1 ● GCF_001889305.1 [Dehalococcoides mccartyi]: 108 g__Dehalococcoides 47/108/1 ● GCF_000011905.1 [Dehalococcoides mccartyi 195]: 108 100/106/1 ● GCF_000830885.1 [Dehalococcoides mccartyi CG5]: 108 -/99/0.92 GCA_002309415.1: 41 92/32/1 GCA_003597995.1: 96 f__Dehalococcoidaceae g__UBA5627 ● GCA_002421585.1: 90 GCA_002419135.1: 101 96/73/1 100/77/1 ● -/105/0.99 g__CSSed11-197 GCA_003563865.1: 91 GCA_003552785.1: 96 -/78/0.83 100/81/1 ● GCA_002382615.1: 101 GCA_003599745.1: 81 g__E44-bin10 GCA_004377135.1: 100 100/69/1 GCA_004376545.1: 71 GCA_001796515.1: 102 -/73/0.98 GCA_004375885.1: 88 GCA_003647945.1: 89 ● GCA_003567655.1: 100 100/91/1 GCA_003552465.1: 96 g__E29-bin5466/86/1 ● 63/93/1 GCA_004376695.1: 100 100/97/1 ● GCA_003556385.1: 97 -/58/0.92 GCA_001873005.1: 96 89/78/1 GCA_002772325.1: 69 GCA_001872925.1: 89 GCA_003562555.1: 81 -/89/0.94 GCA_003552125.1: 95 f__AB-539-J10 g__SKVJ01 100/79/1 GCA_003562475.1: 96 100/100/1 100/96/1 GCA_007129495.1: 97 GCA_002452795.1: 102 g__UBA1162 o__GIF9 GCA_003142375.1: 104 100/98/1 GCA_002311135.1: 111 100/110/1 g__UBA6808 f__UBA5629 100/100/1 GCA_002452695.1: 89 100/77/1 GCA_002383665.1: 99 99/102/0.98 g__UBA5629 GCA_003484395.1: 87 100/68/1 GCA_002421445.1: 88 100/108/1 GCA_001795315.1: 70 f__UBA5620 GCA_007280145.1: 82 94/86/1 GCA_002423605.1: 101 GCA_003650185.1: 90 100/111/1 f__RBG-13-53-26 GCA_003551705.1: 104 o__UBA2777 100/61/1 ●● 100/68/1 GCA_001795035.1: 76 83/76/1 GCA_002352365.1: 102 -/78/0.98 GCA_002367275.1: 81 GCA_002347295.1: 106 f__E44-bin15 17/88/0.84 GCA_004377275.1: 102 o__SZUA-161 100/91/1 GCA_004376075.1: 102 f__SZUA-161;g__SZUA-161 GCA_004376205.1: 99 62/89/1 13/102/1 100/91/1 GCA_003229915.1: 98 GCA_002011475.1 [Dehalococcoides sp. JdFR-54]: 91

Bacteria, Chlroflexi: the clade uniting the orders Dehalococcoidales, SZUA-161, UBA2777 and GIF9

Figure S4​. ​Phylogenetic affinities of the chloroflexal clone isolated in this study. ​See Supplementary Figure S2 for description details.

7

TM1 ECL1 TM2 ICL1 TM3

Res. No. 48C12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121

48C12 M AKPT VKEI KSLQNFNRIAGVFHLLQMLA ----VLALAN-DFALPMTGTYLNGPPG TT------F SAPVVILETPVGL AVALFLGL SALFHFIVSSGN FFKR YSASLMKNQNIFRWVEYSLS SSVMIVLIAQ HULAb132A11 MTKEIND SKL RRYNIIAGVFHLVQMIA ----VLALAN-DFTLPI VARYMSGPPGST ------F AEPITLLNTPVGIG VALFLGL SAFFHFLV ASPQ FFSR YSAGLAQNRNYFRWVEYSI SSSVMIVLIAQ HULAa458S MA ------KKIRTIDEQINRLRIYNVVAGAIHLGQALAFG YALTLIST- PVLFPVTVDYMTGPPGVP ------LPTERVTLFEIDLGIG VVAFLLM SAFFHFLISLPRVFTRYANGLRANHNYFRWT EYSL SSSVMIWLIAQ HULAa32G3 M ASSTPLDI PSSRLAGLRRYNLFAGIFHAVQAIA----IIALAN-DFAL PVSVNYLLD APVPG ------ARFDSVQLFDFPIAIGVAIFS LLSALAHFWIVGP-GFDRYSNDLRNKRNIARWI EYAISSTLMIVLISL

Group A HULAa3G5 MA ------KVERTPDQYISRLRSYN VVAGFLHLAQAAGLT YVLT LLET- QILFPVTIDYPTGPPGVP------VPPER VELFDINIGAGVIGFLLLSAFFHFLI ASPLFFQRYKNGLKQNHNYFRWVEYSL SSSVMIWLIAQ HULAa2F4 MA ------KVERTPDQYISRLRSYN VVAGFLHLAQAAGLT YVLT LLET- QILFPVTIDYPTGPPGVP------VPPER VELFDINIGAGVIGFLLLSAFFHFLI ASPLFFQRYKNGLKQNHNYFRWVEYSL SSSVMIWLIAQ HULAa30F3 M ------ATDSAT EKKLKGLRTWN IVVGL ILAVQ AVL ----I AVLTN-SFALPVTATFMEGPPGTA ------PALQHLFN IQTGWGVF IFMAISAGALLL I ASPMVFSWYKRN LLESRNYGRWI EYFFSSSVMIVLISQ TaHeR M ------TENEEINFRKFRIFNGIMGVIHLIQVFL ----VLYLSN-NFSLPITVN KPVYNEITNS ------I SPVAETLFSIEIGPLVAMFLFISATAHILIAT-VL YYRYVQNLKNHMN PYRWF EYSI SASFMIVIIAM BcHeR MT SESNGPL VAAAAEL KKFQGL RRFNLIMGFLHLIQGIF ----MWVVSN- DRTYPIFTNYL TFDRATFT ------LTPNPQ LLYELPFG PAVAVFLLI SAVAHFYLST-IGYR PYVENLKKGMNPIRF YEYAL SSSLM VVLIGM EINA20F1 M KPETKLGKDLQVFNRIAGATHLIQGIV ----LLVIMNTETTI PVVTRFFDQTS--DG -----EI APVSKILFEF SVAKIAPIF LLLSAFAHILI SSPSYVRRYEQNISKGIN PVRWWEYAF SSSLML VVLLM EINA62G7 M KPETKLGKDLQVFNRIAGATHLIQGIV ----LLVIMNTETTI PVVTRFFDQTS--DG -----EI APVSKILFEF SVAIIAPIF LLLSAFAHILI SSPSYVRRYEQNISKGIN PVRWWEYAF SSSLML VVLLM Group B HULAa50H9 M KPETELGRKLQ TFNRIAGLTHLIQGIA ----LAFILNPDTTIPVITRFFDETS--DG ------VR PVSETLFEFPIALIAPIFLLL SAFAHLF VSSPTYI RRYEANIAKGIN PARWWEYAF SSSLML VVLLM HULAa36F11 MRI --TVRRDKMVETETKF KKLKNFNLTMGILHLFQGIL ----M VVLSN-DFALPVTR SYL AAEYPTGTTGGMPALIT VSETLFEVWIGPL VALFLFI SAAAHILIST-VL YKKYIAGLMNHQNRYRWYEYAISSSLMIVVISM

G PP G V F

0.00.10.20.30.4 Divergence

ECL2 TM4 ICL2 TM5 ECL1 TM6 ICL3 TM7

Res. No. 48C12 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256

48C12 VA LLAIFGVNASMILFGWLQ EKYTQPKD-GDLLPFWFGCIA GIVPWIGLLIYVIAPGSTSDVAVPGFVYG IIISLFLFFNSFALVQYLQYKGKGKWSNYLRGERAYIVLSLV AKSALAWQIFSGTLI PA HULAb132A11 IISIFGVNASMILFGWLQEKYETPGN-GGWIPFIFGCITGIIPWLALVFYV FSIGGPSEASA PAFVY GIVFSLFVLFNTFAVVQYLQYKKVGKWSDYLSGEKTYITLSLI AKSALAWQVFVGTL I E HULAa458S -GNFLPFVFGSMTGIVPWIIILIYTLQPGSESA AEVPGFVIGIIVSLFLFFNTFAVNQVLQYKQI GGWRDYVRGERMYITLSLV AKSALAWQVFSGALV PLFVD HULAa32G3 S LLPFWFGCIAGAVPWIAMF LLLFSPGS--EAEAPGFVY GIVISLFLFFNSFALVQWLQYKQIGKFADYLAGERTYITLSFIA KSLLA WQ I F A G V L A TSSL

Group A HULAa3G5 KGLPFIMGSMTGIVPWLIIVVYTIQPGSESAAEIPGFVYGIIVALFIFFNTFAFNQAFQYKKIG PWRNYLYGEATYITLSLV AKSILAWQVFSGAI I PA--IAST HULAa2F4 KGLPFIMGSMTGIVPWL IIIVYTIQPGS ESAAEIPGFVYGIIVALFIFFNTFAFNQAFQYKKIGPWRNYLYGEATYITLSL VAKSILAWQVFSGAI I PA--IAST HULAa30F3 IAA LLAIFGINASMILFGALQ EKYEKPGK-PN LLSFWFGSFAGIIPWIAVLIYVVSPGV--AAA PPAFVYGIVVALFVFFNCFAVNMILQYKQIGPWKDYLFGEKVYIILSLTAKS LLAWMVFANVLVL TaHeR FTLTAVMNLMGLMMEL HNQTTQNTNWTSYIIGCIAGFVPWIVIFIPLISA -----E SVPDFVIYIFISIAIFFNCFAINMYLQYKKIGKWKNYLHGEKVYIILSLVAKS ALAWQVFAGTLR PM BcHeR IYGCLA GIVPWIVIFLYFMGAVNSGDAK PPAFVYAIVPTLFVMFNIFAVNMVLQYKKVGRWKDYLYGERVYIILSLVAKTV LAWLIWFGTLA PV EINA20F1 ELSTVVFVFTLNFIMNLMGLVMEKYNQLTQTTNWLPFNIGVLA GIVPWIMGGLYFWVSTNNIADAIPVYAQFGFLLTFLFFNSFAINMWLQYKKVG KWSNYAYGENAYIVLSLVSKSALGWIIVLGTMG I EINA62G7 IELSTVVFVFTLNFIMNLMGLVMEKYNQLTQTTNWLPFNIGVLA GIVPWIMGGLYFWVSTNNIADAIPVYAQFGFLLTFLFFNSFAINMWLQYKKVG KWSNYAYGE KAYIVLSLVSK SALGWIIVLGTMG I Group B HULAa50H9 LIELSSVVFIFTLNFIMNLMGLMMEKYNQLTDKTSWLPFNIGVLAGIVPWI MGGLYFWVSTSNIADSIPVYAQFGFLLTFLFFNSFAINMWLQYKKVGKWK VYAYGEKSYIVLSL VSK SALGWIIVLGTLG I HULAa36F11 IAGFIPWVVIFVWLFGAGGSG-GGPPDFVYWIFLSMAIFFNSFAVNMWLQY KKKGKWADYLYGERMYVILSLVA KSLLA WQ V F A G T L R PA

I Q P G N M T

Figure S5​. ​Sequence alignment of HeRs from the environmental clones and previously characterized prokaryotic HeRs. Part of a bigger alignment of

prokaryotic HeRs from two phylogenetic groups (see Figure 1B) representing HeRs

from the environmental clones from Ha’Hula and Ein Afek and three HeRs

characterized previously. Structural annotation above the annotation corresponds to

the structure of HeR-48C12 (PDB: ​6SU3​). Below the alignment a differential weblogo is provided showing Jensen-Shannon divergence per position between phylogenetic

groups A and B, based on the alignment of 60%-identity clusters, with dashed

positions representing alignment gaps. For positions with a divergence above 0.25,

>50%-consensus residues are shown. Red arrows indicate the two highly conserved

His residues that are mutated in HeR HULAa30F3.

8

0.04 365 0 min 2 min 4 min 0.02 8 min 16 min

0.00 ΔOD 32 min -0.02 64 min

-0.04

559 -0.06 300 400 500 600 700 800 Wavelength (nm)

Figure S6. Biophysical characterization of HeR HULAa30F3. ​The difference in UV-vis absorption spectra of HeR HULAa30F3 between the spectra measured after and before bleaching of protein with hydroxylamine. The inset shows ​E. coli ​cell pellet expressing HeR HULAa30F3 in the presence of all-​trans​ retinal.

9

A HULAa30F3B HULAa50H9 0.020 622 nm (O) 610 nm (O) t = 80 s 0.030 t = 40 μs 0.015 μ t = 240 μs t = 400 μs t = 16 ms t = 4.2 ms 0.010 t = 106 ms 0.020 t = 27 ms t = 233 ms t = 54 ms 409 nm (M) t = 456 ms t = 97 ms 0.005 t = 1000 ms t = 170 ms 0.010 402 nm (M) t = 620 ms OD OD Δ

0.000 Δ 0.000 -0.005 -0.010 -0.010

-0.020 -0.015 540 nm (dark state) 524 nm (dark state)

400 500 600 700 400 500 600 700 Wavelength (nm) Wavelength (nm)

0.05 622 nm 0.05 (K and O) 0.04 612 nm 0.04 (K and O) 0.03 0.03 0.02 400 nm 0.02 400 nm 0.01 (M) OD OD

Δ (M) Δ 0.01 0.00

0.00 -0.01

-0.02 520 nm -0.01 520 nm (L and dark) (L and dark) -0.03 -0.02 -0.04

10-5 10-4 10-3 10-2 10-1 100 101 10-5 10-4 10-3 10-2 10-1 100 101 Time (s) Time (s)

C HULAa36F11 D HULAa45C8S 611 nm (O) 0.040 624 (O) t 0.016 = 40 μs t = 120 μs t = 190 μs 0.030 0.012 t = 70 ms t = 1.8 s 0.020 406 (M) t = 3.7 s 0.008 403 nm (M) t = 6.7 s t = 12 s 0.010 t = 27 s 0.004 OD OD

Δ 0.000 Δ 0.000 t = 500 μs -0.010 t = 1.5 ms t = 12 ms -0.004 t = 120 ms -0.020 t = 270 ms t -0.008 = 450 ms -0.030 t = 700 ms 551 nm (dark state) 535 (dark state) t = 1.5 s -0.012 400 500 600 700 400 500 600 700 Wavelength (nm) Wavelength (nm)

616 nm 400 nm (K and O) 0.03 624 nm (M) 0.02 (K and O) 0.02

0.01 400 nm 0.01 (M) OD OD 0.00 Δ 0.00 Δ -0.01

-0.01 -0.02 550 nm 550 nm (L and dark) (L and dark) -0.03 -0.02

10-5 10-4 10-3 10-2 10-1 100 101 10-5 10-4 10-3 10-2 10-1 100 101 Time (s) Time (s)

10

Figure S7. Transient absorption changes of HeRs. Transient absorption spectra (upper) and time evolutions of transient absorption change at specific wavelengths

(lower), of HeR HULAa30F3 (​A​), HULAa50H9 (​B​), HULAa36F11 (​C​), and HULAa45C8S (​D​). The time evolutions of the accumulation of the K/O and the M, and the bleaching of the initial state were monitored at the wavelength indicated with orange (λ = 611-624 nm), blue (λ = 400-406 nm) and green (λ = 520-550 nm) solid lines. Yellow lines indicate the best-fit curves of multi-exponential function. For the measurement of transient spectra at ​t > 70 ms and ​t > 1 s for HULAa36F11 and HULAa45C8S, respectively, the jitter between laser pulse illumination and the exposure of detector could not be so precisely controlled due to instrumental limitation (see Materials and Methods), so that we needed to place a notch filter in front of a detector to avoid the saturation by scattered laser pulse. Hence, the absorption change could not be measured at ​λ​ = 523-540 nm.

11

Data from NCBI WGS, A Uniprot HeRs (PF18761) NCBI Assembly, environmental clones

Search for rhodopsin- hmmsearch Cluster HeR CD-HIT containing segments custom scripts proteins 80% identity All rhodopsin-containing Rhodopsin-containing mafft segments assemblies with ≥1000 genes Alignment Predict genes, GeneMarkS --localpair manual curation --prok trimal Trimming Re-identify hmmsearch -gt 0.9 rhodopsin genes custom scripts Sequence- Remove outliers Extract genes within Bouncer Custom scripts ±10-gene windows Phylogeny iqtree ±10-gene neighbors Rhodopsins reconstruction 20 runs Cluster rhodopsins CD-HIT Separation of long- ggtree/itol +2 neighbors 80% identity branching lineages Search for protein Pfam (mostly) prokaryotic families pfam_scan.pl HeR clusters with Lys Second round of CD-HIT Pfam matches clustering 60% identity mafft Count matches Custom scripts Alignment --localpair Compare between trimal rhodopsin types and Custom scripts Trimming against background Binning HeR clusters -gt 0.9 into branches A and B Phylogeny iqtree reconstruction 20 runs

B Potential for co-expression Gene positions occupied by Proportion of rhodopsin the protein family members genes with neighbors away near rhodopsin from the protein faimly ... possible less likely

Taxa × HeR C Taxa phylogenetic group

Heliorhodopsins DTE proton pumps

DTE proton pumps Heliorhodopsins

Number of rhodopsin genes Number of rhodopsin genes with matching w/o matching with matching w/o matching neighbors neighbors neighbors neighbors Picking the best case for DTE pumps among: A case for HeRs: • counts pooled for all taxa • counts pooled for all • highest proportion of matches among taxa taxa and phylogenetic groups • highest absolute number of matches among taxa

Taxa × rhodopsin types/groups

Near ...... rhodopsin

Away from ...... rhodopsin

Total number of gene positions that match the that do not protein profile match

12

Figure S8. Pipeline used for phylogenetic classification of HeRs and for the analysis of neighbors. A​. Schematics of the different steps in the pipeline with the color code indicating: input data (lilac), computational step (red) with the corresponding software (blue) and derived data (green). ​B​. Parameters estimated for each of the protein families found in the vicinity of rhodopsin genes: incidence of the gene orientations, proportion of HeR genes with the corresponding neighbors, proportion of gene positions occupied by the matching genes among rhodopsin neighbors and away from them (for long assemblies). ​C​. Contingency tables created for the formal statistical tests for three types of comparisons: HeR for each grouping compared to the best case built for the DTE proton pumps; DTE proton pumps for each of the taxa against pooled counts for HeR neighbors; comparison of gene positions occupied by the corresponding neighbors against the background (gene positions away from the rhodopsin genes).

13

Chosen taxa All collected Non-redudant Search space (and NCBI ) rhodopsin genes rhodopsin genes

Archaea 5718+1 5 5 (3) (3) 84 2157 Archaea 219 (81) 576 (210) 299 352 (551) (286) (338) 210 (202)

Actinobacteria 28,930+11

201174 Actinobacteria 660 1993 2189 4415 1606 (642) (1920) 876 (2139) (4287) (856) (1566)

233 70 (228) (68)

Chloroflexi 2584+1 95 2 57 2 (93) 195 (56) 75 200795 Chloroflexi (193) 597 309 (587) (303)

305 175 (299) (170)

Firmicutes s.l. 129,562 2 2 1239 s.s. 352 222 544448 Tenericutes (332) (211)

350 220 (330) (209)

4 4 Patescibacteria s.l. 3640 3 3 44 (3) 16 (3) 1783273 Patescibacteria s.s. (43) 179 111 95818 Saccharibacteria (170) (104) 128 74243 Dojkabacteria 88 (121) 422282 WWE3 (82)

Thermocalda 265 200918 68297 Dictyoglomi 30 18 2498710 Caldiserica/Cryosericota 30 18

Outlier/unidentified HeRs Group A HeRs Group B HeRs DTE proton pumps

14

Figure S9. Taxa used for the analysis of rhodopsin neighbors. Summary statistics for the six taxa chosen for the analysis of HeR neighbors. The taxa were chosen to represent a sufficient number of phylogenetically close and morphologically similar and represent the vast majority of HeR-harboring prokaryotes based on the selection of HeR proteins in Uniprot. The “search space” indicates the number of the analyzed NCBI WGS and Assembly records assigned to the corresponding NCBI taxonomy accessions (the number after “+” indicates the number of HeR-coding environmental clones). The name `Patescibacteria` is used here in the broader sense to include all “candidate phyla” of which the four biggest groups of HeR-harboring assemblies were analyzed. Thermocalda unites three phyla or classes of diderms lacking lipopolysaccharides in the outer membranes following.

The total number of the analyzed assemblies, raw number of the rhodopsin genes with neighbors identified (according to the filtering criteria ​– see Materials & Methods) and the numbers of non-redundant rhodopsin genes (i.e. having distinct triads: rhodopsin and two of its immediate neighbors, see Supplementary Figure S8) are provided. In parenthesis are the numbers of assemblies for each category (if different from the number of the genes, the discrepancy arises from assemblies containing multiple HeR genes). HeR are subdivided into the phylogenetic groups A and B (see Figures 1 and Supplementary Figure S5). 15

Cryobacterium sp. MDT1-3 [Bacteria: Actinobacteria: Microbacteriaceae] SOFP01000010.1: 30,295-37,540(-) TetR DUF2256+3253 HeR BLH CrtY DLH Photolyase PhrB Mesotoga infera VNs100 [Bacteria: (Thermocalda:) Thermotogae: Kosmotogaceae] LS974202.1: 2,884,087-2,892,744(-) DUF1848 UvrD HeR PncA MetI MetN MetQ 1 kb Anaerolineae bacterium UBA12294 (MAG) [Bacteria: Chloroflexi: Anaerolineae] DNDO01000049.1: 1-11,009(-) TspO CrtB CrtI MerR+B12-bd HeR UCP010219 DUF1295 DUF4345 Photolyase CYP CrtY Ca. Saccharibacteria bacterium UBA1949 (MAG) [Bacteria: Patescibacteria: Saccharimonadia: Saccharimonadaceae] DDER01000010.1: 17,267-23,057 RrnaAD Beta helix HeR DUF2177 DUF1295 HeR Heliorhodopsin A HeR Heliorhodopsin B Acholeplasmataceae bacterium DLM2.Bin120 (MAG) [Bacteria: Firmicutes: Acholeplasmataceae (mollicutes)] RECG01000016.1: 16,058-25,112 Oxidative stress DUF2177 Dak2+DegV DUF1295 FeoB FeoA Photolyase HeR Tocopherol cycl Transcription regulators Translation/amino acid metabolism Lactococcus garvieae UNIUD074 [Bacteria: Firmicutes: Streptococcaceae] AFHF01000009.1: 71,199-66,766(-) Biosynthesis of retinal DUF1295 Dak2+DegV DUF2177 HeR Light-dependent proteins Transport proteins Methanobacterium subterraneum A8p [Archaea: Methanobacteriota: Methanobacteriaceae] CP017768.1: 40,173-48,351(-) DUF2177, DegV and DUF1295 Ala-tRNA ligase HcyBio NTPase 1 SOUL HeR α/β hydrolase

Figure S10. Examples of HeR gene neighbors in various prokaryotes.

Representative DNA fragments containing HeR genes from different prokaryotic

groups with the neighboring genes classified into functional categories. Note that

some gene products are classified in two categories and receive gradient color. The

examples include genome and metagenome assemblies of: ​ sp.

MDT1-3 with a rare case of a HeR gene next to ​blh​; ​Mesotoga infera as a representative of unusual diderms; Anaerolineae bacterium UBA12294, a

metagenomic assembly in which the HeR gene is surrounded by several genes of

the beta-carotene biosynthetic pathway but not ​blh​, as well as a putative cobalamine-binding (“B12-bd”) transcription regulator; Acholeplasmataceae

bacterium DLM2.Bin120, a mollicute with a tocopherol cyclase gene typical for this

group, as well as the triad DUF2177, DegV and DUF1295; ​Lactococcus garvieae​, a

lactic bacterium with the triad DUF2177, DegV and DUF1295; ​Methanobacterium subterraneum A8p, a methanogenic archaeon with a gene coding for a

SOUL--containing heme-binding protein and multiple genes involved in

translation/amino acid metabolism.

16

400 373

300

200

150

100

Gene co-occurrence Gene 73 58 47 43 43 34 33 29 28 28 14 13 12 12 11 10 10 7 6 4 4 3 3 2 2 2 2 0 HeR DUF2177 degV DUF1295 blh Classical rhod.

750 500 250 0 Number of genomes

Figure S11. Genome-wide co-occurrences of protein families characteristic to rhodopsin neighbors. ​Co-occurrence of HeRs (PF18761) and other rhodopsin types (PF01036) with four chosen protein families frequently co-locating with them.

The search covered 18,708 of high-quality non-redundant prokaryotic assemblies among GTDB representative genomes. Note that the protein families DUF2177,

DegV and DUF1295 alone have a much wider distribution than the rhodopsin genes, therefore only gene combinations involving rhodopsins are shown.