1 Supplementary Information

2 Mechanisms Driving Genome Reduction of a Novel Roseobacter Lineage Showing

3 Vitamin B12 Auxotrophy

4 Xiaoyuan Feng, Xiao Chu, Yang Qian, Michael W. Henson, V. Celeste Lanclos, Fang Qin,

5 Yanlin Zhao, J. Cameron Thrash, Haiwei Luo

6

7 This PDF file includes:

8 Text 1. Supplementary methods

9 Text 2. Supplementary results

10 Figures S1 to S5

11 References

12 13 Text 1. Supplementary methods

14 1.1 Sampling, bacterial cultivation and genome sequencing

15 1.2 Genome assembly and annotation

16 1.3 Phylogenomic analysis

17 1.4 Phylogenetic analysis based on 16S rRNA genes

18 1.5 Genome content analysis

19 1.6 Statistical analysis for genomic features

20 1.7 Recruitment analysis using public metagenomic sequencing data

21 1.8 dR/dC ratio calculation

22 1.9 Vitamin B12 assay

23 Text 2. Supplementary results

24 2.1 Biased codon and amino acid usage

25 26 Text 1. Supplementary methods

27 1.1 Sampling, bacterial cultivation and genome sequencing

28 HKCC strains were isolated from ambient seawater samples from both the coral

29 Platygyra acuta ecosystem (lat. N22.52°, long. E114.3173°) and the brown alga Sargassum

30 hemiphyllum ecosystem that were collected in 2017 by scuba diving near Hong Kong, China.

31 These samples were stored in the 50 mL microcentrifuge tube at 4°C. Marine basal medium

32 (MBM) was modified with 1 mM of dimethylsulfoniopropionate (DMSP) as the sole carbon

33 source. A 100-fold serial dilution of seawater was prepared, and a 100 μL aliquot of the

34 diluted sample was spread on the MBM-DMSP medium. The isolation plates were incubated

35 at room temperature. Colonies were selected after one week and repeatedly re-streaked on the

36 2216E marine agar (BD Difco, USA) to collect the biomass for DNA extraction. Genomic

37 DNA was extracted using EZ.N.A. Bacterial DNA Kit, and was sent to Qingdao Huada Gene

38 Biotechnology Co., Ltd for library preparation and genome sequencing with BGISEQ500

39 (PE100) following the standard protocol (1). In addition, biomass of HKCCA1288 was sent

40 to Qingdao Huada Gene Biotechnology Co., Ltd for library preparation and genome

41 sequencing with PacBio Sequel platform following a previous study (2) to obtain a complete

42 and closed genome.

43 FZCC strains were isolated from the coastal water of Pingtan Island (lat. N25.43°, long.

44 E119.78°) in May 2017 using high-throughput dilution-to-extinction cultivation (HTC)

45 method (3–⁠ 5). The seawater sample was diluted with seawater-based media (4,⁠ 5) and

46 dispensed into 24-well polystyrene microplates (Corning Incorporated) with a final

47 inoculation density of 3 cells/well. The plates were incubated at 20˚C in the dark for four

48 weeks and counted with flow cytometry after staining with SYBR Green I. Wells with at least

49 105 cells/mL were defined as positive cultures. Cells (500 μl) of positive cultures were

50 collected by centrifugation (10,000 rpm, 30 min) and identified by 16S rRNA gene sequence 51 identity as described previously (5). FZCC strains used for genome sequencing were grown

52 in polycarbonate flasks (50 mL culture volume) to a cell density of >106 cells/mL and

53 collected by centrifugation. Genomic DNA was extracted using DNeasy Blood & Tissue Kit

54 (Qiagen, Valencia, CA, USA), and was sent for genome sequencing using Illumina HiSeq

55 2500 (PE150) at Hubbard Center for Genome Studies, University of New Hampshire.

56 LSUCC strains were isolated as previously described (6,⁠ 7) with the exception of

57 LSUCC1028, in which we used a modified JW1 medium (6) containing no carbon sources

58 except for fulvic acids (Santa Cruz Biotechnology, Inc, TX, USA; CAS 479-66-3) at 1mg/L

59 (“JW1FA”). LSUCC1028 was isolated from surface water collected from Terrebonne Bay,

60 LA (lat. N29.1872°, long. W90.62822°) in July 2015. Briefly, the collected water was diluted

61 in JW1FA and inoculated into a 96-well Teflon plate at a density of 2 cells/well. The plate

62 was incubated in the dark at 24˚C for two weeks and counted with flow cytometry. Each well

63 was transferred into a new plate containing fresh media and counted after one week. This was

64 repeated once more for a total of three plates. Any wells that remained positive (104 cells/mL)

65 after three consecutive plates were subjected to three rounds of serial dilution to ensure

66 culture purity. Cultures were then grown in polycarbonate flasks (50 mL culture volume) to a

67 cell density of 106 cells/mL and identified by16S rRNA gene sequence identity as described

68 described (6,⁠ 7). Genomic DNA was extracted from LSUCC0031 via the DNeasy kit

69 (Qiagen), from LSUCC0246 and LSUCC0387 via phenol-chloroform, and from LSUCC1028

70 via the PowerWater kit (Mo Bio). Genomic DNA was sent for genome sequencing using

71 Illumina HiSeq 2500 (PE150) at Hubbard Center for Genome Studies, University of New

72 Hampshire.

73

74 1.2 Genome assembly and annotation

75 The Illumina sequencing raw reads were quality trimmed with Trimmomatic v0.36 (8) 76 with options ‘SLIDINGWINDOW:4:15 MAXINFO:40:0.9 MINLEN:40’ and assembled

77 using SPAdes v3.10.1 (9) with ‘-careful’ options. Only contigs with length >2,000 bp and

78 sequencing depth >5x were retained.

79 The quality of sequenced PacBio reads for HKCCA1288 was checked using FastQC

80 v.0.11.4 (10). This strain was assembled by combining the Illumina short reads and the

81 PacBio long reads using Unicycler v0.4.6 (11) with default parameters.

82 Genome completeness, contamination, and strain heterogeneity (Table S1) were

83 calculated using CheckM v1.0.7 (12). Three marker genes (PF05958, PF06723 and PF07991)

84 were excluded in the calculation because they either were absent or multi-copied in the closed

85 genome HKCCA1288. The ANI between genomes was calculated using fastANI v1.3 (13)

86 with the default parameters.

87 Protein-coding genes were predicted with the Prokka annotation pipeline v1.12 (14).

88 Protein sequences were annotated using the online RAST (15) and KEGG server (16). They

89 were further searched against the COG (17), Pfam (18), TIGRFAM (19) and CDD (20)

90 databases, all of which were downloaded in February 2020. Proteins involved in amino acid

91 biosynthesis were inferred using the online tool GapMind (21).

92

93 1.3 Phylogenomic analysis

94 To place the CHUG lineage into the phylogeny of the Roseobacter group, we performed

95 a phylogenomic analysis based on 120 bacterial marker genes (22). Eight CHUG genomes,

96 one newly sequenced reference genome (LSUCC0031), and 78 additional reference genomes

97 from a previous study (23) were used for phylogenomic tree construction. Marker genes were

98 each aligned at the amino acid sequence level using MAFFT v7.222 (24) and trimmed using

99 trimAl v1.4.rev15 (25) with ‘-resoverlap 0.55 -seqoverlap 60’ options. The trimmed

100 alignments were concatenated using a custom script (https://github.com/luolab-cuhk/CHUG- 101 genome-reduction-project) to comprise a super-alignment with 456, 904 sites. The maximum

102 likelihood (ML) phylogenomic tree was built using IQ-TREE v1.6.2 (26) with the

103 ModelFinder (27) assigning the best substitution model, and a total of 1,000 ultrafast

104 bootstrap replicates were sampled to assess the robustness of the phylogeny (28). The

105 phylogeny was visualized using iTOL (29).

106

107 1.4 Phylogenetic analysis based on 16S rRNA genes

108 To demonstrate that the CHUG lineage represents a distinct phylogenetic branch in the

109 Roseobacter group, we constructed an ML phylogeny based on 16S rRNA gene sequences of

110 both cultured and uncultivated roseobacters following the same method mentioned above.

111 The cultured roseobacters’ 16S rRNA genes were extracted from the above-mentioned 87

112 genomes, and the uncultivated roseobacters’ sequences were retrieved from a previous study

113 (30). To broadly search uncultivated 16S rRNA gene sequences that are closely related to the

114 cultured CHUG members, we built a preliminary 16S rRNA gene tree (data not shown) with

115 all Roseobacter sequences from the SILVA database (31) and those from the CHUG

116 genomes, and identified three SILVA sequences closely related to the CHUG isolates. These

117 three sequences (JN119120, JQ197701 and GQ342302) were also included in the above-

118 mentioned IQ-TREE phylogenetic analysis.

119

120 1.5 Genome content analysis

121 To identify the shared genomic content between CHUG and the previously reported

122 pelagic Roseobacter cluster (PRC) (32), we reconstructed a dendrogram based on the

123 presence and absence patterns of orthologous families. Orthologous gene families were

124 identified using OrthoFinder v2.2.1 (33) with ‘-S diamond -M msa’ options. The binary

125 matrix of presence and absence pattern for each orthologous gene family was used for 126 dendrogram construction using IQ-TREE v1.6.2 (26), and a total of 1,000 ultrafast bootstrap

127 replicates were sampled to assess the robustness of the dendrogram (28). The dendrogram

128 was visualized using iTOL (29).

129 To help understand the evolutionary process giving rise to the CHUG lineage, we

130 reconstructed the genome content of the ancestral nodes related to CHUG, its sister lineage

131 and their outgroups, and also inferred the gene gain and loss events using BadiRate v1.35

132 (34) with ‘-anc -bmodel FR -rmodel BDI -ep CSP’ options. A pruned subtree of the

133 phylogeny inferred based on 120 bacterial marker genes (Fig. 1A) and a table of gene count

134 for each orthologous gene family predicted by OrthoFinder (33) were used as the inputs. The

135 ancestral genome sizes were estimated based on the number of orthologous gene families

136 under linear regression model (R2 =0.97, p < 0.01) in R.

137

138 1.6 Statistical analysis for genomic features

139 The assembled genome size, gene number, coding density and GC content were obtained

140 using CheckM v1.0.7 (12), then the estimated genome size was adjusted as (assembled

141 genome size)/(completeness + contamination) (35). Pseudogenes were predicted following

142 our recent study (36). The number of carbon atoms per amino-acid-residue side chain (C-

143 ARSC), the number of nitrogen atoms per amino-acid-residue side chain (N-ARSC), amino

144 acid usage, and codon usage were retrieved using custom scripts (https://github.com/luolab-

145 cuhk/CHUG-genome-reduction-project). The number of orthologous families and the mean

146 number of genes per orthologous family were calculated based on the gene families clustered

147 by OrthoFinder v2.2.1 (33). Phylogenetic ANOVA analyses were performed to compare

148 these genomic features between different Roseobacter lineages using the ‘phylANOVA’

149 function in the ‘phytools’ R package, which allows controlling for phylogenetic impact on

150 these metrics (37). 151 Next, we identified genes that were either enriched or depleted in CHUG and other PRC

152 members compared to other roseobacters. Briefly, the phylogenetic signal of functional genes

153 (coxL, pdo, sox) or traits (light utilization) were checked with the ‘phylosig’ function of the

154 ‘phytools’ R package (37). A strong phylogenetic signal was identified in the distribution of

155 pdo, sox and the light utilization trait (lambda > 0.99, p < 0.001 for each), so the subsequent

156 tests whether they were associated with a particular category (PRC or non-PRC) were

157 controlled for evolutionary history using the ‘binaryPGLMM’ function in the ‘ape’ R

158 package (38). On the other hand, there was no phylogenetic signal for the coxL gene (lambda

159 < 0.01), so χ2 test was used to test whether it was associated with a particular category (PRC

160 or non-PRC).

161

162 1.7 Recruitment analysis using public metagenomic sequencing data

163 To identify the global occurrence and activity of CHUG members, the TARA Ocean

164 metagenomic and metatranscriptomic sequencing data (39–⁠ 41) were downloaded and mapped

165 to the genomes. Two additional metagenomic sequencing data sampled at the Red Sea (42)

166 and the Kwangyang bay (43) were also collected because an increased abundance of CHUG

167 members were identified in a preliminary analysis. These public sequencing data were quality

168 trimmed with Trimmomatic v0.36 (8) with options ‘SLIDINGWINDOW:4:15

169 MAXINFO:40:0.9 MINLEN:40’ and were subsequently mapped to the 89 Roseobacter

170 genomes using bowtie v2.3.2 (44) with the parameter ‘-very-sensitive-local’ for a quick

171 screening. Mapped reads were extracted using SAMtools v1.4.1 (45) and more precisely

172 searched against the 89 genomes using BLASTN (46) with the setting ‘-evalue 1e-5 -

173 perc_identity 95 -qcov_hsp_perc 80’. Thereby only those mapped reads that shared >95%

174 identity and >80% coverage with a reference genome were kept for further calculations. The

175 relative abundances of CHUG and other PRC members were represented using Reads Per 176 Kilobase per Million mapped reads (RPKM) and compared using the Wilcox test in the

177 ‘ggplot2’ R package. The correlation analysis was performed using the ‘rcorr’ function in the

178 ‘Hmisc’ R package (47), and the significance level was adjusted using stringent Bonferroni

179 correction. These analyses were not performed for each CHUG genome individually because

180 these genomes are closely related (95.4 ± 2.4% ANI) and some reads were equally mapped to

181 multiple genomes.

182

183 1.8 dR/dC ratio calculation

184 We calculated the ratio of radical nonsynonymous nucleotide substitutions per radical

185 nonsynonymous site (dR) versus conservative nonsynonymous nucleotide substitutions per

186 conservative nonsynonymous site (dC) following our previous protocol (48). Briefly,

187 orthologous gene families from CHUG members and their sister group were compared to

188 those from the outgroup, respectively. The 20 amino acids were categorized into three groups

189 based on charge or six groups according to volume and polarity (48). For each method of

190 categorization, a nonsynonymous nucleotide substitution was considered ‘conservative’ if it

191 led to a within-group replacement of amino acids and ‘radical’ if it resulted in a between-

192 group replacement of amino acids. The dR/dC ratio was calculated using three methods: GC-

193 corrections based on codon frequency, GC-corrections based on amino acid composition, and

194 the traditional uncorrected method (49), all of which were implemented in RCCalculator (48).

195

196 1.9 Vitamin B12 assay

197 To validate the vitamin B12 auxotrophy in CHUG members, a growth assay was

198 performed for the HKCCA1288 as the experimental CHUG strain and the model roseobacter

199 strain pomeroyi DSS-3 (50) as the positive control. The starter culture was prepared

200 in 2216 marine broth (Difco) prior to the growth experiments. In the vitamin B12 assay, the 201 defined marine ammonium mineral salts (MAMS) medium was supplemented with ribose (30

202 mM, Sigma) as the sole carbon source and with SL-10 trace metals solution (1 mL/L) (51).

203 Vitamin mixture in the presence or absence of vitamin B12 was added as described previously

204 (52). Strains were cultivated in MAMS medium for 96 h, and samples were collected every

205 12 h for cell counting. The collected cells were stained with SYBR Green I (Life

206 Technologies, USA) dye for 15 min, and cell numbers were counted using a flow cytometer

207 (Guava EasyCyte Plus, MA, USA) equipped with a fluorescence detector. All experiments in

208 this section were performed in triplicate.

209

210 Text 2. Supplementary results

211 2.1 Biased codon and amino acid usage

212 As mentioned in the main paper, the GC content (Fig. 2C) was lower in both CHUG

213 members and seven other PRC genomes compared to non-PRC members. We further

214 investigated whether there was codon usage bias within these genomes for the 18 amino acids

215 each encoded by more than one codon (Fig. S2). CHUG members tended to use more

216 adenine/thymine (A/T) in the synonymous codons encoding 11 and 12 amino acids when

217 compared to its sister group and the outgroup, respectively (Fig. S2). For example, among the

218 four synonymous codons for proline (CCA, CCT, CCG, CCC), CCA and CCT were more

219 frequently used in CHUG (20.9% and 23.7%, respectively) compared to its sister group

220 (7.4% and 8.3%, respectively; p < 0.05 for each) and outgroup (7.1%, p > 0.05 for CCA;

221 7.5%, p < 0.01 for CCT). In contrast, the frequencies of the rest codons (CCG and CCC)

222 were reduced in CHUG (22.6% and 32.6%, respectively) compared to its sister group (42.2%

223 and 41.9%, respectively; p < 0.05 for each) and outgroup (45.8%, p < 0.05 for CCG; 39.4%,

224 p > 0.05 for CCC).

225 Next, we investigated the amino acid usage frequency bias within the oligotrophic 226 CHUG and seven other PRC members (Fig. S3). The 20 amino acids can be divided into six

227 groups based on their volume and polarity (Fig. S3) (53). Among the nonpolar and relatively

228 small amino acids, CHUG members tended to use more isoleucine (5.7%) and less valine

229 (6.8%) compared to the outgroup (4.9% for isoleucine and 7.3% for valine; p < 0.05). One

230 possible explanation was that the codons for isoleucine (ATT, ATC and ATA) generally used

231 less C and G and thus saved more nitrogen than the codons for valine (GTT, GTC, GTA,

232 GTG) (54).

233 Loktanella hongkon

DC5-80-3 (RCA) Oceanicola granul NAC11-7

Loktanella cinnabarina LL-001

Planktomarin 888 M0-Ar2-P4F09 gensis UST950701

Leisingera nan osus HTCC2516 Phaeobacter gallaeciensis Roseobacter sp. LE17 a temperata RCA23 Phaeobacter caer

GQ342313 EF016464 TCC17025 A A GQ342312 GQ342315 Y145589 AF245635 GQ342314 Leisingera aqu haiensis NH52FGQ342310 Leisingera methylohalidivorans MB2 GQ34231 uleus DSM24564 Pseudorhodobacter ferrugineus DSM5 DSM19073 ANG1 bacterium HTCC2255 imarina R-26159 1 Roseobacter sp. R2A57 ensis Rhodobacter sphaeroides Loktanella vestfoldensis SKA53

Phaeobacter gallaeciensis 2.10 Octadecabacter arcticus 238 Octadecabacter antarcticus 307 Paracoccus sulfuroxidans CGMCC1.5364 Phaeobacter gallaeciensis BS107 Thalassobacter stenotrophicus DSM16310 Rhodobacter sp. SW2 Jannaschia sp. CCS1 iphilus DSM15620 Litoreibacter arenae DSM19593 Jannaschia sp. EhC01 Roseobacter sp. SK209-2-6 Jannaschia pohang Paracoccus aminophilus JCM7686

Gemmobacter nectar Roseobacter sp. MED193 5 TM103 Phaeobacter ar Rhodobacter capsulatus SB_1003 cticus 20188 Roseovarius sp. Roseobacter denitr Roseovarius sp. 217 ificans OCh1 14 Roseobacter litora FZCC0069 lis Och149 HKCCD6035 Sulfitobacter sp. EE36 HKCCA1288 FZCC0188 JN119120.1.1218 CHUG GQ342317 JQ197701.1.1287 HKCCA1065 LSUCC0387 bacterium SB2 GQ342302 AJ240910 Roseibacterium LSUCC0031Dinoroseobacter GQ342316 Nioella nitratireducens SSW136 157 Nioella sedimi LoktanellaJannaschia sp. SE62 aquimarinaRoseicyclus DSM28248 mahone W elongatum shibaeDFL-43 DFL12 enxinia marina HY34

Ruegeria lacuscaerulensiseroyi DSS-3 ITI-10 nis JS7-1 TM104 yensis DSM16097 Ruegeria pom 1 Ruegeria sp. Ahrensia mar Euryhalocaul

GQ342309 Labrenzia aggr Hellea balneolensis DSM19091 mudensis HTCC2601 GQ342308

Chesl-B GQ342307 Pseudovibriois caribicus sp. JL2009 Henriciella aquimarina LMG2471 Pelagibaca ber ina LZD062 Hyphomonadaceae bacterium UKL13-1

DQ009290 egata IAM 12614 1 EF471521 DQ009289

DQ009291 GQ342304 T unPSC04-5I4 GQ342303 EF471647 Chesl-C

1 DQ009286 EF471645 DQ009285 GQ342305

GQ342306 DQ300580

Chesl-A Rhodobacteraceae bacterium HIMB1

Fig. S1. The IQ-TREE maximum likelihood phylogenetic tree based on 16S rRNA gene sequences. The sequences include those extracted from the 87 genomes used in the present study, those collected from the SILVA database and clustered with CHUG, and those from the wild Roseobacter clusters named in a previous study (30). Solid circles in the phylogeny indicate nodes with bootstrap values >95%. N-ARSC = 0

Gly

60 **

*

*

** *

C-ARSC = 0 40 * *

* ** 20 ** *

Codon usage (%) 0

GGA GGC GGG GGT

*

* Ala Ser Cys * 50

50

* 40 ** 75 40 *

** *

30 30 *

* ** *

C-ARSC = 1 * 50

**

**

** **

20 ** *

20 ** * *

* ** * * ** * ** * 10 * 25 10 * N-ARSC = 1 ** Codon usage (%) 0 0 ** GCA GCC GCG GCT * * * * TGC TGT AGCAGTTCA TCCTCGTCT

Thr Asp Asn *

**

**

60 **

* 75

** 75

** *

* ** ** 40 ** *

**

C-ARSC = 2 50 50 ** ** ** ** * ** ** ** 20 ** 25 stop codon ** * 25 ** ** * **

80 ** ** Codon usage (%) 0 * * * ACA ACC ACG ACT GAC GAT AAC AAT 60 ** **

40 **

Pro Met Val Glu Gln ** ** *

60 * 20 ** ** 60 * **

** 75

** *

75 Codon usage (%) * ** * * 40 * TAA* TAG TGA

40 **

**

* ** ** ** ** 100 50 50 ** **

C-ARSC = 3 ** * ** ** ** ** * ** ** ** ** 20 ** 20 ** ** ** ** ** * 25 25

** N-ARSC = 2 N-ARSC = 3 ** ** **

** 0

Codon usage (%)

0 **

CCA* CCC CCG CCT* ATG GTA GTC GTG GTT GAA GAG CAA CAG Ile Leu Lys His Arg*

100 80

** * 80

* 60 * ** **

**

** *

75 60 * 75 * *

** *

* ** 60 * ** *

* 40 **

** *

50 ** 40 50 *

**

C-ARSC = 4 * ** * ** **

**

** **

** * *

40 *

* **

** ** **

** 20 25 20 ** * * * * 25 ** ** ** ** **

** * ** 20 ** ** 0 0 * ** 0 Codon usage (%) * **

* ** ATA ATC ATT A A AAA AAG CAC CAT CT CTC CTG CTT TT TTG AGA CGA AGG CGC CGG CGT Phe Tyr 100

75 75 ** group **

**

** *

* * C-ARSC = 7 50 ** 50 *

** * * outgroup ** 25 25 ** * ** sister group Codon usage (%) 0 TTC TTT TAC TAT CHUG members Trp * p < 0.05 for the corresponding group compared to CHUG members (phylANOVA) seven other PRC members

** p < 0.01 for the corresponding group compared to CHUG members (phylANOVA) C-ARSC = 9 100 other roseobacters * p < 0.05 for the corresponding group compared to seven other PRC members (phylANOVA) Codon usage (%)

** p < 0.01 for the corresponding group compared to seven other PRC members (phylANOVA) TGG

Fig. S2. Codon usage frequency between CHUG, its sister group, the outgroup, seven other PRC members, and other reference roseobacters. The significance level in the codon usage frequency between CHUG and other four groups are shown in red, while that between seven other PRC members and the remaining three groups are shown in blue. The markers * and ** denote p < 0.05 and p < 0.01 (phylANOVA analysis), respectively. The 20 amino acids are placed based on their number of carbon atoms per amino-acid-residue side chain (C-ARSC) and number of nitrogen atoms per amino-acid-residue side chain (N-ARSC). N-ARSC = 0

Gly 9.0 ***

8.5 ** C-ARSC = 0 * * 8.0 Usage (%)

7.5 Ala Ser Cys *

14

* ** * 1.0

12 6

C-ARSC = 1 ** ** *

** 0.9 ** **

** 10 5 ** Usage (%) 0.8 N-ARSC = 1 Special Thr Asp Asn 5

5.75 6.5

4 **

C-ARSC = 2 5.50 * 6.0 3 * 5.25 Usage (%) 5.5 2

Pro Met Val Glu Gln

5.6 7.6 6.5

* ** 3.0 ** * 5.2 7.2 * 6.0

2.8 3.5 C-ARSC = 3 4.8 ** 5.5 ** 6.8

2.6 3.0

Usage (%) 4.4 5.0 * 6.4 N-ARSC = 2 N-ARSC = 3 4.0 2.4 Neutral and small Polar and relative small Ile Leu Lys His Arg 8

6 2.3 8

7 10.4 5 ** 2.2 **

* 7 ** ** C-ARSC = 4 6 10.0 4 2.1 **

* ** 6 ** * Usage (%) 3 ** ** 2.0 5 * 9.6 * 5 2 Nonpolar and relative small Polar and relative large

Phe Tyr

2.7

** 4.2 2.5 ** group

4.0

** 2.3

C-ARSC = 7 3.8 ** outgroup ** Usage (%) ** 3.6 2.1 ** 3.4 1.9 sister group Nonpolar and relative large CHUG members Trp * p < 0.05 for the corresponding group compared to CHUG members (phylANOVA) 1.50 seven other PRC members

1.45

p < 0.01 for the corresponding group compared to CHUG members (phylANOVA) * ** * * C-ARSC = 9 1.40 other roseobacters * p < 0.05 for the corresponding group compared to seven other PRC members (phylANOVA) 1.35 Usage (%) 1.30 * ** p < 0.01 for the corresponding group compared to seven other PRC members (phylANOVA)

Fig. S3. Amino acid usage frequency between CHUG, its sister group, the outgroup, seven other PRC members, and other reference roseobacters. The significance level in the amino acid usage frequency between CHUG and other four groups are shown in red, while that between seven other PRC members and the remaining three groups are shown in blue. The markers * and ** denote p < 0.05 and < 0.01 (phylANOVA analysis), respectively. The amino acids are placed based on their number of carbon atoms per amino-acid-residue side chain (C-ARSC) and number of nitrogen atoms per amino-acid-residue side chain (N-ARSC). Six amino acid groups based on their volume and polarity were shaded with different color. (A) Recruited reads in the Red Sea

5% 1%0.5% 0.1%0.05% 0.01% All CHUG members Recruited reads DC5-80-3 LE17 10% (RCA) RCA23 HTCC2083 0.1% HTCC2150 NAC11-7 HTCC2255 CHAB-I-5 SB2 0.001% HIMB11 Control N,P mesocosm N,P mesocosm N,P,Si N,P,Si mesocosm mesocosm 1 day 2 week mesocosm, 1 day 2 week

(B) Recruited reads in the Kwangyang Bay

5% 1%0.5% 0.1%0.05% 0.01% All CHUG members Recruited reads 10% DC5-80-3 LE17 (RCA) RCA23 HTCC2083 HTCC2150 0.1% NAC11-7 HTCC2255 CHAB-I-5 SB2 HIMB11 0.001%

May May May May May Feb Aug Feb Aug Feb Aug Feb Aug Feb Aug Site1 Site2 Site3 Site4 Site5 Fig. S4. The relative abundance of CHUG and other PRC members in the bacterioplankton communities based on recruitment analysis using the Red Sea (A) and Kwangyang bay (B) sequencing samples. (A) Charge Volume and Polarity 1.05 0.70 1.00 0.65 0.95 Uncorrected

0.60 GC−corrected (AA frequency) 0.90 dR/dC GC−corrected (codon frequency) 0.85 0.55 0.80 0.50 0.75 CHUG Sister group CHUG Sister group (B) Charge Amino acid Alanine, Asparagine, Cysteine, Glutamine, Glycine, Isoleucine, Leucine, Methionine, Neutral Phenylalanine, Proline, Serine, Threonine, Tryptophan, Tyrosine, Valine Positive Arginine, Histidine, Lysine Negative Asparticacid Glutamicacid (C) Volume and polarity Amino acid special Cysteine neutral and small Alanine, Glycine, Proline, Serine, Threonine nonpolar and relative small Isoleucine, Leucine, Methionine, Valine nonpolar and relative large Phenylalanine, Tryptophan, Tyrosine polar and relative small Asparticacid, Glutamicacid, Asparagine, Glutamine polar and relative large Histidine, Lysine, Arginine

Fig. S5. (A) Analysis of the dR /dC ratios of CHUG members and their sister group compared to their outgroup. These ratios are calculated based on the physicochemical classification of the 20 amino acids by charge (left) and by volume and polarity (right), respectively (48). Bars indicate one standard deviation of the mean. Results were obtained using RCCalculator (http://www.geneorder.com/RCCalculator) with GC-corrections based on codon frequency (blue square), amino acid (AA) composition (red square), and uncorrected method HON-NEW (grey square). (B) Classification of amino acids by charge. (C) Classification of amino acids by volume and polarity. References

1. Mak SST, Gopalakrishnan S, Carøe C, Geng C, Liu S, Sinding M-HS et al. Comparative performance of the BGISEQ-500 vs Illumina HiSeq2500 sequencing platforms for palaeogenomic sequencing. Gigascience 2017; 6(8):1–13.

2. Lui W-Y, Yuen C-K, Li C, Wong WM, Lui P-Y, Lin C-H et al. SMRT sequencing revealed the diversity and characteristics of defective interfering RNAs in influenza A (H7N9) virus infection. Emerg Microbes Infect 2019; 8(1):662–74.

3. Connon SA, Giovannoni SJ. High-throughput methods for culturing microorganisms in very-low-nutrient media yield diverse new marine isolates. Appl Environ Microbiol 2002; 68(8):3878–85.

4. Song J, Oh H-M, Cho J-C. Improved culturability of SAR11 strains in dilution-to- extinction culturing from the East Sea, West Pacific Ocean. FEMS Microbiol Lett 2009:141– 7.

5. Yang S-J, Kang I, Cho J-C. Expansion of cultured bacterial diversity by large-scale dilution-to-extinction culturing from a single seawater sample. Microb Ecol 2016:29–43.

6. Henson MW, Pitre DM, Weckhorst JL, Lanclos VC, Webber AT, Thrash JC. Artificial seawater media facilitate cultivating members of the microbial majority from the Gulf of Mexico. mSphere 2016; 1(2).

7. Henson MW, Lanclos VC, Pitre DM, Weckhorst JL, Lucchesi AM, Cheng C et al. Expanding the diversity of bacterioplankton isolates and modeling isolation efficacy with large-scale dilution-to-extinction cultivation. Appl Environ Microbiol 2020; 86(17).

8. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30(15):2114–20.

9. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012; 19(5):455–77.

10. Babraham Bioinformatics. FastQC: a quality control tool for high throughput sequence data. Cambridge, UK: Babraham Institute 2011.

11. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13(6):e1005595.

12. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015; 25(7):1043–55.

13. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 2018; 9(1):5114.

14. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30(14):2068–9.

15. Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep 2015; 5:8365.

16. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1):27–30.

17. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003; 4:41.

18. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR et al. Pfam: the protein families database. Nucleic Acids Res 2014; 42(Database issue):D222-30.

19. Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res 2003; 31(1):371–3.

20. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 2011; 39(Database issue):D225-9.

21. Price M, Deutschbauer AM, Arkin AP. GapMind: Automated annotation of amino acid biosynthesis; 2019.

22. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A et al. A standardized bacterial based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 2018; 36(10):996–1004.

23. Simon M, Scheuner C, Meier-Kolthoff JP, Brinkhoff T, Wagner-Döbler I, Ulbrich M et al. Phylogenomics of Rhodobacteraceae reveals evolutionary adaptation to marine and non- marine habitats. ISME J 2017; 11(6):1483–99.

24. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 2013; 30(4):772–80.

25. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009; 25(15):1972– 3.

26. Nguyen L-T, Schmidt HA, Haeseler A von, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 2015; 32(1):268–74.

27. Kalyaanamoorthy S, Minh BQ, Wong TKF, Haeseler A von, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 2017; 14(6):587–9.

28. Hoang DT, Chernomor O, Haeseler A von, Minh BQ, Le Vinh S. UFBoot2: Improving the ultrafast bootstrap approximation. Mol Biol Evol 2018; 35(2):518–22.

29. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res 2019.

30. Buchan A, Hadden M, Suzuki MT. Development and application of quantitative-PCR tools for subgroups of the Roseobacter clade. Appl Environ Microbiol 2009; 75(23):7542–7.

31. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 2013; 41(Database issue):D590-6.

32. Billerbeck S, Wemheuer B, Voget S, Poehlein A, Giebel H-A, Brinkhoff T et al. Biogeography and environmental genomics of the Roseobacter-affiliated pelagic CHAB-I-5 lineage. Nat Microbiol 2016; 1(7):16063.

33. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 2019; 20(1):238.

34. Librado P, Vieira FG, Rozas J. BadiRate: estimating family turnover rates by likelihood- based methods. Bioinformatics 2012; 28(2):279–81.

35. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2017; 2(11):1533–42. 36. Chu X, Li S, Wang S, Luo D, Luo H. Gene loss through pseudogenization contributes to the ecological diversification of a generalist Roseobacter lineage. ISME J 2020.

37. Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 2012; 3(2):217–23.

38. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2019; 35(3):526–8.

39. Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G et al. Ocean plankton. Structure and function of the global ocean microbiome. Science 2015; 348(6237):1261359.

40. Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh H-J, Cuenca M et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell 2019; 179(5):1068-1083.e21.

41. Vargas C de, Audic S, Henry N, Decelle J, Mahé F, Logares R et al. Eukaryotic plankton diversity in the sunlit ocean. Science 2015; 348(6237):1261605.

42. Coello-Camba A, Diaz-Rua R, Duarte CM, Irigoien X, Pearman JK, Alam IS et al. Picocyanobacteria community and cyanophage infection responses to nutrient enrichment in a mesocosms experiment in oligotrophic waters. Front. Microbiol. 2020; 11.

43. Kim Y, Jeon J, Kwak MS, Kim GH, Koh I, Rho M. Photosynthetic functions of Synechococcus in the ocean microbiomes of diverse salinity and seasons. PLoS ONE 2018; 13(1):e0190266.

44. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9(4):357–9.

45. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25(16):2078–9.

46. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215(3):403–10.

47. Harrell Jr FE. Package ‘Hmisc’. CRAN2018 2019; 2019:235–6.

48. Luo H, Huang Y, Stepanauskas R, Tang J. Excess of non-conservative amino acid changes in marine bacterioplankton lineages with reduced genomes. Nat Microbiol 2017; 2:17091. 49. Zhang J. Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. J Mol Evol 2000; 50(1):56–68.

50. Moran MA, Buchan A, González JM, Heidelberg JF, Whitman WB, Kiene RP et al. Genome sequence of Silicibacter pomeroyi reveals adaptations to the marine environment. Nature 2004; 432(7019):910–3.

51. Lidbury I, Kimberley G, Scanlan DJ, Murrell JC, Chen Y. Comparative genomics and mutagenesis analyses of choline metabolism in the marine Roseobacter clade. Environ Microbiol 2015; 17(12):5048–62.

52. Kanagawa T, Dazai M, Fukuoka S. Degradation of O,O-dimethyl phosphorodithioate by Thiobacillus thioparus TK-1 and Pseudomonas AK-2. Agricultural and Biological Chemistry 1982; 46(10):2571–8.

53. Miyata T, Miyazawa S, Yasunaga T. Two types of amino acid substitutions in protein evolution. J Mol Evol 1979; 12(3):219–36.

54. Bohlin J, Brynildsrud O, Vesth T, Skjerve E, Ussery DW. Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS ONE 2013; 8(7):e69878.