1 Supplementary Information
2 Mechanisms Driving Genome Reduction of a Novel Roseobacter Lineage Showing
3 Vitamin B12 Auxotrophy
4 Xiaoyuan Feng, Xiao Chu, Yang Qian, Michael W. Henson, V. Celeste Lanclos, Fang Qin,
5 Yanlin Zhao, J. Cameron Thrash, Haiwei Luo
6
7 This PDF file includes:
8 Text 1. Supplementary methods
9 Text 2. Supplementary results
10 Figures S1 to S5
11 References
12 13 Text 1. Supplementary methods
14 1.1 Sampling, bacterial cultivation and genome sequencing
15 1.2 Genome assembly and annotation
16 1.3 Phylogenomic analysis
17 1.4 Phylogenetic analysis based on 16S rRNA genes
18 1.5 Genome content analysis
19 1.6 Statistical analysis for genomic features
20 1.7 Recruitment analysis using public metagenomic sequencing data
21 1.8 dR/dC ratio calculation
22 1.9 Vitamin B12 assay
23 Text 2. Supplementary results
24 2.1 Biased codon and amino acid usage
25 26 Text 1. Supplementary methods
27 1.1 Sampling, bacterial cultivation and genome sequencing
28 HKCC strains were isolated from ambient seawater samples from both the coral
29 Platygyra acuta ecosystem (lat. N22.52°, long. E114.3173°) and the brown alga Sargassum
30 hemiphyllum ecosystem that were collected in 2017 by scuba diving near Hong Kong, China.
31 These samples were stored in the 50 mL microcentrifuge tube at 4°C. Marine basal medium
32 (MBM) was modified with 1 mM of dimethylsulfoniopropionate (DMSP) as the sole carbon
33 source. A 100-fold serial dilution of seawater was prepared, and a 100 μL aliquot of the
34 diluted sample was spread on the MBM-DMSP medium. The isolation plates were incubated
35 at room temperature. Colonies were selected after one week and repeatedly re-streaked on the
36 2216E marine agar (BD Difco, USA) to collect the biomass for DNA extraction. Genomic
37 DNA was extracted using EZ.N.A. Bacterial DNA Kit, and was sent to Qingdao Huada Gene
38 Biotechnology Co., Ltd for library preparation and genome sequencing with BGISEQ500
39 (PE100) following the standard protocol (1). In addition, biomass of HKCCA1288 was sent
40 to Qingdao Huada Gene Biotechnology Co., Ltd for library preparation and genome
41 sequencing with PacBio Sequel platform following a previous study (2) to obtain a complete
42 and closed genome.
43 FZCC strains were isolated from the coastal water of Pingtan Island (lat. N25.43°, long.
44 E119.78°) in May 2017 using high-throughput dilution-to-extinction cultivation (HTC)
45 method (3– 5). The seawater sample was diluted with seawater-based media (4, 5) and
46 dispensed into 24-well polystyrene microplates (Corning Incorporated) with a final
47 inoculation density of 3 cells/well. The plates were incubated at 20˚C in the dark for four
48 weeks and counted with flow cytometry after staining with SYBR Green I. Wells with at least
49 105 cells/mL were defined as positive cultures. Cells (500 μl) of positive cultures were
50 collected by centrifugation (10,000 rpm, 30 min) and identified by 16S rRNA gene sequence 51 identity as described previously (5). FZCC strains used for genome sequencing were grown
52 in polycarbonate flasks (50 mL culture volume) to a cell density of >106 cells/mL and
53 collected by centrifugation. Genomic DNA was extracted using DNeasy Blood & Tissue Kit
54 (Qiagen, Valencia, CA, USA), and was sent for genome sequencing using Illumina HiSeq
55 2500 (PE150) at Hubbard Center for Genome Studies, University of New Hampshire.
56 LSUCC strains were isolated as previously described (6, 7) with the exception of
57 LSUCC1028, in which we used a modified JW1 medium (6) containing no carbon sources
58 except for fulvic acids (Santa Cruz Biotechnology, Inc, TX, USA; CAS 479-66-3) at 1mg/L
59 (“JW1FA”). LSUCC1028 was isolated from surface water collected from Terrebonne Bay,
60 LA (lat. N29.1872°, long. W90.62822°) in July 2015. Briefly, the collected water was diluted
61 in JW1FA and inoculated into a 96-well Teflon plate at a density of 2 cells/well. The plate
62 was incubated in the dark at 24˚C for two weeks and counted with flow cytometry. Each well
63 was transferred into a new plate containing fresh media and counted after one week. This was
64 repeated once more for a total of three plates. Any wells that remained positive (104 cells/mL)
65 after three consecutive plates were subjected to three rounds of serial dilution to ensure
66 culture purity. Cultures were then grown in polycarbonate flasks (50 mL culture volume) to a
67 cell density of 106 cells/mL and identified by16S rRNA gene sequence identity as described
68 described (6, 7). Genomic DNA was extracted from LSUCC0031 via the DNeasy kit
69 (Qiagen), from LSUCC0246 and LSUCC0387 via phenol-chloroform, and from LSUCC1028
70 via the PowerWater kit (Mo Bio). Genomic DNA was sent for genome sequencing using
71 Illumina HiSeq 2500 (PE150) at Hubbard Center for Genome Studies, University of New
72 Hampshire.
73
74 1.2 Genome assembly and annotation
75 The Illumina sequencing raw reads were quality trimmed with Trimmomatic v0.36 (8) 76 with options ‘SLIDINGWINDOW:4:15 MAXINFO:40:0.9 MINLEN:40’ and assembled
77 using SPAdes v3.10.1 (9) with ‘-careful’ options. Only contigs with length >2,000 bp and
78 sequencing depth >5x were retained.
79 The quality of sequenced PacBio reads for HKCCA1288 was checked using FastQC
80 v.0.11.4 (10). This strain was assembled by combining the Illumina short reads and the
81 PacBio long reads using Unicycler v0.4.6 (11) with default parameters.
82 Genome completeness, contamination, and strain heterogeneity (Table S1) were
83 calculated using CheckM v1.0.7 (12). Three marker genes (PF05958, PF06723 and PF07991)
84 were excluded in the calculation because they either were absent or multi-copied in the closed
85 genome HKCCA1288. The ANI between genomes was calculated using fastANI v1.3 (13)
86 with the default parameters.
87 Protein-coding genes were predicted with the Prokka annotation pipeline v1.12 (14).
88 Protein sequences were annotated using the online RAST (15) and KEGG server (16). They
89 were further searched against the COG (17), Pfam (18), TIGRFAM (19) and CDD (20)
90 databases, all of which were downloaded in February 2020. Proteins involved in amino acid
91 biosynthesis were inferred using the online tool GapMind (21).
92
93 1.3 Phylogenomic analysis
94 To place the CHUG lineage into the phylogeny of the Roseobacter group, we performed
95 a phylogenomic analysis based on 120 bacterial marker genes (22). Eight CHUG genomes,
96 one newly sequenced reference genome (LSUCC0031), and 78 additional reference genomes
97 from a previous study (23) were used for phylogenomic tree construction. Marker genes were
98 each aligned at the amino acid sequence level using MAFFT v7.222 (24) and trimmed using
99 trimAl v1.4.rev15 (25) with ‘-resoverlap 0.55 -seqoverlap 60’ options. The trimmed
100 alignments were concatenated using a custom script (https://github.com/luolab-cuhk/CHUG- 101 genome-reduction-project) to comprise a super-alignment with 456, 904 sites. The maximum
102 likelihood (ML) phylogenomic tree was built using IQ-TREE v1.6.2 (26) with the
103 ModelFinder (27) assigning the best substitution model, and a total of 1,000 ultrafast
104 bootstrap replicates were sampled to assess the robustness of the phylogeny (28). The
105 phylogeny was visualized using iTOL (29).
106
107 1.4 Phylogenetic analysis based on 16S rRNA genes
108 To demonstrate that the CHUG lineage represents a distinct phylogenetic branch in the
109 Roseobacter group, we constructed an ML phylogeny based on 16S rRNA gene sequences of
110 both cultured and uncultivated roseobacters following the same method mentioned above.
111 The cultured roseobacters’ 16S rRNA genes were extracted from the above-mentioned 87
112 genomes, and the uncultivated roseobacters’ sequences were retrieved from a previous study
113 (30). To broadly search uncultivated 16S rRNA gene sequences that are closely related to the
114 cultured CHUG members, we built a preliminary 16S rRNA gene tree (data not shown) with
115 all Roseobacter sequences from the SILVA database (31) and those from the CHUG
116 genomes, and identified three SILVA sequences closely related to the CHUG isolates. These
117 three sequences (JN119120, JQ197701 and GQ342302) were also included in the above-
118 mentioned IQ-TREE phylogenetic analysis.
119
120 1.5 Genome content analysis
121 To identify the shared genomic content between CHUG and the previously reported
122 pelagic Roseobacter cluster (PRC) (32), we reconstructed a dendrogram based on the
123 presence and absence patterns of orthologous families. Orthologous gene families were
124 identified using OrthoFinder v2.2.1 (33) with ‘-S diamond -M msa’ options. The binary
125 matrix of presence and absence pattern for each orthologous gene family was used for 126 dendrogram construction using IQ-TREE v1.6.2 (26), and a total of 1,000 ultrafast bootstrap
127 replicates were sampled to assess the robustness of the dendrogram (28). The dendrogram
128 was visualized using iTOL (29).
129 To help understand the evolutionary process giving rise to the CHUG lineage, we
130 reconstructed the genome content of the ancestral nodes related to CHUG, its sister lineage
131 and their outgroups, and also inferred the gene gain and loss events using BadiRate v1.35
132 (34) with ‘-anc -bmodel FR -rmodel BDI -ep CSP’ options. A pruned subtree of the
133 phylogeny inferred based on 120 bacterial marker genes (Fig. 1A) and a table of gene count
134 for each orthologous gene family predicted by OrthoFinder (33) were used as the inputs. The
135 ancestral genome sizes were estimated based on the number of orthologous gene families
136 under linear regression model (R2 =0.97, p < 0.01) in R.
137
138 1.6 Statistical analysis for genomic features
139 The assembled genome size, gene number, coding density and GC content were obtained
140 using CheckM v1.0.7 (12), then the estimated genome size was adjusted as (assembled
141 genome size)/(completeness + contamination) (35). Pseudogenes were predicted following
142 our recent study (36). The number of carbon atoms per amino-acid-residue side chain (C-
143 ARSC), the number of nitrogen atoms per amino-acid-residue side chain (N-ARSC), amino
144 acid usage, and codon usage were retrieved using custom scripts (https://github.com/luolab-
145 cuhk/CHUG-genome-reduction-project). The number of orthologous families and the mean
146 number of genes per orthologous family were calculated based on the gene families clustered
147 by OrthoFinder v2.2.1 (33). Phylogenetic ANOVA analyses were performed to compare
148 these genomic features between different Roseobacter lineages using the ‘phylANOVA’
149 function in the ‘phytools’ R package, which allows controlling for phylogenetic impact on
150 these metrics (37). 151 Next, we identified genes that were either enriched or depleted in CHUG and other PRC
152 members compared to other roseobacters. Briefly, the phylogenetic signal of functional genes
153 (coxL, pdo, sox) or traits (light utilization) were checked with the ‘phylosig’ function of the
154 ‘phytools’ R package (37). A strong phylogenetic signal was identified in the distribution of
155 pdo, sox and the light utilization trait (lambda > 0.99, p < 0.001 for each), so the subsequent
156 tests whether they were associated with a particular category (PRC or non-PRC) were
157 controlled for evolutionary history using the ‘binaryPGLMM’ function in the ‘ape’ R
158 package (38). On the other hand, there was no phylogenetic signal for the coxL gene (lambda
159 < 0.01), so χ2 test was used to test whether it was associated with a particular category (PRC
160 or non-PRC).
161
162 1.7 Recruitment analysis using public metagenomic sequencing data
163 To identify the global occurrence and activity of CHUG members, the TARA Ocean
164 metagenomic and metatranscriptomic sequencing data (39– 41) were downloaded and mapped
165 to the genomes. Two additional metagenomic sequencing data sampled at the Red Sea (42)
166 and the Kwangyang bay (43) were also collected because an increased abundance of CHUG
167 members were identified in a preliminary analysis. These public sequencing data were quality
168 trimmed with Trimmomatic v0.36 (8) with options ‘SLIDINGWINDOW:4:15
169 MAXINFO:40:0.9 MINLEN:40’ and were subsequently mapped to the 89 Roseobacter
170 genomes using bowtie v2.3.2 (44) with the parameter ‘-very-sensitive-local’ for a quick
171 screening. Mapped reads were extracted using SAMtools v1.4.1 (45) and more precisely
172 searched against the 89 genomes using BLASTN (46) with the setting ‘-evalue 1e-5 -
173 perc_identity 95 -qcov_hsp_perc 80’. Thereby only those mapped reads that shared >95%
174 identity and >80% coverage with a reference genome were kept for further calculations. The
175 relative abundances of CHUG and other PRC members were represented using Reads Per 176 Kilobase per Million mapped reads (RPKM) and compared using the Wilcox test in the
177 ‘ggplot2’ R package. The correlation analysis was performed using the ‘rcorr’ function in the
178 ‘Hmisc’ R package (47), and the significance level was adjusted using stringent Bonferroni
179 correction. These analyses were not performed for each CHUG genome individually because
180 these genomes are closely related (95.4 ± 2.4% ANI) and some reads were equally mapped to
181 multiple genomes.
182
183 1.8 dR/dC ratio calculation
184 We calculated the ratio of radical nonsynonymous nucleotide substitutions per radical
185 nonsynonymous site (dR) versus conservative nonsynonymous nucleotide substitutions per
186 conservative nonsynonymous site (dC) following our previous protocol (48). Briefly,
187 orthologous gene families from CHUG members and their sister group were compared to
188 those from the outgroup, respectively. The 20 amino acids were categorized into three groups
189 based on charge or six groups according to volume and polarity (48). For each method of
190 categorization, a nonsynonymous nucleotide substitution was considered ‘conservative’ if it
191 led to a within-group replacement of amino acids and ‘radical’ if it resulted in a between-
192 group replacement of amino acids. The dR/dC ratio was calculated using three methods: GC-
193 corrections based on codon frequency, GC-corrections based on amino acid composition, and
194 the traditional uncorrected method (49), all of which were implemented in RCCalculator (48).
195
196 1.9 Vitamin B12 assay
197 To validate the vitamin B12 auxotrophy in CHUG members, a growth assay was
198 performed for the HKCCA1288 as the experimental CHUG strain and the model roseobacter
199 strain Ruegeria pomeroyi DSS-3 (50) as the positive control. The starter culture was prepared
200 in 2216 marine broth (Difco) prior to the growth experiments. In the vitamin B12 assay, the 201 defined marine ammonium mineral salts (MAMS) medium was supplemented with ribose (30
202 mM, Sigma) as the sole carbon source and with SL-10 trace metals solution (1 mL/L) (51).
203 Vitamin mixture in the presence or absence of vitamin B12 was added as described previously
204 (52). Strains were cultivated in MAMS medium for 96 h, and samples were collected every
205 12 h for cell counting. The collected cells were stained with SYBR Green I (Life
206 Technologies, USA) dye for 15 min, and cell numbers were counted using a flow cytometer
207 (Guava EasyCyte Plus, MA, USA) equipped with a fluorescence detector. All experiments in
208 this section were performed in triplicate.
209
210 Text 2. Supplementary results
211 2.1 Biased codon and amino acid usage
212 As mentioned in the main paper, the GC content (Fig. 2C) was lower in both CHUG
213 members and seven other PRC genomes compared to non-PRC members. We further
214 investigated whether there was codon usage bias within these genomes for the 18 amino acids
215 each encoded by more than one codon (Fig. S2). CHUG members tended to use more
216 adenine/thymine (A/T) in the synonymous codons encoding 11 and 12 amino acids when
217 compared to its sister group and the outgroup, respectively (Fig. S2). For example, among the
218 four synonymous codons for proline (CCA, CCT, CCG, CCC), CCA and CCT were more
219 frequently used in CHUG (20.9% and 23.7%, respectively) compared to its sister group
220 (7.4% and 8.3%, respectively; p < 0.05 for each) and outgroup (7.1%, p > 0.05 for CCA;
221 7.5%, p < 0.01 for CCT). In contrast, the frequencies of the rest codons (CCG and CCC)
222 were reduced in CHUG (22.6% and 32.6%, respectively) compared to its sister group (42.2%
223 and 41.9%, respectively; p < 0.05 for each) and outgroup (45.8%, p < 0.05 for CCG; 39.4%,
224 p > 0.05 for CCC).
225 Next, we investigated the amino acid usage frequency bias within the oligotrophic 226 CHUG and seven other PRC members (Fig. S3). The 20 amino acids can be divided into six
227 groups based on their volume and polarity (Fig. S3) (53). Among the nonpolar and relatively
228 small amino acids, CHUG members tended to use more isoleucine (5.7%) and less valine
229 (6.8%) compared to the outgroup (4.9% for isoleucine and 7.3% for valine; p < 0.05). One
230 possible explanation was that the codons for isoleucine (ATT, ATC and ATA) generally used
231 less C and G and thus saved more nitrogen than the codons for valine (GTT, GTC, GTA,
232 GTG) (54).
233 Loktanella hongkon
DC5-80-3 (RCA) Oceanicola granul NAC11-7
Loktanella cinnabarina LL-001
Planktomarin 888 M0-Ar2-P4F09 gensis UST950701
Leisingera nan osus HTCC2516 Phaeobacter gallaeciensis Roseobacter sp. LE17 a temperata RCA23 Phaeobacter caer
GQ342313 EF016464 TCC17025 A A GQ342312 GQ342315 Y145589 AF245635 GQ342314 Leisingera aqu haiensis NH52FGQ342310 Leisingera methylohalidivorans MB2 GQ34231 uleus DSM24564 Pseudorhodobacter ferrugineus DSM5 DSM19073 ANG1 Rhodobacterales bacterium HTCC2255 imarina R-26159 1 Roseobacter sp. R2A57 ensis Rhodobacter sphaeroides Loktanella vestfoldensis SKA53
Phaeobacter gallaeciensis 2.10 Octadecabacter arcticus 238 Octadecabacter antarcticus 307 Paracoccus sulfuroxidans CGMCC1.5364 Phaeobacter gallaeciensis BS107 Thalassobacter stenotrophicus DSM16310 Rhodobacter sp. SW2 Jannaschia sp. CCS1 iphilus DSM15620 Litoreibacter arenae DSM19593 Jannaschia sp. EhC01 Roseobacter sp. SK209-2-6 Jannaschia pohang Paracoccus aminophilus JCM7686
Gemmobacter nectar Roseobacter sp. MED193 5 TM103 Phaeobacter ar Rhodobacter capsulatus SB_1003 cticus 20188 Roseovarius sp. Roseobacter denitr Roseovarius sp. 217 ificans OCh1 14 Roseobacter litora FZCC0069 lis Och149 HKCCD6035 Sulfitobacter sp. EE36 HKCCA1288 FZCC0188 JN119120.1.1218 CHUG GQ342317 JQ197701.1.1287 HKCCA1065 LSUCC0387 Rhodobacteraceae bacterium SB2 GQ342302 AJ240910 Roseibacterium LSUCC0031Dinoroseobacter GQ342316 Nioella nitratireducens SSW136 157 Nioella sedimi LoktanellaJannaschia sp. SE62 aquimarinaRoseicyclus DSM28248 mahone W elongatum shibaeDFL-43 DFL12 enxinia marina HY34
Ruegeria lacuscaerulensiseroyi DSS-3 ITI-10 nis JS7-1 TM104 yensis DSM16097 Ruegeria pom 1 Ruegeria sp. Ahrensia mar Euryhalocaul
GQ342309 Labrenzia aggr Hellea balneolensis DSM19091 mudensis HTCC2601 GQ342308
Chesl-B GQ342307 Pseudovibriois caribicus sp. JL2009 Henriciella aquimarina LMG2471 Pelagibaca ber ina LZD062 Hyphomonadaceae bacterium UKL13-1
DQ009290 egata IAM 12614 1 EF471521 DQ009289
DQ009291 GQ342304 T unPSC04-5I4 GQ342303 EF471647 Chesl-C
1 DQ009286 EF471645 DQ009285 GQ342305
GQ342306 DQ300580
Chesl-A Rhodobacteraceae bacterium HIMB1
Fig. S1. The IQ-TREE maximum likelihood phylogenetic tree based on 16S rRNA gene sequences. The sequences include those extracted from the 87 genomes used in the present study, those collected from the SILVA database and clustered with CHUG, and those from the wild Roseobacter clusters named in a previous study (30). Solid circles in the phylogeny indicate nodes with bootstrap values >95%. N-ARSC = 0
Gly
60 **
*
*
** *
C-ARSC = 0 40 * *
* ** 20 ** *
Codon usage (%) 0
GGA GGC GGG GGT
*
* Ala Ser Cys * 50
50
* 40 ** 75 40 *
** *
30 30 *
* ** *
C-ARSC = 1 * 50
**
**
** **
20 ** *
20 ** * *
* ** * * ** * ** * 10 * 25 10 * N-ARSC = 1 ** Codon usage (%) 0 0 ** GCA GCC GCG GCT * * * * TGC TGT AGCAGTTCA TCCTCGTCT
Thr Asp Asn *
**
**
60 **
* 75
** 75
** *
* ** ** 40 ** *
**
C-ARSC = 2 50 50 ** ** ** ** * ** ** ** 20 ** 25 stop codon ** * 25 ** ** * **
80 ** ** Codon usage (%) 0 * * * ACA ACC ACG ACT GAC GAT AAC AAT 60 ** **
40 **
Pro Met Val Glu Gln ** ** *
60 * 20 ** ** 60 * **
** 75
** *
75 Codon usage (%) * ** * * 40 * TAA* TAG TGA
40 **
**
* ** ** ** ** 100 50 50 ** **
C-ARSC = 3 ** * ** ** ** ** * ** ** ** ** 20 ** 20 ** ** ** ** ** * 25 25
** N-ARSC = 2 N-ARSC = 3 ** ** **
** 0
Codon usage (%)
0 **
CCA* CCC CCG CCT* ATG GTA GTC GTG GTT GAA GAG CAA CAG Ile Leu Lys His Arg*
100 80
** * 80
* 60 * ** **
**
** *
75 60 * 75 * *
** *
* ** 60 * ** *
* 40 **
** *
50 ** 40 50 *
**
C-ARSC = 4 * ** * ** **
**
** **
** * *
40 *
* **
** ** **
** 20 25 20 ** * * * * 25 ** ** ** ** **
** * ** 20 ** ** 0 0 * ** 0 Codon usage (%) * **
* ** ATA ATC ATT A A AAA AAG CAC CAT CT CTC CTG CTT TT TTG AGA CGA AGG CGC CGG CGT Phe Tyr 100
75 75 ** group **
**
** *
* * C-ARSC = 7 50 ** 50 *
** * * outgroup ** 25 25 ** * ** sister group Codon usage (%) 0 TTC TTT TAC TAT CHUG members Trp * p < 0.05 for the corresponding group compared to CHUG members (phylANOVA) seven other PRC members
** p < 0.01 for the corresponding group compared to CHUG members (phylANOVA) C-ARSC = 9 100 other roseobacters * p < 0.05 for the corresponding group compared to seven other PRC members (phylANOVA) Codon usage (%)
** p < 0.01 for the corresponding group compared to seven other PRC members (phylANOVA) TGG
Fig. S2. Codon usage frequency between CHUG, its sister group, the outgroup, seven other PRC members, and other reference roseobacters. The significance level in the codon usage frequency between CHUG and other four groups are shown in red, while that between seven other PRC members and the remaining three groups are shown in blue. The markers * and ** denote p < 0.05 and p < 0.01 (phylANOVA analysis), respectively. The 20 amino acids are placed based on their number of carbon atoms per amino-acid-residue side chain (C-ARSC) and number of nitrogen atoms per amino-acid-residue side chain (N-ARSC). N-ARSC = 0
Gly 9.0 ***
8.5 ** C-ARSC = 0 * * 8.0 Usage (%)
7.5 Ala Ser Cys *
14
* ** * 1.0
12 6
C-ARSC = 1 ** ** *
** 0.9 ** **
** 10 5 ** Usage (%) 0.8 N-ARSC = 1 Special Thr Asp Asn 5
5.75 6.5
4 **
C-ARSC = 2 5.50 * 6.0 3 * 5.25 Usage (%) 5.5 2
Pro Met Val Glu Gln
5.6 7.6 6.5
* ** 3.0 ** * 5.2 7.2 * 6.0
2.8 3.5 C-ARSC = 3 4.8 ** 5.5 ** 6.8
2.6 3.0
Usage (%) 4.4 5.0 * 6.4 N-ARSC = 2 N-ARSC = 3 4.0 2.4 Neutral and small Polar and relative small Ile Leu Lys His Arg 8
6 2.3 8
7 10.4 5 ** 2.2 **
* 7 ** ** C-ARSC = 4 6 10.0 4 2.1 **
* ** 6 ** * Usage (%) 3 ** ** 2.0 5 * 9.6 * 5 2 Nonpolar and relative small Polar and relative large
Phe Tyr
2.7
** 4.2 2.5 ** group
4.0
** 2.3
C-ARSC = 7 3.8 ** outgroup ** Usage (%) ** 3.6 2.1 ** 3.4 1.9 sister group Nonpolar and relative large CHUG members Trp * p < 0.05 for the corresponding group compared to CHUG members (phylANOVA) 1.50 seven other PRC members
1.45
p < 0.01 for the corresponding group compared to CHUG members (phylANOVA) * ** * * C-ARSC = 9 1.40 other roseobacters * p < 0.05 for the corresponding group compared to seven other PRC members (phylANOVA) 1.35 Usage (%) 1.30 * ** p < 0.01 for the corresponding group compared to seven other PRC members (phylANOVA)
Fig. S3. Amino acid usage frequency between CHUG, its sister group, the outgroup, seven other PRC members, and other reference roseobacters. The significance level in the amino acid usage frequency between CHUG and other four groups are shown in red, while that between seven other PRC members and the remaining three groups are shown in blue. The markers * and ** denote p < 0.05 and < 0.01 (phylANOVA analysis), respectively. The amino acids are placed based on their number of carbon atoms per amino-acid-residue side chain (C-ARSC) and number of nitrogen atoms per amino-acid-residue side chain (N-ARSC). Six amino acid groups based on their volume and polarity were shaded with different color. (A) Recruited reads in the Red Sea
5% 1%0.5% 0.1%0.05% 0.01% All CHUG members Recruited reads DC5-80-3 LE17 10% (RCA) RCA23 HTCC2083 0.1% HTCC2150 NAC11-7 HTCC2255 CHAB-I-5 SB2 0.001% HIMB11 Control N,P mesocosm N,P mesocosm N,P,Si N,P,Si mesocosm mesocosm 1 day 2 week mesocosm, 1 day 2 week
(B) Recruited reads in the Kwangyang Bay
5% 1%0.5% 0.1%0.05% 0.01% All CHUG members Recruited reads 10% DC5-80-3 LE17 (RCA) RCA23 HTCC2083 HTCC2150 0.1% NAC11-7 HTCC2255 CHAB-I-5 SB2 HIMB11 0.001%
May May May May May Feb Aug Feb Aug Feb Aug Feb Aug Feb Aug Site1 Site2 Site3 Site4 Site5 Fig. S4. The relative abundance of CHUG and other PRC members in the bacterioplankton communities based on recruitment analysis using the Red Sea (A) and Kwangyang bay (B) sequencing samples. (A) Charge Volume and Polarity 1.05 0.70 1.00 0.65 0.95 Uncorrected
0.60 GC−corrected (AA frequency) 0.90 dR/dC GC−corrected (codon frequency) 0.85 0.55 0.80 0.50 0.75 CHUG Sister group CHUG Sister group (B) Charge Amino acid Alanine, Asparagine, Cysteine, Glutamine, Glycine, Isoleucine, Leucine, Methionine, Neutral Phenylalanine, Proline, Serine, Threonine, Tryptophan, Tyrosine, Valine Positive Arginine, Histidine, Lysine Negative Asparticacid Glutamicacid (C) Volume and polarity Amino acid special Cysteine neutral and small Alanine, Glycine, Proline, Serine, Threonine nonpolar and relative small Isoleucine, Leucine, Methionine, Valine nonpolar and relative large Phenylalanine, Tryptophan, Tyrosine polar and relative small Asparticacid, Glutamicacid, Asparagine, Glutamine polar and relative large Histidine, Lysine, Arginine
Fig. S5. (A) Analysis of the dR /dC ratios of CHUG members and their sister group compared to their outgroup. These ratios are calculated based on the physicochemical classification of the 20 amino acids by charge (left) and by volume and polarity (right), respectively (48). Bars indicate one standard deviation of the mean. Results were obtained using RCCalculator (http://www.geneorder.com/RCCalculator) with GC-corrections based on codon frequency (blue square), amino acid (AA) composition (red square), and uncorrected method HON-NEW (grey square). (B) Classification of amino acids by charge. (C) Classification of amino acids by volume and polarity. References
1. Mak SST, Gopalakrishnan S, Carøe C, Geng C, Liu S, Sinding M-HS et al. Comparative performance of the BGISEQ-500 vs Illumina HiSeq2500 sequencing platforms for palaeogenomic sequencing. Gigascience 2017; 6(8):1–13.
2. Lui W-Y, Yuen C-K, Li C, Wong WM, Lui P-Y, Lin C-H et al. SMRT sequencing revealed the diversity and characteristics of defective interfering RNAs in influenza A (H7N9) virus infection. Emerg Microbes Infect 2019; 8(1):662–74.
3. Connon SA, Giovannoni SJ. High-throughput methods for culturing microorganisms in very-low-nutrient media yield diverse new marine isolates. Appl Environ Microbiol 2002; 68(8):3878–85.
4. Song J, Oh H-M, Cho J-C. Improved culturability of SAR11 strains in dilution-to- extinction culturing from the East Sea, West Pacific Ocean. FEMS Microbiol Lett 2009:141– 7.
5. Yang S-J, Kang I, Cho J-C. Expansion of cultured bacterial diversity by large-scale dilution-to-extinction culturing from a single seawater sample. Microb Ecol 2016:29–43.
6. Henson MW, Pitre DM, Weckhorst JL, Lanclos VC, Webber AT, Thrash JC. Artificial seawater media facilitate cultivating members of the microbial majority from the Gulf of Mexico. mSphere 2016; 1(2).
7. Henson MW, Lanclos VC, Pitre DM, Weckhorst JL, Lucchesi AM, Cheng C et al. Expanding the diversity of bacterioplankton isolates and modeling isolation efficacy with large-scale dilution-to-extinction cultivation. Appl Environ Microbiol 2020; 86(17).
8. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30(15):2114–20.
9. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012; 19(5):455–77.
10. Babraham Bioinformatics. FastQC: a quality control tool for high throughput sequence data. Cambridge, UK: Babraham Institute 2011.
11. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13(6):e1005595.
12. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015; 25(7):1043–55.
13. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 2018; 9(1):5114.
14. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30(14):2068–9.
15. Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep 2015; 5:8365.
16. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1):27–30.
17. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003; 4:41.
18. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR et al. Pfam: the protein families database. Nucleic Acids Res 2014; 42(Database issue):D222-30.
19. Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res 2003; 31(1):371–3.
20. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 2011; 39(Database issue):D225-9.
21. Price M, Deutschbauer AM, Arkin AP. GapMind: Automated annotation of amino acid biosynthesis; 2019.
22. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 2018; 36(10):996–1004.
23. Simon M, Scheuner C, Meier-Kolthoff JP, Brinkhoff T, Wagner-Döbler I, Ulbrich M et al. Phylogenomics of Rhodobacteraceae reveals evolutionary adaptation to marine and non- marine habitats. ISME J 2017; 11(6):1483–99.
24. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 2013; 30(4):772–80.
25. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009; 25(15):1972– 3.
26. Nguyen L-T, Schmidt HA, Haeseler A von, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 2015; 32(1):268–74.
27. Kalyaanamoorthy S, Minh BQ, Wong TKF, Haeseler A von, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 2017; 14(6):587–9.
28. Hoang DT, Chernomor O, Haeseler A von, Minh BQ, Le Vinh S. UFBoot2: Improving the ultrafast bootstrap approximation. Mol Biol Evol 2018; 35(2):518–22.
29. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res 2019.
30. Buchan A, Hadden M, Suzuki MT. Development and application of quantitative-PCR tools for subgroups of the Roseobacter clade. Appl Environ Microbiol 2009; 75(23):7542–7.
31. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 2013; 41(Database issue):D590-6.
32. Billerbeck S, Wemheuer B, Voget S, Poehlein A, Giebel H-A, Brinkhoff T et al. Biogeography and environmental genomics of the Roseobacter-affiliated pelagic CHAB-I-5 lineage. Nat Microbiol 2016; 1(7):16063.
33. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 2019; 20(1):238.
34. Librado P, Vieira FG, Rozas J. BadiRate: estimating family turnover rates by likelihood- based methods. Bioinformatics 2012; 28(2):279–81.
35. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2017; 2(11):1533–42. 36. Chu X, Li S, Wang S, Luo D, Luo H. Gene loss through pseudogenization contributes to the ecological diversification of a generalist Roseobacter lineage. ISME J 2020.
37. Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 2012; 3(2):217–23.
38. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2019; 35(3):526–8.
39. Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G et al. Ocean plankton. Structure and function of the global ocean microbiome. Science 2015; 348(6237):1261359.
40. Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh H-J, Cuenca M et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell 2019; 179(5):1068-1083.e21.
41. Vargas C de, Audic S, Henry N, Decelle J, Mahé F, Logares R et al. Eukaryotic plankton diversity in the sunlit ocean. Science 2015; 348(6237):1261605.
42. Coello-Camba A, Diaz-Rua R, Duarte CM, Irigoien X, Pearman JK, Alam IS et al. Picocyanobacteria community and cyanophage infection responses to nutrient enrichment in a mesocosms experiment in oligotrophic waters. Front. Microbiol. 2020; 11.
43. Kim Y, Jeon J, Kwak MS, Kim GH, Koh I, Rho M. Photosynthetic functions of Synechococcus in the ocean microbiomes of diverse salinity and seasons. PLoS ONE 2018; 13(1):e0190266.
44. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9(4):357–9.
45. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25(16):2078–9.
46. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215(3):403–10.
47. Harrell Jr FE. Package ‘Hmisc’. CRAN2018 2019; 2019:235–6.
48. Luo H, Huang Y, Stepanauskas R, Tang J. Excess of non-conservative amino acid changes in marine bacterioplankton lineages with reduced genomes. Nat Microbiol 2017; 2:17091. 49. Zhang J. Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. J Mol Evol 2000; 50(1):56–68.
50. Moran MA, Buchan A, González JM, Heidelberg JF, Whitman WB, Kiene RP et al. Genome sequence of Silicibacter pomeroyi reveals adaptations to the marine environment. Nature 2004; 432(7019):910–3.
51. Lidbury I, Kimberley G, Scanlan DJ, Murrell JC, Chen Y. Comparative genomics and mutagenesis analyses of choline metabolism in the marine Roseobacter clade. Environ Microbiol 2015; 17(12):5048–62.
52. Kanagawa T, Dazai M, Fukuoka S. Degradation of O,O-dimethyl phosphorodithioate by Thiobacillus thioparus TK-1 and Pseudomonas AK-2. Agricultural and Biological Chemistry 1982; 46(10):2571–8.
53. Miyata T, Miyazawa S, Yasunaga T. Two types of amino acid substitutions in protein evolution. J Mol Evol 1979; 12(3):219–36.
54. Bohlin J, Brynildsrud O, Vesth T, Skjerve E, Ussery DW. Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS ONE 2013; 8(7):e69878.