1
2
3
4
5
6
7 Ecological and functional capabilities of an uncultured
8 Kordia sp.
9
10 Royo-Llonch M1,4, Sánchez P1, González JM2, Pedrós-Alió C3, Acinas SG1*
11
12 ______
13
14 1 Department of Marine Biology and Oceanography, Institut de Ciències del Mar (ICM), 15 CSIC, Barcelona, Spain.
16 2 Department of Microbiology, University of La Laguna, La Laguna, Spain
17 3 Systems Biology Program, Centro Nacional de Biotecnología (CNB), CSIC, Madrid, Spain
18 4 Departament de Genètica i Microbiologia, Facultat de Biociències, Universitat Autònoma 19 de Barcelona, Bellaterra, Spain
20
21 *Correspondence:
22 Dr. Silvia G. Acinas 23 [email protected];
24
25 Keywords: SAG , Kordia, co-assembly, photoheterotrophy, metagenomics
26
1 27 Abstract
28 Cultivable bacteria are only a fraction of the diversity in microbial communities. However,
29 the official procedures for classification and characterization of a novel prokaryotic
30 species still rely on isolates. Thanks to Single Cell Genomics, it is possible to retrieve
31 genomes from environmental samples by sequencing them individually, and to assign
32 specific genes to a specific taxon, regardless of their ability to grow in culture. In this
33 study, we performed a complete description of the uncultured Kordia sp.
34 TARA_039_SRF, a proposed novel species within the genus Kordia, using culture-
35 independent techniques. The type material was a high-quality draft genome (94.97%
36 complete, 4.65% gene redundancy) co-assembled using 10 nearly identical Single
37 Amplified Genomes (SAGs) from surface seawater in the North Indian Ocean from the
38 Tara Oceans Expedition. The assembly process was optimized to obtain the best
39 possible assembly metrics and a less fragmented genome. Its closest relative is Kordia
40 periserrulae, sharing 97.56% similarity of the 16S rRNA gene, 75% of their orthologs and
41 89.13% average nucleotide identity. We describe the functional potential of the proposed
42 novel species, that includes proteorhodopsin, the ability to incorporate nitrate,
43 cytochrome oxidases with high affinity for oxygen and CAZymes that are unique features
44 within the genus. Its abundance at different depths and size fractions was also evaluated
45 together with its functional annotation, revealing that its putative ecological niche seems
46 to be particles of phytoplanktonic origin. They can attach to these particles and consume
47 them while sinking to the deeper and oxygen depleted layers of the North Indian Ocean.
48
49
2 50 Introduction
51 There is a lack of a consensus on what ecologically coherent units should be used as a
52 proxy for bacterial species in environmental communities [23,52,67,72]. Moreover, for a
53 new bacterial species to be officially recognized, its isolation in pure culture is still a
54 requirement. It becomes necessary to re-define bacterial “species” to fit with the reality
55 of the (still uncultured) majority of bacteria. An update of the validation of novel
56 uncultured high-quality genomes is also needed, as they are given a provisional
57 Candidatus status even after a thorough description [37]. Recent efforts have emerged
58 to facilitate novel uncultured taxa standardization [37], like the Microbial Genome Atlas
59 (MiGA), which infers genome-based taxonomy and quality assessment across genomes
60 from different environments [66].
61 Nowadays, uncultured genomes can be used to expand knowledge of the genetic and
62 evolutionary differences between representatives of the same species or close relatives,
63 as well as for enriching knowledge of microbial diversity. Uncultured genomes can be
64 retrieved by: i) co-assembling metagenomes (Metagenomic Assembled Genomes or
65 MAGs) or ii) by single cell genomics (Single Amplified Genomes or SAGs). In marine
66 microbiology, assembling MAGs has brought a vast amount of genomes belonging to
67 novel bacterial and archaeal phyla, unveiling key players in the biogeochemistry of the
68 oceans [18,56,57]. Alternatively, Single Cell Genomics allows the assignment of specific
69 functional traits to a specific taxon, an outstanding resolution level for microdiversity
70 studies. Single Cell Genomics has helped link functional roles to relevant taxonomic
71 groups that have changed the current understanding of predominant marine
72 metabolisms (such as chemolithoautotrophy or the role of nitrite-oxidizing bacteria in the
73 deep ocean) [53,77]. While MAGs tend to be longer and more complete than SAGs,
74 MAGs result from a population of genomes and there is still lack of consensus about
75 what taxonomic units are they reflecting. Nevertheless, multiple SAGs can be co-
76 assembled to retrieve more complete genomes, provided they are closely related
3 77 phylogenetically. The comparison between MAGs and SAGs assembled from the same
78 Baltic Sea water samples, revealed very high nucleotide identities between the
79 corresponding pairs but a difference in the size and completeness between the genomes
80 [4]. SAGs have also been used in pangenome analyses, which had previously been
81 mostly restricted to cultured microorganisms (especially pathogenic bacteria) [49,51,79].
82 Comparative genomics and the development of the “pan-genome” concept offered an
83 alternative for understanding the genetic extent and dynamics of bacterial species
84 [49,51,79], dramatically changing the description of a species by studying multiple
85 genomes belonging to the same defined taxa [34,38,65]. In the marine environment,
86 SAG-based pangenome analyses have mostly focused on the highly studied
87 Prochlorococcus (revealing hundreds of co-existing Prochlorococcus populations [31]
88 and the link between their hypervariable genomic islands and the environment [17]) or
89 together with SAR11 (defining endemic gene-level adaptations to specific locations like
90 the Red sea [80]).
91 In the present study, we carried out the co-assembly of 10 SAGs of a nearly clonal
92 population of a novel species in the genus Kordia to generate a high-quality genome
93 complete enough to: i) allow its putative functional description, ii) determine its preferred
94 niche in the water column from which it was retrieved, and iii) to describe the novel
95 species in comparison with the available sequenced relatives of the genus Kordia. The
96 genus Kordia belongs to the family Flavobacteriaceae and it was first proposed with the
97 isolation in culture of Kordia algicida, which lyses algal cells and feeds on
98 phytoplanktonic bloom exudates [75]. Members of this genus are Gram-negative, strictly
99 aerobic or facultatively anaerobic, non-motile or motile by gliding, rod shaped and
100 showing 34-37% of DNA G+C content [33]. In the last 15 years, new species have been
101 added to the genus, all of them isolated from samples found in the aquatic environment:
102 K. zhangzhouensis thrives in freshwater [41], K. jejudonensis was isolated from the
103 interphase between seawater and freshwater springs [54], K. antarctica and K. aquimaris
4 104 were isolated from Antarctic and Taiwanese seawater, respectively [6,26], K. ulvae,
105 K.zosterae and Kordia sp. SMS9 were retrieved from the surface of marine algae
106 [33,59,62] and K. periserrulae was found in the gut of a tidal flat polychaete [14]. Kordia
107 sp. NORP58 is a MAG assembled from a marine subsurface aquifer [81]. At the time of
108 writing, the genus consists of eight species with validly published names, of which four
109 have had their genomes completely sequenced.
110
111 Despite the fundamental insights in microbial ecology and evolution provided by the
112 analyses of uncultured bacterial and archaeal genomes, there is not yet a proper
113 taxonomy for the uncultured majority. We aim to provide a good example of a complete
114 description of a novel species of the genus Kordia analyzed via culture-independent
115 methods to infer its ecological and functional description.
116
5 117 Materials and Methods
118 Single Amplified Genome generation and phylogeny analysis
119 A total of 98 Single Amplified Genomes were generated as detailed in [77] from a surface
120 (SRF) seawater sample from the North Indian Ocean (latitude 18.59ºN – longitude
121 66.62ºE) during the circumnavigation expedition Tara Oceans [30], station TARA_039
122 (Sample ID: TARA_G000000266)(SM Table 1).
123 Multi Locus Sequencing Analysis (MLSA) of the generated SAGs revealed that 84% of
124 them belonged to a novel species of the genus Kordia (referred hereafter as Kordia sp.
125 TARA_039_SRF) and that the amplified genes (16S rRNA, partial 23S rRNA, partial
126 RNA polymerase subunit B and partial proteorhodopsin genes, as well as Internally
127 Transcribed Spacer) were identical in the whole Kordia SAG population. More details on
128 the phylogeny and distribution of these Kordia SAGs can be found in [69].
129 SAG selection
130 We chose 10 out of the 98 generated Kordia SAGs for Whole Genome Sequencing.
131 They were selected based on: i) Multiple Displacement Amplification’s (MDA) Cp values,
132 since they were the shortest, ranging from 7:15-8:15, and ii) that amplification was
133 possible for any of the markers tested in previous MLSA.
134 Sequencing and read treatment
135 Sequencing of the ten SAGs was carried out in two batches. For the first set of five, the
136 sequence reads were obtained by Illumina MiSeq 2x300 bp technology and by Illumina
137 HiSeq 2x250 bp for the second set. Quality assessment of the raw reads was performed
138 using FastQC and Illumina PhiX174 adapters were removed using Bowtie2 v2.2.9 [43]
139 and Samtools v.1.3.1 [46]. Afterwards, reads were normalized by coverage for each
140 individual SAG using DOE JGI’s BBnorm, setting the maximum coverage at 40X. The
141 normalized paired-end libraries were trimmed and filtered with Trimmomatic [10] and
142 paired-reads were merged with PEAR v0.9.6 [85]. The workable read dataset consisted
6 143 of merged reads, pairs of reads that did not merge and also unpaired orphan reads
144 (Figure 1A). The final non-redundant set accounted for 1.75% of the original raw read
145 dataset (3,746,468 reads out of a total of 213,457,114).
146 Co-assembly optimization and quality control
147 We chose a strategy that focused on reducing gene redundancy and genome
148 fragmentation to obtain the co-assembly. As a first approach, 36 co-assemblies were
149 generated with assembler Ray v2.2.0 using individual k-mer length values, from 17 to
150 133, in order to find those that performed better (Figure 1A). The three k-mer values that
151 generated co-assemblies with best metrics were combined using assembler SPAdes
152 v3.10 [7] and options --careful and --sc (recommended for single cell genome assembly).
153 Another co-assembly was gathered with SPAdes default combination of k-mer length
154 values (21, 33, 55) to confirm whether the custom k-mer length combination resulted in
155 a better co-assembly.
156 Normalizing the pool of reads by coverage and merging them before assembly reduced
157 significantly the memory requirements and duration of assembly. Using default single-
158 cell mode SPAdes assembler with the full read dataset, the co-assembly lasted ~62
159 hours and consumed 75 Gb of memory. The same process with the workable read
160 dataset (normalized and merged) and optimized k-mer lengths took 11 min and 12 Gb.
161 Metrics for each co-assembly were generated using a custom Perl script. Genome
162 completeness and contamination (assessed as single copy marker gene redundancy)
163 was calculated with CheckM v1.0.11 [55]. The reference marker gene database chosen
164 was “Family Flavobacteriaceae”, which was the closest to the taxonomical annotation of
165 the co-assembly. Nevertheless, the marker gene PF07659.6 was excluded from the
166 database as it was absent in all complete isolated Kordia genomes. The assembly
167 accuracy was assessed with Assembly Likelihood Estimator ALE [15]. The three k-mer
168 length values that showed better metrics (higher genome completeness, longer co-
169 assembly length, smaller number of contigs and larger N50) and best ALE score were
7 170 used as a combination in a final co-assembly (K=55, 61, 117). Overlapping regions
171 among scaffolds were detected after a visual check of gene synteny in their functional
172 annotations. Several refining steps were taken to solve redundancies in the co-assembly
173 (Figure 1B). First, the scaffolds were assembled in Geneious R11, with default settings
174 in the High Sensitivity assembly algorithm. Second, the resulting contigs were split in
175 different datasets by setting a cut-off for a minimum contig length. The final co-assembly
176 was the contig dataset with the highest genome completeness, highest number of copy
177 genes present only once and percent contamination lower than 5% [11,37]. Detailed
178 metrics and evaluation of the different co-assemblies are in SM Table 2.
179 Alignment of individual SAGs against the co-assembly
180 Mappings of individual SAGs to the co-assembly were done with the same read dataset
181 as for the co-assembly (clean, trimmed, merged and normalized by coverage reads)
182 using Bowtie2 v2.2.9. The percentage of mapped reads was extracted from Bowtie2’s
183 log file. The vertical coverage of the co-assembly by each SAG individually was
184 calculated using Bedtools v2.27.1 genomecov function. The sum of bases of the co-
185 assembly covered by each SAG was divided by the total length of the co-assembly to
186 obtain the contribution of each SAG to the co-assembly (Table 1).
187 Ggplot2 [24] was used to visualize the mapping of reads of every individual SAG against
188 the generated co-assembly (Figure SM2).
189 Phylogeny based on the 16S rRNA gene
190 The complete 16S rRNA gene of the co-assembly was queried against the Living Tree
191 Project [83] database using SILVA’s SINA ARB v1.2.11 online tool [61]. Its
192 Flavobacteriaceae best hits were exported and aligned in Geneious R11, together with
193 Kordia amplicon OTUs generated from the Malaspina dataset in [50], the 16S rRNA
194 genes of other Kordia spp. deposited in GenBank and two outgroups from a different
195 phylum (SM Table 1). MEGA v7 [40] was used for a Maximum-Likelihood phylogeny
8 196 reconstruction using GTR+G model and 500 bootstrap replicates. The tree was later
197 processed in iTol [45].
198 Phylogeny based on single copy genes
199 A multi-locus (40 conserved single copy genes) phylogenetic placement of the co-
200 assembly was also carried out using the same taxa as in the 16S rRNA gene phylogeny,
201 also including uncultured genomes like MAGs that belong to family Flavobacteriaceae.
202 Gene prediction was done with Prodigal v2.6.3 [27]. Software FetchMG v1.0 [39] was
203 used with option -v (recommended for reference genomes) to extract the 40 conserved
204 COGs, whose amino acid sequences were concatenated and later aligned with Muscle
205 v3.8.31 [20]. Neighbor-Joining phylogenetic reconstruction was done with MEGA v7.0
206 and the resulting tree was processed in iTol.
207 Sequence composition identities against other Flavobacteriaceae
208 Average Nucleotide Identities (ANI), Average Amino acid Identities (AAI) and
209 tetranucleotide frequency comparisons were done between the co-assembly and the
210 genomes of its closest relatives using fastANI v1.1 [28], compareM v0.0.23
211 [https://github.com/dparks1134/CompareM] and pyani v0.2.7
212 [https://github.com/widdowquinn/pyani], respectively.
213 Functional annotation of Kordia sp. TARA_039_SRF
214 Gene prediction and a basic annotation of the co-assembled contigs were carried out in
215 Prokka v1.12 [70]. PFAMs and TIGRFAMs were annotated in the predicted genes with
216 HMMER’s v3.1b2 hmmscan (hmmer.org). KEGG orthologs (KO) were predicted online
217 with BLASTKOALA v2.1 [29] and transporters were predicted using Transporter
218 Classification TC database [21]. Carbohydrate-active enzymes’ annotation was based
219 on dbCAN’s CAZymes database [84] and peptidases were annotated using the Merops
220 database [63]. Polysaccharide Utilization Loci (PULs) were manually located looking for
221 pairs of SusC/SusD genes encoded next to each other. Genomic Islands were predicted
9 222 with IslandViewer v4 online tool [9]. Insertion sequences (IS) were predicted and
223 estimated with ISsaga tool [82] and prophage regions were predicted with PHASTER [5].
224 Secretion systems were detected using MacSyFinder-based TXSScan [1,2].
225 KEGGmapper Search&Color pathway v3.1 was used to describe the functional potential
226 of the co-assembly, for those gene calls that could be associated to a KO value.
227 Phylogeny based on the proteorhodopsin gene
228 The complete amino acid sequence of the co-assembly’s proteorhodopsin gene was
229 aligned in Geneious R11 together with its BLAST best hits against NCBI's Protein
230 database and two outgroup sequences (a proteorhodopsin from a SAR86 group
231 bacterium and one from a “Candidatus Pelagibacter ubique” bacterium). MEGA v7 [40]
232 was used for a Maximum-Likelihood phylogeny reconstruction using LG+G model and
233 500 bootstrap replicates. The tree was later processed in iTol [45].
234 Distribution of Kordia sp. TARA_039_SRF in TARA_039 station with competitive
235 Fragment Recruitment Analysis
236 The vertical and size fraction distribution of Kordia sp. TARA_039_SRF was assessed
237 in all available metagenomes for Tara Oceans station TARA_039 as it was the marine
238 sample from which the Kordia SAGs were obtained. Metagenomes from TARA_039
239 were sampled in three different depths (surface/SRF 5m, Deep Chlorophyll
240 Maximum/DCM 25m and mesopelagic/MES 270m) and size-fractioned for viruses (<-
241 0.22 µm), giruses (0.1-0.22 µm), bacteria (0.22-1.6 µm) and protists (0.8-5 µm, 5-20 µm,
242 20-180 µm and 180-2000 µm) and were generated following different protocols (SM
243 Table 1)[3]. All merged reads (PEAR v0.9.6) from each described metagenome were
244 mapped against all genomes available from Kordia spp. (Kordia sp. TARA_039_SRF,
245 Kordia algicida, Kordia jejudonensis, Kordia zhangzhouensis, Kordia periserrulae,
246 Kordia sp. SMS9 and Kordia sp. NORP58) (SM Table 1) with blastn (BLAST 2.2.8+;
247 options -max_target_seqs 1 –perc_identity 70 –evalue 0.0001). The output was filtered
248 in R: i) coverage between query and subject set at >90%, ii) removal of duplicated reads
10 249 and iii) removal of reads mapping to the ribosomal operons [12]. Recruited reads were
250 split based on their nucleotide identities against the reference genome. Those mapping
251 at identities >95% are assumed to belong to the same species as the reference genome
252 and those between 70-95% belonged to the co-occurrent close relatives of the reference
253 genome [12,65]. Normalization of the recruited reads was done by the sequencing depth
254 of each metagenome and considering the complete genomic length.
255 Comparative genomics analysis
256 Anvi’o v4 pangenomic workflow was used to organize all the genes from the five Kordia
257 genomes (Kordia sp. TARA_039_SRF, Kordia algicida, Kordia jejudonensis, Kordia
258 zhangzhouensis and Kordia periserrulae) into gene clusters of protein sequence
259 similarity. Gene calling was done with Prodigal, amino acid sequence search with NCBI’s
260 blastp v2.7.1, gene clustering with MCL [19] and sequence alignment with Muscle. Using
261 function --anvi-summarize we could export the list of gene clusters and see in which
262 genome they were encoded. These results were plotted using anvi’o and R packages
263 UpSetR v1.1.3 [16] and ggplot2.
264 Anvi’o phylogenomics workflow was used to produce a phylogenomic tree (Fasttree
265 v2.1.11 [60]) based on the amino acid sequences of the single copy genes shared by all
266 five Kordia genomes.
267 Accession numbers
268 The genome of the novel Kordia sp. TARA_039_SRF has been deposited in NCBI’s
269 Bioproject PRJNA524487. The co-assembly’s accession number is SMNH02000000
270 and Biosample SAMN11028936. Raw reads of the 10 individual SAGs are: AB-193_M23
271 (SRR8655105), AB-193_M19 (SRR8655106), AB-193_N20 (SRR8655104), AB-
272 193_O13 (SRR8655103), AB-194_L04 (SRR8655108), AB-193_M03 (SRR8655107),
273 AB-193_O22 (SRR8655102), AB-193_I20 (SRR8655109), AB-193_I04 (SRR8655110)
11 274 and AB-193_P22 (SRR8655101). A complete list of accession numbers for all the
275 genomes and metagenomes used in this study can be found in SM Table 1.
276
12 277 Results
278 Co-assembly of the Kordia sp. TARA_039_SRF genome
279 Ten SAGs belonging to the genus Kordia, named provisionally Kordia sp.
280 TARA_039_SRF, were selected for genome co-assembly and description of gene
281 content. These SAGs were tested for microdiversity through MLSA analyses, resulting
282 in 100% identity at the nucleotide level in each marker [69]. The five markers were the
283 16S rRNA gene, partial 23S rRNA gene and the Internally Transcribed Spacer, that could
284 be amplified in the 10 SAGs; the partial RNA polymerase subunit B, amplified in 5 SAGs;
285 and partial Proteorhodopsin, amplified in 7 SAGs.
286 Individual SAG assemblies varied in genome completeness and metrics (Table 1),
287 recovering at most 45% of the estimated complete genome. GC content was consistent
288 in the different SAGs (35.3-36%).
289 Considering the high genetic similarity among the individual SAGs based on the previous
290 MLSA analyses, their co-assembly into one single genome was the strategy chosen to
291 obtain a more complete genome. All the reads from the 10 SAGs were pooled together,
292 normalized by coverage and merged. They were afterwards assembled using a wide
293 range of k-mer length values, a procedure to find the best metrics and quality values (SM
294 Table 2). The combination of k-mer length values chosen were: 55, 61 and 117 (Figure
295 1B). After refinement of the optimized co-assembly, the resulting genome was 4,594,716
296 bp long, fragmented into 27 contigs of N50 429,544 bp. The genome was 94.97%
297 complete using a database of Flavobacteriaceae conserved single copy genes and the
298 contamination based on gene redundancy of these conserved genes was 4.65% (Table
299 1).
300 There were 985 ambiguities spread across the final co-assembly (217 Y, 247 W, 92 S,
301 158 R, 32 N, 140 M and 99 K), representing 0.02% of the total nucleotides.
302 The proportion of reads that mapped back to the refined co-assembly varied between
303 97.6 and 98.1% depending on the SAG. Contribution of each SAG to the co-assembly
13 304 ranged between 21 and 50% (Table 1). The average nucleotide identities of the
305 individual SAGs against the co-assembly were higher than 99.9% in all cases (Table 1).
306 Tetranucleotide frequency was almost identical between each SAG and the co-
307 assembly, with similarities ranging between 99.6 and 99.8%.
308 Comparing the completeness and gene redundancy of all the possible SAG
309 combinations to co-assemble (1023 for 10 SAGs), the highest completeness value
310 (97.48%) is reached with the co-assembly of all 10 SAGs, with 8.48% contamination (SM
311 Figure 1; SM Table 3). Nevertheless, there is saturation in completeness values ~94-
312 95% starting at combinations of seven and eight SAGs. Their contamination values were
313 also higher than the standards for high-quality draft genomes (>5%) with the exception
314 of a combination of eight SAGs, that produced a co-assembly 94.07% complete with
315 4.92% of contamination. Considering that genome amplification in SAGs is random [76],
316 it is impossible to know which part of the genome was amplified prior to sequencing.
317 Despite the fact that we sequenced 10 SAGs, the co-assembly of just eight of them would
318 have been enough to reach completeness values similar to the refined co-assembly.
319 Novelty of Kordia sp. TARA_039_SRF
320 The novelty of Kordia sp. TARA_039_SRF was confirmed with phylogenies of the 16S
321 rRNA gene and 40 conserved single copy genes (Figure 2), as well as through whole
322 genome sequence composition comparisons of ANI, AAI and tetranucleotide signature
323 (SM Table 4).
324 Both phylogenies placed Kordia sp. TARA_039_SRF within family Flavobacteriaceae,
325 more accurately within genus Kordia, in 100% of the bootstrap replicates. Its closest
326 described relative is Kordia periserrulae, with 97.56% of identity in the 16S rRNA gene
327 (97% coverage) and 89.13% ANI (75% coverage). Otu5835, amplified from Malaspina
328 expedition seawater samples, also falls within the Kordia sp. TARA_039_SRF and K.
329 periserrulae clade.
14 330 Functional description of Kordia sp. TARA_039_SRF
331 Having an almost complete genome (94.97%) allowed to investigate the potential
332 functional capabilities and energy metabolism of this novel species (Figure 3; SM Table
333 5). Nevertheless, 2357 of 4125 predicted protein coding genes (57.1%) were annotated
334 as hypothetical proteins of unknown function.
335 Central metabolism
336 The co-assembled Kordia sp. TARA_039_SRF has the ability to shut down the loss of
337 CO2 of the regular TCA cycle by the glyoxylate shunt, like other Kordia spp., and it is
338 also able to replenish the TCA cycle of certain intermediates through anaplerotic routes:
339 it encodes both PEP carboxykynase and PEP carboxylase, which convert PEP to
- 340 oxaloacetate using ATP and CO2, and H2O and HCO3 , respectively. This mechanism is
341 also found in other Kordia spp. Bicarbonate uptake can be achieved through specific
342 bicarbonate membrane transporters (BicA), of which four copies were found. One of
343 these copies is next to a carbonic anhydrase gene, whose product interconverts CO2
- 344 and HCO3 . Kordia sp. TARA_039_SRF encodes a third anaplerotic strategy driven by
345 the malic enzyme, which converts pyruvate into L-malate with NADPH and CO2. One of
346 the copies is encoded in the core genome of Kordia whereas another copy is located in
347 a genomic island.
348 Unlike other Kordia spp., Kordia sp. TARA_039_SRF encodes several enzymes that
349 allow a complete biosynthesis of CoA from pyruvate. This cofactor plays a key role in the
350 biosynthesis and breakdown of fatty acids, as well as in the biosynthesis of polyketides
351 and non-ribosomal peptides [8].
352 Kordia sp. TARA_039_SRF shows several mechanisms for oxidative phosphorylation,
353 of which some are also common in other Kordia spp.: i) succinate dehydrogenase /
354 fumarate reductase, ii) cytochrome c oxidase, cbb3-type, iii) NADH dehydrogenase
355 NQR, iv) NADH dehydrogenase NDH-I, v) cytochrome bd subunits CydA and CydB,; and
15 356 some are unique: i) cytochrome c oxidase aa3-type genes CoxABC, QoxA, COX10 and
357 COX11, and ii) cytochrome c oxidase bo-type genes CyoDW.
358 Carbohydrate-active enzymes, PULs, surface adhesion and motility
359 Kordia sp. TARA_039_SRF encodes a CAZymes array most similar to K. periserrulae
360 and larger than the rest of Kordia spp. included in the study (SM Table 6). It shows the
361 second highest density of carbohydrate-active enzymes (41 enzymes per genomic Mbp)
362 after that of K. zhangzhouensis. It encodes 34 non-catalytic carbohydrate-binding
363 modules (CBM), mostly specific for complex carbohydrates like cellulose, chitin,
364 mannan, beta-glucans, starch, and glycogen. The novel species also encodes 1.1
365 glycosyltransferases per genomic Mb (a total of 52). These outer membrane proteins are
366 involved in the synthesis of surface adhesion polysaccharides. A total of 51 predicted
367 adhesion-related genes, such as fasciclin, fibronectin, or lectins (SM Table 7) were
368 found, that are also present in other Flavobacteria [25]. The novel Kordia sp.
369 TARA_039_SRF does not encode a complete gliding motility complex.
370 The co-assembly encodes four Polysaccharide Utilization Loci (PULs) (SM Table 5),
371 three of them being surrounded by 6 families of glycosyl hydrolases, one family of
372 glycosyl transferases, one family of carbohydrate esterases, two families of
373 carbohydrate-binding modules and peptidases like serine aminopeptidase and prolyl
374 oligopeptides.
375 Sulfur and nitrogen metabolism
376 Another unique feature of Kordia sp. TARA_039_SRF within its genus is that it codes for
377 all the genes involved in the assimilatory sulfate reduction pathway that converts sulfate
378 to sulfide, and it is also able to transform thiosulfate to sulfite.
379 The genome of Kordia sp. TARA_039_SRF is the only Kordia genome sequenced so far
380 that encodes both NasA, a nitrate/nitrite transporter, and NasF, the substrate binding
381 protein in the nitrate/nitrite transport system. The bacterium is predicted to assimilate
16 382 nitrate for further incorporations into amino acids as it encodes the nitrate
383 reductase/nitrite oxidoreductase. Moreover, like other Kordia spp., it has the potential to
384 use environmental ammonium as a nitrogen source importing it through an ammonium
385 channel transporter.
386 Light sensing and environmental information processing
387 The co-assembled Kordia encodes a copy of a green-light absorbing proteorhodopsin,
388 like Kordia periserrulae and Kordia sp. SMS9. Their phylogenetic reconstruction clusters
389 the co-assembly’s and K.periserrulae’s proteins in the same clade with a 100% bootstrap
390 value (SM Figure 3), and their pairwise alignment shows that they are identical in 95.5%
391 of the sites (232 of 243 bp).
392 It also codes for other putative light sensors usually found in proteorhodopsin containing
393 marine bacteria: one (6-4) photolyase, seven copies of phytochrome-like proteins, one
394 cryptochrome DASH domain and one cryptochrome-like protein.
395 The co-assembly codes for DNA repair systems like the entire nucleotide excision repair
396 (NER) complex UvrABC, which recognizes and removes light induced DNA lesions.
397 The two-component systems encoded in the co-assembly have been described to detect
398 the following environmental signals: phosphate, ferric ion, oxygen, nitrate/nitrite and
399 chemotaxis related attractant/repellent substances. It also codes for sigma 54 factor and
400 putative quorum-quenching lactonases, which have been related in some cases to
401 virulence and biofilm formation.
402 Transporters
403 The co-assembly encodes 95 different transporter families and ~88 transporters per
404 Mbp. We find 4 different types of pore-forming toxins, phage-related transporters like the
405 FadL outer membrane protein and a holin belonging to the Bacillus subtilis phage φ29
406 transporter family, two different light-absorption driven transporters and uptake systems
407 specific for amino acids, potassium, fatty acids, nitrate, dissolved inorganic carbon, iron
408 and magnesium (Figure 3; SM Table 8).
17 409 Kordia sp. TARA_039_SRF codes for two complete secretory systems: T1SS and the
410 Bacteroidetes specific T9SS (SM Table 9). There are 85 peptides encoded in Kordia sp.
411 TARA_039_SRF that contain the Por secretion system domain (SM Table 10), meaning
412 that they are secreted through the T9SS to the extracellular environment or the cell
413 surface. Their most abundant functions relate to surface adhesion, protein degradation,
414 and carbohydrate hydrolysis. Out of the 184 peptidases found in the co-assembly, 8 of
415 them are putatively secreted (SM Table 11). Other secreted peptides show domains that
416 have a role in carbohydrate binding, environmental sensing through two-component
417 system, and quorum sensing sensors. Additionally, we also found proteins with domains
418 associated with prophage endonucleases, prokaryotic endonucleases, bacterial toxins
419 and proteinase inhibitors.
420 Pigment and vitamin synthesis
421 Kordia sp. TARA_039_SRF is the only described Kordia spp. that can potentially
422 synthesize beta-carotene without relying on any exogenous intermediates as it codes for
423 the complete terpenoid backbone synthesis. Another exclusive feature is the putative
424 ability to synthesize biotin from Malonyl-ACP and Pimeloyl-CoA. As other Kordia spp.,
425 Kordia sp. TARA_039_SRF can synthesize Riboflavin (vitamin B2) and it also has genes
426 related to the synthesis of folate (B9), pantothenate (B5) and pyridoxine (B7). It can
427 putatively import vitamins from the environment with specific binding proteins and an
428 outer membrane channel coupled to a TonB transporter.
429 Genomic islands, mobile elements and prophages
430 Kordia sp. TARA_039_SRF’s genome contains 12 predicted genomic islands with a total
431 length of 252,407 bp (5.5% of the total genome and 28% of the unique accessory
432 genome). They code for virulence genes such as non-ribosomal peptides (NRPs),
433 porins, toxins and invasion proteins and other genes such as those involved in oxidative
434 phosphorylation, nitrate/nitrite transport, and sulfur carrier proteins (SM Table 12). There
435 are 16 different kinds of transporters encoded in ten genomic islands, some of them
18 436 represented by several copies, related to oxidative phosphorylation pathways, secretory
437 pathways, Ompf-OmpA environmental oxygen sensing and bacteriocin releasing
438 systems. There are also 15 complete insertion sequences (IS) belonging to 8 different
439 transposase families (SM Table 13) and three predicted incomplete prophage regions,
440 with a total size of 26,876 bp (0.057% of the total genome length) (SM Table 14).
441 Abundance of Kordia sp. TARA_039_SRF in the North Indian Ocean
442 Kordia sp. TARA_039_SRF’s read recruitments at the species level (alignment identities
443 >95%) did not exceed 0.01% in station TARA_039 (Figure 4A, SM Table 15).
444 Nevertheless, surface and DCM metagenomic reads of the 20-180 µm fraction mapped
445 against 40 and 53% of the co-assembly’s predicted genes, respectively (8.5 and 13% of
446 horizontal genomic coverage). Relative abundances in these samples are low (0.0018
447 and 0.0033%) but reads map homogeneously across the genome (Figure 4B) and
448 against both the core genome and the accessory genome (SM Table 16). Read
449 mappings also occur homogeneously in the 0.8-5 µm and 0.1-0.22 µm fractions of the
450 mesopelagic layer, with abundances of 0.0016 and 0.0026% each but the matched
451 number of genes and horizontal coverage is smaller in both cases (between 24-28% of
452 genes and ~5% of genomic coverage). The highest abundance of recruited reads was
453 found in the mesopelagic 0.22-1.6 µm size fraction (0.009%) but it is spread through 8%
454 of the co-assembly’s ORFs and specially in house-keeping genes like transcriptional
455 regulators, RNA polymerase subunits and a serine aminopeptidase.
456 None of the other Kordia spp. used in the competitive Fragment Recruitment Analysis
457 recruited enough reads at the species level to consider them present at the moment of
458 sampling. Nevertheless, recruitments at identities between 80-90% increasing with depth
459 by all Kordia genomes might suggest that a co-occurrent close relative to the co-
460 assembly is also part of the community (SM Figure 4, 5 and 6).
461 Comparative genomics of the genus Kordia
19 462 The five Kordia genomes included in this comparative study belong to five different
463 species: four Kordia genomes available in public databases retrieved from isolated
464 cultures and our novel Kordia co-assembled genome (Table 2). The ANIm values
465 between genome pairs were ~79-89% (with alignment coverages between 44-75% of
466 the genes). Similarities at the 16S rRNA gene level ranged from 96.34 to 98.00% (with
467 alignment coverage values between 95-100%). In both cases, the closest relative to the
468 co-assembled Kordia sp. was Kordia periserrulae. Genomic length ranged between 4.59
469 and 5.35 Mb for the seawater genomes, while K. zhangzhouensis, retrieved from the
470 interphase between seawater and freshwater, was the shortest (4.03 Mbp). Kordia sp.
471 TARA_039_SRF had the second highest GC content (35.8%), following K. periserrulae
472 (36.2). This value ranged between 33.8 and 34.2 for the other Kordia spp. The mean
473 coding density of the studied genomes was 88.8%.
474 The core genome of the analyzed Kordia spp. consisted of a mean of 57% of the total
475 number of gene clusters of every other Kordia genome (Figure 5, SM Table 17) and
476 ~53% of them could be classified into a Cluster of Orthologous Genes (COGs) category.
477 A phylogenomics analysis of these gene clusters defined two clades, one grouping K.
478 zhangzhouensis, K.algicida and K.jejdonensis and another with the co-assembled
479 Kordia sp. and K.periserrulae (Figure 5).
480 For the rest of gene clusters in these Kordia spp., a mean of 24% are shared between
481 two, three or four genomes. These are defined as the shared accessory genome. The
482 highest number of gene clusters in this category is shared between K. periserrulae and
483 the co-assembled Kordia sp. TARA_039_SRF (204).
484 Between 18 and 28% of gene clusters are unique to each species (Table 2, Figure 5,
485 constituting the flexible genome. Of those classified into COG categories, between 10-
486 15% are classified into the cellular processes and signaling category, 6-11% are related
487 to information storage and processing, 7-12% belong to metabolic functions and 0.2-
488 2.3% are related to mobilome elements. The co-assembly has the lowest number of
20 489 peptidases in its flexible gene dataset (one), 7 transporters (mainly channels and pores
490 and electrochemical potential-driven transporters) and 5 CAZymes (3 different CBM and
491 2 glycosyl hydrolases).
492 There is a high number of hypothetical proteins of unknown function in the co-assembly
493 Kordia sp. TARA_039_SRF. It reaches a 41% of its core genome, 6% of its shared
494 accessory genome and up to 89% of its strain specific accessory genome.
495
496
21 497 Discussion
498 The aim of this study was to characterize the novel uncultured species Kordia sp.
499 TARA_039_SRF and shed light on its ecological and functional potential in ocean waters
500 through an optimized co-assembly of ten co-occurring and nearly identical marine Kordia
501 SAGs.
502 We have co-assembled 10 Kordia SAGs previously classified as nearly identical
503 genomes [69] into a high-quality draft genome, which is 94.83% complete and meets the
504 set standards regarding contamination (<5%) [11,37]. Recent studies have shown that
505 co-assembling very closely related SAGs helps overcome the main drawback of their
506 genome amplification step: that it rarely exceeds 60% of estimated genome length
507 [36,48]. We have taken the assembly procedure one step further in order to optimize
508 contig length, reduce genome fragmentation, and significantly decrease the needs for
509 computational power and processing time. Read redundancy was expected to be high
510 considering the near genomic clonality of the ten SAGs. After normalization by coverage
511 the workable dataset was reduced to 1.75% of the original, which substantially
512 decreased assembly computational needs and time. The use of merged reads, together
513 with those unmerged, contributed to the reliability of predicted gene synteny in contigs
514 [73]. The combination of different k-mer lengths during co-assembly generated
515 remarkably longer contigs than default assembly parameters [13]. Selecting the best
516 combination of k-mer sizes was approached by comparing different assemblies
517 generated with different individual k-mer length values and finding the best performers.
518 These contigs were assembled again as gene redundancy was observed in the ends of
519 some of them (probably due to variability hotspots in some of the SAGs that forced a
520 split in the assembly process). The resulting reference genome had a small number of
521 very long contigs (27 contigs, N50 0.43 Mb) of high-quality (best Assembly Likelihood
522 Evaluation score) to which the ten individual Kordia SAGs shared a pairwise ANIm of
523 99.9%. We think this co-assembly can serve as a reference genome for comparative
22 524 genomics and ecology studies just like a draft genome assembled from bulk DNA of a
525 monoclonal culture.
526 The co-assembly’s taxonomic novelty was confirmed through phylogenies based on the
527 16S rRNA gene and highly conserved single copy genes [39] and also using whole
528 genome composition comparisons like ANI and AAI. The closest relative is Kordia
529 periserrulae, isolated from the gut of the marine polychaete Periserrula leucophryna.
530 Its genome-based functional analysis suggests that this taxon is a novel photo-
531 heterotrophic, non-motile, bacterium that contains proteorhodopsin and oxidative
532 phosphorylation machinery (together with membrane oxygen sensors) for aerobic and
533 microaerobic environments. It has the putative ability to sense other limiting nutrients
534 and elements in the environment like iron, nitrite, and nitrate. Kordia sp. TARA_039_SRF
535 presents exclusive CAZymes in its genus, especially several types of carbohydrate
536 binding modules with an affinity for phytoplankton exopolymers like cellulose or
537 glucomannan, that would promote their hydrolysis before uptake from the extracellular
538 environment [74]. It also encodes Polysaccharide Utilization Loci, surrounded by
539 CAZymes as previously described in Bacteroidetes (carbohydrate-binding modules,
540 glycosyl hydrolases, glycosyl transferases, carbohydrate esterases) [44]. Moreover, the
541 genome encodes almost twice the number of copies of the outer membrane TonB
542 dependent receptor/transport systems than its relatives, hinting at the apparent
543 relevance of importing extracellular compounds for the survival of this species. The
544 presence of virulence factors such as NRPs, antibiotics, proteases, porins, toxins and
545 prophage related enzymes in Kordia sp. TARA_039_SRF’s genome (some of them
546 exported through the Bacteroidetes specific secretion system T9SS), suggest that this
547 novel taxon is an active player for niche colonization. It also presents features common
548 to surface seawater bacteria: seven copies of DNA light-induced damage repair system
549 genes [35], the green-light absorbing proteorhodopsin and other light-sensing proteins.
23 550 Such functional capabilities match the environmental characteristics of the SAGs’
551 original habitat, as well as its abundance in specific size fractions and depths of the water
552 column. The SAGs co-assembled as Kordia sp. TARA_039_SRF were retrieved from a
553 surface seawater sample from the Arabian Sea, in the North Indian Ocean (TARA_039),
554 pre-filtered through a 200 µm mesh. This oceanic region is prone to seasonal surface
555 phytoplankton blooms [42] and has one of the most intense and large Oxygen Minimum
556 Zones (OMZ) [58]. A diatom bloom was registered a month prior to the sampling period
557 and Roullier et al. [68] thoroughly described the particle distribution in the water column
558 across the area, as well as seawater nitrate concentrations and size of the OMZ. By the
559 time of sampling in TARA_039, surface waters showed a general decreasing
560 concentration of large particulate matter of 15 particles m-2 (LPM; >100 µm) [68]. This is
561 the size range where we find highest read recruitments of the co-assembly in the
562 available sea surface and DCM metagenomes. Even though the abundance of recruited
563 reads in this size fraction is low, the reads spread homogeneously throughout the
564 genome. The co-assembly codes for a large number of enzymes that would allow growth
565 on particles that result from diatom exudates, even on those where microaerobic
566 conditions may arise. Moreover, expression of proteorhodopsin in surface waters would
567 provide extra energy and therefore an ecological advantage for rapid particle
568 colonization. In fact, the very low microdiversity within the SAG population may be due
569 to the sorting of a disaggregated particle, as proposed in [69]. The carbohydrate-binding
570 modules, adhesins and internalization transporters encoded in the co-assembly would
571 provide it with the ability to feed on the phytoplankton particles as they sink, regardless
572 of decreasing environmental light. An intermediate nepheloid layer (INL) formed only by
573 particles < 200 µm was observed in TARA_039, at 250-300 m. This depth was also the
574 oxygen minimum zone and had the highest concentration of nitrate in the sampled water
575 column. Of the available metagenomic samples at this mesopelagic depth, the co-
576 assembly was more abundant in the free-living size fraction (0.22-1.6 µm). This could be
577 related to the novel species’ putative ability to cope with low oxygen conditions and the
24 578 potential for nitrate assimilation, and therefore switching life-style from particle-
579 associated (> 1.6 µm size fractions) at the surface to free-living at the OMZ. Metabolic
580 versatility had previously been suggested to be related to a high number of transporters
581 [64], a feature apparently characteristic of the genus Kordia, in which the co-assembly
582 shows the highest ratio of encoded transporters per genomic Mb. A lifestyle alternation
583 between attachment to particles and free-living has already been proposed for the
584 marine Flavobacteria Polaribacter sp. MED152 [25].
585 The genomic characteristics of the novel Kordia sp. TARA_039_SRF contrast with most
586 of the common genomic signatures described in free-living bacterioplankton SAGs by
587 Swan et al. (2013), in accordance with the predicted attachment to particles of Kordia
588 sp. TARA_039_SRF as main lifestyle. Free-living bacteria SAGs were characterized by:
589 i) shorter genomes (also described as characteristic of proteorhodopsin coding
590 Bacteroidetes [22]), ii) genome streamlining with less non-coding regions and iii) lower
591 GC content [78]. In contrast, the co-assembled genome is relatively large (~4.5 Mb), it
592 also shows the second highest GC value of the tested Kordia spp. and a high portion of
593 non-coding DNA (12%).
594 Comparative genomics of Kordia sp. TARA_039_SRF, Kordia algicida, Kordia
595 jejudonensis and Kordia zhangzhouensis reveals an inter-species core genome that
596 constitutes ~57% of the genomes included in the analysis. This value is consistent with
597 the findings in [71], where core genome size is analyzed at the intra-species level but
598 also between different species of the same genus. The result is also consistent with the
599 genus-level pangenomes of marine Alteromonas [47] and Prochlorococcus [32].
600 The low abundance of read recruitment by the co-assembly in the samples from which
601 the SAGs were retrieved and their low genomic coverage suggests that this novel taxon
602 would have remained unseen using other culture-independent techniques like
603 metagenomic genome assembly, confirming the convenience of single cell genomics in
604 the unveiling of novel microbial taxa.
25 605 Description of “Candidatus Kordia photophila” sp. nov.
606 “Candidatus Kordia photophila” (pho.to'phi.la. Gr. neut. n. phos, photos light; N.L. masc.
607 adj. philus (from Gr. masc. adj. philos) loving; N.L. fem. adj. photophila light-loving).
608 This bacterium is proposed to grow photoheterotrophically attached to phytoplankton
609 derived particles. Its genome-based functional annotation suggests that the taxon is non-
610 motile, it encodes green-light absorbing proteorhodopsin and oxidative phosphorylation
611 machinery (together with membrane oxygen sensors) for aerobic and microaerobic
612 environments. It is the first Kordia species that encodes genes from cytochrome c
613 oxidase types aa3 and bo. The former shows higher affinity for O2 than cbb3-type found
614 in other Kordia spp. and the latter is expressed under iron limitation conditions. It is the
615 first representative to encode both the nitrate/nitrite transporter NasA and the substrate
616 binding protein NasF. Other unique features within the Kordia genus are the putative
617 ability to perform assimilatory sulfate reduction, encoding cellulose and glucomannan
618 specific CBM 16, 22 and 04, N-acetylmuramidase GH108 and the unsaturated
619 rhanmnogalacturonyl hydrolase GH types 105 and 88. Regarding membrane
620 transporters absent in other Kordia spp., it encodes: i) 4 types of channels and pores
621 (Phospholemman family, Type B Influenza Virus NB Channel family, General Bacterial
622 Porin family and the channel-forming Colicin family), ii) 2 types of electrochemical
623 potential-driven transporters (Betaine/Carnitine/Choline transporter family and the K+
624 uptake permease KUP) and iii) the slow voltage-gated K+ channel accessory protein
625 MinK.
626 The type material of the proposed “Candidatus Kordia photophila” is its genome
627 sequence, co-assembled from 10 SAGs, that can be found under accession number
628 SMNH00000000. The version described in this paper is version SMNH02000000. Its
629 NCBI’s taxonomy id is NCBI:txid2530203.
630 631
26 632
633 Acknowledgements
634 High-Performance computing analyses were run at the Marine Bioinformatics Core
635 Facility (MARBITS) of the Institut de Ciències del Mar (ICM-CSIC) in Barcelona. We
636 thank Shook Studio for assistance with figure design and execution. We thank Aharon
637 Oren for helping in the etymology of the candidatus and Dr. Ramon Roselló-Móra and
638 the two other anonymous reviewers for helping us improve our manuscript. We thank
639 our fellow scientists, the crew and chief scientists of the different cruise legs involved in
640 Tara Oceans for collecting the samples used in this study. This is a Tara Oceans
641 contributed paper number XXX.
27 642 Funding
643 Funding was provided by the Spanish Ministry of Economy and Competitivity grant
644 MAGGY (CTM2017-87736-R) to SGA and CTM2016-80095-C2 to CPA and JMG. PS is
645 funded by MAGGY project and MR-L held a Ph.D. Fellowship FPI (BES-2014-068285)
646 funded by the Spanish Ministry of Economy and Competitivity.
28 647 Author’s contributions
648 SA designed this research. MR-L performed the laboratory and analyses work and wrote
649 the paper. PS and JG helped with the data analyses. SA funded this research with
650 contributions from CP-A and JG. All authors were involved in critical reading for writing
651 the paper.
29 652 Conflicts of interest
653 The authors declare no conflicts of interest.
30 654 Bibliography
655 [1] Abby, S.S., Néron, B., Ménager, H., Touchon, M., Rocha, E.P.C. (2014) 656 MacSyFinder: A program to mine genomes for molecular systems with an 657 application to CRISPR-Cas systems. PLoS One 9(10), Doi: 658 10.1371/journal.pone.0110726.
659 [2] Abby, S.S., Rocha, E.P.C. (2017) Identification of protein secretion systems in 660 bacterial genomes using MacSyFinder. Methods Mol. Biol. 1615(February), 1– 661 21, Doi: 10.1007/978-1-4939-7033-9_1.
662 [3] Alberti, A., Poulain, J., Engelen, S., Labadie, K., Romac, S., Ferrera, I., Albini, 663 G., Aury, J.-M., Belser, C., Bertrand, A., Cruaud, C., Da Silva, C., Dossat, C., 664 Gavory, F., Gas, S., Guy, J., Haquelle, M., Jacoby, E., Jaillon, O., Lemainque, 665 A., Pelletier, E., Samson, G., Wessner, M., Bazire, P., Beluche, O., Bertrand, L., 666 Besnard-Gonnet, M., Bordelais, I., Boutard, M., Dubois, M., Dumont, C., 667 Ettedgui, E., Fernandez, P., Garcia, E., Aiach, N.G., Guerin, T., Hamon, C., 668 Brun, E., Lebled, S., Lenoble, P., Louesse, C., Mahieu, E., Mairey, B., Martins, 669 N., Megret, C., Milani, C., Muanga, J., Orvain, C., Payen, E., Perroud, P., Petit, 670 E., Robert, D., Ronsin, M., Vacherie, B., Acinas, S.G., Royo-Llonch, M., 671 Cornejo-Castillo, F.M., Logares, R., Fernández-Gómez, B., Bowler, C., 672 Cochrane, G., Amid, C., Hoopen, P. Ten., De Vargas, C., Grimsley, N., 673 Desgranges, E., Kandels-Lewis, S., Ogata, H., Poulton, N., Sieracki, M.E., 674 Stepanauskas, R., Sullivan, M.B., Brum, J.R., Duhaime, M.B., Poulos, B.T., 675 Hurwitz, B.L., Acinas, S.G., Bork, P., Boss, E., Bowler, C., De Vargas, C., 676 Follows, M., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., 677 Kandels-Lewis, S., Karp-Boss, L., Karsenti, E., Not, F., Ogata, H., Pesant, S., 678 Raes, J., Sardet, C., Sieracki, M.E., Speich, S., Stemmann, L., Sullivan, M.B., 679 Sunagawa, S., Wincker, P., Pesant, S., Karsenti, E., Wincker, P. (2017) Viral to 680 metazoan marine plankton nucleotide sequences from the Tara Oceans 681 expedition. Sci. Data 4, 170093, Doi: 10.1038/sdata.2017.93.
682 [4] Alneberg, J., Karlsson, C.M.G., Divne, A.-M., Bergin, C., Homa, F., Lindh, M. V., 683 Hugerth, L.W., Ettema, T.J.G., Bertilsson, S., Andersson, A.F., Pinhassi, J. 684 (2018) Genomes from uncultivated prokaryotes: a comparison of metagenome- 685 assembled and single-amplified genomes. Microbiome 2018 61 6(1), 173, Doi: 686 10.1186/s40168-018-0550-0.
687 [5] Arndt, D., Grant, J.R., Marcu, A., Sajed, T., Pon, A., Liang, Y., Wishart, D.S. 688 (2016) PHASTER: a better, faster version of the PHAST phage search tool.
31 689 Nucleic Acids Res. 44(W1), W16–21, Doi: 10.1093/nar/gkw387.
690 [6] Baek, K., Choi, A., Kang, I., Lee, K., Cho, J.C. (2013) Kordia antarctica sp. nov., 691 isolated from Antarctic seawater. Int. J. Syst. Evol. Microbiol. 63(PART10), 692 3617–22, Doi: 10.1099/ijs.0.052738-0.
693 [7] Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., 694 Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A. V., Sirotkin, 695 A. V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A. (2012) SPAdes: A 696 New Genome Assembly Algorithm and Its Applications to Single-Cell 697 Sequencing. J. Comput. Biol. 19(5), 455–77, Doi: 10.1089/cmb.2012.0021.
698 [8] Begley, T.P., Kinsland, C., Strauss, E. (2001) The biosynthesis of coenzyme A 699 in bacteria. Vitam. Horm. 61, 157–71, Doi: 10.1016/s0083-6729(01)61005-7.
700 [9] Bertelli, C., Laird, M.R., Williams, K.P., Lau, B.Y., Hoad, G., Winsor, G.L., 701 Brinkman, F.S.L. (2017) IslandViewer 4: Expanded prediction of genomic islands 702 for larger-scale datasets. Nucleic Acids Res. 45(W1), W30–5, Doi: 703 10.1093/nar/gkx343.
704 [10] Bolger, A.M., Lohse, M., Usadel, B. (2014) Trimmomatic: a flexible trimmer for 705 Illumina sequence data. Bioinformatics 30(15), 2114–20, Doi: 706 10.1093/bioinformatics/btu170.
707 [11] Bowers, R.M., Kyrpides, N.C., Stepanauskas, R., Harmon-Smith, M., Doud, D., 708 Reddy, T.B.K., Schulz, F., Jarett, J., Rivers, A.R., Eloe-Fadrosh, E.A., Tringe, 709 S.G., Ivanova, N.N., Copeland, A., Clum, A., Becraft, E.D., Malmstrom, R.R., 710 Birren, B., Podar, M., Bork, P., Weinstock, G.M., Garrity, G.M., Dodsworth, J.A., 711 Yooseph, S., Sutton, G., Glöckner, F.O., Gilbert, J.A., Nelson, W.C., Hallam, 712 S.J., Jungbluth, S.P., Ettema, T.J.G., Tighe, S., Konstantinidis, K.T., Liu, W.-T., 713 Baker, B.J., Rattei, T., Eisen, J.A., Hedlund, B., McMahon, K.D., Fierer, N., 714 Knight, R., Finn, R., Cochrane, G., Karsch-Mizrachi, I., Tyson, G.W., Rinke, C., 715 Kyrpides, N.C., Schriml, L., Garrity, G.M., Hugenholtz, P., Sutton, G., Yilmaz, P., 716 Meyer, F., Glöckner, F.O., Gilbert, J.A., Knight, R., Finn, R., Cochrane, G., 717 Karsch-Mizrachi, I., Lapidus, A., Meyer, F., Yilmaz, P., Parks, D.H., Eren, A.M., 718 Schriml, L., Banfield, J.F., Hugenholtz, P., Woyke, T. (2017) Minimum 719 information about a single amplified genome (MISAG) and a metagenome- 720 assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35(8), 721 725–31, Doi: 10.1038/nbt.3893.
722 [12] Caro-Quintero, A., Konstantinidis, K.T. (2012) Bacterial species may exist, 723 metagenomics reveal. Environ. Microbiol. 14(2), 347–55, Doi: 10.1111/j.1462-
32 724 2920.2011.02668.x.
725 [13] Chikhi, R., Medvedev, P. (2014) Informed and automated k-mer size selection 726 for genome assembly. Bioinformatics 30(1), 31–7, Doi: 727 10.1093/bioinformatics/btt310.
728 [14] Choi, A., Oh, H.-M., Yang, S.-J., Cho, J.-C. (2011) Kordia periserrulae sp. nov., 729 isolated from a marine polychaete Periserrula leucophryna, and emended 730 description of the genus Kordia. Int. J. Syst. Evol. Microbiol. 61(Pt 4), 864–9, 731 Doi: 10.1099/ijs.0.022764-0.
732 [15] Clark, S.C., Egan, R., Frazier, P.I., Wang, Z. (2013) ALE: A generic assembly 733 likelihood evaluation framework for assessing the accuracy of genome and 734 metagenome assemblies. Bioinformatics 29(4), 435–43, Doi: 735 10.1093/bioinformatics/bts723.
736 [16] Conway, J.R., Lex, A., Gehlenborg, N. (2017) UpSetR: an R package for the 737 visualization of intersecting sets and their properties. Bioinformatics 33(18), 738 2938–40, Doi: 10.1093/bioinformatics/btx364.
739 [17] Delmont, T.O., Eren, A.M. (2018) Linking pangenomes and metagenomes: the 740 Prochlorococcus metapangenome. PeerJ, Doi: 10.7717/peerj.4320.
741 [18] Delmont, T.O., Quince, C., Shaiber, A., Esen, O.C., Lee, S.T., Lücker, S., Murat 742 Eren, A. (2018) Nitrogen-fixing populations of Planctomycetes and 743 Proteobacteria are abundant in the surface ocean. Nat. Microbiol. 3(July), Doi: 744 10.1101/129791.
745 [19] van Dongen, S., Abreu-Goodger, C. (2012) Using MCL to Extract Clusters from 746 Networks. In: van Helden, J., Toussaint, A., Thieffry, D., (Eds.), Bacterial 747 Molecular Networks: Methods and Protocols, Springer New York, New York, NY, 748 pp. 281–95.
749 [20] Edgar, R.C. (2004) MUSCLE: Multiple sequence alignment with high accuracy 750 and high throughput. Nucleic Acids Res. 32(5), 1792–7, Doi: 751 10.1093/nar/gkh340.
752 [21] Elbourne, L.D.H., Tetu, S.G., Hassan, K.A., Paulsen, I.T. (2017) TransportDB 753 2.0: a database for exploring membrane transporters in sequenced genomes 754 from all domains of life. Nucleic Acids Res. 45(D1), D320–4, Doi: 755 10.1093/nar/gkw1068.
756 [22] Fernández-Gómez, B., Richter, M., Schüler, M., Pinhassi, J., Acinas, S.G., 757 González, J.M., Pedrós-Alió, C. (2013) Ecology of marine Bacteroidetes: a
33 758 comparative genomics approach. ISME J. 7(5), 1026–37, Doi: 759 10.1038/ismej.2012.169.
760 [23] Fraser, C., Alm, E.J., Polz, M.F., Spratt, B.G., Hanage, W.P. (2009) The 761 Bacterial Species Challenge : Making Sense of Genetic and Ecological Diversity. 762 Science (80-. ). 323(February), 741–6, Doi: 10.1126/science.1159388.
763 [24] Gómez-Rubio, V. (2017) ggplot2 - Elegant Graphics for Data Analysis (2nd 764 Edition). J. Stat. Softw. 77(Book Review 2), 3–5, Doi: 10.18637/jss.v077.b02.
765 [25] Gonzalez, J.M., Fernandez-Gomez, B., Fernandez-Guerra, A., Gomez- 766 Consarnau, L., Sanchez, O., Coll-Llado, M., del Campo, J., Escudero, L., 767 Rodriguez-Martinez, R., Alonso-Saez, L., Latasa, M., Paulsen, I., 768 Nedashkovskaya, O., Lekunberri, I., Pinhassi, J., Pedros-Alio, C. (2008) 769 Genome analysis of the proteorhodopsin-containing marine bacterium 770 Polaribacter sp. MED152 (Flavobacteria). Proc. Natl. Acad. Sci. 105(25), 8724– 771 9, Doi: 10.1073/pnas.0712027105.
772 [26] Hameed, A., Shahina, M., Lin, S.Y., Cho, J.C., Lai, W.A., Young, C.C. (2013) 773 Kordia aquimaris sp. nov., a zeaxanthin-producing member of the family 774 Flavobacteriaceae isolated from surface seawater, and emended description of 775 the genus Kordia. Int. J. Syst. Evol. Microbiol. 63(PART 12), 4790–6, Doi: 776 10.1099/ijs.0.056051-0.
777 [27] Hyatt, D., Chen, G., Locascio, P.F., Land, M.L., Larimer, F.W., Hauser, L.J. 778 (2010) Prodigal : prokaryotic gene recognition and translation initiation site 779 identification.
780 [28] Jain, C., Rodriguez-R, L.M., Phillipy, A.M., Konstantinidis, K.T., Aluru, S. (2018) 781 High throughput ANI analysis of 90K prokaryotic genomes reveals clear species 782 boundaries. Nat. Commun. 9(5114), 1–8, Doi: 10.1038/s41467-018-07641-9.
783 [29] Kanehisa, M., Sato, Y., Morishima, K. (2016) BlastKOALA and GhostKOALA: 784 KEGG Tools for Functional Characterization of Genome and Metagenome 785 Sequences. J. Mol. Biol. 428(4), 726–31, Doi: 10.1016/j.jmb.2015.11.006.
786 [30] Karsenti, E., Acinas, S.G., Bork, P., Bowler, C., De Vargas, C., Raes, J., 787 Sullivan, M., Arendt, D., Benzoni, F., Claverie, J.-M., Follows, M., Gorsky, G., 788 Hingamp, P., Iudicone, D., Jaillon, O., Kandels-Lewis, S., Krzic, U., Not, F., 789 Ogata, H., Pesant, S., Reynaud, E.G., Sardet, C., Sieracki, M.E., Speich, S., 790 Velayoudon, D., Weissenbach, J., Wincker, P. (2011) A holistic approach to 791 marine eco-systems biology. PLoS Biol. 9(10), e1001177, Doi:
34 792 10.1371/journal.pbio.1001177.
793 [31] Kashtan, N., Roggensack, S.E., Rodrigue, S., Thompson, J.W., Biller, S.J., Coe, 794 A., Ding, H., Marttinen, P., Malmstrom, R.R., Stocker, R., Follows, M.J., 795 Stepanauskas, R., Chisholm, S.W. (2014) Single-cell genomics reveals 796 hundreds of coexisting subpopulations in wild Prochlorococcus. Science 797 344(6182), 416–20, Doi: 10.1126/science.1248575.
798 [32] Kettler, G.C., Martiny, A.C., Huang, K., Zucker, J., Coleman, M.L., Rodrigue, S., 799 Chen, F., Lapidus, A., Ferriera, S., Johnson, J., Steglich, C., Church, G.M., 800 Richardson, P., Chisholm, S.W. (2007) Patterns and implications of gene gain 801 and loss in the evolution of Prochlorococcus. PLoS Genet. 3(12), 2515–28, Doi: 802 10.1371/journal.pgen.0030231.
803 [33] Kim, D.I., Lee, J.H., Kim, M.S., Seong, C.N. (2017) Kordia zosterae sp. nov., 804 isolated from the seaweed, Zostera marina. Int. J. Syst. Evol. Microbiol. 67(11), 805 4790–5, Doi: 10.1099/ijsem.0.002379.
806 [34] Kim, M., Oh, H.S., Park, S.C., Chun, J. (2014) Towards a taxonomic coherence 807 between average nucleotide identity and 16S rRNA gene sequence similarity for 808 species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. 64(PART 2), 809 346–51, Doi: 10.1099/ijs.0.059774-0.
810 [35] Kisker, C., Kuper, J., Van Houten, B. (2013) Prokaryotic Nucleotide Excision 811 Repair. Cold Spring Harb. Perspect. Biol. 5(3), a012591–a012591, Doi: 812 10.1101/cshperspect.a012591.
813 [36] Kogawa, M., Hosokawa, M., Nishikawa, Y., Mori, K., Takeyama, H. (2018) 814 Obtaining high-quality draft genomes from uncultured microbes by cleaning and 815 co-assembly of single-cell amplified genomes. Sci. Rep. 8(1), 1–11, Doi: 816 10.1038/s41598-018-20384-3.
817 [37] Konstantinidis, K.T., Rosselló-Móra, R., Amann, R. (2017) Uncultivated 818 microbes in need of their own taxonomy. ISME J., 1–8, Doi: 819 10.1038/ismej.2017.113.
820 [38] Konstantinidis, K.T., Tiedje, J.M. (2005) Genomic insights that advance the 821 species definition for prokaryotes. Proc. Natl. Acad. Sci. U. S. A. 102(7), 2567– 822 72, Doi: 10.1073/pnas.0409727102.
823 [39] Kultima, J.R., Sunagawa, S., Li, J., Chen, W., Chen, H., Mende, D.R., 824 Arumugam, M., Pan, Q., Liu, B., Qin, J., Wang, J., Bork, P. (2012) MOCAT : A 825 Metagenomics Assembly and Gene Prediction Toolkit 7(10), 1–6, Doi:
35 826 10.1371/journal.pone.0047656.
827 [40] Kumar, S., Stecher, G., Tamura, K. (2016) MEGA7 : Molecular Evolutionary 828 Genetics Analysis Version 7 . 0 for Bigger Datasets Brief communication 33(7), 829 1870–4, Doi: 10.1093/molbev/msw054.
830 [41] Lai, Q., Shao, Z., Liu, Y., Xie, Y., Du, J., Dong, C. (2015) Kordia zhangzhouensis 831 sp. nov., isolated from surface freshwater. Int. J. Syst. Evol. Microbiol. 65(10), 832 3379–83, Doi: 10.1099/ijsem.0.000424.
833 [42] Landry, M.R., Brown, S.L., Campbell, L., Constantinou, J., Liu, H. (1998) Spatial 834 patterns in phytoplankton growth and microzooplankton grazing in the Arabian 835 Sea during monsoon forcing. Deep. Res. Part II Top. Stud. Oceanogr. 45(10– 836 11), 2353–68, Doi: 10.1016/S0967-0645(98)00074-5.
837 [43] Langmead, B., Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. 838 Nat. Methods 9(4), 357–9, Doi: 10.1038/nmeth.1923.
839 [44] Lapébie, P., Lombard, V., Drula, E., Terrapon, N. (n.d.) Bacteroidetes use 840 thousands of enzyme combinations to break down glycans. Nat. Commun. 841 (2019), Doi: 10.1038/s41467-019-10068-5.
842 [45] Letunic, I., Bork, P. (2019) Interactive Tree Of Life ( iTOL ) v4 : recent updates 843 and 47(April), 256–9, Doi: 10.1093/nar/gkz239.
844 [46] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., 845 Abecasis, G., Durbin, R. (2009) The Sequence Alignment/Map format and 846 SAMtools. Bioinformatics 25(16), 2078–9, Doi: 10.1093/bioinformatics/btp352.
847 [47] López-Pérez, M., Rodriguez-Valera, F. (2016) Pangenome evolution in the 848 marine bacterium Alteromonas. Genome Biol. Evol. 8(5), evw098, Doi: 849 10.1093/gbe/evw098.
850 [48] Mangot, J., Logares, R., Sánchez, P., Latorre, F., Seeleuthner, Y., Mondy, S., 851 Sieracki, M.E., Jaillon, O., Wincker, P., Vargas, C. de., Massana, R. (2017) 852 Accessing the genomic information of unculturable oceanic picoeukaryotes by 853 combining multiple single cells. Sci. Rep. 7(January), 41498, Doi: 854 10.1038/srep41498.
855 [49] Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R. (2005) The 856 microbial pan-genome. Curr. Opin. Genet. Dev. 15(6), 589–94, Doi: 857 http://dx.doi.org/10.1016/j.gde.2005.09.006.
858 [50] Mestre, M., Ruiz-gonzález, C., Logares, R., Duarte, C.M., Gasol, J.M. (2018)
36 859 Sinking particles promote vertical connectivity in the ocean microbiome 115(29), 860 6799–807, Doi: 10.1073/pnas.1802470115.
861 [51] Mira, A., Martín-Cuadrado, A.B., D’Auria, G., Rodríguez-Valera, F. (2010) The 862 bacterial pan-genome:a new paradigm in microbiology. Int. Microbiol. 13(2), 45– 863 57, Doi: 10.2436/20.1501.01.110.
864 [52] Ochman, H., Lerat, E., Daubin, V. (2005) Examining bacterial species under the 865 specter of gene transfer and exchange. Proc. Natl. Acad. Sci. 102(Supplement 866 1), 6595–9, Doi: 10.1073/pnas.0502035102.
867 [53] Pachiadaki, M.G., Sintes, E., Bergauer, K., Brown, J.M., Record, N.R., Swan, 868 B.K., Mathyer, M.E., Hallam, S.J., Lopez-Garcia, P., Takaki, Y., Nunoura, T., 869 Woyke, T., Herndl, G.J., Stepanauskas, R. (2017) Major role of nitrite-oxidizing 870 bacteria in dark ocean carbon fixation. Science (80-. ). 358(6366), 1046–51, Doi: 871 10.1126/science.aan8260.
872 [54] Park, S., Jung, Y.-T., Yoon, J.-H. (2014) Kordia jejudonensis sp. nov., isolated 873 from the junction between the ocean and a freshwater spring, and emended 874 description of the genus Kordia. Int. J. Syst. Evol. Microbiol. 64(Pt 2), 657–62, 875 Doi: 10.1099/ijs.0.058776-0.
876 [55] Parks, D.H., Imelfort, M., Skennerton, C.T., Hugenholtz, P., Tyson, G.W. (2015) 877 CheckM: assessing the quality of microbial genomes recovered from isolates, 878 single cells, and metagenomes. Genome Res. 25(7), 1043–55, Doi: 879 10.1101/gr.186072.114.
880 [56] Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., 881 Evans, P.N., Hugenholtz, P., Tyson, G.W. (2017) Recovery of nearly 8,000 882 metagenome-assembled genomes substantially expands the tree of life. Nat. 883 Microbiol. 2(11), 1533–42, Doi: 10.1038/s41564-017-0012-7.
884 [57] Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., 885 Evans, P.N., Hugenholtz, P., Tyson, G.W. (2018) Author Correction: Recovery of 886 nearly 8,000 metagenome-assembled genomes substantially expands the tree 887 of life. Nat. Microbiol. 3(2), 253–253, Doi: 10.1038/s41564-017-0083-5.
888 [58] Paulmier, A., Ruiz-Pino, D. (2009) Oxygen minimum zones (OMZs) in the 889 modern ocean. Prog. Oceanogr. 80(3–4), 113–28, Doi: 890 10.1016/j.pocean.2008.08.001.
891 [59] Pinder, M.I.M., Johansson, O.N., Almstedt, A., Kourtchenko, O., Clarke, A.K., 892 Godhe, A., Töpel, M. (2019) Genome Sequence of Kordia sp. Strain SMS9
37 893 Identified in a Non-Axenic Culture of the Diatom Skeletonema marinoi . J. 894 Genomics 7, 46–9, Doi: 10.7150/jgen.35061.
895 [60] Price, M.N., Dehal, P.S., Arkin, A.P. (2010) FastTree 2 – Approximately 896 Maximum-Likelihood Trees for Large Alignments 5(3), Doi: 897 10.1371/journal.pone.0009490.
898 [61] Pruesse, E., Peplies, J., Glöckner, F.O., Editor, A., Wren, J. (2012) SINA : 899 Accurate high-throughput multiple sequence alignment of ribosomal RNA genes 900 28(14), 1823–9, Doi: 10.1093/bioinformatics/bts252.
901 [62] Qi, F., Shao, Z., Li, D., Huang, Z., Lai, Q. (2016) Kordia ulvae sp. nov., a 902 bacterium isolated from the surface of green marine algae Ulva sp. Int. J. Syst. 903 Evol. Microbiol. 66(7), 2623–8, Doi: 10.1099/ijsem.0.001098.
904 [63] Rawlings, N.D., Waller, M., Barrett, A.J., Bateman, A. (2014) MEROPS : the 905 database of proteolytic enzymes , their substrates and inhibitors 42(July 2011), 906 503–9, Doi: 10.1093/nar/gkt953.
907 [64] Ren, Q., Pauisers, J.T. (2005) Comparative analyses of fundamental differences 908 in membrane transport Capabilities in Prokaryotes and Eukaryotes. PLoS 909 Comput. Biol. 1(3), 0190–201, Doi: 10.1371 journal.pcbi.0010027.
910 [65] Richter, M., Rosselló-Móra, R. (2009) Shifting the genomic gold standard for the 911 prokaryotic species definition. Proc. Natl. Acad. Sci. U. S. A. 106(45), 19126–31, 912 Doi: 10.1073/pnas.0906412106.
913 [66] Rodriguez-R, L.M., Gunturu, S., Harvey, W.T., Rossell, R., Tiedje, J.M., Cole, 914 J.R., Konstantinidis, K.T. (2018) The Microbial Genomes Atlas ( MiGA ) 915 webserver : taxonomic and gene diversity analysis of Archaea and (June), 1–7, 916 Doi: 10.1093/nar/gky467.
917 [67] Rosselló-Mora, R., Amann, R. (2001) The species concept for prokaryotes. 918 FEMS Microbiol. Rev. 25(1), 39–67.
919 [68] Roullier, F., Berline, L., Guidi, L., Durrieu De Madron, X., Picheral, M., Sciandra, 920 A., Pesant, S., Stemmann, L. (2014) Particle size distribution and estimated 921 carbon flux across the Arabian Sea oxygen minimum zone. Biogeosciences 922 11(16), 4541–57, Doi: 10.5194/bg-11-4541-2014.
923 [69] Royo-Llonch, M., Ferrera, I., Cornejo-Castillo, F.M., Sánchez, P., Salazar, G., 924 Stepanauskas, R., González, J.M., Sieracki, M.E., Speich, S., Stemmann, L., 925 Pedrós-Alió, C., Acinas, S.G. (2017) Exploring microdiversity in novel Kordia sp. 926 (Bacteroidetes) with proteorhodopsin from the tropical Indian Ocean via Single
38 927 Amplified Genomes. Front. Microbiol. 8(JUL), Doi: 10.3389/fmicb.2017.01317.
928 [70] Seemann, T. (2014) Prokka: Rapid prokaryotic genome annotation. 929 Bioinformatics 30(14), 2068–9, Doi: 10.1093/bioinformatics/btu153.
930 [71] Segerman, B. (2012) The genetic integrity of bacterial species: the core genome 931 and the accessory genome, two different stories. Front. Cell. Infect. Microbiol. 932 2(September), 1–8, Doi: 10.3389/fcimb.2012.00116.
933 [72] Shapiro, B.J., Polz, M.F. (2014) Ordering microbial diversity into ecologically and 934 genetically cohesive units. Trends Microbiol. 22(5), 235–47, Doi: 935 10.1016/j.tim.2014.02.006.
936 [73] Sharon, I., Kertesz, M., Hug, L.A., Pushkarev, D., Blauwkamp, T.A., Castelle, 937 C.J., Amirebrahimi, M., Thomas, B.C., Burstein, D., Tringe, S.G., Williams, K.H., 938 Banfield, J.F. (2015) Accurate , multi-kb reads resolve complex populations and 939 detect rare microorganisms. Genome Res., 534–43, Doi: 940 10.1101/gr.183012.114.
941 [74] Shoseyov, O., Shani, Z., Levy, I. (2006) Carbohydrate Binding Modules: 942 Biochemical Properties and Novel Applications. Microbiol. Mol. Biol. Rev. 70(2), 943 283–95, Doi: 10.1128/MMBR.00028-05.
944 [75] Sohn, J.H., Lee, J.H., Yi, H., Chun, J., Bae, K.S., Ahn, T.Y., Kim, S.J. (2004) 945 Kordia algicida gen. nov., sp. nov., an algicidal bacterium isolated from red tide. 946 Int. J. Syst. Evol. Microbiol. 54(3), 675–80, Doi: 10.1099/ijs.0.02689-0.
947 [76] Stepanauskas, R. (2012) Single cell genomics: An individual look at microbes. 948 Curr. Opin. Microbiol. 15(5), 613–20, Doi: 10.1016/j.mib.2012.09.001.
949 [77] Swan, B.K., Martinez-Garcia, M., Preston, C.M., Sczyrba, A., Woyke, T., Lamy, 950 D., Reinthaler, T., Poulton, N.J., Masland, E.D.P., Gomez, M.L., Sieracki, M.E., 951 DeLong, E.F., Herndl, G.J., Stepanauskas, R. (2011) Potential for 952 chemolithoautotrophy among ubiquitous bacteria lineages in the dark ocean. 953 Science 333(6047), 1296–300, Doi: 10.1126/science.1203690.
954 [78] Swan, B.K., Tupper, B., Sczyrba, A., Lauro, F.M., Martinez-Garcia, M., 955 González, J.M., Luo, H., Wright, J.J., Landry, Z.C., Hanson, N.W., Thompson, 956 B.P., Poulton, N.J., Schwientek, P., Acinas, S.G., Giovannoni, S.J., Moran, M.A., 957 Hallam, S.J., Cavicchioli, R., Woyke, T., Stepanauskas, R. (2013) Prevalent 958 genome streamlining and latitudinal divergence of planktonic bacteria in the 959 surface ocean. Proc. Natl. Acad. Sci. U. S. A. 110(28), 11463–8, Doi: 960 10.1073/pnas.1304246110.
39 961 [79] Tettelin, H., Riley, D., Cattuto, C., Medini, D. (2008) Comparative genomics: the 962 bacterial pan-genome. Curr. Opin. Microbiol. 11(5), 472–7, Doi: 963 10.1016/j.mib.2008.09.006.
964 [80] Thompson, L.R., Haroon, M.F., Shibl, A.A., Cahill, M.J., Ngugi, D.K., Williams, 965 G.J., Morton, J.T., Knight, R., Goodwin, K.D., Stingl, U. (2019) Red Sea SAR11 966 and Prochlorococcus Single-Cell Genomes Reflect Globally Distributed 967 Pangenomes. Appl. Environ. Microbiol. 85(13), 1–18, Doi: 10.1128/AEM.00369- 968 19.
969 [81] Tully, B.J., Wheat, C.G., Glazer, B.T., Huber, J.A. (2018) A dynamic microbial 970 community with high functional redundancy inhabits the cold, oxic subseafloor 971 aquifer. ISME J. 12(1), 1–16, Doi: 10.1038/ismej.2017.187.
972 [82] Varani, A.M., Siguier, P., Gourbeyre, E., Charneau, V., Chandler, M. (2011) 973 ISsaga is an ensemble of web-based methods for high throughput identification 974 and semi-automatic annotation of insertion sequences in prokaryotic genomes. 975 Genome Biol. 12(3), R30, Doi: 10.1186/gb-2011-12-3-r30.
976 [83] Yarza, P., Richter, M., Euzeby, J., Amann, R., Schleifer, K., Ludwig, W., Glo, 977 F.O., Rossello, R. (2020) The All-Species Living Tree project : A 16S rRNA- 978 based phylogenetic tree of all sequenced type strains 31(2008), 241–50, Doi: 979 10.1016/j.syapm.2008.07.001.
980 [84] Yin, Y., Mao, X., Yang, J., Chen, X., Mao, F., Xu, Y. (2012) dbCAN: a web 981 resource for automated carbohydrate-active enzyme annotation. Nucleic Acids 982 Res. 40(W1), W445–51, Doi: 10.1093/nar/gks479.
983 [85] Zhang, J., Kobert, K., Flouri, T., Stamatakis, A. (2014) PEAR: a fast and 984 accurate Illumina Paired-End reAd mergeR. Bioinformatics 30(5), 614–20, Doi: 985 10.1093/bioinformatics/btt593.
986
987
988
40 989 Figures
990
991
992 Figure 1. Workflow used to co-assemble the novel Kordia sp. TARA_039_SRF high- 993 quality draft genome from ten Kordia SAGs. The first step includes read processing and 994 co-assembly optimization (A). It cleans and merges raw reads to eventually have a 995 workable read dataset consisting of merged reads and clean paired-end reads that did 996 not merge. These are co-assembled multiple times with different individual k-mer sizes, 997 ranging between 17 and 133 bp. N50 length, number of contigs and best assembly 998 evaluation score (ALE score) are the parameters chosen to determine the three best k- 999 mer sizes to combine for the co-assembly that will be used afterwards. The second step 1000 aims to reduce the redundancy found in contig ends of this co-assembly and reduce 1001 contamination (B). It relies on re-assemblying the co-assembly’s contigs, measuring 1002 genome completeness and contamination and discarding those shorter contigs that add 1003 extra copies of single copy genes into the co-assembly to a final contamination value 1004 <5%.
41 A) B)
Kordia sp. TARA_039_SRF Kordia sp. TARA_039_SRF Kordia periserrulae 94 100 Kordia periserrulae Kordia sp. Otu5835 94 Kordia sp SMS9 62 Kordia sp. SMS9 Kordia jejudonensis Kordia jejudonensis Kordia aquimaris 100 Kordia algicida 96 84 Kordia zhangzhouensis Kordia zhangzhouensis 99 Kordia sp. NBRC 110215 92 Zhouia amylolytica CGMCC 1.6114 Kordia algicida OT-1 TARA PSW MAG 00065 64 Kordia sp. Otu3978 99 Kordia sp. Otu5934 TARA ASW MAG 00012 79 Kordia sp. Otu1517 Aquimarina agarilytica ZC1 74 97 Kordia antarctica TARA PSE MAG 00083 100 Kordia zosterae ZO2-23 81 TARA PSE MAG 00098 Kordia ulvae Kordia sp NORP58 Kordia sp. Otu3199 71 Zhouia amylolytica AD3 TARA PSE MAG 00126 Aquimarina agarivorans TARA ANW MAG 00010 65 93 99 Marivirga tractuosa 100 Altibacter lentus JL2010 75 Aquimarina atlantica TARA PSE MAG 00079 Altibacter lentus Psychroserpens sp. Hel I 66 Algibacter lectus 100 Winogradskyella jejuensis DSM 25330 Tamlana agarivorans 88 Flaviramulus ichthyoenteri Th78 100 TARA PSE MAG 00013 Olleya namhaensis Tamlana agarivorans JW-26 Flavobacteriales bacterium ALC-1 100 86 79 100 Algibacter lectus DSM 15365 95 Winogradskyella jejuensis Bizionia echini DSM 23925 Psychroserpens sp. Hel_I_66 100 Bizionia echini DSM 23925 Formosa algae KMM 3553 100 Formosa algae 87 TARA PSE MAG 00005 Tree scale: 0.01 Tree scale: 100 1005 SAG co-assembly Isolate OTU MAG 1006 Figure 2. Phylogenetic reconstruction based of Kordia sp. TARA_039_SRF and its 1007 closest relatives based on the 16S rRNA gene (A) and 40 conserved single copy genes 1008 (B).
1009 1010 Figure 3. Theoretical model with highlights of the functional potential of Kordia sp. 1011 TARA_039_SRF. Membrane transporters are colored depending on their TC Family 1012 classification and their simplified structure has been inferred from TC database and 1013 related bibliography.
42 1014
1015 Figure 4. Relative abundances of recruited reads in all available metagenomes from 1016 station TARA_039 in the north Indian ocean (A). Unavailable metagenomic samples are 1017 depicted as slashed in the heatmap, samples where there was no read recruitment 1018 whatsoever appear in white. Horizontal genomic coverage of mapped reads at 95-100% 1019 in TARA_039 metagenomes (B). The total of 4193 loci encoded in the co-assembly have 1020 been collapsed in 420 bins of 10 loci each and the number of reads mapping has been 1021 colored. No read mapping is shown in white.
1022
1023
1024
43 1025
1026 Figure 5. Comparative genomics of Kordia sp. TARA_039_SRF, K. algicida, K. 1027 jejudonensis and K. zhangzhouensis. Top left dendogram represents clusters of genes 1028 ordered by the number of genomes encoding them, regardless of copy numbers. The 1029 phylogenomics tree connecting Kordia spp. lables was done based on their core 1030 genome. Right bar plot quantifies the total number of genes that are unique and shared 1031 in all possible combinations between the five Kordia genomes. Bottom heatmap 1032 quantifies the number of genes classified to a COG category, peptidases, Transport 1033 Classification family or CAZymes category for each genome combination and for the 1034 unique flexible genome. Each genome has a specific color that is maintained in the 1035 different sections of the figure.
1036
44 1037 Tables
1038 Table 1. Metrics of the individual assemblies and the co-assembly of the 10 Kordia 1039 SAGs.
Individual SAGs AB-193_ refined Kordia sp. TARA_039_SRF co- genomes I04 I20 L04 M03 M19 M23 N20 O13 O22 P22 assembly
Assembly metrics
Total N of contigs 179 111 367 124 197 244 256 179 221 220 27
Contig N50 (Kbp) 20.9 30.0 33.0 54.2 40.2 38.4 27.5 42.7 49.8 25.5 42.9
Mean contig length (Kbp) 5.9 8.6 5.6 10.2 10.4 9.3 5.9 9.5 9.1 6.6 170.1
Min. contig length (bp) 118 118 118 118 118 118 118 118 118 118 14743
Max. contig length (Kbp) 59.5 80.7 113.4 135 137 139.7 107.7 93.4 106.1 113 682.4
Total assembly size (Mbp) 1.05 0.96 2.05 1.27 2.06 2.27 1.51 1.7 2.01 1.47 4.59
Quality parameters
Genome completeness 24.36 11.15 39.94 26.14 44.11 43.32 33.02 36.96 44.78 32.31 94.97
Contamination 0.65 0.00 1.14 0.49 1.14 2.16 1.84 0.65 0.43 0.69 4.65
Strain heterogeneity 100.00 0.00 0.00 50.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Read mappings against co-
assembly Contribution to co-assembly (%) 24.56 21.70 45.96 28.75 45.30 50.04 34.56 37.82 44.99 32.84 -
Mapped clean reads (%) 97.84 97.64% 97.76 98.21 98.38 98.40 98.50 98.11 98.48 98.15 -
Genomic features
Predicted ORFs 1012 944 1855 1142 1891 2118 1357 1548 1864 1334 4192
GC (%) 35.4 35.5 35.3 35.4 35.6 35.5 35.4 35.5 35.8 36.0 36.33
ANIm against co-assembly (%) 99.98 99.98 99.97 99.99 99.98 99.98 99.97 99.98 99.96 99.98 100 Identity of tetrant. freq. with 99.53 99.42 99.83 99.64 99.87 99.84 99.76 99.83 99.85 99.56 100 co-assembly (%) 1040
1041
45 1042 Table 2. Description of the four Kordia genomes used in the comparative genomics 1043 analysis.
K. Kordia sp. K. algicida K. jejudonensis zhanghzouens K. periserrulae TARA_039_SRF is
Genomic features
Length (Mbp) 4.59 5.01 5.35 4.03 4.72
GC (%) 35.8 34.2 33.9 33.8 36.2
Coding density (%) 88.45 87.58 88.09 90.98 88.93
Total number of ORFs 4192 4492 4782 3585 4105
Predicted protein coding genes (CDS) 4124 4411 4714 3527 4051
Completeness (%) 94.83 100 100 100 100
Pangenome
# clusters of core genes (% of total) 2286 (57.4%) 2286 (55.4%) 2286 (50.3%) 2286 (65.8%) 2286 (58.1%) # clusters of genes shared with Kordia spp. (% of total) 971 (24.4%) 1038 (25.1%) 969 (21.3%) 772 (22.2%) 1050 (26.6%) Accessory shared genome # clusters of unique flexible genes (% of total) 723 (18.1%) 801 (19.4%) 1287 (28.3%) 414 (11.9%) 599 (15.2%) Accessory strain specific genome Total # of gene clusters 3980 4125 4542 3472 3935
ANIm (%)
Kordia sp. TARA_039_SRF 100 - - - -
Kordia algicida 81.13 100 - - -
Kordia jejudonensis 79.83 81.47 100 - -
Kordia zhangzhouensis 79.62 79.56 79.23 100 -
Kordia periserrulae 89.13 80.97 79.83 79.63 100
AAI (%)
Kordia sp. TARA_039_SRF 100 - - - -
Kordia algicida 81.05 100 - - -
Kordia jejudonensis 77.96 80.14 100 - -
Kordia zhangzhouensis 79.93 80.62 78.06 100 -
Kordia periserrulae 92.12 81.03 77.97 80.06 100
Ribosomal RNA
16S rRNA id. % with Kordia sp. TARA_039_SRF 100 - - - -
16S rRNA id. % with K. algicida (coverage) 97.03 (100%) 100 - - - 96.69 16S rRNA id. % with K. jejudonensis (coverage) 98.00 (95%) 100 - - (95%) 95.85 16S rRNA id. % with K. zhangzhouensis (coverage) 96.85 (98%) 97.24 (00%) 100 - (98%) 96.74 16S rRNA id, % with K. periserrulae (coverage) 97.56 (97%) 97.24 (98%) 96.34 (99%) 100 (97%) # of tRNA types 21 21 20 19 20
# of rRNA operons 1 3 1 1 1
1044
1045
46