1

2

3

4

5

6

7 Ecological and functional capabilities of an uncultured

8 Kordia sp.

9

10 Royo-Llonch M1,4, Sánchez P1, González JM2, Pedrós-Alió C3, Acinas SG1*

11

12 ______

13

14 1 Department of Marine Biology and Oceanography, Institut de Ciències del Mar (ICM), 15 CSIC, Barcelona, Spain.

16 2 Department of Microbiology, University of La Laguna, La Laguna, Spain

17 3 Systems Biology Program, Centro Nacional de Biotecnología (CNB), CSIC, Madrid, Spain

18 4 Departament de Genètica i Microbiologia, Facultat de Biociències, Universitat Autònoma 19 de Barcelona, Bellaterra, Spain

20

21 *Correspondence:

22 Dr. Silvia G. Acinas 23 [email protected];

24

25 Keywords: SAG , Kordia, co-assembly, photoheterotrophy, metagenomics

26

1 27 Abstract

28 Cultivable are only a fraction of the diversity in microbial communities. However,

29 the official procedures for classification and characterization of a novel prokaryotic

30 species still rely on isolates. Thanks to Single Cell Genomics, it is possible to retrieve

31 genomes from environmental samples by sequencing them individually, and to assign

32 specific genes to a specific taxon, regardless of their ability to grow in culture. In this

33 study, we performed a complete description of the uncultured Kordia sp.

34 TARA_039_SRF, a proposed novel species within the genus Kordia, using culture-

35 independent techniques. The type material was a high-quality draft genome (94.97%

36 complete, 4.65% gene redundancy) co-assembled using 10 nearly identical Single

37 Amplified Genomes (SAGs) from surface seawater in the North Indian Ocean from the

38 Tara Oceans Expedition. The assembly process was optimized to obtain the best

39 possible assembly metrics and a less fragmented genome. Its closest relative is Kordia

40 periserrulae, sharing 97.56% similarity of the 16S rRNA gene, 75% of their orthologs and

41 89.13% average nucleotide identity. We describe the functional potential of the proposed

42 novel species, that includes proteorhodopsin, the ability to incorporate nitrate,

43 cytochrome oxidases with high affinity for oxygen and CAZymes that are unique features

44 within the genus. Its abundance at different depths and size fractions was also evaluated

45 together with its functional annotation, revealing that its putative ecological niche seems

46 to be particles of phytoplanktonic origin. They can attach to these particles and consume

47 them while sinking to the deeper and oxygen depleted layers of the North Indian Ocean.

48

49

2 50 Introduction

51 There is a lack of a consensus on what ecologically coherent units should be used as a

52 proxy for bacterial species in environmental communities [23,52,67,72]. Moreover, for a

53 new bacterial species to be officially recognized, its isolation in pure culture is still a

54 requirement. It becomes necessary to re-define bacterial “species” to fit with the reality

55 of the (still uncultured) majority of bacteria. An update of the validation of novel

56 uncultured high-quality genomes is also needed, as they are given a provisional

57 Candidatus status even after a thorough description [37]. Recent efforts have emerged

58 to facilitate novel uncultured taxa standardization [37], like the Microbial Genome Atlas

59 (MiGA), which infers genome-based and quality assessment across genomes

60 from different environments [66].

61 Nowadays, uncultured genomes can be used to expand knowledge of the genetic and

62 evolutionary differences between representatives of the same species or close relatives,

63 as well as for enriching knowledge of microbial diversity. Uncultured genomes can be

64 retrieved by: i) co-assembling metagenomes (Metagenomic Assembled Genomes or

65 MAGs) or ii) by single cell genomics (Single Amplified Genomes or SAGs). In marine

66 microbiology, assembling MAGs has brought a vast amount of genomes belonging to

67 novel bacterial and archaeal phyla, unveiling key players in the biogeochemistry of the

68 oceans [18,56,57]. Alternatively, Single Cell Genomics allows the assignment of specific

69 functional traits to a specific taxon, an outstanding resolution level for microdiversity

70 studies. Single Cell Genomics has helped link functional roles to relevant taxonomic

71 groups that have changed the current understanding of predominant marine

72 metabolisms (such as chemolithoautotrophy or the role of nitrite-oxidizing bacteria in the

73 deep ocean) [53,77]. While MAGs tend to be longer and more complete than SAGs,

74 MAGs result from a population of genomes and there is still lack of consensus about

75 what taxonomic units are they reflecting. Nevertheless, multiple SAGs can be co-

76 assembled to retrieve more complete genomes, provided they are closely related

3 77 phylogenetically. The comparison between MAGs and SAGs assembled from the same

78 Baltic Sea water samples, revealed very high nucleotide identities between the

79 corresponding pairs but a difference in the size and completeness between the genomes

80 [4]. SAGs have also been used in pangenome analyses, which had previously been

81 mostly restricted to cultured microorganisms (especially pathogenic bacteria) [49,51,79].

82 Comparative genomics and the development of the “pan-genome” concept offered an

83 alternative for understanding the genetic extent and dynamics of bacterial species

84 [49,51,79], dramatically changing the description of a species by studying multiple

85 genomes belonging to the same defined taxa [34,38,65]. In the marine environment,

86 SAG-based pangenome analyses have mostly focused on the highly studied

87 Prochlorococcus (revealing hundreds of co-existing Prochlorococcus populations [31]

88 and the link between their hypervariable genomic islands and the environment [17]) or

89 together with SAR11 (defining endemic gene-level adaptations to specific locations like

90 the Red sea [80]).

91 In the present study, we carried out the co-assembly of 10 SAGs of a nearly clonal

92 population of a novel species in the genus Kordia to generate a high-quality genome

93 complete enough to: i) allow its putative functional description, ii) determine its preferred

94 niche in the water column from which it was retrieved, and iii) to describe the novel

95 species in comparison with the available sequenced relatives of the genus Kordia. The

96 genus Kordia belongs to the family and it was first proposed with the

97 isolation in culture of Kordia algicida, which lyses algal cells and feeds on

98 phytoplanktonic bloom exudates [75]. Members of this genus are Gram-negative, strictly

99 aerobic or facultatively anaerobic, non-motile or motile by gliding, rod shaped and

100 showing 34-37% of DNA G+C content [33]. In the last 15 years, new species have been

101 added to the genus, all of them isolated from samples found in the aquatic environment:

102 K. zhangzhouensis thrives in freshwater [41], K. jejudonensis was isolated from the

103 interphase between seawater and freshwater springs [54], K. antarctica and K. aquimaris

4 104 were isolated from Antarctic and Taiwanese seawater, respectively [6,26], K. ulvae,

105 K.zosterae and Kordia sp. SMS9 were retrieved from the surface of marine algae

106 [33,59,62] and K. periserrulae was found in the gut of a tidal flat polychaete [14]. Kordia

107 sp. NORP58 is a MAG assembled from a marine subsurface aquifer [81]. At the time of

108 writing, the genus consists of eight species with validly published names, of which four

109 have had their genomes completely sequenced.

110

111 Despite the fundamental insights in microbial ecology and evolution provided by the

112 analyses of uncultured bacterial and archaeal genomes, there is not yet a proper

113 taxonomy for the uncultured majority. We aim to provide a good example of a complete

114 description of a novel species of the genus Kordia analyzed via culture-independent

115 methods to infer its ecological and functional description.

116

5 117 Materials and Methods

118 Single Amplified Genome generation and phylogeny analysis

119 A total of 98 Single Amplified Genomes were generated as detailed in [77] from a surface

120 (SRF) seawater sample from the North Indian Ocean (latitude 18.59ºN – longitude

121 66.62ºE) during the circumnavigation expedition Tara Oceans [30], station TARA_039

122 (Sample ID: TARA_G000000266)(SM Table 1).

123 Multi Locus Sequencing Analysis (MLSA) of the generated SAGs revealed that 84% of

124 them belonged to a novel species of the genus Kordia (referred hereafter as Kordia sp.

125 TARA_039_SRF) and that the amplified genes (16S rRNA, partial 23S rRNA, partial

126 RNA polymerase subunit B and partial proteorhodopsin genes, as well as Internally

127 Transcribed Spacer) were identical in the whole Kordia SAG population. More details on

128 the phylogeny and distribution of these Kordia SAGs can be found in [69].

129 SAG selection

130 We chose 10 out of the 98 generated Kordia SAGs for Whole Genome Sequencing.

131 They were selected based on: i) Multiple Displacement Amplification’s (MDA) Cp values,

132 since they were the shortest, ranging from 7:15-8:15, and ii) that amplification was

133 possible for any of the markers tested in previous MLSA.

134 Sequencing and read treatment

135 Sequencing of the ten SAGs was carried out in two batches. For the first set of five, the

136 sequence reads were obtained by Illumina MiSeq 2x300 bp technology and by Illumina

137 HiSeq 2x250 bp for the second set. Quality assessment of the raw reads was performed

138 using FastQC and Illumina PhiX174 adapters were removed using Bowtie2 v2.2.9 [43]

139 and Samtools v.1.3.1 [46]. Afterwards, reads were normalized by coverage for each

140 individual SAG using DOE JGI’s BBnorm, setting the maximum coverage at 40X. The

141 normalized paired-end libraries were trimmed and filtered with Trimmomatic [10] and

142 paired-reads were merged with PEAR v0.9.6 [85]. The workable read dataset consisted

6 143 of merged reads, pairs of reads that did not merge and also unpaired orphan reads

144 (Figure 1A). The final non-redundant set accounted for 1.75% of the original raw read

145 dataset (3,746,468 reads out of a total of 213,457,114).

146 Co-assembly optimization and quality control

147 We chose a strategy that focused on reducing gene redundancy and genome

148 fragmentation to obtain the co-assembly. As a first approach, 36 co-assemblies were

149 generated with assembler Ray v2.2.0 using individual k-mer length values, from 17 to

150 133, in order to find those that performed better (Figure 1A). The three k-mer values that

151 generated co-assemblies with best metrics were combined using assembler SPAdes

152 v3.10 [7] and options --careful and --sc (recommended for single cell genome assembly).

153 Another co-assembly was gathered with SPAdes default combination of k-mer length

154 values (21, 33, 55) to confirm whether the custom k-mer length combination resulted in

155 a better co-assembly.

156 Normalizing the pool of reads by coverage and merging them before assembly reduced

157 significantly the memory requirements and duration of assembly. Using default single-

158 cell mode SPAdes assembler with the full read dataset, the co-assembly lasted ~62

159 hours and consumed 75 Gb of memory. The same process with the workable read

160 dataset (normalized and merged) and optimized k-mer lengths took 11 min and 12 Gb.

161 Metrics for each co-assembly were generated using a custom Perl script. Genome

162 completeness and contamination (assessed as single copy marker gene redundancy)

163 was calculated with CheckM v1.0.11 [55]. The reference marker gene database chosen

164 was “Family Flavobacteriaceae”, which was the closest to the taxonomical annotation of

165 the co-assembly. Nevertheless, the marker gene PF07659.6 was excluded from the

166 database as it was absent in all complete isolated Kordia genomes. The assembly

167 accuracy was assessed with Assembly Likelihood Estimator ALE [15]. The three k-mer

168 length values that showed better metrics (higher genome completeness, longer co-

169 assembly length, smaller number of contigs and larger N50) and best ALE score were

7 170 used as a combination in a final co-assembly (K=55, 61, 117). Overlapping regions

171 among scaffolds were detected after a visual check of gene synteny in their functional

172 annotations. Several refining steps were taken to solve redundancies in the co-assembly

173 (Figure 1B). First, the scaffolds were assembled in Geneious R11, with default settings

174 in the High Sensitivity assembly algorithm. Second, the resulting contigs were split in

175 different datasets by setting a cut-off for a minimum contig length. The final co-assembly

176 was the contig dataset with the highest genome completeness, highest number of copy

177 genes present only once and percent contamination lower than 5% [11,37]. Detailed

178 metrics and evaluation of the different co-assemblies are in SM Table 2.

179 Alignment of individual SAGs against the co-assembly

180 Mappings of individual SAGs to the co-assembly were done with the same read dataset

181 as for the co-assembly (clean, trimmed, merged and normalized by coverage reads)

182 using Bowtie2 v2.2.9. The percentage of mapped reads was extracted from Bowtie2’s

183 log file. The vertical coverage of the co-assembly by each SAG individually was

184 calculated using Bedtools v2.27.1 genomecov function. The sum of bases of the co-

185 assembly covered by each SAG was divided by the total length of the co-assembly to

186 obtain the contribution of each SAG to the co-assembly (Table 1).

187 Ggplot2 [24] was used to visualize the mapping of reads of every individual SAG against

188 the generated co-assembly (Figure SM2).

189 Phylogeny based on the 16S rRNA gene

190 The complete 16S rRNA gene of the co-assembly was queried against the Living Tree

191 Project [83] database using SILVA’s SINA ARB v1.2.11 online tool [61]. Its

192 Flavobacteriaceae best hits were exported and aligned in Geneious R11, together with

193 Kordia amplicon OTUs generated from the Malaspina dataset in [50], the 16S rRNA

194 genes of other Kordia spp. deposited in GenBank and two outgroups from a different

195 phylum (SM Table 1). MEGA v7 [40] was used for a Maximum-Likelihood phylogeny

8 196 reconstruction using GTR+G model and 500 bootstrap replicates. The tree was later

197 processed in iTol [45].

198 Phylogeny based on single copy genes

199 A multi-locus (40 conserved single copy genes) phylogenetic placement of the co-

200 assembly was also carried out using the same taxa as in the 16S rRNA gene phylogeny,

201 also including uncultured genomes like MAGs that belong to family Flavobacteriaceae.

202 Gene prediction was done with Prodigal v2.6.3 [27]. Software FetchMG v1.0 [39] was

203 used with option -v (recommended for reference genomes) to extract the 40 conserved

204 COGs, whose amino acid sequences were concatenated and later aligned with Muscle

205 v3.8.31 [20]. Neighbor-Joining phylogenetic reconstruction was done with MEGA v7.0

206 and the resulting tree was processed in iTol.

207 Sequence composition identities against other Flavobacteriaceae

208 Average Nucleotide Identities (ANI), Average Amino acid Identities (AAI) and

209 tetranucleotide frequency comparisons were done between the co-assembly and the

210 genomes of its closest relatives using fastANI v1.1 [28], compareM v0.0.23

211 [https://github.com/dparks1134/CompareM] and pyani v0.2.7

212 [https://github.com/widdowquinn/pyani], respectively.

213 Functional annotation of Kordia sp. TARA_039_SRF

214 Gene prediction and a basic annotation of the co-assembled contigs were carried out in

215 Prokka v1.12 [70]. PFAMs and TIGRFAMs were annotated in the predicted genes with

216 HMMER’s v3.1b2 hmmscan (hmmer.org). KEGG orthologs (KO) were predicted online

217 with BLASTKOALA v2.1 [29] and transporters were predicted using Transporter

218 Classification TC database [21]. Carbohydrate-active enzymes’ annotation was based

219 on dbCAN’s CAZymes database [84] and peptidases were annotated using the Merops

220 database [63]. Polysaccharide Utilization Loci (PULs) were manually located looking for

221 pairs of SusC/SusD genes encoded next to each other. Genomic Islands were predicted

9 222 with IslandViewer v4 online tool [9]. Insertion sequences (IS) were predicted and

223 estimated with ISsaga tool [82] and prophage regions were predicted with PHASTER [5].

224 Secretion systems were detected using MacSyFinder-based TXSScan [1,2].

225 KEGGmapper Search&Color pathway v3.1 was used to describe the functional potential

226 of the co-assembly, for those gene calls that could be associated to a KO value.

227 Phylogeny based on the proteorhodopsin gene

228 The complete amino acid sequence of the co-assembly’s proteorhodopsin gene was

229 aligned in Geneious R11 together with its BLAST best hits against NCBI's Protein

230 database and two outgroup sequences (a proteorhodopsin from a SAR86 group

231 bacterium and one from a “Candidatus Pelagibacter ubique” bacterium). MEGA v7 [40]

232 was used for a Maximum-Likelihood phylogeny reconstruction using LG+G model and

233 500 bootstrap replicates. The tree was later processed in iTol [45].

234 Distribution of Kordia sp. TARA_039_SRF in TARA_039 station with competitive

235 Fragment Recruitment Analysis

236 The vertical and size fraction distribution of Kordia sp. TARA_039_SRF was assessed

237 in all available metagenomes for Tara Oceans station TARA_039 as it was the marine

238 sample from which the Kordia SAGs were obtained. Metagenomes from TARA_039

239 were sampled in three different depths (surface/SRF 5m, Deep Chlorophyll

240 Maximum/DCM 25m and mesopelagic/MES 270m) and size-fractioned for viruses (<-

241 0.22 µm), giruses (0.1-0.22 µm), bacteria (0.22-1.6 µm) and protists (0.8-5 µm, 5-20 µm,

242 20-180 µm and 180-2000 µm) and were generated following different protocols (SM

243 Table 1)[3]. All merged reads (PEAR v0.9.6) from each described metagenome were

244 mapped against all genomes available from Kordia spp. (Kordia sp. TARA_039_SRF,

245 Kordia algicida, Kordia jejudonensis, Kordia zhangzhouensis, Kordia periserrulae,

246 Kordia sp. SMS9 and Kordia sp. NORP58) (SM Table 1) with blastn (BLAST 2.2.8+;

247 options -max_target_seqs 1 –perc_identity 70 –evalue 0.0001). The output was filtered

248 in R: i) coverage between query and subject set at >90%, ii) removal of duplicated reads

10 249 and iii) removal of reads mapping to the ribosomal operons [12]. Recruited reads were

250 split based on their nucleotide identities against the reference genome. Those mapping

251 at identities >95% are assumed to belong to the same species as the reference genome

252 and those between 70-95% belonged to the co-occurrent close relatives of the reference

253 genome [12,65]. Normalization of the recruited reads was done by the sequencing depth

254 of each metagenome and considering the complete genomic length.

255 Comparative genomics analysis

256 Anvi’o v4 pangenomic workflow was used to organize all the genes from the five Kordia

257 genomes (Kordia sp. TARA_039_SRF, Kordia algicida, Kordia jejudonensis, Kordia

258 zhangzhouensis and Kordia periserrulae) into gene clusters of protein sequence

259 similarity. Gene calling was done with Prodigal, amino acid sequence search with NCBI’s

260 blastp v2.7.1, gene clustering with MCL [19] and sequence alignment with Muscle. Using

261 function --anvi-summarize we could export the list of gene clusters and see in which

262 genome they were encoded. These results were plotted using anvi’o and R packages

263 UpSetR v1.1.3 [16] and ggplot2.

264 Anvi’o phylogenomics workflow was used to produce a phylogenomic tree (Fasttree

265 v2.1.11 [60]) based on the amino acid sequences of the single copy genes shared by all

266 five Kordia genomes.

267 Accession numbers

268 The genome of the novel Kordia sp. TARA_039_SRF has been deposited in NCBI’s

269 Bioproject PRJNA524487. The co-assembly’s accession number is SMNH02000000

270 and Biosample SAMN11028936. Raw reads of the 10 individual SAGs are: AB-193_M23

271 (SRR8655105), AB-193_M19 (SRR8655106), AB-193_N20 (SRR8655104), AB-

272 193_O13 (SRR8655103), AB-194_L04 (SRR8655108), AB-193_M03 (SRR8655107),

273 AB-193_O22 (SRR8655102), AB-193_I20 (SRR8655109), AB-193_I04 (SRR8655110)

11 274 and AB-193_P22 (SRR8655101). A complete list of accession numbers for all the

275 genomes and metagenomes used in this study can be found in SM Table 1.

276

12 277 Results

278 Co-assembly of the Kordia sp. TARA_039_SRF genome

279 Ten SAGs belonging to the genus Kordia, named provisionally Kordia sp.

280 TARA_039_SRF, were selected for genome co-assembly and description of gene

281 content. These SAGs were tested for microdiversity through MLSA analyses, resulting

282 in 100% identity at the nucleotide level in each marker [69]. The five markers were the

283 16S rRNA gene, partial 23S rRNA gene and the Internally Transcribed Spacer, that could

284 be amplified in the 10 SAGs; the partial RNA polymerase subunit B, amplified in 5 SAGs;

285 and partial Proteorhodopsin, amplified in 7 SAGs.

286 Individual SAG assemblies varied in genome completeness and metrics (Table 1),

287 recovering at most 45% of the estimated complete genome. GC content was consistent

288 in the different SAGs (35.3-36%).

289 Considering the high genetic similarity among the individual SAGs based on the previous

290 MLSA analyses, their co-assembly into one single genome was the strategy chosen to

291 obtain a more complete genome. All the reads from the 10 SAGs were pooled together,

292 normalized by coverage and merged. They were afterwards assembled using a wide

293 range of k-mer length values, a procedure to find the best metrics and quality values (SM

294 Table 2). The combination of k-mer length values chosen were: 55, 61 and 117 (Figure

295 1B). After refinement of the optimized co-assembly, the resulting genome was 4,594,716

296 bp long, fragmented into 27 contigs of N50 429,544 bp. The genome was 94.97%

297 complete using a database of Flavobacteriaceae conserved single copy genes and the

298 contamination based on gene redundancy of these conserved genes was 4.65% (Table

299 1).

300 There were 985 ambiguities spread across the final co-assembly (217 Y, 247 W, 92 S,

301 158 R, 32 N, 140 M and 99 K), representing 0.02% of the total nucleotides.

302 The proportion of reads that mapped back to the refined co-assembly varied between

303 97.6 and 98.1% depending on the SAG. Contribution of each SAG to the co-assembly

13 304 ranged between 21 and 50% (Table 1). The average nucleotide identities of the

305 individual SAGs against the co-assembly were higher than 99.9% in all cases (Table 1).

306 Tetranucleotide frequency was almost identical between each SAG and the co-

307 assembly, with similarities ranging between 99.6 and 99.8%.

308 Comparing the completeness and gene redundancy of all the possible SAG

309 combinations to co-assemble (1023 for 10 SAGs), the highest completeness value

310 (97.48%) is reached with the co-assembly of all 10 SAGs, with 8.48% contamination (SM

311 Figure 1; SM Table 3). Nevertheless, there is saturation in completeness values ~94-

312 95% starting at combinations of seven and eight SAGs. Their contamination values were

313 also higher than the standards for high-quality draft genomes (>5%) with the exception

314 of a combination of eight SAGs, that produced a co-assembly 94.07% complete with

315 4.92% of contamination. Considering that genome amplification in SAGs is random [76],

316 it is impossible to know which part of the genome was amplified prior to sequencing.

317 Despite the fact that we sequenced 10 SAGs, the co-assembly of just eight of them would

318 have been enough to reach completeness values similar to the refined co-assembly.

319 Novelty of Kordia sp. TARA_039_SRF

320 The novelty of Kordia sp. TARA_039_SRF was confirmed with phylogenies of the 16S

321 rRNA gene and 40 conserved single copy genes (Figure 2), as well as through whole

322 genome sequence composition comparisons of ANI, AAI and tetranucleotide signature

323 (SM Table 4).

324 Both phylogenies placed Kordia sp. TARA_039_SRF within family Flavobacteriaceae,

325 more accurately within genus Kordia, in 100% of the bootstrap replicates. Its closest

326 described relative is Kordia periserrulae, with 97.56% of identity in the 16S rRNA gene

327 (97% coverage) and 89.13% ANI (75% coverage). Otu5835, amplified from Malaspina

328 expedition seawater samples, also falls within the Kordia sp. TARA_039_SRF and K.

329 periserrulae clade.

14 330 Functional description of Kordia sp. TARA_039_SRF

331 Having an almost complete genome (94.97%) allowed to investigate the potential

332 functional capabilities and energy metabolism of this novel species (Figure 3; SM Table

333 5). Nevertheless, 2357 of 4125 predicted protein coding genes (57.1%) were annotated

334 as hypothetical proteins of unknown function.

335 Central metabolism

336 The co-assembled Kordia sp. TARA_039_SRF has the ability to shut down the loss of

337 CO2 of the regular TCA cycle by the glyoxylate shunt, like other Kordia spp., and it is

338 also able to replenish the TCA cycle of certain intermediates through anaplerotic routes:

339 it encodes both PEP carboxykynase and PEP carboxylase, which convert PEP to

- 340 oxaloacetate using ATP and CO2, and H2O and HCO3 , respectively. This mechanism is

341 also found in other Kordia spp. Bicarbonate uptake can be achieved through specific

342 bicarbonate membrane transporters (BicA), of which four copies were found. One of

343 these copies is next to a carbonic anhydrase gene, whose product interconverts CO2

- 344 and HCO3 . Kordia sp. TARA_039_SRF encodes a third anaplerotic strategy driven by

345 the malic enzyme, which converts pyruvate into L-malate with NADPH and CO2. One of

346 the copies is encoded in the core genome of Kordia whereas another copy is located in

347 a genomic island.

348 Unlike other Kordia spp., Kordia sp. TARA_039_SRF encodes several enzymes that

349 allow a complete biosynthesis of CoA from pyruvate. This cofactor plays a key role in the

350 biosynthesis and breakdown of fatty acids, as well as in the biosynthesis of polyketides

351 and non-ribosomal peptides [8].

352 Kordia sp. TARA_039_SRF shows several mechanisms for oxidative phosphorylation,

353 of which some are also common in other Kordia spp.: i) succinate dehydrogenase /

354 fumarate reductase, ii) cytochrome c oxidase, cbb3-type, iii) NADH dehydrogenase

355 NQR, iv) NADH dehydrogenase NDH-I, v) cytochrome bd subunits CydA and CydB,; and

15 356 some are unique: i) cytochrome c oxidase aa3-type genes CoxABC, QoxA, COX10 and

357 COX11, and ii) cytochrome c oxidase bo-type genes CyoDW.

358 Carbohydrate-active enzymes, PULs, surface adhesion and motility

359 Kordia sp. TARA_039_SRF encodes a CAZymes array most similar to K. periserrulae

360 and larger than the rest of Kordia spp. included in the study (SM Table 6). It shows the

361 second highest density of carbohydrate-active enzymes (41 enzymes per genomic Mbp)

362 after that of K. zhangzhouensis. It encodes 34 non-catalytic carbohydrate-binding

363 modules (CBM), mostly specific for complex carbohydrates like cellulose, chitin,

364 mannan, beta-glucans, starch, and glycogen. The novel species also encodes 1.1

365 glycosyltransferases per genomic Mb (a total of 52). These outer membrane proteins are

366 involved in the synthesis of surface adhesion polysaccharides. A total of 51 predicted

367 adhesion-related genes, such as fasciclin, fibronectin, or lectins (SM Table 7) were

368 found, that are also present in other Flavobacteria [25]. The novel Kordia sp.

369 TARA_039_SRF does not encode a complete gliding motility complex.

370 The co-assembly encodes four Polysaccharide Utilization Loci (PULs) (SM Table 5),

371 three of them being surrounded by 6 families of glycosyl hydrolases, one family of

372 glycosyl transferases, one family of carbohydrate esterases, two families of

373 carbohydrate-binding modules and peptidases like serine aminopeptidase and prolyl

374 oligopeptides.

375 Sulfur and nitrogen metabolism

376 Another unique feature of Kordia sp. TARA_039_SRF within its genus is that it codes for

377 all the genes involved in the assimilatory sulfate reduction pathway that converts sulfate

378 to sulfide, and it is also able to transform thiosulfate to sulfite.

379 The genome of Kordia sp. TARA_039_SRF is the only Kordia genome sequenced so far

380 that encodes both NasA, a nitrate/nitrite transporter, and NasF, the substrate binding

381 protein in the nitrate/nitrite transport system. The bacterium is predicted to assimilate

16 382 nitrate for further incorporations into amino acids as it encodes the nitrate

383 reductase/nitrite oxidoreductase. Moreover, like other Kordia spp., it has the potential to

384 use environmental ammonium as a nitrogen source importing it through an ammonium

385 channel transporter.

386 Light sensing and environmental information processing

387 The co-assembled Kordia encodes a copy of a green-light absorbing proteorhodopsin,

388 like Kordia periserrulae and Kordia sp. SMS9. Their phylogenetic reconstruction clusters

389 the co-assembly’s and K.periserrulae’s proteins in the same clade with a 100% bootstrap

390 value (SM Figure 3), and their pairwise alignment shows that they are identical in 95.5%

391 of the sites (232 of 243 bp).

392 It also codes for other putative light sensors usually found in proteorhodopsin containing

393 marine bacteria: one (6-4) photolyase, seven copies of phytochrome-like proteins, one

394 cryptochrome DASH domain and one cryptochrome-like protein.

395 The co-assembly codes for DNA repair systems like the entire nucleotide excision repair

396 (NER) complex UvrABC, which recognizes and removes light induced DNA lesions.

397 The two-component systems encoded in the co-assembly have been described to detect

398 the following environmental signals: phosphate, ferric ion, oxygen, nitrate/nitrite and

399 chemotaxis related attractant/repellent substances. It also codes for sigma 54 factor and

400 putative quorum-quenching lactonases, which have been related in some cases to

401 virulence and biofilm formation.

402 Transporters

403 The co-assembly encodes 95 different transporter families and ~88 transporters per

404 Mbp. We find 4 different types of pore-forming toxins, phage-related transporters like the

405 FadL outer membrane protein and a holin belonging to the Bacillus subtilis phage φ29

406 transporter family, two different light-absorption driven transporters and uptake systems

407 specific for amino acids, potassium, fatty acids, nitrate, dissolved inorganic carbon, iron

408 and magnesium (Figure 3; SM Table 8).

17 409 Kordia sp. TARA_039_SRF codes for two complete secretory systems: T1SS and the

410 specific T9SS (SM Table 9). There are 85 peptides encoded in Kordia sp.

411 TARA_039_SRF that contain the Por secretion system domain (SM Table 10), meaning

412 that they are secreted through the T9SS to the extracellular environment or the cell

413 surface. Their most abundant functions relate to surface adhesion, protein degradation,

414 and carbohydrate hydrolysis. Out of the 184 peptidases found in the co-assembly, 8 of

415 them are putatively secreted (SM Table 11). Other secreted peptides show domains that

416 have a role in carbohydrate binding, environmental sensing through two-component

417 system, and quorum sensing sensors. Additionally, we also found proteins with domains

418 associated with prophage endonucleases, prokaryotic endonucleases, bacterial toxins

419 and proteinase inhibitors.

420 Pigment and vitamin synthesis

421 Kordia sp. TARA_039_SRF is the only described Kordia spp. that can potentially

422 synthesize beta-carotene without relying on any exogenous intermediates as it codes for

423 the complete terpenoid backbone synthesis. Another exclusive feature is the putative

424 ability to synthesize biotin from Malonyl-ACP and Pimeloyl-CoA. As other Kordia spp.,

425 Kordia sp. TARA_039_SRF can synthesize Riboflavin (vitamin B2) and it also has genes

426 related to the synthesis of folate (B9), pantothenate (B5) and pyridoxine (B7). It can

427 putatively import vitamins from the environment with specific binding proteins and an

428 outer membrane channel coupled to a TonB transporter.

429 Genomic islands, mobile elements and prophages

430 Kordia sp. TARA_039_SRF’s genome contains 12 predicted genomic islands with a total

431 length of 252,407 bp (5.5% of the total genome and 28% of the unique accessory

432 genome). They code for virulence genes such as non-ribosomal peptides (NRPs),

433 porins, toxins and invasion proteins and other genes such as those involved in oxidative

434 phosphorylation, nitrate/nitrite transport, and sulfur carrier proteins (SM Table 12). There

435 are 16 different kinds of transporters encoded in ten genomic islands, some of them

18 436 represented by several copies, related to oxidative phosphorylation pathways, secretory

437 pathways, Ompf-OmpA environmental oxygen sensing and bacteriocin releasing

438 systems. There are also 15 complete insertion sequences (IS) belonging to 8 different

439 transposase families (SM Table 13) and three predicted incomplete prophage regions,

440 with a total size of 26,876 bp (0.057% of the total genome length) (SM Table 14).

441 Abundance of Kordia sp. TARA_039_SRF in the North Indian Ocean

442 Kordia sp. TARA_039_SRF’s read recruitments at the species level (alignment identities

443 >95%) did not exceed 0.01% in station TARA_039 (Figure 4A, SM Table 15).

444 Nevertheless, surface and DCM metagenomic reads of the 20-180 µm fraction mapped

445 against 40 and 53% of the co-assembly’s predicted genes, respectively (8.5 and 13% of

446 horizontal genomic coverage). Relative abundances in these samples are low (0.0018

447 and 0.0033%) but reads map homogeneously across the genome (Figure 4B) and

448 against both the core genome and the accessory genome (SM Table 16). Read

449 mappings also occur homogeneously in the 0.8-5 µm and 0.1-0.22 µm fractions of the

450 mesopelagic layer, with abundances of 0.0016 and 0.0026% each but the matched

451 number of genes and horizontal coverage is smaller in both cases (between 24-28% of

452 genes and ~5% of genomic coverage). The highest abundance of recruited reads was

453 found in the mesopelagic 0.22-1.6 µm size fraction (0.009%) but it is spread through 8%

454 of the co-assembly’s ORFs and specially in house-keeping genes like transcriptional

455 regulators, RNA polymerase subunits and a serine aminopeptidase.

456 None of the other Kordia spp. used in the competitive Fragment Recruitment Analysis

457 recruited enough reads at the species level to consider them present at the moment of

458 sampling. Nevertheless, recruitments at identities between 80-90% increasing with depth

459 by all Kordia genomes might suggest that a co-occurrent close relative to the co-

460 assembly is also part of the community (SM Figure 4, 5 and 6).

461 Comparative genomics of the genus Kordia

19 462 The five Kordia genomes included in this comparative study belong to five different

463 species: four Kordia genomes available in public databases retrieved from isolated

464 cultures and our novel Kordia co-assembled genome (Table 2). The ANIm values

465 between genome pairs were ~79-89% (with alignment coverages between 44-75% of

466 the genes). Similarities at the 16S rRNA gene level ranged from 96.34 to 98.00% (with

467 alignment coverage values between 95-100%). In both cases, the closest relative to the

468 co-assembled Kordia sp. was Kordia periserrulae. Genomic length ranged between 4.59

469 and 5.35 Mb for the seawater genomes, while K. zhangzhouensis, retrieved from the

470 interphase between seawater and freshwater, was the shortest (4.03 Mbp). Kordia sp.

471 TARA_039_SRF had the second highest GC content (35.8%), following K. periserrulae

472 (36.2). This value ranged between 33.8 and 34.2 for the other Kordia spp. The mean

473 coding density of the studied genomes was 88.8%.

474 The core genome of the analyzed Kordia spp. consisted of a mean of 57% of the total

475 number of gene clusters of every other Kordia genome (Figure 5, SM Table 17) and

476 ~53% of them could be classified into a Cluster of Orthologous Genes (COGs) category.

477 A phylogenomics analysis of these gene clusters defined two clades, one grouping K.

478 zhangzhouensis, K.algicida and K.jejdonensis and another with the co-assembled

479 Kordia sp. and K.periserrulae (Figure 5).

480 For the rest of gene clusters in these Kordia spp., a mean of 24% are shared between

481 two, three or four genomes. These are defined as the shared accessory genome. The

482 highest number of gene clusters in this category is shared between K. periserrulae and

483 the co-assembled Kordia sp. TARA_039_SRF (204).

484 Between 18 and 28% of gene clusters are unique to each species (Table 2, Figure 5,

485 constituting the flexible genome. Of those classified into COG categories, between 10-

486 15% are classified into the cellular processes and signaling category, 6-11% are related

487 to information storage and processing, 7-12% belong to metabolic functions and 0.2-

488 2.3% are related to mobilome elements. The co-assembly has the lowest number of

20 489 peptidases in its flexible gene dataset (one), 7 transporters (mainly channels and pores

490 and electrochemical potential-driven transporters) and 5 CAZymes (3 different CBM and

491 2 glycosyl hydrolases).

492 There is a high number of hypothetical proteins of unknown function in the co-assembly

493 Kordia sp. TARA_039_SRF. It reaches a 41% of its core genome, 6% of its shared

494 accessory genome and up to 89% of its strain specific accessory genome.

495

496

21 497 Discussion

498 The aim of this study was to characterize the novel uncultured species Kordia sp.

499 TARA_039_SRF and shed light on its ecological and functional potential in ocean waters

500 through an optimized co-assembly of ten co-occurring and nearly identical marine Kordia

501 SAGs.

502 We have co-assembled 10 Kordia SAGs previously classified as nearly identical

503 genomes [69] into a high-quality draft genome, which is 94.83% complete and meets the

504 set standards regarding contamination (<5%) [11,37]. Recent studies have shown that

505 co-assembling very closely related SAGs helps overcome the main drawback of their

506 genome amplification step: that it rarely exceeds 60% of estimated genome length

507 [36,48]. We have taken the assembly procedure one step further in order to optimize

508 contig length, reduce genome fragmentation, and significantly decrease the needs for

509 computational power and processing time. Read redundancy was expected to be high

510 considering the near genomic clonality of the ten SAGs. After normalization by coverage

511 the workable dataset was reduced to 1.75% of the original, which substantially

512 decreased assembly computational needs and time. The use of merged reads, together

513 with those unmerged, contributed to the reliability of predicted gene synteny in contigs

514 [73]. The combination of different k-mer lengths during co-assembly generated

515 remarkably longer contigs than default assembly parameters [13]. Selecting the best

516 combination of k-mer sizes was approached by comparing different assemblies

517 generated with different individual k-mer length values and finding the best performers.

518 These contigs were assembled again as gene redundancy was observed in the ends of

519 some of them (probably due to variability hotspots in some of the SAGs that forced a

520 split in the assembly process). The resulting reference genome had a small number of

521 very long contigs (27 contigs, N50 0.43 Mb) of high-quality (best Assembly Likelihood

522 Evaluation score) to which the ten individual Kordia SAGs shared a pairwise ANIm of

523 99.9%. We think this co-assembly can serve as a reference genome for comparative

22 524 genomics and ecology studies just like a draft genome assembled from bulk DNA of a

525 monoclonal culture.

526 The co-assembly’s taxonomic novelty was confirmed through phylogenies based on the

527 16S rRNA gene and highly conserved single copy genes [39] and also using whole

528 genome composition comparisons like ANI and AAI. The closest relative is Kordia

529 periserrulae, isolated from the gut of the marine polychaete Periserrula leucophryna.

530 Its genome-based functional analysis suggests that this taxon is a novel photo-

531 heterotrophic, non-motile, bacterium that contains proteorhodopsin and oxidative

532 phosphorylation machinery (together with membrane oxygen sensors) for aerobic and

533 microaerobic environments. It has the putative ability to sense other limiting nutrients

534 and elements in the environment like iron, nitrite, and nitrate. Kordia sp. TARA_039_SRF

535 presents exclusive CAZymes in its genus, especially several types of carbohydrate

536 binding modules with an affinity for phytoplankton exopolymers like cellulose or

537 glucomannan, that would promote their hydrolysis before uptake from the extracellular

538 environment [74]. It also encodes Polysaccharide Utilization Loci, surrounded by

539 CAZymes as previously described in Bacteroidetes (carbohydrate-binding modules,

540 glycosyl hydrolases, glycosyl transferases, carbohydrate esterases) [44]. Moreover, the

541 genome encodes almost twice the number of copies of the outer membrane TonB

542 dependent receptor/transport systems than its relatives, hinting at the apparent

543 relevance of importing extracellular compounds for the survival of this species. The

544 presence of virulence factors such as NRPs, antibiotics, proteases, porins, toxins and

545 prophage related enzymes in Kordia sp. TARA_039_SRF’s genome (some of them

546 exported through the Bacteroidetes specific secretion system T9SS), suggest that this

547 novel taxon is an active player for niche colonization. It also presents features common

548 to surface seawater bacteria: seven copies of DNA light-induced damage repair system

549 genes [35], the green-light absorbing proteorhodopsin and other light-sensing proteins.

23 550 Such functional capabilities match the environmental characteristics of the SAGs’

551 original habitat, as well as its abundance in specific size fractions and depths of the water

552 column. The SAGs co-assembled as Kordia sp. TARA_039_SRF were retrieved from a

553 surface seawater sample from the Arabian Sea, in the North Indian Ocean (TARA_039),

554 pre-filtered through a 200 µm mesh. This oceanic region is prone to seasonal surface

555 phytoplankton blooms [42] and has one of the most intense and large Oxygen Minimum

556 Zones (OMZ) [58]. A diatom bloom was registered a month prior to the sampling period

557 and Roullier et al. [68] thoroughly described the particle distribution in the water column

558 across the area, as well as seawater nitrate concentrations and size of the OMZ. By the

559 time of sampling in TARA_039, surface waters showed a general decreasing

560 concentration of large particulate matter of 15 particles m-2 (LPM; >100 µm) [68]. This is

561 the size range where we find highest read recruitments of the co-assembly in the

562 available sea surface and DCM metagenomes. Even though the abundance of recruited

563 reads in this size fraction is low, the reads spread homogeneously throughout the

564 genome. The co-assembly codes for a large number of enzymes that would allow growth

565 on particles that result from diatom exudates, even on those where microaerobic

566 conditions may arise. Moreover, expression of proteorhodopsin in surface waters would

567 provide extra energy and therefore an ecological advantage for rapid particle

568 colonization. In fact, the very low microdiversity within the SAG population may be due

569 to the sorting of a disaggregated particle, as proposed in [69]. The carbohydrate-binding

570 modules, adhesins and internalization transporters encoded in the co-assembly would

571 provide it with the ability to feed on the phytoplankton particles as they sink, regardless

572 of decreasing environmental light. An intermediate nepheloid layer (INL) formed only by

573 particles < 200 µm was observed in TARA_039, at 250-300 m. This depth was also the

574 oxygen minimum zone and had the highest concentration of nitrate in the sampled water

575 column. Of the available metagenomic samples at this mesopelagic depth, the co-

576 assembly was more abundant in the free-living size fraction (0.22-1.6 µm). This could be

577 related to the novel species’ putative ability to cope with low oxygen conditions and the

24 578 potential for nitrate assimilation, and therefore switching life-style from particle-

579 associated (> 1.6 µm size fractions) at the surface to free-living at the OMZ. Metabolic

580 versatility had previously been suggested to be related to a high number of transporters

581 [64], a feature apparently characteristic of the genus Kordia, in which the co-assembly

582 shows the highest ratio of encoded transporters per genomic Mb. A lifestyle alternation

583 between attachment to particles and free-living has already been proposed for the

584 marine Flavobacteria Polaribacter sp. MED152 [25].

585 The genomic characteristics of the novel Kordia sp. TARA_039_SRF contrast with most

586 of the common genomic signatures described in free-living bacterioplankton SAGs by

587 Swan et al. (2013), in accordance with the predicted attachment to particles of Kordia

588 sp. TARA_039_SRF as main lifestyle. Free-living bacteria SAGs were characterized by:

589 i) shorter genomes (also described as characteristic of proteorhodopsin coding

590 Bacteroidetes [22]), ii) genome streamlining with less non-coding regions and iii) lower

591 GC content [78]. In contrast, the co-assembled genome is relatively large (~4.5 Mb), it

592 also shows the second highest GC value of the tested Kordia spp. and a high portion of

593 non-coding DNA (12%).

594 Comparative genomics of Kordia sp. TARA_039_SRF, Kordia algicida, Kordia

595 jejudonensis and Kordia zhangzhouensis reveals an inter-species core genome that

596 constitutes ~57% of the genomes included in the analysis. This value is consistent with

597 the findings in [71], where core genome size is analyzed at the intra-species level but

598 also between different species of the same genus. The result is also consistent with the

599 genus-level pangenomes of marine Alteromonas [47] and Prochlorococcus [32].

600 The low abundance of read recruitment by the co-assembly in the samples from which

601 the SAGs were retrieved and their low genomic coverage suggests that this novel taxon

602 would have remained unseen using other culture-independent techniques like

603 metagenomic genome assembly, confirming the convenience of single cell genomics in

604 the unveiling of novel microbial taxa.

25 605 Description of “Candidatus Kordia photophila” sp. nov.

606 “Candidatus Kordia photophila” (pho.to'phi.la. Gr. neut. n. phos, photos light; N.L. masc.

607 adj. philus (from Gr. masc. adj. philos) loving; N.L. fem. adj. photophila light-loving).

608 This bacterium is proposed to grow photoheterotrophically attached to phytoplankton

609 derived particles. Its genome-based functional annotation suggests that the taxon is non-

610 motile, it encodes green-light absorbing proteorhodopsin and oxidative phosphorylation

611 machinery (together with membrane oxygen sensors) for aerobic and microaerobic

612 environments. It is the first Kordia species that encodes genes from cytochrome c

613 oxidase types aa3 and bo. The former shows higher affinity for O2 than cbb3-type found

614 in other Kordia spp. and the latter is expressed under iron limitation conditions. It is the

615 first representative to encode both the nitrate/nitrite transporter NasA and the substrate

616 binding protein NasF. Other unique features within the Kordia genus are the putative

617 ability to perform assimilatory sulfate reduction, encoding cellulose and glucomannan

618 specific CBM 16, 22 and 04, N-acetylmuramidase GH108 and the unsaturated

619 rhanmnogalacturonyl hydrolase GH types 105 and 88. Regarding membrane

620 transporters absent in other Kordia spp., it encodes: i) 4 types of channels and pores

621 (Phospholemman family, Type B Influenza Virus NB Channel family, General Bacterial

622 Porin family and the channel-forming Colicin family), ii) 2 types of electrochemical

623 potential-driven transporters (Betaine/Carnitine/Choline transporter family and the K+

624 uptake permease KUP) and iii) the slow voltage-gated K+ channel accessory protein

625 MinK.

626 The type material of the proposed “Candidatus Kordia photophila” is its genome

627 sequence, co-assembled from 10 SAGs, that can be found under accession number

628 SMNH00000000. The version described in this paper is version SMNH02000000. Its

629 NCBI’s taxonomy id is NCBI:txid2530203.

630 631

26 632

633 Acknowledgements

634 High-Performance computing analyses were run at the Marine Bioinformatics Core

635 Facility (MARBITS) of the Institut de Ciències del Mar (ICM-CSIC) in Barcelona. We

636 thank Shook Studio for assistance with figure design and execution. We thank Aharon

637 Oren for helping in the etymology of the candidatus and Dr. Ramon Roselló-Móra and

638 the two other anonymous reviewers for helping us improve our manuscript. We thank

639 our fellow scientists, the crew and chief scientists of the different cruise legs involved in

640 Tara Oceans for collecting the samples used in this study. This is a Tara Oceans

641 contributed paper number XXX.

27 642 Funding

643 Funding was provided by the Spanish Ministry of Economy and Competitivity grant

644 MAGGY (CTM2017-87736-R) to SGA and CTM2016-80095-C2 to CPA and JMG. PS is

645 funded by MAGGY project and MR-L held a Ph.D. Fellowship FPI (BES-2014-068285)

646 funded by the Spanish Ministry of Economy and Competitivity.

28 647 Author’s contributions

648 SA designed this research. MR-L performed the laboratory and analyses work and wrote

649 the paper. PS and JG helped with the data analyses. SA funded this research with

650 contributions from CP-A and JG. All authors were involved in critical reading for writing

651 the paper.

29 652 Conflicts of interest

653 The authors declare no conflicts of interest.

30 654 Bibliography

655 [1] Abby, S.S., Néron, B., Ménager, H., Touchon, M., Rocha, E.P.C. (2014) 656 MacSyFinder: A program to mine genomes for molecular systems with an 657 application to CRISPR-Cas systems. PLoS One 9(10), Doi: 658 10.1371/journal.pone.0110726.

659 [2] Abby, S.S., Rocha, E.P.C. (2017) Identification of protein secretion systems in 660 bacterial genomes using MacSyFinder. Methods Mol. Biol. 1615(February), 1– 661 21, Doi: 10.1007/978-1-4939-7033-9_1.

662 [3] Alberti, A., Poulain, J., Engelen, S., Labadie, K., Romac, S., Ferrera, I., Albini, 663 G., Aury, J.-M., Belser, C., Bertrand, A., Cruaud, C., Da Silva, C., Dossat, C., 664 Gavory, F., Gas, S., Guy, J., Haquelle, M., Jacoby, E., Jaillon, O., Lemainque, 665 A., Pelletier, E., Samson, G., Wessner, M., Bazire, P., Beluche, O., Bertrand, L., 666 Besnard-Gonnet, M., Bordelais, I., Boutard, M., Dubois, M., Dumont, C., 667 Ettedgui, E., Fernandez, P., Garcia, E., Aiach, N.G., Guerin, T., Hamon, C., 668 Brun, E., Lebled, S., Lenoble, P., Louesse, C., Mahieu, E., Mairey, B., Martins, 669 N., Megret, C., Milani, C., Muanga, J., Orvain, C., Payen, E., Perroud, P., Petit, 670 E., Robert, D., Ronsin, M., Vacherie, B., Acinas, S.G., Royo-Llonch, M., 671 Cornejo-Castillo, F.M., Logares, R., Fernández-Gómez, B., Bowler, C., 672 Cochrane, G., Amid, C., Hoopen, P. Ten., De Vargas, C., Grimsley, N., 673 Desgranges, E., Kandels-Lewis, S., Ogata, H., Poulton, N., Sieracki, M.E., 674 Stepanauskas, R., Sullivan, M.B., Brum, J.R., Duhaime, M.B., Poulos, B.T., 675 Hurwitz, B.L., Acinas, S.G., Bork, P., Boss, E., Bowler, C., De Vargas, C., 676 Follows, M., Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Jaillon, O., 677 Kandels-Lewis, S., Karp-Boss, L., Karsenti, E., Not, F., Ogata, H., Pesant, S., 678 Raes, J., Sardet, C., Sieracki, M.E., Speich, S., Stemmann, L., Sullivan, M.B., 679 Sunagawa, S., Wincker, P., Pesant, S., Karsenti, E., Wincker, P. (2017) Viral to 680 metazoan marine plankton nucleotide sequences from the Tara Oceans 681 expedition. Sci. Data 4, 170093, Doi: 10.1038/sdata.2017.93.

682 [4] Alneberg, J., Karlsson, C.M.G., Divne, A.-M., Bergin, C., Homa, F., Lindh, M. V., 683 Hugerth, L.W., Ettema, T.J.G., Bertilsson, S., Andersson, A.F., Pinhassi, J. 684 (2018) Genomes from uncultivated prokaryotes: a comparison of metagenome- 685 assembled and single-amplified genomes. Microbiome 2018 61 6(1), 173, Doi: 686 10.1186/s40168-018-0550-0.

687 [5] Arndt, D., Grant, J.R., Marcu, A., Sajed, T., Pon, A., Liang, Y., Wishart, D.S. 688 (2016) PHASTER: a better, faster version of the PHAST phage search tool.

31 689 Nucleic Acids Res. 44(W1), W16–21, Doi: 10.1093/nar/gkw387.

690 [6] Baek, K., Choi, A., Kang, I., Lee, K., Cho, J.C. (2013) Kordia antarctica sp. nov., 691 isolated from Antarctic seawater. Int. J. Syst. Evol. Microbiol. 63(PART10), 692 3617–22, Doi: 10.1099/ijs.0.052738-0.

693 [7] Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., 694 Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A. V., Sirotkin, 695 A. V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A. (2012) SPAdes: A 696 New Genome Assembly Algorithm and Its Applications to Single-Cell 697 Sequencing. J. Comput. Biol. 19(5), 455–77, Doi: 10.1089/cmb.2012.0021.

698 [8] Begley, T.P., Kinsland, C., Strauss, E. (2001) The biosynthesis of coenzyme A 699 in bacteria. Vitam. Horm. 61, 157–71, Doi: 10.1016/s0083-6729(01)61005-7.

700 [9] Bertelli, C., Laird, M.R., Williams, K.P., Lau, B.Y., Hoad, G., Winsor, G.L., 701 Brinkman, F.S.L. (2017) IslandViewer 4: Expanded prediction of genomic islands 702 for larger-scale datasets. Nucleic Acids Res. 45(W1), W30–5, Doi: 703 10.1093/nar/gkx343.

704 [10] Bolger, A.M., Lohse, M., Usadel, B. (2014) Trimmomatic: a flexible trimmer for 705 Illumina sequence data. Bioinformatics 30(15), 2114–20, Doi: 706 10.1093/bioinformatics/btu170.

707 [11] Bowers, R.M., Kyrpides, N.C., Stepanauskas, R., Harmon-Smith, M., Doud, D., 708 Reddy, T.B.K., Schulz, F., Jarett, J., Rivers, A.R., Eloe-Fadrosh, E.A., Tringe, 709 S.G., Ivanova, N.N., Copeland, A., Clum, A., Becraft, E.D., Malmstrom, R.R., 710 Birren, B., Podar, M., Bork, P., Weinstock, G.M., Garrity, G.M., Dodsworth, J.A., 711 Yooseph, S., Sutton, G., Glöckner, F.O., Gilbert, J.A., Nelson, W.C., Hallam, 712 S.J., Jungbluth, S.P., Ettema, T.J.G., Tighe, S., Konstantinidis, K.T., Liu, W.-T., 713 Baker, B.J., Rattei, T., Eisen, J.A., Hedlund, B., McMahon, K.D., Fierer, N., 714 Knight, R., Finn, R., Cochrane, G., Karsch-Mizrachi, I., Tyson, G.W., Rinke, C., 715 Kyrpides, N.C., Schriml, L., Garrity, G.M., Hugenholtz, P., Sutton, G., Yilmaz, P., 716 Meyer, F., Glöckner, F.O., Gilbert, J.A., Knight, R., Finn, R., Cochrane, G., 717 Karsch-Mizrachi, I., Lapidus, A., Meyer, F., Yilmaz, P., Parks, D.H., Eren, A.M., 718 Schriml, L., Banfield, J.F., Hugenholtz, P., Woyke, T. (2017) Minimum 719 information about a single amplified genome (MISAG) and a metagenome- 720 assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35(8), 721 725–31, Doi: 10.1038/nbt.3893.

722 [12] Caro-Quintero, A., Konstantinidis, K.T. (2012) Bacterial species may exist, 723 metagenomics reveal. Environ. Microbiol. 14(2), 347–55, Doi: 10.1111/j.1462-

32 724 2920.2011.02668.x.

725 [13] Chikhi, R., Medvedev, P. (2014) Informed and automated k-mer size selection 726 for genome assembly. Bioinformatics 30(1), 31–7, Doi: 727 10.1093/bioinformatics/btt310.

728 [14] Choi, A., Oh, H.-M., Yang, S.-J., Cho, J.-C. (2011) Kordia periserrulae sp. nov., 729 isolated from a marine polychaete Periserrula leucophryna, and emended 730 description of the genus Kordia. Int. J. Syst. Evol. Microbiol. 61(Pt 4), 864–9, 731 Doi: 10.1099/ijs.0.022764-0.

732 [15] Clark, S.C., Egan, R., Frazier, P.I., Wang, Z. (2013) ALE: A generic assembly 733 likelihood evaluation framework for assessing the accuracy of genome and 734 metagenome assemblies. Bioinformatics 29(4), 435–43, Doi: 735 10.1093/bioinformatics/bts723.

736 [16] Conway, J.R., Lex, A., Gehlenborg, N. (2017) UpSetR: an R package for the 737 visualization of intersecting sets and their properties. Bioinformatics 33(18), 738 2938–40, Doi: 10.1093/bioinformatics/btx364.

739 [17] Delmont, T.O., Eren, A.M. (2018) Linking pangenomes and metagenomes: the 740 Prochlorococcus metapangenome. PeerJ, Doi: 10.7717/peerj.4320.

741 [18] Delmont, T.O., Quince, C., Shaiber, A., Esen, O.C., Lee, S.T., Lücker, S., Murat 742 Eren, A. (2018) Nitrogen-fixing populations of Planctomycetes and 743 Proteobacteria are abundant in the surface ocean. Nat. Microbiol. 3(July), Doi: 744 10.1101/129791.

745 [19] van Dongen, S., Abreu-Goodger, C. (2012) Using MCL to Extract Clusters from 746 Networks. In: van Helden, J., Toussaint, A., Thieffry, D., (Eds.), Bacterial 747 Molecular Networks: Methods and Protocols, Springer New York, New York, NY, 748 pp. 281–95.

749 [20] Edgar, R.C. (2004) MUSCLE: Multiple sequence alignment with high accuracy 750 and high throughput. Nucleic Acids Res. 32(5), 1792–7, Doi: 751 10.1093/nar/gkh340.

752 [21] Elbourne, L.D.H., Tetu, S.G., Hassan, K.A., Paulsen, I.T. (2017) TransportDB 753 2.0: a database for exploring membrane transporters in sequenced genomes 754 from all domains of life. Nucleic Acids Res. 45(D1), D320–4, Doi: 755 10.1093/nar/gkw1068.

756 [22] Fernández-Gómez, B., Richter, M., Schüler, M., Pinhassi, J., Acinas, S.G., 757 González, J.M., Pedrós-Alió, C. (2013) Ecology of marine Bacteroidetes: a

33 758 comparative genomics approach. ISME J. 7(5), 1026–37, Doi: 759 10.1038/ismej.2012.169.

760 [23] Fraser, C., Alm, E.J., Polz, M.F., Spratt, B.G., Hanage, W.P. (2009) The 761 Bacterial Species Challenge : Making Sense of Genetic and Ecological Diversity. 762 Science (80-. ). 323(February), 741–6, Doi: 10.1126/science.1159388.

763 [24] Gómez-Rubio, V. (2017) ggplot2 - Elegant Graphics for Data Analysis (2nd 764 Edition). J. Stat. Softw. 77(Book Review 2), 3–5, Doi: 10.18637/jss.v077.b02.

765 [25] Gonzalez, J.M., Fernandez-Gomez, B., Fernandez-Guerra, A., Gomez- 766 Consarnau, L., Sanchez, O., Coll-Llado, M., del Campo, J., Escudero, L., 767 Rodriguez-Martinez, R., Alonso-Saez, L., Latasa, M., Paulsen, I., 768 Nedashkovskaya, O., Lekunberri, I., Pinhassi, J., Pedros-Alio, C. (2008) 769 Genome analysis of the proteorhodopsin-containing marine bacterium 770 Polaribacter sp. MED152 (Flavobacteria). Proc. Natl. Acad. Sci. 105(25), 8724– 771 9, Doi: 10.1073/pnas.0712027105.

772 [26] Hameed, A., Shahina, M., Lin, S.Y., Cho, J.C., Lai, W.A., Young, C.C. (2013) 773 Kordia aquimaris sp. nov., a zeaxanthin-producing member of the family 774 Flavobacteriaceae isolated from surface seawater, and emended description of 775 the genus Kordia. Int. J. Syst. Evol. Microbiol. 63(PART 12), 4790–6, Doi: 776 10.1099/ijs.0.056051-0.

777 [27] Hyatt, D., Chen, G., Locascio, P.F., Land, M.L., Larimer, F.W., Hauser, L.J. 778 (2010) Prodigal : prokaryotic gene recognition and translation initiation site 779 identification.

780 [28] Jain, C., Rodriguez-R, L.M., Phillipy, A.M., Konstantinidis, K.T., Aluru, S. (2018) 781 High throughput ANI analysis of 90K prokaryotic genomes reveals clear species 782 boundaries. Nat. Commun. 9(5114), 1–8, Doi: 10.1038/s41467-018-07641-9.

783 [29] Kanehisa, M., Sato, Y., Morishima, K. (2016) BlastKOALA and GhostKOALA: 784 KEGG Tools for Functional Characterization of Genome and Metagenome 785 Sequences. J. Mol. Biol. 428(4), 726–31, Doi: 10.1016/j.jmb.2015.11.006.

786 [30] Karsenti, E., Acinas, S.G., Bork, P., Bowler, C., De Vargas, C., Raes, J., 787 Sullivan, M., Arendt, D., Benzoni, F., Claverie, J.-M., Follows, M., Gorsky, G., 788 Hingamp, P., Iudicone, D., Jaillon, O., Kandels-Lewis, S., Krzic, U., Not, F., 789 Ogata, H., Pesant, S., Reynaud, E.G., Sardet, C., Sieracki, M.E., Speich, S., 790 Velayoudon, D., Weissenbach, J., Wincker, P. (2011) A holistic approach to 791 marine eco-systems biology. PLoS Biol. 9(10), e1001177, Doi:

34 792 10.1371/journal.pbio.1001177.

793 [31] Kashtan, N., Roggensack, S.E., Rodrigue, S., Thompson, J.W., Biller, S.J., Coe, 794 A., Ding, H., Marttinen, P., Malmstrom, R.R., Stocker, R., Follows, M.J., 795 Stepanauskas, R., Chisholm, S.W. (2014) Single-cell genomics reveals 796 hundreds of coexisting subpopulations in wild Prochlorococcus. Science 797 344(6182), 416–20, Doi: 10.1126/science.1248575.

798 [32] Kettler, G.C., Martiny, A.C., Huang, K., Zucker, J., Coleman, M.L., Rodrigue, S., 799 Chen, F., Lapidus, A., Ferriera, S., Johnson, J., Steglich, C., Church, G.M., 800 Richardson, P., Chisholm, S.W. (2007) Patterns and implications of gene gain 801 and loss in the evolution of Prochlorococcus. PLoS Genet. 3(12), 2515–28, Doi: 802 10.1371/journal.pgen.0030231.

803 [33] Kim, D.I., Lee, J.H., Kim, M.S., Seong, C.N. (2017) Kordia zosterae sp. nov., 804 isolated from the seaweed, Zostera marina. Int. J. Syst. Evol. Microbiol. 67(11), 805 4790–5, Doi: 10.1099/ijsem.0.002379.

806 [34] Kim, M., Oh, H.S., Park, S.C., Chun, J. (2014) Towards a taxonomic coherence 807 between average nucleotide identity and 16S rRNA gene sequence similarity for 808 species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. 64(PART 2), 809 346–51, Doi: 10.1099/ijs.0.059774-0.

810 [35] Kisker, C., Kuper, J., Van Houten, B. (2013) Prokaryotic Nucleotide Excision 811 Repair. Cold Spring Harb. Perspect. Biol. 5(3), a012591–a012591, Doi: 812 10.1101/cshperspect.a012591.

813 [36] Kogawa, M., Hosokawa, M., Nishikawa, Y., Mori, K., Takeyama, H. (2018) 814 Obtaining high-quality draft genomes from uncultured microbes by cleaning and 815 co-assembly of single-cell amplified genomes. Sci. Rep. 8(1), 1–11, Doi: 816 10.1038/s41598-018-20384-3.

817 [37] Konstantinidis, K.T., Rosselló-Móra, R., Amann, R. (2017) Uncultivated 818 microbes in need of their own taxonomy. ISME J., 1–8, Doi: 819 10.1038/ismej.2017.113.

820 [38] Konstantinidis, K.T., Tiedje, J.M. (2005) Genomic insights that advance the 821 species definition for prokaryotes. Proc. Natl. Acad. Sci. U. S. A. 102(7), 2567– 822 72, Doi: 10.1073/pnas.0409727102.

823 [39] Kultima, J.R., Sunagawa, S., Li, J., Chen, W., Chen, H., Mende, D.R., 824 Arumugam, M., Pan, Q., Liu, B., Qin, J., Wang, J., Bork, P. (2012) MOCAT : A 825 Metagenomics Assembly and Gene Prediction Toolkit 7(10), 1–6, Doi:

35 826 10.1371/journal.pone.0047656.

827 [40] Kumar, S., Stecher, G., Tamura, K. (2016) MEGA7 : Molecular Evolutionary 828 Genetics Analysis Version 7 . 0 for Bigger Datasets Brief communication 33(7), 829 1870–4, Doi: 10.1093/molbev/msw054.

830 [41] Lai, Q., Shao, Z., Liu, Y., Xie, Y., Du, J., Dong, C. (2015) Kordia zhangzhouensis 831 sp. nov., isolated from surface freshwater. Int. J. Syst. Evol. Microbiol. 65(10), 832 3379–83, Doi: 10.1099/ijsem.0.000424.

833 [42] Landry, M.R., Brown, S.L., Campbell, L., Constantinou, J., Liu, H. (1998) Spatial 834 patterns in phytoplankton growth and microzooplankton grazing in the Arabian 835 Sea during monsoon forcing. Deep. Res. Part II Top. Stud. Oceanogr. 45(10– 836 11), 2353–68, Doi: 10.1016/S0967-0645(98)00074-5.

837 [43] Langmead, B., Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. 838 Nat. Methods 9(4), 357–9, Doi: 10.1038/nmeth.1923.

839 [44] Lapébie, P., Lombard, V., Drula, E., Terrapon, N. (n.d.) Bacteroidetes use 840 thousands of enzyme combinations to break down glycans. Nat. Commun. 841 (2019), Doi: 10.1038/s41467-019-10068-5.

842 [45] Letunic, I., Bork, P. (2019) Interactive Tree Of Life ( iTOL ) v4 : recent updates 843 and 47(April), 256–9, Doi: 10.1093/nar/gkz239.

844 [46] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., 845 Abecasis, G., Durbin, R. (2009) The Sequence Alignment/Map format and 846 SAMtools. Bioinformatics 25(16), 2078–9, Doi: 10.1093/bioinformatics/btp352.

847 [47] López-Pérez, M., Rodriguez-Valera, F. (2016) Pangenome evolution in the 848 marine bacterium Alteromonas. Genome Biol. Evol. 8(5), evw098, Doi: 849 10.1093/gbe/evw098.

850 [48] Mangot, J., Logares, R., Sánchez, P., Latorre, F., Seeleuthner, Y., Mondy, S., 851 Sieracki, M.E., Jaillon, O., Wincker, P., Vargas, C. de., Massana, R. (2017) 852 Accessing the genomic information of unculturable oceanic picoeukaryotes by 853 combining multiple single cells. Sci. Rep. 7(January), 41498, Doi: 854 10.1038/srep41498.

855 [49] Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R. (2005) The 856 microbial pan-genome. Curr. Opin. Genet. Dev. 15(6), 589–94, Doi: 857 http://dx.doi.org/10.1016/j.gde.2005.09.006.

858 [50] Mestre, M., Ruiz-gonzález, C., Logares, R., Duarte, C.M., Gasol, J.M. (2018)

36 859 Sinking particles promote vertical connectivity in the ocean microbiome 115(29), 860 6799–807, Doi: 10.1073/pnas.1802470115.

861 [51] Mira, A., Martín-Cuadrado, A.B., D’Auria, G., Rodríguez-Valera, F. (2010) The 862 bacterial pan-genome:a new paradigm in microbiology. Int. Microbiol. 13(2), 45– 863 57, Doi: 10.2436/20.1501.01.110.

864 [52] Ochman, H., Lerat, E., Daubin, V. (2005) Examining bacterial species under the 865 specter of gene transfer and exchange. Proc. Natl. Acad. Sci. 102(Supplement 866 1), 6595–9, Doi: 10.1073/pnas.0502035102.

867 [53] Pachiadaki, M.G., Sintes, E., Bergauer, K., Brown, J.M., Record, N.R., Swan, 868 B.K., Mathyer, M.E., Hallam, S.J., Lopez-Garcia, P., Takaki, Y., Nunoura, T., 869 Woyke, T., Herndl, G.J., Stepanauskas, R. (2017) Major role of nitrite-oxidizing 870 bacteria in dark ocean carbon fixation. Science (80-. ). 358(6366), 1046–51, Doi: 871 10.1126/science.aan8260.

872 [54] Park, S., Jung, Y.-T., Yoon, J.-H. (2014) Kordia jejudonensis sp. nov., isolated 873 from the junction between the ocean and a freshwater spring, and emended 874 description of the genus Kordia. Int. J. Syst. Evol. Microbiol. 64(Pt 2), 657–62, 875 Doi: 10.1099/ijs.0.058776-0.

876 [55] Parks, D.H., Imelfort, M., Skennerton, C.T., Hugenholtz, P., Tyson, G.W. (2015) 877 CheckM: assessing the quality of microbial genomes recovered from isolates, 878 single cells, and metagenomes. Genome Res. 25(7), 1043–55, Doi: 879 10.1101/gr.186072.114.

880 [56] Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., 881 Evans, P.N., Hugenholtz, P., Tyson, G.W. (2017) Recovery of nearly 8,000 882 metagenome-assembled genomes substantially expands the tree of life. Nat. 883 Microbiol. 2(11), 1533–42, Doi: 10.1038/s41564-017-0012-7.

884 [57] Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., 885 Evans, P.N., Hugenholtz, P., Tyson, G.W. (2018) Author Correction: Recovery of 886 nearly 8,000 metagenome-assembled genomes substantially expands the tree 887 of life. Nat. Microbiol. 3(2), 253–253, Doi: 10.1038/s41564-017-0083-5.

888 [58] Paulmier, A., Ruiz-Pino, D. (2009) Oxygen minimum zones (OMZs) in the 889 modern ocean. Prog. Oceanogr. 80(3–4), 113–28, Doi: 890 10.1016/j.pocean.2008.08.001.

891 [59] Pinder, M.I.M., Johansson, O.N., Almstedt, A., Kourtchenko, O., Clarke, A.K., 892 Godhe, A., Töpel, M. (2019) Genome Sequence of Kordia sp. Strain SMS9

37 893 Identified in a Non-Axenic Culture of the Diatom Skeletonema marinoi . J. 894 Genomics 7, 46–9, Doi: 10.7150/jgen.35061.

895 [60] Price, M.N., Dehal, P.S., Arkin, A.P. (2010) FastTree 2 – Approximately 896 Maximum-Likelihood Trees for Large Alignments 5(3), Doi: 897 10.1371/journal.pone.0009490.

898 [61] Pruesse, E., Peplies, J., Glöckner, F.O., Editor, A., Wren, J. (2012) SINA : 899 Accurate high-throughput multiple sequence alignment of ribosomal RNA genes 900 28(14), 1823–9, Doi: 10.1093/bioinformatics/bts252.

901 [62] Qi, F., Shao, Z., Li, D., Huang, Z., Lai, Q. (2016) Kordia ulvae sp. nov., a 902 bacterium isolated from the surface of green marine algae Ulva sp. Int. J. Syst. 903 Evol. Microbiol. 66(7), 2623–8, Doi: 10.1099/ijsem.0.001098.

904 [63] Rawlings, N.D., Waller, M., Barrett, A.J., Bateman, A. (2014) MEROPS : the 905 database of proteolytic enzymes , their substrates and inhibitors 42(July 2011), 906 503–9, Doi: 10.1093/nar/gkt953.

907 [64] Ren, Q., Pauisers, J.T. (2005) Comparative analyses of fundamental differences 908 in membrane transport Capabilities in Prokaryotes and Eukaryotes. PLoS 909 Comput. Biol. 1(3), 0190–201, Doi: 10.1371 journal.pcbi.0010027.

910 [65] Richter, M., Rosselló-Móra, R. (2009) Shifting the genomic gold standard for the 911 prokaryotic species definition. Proc. Natl. Acad. Sci. U. S. A. 106(45), 19126–31, 912 Doi: 10.1073/pnas.0906412106.

913 [66] Rodriguez-R, L.M., Gunturu, S., Harvey, W.T., Rossell, R., Tiedje, J.M., Cole, 914 J.R., Konstantinidis, K.T. (2018) The Microbial Genomes Atlas ( MiGA ) 915 webserver : taxonomic and gene diversity analysis of Archaea and (June), 1–7, 916 Doi: 10.1093/nar/gky467.

917 [67] Rosselló-Mora, R., Amann, R. (2001) The species concept for prokaryotes. 918 FEMS Microbiol. Rev. 25(1), 39–67.

919 [68] Roullier, F., Berline, L., Guidi, L., Durrieu De Madron, X., Picheral, M., Sciandra, 920 A., Pesant, S., Stemmann, L. (2014) Particle size distribution and estimated 921 carbon flux across the Arabian Sea oxygen minimum zone. Biogeosciences 922 11(16), 4541–57, Doi: 10.5194/bg-11-4541-2014.

923 [69] Royo-Llonch, M., Ferrera, I., Cornejo-Castillo, F.M., Sánchez, P., Salazar, G., 924 Stepanauskas, R., González, J.M., Sieracki, M.E., Speich, S., Stemmann, L., 925 Pedrós-Alió, C., Acinas, S.G. (2017) Exploring microdiversity in novel Kordia sp. 926 (Bacteroidetes) with proteorhodopsin from the tropical Indian Ocean via Single

38 927 Amplified Genomes. Front. Microbiol. 8(JUL), Doi: 10.3389/fmicb.2017.01317.

928 [70] Seemann, T. (2014) Prokka: Rapid prokaryotic genome annotation. 929 Bioinformatics 30(14), 2068–9, Doi: 10.1093/bioinformatics/btu153.

930 [71] Segerman, B. (2012) The genetic integrity of bacterial species: the core genome 931 and the accessory genome, two different stories. Front. Cell. Infect. Microbiol. 932 2(September), 1–8, Doi: 10.3389/fcimb.2012.00116.

933 [72] Shapiro, B.J., Polz, M.F. (2014) Ordering microbial diversity into ecologically and 934 genetically cohesive units. Trends Microbiol. 22(5), 235–47, Doi: 935 10.1016/j.tim.2014.02.006.

936 [73] Sharon, I., Kertesz, M., Hug, L.A., Pushkarev, D., Blauwkamp, T.A., Castelle, 937 C.J., Amirebrahimi, M., Thomas, B.C., Burstein, D., Tringe, S.G., Williams, K.H., 938 Banfield, J.F. (2015) Accurate , multi-kb reads resolve complex populations and 939 detect rare microorganisms. Genome Res., 534–43, Doi: 940 10.1101/gr.183012.114.

941 [74] Shoseyov, O., Shani, Z., Levy, I. (2006) Carbohydrate Binding Modules: 942 Biochemical Properties and Novel Applications. Microbiol. Mol. Biol. Rev. 70(2), 943 283–95, Doi: 10.1128/MMBR.00028-05.

944 [75] Sohn, J.H., Lee, J.H., Yi, H., Chun, J., Bae, K.S., Ahn, T.Y., Kim, S.J. (2004) 945 Kordia algicida gen. nov., sp. nov., an algicidal bacterium isolated from red tide. 946 Int. J. Syst. Evol. Microbiol. 54(3), 675–80, Doi: 10.1099/ijs.0.02689-0.

947 [76] Stepanauskas, R. (2012) Single cell genomics: An individual look at microbes. 948 Curr. Opin. Microbiol. 15(5), 613–20, Doi: 10.1016/j.mib.2012.09.001.

949 [77] Swan, B.K., Martinez-Garcia, M., Preston, C.M., Sczyrba, A., Woyke, T., Lamy, 950 D., Reinthaler, T., Poulton, N.J., Masland, E.D.P., Gomez, M.L., Sieracki, M.E., 951 DeLong, E.F., Herndl, G.J., Stepanauskas, R. (2011) Potential for 952 chemolithoautotrophy among ubiquitous bacteria lineages in the dark ocean. 953 Science 333(6047), 1296–300, Doi: 10.1126/science.1203690.

954 [78] Swan, B.K., Tupper, B., Sczyrba, A., Lauro, F.M., Martinez-Garcia, M., 955 González, J.M., Luo, H., Wright, J.J., Landry, Z.C., Hanson, N.W., Thompson, 956 B.P., Poulton, N.J., Schwientek, P., Acinas, S.G., Giovannoni, S.J., Moran, M.A., 957 Hallam, S.J., Cavicchioli, R., Woyke, T., Stepanauskas, R. (2013) Prevalent 958 genome streamlining and latitudinal divergence of planktonic bacteria in the 959 surface ocean. Proc. Natl. Acad. Sci. U. S. A. 110(28), 11463–8, Doi: 960 10.1073/pnas.1304246110.

39 961 [79] Tettelin, H., Riley, D., Cattuto, C., Medini, D. (2008) Comparative genomics: the 962 bacterial pan-genome. Curr. Opin. Microbiol. 11(5), 472–7, Doi: 963 10.1016/j.mib.2008.09.006.

964 [80] Thompson, L.R., Haroon, M.F., Shibl, A.A., Cahill, M.J., Ngugi, D.K., Williams, 965 G.J., Morton, J.T., Knight, R., Goodwin, K.D., Stingl, U. (2019) Red Sea SAR11 966 and Prochlorococcus Single-Cell Genomes Reflect Globally Distributed 967 Pangenomes. Appl. Environ. Microbiol. 85(13), 1–18, Doi: 10.1128/AEM.00369- 968 19.

969 [81] Tully, B.J., Wheat, C.G., Glazer, B.T., Huber, J.A. (2018) A dynamic microbial 970 community with high functional redundancy inhabits the cold, oxic subseafloor 971 aquifer. ISME J. 12(1), 1–16, Doi: 10.1038/ismej.2017.187.

972 [82] Varani, A.M., Siguier, P., Gourbeyre, E., Charneau, V., Chandler, M. (2011) 973 ISsaga is an ensemble of web-based methods for high throughput identification 974 and semi-automatic annotation of insertion sequences in prokaryotic genomes. 975 Genome Biol. 12(3), R30, Doi: 10.1186/gb-2011-12-3-r30.

976 [83] Yarza, P., Richter, M., Euzeby, J., Amann, R., Schleifer, K., Ludwig, W., Glo, 977 F.O., Rossello, R. (2020) The All-Species Living Tree project : A 16S rRNA- 978 based phylogenetic tree of all sequenced type strains 31(2008), 241–50, Doi: 979 10.1016/j.syapm.2008.07.001.

980 [84] Yin, Y., Mao, X., Yang, J., Chen, X., Mao, F., Xu, Y. (2012) dbCAN: a web 981 resource for automated carbohydrate-active enzyme annotation. Nucleic Acids 982 Res. 40(W1), W445–51, Doi: 10.1093/nar/gks479.

983 [85] Zhang, J., Kobert, K., Flouri, T., Stamatakis, A. (2014) PEAR: a fast and 984 accurate Illumina Paired-End reAd mergeR. Bioinformatics 30(5), 614–20, Doi: 985 10.1093/bioinformatics/btt593.

986

987

988

40 989 Figures

990

991

992 Figure 1. Workflow used to co-assemble the novel Kordia sp. TARA_039_SRF high- 993 quality draft genome from ten Kordia SAGs. The first step includes read processing and 994 co-assembly optimization (A). It cleans and merges raw reads to eventually have a 995 workable read dataset consisting of merged reads and clean paired-end reads that did 996 not merge. These are co-assembled multiple times with different individual k-mer sizes, 997 ranging between 17 and 133 bp. N50 length, number of contigs and best assembly 998 evaluation score (ALE score) are the parameters chosen to determine the three best k- 999 mer sizes to combine for the co-assembly that will be used afterwards. The second step 1000 aims to reduce the redundancy found in contig ends of this co-assembly and reduce 1001 contamination (B). It relies on re-assemblying the co-assembly’s contigs, measuring 1002 genome completeness and contamination and discarding those shorter contigs that add 1003 extra copies of single copy genes into the co-assembly to a final contamination value 1004 <5%.

41 A) B)

Kordia sp. TARA_039_SRF Kordia sp. TARA_039_SRF Kordia periserrulae 94 100 Kordia periserrulae Kordia sp. Otu5835 94 Kordia sp SMS9 62 Kordia sp. SMS9 Kordia jejudonensis Kordia jejudonensis Kordia aquimaris 100 Kordia algicida 96 84 Kordia zhangzhouensis Kordia zhangzhouensis 99 Kordia sp. NBRC 110215 92 Zhouia amylolytica CGMCC 1.6114 Kordia algicida OT-1 TARA PSW MAG 00065 64 Kordia sp. Otu3978 99 Kordia sp. Otu5934 TARA ASW MAG 00012 79 Kordia sp. Otu1517 Aquimarina agarilytica ZC1 74 97 Kordia antarctica TARA PSE MAG 00083 100 Kordia zosterae ZO2-23 81 TARA PSE MAG 00098 Kordia ulvae Kordia sp NORP58 Kordia sp. Otu3199 71 Zhouia amylolytica AD3 TARA PSE MAG 00126 Aquimarina agarivorans TARA ANW MAG 00010 65 93 99 Marivirga tractuosa 100 Altibacter lentus JL2010 75 Aquimarina atlantica TARA PSE MAG 00079 Altibacter lentus Psychroserpens sp. Hel I 66 Algibacter lectus 100 Winogradskyella jejuensis DSM 25330 Tamlana agarivorans 88 Flaviramulus ichthyoenteri Th78 100 TARA PSE MAG 00013 Olleya namhaensis Tamlana agarivorans JW-26 bacterium ALC-1 100 86 79 100 Algibacter lectus DSM 15365 95 Winogradskyella jejuensis Bizionia echini DSM 23925 Psychroserpens sp. Hel_I_66 100 Bizionia echini DSM 23925 Formosa algae KMM 3553 100 Formosa algae 87 TARA PSE MAG 00005 Tree scale: 0.01 Tree scale: 100 1005 SAG co-assembly Isolate OTU MAG 1006 Figure 2. Phylogenetic reconstruction based of Kordia sp. TARA_039_SRF and its 1007 closest relatives based on the 16S rRNA gene (A) and 40 conserved single copy genes 1008 (B).

1009 1010 Figure 3. Theoretical model with highlights of the functional potential of Kordia sp. 1011 TARA_039_SRF. Membrane transporters are colored depending on their TC Family 1012 classification and their simplified structure has been inferred from TC database and 1013 related bibliography.

42 1014

1015 Figure 4. Relative abundances of recruited reads in all available metagenomes from 1016 station TARA_039 in the north Indian ocean (A). Unavailable metagenomic samples are 1017 depicted as slashed in the heatmap, samples where there was no read recruitment 1018 whatsoever appear in white. Horizontal genomic coverage of mapped reads at 95-100% 1019 in TARA_039 metagenomes (B). The total of 4193 loci encoded in the co-assembly have 1020 been collapsed in 420 bins of 10 loci each and the number of reads mapping has been 1021 colored. No read mapping is shown in white.

1022

1023

1024

43 1025

1026 Figure 5. Comparative genomics of Kordia sp. TARA_039_SRF, K. algicida, K. 1027 jejudonensis and K. zhangzhouensis. Top left dendogram represents clusters of genes 1028 ordered by the number of genomes encoding them, regardless of copy numbers. The 1029 phylogenomics tree connecting Kordia spp. lables was done based on their core 1030 genome. Right bar plot quantifies the total number of genes that are unique and shared 1031 in all possible combinations between the five Kordia genomes. Bottom heatmap 1032 quantifies the number of genes classified to a COG category, peptidases, Transport 1033 Classification family or CAZymes category for each genome combination and for the 1034 unique flexible genome. Each genome has a specific color that is maintained in the 1035 different sections of the figure.

1036

44 1037 Tables

1038 Table 1. Metrics of the individual assemblies and the co-assembly of the 10 Kordia 1039 SAGs.

Individual SAGs AB-193_ refined Kordia sp. TARA_039_SRF co- genomes I04 I20 L04 M03 M19 M23 N20 O13 O22 P22 assembly

Assembly metrics

Total N of contigs 179 111 367 124 197 244 256 179 221 220 27

Contig N50 (Kbp) 20.9 30.0 33.0 54.2 40.2 38.4 27.5 42.7 49.8 25.5 42.9

Mean contig length (Kbp) 5.9 8.6 5.6 10.2 10.4 9.3 5.9 9.5 9.1 6.6 170.1

Min. contig length (bp) 118 118 118 118 118 118 118 118 118 118 14743

Max. contig length (Kbp) 59.5 80.7 113.4 135 137 139.7 107.7 93.4 106.1 113 682.4

Total assembly size (Mbp) 1.05 0.96 2.05 1.27 2.06 2.27 1.51 1.7 2.01 1.47 4.59

Quality parameters

Genome completeness 24.36 11.15 39.94 26.14 44.11 43.32 33.02 36.96 44.78 32.31 94.97

Contamination 0.65 0.00 1.14 0.49 1.14 2.16 1.84 0.65 0.43 0.69 4.65

Strain heterogeneity 100.00 0.00 0.00 50.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Read mappings against co-

assembly Contribution to co-assembly (%) 24.56 21.70 45.96 28.75 45.30 50.04 34.56 37.82 44.99 32.84 -

Mapped clean reads (%) 97.84 97.64% 97.76 98.21 98.38 98.40 98.50 98.11 98.48 98.15 -

Genomic features

Predicted ORFs 1012 944 1855 1142 1891 2118 1357 1548 1864 1334 4192

GC (%) 35.4 35.5 35.3 35.4 35.6 35.5 35.4 35.5 35.8 36.0 36.33

ANIm against co-assembly (%) 99.98 99.98 99.97 99.99 99.98 99.98 99.97 99.98 99.96 99.98 100 Identity of tetrant. freq. with 99.53 99.42 99.83 99.64 99.87 99.84 99.76 99.83 99.85 99.56 100 co-assembly (%) 1040

1041

45 1042 Table 2. Description of the four Kordia genomes used in the comparative genomics 1043 analysis.

K. Kordia sp. K. algicida K. jejudonensis zhanghzouens K. periserrulae TARA_039_SRF is

Genomic features

Length (Mbp) 4.59 5.01 5.35 4.03 4.72

GC (%) 35.8 34.2 33.9 33.8 36.2

Coding density (%) 88.45 87.58 88.09 90.98 88.93

Total number of ORFs 4192 4492 4782 3585 4105

Predicted protein coding genes (CDS) 4124 4411 4714 3527 4051

Completeness (%) 94.83 100 100 100 100

Pangenome

# clusters of core genes (% of total) 2286 (57.4%) 2286 (55.4%) 2286 (50.3%) 2286 (65.8%) 2286 (58.1%) # clusters of genes shared with Kordia spp. (% of total) 971 (24.4%) 1038 (25.1%) 969 (21.3%) 772 (22.2%) 1050 (26.6%) Accessory shared genome # clusters of unique flexible genes (% of total) 723 (18.1%) 801 (19.4%) 1287 (28.3%) 414 (11.9%) 599 (15.2%) Accessory strain specific genome Total # of gene clusters 3980 4125 4542 3472 3935

ANIm (%)

Kordia sp. TARA_039_SRF 100 - - - -

Kordia algicida 81.13 100 - - -

Kordia jejudonensis 79.83 81.47 100 - -

Kordia zhangzhouensis 79.62 79.56 79.23 100 -

Kordia periserrulae 89.13 80.97 79.83 79.63 100

AAI (%)

Kordia sp. TARA_039_SRF 100 - - - -

Kordia algicida 81.05 100 - - -

Kordia jejudonensis 77.96 80.14 100 - -

Kordia zhangzhouensis 79.93 80.62 78.06 100 -

Kordia periserrulae 92.12 81.03 77.97 80.06 100

Ribosomal RNA

16S rRNA id. % with Kordia sp. TARA_039_SRF 100 - - - -

16S rRNA id. % with K. algicida (coverage) 97.03 (100%) 100 - - - 96.69 16S rRNA id. % with K. jejudonensis (coverage) 98.00 (95%) 100 - - (95%) 95.85 16S rRNA id. % with K. zhangzhouensis (coverage) 96.85 (98%) 97.24 (00%) 100 - (98%) 96.74 16S rRNA id, % with K. periserrulae (coverage) 97.56 (97%) 97.24 (98%) 96.34 (99%) 100 (97%) # of tRNA types 21 21 20 19 20

# of rRNA operons 1 3 1 1 1

1044

1045

46