bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 An expanded gene catalog of the mouse gut metagenome 2 3 Jiahui Zhu1,2,3,#, Huahui Ren3,4,#, Huanzi Zhong3,4, Xiaoping Li3,4, Yuanqiang Zou3,4, Mo Han3,4, Minli 4 Li1,2, Lise Madsen3,4,5, Karsten Kristiansen3,4,6, Liang Xiao3,6,7,* 5 6 1School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China 7 2State key laboratory of Bioelectronics, Southeast University, Nanjing 210096, China 8 3BGI-Shenzhen, Shenzhen 518083, China 9 4Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of 10 Copenhagen, 2100 Copenhagen, Denmark 11 5Institute of Marine Research, P.O. Box 7800, 5020 Bergen, Norway 12 6Qingdao-Europe Advanced Institute for Life Sciences, BGI-Shenzhen, Qingdao 266555, China 13 7Shenzhen Engineering Laboratory of Detection and Intervention of Human Intestinal Microbiome, 14 Shenzhen 518083, China 15 16 #These authors contributed equally to this paper. 17 *Corresponding author: 18 Dr. Liang Xiao 19 BGI-Shenzhen, Building 11, Beishan Industrial Zone, Yantian District, Shenzhen, China 20 Email: [email protected] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

53 Abstract

54 High-quality and comprehensive reference gene catalogs are essential for metagenomic research. The

55 rather low diversity of samples used to construct existing catalogs of mouse gut metagenomes limits

56 the size and numbers of identified genes in existing catalogs. We therefore established an expanded

57 gene catalog of genes in the mouse gut metagenomes (EMGC) containing >5.8 million genes by

58 integrating 88 newly sequenced samples, 86 mouse-gut-related bacterial genomes and 3 existing gene

59 catalogs. EMGC increases the number on non-redundant genes by more than one million genes

60 compared to the so far most extensive catalog. More than 50% of the genes in EMGC were

61 taxonomically assigned and 30% were functionally annotated. 902 Metagenomic (MGS)

62 assigned to 122 taxa are identified based on the EMGC. The EMGC-based analysis of samples from

63 groups of mice originating from different animal providers, housing laboratories and genetic strains

64 substantiated that diet is a major contributor to differences in composition and functional potential of

65 the gut microbiota irrespective of differences in environment and genetic background. We envisage

66 that EMGC will serve as an efficient and resource-saving reference dataset for future metagenomic

67 studies in mice.

68 Introduction

69 Mouse models are among the most widely used animal models for biomedical studies to decipher the

70 complex interplay between the gut microbiota and host phenotypes (Nguyen et al., 2015; Justice &

71 Dhillon, 2016; Perlman, 2016; Hugenholtz & de Vos, 2018). Amplicon sequencing of the 16S rRNA

72 gene has been widely used for analyses of the gut microbiota due to low costs and short analysis cycle.

73 However, the taxonomic information is in most cases limited to the genus level, and amplicon

74 sequencing generally provided limited information on function (Morgan & Huttenhower, 2014; Wang

75 & Jia, 2016b). Key to the use of mouse model for detailed functional analyses of the gut microbiota is

76 the availability of comprehensive catalogs of microbial genes and derived metagenomic species

77 (MGSs)/metagenome-assembled genomes (MAGs). The first catalog of genes in the mouse gut

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

78 microbiome included 2.6 million nonredundant genes from fecal samples of 184 mice (Xiao et al., 2015).

79 Subsequent studies further explored the diversity and functional potential of the mouse gut microbiota

80 by isolating and sequencing an increasing number of bacterial strains from the mouse gut

81 (Wannemuehler et al., 2014; Lagkouvardos et al., 2016; Brugiroux et al., 2017; Liu et al., 2020;) and

82 establishing a Mouse Intestinal Bacterial Collection (miBC), depositing bacterial strains and associated

83 genomes from the mouse gut (Lagkouvardos et al., 2016). Recently, Lesker et al. generate an integrated

84 mouse gut metagenome catalog (iMGMC), comprising 4.6 million unique genes and 830 high-quality

85 MAGs, and by linking MAGs to reconstructed 16S rRNA gene sequences they provided a pipeline

86 enabling improved prediction of functional potentials based on 16S rRNA gene amplicon sequencing

87 (Lesker et al., 2020).

88

89 Here we constructed an expanded mouse gut metagenome catalog (EMGC) by integrating 3 published

90 gene catalogs, including the gene catalog of mouse gut metagenome (MGGC) released in 2015

91 comprising 2,571,074 genes (Xiao et al., 2015), a Feed & Diet Gene Catalog for mice (FDGC)(Xiao et

92 al., 2017), the integrated mouse gut metagenome catalog (iMGMC)(Lesker et al., 2020), 72 available

93 sequenced mouse gut-related bacterial genomes (Wannemuehler et al. 2014; Lagkouvardos et al., 2016;

94 Brugiroux et al., 2017; Liu et al., 2020), 14 high-quality genomes assembled from published sequencing

95 data of isolates (Lagkouvardos et al., 2016), and 88 newly shotgun sequenced samples. Our new non-

96 redundant reference gene catalog comprises 5,862,027 genes and was annotated by NR (released on 5th

97 Jan 2019) and KEGG (release 89) databases (Kanehisa & Goto, 2000). Finally, we generated 902 MGSs

98 from the gene abundance profiles for 326 laboratory mice of EMGC, and compared these MGSs with

99 the high-quality MAG collection (Lesker et al., 2020) and the recent collection of isolated from

100 the mouse gut (Liu et al., 2020). By combining these individual datasets we increased the number of

101 sequenced bacterial genes of the mouse gut microbiome by more than one million genes, and

102 significantly increased the mapping ratio of read obtained by shotgun sequencing of samples from

103 mouse gut fecal samples providing a resource for future studies on the mouse gut microbiota.

104

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

105 MATERIALS and METHODS

106 Data acquisition

107 DNA from 88 stool samples of C57BJ/6J wild-type male mice, collected from the laboratory of BGI-

108 Wuhan, were extracted and shotgun sequenced using the BGISEQ-500 platform and PE100

109 sequencing as describe (Fang et al., 2018). An optimized sequencing quality filter for the cPAS-based

110 BGISEQ platform, OAs1 (Fang et al., 2018), was applied in the quality control step, followed by host

111 removal with SOAP2 (Li et al., 2009). Individual assembly of metagenomic reads was done using

112 metaSPAdes v3.14.0 (parameters: -k 49, other parameters were set to be default)(Kultima et al., 2012;

113 Nurk et al., 2017). Genes were predicted by MetaGeneMark (W. Zhu et al., 2010) from metagenome-

114 assembled contigs with length >500bps and filtered by length >100bps. Redundant predicted genes

115 were removed by CD-HIT (v4.5.7, parameters: -G 0 -n 8 -aS 0.9 -c 0.95 -d 0 -r 1 -g 1) (Fu et al., 2012)

116 in order to generate a sub mouse gut gene catalog (PMGC).

117

118 Raw reads of 43 unassembled bacterial genomes from the miBC (EBI Project ID PRJEB10572) were

119 downloaded from EBI and filtered by Trimmomatic (v 0.39)(Bolger et al., 2014). Draft genomes were

120 assembled separately by SPAdes (-k 29,39,49,69 –careful) (Bankevich et al., 2012; Nurk et al., 2013)

121 and filtered by the CheckM (Parks et al., 2015). After assessment using the criteria

122 Completeness >90% and Contamination <5%, the remaining 14 genomes were used for gene

123 prediction by GeneMarkS-2 (Lomsadze et al., 2018).

124

125 A total of 72 genomes, including 24 sequenced strains of miBC(Lagkouvardos et al., 2016), 8

126 genomes of the Altered Schaedler Flora(Wannemuehler et al., 2014) (PRJNA175999-176003, 213740,

127 213743) and 40 genomes of mGMB (Liu et al., 2020) (released before 26th Feb 2019, PRJNA486904),

128 as well as their coding sequences (CDSs) and translated CDSs were all downloaded from the NCBI

129 RefSeq database and the Integrated Microbial Genomes (IMG) database(Chen et al., 2019). We

130 gathered CDSs from 86 bacterial genomes and filtered out genes smaller than 100bps. We clustered

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

131 CDSs using CD-HIT (v4.5.7, parameters: -G 0 -n 8 -aS 0.9 -c 0.95 -d 0 -r 1 -g 1) (Fu et al., 2012)

132 establishing a gene catalog termed Mouse Intestinal Cultured Bacteria gene set (MiCB). Detailed

133 information on the included genomes is provided in Supplementary Table 3.

134

135 The five public mouse-related microbial datasets used in this study include: (i) 184 host-free

136 sequenced mice gut microbiomes (EBI Project ID PRJEB7759) and the catalog of mouse gut

137 metagenome (MGGC, GigaDB http://dx.doi.org/10.5524/100114); (ii) 54 sequenced mice gut

138 microbiomes (EBI Project ID PRJEB10308) and the related gene catalog (FDGC, GigaDB

139 http://dx.doi.org/10.5524/100271 ); (iii) 830 high-quality de-replicated MAGs and the iMGMC

140 catalog from the iMGMC study (GitHub repository https://github.com/tillrobin/iMGMC ); (iv) 40

141 sequenced mouse fecal metagenomes from EBI Project PRJEB6996; (v) 34 sequenced mice cecum

142 metagenomes from MG-RAST Project mgp6153.

143

144 Construction of EMGC and selection of new genes

145 All downloaded gene were filtered by length >100bps and integrated to construct the EMGC using

146 CD-HIT (Fu et al., 2012) (parameters: -G 0 -n 8 -aS 0.9 -c 0.95 -d 0 -r 1 -g 1). The output of EMGC

147 clusters was analyzed to generate a list for new genes in EMGC that are not present in iMGMC and

148 MGGC. Metagenomes were mapped to gene catalogs by SOAP2 (parameters: -m 0 -x 1000 -c 0.95).

149 Mapping rates between groups and catalogs were compared by Wilcoxon rank-sum test. The profile of

150 relative gene abundances for the 326 laboratory mice (see Supplementary Table 1 for an overview of

151 these mice) was calculated based on the method of Qin et al.(Qin et al., 2012) using EMGC. Richness

152 estimation by the Chao2 index and incidence-based coverage estimator (ICE) was calculated based on

153 the gene abundance profiles (Li et al., 2014).The occurrence and average abundance of new genes

154 were calculated using the relative gene abundance profiles.

155

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

156 Taxonomic and functional annotation

157 Genes predicted from metagenome assemblies were taxonomically annotated by Kaiju (Menzel et al.,

158 2016) using the NCBI-NR database (released on Jan 5th, 2019) and the parameters of the program

159 were set to ‘-a greedy -e 5 -E 0.01 -v -z 4’. For genes from mouse gut-related bacterial genomes, we

160 kept the original taxonomic information of the genomes and assigned them to the corresponding

161 genes. All genes were searched against KEGG (Kanehisa & Goto, 2000) (version 89) by BLAST

162 (Altschul et al., 1990; Gish & States, 1993) for functional annotation. Parameters were set to ‘-e 0.01 -

163 F T -b 100 -K 1’ and results were filtered by score >60 (20). BLAST results were turned into

164 functional annotation based on the information provided by KEGG to create a gene-KO list for the

165 generation of KO relative abundance profiles. The taxonomic and functional information of new genes

166 of EMGC were extracted for further analysis.

167

168 Evaluation of the effect of diet on the gut microbiota

169 The differences between sample groups regarding mapping rate were assessed by Wilcox rank-sum

170 test (R ggpubr package). The significance for mapping rate was set at p <0.05. To evaluate the

171 consistent effect of diet among providers and mouse strains on taxonomic and functional composition

172 of gut metagenomes, samples from 7 groups fed high-fat (HF) or low-fat (LF) diets and representing

173 different providers and mouse strains were selected (Supplementary Table 1). The calculation of

174 relative abundance profiles for taxa and KO was based on Qin et al. (2012). Permutational multivariate

175 analysis of variance (PERMANOVA) test was applied to determine the influence of diet on gut

176 metagenomes within different groups. The Shannon index of the relative abundance profiles was used

177 to estimate alpha diversity of the samples. Principal Coordinates Analysis (PCoA) of selected samples

178 was performed based on the relative abundance profiles using Bray-Curtis distance (R ape4 package)

179 to visualize the effect of diet on the bacterial composition of the gut microbiota. Wilcox rank-sum test

180 was used for analysis of differences of genera and KO relative abundance profiles. P-value adjustment

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

181 was applied for multiple hypothesis testing using the Benjamin-Hochberg (BH) method. A BH-

182 adjusted P-value < 0.05 was considered as statistical significance.

183

184 Metagenomic species clustering

185 The gene relative abundance profiles of 326 laboratory mice were clustered using the co-abundance

186 Canopy algorithm (Nielsen et al., 2014). Co-abundance genomes (CAGs) that were present in >90%

187 samples were chosen and CAGs with >700 genes were considered as Metagenomic Species

188 (MGSs)(Nielsen et al., 2014). CAGs and MGSs were assigned to a given taxon when >50% genes

189 belonged to that specific taxon (Nielsen et al., 2014). The taxonomic distribution of MGSs was

190 calculated using the R package ‘phytool’(Revell, 2012). The Shannon Index was calculated based on

191 the MGS profiles of samples from the 7 selected mice groups and PCoA was generated based on the

192 same profile.

193

194 All MGSs were searched against the 830 high quality Metagenome-Assembly Genomes (MAGs) and

195 the 115 mGMB genomes by MUMmer3 (v3.23) (Kurtz et al., 2004) for calculation of MUMi values

196 (Bäckhed et al., 2015; Deloger et al., 2009). If the MUMi value for two items was >0.54, then these

197 two items were recognized as the same species (Bäckhed et al., 2015). The result of the comparison

198 between MGSs and MAGs was hierarchical clustered by R package ‘hclust’ with weighted pair group

199 method with averaging (WPGMA) and then visualized in a cladogram with annotation by a R package

200 ‘ggtree’(Yu, 2020; Yu et al., 2017). The result of the comparison between MGSs and mGMB genomes

201 is presented in Supplementary Table 9.

202

203 Results

204 Construction and evaluation of EMGC

205 Fecal samples from 88 C57BL/6J male mice were sequenced using the BGISEQ-500 platform

206 providing 1,098 gigabases (Gb) high-quality host-free data with an average of 12.47 Gb per sample

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

207 (Supplementary Table 2) and a catalog comprising 2,602,584 non-redundant genes (PMGC). We next

208 used 72 mouse gut related bacterial genomes (Wannemuehler et al., 2014; Lagkouvardos et al., 2016;

209 Brugiroux et al., 2017; Liu et al., 2020;) from IMG and NCBI Refseq and 14 high-quality genomes

210 (completeness >90% and contamination <5%) assembled from reads accessible from PRJEB10572

211 (Lagkouvardos et al., 2016) (Supplementary Table 3) to generate a Mouse Gut Cultured Bacteria gene

212 set (MiCB). Together with MGGC (Xiao et al., 2015), FDGC (Xiao et al., 2017) and iMGMC (Lesker

213 et al., 2020), downloaded from GigaDB and the GitHub repository, respectively, all gene catalogs

214 were integrated to construct an expanded non-redundant mouse gut bacterial gene catalog

215 (EMGC)(Figure 1). The expanded catalog comprises 5,862,027 genes, which is more than twice the

216 number of genes in the MGGC (Xiao et al., 2015) and one million genes more than the iMGMC

217 catalog (Lesker et al., 2020) (Table 1). Thus, 18.93 % of the genes in EMGC are not represented in

218 either the iMGMC or MGGC (Supplementary Figure 1).

219

220 To compare the performance of EMGC relative to MGGC and iMGMC, we mapped sequencing reads

221 from the FDGC, MGGC, and PMGC studies to the three catalogs. Of the sequencing reads from

222 PMGC , 55.72% were mapped to MGGC and 56.56 % to the iMGMC, whereas the EMGC allowed

223 mapping of 79.52% of the reads (Figure 2A), close to the maximum achievable mapping rate in

224 prokaryotes ( Li et al., 2014).

225

226 Comparing mapping rates of reads from the 326 fecal samples obtained from different mouse strains

227 and providers to the EMGC demonstrated that mice from the Laboratory Animal Center at Sun Yat-

228 Sen University exhibited a lower mapping rate than samples from the other providers where the

229 median mapping rates were higher than 80% (Figure 2B). Median mapping rates of reads obtained

230 from all mice strains were also higher than 80% (Supplementary Figure 2A). Richness estimated by

231 Chao2 indicated that our EMGC covered 98.24% of the genes in the 326 fecal samples

232 (Supplementary Figure 2B) whereas ICE suggested that 97.49% of the genes were covered.

233

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

234 To further evaluate the quality of the EMGC, we mapped the metagenomic data obtained from 40

235 fecal samples from control mice and mice that had consumed non-caloric artificial sweeteners (Suez et

236 al., 2014) and metagenomic data obtained from 34 cecal samples from control mice and mice treated

237 with prebiotic (Everard et al., 2014) . For the reads obtained in the study of Suez et al., 63.07%

238 mapped to the MGGC and, 61.04 % to iMGMC, whereas 72.94% mapped to the EMGC (Figure 2C;

239 Supplementary Table 4). For the read obtained from the study of Everard et al., 61.99% mapped to

240 MGGC and 60.97% mapped to iMGMC, but 73.01% of the reads mapped to the EMGC (Figure 2D;

241 Supplementary Table 5). Together these results demonstrate a significantly increase mapping rate of

242 reads using the EMGC as a reference.

243

244 Taxonomic and functional characteristics of EMGC

245 We taxonomically annotated the genes of EMCG using Kaiju (Menzel et al., 2016) and the NCBI NR

246 database to provide an overview of the taxonomical composition visualized by a Krona plot (Ondov et

247 al., 2011). This plot revealed that 67% of the genes could be annotated (Supplementary Figure 3). We

248 assigned 54.20% of the genes to the phylum level and 44.30% of the genes to the family level

249 (Supplementary Figure 4A, 4B). We next annotated the genes in the EMGC to the KEGG (release 89)

250 database (Kanehisa & Goto, 2000) and identified 6571 KEGG functional orthologs (KO) and 292

251 KEGG pathways (Supplementary Figure 5).

252

253 To further examine the quality of the EMGC, we calculated the occurrence frequency and average

254 abundance of the 1,109,381 genes not present in the previous MGGC and iMGMC. As shown in

255 Figure 3A, 40.41% of these genes exhibited an occurrence frequency and mean abundance higher than

256 0.1 and 10-8, respectively. We also extracted taxonomic and functional information of these new

257 genes. Annotation of the genes not present in the MGGC at the species level revealed that the top 5

258 species could be assigned to Oscillibacter sp. 1-3, bacterium ASF500, Acetatifactor muris,

259 bacterium 1xD42-67 and Eubacterium plexicaudatum (Figure 3B), all isolated from the mouse gut

260 based on information from NCBI BioSample Database. In relation to functions, the general

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

261 distribution of KEGG pathways in these additional genes is similar to the overall distribution in

262 EMGC (Figure 3C). We identified 190 KOs in EMGC which are not present in either MGGC or

263 iMGMC. Furthermore, 46 KEGG pathways are covered by additional KOs (Supplementary Table 6)

264 in EMGC. For 4 KEGG pathways, including betalain biosynthesis, xylene degradation, neomycin,

265 kanamycin and gentamicin biosynthesis and lipopolysaccharide biosynthesis, we found that more than

266 5% of the additional KOs are only represented in EMGC compared to iMGMC (Figure 3D,

267 Supplementary Table 6).

268

269 Changes in the microbiota composition and functional potential

270 We have earlier reported that the mouse gut metagenome is affected by animal providers and housing

271 as well as strain and diet (Xiao et al., 2015). In order to investigate to what extent diet affected the gut

272 metagenomes independently of strain and providers, we selected 7 groups (G1-G7) of samples from

273 different strains from different providers fed a low-fat (LF) diet or a high-fat (HF) diet and housed in

274 the same facility (Details in Supplementary Table 1). We estimated the impact of diet on the variation

275 of gut microbiota based on the relative abundance profiles of genera and KOs using PERMANOVA.

276 The analyses indicated that diet explained at least 33.9% (P value=0.003) of the total variation at the

277 genus level and 47.3 % (P value=0.006) at the KO level (Figure 4A, Supplementary Figure 6A).

278 Compared to mice fed a LF diet, mice fed a HF diet exhibited an increase in alpha diversity at the

279 genus level independent of housing laboratories, strains and providers (Figure 4B). In contrast, at the

280 KO level alpha diversity in mice fed a LF diet generally, except for group 7, exhibited an increased

281 diversity compared to mice fed a HF diet (Supplementary Figure 6B). Principal coordinate analysis

282 (PCoA) similarly confirmed that the diet strongly influenced the genus profile (Figure 4C), and the

283 KO profile (Supplementary Figure 6C).

284

285 To further examine diet-induced changes, we examined genera and KOs enriched in samples from

286 either HF or LF fed mice by Wilcoxon rank-sum test. As shown in Figure 4D, in all 7 groups the 36

287 genera found at higher abundance in samples from HF fed mice belong to Firmicutes, whereas four

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

288 genera within the Bacteroidetes phylum and one genus within the Fibrobacteres phylum were found at

289 higher abundance in samples from LF fed mice (Figure 4D, Supplementary Table 7). However,

290 whereas 324 KOs were enriched in LF-fed mice, only 233 KOs were detected at higher abundance in

291 HF than LF fed mice (Supplementary Table 8). To investigate which taxa contributed to the disparate

292 response to LF and HF diet at the and the functional levels, we identified the taxa at the

293 phylum level that contributed to the enrichment of KOs. Whereas genera within the Bacteroidetes

294 phylum were the main contributor to KOs enriched in LF-fed mice, genera within the Firmicutes

295 phylum contributed in HF-enriched KOs (Supplementary Figure 7).

296

297 Construction of metagenomic species

298 We identified 902 Metagenomic Species (MGSs, >700 genes) using the relative gene abundances

299 based on 326 fecal samples obtained from different mouse strains and providers using MGS Canopy

300 Clustering and taxonomic annotation as described previously (Nielsen et al., 2014). The 902 MGSs

301 were assigned to 122 taxa (Supplementary Figure 8; Supplementary Table 9). We also generated MGS

302 profiles for the 7 groups of mice fed a LF or a HF diet. The Shannon indices and PCoA plot revealed a

303 clear effect of diet, independent of mouse strain and provider (Supplementary Figure 9A & 9B).

304

305 We next compared the 902 MGSs with the 830 high-quality non-redundant MAGs generated in the

306 iMGMC project (Lesker et al., 2020) and 115 bacterial genomes from the mGMB project (Liu et al.,

307 2020). 559 (61.97%) MGSs were classified as the same species as the MAGs from the iMGMC

308 project (Maximal unique match index (MUMi) value (Bäckhed et al., 2015; Deloger et al., 2009; J. Li

309 et al., 2020) >0.54, Supplementary Table 9). As shown in Figure 5, Firmicutes and Bacteroides were

310 the most prevalent phyla among all MGSs and MAGs. We also identified 56 MGSs representing

311 genomes of species from the mGMB project, and of these, 8 MGSs could be identified as mGMB

312 genomes, but not as MAGs (Supplementary Table 9). Of note, for more than one third of the MGSs

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

313 we were unable to identify corresponding entities in the MAG collection or in the cultured genomes

314 collection.

315

316 Discussion

317 The EMGC represents the most comprehensive catalog of genes in the mouse gut microbiome. It

318 covers samples from feces and cecum, different mouse strains fed different diets, obtained from

319 different providers and housed in different laboratories. The majority of the genes identified in this

320 study were assigned to known species, which might improve the coverage of known species and the

321 detection of low abundant taxa. The improvement in KO coverages of a number of pathways will

322 enhance the functional characterization of the mouse gut microbiota. In addition, the analysis of

323 samples from different mouse strains from different animal providers and different housing

324 laboratories confirms the pronounced effect of diets on the taxonomic and functional composition of

325 the gut microbiota.

326

327 In spite of the increased number of genes in EMGC, there are still some limitations. The sample size

328 and variation of sample types in the EMGC are still small. The majority of samples included in EMGC

329 were collected from C57/BL6 mice, which might affect the applicability in studies on other laboratory

330 mouse strains and wild-caught mice. Many additional confounding factors than those addressed in the

331 present study will most probably impinge on the gut microbiota (Hugenholtz & de Vos, 2018;

332 Laukens et al., 2016; Nguyen et al., 2015; Perlman, 2016). It therefore seems important to include

333 more samples from other mouse strains to gain further insights into the effect of confounding factors,

334 which may lead to pronounced variability in the mouse gut microbiota, which again might limit the

335 reproducibility of biomedical research using mouse models (Laukens et al., 2016; Stappenbeck &

336 Virgin, 2016). Although both culture-independent and culture-dependent studies on the mouse gut

337 microbiota have been carried out to improve the understanding of host-microbe interaction in mouse

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

338 models, the majority of mouse gut metagenome members still remains relatively uncharacterized

339 (Lagkouvardos et al., 2016; Lesker et al., 2020; Liu et al., 2020; Xiao et al., 2015).

340

341 Even though more and more metagenomic analysis methods are being developed at an increasing

342 speed, gene catalogs are still necessary resources, not only to provide reliable and consistent

343 taxonomic annotation, but also to reduce the gap between phylogenetic and functional biases (Li et al.,

344 2014; Nayfach & Pollard, 2016). The high-quality reference gene catalog, EMGC, together with the

345 902 metagenomic species, is able to support and improve accurate metagenome-wide association

346 analyses using mouse models which may assist in functional characterization of observed correlation

347 between the microbiota composition and functional potential in relation host phenotypes.

348 349 Acknowledgements 350 This work is funded by the National Natural Science Foundation of China (Grant No. 81670606), the 351 National Key Research and Development Program of China (No. 2018YFC1313800) and the 352 Shenzhen Engineering Laboratory of Detection and Intervention of Human Intestinal Microbiome 353 (No. DRC-SZ [2015]162). This work was supported by China National GeneBank and CNSA (CNGB 354 Nucleotide Sequence Archive) of CNGBdb. We thank the sequencing and bioinformatic staff of BGI- 355 Shenzhen, especially Dr. Dan Wang, Ms. Ying Xu, Ms. Fangming Yang, Mr. Jie Zhu, Dr. Jinghong 356 Yu and Dr. Guangwen Luo for help and advice in our work. 357 358 359 Reference: 360 Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment 361 search tool. Journal of Molecular Biology, 215(3), 403–410. 362 https://doi.org/10.1016/S0022-2836(05)80360-2 363 Bäckhed, F., Roswall, J., Peng, Y., Feng, Q., Jia, H., Kovatcheva-Datchary, P., Li, Y., Xia, Y., Xie, 364 H., Zhong, H., Khan, M. T., Zhang, J., Li, J., Xiao, L., Al-Aama, J., Zhang, D., Lee, Y. S., 365 Kotowska, D., Colding, C., … Wang, J. (2015). Dynamics and Stabilization of the 366 Human Gut Microbiome during the First Year of Life. Cell Host & Microbe, 17(5), 690– 367 703. https://doi.org/10.1016/j.chom.2015.04.004 368 Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., 369 Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., 370 Tesler, G., Alekseyev, M. A., & Pevzner, P. A. (2012). SPAdes: A New Genome 371 Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of 372 Computational Biology, 19(5), 455–477. https://doi.org/10.1089/cmb.2012.0021 373 Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina 374 sequence data. Bioinformatics, 30(15), 2114–2120. 375 https://doi.org/10.1093/bioinformatics/btu170 376 Brugiroux, S., Beutler, M., Pfann, C., Garzetti, D., Ruscheweyh, H.-J., Ring, D., Diehl, M., Herp, 377 S., Lötscher, Y., Hussain, S., Bunk, B., Pukall, R., Huson, D. H., Münch, P. C., McHardy,

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

378 A. C., McCoy, K. D., Macpherson, A. J., Loy, A., Clavel, T., … Stecher, B. (2017). 379 Genome-guided design of a defined mouse microbiota that confers colonization 380 resistance against Salmonella enterica serovar Typhimurium. Nature Microbiology, 381 2(2), 16215. https://doi.org/10.1038/nmicrobiol.2016.215 382 Chen, I.-M. A., Chu, K., Palaniappan, K., Pillay, M., Ratner, A., Huang, J., Huntemann, M., 383 Varghese, N., White, J. R., Seshadri, R., Smirnova, T., Kirton, E., Jungbluth, S. P., 384 Woyke, T., Eloe-Fadrosh, E. A., Ivanova, N. N., & Kyrpides, N. C. (2019). IMG/M v.5.0: 385 An integrated data management and comparative analysis system for microbial 386 genomes and microbiomes. Nucleic Acids Research, 47(D1), D666–D677. 387 https://doi.org/10.1093/nar/gky901 388 Deloger, M., El Karoui, M., & Petit, M.-A. (2009). A Genomic Distance Based on MUM 389 Indicates Discontinuity between Most Bacterial Species and Genera. Journal of 390 Bacteriology, 191(1), 91–99. https://doi.org/10.1128/JB.01202-08 391 Everard, A., Lazarevic, V., Gaïa, N., Johansson, M., Ståhlman, M., Backhed, F., Delzenne, N. 392 M., Schrenzel, J., François, P., & Cani, P. D. (2014). Microbiome of prebiotic-treated 393 mice reveals novel targets involved in host response during obesity. The ISME 394 Journal, 8(10), 2116–2130. https://doi.org/10.1038/ismej.2014.45 395 Fang, C., Zhong, H., Lin, Y., Chen, B., Han, M., Ren, H., Lu, H., Luber, J. M., Xia, M., Li, W., 396 Stein, S., Xu, X., Zhang, W., Drmanac, R., Wang, J., Yang, H., Hammarström, L., Kostic, 397 A. D., Kristiansen, K., & Li, J. (2018). Assessment of the cPAS-based BGISEQ-500 398 platform for metagenomic sequencing. GigaScience, 7(3). 399 https://doi.org/10.1093/gigascience/gix133 400 Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: Accelerated for clustering the next- 401 generation sequencing data. Bioinformatics, 28(23), 3150–3152. 402 https://doi.org/10.1093/bioinformatics/bts565 403 Gish, W., & States, D. J. (1993). Identification of protein coding regions by database similarity 404 search. Nature Genetics, 3(3), 266–272. https://doi.org/10.1038/ng0393-266 405 Hugenholtz, F., & de Vos, W. M. (2018). Mouse models for human intestinal microbiota 406 research: A critical evaluation. Cellular and Molecular Life Sciences, 75(1), 149–160. 407 https://doi.org/10.1007/s00018-017-2693-8 408 Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic 409 acids research, 28(1), 27-30. https://doi.org/10.1093/nar/28.1.27 410 Kultima, J. R., Sunagawa, S., Li, J., Chen, W., Chen, H., Mende, D. R., Arumugam, M., Pan, Q., 411 Liu, B., Qin, J., Wang, J., & Bork, P. (2012). MOCAT: A Metagenomics Assembly and 412 Gene Prediction Toolkit. PLoS ONE, 7(10), e47656. 413 https://doi.org/10.1371/journal.pone.0047656 414 Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., & Salzberg, S. 415 L. (2004). Versatile and open software for comparing large genomes. Genome 416 Biology, 9. 417 Lagkouvardos, I., Pukall, R., Abt, B., Foesel, B. U., Meier-Kolthoff, J. P., Kumar, N., Bresciani, 418 A., Martínez, I., Just, S., Ziegler, C., Brugiroux, S., Garzetti, D., Wenning, M., Bui, T. P. 419 N., Wang, J., Hugenholtz, F., Plugge, C. M., Peterson, D. A., Hornef, M. W., … Clavel, T. 420 (2016). The Mouse Intestinal Bacterial Collection (miBC) provides host-specific insight 421 into cultured diversity and functional potential of the gut microbiota. Nature 422 Microbiology, 1(10), 16131. https://doi.org/10.1038/nmicrobiol.2016.131

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

423 Laukens, D., Brinkman, B. M., Raes, J., De Vos, M., & Vandenabeele, P. (2016). Heterogeneity 424 of the gut microbiome in mice: Guidelines for optimizing experimental design. FEMS 425 Microbiology Reviews, 40(1), 117–132. https://doi.org/10.1093/femsre/fuv036 426 Lesker, T. R., Durairaj, A. C., Gálvez, E. J. C., Lagkouvardos, I., Baines, J. F., Clavel, T., Sczyrba, 427 A., McHardy, A. C., & Strowig, T. (2020). An Integrated Metagenome Catalog Reveals 428 New Insights into the Murine Gut Microbiome. Cell Reports, 30(9), 2909-2922.e6. 429 https://doi.org/10.1016/j.celrep.2020.02.036 430 Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R., Prifti, E., 431 Nielsen, T., Juncker, A. S., Manichanh, C., Chen, B., Zhang, W., Levenez, F., Wang, J., 432 Xu, X., Xiao, L., Liang, S., … Wang, J. (2014). An integrated catalog of reference genes 433 in the human gut microbiome. Nature Biotechnology, 32(8), 834–841. 434 https://doi.org/10.1038/nbt.2942 435 Li, J., Zhong, H., Ramayo-Caldas, Y., Terrapon, N., Lombard, V., Potocki-Veronese, G., Estellé, 436 J., Popova, M., Yang, Z., Zhang, H., Li, F., Tang, S., Yang, F., Chen, W., Chen, B., Li, J., 437 Guo, J., Martin, C., Maguin, E., … Morgavi, D. P. (2020). A catalog of microbial genes 438 from the bovine rumen unveils a specialized and diverse biomass-degrading 439 environment. GigaScience, 9(6), giaa057. 440 https://doi.org/10.1093/gigascience/giaa057 441 Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., & Wang, J. (2009). SOAP2: An 442 improved ultrafast tool for short read alignment. Bioinformatics, 25(15), 1966–1967. 443 https://doi.org/10.1093/bioinformatics/btp336 444 Liu, C., Zhou, N., Du, M.-X., Sun, Y.-T., Wang, K., Wang, Y.-J., Li, D.-H., Yu, H.-Y., Song, Y., Bai, 445 B.-B., Xin, Y., Wu, L., Jiang, C.-Y., Feng, J., Xiang, H., Zhou, Y., Ma, J., Wang, J., Liu, H.- 446 W., & Liu, S.-J. (2020). The Mouse Gut Microbial Biobank expands the coverage of 447 cultured bacteria. Nature Communications, 11(1), 79. 448 https://doi.org/10.1038/s41467-019-13836-5 449 Lomsadze, A., Gemayel, K., Tang, S., & Borodovsky, M. (2018). Modeling leaderless 450 transcription and atypical genes results in more accurate gene prediction in 451 prokaryotes. Genome Research, 28(7), 1079–1089. 452 https://doi.org/10.1101/gr.230615.117 453 Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for 454 metagenomics with Kaiju. Nature Communications, 7(1), 11257. 455 https://doi.org/10.1038/ncomms11257 456 Morgan, X. C., & Huttenhower, C. (2014). Meta’omic Analytic Techniques for Studying the 457 Intestinal Microbiome. Gastroenterology, 146(6), 1437-1448.e1. 458 https://doi.org/10.1053/j.gastro.2014.01.049 459 Nayfach, S., & Pollard, K. S. (2016). Toward Accurate and Quantitative Comparative 460 Metagenomics. Cell, 166(5), 1103–1116. https://doi.org/10.1016/j.cell.2016.08.007 461 Nguyen, T. L. A., Vieira-Silva, S., Liston, A., & Raes, J. (2015). How informative is the mouse 462 for human gut microbiota research? Disease Models & Mechanisms, 8(1), 1–16. 463 https://doi.org/10.1242/dmm.017400 464 Nielsen, H. B., Almeida, M., Juncker, A. S., Rasmussen, S., Li, J., Sunagawa, S., Plichta, D. R., 465 Gautier, L., Pedersen, A. G., Le Chatelier, E., Pelletier, E., Bonde, I., Nielsen, T., 466 Manichanh, C., Arumugam, M., Batto, J.-M., Quintanilha dos Santos, M. B., Blom, N., 467 Borruel, N., … Ehrlich, S. D. (2014). Identification and assembly of genomes and

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

468 genetic elements in complex metagenomic samples without using reference 469 genomes. Nature Biotechnology, 32(8), 822–828. https://doi.org/10.1038/nbt.2939 470 Nurk, S., Bankevich, A., Antipov, D., Gurevich, A., Korobeynikov, A., Lapidus, A., Prjibelsky, A., 471 Pyshkin, A., Sirotkin, A., Sirotkin, Y., Stepanauskas, R., McLean, J., Lasken, R., 472 Clingenpeel, S. R., Woyke, T., Tesler, G., Alekseyev, M. A., & Pevzner, P. A. (2013). 473 Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In M. 474 Deng, R. Jiang, F. Sun, & X. Zhang (Eds.), Research in Computational Molecular 475 Biology (Vol. 7821, pp. 158–170). Springer Berlin Heidelberg. 476 https://doi.org/10.1007/978-3-642-37195-0_13 477 Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: A new 478 versatile metagenomic assembler. Genome Research, 27(5), 824–834. 479 https://doi.org/10.1101/gr.213959.116 480 Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic 481 visualization in a Web browser. BMC Bioinformatics, 12(1), 385. 482 https://doi.org/10.1186/1471-2105-12-385 483 Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: 484 Assessing the quality of microbial genomes recovered from isolates, single cells, and 485 metagenomes. Genome Research, 25(7), 1043–1055. 486 https://doi.org/10.1101/gr.186072.114 487 Perlman, R. L. (2016). Mouse Models of Human Disease: An Evolutionary Perspective. 488 Evolution, Medicine, and Public Health, eow014. 489 https://doi.org/10.1093/emph/eow014 490 Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D., Peng, Y., 491 Zhang, D., Jie, Z., Wu, W., Qin, Y., Xue, W., Li, J., Han, L., Lu, D., … Wang, J. (2012). A 492 metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 493 490(7418), 55–60. https://doi.org/10.1038/nature11450 494 Revell, L. J. (2012). phytools: An R package for phylogenetic comparative biology (and other 495 things). Methods in Ecology and Evolution, 3(2), 217–223. 496 https://doi.org/10.1111/j.2041-210X.2011.00169.x 497 Stappenbeck, T. S., & Virgin, H. W. (2016). Accounting for reciprocal host–microbiome 498 interactions in experimental science. Nature, 534(7606), 191–199. 499 https://doi.org/10.1038/nature18285 500 Suez, J., Korem, T., Zeevi, D., Zilberman-Schapira, G., Thaiss, C. A., Maza, O., Israeli, D., 501 Zmora, N., Gilad, S., Weinberger, A., Kuperman, Y., Harmelin, A., Kolodkin-Gal, I., 502 Shapiro, H., Halpern, Z., Segal, E., & Elinav, E. (2014). Artificial sweeteners induce 503 glucose intolerance by altering the gut microbiota. Nature, 514(7521), 181–186. 504 https://doi.org/10.1038/nature13793 505 Wang, J., & Jia, H. (2016). Metagenome-wide association studies: Fine-mining the 506 microbiome. Nature Reviews Microbiology, 14(8), 508–522. 507 https://doi.org/10.1038/nrmicro.2016.83 508 Wannemuehler, M. J., Overstreet, A.-M., Ward, D. V., & Phillips, G. J. (2014). Draft Genome 509 Sequences of the Altered Schaedler Flora, a Defined Bacterial Community from 510 Gnotobiotic Mice. Genome Announcements, 2(2), e00287-14, 2/2/e00287-14. 511 https://doi.org/10.1128/genomeA.00287-14

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

512 Xiao, L., Feng, Q., Liang, S., Sonne, S. B., Xia, Z., Qiu, X., Li, X., Long, H., Zhang, J., Zhang, D., 513 Liu, C., Fang, Z., Chou, J., Glanville, J., Hao, Q., Kotowska, D., Colding, C., Licht, T. R., 514 Wu, D., … Kristiansen, K. (2015). A catalog of the mouse gut metagenome. Nature 515 Biotechnology, 33(10), 1103–1108. https://doi.org/10.1038/nbt.3353 516 Xiao, L., Sonne, S. B., Feng, Q., Chen, N., Xia, Z., Li, X., Fang, Z., Zhang, D., Fjære, E., Midtbø, L. 517 K., Derrien, M., Hugenholtz, F., Tang, L., Li, J., Zhang, J., Liu, C., Hao, Q., Vogel, U. B., 518 Mortensen, A., … Kristiansen, K. (2017). High-fat feeding rather than obesity drives 519 taxonomical and functional changes in the gut microbiota in mice. Microbiome, 5(1), 520 43. https://doi.org/10.1186/s40168-017-0258-6 521 Yu, G. (2020). Using ggtree to Visualize Data on Tree-Like Structures. Current Protocols in 522 Bioinformatics, 69(1). https://doi.org/10.1002/cpbi.96 523 Yu, G., Smith, D. K., Zhu, H., Guan, Y., & Lam, T. T. (2017). GGTREE: an R package for 524 visualization and annotation of phylogenetic trees with their covariates and other 525 associated data. Methods in Ecology and Evolution, 8(1), 28–36. 526 https://doi.org/10.1111/2041-210X.12628 527 Zhu, W., Lomsadze, A., & Borodovsky, M. (2010). Ab initio gene identification in 528 metagenomic sequences. Nucleic Acids Research, 38(12), e132–e132. 529 https://doi.org/10.1093/nar/gkq275 530 531 Data availability. The host-free data that was sequenced in this study have been deposited into CNSA 532 (CNGB Nucleotide Sequence Archive) of CNGBdb with accession number CNP0000619 533 (metagenomic sequencing reads and assemblies of 88 mice samples with the dataset of EMGC). 534 535 536 Author contributions 537 J. Z., X.L. and Liang X. managed the project. X.L., Y.Z., H.Z., M.L. and Liang X. supervised the 538 project. X.L. provided samples. J.Z., H.R., H.Z., G.L. and Liang X. designed the analysis. J.Z., H.R. 539 and G.L. performed the analysis. J.Z. wrote the paper. J.Z., H.R., H. Z., X.L., Y. Z, M. H., M. L., K.K. 540 and Liang X. revised the paper. 541 542 Competing financial interests 543 The authors declare no competing financial interests. 544 545 546 547 548 549 550 Table 1 General Features of gene catalogs 551 Catalogs Sample Total Total Length Average N50 N90 Max Minimum Size Number (bp) Length Length Length of ORFs (bp) (bp) (bp) PMGC † 88 2,602,584 1,920,079,578 737.76 981 396 120489 102 MG GC‡ 184 2,572,074 1,959,483,705 761.83 1014 408 120489 102 FDGC§ 54 793,847 585,096,360 737.04 978 396 23610 102 MiCB¶ NA 267,801 251,087,538 937.59 1206 504 79287 100 iMGMC†† 292 4,499,720 3,505,479,714 779.04 1107 390 120399 102 EMGC‡‡ 434 5,862,027 4,542,473,508 774.90 1104 393 120489 100

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

552 †PMGC: A sub mouse gut gene catalog from 88 mouse gut metagenomes 553 ‡MGGC: The Gene Catalog of Mouse Gut metagenome released in 2015 554 §FDGC: A Feed & Diet Gene Catalog for mice 555 ¶ MiCB: Mouse intestinal Cultured Bacteria gene set 556 ††iMGMC: The integrated Mouse Gut Metagenome Catalog 557 ‡‡EMGC: an expanded gene catalog of mouse gut metagenomes 558 559 560 561 562

Metagenomic Data Genomic Data Prepocession Downloaded Data Preprocession

Mouse Gut PRJEB10572 CDSs of Genomes Microbiota raw reads Downloaded from NCBI (n=88) (n=43 samples) RefSeq & IMG miBC ASF8 mGMB Quality Control (n=24) (n=8) (n=40) Quality Control [Trimmomatic v0.39] [OAs1] Assembly Host Removal [SPAdes v3.14.0] [SOAP2] Assessment Individual Assembly [CheckM v1.0.11] [SPAdes v3.14.0]

Qualified Genomes (n=14) Gene Prediction [MetaGeneMark] Gene Prediction [GeneMarkS-2] Predicted CDSs of Metagenomes Predicted CDSs of Genomes

Gene Clustering Gene Clustering [CD-HIT v4.5.7] [CD-HIT v4.5.7]

PMGC MiCB MGGC FDGC iMGMC

Gene Clustering [CD-HIT v4.5.7]

EMGC

563 564 Figure 1 Construction of the EMGC. Metagenomic sequencing data of 88 mouse gut metagenomes 565 were processed by the pipeline as displayed to generate non-redundant genes for PMGC. Unassembled 566 strains of miBC (under BioProject PRJEB10572) were assembled and filtered by genome quality

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

567 (completeness >90%, contamination <5%) of assembled genomes. Qualified genomes were used for 568 gene prediction. CDSs from assembled genomes and downloaded genomes were gathered and 569 clustered to MiCB. PMGC and MiCB alongside with 3 downloaded genesets, FDGC, MGGC and 570 iMGMC, and were merged to generate EMGC. 571

572 573 Figure 2 Performance of the EMGC. (A) Comparison of mapping rates between MGGC, iMGMC and 574 EMGC. **** represents p values <0.0001 by Wilcox rank-sum test. (B) Display of mapping rates 575 among samples’ providers, including the Wallenberg Laboratory for Cardiovascular and Metabolic 576 Research (CMR), the Jackson Laboratory in US (JUS), Laboratory Animal Center of Southern 577 Medical University (SMU), Laboratory Animal Center of Sun Yat-Sen University (SYSU), Taconic in 578 Denmark (TDK) and US (TUS). Dashed line represents a mapping rate of 80%. (C) & (D) Density 579 curve for the mapping rate of fecal metagenomes and cecal metagenomes which were not included in 580 gene catalog construction. Dashed lines represented the average values of the mapping rate of each 581 gene catalog. 582

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Second.Level

A C Carbohydrate metabolism Amino acid metabolism Energy metabolism

Metabolism of cofactors and vitamins )

10 Nucleotide metabolism Lipid metabolism Glycan biosynthesis and metabolism Biosynthesis of other secondary metabolites Metabolism of other amino acids First.Level e of Gene (log

c Metabolism of terpenoids and polyketides Cellular Processes

Xenobiotics biodegradation and metabolism Environmental Information Processing

undan Replication and repair Genetic Information Processing b Translation Metabolism Folding, sorting and degradation

Transcription

Relative a Membrane transport Signal transduction Signaling molecules and interaction 0.0 0.2 0.4 0.6 0.8 1.0 Cellular commryotes Cell motility Occurrence of Gene Cell growth and death Transport and catabolism Cellular commryotes

0 20000 40000 60000 Gene.Count B D 12 more 103 more 100

11 more Improvement in EMGC MGGC

Unclassified 128 more 16 more Bacte Improvement in iMGMC roida Bacteroidalesceae 72 more 75 Bacteroidia Bacillaceae 14 more Bacillales Unclassified Proteobacteria Bacilli Bacteroidetes

Unclassified 50

Eubacteriaceae

% coverage

Oscillospiraceae 71 more Bacteria 25

Firmicutes all Clostridiaceae 33 more

Clostridia Clostridiales 0 Rumino...caceae

folate factors y adation adation adation b r r r

iption Proteasome r vin metabolism ide biosynthesis a ene deg r z ycin biosynthesis ansc yrosine metabolism Xylene deg m r Styrene deg T obiocin biosynthesis Betalain biosynthesis v Galactose metabolismRibofl o Glycerolipid metabolism alida N v Sphingolipid metabolism One carbon pool idine alkaloid biosynthesis Basal t r y p Unclassified Lipopolysaccha Isoquinolineycin alkaloid and biosynthesisgentamicinGlycosaminoglycan biosynthesis biosynthesis m Cysteine and methionine metabolism xane and chloroben idine and e r Acarbose and 42 more

, pipe ycin, kana e m

Chlorocycloh Neo ropan T

583 584 Figure 3 Description of new genes included in EMGC. (A) 2D density histogram showing the 585 distribution of occurrence and mean relative abundance of new genes. (B) A general display of 586 taxonomic composition of new genes by Krona. (C) Frequency of functional pathways associated with 587 the new genes. (D) Stacked histogram of KO coverage of functional pathways improved in EMGC, 588 compared to MGGC and iMGMC. Coverage is calculated as 589 (��������� �� �������⁄����� �� �������) × 100% in a given pathways.

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A D Phylum G1 G2 G3 G4 G5 G6 G7 60

● ● ● ● * Roseburia ● ● ● * Lachnoclostridium ● ● ● ● ● ●● * * * Anaerotruncus ● ●

● ● 40 * Dorea ● ● ●

Blautia ● ●

● ● * Flavonifractor ● ● ● ● ● ● ● Pseudoflavonifractor ● ●● ● ●● ● ● 20

● Faecalibacterium ● ●

● ● Hungatella ● ● ● ariance Explained by Diet (%) ●

● V Paenibacillus

0 ● Coprococcus ● ●

● G1 G2 G3 G4 G5 G6 G7 Subdoligranulum ● ●

● Bacillus ● ●

● ●● ● ** ● ● B ** Anaeromassilibacillus

● ● ● ● ● Clostridioides ●● ● ● ** ● Enterococcus ● ● ● 2.25 ● ** ** ● Lachnotalea ● ● ●● ● ●

● x G1_LF ● ● ●● ●● ● ● ● e ● Robinsoniella ● ● ● ● G1_HF G2_LF ● ● ● ● Angelakisella ● 2.00 G2_HF ** G3_LF ● ● ● ● ● ** Gemmiger ● ● ● ● G3_HF

● ● ● ● ● ●● G4_LF Faecalicatena ● ● ● ● ● G4_HF Shannon ind ● ● G5_LF Drancourtella ● 1.75 G5_HF

● ● ● ● G6_LF Agathobaculum ● ● ●● ● ● G6_HF

● ● ● Neglecta ● ● ● ● G7_LF G7_HF

● ● 1.50 Merdimonas ●

● ● ● ● Ruminiclostridium ● G1_LF G2_LF G3_LF G4_LF G5_LF G6_LF G7_LF G1_HF G2_HF G3_HF G4_HF G5_HF G6_HF G7_HF ● Emergencia ● ● ● Firmicutes

Mordavella ● ● ● Fibrobacteres

● ● ● ● ● ● C ● Desulfitobacterium ● ● ● Bacteroidetes ● Holdemania ● ● ● ● ● ●● ● ●● ● ● ● ● 0.1 ● ●● ● ● Marasmitruncus ● ●● ● Anaerofilum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Sarcina ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Massilimaliae ● ● ● ●● ● ● ● ● ● ● ● ● 0.0 ●● ● ● Cellulosilyticum ● ● ●● ● ● ●● ● ● ● Fibrobacter ● ● ● ● ● ● ● ● PCoA2 (17.53%) ● ● ● ● ● ● ● G1_LF G5_LF ● Tannerella ● G1_HF ● G5_HF ● ● ● ● ● ● ● G2_LF G6_LF ● ● ● ● Mediterranea ● ● ● ● G2_HF ● G6_HF ● ● ● ● G3_LF G7_LF ● Pedobacter ● ● ● ● ● ● ● ● G3_HF ● G7_HF ● ● G4_LF ● ● ● ● ● ● G4_HF ●● ● ● Chryseobacterium 0.0 0.1 0.2 590 PCoA1 (56.37%) Relative abundance (log10) 591 592 Figure 4 The influence of diet on the composition of the microbiota. (A) PERMANOVA analysis to 593 estimate the influence of diet on the composition of gut metagenomes among all 7 sample groups. G1: 594 C57/BL6 mice provided by Taconic Denmark (TDK) and hosted in National Institute of Nutrition and 595 Seafood Research of Norway (NIFES); G2: Sv129 mice provided by TDK and hosted in NIFES; G3: 596 C57/BL6 mice provided by the Jackson Laboratory in US (JUS) and hosted by Pfizer-I; G4: C57/BL6 597 mice provided by Taconic US (TUS) and hosted in Pfizer-I; G5: C57/BL6 mice provided by TDK and 598 hosted in the University of Copenhagen (KU); G6: Sv129 mice provided by TDK and hosted in KU; 599 G7: C57/BL6 mice provided by the Laboratory Animal Center of Sun Yat-Sen University (SYSU). 600 (B) Shannon index (** represents p value <0.01, Wilcox rank-sum test) and (C) PCoA based on genus 601 profile for 7 groups fed the HF and LF diet. (D) Genera differently enriched (BH-adjusted p-value 602 <0.05, Wilcox rank-sum test; relative abundance >1e-5) in mice fed the HF and LF diet among all 7 603 groups. Genera in light red represent genera enriched in HF fed mice, while genera in light green 604 represent genera enriched in LF fed mice.

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.16.299339; this version posted September 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

s

0

i

n 3 7

g

s 6

s

6

l

i

i e

1 n

n 7

7

4 g

g 1

3

C

0 l 2 l s 7

e

e T

2 i 1 h 7 0 n

1 1 i 0 3 e

g C n s C e 1 s xt 3 0 l i a

h e x s h n 0 7 i e 9 0 e 1 n e

t r i i i g 2 5

x n n r 1 x 1 n 1 e g x G C a l t e g t a 1 3 e a xt t 1 e l r r 1 e r 1 F 4 a h l e a 6 xt e a 6 s 6 S r

3 i e x 7 2 6 2 M i n 7 1 r M n x t 6 S

S p 4 r 0 1 a w S 1 g t u e a

3

5 u 1 r u 8 T 7 9 xt l i n a

a l a

e n 5 MGS00727 M 5 7 M 6 M 5 r i M M 1 M 1 n a n r M 1 i 5

2 2

5 7 e M i 7 e m i M 3 7 6 5 M 3 G 2 M 9 G 1 h 1 7 G G n 2 5 1 M G 3 xt M 9 3 3 G 2 xt h G i 6 h h 0 2 F i G 9 9 5 8 6 G 8 5 M 0 5 i 1 M G 4 6 5 0 G S 2 6 9 0 3 S G 5 S h 4 5 S 2 u e 4 r 7 C 9 4 S 7 r G C 2 S 6 3 4 F 2 a S 6 8 8 3 5 e S 0 1 6 6 0 0 G S 9 F 0 0 G 5 n 3 S 0 4 0 8 MGS00547 0 S 0 0 C T 8 8 7 0 7 0 2 C n S 0 e 0 0 0 S 0 S 0 0 8 3 i MGS00613 0 0 0 0 0 e 0 i 0 0 0 5 0 MGS00349 1 e 6 0 0 0 S 0 e 7

M 7 0 0 0 0 0 C 0 7 7 0 l l C 0 0 4 h e MGS00829 0 6 C M 0 S 7 0 1 1 0 S 7 0 S 4 6 0 MGS00648 S e 0 2 C 9 h 0 e 7 F 0 0 S 6 a 6 g 5 S l g a 0 C xt 0 7

C 2 e 6 5 S e MGS00540 1 0 M 2 4 4 1 r C 7 0 7 0 4 2 S 9 G r G M s 7 0 6 G 1 1 G S S n t 1 G g 00760 MGS00149 n M 1 t 5 8 G s 6 xt 8 4 S 7 i 0 2 0 G 7 i 2 2 e r MGS00526 3 M 2 6 4 G F G S 6 8 3 MGS0 6 x 5 8 n G C 1 x G

8 M 0 M

s 0 M G s 4 2 4 M i xt 8 5 e 4 M G 5 i 9 8 r e M G MGS00595 5 5 e M l 6 G r a 8 e 0 M G M s 1 3 8 0 M 5 M r MGS00560 3 i M r 3 M g e 8 i M t 1 M 6 i M xt MGS e 0 M 4 i n G xt 4 0 x MGS i e MGS00617 s MGS00579 2 2 e s r M MGS00736 MGS00835 i 7 0398 9 MGS00563 MGS00629 GS00520 6 e MGS00654 MGS00464 0085 xt M e MGS00510 MGS00624 3 r xt r MGS00812 r 8 MGS00304 e r MGS00567 xt MGS00330 xt 3 e e MGS00026 e r MGS0 MGS00682 e MGS00693 r e MGS00625 xt xt F xt e e xt r MGS00385 MGS00830 xt e r r MGS005 1 MGS00323 r e r xt MGS00716 0 xt e MGS00310 xt 120 MGS00645 e r MGS00444 r MGS0063 e xt MGS00433 MGS00792 e e xt e GS0077 x xt MGS00683 r t e MGS00091 GS00599 r r M xt MGS00 M r M e 3 xt MGS00066 MGS00590 G s e MGS00725 r MGS00801 xt S00881 MGS00581 e r MGS002 7 MGS00528 xt 80 r F MGS00023 e r xt xt MGS00424 e r MGS00110 MGS00571 MGS00257 r 45 MGS00457 MGS00787xt e e MGS00767 MGS00789 MGS00849 MGS00309 MGS00709 r 04 F MGS00032 MGS00270xt r MGS00011 e 89 3 r MGS00141xt S00117 MGS00280 e xt 74 e MG r MGS00793 t MGS006 x e MGS00299 e xt GS00 e MGS002 r 66 xt r MGS00044 M xt e e r MGS00014 r xt MGS00794xt r S005 MGS00069 e F MGS001 MG MGS00828 r e MGS00178 xt xt e r r MGS00028 e 0619 MGS00556xt xt 09 e r r MGS0008 MGS00618xt MGS00446 e MGS00076 S00577 MGS0 612 G e xt MGS0 M e r 0 xt r xt 0739 0 r MGS00504 8 e MGS00 MGS00 S 0093 r MG MGS00166 xt e MGS0071 039 MGS00597 MGS00478 MGS00646 r MGS00807 xt e MGS00897 r 6 e MGS00524 x MGS00 xt e t e 22 xt r F MGS00837 r 00 MGS002 MGS00721 0823 GS r 0 MGS001 44 M e xt 1 xt 0 e r r MGS 8 xt MGS00572 4 e 1 r MGS00353 xt S0061 e 33 039 xt MGS00 e G M r 0432 F MGS00411 MGS00575GS0 e xt M M 602 r 0 e GS0022 47 MGS00472GS0 xt M r 5 MGS00199 T MG e MGS0 r x S00305 2 MGS00806 tr MGS00075 xt e e xt MGS00333 24 7 r MGS00474 MGS00060 MGS00722 e xt M 2 e r MGS00 xt GS0 r MGS00717 059 MGS00442 MGS00605 e 0 S0 xt MGS00467 MGS00184 239 r r MG t x e r xt MGS00337 e MGS00485r xt MGS0 e F MGS00262r xt MG e 0158 00360 S00408 M MGS00251GS 0326 e G M 0 665 xt 0 S00 r MGS00429 MGS 809 MGS0 r MGS00394 xt e e MGS00825 MGS00357 MGS0 T 4 MGS00664 MGS00607 0168 MGS00870 e F xt r M GS00562 MGS0031 T e MGS00679 xt MGS00731r r xt S00858 e G 9 MGS00791 7 M 4 MGS00 S003730 0 S e MG G e xt MG 0576 xt r M r 2 xt S0 e r S00300 41 e xt G r M M GS0 2 e MGS00584 38 xt r 0535 MGS00393tr x 9 e MG MGS00r xt S e 00531 MGS MGS0065r xt 00406 e e xt 00 M MGS r GS00396 374 F MGS00673 e xt MGS00418 e r xt MGS00345r r xt M e GS00558 MGS00501 783 0 0 2 MGS00443 MGS00182 MGS0 F e MGS007 xt MGS00388 GS00342 r 2 M 5 MGS00271 MGS00549 MGS003S00555 MG e MGS00517 xt MGS00341 459 r MGS00392 MGS00 e xt rMGS00228 ext MGS00126 r MGS00714 MGS00123 MGS00471 MGS00273r MGS00094 xt e MGS00426 e MGS00338 29 xt MGS00587r1 e r ext MGS00652 e MGS00 xt 00658 T r MGS00405 e xt MGS00208 MGS r e xt r T s MGS00509 MGS00164 F MGS00449 MGS00293 ext r MGS00561 MGS00546 F MGS00254 e xt MGS00255 e r r xt xt rMGS00306 e MGS00539 88 MGS00589 J MGS002 e MGS00327 xt r MGS00436 MGS00339 MGS00152 MGS0036707 e r 4 xt ext rMGS00249 MGS00 MGS00276 ext MGS00820 MGS00885 r MGS00580 MGS00128 ext MGS00486 r ext MGS00243 rMGS00461 MGS00329 r ext ext r MGS00350 MGS00588 MGS00574 e xt MGS00320 MGS00154 r ext r ext MGS00348 r MGS00541 MGS00747 ext r MGS00487 MGS00715 ext e r MGS00077 xt MGS00319 r r ext MGS00505 ext r r ext MGS00777 MGS00346 r ext xt r e MGS00316 28 MGS00650 MGS006S00596 ext MGS00321 r MG MGS00267 MGS00325 r ext r ext F MGS00328 MGS00758 MGS00420 ext MGS00259 r MGS00217 xtr ext e r ext MGS00430 MGS00698 r ext r MGS00614 3 GS0062 ext F MGS00 M r 445 530 MGS00 r40 MGS00843 xt ext e 000 r MGS00477 MGS r ext MGS00766 r MGS00690 ext 708 e MGS00775 MGS0081500 xtr 1 MGS00886 MGS 060 ext r MGS0 MGS00363 ext rMGS00838 MGS00723 MGS00737 ext r MGS00694 ext r MGS00782 ext 38 r MGS0084707 ext r MGS0 ext r MGS00702 ext S00832 rMGS00369 MG xtr e ext rMGS00627 MGS00875 e xt MGS00483 r xtr MGS00669 e 6 MGS00512 MGS0051 MGS 00389 xtr MGS00515 e r F ext S00674 MGS00864 MG xtr MGS00889 e e MGS00851 xtrMGS00488 MGS00637 00867 MGS MGS00087 MGS00699 xtr e 68 S005 e MG xtrMGS008 ext r 22 MGS00332 r ext 00448 MGS MGS00874 MGS00462 r ext MGS005934 MGS0041 MGS00368 extr MGS00434 MGS008245 MGS0023 MGS00559 MGS00585 extr MGS0030367 MGS006 ext MGS00421 r MGS00491 e xtr MGS00744 00438 MGS e MGS00534 xtr MGS00578 MGS00842 MGS0065116 M GS00695 MGS002 MGS0001002 e xtr MGS00642 MGS003 MGS00670 MGS00551 ext r 50 MGS005 MGS00343 MGS00893 MGS0 0636 MGS00879 extr extr MGS00641 MGS00460 MGS00002 MGS00247 xtr extr e ext r MGS00047 extr MGS00404 MGS00631 1 MGS00347 MGS0005 extr MGS00869 MGS00375 extr extr 3 Phylum MGS0022 Annotation MGS00397 F MGS00351 MGS00331 extr extr MGS00211 MGS00506 extr MGS00400 MGS00586 6 MGS00603 MGS0033 MGS00751 MGS00700 extr extr MGS00887 MGS008 44 MGS00871 MGS00728 MGS00466 ext r extr MGS00697 ext r MGS00001 r MGS00666 ext MGS00543 ria extr MGS00712 0499 MGS00016 MGS0 MGS007 MGS00814 50 xtr MGS00476 e 54 extr MGS007 MGS00818 MGS00860 MGS00891 extr MGS00888 extr MGS00362 extr MGS00804 r extr extr extr MGS00644 MG extr F MGS00620 MGS00525 extr MGS00080 MGS00876 extr MGS00417 extr MGS00817 extr MGS00902 MGS00892 MGS00451 MGS00803 MGS00598 extr MGS MGS00831 F MGS00896 MGS00640 MGS00676 MGS00884 MGS00569 MGS00901 extr MGS00795 MGS00752 MGS00890 MGS00458 MGS00859 Camp MGS002 86 MGS00845 extr MGS00519 MGS00565 MGS00733 MGS00895 MGS00898 MGS00098 MGS00883 MGS00104 MGS00854 extr MGS00872 MGS00452 MGS00808 MGS00797 MGS00746 MGS00785 MGS00142 f MGS00763 MGS00470 phylum extr MGS00529 MGS00015 MGS002 13 extr MGS00025 MGS00155 MGS00160 MGS00004 extr MGS00232 MGS000 35 Annotation Fir MGS00632 MGS00144 extr extr MGS00106 MGS00840 MGS00538 extr p MGS00029 MGS00856 extr extr extr MGS00231 Others extr MGS00435 extr MGS00252 MGS00608 extr extr MGS00399 p MGS00463 extr MGS00455 MGS00378 extr MGS00482 MGS00473 MGS00275 MGS00800 extr ria MGS00833 MGS00038 extr extr extr extr MGS00447 extr extr MGS00283 MGS00101 extr MGS00 147 extr MGS00383 MGS00205 MGS00034 e r e Jan MGS00384 J extr MGS00146 MGS00376 F extr MGS00441 T MGS00732 MGS00846 MGS00591 MGS00513 MGS00498 MGS00638 MGS00043 MGS00489 MGS00635 ext 786 r MGS00 0 MGS00019 MGS0060 MGS00121 xtr e MGS00274 MGS0017 7 extr MGS00490 MGS00165 MGS00610 MGS003 00759 22 MGS extr MGS00863 MGS00008 MGS00729 extr MGS00688 MGS00131 MGS00755 MGS00861 MGS00294 MGS00662 extr MGS00816 extr MGS00377 MG MGS00703 MGS00753 extr MGS00839 MGS0008 5 MGS00877 MGS00594 MGS00317 MGS00713 extr MGS00701 MGS003 0671 80 MGS0 0761 MGS0 MGS0 069 e 6 extr MGS00159 xtr MGS0 038 7 MGS extr r ext MGS00201 MGS00 20 extr 4 MGS00269 extr MGS00757 MGS00657 MGS00285 S00826 MG MGS00372 J MGS00521 MGS00135 extr MGS00548 MGS00730 r ext MGS00502 rMGS00704 ext MGS00811 rMGS00649 ext MGS00691 rMGS00821 ext MGS00258 MGS00264 MGS00355 MGS00 MGS00007532 rMGS00431 ext MGS00084 ext r r ext 533 MGS00111 r ext MGS00 e xtr MGS0 MGS00681 xtr ext 0115 e r MGS008340770 MGS0 MGS0 ext 00 r18 MGS00711 MGS00105 MGS00857 MGS00218 MGS00544 ex tr MGS00677 MGS00021 MGS00686 ext r MGS0054500508 MGS002 MGS MGS00465 19 MGS00582 MG S00852 MGS00705 MGS00437 ext r MGS00224 MGS00012 MGS00790 MGS00511 ext r MGS00171 MGS00622 ext xtr r e MGS00119 r ext MGS00788 MGS00 xtr e ext 0 r62 M MGS00748 GS00130 xtr e e xtr F MGS00395 MGS00402 MG S00778 ext r MGS00125 xtr e MGS00781 r ext ext r MGS00469 ext r r ext MGS00083 r ext F MGS00173 ext r xtr MGS00206 F e rMGS00536 xt e F MGS00423 ext r ext MGS00798 r r MGS00053 ext ext MGS00813 r r MGS00022 ext ext MGS00606 r ext r MGS00059 MGS00672 MGS00151 MGS00639 MGS00100 MGS0003 MGS00233 MGS00027 3 r ext MGS00366 F MGS00162 r ext MGS00048 rMGS00365 xt r e xt e MGS00268 MGS00850 r xt MGS00765 e e xt r MGS00630 r ext MGS00185 MGS00609 MGS00099 r MGS00802 MGS00740 ext MGS00719 MGS00067 r MGS00313 ext MGS00741 ext MGS00153r F MGS00412 MGS00706 ext r MGS00307 MGS00557 MGS00005 MGS00734 MGS00127 ext r rMGS00163 MGS00082 xt e r xt e MGS00779 MGS00439 MGS00261 e rMGS00281 xt xt r e ext r MGS00189 ext r r ext MGS00248 MGS00138 MGS00046 MGS00726 MGS00827 MGS00132 MGS00494 e xt e r MGS00496 xt MGS00210r r MGS00492 ext MGS00660 r xt MGS00868 e F MGS00364 e MGS00370 xt MGS00256r MGS00718 MGS00537 r xt MGS00675 e MGS00229 MGS00749 e xt MGS00453 MGS00409r MGS00312 MGS e xt 00894 MGS00344 MGS00371r MGS00573 e xt MGS00190 e r xt MGS00240 MGS00225r MGS00643 e xt r MGS00096 MGS0 MGS00102 0237 MGS00242 T MGS00359 r xt r e e xt xt e MGS00454 MGS00090r T r MGS00197 xt e MGS00193 MGS00386 r e xt MGS00054 e MGS00493 MGS00279 r xt e e MGS00295 xt MGS00139 r MGS00481 MGS00495 MGS00810 MGS00689 MGS00381 MGS00742 MGS00634 MGS00311 MGS00340 e xt MGS r MGS00855 e xt MGS00415 MGS00554 00416 r e r xt xt e e xt r MGS00552 r MGS00354 MGS00266 e xt MGS00564 MGS00230 r MGS00017 MGS00427 MGS00292 MGS00064 e MGS00063 xt MGS00848 MGS00799r J MGS00170 MGS00542 MGS00212 MGS00379 e MGS00103 xt r MGS00272 e r MGS00186 xt xt e xt r e MGS00003 r MGS00456 r MGS00238 xt e MGS00841 MGS00236 MGS00263 e xt e xt r MGS00081 MGS00049 J r MGS00221 MGS00107 e xt MGS00278 r MGS00358 e xt MGS00196 r e MGS00041 xt r e xt MGS00287 xt r e MGS00 r e MGS00092 xt MGS00198 r MGS00074 157 MGS00194 r MGS00776 xt MGS00137 e MGS00145 MGS00450 MGS00183 MGS00089 MGS00468 MGS00200 MGS00769 MGS00484 r MGS00570 xt MGS00480 MGS00113 e r MGS00167 xt e T MGS00191 MGS00291 MGS00071 MGS00756 r e MGS00061xt xt r MGS00656 e r e xt MGS00410 MGS00122xt e e r T xt MGS00356 MGS00086 r r MGS00124 xt MGS00169 e e e xt r MGS00068x r MGS00401 t r p xt MGS00878 MGS0 e e r MGS00214 e xt xt MGS00315 MGS00150xt r e MGS00413 r r xt MGS00862 00 MGS00768 e MGS00246 95 MGS00298 MGS00419 e MGS00140 MGS00 xt MGS00507 MGS00244 r MGS0 MGS00655 MGS00297 MGS00260 MGS00108 031 MGS00301 MGS00265 0187 MGS00514 MGS00735 r MGS00882 00661 MGS00324 MGS00805 xt e e MGS00553 MGS0013 xt r MGS0020 r xt MGS e e r x MGS00308 MGS00078 t r xt r MGS00687 M e xt e e MGS00290 xt GS00009 MGS00215 MGS00176 r 6 MGS00195 MGS00203 r e MGS00073 2 xt MGS00250 xt e e r xt r t MGS00156 MGS00148 MGS00865 x e r r F MGS00172 e e xt MGS xt xt r MGS00880 e MGS00055 r e e xt r xt MGS00390 MGS00181 MGS00899 xt r e GS00784 MGS00615 MGS00318 r 00072 M MGS00663 r e MGS00030 MGS00668 MGS00192 xt MGS00065 220 MGS00900 xt e MGS00045 MGS00 e r e xt MGS00175 MGS00143 e xt MGS00774 MGS00134 e r MGS00334 e xt 00253 MGS s xt r 4 MGS00 xt MGS00161 e r 6 MGS00174 i 0 s n 5 MGS00180 r MGS00422 x 1 6 s r e F 8 i MGS00037 g MGS00 1 7 s n t M i 1 0 0 e xt r M n 8 e i l 873 MGS00207 0 4 M g e a 0 1 s n e MGS 4 x 00118 3 2 M 4 e g M G 5 4 x s r M 1 C 1 6 M x l 1 s G 2 M i J e t g MGS00296 M 3 s

3 M e

0 MGS00335 e e 4 x G 8 n l C 6 t 6 0 4 1 i r 3 9 G t 1 2 i e 7 S MGS00227 8 2 n r S 7 x G 0 i 6 l 1 6 a C x 9

3 xt n t 3 G r 3 G S M 8 g 7 n G e 0 6 a 8 G 4 S MGS00234 4 G r 4 6 7 057 6 a 8 t g M 0 8 r 0 3 t 0 0 9 S C 7 C 3 g 1 a w 1 r S l h 0 9 g 3 0 r

7 0 r S

G 2 S 2 0 2 0 e MGS00361 2 M S 0 2 l a 0 S 8 0 S 0 9 7 0 a 1 l S 9 e 0 G 0 i xt 0 2 l 3 0 w 4 S i M e 0 M 2 3 1 S 0 4 0 n e 0 MGS00403 0 0 0 S l r 4 0 0 0 2 0 0 0 M 2 S 0 0 S 0 e 7 0 MGS00112 1 8 0 i S 0

1 2 2 2 0 a 0 M C i C 0 0 S 0 S G C G 3 0 0 7 0 7 C 0 G S i 8 0 l S 0 M 0 m xt 7 7 0 2 G S 0 w MGS00209 2 1 1 2 C 9 S 0 C G 1 4 7 S h 0 MGS00042 0 S S 1 0 e M S G 2 F M 4 4 M 5 m M 7 M G 1 1 7 i 5 7 i G i i C 7 1 r e M 4 2 i h 3

1 G l 6 7 9 1 M G n C 2 G 4 5 G 9 G G 7 M 6

e C 5 G 6 8 9 8 i e i M 6 4 5 i 5 xt M 0 r a n 9 3 m M 1 h

6 S M 3 e 5 2 M M i M M MGS00050 e MGS00866 3 e M a i 2 2 1 7 m i 0 r e 3 C 9 i i 3 6 1 5 1 m 0

a 3 6 1 1

l i 1 xt a 6 C e r 3 i 3 0 S m 3

r 2 4 t 3 1 1 e l S 4 2 t i 4 4 1 6 m 4 S x w 5 4 6 1 l e x 0 2

C i 3 8 2 2 w 9 e a 7 M 4 5 a l e 0 e a r 6 1 i 4 7 r 3 1 1 w a e t r 7 5 4 l t t 3 r

1 1 1 1 1 2 2 x w e t 1 r 5 x l 5 g x 1 0 1 9 9 e 2 2

e x C 4 e 8 g 5 n l e

2

xt 3 0 1

i 9 e

e

7

n l g h e 0 3 s i

8

7 n g 7 s 0 i 2

n

s 7 1 i

2

s 1

6 1

m

p

3

3

605 606 Figure 5 Phylogenetic tree of the 902 MGSs and 830 high-quality iMGMC MAGs. MUMi distance 607 for MGSs and MAGs were used to construct the phylogenetic tree using hierarchical clustering. 608 MAGs are shown as red branches and MGSs as grey branches. The outer ring shows the relation 609 between MGSs and MAGs. MGSs which have a MUMi value >0.54, are marked as “Ann_MGS” in 610 green blocks, otherwise in purple blocks. MAGs are all in grey blocks. Colored blocks in the inner 611 cycle indicate phyla assigned to MGSs and MAGs.

22