<<

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 TITLE: Evaluation of the effects of library preparation procedure and sample 2 characteristics on the accuracy of metagenomic profiles 3 4 Christopher A Gaulkea,b,c,#*, Emily R Schmeltzera*, Mark Dasenkod, Brett M. Tylerd,e, 5 Rebecca Vega Thurbera, Thomas J Sharptona,d,f 6 7 *These authors contributed equally. 8 9 Affiliations 10 aDepartment of Microbiology, Oregon State University, Corvallis, OR 97331 11 bDepartment of Pathobiology, University of Illinois Urbana-Champaign, Urbana, IL 12 61802 13 cCarl R. Woese Institute for Genomic Biology, University of Illinois Urbana- 14 Champaign, Urbana, IL 61802 15 dCenter for Genome Research and Biocomputing, Oregon State University, Corvallis, 16 OR 97331 17 eDepartment of Botany and Plant Pathology, Oregon State University, Corvallis, OR 18 97331 19 fDepartment of Statistics, Oregon State University, Corvallis, OR 97331

20 21 Corresponding Author Information 22 #Christopher A Gaulke: [email protected] 23 24 ABSTRACT

25 Shotgun metagenomic sequencing has transformed our understanding of microbial

26 community ecology. However, preparing metagenomic libraries for high-throughput DNA

27 sequencing remains a costly, labor-intensive, and time-consuming procedure, which in

28 turn limits the utility of metagenomes. Several library preparation procedures have

29 recently been developed to offset these costs, but it is unclear how these newer

30 procedures compare to current standards in the field. In particular, it is not clear if all such

31 procedures perform equally well across different types of microbial communities, or if

32 features of the biological samples being processed (e.g., DNA amount) impact the

33 accuracy of the approach. To address these questions, we assessed how five different

34 shotgun DNA sequence library preparation methods, including the commonly used

35 Nextera® Flex kit, perform when applied to metagenomic DNA. We measured each

1 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

36 method’s ability to produce metagenomic data that accurately represents the underlying

37 taxonomic and genetic diversity of the community. We performed these analyses across

38 a range of microbial community types (e.g., soil, coral-associated, mouse-gut-associated)

39 and input DNA amounts. We find that the type of community and amount of input DNA

40 influence each method’s performance, indicating that careful consideration may be

41 needed when selecting between methods, especially for low complexity communities.

42 However, cost-effective preparation methods we assessed are generally comparable to

43 the current gold standard Nextera® DNA Flex kit for high-complexity communities.

44 Overall, the results from this analysis will help expand and even facilitate access to

45 metagenomic approaches in future studies.

46

47 IMPORTANCE

48 Metagenomic library preparation methods and sequencing technologies continue to

49 advance rapidly, allowing researchers to characterize microbial communities in previously

50 underexplored environmental samples and systems. However, widely-accepted

51 standardized library preparation methods can be cost-prohibitive. Newly available

52 approaches may be less expensive, but their efficacy in comparison to standardized

53 methods remains unknown. In this study, we compared five different metagenomic library

54 preparation methods. We evaluated each method across a range of microbial

55 communities varying in complexity and quantity of input DNA. Our findings demonstrate

56 the importance of considering sample properties, including community type, composition,

57 and DNA amount, when choosing the most appropriate metagenomic library preparation

58 method.

2 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

59 INTRODUCTION

60 Recent advancements in high-throughput sequencing have revolutionized

61 genomic discovery and unlocked new insights regarding the diversity and function of

62 microbial communities (1–4). For example, shotgun metagenomic sequencing has

63 clarified how the functional capacity of the gut microbiome links to human health (5–8),

64 improved the efficacy of antibiotic resistance gene discovery (9–12), identified beneficial

65 soil microbes for agricultural use (13–15), and uncovered novel, medically relevant

66 biosynthetic gene clusters in marine microbes (16–18). However, while metagenomes

67 offer rich opportunity to transform discovery, the financial cost of producing metagenomic

68 data limits their application. Because much of this cost is associated with the preparation

69 of metagenomic DNA for high-throughput sequencing, there is hope that emergent

70 economical products and procedures can expand the scope of metagenomic

71 investigations.

72 Illumina’s Nextera® XT and DNA Flex kits (the latter now known as the “Illumina®

73 DNA Prep”) have been the most widely used methods for preparing metagenomic

74 libraries and have effectively served as industry standard approaches. Indeed, Illumina

75 DNA sequencing platforms remain the most widely utilized for generating genomic and

76 metagenomic data, and their library preparation kits are accordingly used to prepare

77 samples for sequencing. Due to their frequent use, these kits are subject to extensive

78 evaluation and refinement. For example, Illumina recently released an updated version

79 of their “gold standard” Nextera® XT kit, which was rebranded as the Nextera® DNA Flex

80 (and now Illumina® DNA Prep). This new kit allows greater flexibility across a wider range

81 of genomes, from small genomes (microbial and amplicons) to more complex genomes

3 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

82 found in eukaryotic and human systems. The Flex kit also resolved sequencing biases

83 identified in the Nextera® XT kit that occur in genomic regions with extreme GC-content

84 (19). These features of the Nextera® DNA Flex kit have contributed to its broad adoption

85 in metagenomic investigations.

86 One downside to the Nextera® DNA Flex kit is its relatively high price, which

87 presently costs roughly $46 per sample. While this cost may be reasonable considering

88 the demand for the product and its observed efficacy, it is high enough that it limits the

89 scale of many metagenomic investigations. For example, studies performing high-

90 throughput analyses on hundreds or thousands of samples may be forced to utilize non-

91 metagenomic approaches (e.g., 16S rRNA gene sequencing) due to this library

92 preparation expense. In the effort to circumvent this challenge, several alternative and

93 competitive preparation methods have recently been developed and

94 applied to metagenomic investigations. These approaches fall into two categories:

95 methods that increase the economy of the Illumina Nextera® by modifying various

96 aspects of the manufacturing protocols (e.g. Baym et al. 2015 (20), and those that use

97 entirely different technologies (e.g. seqWell plexWell™ 96). These approaches hold great

98 promise to improve the throughput of metagenomic investigations by reducing library

99 preparation costs. For example, the recent method known as ‘Hackflex’ achieves an

100 eleven-fold decrease in per sample reagent costs compared to the Illumina kit protocols

101 (21).

102 Although several alternative library preparation approaches have been assessed

103 from the perspective of , very little is known about their

104 accuracy and precision when applied to metagenomic investigations. It is crucial that the

4 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

105 performance of novel library preparation procedures be specifically assessed in diverse

106 metagenomic communities as different community types provide the unique sequencing

107 challenges not common to traditional whole genome sequencing. For example,

108 metagenomic communities vary in complexity, with some communities having few distinct

109 taxa (e.g., insect gut) while others are very highly diverse (e.g., soil). Library preparation

110 procedures may vary in their ability to unbiasedly sample DNA across the different

111 genomes present in the community. Biological samples vary in their biomass, which

112 affects the amount of whole community DNA that is subject to the library preparation

113 approach. The sensitivity of these approaches to the amount of input DNA may hence

114 impact study outcomes (25–28).

115 To advance the utility of low cost metagenomic library preparation methods, we

116 quantified the performance of five recently developed approaches. Our investigation

117 assessed how different features of metagenomic samples, including community

118 complexity, and biomass impacts the performance of these procedures. In particular, we

119 compared the Illumina Nextera® DNA Flex, a modified DNA Flex protocol (20), QIAGEN®

120 QIAseq FX DNA, Perkin Elmer NEXTFLEX® Rapid DNA-Seq 2.0, and seqWell

121 plexWell™ 96 library preparation methods using community acquired DNA obtained from

122 three different types of microbial communities: low complexity communities (as

123 represented by Acropora hyacinthus microbiomes), moderately complex communities (as

124 represented by Mus musculus mouse fecal microbiomes), and a highly complex

125 community (as represented by a soil microbiome). We also evaluated how each approach

126 performs on a commercially available mock community comprised of ten microbial

127 species. Our analysis clarifies the performance of these approaches across these

5 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

128 different sample conditions and the results will assist investigators in identifying

129 appropriate approaches for their metagenomic investigations.

130

131 RESULTS

132 Library preparation procedure, community type, and input concentration

133 influence metagenomic library characteristics. To determine how metagenomic

134 library characteristics (e.g., insert size, millions of sequences generated) varied across

135 different metagenomic library preparations, we regressed each library characteristic on

136 community type, library preparation, and input DNA concentration (Table 1). We found

137 that these predictor variables statistically affected the following characteristics: median

2 -16 138 fragment size (F(31,48) = 29.94, R = 0.90, P < 2.2x10 ), library concentration (F(31,48) =

2 -14 2 - 139 12.44, R = 0.82, P = 3.51x10 ), library molarity (F(31,48) =11.75, R 0.81, P = 1.09x10

13 2 -16 140 ), sequence read length (F(39,60)= 56.81, R = 0.96, p < 2.2x10 ), number of reads

2 -16 2 141 generated (F(39,60) = 11.69, R = 0.81, P < 2.2x10 ), read GC content (F(39,60 ) = 2285, R

-16 2 142 = 0.99, P < 2.2x10 ), duplication rate (F(8,91) = 2.494, R = 0.11, P = 0.02), and

2 -16 143 percentage of reads filtered (F(39,60) = 1716, R = 0.99, P < 2.2x10 ). Sequence

-04 144 duplication rate was only sensitive to community type (F(3,91) = 5.93, P = 9.0x10 ).

145 Specifically, communities with low microbial diversity such as the coral (t = 2.885, P =

146 4.88x10-03) and mock communities (t = 2.16, P = 0.03) had elevated duplication rates.

147 All other library characteristics were sensitive to interactions between community type,

148 library preparation, and input DNA concentration, and many characteristics were also

149 impacted by the independent effects of these variables. For example, library preparation

6 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

-24 - 150 method (F(3,48) = 148.85, P = 2.61x10 ) and community type (F(3,48) = 11.83, P = 6.39x10

151 06) affected median fragment size independent of the interaction between these

-05 152 parameters and DNA input (F(9,48) = 5.93, P = 1.56x10 ). Library concentration was

-03 153 impacted by community type (F(3,48) = 5.72, P = 1.98 ), library preparation (F(3,48) = 20.75,

-09 -16 154 P = 9.22x10 ), input DNA concentration (F(1,48) = 186.86, P < 2.2x10 ) and their

-05 155 interaction (F(9,48) = 6.09, P = 1.16x10 ). Library molarity was similarly impacted by

-03 156 community type (F(3,48) = 5.63, P = 2.18x10 ), library preparation (F(3,48) = 35.69, P =

-12 -15 157 2.82x10 ), input DNA concentration (F(1,48) = 126.21, P = 4.89x10 ) and their interaction

-04 158 (F(9,48) = 4.40, P = 3.16x10 ).

159 The number of sequences generated was sensitive to preparation procedure

-16 -19 160 (F(4,60) = 41.68, P = 1.10x10 ), community type (F(3,60) = 67.67, P = 3.09x10 ), and DNA

-02 161 input concentration (F(1,60) = 5.62, P = 2.09x10 ) as well as the interaction between these

-02 162 variables (F(12,60) = 2.25, P = 2.00x10 ). The number of sequence reads that were quality

163 filtered and derived from the host genome was also significantly affected by this

-06 164 interaction (F(12,60) = 5.16, P = 7.44x10 ). Metagenome libraries constructed for coral

165 communities had increased levels of quality filtering (t = 62.11, P = 3.66x10-56) while

166 filtering in soil (t = -6.65, P = 9.84x10-09) and mock (t = -6.56, P = 1.39-08) communities

167 was decreased when compared to fecal samples. Read filtering was also consistently

168 increased in samples prepared with the Nextera® Flex Reduced method (t = -12.38, P =

169 3.59-18) and reduced in the samples prepared using the QIASeqFX procedure (t = -4.89,

170 P = 7.84-06).

7 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

171 Metagenomic read characteristics were also significantly impacted by the examined

172 variables. For example, the GC content of reads was significantly dependent on

-95 173 community type (F(3,60) = 28,941.39, P = 9.38x10 ). GC content was also affected by

-44 174 library preparation method (F(4,60) = 442.80, P = 8.71x10 ); a significant effect of the

175 interaction between library preparation, community type, and input DNA concentration

-03 176 (F(12,60) = 2.68, P = 5.98x10 ) was also observed for GC content. Finally, average read

-18 177 length after quality filtering varied by both community type (F(3,60) = 60.78, P = 3.54x10 )

-42 178 and preparation procedure (F(4,60) = 404.25, P = 1.21x10 ). Collectively, these findings

179 indicate that metagenomic library preparation procedures yield distinct library

180 characteristics for different community types.

181 Different library preparation methods result in similar taxonomic profiles of

182 a standardized mock community. While library preparation procedures varied in the

183 resulting metagenomic library and sequence characteristics, it is unclear if this variation

184 results in different downstream assessment of community composition. To address this

185 question, we quantified how each library preparation method predicted the taxonomic

186 composition of a defined mock community. We compared the taxonomic composition

187 generated by each library preparation to the ZymoBIOMICS™ Microbial Community

188 standard’s defined taxonomic composition of the mock community. Strong correlations (r

189 = 0.93 - 0.97, P = 1.29x10-6 - P < 2.2x10-16, fdr < 1.0x10-5) were observed between the

190 MetaPhlAn2 inferred taxonomic abundances and theoretical taxonomic abundances of

191 taxa present in the mock community (Figure 1). To confirm that this was not due to bias

192 in the MetaPhlAn2 database, we also compared the inferred taxonomic abundances

8 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

193 using Kraken2 and observed similar taxonomic associations (r = 0.93 - 0.96, all P <

194 2.2x10-16, fdr < 2.2x10-16) and abundance profiles (Supp. Fig. 1).

195 MetaPhlAn2 results showed that the Nextera® Flex Full and Nextera® Flex

196 Reduced methods, which are widely used in metagenomic studies, had the lowest

197 correlations (r = 0.93 – 0.96, P = 1.29x10-6 - 2.66x10-8) with the theoretical composition

198 of the mock community. The lower correlations produced by these two library preparation

199 procedures are driven by an underestimation of Lactobacillus fermentum and an

200 overestimation of the abundance of Staphylococcus aureus and Enterococcus faecalis.

201 The strongest correlations were observed in the QIASeqFX (r = 0.97, P = 1.21x10-8 -

202 5.79x10-9), plexWell96 (r = 0.96 - 0.97, P = 2.58x10-8 - 1.24x10-8), and NEXTFLEX Rapid

203 (r = 0.96 - 0.97, P = 1.02x10-7 - 2.38x10-8) methods (Figure 1). Together these data

204 indicate that library preparation methods subtly influence some taxonomic estimates but

205 that all methods examined overall performed well at recapitulating simple, defined

206 microbial communities.

207 Community taxonomic profiles are significantly impacted by library

208 preparation procedure and input concentration. Next, we quantified the impact of

209 library preparation methods and input concentrations on the taxonomic profiles of soil,

210 coral, mock, and fecal metagenomes. Library preparation procedure significantly

211 associated with resulting taxonomic microbiome profiles as measured by PERMANOVA

212 in coral (R2 = 0.70, P = 2.00x10-4), feces (R2 = 0.85, P = 2.00x10-4), mock (R2 = 0.72, P =

213 4.00x10-4), and soil (R2 = 0.72, P = 4.00x10-4) (Figure 2A) communities. In the fecal, soil,

9 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

214 and mock communities, no association was found between input concentration and

215 microbiome beta-diversity. However, in coral communities we identified a significant

216 interaction effect between input concentration and metagenomic preparation method (R2

217 = 0.13, P = 3.0x10-2). The taxonomic abundance profiles of each library were highly

218 correlated (r = 0.63 – 1, fdr < 2.2x10-16) across library preparation methods (Figure 2B)

219 suggesting that preparation methodology associates with distinct community profiles.

220 To quantify how variable individual replicates were across different library

221 preparation methods, we examined the dissimilarity in taxonomic beta-diversity within

222 each library preparation method. Methods that yield low intra-concentration (i.e., 1 ng/µL

223 input) dissimilarities indicate high levels of reproducibility, while methods with low inter-

224 concentration dissimilarities (i.e., all concentrations) would indicate that the taxonomic

225 profiles generated using this method are robust to variation in library input concentration.

226 We found that variation in intra-concentration taxonomic dissimilarity was low and

227 significant differences were not observed across library preparation methodologies

228 (Figure 2C). Inter-concentration dissimilarity was also low in fecal, mock, and soil samples

229 but varied significantly across preparation methods in coral (H = 12.28, P = 0.02)

230 potentially due to elevated dissimilarity in the QIASeqFX libraries. Together these data

231 suggest that the taxonomic profiles generated using the methods under investigation are

232 similarly reproducible, but the robustness varied across methods.

233 Community gene abundance profiles are sensitive to library preparation

234 procedures and input DNA concentration. Metagenomic investigations frequently seek

235 to define the genetic diversity of microbial communities. Using the number of distinct gene

10 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

236 families observed in the data (i.e., gene family richness) as well as the functional

237 composition of the community (i.e., gene family beta-diversity), we measured how

238 different library preparation procedures affected determination of a community’s

239 functional capacity. Gene family richness varied by library preparation method and input

2 -08 2 240 concentration in coral (F(9,15) = 31.90, R = 0.92, P = 3.47 x10 ), feces (F(9,15) = 9.13, R

-04 2 -04 241 = 0.75, P = 1.20 x10 ), and mock (F(9,15) = 8.78, R = 0.74, P = 1.51 x10 ) communities,

2 242 while soil richness (F(9,15) = 1.41, R = 0.13, P =0.27) was less sensitive to these effects

243 (Figure 3A). We also observed a significant interaction between library preparation

244 procedure and DNA input on the predicted functional profiles of coral (F(4,15) = 8.00, P =

-03 -03 245 1.16 x 10 ), feces (F(4,15) = 5.64, P = 5.60 x 10 ), and mock microbiomes (F(4,15) = 7.78,

246 P = 1.33 x 10-03). However, these associations were not always consistent across different

247 library preparation procedures. For example, gene richness was elevated in coral

248 samples prepared with the plexWell™96 method (t = 5.53, P = 5.77x10-05), while a

249 contrasting pattern was observed in both fecal (t = -6.38, P = 1.23 x10-05) and soil (t = -

250 1.958, P = 6.91 x10-02) samples, and no difference was identified in mock community

251 samples (t = -0.58, P = 0.57). Similar effects of library preparation method and input

252 concentration were observed on Shannon entropy (Figure 3B). Specifically, significant

2 -05 253 effects were observed for coral (F(9,15) = 11.09, R = 0.79, P = 3.75 x10 ), feces (F(9,15) =

2 -10 2 -05 254 62.29, R = 0.96, P = 1.89 x10 ), mock (F(9,15) = 11.04, R = 0.79, P = 3.85 x10 ), and

2 -03 255 soil (F(9,15) = 5.06, R = 0.60, P = 2.95 x10 ).

256 As measured by PERMANOVA, we found that gene family beta-diversity (Bray

257 Curtis) was significantly associated with library preparation method for coral (F(4,15) = 3.62,

11 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

2 -04 2 -04 258 R = 0.44, P = 2.00 x10 ), fecal (F(4,15) = 7.82, R = 0.60, P = 2.00 x10 ), mock (F(4,15)

2 -04 2 -04 259 = 3.36, R = 0.41, P = 2.00 x10 ), and soil (F(4,15) = 3.39, R = 0.40, P = 2.00 x10 )

260 communities (Figure 4A). However, the association between library preparation

261 procedure and gene family beta-diversity is muted in comparison to taxonomic beta-

262 diversity (Figure 2A), possibly as a result of the increased overall similarity in gene family

263 abundances across samples of the same community type (r = 0.99 – 1.00, P < 2.2x10-

264 16; Figure 4B). Despite this increased similarity between library preparation methods for

265 gene family abundances, the beta-diversity of technical replicates did vary across library

266 preparation methods for coral (H = 11.43, P = 2.21 x 10-02), fecal (H = 9.03, P = 6.02 x

267 10-02), mock (H = 12.1, P = 1.66 x 10-02), and soil (H = 11.5, P = 2.15 x 10-02) samples.

268 However, significant differences were not detected between individual library preparation

269 methods (Figure 4C). The robustness of metagenome beta-diversity to input

270 concentration differed across library preparation methods for coral (H = 24.33, P = 6.86 x

271 10-05), fecal (H = 24.53, P = 6.26 x 10-05), mock (H = 25.65, P = 3.72 x 10-05), and soil (H

272 = 13.83, P = 7.85 x 10-03) samples (Figure 4D). Notably, the plexWell™96 method had

273 elevated variability compared to the other library prep methods in coral, fecal, and mock

274 community samples. In soil, these patterns were mitigated (Figure 4D). Overall, these

275 data demonstrate that, similar to the taxonomic profiles, library preparation methods affect

276 gene diversity profile predictions.

277

278 DISCUSSION

279 Taxonomic and functional profiles predictions are similar across

280 methodology. Although the Nextera® kits are widely used and considered the ‘gold

12 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

281 standard’ for metagenomic sample preparation, their cost can limit researchers from

282 conducting expansive project aims. As applications for metagenomic sequencing

283 continue to increase, researchers are left with the difficult task of balancing the need for

284 high-quality data with the cost of its generation. The development of new protocols that

285 modify the standard Nextera® kit protocol as well as several new economical library

286 preparation kits have the potential to dramatically alter the field by expanding the

287 accessibility of shotgun . However, the quality of libraries prepared using

288 more economical methods varies substantially (29). While prior studies have

289 demonstrated that different library preparation procedures can affect metagenome

290 characteristics (30–33), these studies did not evaluate contemporary procedures nor did

291 they consider the sensitivity of the approaches to different metagenome sample types.

292 Here we demonstrate that library quality as well as taxonomic and functional profiles vary

293 as a function of environmental community type and biomass. Our findings suggest that

294 while researchers need to be aware of differences between kits, overall taxonomic and

295 functional profiles produced by these kits are similar.

296 Several investigations have identified key differences in library characteristics

297 across metagenomic library preparation procedures, often by incorporating multiple study

298 designs. These variations can result in substantial changes in the quality of the

299 metagenomic library and are important considerations in preparation method selection.

300 For example, Baym et al. demonstrated that a custom Nextera® XT protocol yielded a

301 substantially reduced insert fragment length (20). Smaller fragments have higher

302 proportions of adapter contamination in reads, while fragments that are too large may be

303 preferentially lost during the Illumina cluster generation process (34). We observed

13 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

304 significant effects of community type, library preparation procedure, and input DNA

305 concentration on fragment size. In our hands, the Nextera® Flex protocols generated the

306 largest library insert sizes, while the QIAseq FX and plexWell™96 procedures

307 consistently produced the smallest. However, the Nextera® procedures also produced

308 libraries with the lowest average GC content when compared to the other procedures

309 examined. This reduced representation of GC could impact the representation of genes

310 with high GC content and skew both taxonomic and functional profiles (29, 35). The

311 interacting effects of library preparation procedure, community type, and DNA input on

312 GC content further indicates that specific library preparation procedures may have distinct

313 insertion site biases.

314 Comparing library characteristics across environmental sample types, samples

315 with low relative diversity (i.e., coral) had both a higher percentage of duplicate reads and

316 a high number of reads filtered and removed from the resulting libraries regardless of

317 input concentration. This high level of filtering is likely due to the extreme levels of host

318 DNA contaminants relative to the other sample types, and additionally may point to the

319 larger issue of host sequence contamination, regardless of library preparation method, in

320 similar research studies. However, coral samples also had similar levels of precision

321 across kit types, with the exception of lower precision with QIASeq FX, demonstrating

322 that different kit types may still be viable options in other low complexity study systems.

323 Samples with moderate (fecal) and high (soil) microbial diversity had much lower

324 respective average percentage of sequencing reads filtered than coral samples, but with

325 a higher average GC content across libraries than coral samples. Fecal sample libraries

326 had the highest variability in precision between kit types, with the exception of Nextera®

14 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

327 Flex Reduced, likely due to the more complex community composition. However, our

328 study also had a relatively low average sequencing depth across all samples due to

329 sample number and financial constraints; the high intra-community variability in precision

330 that we observed may be resolved with a higher sequencing coverage (37). It is also

331 possible that longer insert fragment sizes introduce greater variability due to lower base

332 quality in produced community composition regardless of sample type (38), though this

333 was not the case for coral, soil, or mock libraries with the longest fragment sizes.

334 In successfully recapitulating the taxonomic profiles of a mock microbial

335 community, all library preparation methods performed similarly overall, however, variation

336 in taxonomic profiles for the environmental sample types showed subtle differences

337 between methods. While higher levels of intra-community variation per method could

338 again be due to low sequencing depth, our results of higher variation for the coral sample

339 types with lower relative diversity are consistent with previous findings that library

340 coverage is increased for highly complex microbial communities. Furthermore, while it

341 may appear that all preparation methods perform poorly in both taxonomic and functional

342 resolution for low (coral) and high (soil) diversity sample types, it must be noted that these

343 profiles may only be as complete as the reference databases used for assignment, and it

344 is well known that these databases are preferentially curated with human microbiome

345 sequences and studies in mind (39).

346 Financial and opportunity costs of metagenomic preparation methods differ.

347 Decreasing the costs of kits and reagents associated with library preparation improves

348 access to metagenomic approaches. The Nextera® DNA Flex Full preparation actualized

349 cost remains the most expensive of the five methods tested, with the NEXTFLEX Rapid

15 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

350 XP and QIASeq FX in the median relative expense range, and Nextera® Flex Reduced

351 prep and plexWell™96 as the most economical choices for metagenomic library

352 generation. However, due to the noted effect of specificity of environmental sample type

353 on performance of preparation method, neither the most economical choice nor the most

354 expensive may necessarily suit every study or generate the highest quality libraries. Due

355 to the effects of preparation procedure, community type, and DNA input on fragment size

356 and both taxonomic and functional profiles of metagenomic samples, comparing

357 communities across multiple study designs may require additional covariates in statistical

358 design. For future studies we recommend incorporating library preparation technique as

359 a potential covariate in statistical design to account for these known differences and

360 potential biases.

361 Conclusion. Collectively, these findings demonstrate that no single metagenomic library

362 preparation approach performed the best across all community types and conditions

363 evaluated. Rather, the performance of approaches varied as function of sample and the

364 amount of input DNA. Consequently, researchers should consider these variables when

365 selecting library preparation approaches, especially when attempting to optimize data

366 quality, accuracy, and precision. To aid in this effort, we provide Figure 5 as a reference

367 guide to aid in choosing preparation methods with cost and performance in mind. Our

368 hope is that this information helps improve the accessibility and utility of metagenomic

369 investigations. Further study is needed to determine what community properties (e.g., GC

370 content, taxonomic diversity, etc.) dictate these differences in library procedure

371 performance in order to generate more generalizable guidance of procedure selection.

372 That said, our results show that the different approaches generally produced relatively

16 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

373 consistent taxonomic and gene family diversity profiles, which indicates that selecting

374 approaches based on cost and ease of implementation may be appropriate for some

375 studies (namely those in which the loss of accuracy and precision is tolerable). However,

376 we recommend careful consideration of community type and the amount of input DNA

377 when selecting a metagenomic library preparation procedure to ensure optimal

378 performance.

379

380 MATERIALS AND METHODS

381 Genomic DNA extraction. Prior to metagenomic library construction, genomic

382 DNA was extracted from environmental samples originating from: soil from the North

383 American Project to Evaluate Soil Health Measurements (40), coral (Acropora

384 hyacinthus), and mammalian feces (Mus musculus; C57BL/6) using methods outlined

385 below. In addition to environmental samples, we used the ZymoBIOMICS™ Microbial

386 Community DNA Standard to more efficiently assess bias associated with library

387 preparation methods on a standard mock community.

388 For coral slurry, coral nubbins preserved in RNA/DNA Shield (ZymoBIOMICS™)

389 were vortexed in 15ml tubes with a combination of ceramic and garnet bead lysing matrix

390 at ~2500 RPM for 25 minutes. DNA was extracted from 300µl of the resulting coral slurry

391 using ZymoBIOMICS™ DNA/RNA Miniprep (Zymo Research Corp., Irvine, CA, USA)

392 following an additional 2-step enzyme incubation to increase recovery of bacterial DNA:

393 1) addition of 30µl chicken egg-white lysozyme (10mg/ml, Novagen® ), 1.8µL mutanolysin

394 (50KU/ml from Streptomyces globisporus ATCC 21553, Sigma-Aldrich), 1.8µL

395 lysostaphin (4KU/ml from Staphylococcus staphylolyticus, Sigma-Aldrich) and incubation

17 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

396 at 37ºC for 1 hr and 2) 1 hour incubation at 50ºC following addition of 15µl Proteinase K

397 (20 mg/ml, Thermo Scientific™) and 30µl Proteinase K digestion buffer (0.1M NaCl,

398 10mM Tris pH 9.0, 1mM EDTA, 0.5% SDS, nuclease-free water). Following digestion,

399 one volume of kit-specific DNA/RNA Lysis Buffer was added in order to proceed with the

400 manufacturer's recommended extraction protocol.

401 For soil, the sample was taken on 2/27/2019 at the Virginia Tech Eastern Shore

402 Agricultural Research and Extension Center. Samples were collected as 12 composite

403 knife slices of soil to a depth of 15 cm, and each of the 12 slices were passed through an

404 8mm filter. Detailed sampling methods can be reviewed in Norris et al. 2020 (40).

405 Following collection, 0.25g aliquots of soil were stored at -80ºC after overnight shipment

406 from the collection site. Soil aliquots were then extracted following the Earth Microbiome

407 Protocol (41) using a KingFisher™ Flex (Thermo Fisher®).

408 For mouse feces, DNA was isolated from a single fecal pellet using the DNeasy

409 PowerSoil isolation kit (QIAGEN®) following the manufacturer’s instructions. An

410 additional 10-minute incubation at 65˚C directly before bead beating was added to

411 enhance microbial cell lysis. The samples were then homogenized using a Vortex-Genie

412 2 and vortex adapter (QIAGEN®) at the highest setting for 10 minutes.

413

414 Metagenomic library preparation and sequencing. Environmental DNA samples were

415 prepared for metagenomic sequencing following manufacturers’ protocol using four

416 commercially available kits: 1) Illumina Nextera® DNA Flex Library Kit, 2) QIAGEN®

417 QIAseq FX DNA Library Kit, 3) Perkin Elmer NEXTFLEX® Rapid DNA-Seq Kit 2.0, and

418 4) seqWell plexWell™ 96. In addition, we included a fifth preparation method using the

18 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

419 modified “reduced” protocol established by Baym et al. to increase the number of libraries

420 the Nextera® DNA Flex could generate (20). Genomic DNA was quantified using a Qubit

421 1X dsDNA HS Assay Kit for soil, fecal, and coral communities. Mock community DNA

422 concentration was not quantified as ZymoBIOMICS™ manufacturer information provided

423 a known concentration of 100ng/µl. Following quantification, all samples prepared using

424 Nextera®Flex Full, QIASeq FX, and NEXTFLEX® Rapid XP were diluted with water to

425 0.2ng/µl. To determine how DNA input affected library generation, each standardized

426 DNA concentration was then added to obtain respective 0.5ng, 1.0ng, and 5.0ng inputs.

427 Samples prepared using plexWell™96 were diluted to 0.25ng/µl with water and

428 appropriate additive volumes made to obtain 0.5ng and 1.0ng input concentrations. For

429 samples with 5.0ng inputs, samples were diluted to 1.25ng/µl and then 4µl of sample was

430 used to obtain 5.0ng input concentration. For the Nextera®Flex Reduced reaction, all

431 samples were diluted to 5.0ng/µl with nuclease-free water. A 1µl aliqout of this dilution

432 was used for 5.0ng input libraries. For 0.5ng and 1.0ng input libraries, sufficient water

433 was added to the 1µl dilution to bring respective concentrations to 0.5ng/µl and 1.0ng/µl.

434 Library insert size was assessed for the Nextera® DNA Flex (full and reduced) and

435 plexWell™96 methods using the Agilent TapeStation 4200 high sensitivity D5000 DNA

436 ScreenTape. Insert size for the QIAseq FX and NEXTFLEX® Rapid XP methods was

437 quantified using the Agilent Bioanalyzer 2100 high-sensitivity DNA chip as these libraries

438 are more prone to have adapter dimers which are poorly resolved using the TapeStation.

439 Library concentration was assessed with the Qubit 1X high-sensitivity dsDNA

440 quantification kit (ThermoFisher). Resulting libraries were diluted, pooled, and sequenced

441 for paired-end reads of 150bp on Illumina HiSeq3000.

19 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

442

443 Microbial community gene family abundance and taxonomic diversity. Quality

444 filtering, adaptor removal, and host read filtering were performed using shotcleaner v0.1

445 (42) with default parameters. For mouse fecal samples, host reads were removed by

446 aligning to the mouse reference genome (GRCm38). A similar procedure was used for

447 coral samples except that these reads were filtered against a concatenated version of the

448 coral (Acropora millepora, Genbank: QTZP00000000.1 (43)) and symbiont

449 (Symbiodiniaceae sp. clade A MAC-Cass KB8 (Tax ID: 671378, UniProt) genomes.

450 Quality controlled sequence reads were input into HUMAnN 2.0 (44) for taxonomic and

451 functional classification using the UniRef90 database and default parameters. HUMAnN

452 2.0 outputs for each community type were combined and renormalized to counts per

453 million using HUMAnN 2.0 utility scripts before downstream analysis. High-quality reads

454 were also taxonomically classified using Kraken2 v2.0.8-beta (45, 46) and a custom

455 reference database that included sequences from all human, mouse (GRCm38), UniVec

456 Core, , archaea, virus, fungi, and protozoa in the NCBI RefSeq database

457 (accessed October 8, 2019) as well as the Symbiodinacaea sp. clade A MAC-Cass KB8

458 and A. millepora genomes.

459

460 Statistical Analyses. Independent linear models (R::stats::lm) were used to determine

461 how community type, library preparation method, and input DNA concentration affect

462 variance of the resulting library characteristics including: the number of reads generated,

463 median fragment size, library concentration, library molarity, mean read length, read

464 duplication rate, mean read GC content, and total reads filtered and removed. Since we

20 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

465 reasoned it was likely that interactions between the predictors exist, we employed a model

466 selection procedure to identify the most parsimonious model for each characteristic

467 examined. For each characteristic we built a set of models of increasing complexity: 1) a

468 reduced model with only additive effects (eq. 1), 2) a model with interaction terms for

469 community type, library preparation procedure, and DNA input concentration (eq. 2).

470 Eq. 1: Characteristic = �0 + �1(community) + �2 (library preparation) + �3(DNA input) +�

471 Eq. 2: Characteristic = �0 + �1(community) * �2 (library preparation) * �3(DNA input) +�

472 We then used Akaike information criterion (AIC) to select the most parsimonious model

473 and analysis of variance (ANOVA) to determine the significance of each term in the

474 selected model. Because sequencing libraries produced from distinct samples are pooled

475 as part of the plexWell™ library preparation protocol, a single value is available for median

476 fragment size, sequence library concentration, sequence library molarity for this method.

477 The similarity between taxonomic profiles generated by library preparation

478 methods and the known taxonomic composition of the ZymoBIOMICS™ Microbial

479 Community DNA Standard was assessed by Pearson’s correlation test

480 (R::stats::cor.test). To measure the variation in taxonomic profiles generated by different

481 library preparation methods across soil, coral, and fecal communities, we calculated the

482 Pearson’s correlation coefficient of generated taxonomic abundance profiles for each pair

483 of samples. This analysis was conducted using Kraken2, a sensitive read binning tool, as

484 MetaPhlAn2, a marker-gene-based abundance estimation tool to eliminate the possibility

485 that mammalian biases in marker gene databases would skew results in environmental

486 samples. We accounted for the effects of multiple correlation tests using false discovery

487 rate (R::stats::p.adjust, method = fdr).

21 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

488 The additive and interactive statistical effects of library preparation and DNA input

489 concentration on microbiome composition, as measured by the Bray-Curtis dissimilarity

490 metric, were evaluated using PERMANOVA analysis (R::vegan::adonis, permutations =

491 5000, method= bray) and visualized using an ordination of principal components analysis

492 for each community. Differences in the Bray-Curtis dissimilarity of taxonomic abundance

493 profiles within and across library preparation methods were measured using Kruskal-

494 Wallis tests (R::stats::kruskal.test) with a post-hoc pairwise-Wilcoxon-test

495 (R::stats::pairwise.wilcox.test). A holm correction was used to control Wilcoxon-test

496 family-wise error rates.

497 Shannon entropy and gene richness were calculated for HUMAnN 2.0 gene

498 abundance profiles using R and vegan. Linear regression to quantified associations

499 between gene level alpha diversity and library preparation and input DNA concentration

500 for each community type. Associations with gene level Bray-Curtis dissimilarity,

501 preparation method and DNA input were quantified using PERMANOVA

502 (R::vegan::adonis, permutations = 5000, method = bray). Differences in gene

503 abundances and metagenomic dissimilarity were quantified as with taxonomy.

504

505 ACKNOWLEDGEMENTS

506 Research reported in this publication was supported by the National Science Foundation

507 under Grant No. OCE-1933165 and BIO-2025457, as well as the National Institute of

508 Allergy and Infectious Diseases under award number R21AI135641 and the National

509 Institute of Environmental Health Sciences under award number R01ES030226, and a

510 contract to the Center for Genome Research and Biocomputing from the Soil Health

22 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

511 Institute. Library preparation kits used in this study were generously donated by

512 PerkinElmer, Inc. and soil samples were provided by Elizabeth Rieke (Soil Health

513 Institute).

514 515 REFERENCES 516 517 1. Paszkiewicz K, Studholme DJ. 2010. De novo assembly of short sequence 518 reads. Brief Bioinform 11:457–472. 519 2. Brumfield KD, Huq A, Colwell RR, Olds JL, Leddy MB. 2020. Microbial resolution 520 of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly 521 available NEON data. PLoS ONE 15. 522 3. Yoon SS, Kim E-K, Lee W-J. 2015. Functional genomic and metagenomic 523 approaches to understanding gut microbiota–animal mutualism. Curr Opin Microbiol 524 24:38–46. 525 4. Quince C, Walker AW, Simpson JT, Loman NJ, Segata N. 2017. Shotgun 526 metagenomics, from sampling to analysis. 9. Nat Biotechnol 35:833–844. 527 5. Laudadio I, Fulci V, Palone F, Stronati L, Cucchiara S, Carissimi C. 2018. 528 Quantitative Assessment of Shotgun Metagenomics and 16S rDNA Amplicon 529 Sequencing in the Study of Human Gut Microbiome. Omics J Integr Biol 22:248–254. 530 6. Jovel J, Patterson J, Wang W, Hotte N, O’Keefe S, Mitchel T, Perry T, Kao D, 531 Mason AL, Madsen KL, Wong GK-S. 2016. Characterization of the Gut Microbiome 532 Using 16S or Shotgun Metagenomics. Front Microbiol 7. 533 7. Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, 534 Creasy HH, Earl AM, FitzGerald MG, Fulton RS, Giglio MG, Hallsworth-Pepin K, Lobos 535 EA, Madupu R, Magrini V, Martin JC, Mitreva M, Muzny DM, Sodergren EJ, Versalovic 536 J, Wollam AM, Worley KC, Wortman JR, Young SK, Zeng Q, Aagaard KM, Abolude 537 OO, Allen-Vercoe E, Alm EJ, Alvarado L, Andersen GL, Anderson S, Appelbaum E, 538 Arachchi HM, Armitage G, Arze CA, Ayvaz T, Baker CC, Begg L, Belachew T, Bhonagiri 539 V, Bihan M, Blaser MJ, Bloom T, Bonazzi V, Paul Brooks J, Buck GA, Buhay CJ, 540 Busam DA, Campbell JL, Canon SR, Cantarel BL, Chain PSG, Chen I-MA, Chen L, 541 Chhibba S, Chu K, Ciulla DM, Clemente JC, Clifton SW, Conlan S, Crabtree J, Cutting 542 MA, Davidovics NJ, Davis CC, DeSantis TZ, Deal C, Delehaunty KD, Dewhirst FE, 543 Deych E, Ding Y, Dooling DJ, Dugan SP, Michael Dunne W, Scott Durkin A, Edgar RC, 544 Erlich RL, Farmer CN, Farrell RM, Faust K, Feldgarden M, Felix VM, Fisher S, Fodor 545 AA, Forney LJ, Foster L, Di Francesco V, Friedman J, Friedrich DC, Fronick CC, Fulton 546 LL, Gao H, Garcia N, Giannoukos G, Giblin C, Giovanni MY, Goldberg JM, Goll J, 547 Gonzalez A, Griggs A, Gujja S, Kinder Haake S, Haas BJ, Hamilton HA, Harris EL, 548 Hepburn TA, Herter B, Hoffmann DE, Holder ME, Howarth C, Huang KH, Huse SM, 549 Izard J, Jansson JK, Jiang H, Jordan C, Joshi V, Katancik JA, Keitel WA, Kelley ST,

23 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

550 Kells C, King NB, Knights D, Kong HH, Koren O, Koren S, Kota KC, Kovar CL, Kyrpides 551 NC, La Rosa PS, Lee SL, Lemon KP, Lennon N, Lewis CM, Lewis L, Ley RE, Li K, 552 Liolios K, Liu B, Liu Y, Lo C-C, Lozupone CA, Dwayne Lunsford R, Madden T, Mahurkar 553 AA, Mannon PJ, Mardis ER, Markowitz VM, Mavromatis K, McCorrison JM, McDonald 554 D, McEwen J, McGuire AL, McInnes P, Mehta T, Mihindukulasuriya KA, Miller JR, Minx 555 PJ, Newsham I, Nusbaum C, O’Laughlin M, Orvis J, Pagani I, Palaniappan K, Patel SM, 556 Pearson M, Peterson J, Podar M, Pohl C, Pollard KS, Pop M, Priest ME, Proctor LM, 557 Qin X, Raes J, Ravel J, Reid JG, Rho M, Rhodes R, Riehle KP, Rivera MC, Rodriguez- 558 Mueller B, Rogers Y-H, Ross MC, Russ C, Sanka RK, Sankar P, Fah Sathirapongsasuti 559 J, Schloss JA, Schloss PD, Schmidt TM, Scholz M, Schriml L, Schubert AM, Segata N, 560 Segre JA, Shannon WD, Sharp RR, Sharpton TJ, Shenoy N, Sheth NU, Simone GA, 561 Singh I, Smillie CS, Sobel JD, Sommer DD, Spicer P, Sutton GG, Sykes SM, Tabbaa 562 DG, Thiagarajan M, Tomlinson CM, Torralba M, Treangen TJ, Truty RM, Vishnivetskaya 563 TA, Walker J, Wang L, Wang Z, Ward DV, Warren W, Watson MA, Wellington C, 564 Wetterstrand KA, White JR, Wilczek-Boney K, Wu Y, Wylie KM, Wylie T, Yandava C, 565 Ye L, Ye Y, Yooseph S, Youmans BP, Zhang L, Zhou Y, Zhu Y, Zoloth L, Zucker JD, 566 Birren BW, Gibbs RA, Highlander SK, Methé BA, Nelson KE, Petrosino JF, Weinstock 567 GM, Wilson RK, White O, The Human Microbiome Project Consortium. 2012. Structure, 568 function and diversity of the healthy human microbiome. 7402. Nature 486:207–214. 569 8. Armour CR, Nayfach S, Pollard KS, Sharpton TJ. 2019. A Metagenomic Meta- 570 analysis Reveals Functional Signatures of Health and Disease in the Human Gut 571 Microbiome. mSystems 4. 572 9. Guo J, Li J, Chen H, Bond PL, Yuan Z. 2017. Metagenomic analysis reveals 573 wastewater treatment plants as hotspots of antibiotic resistance genes and mobile 574 genetic elements. Water Res 123:468–478. 575 10. Jadeja NB, Purohit HJ, Kapley A. 2019. Decoding microbial community 576 intelligence through metagenomics for efficient wastewater treatment. Funct Integr 577 Genomics 19:839–851. 578 11. Li A-D, Li L-G, Zhang T. 2015. Exploring antibiotic resistance genes and metal 579 resistance genes in metagenomes from wastewater treatment plants. Front 580 Microbiol 6. 581 12. Wang Z, Zhang X-X, Huang K, Miao Y, Shi P, Liu B, Long C, Li A. 2013. 582 Metagenomic Profiling of Antibiotic Resistance Genes and Mobile Genetic Elements in 583 a Tannery Wastewater Treatment Plant. PLOS ONE 8:e76079. 584 13. Fierer N, Lauber CL, Ramirez KS, Zaneveld J, Bradford MA, Knight R. 2012. 585 Comparative metagenomic, phylogenetic and physiological analyses of soil microbial 586 communities across nitrogen gradients. 5. ISME J 6:1007–1017. 587 14. Blaya J, Marhuenda FC, Pascual JA, Ros M. 2016. Microbiota Characterization 588 of Compost Using Omics Approaches Opens New Perspectives for Phytophthora Root 589 Rot Control. PLOS ONE 11:e0158048.

24 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

590 15. Lutz S, Thuerig B, Oberhaensli T, Mayerhofer J, Fuchs JG, Widmer F, Freimoser 591 FM, Ahrens CH. 2020. Harnessing the Microbiomes of Suppressive Composts for Plant 592 Protection: From Metagenomes to Beneficial Microorganisms and Reliable Diagnostics. 593 Front Microbiol 11. 594 16. Blockley A, Elliott DR, Roberts AP, Sweet M. 2017. Symbiotic Microbes from 595 Marine Invertebrates: Driving a New Era of Natural Product Drug Discovery. 4. Diversity 596 9:49. 597 17. Trindade M, van Zyl LJ, Navarro-Fernández J, Abd Elrazak A. 2015. Targeted 598 metagenomics as a tool to tap into marine natural product diversity for the discovery 599 and production of drug candidates. Front Microbiol 6. 600 18. Kennedy J, Marchesi JR, Dobson ADW. 2007. Metagenomic approaches to 601 exploit the biotechnological potential of the microbial consortia of marine sponges. Appl 602 Microbiol Biotechnol 75:11–20. 603 19. Sato MP, Ogura Y, Nakamura K, Nishida R, Gotoh Y, Hayashi M, Hisatsune J, 604 Sugai M, Takehiko I, Hayashi T. 2019. Comparison of the sequencing bias of currently 605 available library preparation kits for Illumina sequencing of bacterial genomes and 606 metagenomes. DNA Res 26:391–398. 607 20. Baym M, Kryazhimskiy S, Lieberman TD, Chung H, Desai MM, Kishony R. 2015. 608 Inexpensive Multiplexed Library Preparation for Megabase-Sized Genomes. PLOS ONE 609 10:e0128036. 610 21. Gaio D, To J, Liu M, Monahan L, Anantanawat K, Darling AE. 2019. Hackflex: 611 low cost Illumina sequencing library construction for high sample counts. bioRxiv 612 779215. 613 22. Couto N, Schuele L, Raangs EC, Machado MP, Mendes CI, Jesus TF, 614 Chlebowicz M, Rosema S, Ramirez M, Carriço JA, Autenrieth IB, Friedrich AW, Peter S, 615 Rossen JW. 2018. Critical steps in clinical shotgun metagenomics for the concomitant 616 detection and typing of microbial pathogens. 1. Sci Rep 8:13767. 617 23. Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall AB, Brady A, 618 Creasy HH, McCracken C, Giglio MG, McDonald D, Franzosa EA, Knight R, White O, 619 Huttenhower C. 2017. Strains, functions and dynamics in the expanded Human 620 Microbiome Project. 7674. Nature 550:61–66. 621 24. Pereira-Marques J, Hout A, Ferreira RM, Weber M, Pinto-Ribeiro I, van Doorn L- 622 J, Knetsch CW, Figueiredo C. 2019. Impact of Host DNA and Sequencing Depth on the 623 Taxonomic Resolution of Whole Metagenome Sequencing for Microbiome Analysis. 624 Front Microbiol 10. 625 25. Marine R, Polson SW, Ravel J, Hatfull G, Russell D, Sullivan M, Syed F, Dumas 626 M, Wommack KE. 2011. Evaluation of a Transposase Protocol for Rapid Generation of 627 Shotgun High-Throughput Sequencing Libraries from Nanogram Quantities of DNA. 628 Appl Environ Microbiol 77:8071–8079.

25 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

629 26. Solonenko SA, Ignacio-Espinoza JC, Alberti A, Cruaud C, Hallam S, 630 Konstantinidis K, Tyson G, Wincker P, Sullivan MB. 2013. Sequencing platform and 631 library preparation choices impact viral metagenomes. BMC Genomics 14:320. 632 27. Chafee M, Maignien L, Simmons SL. 2015. The effects of variable sample 633 biomass on comparative metagenomics. Environ Microbiol 17:2239–2253. 634 28. Duhaime MB, Deng L, Poulos BT, Sullivan MB. 2012. Towards quantitative 635 metagenomics of wild viruses and other ultra-low concentration DNA samples: a 636 rigorous assessment and optimization of the linker amplification method. Environ 637 Microbiol 14:2526–2537. 638 29. Jones MB, Highlander SK, Anderson EL, Li W, Dayrit M, Klitgord N, Fabani MM, 639 Seguritan V, Green J, Pride DT, Yooseph S, Biggs W, Nelson KE, Venter JC. 2015. 640 Library preparation methodology can influence genomic and functional predictions in 641 human microbiome research. Proc Natl Acad Sci 112:14024–14029. 642 30. Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. 2015. 643 Analysis of composition of microbiomes: a novel method for studying microbial 644 composition. Microb Ecol Health Dis 26:27663. 645 31. Simon C, Daniel R. 2011. Metagenomic Analyses: Past and Future Trends. Appl 646 Environ Microbiol 77:1153–1161. 647 32. Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R. 2015. Modeling and 648 Analysis of Compositional Data. John Wiley & Sons. 649 33. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. 2017. Microbiome 650 Datasets Are Compositional: And This Is Not Optional. Front Microbiol 8. 651 34. Kircher M, Heyn P, Kelso J. 2011. Addressing challenges in the production and 652 analysis of illumina sequencing data. BMC Genomics 12:382. 653 35. Browne PD, Nielsen TK, Kot W, Aggerholm A, Gilbert MTP, Puetz L, Rasmussen 654 M, Zervas A, Hansen LH. 2020. GC bias affects genomic and metagenomic 655 reconstructions, underrepresenting GC-poor organisms. GigaScience 9. 656 36. Huptas C, Scherer S, Wenning M. 2016. Optimized Illumina PCR-free library 657 preparation for bacterial whole genome sequencing and analysis of factors influencing 658 de novo assembly. BMC Res Notes 9:269. 659 37. Rodriguez-R LM, Konstantinidis KT. 2014. Estimating coverage in metagenomic 660 data sets and why it matters. 11. ISME J 8:2349–2351. 661 38. Tan G, Opitz L, Schlapbach R, Rehrauer H. 2019. Long fragments achieve lower 662 base quality in Illumina paired-end sequencing. 1. Sci Rep 9:2856. 663 39. Lindgreen S, Adair KL, Gardner PP. 2016. An evaluation of the accuracy and 664 speed of metagenome analysis tools. 1. Sci Rep 6:19233. 665 40. Norris CE, Bean GM, Cappellazzi SB, Cope M, Greub KLH, Liptzin D, Rieke EL, 666 Tracy PW, Morgan CLS, Honeycutt CW. 2020. Introducing the North American project 667 to evaluate soil health measurements. Agron J 112:3195–3215.

26 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

668 41. Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, 669 Tripathi A, Gibbons SM, Ackermann G, Navas-Molina JA, Janssen S, Kopylova E, 670 Vázquez-Baeza Y, González A, Morton JT, Mirarab S, Zech Xu Z, Jiang L, Haroon MF, 671 Kanbar J, Zhu Q, Jin Song S, Kosciolek T, Bokulich NA, Lefler J, Brislawn CJ, 672 Humphrey G, Owens SM, Hampton-Marcell J, Berg-Lyons D, McKenzie V, Fierer N, 673 Fuhrman JA, Clauset A, Stevens RL, Shade A, Pollard KS, Goodwin KD, Jansson JK, 674 Gilbert JA, Knight R, The Earth Microbiome Project Consortium, Rivera JLA, Al- 675 Moosawi L, Alverdy J, Amato KR, Andras J, Angenent LT, Antonopoulos DA, Apprill A, 676 Armitage D, Ballantine K, Bárta J, Baum JK, Berry A, Bhatnagar A, Bhatnagar M, Biddle 677 JF, Bittner L, Boldgiv B, Bottos E, Boyer DM, Braun J, Brazelton W, Brearley FQ, 678 Campbell AH, Caporaso JG, Cardona C, Carroll J, Cary SC, Casper BB, Charles TC, 679 Chu H, Claar DC, Clark RG, Clayton JB, Clemente JC, Cochran A, Coleman ML, Collins 680 G, Colwell RR, Contreras M, Crary BB, Creer S, Cristol DA, Crump BC, Cui D, Daly SE, 681 Davalos L, Dawson RD, Defazio J, Delsuc F, Dionisi HM, Dominguez-Bello MG, Dowell 682 R, Dubinsky EA, Dunn PO, Ercolini D, Espinoza RE, Ezenwa V, Fenner N, Findlay HS, 683 Fleming ID, Fogliano V, Forsman A, Freeman C, Friedman ES, Galindo G, Garcia L, 684 Garcia-Amado MA, Garshelis D, Gasser RB, Gerdts G, Gibson MK, Gifford I, Gill RT, 685 Giray T, Gittel A, Golyshin P, Gong D, Grossart H-P, Guyton K, Haig S-J, Hale V, Hall 686 RS, Hallam SJ, Handley KM, Hasan NA, Haydon SR, Hickman JE, Hidalgo G, 687 Hofmockel KS, Hooker J, Hulth S, Hultman J, Hyde E, Ibáñez-Álamo JD, Jastrow JD, 688 Jex AR, Johnson LS, Johnston ER, Joseph S, Jurburg SD, Jurelevicius D, Karlsson A, 689 Karlsson R, Kauppinen S, Kellogg CTE, Kennedy SJ, Kerkhof LJ, King GM, Kling GW, 690 Koehler AV, Krezalek M, Kueneman J, Lamendella R, Landon EM, Lane-deGraaf K, 691 LaRoche J, Larsen P, Laverock B, Lax S, Lentino M, Levin II, Liancourt P, Liang W, Linz 692 AM, Lipson DA, Liu Y, Lladser ME, Lozada M, Spirito CM, MacCormack WP, MacRae- 693 Crerar A, Magris M, Martín-Platero AM, Martín-Vivaldi M, Martínez LM, Martínez-Bueno 694 M, Marzinelli EM, Mason OU, Mayer GD, McDevitt-Irwin JM, McDonald JE, McGuire KL, 695 McMahon KD, McMinds R, Medina M, Mendelson JR, Metcalf JL, Meyer F, Michelangeli 696 F, Miller K, Mills DA, Minich J, Mocali S, Moitinho-Silva L, Moore A, Morgan-Kiss RM, 697 Munroe P, Myrold D, Neufeld JD, Ni Y, Nicol GW, Nielsen S, Nissimov JI, Niu K, Nolan 698 MJ, Noyce K, O’Brien SL, Okamoto N, Orlando L, Castellano YO, Osuolale O, Oswald 699 W, Parnell J, Peralta-Sánchez JM, Petraitis P, Pfister C, Pilon-Smits E, Piombino P, 700 Pointing SB, Pollock FJ, Potter C, Prithiviraj B, Quince C, Rani A, Ranjan R, Rao S, 701 Rees AP, Richardson M, Riebesell U, Robinson C, Rockne KJ, Rodriguezl SM, Rohwer 702 F, Roundstone W, Safran RJ, Sangwan N, Sanz V, Schrenk M, Schrenzel MD, Scott 703 NM, Seger RL, Seguin-Orlando A, Seldin L, Seyler LM, Shakhsheer B, Sheets GM, 704 Shen C, Shi Y, Shin H, Shogan BD, Shutler D, Siegel J, Simmons S, Sjöling S, Smith 705 DP, Soler JJ, Sperling M, Steinberg PD, Stephens B, Stevens MA, Taghavi S, Tai V, 706 Tait K, Tan CL, Tas¸ N, Taylor DL, Thomas T, Timling I, Turner BL, Urich T, Ursell LK, 707 van der Lelie D, Van Treuren W, van Zwieten L, Vargas-Robles D, Thurber RV,

27 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

708 Vitaglione P, Walker DA, Walters WA, Wang S, Wang T, Weaver T, Webster NS, 709 Wehrle B, Weisenhorn P, Weiss S, Werner JJ, West K, Whitehead A, Whitehead SR, 710 Whittingham LA, Willerslev E, Williams AE, Wood SA, Woodhams DC, Yang Y, 711 Zaneveld J, Zarraonaindia I, Zhang Q, Zhao H. 2017. A communal catalogue reveals 712 Earth’s multiscale microbial diversity. Nature 551:457–463. 713 42. Sharpton T. 2017. sharpton/shotcleaner. Perl6. 714 43. Ying H, Hayward DC, Cooke I, Wang W, Moya A, Siemering KR, Sprungala S, 715 Ball EE, Forêt S, Miller DJ. 2019. The Whole-Genome Sequence of the Coral Acropora 716 millepora. Genome Biol Evol 11:1374–1379. 717 44. Franzosa EA, McIver LJ, Rahnavard G, Thompson LR, Schirmer M, Weingart G, 718 Lipson KS, Knight R, Caporaso JG, Segata N, Huttenhower C. 2018. Species-level 719 functional profiling of metagenomes and metatranscriptomes. Nat Methods 15:962–968. 720 45. Wood DE, Salzberg SL. 2014. Kraken: ultrafast metagenomic sequence 721 classification using exact alignments. Genome Biol 15:R46. 722 46. Wood DE, Lu J, Langmead B. 2019. Improved metagenomic analysis with 723 Kraken 2. Genome Biol 20:257. 724 725 726 TABLES, FIGURES, AND LEGENDS 727

28 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

728 729 730

29 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

731

30 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

732 733

31 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

734

32 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

735 736 737 738 739 740 741 742

33 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

743

34 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439578; this version posted April 13, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

744 745 746

35