<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases 2 3 Abigail Basson1,2, Fabio Cominelli1,2,3,4, Alex Rodriguez-Palacios*1,2,3,4 4 5 6 1 Division of Gastroenterology & Liver Diseases, Case Western Reserve University School of Medicine, 7 Cleveland, OH, USA. 8 2 Digestive Health Research Institute, University Hospitals Cleveland Medical Center, Cleveland, OH, 9 USA. 10 3 Mouse Models Core, Silvio O’Conte Cleveland Digestive Diseases Research Core Center, Cleveland, 11 OH, USA. 12 4 Germ-free and Gut Microbiome Core, Digestive Health Research Institute, Case Western Reserve 13 University, Cleveland, OH, USA. 14 15 Corresponding Author: Alex Rodriguez-Palacios ([email protected])

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

16 Highlights 17 18 • Multimodal diseases are those in which affected individuals can be divided into subtypes (or ‘ 19 modes’); for instance, ‘mild’ vs. ‘severe’, based on (unknown) modifiers of disease severity. 20 21 • The role of the microbiome in multimodal diseases has been studied in animals; however, findings 22 are often deemed irreproducible, or unreasonably biased, with pathogenic roles in 95% of reports. 23 24 • As a solution to repeatably, investigators have been told to seek funds to increase the number of 25 human-microbiome donors (N) to increase the reproducibility of animal studies. 26 27 • Herein, we illustrate that although increasing N could help identify statistical effects (patterns of 28 analytical irreproducibility), clinically-relevant information will not always be identified. 29 30 • Depending on which diseases need to be compared, ‘random ’ alone leads to reproducible 31 ‘patterns of analytical irreproducibility’ in multimodal disease simulations. 32 33 • Instead of solely increasing N, we illustrate how disease multimodality could be understood, 34 visualized and used to guide the study of diseases by selecting and focusing on ‘disease modes’. 35 36 37 Abstract 38 Multimodal diseases are those in which affected individuals can be divided into subtypes (or ‘data 39 modes’); for instance, ‘mild’ vs. ‘severe’, based on (unknown) modifiers of disease severity. Studies have 40 shown that despite the inclusion of a large number of subjects, the causal role of the microbiome in 41 human diseases remains uncertain. The role of the microbiome in multimodal diseases has been studied 42 in animals; however, findings are often deemed irreproducible, or unreasonably biased, with pathogenic 43 roles in 95% of reports. As a solution to repeatability, investigators have been told to seek funds to 44 increase the number of human-microbiome donors (N) to increase the reproducibility of animal studies 45 (doi:10.1016/j.cell.2019.12.025). Herein, through simulations, we illustrate that increasing N will not 46 uniformly/universally enable the identification of consistent statistical differences (patterns of analytical 47 irreproducibility), due to random sampling from a population with ample variability in disease and the 48 presence of ‘disease data subtypes’ (or modes). We also found that studies do not use cluster 49 when needed (97.4%, 37/38, 95%CI=86.5,99.5), and that scientists who increased N, concurrently 50 reduced the number of mice/donor (y=-0.21x, R2=0.24; and vice versa), indicating that statistically, 51 scientists replace the disease in mice by the variance of human disease. Instead of assuming 52 that increasing N will solve reproducibility and identify clinically-predictive findings on causality, we 53 propose the visualization of data distribution using kernel-density-violin plots (rarely used in rodent 54 studies; 0%, 0/38, 95%CI=6.9e-18,9.1) to identify ‘disease data subtypes’ to self-correct, guide and 55 promote the personalized investigation of disease subtype mechanisms. 56 57 Keywords: violin plots, random sampling, analytical irreproducibility, microbiome, fecal matter 58 transplantation, data disease subtypes

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

59 Introduction 60 Multimodal diseases are those in which affected individuals can be divided into subtypes (or ‘data 61 modes’); for instance, ‘mild’ vs. ‘severe’, based on (unknown) modifiers of disease severity. Since the 62 availability of DNA-sequencing platforms, there have been major advances in our understanding of the 63 human microbiome, its ecological complexity, and temporal oscillations. However, to differentiate the 64 causal connection between microbiome alterations and human diseases (from that of secondary 65 alterations due to disease), animal models, primarily germ-free rodents transplanted with human gut/fecal 66 microbiota (hGM-FMT), have been critical as in vivo phenotyping tools for human diseases. 67 Unfortunately, despite considerable efforts from organizations and guidelines to help scientists design 68 and report preclinical (e.g., ARRIVE)1,2, there are still concerns of study reproducibility. 69 Studies have described novel technical sources of ‘artificial’ microbiome heterogeneity that could 70 explain why hGM-FMT study results vary2-6. In our own work2, we discovered that scientists lacked 71 appropriate methods for the description and analysis of cage-clustered data. To help scientists to self- 72 correct issues on rodent experimentation, we identified ‘six action themes’ and provided examples, and 73 statistical code, on how to use and compute ‘study power’ as a reproducible parameter that could enable 74 inter-laboratory comparisons and improve the planning of human clinical trials based on preclinical data2. 75 In this regard, a recent perspective article on hGM-associated rodent studies by Walter et al.7 76 (“Establishing or Exaggerating Causality for the Gut Microbiome: Lessons from Human Microbiota- 77 Associated Rodents”; published in Cell, January 23rd 2020) recommended to scientists seek additional 78 funding to increase the number of human donors (N) as a main solution to improve experimental rigor and 79 reproducibility, and to determine the causal role of the hGM in disease. Given that large disease 80 variability is experimentally problematic for both humans and animals, we hypothesized that increasing N 81 would not ensure consistent results due to the aleatory effects of random sampling of subjects from a 82 population with multimodal disease distributions (i.e., multimodal: >2 types of modes or ‘subtypes of 83 disease data’ can be seen in a population; the most, and the least diseased). To verify this hypothesis in 84 the context of hGM and N, we used published (observed) preclinical distribution (disease variability) 85 estimates to conduct a statistical and visualization analysis of the impact of repeated random sampling on 86 the significance of statistical comparisons between simulated disease groups, at various N. 87 Underscoring the importance of the (which can be visualized in[8]), 88 simulations indicate that more studies addressing disease multimodality (independent of N; personalized 89 disease subtyping studies) are preferable than fewer studies with larger N that do not address disease 90 multimodality. After examining the statistical content of 38 studies8-45 listed in Walter et al,7 we found that 91 scientists who increased N, concurrently reduced the number of MxD, indicating that statistically, 92 scientists replace the disease variance in mice by the disease variance in humans in their hGM-FMT 93 studies. Further, studies lacked proper clustered-data statistics to control for animal density; which is a 94 major source of misleading results (false-positive, or false-negative), especially when scientists prefer to 95 house many rodents per cage, and when the number of mice per is low2,46. 96 Herein, we provide a conceptual framework that illustrates various patterns of analytical 97 irreproducibility by simulating and integrating the dynamics of: N, random sampling, group , sample 98 variance, and the population disease diversity that could be visualized as unimodal, bimodal or 99 multimodal, through the use of kernel-based violin density plots for the identification of data subtypes. 100 Simulations and provided examples could help scientists i) visualize the dynamics of random sampling 101 from a heterogeneous population of healthy and diseased subjects, ii) decide on N once preclinical data 102 are generated, and iii) improve experimental rigor in hGM-FMT studies. 103 104 RESULTS 105 ‘Disease subtypes’ occur in simulations using published data and UNIMODAL distributions 106 In microbiome rodent studies, the selection of a sufficient number of both human donors (N), and 107 the number of mice required to test each human donor (MxD), is critical to account for the effects of

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

108 random sampling, which exist when the hGM induces variable disease severity in humans and 109 rodents. Thus, to visualize the variability of disease severity (data subtypes/modes) in rodents, and the 110 effect of N on the reproducibility of pairwise statistical comparisons between groups of hypothetically, 111 randomly selected human donors, we first conducted a series of simulations using the ±SD 112 (observed data) from hGM-FMT mice in Basson et al.46 (note the dispersed overlapping , SD in 113 Figure 1A). Using the observed data we generated random datasets using functions designed to draw 114 numbers from an inversed Gaussian distribution (with unimodal normal continuous probability; 0,∞). We 115 demonstrate how the random selection of donors (sampled as groups for each of three iterative datasets) 116 influence the direction and significance level in pairwise comparative statistics (Figure 1B). 117 Simulations showed that the number of MxD is important because mice have various response 118 patterns to the hGM (i.e., disease severity, data subtypes/modes), which can be consistently detected 119 depending on the MxD and thus the variability introduced by random sampling. Simulations showed that 120 for the three group datasets (plotted as ‘Dis1’, ‘Dis2’ and ‘Healthy’), it was possible to reproducibly identify 121 two-to-three unique donor disease severity subtypes (data modes) in mice induced by the hGM (‘high’, 122 ‘middle’, and ‘low’ disease severity). Simulation plots made it visually evident that testing <4-5 MxD 123 yielded mean values more likely to be affected by intrinsic variability of random sampling; thus, making 124 studies with >6 MxD more stable and preferable. Conversely, studies with 1-2 MxD are at risk of being 125 strongly dependent on . Iterative simulations showed that the mean effect (e.g., ileal 126 histology) in transplanted mice varies minimally (i.e., stabilizes) after 7±2 MxD, depending on the random 127 dataset iterated. Beyond that, increasing MxD becomes less cost-effective/unnecessary if the focus is the 128 human donors (Figure 1C). 129 130 Random sampling from overlapping diseases yield ‘linear patterns of analytical irreproducibility’ 131 Often, published literature contains figures and statistical analysis conducted with 3 donors per 132 disease group. Thus, to mimic this scenario and to examine the role of random sampling on the 133 reproducibility of pairwise statistical results (‘significant’ vs. ‘non-significant’), we conducted, i) multiple 3- 134 donor/group (‘trio-trio’) pairwise comparisons, and ii) a simultaneous overall analysis for the cumulative 135 sum of all the 3-donor trios simulated for each disease group. That is, we monitored and quantified 136 whether results for each random iteration were significant (using univariate Student’s t-statistics p<0.05) 137 or non-significant (p>0.05) for groups of simulated donor datasets (‘Dis1’, ‘Dis2, and ‘Healthy’). Assessing 138 the effect of random sampling at various N, and also as N accumulated, we were able to illustrate that 139 pairwise trio-trio comparisons between the simulated datasets almost always produce non-significant 140 results when iterative trios were compared (due to large SD overlapping; see bars in Figure 1D 141 representing 21 sets of pairwise trio-trio p-values). However, as N increases by the cumulative addition of 142 all (mostly ‘non-significant’) donor trios (i.e., N increases in multiples of 3, for a of N between 3 and 143 63 donors/group; [3, 6, 9, 12…63]), pairwise statistical comparisons between the simulated datasets did 144 not produce consistent results (see line plots in Figure 1D representing p-value for cumulative addition of 145 donors when sampling iterations were simulated). 146 Results are clinically relevant because the simulated N, being much larger (63 donors/group) than 147 the largest N tested by one of the studies reviewed by Walter et al7 (21 donors/group)40 demonstrates that 148 the analysis of randomly selected patients would not always yield reproducible results due to the chance 149 of sampling aleatory sets of individuals with varying degrees of disease severity, regardless of how many 150 donors are recruited in an study. To provide a specific example, using ‘Dis1’ as a referent, cumulative 151 pairwise comparisons (vs. ‘Dis2’, and vs. ‘Healthy’) revealed at least five different patterns of 152 ‘irreproducible’ statistical results as N increased between 3 and 63 per group. Figure 1D illustrates four of 153 these variable cumulative linear patterns of analytical irreproducibility, in which, remarkably, i) ‘Dis1’ 154 becomes significantly different vs. Dis2, and vs. ‘Healthy’, as N increases, ii) ‘Dis1’ becomes significantly 155 different from ‘Dis2’ but not vs. ‘Healthy’, iii) ‘Dis1’ was significantly different from healthy but not vs. 156 ‘Dis2’, and iv) ‘Dis1’ never becomes significantly different despite sampling up to 63 donors/group. See

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

157 Supplementary Figure 1 for complementary plots illustrating linearity of patterns (R2, mean 0.51±0.23, 158 21 simulations) Hence, results clearly illustrate that seeking funds to recruit more donors is not a prudent 159 statistical solution to the problem of understanding disease causality of widely variable conditions in both 160 humans and animals. By analytical irreproducibility, herein, we refer to the inability to reproduce the 161 direction and of a test effect when analyses are conducted between groups created 162 by the random selection of subjects from distributions defined based on observed (mean±SD) data. 163 164 100,000 Monte Carlo simulations illustrate the effect of randomness on analytical reproducibility 165 To summarize the overall significance of the inconsistent patterns observed via random sampling, 166 we computed an aggregate ‘cumulative probability of being a significant simulation’ for 50 pairwise 167 statistical simulation sets fulfilling the 4 linear patterns described above. Emphasizing the concept that 168 increasing N is not a reproducible solution, Figure 1E shows that only 35.3±4.0% of comparisons 169 between ‘Dis1’ and ‘Dis2’, and 58.8±3.3% for ‘Dis1’ and ‘Healthy’ were significant. 170 Expanding the validity of these inverted-Gaussian simulations for N=63 donors/group, we then 171 conducted i) Monte Carlo adjusted Student’s unpaired t-tests, and ii) Monte-Carlo adjusted one-way 172 ANOVA with Tukey correction for family errors and multiple comparisons. Monte Carlo simulations used 173 data drawn from a normal (non-inverted) Gaussian distribution around the group means with a pooled SD 174 of ±4, and were conducted using GraphPad, a popular statistical software in published studies. To 175 estimate a probability closer to the real expectation (narrower confidence intervals), 100,000 simulations 176 were performed. Supporting the observations above (based on inverse normal simulations), Monte Carlo 177 Gaussian simulations showed that, using pairwise comparison, ‘Dis1’ would be significantly different from 178 Dis 2 (adjusted T-test p<0.05) only 57.7% of the time (95%CI=58-57.4), with 1540 simulations producing 179 negative (contradictory) mean differences between the groups. Compared to ‘Healthy’, ‘Dis1’ and ‘Dis2’ 180 were significant only 9.1% (95%CI=9.2-8.9) and 78.3% (95%CI=78.6-78.1) of the time, respectively. 181 Under the ‘Weak Law of Large Numbers’47-49, and principles, it is almost always 182 possible to detect some level of statistical significance(s) and mean group differences when asymptotic 183 mathematical methods based on numerous simulations are used, for example, as a surrogate for multiple 184 experiments which are not possible in real research settings. However, in this case, the mean simulated 185 differences yielding from 100,000 simulations were minuscule (1.6 for ‘Dis1’-‘Dis2’; -1.97 ‘Healthy’-‘Dis2’, 186 and 0.42, ‘Healthy’-‘Dis1’). Compared to the range of disease variance for each disease, such minuscule 187 differences may not be clinically relevant to explain disease variance at the individual level. Note that the 188 SD was 4, therefore it is intuitive to visualize in a numerical context such as small differences across 189 greatly overlapping unimodal simulations. Correcting for family errors, One-way ANOVA corrected with 190 10,000 Monte Carlo simulations with N=63/group, showed that at least one of the three groups would be 191 statistically different in approximately only 67.2% of the simulations (95%CI=64.2-70.0), whereas in 192 32.8% (95%CI=64.2-70.0) of simulations, the groups would appear as statistically similar (see Table 1 for 193 estimations after 100,000 Monte Carlo simulations; note narrower CI as simulations increase). The 194 comparison of ‘Dis1’ vs. ‘Dis2’ in Table 1, clearly demonstrates that the percentage of cases, in which a 195 simulation could be significant, depends on the degree of data dispersion. For example, simulations with 196 SD of 4, compared to SD of 10, produce significant results less often, illustrating how data with larger 197 dispersions contribute to poor statistical reproducibility, which cannot necessarily be corrected by 198 increasing N. 199 200 Random sampling can lead to ‘erratic patterns of analytical irreproducibility’ as N increases 201 To increase the external validity of our observations, we next simulated the mean±SD data 202 published from a hGM-FMT study on colorectal cancer conducted by Baxter et al16. In agreement with 203 Basson et al, Baxter et al revealed comparably bimodal colorectal cancer phenotypes in mice resulting 204 from both the diseased (colorectal cancer) patients and healthy human donors (Figure 1F). Equally 205 important, we observed for both Basson et al46 and Baxter et al16, what we describe as the fifth ‘pattern of

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

206 analytical irreproducibility’ in this report. That is, in some cases, the steady addition of donor trios/group 207 (as simulations proceeded for increasing values of N) made it possible to identify simulations where 208 erratic changes in the statistical significance for group comparisons switched randomly, yet gradually, 209 from being significant to non-significant as more donor trios were ‘recruited’ into the simulations (Figure 210 1G). Clinically relevant, simulations indicated (in a reproducible manner), that adding extra patients could 211 at times actually invert the overall cumulative effect of the p-value, possibly due to the variable distribution 212 and multimodal nature of the human and rodent responses to experimental interventions. As such, 213 simulations indicate that it is advisable to conduct several a-priori determined interim results in clinical 214 trials to ensure that significance is numerically stable (p<0.05), as well as the relevance of personalized 215 analysis to examine disease variance in populations. Unfortunately, there are no guidelines or examples 216 available to assist in determining how many donors would be sufficient, and to visualize the effect of 217 random sampling of individuals from a vastly heterogeneous population of healthy and diseased subjects, 218 once rodent preclinical data is generated. 219 220 Violin plots and statistical methods for visualization of MULTIMODAL ‘disease data subtypes’ 221 To visualize the underlying mechanisms that could explain the ‘linear and erratic patterns of 222 analytical irreproducibility’ introduced by random sampling, we first used dot plots based on observed and 223 simulated data, followed by kernel-based statistics and plots. appearance and one-way ANOVA 224 statistics showed that when N is increased, significant results, when present for largely overlapping 225 phenotypes, are primarily due to small differences between sample means (Figure 2A-B). Simulations 226 that compared 3 groups of 65 donors/group almost always yielded a significantly different group; 227 however, dot plots show that the significant differences between means are just a small fraction of the 228 total disease variability as verified with Monte Carlo simulations above. That is, as N increases, 229 comparisons can become significant (see plot with 65 donors in Figure 2C). In this context, a significant 230 difference of such a narrow magnitude may not be clinically relevant, or generalizable, to explain the 231 presence of a disease phenotype in a population, especially for those individuals at the extreme ranges of 232 the disease distribution. 233 Mechanistically, the detection of significant comparisons can be attributed to the effect that 234 ‘increasing N’ has on the data mean and variance, which increases at a higher rate for the variance as 235 shown in Figure 2D. Instead of increasing N as a general solution, we propose to scientists to use violin 236 plots, over other plots commonly encouraged by publishers50 (e.g., bar, boxplot and dot plots), because 237 violin plots provide an informative approach, at the group-sample level, for making inferences about 238 ‘disease data subtypes’ in the population (see ‘subtypes’ shown with arrows in Figure 2E). 239 Violin plots are similar to a , as they show a marker for the data , interquartile 240 ranges, and the individual data points51. However, as a unique feature, violin plots show the probability 241 density of the data at different values, usually smoothed by a kernel density estimator. The idea of a 242 kernel average smoother is that within a range of aligned data points, for each data point to be 243 smoothened (X0), there is a constant distance size (λ) of choice for the kernel window (radius, or width), 244 around which a weighted average for nearby data points are estimated. Weighted analysis gives data 245 points that are closer to X0 higher weights within the kernel window, thus identifying areas with higher 246 data densities (which correspond to the disease data modes). As an example of the benefits of using 247 violin plots, Figure 2F illustrates that as N increases, so does the ability of scientists to subjectively infer 248 the presence of disease subtypes. To strengthen the reproducibility of ‘subtype’ identification, 249 herein we recommend the use of statistical methods to identify disease data modes (e.g., see the modes 250 in Methods and Discussion), because as N increases, the visual detection of modes becomes 251 increasingly more subjective as shown in Figure 2F.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 6 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

252 Kernel density violin plots help guide subtype analysis to identify biologically significant results 253 Violin plots and kernel density distribution curves in Figure 3 illustrate why comparing groups of 254 randomly sampled individuals may not yield biologically relevant information, even though statistical 255 analysis identifies that the mean values differ between compared groups. Figure 3A illustrates the 256 different patterns of potential donor subtypes (i.e., data modes, visualized in violin plots as disease 257 data/curve ‘shoulders’) that would yield significant results in a single experiment depending on the donors 258 sampled. However, the kernel density plots in Figure 3B show that significant findings do not necessarily 259 indicate/yield clinically relevant thresholds or parameters to differentiate between the populations (due to 260 the overlapping and inflation of data ‘shoulders’ in some subjects within the samples). To contrast the 261 data simulated from Basson et al., we replaced data from ‘Dis1’ dataset with a Gaussian distributed 262 sample of random numbers (within 13.5±3.5, labeled as ‘fake disease X’; vs. 6.4±4.3, and 4.5±2.5 for 263 ‘Dis2’ and ‘Healthy’, respectively) to illustrate how a kernel plot would appear when significant differences 264 have a clinically relevant impact in differentiating disease subtypes (Figure 3C-D). 265 Collectively, simulations indicate that the uneven random sampling of subtypes across a disease 266 group would be an important factor in determining the direction of significance if studies were repeated, 267 owing primarily to the probability of sampling data ‘shoulders’ or ‘valleys’ in both healthy and diseased 268 populations. 269 270 Simulation of BIMODAL diseases illustrate mechanism of analytical irreproducibility 271 In our report thus far, we have used unimodal simulations to show how random sampling affects 272 statistical results. However, there has been an increased interest in understanding data multimodality in 273 various biological processes52,53 for which new statistical approaches have been proposed. Methods to 274 simulate multimodal distributions are however not trivial, in part due to the unknown nature of 275 multimodality in biological processes. To facilitate the understanding of the conceptual mechanisms that 276 influence the effect of data multimodality and random sampling on statistical significance, Figure 4 277 schematically contextualizes the statistical and data distribution principles that can interfere with 278 reproducibility of statistical results when simulations are repeated. 279 Random simulations from unimodal distributions work on the assumption that numbers (e.g., 280 donors’ disease severity) are drawn from a population, independently from one another. That is, the 281 probability of sampling or drawing a number from a population is not influenced by the number that was 282 selected prior. While this form of random sampling is very useful in deterministic mathematics, it does not 283 capture the dependence of events that occur in biology. That is, in biology, the probability of an event to 284 occur depends on the nature of the preceding events. To increase the external validity of our report, we 285 thus conducted simulations based on three strategies to draw density curves resembling bimodal 286 distributions. Figure 5A depicts distributions derived from both ‘truncated beta’, and the combination of 287 two ‘mixed unimodal’ distribution functions (e.g., two independent Gaussian curves in one plot), which are 288 illustrative of multimodality, but not necessarily reliable methods to examine the effects from dependent 289 random sampling in multimodality. 290 Thus, we used ‘Random walk Markov chain Metropolis-Hastings algorithms’ to simulate random 291 sampling, accounting for the hypothetical dependence between two different disease subtypes. To 292 simulate the statistical comparison of two these two hypothetical bimodal diseases, we i) ran Markov 293 Chain Monte Carlo (MCMC) simulations (Figure 5B), ii) used the ‘dip test’ to determine if the simulated 294 data were statistically multimodal Figure 5C, and iii) used the Student’s t-test to determine the statistical 295 significance, the mean differences and directions for the simulated distributions, using N=100. The MCMC 296 simulations clearly illustrate how random sampling of two bimodal hypothetical diseases lead to 297 inconsistent patterns of statistical results when compared. Notice that the data dispersion increases as N 298 increases; see in Figure 5D. 299 Conclusively, MCMC illustrations emphasize that increasing N in the study of multimodal 300 diseases in a single study should not be assumed to provide results that can be directly extrapolated to

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 7 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

301 the population, but rather, MCMC emphasize that the target study of data subtypes could lead to the 302 identification of mechanisms which could explain why diseases vary within biological systems (e.g., 303 humans and mice). 304 305 Personalized ‘data disease subtyping’ must be combined with proper ‘cage-cluster’ statistics 306 One important caveat to consider across animal studies is that increasing N alone is futile if 307 clustered-data statistics are not used to control for animal cage-density (>1 mouse/cage), which our group 308 showed contributes to ‘artificial heterogeneity’, ‘cyclical microbiome bias’, and false-positive/false- 309 negative conclusions2,54. To infer the role of scientific decision on the need for particular statistical 310 methods, we examined the studies reviewed by Walter et al.7 for ‘animal density’ and ‘statistical’ content 311 (see Methods). Of note, only one of the 38 studies (2.6%, 95%CI=0.1-13.8%) used proper statistical 312 methods (mixed-models) to control for cage-clustering18. Although on average, studies tested 6.6 patients 313 and 6.4 controls/group (range=1-21), most studies were below the average (65.7%, 25/38, 95%CI=48.6- 314 80.4%), with 14 having <4 donors/group (Figure 6A). However, of interest, the number of human donors 315 included in a study was inversely correlated with the number of mice/per donor used in the FMT 316 experiments Figure 6B. 317 Unfortunately, the majority of studies (25/38, 65.8%, 95%CI=48.6-80.4%) did not report animal 318 density, consistent with previous analyses2; while 10.5% of the studies (4/38, 95%CI=2.9-24.8%) housed 319 their mice individually, which is advantageous because study designs are free of ICC, eliminating the 320 need for cage-cluster statistics (Figure 6C). Our review of the statistical methods used across the 38 321 studies also revealed that most scientists used GraphPad chiefly for graphics and univariate analysis of 322 mouse phenotype data. This finding suggests an underutilization of the available functions in statistical 323 software, for example, Monte Carlo simulations, to help understand the effect of random sampling on the 324 reproducibility and significance of observed study results, and the likelihood of repeatability by others 325 (Monte Carlo adjusted 95% confidence intervals) (Figure 6D). 326 327 DISCUSSION 328 Despite the inclusion of large numbers of human subjects in microbiome studies, the causal role 329 of the human microbiome in disease remains uncertain. Exemplifying that a large N is not necessarily 330 informative with complex human diseases, a large metanalysis55 of raw hGM data from obese and IBD 331 patients showed that human disease phenotypes do not always yield reproducible inter-laboratory 332 predictive biological signatures. Even when hundreds of individuals are studied, especially, if the ‘effect 333 size for the disease of interest’ is narrow (i.e., in obesity; larger in IBD) relative to the variability of the 334 disease. For the human IBD subtypes (i.e., ulcerative colitis, and Crohn’s disease), the metanalysis55 335 concluded that only the ileal form of Crohn’s disease showed consistent hGM signatures compared to 336 both healthy control donors and patients with either colonic Crohn’s disease or ulcerative colitis,56 but no 337 consistent signatures were observed for obesity. In this context, herein we present observations derived 338 from simulation analysis to highlight that Walter et al7’s recommendation to scientists to seek further 339 funding to recruit more human donors (increasing N) is an imperfect solution to increase study 340 reproducibly. 341 Using a simple strategy of assuming random numbers drawn from an observed sample 342 distribution, we have analytically illustrated that increasing N yields aberrant and/or conflicting statistical 343 predictions, which depend on the patterns of disease variability and presence of disease subtypes (data 344 modes). Specifically, our simulations revealed that the number of discernable data subtypes may wax and 345 wane as N increases, and that increasing N does not uniformly enable the identification of statistical 346 differences between groups. Further, subjects randomly selected from a multimodal diseased population 347 may create groups with differences that do not always have the same direction. Especially, i) if the human 348 disease of interest exhibits variable phenotypes (e.g., cancer, obesity, asthma), and ii) if multivariable

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 8 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

349 cage-clustered data analyses are not used to account for ICC of phenotypes within/between animal 350 cages. 351 Under the ‘weak law of large numbers’ principle in mathematics (Bernoulli's theorem47-49; see ref 352 for further illustration57), as N increases, the distribution of the study/sample means approximates the 353 mean of the actual population, which facilitates the identification of statistically significant (but not 354 biologically meaningful) differences between otherwise overlapping sample datasets. Commonly used 355 statistical methods (e.g., t-tests; parametric vs. nonparametric) are designed to quantify differences 356 around the sample centers (mean, median) and range of dispersion (standard errors or deviation) of two 357 groups. However, these methods do not account for the distribution shape (unimodal vs. bi/multimodal) of 358 the compared datasets. With arbitrary increases in N, what is insignificant becomes significant, thus 359 increasing the tendency for the null hypothesis to be rejected despite clinically negligible differences58,59. 360 To guide the selection of sufficient N (cases) or disease data subtype, herein we highlight the use 361 of two simple statistical steps, i) to first determine if the shape of the dataset is unimodal (e.g., dip test), 362 and if not unimodal, then ii) to use statistical simulations and tests to determine the number of 363 modes/data values of interest. By doing so will facilitate the objective design of personalized/disease 364 subtyping experiments. Although comparisons between group means is important because some 365 diseases are truly different, findings from our own hGM-FMT46, and others16,18 highlight the relevance of 366 studying disease subtypes and the sources of variability by personalizing the functional analysis of the 367 hGM in mice (i.e., that both ‘pathological’ and ‘beneficial’ effects can be seen in hGM-FMT mice 368 independent of donor disease status). For example, in our own work, the functional characterization of 369 ‘beneficial’ or ‘non-beneficial’ disease microbiome subtypes in IBD patients at times of remission could 370 lead to the identification of an ideal patient fecal sample for future autologous transplantation during times 371 of active disease. Therefore, personalized research has the potential to identify different functional 372 microbiome subtypes (on a given outcome, e.g., assay or hGM-FMT mice) for one individual. 373 With respect to determining , easily implementable tests are available in 374 (diptest and mode; proprietary and community contributed) and R (Package ‘multimode’, community 375 contributed)60. The dip test61 quantifies departures from unimodality and does not require a priori 376 knowledge of potential multimodality and thus information can be easily interpreted from the test statistics 377 and the P-value 62,63. Although reports and comparative analysis of statistical performance have been 378 described for various multimodality tests (e.g., Dip test, Bimodality test, Silverman’s test and likelihood 379 ratio test64, and kernel methods), including simpler alternatives that use benchmarks to determine the 380 influence of data outliers 52,53,62,65, it is important to emphasize that every method depends on its intended 381 application and data set (and data shape),66 and therefore must be accompanied by the inspection of the 382 data distributions (‘shoulders’, ‘bumps’, and respective ‘valleys’). 383 In conclusion, by conducting a series of simulations and a review of statistical methods in current 384 hGM-FMT literature, we extensively illustrate the constraints of increasing N as a main solution to identify 385 causal links between the hGM and disease. We also highlight the integral role of multivariable cage- 386 clustered data analyses, as previously described by our group2. Herein, we provided a conceptual 387 framework that integrates the dynamics of sample center means and range of dispersion from the 388 compared datasets with kernel and violin plots to identify ‘data disease subtypes’. Biological insights from 389 well-controlled, analyzed and personalized analyses will lead to precise ‘person-specific’ principles of 390 disease, or identification of anti-inflammatory hGM, that could explain clinical/treatment outcomes in 391 patients with certain disease subtypes, and self-correct, guide and promote the personalized investigation 392 of disease subtype mechanisms. 393

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 9 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

394 AUTHOR CONTRIBUTIONS. 395 A.R.P. proposed analytical arguments, conducted statistical analysis and simulations. A.B. and A.R.P 396 conducted the survey, interpreted/analyzed data, and wrote manuscript. F.C. contributed with scientific 397 suggestions and revisions to the manuscript. All authors approved manuscript. 398 399 AVAILABILITY OF DATA AND MATERIALS. 400 All data analyzed for this report are included in the following published articles7,16,46 and/or their 401 supplementary information files. The authors will make all statistical codes available upon request. 402 403 DECLARATION OF INTERESTS 404 The authors declare no competing interests.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 10 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

405 Table 1. Comparative percentages of simulations that yielded significant results for two statistical 406 approaches based on randomly simulated data derived from unimodal distributions Inverse Normal Gaussian Monte Carlo Normal Gaussian

Simulation, n= and 50 T-tests 100,000 Adjusted T-tests 100,000 Adjusted One Way with statistical test (significant cumulative linear (overall significance with multiple comparison Tukey testb pattern*) (95%CI=) N=63/group)(95%CI=)a Dis1 vs Dis2 35.3% (22.9, 50.8) 57.7% (57.4, 58.0) 37.8% (37.5, 38.1) Dis1 vs Healthy 58.8% (43.2, 71.8) 9.1% (8.9, 9.3) 3.8% (3.7, 3.9) Dis2 vs Healthy ND 78.3% (78.0, 78.6) 59.6% (59.3, 59.9) One Way ANOVA p<0.05 68.1% (68.4 to 67.8) p>0.05 31.9% (32.2 to 31.6) 407 *Not overall p-value at N=63. ND, not determined. 408 a,b Notice that the percentage of simulations achieving significance is inflated when analysis for three groups is conducted with T- 409 tests (instead of ANOVA) which does not control for false positives due to family errors. Proper comparison between >2 groups 410 should be performed with methods to control for such family errors (e.g., ANOVA-post-hoc Tukey statistics). Note that the 411 percentage is different as illustrated in Figure 1D because the patterns with non-linear behavior are not considered.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 11 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a Simulation of published hGM-FMT data b Simulation of outcome for mice and human data Mean outcome due Generate pseudo- 18.0 Dis1 Dis2 Healthy

1trio/group Simulation of to hGM random numbers

10 incremental with disease N=3(trio)/group 8 distribution for statistical 6 defined by 9.0 simulated comparisons, e.g.,:

mean&SD 4 hGM- Simulate data … 6 mice/donor + + FMT 2 effect of N on Healthy

statistical diff & 3 3 e.g.,

0 0.0

sample

PUBLISHED distributions + + …

HealthyDis2 Dis1 DIsease1DIsease2 Number of donor of per Number mice Dis1 Healthy Outcome severity severity Outcome d 3 3 Cumulative mean responsesDis1.donor1 Dis1.donor2 is Dis1.donor3 moreDis2.donor1 Dis2.donor2 Dis2.donor3 Irreproducibility, random sampling, increasing N variable when groups have small number of Healthy.donor1 Healthy.donor2 Healthy.donor3 Bars: comparing two random trios of donors/group e Dis1 vs. Dis2 individuals, but mean for gruop stabilizes Line: comparing all cumulative n of trios/group 1100 c Simulated diseaseafter 8-10 subtypes individuals/group in mice due to hGM 90 14.0 Mean probabilities: 4

- 80 from ‘Dis1’ donors vs mice/donor Trio Dis1 vs Dis 2 Dis1 vs Healthy Non-Significant 70 T-test statistics P-value - univariate between 12.0 T-test statistics P-value - univariate between 60 disease1 vs disease2 comparing 3 disease1 vs healthy comparing 3 64.7% (4.0) hGM from

N donors/gpatients/group 3 21 (bars) and42 cumulative p-63 3 21 42 63 patients/group (bars) and cumulative p- 50 values for all 'three-patient' pairs (red line) values for all 'three-patient' pairs (red line) Donor Trio order 1 7 14 21 40 11 3 5 7 9 11 131415 17 19 2121 1 3 5 7 9 11 13 15 17 19 21 10.0 30 1 1.0000 1.0000 Significant high 1 20 patterns 1 patterns 10 35.3% (4.0) 8.0 subtype 0.1000 0.10000.1 0 ** 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

Dis1Dis1vsDis2-prob.of.Nonsignif.sim vs. Healthy value 0.0100 100 Dis1vsDis2-prob.of.signif.sim 6.0 - 0.010.0100 Donor middle

P 90 T-test statistics P-value - univariate Non-Significant 2 subtype T-test statistics P-value - univariate 0.0010 80 0.0010.0010between disease1 vs disease2 comparing between disease1 vs healthy comparing 3 4.0 n.s. Signif. 70 3 patients/group (bars) and cumulative p- patients/group (bars) and cumulative p- 41.2% (3.3) values for all 'three-patient' pairs (red line) values for all 'three-patient' pairs (red line) 0.0001 0.0001 linear simulation 1 pattern 60 0.0001 1 3 5 7 9 11 13 15 17 19 21 1 3 5 7 9 11 13 15 17 19 21 Series1 Series2 50 2.0 low 1.0000 Series1 Series2 1.0000 1 40 subtype

Disease severity in mice severity Disease in 30 Donor 3 0.1000 0.1000 Significant 0.0 0.1 a being of probability cumulative 20 significant simulation simulation significant 0 3 6 9 12 15 18 21 10 58.8% (3.3) T-test statistics P-value - univariate 0.01000.01 0.0100 0 0 human disease1 donor1mice/donordonor2 donor3 1516 19 Nonsignif.22simulations25 28 31 Simulations34 37 40 43 465049 between disease1 vs disease2 comparing 0.0010.0010T-test statistics P-value - univariate between 0.0010T-test statistics P-value - univariate between Signif. Simulations 3 patients/group (bars) and cumulative p- disease1 vs disease2 comparing 3 disease1 vs healthy comparing 3 Pattern 5: erratic p value trend as N increases patients/group (bars) and cumulative p- patients/group (bars) and cumulative p- g values for all 'three-patient' pairs (red line) values for all 'three-patient'Signif. pairs (red line) n.s. 0.0001 values for all 'three -patient' pairs (red line) 2 pattern Dis1 vs Dis 2 CRC vs Healthy 0.0001 1 3 5 7 9 11 13 15 17 19 21 0.0001 1 3 5 7 9 11 13 15 17 19 21 *From Baxter et al. 1 3 5 7 9 11 13 15 17 19 21 1.00001 Series1 Series2 1.0000 Series1 Series2 3 21 42 63 3 21 42 63 f 1.0000 30.0 CRC Healthy 1 0.10000.1 0.1000

value

25.0 - 0.01000.01 0.0100

P 0.1000 T-test statistics P-value - univariate between T-test statistics P-value - univariate between 20.0 0.0010 0.0010 0.001 disease1 vs disease2 comparing 3 disease1 vs healthy comparing 3 0.1 patients/group (bars) and cumulativen.s. p- patients/group (bars) and cumulative p- values for all 'three-patient' pairs (red line) valuesn.s. for all 'three -patient' pairs (red line) 15.0

0.00010.0001 0.00013 pattern 0.05

1 3 5 7 9 11 13 15 17 19 21 1 3 5 7 9 11 13 15 17 19 21 Series1 Series2 1.00001 Series1 Series2 1.0000 10.0 0.0100 0.10000.1 0.1000 tumors of N 5.0 0.01 0.01000.01 0.0100 0.0 0.0010 0.0010.0010 0.0010 Simulated: Simulated: Signif. Signif. *From Basson et al. *From Baxter et al.

pattern 4 pattern 412 0.00010.0001 0.0001 0.001 Series1 Series2 Series1 Series2 0.0001

413 Figure 1. Random sampling from overlappingDis1.donor1 diseasesDis1.donor2 Dis1.donor3 yield ‘linear patterns of analyticalSeries1 Series2

414 irreproducibility’. Simulations on observed data fromHealthy.donor1 BassonHealthy.donor2 Healthy.donor3 et al46 to visualize naturally/highly variable 415 disease/healthy datasets. a) Method overview to generate pseudo-random numbers and simulations from 416 published (observed) data. b) Visualization of simulated outcome from random integers generated based 417 on 3 donors/group for Disease 1 (‘Dis1’), Disease 2 (‘Dis2’), and healthy groups. c) Simulation of hGM 418 transplanted into mice yields reproducible simulated ‘disease data subtypes’ from 6 mice/group. d) Four 419 patterns of analytical irreproducibility. Representative simulations comparing 2 groups of donors, with N 420 ranging from 3 (trio)donors/group to 63, in multiples of 3 (cumulative addition of new trios per group). Y 421 axis, p-value of differences using 2-group Student-t test. Notice as N increases, the cumulative 422 significance (red line) exhibit different linear patterns due to variance introduced by random sampling. e) 423 Cumulative probability of a simulation to yield a significant difference (blue; significant, black; non- 424 significant; parentheses, std. dev.). A comparison was deemed significant, if at least one p-value<0.05 425 across simulations with N between 3 and 63 donors/group. f) Visualization of simulated outcome using 426 observed data from Baxter et al16. g) Random simulations illustrate two other possible analytical patterns. 427 Notice as N increases, group differences become more significant, until an inflection point, where adding 428 more donors makes the significance disappear. See Supplementary Figure 1 for additional examples 429 and computed R2 value to illustrate the linearity of the correlation between N and statistical significance.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 12 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a Observed b Simulated Effect of number of individuals/group on plot appearance of statistics 1trio/group1trio 1trio/group 1trio/group 1trio/group 20 6 donors/group6 donors 15 6 donors/group6 donors (3 donors)/ (randomDnorm) (randomDnorm) 10 group 10 15 15 15 1-way ANOVA P=0.9011 8 8 P=0.4241 10 10 10 6 6 10

4 4 5 5 5 5 disease severity 2 2 disease severity

0 0 0 0 0 0 Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy Healthy DIsease1DIsease2 Healthy Healthy Healthy DIsease1DIsease2 DIsease1DIsease2 DIsease1DIsease2 Healthy Healthy c Simulated DIsease1DIsease2 DIsease1DIsease2 65 donors/group 20 9 donors/group9 donors 20 21 donors/group21 donors 20 65 donors d P=0.0477 .2 .2 N=630 P=0.4948 P=0.6371 N=63 15 15 15 N=630.15 .15 .08 .08

N=63 N=12 .1 .1 N=6 .06 .06 N=12 10 10 10 N=3 N=6 .05 .05 .04 .04 N=3 5 5 5 disease severity disease severity disease severity Kernel density .02

.02 0 0 -10 0-10 100 2010 3020 30 x x 0 0 0 0 0 (M) mean -20 -200 020 2040 4060 kdensity6080 mean3kdensity80 mean3 kdensity mean63kdensity mean63 Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy x x (V) variance kdensity mean630kdensity mean630 kdensity mean6kdensity mean6 kdensitykdensity variance3 variance3 kdensitykdensity variance63 variance63 Violin plot / densityHealthy curves overlap Violin plot / densityHealthy curves overlap Healthy kdensity variance630 kdensitykdensity variance6 mean12kdensity mean12 eDIsease1 DIsease2 DIsease1DIsease2 DIsease11000DIsease2 Monte Carlo kdensity variance630 kdensity variance6 Mean diff: 1-W ANOVA 30 MeanVIOLIN diff: 1-W ANOVA P<0.0001 30 Mean diff:DOT 1-W ANOVA P<0.0001 simulations kdensitykdensity variance12 variance12P=0.4946 Violin plotsViolin plot allows / density curves overlapvisualization of data subtypes FakeDisX vs Dis2 f 25 SD diff: SD diff: as N increases Bartlet P= 0.0002 Bartlet P= 0.0002 20 B-F P= 0.0013 B-F P= 0.0013 20 20 15 densitycurves to show overlap SDev (2500) 10 Mean diff: 1-W ANOVA P=0.4946 5 Violin plot / density curves overlap 1-W ANOVA,

simulations 30 15 10 10 1000 Monte Carlo 0 P=0.5 P > 0.05 P < 0.05 10 Dis2 (n=3) Dis2 (n=6) FakeDisX vs Healthy Dis2 (n=12) Dis2 (n=24) Dis2 (n=48) Dis2 (n=96)

0 0 20 5 Dis1 Dis2 Healthy Dis1 Dis2 Healthy Violin plot / density curves overlap Violin plot / density curves overlap Pairwise diff <0.0001 0.0055 Pairwise diff <0.0001 0.0055 0 20Tukey P= Mean diff:BAR 1-W ANOVA P<0.0001 30Tukey P= Mean diff:BOX 1-W ANOVA P<0.0001

<0.0001 <0.0001 simulations Dis2 (n=3) Dis2 (n=6) P > 0.05 P < 0.05 Dis2 (n=12) Dis2 (n=24) Dis2 (n=48) Dis2 (n=96) SD diff: SD diff: 1000 Monte Carlo 0.0002 0.0002 Bartlet P= Bartlet P= 10 Arrows: data subtypes 15 B-F P= 0.0013 B-F P= 0.0013 20 Dis2 vs Healthy

10 Outcome measure Outcome 10 0 5

0 0 unpaired P < 0.05 Dis2 (n=3) Dis2 (n=6) Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis2 (n=12)Dis2 (n=24)Dis2 (n=48)Dis2 (n=96) t-test DisX Dis2 Healthy DisX Dis2 Healthy P > 0.05 Dis2 (n=2500) 430 Pairwise diff <0.0001 0.0055 Pairwise diff <0.0001 0.0055 Tukey P= Tukey P= 1000 Monte Carlo 431 Figure 2. Violin<0.0001 plots enable visualization<0.0001 of datasimulations subtypes in simulations of random sampling as 432 a function of N. Observed raw data derived from Basson et al published data. a-b) Dot plots (mean, 433 range) of observed (1 trio; 3 donors/group), and simulated data (3 and 6 donors/group; panel B). Note 434 that differences are not significant because of the variability between diseases. c) Dot plots (mean, range) Mean 435 of simulated data for 9, 21 and 65 donors per group. Note that simulated mean effects became significant 436 with 65 donors/group. However, the mean difference is small compared to the variance of the groups and 437 the difference is not biologically different because it is a function of the total variance (23%). d) Kernel 438 density simulations (10,000) based on observed (n=3) and simulated data. Note that as N increases the 439 mean becomes more narrow while the variance widens. See 100,000 Monte Carlo simulations in Table 1. 440 e) Comparison of visual appearance and data display for simulated data to illustrate ‘disease data 441 subtypes’ in the population (i.e., ‘disease data shoulders’/subtypes shown as arrows). f) Violin plots allow 442 visualization of data subtypes as N increases (‘disease data shoulders’/subtypes shown as arrows).

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 13 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.22.002469; this version posted March 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a Violin plot / density curves overlap Violin plot / density curves overlap Violin plot / density curves overlap Violin plot / density curves overlap 20 Mean diff: 1-W ANOVA P<0.0001 20 Mean diff: 1-W ANOVA P=0.0238 20 Mean diff: 1-W ANOVA P=0.0128 20 Mean diff: 1-W ANOVA P=0.0051

SD diff: SD diff: SD diff: SD diff:

Bartlet P= 0.0635 Bartlet P= 0.0008 Bartlet P= 0.0399 Bartlet P= 0.1962 15 B-F P= 0.0481 15 B-F P= 0.0002 15 B-F P= 0.1512 15 B-F P= 0.2453

10 10 10 10

Outcome severity 5 5 5 5

0 0 0 0 Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy Dis1 Dis2 Healthy

Pairwise diff 0.0031 0.0001 Pairwise diff 0.2302 0.0182 Pairwise diff 0.1192 0.0107 Pairwise diff 0.6342 0.0046 Tukey P= Tukey P= Tukey P= Tukey P= 0.6632 0.5143 0.6142 0.0603

b

Dis1 Dis2 Dis1 Dis2 Healthy Healthy population population

SUBTYPES SUBTYPES SUBTYPES SUBTYPES SUBTYPES SUBTYPES SUBTYPES SUBTYPES withinsampled withinsampled

.15

No clear threshold. All variances overlap. .15 Overall mean of Not biologically useful to differentiate Threshold not

Dis2 was p<0.05 37.4% donors at population level biologically n=63 simulation donors/group diff Dis2 kdensity healthy relevant vs Dis1 & .1 Overallkdensity meandis1 of .1 Healthy. 31.3% Dis2 was p<0.05 diff kdensity dis2 However, Dis1 vs poor or Dis1 & Healthy predictor 31.3% healthy .05 .05 Population density density Population kdensity dis1 kdensity dis2

(kernel estimate ofsimulated data) kdensity healthy 0 0 0 5 10 15 0 5 10 15 20 Outcome - disease severity from functional studies in mice Outcome - disease severity from functional studies in mice

c Violin plot / density curves overlap Biologically/clinically relevant threshold. .2 d 30 Mean diff: 1-W ANOVA P<0.0001 kdensity fake_dis If outcome= 14.5 kdensity dis2 SD diff: if a donor has an outcome 90% Dis.X 0.0002 kdensity healthy Bartlet P= of >14.5 he/she would

B-F P= 0.0013 .15 20 have >90% of chance of having ‘mock diseaseX’ and not ‘disease2’ .1 (<12%),

10 and 0% chance of being 90% healthy Outcome severity Population density density Population Dis.X .05

0 Dis1 Dis2 Healthy (kernel estimate ofsimulated data) 0

10% 10% Pairwise diff <0.0001 0.0055 Tukey P= 0 5 10 14.5 15 20 25 Dis2 443 <0.0001 Outcome - Disease severity across sampled population n=65/group 444 Figure 3. Violin plots illustrate that statistical differences with large N may not have clinical 445 predictive utility at individual level. Violin and kernel plots illustrate statistical vs. biologically relevant 446 differences. a) Violin plots of four simulated random number sets illustrate that each set of donors may 447 have unique subtypes of disease illustrated with arrowheads (disease severity scores with higher number 448 of simulated donors). Arrows indicate ‘disease data shoulders/subtypes’ vary with every simulation of 63 449 donors/group. b) Kernel density curves illustrate large overlap of sample population from simulated data 450 (see panel A). Significant differences are highlighted by shaded area. Note the threshold does not have 451 distinctive separation for the plots indicating that it is not biologically useful as a predictor of outcome. 452 c) Violin plots illustrate meaningful statistical difference for population (compared to panel 3b). ‘Fake 453 disease X’ (‘DisX’) was generated as a ‘mock’ disease following Gaussian distribution around the mean. 454 Monte Carlo simulations were significant 96.5% (upper limit 97.6, lower limit 95.4%). d) Kernel density 455 curves of panel 3c illustrates example of distribution separation with both statistical difference and 456 biological relevance.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 14 Population: universe of Samples: Larger samples leads to Even if real population Statistical methods individuals healthy & diseased Individuals in more precise means and distributions sampled determine if mean and SD studies narrower standard dev. in a study have non- of the samples/groups (SD) around the mean. Gaussian distributions differ and overlap Study 1 Size samples, Gaussian Significantly Large Unimodal different medium small Not different, Bimodal but depends on Study n sample size The data distribution within a sample from the Methods population (even if ‘randomly sampled”)’ may SD Multimodal suboptimal if be heterogeneous, depending on distribution multimodal of unknown disease/health subtypes Sample mean

Violin plots enable the Analysis of multimodal data samples/groups may yield variable results depending on subtypes sampled visualization of data Given, Dis1 and Dis2 If study a samples mostly And compares data distribution shapes, and See MCMC multimodal diseases ‘low subtype’ Dis1 donors with Dis2 ‘high data subtypes multimodal Dis1 subtype’ donors SIMULATIONS And population mean,SD in Fig 5 Dis1 has Dis1 Dis1 Dis2 Dis2 Kernel two data low low Dis2 Dis2 Vs. Then, there will density plot subtypes: Dis1 Dis1 high high be ‘confounded’ low Dis1 low Dis2 Dis2 significant differences High high high high Dis1 Dis2 Dis1 Violin plots High

Use kernel- Dis1 density based High statistics Low low Low 457 458 Figure 4. Conceptual framework of the effect of random sampling from a multimodal disease 459 population on the reproducibility of study results. Schematic conceptualization of random sampling 460 from a bi/multi-modal disease distribution (subtypes) and utility of violin and kernel density plots to 461 visualize disease subtypes.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 15 BETA(.5, .4) N=100 Normal Mixture Density Density Density Density

Multimodal Simulations Markov Chain Monte Carlo bi-modal distributions c Statistical tests to define multimodality a Truncated Mixture b N=100,000 N=100 N=100 Beta Normal BETA(.5, .4) N=100 Normal distribution .08 2.5 0.2 .06

N=100 N=100 N=100 Density Density

Density .04 Density Density Density

.02 0.0 0.5 1.0 1.5 2.0 2.5 0.00 0.05 0.10 0.15 0.20 0.0 0.5 1.0 1.5 2.0 2.5 00.0 0.5 1.0 1.5 2.0 2.5 0 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0 0.002 0.05 4 0.106 8 0.15 10 12 0.20

x1 x2 0 0 1 2 12 Outcome of disease severity 30 40 50 60 70 quality 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 d 0.0 0.2 0.4 0.6 0.8 1.0 Effect of2 4N 6 on 8 10 variable 12 significance across simulations and pairwise group direction of effect N=3 N=6 N=12 N=50 N=1,000 N=100,000 (T-test, P=0.16) NS (P=0.048)** (P=0.174) NS (P=7.1e-7)*** (P=2.2e-16)*** (P=0.45) NS x1 x2 Dip P= sL: 0.11 sL: 0.09 sL: 0.04 sL: 0.02 sL: 0.02 sL: 0.16 dL: 0.18 dL: 0.10 dL: 0.04 dL: 0.01 dL: 0.01 dL: 0.16

462 Outcome of disease severity (all plots are scaled between -10 and 10, with reference line at 0 for comparison) 463 Figure 5. Markov chain Monte Carlo (MCMC) simulations emphasize targeted study of subtypes in 464 the study of multimodal diseases. Random walk Markov chain Metropolis-Hastings’ algorithm to 465 simulate random sampling accounting for the hypothetical dependence of two different disease subtypes. 466 a) Mathematical function that allows distribution of number sampling if numbers that follow a bi-modal 467 distribution. Simulation depicts distributions derived from ‘truncated beta’ and the combination of two 468 ‘mixed unimodal’ distribution functions. b) Random sampling iterations after Markov chain simulations 469 (N=100) comparing two hypothetical bimodal data distributions (red dotted line vs. blue solid line) for an 470 outcome of disease severity, wherein 100,000 simulations represent approximately the real distribution 471 (grey line; zero). As a stochastic model, the Markov chain algorithm considers biologically relevant 472 sampling dependence. Notice how random sampling for two bimodal distributions can yield non- 473 consistent statistical results variability between iterations, in this case N=100 (-5, SD of 3; 7, SD of 3). c) 474 Example of a Hartigan-Hartigan (Hartigans’) unimodality dip test and a modes test 61,62 showing a 475 multimodal data distribution of a hypothetical dataset (black dotted line) compared to a normal univariate 476 density plot (red line). To identify data subtypes (modes), the dip test61 computes a p-value to help 477 determine if a data is unimodal or multimodal and does not require a priori knowledge of potential 478 multimodality and can be interpreted from its test statistics and P-value (p<0.05 indicates data is not 479 unimodal, p>0.05-1.0 indicating at least one data mode exists in dataset). d) Effect of increasing N on t- 480 test significance (direction, arrow), and dip test for MCMC simulations using the Markov chain simulation 481 scripts (panel C) comparing two hypothetical bimodal data distributions (red dotted line vs. blue solid line), 482 controlling for randomness (set.seed 101) while increasing N. Plots scaled between -10 and 10 483 illustrate increased data dispersion as N increases (grey line; zero). See Supplementary Figure 2 484 for wider range of N and the examples for dip test and modes analysis using STATA and R 485 commands.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 16

a Estimated Average d overall statistical methods across 38 hGM−FMT mouse studies From reported data in 36/38 studies 2020 1515 1010 55 0 0 donors/study mice/donor 20 20 15 15 10 10 5 5 0 0 20 15 10 5 0 21.021 10.010

18.0 Yes Unclear No 15.0 12.0 5. 0 9. 0

n of D.Hypothesis.(related.to.FMT)* 6. 0 20 patients 3. 0 M.Spearman_corr mice/patient 0. 0 0. 0 M.Benjamini_Hochberg_fdr 0 0. 0 3. 0 6. 0 9. 0 12.015.018.02121.0 00. 00 5. 00 1010.00 M.Kruskal controls mice/control M.Dunn_nonparametric Patients 6.6 m/Patient 5.1 M.Linear.Mixed M.multivariateanalysis 15 Controls 6.4 m/Control 4.7 hypothesis_fmt abar.cohoused abar.multivariateanalysis_cage two_way_ANOVA Bonferroni_correction_param Linear_Mixed_effects_models Anosim_Adonis_test ROC univariable_logistic_reg multivariable_Holm_Sidak Logistic_reg Holm_Sidak Cluster_analysis MRPP_test_biolog DS_FDR_otu Diagonal_covariance_matrices Linear_Reg Dunnett_pairw_ANOVA Shapiro_Wilk transform_rootsq Pearsons_chisquared ICC_x16s F_test Dagostino_Pearson_normality sas stata rsoftware Fischers_exact permanova Tukey_post_tests spss Grubb_Smirnov_Grubbs_outliers abar.micecage Spearman_corr Benjamini_Hochberg_fdr Wilcoxon Kruskal Dunn_nonparametric n_hum_dis abar.miceperpatient Pseudoreplication graphpad mann_whitney t_test_students_Welch one_way_ANOVA M.two_way_ANOVA study29 M.Bonferroni_correction study11 hypothesis_fmt abar.cohoused abar.multivariateanalysis_cage hypothesis_fmt two_way_ANOVA abar.cohoused Bonferroni_correction_param abar.multivariateanalysis_cage Linear_Mixed_effects_models two_way_ANOVA Anosim_Adonis_test Bonferroni_correction_param ROC Linear_Mixed_effects_models univariable_logistic_reg Anosim_Adonis_test multivariable_Holm_Sidak ROC Logistic_reg univariable_logistic_reg Holm_Sidak multivariable_Holm_Sidak Cluster_analysis Logistic_reg MRPP_test_biolog Holm_Sidak DS_FDR_otu Cluster_analysis Diagonal_covariance_matrices MRPP_test_biolog Linear_Reg DS_FDR_otu Dunnett_pairw_ANOVA Diagonal_covariance_matrices Shapiro_Wilk Linear_Reg transform_rootsq Dunnett_pairw_ANOVA Pearsons_chisquared Shapiro_Wilk ICC_x16s transform_rootsq F_test Pearsons_chisquared Dagostino_Pearson_normality ICC_x16s sas F_test stata Dagostino_Pearson_normality rsoftware sas Fischers_exact stata permanova rsoftware Tukey_post_tests Fischers_exact spss permanova Grubb_Smirnov_Grubbs_outliers Tukey_post_tests abar.micecage spss Spearman_corr Grubb_Smirnov_Grubbs_outliers Benjamini_Hochberg_fdr abar.micecage Wilcoxon Spearman_corr Kruskal Benjamini_Hochberg_fdr Dunn_nonparametric Wilcoxon Kruskal n_hum_dis Dunn_nonparametric abar.miceperpatient n_hum_dis Pseudoreplication abar.miceperpatient graphpad Pseudoreplication mann_whitney t_test_students_Welch graphpad one_way_ANOVA mann_whitney t_test_students_Welch one_way_ANOVA hypothesis_fmt abar.cohoused abar.multivariateanalysis_cage two_way_ANOVA Bonferroni_correction_param Linear_Mixed_effects_models Anosim_Adonis_test M.Tukey_post_testsROC univariable_logistic_reg multivariable_Holm_Sidak Logistic_reg Holm_Sidak Cluster_analysis MRPP_test_biolog DS_FDR_otu Diagonal_covariance_matrices Linear_Reg Dunnett_pairw_ANOVA Shapiro_Wilk transform_rootsq Pearsons_chisquared ICC_x16s F_test Dagostino_Pearson_normality sas stata rsoftware Fischers_exact permanova Tukey_post_tests spss Grubb_Smirnov_Grubbs_outliers abar.micecage Spearman_corr Benjamini_Hochberg_fdr Wilcoxon Kruskal Dunn_nonparametric n_hum_dis abar.miceperpatient Pseudoreplication graphpad mann_whitney t_test_students_Welch one_way_ANOVA study7study29study29study29 S.spss study17study11study11study11 Donors vs mice/donor M.Grubb_Smirnov_outliers b study2study7study7study7 D.mice.per.cage 10 study33 20.0020 study17study17study17 M.Anosim_Adonis_test study20study2study2study2 M.Fishers_exact study13study33study33 y = -0.211x + 7.8 study19study33 M.ROC study18 study20study20study20 R² = 0.24 M.univariable_logistic_reg study9study13study13study13 study34study19study19study19 M.multivariable_Holm_Sidak study25study18study18study18 15.00 M.Logistic.reg 5 study16study9study9study9 y = -2.96ln(x) + 11.8 M.Holm_Sidak study37study34study34study34 study30study25study25study25 R² = 0.36 M.Cluster_analysis study31study16study16study16 M.MRPP_test_biolog study37study37study37 study24study30study30study30 M.DS_FDR_otu study10study31study31study31 10.00 M.Diagonal_covariance study28 0 study8study24study24study24 M.Linear.Reg study6study10study10study10 y = 16.6x-0.68 M.Dunnett_pairw_ANOVA study23study28study28study28 M.Shapiro_Wilk study15study8study8study8 study27study6study6study6 R² = 0.41 M.transform_rootsq study36study23study23study23 5.00 M.Pearsons_chisquared study12study15study15study15 M.ICC_x16s study14study27study27study27 n of mice/human donor study35study36study36study36 M.F_test study1study12study12study12 M.DagostinoPearson_normality study21study14study14study14 study5study35study35study35 M.Linear.Mixed.CAGE.randomeffect study4study1study1study1 S.stata study38study21study21study21 0.00 study22study5study5study5 0 N of donors/study 40 S.sas study32study4study4study4 study26study38study38study38

0.0 8.016.024.032.040.0 statistical methods in hGM0FMT mouse studies overall D.cohoused_YesNoUnclear study22study22study22 study3study32study32study32 S.Rsoftware study26study26study26 overall statistical methods in hGM0FMT mouse studies overall statistical methods in hGM0FMT mouse studies overall 5 MxCg statistical methods in hGM0FMT mouse studies overall M.Wilcoxon study3study3study3 c 10.5%

D.Human.N.donors.Disease*

2 MxCg D.meanhumanctrl 13.2% S.Graphpad_versions.5through8 M.mann_whitney M.t_test_students_Welch M.one_way_ANOVA

1MxCg 65.8% M.Pseudoreplication* 10.5% D.maxMice_Distotalog* D.miceperpatient Most studies do not report animal density animal report D.miceperctrl study03 study06 study27 study32 study09 study19 study13 study14 study07 study36 study08 study12 study28 study10 study24 study29 study21 study22 study38 study11 study33 study31 study18 study20 study35 study26 study16 study04 study05 study25 study34 study15 study23 study30 study37 study02 study01 study17 Reporting cage-clustered effects 486 N of mice/cage | % of 38 studies 487 488 Figure 6. Study design and statistical methods among 38 hGM-FMT studies reveal lack of cage- 489 clustered analysis and dominance of univariate analysis. Analysis of 38 studies as reviewed by 490 Walter et al7. a) Average and correlation for number of donors (left plot; patients with disease vs. healthy 491 control) and average number of mice per human donor in each study (right plot). b) Correlation plot with 492 exponential, logarithmic and linear fits shows that scientists tend to use less animals when more donors 493 are tested, creating a ‘trade-off’ between data uncertainty due to variance in human disease with that of 494 variance in animal models for disease of interest. c) shows distribution of studies reporting mice 495 per cage (MxCg) attributing to cage-clustered effects. (keywords: cage/cluster*, individual/house*, mice 496 per*, density*, mixed/random/fix/methods/stat*, P=). Note that most studies do not report MxCg (i.e., 497 animal density). d) Heatmap illustrates the overall statistical methods (M), statistical software (S) and 498 study design (D) used by the 38 reviewed studies. Note that only one study (‘study 6’)18 used linear 499 mixed methods to control for the random effects of cage clustering and that the majority of studies limited 500 analysis of datasets to univariate-based approaches. *Asterix indicate variables examined by Walter. 501 Notice the dichotomy between software, the GUI interface R statistical software and GraphPad. 502 Highlighted areas shown for reference; notice the cluster around GraphPad and univariate analysis.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 17 Erratic effect of N on p-value Nonsig>sign>nonsig

Erratic effect of N on p-value Nonsig>sign>nonsig

Erratic effect of N on p-value Nonsig>sign>nonsig

Erratic effect of N on p-value Nonsig>sign>nonsig

503 504 Supplementary Figure 1. Further examples for data simulations with R2 value illustrate linearity as 505 illustrated in Figure 1D. Computed R2 value (mean 0.51±0.23, 20 simulations) illustrate the linearity of 506 the correlation between N and statistical significance. Y axis, p-value of the differences using 2-group 507 Student-t test.

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 18 N=3 N=6 N=12 N=24 a (P=0.16) NS (P=0.048)** (P=0.174) NS (P=0.0004)*** Dip P= sL: 0.11 sL: 0.09 sL: 0.08 sL: 0.16 dL: 0.18 dL: 0.10 dL: 0.07 dL: 0.16

N=50 N=100 N=1,000 N=100,000 (P=7.1e-7)*** (P=6.1e-13)*** (P=2.2e-16)*** (Tt, P=0.45) NS sL: 0.04 sL: 0.02 sL: 0.02 Dip P= dL: 0.04 dL: 0.04 dL: 0.01 sL: 0.02 dL: 0.01

b .04 .03 .02 Density .01 0 30 40 50 60 70 quality

c

508

509 Supplementary Figure 2. Markov chain Monte Carlo (MCMC) simulations and examples of dip test. 510 Random walk Markov chain Metropolis-Hastings’ algorithm to simulate random sampling accounting for 511 the hypothetical dependence of two different disease subtypes (complement to plots presented in Figure 512 5).

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 19 513 METHODS 514 Preclinical hGM-FMT (observed) data used for simulations. To facilitate the visualization of how 515 random sampling and disease variability influences study conclusions (significant vs. non-significant p- 516 values) in the context of N, we conducted a series of simulations from preclinical hGM-FMT disease 517 phenotyping data estimates from our own IBD studies (Basson et al)46 and that of Baxter et al16 (a study 518 reviewed by Walter et al[7]). In brief, by transplanting feces from inflammatory bowel disease (IBD), 519 namely Crohn’s disease (‘Dis1’) and ulcerative colitis (‘Dis2’), and ‘Healthy’ donors (n=3 donors for each 520 ’disease/healthy’ state) into a germ-free spontaneous mouse model of cobblestone/ileal Crohn’s disease 521 (SAMP1/YitFc)46,67, Basson et al46 observed with ~90% engraftment of human microbial taxa after 60 522 days, that the hGM-FMT effect on mouse IBD-phenotype was independent of the disease state of the 523 donor. Specifically, samples from some IBD patients and some healthy donors did not affect the severity 524 of intestinal inflammation in mice, while the remaining donors exacerbated inflammation. The overlapping 525 presence of both pro-inflammatory and non-inflammatory hGM in the disease phenotype of the mice for 526 IBD and healthy human donors, indicate the presence of data bimodality. Comparably, Baxter et al[17] 527 found that differences in the number of tumors resulting in a hGM-FMT mouse model of chemically 528 induced colorectal cancer (CRC) was independent of the cancer status of the human donors (n=3 529 colorectal cancer, n=3 healthy individuals). 530 531 Simulations from preclinical hGM-FMT data. Simulations were first conducted using random and 532 inverse random normal functions using the mean±SD data from published data16,46. In all depicted 533 illustrations, the randomly generated numbers used computer software/automated pseudorandom 534 (seeded and unseeded) methods. Unless described otherwise, the numbers generated were restricted to 535 be confined within biologically meaningful data boundaries based on published data (for example, 0 as 536 minimum for normal histological score or intestinal inflammation, and 80 as arbitrary ~3-fold the maximum 537 possible histological score). For illustration purposes, the x- or y-axes in plots were generically labeled as 538 ‘outcome disease severity’. Simulating a situation where a scientist would recruit a trio of donors (3 539 donors) per group at a time, and was interested in conducting interim statistical analysis following the 540 addition of every trio of donors to the study, we summarized the pairwise group analysis for the simulated 541 disease comparisons, for various N, and for consecutively added donors as an aggregate ‘cumulative 542 probability of being a significant simulation’ . Comparisons were deemed significant if at least one 543 p-value was <0.05 across simulations. Student’s unpaired t-test and/or One-way ANOVA with Tukey 544 statistical comparisons were also adjusted using Monte Carlo simulations to determine the adjusted p 545 values and the % of simulations to demonstrate that not all disease group comparisons would be 546 reproducible. As a control normal (unimodal) simulation, we created several datasets, including one 547 depicted in illustrations as ‘fake disease X’). Lastly, to illustrate the effect of random sampling from data 548 simulations from functions, unconstrained-parameter simulations of two mixed yet 549 separate normal distributions, were performed using the Random Walk Metropolis-Hasting algorithm, a 550 form of doing dependent sampling from a proposed posterior distribution, as a well-established method of 551 Markov Chain Monte Carlo (MCMC) simulations, using R, and STATA (v15.1). In the latter, the MCMC 552 sampling of a new individual is dependent on the of being part of a mode within a bimodal 553 distribution, instead of being completely random from a unimodal distribution, using a log-likelihood 554 correction to prevent negative sigma values and also allow for asymmetrical distributions. This method is 555 beneficial as it asymptotically converges to the true proposal distribution, and so represents a more 556 robust method of data simulation of other potential alternatives of simulating sampling from bimodal 557 distributions (i.e., binomial, and mixed normal distributions). 558 559 Variability of statistical methods in hGM-FMT rodent studies. To determine the sources of statistical 560 methods variability in hGM-FMT rodent studies, we reviewed the content of 38 studies listed in Walter et 561 al.7 For computation purposes, we searched each article for the following keywords “cage,” “stat*”,

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 20 562 “housed,” “multiple,” “multivariable,” “cluster,” “mixed,” “individual*”, “random*” and appropriately extracted 563 details to additional inserted columns of an excel file. Detailed statistical tests and software used, focused 564 on assessing the effect of the hGM in the rodent phenotypes, were extracted to determine if studies used 565 proper cluster statistical analysis, and/or controlled for random effects introduced by caging, when 566 needed; that is, if scientists housed more than one mouse per cage. Integer numbers, including 567 descriptions of animal density, were assigned to the sourced keywords to allow for statistical analysis. If a 568 range was provided for N or animal density (e.g., 1-5), estimations were computed using the median 569 value within the range, as well as the minimum and maximum values. Average of estimated center values 570 were used for analysis and graphical summaries 571 572 Statistical Analysis. for parametric data were employed if assumptions were 573 fulfilled (e.g., 1-way ANOVA). Non-fulfilled assumptions were addressed with nonparametric methods 574 (e.g., Kruskal-Wallis). The 95% confidence intervals are reported to account for sample size and for 575 external validity. The test of multimodality was conducted using the dip test (which measures the 576 departure of a sample from unimodality, using as reference the uniform distribution as a worst case) and 577 STATA61, with packages available in R68. The tabulation of modes from a variable in a dataset was 578 computed using the modes and hsmode function in STATA.69,70 .Statistical and simulation analyses were 579 conducted or plotted with Excel, R, Stata, and GraphPad. 580 581 582 REFERENCES 583 1 Kilkenny, C., Browne, W. J., Cuthill, I. C., Emerson, M. & Altman, D. G. Improving bioscience 584 research reporting: the ARRIVE guidelines for reporting animal research. PLoS Biol 8, e1000412, 585 doi:10.1371/journal.pbio.1000412 (2010). 586 2 Basson, A. et al. Artificial microbiome heterogeneity spurs six practical action themes and 587 examples to increase study power-driven reproducibility Sci Rep 10 Article number: 5039, 588 doi:10.1038/s41598-020-60900-y (2019). 589 3 Franklin, C. L. & Ericsson, A. C. Microbiota and reproducibility of rodent models. Lab Animal 46, 590 114-122, doi:10.1038/laban.1222 (2017). 591 4 Ericsson, A. C. et al. The influence of caging, bedding, and diet on the composition of the 592 microbiota in different regions of the mouse gut. Sci Rep 8, doi:10.1038/s41598-018-21986-7 593 (2018). 594 5 Stappenbeck, T. S. & Virgin, H. W. Accounting for reciprocal host-microbiome interactions in 595 experimental science. Nature 534, 191-199, doi:10.1038/nature18285 (2016). 596 6 Arrieta, M. C., Walter, J. & Finlay, B. B. Human Microbiota-Associated Mice: A Model with 597 Challenges. Cell Host Microbe 19, 575-578, doi:10.1016/j.chom.2016.04.014 (2016). 598 7 Walter, J., Armet, A. M., Finlay, B. B. & Shanahan, F. Establishing or Exaggerating Causality for 599 the Gut Microbiome: Lessons from Human Microbiota-Associated Rodents. Cell 180, 221-232, 600 doi:10.1016/j.cell.2019.12.025 (2020). 601 8 Soderborg, T. K. et al. The gut microbiota in infants of obese mothers increases inflammation and 602 susceptibility to NAFLD. Nat Commun 9, 4462, doi:10.1038/s41467-018-06929-0 (2018). 603 9 Liu, R. et al. Neuroinflammation in Murine Cirrhosis Is Dependent on the Gut Microbiome and Is 604 Attenuated by Fecal Transplant. Hepatology 71, 611-626, doi:10.1002/hep.30827 (2020). 605 10 Fielding, R. A. et al. Muscle strength is increased in mice that are colonized with microbiota from 606 high-functioning older adults. Exp Gerontol 127, 110722, doi:10.1016/j.exger.2019.110722 607 (2019). 608 11 Maeda, Y. et al. Dysbiosis Contributes to Arthritis Development via Activation of Autoreactive T 609 Cells in the Intestine. Arthritis Rheumatol 68, 2646-2661, doi:10.1002/art.39783 (2016). 610 12 Stoll, M. L. et al. Akkermansia muciniphila is permissive to arthritis in the K/BxN mouse model of 611 arthritis. Immun 20, 158-166, doi:10.1038/s41435-018-0024-1 (2019). 612 13 Petursdottir, D. H. et al. Early-Life Human Microbiota Associated With Childhood Allergy 613 Promotes the T Helper 17 Axis in Mice. Front Immunol 8, 1699, doi:10.3389/fimmu.2017.01699 614 (2017).

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 21 615 14 Feehley, T. et al. Healthy infants harbor intestinal bacteria that protect against food allergy. Nat 616 Med 25, 448-453, doi:10.1038/s41591-018-0324-z (2019). 617 15 Battaglioli, E. J. et al. Clostridioides difficile uses amino acids associated with gut microbial 618 dysbiosis in a subset of patients with diarrhea. Sci Transl Med 10, 619 doi:10.1126/scitranslmed.aam7019 (2018). 620 16 Baxter, N. T., Zackular, J. P., Chen, G. Y. & Schloss, P. D. Structure of the gut microbiome 621 following colonization with human feces determines colonic tumor burden. Microbiome 2, 20, 622 doi:10.1186/2049-2618-2-20 (2014). 623 17 Wong, S. H. et al. Gavage of Fecal Samples From Patients With Colorectal Cancer Promotes 624 Intestinal Carcinogenesis in Germ-Free and Conventional Mice. Gastroenterology 153, 1621- 625 1633 e1626, doi:10.1053/j.gastro.2017.08.022 (2017). 626 18 Tomkovich, S. et al. Human colon mucosal biofilms from healthy or colon cancer hosts are 627 carcinogenic. J Clin Invest 130, 1699-1712, doi:10.1172/JCI124196 (2019). 628 19 Kelly, J. R. et al. Transferring the blues: Depression-associated gut microbiota induces 629 neurobehavioural changes in the rat. J Psychiatr Res 82, 109-118, 630 doi:10.1016/j.jpsychires.2016.07.019 (2016). 631 20 Zheng, P. et al. Gut microbiome remodeling induces depressive-like behaviors through a pathway 632 mediated by the host's metabolism. Mol Psychiatry 21, 786-796, doi:10.1038/mp.2016.44 (2016). 633 21 Fujii, Y. et al. Fecal metabolite of a gnotobiotic mouse transplanted with gut microbiota from a 634 patient with Alzheimer's disease. Biosci Biotechnol Biochem 83, 2144-2152, 635 doi:10.1080/09168451.2019.1644149 (2019). 636 22 Zheng, P. et al. The gut microbiome from patients with schizophrenia modulates the glutamate- 637 glutamine-GABA cycle and schizophrenia-relevant behaviors in mice. Sci Adv 5, eaau8317, 638 doi:10.1126/sciadv.aau8317 (2019). 639 23 Li, S. X. et al. Gut microbiota from high-risk men who have sex with men drive immune activation 640 in gnotobiotic mice and in vitro HIV infection. PLoS Pathog 15, e1007611, 641 doi:10.1371/journal.ppat.1007611 (2019). 642 24 Smith, M. I. et al. Gut microbiomes of Malawian twin pairs discordant for kwashiorkor. Science 643 339, 548-554, doi:10.1126/science.1229000 (2013). 644 25 Kau, A. L. et al. Functional characterization of IgA-targeted bacterial taxa from undernourished 645 Malawian children that produce diet-dependent enteropathy. Sci Transl Med 7, 646 doi:10.1126/scitranslmed.aaa4877 (2015). 647 26 Wagner, V. E. et al. Effects of a gut pathobiont in a gnotobiotic mouse model of childhood 648 undernutrition. Sci Transl Med 8, 366ra164, doi:10.1126/scitranslmed.aah4669 (2016). 649 27 Natividad, J. M. et al. Ecobiotherapy Rich in Firmicutes Decreases Susceptibility to Colitis in a 650 Humanized Gnotobiotic Mouse Model. Inflamm Bowel Dis 21, 1883-1893, 651 doi:10.1097/MIB.0000000000000422 (2015). 652 28 Nagao-Kitamoto, H. et al. Functional Characterization of Inflammatory Bowel Disease-Associated 653 Gut Dysbiosis in Gnotobiotic Mice. Cell Mol Gastroenterol Hepatol 2, 468-481, 654 doi:10.1016/j.jcmgh.2016.02.003 (2016). 655 29 De Palma, G. et al. Transplantation of fecal microbiota from patients with irritable bowel 656 syndrome alters gut function and behavior in recipient mice. Sci Transl Med 9, 657 doi:10.1126/scitranslmed.aaf6397 (2017). 658 30 Touw, K. et al. Mutual reinforcement of pathophysiological host-microbe interactions in intestinal 659 stasis models. Physiol Rep 5, doi:10.14814/phy2.13182 (2017). 660 31 Chen, Y. J. et al. Parasutterella, in association with irritable bowel syndrome and intestinal 661 chronic inflammation. J Gastroenterol Hepatol 33, 1844-1852, doi:10.1111/jgh.14281 (2018). 662 32 Britton, G. J. et al. Microbiotas from Humans with Inflammatory Bowel Disease Alter the Balance 663 of Gut Th17 and RORgammat(+) Regulatory T Cells and Exacerbate Colitis in Mice. Immunity 50, 664 212-224 e214, doi:10.1016/j.immuni.2018.12.015 (2019). 665 33 Torres, J. et al. Infants born to mothers with IBD present with altered gut microbiome that 666 transfers abnormalities of the adaptive immune system to germ-free mice. Gut 69, 42-51, 667 doi:10.1136/gutjnl-2018-317855 (2020). 668 34 Sampson, T. R. et al. Gut Microbiota Regulate Motor Deficits and Neuroinflammation in a Model 669 of Parkinson's Disease. Cell 167, 1469-1480 e1412, doi:10.1016/j.cell.2016.11.018 (2016).

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 22 670 35 Berer, K. et al. Gut microbiota from multiple sclerosis patients enables spontaneous autoimmune 671 encephalomyelitis in mice. Proc Natl Acad Sci U S A 114, 10719-10724, 672 doi:10.1073/pnas.1711233114 (2017). 673 36 Cekanaviciute, E. et al. Gut bacteria from multiple sclerosis patients modulate human T cells and 674 exacerbate symptoms in mouse models. Proc Natl Acad Sci U S A 114, 10713-10718, 675 doi:10.1073/pnas.1711235114 (2017). 676 37 Sharon, G. et al. Human Gut Microbiota from Autism Spectrum Disorder Promote Behavioral 677 Symptoms in Mice. Cell 177, 1600-1618 e1617, doi:10.1016/j.cell.2019.05.004 (2019). 678 38 Koren, O. et al. Host Remodeling of the Gut Microbiome and Metabolic Changes during 679 Pregnancy. Cell 150, 470-480, doi:10.1016/j.cell.2012.07.008 (2012). 680 39 Ridaura, V. K. et al. Gut Microbiota from Twins Discordant for Obesity Modulate Metabolism in 681 Mice. Science 341, 1079-U1049, doi:10.1126/science.1241214 (2013). 682 40 Goodrich, J. K. et al. Human genetics shape the gut microbiome. Cell 159, 789-799, 683 doi:10.1016/j.cell.2014.09.053 (2014). 684 41 Kovatcheva-Datchary, P. et al. Dietary Fiber-Induced Improvement in Glucose Metabolism Is 685 Associated with Increased Abundance of Prevotella. Cell Metab 22, 971-982, 686 doi:10.1016/j.cmet.2015.10.001 (2015). 687 42 Chiu, C. C. et al. Nonalcoholic Fatty Liver Disease Is Exacerbated in High-Fat Diet-Fed 688 Gnotobiotic Mice by Colonization with the Gut Microbiota from Patients with Nonalcoholic 689 Steatohepatitis. Nutrients 9, doi:10.3390/nu9111220 (2017). 690 43 Li, J. et al. Gut microbiota dysbiosis contributes to the development of hypertension. Microbiome 691 5, 14, doi:10.1186/s40168-016-0222-x (2017). 692 44 Zhang, L. B., MI.; Roager, HM.; Fonvig, CE.; Hellgren, LI.; Frandsen, HL.; Pedersen, O.; Holm, 693 J-C.; Hansen, T.; Licht, TR. Environmental spread of microbes impacts the development of 694 metabolic phenotypes in mice transplanted with microbial communities from humans. The ISME 695 Journal 11, 14, doi:doi.org/10.1038/ismej.2016.151 (2017). 696 45 Ge, X. et al. Potential role of fecal microbiota from patients with slow transit constipation in the 697 regulation of gastrointestinal motility. Sci Rep 7, 441, doi:10.1038/s41598-017-00612-y (2017). 698 46 Basson, A. et al. Human Gut Microbiome Transplantation in Ileitis Prone Mice: A Tool for the 699 Functional Characterization of the Microbiota in Inflammatory Bowel Disease Patients. Inflamm 700 Bowel Dis, doi:10.1093/ibd/izz242 (2019). 701 47 Papoulis, A. Probability, Random Variables, and Stochastic Processes. 2nd edn, 69-71 702 (McGraw-Hill, 1984). 703 48 Feller, W. in An Introduction to Probability Theory and Its Applications Vol. 3rd 69-71 (Wiley, 704 1971). 705 49 Weisstein, E. W. Weak Law of Large Numbers. From MathWorld--A Wolfram Web Resource. 706 Accessed March 4, 2020. http://mathworld.wolfram.com/WeakLawofLargeNumbers.html. 707 50 Krzywinski, M. & Altman, N. Visualizing samples with box plots. Nat Methods 11, 119-120, 708 doi:10.1038/nmeth.2813 (2014). 709 51 Hintze, J. & Nelson, R. Violin Plots: A Box Plot-Density Trace Synergism. The American 710 52, 181-184, doi:10.1080/00031305.1998.10480559 (1998). 711 52 Johnsson, K., Linderoth, M. & Fontes, M. What is a "unimodal" cell population? Using statistical 712 tests as criteria for unimodality in automated gating and . Cytometry A 91, 908-916, 713 doi:10.1002/cyto.a.23173 (2017). 714 53 Testroet, E. D. et al. A novel and robust method for testing bimodality and characterizing porcine 715 adipocytes of adipose tissue of 5 purebred lines of pig. Adipocyte 6, 102-111, 716 doi:10.1080/21623945.2017.1304870 (2017). 717 54 Rodriguez-Palacios, A. et al. 'Cyclical Bias' in Microbiome Research Revealed by A Portable 718 Germ-Free Housing System Using Nested Isolation. Sci Rep 8, 18, doi:10.1038/s41598-018- 719 20742-1 (2018). 720 55 Walters, W. A., Xu, Z. & Knight, R. Meta-analyses of human gut microbes associated with obesity 721 and IBD. FEBS Lett 588, 4223-4233, doi:10.1016/j.febslet.2014.09.039 (2014). 722 56 Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn's disease. Cell Host 723 Microbe 15, 382-392, doi:10.1016/j.chom.2014.02.005 (2014). 724 57 Hanck, C., Arnold, M., Gerber, A. & Schmelzer, M. in Introduction to with R 725 (Germany, 2019).

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 23 726 58 Biau, D. J., Kerneis, S. & Porcher, R. Statistics in brief: the importance of sample size in the 727 planning and interpretation of medical research. Clin Orthop Relat Res 466, 2282-2288, 728 doi:10.1007/s11999-008-0346-9 (2008). 729 59 Faber, J. & Fonseca, L. M. How sample size influences research outcomes. Dental Press J 730 Orthod 19, 27-29, doi:10.1590/2176-9451.19.4.027-029.ebo (2014). 731 60 Package ‘multimode’ v. 1.4 (2018). 732 61 Hartigan, J. A. & Hartigan, P. M. The Dip Test of Unimodality. The Annals of Statistics 13, 70-84 733 (1985). 734 62 Freeman, J. B. & Dale, R. Assessing bimodality to detect the presence of a dual cognitive 735 process. Behav Res Methods 45, 83-97, doi:10.3758/s13428-012-0225-x (2013). 736 63 Stanbro, M. Hartigan's Dip Test of Unimodality Applied on Terrestrial Gamma--ray Flashes. 737 64 Wolfe, J. H. Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 738 5, 329-350 (1970). 739 65 Xu, L., Bedrick, E. J., Hanson, T. & Restrepo, C. A Comparison of Statistical Tools for Identifying 740 Modality in Body Mass Distributions. Journal of Data Science 12, 175-196 (2014). 741 66 Kang, Y.-J. & Noh, Y. Development of Hartigan’s Dip Statistic with Bimodality Coefficient to 742 Assess Multimodality of Distributions. Mathematical Problems in Engineering Article ID 4819475, 743 17 (2019). 744 67 Rodriguez-Palacios, A. et al. Stereomicroscopic 3D-pattern profiling of murine and human 745 intestinal inflammation reveals unique structural phenotypes. Nat Commun 6, 7577, 746 doi:10.1038/ncomms8577 (2015). 747 68 Hartigan's Dip Test Statistic for Unimodality - Corrected v. 0.75-7 (R Foundation for Statistical 748 Computing, 2016). 749 69 Cox, N. J. sg113_2: Tabulation of modes. Stata Journal 3, 26-27. Reprinted in Stata Technical 750 Bulletin Reprints, vol.29, pp. 180-181 (2009). 751 70 Bickel, D. R. & Fruhwirth, R. On a fast, robust estimator of the mode: Comparisons to other 752 robust estimators with applications. Computational Statistics & Data Analysis 50, 3500-3530 753 (2006). 754

Basson AR, Cominelli F, Rodriguez-Palacios A. Patterns of ‘Analytical Irreproducibility’ in Multimodal Diseases. 2020. Under review. 24